Grid and Cooperative Computing - GCC 2005: 4th International Conference, Beijing, China, November 30 -- December 3, 2005, Proceedings (Lecture Notes in Computer Science, 3795) 3540305106, 9783540305101

This volume presents the accepted papers for the 4th International Conference onGridandCooperativeComputing(GCC2005),hel

115 19 19MB

English Pages 1224 [1222] Year 2005

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Frontmatter
Towards Global Collaborative Computing: Opportunities and Challenges of Peer to Peer Networks and Applications
Management of Real-Time Streaming Data Grid Services
Session 1: Grid Service and Grid Security
A QoS-Satisfied Interdomain Overlay Multicast Algorithm for Live Media Service Grid
Automated Immunization Against Denial-of-Service Attacks Featuring Stochastic Packet Inspection
Mobile-Agent-Based Web Service Composition
Trust Shaping: Adapting Trust Establishment and Management to Application Requirements in a Service-Oriented Grid Environment
SVM Approach with CTNT to Detect DDoS Attacks in Grid Computing
Model Transformation Based Verification of Web Services Composition
A Worm Behavioral Approach to Susceptible Host Detection
A Dynamic Web Service Selection Strategy with QoS Global Optimization Based on Multi-objective Genetic Algorithm
A Formal Model for Grid Service Deployment in Grid Service Mining Based on Installation Strategies
A Grid Accounting Information Service for Site Autonomy
A KPN Based Cooperative Composition Model of Services
A Layered Architecture of Service Organization in AegisGrid
A Multi-agent Framework for Grid Service Workflow Embedded with Coloured Petri Nets
A New United Certificate Revocation Scheme in Grid Environments
A Novel Secure Routing System in Overlay Environment
A Semantic Metadata Catalog Service for Grid
An ECA-Rule-Based Workflow Management Approach for Web Services Composition
An Efficient Password Authentication Schemes Without Using the Server Public Key for Grid Computing
Certificate-Driven Grid Workflow Paradigm Based on Service Computing
Dynamic-Role Based Access Control Framework Across Multi-domains in Grid Environment
An Automatic Policy Refinement Mechanism for Policy-Driven Grid Service Systems
Grid Services Adaptation in a Grid Workflow
BlogGrid: Towards an Efficient Information Pushing Service on Blogspace
Research of Security Architecture for P2P Network Based on Trust Management System
A Time-Frame Based Trust Model for Grids
Application of Control Engineering Methods to Congestion Control in Differentiated Service Networks
Research on Semantic-Based Web Services Registry Federation
A Proxy-Based Dynamic Inheritance of Soft-Device
Temporal Logical-Based Web Services Architecture Description
The Design and Implementation of GIS Grid Services
The Minimization of QoS Deviation in Grid Environment
The Research on MPC-WS, a Web Service for the Simulation of Metal Powder Compaction Process
Towards a Framework for Automatic Service Composition in Manufacturing Grid
Characterizing Services Composeability and OWL-S Based Services Composition
Session 2: Grid Middleware and Applications
An Efficient Collective Communication Method Using a Shortest Path Algorithm in a Computational Grid
MAG: A Mobile Agent Based Computational Grid Platform
Experiences in Running Workloads over Grid3
An Efficient Network Information Model Using NWS for Grid Computing Environments
Flexible Temporal Consistency for Fixed-Time Constraint Verification in Grid Workflow Systems
An Adaptive Scheduling Algorithm for Molecule Docking Design on Grid
XML-Based Digital Signature Accelerator in Open Mobile Grid Computing
Experiences on Parallel Replicated Discrete-Event Simulation on a GRID
Towards an End-User Programming Environment for the Grid
TCP/IP Offload Engine Module Supporting Binary Compatibility for Standard Socket Interfaces
A Hybrid Parallel Loop Scheduling Scheme on Grid Environments
A Conceptual Modeling Approach to Virtual Organizations in the Grid
Incorporating Data Movement into Grid Task Scheduling
An Integration of Global and Enterprise Grid Computing: Gridbus Broker and Xgrid Perspective
Design and Implementation of a Middleware for Hybrid Switching Networks
A Dynamic Grid Workflow Model Based On Workflow Component Reuse
Coordinated Placement and Replacement for Grid-Based Hierarchical Web Caches
A XML-Based Composition Event Approach as an Integration and Cooperation Middleware
An Infrastructure for Grid Job Monitoring
Grid Enabled Master Slave Task Scheduling for Heterogeneous Processor Paradigm
Optimizing Large File Transfer on Data Grid
A Parallel Collaborative Algorithm Based on Partial Duality in Interconnected Power Grids
Monitoring MPI Running Nodes Status for Load Balance
Scheduling and Executing Heterogeneous Task Graph in Grid Computing Environment
Agent Technology and Generic Workflow Management in an e-Science Environment
Session 3: Knowledge Grid and Semantic Grid
Query Optimization in Database Grid
Pushing Scientific Documents by Discovering Interest in Information Flow Within E-Science Knowledge Grid
Schema Adaptation Under Multi-relation Dependencies
Dart-Dataflow: Towards Communicating Data Semantics in Sensor Grid
Data Distribution Management Modeling and Implementation on Computational Grid
Differentiated Application Independent Data Aggregation in Wireless Sensor Networks
Dynamic Models of Knowledge in Virtual Organizations
Scientific Data Management Architecture for Grid Computing Environments
Efficient Join Algorithms for Integrating XML Data in Grid Environment
Integrated {\itshape k}-NN Query Processing Based on Geospatial Data Services
SGII: Towards Semantic Grid-Based Enterprise Information Integration
The Architecture of SIG Computing Environment and Its Application to Image Processing
The Computation of Semantic Data Cube
Knowledge Acquisition Based on the Global Concept of Fuzzy Cognitive Maps
The Architecture and Implementation of Resource Space Model System
Using Fuzzy Cognitive Map to Effectively Classify E-Documents and Application
Session 4: Resource Management
A Scalable Resource Locating Service in Vega Grid
{\itshape r}Bundle: An Iterative Combinatorial Auction-Based Approach to Supporting Advance Reservation
Decentralized Grid Resource Locating Protocol Based on Grid Resource Space Model
A Constellation Resource Discovery Model Based on Scalable Multi-tape Universal Turing Machine
Replica Placement in Data Grid: A Multi-objective Approach
Grid Resource Discovery Using Semantic Communities
Dynamic Multi-stage Resource Selection with Preference Factors in Grid Economy
On-Demand Resource Allocation for Service Level Guarantee in Grid Environment
A Prediction-Based Parallel Replication Algorithm in Distributed Storage System
Reliability-Latency Tradeoffs for Data Gathering in Random-Access Wireless Sensor Networks
An Optimistic Replication Algorithm to Improve Consistency for Massive Data
A SLA-Based Resource Donation Mechanism for Service Hosting Utility Center
Credit in the Grid Resource Management
Grid Resource Trade Network: Effective Resource Management Model in Grid Computing
Survivability Analysis of Grid Resource Management System Topology
SATOR: A Scalable Resource Registration Mechanism Enabling Virtual Organizations of Enterprise Applications
Collaborating Semantic Link Network with Resource Space Model
RSM and SLN: Transformation, Normalization and Cooperation
Contingent Pricing for Resource Advance Reservation Under Capacity Constraints
Session 5: P2P Computing and Automatic Computing
Anonymous Communication Systems in P2P Network with Random Agent Nodes
An Efficient Cluster-Hierarchy Architecture Model ECHP2P for P2P Networks
Building Efficient Super-Peer Overlay Network for DHT Systems
Exploiting the Heterogeneity in Structured Peer-to-Peer Systems
Dynamic Scheduling Mechanism for Result Certification in Peer to Peer Grid Computing
A Hybrid Peer-to-Peer Media Streaming
Trust Model Based on Similarity Measure of Vectors in P2P Networks
A Large Scale Distributed Platform for High Performance Computing
An Adaptive Service Strategy Based on User Rating in P2P
P2PGrid: Integrating P2P Networks into the Grid Environment
An Efficient Content-Based Notification Service Routed over P2P Network
Distribution of Mobile Agents in Vulnerable Networks
A Mathematical Foundation for Topology Awareness of P2P Overlay Networks
SChord: Handling Churn in Chord by Exploiting Node Session Time
Towards Reputation-Aware Resource Discovery in Peer-to-Peer Networks
Constructing Fair-Exchange P2P File Market
A Novel Behavior-Based Peer-to-Peer Trust Model
A Topology Adaptation Protocol for Structured Superpeer Overlay Construction
A Routing Protocol Based on Trust for MANETs
Dynamic Zone-Balancing of Topology-Aware Peer-to-Peer Networks
A Localized Algorithm for Minimum-Energy Broadcasting Problem in MANET
Multipath Traffic Allocation Based on Ant Optimization Algorithm with Reusing Abilities in MANET
Routing Algorithm Using SkipNet and Small-World for Peer-to-Peer System
Smart Search over Desirable Topologies: Towards Scalable and Efficient P2P File Sharing
A Scalable Version Control Layer in P2P File System
A Framework for Transactional Mobile Agent Execution
Session 6: Performance Evaluation and Modeling
Design of the Force Field Task Assignment Method and Associated Performance Evaluation for Desktop Grids
Performance Investigation of Weighted Meta-scheduling Algorithm for Scientific Grid
Performance Analysis of Domain Decomposition Applications Using Unbalanced Strategies in Grid Environments
Cooperative Determination on Cache Replacement Candidates for Transcoding Proxy Caching
Mathematics Model and Performance Evaluation of a Scalable TCP Congestion Control Protocol to LNCS/LNAI Proceedings
An Active Measurement Approach for Link Faults Monitoring in ISP Networks
GT-Based Performance Improving for Resource Management of Computational Grid
The PARNEM: Using Network Emulation to Predict the Correctness and Performance of Applications
Session 7: Software Engineering and Cooperative Computing
A Hybrid Workflow Paradigm for Integrating Self-managing Domain-Specific Applications
Supporting Remote Collaboration Through Structured Activity Logging
The Implementation of Component Based Web Courseware in Middleware Systems
A Single-Pass Online Data Mining Algorithm Combined with Control Theory with Limited Memory in Dynamic Data Streams
An Efficient Heuristic Algorithm for Constructing Delay- and Degree-Bounded Application-Level Multicast Tree
The Batch Patching Method Using Dynamic Cache of Proxy Cache for Streaming Media
A Rule-Based Analysis Method for Cooperative Business Applications
Retargetable Machine-Description System: Multi-layer Architecture Approach
An Unbalanced Partitioning Scheme for Graph in Heterogeneous Computing
A Connector Interaction for Software Component Composition with Message Central Processing
Research on the Fault Tolerance Deployment in Sensor Networks
The Effect of Router Buffer Size on Queue Length-Based AQM Schemes
Parallel Web Spiders for Cooperative Information Gathering
Backmatter
Recommend Papers

Grid and Cooperative Computing - GCC 2005: 4th International Conference, Beijing, China, November 30 -- December 3, 2005, Proceedings (Lecture Notes in Computer Science, 3795)
 3540305106, 9783540305101

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos New York University, NY, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

3795

Hai Zhuge Geoffrey C. Fox (Eds.)

Grid and Cooperative Computing – GCC 2005 4th International Conference Beijing, China, November 30 – December 3, 2005 Proceedings

13

Volume Editors Hai Zhuge Chinese Academy of Sciences, Institute of Computing Technology P.O. Box 2704-28, Beijing, China E-mail: [email protected] Geoffrey C. Fox Indiana University, Community Grid Computing Laboratory 501 North Morton Street, Suite 224, Bloomington, IN 47404, USA E-mail: [email protected]

Library of Congress Control Number: 2005936339 CR Subject Classification (1998): C.2, D.4, I.2.11, H.4, H.3, H.5, K.6.5 ISSN ISBN-10 ISBN-13

0302-9743 3-540-30510-6 Springer Berlin Heidelberg New York 978-3-540-30510-1 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springeronline.com © Springer-Verlag Berlin Heidelberg 2005 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 11590354 06/3142 543210

Preface

This volume presents the accepted papers for the 4th International Conference on Grid and Cooperative Computing (GCC 2005), held in Beijing, China, during November 30 – December 3, 2005. The conference series of GCC aims to provide an international forum for the presentation and discussion of research trends on the theory, method, and design of Grid and cooperative computing as well as their scientific, engineering and commercial applications. It has become a major annual event in this area. The First International Conference on Grid and Cooperative Computing (GCC 2002) received 168 submissions. GCC 2003 received 550 submissions, from which 176 regular papers and 173 short papers were accepted. The acceptance rate of regular papers was 32%, and the total acceptance rate was 64%. GCC 2004 received 427 main-conference submissions and 154 workshop submissions. The main conference accepted 96 regular papers and 62 short papers. The acceptance rate of the regular papers was 23%. The total acceptance rate of the main conference was 37%. For this conference, we received 576 submissions. Each was reviewed by two independent members of the International Program Committee. After carefully evaluating their originality and quality, we accepted 57 regular papers and 84 short papers. The acceptance rate of regular papers was 10%. The total acceptance rate was 25%. We are pleased to thank the authors whose submissions and participation made this conference possible. We also want to express our thanks to the Program Committee members, for their dedication in helping to organize the conference and reviewing the submissions. We owe special thanks to the keynote speakers for their impressive speeches. We would like to thank the co-chairs Ian Foster and Tony Hey who provided continuous support for this conference. Finally we would like to thank the China Knowledge Grid Research Group, especially Xiaofeng Wang, Jie Liu, Jin Liu, Chao He, and Liang Feng for their excellent work in organizing this conference. October 2005

Hai Zhuge, Geoffrey C. Fox

Conference Committees

General Co-chairs Ian Foster, University of Chicago, USA Tony Hey, University of Southampton, UK

Program Committee Co-chairs Hai Zhuge, Chinese Academy of Sciences, China Geoffrey Fox, Indiana University, USA

Steering Committee Andrew Chien, University of California at San Diego, USA Hai Jin, Huazhong University of Science and Technology, China Guojie Li, China Computer Federation, China Zhiwei Xu, Chinese Academy of Sciences, China Xiaodong Zhang, College of William and Mary, USA

Publicity Chair Cho-Li Wang, University of Hong Kong, China

Program Committee Members Mark Baker (University of Portsmouth, UK) Yaodong Bi (University of Scranton, USA) Rajkumar Buyya (The University of Melbourne, Australia) Wentong Cai (Nanyang Technological University, Singapore) Jiannong Cao (Hong Kong Polytechnic University, Hong Kong, China) Guihai Chen (Nanjing University, China) Guangrong Gao (University of Delaware Newark, USA) Ning Gu (Fudan University, China) Minyi Guo (University of Aizu, Japan) Jun Han (Swinburne University of Technology, Australia) Yanbo Han (Institute of Computing Tech., CAS, China)

VIII

Organization

Chun-Hsi Huang (University of Connecticut, USA) Weijia Jia (City University of Hong Kong, Hong Kong, China) Hai Jin (HuaZhong University of Sci.&Tech., China) Francis Lau (Hong Kong University, Hong Kong, China) Keqin Li (State University of New York, USA) Minglu Li (Shanghai Jiao Tong University, China) Qing Li (City University of Hong Kong, Hong Kong, China) Xiaoming Li (Peking University, China) Xiaola Lin (City University of Hong Kong, China) Junzhou Luo (Southeast University, China) Huaikou Miao (ShangHai University, China) Geyong Min (University of Bradford, UK) Jun Ni (University of Iowa, USA) Lionel Ni (Hong Kong University of Science and Technology, Hong Kong) Yi Pan (Georgia State University, USA) Depei Qian (Xi’an Jiaotong University, China) Yuzhong Qu (Southeast University, China) Hong Shen (Japan Advanced Institute of Science and Technology, Japan) Alexander V. Smirnov (St.-Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Russia) Xian-He Sun (Illinois Institute of Technology, USA) Yuzhong Sun (Institute of Computing Technology, CAS, China) David Taniar (Monash University, Australia) Huaglory Tianfield (Glasgow Caledonian University, UK) David W. Walker (Cardiff University, UK) Shaowen Wang (University of Iowa, USA) Jie Wu (Florida Atlantic University, USA) Zhaohui Wu (Zhejiang University, China) Nong Xiao (National University of Defense Technology, China) Cheng-Zhong Xu (Wayne State University, USA) Guangwen Yang (Tsinghua University, China) Laurence Tianruo Yang (St. Francis Xavier University, Canada) Zhonghua Yang (Nanyang Technological University, Singapore) Xiaodong Zhang (NSF, USA and College of William and Mary, USA) Weimin Zheng (Tsinghua University, China) Wanlei Zhou (Deakin University, Australia) Xinrong Zhou (Abo Akademi University, Finland) Jianping Zhu (The University of Akron, USA)

Table of Contents Towards Global Collaborative Computing: Opportunities and Challenges of Peer to Peer Networks and Applications Ling Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Management of Real-Time Streaming Data Grid Services Geoffrey Fox, Galip Aydin, Harshawardhan Gadgil, Shrideep Pallickara, Marlon Pierce, Wenjun Wu . . . . . . . . . . . . . . . . . .

3

Session 1: Grid Service and Grid Security A QoS-Satisfied Interdomain Overlay Multicast Algorithm for Live Media Service Grid Yuhui Zhao, Yuyan An, Cuirong Wang, Yuan Gao . . . . . . . . . . . . . . . .

13

Automated Immunization Against Denial-of-Service Attacks Featuring Stochastic Packet Inspection Jongho Kim, Jaeik Cho, Jongsub Moon . . . . . . . . . . . . . . . . . . . . . . . . . .

25

Mobile-Agent-Based Web Service Composition Zhuzhong Qian, SangLu Lu, Li Xie . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

Trust Shaping: Adapting Trust Establishment and Management to Application Requirements in a Service-Oriented Grid Environment E. Papalilo, T. Friese, M. Smith, B. Freisleben . . . . . . . . . . . . . . . . . . .

47

SVM Approach with CTNT to Detect DDoS Attacks in Grid Computing Jungtaek Seo, Cheolho Lee, Taeshik Shon, Jongsub Moon . . . . . . . . . .

59

Model Transformation Based Verification of Web Services Composition YanPing Yang, QingPing Tan, Yong Xiao . . . . . . . . . . . . . . . . . . . . . . . .

71

A Worm Behavioral Approach to Susceptible Host Detection BaiLing Wang, BinXing Fang, XiaoChun Yun . . . . . . . . . . . . . . . . . . . .

77

A Dynamic Web Service Selection Strategy with QoS Global Optimization Based on Multi-objective Genetic Algorithm Shulei Liu, Yunxiang Liu, Ning Jing, Guifen Tang, Yu Tang . . . . . . .

84

A Formal Model for Grid Service Deployment in Grid Service Mining Based on Installation Strategies Tun Lu, Zhishu Li, Chunlin Xu, Xuemei Huang . . . . . . . . . . . . . . . . . .

90

X

Table of Contents

A Grid Accounting Information Service for Site Autonomy Beob Kyun Kim, Haeng Jin Jang, Tingting Li, Dong Un An, Seung Jong Chung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

A KPN Based Cooperative Composition Model of Services Xiuguo Zhang, Weishi Zhang, Jinyu Shi . . . . . . . . . . . . . . . . . . . . . . . . .

102

A Layered Architecture of Service Organization in AegisGrid Li Liu, Zhong Zhou, Wei Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

111

A Multi-agent Framework for Grid Service Workflow Embedded with Coloured Petri Nets Zhengli Zhai, Lei Zhou, Yang Yang, Zhimin Tian . . . . . . . . . . . . . . . . .

117

A New United Certificate Revocation Scheme in Grid Environments Ying Liu, Sheng-rong Wang, Jing-bo Xia, Jun Wei . . . . . . . . . . . . . . .

123

A Novel Secure Routing System in Overlay Environment Han Su, Yun Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

A Semantic Metadata Catalog Service for Grid Kewei Wei, Ming Zhang, Yaping Zhu . . . . . . . . . . . . . . . . . . . . . . . . . . . .

136

An ECA-Rule-Based Workflow Management Approach for Web Services Composition Yi Wang, Minglu Li, Jian Cao, Feilong Tang, Lin Chen, Lei Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

143

An Efficient Password Authentication Schemes Without Using the Server Public Key for Grid Computing Eun-Jun Yoon, Kee-Young Yoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

149

Certificate-Driven Grid Workflow Paradigm Based on Service Computing Wanchun Dou, S.C. Cheung, Guihai Chen, Shijie Cai . . . . . . . . . . . . .

155

Dynamic-Role Based Access Control Framework Across Multi-domains in Grid Environment Ying Chen, Shoubao Yang, Leitao Guo . . . . . . . . . . . . . . . . . . . . . . . . . .

161

An Automatic Policy Refinement Mechanism for Policy-Driven Grid Service Systems Bei-shui Liao, Ji Gao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

166

Grid Services Adaptation in a Grid Workflow Wencai Guo, Yang Yang, Zhengli Zhai . . . . . . . . . . . . . . . . . . . . . . . . . .

172

Table of Contents

XI

BlogGrid: Towards an Efficient Information Pushing Service on Blogspace Jason J. Jung, Inay Ha, Geun-Sik Jo . . . . . . . . . . . . . . . . . . . . . . . . . . .

178

Research of Security Architecture for P2P Network Based on Trust Management System Zhang Dehua, Yuqing Zhang, Yiyu Zhou . . . . . . . . . . . . . . . . . . . . . . . . .

184

A Time-Frame Based Trust Model for Grids Woodas W.K. Lai, Kam-Wing Ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

190

Application of Control Engineering Methods to Congestion Control in Differentiated Service Networks F. Habibipou, M. Khajepour, M. Galily . . . . . . . . . . . . . . . . . . . . . . . . . .

196

Research on Semantic-Based Web Services Registry Federation Bing Li, Fei He, Wudong Liu, KeQing He, Jin Liu . . . . . . . . . . . . . . . .

202

A Proxy-Based Dynamic Inheritance of Soft-Device Jia Bi, Yanyan Li, Yunpeng Xing, Xiang Li, Xue Chen . . . . . . . . . . . .

208

Temporal Logical-Based Web Services Architecture Description Yuan Rao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

214

The Design and Implementation of GIS Grid Services Wen-jun Li, Yong-ji Li, Zhi-wei Liang, Chu-wei Huang, Ying-wen Wen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

220

The Minimization of QoS Deviation in Grid Environment YongZhong Zhang, Yinliang Zhao, FangFang Wu, ZengZhi Li . . . . . .

226

The Research on MPC-WS, a Web Service for the Simulation of Metal Powder Compaction Process Puqing Chen, Kejing He, Zhaoyao Zhou, Yuanyuan Li . . . . . . . . . . . . .

232

Towards a Framework for Automatic Service Composition in Manufacturing Grid Lei Zhang, Weizheng Yuan, Wei Wang . . . . . . . . . . . . . . . . . . . . . . . . . .

238

Characterizing Services Composeability and OWL-S Based Services Composition Zhonghua Yang, Jing Bing Zhang, Jiao Tao, Robert Gay . . . . . . . . . . .

244

XII

Table of Contents

Session 2: Grid Middleware and Applications An Efficient Collective Communication Method Using a Shortest Path Algorithm in a Computational Grid Yong Hee Yeom, Seok Myun Kwon, Jin Suk Kim . . . . . . . . . . . . . . . . .

250

MAG: A Mobile Agent Based Computational Grid Platform Rafael Fernandes Lopes, Francisco Jos´e da Silva e Silva, Bysmarck Barros de Sousa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

262

Experiences in Running Workloads over Grid3 Catalin L. Dumitrescu, Ioan Raicu, Ian Foster . . . . . . . . . . . . . . . . . . .

274

An Efficient Network Information Model Using NWS for Grid Computing Environments Chao-Tung Yang, Po-Chi Shih, Sung-Yi Chen, Wen-Chung Shih . . . .

287

Flexible Temporal Consistency for Fixed-Time Constraint Verification in Grid Workflow Systems Jinjun Chen, Yun Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

300

An Adaptive Scheduling Algorithm for Molecule Docking Design on Grid Yan-Li Hu, Liang Bai, Wei-Ming Zhang, Wei-Dong Xiao, Zhong Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

312

XML-Based Digital Signature Accelerator in Open Mobile Grid Computing Namje Park, Kiyoung Moon, Kyoil Chung, Seungjoo Kim, Dongho Won . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

323

Experiences on Parallel Replicated Discrete-Event Simulation on a GRID ´ Angel Perles, Antonio Mart´ı, Francisco Rodr´ıguez, Juan Jos´e Serrano, Miguel A. Mateo . . . . . . . . . . . . . . . . . . . . . . . . . . . .

334

Towards an End-User Programming Environment for the Grid Chengchun Shu, Haiyan Yu, Lijuan Xiao, Haozhi Liu, Zhiwei Xu . . .

345

TCP/IP Offload Engine Module Supporting Binary Compatibility for Standard Socket Interfaces Dong-Jae Kang, Kang-Ho Kim, Sung-In Jung, Hae-Young Bae . . . . .

357

A Hybrid Parallel Loop Scheduling Scheme on Grid Environments Wen-Chung Shih, Chao-Tung Yang, Shian-Shyong Tseng . . . . . . . . . .

370

Table of Contents

XIII

A Conceptual Modeling Approach to Virtual Organizations in the Grid William Song, Xiaoming Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

382

Incorporating Data Movement into Grid Task Scheduling Xiaoshan He, Xian-He Sun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

394

An Integration of Global and Enterprise Grid Computing: Gridbus Broker and Xgrid Perspective Marcos Dias de Assun¸c˜ ao, Krishna Nadiminti, Srikumar Venugopal, Tianchi Ma, Rajkumar Buyya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

406

Design and Implementation of a Middleware for Hybrid Switching Networks Yueming Lu, Yuefeng Ji, Aibo Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

418

A Dynamic Grid Workflow Model Based on Workflow Component Reuse Jian Cao, Yujie Mou, Jie Wang, Shensheng Zhang, Minglu Li . . . . . .

424

Coordinated Placement and Replacement for Grid-Based Hierarchical Web Caches Wenzhong Li, Kun Wu, Xu Ping, Ye Tao, Sanglu Lu, Daoxu Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

430

A XML-Based Composition Event Approach as an Integration and Cooperation Middleware Gang Xu, JianGang Ma, Tao Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . .

436

An Infrastructure for Grid Job Monitoring Cuiju Luan, Guanghua Song, Yao Zheng . . . . . . . . . . . . . . . . . . . . . . . . .

443

Grid Enabled Master Slave Task Scheduling for Heterogeneous Processor Paradigm Ching-Hsien Hsu, Tai-Lung Chen, Guan-Hao Lin . . . . . . . . . . . . . . . . .

449

Optimizing Large File Transfer on Data Grid Teng Ma, Junzhou Luo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

455

A Parallel Collaborative Algorithm Based on Partial Duality in Interconnected Power Grids Ke-yan Liu, Wan-xing Sheng, Yun-hua Li . . . . . . . . . . . . . . . . . . . . . . .

461

Monitoring MPI Running Nodes Status for Load Balance Qianni Deng, Xugang Wang, Dehua Zang . . . . . . . . . . . . . . . . . . . . . . . .

467

XIV

Table of Contents

Scheduling and Executing Heterogeneous Task Graph in Grid Computing Environment Weiguang Qiao, Guosun Zeng, An Hua, Fei Zhang . . . . . . . . . . . . . . . .

474

Agent Technology and Generic Workflow Management in an e-Science Environment Zhiming Zhao, Adam Belloum, Peter Sloot, Bob Hertzberger . . . . . . . .

480

Session 3: Knowledge Grid and Semantic Grid Query Optimization in Database Grid Xiaoqing Zheng, Huajun Chen, Zhaohui Wu, Yuxin Mao . . . . . . . . . . .

486

Pushing Scientific Documents by Discovering Interest in Information Flow Within E-Science Knowledge Grid Lianhong Ding, Xiang Li, Yunpeng Xing . . . . . . . . . . . . . . . . . . . . . . . . .

498

Schema Adaptation Under Multi-relation Dependencies MingHong Zhou, HuaMing Liao, Feng Li . . . . . . . . . . . . . . . . . . . . . . . .

511

Dart-Dataflow: Towards Communicating Data Semantics in Sensor Grid Zhiyong Ye, Huajun Chen, Zhaohui Wu . . . . . . . . . . . . . . . . . . . . . . . . .

517

Data Distribution Management Modeling and Implementation on Computational Grid Jong Sik Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

523

Differentiated Application Independent Data Aggregation in Wireless Sensor Networks Jianlin Qiu, Ye Tao, Sanglu Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

529

Dynamic Models of Knowledge in Virtual Organizations Yan Ren, Xueshan Luo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

535

Scientific Data Management Architecture for Grid Computing Environments Jaechun No, Nguyen Tien Cuong, Sung Soon Park . . . . . . . . . . . . . . . .

541

Efficient Join Algorithms for Integrating XML Data in Grid Environment Hongzhi Wang, Jianzhong Li, Shuguang Xiong . . . . . . . . . . . . . . . . . . .

547

Integrated k-NN Query Processing Based on Geospatial Data Services Guifen Tang, Luo Chen, Yunxiang Liu, Shulei Liu, Ning Jing . . . . . .

554

Table of Contents

XV

SGII: Towards Semantic Grid-Based Enterprise Information Integration Jingtao Zhou, Shusheng Zhang, Han Zhao, Mingwei Wang . . . . . . . . .

560

The Architecture of SIG Computing Environment and Its Application to Image Processing Chunhui Yang, Deke Guo, Yan Ren, Xueshan Luo, Jinfeng Men . . . .

566

The Computation of Semantic Data Cube Yubao Liu, Jian Yin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

573

Knowledge Acquisition Based on the Global Concept of Fuzzy Cognitive Maps Xiang-Feng Luo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

579

The Architecture and Implementation of Resource Space Model System Peng Shi, Yunpeng Xing, Erlin Yao, Zhen Wang, Kehua Yuan, Junsheng Zhang, Jianzeng Wang, Fei Guo . . . . . . . . . . . . . . . . . . . . . . .

585

Using Fuzzy Cognitive Map to Effectively Classify E-Documents and Application Jianzeng Wang, Yunpeng Xing, Peng Shi, Fei Guo, Zhen Wang, Erlin Yao, Kehua Yuan, Junsheng Zhang . . . . . . . . . . . . . . . . . . . . . . . .

591

Session 4: Resource Management A Scalable Resource Locating Service in Vega Grid Hai Mo, Zha Li, Liu Haozhi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

597

rBundle: An Iterative Combinatorial Auction-Based Approach to Supporting Advance Reservation Zhixing Huang, Yuhui Qiu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

609

Decentralized Grid Resource Locating Protocol Based on Grid Resource Space Model Deke Guo, Honghui Chen, Chenggang Xie, Hongtao Lei, Tao Chen, Xueshan Luo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

621

A Constellation Resource Discovery Model Based on Scalable Multi-tape Universal Turing Machine Yinfeng Wang, Xiaoshe Dong, Hua Guo, Xiuqiang He, GuoRong Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

633

Replica Placement in Data Grid: A Multi-objective Approach Rashedur M. Rahman, Ken Barker, Reda Alhajj . . . . . . . . . . . . . . . . . .

645

XVI

Table of Contents

Grid Resource Discovery Using Semantic Communities Juan Li, Son Vuong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

657

Dynamic Multi-stage Resource Selection with Preference Factors in Grid Economy Yu Hua, Chanle Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

668

On-Demand Resource Allocation for Service Level Guarantee in Grid Environment Hailan Yang, Gongyi Wu, Jianzhong Zhang . . . . . . . . . . . . . . . . . . . . . .

678

A Prediction-Based Parallel Replication Algorithm in Distributed Storage System Yijie Wang, Xiaoming Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

690

Reliability-Latency Tradeoffs for Data Gathering in Random-Access Wireless Sensor Networks Haibo Zhang, Hong Shen, Haibin Kan . . . . . . . . . . . . . . . . . . . . . . . . . . .

701

An Optimistic Replication Algorithm to Improve Consistency for Massive Data Jing Zhou, Yijie Wang, Sikun Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

713

A SLA-Based Resource Donation Mechanism for Service Hosting Utility Center Yufeng Wang, Huaimin Wang, Yan Jia, Dianxi Shi, Bixin Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

719

Credit in the Grid Resource Management Manfu Ma, Jian Wu, Shuyu Li, Dingjian Chen, Zhengguo Hu . . . . . .

725

Grid Resource Trade Network: Effective Resource Management Model in Grid Computing Sung Ho Jang, Da Hye Park, Jong Sik Lee . . . . . . . . . . . . . . . . . . . . . . .

732

Survivability Analysis of Grid Resource Management System Topology Yang Qu, Chuang Lin, Yajuan Li, Zhiguang Shan . . . . . . . . . . . . . . . . .

738

SATOR: A Scalable Resource Registration Mechanism Enabling Virtual Organizations of Enterprise Applications Chen Liu, Fanke Cheng, Yanbo Han . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

744

Collaborating Semantic Link Network with Resource Space Model Yunpeng Xing, Jie Liu, Xiaoping Sun, Erlin Yao . . . . . . . . . . . . . . . . .

750

Table of Contents

XVII

RSM and SLN: Transformation, Normalization and Cooperation Erlin Yao, Yunpeng Xing, Jie Liu, Xiaoping Sun . . . . . . . . . . . . . . . . .

756

Contingent Pricing for Resource Advance Reservation Under Capacity Constraints Zhixing Huang, Yuhui Qiu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

761

Session 5: P2P Computing and Automatic Computing Anonymous Communication Systems in P2P Network with Random Agent Nodes Byung Ryong Kim, Ki Chang Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

767

An Efficient Cluster-Hierarchy Architecture Model ECHP2P for P2P Networks Guangxue Yue, Renfa Li, Zude Zhou, Ronghui Wu . . . . . . . . . . . . . . . .

776

Building Efficient Super-Peer Overlay Network for DHT Systems Yin Li, Xinli Huang, Fanyuan Ma, Futai Zou . . . . . . . . . . . . . . . . . . . .

787

Exploiting the Heterogeneity in Structured Peer-to-Peer Systems Tongqing Qiu, Guihai Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

799

Dynamic Scheduling Mechanism for Result Certification in Peer to Peer Grid Computing SungJin Choi, MaengSoon Baik, JoonMin Gil, ChanYeol Park, SoonYoung Jung, ChongSun Hwang . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

811

A Hybrid Peer-to-Peer Media Streaming Sunghoon Son . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

825

Trust Model Based on Similarity Measure of Vectors in P2P Networks Leitao Guo, Shoubao Yang, Jing Wang, Jinyang Zhou . . . . . . . . . . . . .

836

A Large Scale Distributed Platform for High Performance Computing Nabil Abdennadher, R´egis Boesch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

848

An Adaptive Service Strategy Based on User Rating in P2P Jianming Fu, Lei Zhang, Weinan Li, Huanguo Zhang . . . . . . . . . . . . .

860

P2PGrid: Integrating P2P Networks into the Grid Environment Jiannong Cao, Fred B. Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

871

XVIII

Table of Contents

An Efficient Content-Based Notification Service Routed over P2P Network Xixiang Hu, Yuexuan Wang, Yunhe Pan . . . . . . . . . . . . . . . . . . . . . . . . .

884

Distribution of Mobile Agents in Vulnerable Networks Wenyu Qu, Hong Shen, Yingwei Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . .

894

A Mathematical Foundation for Topology Awareness of P2P Overlay Networks Habib Rostami, Jafar Habibi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

906

SChord: Handling Churn in Chord by Exploiting Node Session Time Feng Hong, Minglu Li, Jiadi Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

919

Towards Reputation-Aware Resource Discovery in Peer-to-Peer Networks Jinyang Zhou, Shoubao Yang, Leitao Guo, Jing Wang, Ying Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

930

Constructing Fair-Exchange P2P File Market Min Zuo, Jianhua Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

941

A Novel Behavior-Based Peer-to-Peer Trust Model Tao Wang, Xianliang Lu, Hancong Duan . . . . . . . . . . . . . . . . . . . . . . . .

947

A Topology Adaptation Protocol for Structured Superpeer Overlay Construction Changyong Niu, Jian Wang, Ruimin Shen . . . . . . . . . . . . . . . . . . . . . . .

953

A Routing Protocol Based on Trust for MANETs Cuirong Wang, Xiaozong Yang, Yuan Gao . . . . . . . . . . . . . . . . . . . . . . .

959

Dynamic Zone-Balancing of Topology-Aware Peer-to-Peer Networks Gang Wu, Jianli Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

965

A Localized Algorithm for Minimum-Energy Broadcasting Problem in MANET Chao Peng, Hong Shen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

971

Multipath Traffic Allocation Based on Ant Optimization Algorithm with Reusing Abilities in MANET Hui-Yao An, Xi-Cheng Lu, Wei Peng . . . . . . . . . . . . . . . . . . . . . . . . . . .

978

Routing Algorithm Using SkipNet and Small-World for Peer-to-Peer System Xiaoqin Huang, Lin Chen, Linpeng Huang, Minglu Li . . . . . . . . . . . . .

984

Table of Contents

XIX

Smart Search over Desirable Topologies: Towards Scalable and Efficient P2P File Sharing Xinli Huang, Yin Li, Wenju Zhang, Fanyuan Ma . . . . . . . . . . . . . . . . .

990

A Scalable Version Control Layer in P2P File System Xin Lin, Shanping Li, Wei Shi, Jie Teng . . . . . . . . . . . . . . . . . . . . . . . .

996

A Framework for Transactional Mobile Agent Execution Jin Yang, Jiannong Cao, Weigang Wu, Chengzhong Xu . . . . . . . . . . . 1002

Session 6: Performance Evaluation and Modeling Design of the Force Field Task Assignment Method and Associated Performance Evaluation for Desktop Grids Edscott Wilson Garc´ıa, Guillermo Morales-Luna . . . . . . . . . . . . . . . . . . 1009 Performance Investigation of Weighted Meta-scheduling Algorithm for Scientific Grid Jie Song, Chee-Kian Koh, Simon See, Gay Kheng Leng . . . . . . . . . . . . 1021 Performance Analysis of Domain Decomposition Applications Using Unbalanced Strategies in Grid Environments Beatriz Otero, Jos´e M. Cela, Rosa M. Bad´ıa, Jes´ us Labarta . . . . . . . . 1031 Cooperative Determination on Cache Replacement Candidates for Transcoding Proxy Caching Keqiu Li, Hong Shen, Di Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043 Mathematics Model and Performance Evaluation of a Scalable TCP Congestion Control Protocol to LNCS/LNAI Proceedings Li-Song Shao, He-Ying Zhang, Yan-Xin Zheng, Wen-Hua Dou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054 An Active Measurement Approach for Link Faults Monitoring in ISP Networks Wenwei Li, Dafang Zhang, Jinmin Yang, Gaogang Xie . . . . . . . . . . . . 1066 GT-Based Performance Improving for Resource Management of Computational Grid Xiu-chuan Wu, Li-jie Sha, Dong Guo, Lan-fang Lou, Liang Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1072

XX

Table of Contents

The PARNEM: Using Network Emulation to Predict the Correctness and Performance of Applications Yue Li, Depei Qian, Chunxiao Xing, Ying He . . . . . . . . . . . . . . . . . . . . 1078

Session 7: Software Engineering and Cooperative Computing A Hybrid Workflow Paradigm for Integrating Self-managing Domain-Specific Applications Wanchun Dou, S.C. Chueng, Guihai Chen, J.Wang, S.J. Cai . . . . . . 1084 Supporting Remote Collaboration Through Structured Activity Logging Matt-Mouley Bouamrane, Saturnino Luz, Masood Masoodian, David King . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096 The Implementation of Component Based Web Courseware in Middleware Systems Hwa-Young Jeong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108 A Single-Pass Online Data Mining Algorithm Combined with Control Theory with Limited Memory in Dynamic Data Streams Yanxiang He, Naixue Xiong, Xavier D´efago, Yan Yang, Jing He . . . . 1119 An Efficient Heuristic Algorithm for Constructing Delay- and Degree-Bounded Application-Level Multicast Tree Feng Liu, Xicheng Lu, Yuxing Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1131 The Batch Patching Method Using Dynamic Cache of Proxy Cache for Streaming Media Zhiwen Xu, Xiaoxin Guo, Xiangjiu Che, Zhengxuan Wang, Yunjie Pang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1143 A Rule-Based Analysis Method for Cooperative Business Applications Yonghwan Lee, Eunmi Choi, Dugki Min . . . . . . . . . . . . . . . . . . . . . . . . . 1155 Retargetable Machine-Description System: Multi-layer Architecture Approach Dan Wu, Kui Dai, Zhiying Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1161 An Unbalanced Partitioning Scheme for Graph in Heterogeneous Computing Yiwei Shen, Guosun Zeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167

Table of Contents

XXI

A Connector Interaction for Software Component Composition with Message Central Processing Hwa-Young Jeong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173 Research on the Fault Tolerance Deployment in Sensor Networks Juhua Pu, Zhang Xiong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1179 The Effect of Router Buffer Size on Queue Length-Based AQM Schemes Ming Liu, Wen-hua Dou, He-ying Zhang . . . . . . . . . . . . . . . . . . . . . . . . . 1185 Parallel Web Spiders for Cooperative Information Gathering Jiewen Luo, Zhongzhi Shi, Maoguang Wang, Wei Wang . . . . . . . . . . . 1192 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1199

Towards Global Collaborative Computing: Opportunities and Challenges of Peer to Peer Networks and Applications Ling Liu College of Computing Georgia Institute of Technology, USA [email protected]

Abstract. Collaborative computing has emerged as a promising paradigm for developing large-scale distributed systems. Peer to Peer (P2P) and Grid computing represent a significant step towards global collaboration, a fundamental capability of network computing. P2P systems are decentralized, selforganizing, and self-repairing distributed systems that cooperate to exchange data and accomplish computing tasks. These systems have transpired as the dominant consumer of residential Internet subscribers' bandwidth, and are being increasingly used in many different application domains. With rapid advances in wireless and mobile communication technologies, such as wireless mesh networks, wireless LANs, and 3G cellular networks, P2P computing is moving into wireless networking, mobile computing, and sensor network applications. In this keynote, I will discuss some important opportunities and challenges of Peer to Peer networks and applications towards global collaborative computing paradigm. I will first review the P2P research and development in the past few years, focusing on the remarkable results produced in P2P system scalability, robustness, distributed storage, and system measurements, the continued evolution of P2P systems, and how today's state-of-the-art developments differentiate from earlier instantiations, such as Napster, Gnutella, KaZaA, and Morpheus. Then I will discuss some important challenges for wide deployment of P2P computing in mission-critical applications and future computing environments.

References 1. Gedik, B. and Liu, L.: A Scalable Peer-to-Peer Architecture for Distributed Information Monitoring Applications. IEEE Transactions on Computers. 54(6) (2005) 767-782. 2. Gedik, B. and Liu, L.: PeerCQ: A Decentralized and Self-Configuring Peer-to-Peer Information Monitoring System. In: Proceedings of the 23rd International Conference on Distributed Computing Systems (ICDCS2003), Providence, Rhode Island USA, May 1922, 2003. (Best Paper Award) 3. Ramaswamy, L., Gedik, B. and Liu, L.: A Distributed Approach to Node Clustering in Decentralized Peer-to-Peer Networks. IEEE Transactions on Parallel and Distributed Systems (TPDS), 16(9) (2005) 1-16. 4. Ramaswamy, L. and Liu, L.: FreeRiding: A New Challenge for Peer-to-Peer File Sharing Systems. In: Proceedings of the 36th HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES (HICSS-36) - Peer-to-Peer (P2P) Computing Track, Hilton Waikoloa Village, Big Island, Hawaii, January 6-9, 2003. H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 1 – 2, 2005. © Springer-Verlag Berlin Heidelberg 2005

2

L. Liu

5. Singh, A., Srivatsa, M., Liu, L. and Miller, T.: Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web. In: Proceedings of the ACM SIGIR workshop on Distributed IR. Lecture Notes of Computer Science (LNCS) series, Springer Verlag, Aug. 1, 2003. 6. Srivatsa, M. and Liu, L.: Vulnerabilities and Security Threats in Structured Overlay Networks: A Quantitative Analysis. To appear in the Proceedings of the 20th Annual Computer Security Applications Conference (ACSAC 2004), (IEEE Press). Tucson, Arizona, December 6-10, 2004. 7. Srivatsa, M. and Liu, L.: Countering Targeted File Attacks using LocationGuard. To appear in Proceedings of the 14th USENIX Security Symposium (USENIX Security), Baltimore, MD, 81-96, August 1 - 5, 2005. 8. Srivatsa, M. and Liu, L.: Securing Publish-Subscribe Overlay Services with EventGuard. To appear in Proceedings of ACM Computer and Communication Security (CCS 2005), Hilton Alexandria Mark Center, Alexandria, VA, USA, November 7-11, 2005. 9. Srivatsa, M., Xiong, L. and Liu, L.: TrustGuard: Countering Vulnerabilities in Reputation Management For Decentralized Overlay Networks. In: Proceedings of 14th World Wide Web Conference (WWW 2005), Chiba, Japan, May 10-14, 2005. 10. Srivatsa, M. Gedik, B. and Liu, L.: Scaling Unstructured Peer-to-Peer Networks with Multi-Tier Capacity-Aware Overlay Topologies. In: Proceedings of the Tenth International Conference on Parallel and Distributed Systems (IEEE ICPADS 2004), Newport Beach, California, July 7-9, 2004. 11. Xiong, L. and Liu, L.: PeerTrust: Supporting Reputation-Based Trust for Peer-to-Peer Electronic Communities. IEEE Transactions on Knowledge and Data Engineering, Special issue on Peer to Peer Based Data Management, 16(7) (2004) 843-857. 12. Zhang, J., Liu, L., Pu, C. and Ammar, M.: Reliable Peer-to-peer End System Multicasting through Replication. IEEE International Conference on Peer to Peer Computing (P2P 2004), Zurich, Switzerland, Aug. 25-27, 2004. 13. Zhuge, H.: The Future Interconnection Environment, IEEE Computer, 38 (4)(2005) 27-33. 14. Zhuge, H.: Exploring an Epidemic in an E-Science Environment, Communications of the ACM, 48(9)( 2005)109-114.

Management of Real-Time Streaming Data Grid Services Geoffrey Fox, Galip Aydin, Harshawardhan Gadgil, Shrideep Pallickara, Marlon Pierce, and Wenjun Wu Community Grids Laboratory, Indiana University, 501 North Morton Street, Suite 224, Bloomington, IN 47404 {gcf, gaydin, hgadgil, spallick, marpierc, wewu}@indiana.edu

Abstract. We discuss the architectural and management support for real time data stream applications, both in terms of lower level messaging and higher level service, filter and session structures. In our approach, messaging systems act as a Grid substrate that can provide qualities of service to various streaming applications ranging from audio-video collaboration to sensor grids. The messaging substrate is composed of distributed, hierarchically arranged message brokers that form networks. We discuss approaches to managing systems for both broker networks and application filters: broker network topologies must be created and maintained, and distributed filters must be arranged in appropriate sequences. These managed broker networks may be applied to a wide range of problems. We discuss applications to audio/video collaboration in some detail and also describe applications to streaming Global Positioning System data streams.

1 Introduction A growing number of applications involve real-time streams of information that need to be transported in a dynamic, high-performance, reliable, and secure fashion. Examples include sensor nets for both science and the military applications, mobile devices on ad-hoc networks, and collaborative applications. In the latter case the streams consist of a set of “change events” for a collaborative entity multicast to the participating clients. They could be the frames of audio-video streams, encoded changed pixels in a shared display, or high level semantic events such as signals of PowerPoint slide changes. Here we describe our research into ways of managing such streams, which we think are a critical component of both sensor nets and real time synchronous collaboration environments. We develop real-time streaming technology assuming that the sources, sinks, and filters of these streams are Web or Grid services. This allows us to share the support technology between streaming applications and benefit from the pervasive interoperability of a service-oriented architecture. Further, this allows a simple model of collaborative Web and Grid services gotten by “just” sharing the input or output ports. As services expose their change by explicit messages (using what we call a messagebased Model-View-Controller architecture [1]), it is much easier to make them collaborative than traditional desktop applications, in which events are often buried in the application. Traditional collaborative applications can be made service oriented with in particular a set of services implementing traditional H.323 functionality and interoperating with Access Grid and Polycom systems. This required development of H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 3 – 12, 2005. © Springer-Verlag Berlin Heidelberg 2005

4

G. Fox et al.

an XML equivalent of the H.323 protocol [2]. Our other major motivation is the sensor networks of military, scientific and social infrastructure. These are well suited to a service architecture as exemplified by the US military Global Information Grid with its service-based Network Centric Operations and Warfare Architecture [3, 4]. We have developed general purpose, open source software to support distributed streams, described in Sec. 2. NaradaBrokering [5] forms a distributed set of message brokers that implement a publish-subscribe software overlay network. This environment supports multiple protocols (including UDP, TCP, and parallel TCP) and provides reliable message delivery with a scalable architecture. Our architecture supports the interesting concept of hybrid streams where multiple “simple streams” are intrinsically linked: examples are linkages of a stream of annotation white boards with original audio/video stream [7] and the combination of lossless and lossy codec streams (using perhaps parallel TCP and UDP respectively) to represent a large dynamic shared display. Several applications drive the development of our technology. These include collaboration services with audio, video, and shared display streams, as well as linkages of real-time Global Positioning System sensors to Geographical Information Systems implemented as Web services. Other examples include integration of hand-held devices to a Grid [6] and the linkage of annotations to video streams showing how composite streams can be supported for real-time annotation [7]. The first two applications are described in sections 4 and 5 and illustrate the need the high level session and filter infrastructure on top of the messaging infrastructure. The messaging infrastructure supports the application services with their filters, gateways and sessions reflecting both collaborative and workflow functions. However we have found the need for a set of services that manage the messaging itself and so control broker deployment and quality of service. Section 3 describes the integration of the management of messaging and higher-level services.

2 NaradaBrokering: A Distributed Messaging Substrate NaradaBrokering [5, 9] is a messaging infrastructure that is based on the publish/subscribe paradigm. The system efficiently routes messages [10] from the originators to the consumers that are interested in the message. The system places no restrictions on the size and the rate at which these messages are issued. Consumers can express their interests (or specify subscriptions) using simple formats such as character strings. Subscriptions may also be based on sophisticated queries involving XPath, SQL, or regular expressions. Support for these subscription formats enables consumers to precisely narrow the type of messages that they are interested in. The substrate incorporates support for enterprise messaging specifications such as the Java Message Service. The substrate also incorporates support for a very wide array of transports (TCP, UDP, Multicast, SSL, HTTP and ParallelTCP among others), which enable the infrastructure to be leveraged by entities in a wide variety of settings. To cope with very large payloads the system leverages ParallelTCP at the transport level and services such as compression and fragmentation to reduce individual message sizes. The fragments (compressed or otherwise) are reconstituted by appropriate services (coalescing and de-compression) prior to delivery to the application.

Management of Real-Time Streaming Data Grid Services

5

The most fundamental unit in NaradaBrokering is a message. A stream can be thought of as being composed by a series of messages, each with causal and ordering correlations to previous messages in the stream. The inter-broker latency for routing typical messages is around 1 millisecond. In a controlled cluster setting a single broker was found to support up to 400 UDP-based A/V clients concurrently with adequate latency [11]. Among the services most relevant for collaboration within the system are the following. 1. Support for a replay and recording services: The recording service is used to store messages reliably to the archival system. The recording is done in such a way that all events issued by the recording entity are stored in the order that they were published. The replay service facilitates the replay of these previously stored messages. The replay service support replays in multiple flavors. Entities may request replays based on sequencing information, timing information, content of the message or based on the topics that these messages were published to. In some cases one or more of the parameters can be combined in a single request. 2. Support for consistent global timestamps [12] through an implementation of the Network Time Protocol (NTP). This implementation ensures that timestamps at the distributed entities are within a few milliseconds of each other. This allows us to ensure that we can order messages based on these global timestamps. This is especially useful during replays when we can precisely determine the order in which messages should be released to the application. 3. Support for buffering and subsequent time-spaced release of messages to reduce jitters. The typical lower bound for time space resolution is a millisecond. However, we have also been able to successively time-space events in the order of several microseconds. By buffering and releasing messages we reduce the jitters that may have been introduced by the network. More recently, we have incorporated support for Web Services within the substrate. Entities can send SOAP messages directly to the brokers that are part of the messaging infrastructure. We have incorporated support for Web Service specifications such as WS-Eventing, WS-ReliableMessaging, and WS-Reliability. Work on implementing the WS-Notification suite of specifications is currently underway. The implementation of these specifications also had to cope with other specifications such as WS-Addressing and WS-Policy that are leveraged by these applications. In addition to the rules governing SOAP messages and the implemented protocols, rules governing WS-Addressing are also enforced. In our support for SOAP within NaradaBrokering we have introduced filters and filter-pipelines. A filter is smallest processing unit for a SOAP message. Several filters can be cascaded together to constitute a filter-pipeline. Here, the filters within a filter-pipeline can be dynamically shuffled and reorganized. The system allows a filter-pipeline to be registered for every role that the node (functioning as a SOAP intermediary) intends to perform. Upon receipt of a SOAP message that is targeted to multiple roles (as indicated by the SOAP 1.2 role attribute) the corresponding filter-pipelines are cascaded so that the appropriate functions are performed. The SOAP message is first parsed to determine

6

G. Fox et al.

the roles that need to be performed. Next, we check to see if there are any pipelines registered for a specific role. The scheme allows developers to develop their own Filters and Filter-Pipelines and target them for specialized roles. For example, a developer may wish to develop a filter that performs message transformations between the competing notification specifications: WS-Eventing and WS-Notification. By providing an extensible framework for the creation of Filters and the registration of roles sophisticated applications can be built.

3 HPSearch: Managing Broker Networks and Service Grids As discussed in the previous section, NaradaBrokering provides a software messaging infrastructure. In a related project, we have been developing HPSearch [14] as a scripting-based management console for broker networks and their services. At one end of the spectrum are services which help manage the messaging middleware, while at the other end are services that leverage capabilities of the middleware (WSProxy). The management of both sets of services is handled by a scripting medium that binds Uniform Resource Identifiers (URI) to the scripting language. By binding URI as a first-class object we can use the scripting language to manage the resource identified by the URI. We discuss these functions in detail below. In order to deploy a distributed application that uses NaradaBrokering, the middleware must be setup and a broker network topology must be deployed. Broker network topology may also be changed at runtime using HPSearch by adding or deleting links between brokers. Once the middleware is setup, we leverage the broker network to deploy the distributed application. To fulfill this requirement we have been developing a specialized Web Service called the Broker Service Adapter (BSA) that helps us deploy brokers on distributed nodes and setup links between them. The BSA is a Web Service that enables management of the middleware via WS-Management. Further, the BSA network is a scalable network that periodically restructures itself to achieve a tree based structure. A management engine simply sends the appropriate commands to the root BSA node which is then appropriately routed to the correct BSA. Errors and other conditions are similarly handled and notified to the management engine using WS-Eventing. HPSearch uses NaradaBrokering to route data between components of a distributed application. This data transfer is managed transparently by the HPSearch runtime component, the Web Service Proxy (WSProxy) [14]. Thus, each of the distributed components is exposed as a Web Service which can be initialized and steered by simple SOAP requests. WSProxy can either wrap existing applications or create new data processing and data filtering services. WSProxy handles streaming data transfer using NaradaBrokering on behalf of the services thus enabling streaming data transfer for any service. The streaming data is enabled using NaradaBrokering middleware, a distributed routing substrate. Thus there are no central bottlenecks and failure of a broker node routes the data stream through alternate routes if available. Further, NaradaBrokering supports reliable delivery via persistent storage [13] thus enabling guaranteed delivery for data streams.

Management of Real-Time Streaming Data Grid Services

7

4 Global-MMCS: Audio and Video Stream Services and Management Global-MMCS, as a service-oriented multimedia collaboration system, mainly processes multimedia streams: video, audio, whiteboards and so on. “Events” in video or audio are usually called video frames or audio samples. Generally speaking, there are a lot of similarities between multimedia streams and other data streams such as sensor data. All streaming data require significant Quality of Service (QoS) constraints and dynamic filtering. These are both particularly demanding and well-understood for multimedia streams for both communication and processing. Because of high bandwidth generated by raw multimedia bit-streams, complicated codecs must be used to compress the streams and transmit them over the Internet. Further, multimedia streams are typically used collaboratively and so stress the infrastructure needed to support the efficient software or hardware of multicasting required by the delivery of a given stream to multiple clients. Due to the diversity of collaboration clients supported by Global-MMCS, the services for multimedia streams need to adapt the streams to different clients. We note that many relevant web service specifications like those for reliable messaging and notification appear not well designed for scalable efficient multicast as needed by Global-MMCS. Thus we suggest that multimedia collaboration is an excellent proving ground for general streaming data grid infrastructure. Streaming Filters: A media service or filter is a functional entity, which can receive one or multiple media streams, perform some processing, and output one or multiple media streams. Each service is characterized by a set of input and output stream interfaces and a processing unit. According to the number of fan-in and fan-out filters, they can be divided into three categories: one-in-one-out filters, multiple-in-one out filters, and one-in-multiple-out. In addition, there is a final “sink” filter category. We discuss each of these below. One-In-One-Out filters implement the basic transformation operation. For instance, a filter can receive as input a video stream in YUV4:1:1 format, resize it and deliver the modified video as output. Each filter provides a very basic adaptation on a stream in an intermediate format. Complex stream transformations can be built by combining several basic filters and creating a filtering workflow pipeline. Below are examples of one-in-one-out filters: Decoder/Encoder transcoder filters aim at compressing/uncompressing the data into a chosen intermediate format (e.g. RGB24, YUV4:1:1, Linear Audio). Common codecs include H.261, H.263, MPEG1, MPEG2, MPEG4, H.264, and RealMedia. Transcoding generates a new stream which is encoded in the format wanted by the user. For examples, if a RealPlayer user needs to receive a video encoded in H.261 RTP, a RealStream producer is needed to first decode the H.261 video and generate a new RealFormat stream. Image-scaling filters resize video frames, which is useful to adapt a stream for devices with limited display capacities. They are sometimes required to enable transcoding operations. For example MPEG videos may be transmitted in any size while H.261 videos require predefined sizes such as CIF, QCIF or SQCIF. Color-space-scaling filters reduce the number of entries in the color space, for example from 24 to 12 bits, gray-scale or black-and-white. Frame-rate filters can

8

G. Fox et al.

reduce the frame rate in a video stream to meet low-end clients like PDA. For example, we can discard B-frame or P-frame in a MPEG-4 video stream with 24 fps to create a new stream with a lower frame rate. Multiple-In-One-Out filters, also known as mixer filters, combine multiple streams. A video mixer can create a mixed video streams resulting from several input sources. Each element of the resulting mixed video (typically displayed as a grid of images) results from an image-scaling adaptation of a particular stream. An audio mixer can create a mixed audio stream by summing up several input sources. Audio mixing is very important to those clients that can’t receive multiple RTP audio streams and mix them. Video mixing service improves the visual collaboration especially for those limited clients that can only handle a single video stream. Multiplexors//Demultiplexors are used to aggregate/separate audio and video data in a multimedia stream. For instance, an MPEG multiplexor allows merging an MP3 audio and an MPEG-1 video in a MPEG2 stream. Multiplex and demultiplex are quite useful for guaranteeing stream synchronization in unpredictable network environments. One-In-Multiple-Out filters, or duplicator filters, are used to replicate an output media stream. Duplication is useful when a stream has different targets with different requirements. In most cases, multiple simple media filters should be organized in a media filter chain. Filters can be either as simple as bit-stream parsing, or as complicated as decoding and encoding. Composite media services are usually acyclic computation graphs consisting of multiple filter chains. There is also another type of bit-stream service, called sink service, which doesn’t change bits in the stream. Examples of sink services include buffering and replaying services. These can buffer real-time multimedia streams in memory caches or disk storage, and allow users to reply or fast-forward these streams through RTSP session. Sink filters can handle single or multiple streams. When multiple streams flow into a sink entity, all the streams can be synchronized and replayed. Based on such a composite sink service, an annotation service can be developed. Through annotation, users can attach text and image streams to the original video and audio stream to convey additional meaning in collaboration. Global-MMCS Workflow Management: There is substantial literature on Grid and Service-based workflow [16]. Unlike many of these systems, Global-MMCS’s streaming workflow, especially conferencing workflow, is implicit and can be determined by the system at run time based on the specified (in XGSP) sinks and sources and their QoS. For example, when a PDA with limited network and processing capability wants to receive an H.261 encoded, 24 fps, CIF video stream, a customized workflow is need to transcode the H.261 stream to a JPEG picture stream or lowbitrate RealMedia Stream. An intelligent workflow engine can easily build a filter chain automatically based on the format description of the source stream and capability description of the PDA. Such an engine usually follows a graph search algorithm and tries to find a route from the graph node representing the format of the source stream to the destination node representing the format needed by the receiver. No user involvement is needed for defining explicit workflow. Furthermore, in order to minimize the traffic and delay, most of one-in-one-out filter chain should be constrained in a single service container. One needs a distributed implementation to

Management of Real-Time Streaming Data Grid Services

1

2

9

3

4 7

5 8

6 9

*

8

#

Fig. 1. Workflow and filters in GlobalMMCS

orchestrate multiple-in and multiple-out filters for different clients. Therefore the key issue in Global-MMCS media service management is how to locate the best service container based on streaming QoS requirement and make the service provider shared by participants in XGSP Sessions. Session Management and NaradaBrokering Integration: As shown in Figure 1, NaradaBrokering can publish performance monitoring data in the form of XML on a topic which is subscribed to by the AV Session Server. From these performance data and broker network maps, the Session Server can estimate the delay and bandwidth between the service candidates and the requesting user. Based on the workload of the media service providers and estimated the performance metrics, the Session Server can find the best service providers and initiate a media service instance. Furthermore, the AV Session Server has to monitor the health of each media service instance. Through a specific NaradaBrokering topic, an active media service instance can publish status meta-data to notify the session server. If it fails to respond within a period of time, the AV Session Server restarts it or locates a new service provider and start a new instance. Note that the messaging infrastructure supports both TCP control and UDP media streams and their reliable delivery; the session can choose separate QoS for each type of stream. Each session server may host limited numbers of active XGSP AV sessions. The exact number depends upon the workload and the computational power of the machine. The session initiator will firstly locate the right session provider to create a session service instance for a particular XGSP AV session. Then, this session server will locate the necessary media service resources on demand. In the current implementation, a default audio mixer is created to handle all the audio in the session.

10

G. Fox et al.

Private audio mixers can be created on-demand for private sessions supporting subgroups in the session. Further, multiple video mixers can be created by the session server on the request of the client. An image grabber (thumbnail) service is created when a new video stream is detected in the session. Further, customized transcoding services can be created when a user sends a request to access particular streams. For example, a mobile client like PDA connected to Wi-Fi, which only has limited processing power wants to choose a 24 4-CIF MPEG-4 video; then a transcoding process pipeline consisting of frame rate adapter, video size down-sampler and color transformation, is needed to create this stream. Another example is an H.323 terminal, which can only handle H.261 and H.263 codecs, wants to display a MPEG-4 video, it will ask the session server to start a MPEG-4-to-H.261 transcoder. Sink services like buffering, archiving and replaying services can also be initiated by real-time XGSP sessions. Buffering and archiving services store events into distributed cache and file storage attached to NaradaBrokering overlay networks. Once stream data flow into these “sinks”, replaying service can pull the data flow out of the sinks and send to clients based on the RTSP request of the user. The events are accessed in an ordered fashion and resynchronized using their timestamps which have been unified using NaradaBrokers NTP service. The list with time-stamps of these archived and annotated streams is kept in the WS-Context dynamic meta-data service. Through the recording manager service, a component of AV session server, users can choose streams to be buffered and archived. And through replay and RTSP services, users can initiate RTSP sessions and replay those buffered streams. After the streams are buffered, users can add annotations to the streams and archive the new composite steams for later replay.

5 Supporting Real Time Sensor Grid Services The basic services needed to support audio-video collaboration, such as reliable delivery, multicasting and replay, can also be applied to problems in real-time delivery of sensor grid data. In Fig. 2, we depict our work to develop filters on live Global Positioning System data. OCRTN, RICRTN, and SDCRTN represent GPS networks for Orange, Riverside, and San Diego Counties in Southern California. These stations are maintained by the Scripps Orbit and Permanent Array Center (SOPAC) and are published to an RTD server, where they are made publicly available. Data is published from these stations in the binary RYO format. By connecting a sequence of filters, we convert and republish the data as ASCII and as Geography Markup Language (GML) formatted data. The data can be further subdivided into individual station position measurements. We are currently developing more sophisticated real-time data filters for data mining. Tools such as RDAHMM [17] may be used to detect state changes in archived GPS time signals. These may be associated with both seismic and aseismic causes. We are currently working to develop an RDAHMM filter that can be applied to real-time signals and link them in streaming fashion to the Open Geospatial Consortium standard services supporting integration of maps, features and sensors.

Management of Real-Time Streaming Data Grid Services

11

Fig. 2. Naradabrokering may be used to support filters of real-time GPS data

6 Future Work Conventional support of SOAP messages using the verbose “angle-bracket” representation is too slow for many applications. Thus we and others are researching [6, 18] a systematic use of “fast XML and SOAP” where services negotiate the use of efficient representations for SOAP messages. All messages rigorously support the service WSDL and transport the SOAP Infoset using the angle bracket form in the initial negotiation but an efficient representation where possible for streamed data Another interesting area is structuring the system so that it can be implemented either with standalone services, message brokers and clients or in a Peer-to-Peer mode. These two implementations have tradeoffs between performance and flexibility and both are important. The core architecture “naturally” works in both modes but the details are not trivial and require substantial further research.

References 1. Qiu, X.: Message-based MVC Architecture for Distributed and Desktop Applications Syracuse University PhD March 2 2005 2. Wu, W., Bulut, H., Uyar, A., and Fox, G.: Adapting H.323 Terminals in a ServiceOriented Collaboration System. In special "Internet Media" issue of IEEE Internet Computing July-August 2005, Vol 9, No. 4 pages 43-50 (2005)

12

G. Fox et al.

3. Fox, G., Ho, A., Pallickara, S., Pierce, M., and Wu, W.: Grids for the GiG and Real Time Simulations Proceedings of Ninth IEEE International Symposium DS-RT 2005 on Distributed Simulation and Real Time Applications (2005) 4. Birman, K., Hillman, R., and Pleisch, S.: Building Network-Centric Military Applications Over Service Oriented Architectures. In proceedings of SPIE Conference Defense Transformation and Network-Centric Systems (2005) http://www.cs.cornell.edu/projects/ quicksilver/public_pdfs/GIGonWS_final.pdf 5. Fox, G., Pallickara, S., Pierce, M., and Gadgil, H.: Building Messaging Substrates for Web and Grid Applications. To be published in special Issue on Scientific Applications of Grid Computing in Philosophical Transactions of the Royal Society of London (2005). 6. Oh, S., Bulut, H., Uyar, A., Wenjun Wu, Geoffrey Fox Optimized Communication using the SOAP Infoset For Mobile Multimedia Collaboration Applications. In proceedings of the International Symposium on Collaborative Technologies and Systems CTS05 (2005) 7. For discussion, see http://grids.ucs.indiana.edu/ptliupages/presentations/DoDGrids Aug25-05.ppt 8. Aktas, M. S., Fox, G., and Pierce, M.: An Architecture for Supporting Information in Dynamically Assembled Semantic Grids Technical report (2005) 9. Pallickara, S. and Fox, G.: NaradaBrokering: A Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids. Proceedings of ACM/IFIP/USENIX International Middleware Conference Middleware (2003). 10. Pallickara, S. and Fox, G.: On the Matching Of Events in Distributed Brokering Systems. Proceedings of IEEE ITCC Conference on Information Technology, Volume II (2004) 68-76 11. Uyar, A. and Fox, G.: Investigating the Performance of Audio/Video Service Architecture I: Single Broker Proceedings of the International Symposium on Collaborative Technologies and Systems CTS05 (2005) 12. Bulut, H., Pallickara, S., and Fox, G.: Implementing a NTP-Based Time Service within a Distributed Brokering System. ACM International Conference on the Principles and Practice of Programming in Java. (2004) 126-134. 13. Pallickara, S. and Fox, G.: A Scheme for Reliable Delivery of Events in Distributed Middleware Systems. In Proceedings of the IEEE International Conference on Autonomic Computing (2004). 14. Gadgil, H., Fox, G., Pallickara, S., Pierce, M., and Granat, R.: A Scripting based Architecture for Management of Streams and Services in Real-time Grid Applications, In Proceedings of the IEEE/ACM Cluster Computing and Grid 2005 Conference, CCGrid (2005) 15. Wu, W., Fox, G., Bulut, H., Uyar, A., and Altay, H.: Design and Implementation of a Collaboration Web-services system. Journal of Neural, Parallel & Scientific Computations, Volume 12 (2004) 16. Grid workflow is summarized in GGF10 Berlin meeting http://www.extreme. indiana.edu/groc/ggf10-ww/ with longer papers to appear in a special issue of Concurrency&Computation: Practice & Experience at http://www.cc-pe.net/iuhome/ workflow2004index.html. See also Gannon, D. and Fox, G.: Workflow in Grid Systems. 17. Granat, R.: Regularized Deterministic Annealing EM for Hidden Markov Models, Doctoral Dissertation, University of California Los Angeles (2004) 18. Chiu, K., Govindaraju, M., and Bramley, R.: Investigating the Limits of SOAP Performth ance for Scientific Computing, Proc. of 11 IEEE International Symp on High Performance Distributed Computing HPDC-11 (2002) 256.

A QoS-Satisfied Interdomain Overlay Multicast Algorithm for Live Media Service Grid Yuhui Zhao1,2 , Yuyan An2 , Cuirong Wang3 , and Yuan Gao1 1

2

School of Information Science and Engineering, Northeastern University, Shenyang 110004, China [email protected] QinHuangDao Foreign Language Professional College, 066311, QinHuangDao, China 3 Northeastern University at QinHuangDao, 066000, QinHuangDao, China

Abstract. We are developing a live media service Grid(LMSG) as an extensible middleware architecture for supporting live media applications in a grid computing environment. The LMSG service was provided by the service brokers nodes(SvBs) which are strategically deployed by Internet Service Providers (ISPs). This paper mainly presents a QoS-satisfied interdomain overlay multicast algorithm (QIOM), which is the key to organize the SvBs to build a QoS-aware multicast service tree and the most important part of LMSG middleware. It is an overlay based the multicast connection solution for live media relay in different ASes and can overcome the difficulties of Border Gateway Multicast Protocol. The simulation results show that the QIOM algorithms can effectively find and provide QoS-assured overlay services and balance the overlay traffic burden among the SvBs.

1

Introduction

Distributed collaboration based live media is an important class of Internet applications. Many educational and research institutions have begun to web cast lectures and seminars over the Internet, and distributed collaboration systems such as the Access Grid (AG) are increasingly popular. Although there are many web cast and distributed collaboration productions [1,2], several problems remain unsolved. First, they are lack of a QoS-awareness and scalability live media feasible delivery protocol which is fitted to the synchronous group communication. Second, there is not an efficient multicast model for ISP to benefit from the service providing. Third, applications that manage a production environment are difficult to build because the detailed knowledge of the environment is required. Forth, applications are often tightly coupled with the environment and cannot adapt well to changes. Furthermore, an application written for one environment cannot be easily reused in another environment. H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 13–24, 2005. c Springer-Verlag Berlin Heidelberg 2005 

14

Y. Zhao et al.

We are developing the live media service Grid (LMSG) to address the above challenges. LMSG aims to bring together large communities of users across networks to enable them to work together closely and successfully. The LMSG service was provided by the service broker nodes which are strategically deployed by ISPs and run the LMSG middleware. LMSG middleware is the key technology for supporting live media applications in a grid computing environment. It is to provide support for inter-active applications from atomic service components based on user’s dynamic service requirements. LMSG provides a layer of abstraction between the audio/video environments and the applications, which provides abstractions and an API for managing entities in the environment without revealing the underlying details. In this paper, we mainly discuss a QoS-satisfied Interdomain Overlay Multicast Optimizing Algorithm (QIOM), which is the key to organize the SvBs to build a QoS-aware multicast service tree and the most important part of LMSG middleware. It is an overlay based multicast connection solution for live media relay in different Ases and can overcome the difficulties of Border Gateway Multicast Protocol in IP level. QIOM provides a novel probing scheme to achieve QoS-aware multicast service in a scalable and efficient fashion. The QIOM’s features mainly include as follows. First, it provides multi-constrained statistical QoS assurances [3] for the distributed live media delivery services. Second, it achieves good load balancing in the SvBs to improve overall resource utilization and reduce the traffic cost among the ASes for ISP. Third, the interdomain multicast turns to feasible and this is the basic of synchronous interdomain group communication. The rest of the paper is organized as follows. The related works are discussed in Section 2. In Section 3, we introduce Live Media Service Grid. Then, the basic idea of QoS-satisfied interdomain overlay multicast algorithm(QIOM) is described in Section 4. And a series of simulations and results are discussed in Section 5. Finally, we draw the conclusions in Section 6.

2

Relative Works

Recently, various overlay networks have been proposed, such as peer-to-peer data lookup overlay network[4, 5, 6, 7], locality-aware overlay networks [8], application-level multicast networks [9, 10], resilient overlay networks [11], and Internet Indirection Infrastructure [12]. The major difference between LMSG and the above work is that LMSG focuses on integrated collaboration live media service delivery rather than merely data transport or lookup. In doing so, LMSG tackles the challenge of scalable decentralized resource and QoS management while instantiating the composed service. The Opus project [13] proposed an overlay utility service that performs wide-area resource mapping based on the application’s performance and availability requirements. Different from Opus that uses hierarchy, aggregation, and approximation for tracking global states, LMSG proposes on-demand states collection that is more suitable for at overlay structure and QoS-aware integrated service delivery. LMSG complements

A QoS-Satisfied Interdomain Overlay Multicast Algorithm for LMSG

15

previous work by integrating interdomain resource management into dynamic cooperated services. In IP layer, many protocols like BGMP[14], MASC[15] are still being investigated and tested for interdomain multicast. Their main objections include the protocol’s complexity that makes implementation difficult and costly, and the limited multicast address space, and some additional shortcoming such as less QoS, security and billing. In contrast to IP multicast, QIOM is a QoSawareness and scalability live media feasible delivery protocol which is fitted to the synchronous group communication. At the same time, it provides an efficient multicast model for ISP by deploying the service brokers to benefit from the service providing. By probing scheme, it can achieve QoS-aware multicast service in a scalable and efficient fashion.

3

The Live Media Service Grid (LMSG)

ISP needs a quality of service (QoS) ensured multicast service to save the resources and improve better service when more and more large-scale multimedia applications are emerging in the Internet. In order to provide the Live Media Service Grid(LMSG), it is necessary for ISPs to build a overlay service network by deploying the service brokers nodes. One or more SvBs are deployed in an Autonamous System(AS), the number depends on the needs of live media communication. This is different from IP multicast protocol. The SvBs run the LMSG middleware components which have efficient function for QoS management, live media organization and network accounting. There are three tiers in the LMSG middleware framework. The top tier is called the Abstract Service Interface Layer, and composed of a manager and a set of agents, which is the interface of the application software. The second tier is called the Trans-action Service Layer, and composed of some function entities, which includes compo-nents of Authentication, Registration, Resource Discovery, Scheduling, Logging, audio and video etc. The substrate tier is the Overlay Multicast Layer, which is above the TCP/IP protocol; SvBs use the components to build a overlay network, to finish the QoS-aware live media deliveries. Figure 1 shows the tiers of LMSG. Abstract Service Interface Layer is in charge of user’s service requirements. The user can specify a service request using a function agent (FA) and a QoS requirement vector (QR ). The FA specifies required service functions (Fi ) and inter-service dependency and commutative relations. We use Q target =[, ..., ] to define the user’s QoS requirements for the integrated service, where specifies the bound C qi and the satisfaction probability P qi for the metric qi that represents a QoS metric such as delay and loss rate. Users can either directly specify the integrated service request using extensible markup language (XML) or use visual specification tools [11]. Transaction Service Layer consists of distributed live media services that are dynamically composed from existing service components. A live media service

16

Y. Zhao et al.

component (si ) is a self-contained multimedia application unit providing certain functionality. Each service component has several input and output ports for receiving input messages and sending output messages, respectively. Each input port is associated with a message queue for asynchronous communication between service components. input quality requirements of the service component such as media format, frame rate, which is denoted by Q in =[q 1 in , ..., q d in ] ; Output quality properties of the service component, denoted Q out =[q 1 out , ..., q d out ] . We describe an aggregate distributed live media service using a Directed Acyclic Graph called ServStreaming (λ). The nodes in the ServStreaming represent the service components and the links in the ServStreaming represent application-level connections called service link. Each service link is mapped to an overlay path by the Overlay Multicast Layer. Overlay Multicast Layer consists of the service brokers — distributed overlay nodes (Vi ) connected by application-level connections called overlay links (lj ). The overlay network topology can be formed by connecting each SvB node with a number of other nodes called neighbors via overlay links. For supporting large-scale live media communication to enhance scalability, the SvBs have a hierarchical architecture. The contiguous SvBs are grouped to form clusters, which in turn are grouped together to form super-clusters. And an end host could form a cluster with other end hosts close by. In the cluster, overlay multicast is used for efficient data delivery between the limited numbers of end users. The user clusters connect to the LMSG by the edge SvBs. Either the SvB or the user end, the LMSG middleware is the essential software. Each SvB can provide one or more multimedia service components.

Fig. 1. The tiers of Live Media Service Grid

Each node Vi is associated with a statistical resource availability vector , where r k vi is a random variable describing the statistical  availability for the k th end-system resource type (e.g. CPU, memory, disk storage). Each node vi also maintains statistical bandwidth availability bω lj for its adjacent overlay links lj . For scalability, each node maintains the above histograms locally, which are not disseminated to other overlay nodes.

A QoS-Satisfied Interdomain Overlay Multicast Algorithm for LMSG

4 4.1

17

QoS-Satisfied Interdomain Overlay Multicast Algorithm Problem Description

The function agent (FA) and QoS requirement vector (QR ) in Abstract Service  Interface Layer map to an overlay weighted directed acyclic graph G in QIOM. For formalizing the QoS constraint for the overlay network between the ASes, we define the functions to do performance measurements among the SvBs, the  function of the link delay as Delay: E −→R + , Delay(l ) denotes the delay of the  packets passing the overlay link l, l ∈ E . And the function of the performance of a node as Perfo: CPU−→100%, Perfo(SvB ) denotes the OccupyTime of the  CPU of the SvB, SvB ∈V . To control the number of spawned probes, the probe carries a probing budget (β ) that defines how many probes we could use for a composition request. The probing budget represents the trade-off between the probing overhead and composition optimality. Larger probing budget allows us to examine more candidate ServStreamings, which allows us to find a better qualified ServStreaming.  Let M ∈V be a set of nodes involved in a group communication. M is called   multicast group with each node VM , VM ∈ M, is a group member. Packets  originating from a source node Vs have to be delivered to a set of receiver nodes    M - {Vs }. A multicast tree T (VM , EM ) is a subgraph of G’ that centers as ci ,   ∀ ci ∈ G , and spans all the nodes in M. The path from a source node Vs to a      receiver node Vd in the tree T, is denoted by PT (Vs , Vd ), where Vs , Vd ∈ M. The multicast tree should meet the following conditions: β(SvBi ) ≤ γ, and P erf o(SvBi ) ≤ τ, ∀ SvBi ∈ M & SvBi ∈ Dx min (P erf o(SvB1 ), P erf o(SvB2 ), . . . , P erf o(SvBj ))

SvBi ∈M

  P erf o(SvBlj ,i ), . . . , P erf o(SvBlj ,i )), ∀i, j, m ∈ N min (

lj ∈Pt

j=1

 l∈PT (u,v)

(2) (3)

j=m



|

(1)

delay(l) ≤ , ∀Vs , Vd ∈ M

l∈PT (Vs ,Vd )

delay(l) −



delay(l)| ≤ δ, ∀u, v, x, y ∈ M

(4) (5)

l∈PT (x,y)

where γ is the maximum of probing budget, τ is the maximum occupy time of the SvB’s CPU, Δ is the maximum of delay, δ is the maximum of the delay variation between the any different l. To quantify the load balancing property of an instantiated ServStreaming, we more define a resource cost aggregation metric, denoted by ψ λ , which is the weighted sum of ratios between resource requirements of the SvB links and resource availabilities on the overlay paths. We use Csrii and Psrii to represent the resource requirement threshold and satisfaction probability of the service

18

Y. Zhao et al. 

component si for the i th end-system resource type (e.g., CPU, memory, disk i i storage), respectively. Similarly, we use Clbw and Plbw to denote the required threshold and satisfaction probability for the network bandwidth on the service link li , respectively. The resource requirements of a service component depend on its implementations and the current workload. In contrast to the conventional data routing path, the resource requirements along a ServStreaming are no longer uniform due to the non-uniform service functionalities on the ServStreaming. Different service components can have different resource requirements due to hetero-geneous functions and implementations. The bandwidth requirements also vary among different service links since the value added service instances can change the original media content (e.g., image scaling, color filter, information  v embedding). We use Mrij to denote mean availability of i th end-system resource φi type on the overlay node vj . We use Mbw to denote the mean availability of the bandwidth on the overlay path φi , which is defined as the minimum mean available bandwidth among all overlay links ei ∈ φ . The mean values can be calculated from the p.d.f.’s of the corresponding statistical metrics. Hence, the resource cost aggregation metric is ψ λ defined as follows, ψλ =

n   si ,vj ∈λ k=1

Wk

n+1  C li  Crsii bw +w , here wk = 1, 0 ≤ wk ≤ 1 (6) n+1 vj φ i M ri li ,φi ∈λ Mbw k=1

For wk , 1 ≤ k ≤ n + 1 represents the importance of different resource types. We can customize ψλ by assigning higher weights to more critical resource types. The ServStreaming with smaller resource cost aggregation value has better load balancing property because the resource availabilities exceed the resource requirements by a larger margin. 4.2

Algorithm Suggestion

LMSG executes a QoS-satisfied Interdomain Overlay Multicast Algorithm (QIOM) to perform live media service. Given an aggregated service request, the source node invokes the QIOM algorithm, which includes major steps as following. Step 1. The Initialization of Service Broker Probing The source SvB first generates a QoS-requested probing message, called probe. The probe carries the information of Function Agent and the user’s resource requirements. The probe can spawn new probes in order to concurrently examine multiple next-hop choices. If some terminal in ASx belongs to a multicast group, then QIOM can select one or more adjacent SvBs as the multicast relay node, and the selecting SvB need to satisfy Expressions 1, which means only to select the free SvB as a relay node. Step 2. Search for Candidate ServStreamings The step’s goal is to collect needed information and perform intelligent parallel searching of multiple candidate ServStreamings. We adopt the Probing ServStreamings to finish this.

A QoS-Satisfied Interdomain Overlay Multicast Algorithm for LMSG

19

Step 2.1: SvB QoS check and allocation. When a SvB receives a probe, it first check whether the QoS and resource values of the probed ServStreaming already violate the user’s requirements using the Expressions 1 and 4. If the accumulated QoS and resource values already violate the user’s requirements, the probe is dropped immediately. Otherwise, the SvB will temporarily allocate required resources to the expected application session. It will be cancelled after certain timeout period if the SvB does not receive a confirmation message. Thus, we can guarantee that probed available resources are still available at the end of the probing process. Step 2.2: Derive next-hop node. After that, the SvB derives the next-hop service functions according to the dependency and commutative relations in the Function Agent. All the functions dependent on the current function are considered as next-hop functions. For each next-hop function Fk derived above, if there is an exchange link between Fk and Fl , Fl is also a possible nexthop function. The probing budget is proportionally distributed among next-hop functions. Step 2.3: Check QoS consistency. Based on the service discovery results, the service node then performs QoS consistency check between the current SvB and next-hop candidate SvB. The QoS consistency check includes two aspects: (1) the consistencies between output QoS parameters Qout of the current service component and input QoS parameters Qin of the next-hop service component; and (2) the compatibility between the adaptation policies of two connected SvBs. Unlike the IP layer network where all routers provide a uniform data forwarding service, the SvBs in the service overlay can provide different live media services, which makes it necessary to perform QoS consistency check between two connected service components. The process can use expression 6. Step 3. Optimal selection and Setup multicast Tree service session The destination collects the probes for a request with certain timeout period. It then selects the best qualified ServStreaming based on the resource and QoS states collected by the probes. The destination sends an acknowledge message along the reversed selected ServStreaming to confirm resource allocations and initialize service components at each intermediate SvB. Then the application sender starts to send ap-plication data stream along the selected ServStreaming. If no qualified ServStreaming is found, the destination returns a failure message to the source directly. The step of building the tree can be described more as follows. Step3.1: the selected SvBs composed of the multicast group. Step3.2: center as any node SvBi in the multicast group, create a multicast tree T using Dijkstra Algorithm, compute the delay, if it satisfies the expression 4, then compute the delay variation, if it satisfies the expression 5, at last, the node is sent to the set of the center nodes. Step3.3: in turn compute the cost order of the relating multicast tree lj using expression 3.

20

Y. Zhao et al.

Step3.4: In the set of the center nodes, in turn compute the order of the candidate center node using expression 2. Step3.5: select the freest SvB as the center, and the relating tree as the current multicast tree, the others as candidates.

5 5.1

Performance Evaluations Simulation Setup

We have implemented a session-level event-driven simulator to evaluate the performance of QIOM. The simulations are based on the Georgia Technology Internetwork Topology Model (GT-ITM) [16], which is used to generate the network topology. These topologies have 50 transit domain routers and 500-2000 stub domain routers. Each end host is attached to a stub router uniformly at random. To test the scalability of different schemes, we focus on large group sizes and vary the number of members in each group from 200 to 1000. When simulating the algorithms, the SvBs form a hierarchical topology. Each of the clusters in the hierarchy has an average of ten members — including subclusters or SvBs. The dynamics of the overlay Multicasting is modeled as follows. The overlay multicasting request arrives at a random accSvB node according to a Poisson distribution with rate η. The destination domain is randomly chosen. The holding time of the overlay session is exponentially distributed with a mean of 2 min. Similar to [17],  the offered load of the overlay routing request is defined as ρ = (η ∗ h/u ∗ ( Li )) , where h is the mean of overlay path hops (number of SvBs in the path), and u is the sum of the overlay link capacities in the corresponding overlay topology. During the process of simulation, we vary the value of to test QIOM’s performance under different offered loads. The physical links’ bandwidths during the simulation are randomly selected between 40 and 280 units with delay 2 ms, while the SvBs’ capacities are uniformly set as 900 units. The non-overlay traffic occupies around 50 of each physical link’s capacity. The non-overlay traffic varies its volume ±20% every 500 ms. The SvBs exchange their state information every 1000 ms. We assume that the error of available bandwidth measurement result is within ±10%. For each overlay routing request, we use QIOM to set up an overlay path connecting the source SvB and the destination SvB. 5.2

Simulation Results and Discussions

QoS-Satisfaction Ratio (QSR): Because of the unbalanced distribution of Internet traffic, in many situations, the shortest path-based routing protocol cannot provide a QoS-satisfied path connecting the source and destination domains. To quantify this factor, QSR is defined as QSR =

Number of QoS satisfied overlay paths Number of overlay request paths

(7)

A QoS-Satisfied Interdomain Overlay Multicast Algorithm for LMSG

21

The results obtained for QIOM are compared with that of the shortest-path routing (SPR) algorithm, which refers to the shortest path in the overlay network, not in the IP layer. Fig. 2 shows the QSR of QIOM compared with SPR. From the figure, we can observe that QIOM can greatly improve the QSR. In addition to finding QoS-satisfying overlay paths, QIOM also helps in finding paths that are not affected significantly by the non-overlay traffic. 1.2 QIOM

1

SPR

0.8 0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

Fig. 2. QoS-Satisfaction Ratio and Offered Load comparison

Multicast tree cost: Multicast tree cost measures by the number of links in a multi-cast distribution tree. It quantifies the efficiency of multicast routing schemes. Application level multicast trees and unicast paths may traverse an underlying link more than once, and thus they usually have a higher cost than IP multicast trees. In Fig. 3, we plot the average tree cost of QIOM, NICE[18], and IP BGMP as group size increases from 100 to 900. As a reference, we also include the total link cost for unicast. Compared with the cost of unicast paths, NICE trees reduce the cost by 35%, QIOM trees reduce the cost by approximately 4555%, and IP multicast BGMP trees save the cost by 68-80%. Clearly, the performance of QIOM is comparable to IP multicast. In addition, QIOM outperforms NICE in all cases, and their difference magnifies as group size is increased. 9000 8000

QIOM NICE

7000 6000 5000

Unicast BGMP

4000 3000 2000 1000 0 0

200

400

600

800

1000 1200

Fig. 3. The relationship of the Group size and tree cost

Average link stress: Link Stress is defined as the number of identical data packets delivered over each link. Fig. 4 shows the average link stress as the

22

Y. Zhao et al.

group size varies. IP multicast trees has the least link stress since only a single copy of a data packet is sent over each link. IP multicast maintains a unit stress since no duplicate packets are transmitted on the same link. QIOM trees exhibit average link stress between 1.2 and 1.6, whereas the average link stress of NICE trees is always higher than 2.00. For both QIOM and NICE, the link stress does not vary greatly with group size. However, unicast is not as scalable as QIOM and NICE, since its link stress keeps increasing when group size grows. 2.5 2 QIOM

1.5

NICE 1

IP Multicast

0.5 0 0

200

400

600

800

1000

1200

Fig. 4. The Group Size and average link stress

Average path length: Path Length is the number of links on the path from the source to a member. Unicast and shortest-path multicast schemes are usually optimized on this metric and thus have smallest path lengths. In simulation experiments, end hosts join the multicast group during an interval of 200 seconds. The results for average path length are shown in Fig.5. As expected, IP multicast has the shortest end-to-end paths. Additionally, the path lengths of QIOM trees are shorter than those of NICE trees on average. For instance, at group size of 1000, the average path lengths of QIOM and NICE trees are approximately 20 and 24, respectively. 30 25 20

QIOM NICE

15

IP Multicast

10 5 0 0

200

400

600

800

1000

1200

Fig. 5. The relationship of Group Size and average Length Length

6

Conclusions and Future Works

As an extensible middleware architecture, LMSG locates to provide the emerging large-scale live media applications and tries to build an overlay multicast

A QoS-Satisfied Interdomain Overlay Multicast Algorithm for LMSG

23

service net-work for ISP. The LMSG service was provided by the service brokers nodes which are strategically deployed by ISPs. QIOM is the key to organize the SvBs to build a QoS-aware multicast service tree and the most important part of LMSG middleware. It is an overlay based the multicast connection solution for live media relay in different ASes and can overcome the difficulties of Border Gateway Multicast Protocol. The simulation results show that the QIOM algorithms can effectively find and provide QoS-assured overlay services and balance the overlay traffic burden among the SvBs, as well as the overlay links for live media service. Our work is just beginning, only limited simulation has done, the suggestion needs to be tested in the distributed collaboration environment. It is necessary to do more optimize for more QoS problem during the process of looking for the multicast trees in the future.

Acknowledgements This work is supported by the National Natural Science Foundation of China under grant No.60273078 and the Doctor Fund of Hebei under grant No.05547010D-3.

References 1. Machnicki, E., and Rowe, L.A. : Virtual director: Automating a webcast. In Proceedings of the SPIE Multimedia Computing and Networking 2002, Vol. 4673, San Jose, CA, January 2002 2. Perry, M., and Agarwal, D. : Remote control for videoconferencing. In Proceedings of the 11th International Conference of the Information Resources Management Association, Anchorage, AK, May 2000 3. Knightly, E., and Shroff, N.B. : Admission Control for Statistical QoS: Theory and Practice. IEEE Network, March 1999, 13(2):20-29 4. Rowstron, A., and Druschel, P. : Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. Proc. of IFIP/ACM International Conference onDistributed Systems Platforms (Middleware), November 2001 5. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H. : Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. Proc. of ACM SIGCOMM 2001, San Diego, California, August 2001 6. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S. : A scalable content address-able network. Proc. of the ACM SIGCOMM 2001, San Diego, CA, August 2001 7. Kubiatowicz, D., Zhao, B.Y., and Joseph, A.D. : Tapestry: An infrstructure for fault tolerant wide-area location and routing. Technical Report UCB/CSD0101141. U.C. Berkeley, April 2001 8. Harvey, N.J., Jones, M. B., Sarioiu, S., Theimer, M., and Wolman, A. : SkipNet: A Scalable Overlay Network with Practicaly Locality Properties. Proc. of the Fourth USENIX Sympo-sium on Internet Technologies and Systems (USITS ’03), Seattle, WA, March 2003 9. Chu, Y., Rao, S.G., and Zhang, H. : A Case For End System Multicast. In Proc. of ACM SIGMETRICS, Santa Clara,CA, June 2000

24

Y. Zhao et al.

10. Castro, M., Druschel, P., Kermarrec, A.M., and Rowstron, A. : SCRIBE: A largescale and decentralised application-level multicast infrastructure. IEEE Journal on Selected Areas in Communication (JSAC), Vol. 20, No, 8, October 2000 11. Andersen, D., Balakrishnan,H.,Kaashoek, F., Morris. R. : Resilient Overlay Networks. In Proc. 18th ACM SOSP 2001, Banff, Canada, October 2001. 12. Stoica,I., D. Adkins, S. Zhuang,Shenker, S., and Surana, S. : Internet Indirection Infrastructure. Proc. of ACM SIGCOMM 2002, August 2002. 13. Braynard,R., Kostic,D., Rodriguez, A., Chase, J., and Vahdat, A. : Opus: an Overlay Peer Utility Service. Proc. of International Conference on Open Architectures and NetworkPro-gramming(OPENARCH), June 2002. 14. Thaler, D.: RFC 3913 - Border Gateway Multicast Protocol (BGMP): Protocol Specification, September 2004 15. Radoslavov, P., Estrin, D., Govindan, R., Handley, M., Kumar, S., Thaler, D. : RFC 2909 - The Multicast Address-Set Claim (MASC) Protocol, September 2000 16. GT-ITM: Modeling Topology of Large Internetworks [Online]. Available: http://www.cc.gatech.edu/projects/gtitm/ 17. Shaikh,A., Rexford,J., and Shin, K. : Evaluating the overheads of source directed quality-of-service routing. Proc. 6th IEEE ICNP, Oct. 1998, pp. 42-51. 18. Banerjee, S., Bhattacharjee, B., and Kommareddy, C. : Scalable Application Layer Multicast. Proc. of ACM SIGCOMM ’02, 2002.

Automated Immunization Against Denial-of-Service Attacks Featuring Stochastic Packet Inspection Jongho Kim, Jaeik Cho, and Jongsub Moon Center for Information Security Technologies (CIST), Korea University {sky45k, chojaeik, jsmoon}@korea.ac.kr Abstract. Denial of Service attacks are easy to implement, difficult to trace, and inflict serious damage on target networks in a short amount of time. This model eliminates attack packets from a router using probability packet inspection as an automated defense against DoS. The detection module begins with an initial probability for inspecting packets. As an attack commences and the occupied bandwidth of the channel increases, the detection module optimizes the inspection probability.

1 Introduction In recent years, information security has become and increasingly important aspect of network expansion. While novel security solutions have been employed to protect these infrastructures such as IDS, firewalls, VPN, ESM, and so on, many problems still remain unsolved. Prominent among these remaining threats are Denial of Service attacks. Denial of Service attacks is easy to implement, difficult to trace, and inflict serious damage on target networks in a short amount of time. Over half of DoS attacks escalate within ten minutes [1], making it difficult for administrators to respond using manual defense mechanisms. Hence, there is a need to research automated defense systems that are capable of identifying and responding to these threats. One option is to monitor all packets, however this method is time consuming and inefficient. For one, DoS attacks to not occur frequently, and two, there is a risk that the filtering process will drop valid packets unnecessarily. Furthermore, the network equipment used to detect and filter attack packets could easily be overloaded and shut down in the case of a real attack. In this paper we propose a more efficient automated defense model. This model removes inspects packets from a router using a stochastic packet inspection method as an automated defense against DoS. The detection module begins with an initial probability for inspecting packets; as an attack escalates and the occupied bandwidth of the channel increases, the detection module optimizes the inspection probability to balance network safety and quality of service.

2 Denial of Service Attacks 2.1 DoS Attack Trends Traditional Denial of Service attacks, such as Ping of Death, exploited systems’ inabilities to process irregular packets. Security patches were distributed that protected H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 25 – 34, 2005. © Springer-Verlag Berlin Heidelberg 2005

26

J. Kim, J. Cho, and J. Moon

operating systems from these threats. Accordingly, Denial of Service attacks relying on single packet morphologies have become less and less common. In February 2000, a Distributed Denial of Service (DDoS)[2] scheme attacked Yahoo, Ebay, and others using a system resource consumption method. This attack abused the fact that systems commit a great deal of resources to perform 3-way handshakes, reassemble fragmented packets, sustain half-open connections, and send fragmented packet exception messages. In 2002, a new type of attack appeared: Distributed Reflective Denial of Service (DRDos)[3]. This technique strengthens weaknesses found in prior DoS attacks, offers more options and control to attackers, and neutralizes existing techniques for defending against DoS. 2.2 Existing Research on Defense Against Denial of Service Attacks Approaches for defending against Denial of Service attacks can be divided into one of two categories: passive, such as attack detection [4,5], and active, which includes such tools as IP trace-back, traffic filtering, and packet filtering[11]. Because attack detection does not offer any defense in and of itself, this technique must be used in conjunction with attack response mechanisms. Likewise, IP trace-back methods such as PPM and ICMP trace-back [6,7] that search out and locate an attacker’s position, are generally employed after an attack; they are more efficient as forensics tools rather than network defense mechanisms. Filtering is another option, however filtering every packet can cause the detection modules to overload and malfunction. Filtering based on traffic frequency analysis is an efficient isolation technique, but in the case of ingress filtering[8] this is only applied to certain packet fields. Source IP analysis, for example, is an option, however this is problematic against attacks such as DRDoS which use normal TCP servers to reflect and propagate attacks. Approaches such as ACC[9] and hop count filtering[10] are effective, but only when they can establish the path of routers traversed during the attack. A truly effective defense against Denial of Service attacks must consist of an automated system which both systematically detects and eliminates attack packets, and is itself resistant to Denial of Service attacks.

3 Proposed Model Our model removes attack packets from a router using stochastic packet inspection as an automated defense against DoS attacks. As long as no attack packets are detected, the detection module continues to inspect with a low initial probability. Once an attack packet is detected and the occupied bandwidth of the channel increases, the detection module increases the packet inspection probability. 3.1 Assumptions First, we do not consider attacks that exceed the bandwidth of the external router. At present, there are no known methods for defending against attacks that exceed the bandwidth available between the internet service provider and the external router. To

Automated Immunization Against DoS Attacks Featuring Stochastic Packet Inspection

27

defend against this kind of attack, we would have to consider the service provider; however, these issues are not addressed in this paper. Second, we assume that there is an interval between the arrival of the first attack packet and the peak of the attack. Although an attacker could select agents and issue the command to attack simultaneously, attack packets will arrive at different times. Therefore, we assume that it traffic attacking a target network will not reach its peak instantaneously. Third, we assume that servers can process traffic below a certain acceptable bandwidth. The purpose of this paper is to present a defense against attacks that target exterior routers and consume internal network bandwidth. Efficient techniques for combating such system resource attacks exist, but are not included within the scope of this paper[12]. Fourth, we assume the existence of an efficient detection module. Methods for classifying attack packets are plentiful and well implemented [13, 14]. In this paper, we assume that such a module could classify normal and attack packets with ease. 3.2 Stochastic Packet Inspection During a non-attack situation, this model inspects packets randomly with an initial probability ( Pi ) . The detection module adjusts the inspection probability according to the bandwidth available. Selection of Inspection Probability From time t to t + Δt , if the module inspects packets with probability Pt , we can compute normal traffic ( NTt ) , attack traffic ( ATt ) , and total traffic (TTt ) at each time t + Δt .

¦ NT (bps ) =

t +Δt

¦ AT (bps ) =

t +Δt

t

Length of inspected Normal Packets (bit)

t

(1)

Pt t

Length of inspected Attack Packets (bit)

t

(2)

Pt TTt (bps ) = NTt + ATt

From time t + Δt , we can infer that the module will drop Therefore, available bandwidth

(3)

Pt • ATt attack packet bits.

( ABW ) on t + Δt is calculated as shown below.

Here PBW refers to the acceptable bandwidth on this channel. ABW (Available Bandwidth) = PBW (Acceptable Bandwidth) − NTt − (1 − Pt ) • ATt

(4)

28

J. Kim, J. Cho, and J. Moon

If T is the minimum available bandwidth required for communication, then our goal is to maintain ABW at a larger value than or equal to T .

ABW ≥ T

(5)

If we combine (4) and (5), following equation (6) and (7) are derived.

Pt ≥

NTt + ATt − PBW + T ATt

(6)

PBW + T − NTt ATt

(7)

Pt ≥ 1 −

The shape of equation (7) is shown in Figure 1. For example, if attack traffic is four times PBW − NTt − T , ( x = 4) , then in order not to exceed the acceptable bandwidth, we must inspect no more than 75% of all packets. But we cannot know by how much the attack traffic will vary after t + Δt . Therefore this model calculates

Pt +Δt on t + Δt based on the network environment between t and t + Δt . From

equation (6), the following equations (8) and (9) are derived.

Pt =

NTt −Δt + ATt −Δt − PBW + T ATt −Δt

Pt +Δt =

NTt + ATt − PBW + T ATt −Δt

Equation (10) is derived using Pt from equation (8) and

Pt +Δt =

(8)

TTt + TTt −Δt + Pt • ATt −Δt ATt

(9)

Pt +Δt from equation (9). (10)

> ATt , with each t , ATt will become smaller, each Pt +Δt will be smaller than the last Pt , available bandwidth ( ABW ) will increase, and no normal

In the case of ATt −Δt

packet loss will occur. In the case that ATt −Δt

= ATt , the probability is appropriate, there will be no loss of normal packets, and the same probability will be applied in the period of t + Δt . In the case of ATt −Δt < ATt , Pt is smaller than the probability required to achieve T , and there is a loss of normal packets. However, Pt +Δt is adjusted to be larger than Pt .

Automated Immunization Against DoS Attacks Featuring Stochastic Packet Inspection

Fig. 1. Illustration of equation (7)

29

Fig. 2. Illustration of equation (11)

Selection of the Probability Adjustment Interval ( Δt ) Using this approach, Pt +Δt is the result of past traffic and current traffic and not a prediction of future traffic. However, in the event that the model calculates Pt +Δt

incorrectly, an adjustment can be made on t + Δ 2t . Therefore appropriate selection of the interval ( Δt ) is important.

Limitation of Inspection Probability From t to t + Δt , assume the total number of packets is N t and the maximum number of packets that can be inspected by the detection module is L .

Pt
. From the definition, we can see that a nonempty set of activities compose a planning. Essentially, in this sequence, Pre, execution of web service and Post appear alternately, which compose the workflow of composite process. While running, the agents are assigned to carry out the web service according to the planning. In MAWSC model, planning is very important which specifies the logic actions of mobile agent. Definition 3 WSC-MA. A WSC-MA is a 5-ruple , where ScopeNum is an integer used to distinguish which agent scope it belongs to; the DataBlock stores a set of useful data, which is intermediate results or global parameters; ST is the status of the Agent, and it is an enumeration of {BEGIN, READY, WAITING, RUN, ABORT, TRANSITION}. For one WSC process, maybe, there are several WSC-MAs to implement the task in parallel. We use ScopeNum to identify these WSC-MAs which belongs to the same group. ST refers to the status of WSC-MA. DataBlock is an important component

Mobile-Agent-Based Web Service Composition

39

during the running time. The variables in DataBlock reflect the current context of the composite process. The behavior of agent is mainly depended on the information. 3.2 Action Rules As mentioned above, planning needs to describe the complex control flow corresponds to the procedural rules specified in the WSC specification. For example, when one agent finishes booking the train ticket and moves to the target host where to invoke the register service, it must make sure that there is another agent has finished booking hotel. Here we give the detailed definition of the action rules. Definition 4 Pre. A Pre is a precondition rule defined as , where Operator is enumeration of {AND, XOR} which explains the relationship of the expression; A-Set is a nonempty set of activities, which should be completed before the execution of this activity. Before the execution of the web service, a set of the related activities should be achieved first. There are mainly two reasons for WSC-MA waiting when it is ready. One is that some of the inputs required by the web service are outputs of some other web services, so that the WSC-MA has to wait for the intermediate results. Another reason is just follow the logic order of the WSC specification. The two operators mean different control logic respectively. AND is the synchronization of the execution along the different path. Before WSC-MA invokes the web service, all the activities in A-Set should be finished. XOR is simple merge. One of activities being finished can trigger the web service. If there is only one activity in the A-Set, the operator is arbitrary and it indicates the sequence. Definition 5 Post. The Post is defined as a set of rules < (Rule)*>. One Rule represents one succeeded branch of the process. Rule is a 2-tuple < Condition | , A> where a “Condition” is a logical expression which may be evaluated by the agent to decide whether a succeeded activity should be started. The Post tells the agent what to do next, after the activity is completed in one host. Post is the logic transition of the activity, which enacts the behavior of agents. The basic construct of “sequence”, “AND/OR-split” in workflow can be expressed. 3.3 Behaviors of WSC-MA In MAWSC model, four basic physical behaviors of WSC-MA are defined: execution, merging, migration, and clone. These basic behaviors are the base of MAWSC and new behaviors also can be added in order to realize some special WSC applications. (1) Execution. This behavior includes two steps: invoking a web service and collecting the results to the DataBlock. (2) Merging. When the ST of an agent is WAITING, it means the WSC-MA is waiting for the complement of another activity. Once another WSC-MA arrives, the former WSC-MA checks the ScopeNum of incoming WSC-MA. If these two WSC-MA have the same ScopeNum, then a Merging operation takes place. This behavior corresponds to the “AND-join” and “XOR-join” in the workflow. (3) Migration. After a WSC-MA completes the activity in one host it moves on achieving another activity. A migration indicates that the WSC-MA migrates to a

40

Z. Qian, S. Lu, and L. Xie

certain host and access the web service locally. If the agent fails to migrate, it throws an exception and the ST of the agent turns to “ABORT”. (4)Clone. In this function, a WSC-MA is replicated. The input parameter is one of the rules in the Post of the primary WSC-MA. So any cloned WSC-MA only has one rule of Post. Suppose at one node in the process, a single thread is split into n threads, and then n agents are cloned to achieve these threads respectively. Each new agent has the same DataBlock and ST. After clone, the primary agent itself is destroyed. 3.4 Architecture of WSC-MA We have designed WSC-MA to realize our approach and elide the description of mobile agent platform which is not the emphases of this paper.

Fig. 2. Architecture of WSC-MA

WSC-MA is divided into two parts. As shown in Figure 2, the left part is the runtime control. According to the planning, behavior matching (detailed in section 5) convert the action of WSC-MA into the physical behaviors of WSC-MA. Then the behavior control module implements these behaviors. It calls the execution module to invoke the web service and transfers the instructions to the mobile agent platform. The execution module invokes the web service, collects the results and catches exceptions. The exception processing module calls the functions defined in the planning to handle the exception. If the exception is a fatal one, it sends a report to WSC portal and terminates the running. The right part is the agent guide line. This part is preloaded before WSC-MA begins to achieve the task. It includes planning, behavior matching algorithms and behaviors. Changing the planning, we get the different WSC applications; while changing the behavior matching algorithms and behaviors, we get a new WSC-MA system used to some specific scenario. For example, we can import a new behavior of agent from MAFTM model[12], called “replication” and adds corresponding behavior matching algorithm. When there are more than one web services in set I (that is to say, there are several candidates to offer the same function), WSC-MA executes “replication”, splitting into several WSC-MAs and accessing these web services respectively. Thus, this advanced MAWSC model is a fault tolerance WSC model.

Mobile-Agent-Based Web Service Composition

41

4 Process Mapping Rules In MAWSC model, we need to convert the input specification of WSC into planning. Literature [13] summarizes 20 workflow patterns and indicates that any complex process can be divided into these patterns. Consequently, so long as to define the mapping rules between the patterns and actions of WSC-MA, the specification can be converted to the planning. Among these patterns, there are 5 basic ones which closely match the definitions of elementary control flow concepts provided by the WfMC in [14]. For the limit of the space, we only give the 5 basic mapping rules. Rule 1 Sequence. Suppose activity T1 is enabled after the completion of activity T2 (Figure 3(b)). Then T2.Post and T1.Pre are defined as Figure 3(a) shows.

T1

arbitrary T2

(a)

(b)

Fig. 3. Sequence

Rule 2 Parallel Split (AND-split). Suppose in the point of activity T1, the single thread of control splits into multiple threads of control which can be executed in parallel and these thread are respectively started with T2, T3 … (Figure 4(b)) Then, T1.Post is defined in Figure 4(a)

T2 T3 …

(a)

(b)

Fig. 4. Parallel split

Rule 3 Synchronization (AND-join). Suppose in the point of activity T1, multiple parallel activities converge into one single thread of control, thus synchronizing multiple threads. And the activities before T1 in these parallel activities are T2, T3… (Figure 5(b)) Then, T1.Pre is defined in Figure 5(a).

“AND”

T2 T3 …

(a)

(b)

Fig. 5. Synchronization

42

Z. Qian, S. Lu, and L. Xie

Rule 4 Exclusive Choice (XOR-split). Suppose in the point of activity T1, according to the procedural rules, one of branches is chosen. These conditions are C2, C3 … correspond to the branches started with T2, T3 … (Figure 6(b)) Then, T1.Post is defined in Figure 6(a).

C2 T2

C3 T3

(a)

(b)

Fig. 6. Exclusive choice

Rule 5 Simple Merge (XOR-join). Suppose in the point of activity T1, two or more alternative branches come together without synchronization. And the activities before T1 in these parallel activities are T2, T3… (Figure 7(b)) Then, T1.Pre is defined in Figure 7(a).

“XOR”

T2 T3 …

(a)

(b)

Fig. 7. Simple merge

The Pre and the Post always appear in pair, Post of the prior activity and the Pre of the successor compose of the relationship of these two activities. Note that in rule 2 and rule 4, we elide the Pre of the T2 and T3, clearly, their Pre should be T1. Similarly, we elide the Post of the T2 and T3 in rule 3 and rule 5, their Post should be T1.

5 Behavior Matching WSC portal system converts the specification of WSC to the planning and creates WSC-MA to implement task. In this section we give “Pre2Behavior” and “Post2Behavior” to convert the action rules to the physical behaviors of WSC-MA.

Mobile-Agent-Based Web Service Composition

43

Pre2Behavior () { Probe the existing WSC-MAs in this host If (this.SopeNum == Existing Agent.ScopeNum) If (this.A in ExistingAgent.Pre.A-Set) return;

Switch (Operator){ case XOR: case AND: { A-Set temp; this.ST == WAITING; temp = this.A.Pre.A-Set; delete this.A from temp; While (temp != empty) { Waiting for incoming agent … If (AgentIncom.ScopeNum == this.ScopeNum) if (AgentIncom.A in temp){ Delete Agent.A from temp; Merging (this, AgentIncom); } } This.ST == READY; U = choose( this.I); Execution (U); } } }

Fig. 8. Algorithm of Pre to behavior

Figure 8 shows Pre2Behavior in which maps Pre to behaviors. Here “this” represents WSC-MA itself. Think about the following scenario. There are two activities T1 and T2 which are the priors of T3 and the “operator” of the T3.Pre is “XOR”. That means whichever T1 or T2 has been completed, T3 will be triggered. Suppose, after finishing the T1, WSC-MA A1 invokes the T3 and gets the result migrating to the next host. Then, WSC-MA A2 accomplishes T2 and arrives, again it invokes the T3. So T3 is invoked twice. Respond to this problem, in this algorithm, the WSC-MA waits for the others until all the WSC-MAs arrive or overtime. Post2Behavior() { RuleSet: set of Rule; RuleSet= this.A.Post; if (card(RuleSet) == 1) { check DataBlock to find out if (rule.Post == True) { Migration (rule.A); return; } else destroy this; } for each rule in RuleSet { Clone (rule); } destroy this; }

Fig. 9. Algorithm of Post to behavior

Figure 9 shows Post2Behavior, which converts the Post to the transition actions. Firstly, the WSC-MA with multiple succeeded activities is replicated by “Clone” and the cloned WSC-MA has single succeeded activity. That is to say, a WSC-MA with n succeeded activities will be replicated n times, creating n cloned ones with single activity. Then, these new WSC-MAs check the “condition” of the rule, deciding whether doing “Migration” or not. After the “Clone”, the origin WSC-MA will be destroyed.

44

Z. Qian, S. Lu, and L. Xie

6 Related Work Research on the orchestration of the WSC has mainly concentrated on the framework of the composition of web services. In survey [15], three approaches to build composition services are discussed, which are peer-to-peer, mediated approach, and brokered approach. The widely used web services composition languages such as BPEL4WS, BPML, etc., is designed for specifying the behaviors of mediator service, namely central peer. The brokered approach is the expansion of the mediated approach. The central peer (the “broker”) controls the other peers, but data can be passed between any pair of peers directly. This approach is embodied in GSFL[16] for scientific workflow and AZTEC[17] for telecommunications applications. It makes the central peer escape from the communication. However, mediated approach and brokered approach are both centralized approaches and have the shortcomings. The center peer becomes the bottleneck of performance and once it is down, the execution of the composite process will be failed. Furthermore, for the mediated approach, because both data and control flow passes through the center, the traffic on the network is heavy. Compared with these two kinds of approaches, our WSC model is a decentralized orchestration model. In MAWSC model, the WSC portal is just an entrance of the WSC system; during the run time, the WSC-MA executes the WSC task according to the planning and need not a “central peer” to control the execution and the intermediate results are carried to service host directly. So our approach is in peer-to-peer mode and the bottleneck avoids. Another advantage is that even the client user is offline during the running time; the agent will keep on doing the WSC work. Up to now, WSC systems based on agent also appears. In IG-JADE-PKSlib toolkit [18], four kinds of agents are defined. Among them, the Executor Agent (EA) is defined to implement the arrangements, and the JADE agent represents the web service. Essentially, this model is still a mediated approach; the mediator is composed of EA and the JADE agents. In [19], team-based agent is imported to achieve WSC; the team leader forms a team to honor a complex service. It is the typically broker architecture. Strictly speaking, the agents in these systems are not mobile agent; they only act as the connectors between the web services and the WSC control center. In [20], Fuyuki Ishikawa presents a way to add description of the physical behaviors of agents to the integration logic based on BPEL, by which it tries to utilize the mobile agents to achieve the workflow of web service defined by BPEL. However, as mentioned above, BPEL is designed for centralized management. So, simply adding some actions of the agent to the BPEL is farfetched. Furthermore, [20] only explains the two kinds of the behaviors of agent and does not mention how the system works. In our model, we use the planning which is much more suitable to specify the behaviors of agent than use BPEL added with actions of agent and we also give the detailed orchestration of the execution of our model. In [12], MAFTM is proposed, which is a mobile agent-based fault-tolerant model for WSC. However, as presented in the paper itself, MAFTM only describe the execution of the single mobile agent, and doesn’t involve in the control of the composite web service. In section 3.4, we describe an advanced MAWSC model, which imports the fault-tolerant technologies of MAFTM. Consequently, MAFTM can be a module of WSC model, but itself is not a WSC model. In fact, in our model, WSC-MA executes

Mobile-Agent-Based Web Service Composition

45

the web services in set I (as mentioned in section 3.1, the web services in set I offer same function) one by one, until one of them executes successfully. This is also a policy of fault-tolerance. Compared with MAFTM, our approach is much simpler, while MAFTM works with good efficiency. Anyway, MAWSC model is a novel WSC orchestration model, compared with the other WSC approaches, there are mainly three features: • •



Peer-to-peer Mode. The migration of mobile agent realizes the communication between the web services directly. The network traffic is reducing. Execution Independently. There is no “center” controlling the execution of the WSC in MAWSC model. WSC-MA implements the task independently so that there is no bottleneck of performance and one WSC portal can provide WSC services to a large numbers of clients. Extensibility. The architecture of the WSC-MA is well extensible. By changing the physical behaviors and the corresponding matching algorithms, new features are easily added to WSC-MA. Beside the fault-tolerance technologies, we can add other behaviors (such as communication primitive, interoperability, etc.), so that WSC-MA can be used in some special application environment.

7 Conclusion and Future Work In this paper, we present a novel web service composition model: MAWSC, which is based on mobile agents. This is a distributed WSC orchestration model. According to the model, the execution of WSC becomes more flexible and efficient. It avoids the dependence on the “central peer” and decreases the dependence on the traffic over network. So MAWSC model can directly utilize to pervasive network whose connection is relatively narrowness and instability. The project of cooperate office automation system we are developing now is just based on MAWSC model. There are two challenges in cooperate OA systems: (1) different kinds of client devices are used including PC, mobile phone, PDA, laptop, etc.; (2) large numbers of documents transfer. Adopting MAWSC model, we solve these two problems easily. In section 4, we present five basic processing mapping rules, now we investigate the control patterns[13] of workflow in order to find some relationships between them and give unified mapping rules.

References 1. X. X. Ma, P. Yu, X. P. Tao, J. Lu: A Service-Oriented Dynamic Coordination Architecture and Its Supporting System. Chinese Journal of Computers, 4(2005) 467-477 2. M. Gudgin, M. Hadley, N. Mendelsohn, et al: SOAP Version 1.2 Part 1: Messaging Framework. W3C. (2003) 3. E. Christensen, et al: Web Services Description Language (WSDL) 1.1. W3C. (2001) 4. L. Clement, et al: UDDI Version 3.0.2:UDDI Spec Technical Committee Draft. W3C. (2004)

46

Z. Qian, S. Lu, and L. Xie

5. D. Berardi, D. Calvanese, G. D. Giacomo, et al: Automatic Composition of E-services That Export Their Behavior. In proceedings of International Conference of Service Oriented Computering 2003. Lecture Notes in Computer Science, Vol.2910. Springer-Verlag, Berlin Heidelberg (2003) 43-58. 6. T. Andrews, et al: Specification: Business Process Execution Language for Web Services Version 1.1. Microsoft, IBM, America (2003) 7. F. Leymann: Web Services Flow Language (WSFL) 1.0. IBM, America (2001) 8. A. Arkin, et al: Web Service Choreography Interface (WSCI) 1.0. W3C. (2002) 9. A. Arkin: Business Process Modeling Language. BPMI. (2002) 10. D. Milojicic. Mobile agent applications. IEEE Concurrency, 3(1999) 7-13. 11. M. Oshima, G. Karjoth, and K. Ono. Aglets specification. http://www.trl.ibm.com/aglets /spec11.htm. (1998) 12. W. Xu, B. H. Jin, J. Li, J. N. Cao: A Mobile Agent-Based Fault-Tolerant Model for Composite Web Service. Chinese Journal of Computers, 4(2005) 558-567. 13. W.M.P. van der Aalst, A.H.M. ter Hofstede, B. Kiepuszewski, A.P. Barros: Workow Patterns. Distributed and Parallel Databases, Kluwer Academic Publishers, USA. (2003) 5-51 14. WfMC: Terminology & Glossary, WFMC-TC-1011 issue 3.0. (1999) 15. R. Hull, M. Renedikt, V. Christophides, J. W. Su: E-Services: A Look Behind the Curtain. In proceedings of the 22nd ACM Symposium on Principles of Database Systems, California (2003) 1-14. 16. S. Krishnan, P. Wagstrom, G. von Laszewski. GSFL: A workflow framework for Grid services. Technical Report Preprint ANL/MCS-P980-0802, Argonne National Laboratory. (2002) 17. V. Christophides, R. Hull, G. Karvounarakis, A. Kumar, G. Tong, M. Xiong: Beyond discrete e-services: Composing session-oriented services in telecommunications. In Proceedings of Workshop on Technologies for E-Services (TES), Lecture Notes in Computer Science, Vol. 2193. Rome, Italy. (2001) 18. E. Martinez, Y. Lesperance: IG-JADE-PKSlib: An Agent-Based Framework for Advanced Web Service Composition and Provisioning. In Proceedings of Workshop on Web Services and Agent-based Engineering, New York, USA. (2004) 19. X. Fan, K. Umapathy, J. Yen, S. Purao: An Agent-based Approach for Interleaved Composition and Execution of Web Services. In proceedings of the 23rd International Conference on Conceptual Modeling, Shanghai, China. (2004) 20. F. Ishikawa et al: Mobile Agent System for Web Services Integration in Pervasive Networks. In Proceedings of International Workshop on Ubiquitous Computing. Miami, USA. (2004)38-47.

Trust Shaping: Adapting Trust Establishment and Management to Application Requirements in a Service-Oriented Grid Environment E. Papalilo, T. Friese, M. Smith, and B. Freisleben Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str., D-35032 Marburg, Germany {elvis, friese, matthew, freisleb}@informatik.uni-marburg.de

Abstract. In this paper, an approach for establishing and managing trust among the interaction partners (i.e. users, nodes and/or services) in a service-oriented Grid environment is presented. The approach is based on a flexible trust model and system architecture for collecting and managing multidimensional trust values. Both identity and behaviour trust of the interaction partners are considered, and different sources are used to determine the overall trust value of an interaction partner. A proposal for establishing the first trust between interaction partners is made, and the possibility to continuously monitor the partners’ behaviour trust during an interaction is provided. The proposed trust architecture can be configured to the domain specific trust requirements by the use of several separate trust profiles covering the entire lifecycle of trust establishment and management.

1 Introduction The Grid computing paradigm is aimed at providing flexible, secure, coordinated resource sharing among dynamic collections of geographically distributed collaborating individuals, institutions and resources [1]. In such a distributed and heterogeneous environment, trust is a major requirement for enabling collaboration among the interaction partners (i.e. users, nodes and/or services). Azzedin et al. [2] have classified trust into two categories: identity trust and behaviour trust. Identity trust is concerned with verifying the authenticity of an interaction partner, whereas behaviour trust deals with the trustworthiness of an interaction partner. The overall behaviour trust of an interaction partner can be built up by considering several factors, such as accuracy or reliability. These factors of behaviour trust should be continuously tested and verified. In this way, it is possible to collect a history of past collaborations that can be used for future decisions on further collaborations between interaction partners. This kind of experience can also be shared as recommendations to other participants. Furthermore, the overall decision whether to trust an interaction partner or not may be affected by other non-functional aspects that cannot be generally determined for every possible situation, but should rather be under the control of the user when requesting such a decision. In addition, while the basic functionalities of two H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 47 – 58, 2005. © Springer-Verlag Berlin Heidelberg 2005

48

E. Papalilo et al.

applications could be similar, differences in application behaviour could be caused by different domain specific trust requirements. Therefore, a trust system for a Grid environment should offer flexible and easy to use components that can be configured to the specific needs of a user on a per case basis. In this paper, an approach that allows adapting trust establishment and management to user and application requirements in a service-oriented Grid environment is presented. The approach is based on a flexible trust model which includes identity and behaviour trust of the interaction partners and considers different sources to calculate the overall trust value for an interaction partner. A proposal for establishing the first trust between interaction partners is made, and the possibility to monitor the partners' behaviour trust during an interaction is provided. Finally, a system architecture for collecting and managing multidimensional trust values is proposed. It consists of two main components, a trust engine and a verification engine. The trust engine manages trust values and offers partner discovery and rating functionality to higher level applications. The verification engine handles the verification of Grid service results and other criteria (e.g. availability, latency etc.) and generates the necessary feedback for the trust engine regarding the partner. The proposed system architecture can be configured to the domain specific trust requirements by the use of several separate trust profiles covering the entire lifecycle of trust establishment and management. The paper is organized as follows. Related work is discussed in section 2. In section 3, our trust model is introduced. In section 4, two sample application scenarios are discussed. Section 5 presents a system architecture supporting the proposed trust management model in a service-oriented Grid environment. Section 6 concludes the paper and outlines areas for future research.

2 Related Work Trust is an important factor for establishing an interaction among participants in a collaborative computing environment, especially in virtual societies [3]. In the field of Grid computing, trust management has been discussed by Azzedin and Maheswaran [2] who separate the "Grid domain" into a "client domain" and a "resource domain". They define the notion of trust as consisting of identity trust and behaviour trust. Identity trust is considered only during the authentication and authorization process, and behaviour trust is limited to the abusive or abnormal behaviour of the entities during the collaboration. The authors present a formal definition of behaviour trust where for trust establishment among entities direct and indirect experiences together with a decay function that reflects the trust decay with time are considered. In their model, a recommender's trust is calculated at the end of a transaction, according to the directly accumulated experience. Hwang and Tanachaiwiwat [4] propose a trust model based on the trust that Grid resources place to each others' identity. They make use of a central certification authority which bridges different PKI domains for cross-certification. Goel and Sobolewski [5] present trust and security models for Grid environments using trust metrics based on e-business criteria. They offer a centralized and a distributed scheme for estimating trust among entities.

Trust Shaping: Adapting Trust Establishment and Management

49

In [6], Thompson et al. examine the basis for the trust a relying party places on an identity certificate signed by a certification authority. In [7], Lin et al. develop a trust management architecture for addressing security issues in Grid environments based on subjective logic. They use the belief, disbelief and uncertainty to weight the trustworthiness of the collaborating parties. Identity trust, although considered in the model, is implied to belong only to the authentication process, without offering any possibility to measure it. The authors deal with a general notion of behaviour trust that is established before a collaboration takes place among participants, using trust information from direct and recommenders' sources. Quillinan et al. [8] propose to use trust management for establishing relationships between Grid administrators and users for automating administration tasks. In [9], Jennings et al. develop a general trust model for agent-based virtual organizations. The authors make use of direct and indirect trust sources. A distinctive feature of their model is that recommendations coming from inaccurate sources are not given the same importance as those received from accurate sources. They suppose that the agents do not change their behaviour over time. Finally, Tie-Yan et al. [10] propose a trust management model that supports the process of authentication between entities by considering their identity trust. Their model consists of two tiers. In the upper tier, trust among different virtual organizations is managed, and in the lower tier the trust among entities that belong to the same Grid domain or individuals is managed.

3 Establishing and Managing Trust in Grids All approaches presented in the previous section have limitations. First, they only use part of the trust sources, more precisely only direct and indirect trust ([9] uses also prejudice), and either identity or behaviour trust. A collective characteristic is the lack of mutual verification of the partners involved in an interaction. An interaction is considered as successful only if an interaction partner X receives a response from an interaction partner Y. None of the proposals includes checks on the accuracy of the received responses. The trust models that manage identity trust involve a mutual verification of the identities of the collaborating partners at the beginning of the collaboration, but there still is the need for a continuous mutual verification strategy during the collaboration. This verification affects the behaviours of the partners involved and, as a result, influences their decision to continue the collaboration. Thus, there is a need for a flexible trust model whose properties reflect the requirements of Grid applications and the preferences and needs of their users. Our approach to achieve this goal in a service-oriented Grid environment is described in the following. The following terminology is used in this section: A collaboration in a Grid takes places between interaction partners. An interaction partner is either a service provider (e.g. a node to host and provide a service, or a service instance running on the provider node) or a service consumer (e.g. a node that requests a service from a provider (which includes the request to deploy and perform a service at the provider), or a service instance running on the consumer node). There are two major aspects that influence the selection or acceptance of an interaction partner:

50

E. Papalilo et al.

− The identity of the interaction partner or more specifically the trust that one can put in the credibility of the identity an interaction partner claims to have. − The past behaviour of the interaction partner as an indicator for its future behaviour. This behaviour can be rated considering a multitude of dimensions, such as the accuracy of delivered results, actual costs compared to expected costs, availability of the service, response time, or fault and intrusion properties. Furthermore, the trust values might be different for different applications/services the interaction partner offers or requests. In most cases, an interaction partner is not able to judge trustworthiness based on personal and direct experiences only. A socially inspired model using several dimensions of trust that builds on exchanges of experiences and recommendations is useful to decide whether to trust another interaction partner or not. In the following, a flexible trust model and system architecture for collecting and managing such multidimensional trust values are presented. 3.1 A Grid Trust Model We assume that the probability for a successful future interaction among partners is closely related to the mutual trust values the partners assign to each other. These values vary in [0,1] ⊂ ℜ, where 0 means that the other partner is not trusted at all or there are uncertainties due to the lack of information, and 1 means that it can be fully trusted and gives certainties on the success of the interaction that is going to take place. It is inherently clear that e.g. a service consumer will, for example, try to find the most trusted of all available providers for a given task to perform. The trust T that an interaction partner X has for partner Y is influenced by both identity trust TI and behaviour trust TB:

TX (Y ) = TXI (Y ) ⋅ TXB (Y )

(1)

Although identity trust is part of the Grid authentication and authorization process, its value is nevertheless related to the overall trust value of a partner. It expresses the belief that the partner is who it claims it is. In a Grid environment, a participant is typically authenticated through a certification authority (CA) which could be: (a) a trusted and dedicated entity (e.g. Globus, IBM, etc.), (b) another participant in the environment, (c) the participant itself which could issue self-signed certificates. To give an interaction partner X the possibility to determine the identity trust of an interaction partner Y, a collaboration subgraph is centred at X’s CA. Similar to Erdös numbers [11], we define the degree of separation D of partner Y from partner X as the path length of the shortest certificate chain from X’s CA to Y. Any partner who has no path to X’s CA is said to have an infinite relationship with the centre of this “collaboration graph” and thus ∞ is assigned. A relationship is of first order (i.e. 1 assigned) between trusted and dedicated CAs like Globus, IBM, etc. In a well established Grid infrastructure, we do believe that all participants authenticated through CAs of first order are completely identified. In this case, if partner X needs to gather information regarding the identity trust of another partner Y, after establishing the degree of separation of partner Y with its CA, equation (2) can be used:

TXI (Y ) = 1 DX ,Y

(2)

Trust Shaping: Adapting Trust Establishment and Management

51

Participants can have different preferences and requirements for future interaction partners. Considering the relationship between quality of service (QoS) and trust [12], different QoS properties like availability of the service, accessibility of the service, accuracy of the response provided by the service, response time, cost of the services, security etc., can be considered and modelled as behaviour trust elements that a consumer uses to rate a provider. In a similar way, the total number of (concurrent) requests coming from a consumer or the size of the packets received from it can be considered as behaviour trust elements from the point of view of a provider. Trust is a multidimensional value that can be derived from different sources. We distinguish between three of such sources. First, there is direct and personal experience from past collaborations with a partner. Second, there are recommendations from known sources (i.e. partners for which direct experiences exist). Finally, recommendations from other nodes/services in the Grid may be considered and then a path can be found using the known partners of known partners and so on. A collaboration graph similar to the one used in the determination of TI can be constructed with a participant at its centre. Those three classes can be determined using the degree of separation from the participant (i.e. personal experience is a recommendation from a partner with D=0, all known partners are defined to have D=1 , and all unknown sources are identified by the value D>1). Now assume that the trust based on direct experience of an interaction partner X with a partner Y is given by

TXB,D=0 (Y ) . Each partner in the network can now

calculate the weighted recommendations coming from the set of known partners Nk according to equation (3).

RX , D =1 (Y ) = ¦ k ∈N (TXB, D = 0 (k ) ⋅ TkB, D = 0 (Y )) N k

(3)

k

There are two strategies for calculating the total weighted recommendation coming from the set of unknown partners, either considering the experience values of each participant pi along the shortest path P={p1,…,pD}between partner X and partner Y or taking only the experience of the participant in the path preceding partner Y into account. This value is weighted based on the degree of separation of this participant from partner X:

R X ,D >1 (Y ) = ¦u∈N (∏i=u1 T pBu ,i , D=0 ( pu ,i +1 )) N u P

(4)

u

where Pu={pu,1,…,pu,D+1}is the shortest path from X to Y with pu,D = u. Equation (4) requires several subjective decisions in the determination of every partner's trust along the path, which are based on experience and - as we will see based on several uncertainties regarding the evaluation of these experiences. Therefore, equation (5) represents a more prejudiced evaluation of the recommendations based on the idea that a more objective metric is needed to weight recommendations originating from unknown sources.

RXp , D >1 (Y ) = ¦u∈N (TXI (u ) ⋅ TuB, D = 0 (Y )) N u

(5)

u

Users need the flexibility to weight the different sources for the total trust differently in different situations. For this purpose, we define the vector of all trust sources an interaction partner X may use for rating an interaction partner Y as:

52

E. Papalilo et al.

TXS (Y ) := (TXB,D=0 , RX ,D=1 , RX ,D>1 , RXp ,D>1 )

(6)

Now, interaction partner X can calculate the resulting normalized trust to put into an interaction partner Y using a profile vector

r 4 P ∈ [0,1] :

r P = ( wD=0 , wD=1 , wD>1 , wDp >1 )

(7)

where the relative weights express direct experience (D=0), indirect experience of “known” sources (D=1), experiences of “unknown” sources (D>1) without prejudice and the last weight for experiences of “unknown” sources making prejudices on how they have gathered their experience with interaction partner Y. The weights can have values in [0,1]. Having defined the relative weights for the different sources, interaction partner X can calculate the resulting normalized trust to put into an interaction partner Y:

T XB (Y ) = T XS ⋅ PX

PX

(8)

The resulting trust value is only used in the decision to interact with a certain interaction partner Y. It does not affect the experience value, because this value only depends on the outcome of the subsequent interaction. 3.2 First-Trust Problem Consider the situation when a user completely new to a Grid environment enters the network. He or she has no personal experience with any of the service providers, therefore all of the trust sources referenced in (3) are equal to 0. The usual strategies for selecting a partner do not apply in this situation. We distinguish two different basic strategies for “initializing” trust. One is a rather open approach to assign an initial trust value slightly above the minimal trust threshold to every partner, effectively giving the partner a chance to prove itself trustworthy without prior verification. We refer to this method as “warm start phase”. In contrast, there might be scenarios in which a partner is tested by performing validation checks and deriving initial behaviour trust from these interactions. Obviously, this trust establishment phase through a “cold start” comes at a comparably high price. The problem of establishing first trust may be seen both from a service consumer as well as a service provider point of view. We believe that a trust management environment for service-oriented Grids must be flexible enough to allow specification of the strategy to be used in either role and on an application basis. In addition to these two basic strategies, further strategies for first trust establishment may be specified in the system. 3.3 Verification Techniques It might be desirable to verify that a particular partner stands up to an assumed or previously offered behaviour. The extent to which verification is performed may vary depending on application scenarios or various user requirements. Also, the need for verification of the partners’ behaviour may arise in both roles (i.e. consumer or provider) of a service consumption scenario. Partners will continuously monitor the interaction process among each other, and in case of discovered anomalies in the

Trust Shaping: Adapting Trust Establishment and Management

53

behaviour of the other, the consumers and/or providers will reorganize their scheduling or access policies accordingly. The different aspects of the partners’ behaviour (e.g. availability, response time, accuracy, etc.) are criteria for developing verification strategies. In the following, we will only consider the accuracy of the responses coming from a service provider as an example and refer to this dimension as behaviour trust for brevity (note, however, that this is only one dimension of behaviour trust to be considered between partners). The strategy to use for the verification of the accuracy of responses to be expected from one provider may vary depending on certain constraints such as the additional acceptable cost for performing the verification operations. The following verification strategies can be applied: Challenge with known tasks. A service consumer may prepare a particular test challenge for which it knows the correct result. In this case, the consumer can directly verify if a service provider was able to calculate the correct response for this challenge. However, a malicious provider may be able to guess the correct answers to the set of challenges and thereby manipulate the behaviour trust assigned to it, as the computational resources of the consumer for test preparation may be very limited. Best of n replies. A more feasible verification technique is similar to the one that is used by SETI@HOME (http://setiathome.ssl.berkeley.edu). The validity of the computed results is checked by letting different entities work on the same data unit. At the end, a majority vote establishes the correct response. This technique results in a significant increase of the total computational costs to solve a particular problem, since the overall workload is multiplied by the replication factor used to reassign subtasks. If there is a service charge collected by a service provider for every processed work unit, this also leads to a considerable increase of the overall cost. Human in the loop. In some applications, it might be impossible to construct automatic verification or result rating modules. In such cases, it can still be helpful to involve human users in the process of verification. This technique relies on presenting the results to the user, offering the ability to directly express a value to be taken into the trust calculation. In our approach, it is possible for each of the partners to develop their personalized trust preferences towards the interaction partners. These preferences include the initialization values that the user is willing to assign to each of the new partners, the selection of sources for getting trust information from (recommendations), the interaction partners the participant collaborates with and verification strategies for all the trust elements. The consumer may choose between verifying the accuracy of every single answer coming from the provider (“trust no one”) or to verify the accuracy of only a part of the responses coming from the provider (“optimistic trust”). In order to minimize added costs, we propose to couple the frequency of this partial verification technique with the behaviour trust associated with a particular partner in the environment. This relationship is expressed by: B f = −((1 − Vmin ) ⋅ Tlast ) +1

where

(9) B last

Vmin is the minimal verification rate set by the consumer and T

the trust value of a provider at a certain moment of time.

represents

54

E. Papalilo et al.

From the consumer side this means that for a non-trusted provider every single response is verified and for a fully trusted provider only a minimum of the responses coming from that specific provider has to be verified. The result of the verification operations will directly be used to alter the behaviour trust regarding accuracy.

4 Application Scenarios We now present two cases, one from media sciences and one from medicine, to illustrate how different application requirements may arise depending on the field of application. Grid Services for Video Cut Detection. An example from video content analysis is the process of identifying “cuts” in videos [13]. To identify such cuts, the video file is split and all parts are sent to a manager, which assigns the jobs (split video files) to remote services that will take care of the analysis and the identification of the cuts. After all the jobs have been processed, the resulting cut lists are merged together, some boundary processing takes place in the manager, and the user is notified that the work has been finished. Video analysis is a collaborative task with moderate security requirements and even moderate requirements on the accuracy of results. In this case, an open attitude accepting recommendations from strangers and requiring only occasional verification of the data returned from individual data processing services may be sufficient to satisfy the users of the application. Grid Services for Medical Imaging. An example from medical imaging, is finding similar cases (images) in a set of mammograms [14] for diagnosis support. Images resulting from the mammography are passed to remote services representing image analysis algorithms which calculate similarities in image sets. These image analysis algorithms may be basically similar to those used in the identification of video cuts or in object recognition tasks within frames of video sequences. However, different standards are required in the medical scenario compared to the video application: − The radiologist may be interested in simply performing standard image processing techniques on generated data sets without placing any special trust requirements on the interaction partners. A subset of images stored on a remote image server is retrieved, and viewing and analysis of image data on the local workstation is performed. Only occasional verification of the data returned from individual data processing services may be required. − The analysis can be extended to the application of several different statistical analysis methods on data from multiple different studies to view the effects of a treatment method across a patient group. In this case, trusting the accuracy of the responses coming from the interaction partners is important. The radiologist may consider only own experience or recommendations from highly trusted sources for the selection of partners. A high frequency of verification of the data returned from the individual data processing services is required. While the basic application is similar in both cases (i.e. applying a number of image processing algorithms to a set of images and merging the individual results), the domain specific trust requirements lead to different application behaviour .

Trust Shaping: Adapting Trust Establishment and Management

55

Furthermore, the initialization of trust values in the above cases will vary. While it is feasible to assume a certain level of trust for new participants (assigning a positive initial trust value > 0 to them) in the video application, a more cautious behaviour is appropriate in the medical application (assigning a positive but small initial trust value ε to them). Another interesting aspect in the trust calculation is the selection of trust sources. In the video application, the user has a very open attitude, accepting recommendations from other parties (even unknown sources) with a small bias against the less known recommenders and thus may choose a trust source vector of (1,0.5,0.25,0.25). This means that personal experience makes up 50% of the trust value, direct recommendation accounts for 25% and the more remote recommendation sources for 12.5% each. In the medical application, only highly trusted sources for the decision are desired, therefore the trust source vector could be (0.75,0.25,0,0) meaning that personal experience makes up 75% of the trust value, recommendations coming from directly known parties enter into the trust calculation with 25%, and other trust sources are disregarded. As a verification strategy, a user may opt for verifying every result by choosing Vmin = 1 in the medical application, while in the video application only a moderate value of Vmin = 0.3 will be chosen, leading to a verification ratio of 1 − 0.7 ⋅ Tlast . B

5 System Architecture A system architecture supporting trust management in service-oriented Grid applications is presented in Fig. 1. The system consists of two main components, the trust engine and the verification engine. The trust engine manages trust values and offers partner discovery and rating functionality to higher level applications, such as workflow engines or job scheduling systems. The verification engine handles the verification of Grid service results and generates the necessary feedback for the trust engine regarding the partner. For brevity, we will focus our discussion on the service consumer use of those platform components.

Fig. 1. Architecture of a grid system supporting our trust model

56

E. Papalilo et al.

The user starts with specifying his or her trust requirements along with the input data to a trust enabled Grid application (step 1), which in turn uses the workflow engine of the local service-oriented Grid platform (step 2). To enable the selection of trusted services, the decision is made based on a rated list of potential partner services that is obtained from the trust engine (step 3). The trust engine uses its service discovery component to discover individual services (step 4) and to collect recommendations from other trust engines (step 5). These values are stored in the local trust pool to be used in subsequent interactions. The user specified trust profile is also stored in a trust pool for later reference and use by other components in the trust engine. The information gathered by the trust engine is now processed according to the user’s trust profile specification and passed on to the workflow engine which then can use the partner services according to the rating generated by the trust engine. Invocation of external services is then delegated to an invocation handler (step 6). The invocation handler consults the verification engine (step 7) to determine whether a call has to be replicated or redirected (e.g. to perform the best of n verification strategy). The verification engine considers the trust profile managed by the trust engine (step 7), allowing, for example, cost-trust-ratio relations to be taken into account. The resulting invocation is carried out at the selected partner services and results - both synchronous and asynchronous (notification) results - are then collected by the invocation handler (step 8) and verified through the verification engine, using a strategy and verification module consistent with the user supplied trust profile (step 9). The overall result of this process is then passed to the workflow engine that collects results for the application to present them to the end user. The configuration of the trust engine by use of trust requirement profiles influences three phases during execution of an application workflow. These main phases are addressed by the three arrows in Fig. 2. The initialization profile determines the influence and scope of factors for initializing the trust values to be used in an interaction. It allows to manually assign trust values to certain partners, as well as specifying how trust recommendations of partners are handled and weighted. This profile specifies the behaviour of the local platform in a situation that requires the establishment of first trust.

Fig. 2. Trust profile elements influencing the stored trust values and application decisions

Trust Shaping: Adapting Trust Establishment and Management

57

The source selection profile determines the selection of behaviour trust dimensions (e.g. availability, accuracy) as well as trust sources (e.g. personal experience, recommendations from directly known partners) to determine a partner ranking according to the application needs. This allows a user to take accuracy trust recommendations from known partners into account with a higher weight than, for example, availability values (which might be caused by the different network locations) coming from the same partner. The verification profile specifies which verification strategies are to be applied to the results of partner service invocations and the feedback parameters into the trust engine. In this profile, the user specifies how breaches of assumed service level agreements should influence the future interactions with a partner since they are fed back into the trust store for this application and partner service. This profile also dynamically determines the frequency of verification to allow a fine grained control over costs incurred by result verification.

6 Conclusions In this paper, a flexible trust model and system architecture for collecting and managing multidimensional trust values have been presented. Both identity and behaviour trust of the interaction partners were considered, and different sources were used to determine the overall trust value of a partner. A proposal for establishing the first trust between interaction partners was made, and the possibility to monitor the partners' behaviour trust during an interaction has been provided. Our trust system can be configured to the domain specific trust requirements by the use of several separate trust profiles covering the entire lifecycle of trust establishment and management. There are several areas of future work. First, the implementation of the proposed trust system architecture needs to be finished. Second, experiments involving a large number of participants in a Grid environment should be performed to evaluate the properties of our approach in real-life scenarios. Third, the system architecture should be extended and adapted to the needs of the service providers. Finally, further research will concentrate on increasing the security of the interaction among participants, especially message level security through using XML encryption. Acknowledgements. This work is financially supported by Siemens AG (Corporate Technology), by DFG (SFB/FK 615, Teilprojekt MT), and by BMBF (D-Grid Initiative).

References 1. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International J. on Supercomputer Applications 15 (2001) 2. Azzedin, F., Maheswaran, M.: Evolving and Managing Trust in Grid Computing Systems. In: Conference on Electrical and Computer Engineering, Canada, IEEE (2002) 1424-1429 3. Sabater, J., Sierra, C.: Review on Computational Trust and Reputation Models. Artificial Intelligence Review 24 (2005) 33-60

58

E. Papalilo et al.

4. Hwang, K., Tanachaiwiwat, S.: Trust Models and NetShield Architecture for Securing Grid Computing. Journal of Grid Computing (2003) 5. Goel, S., Sobolewski, M.: Trust and Security in Enterprise Grid Computing Environment. In: Proc. of IASTED Int. Conf. on Communication, Network and Information Security, New York, USA (2003) 6. Thompson, M.R., Olson, D., Cowles, R., Mullen, S., Helm, M.: CA-Based Trust Model for Grid Authentication and Identity Delegation. Global Grid Forum CA Operations WG Community Practices Document (2002) 7. Lin, C., Varadharajan, V., Wang, Y., Pruthi., V.: Enhancing Grid Security with Trust Management. In: Proceedings of the IEEE International Conference on Services Computing (SCC), Shanghai, China, IEEE (2004) 303-310 8. Quillinan, T., Clayton, B., Foley, S.: GridAdmin: Decentralizing Grid Administration Using Trust Management. In: Third International Symposium on Parallel and Distributed Computing (ISPDC/HeteroPar), Cork, Ireland (2004) 184-192 9. Patel, J., Teacy, W.T.L., Jennings, N.R., Luck, M.: A Probabilistic Trust Model for Handling Inaccurate Reputation Sources. In: Proceedings of Third International Conference on Trust Management, Rocquencourt, France (2005) 193-209 10. Tie-Yan, L., Huafei, Z., Kwok-Yan, L.: A Novel Two-Level Trust Model for Grid. In: Proceedings of Fifth International Conference on Information and Communications Security (ICICS), Huhehaote, China (2003) 214-225 11. Erdös: Erdös Number Project. (2005) http://www.oakland.edu/enp/. 12. Ali, A.S., Rana, O., Walker, D.W.: WS-QoC: Measuring Quality of Service Compliance. In: Proceeding of the Second International Conference on Service-Oriented Computing Short Papers (ICSOC), New York, USA (2004) 16-25 13. Ewerth, R., Friese, T., Grube, M., Freisleben, B.: Grid Services for Distributed Video Cut Detection. In: Proceedings of the Sixth IEEE International Symposium on Multimedia Software Engineering, Miami, USA, IEEE (2004) 164-168 14. Amendolia, S., Estrella, F., Hassan, W., Hauer, T., Manset, D., McClatchey, R., Rogulin, D., Solomonides, T.: MammoGrid: A Service Oriented Architecture Based Medical Grid Application. In: Proceedings of the Third International Conference on Grid and Cooperative Computing, Wuhan, China (2004) 939-942

SVM Approach with CTNT to Detect DDoS Attacks in Grid Computing* Jungtaek Seo1, Cheolho Lee1, Taeshik Shon2, and Jongsub Moon2 1 National Security Research Institute, KT 463-1, Jeonmin-dong, Yuseong-gu, Daejeon, 305-811, Republic of Korea {seojt, chlee}@etri.re.kr 2 CIST, Korea University, 1-Ga, Anam-dong, Sungbuk-Gu, Seoul, Republic of Korea {743zh2k, jsmoon}@korea.ac.kr

Abstract. In the last several years, DDoS attack methods become more sophisticated and effective. Hence, it is more difficult to detect the DDoS attack. In order to cope with these problems, there have been many researches on DDoS detection mechanism. However, the common shortcoming of the previous detection mechanisms is that they cannot detect new attacks. In this paper, we propose a new DDoS detection model based on Support Vector Machine (SVM). The proposed model uses SVM to automatically detect new DDoS attacks and uses Concentration Tendency of Network Traffic (CTNT) to analyze the characteristics of network traffic for DDoS attacks. Experimental results show that the proposed model can be a highly useful to detect various DDoS attacks.

1 Introduction Over the last several years, a new breed of network worms such as CodeRed, SQL Slammer, and Nimda have launched widespread attacks on commercial web sites such as Yahoo, CNN, and Amazon, etc [1], [2], [3]. These incidents temporarily disable the network services or damage systems by flooding a huge number of network packets for several minutes or longer. These attacks are harmful to almost networked systems especially open resource sites such as computational grids, and it could well be the next wave of target. Thus, now more than ever, we need to provide a secure grid computing environment over the Internet. Recently, many security models and systems are developed for secure Grid computing in encryption and authentication areas. However, there are rare researches on an availability area of grid computing even though malicious intrusions may easily destroy most of valuable hosts, network, and storage resources in the grids. The vulnerabilities of DDoS in grid computing arise from each grid’s limited resources *

This work was supported by the Ministry of Information Communication, Korea, under the Information Technology Research Center Support Program supervised by the IITA.

H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 59 – 70, 2005. © Springer-Verlag Berlin Heidelberg 2005

60

J. Seo et al.

that can be exhausted by a sufficient number of users. Thus, most of flood attacks to exhaust resource (e.g., network bandwidth, computing power, and storage) of the victim are possible in grid [4]. In order to cope with the threat, there have been many researches on the defense mechanisms including various DDoS detection mechanisms [5], [6], [7]. However, the common shortcoming of the previous detection mechanisms is that they cannot automatically detect the new attacks. To solve the problem, we adopt the Concentration Tendency of Network Traffic (CTNT) method and Support Vector Machine (SVM) method [8], [9], [10], [11]. In our earlier research, we presented CTNT method to analyze the characteristics of network traffic for the DDoS attacks [12]. CTNT monitors the ratio of a specific type of packets among the total amount of network packets, and compute TCP flag rate and Protocol rate. The result of analyzing network traffic using CTNT showed that there are predictable differences between normal traffic and DDoS attack traffic. In a normal situation, for instance, SYN and FIN are in the ratio of 1:1 and TCP and UDP traffic are in the ratio of 9:1. However, in an abnormal situation (e.g., SYN flooding, UDP flooding), these ratios are broken. Thus, CTNT is used good feature extraction method for SVM that is a machine learning algorithm. Using the traffic analyzing result as input feature of SVM, we were able to automatically generate effective DDoS detection rules. Our detection model showed a high degree of performance, and detected various DDoS attacks successfully. We introduce related research in section 2, and explain CTNT in section 3. The background knowledge of SVM is discussed in section 4. In section 5, the experimental environment is introduced and the detection performance of SVM and other machine learning algorithms are tested and compared. Lastly, we mention the conclusion of this research and the direction of future work in section 6.

2 Related Work There have been some researches to defend Grid environment from DDoS attacks. Xiang and Zhou introduce the vulnerability of grids to DDoS attack and propose distributed defense system to protect grids form DDoS attacks. The proposed system is built with protecting communication network, attack control system, intrusion surveillance system, and traceback system [4]. Hwang et al. proposed GridSec architecture. The architecture is built with distributed firewalls, packet filters, security managers, and intrusion detection and response system [13]. It adopts DHT-based overlay architecture as its backbone. Its overlay network maintains a robust virtual inter-networking topology. The GridSec system functions as cooperative anomaly and intrusion detection system. Intrusion information is exchanged by the overlay topology with confidentiality and integrity. The system reacts automatically to the malicious intrusions and attacks using the information. Traditional defending mechanism against DDoS attacks includes defending, detecting, and reacting mechanism. Most of all, there have been many researches to detect the DDoS attacks since detecting the DDoS attacks is an essential step to defend

SVM Approach with CTNT to Detect DDoS Attacks in Grid Computing

61

DDoS attacks [5], [6], [7]. When DDoS attacks occur, there is a big mismatch between the packet flows “to-rate” toward the victim and “from-rate” from the victim. Gil et al. propose the method that examines the disproportion between “to-rate” and “from-rate” in order to detect DDoS attacks [5]. Kulkarni et al. [6] presents DDoS detection methods based on randomness of IP spoofing. Almost DDoS attackers use IP spoofing to hide their real IP addresses and locations. Since spoofed IP addresses are generated randomly, this characteristic of randomness may be used to reveal the occurrence of DDoS attacks. Kulkarni’s method uses Kolmogorov complexity metrics to measure the randomness of source IP addresses in network packet headers [14]. Wang et al. proposed the method that detects DDoS attack based on the protocol behavior of SYN-FIN(RST) pairs [7]. In the normal situation, the ratio of SYN and FIN is balanced because of the characteristic of the TCP 3-Way handshake. However, the ratio of SYN packet increases drastically during the SYN flooding attack. By monitoring sudden change of the ratio of SYN and FIN, the method detects SYN flooding attacks. However, these approaches are based on the specific characteristics of the attacks such as a mismatch of “to-rate” and “from-rate”, effect of IP spoofing, and unbalance of the ratio of SYN and FIN packet. Thus, these may not properly detect the attacks that use an undefined characteristic. For example, Gil’s method is not applicable to detect attacks using IP spoofing since the method cannot discriminate legitimated packet and spoofed packet, and Wang’s method is only applicable to SYN flooding attacks. On the other hand, the proposed detection model automatically generates detection rules using CTNT and SVM. The advantage of the proposed model will be discussed in the section 5.

3 Concentration Tendency of Network Traffic In a normal situation, network traffic rate has specific characteristics. For example, SYN and FIN are in the ratio of 1:1 and TCP and UDP traffic are in the ratio of 9:1. However, in an abnormal situation such as SYN flooding and UDP flooding, these ratios are broken. Using these fact, CTNT method can effectively discriminate normal situation and abnormal situation. Thus, the traffic analysis result using CTNT method is good input feature of SVM. Details of the CTNT and the differences of normal traffic and attack traffic are explained in section 3.1 and 3.2. 3.1 Concentration Tendency of Network Traffic The Internet provides the users with various types of network services such as WWW, E-mail, FTP, P2P, streaming services, and so on. However, network traffics which are found at endpoint servers have specific characteristics according to the services they provide. CTNT (Concentration Tendency of Network Traffic) is defined as a phenomenon that network traffics are mainly composed of one or more specific types of network packets. For instance, almost TCP packets have ACK flags in their

62

J. Seo et al.

headers during their connection sessions. Moreover, since the Internet has dominant network services such as WWW, E-mail, FTP which are dependent on specific network protocols, CTNT can be found on not only endpoint clients and servers but also core backbone networks [15]. To analyze network traffics on the Web servers in case that various types of DDoS attacks occur, we consider these CTNT as TCP flag rate and IP protocol rate [10]. They examine the occurrence rate of a specific type of packets within the stream of monitored network traffic. TCP flag rate is defined in the following equation. R td [ F i | o ] =

¦ flag ( F ) in ¦ TCP

a TCP header packets

(1)

TCP flag rate means the ratio of the number of a specific TCP flag to the total number of TCP packets. In the equation (1), a TCP flag ’F’ can be one of SYN, FIN, RST, ACK, PSH, URG, and NULL, and ’td’ is the time interval used to calculate the value. In this paper, we omit the time duration ‘td’ when the interval is one. The direction of network traffic is expressed as ’i’ (inbound) and ’o’ (outbound). R td [ [TCP|UDP| ICMP] i | o] =

¦[TCP|UDP| ICMP ] packets ¦ IP packets

(2)

IP protocol rate is defined in equation (2). It means the ratio of specific TransportLayer protocol (e.g. TCP, UDP, and ICMP) packets to total Network-Layer (IP) protocol packets. 3.2 Network Traffic Changes Under DDoS Attacks In this section, we analyze normal Web traffic and DDoS attack traffic using the CTNT and show differences between them. Since web service is based on TCP connection, the number of HTTP requests in a TCP session (R/C: Requests per connection) and the number of TCP sessions simultaneously established (SC: Simultaneous Connection) are the key features of web traffic in terms of network traffic analysis. Thus, we can simulate various web traffic environments by adjusting these two features (R/C and SC). In the experiments, we used SPECweb99 as web traffic generating tool [16]. It sends HTTP requests to the Web server and receives HTTP replies from the Web server like the real Web browsers do. Fig. 1 shows the experimental results of SPECweb99. We changed SC to 5, 10, 50, 100, and 150, and R/C to 1, 2, 5, and 10. As a result, the experiments show that normal Web service traffic has a constant pattern with regardless of SC, R/C, and time. The resulting rate of SYN and FIN is almost identical. The other distinguishing result is that the rate of ACK is very high. It’s because HTTP is based on TCP which is a connection-oriented protocol. These results show that network traffic of normal Web services has a specific pattern.

SVM Approach with CTNT to Detect DDoS Attacks in Grid Computing

(a) Inbound Traffic

63

(b) Outbound Traffic

Fig. 1. Web service traffic (average value) using SPECweb99

Fig. 2 shows the change of network traffic when a SYN flooding attacks occur. We generate Web service traffic during 72 seconds after 10th second from start the simulation, and a SYN flooding attack was generated during 40 seconds after 17th second from start the generation of the Web service traffic. As shown in Fig. 2-(a), the rates of SYN and URG increased to almost 1.0 and the rates of other flags, especially ACK rate, decreased to almost 0.0 during SYN flooding attacks.

(a) Inbound TCP flag rate

(b) Outbound TCP flag rate

Fig. 2. SYN flooding attacks against the Web server. Under SYN flooding attacks, the rates of SYN and ACK of inbound traffic change significantly.

Furthermore, we can also see big changes of network traffic during other types of DDoS attacks such as ICMP flooding attacks or UDP flooding attacks [8], [9].

4 Support Vector Machine We have chosen Support Vector Machine (SVM) among various other machine learning algorithms as the focus of this paper. Support Vector Machine (SVM) is a learning machine that plots the training vectors in high-dimensional feature space, and labels each vector by its class. SVM classifies data by determining a set of support vectors, which are members of the set of training inputs that outline a hyper plane in feature space. The SVM is based on the idea of structural risk minimization, which

64

J. Seo et al.

minimizes the generalization error, i.e. true error on unseen examples. The number of free parameters used in the SVM depends on the margin that separates the data points to classes but not on the number of input features. Thus SVM does not require a reduction in the number of features in order to avoid over fitting [17]. Details of SVM are explained in section 4.1. 4.1 SVM for Categorization In this section we review some basic ideas of SVM. Given the training data set

{(x i , d i )}iN= 1 with input data

xi ∈ R

N

and corresponding binary class labels

d i ∈ {− 1 ,1} , the SVM classifier formulation starts from the following assumption. di = 1 d i = −1

The classes represented by the subset

and

are linearly separable,

N where ∃w ∈ R , b ∈ R such that

­wT x + b > 0 for di = +1½ ∃w, b s.t ® T i ¾ ¯w xi + b < 0 for di = −1¿

(3)

The goal of SVM is to find an optimal hyperplane for which the margin of separation , ρ , is maximized. ρ is defined by the separation between the separating hyperplane and the closest data point. If the optimal hyperplane is defined by

(w

)

T , then the function g ( x ) = w 0 ⋅ x + b 0 gives a measure of the distance from x to the optimal hyperplane. T 0

⋅ x + b0 = 0

Support Vectors are defined by data points surface. For a support vector

r=

x (s ) that lie the closest to the decision

x (s ) and the canonical optimal hyperplane

g( x s ) ­+ 1 w0 =® w0 ¯−1 w0 ρ

g , we have

for d ( s) = +1½ ¾ for d ( s ) = −1¿ ∝

(4)

1

w . w 0 should be minimal to achieve Since, the margin of separation is the maximal separation margin. Mathematical formulation for finding the canonical optimal separation hyperplane, given the training data set {( x , d )} , solves the following quadratic problem 0

i

l ­ ½ 1 2 minτ (ω, ξ ) = w + C ¦ζ i ° ° 2 ® ¾ i =1 °¯s.t di (wT xi + b) ≥ 1 − ζ i for ζ i ≥ 0, i = 1,K, l °¿

i

N i=1

(5)

Note that the global minimum of above problem must exist, because Φ ( w ) =

1 2

w

2 0

is convex in w and the constrains are linear in w and b. This constrained optimization problem is dealt with by introducing Lagrange multipliers ai ≥ 0 and a Lagrangian function given by

SVM Approach with CTNT to Detect DDoS Attacks in Grid Computing

[

l

]

l

L(w, b,ζ , a, v) = τ (w,ζ ) − ¦ai di (wiT xi + b) −1 + ζ k − ¦vi ζ i i =1

65

(6)

i =1

which leads to l l ∂L = 0 ⇔ w − ¦ai di xi = 0 (∴ w = ¦ai di xi ) ∂w i =1 i =1

∂L =0 ⇔ ∂b

l

¦a d i

i

(7)

=0

(8)

i =1

The solution vector thus has an expansion in terms of a subset of the training patterns, namely those patterns whose a i is non-zero, called Support Vectors. By the Karush-Kuhn-Tucker complementarity conditions, we have,

[ (

) ]

ai d i wT xi + b − 1 = 0

for i = 1,K, N

(9)

by substituting (7),(8) and (9) into equation (6), find multipliers l

maxΘ(a) = ¦ ai − i =1

l

a

i

for which

l

1 ¦¦ai a j di di xi ⋅ x j 2 i=1 i=1

s.t. 0 ≤ ai ≤ c, i = 1,K, l and

l

¦a y i

i

(10)

=0

(11)

i =1

The hyperplane decision function can thus be written as

§ · f ( x) = sgn¨¨ ¦ yi ai ⋅ ( x ⋅ xi ) + b ¸¸ © ¹

(12)

where b is computed using (9). To construct the SVM, the optimal hyperplane algorithm has to be augmented by a method for computing dot products in feature spaces nonlinearly related to input space. The basic idea is to map the data into some other dot product space (called the feature space) F via a nonlinear map Φ , and to perform the above linear algorithm in N F, i.e nonseparable data {( x i , d i )}i = 1 , where x i ∈ R N , d i ∈ {+ 1 , − 1 } , preprocess the data with, Φ : R N → Θ( x) where N q, we say that Sip is an upgraded version of Siq. When a user submit a new application Ak (Ak ∉ SASys) and Ak is inserted into SASys, we called it the start of application Ak. When a service Sjp (Sjp ∈ GSk) is installed in GSSys, we called it the installation of Sjp. The installation of Sjp is more complex, because it concerns how to install Sjp into GSSys. The problem is how to install these services in GSM, and what are the conditions that can guarantee safe (no existing application is damaged) or successful (newly started applications work properly) configuration of different forms with the given realize(Ak, GSk). 3.2 Installation Strategies Installation Strategy ISjp determines how grid service Sjp is installed into GSSys. We’ve defined four kinds of installation strategies in FGSM: (1) DIS (Direct Installation Strategy), ISjp=DIS: Directly insert Sjp into GSSys as a new element. (2) TIS (Total Installation Strategy), ISjp=TIS: If (GSF(j) GSSys) , then replace , the result of TIS is all grid services in (GSF(j) GSSys) by Sjp; if (GSF(j) GSSys) equal to that of DIS. , (3) INS (Installation of New-version Strategy), ISjp=INS: If (GSF(j) GSSys) replace grid services in {Sjr Sjr ∈ (GSF(j) GSSys) and p>r} by Sjp; if (GSF(j) GSSys) , the result of INS is equal to that of DIS. (4) NIS (None Installation Strategy), ISjp=NIS: Do nothing. There are two kinds of installation results from a certain grid service Sjp: a) Sjp is inserted into GSSys; b) Sjp is not inserted into GSSys. In case of a), Sjp can be inserted directly into GSSys or can replace some grid services. If ISjp=DIS or (GSF(j) GSSys) = , the installation outcome is that Sjp is inserted directly into GSSys. If and ISjp=TIS, the installation outcome is that the grid services in (GSF(j) GSSys) Sys (GSF(j) GS ) are replaced by Sjp. Or if (GSF(j) GSSys) and ISjp=INS, the instalSys lation outcome is that the grid services in (GSF(j) GS ) are replaced by Sjp. In case of b), if ISjp=NIS, or (GSF(j) GSSys) , ISjp=INS and Sjr ∈ (GSF(j) GSSys) (pp). If realize(Ak, GSk’), we say that Sir is the backward compatibility (BC). Definition 4 (forward compatibility). Given realize(Ak, GSk), GSk’ denotes the set in which Sir replace ∀ Sir ∈ GSk (r60 min 50%≤2.4 min, 10%>28.25 min

The session times observed in the works referenced above are summarized in Table 1. It can be concluded that there has been great difference between nodes’ session time. 5-10% nodes’ session time is more than 6 times of average session time. Above 3 experimental studies have also shown that the probability distribution of node session time is like exponential distribution, which gives us a way to do theoretical analysis on node session time. 2.2

Model of Node Session Time

In order to give the quantitative analysis on the probability distribution of node session time in p2p system, a cumulative distribution function, exponential distribution with parameter μ, F (t) = 1 − e−μt

(1)

is introduced to model the distribution of session time as same as [8]. F (t) means the probability of the event that the node session time is no more than t. And 1 − F (t) means the probability of the event that the session time of the node is longer than t. As the probability distribution of node session time is modelled, the failure rate of the node at the time t, can be defined as following: Definition 1. R(t), failure rate at time t, is defined as the probability of the event that the node fails at exactly time t.

922

F. Hong, M. Li, and J. Yu

The failure rate at time t can be calculated as the probability that the node session time is t, i.e. F (t + Δt) − F (t) =μ Δt→0 Δt(1 − F (t))

R(t) = lim

(2)

Equation.2 shows that the failure rate at t is an constant independent of time t, which illustrates that the time at which the node fails doesn’t concern with the time that the node has been in the p2p system. Then we compare the probability of the event that the node has been alive for x time in p2p system and will be alive not more than y time to the probability of the event that the node’s session time is no longer than y. To illustrate this two event clearly, we give out Definition.2. Definition 2. P (x, y), the probability of the event that the node has been alive for x time in p2p system and will be alive not more than y time more. Therefore,the probability of the event that the node’s session time is no longer than y can be calculated as P(0,y). Theorem 1. P(x,y)=P(0,y) Proof. P(x,y) can be explained as the probability of the event that the node session time is more than x and less than x + y. The probability of the event (x+y) that the node session time is more than x and more than x + y is 1−F 1−F (x) , (x+y) therefore, P (x, y) = 1 − 1−F 1−F (x) = 1 − because, P (0, y) = F (y) = 1 − e−μy therefore, P (x, y) = P (0, y)

1−(1−e−μ(x+y) ) 1−(1−e−μx )

= 1 − e−μy

Theorem.1 illustrates that the node have been alive in p2p system for some time x, and the probability of the event that the node will still be in p2p system for any time more doesn’t concern with the past time x that the node has been alive. Lemma 1. The node has been in the system for t time, then the probability of the event that the node will still be in p2p system for t time more is the same as the probability of the event that the node session time passes t. Proof. From Theorem.1, P (t, t) = P (0, t) = F (t)  Lemma.1 illustrates that if the node has been in p2p system for t time, it will still be in the system for t time more with the same probability. 2.3

Distinguishing Nodes of Long Session Time

After analyzing the experimental studies and theoretical model of node session time, we can design our way to distinguish nodes of long session time from other nodes. For the remainder of the readers, this section only concerns on

SChord: Handling Churn in Chord by Exploiting Node Session Time

923

how to distinguish the nodes with long session time and how to exploit them is illustrated in section 3. First, we limit the nodes with long session time to the set of nodes which is 10% nodes of the longest session time in the p2p system and we call this set of nodes as target set. From Equation.1, F (ttarget ) = 1 − e−μt = 90%

(3)

the target time ttarget can be calculated as ttarget = μ1 log10 = 2.3026 μ1 . As the average session time of the node is μ1 , these 10% nodes’ session time is more than 2 times of average session time. We can get this conclusion from Lemma.1 that if a node has been alive for 12 ttarget , it will live 12 ttarget more with the same probability. Therefore, we make the time tpoint = 12 ttarget as the distinguished point for the long session node, which means the node in the target set will begin to take more function in the p2p system’s routing and maintenance protocol for at least tpoint time, which is bigger than μ1 time. As not all the nodes which session time has passed tpoint can be in the target set, we must let the nodes gradually increase their functions in the p2p system’s routing and maintenance protocol. To illustrate the word ”gradually”, we use another time segment tgradual . 1 Let F (tgradual ) = 10%, so tgradual = μ1 log 10 9 = 0.1054 μ . According to Equation.1 and Theorem.1, tgradual can be explained as that the node can be still alive after the next time segment of tgradual with 90% probability. Therefore, the whole method to distinguish target set from other nodes can be summarized as: when a node has been alive for more than tpoint in the p2p system, it begins to play extra roles in the whole system’s routing and maintenance protocols. As long as the node is still alive, it will play more and more role in the whole system’s routing and maintenance protocols when each segment of tgradual passes. The next section will explain what extra function such a node will play when its alive time increases. And in the section of simulation, the whole design of our idea will be described in details of each above parameters.

3 3.1

Core Design of SChord Virtual Nodes

If the nodes of long session time can contribute more to routing and maintenance procedure, the whole p2p system will correct its control and routing information better to handle churn, as we believe. We expand the usage of virtual nodes [9] to let the nodes of long session time to play more role. The notion of virtual nodes is used to show that a real node can act as multiple virtual nodes and the Chord protocol operates at the virtual node level. In [9], virtual node is used to do some load balancing work on the whole Chord ring. Here we let the nodes of long session time hold multiple virtual nodes according to some virtual nodes

924

F. Hong, M. Li, and J. Yu

expanding rule. And the node with longer session time will have more virtual nodes. As the Chord protocol operate at the virtual node level, the node with multiple virtual nodes will take more part in routing and maintenance procedure of the p2p system. As all the multiple virtual nodes of one real node won’t fail until the real node fails, this mechanism will build the comparatively stable part of the Chord ring. Meanwhile, one real node holds multiple virtual nodes, which means that this real node has all the control and routing information of all virtual nodes on its own. As a virtual node only has its own routing information which is limited comparing to the real node which it belongs, the virtual node should ask help from the real node in its routing and maintenance procedure. Therefore there should be some change in the common Chord protocol to fully make use of the idea of virtual node and real node of SChord. We first illustrate the expanding rule of virtual nodes in section 3.2 and describe the main modification of SChord to Chord in section 3.3. 3.2

Expanding Rule of Virtual Nodes

As mention above, the real node with long session time will take more part in the routing and maintenance procedures of the whole system when its alive time passes every segment of tgradual . This increasing function of the real node is completed by this real node holding more virtual nodes. Because every additional virtual node will take part in the whole p2p system, the number of nodes on the Chord ring will increase. And many important parameter is decided by the number of the nodes on the Chord ring, the increment of virtual nodes means that the overhead of the whole p2p system increases. Meanwhile, additional virtual nodes will lead to more severe churn when the real node holding them fails. Therefore, we choose the linear expanding sequence to add virtual nodes on the real node. i.e. additional virtual nodes will be added to the real node when the real node has been in the systen after each time segment of tgradual according to the sequence of C, 2C, 3C, ......(C is a constant decided by the application). The linear increment sequence is a little conservative, but it avoids the challenging of severe churn when a real node with a big number of virtual nodes fails. Hence the number of virtual nodes that a real node holds reflects the session time of the real node. And a real node has more virtual nodes will in sequence take more part in routing and maintenance procedure. So the difference of nodes’ session time is described quantitatively by virtual nodes, and will be exploited by the routing and maintenance algorithms of SChord. 3.3

Main Modification of SChord to Chord

Background on Chord. Chord [4] [8] is one of the typical DHT peer-to-peer system. Chord uses a one-dimensional circular key space. The node responsible for the key is the node whose identifier most closely follows the key (numerically); that node is called the key’s successor. Routing correctness is achieved with the pointer successor. Routing efficiency is achieved with the finger list of O(logN ) nodes spaced exponentially around the key space, where N is the number of

SChord: Handling Churn in Chord by Exploiting Node Session Time

925

Fig. 1. Pseudocode of Chord Overlay

nodes in Chord. Node’s predecessor and successor list is used in the maintenance algorithm. The routing and maintenance algorithm of Chord is described in Fig.1. Routing Algorithm. Fig.1 shows that the routing procedure n.find successor(id) is the key procedure of Chord, for it is not only used in the routing procedure, but also used in the node joining procedure of n.build fingers() and used in the node maintenance procedure of n.fix fingers(). Moreover, n.closest proceding node(id) is the key procedure used in n.find successor(id), whose function is to decide the next hop of routing by exploiting the Chord node’s local routing information. As mentioned above, the real node with long session time will hold multiple virtual nodes in SChord, this real node will have more routing information to use in the routing process. Therefore the key modification of SChord to Chord is to modify the n.closest proceding node(id) to exploit these kinds of extra routing information. The routing algorithm of SChord’s virtual node and real node is shown in Fig.2. It can be got from Fig.2 that real node’s r.closest proceding node(id) has made use of all the routing information of all virtual nodes it holds. If the next hop is still decided as the entry from the original virtual node’s local routing information, this hop of routing process is the same as Chord’s, which comes

926

F. Hong, M. Li, and J. Yu //ask virtual node v to find the successor of id v.find successor(id ){ if (id ∈(v,v.successor )) return v.successor ; else{ v ’=v.closest preceding node(id ); return v’.find successor(id); } } //ask virtual node v to find the closest preceding node to id v.closest preceding node(id ){ //get the real node r virtual node v belongs r =v.get realnode(); //ask real node r to find the closest finger to id from all virtual nodes on it return r.closest proceding node(id ); } //ask real node r to find the closest preceding node to id r.closest preceding node(id ){ //find the closest finger to id by each virtual node v on r for each v ∈r.virtual node list closest preceding list[i]=v.closest preceding finger (id); }//end for return the largest node u precede to id in closest preceding list } //search v ’s local routing information for the highest predecessor of id. v.closest preceding finger(id) return the largest node u in finger [1..m] or successor list so that u∈ (v,id ); }

Fig. 2. pseudocode of routing algorithm of SChord

along to the node with the identifier closer preceding to the target id than current virtual node. Otherwise, the next hop is decided as the entry from the other virtual node’s local routing information. In such condition, the next hop is chosen as a hop with identifier closer to target id ’s successor than common Chord’s hop in key space. Therefore, it can be concluded that the next hop is chosen as local optimum for message routing from all the routing information of the real node. As a result, the whole hop number and whole message number in the routing process will be decreased. So the probability of the event that the routing path cross the failed node will be decreased, whether this procedure is used in the system’s routing or maintenance procedure.

4

Simulation

We implement the simulation in a discrete-event simulator. The simulated network consists of 8,192 nodes. Nodes crash and rejoin at exponentially distributed intervals with a mean session time of one hour to model churn. This choice of mean session time is consistent with [5]. All experiments involve only key lookup, as opposed to data retrieval. Nodes issue lookups for random keys at intervals exponentially distributed with a mean of ten minutes, so the lookup rate guarantees that nodes perform several lookups per session. The experiment runs for six hours of simulated time, and nodes keep their IP address and ID for the du-

SChord: Handling Churn in Chord by Exploiting Node Session Time

927

ration of the experiment. We simulate both Chord and SChord on this simulator network to compare the result. Both Chord and SChord use the same time parameter for periodically running maintenance protocol, which is ten seconds for successor list maintenance interval and one minute for finger list maintenance interval. And the maintenance operations belongs to the virtual node in SChord. In the simulation process, we adopt the failure-stop model for the lookup. The lookup is recorded as failure if the node v’ fails when the node v does remote procedure call to v’ in the procedure of v’.find successor(id) in Fig.1. and Fig.2. In traditional p2psim, if a failed v ’ is met, v will run v.closest preceding node(id) again to select another node from its local routing information, until the node v ’ can be communicated. According to the failure-stop model here, the failure rate of lookup Fl , can be defined as the ratio of the number of failure lookup to the number of all lookups. Fl reflect the error rate of the control and routing information of the whole p2p system during churn happening. And the error rate of the control and routing information reflects the p2p system’s ability to handle churn. Therefore, Fl directly reflects the ability of churn handling of p2p system. Meanwhile, the hop number of each success lookup is logged, which reflects the routing efficiency of the p2p system. As we use the same exponentially distribution to model the churn of p2p system, the system will be running to the dynamically steady state with the total number of nodes is about 4096. For SChord, we can calculate the parameter for expanding rule of virtual nodes now. Let μ1 = 3600, for the mean session time of node is one hour. Therefore, tpoint = 8289.4, ttarget = 4144.7 and tgradual = 379.44 can be calculated. The expanding sequence is defined as 10,20,30.... i.e. 10 additional virtual nodes adds to the real node when its alive time passes one time segment of tgradual . And as we only care about the failure rate of all lookups and the hop number of successful lookup, the latency between pair of nodes is adopted all the same as 76ms, which is the average latency estimated in [10].

0.35 SChord Chord 0.3

PDF of Hop Number

0.25

0.2

0.15

0.1

0.05

0

0

5

10

15

Hop Number

Fig. 3. comparing routing hop number per successful lookup between SChord and Chord

928

F. Hong, M. Li, and J. Yu

Table 2. Distribution of virtual nodes on real node in SChord when simulation ends Number of virtual nodes on one real node 1 10 20 30 40 50 60 70 80 Number of real nodes 4038 3 3 3 2 1 1 1 1

Fig.3 shows probability density function(PDF) of the hop number of successful lookup in SChord comparing to Chord. The average hop number of successful lookup in SChord is 5.81, and 6.07 in Chord, which shows that SChord has only increase 4.3% routing efficiency than Chord. The decrease of hop number is not big enough here, which is because the overhead in SChord is higher than Chord. As mentioned above, the number of nodes in the system is an important metric to the overhead of p2p system. For SChord has 4038 real nodes and 4558 virtual nodes and Chord has 4044 nodes when the simulation ends. The distribution of virtual nodes of SChord is shown in Table.2. And in the simulation process the number of virtual nodes in SChord is higher than the number of nodes in Chord, too. As SChord protocol operates at the virtual node level, it can be concluded that the modification of routing algorithm of SChord helps in decreasing the hop number of lookup. Fig.4 shows the lookup failure rate and the details of lookup in SChord and Chord. The failure rate of lookup is 34.48% on Chord and 27.59% on SChord. Therefore, it can be calculated that SChord decrease 19.98% failure rate of lookup than Chord. Because the failure rate of lookup directly reflect the correctness of the control and routing information of all nodes in the p2p system, which is the key metric that shows that SChord has better performance in churn handling than Chord as expected.

Fig. 4. lookup result comparing SChord to Chord

5

Conclusion

In this paper, we present SChord to handle the churn problem of p2p system. We analyze the past experimental studies on churn problem of p2p system and

SChord: Handling Churn in Chord by Exploiting Node Session Time

929

do some theoretical analysis on the model of node session time. SChord is based on such analysis which can distinguish nodes of long session time from other p2p nodes and exploit these long session nodes with its special routing algorithm. The Simulation shows that SChord has achieved better performance to handle churn than Chord as expected.

References 1. B. Y. Zhao, J. Kubiatowicz, and A. D. Joseph: Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical report, UCB/CSD-011141, University of California at Berkeley, Computer Science Department(2001) 2. A. Rowstron and P. Druschel: Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems. In Proceedings of the 18th IFIP/ACM International Conference on Distributed Systems Platforms(2001) 3. S. Ratnasamy, P. Francis, and M. Handley: A scalable content-addressable network. In Proceedings of NGC’01(2001) 4. I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan: Chord: A scalable peer-to-peer lookup service for internet, IEEE/ACM Transactions on Networking, Vol. 11, No. 1,17-32(2003) 5. S. Saroiu, P. K. Gummadi, and S. D. Gribble: A measurement study of peer-topeer file sharing systems. In Proceedings of Multimedia Conferencing and Networking(2002) 6. J. Chu, K. Labonte, and B. N. Levine. Availability and locality measurements of peer-to-peer file systems. In Proc. of ITCom: Scalability and Traffic Control in IP Networks(2002) 7. K. P. Gummadi, R. J. Dunn, S. Saroiu, S. D. Gribble, H. M. Levy, and J. Zahorjan. Measurement, modeling, and analysis of a peer-to- peer file-sharing workload. In Proc. ACM SOSP(2003) 8. D. Liben-Nowell, H. Balakrishnan, and D. Karger: Analysis of the Evolution of Peer-to-Peer Systems. In ACM Conf. on Principles of Distributed Computing (PODC)(2002) 9. F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica:Wide-area cooperative storage with CFS. In SOSP’01(2001) 10. Krishna P. Gummadi, Stefan Saroiu, and Steven D. Gribble. King: Estimating latency between arbitrary Internet end hosts. In Proceedings of the 2002 SIGCOMM Internet Measurement Workshop(2002)

Towards Reputation-Aware Resource Discovery in Peer-to-Peer Networks1 Jinyang Zhou, Shoubao Yang, Leitao Guo, Jing Wang, and Ying Chen Computer Science Department, University of Science and Technology of China, Hefei 230026, P.R. China [email protected], [email protected], {ltguo, joycew, ychen34}@mail.ustc.edu.cn

Abstract. The resource discovery algorithms in Peer-to-Peer networks are based on the assumption that reliable resources are provided by each peer. The feather that significantly contributes to the success of many P2P applications is dynamic, anonymity and self-organization. However, they also bring about some malicious nodes to provide untrustworthy and pseudo services. To address this problem, this paper introduces a robust and flexible reputation mechanism in unstructured P2P and presents the heuristic resource discovery algorithm based on reputation-aware to ensure that resource requester can obtain reliable resources and services. This new resource discovery algorithm can effectively suppress the deceptive and fake services of P2P network, improve the reliability and security and decrease network load.

1 Introduction P2P network is distributed application paradigm with data sharing, collaborative computing and big storage. There is no central server or control. Each peer has equal capabilities and responsibilities for providing resources, consuming resources and communicating information. Peers can communicate with each other directly, rather than through a central server. But the dynamic, self-organizing and anonymity of P2P network can’t guarantee all peer nodes provide good quality services and reliable resources. Some peers provide services different from description, even imposture services which will damage requester and other peers with illegal and uncertain resources. Therefore, in P2P network with many dynamic and anonymous users, it’s necessary to resolve how to avoid those fraudulent services and guarantee all peers will provide reliable resources and services. According to these requirements, this paper presents Reputation-Aware Resource Discovery Algorithm (RARDA), based on Directed-BFS Breadth-First Traversal algorithm [1] and reputation concept, and performs simulated experiment and analyses experiment results. Analysis shows that the new resource discovery algorithm can effectively restrain the deceptive and pseudo service of P2P network, improve reliability and security, decrease network load as well. 1

This paper is supported by the National Natural Science Foundation of China under Grant No.60273041 and the National ‘863’ High-Tech Program of China under Grant No. 2002AA104560.

H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 930 – 940, 2005. © Springer-Verlag Berlin Heidelberg 2005

Towards Reputation-Aware Resource Discovery in Peer-to-Peer Networks

931

The rest of this paper is organized as following. Section 2 introduces resources discovery algorithms in current unstructured P2P network. Section 3 describes trust concept and how to compute reputation value in P2P network based on polling, and presents RARDA based on Directed-BFS algorithm and integrated with Reputation Factor. Simulations and results analysis are given in section 4. And conclusions and future works are in section 5.

2 Related Works Peer-to-Peer essentially comes in three flavors: 1) centralized P2P network, such as Napster [2]. 2) distributed unstructured P2P network, such as Gnutella[3] and Freenet [4]. 3) distributed structured P2P network, such as CAN [5], Chord [6]. All P2P networks rely on peers’ collaboration. P2P is mainly used in file sharing, distributed processing and instant messaging. All these usage focuses on resource discovery and resource usage. Thus, resource discovery is the primitive service in P2P system. Resource discovery means giving out detail description of expected resource and searching all resources matching with the description. In structured P2P system, files distribution is relative with network topology compactness. Files are accurately distributed in the network according to P2P topology logic addresses. Distributed Hash Table (DHT) is used to locate resources in structured P2P network such as CAN. This method assures that each peer can be visited under limited maximum hops which can be derived from log (n) (n is peers number in P2P network). In unstructured P2P system, files distribution is related with network topology looseness. BFS Breadth-First Traversal can be used to locate resources. BFS technology makes each node forwards the received query message to all its neighbors by the means of simple flooding algorithm, and ends the query messages until meeting predefined hops number. Figure 1 shows flooding algorithm query process.

Fig. 1. Flooding Algorithm in Unstructured P2P Network

To find out a file, peer A generates a query message and broadcasts it to its neighbor peers B and C, then B and C broadcasts this message to their neighbors, respectively. The rest may be deduced by analogy. As long as one peer finds the file;

932

J. Zhou et al.

it will return the result along with original path. In case, if TTL (Time to Live) is equal to 2, the query can’t access peer F and peer G. The resources locating method of flooding algorithm is only used in certain radius scope from source peer. When the query goes beyond this scope, it stops searching. So the method of flooding can’t guarantee the validity of the search. But flooding is completely distributed and it can effectively avoid the problem of single failure. BFS is ideal in theory because it can pass query message to each node as soon as possible, but data stream is increased at exponential level which wastes a lot of resource and network bandwidth, and the system has no good scalability as well. Hence, in paper [1], a compromised method without debasing the quality of query result was introduced: Directed-BFS algorithm. Directed-BFS algorithm forwards query information to a node’s neighbor subset according to history experience to reduce network cost. The key of this technology is to select neighbor nodes intelligently so that it will not debase query quality. But Directed-BFS algorithm can’t guarantee each response peer provides reliable and good services. Some malicious peers and pseudo peers will provide fraudulent and unreliable services [7]. To ensure P2P system reliability and provide good Quality of Service, this paper proposes Reputation-Aware Resource Discovery Algorithm (RARDA), based on Directed-BFS. RARDA evaluates reputation of each peer based on history exchanging information and considers reputation factor when selecting neighbor subset and resource providers. The purpose of algorithm is to obtain the honest service, to decrease the communication load and enhance scalability of P2P system.

3 Heuristic Resource Discovery Algorithm P2P system has no fixed network topology structure. Its topology structure is formed spontaneously when peers join in and leave dynamically. Each peer needs to send massages to its neighbor peers when it joins in or leaves system, from which P2P system can update topology dynamically. For objective and valid evaluation reputation value, the original P2P system structure can be extended based on reputation mechanism. Figure 2 shows the extended P2P structure.

Quer messa y ge

Reputation management

Experience DB Reputation DB

File DB

NeighbourDB

user

Fig. 2. The Extended P2P Structure

Each peer not only maintains file sharing information database and neighbors information database, but also extends history experience information database, in

Towards Reputation-Aware Resource Discovery in Peer-to-Peer Networks

933

which, data is managed by the reputation management mechanism. Figure 3 shows how each peer maintains the data structure in P2P system.

Fig. 3. Data Structure of Peer in Unstructured P2P Network

a. the file table includes the description and download number of a file, and the file keywords generated from hash function SHA-1[8]. b. the neighbor table expresses neighbor peers’ identifier and history exchanging information, including message forwarding times and message responding times of neighbor peers. c. the experience table includes the IDs of the neighbor peers which have directly interacted with current peer, the trust class, the feedback information and the factor of time decline. The reputation value will gradually decline along with the increment of time. d. the reputation table stores the trust degree between peers according to the trust evaluation algorithms. 3.1 The Trust Evaluation Algorithm This paper uses Diego Gambetta’s trust definition [9]. There are two kinds of trust between peers: Direct trust and Recommendation Trust [10]. Direct Trust means two peers directly exchanged information; the value of reputation is based on direct experience from their exchange status. It is shown by solid line. Recommendation Trust is that two peers never exchange information directly, it is shown by dotted line, and they establish trust relationship based on recommendation from other peers, the reputation value is from other peers’ evaluation. This paper uses Trust Class and Reputation Value to describe trust relationship between each pair of peers. Trust Class is a kind of service provided by service provider. Reputation Value is trust degree of two trusted peers based on trust class. It’s the quantity evaluation to trust between peers. The computing formula can be formalized as: TrustClassB A

ReputationValueA

B

B

A (B) + (1 − Ȝ )× RRvTrustClassA (B) . (B) = Ȝ × DRvTrustClass A A

(1)

934

J. Zhou et al.

DRv is direct reputation value, RRv is recommendation reputation value, degree of a peer for direct exchange evaluation.

is trust

3.1.1 Direct Trust Evaluation Algorithm Direct Reputation Value of two peers evaluates history exchanging information and gets the trust capability of one peer to another peer for certain behavior. A (B ) . The reputation value formula is: It can be formalized as DRvTrustClass A B

DRv Trustclass B A

Score A

Trustclass B A

Score A

TrustClass A

B A

(B ) =

1 × n

∑ Score

Trustclass A

B A

(B ) .

(2)

n

shows evaluation value of peer A to peer B in a history exchange,

∈ [− 1,1] . For example, when peer A downloads file from peer B, and

the file is authentic, then the evaluation value is positive; if the file is unauthentic, modified or the downloading process is interrupted, then the evaluation value is negative. n is the history exchange times between peer A and Peer B. 3.1.2 Recommendation Trust Evaluation Algorithm Besides getting direct experience from peer’s history experience, the peer will accept recommendation for target peer from other peers. This paper uses reputation value B

A (B ) . passing to get recommendation reputation. It can be formalized as RRv TrustClass A Figure 4 gives multi-path recommendation.

Fig. 4. Multi-path Recommendation in P2P networks

Peer A can reach Peer G through different paths. There are three assumed paths between A and G. Į, ȕ, Ȗ is weight for each recommended path, Į + ȕ + Ȗ = 1 .The recommendation reputation value is from following formula: G TrustClass A

RRvA

(G) = Į ∗ (RVAB ∗ RVBE * RVEG )+ ȕ ∗ (RVAC ∗ RVCE ∗ RVEG )+ Ȗ ∗ (RVAD ∗ RVDF * RVFG ).

(3)

Towards Reputation-Aware Resource Discovery in Peer-to-Peer Networks

935

B

A (B ) between Request Peer A and respond From above, the Reputation ValueTrustClass A Peer B is

TrustClass B A

RVA

B

A (B ) = Ȝ × ∑ Score Tc A (B ) +

n

n

(1 − Ȝ )× (Į ∗ (RVAB ∗ RVBE * RVEG )+ ȕ ∗ (RVAC ∗ RVCE ∗ RVEG )+ Ȗ ∗ (RVAD ∗ RVDF * RVFG )).

(4)

3.2 Resource Discovery Work Flow Comparing with flooding algorithm, Directed-BFS improves on network bandwidth consumption and processing cost. But it can’t guarantee service-providing peers provide authentic services and reliable resources. Therefore, this paper presents Reputation-Aware Resource Discovery Algorithm, RARDA. Reputation concept can be used in the following aspects: a. Select neighbor sets intelligently to transmit query message according to reputation value. b. Select peer with higher reputation value to be service peer when more than one peer responds the request c. If forwarding peers don’t keep reputation value of neighbor peers, then update peers reputation table based on Formula (4) Figure 5 illustrates the improved resource discovery algorithm workflow:

Fig. 5. Resource Discovery Algorithm Workflow

(1) Request peer sends out query message, Message={SourcePeerId FileId Rvmin} SourcePeerId is request peer identifier, FileId is request file resource identification, Rvmin is minimal reputation value that request peer needs.

936

J. Zhou et al.

(2) Select iteration depth policy; if policy is Policy = {Depth1,Depth2} , select Policy . Depth 1 as iteration flooding hops. (3) Update peers Reputation table based on exchanging history information in experience table and select peer whose reputation value is more than or equal to Rvmin message requesting peers as neighbor subset to transmit message. Then select peer with more responding times or transmitting times during message querying to be transmitting peer in neighbor subset. (4) Make judgment if there is peer to meet searching requirement. If there are more than one peer meet the requirement, then select peer with higher reputation value to respond the request and algorithm is terminated, otherwise, jump to (5). (5) Make judgment if iteration jump time is more than Policy. Depth1 , if not, then jump to (2). Otherwise, resend message, freeze all searching peers in Policy . Depth 1 , jump to (1), select Policy.Dep th 2 as iteration flooding jump number, and repeat step (2) and (3). (6) If response time is over or there is no peer to respond resource query when iteration depth is reached, then use flooding algorithm to search. 3.3 Pseudo Code Descriptions This paper uses resource discovery algorithm based on reputation-aware in unstructured P2P system. Instead of changing the bottom routing algorithm during the course of locating and forwarding information, the heuristic reputation-aware algorithm adds an extra overlay on the under layer network topology. Further more, the algorithm is based on flooding algorithm. In the worst case, algorithm will perform simple flooding algorithm. Thus, this algorithm has better flexibility. This paper computes reputation value based on history exchanging information. It not only avoids malicious peers providing fraudulent or fake service to guarantee Quality of Service, but also doesn’t impact percentage of hits of algorithm searching, and in the mean time, it reduces network load and increases system scalability. Pseudo code of RARDA is shown as follows: Depth-policy: p= {depth1, depth2, depth3} Message= {SourcePeerID, FileId, Rvmin} Heuristic Resource-Discovery Algorithm (Message): i=0; Do { hops=SelectPolicyDepth (p); While (i=SelectPolicyDepth)or(overtime)) Flooding(message);

Towards Reputation-Aware Resource Discovery in Peer-to-Peer Networks

937

From above, details of step 5 are given by underneath of program code, which illustrates intelligent neighbor peers selecting algorithm. IntelligentSelectNeighbourSets(peersets): for( all NeighbourPeers) { Select and lookup ReputationTable; if (ReputationTable.ReputationValue>Message.Rvmin) return(peer NeighbourPeerSets) else {RandomSelectPeers ReputationTable.RV>num1}; while((Respondtime>=a)and( Forwardtime>=b)) return (select peers from NeighbourPeerSets ); } When a resource query message is sent out, resource request peer selects iteration depth policy, then seeks node’s reputation table, neighbor table and experience table. If file table includes requested resource matching reputation requirement, then peer responds request message. Otherwise, peer will intelligently select neighbor peers to transmit message according to reputation value, responding times and forwarding times. If reputation table has no reputation value for those two peers, it will compute and update reputation value as Formula (4). 3.4 Algorithm Performance Analysis To a certain extent, RARDA algorithm can guarantee that peer provides reliable and honest service. At the same time, it decreases communication load of P2P network. Here is the algorithm performance analysis. Assume that there is an edge between any two peers A and B. S (A) is peer A and its neighbor peers set. S (B) is peer B and its neighbor peers set. If peer A is request peer and S(A) translate is transmit neighbor peers of peer A selected through intelligent select policy, then after one hop, searched peer number will be S(A) translate ∪ S(B) . At the worst condition, the algorithm will perform flooding. If the maximum diameter in connection graph from network is d(n) (diameter of network with n peers), then its round complexity is d(n) . The communication complexity of this algorithm depends on edge number in this network. E num (n ) means initial edge number in the network with n peers. The worst case is that all peers start to search at same time and all go through the longest path d(n) . Then the network communication complexity is ȍ(E num (n) × d(n)) .

4 Simulations and Results Analysis BRITE [11] network generator is used in this paper to perform simulations, which is similar to real network structure, and accords with power laws [12]: y = x Į . Simulation network topology structure uses bottom-to-top two-layer topology. The entire structure generates 1000 peers. Waxman model is used as route structure. The maximum and minimum bandwidth is 1024 and 10. This paper uses the Cooperative

938

J. Zhou et al.

Association for Internet Data Analysis (CAIDA)[13] visual tool Otter0.9 to show the topology structure. During simulation, resource searching is simplified as searching shared files of peers. Thus, this paper designs and evaluates tests based on unstructured file sharing system Gnutella Protocol. It is assumed that there are 10,000 files distributed in those 1000 peers randomly without considering the trust class among peers. Assume that each peer is distributed 100 files with different content. Reputation value between any two peers depends on history exchanging evaluation information. Simulation performs flooding algorithm, Directed-BFS algorithm and Reputation-Aware Resource Discovery Algorithm, respectively.

1.0

The number of successfully download

0.9 0.8 0.7 0.6 0.5 0.4 0.3

Flooding Direct-BFS RDARA

0.2 0.1 0.0 0

100

200

300

400

500

600

700

800

900

1000

The number of downloaded file

percentage of files successfully downloaded

Fig. 6. The Proportion of Successfully Downloading 1.0 0.9

Flooding Directed-BFS RDARA

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

the percentage of deceptive peer

Fig. 7. The Proportion of Successfully Downloading

It is shown in figure 6 of the percentage of reputation peers over all the responding ones in the three resource searching algorithm. The three curves of Fig 6 show that the RDARA algorithm ratio of successful download file is higher. Experiment result

Towards Reputation-Aware Resource Discovery in Peer-to-Peer Networks

939

percentage of successfully download file

indicates that Reputation-Aware Resource Discovery Algorithm can ensure more download resources are provided by authentic and reliable peers to insure reliable Quality of Service. It is assumed that 1000 files are downloaded in different percentage of deceptive peers. Figure 7 shows percentage of downloaded authentic files under different malicious peer percentage. This result proves authentic and reliable files can still be downloaded even when 90% peers are malicious peers. 1.0

F looding D irected-B F S RDARA

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0

10

20

30

40

50

60

test tim e/m inute

Fig. 8. Ratio of Resource Successful Discovery

Success rate is the rate of successfully query number comparing with entire query times during some period. These three algorithms have no big difference in Figure 8, which shows RARDA doesn’t reduce resource searching success rate and it has better usability. The numbers of system messages when the algorithm is running is shown in Figure 9. x coordinate shows requester peers number of p2p networks in a certain time. Y coordinate shows messages quantity. The result shows RARDA can decrease system message load obviously and improve system expansibility.

the number of query message

40000 35000 30000 25000

Flooding Directed-BFS RDARA

20000 15000 10000 5000 0 0

10

20

30

40

50

60

70

80

90

the am ount of requestor

Fig. 9. System Message Load Compare

100

110

940

J. Zhou et al.

From above analysis, RARDA can discover more unauthentic and unreliable peers to a certain degree which will avoid selecting them during resource searching reduce fraud and fake service and improve system reliability and expansibility.

5 Conclusions and Future Work This paper presents Reputation-Aware Resource Discovery Algorithm based on Reputation in unstructured P2P system against deceitful and fake service during resource searching. This algorithm combines flooding algorithm with reputation mechanism to guarantee security and reliability, reduce imposture and fake service and provide reliable Quality of Service. The experiment result shows this algorithm can reduce network load and avoid fake service from malicious peers. Algorithm efficiency analysis, cost of computing reputation value and delay, peer collusion and peer bad-mouthing will be our future research direction.

References 1. Beverly Yang Hector, Garcia-Molina: Improving Search in Peer-to-Peer Networks. In Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS'02), 2002. 2. Napster Inc, http://www.napster.com/ 3. Gnutella, http://www.gnutelliums.com/ 4. Freenet documentation, http://freenet.sourceforge.net/doc/book.html 5. Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker: A scalable content-addressable network. In Proc. ACM SIGCOMM (San Diego, CA, August 2001),pp. 161–172. 6. Ion Stoica, Robert Morris and David Karger: Chord: a scalable peer-to-peer lookup service for Internet applications. Proceedings of ACM SIGCOMM`01, San Diego. 7. M.S. Khambatti, K.D. Ryu, P. Dasgupta: Efficient Discovery of Implicitly formed Peer-toPeer Communities. International Journal of Parallel and Distributed Systems and Networks, Vol. 5, No. 4, 2002. pp.155-164 8. William Stallings: Cryptography and Network Security: Principles and Practice Second Edition. 2001,Publishing House of Electronics Industry, pp.223-228 9. Diego Gambetta: Can We Trust? In, Trust: Making and Breaking Cooperative Relations, Gambetta, D (ed.). Basil Blackwell. Oxford, 1990, pp. 213-237. 10. Abdul-Rahman A, Hailes S: A distributed trust model. In Proceedings of the 1997 New security Paradigms Workshop Cambrian, ACM Press, 1998. 11. Brite: a network topology generator. http://www.cs.bu.edu/brite/ 12. Faloutsos M, Faloutsos P, Faloutsos C: On power-law relationships of the Internet topology. In: Chapin L, Sterbenz JPG, Parulkar G, Turner JS, eds. Proc. of the ACM SIGCOMM’99. New York: ACM Press, 1999. 251 ~262. 13. http://www.caida.org/

Constructing Fair-Exchange P2P File Market Min Zuo and Jianhua Li Department of Electronic Engineering, Shanghai Jiaotong University, Shanghai, China {zuomin, lijh888}@sjtu.edu.cn

Abstract. P2P is a promising technology to construct the underlying supporting layer of a Grid. It is known that contribution from all the peers is vital for the sustainability of a P2P community, but peers are often selfish and unwilling to contribute. In this paper we describe how to construct a fair file-exchanging P2P community. We name this community a P2P file market. Our scheme forces peers to contribute by a micropayment-based incentive mechanism.

1 Introduction P2P(Peer-to-Peer) networking is a promising way to construct the underlying supporting layer of a Grid[1]. Some most attracting features of P2P networks are: they are independent of servers (totally or partly); they are self-organized and robust; and, they are rich in resources (computing power, bandwidth, storage, valuable files, etc.). However, none of these features can be achieved without cooperation of the peers. In theory, a peer in a P2P community can be both a consumer and a server. But serving others usually means sacrificing some resources of one’s own interests. If peers are autonomous and self-interested, or if they are on behalf of rational individuals (as compared with altruistic ones), they will tend to act as pure consumers (called “free-riders” in [2]). Free-riding makes a P2P community unfair. This will eventually paralyze the whole community and lead to a “Tragedy of the Commons ”[3]. To solve this problem, some kind of incentive mechanism is needed. In this paper, we take the file-sharing application as an example. We suggest that a P2P file-sharing community be upgraded into a “file market”. In this market, every peer is allowed to share and download files, but they have to pay the provider if they want to download a file. They can earn the necessary “money” by sharing their own files in the community. The rest of this paper goes as follows: we present some of the design considerations in section 2; then we describe the setup, file-trading, and accounting process in section 3-5; finally, we give some remarks and conclude this paper in section 6.

2 Design Considerations There are some design considerations we’d like to mention before going into details. First is the overall architecture. There are basically three kinds of entities in the market: peers (P) who exchange files, a trusted third party (TTP) to resolve conflicts, and an accounting center (AC) acting as a central bank. We’d like to point out that, any entity trusted by the involved two parties could act as the TTP in a transaction. H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 941 – 946, 2005. © Springer-Verlag Berlin Heidelberg 2005

942

M. Zuo and J.H. Li

Then is the problem of identity and key management. We assume each peer to be identified by a unique ID and each ID bond with a pair of public and private keys. Anonymous peers may be permitted, but they can only download and provide costfree files (usually they are deemed as less valuable than priced ones). Here are some of the notations used in this paper: SIGX(M) denotes peer X’s signature with its private key on a message M; EX(M), DX(M) denote the encryption of M with peer X’s public key, and the decryption with the corresponding private key; SEK(M), SDK(M) denote the encryption and decryption of M with symmetry key K; H() is a public hash function, such as MD5, to generate message digests. We suggest the adoption of Identity-based Cryptosystem [4] because of its easiness in key management (IDs are also the public keys). In our file market, the AC also acts as the KGC (Key Generation Center). It sets the public parameters and holds the secret master key to compute a private key for each ID. Last is the problem of fair exchanges. One of our most important tasks is to ensure the fairness of the trading process. We borrow the ideas in [5] to design a fair exchange protocol for our system. It is an optimistic protocol. It guarantees fair exchange of two electronic items between two mutually distrusting parties by the mediation of a third party (TTP), but the involvement of TTP is needed only when a conflict happens. Therefore, if most of the peers are well-intentioned, the TTP is hopefully involved infrequently. With this protocol, the trading process will be able to go in a P2P fashion as much as possible.

3 System Setup Each new peer has to register at the AC before it can participant in the file market. During each registration, AC performs the following operations: (1) choosing a unique ID for the peer; (2) computing the private key for the ID and sending this private key to the peer; (3) issuing an initial capital certificate to the peer; (4) adding an item for this new peer in its account database. The communication of a registration session must be performed through a safe link (eg. SSL/TSL). A capital certificate (CC) is a digital certificate. It contains the subject’s ID, the expiration time of this certificate (Cash-Exp), the upper bound of the value (MaxValue) and number (Max-Num) of valid cheques during this period, and the signature of the AC. We can compare it to a cheque-book issued by a bank. It allows the peer to issue a maximum of Max-Num pieces of cheques, each valued no more than MaxValue. And these cheques must be cashed before the time limit Cash-Exp. The initial values of Max-Value, Max-Num, and Cash-Exp are set to a system default. This initial capital is necessary to make the whole market start to function at the beginning. However, malicious users could make use of the initial capitals to free-ride. For example, one could abundant an ID every time he has used up its initial capital, and register as a newcomer to get another share. This is called “white-washing” in [6]. To combat these abuses, we require that AC perform some sophisticated checking when a newcomer registers its ID. One possible method is to use e-mail confirmation, which is widely adopted in many applications on today’s Internet.

Constructing Fair-Exchange P2P File Market

943

4 Downloading and Paying When a file is put on the market for sale, it is divided into several pieces as in BitTorrent. The metadata information (file ID, file name, description, size, hash values, provider ID, address, port, accepted TTP list, price, etc.) is recorded in the community’s index service and made public to all. A peer makes search queries to find what it wants. A list of matching records is returned to the querier. It then checks the list to see if a file is likely to be what it wants (file name, description, size, etc), if the price is acceptable and affordable, and if there is a TTP they commonly trusted, etc. 4.1 Negotiating After a peer has decided which file to download, there will be a negotiating phase between the downloader (peer B) and the provider (peer A) before the download starts. The message flow of this phase is illustrated in Fig. 1. CCX is peer X’s capital certificate (see section 4). The mark “||” denotes a sequence of two or more parts.

Fig. 1. Negotiating Protocol

DownREQ = FileID||DownloaderID||ChequeSN||TS.

(1)

Each peer maintains a personal counter. It is reset to zero each time the peer gets an updated capital certificate, and increased by one each time the peer issues a new payment cheque. “ChequeSN” is the current value of this counter. It should not be greater than Max-Num of this peer’s current capital certificate, or the request will be rejected by other peers. “TS” is the timestamp. When A receives the signed DownREQ and the capital certificate from B, it first verifies the signatures using B’s and AC’s public key respectively. Then A makes sure that ChequeSN is no greater than Max-Num in the certificate and there is still enough time before Cash-Exp. If all these succeed, A randomly chooses a piece (to be encrypted) in the file and sends its sequence number in the DownRES message. Otherwise, A sends an error message to B, or just ignores this request. 4.2 Downloading Then begins the downloading. If there are several concurrent downloaders, the provider will redirect some of the data requests between them, so that more bandwidth

944

M. Zuo and J.H. Li

can be spared to serve the rarest pieces (Due to space limitation, we won’t dwell on the details here.). For each downloader there is an encrypted key piece (KP). This key piece can only be got from the original provider. When a downloader requests for the key piece, the provider will randomly choose a key K, encrypt the piece with K and send the encrypted piece to the downloader. 4.3 Paying When the downloading finishes, downloader B gets the requested file with one of the pieces encrypted. To reconstruct the file, B has to get the decryption key (K). To get the key, B has to give the provider (A) a payment cheque as they have negotiated in section 4.1. The exchange of K and the cheque must be fair. The following protocol can make sure that B gets K iff (if and only if) A gets the cheque, and vice versa. The message flow is illustrated in Fig. 2. “C” denotes the payment cheque: C = IDPayee||IDPayer||IDTTP||ChequeSN||Price||TS||H(KP)||H(SEK(KP)).

(2)

The value of “Price” shouldn’t be greater than Max-Value in the payer’s capital certificate. If the price is greater than this max value, two or more cheques (signed together) will be needed to cover the entire payment.

Fig. 2. Paying Protocol - a Fair Exchange Protocol

If the provider (payee) and the downloader (payer) are both honest, then the protocol only needs three message exchanges between them directly, without the interference of a third party (optimistic scenario): A1. Peer A first constructs C and Computes Z = ETTP(A,B,K). Then it sends C, Z and SIGA(C,Z) to peer B. B1. After B receives C, Z and SIGA(C,Z), it first check the contents of C and make sure that all fields of C are correct. Then B checks the validity of the signature. If all the checks succeed, B generates SIGB(C,Z) and sends it to A. A2. If A receives properly signed SIGB(C,Z) from B, it sends K to B as plaintext.

Constructing Fair-Exchange P2P File Market

945

In the case that the provider is dishonest, it may refuse to give K to the downloader (step A2) after it receives the downloader’s signature on the check. In this case the downloader can contact the TTP (dotted lines in Fig. 2): B2. B sends C, Z, SIGA(C,Z) and SIGB(C,Z) to the TTP. TTP1. The TTP parses C to learn the ID of the provider and the downloader. Then it verifies the two signatures. If both signatures are correct, it decryptes Z with its own private key. If DTTP(Z) is a triplet and the first two parts are A and B, then it: (i) sends the third part (K) to B; (ii) sends SIGB(C,Z) to A. In another case, if the downloader is dishonest and tries to directly access the TTP to learn K and bypass step B1 after it receives C, Z and SIGA(C,Z), it has to send both SIGA(C,Z) and SIGB(C,Z) to TTP. Then, if K is sent to the downloader, SIGB(C,Z) will also be sent to the provider. When this protocol finishes, A gets the payment and B can reconstruct the file using the key. Here a fair file-exchange is completed.

5 Accounting Accounting operations are initiated by peers periodically. To reduce the burden of AC, peers are discouraged from frequently contacting AC by a certain amount of accounting fees imposed on each accounting request. When a peer contacts AC and request for accounting, it gives AC a package of all the payment cheques it has earned since its last accounting. Each cheque in the package should be a quadruplet in the form of (C, Z, SIGx(C,Z), K). AC maintains an account database for all the registered peers in the community. For each peer, AC stores the information about its balance, capital certificate, creditcheques (this peer as the payee) and debit-cheques (this peer as the payer) that have been cashed but still not expired, etc. When AC receives the package of cheques, it first rejects the expired ones. Then it checks the IDs and sequence numbers in the cheques to detect “double-accounting” and “double-spending”. If it happens, the corresponding cheque will be rejected, and a debit (for example 10% of this cheque’s nominal value) will be imposed on the payee (double-accounting) or the payer (double-spending) as a punishment. For each remained cheque, AC further checks if the following is correct: (1)Z = ETTP(IDpayee, IDpayer, K); (2)SIGpayer(C, Z) is correct. Those do not satisfy these conditions will be deemed as invalid and rejected. After all these checkings, AC begins to update the balances of the involved peers according to the “Price” in the valid cheques. An accounting fee is also subtracted from the requester’s account. At last, AC sends a confirmation message back to the requester, together with a list of accepted cheques. If the requester’s current capital certificate is about to expire or it has issued MaxNum checks, it will also apply for a new certificate. If its balance is greater than a certain lower bound, AC will issue a new certificate to it and updates the corresponding information in the database. Otherwise, the requester will have to “sell” some more files to earn enough “money”, or it can no longer download any priced files from the community after its current certificate expires.

946

M. Zuo and J.H. Li

There is another problem to be considered. Due to the debits and accounting fees, the per-peer capital of the community will decrease as time passes by. That means the whole community will become poorer and poorer until no one can afford a download. However, intuitively as more files are shared and exchanged, the per-peer capital should increase gradually. Thus we add a 0.1% per day “credit interest” for each nonnegative account. These interests also serve as an additional incentive for peers to share more files and accumulate more capitals.

6 Remarks and Conclusion In this paper we describe how to construct a fair P2P file-sharing community. We name it a P2P file market. Our scheme is a “pay-per-transfer” scheme [7], forcing peers to contribute by a micropayment-based incentive mechanism. The most important features of our scheme include: a fair-exchange protocol ensuring the fairness of peer-to-peer trading process (to our best knowledge, most P2P micropayment schemes do not guarantee fair exchanges), a sophisticated accounting policy, and the support for multi-source parallel download (it is one of the most attracting merits of P2P technology). In the future, we will further examine if there are other “rational attacks” [8] possibly existing in our system. Also we will try to find an efficient way to distribute the role of the accounting center onto several peers. Finally we hope to implement a prototype in a real user environment soon and observe its performance in practice.

References 1. H. Zhuge, J. Liu, L. Feng, X. Sun and C. He. “Query Routing in a Peer-to-Peer Semantic Link Network”. Computational Intelligence, 21(2) pp197-216. (2005) 2. E. Adar and B. Huberman. “Free Riding on Gnutella”. First Monday, 5(10), (2000) 3. G. Hardin, “The Tragedy of the Commons,” Science, vol.162, pp1243–1248, (1968). 4. A. Shamir. "Identity-Based Cryptosystems and Signature Schemes". In: Proc. of Crypto'84, LNCS-196, Springer Verlag, (1985). 5. Micali S. “Simple and fast optimistic protocols for fair electronic exchange”. In: Proc. of ACM PODC, (2003). 6. M. Feldman, C. Papadimitriou, J. Chuang, and I. Stoica, "Free-Riding and Whitewashing in Peer-to-Peer Systems," In: Proc. of ACM SIGCOMM'04 Workshop on Practice and Theory of Incentives in Networked Systems (PINS), (2004) 7. B. Yang and H. Garcia-Molina. "PPay: Micropayments for Peer-to-Peer Systems". In: Proc. of ACM CCS’03, (2003) 8. SJ Nielson, SA Crosby, and DS Wallach. “A Taxonomy of Rational Attacks”. In: Proc. of the 4th International Workshop on Peer-To-Peer Systems, (2005)

A Novel Behavior-Based Peer-to-Peer Trust Model Tao Wang, Xianliang Lu, and Hancong Duan College of Computer Science and Engineering, UESTC, Chengdu, 610054 [email protected]

Abstract. In P2P systems, every user can release files freely, which makes the large-scale file-sharing feasible. However, with malicious nodes increasing, lots of faked files and deceptive behaviors restrict the development of P2P application. Current P2P trust models can’t guarantee the Quality of Service(QoS), and take no consideration for trust decay and cooperative cheat. To address these problems, this paper presents a novel behavior-based P2P trust model. The direct trust and reputation trust are considered, and the time decay function and reputation adjustment factor are provided in the model. Results of simulations show that compared to the current trust models, the proposed model is more efficient and scalable. Keywords: P2P, trust, decay function, QoS.

1 Introduction In P2P systems, every node is both client and server, and users can make use of resources freely. Open and anonymity are the characters of P2P system, so nodes can join or leave system in anytime, and what’s more, nodes are not responsible for their behaviors. On one hand, lots of malicious nodes provide faked files, and spread useless or bad files, such as Trojan horse and virus. These nodes take positive response to all queries in P2P system, and then provide the decoy files. On the other hand, irresponsible clients may interrupt downloads arbitrarily. All the above situations can’t guarantee the Quality of Service(QoS) in P2P file-sharing. Marsh proposed a trust model[1], and the model was built based on the trust of society characters, which was complex and infeasible. According to local reputation, another trust model[2] used distributed poll to evaluate the provider’s trust value. The advantage of this model is simple, but nodes’ trust values were always local and unilateral. In Trust model EigenRep[3], when node I wants to know the trust value of node K, system compute the global trust value of node K by using local trust values of nodes which have trades with node K. In Dou model[4], to address the problems of EigenRep, authors proposed a nonlinear trust model. However, these two models have convergence problem and the huge system payload restrict them only fit to smallscale networks. Current models take no consideration for trust decay. As we all know, if there is no direct or indirect touch between nodes in a long time, the trust relationship will decay. The factor will be considered in the following calculation of our trust model. H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 947 – 952, 2005. © Springer-Verlag Berlin Heidelberg 2005

948

T. Wang, X. Lu, and H. Duan

The rest of paper is organized as follows. Section 2 gives a detailed analysis for trust in P2P network. Section 3 presents our trust model in P2P. Experiments and simulations are shown in section 4. Finally, in section 5, we summarize our model.

2 Trust in P2P Trust in P2P networks can be classified into identity trust and behavior trust. Identity trust mainly takes charge of the identity authentication, user rights and etc.. By using encrpytion technique, digital signature and authentication mechanism, identity trust can be realized and there are many solutions in the field of network security, such as authentication[5] and signatures[6]. Behavior trust pays attention to trust problems with wide significance, which are more important in P2P system. Users can make use of the past behavior between partners and other users’ evaluation to adjust the trust value dynamically. In this paper, we focus on behavior trust. Behavior trust in P2P is composed of direct trust and reputation trust(indirect trust)[7]. Direct trust is based on the past behaviors of direct trades between nodes. If there is no trade history between nodes, recommendation from other nodes is the only way to take, which is called reputation trust.

3 P2P Trust Model Definition 1. Let DTij denote the direct trust value between node i and j, which is calculated by the past direct behavior between nodes. Then, DTij is defined:

DTij =

Satij − Dissatij Satij + Dissatij

× θ (t-t ij )

(1)

where Dissatij represents the number of dissatisfied trades between node i and j. Faked files, virus and interrupting downloads will lead to dissatisfaction. Satij represents the number of satisfied trades. In the time decay function θ (t-t ij ) , tij represents the time of the latest trade between node i and j. Definition 2. Let Rj denote the reputation value of node j, which is an average value. It can be evaluated from all nodes which have trades with node j. Rj is defined: n

¦ DT

kj

Rj =

k =1 n

¦N

(2) kj

k =1

­ 1 if there is direct trust between node k and j where N kj = ® . ¯0 if there is no direct trust between node k and j Since the time decay factor has been included in DTij (Equation 1),

needless here.

θ (t-t kj )

is

A Novel Behavior-Based Peer-to-Peer Trust Model

949

• Cooperative cheat Cooperative cheat means node i and j raise up the reputation values of partners by giving the faked positive response frequently. To address the problem, we define a reputation adjustment factor σ ij , which is inverse proportion with the familiar degrees between nodes. The more frequent node i and j submit high reputation values mutually, the less σ ij is. The value of σ ij lies in [0,1]. Generally, we set σ ij = 1 . When system detects that there are frequent high evaluation from the same node, the value of σ ij will be dropped down reasonably. So Rj can be rewrited as:

§ Satkj − Dissatkj · ¸¸ × θ (t-t kj ) × σ kj k =1 © kj + Dissat kj ¹ Rj = n ¦ Nkj n

¦ ¨¨ Sat

(3)

k =1

Definition 3. Let Trustij denote the global trust value between node i and j. ­Trustij = α × DTij + β × R j ® α β ∈ [ 0 1] ¯α + β = 1

(4)

Furtherly, we can get ­ ° § Satij − Dissatij ° °Trustij = α × ¨ ¨ Satij + Dissatij ® © ° ° °¯ α + β = 1

§ Satkj − Dissatkj · ¸¸ × θ (t-t kj ) × σ kj · k =1 © kj + Dissat kj ¹ × + × θ (t-t ) β ¸¸ ij n ¹ ¦ N kj n

¦ ¨¨ Sat

(5)

k =1

α β ∈ [ 0 1]

4 Experiments and Analyses In order to analyse our trust model, we conduct simulation experiments for filesharing in P2P system. Users will query files and download them from the node with the highest trust value. According to successful downloads and authentic files, we judge whether the trade is successful. We simulate several kinds of problem nodes, such as Identity-changed nodes[8], Trust decay nodes and Cooperative cheat nodes. We simulate a P2P network with 1000 nodes. 5000 sharing files are distributed in nodes randomly and the queries for sharing files are also random. After downloads finish, replicas of files will be stored in local nodes. Every user must complete above 50 downloads and every download will be executed from node with the highest trust value, which stores the queried file. In simulation experiments, the final criterion is the Successful Downloads Ratio(SDR) the sum of successful downloads /the sum of downloads in P2P Firstly, we compare P2P systems with and without trust model in different-scale malicious nodes. From the Figure 1, we can conclude that with the ratio of malicious

950

T. Wang, X. Lu, and H. Duan

successful downloads ratio

nodes increasing, our trust model can maintain the successful downloads ratio in a high level, even when malicious nodes take 50% of the whole nodes, SDR is still above 85%.

malicious nodes ratio (%)

Fig. 1. SDR in P2P system with and without trust model

Secondly, we simulate three kinds of problem nodes respectively.

successful downloads ratio

1. Identity-changed nodes We simulate different initial trust values of nodes and record the different effects on our model with Identity-changed nodes. Results can be found in Figure 2.

Identity-changed nodes ratio

Fig. 2. SDR with different initial trust values in trust model

In Figure 2, we can find that the initial trust value of nodes makes great effects on P2P system with Identity-changed nodes. The SDR of initial value 1is lower than the SDR of initial value 0.5 markedly. This is because that a higher initial trust value may urge malicious nodes to reenter the system with a new identity, which will get an inborn high trust value. However, being a cooperative P2P system, we should encourage new nodes to join, so the initial trust value should not be lower than 0. 2. Trust decay nodes We simulate different-scale trust decay nodes in P2P system with our trust model to detect the effects of time decay. Meanwhile, we simulate the EgienRep trust model[3] of Standford to make a compare, which take no consideration for decay.

951

successful downloads ratio

A Novel Behavior-Based Peer-to-Peer Trust Model

T rust decay nodes ratio

Fig. 3. SDR with trust decay nodes in two trust models

In Figure 3, Compared with the EigenRep, the decay function

θ (t-t ij )

in our trust

model is effective in the trust decay situation.

successful downloads ratio

3. Cooperative cheat nodes We simulate different-scale cooperative cheat nodes in our trust model, EgienRep and Dou-trust model [4]. Results can be found in the following.

cooperative cheat nodes ratio

Fig. 4. SDR with different scale cooperative cheat nodes in three models

In Figure 4, our trust model and Dou-trust model do well in the cooperative cheat nodes and SDR can be maintained at about 80%~85%. However, the SDR of EgienRep will drop down markedly with the cooperative cheat nodes increasing and lower than 60% in the end. The difference between our trust model and Dou-trust model [4] lies in the following. We make use of the reputation adjustment factor σ ij to modify our model, while in Dou-trust model, in order to restrain the cooperative cheat, not only should the applicant propose an evaluation after downloads finish, but also the provider should give a confirm information. Obviously, this policy produces an extra payload

952

T. Wang, X. Lu, and H. Duan

in the system and needs more bandwidth, so the efficiency can not be guaranteed in Dou model, especially in a large-scale P2P system.

5 Conclusion In this paper, we propose a behavior-based P2P trust model, which takes consideration for direct trust and reputation trust. Some solutions are provided to address trust decay and cooperative cheat problems. Results of simulation prove that our model is more effective and scalable than current trust models.

References 1. Marsh S, Formalising Trust as a Computational Concept Ph.D Thesis, University of Stirling, 1994 2. Damiani E, De Capitani di Vimercati, A reputation-based approach for choosing reliable resources in peer-to-peer networks, the 9th ACM Conference on Computer and Communications Security, Nov. 2002 3. Kamvar S D, Schlosser M, and Garcia-Molina H, Eigenrep: Reputation management in p2p networks, Proc. of the 12th International World Wide Web Conf., 2003 4. Dou W, Wang HM, and etc, A recommendation-based peer-to-peer trust model, Journal of Software, 2004, 15(4), pp.571-583, April 2004 5. Amit Basu, Steve Muylle, Authentication in e-commerce, Communications of the ACM, Volume 46, Issue 12, pp.159-166, December 2003 6. Giuseppe Ateniese, Verifiable encryption of digital signatures and applications, ACM Transactions on Information and System Security, Volume 7, Issue 1, pp.1-20, February 2004 7. Karl Aberer, Zoran Despotovic, Managing Trust in a Peer-2-Peer Information System, Proceedings of the tenth international conference on Information and knowledge management, Oct. 2001 8. Ernesto Damiani, De Capitani di Vimercati, A reputation-based approach for choosing reliable resources in peer-to-peer networks, Proceedings of the 9th ACM conference on Computer and communications security, November 2002

A Topology Adaptation Protocol for Structured Superpeer Overlay Construction Changyong Niu, Jian Wang, and Ruimin Shen Department of Computer Science and Technology, Shanghai Jiaotong University, Shanghai, China {cyniu, jwang, rmshen}@sjtu.edu.cn

Abstract. Peer-to-peer networks can be divided into structured and unstructured based on their overlay topologies. In reality, unstructured p2p networks with superpeers have proved their capacities to support millions of users simultaneously. However, applications deployed on this kind of overlay networks, such as file-sharing, require flooding or gossip-based message routing, which puts more overhead on underlying networks and provides no guarantee on resource discovery. In this paper we propose an overlay adaptation protocol which allows structured superpeer overlay construction from unstructured p2p overlay networks with the potential to leverage the advantages of structured p2p overlay networks such as efficiency, scalability and guaranteed look-up services. The simulation result shows that our protocol can build the structured superpeer overlay with efficiency and scalability.

1 Introduction P2P networks can be classified as being either structured or unstructured. Recent developments of structured [1,2] and unstructured [3,4] overlay networks point to a new direction for overlay research to address these major challenges such as scalability, efficiency and flexibility. Structured P2P networks implement efficient distributed P2P lookup services by organizing the peers in structured overlay networks and determining routing messages to the peer that is responsible for a given key. These implementations of distributed lookup service are often referred to as Distributed Hash Tables (DHTs). In contract, nodes in unstructured overlay network have the flexibility to choose the number and destinations of their connections, and adapt them to network heterogeneity for improved network performance. Especially by introducing the concept of superpeer [5, 6] into unstructured overlay network, the heterogeneity of P2P networks is further explored without compromising their decentralized nature. However, unstructured overlay networks often require flooding or gossip to route messages, which limits their efficiency and puts more overhead on the underlying physical networks. In this paper we explore the integration of structure and unstructured overlay network, in a way that superpeers are selected from operational unstructured overlay networks and further organized into structured network to tackle the challenges of scalability, efficiency and flexibility. The potential advantages of this method include: H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 953 – 958, 2005. © Springer-Verlag Berlin Heidelberg 2005

954

C. Niu, J. Wang, and R. Shen

− Load Balance: load balance among superpeers is achieved not only on the number of object references maintained but also on the number of clients handled by superpeers. − DHT Enhanced: By organizing superpeers into structured overlay, applications over resulting overlay networks can easily take advantages of DHTs. − Efficiency: simulation shows that the presented protocol is efficient for superpeer overlay construction in terms of underlying network overhead. Rest of this paper is organized as follows. Related superpeer overlay construction protocols are presented in section 2. In section 3, the proposed protocol for building structured superpeer overlay will be described in detail. In section 4, the protocol performance will be analyzed and experimental results will be illustrated.

2 Related Work The concept of superpeer has been introduced to improve the performance of P2P overlay networks and above applications. A superpeer is expected to be more stable and powerful to operate as a server for a group of common peers, and as an equal to other superpeers. Some superpeer overlay designs have been introduced in literature. The system in [7] is based on a two-tier hierarchical overlay network. All nodes are part of the bottom tier: a Chord ring. In a Chord ring, each node keeps track of other nodes that are 1/2-ring away, 1/4-ring away, 1/8-ring away, etc. To construct the upper superpeer tier, the Chord ring is divided into a number of arcs with equal length. In each arc, some node is selected as superpeer of this arc, and all superpeers form the upper tier. Another two-tier superpeer overlay is the system in [8]. Each node belongs to exactly one group of the bottom tier and different groups form separate overlays which may or may not be the same in structure. One or more superpeer is selected from each group and all superpeers form the upper tier overlay which is a modified Chord ring. And the relationship between superpeer and group peer is many-to-many. Other superpeer overlay networks include HONet [9]. In HONet, peers are first clustered into groups based on localities in a structured way. Then superpeers selected from each group are connected using random walk. Superpeer selection and hierarchical overlay construction protocol presented in [6] is gossip based. The resulting overlay resembles the characteristics of one-to-many relationship between superpeer and group peer; random network structure of superpeer overlay and minimized number of superpeers (also called target superpeer overlay). Same as HONet, nodes in groups are not totally separated but are connected through gossip based links. So the robustness of protocol is achieved in price of network overhead. However the total overhead of overlay construction scales sub-linearly with the size of the network, which is very efficient. P2P overlay networks have been used in many applications, such as applicationlevel multicast, file-sharing, media streaming, data storage and most recently semantic layer of Knowledge Grid [10].

A Topology Adaptation Protocol for Structured Superpeer Overlay Construction

955

3 Structured Superpeer Overlay Construction We start the superpeer overlay construction from a network consisting of a large collection of nodes, in which each node having a unique identifier can potentially communicate with any other node through message exchanges. Also the network is highly dynamic with each node joining and leaving, even crashing at any time. Nodes in the network are heterogeneous with different capacities. To differentiate the nodes according to these heterogeneities, each node is associated with a value representing its capacity according to its own characteristics. Based on this information, we can select the nodes with larger capacities to act as superpeers and determine the number of clients that a superpeer may handle. The protocol presented in this paper will gradually adapt an unstructured overlay into a structured superpeer overlay, which has a number of important advantages compared to the flat unstructured overlay networks. • Handling heterogeneity: the upper-layer overlay is consisting of more powerful and stable superpeers. • Transparency: client joining, leaving and crashing under some superpeer are totally transparent to other groups. And the client floating, i.e. from one superpeer to another, is also transparent from object lookup services. • Efficiency: the structured organization of superpeers may further provide more efficient lookup services; also this efficiency may further lead to fewer network overheads. 3.1 Superpeer Selection and Structurization Instead of building a superpeer overlay from scratch, we build the superpeer overlay as an additional overlay, superimposed over an existing connected topology. The rationale behind this is that even if all superpeers crashed simultaneously, the network would not be disconnected and the superpeer overlay may still be rebuilt from scratch. We use tapestry as the superpeer overlay structure and implement the adaptation algorithm in Peersim. Although specific structured overlay protocol is used in this work, the overlay adaptation paradigm is general enough to adopting other DHT protocols. The construction may be divided into two stages. The idea of the first stage superpeer selection is fairly simple. We assume each node in the existing overlay knows about its own capacity. Besides the neighbor table, each node also maintains a table of superpeers and superpeer/client table. The superpeer table of a node A maintains a sample of superpeers that A has already seen in current overlay; and superpeer/client table maintains the current clients of A if A is a superpeer or the current superpeer of A if it is a client. The role of superpeer and client is determined by competition. The competition strategy is illustrated in Table 1, which is similar to the superpeer selection algorithm in [6]. However, instead of generating the minimum number of superpeers, we have adapted the algorithm for load balance purposes. Each superpeer has three satisfaction levels: unsatisfied if its number of clients less than its pre-defined min-client threshold; satisfying if the number of clients is between min-client and max-client threshold; and over-satisfied if the number of clients exceeds max-client. Over-satisfied is only allowed when a

956

C. Niu, J. Wang, and R. Shen

superpeer can take over another superpeer and all its clients. And load balance is handled in top-layer overlay compete strategy. Table 1. Let p being a local peer and p-superpeer being its superpeer table; Lp is p’s current load; Cp is the capacity of p; Cp-max is the max-client threshold of p; Cp-min is the min-client threshold of p;let r, q being temporary remote peers

(1) (2) (3) (4) (5) (6) (7) (8) (9)

S={r| r p-superpeer && Cr > Cp && Lr < Cr-max } Probing the load (used capacity) Lq of q that q {q | q belongs to S} Finding q that (Cq-Lq) > Lp If found q then q accommodate p and all its clients Else finding q that max(Lq-max-Lq) Transfer min (Lq-max – Lq, Lp) clients for p to q if exists client r q and Cr > Cp then exchange the role of r and p, letting p to be a client of q and r to be a superpeer taking over the clients of p

After the selection of the first stage superpeers, every superpeer in current snapshot will apply to join the upcoming superpeer overlay. Here we draw the same assumption as tapestry does, that there are well-known federation and static tapestry nodes in the overlay for bootstrap purpose. Since the goal of the top-tier overlay is load balance, it is reasonable that the resulting overlay contains large capacity superpeers and each superpeer have some client in its domination. To achieve this goal, we design a superpeer compete strategy which is described as follows. Table 2. p is the node querying for a superpeer; S0 is it’s accessing point to top-layer superpeer overlay; LX is node X’s current load; CX is the capacity of X; CMAXX is the max-client threshold of X; CMINX is the min-client threshold of X; TEMP, t, S are temporary variables

(1) TEMP =p; S= S0; (2) while (S != null){ (3) if (LS > CMAXS) // S is not over-satisfied (4) { S accepts TEMP; break;} (5) find k from S’s client table which has the max capacity (6) if (Ck > CTEMP) (7) { t.superpeer=k.superpeer (8) t=k; // TEMP replace k in the client table of S (9) TEMP = t; } (10) S pick out next hop H for S in his routing table (11) if (H == null) (12) { TEMP join as a supernode} (13) else (14) { S = H} (17) }

A Topology Adaptation Protocol for Structured Superpeer Overlay Construction

957

p is the node querying for a superpeer; S0 is it’s accessing point to top-layer superpeer overlay, and LX is node X’s current load with CX being the capacity of X; CMAXX is the max-client threshold of X; CMINX is the min-client threshold of X; The superpeer finding algorithm is described as follows and illustrated in Table 2. 1. If S0 is not over-satisfied and has the spare capacity to accept p as a client then the finding is success. 2. If S 0 could not accommodate p, take a look in S 0’s client table to find Ck, where k is the node with largest capacity in the client table, and compare it with Cp. If Cp is larger then forward the finding to the routing successor of S0. 3. If Ck is larger, then node p becomes the client of S0, and the client node k gracefully leaving S0, and using S0 as the accessing point to find himself a superpeer. (k replaces p to be the node querying for a superpeer.) 4. The adaptation process continues until p successfully finding himself a superpeer or the the routing successor of S0 being null. In the latter case, p joins the superpeer overlay as a superpeer, and the finding process is ended. 3.2 Load Balance We design a milder way to explore the capacity heterogeneity and with a focus on load balance and fairness. To achieve this, each selected superpeer initializes a (minclient, max-client) pair. Each superpeer aggressively maintains its number of clients between min-client and max-client. When a new superpeer joins the top-tapestry overlay, it grabs clients from its neighbors to achieving this goal. And when the number of clients of a superpeer exceeds max-client, it splits some clients to its neighbors. As long as the resulting number of clients is between the threshold, a superpeer is always willing to accept (or split) clients from (or to) other superpeers. The rationale for a client could be reassigned to another superpeer is explained as follows. In a structured superpeer overlay, information of objects is published as (key, value) pairs. A key says K1 is derived from the keywords associated with an object says O1, and the value coupled with K1 is corresponding to the machine hosting O1. All (key, value) pairs are distributed among top-tapestry nodes as object references. As long as a lookup for a key gets the reference, the “value” within the reference could direct the query to the hosting machine, no matter which superpeer the hosting machine belongs to. So the client floating is totally transparent from object lookup.

4 Analysis and Future Work Because we focus our current work on efficient protocol design of structured superpeer overlay construction, in the following we compare the efficiency of our protocol with that of [6] in terms of network overhead and operations performed by each node to form such superpeer overlays. To compute the overall network overhead, we aggregate messages sent by all nodes for neighbor information probing and neighbor transferring which are needed for overlay construction. The aggregations are performed under different network sizes with a power law node capacity distribution and max node capacity of 500. The comparison of total network overhead is illustrated in figure 1. For simplicity, we average the total messages

958

C. Niu, J. Wang, and R. Shen

Fig. 1. Network overhead comparison between the constructions of unstructured and structured superpeer overlays: the main reason that our protocol performs better is because we focus on load balance among superpeers

among all nodes in the network. The average could be read as number of operations taken by each node for probing or transferring. As for our protocol, the operations also include messages for tapestry node join and routing table maintenance during overlay construction. And the experiment result is illustrated in Figure 1. In this paper, we propose a topology adaptation protocol for building structured superpeer overlay. The simulation result shows that the protocol can build a structured superpeer overlay from an unstructured p2p network with efficiency and scalability. Future work includes to incorporate the model with the ability of handling locality heterogeneity and to explore the behavior of such hierarchical topology under churn.

References 1. Stoica, I., Morris, R., Kaashoek, M. and Balakrishnan, H., Chord: a scalable peer-to-peer lookup service for internet applications, Proceedings of ACM SIGCOMM 2001 2. Zhao, B.Y., Kubiatowicz, J. and Joseph, A.D., Tapestry: an infrastructure for fault-tolerant wide-area location and routing, Technical Report, University of California, Berkeley, 2001 3. “Gnutell.” http://www.gnutell.com 4. Kermarrec, A.M., Massoulié, L. and Ganesh, A.J., Probabilistic reliable dissemination in large-scale systems. IEEE Transactions on Parallel and Distributed Systems, 2003 5. Yang, B. and Garcia-Molina, H. Designing a super-peer network, Proceedings of the 19th International conference on Data Engineering, 2003 6. Montresor, A., A robust protocol for building superpeer overlay topologies, Technical Report UBLCS-2004-8 7. Manku, A.T., Cheng, Y., Kumar, V., and Savage, S., Structured superpeers: leveraging heterogeneity to provide constant-time lookup proceeding of IEEE Workshop on Internet Applications, 2003 8. Garces-Eric, L., Biersack, E.W., Felber, P.A., Ross, K.W., and Urvoy-Keller, G., Hierarchical peer-to-peer systems, Proceeding of INFOCOMM’03, 2003 9. Ruixiong Tian, Yongqiang Xiong, Qian Zhang, Bo Li, Ben Y. Zhao, Xing Li, Hybrid overlay structure based on random walks, Proceeding of iptps05, 2005 10. Zhuge, H., Liu, J., Feng, L., Sun, X., He, C., Query routing in a peer-to-peer semantic link network, Computational Intelligence, Volume 21, 2005

A Routing Protocol Based on Trust for MANETs Cuirong Wang1,2 , Xiaozong Yang1 , and Yuan Gao2 1 2

School of Computer Science & Technology, Harbin Institute of Technology, 150001 Harbin, China Qinhuangdao School, Northeastern University, 066000 Qinhuangdao, China [email protected]

Abstract. Ad hoc network is a peer-to-peer grid system. The combination of the Knowledge Grid and ad hoc network could have a great effect on the future interconnection environment. In the existed researches about ad hoc routing protocols, knowledge with trusted requirements is not supported. In this paper, the trust level is used as knowledge for routing. The security rather than shortest path is the primary concern of the method. The performance evaluation via simulations shows that the method is a promising trust routing algorithm for MANETs. The effects of this trust model on DSR route discovery mechanism are analyzed. Results show that our model can improve the performance of DSR route discovery.

1

Introduction

In an ad hoc network nodes cooperate in dynamically establishing wireless networks and maintaining routes through the network, forwarding packets for each other to facilitate multi-hop communication between nodes not in direct transmission range. On-demand routing protocols for mobile ad hoc networks, such as Dynamic Source Routing (DSR), generate routes for unknown destination paths on an as needs be basis. The protocols mostly employ flooding approaches to discovery routes. The flooding approach forwards a node’s queries to all its neighbors, which results in traffic problems. To be effective, a query-routing strategy should forward queries only to nodes who propose certain related knowledge. The proposed routing solutions deal only with number of hops. Connections with trust requirements are not supported. In this paper, we propose a trustbased routing algorithm for ad hoc network. Security rather than optimality is the primary concern of the algorithm. In the case of general routing algorithms, it is better to find a route very fast in order to have a good response time to the speed of topology change, than to search for the optimal route but without meaning, because the network condition is changed and this route does not exist anymore. In this paper, trust parameters of nodes are used for routing decision. To evaluate the performance of the protocol( tr-DSR ), we carried out the simulations for different network conditions. H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 959–964, 2005. c Springer-Verlag Berlin Heidelberg 2005 

960

C. Wang, X. Yang, and Y. Gao

The paper is organized as follows. In Section 2, we introduce the related work. The proposed routing algorithm tr-DSR is described in Section 3. The performance evaluation and analysis of the tr-DSR are discussed in Section 4. Finally, conclusions are given in Section 5.

2 2.1

Related Work DSR and Secure Routing Protocol

The Dynamic Source Routing protocol[1] is a simple and efficient routing protocol designed specifically for use in multi-hop wireless ad hoc networks of mobile nodes. Using DSR, the network is completely self-organizing and selfconfiguring, requiring no existing network infrastructure or administration. In DSR, although it is possible to end up with route discovery failures to some extent, re-broadcasting route request packets with certain trust probability can be considered as a performance improvement technique[2]. In response to a single Route Discovery, a node may learn and cache multiple routes to any destination. This support for multiple routes allows the reaction to routing changes to be much more rapid, since a node with multiple routes to a destination can try another cached route if the one it has been using should fail. This caching of multiple routes also avoids the overhead of needing to perform a new Route Discovery each time a route in use breaks. The trust of node is a very important constrain in wireless network. If a node or a route has very low trust value, this route will be dangerous. Moreover, this can have also a bad effect on the network data packets: there are some nodes that will be dropped. To this end, we propose a trust routing establishment mechanism and we apply it to the DSR routing protocol. The main feature of our work compared to the related works in this area is that it is simple and efficient according to the obtained performance enhancement comparing to the results obtained with the basic DSR protocol. 2.2

Trust Quantization

An ad-hoc network of wireless nodes is a temporarily formedcreated, operated and managed network by the nodes themselves. Nodes assist each other by passing data and control packets from one node to another. The execution and survival of an ad-hoc network is solely dependent upon the trusting nature of its nodes. A number of protocols have been developed to secure ad-hoc networks using cryptographic schemes, but all rely on the presence of an omnipresent, and often omniscient, trust authority. The dependence on a central trust authority is an impractical requirement for ad-hoc networks. We present a model for trust-based communication in ad-hoc networks. In the model, a central trust authority is a superfluous requirement. The routes discovered using our model is not cryptographically secure but each one of them carries a confidence measure regarding

A Routing Protocol Based on Trust for MANETs

961

its suitability in the current. According to Josang[3], trust and security represent two sides of the same thing. Both these terms are so highly interconnected that they cannot be evaluated independently. The principle drawback to find route based trust is the route discovery efficiency. So, in the simulation, we predigested the computing of trust. Each node in the network stores other node’s trust value. The trust value of a node is computed and updated by trust agents that reside on network nodes[8]. In our simulation, the trust values of all nodes are stored in each node in advance. We signify trust -1 to +1, representing an unremitting range from complete distrust to absolute trust.The trust value in route R by source node S is represented as TS (R) and given by the following equation. TS (R)=WS (Ni )*TS (Ni ), WS (Ni )=1, 0 to all his neighbors reachable by power level psmax . Based on Algorithm DNFT, the nodes in the MANET will cooperatively discover the set of transmitting nodes : – The source node s: 1. Send a broadcasting tree construction request message < btcs,id > to all his neighbors reachable by power level pmax ; s 2. Based on the ACK message from all neighbors, compute Nsmax ; 3. Based on the Nvmax info of all neighbors, compute the 2-hop graph G ; 4. Run Algorithm DNFT to compute the transmitting neighbor set; 5. Send < btts,id > to all transmitting neighbors. – On receipt of a btc message at node v: 1. Send back an ACK message; 2. Broadcast its own btn message; 3. Based on the ACK message from all neighbors, compute its Nvmax ; 4. Send back an Nvmax message. – On receipt of a btn message: Send back an ACK message. – On receipt of a btt message: Run the same procedures as a source node. A node will discard an already received message in this session. And after it every transmitting node will have a list of neighbor transmitting nodes. Every node will know which node cover himself. Now a source node will broadcast

A Localized Algorithm for Minimum-Energy Broadcasting Problem

q q

q qq

q q q q q

qq

q

q

q qq

975

q

q qq q q q q q q qq q q qq qq qq q q q q q q q q q q q q q q q q q q q q q q

Fig. 1. L) The topology of a given MANET; R) The broadcasting relay map

the data according to the range computed in the tree construction session. And only those transmitting nodes will forward the data according their computed power rang respectively. All the duplicated message will be discarded, and all other common nodes need only to receive data from a transmitting node. The following Figure 1 shows the cover map of a 60 nodes 1km × 1km MANET computed by our protocol.

4

Performance Evaluation

We evaluate our protocol and compare its performance with the RBOP protocol proposed by J. Cartigny et. al in [7], whose performance could compete with the centralized BIP protocol in [6]. The parameters are almost the same as in [7] for consistency. The number of nodes n = 100. The maximum communication radius rmax = 5∗50 meters. Nodes are uniformly distributed in a square area whose size is adjusted to obtain a given density (from 6 nodes per communication zone to 30). For each measure, 5000 broadcasts have been run. The observed parameter is the energy consumption according to two commonly used energy models: k = 1, α = 2, c = 0 and k = 1, α = 4, c = 108 [5, 8]. For  each broadcast session, we calculate the total energy consumption: E(T ) = u∈V E(u). Usually E(T ) is very large, so we divided it by the total energy consumption needed for blind flooding prtocol with maximal range: Ef looding = n × (Rα + c). The percentage number of this quotient is named to be the average Expended Energy Ration (EER): EER = 100 × E(T )/Ef looding . We show the comparison of DNFT with RBOP, BIP and RTCP in Figure 2, from which we can observe that our DNFT is better than RBOP when the degree is high. This is because RBOP will choose all its RNG-neighbors as transmitting nodes. But sometimes the distance from the source node to its RNG-neighbors will be far different to each other, and some short range nodes will be unnecessarily put into transmitting. So the final number of transmitting nodes in DNFT will be less than that in RBOP, and the reachability is always 100%.

976

C. Peng and H. Shen 35

50 MTCP RTCP BIP RBOP DNFT

45

MTCP RTCP BIP RBOP DNFT

30

40

25

35

30

EER

EER

20 25

15 20

15

10

10

5 5

0

4

6

8

10

12

14 16 Average Degree

18

20

22

24

0

4

6

8

10

12

14 16 Average Degree

18

20

22

24

Fig. 2. L) EER when α = 2, c = 0; R) EER when α = 4, c = 108

It is not surprising to see that the best algorithm is the globalized BIP , since with the whole knowledge one can always make better choice. But when the density rises, the difference between all these protocols converge together. And the direction is certainly downward since with high density the blind flooding will choose more unnecessary nodes as transmitting nodes.

5

Conclusion

In this paper we first presented the CNFT Algorithm for the computation of an energy efficient broadcasting tree in a small MANET, then we extend it to the distributed case algorithm DNFT and proposed our protocol for the construction of an energy efficient broadcasting tree in a large MANET. We have shown by simulation that our protocol is more energy efficient than RBOP, and it is also very flexible and scalable.

References 1. MANET. IETF mobile Ad-hoc Network Working Group, MANET. http://www.ietf.org/html.charters/manet-charter.html. 2. Mario Cagalj, Jean-Pierre Hubaux and Christian Enz. Minimum-energy broadcast in all-wireless networks: NP-completeness and distribution issues. Proceedings of ACM MobiCom 2002, Atlanta, USA, September 2002. 3. R. G. Gallager , P. A. Humblet , P. M. Spira. A Distributed Algorithm for Minimum-Weight Spanning Trees. ACM Transactions on Programming Languages and Systems (TOPLAS), v.5 n.1, p.66-77, Jan. 1983. 4. M. R. Garey , D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, NY, 1979. 5. V. Rodoplu and T. H. Meng. Minimum energy mobile wireless networks. IEEE Journal on Selected Areas in Communications , 17(8), August 1999.

A Localized Algorithm for Minimum-Energy Broadcasting Problem

977

6. J. Wieselthier, G. Nguyen, and A. Ephremides. On the construction of energyefficient broadcast and multicast trees in wireless networks. Proceeding of IEEE Infocom’2000, Tel Aviv, Israel, 2000, pp. 585C594. 7. J. Cartigny, D. Simplot and I. Stojmenovic. Localized Minimum-Energy Broadcasting in Ad-hoc Networks. Proceeding of IEEE Infocom’2003, USA, 2003. 8. S. Lindsey and C. Raghavendra. Energy efficient broadcasting for situation awareness in ad hoc networks. Proceeding of ICPP’01, Valencia, Spain, 2001.

Multipath Traffic Allocation Based on Ant Optimization Algorithm with Reusing Abilities in MANET* Hui-Yao An, Xi-Cheng Lu, and Wei Peng School of Computer, National University of Defense Technology, 410073, Changsha, P.R. China {hyan, xclu, wpeng}@nudt.edu.cn

Abstract. It is important that how to distribute traffic into multiple paths with reason and timing in Mobile Ad Hoc networks. Otherwise, it will cause the performance of these routing schemes to degrade drastically. Most of the traffic distributing schemes proposed in MANET so far don’t take limited resource and frequent topology change into account. In these schemes, it doesn’t reuse the distribution result, and the same size load is distributed into the various paths regardless of their qualities. It will lead to increased computing overhead. In order to circumvent these problems, a novel multipath traffic distribution method that is based on Ant Optimization Algorithm with Reusing Abilities is proposed. Experimental results indicate that, our method improves distribution efficiency about 20%, and improves the quality of multipath routing.

1 Introduction Ad Hoc Networks1 is a peer-to-peer mobile network consisting of large number of mobile nodes. Multipath is an important scheme for Ad Hoc Networks. Main problems exist in the existing multipath scheme2-4 in Ad Hoc network is: 1). Instead of change the size of traffic allocation with various path quality, they distribute traffic with a fixed same size into multiple path; 2)every time the information updates, they distribute traffic into multiple paths again. It leads to a high redundancy; 3) in the phase of traffic allocation, they don’t reuse the previous phase allocation result, it leads to a high computing overhead.

2 Model of Multi-path Consider a simple fork topology shown in Figure 1, where a source-destination pair are connected by n disjoint paths p1 , p 2 ,K, p n . Each path p k has a (bottleneck) capacity ci of units of bandwidth, and is assumed to be known to the source s. Suppose flows arrive at the source s at an average rate Ȝ, and the average flow holding time is 1/u. Throughout this section, we assume that flow arrivals are Poisson, and flow *

This research was supported by the National Grand Fundamental Research 973 Program of China under Grant No. 2003CB314802, the National Natural Science Foundation of China under Grant No. 90104001 and Higher Education and Research Program of Hunan Province.

H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 978 – 983, 2005. © Springer-Verlag Berlin Heidelberg 2005

Multipath Traffic Allocation Based on Ant Optimization Algorithm

979

holding times are exponentially distributed. For simplicity, we also assume that each flow consumes 1 unit of bandwidth. In other words, path p k can accommodate c k flows at any time. Therefore, the question is how to route flows along these n paths so that the overall blocking probability is minimized, and throughput is maximized.

Fig. 1. A set of disjoint paths between a source and a destination

Definition 1. We define the delay time of path k between source s to destination t pair d as the length of the path p k , denoted by l ( p k ) = 1 ¦ D (i , j ) .

2

i= s

Definition 2. If we define minimal delay time from source s to destination d as minimal length, denoted by l min( s , d ) ; then for link (i,j), if l min( j , d ) ≤ l min( i ,d ) , we define the link (i,j) as available-link, and define the minimal length along the link (i,j) to destination node as available-link length: al(i , j ) = l(i , j ) + lmin( j ,d )

(1)

Definition 3. Let li be the average length of all available paths attach to node i, then the weight value of the available link (i,j) is: p e ( i , j ) = EXP [

− k * al ( i , j )

]

(2)

li

and the reliability value of the node i is the sum of the weight value of all the available links attach to node i: pn ( i ) = ¦ pe ( i , j )

(3)

j

k is system parameter relative to the number of the path attach to node i, about 3.0-3.5.

Definition 4. Given any path p k , the reliability value of the path is derived as: p

p(k )

=

n



d

p e (i ) *

i =1



p n( j)

(4)

i= s

and traffic allocation rate of the path p k is: r pk =

w pk n

¦ w pk k =1

(5)

980

H.-Y. An, X.-C. Lu, and W. Peng

3 Multipath Traffic Allocation Based on Ant Optimization Algorithm with Reusing Abilities The Multipath Traffic Allocation Scheme Based on Ant Optimization Algorithm with Reusing Abilities can be viewed to operate in two stages: 1) assign traffic initialization of every path with the value of traffic allocation at previous time, and convert these results into initial pheromone for Ant System Algorithm 2) and then search for optimal result of traffic allocation of every path using ant algorithm with positive feedback and fast convergence.

Fig. 2. Traffic Initialization of multiple paths

3.1 Assign the Initialization of Traffic When one or certain nodes moves, the network topology will change. This will cause some paths to break and some new paths to add. The most of paths are on a changeless status. We assign an initialization of traffic allocation to these paths with the quondam value. And we switch the traffic of the broken paths to the new paths, proportions of flows along the new paths will be initialized based on the weight of those paths. The total load along the new paths equals to the total load of the broken paths. As shown Fig. 2, path p3 , p 4 , p5 have no change, their initialization of traffic at time

Ti +1 is respective r3 , r4 , r5 , is the same as the size of traffic at time Ti . The initialization of the new addition path p6 is r1 + r2 which is the sum of the traffic of the path p1 , p 2

at time Ti .

Multipath Traffic Allocation Based on Ant Optimization Algorithm

981

3.2 Convert Initialization of Traffic into Initial Pheromone

In order to avoid too early convergence of Initialization of Traffic allocation searching, we adopt Max-Min Ant System introduced in5,6, and we limit the pheromone intensity Ant System Algorithm between [τ min ,τ max ] . Then, the process of converting Initialization of Traffic into initial pheromone is: The proportions of flows of each path at time Ti+1, and the value of initial pheromone

p ref

τ inital

, is decided by the reuse ratio

− reuse of this path, and we make the initial pheromone

τ inital

distributes uni-

formly in space [τ min ,τ max ] . Therefore, the probability distribution density of initial pheromone

between

τ inital

τ inital

at space [τ min , τ max ] is

pref −reuse

is:

p ref − reuse =

³τ

and

τ initial min

τ max

1

(τ max − τ min ) , and the relationship

1 dt − τ min

(6)

When we perform integral and predigestion on the right part of this equation, we have: τ initial

pref −reuse = ³

τ min

1 x dx = τ max − τ min τ max − τ min

τ initial

= τ min

τ initial − τ min τ max − τ min

(7)

From above deduction, we get the conversion equation from initial traffic assignment results into initial pheromone as following:

τ inital = τ min + p ref Where,

− reuse

∗ (τ max − τ min )

(8)

τinital is the value of initial pheromone, τ min and τ max are the maximum

and minimum value of pheromone in MMAS Ant System Algorithm, and pref−reuse is the reusing believe degree of proportions of flows on each path at time Ti . If we set

pref −reuse = 0 , the initial pheromone

τ inital

is minimum value

τmin .

If we set

pref−reuse=1, the initial pheromone τ inital is maximum value τ max . With the increment of reusing believe degree pref −reuse of proportions of flows, the initial pheromone increases too, so the process of traffic allocation searches toward optimal results.

4 Simulation 4.1 Performance Metrics

We study our scheme performance using CBMRP algorithm. We compare CBMRP+ (using our scheme) and CBMRP according to the following metrics:

982

H.-Y. An, X.-C. Lu, and W. Peng

Packet Delivery Ratio (PDR):

PDR =

Number of Data Received Number of Data Originated

Control overhead: The control overhead is defined as the total number of routing control packets normalized by the total number of received data packets. Load balancing: We use a graph G = (V , E ) to denote the network, where V is the node set and E is the link set. We define a state function f : V → I where I is the set of positive integers. f (V ) represents the number of data packets forwarded at node v. Let CoV ( f ) =

standard variance of f mean of f

. We use CoV ( f ) as a metric to evaluate the load balanc-

ing. The smaller the CoV ( f ) is, the better the load balancing is. 4.2 Simulation Results

Fig.3 shows the throughput in packet delivery ratio. We can see our scheme improves the throughput. Clearly, both of them will decrease throughput when the mobile speed increases. This is because that when the mobile speed increases, the change frequency of the network topology and network overhead will increase, this leads to throughput decrease. The figures show that the throughput of CBMRP+ is larger than that of CBMRP, this is because CBMRP+ use Ant Optimization Algorithm to search for optimal results, this algorithm has positive-feedback mechanism, so it is possible to lead the traffic distribution procedure to have higher optimal results searching speed. So CBMRP+ can delivery more packets to destination node. CBMRP+ also has some packets losses.

Fig. 3. Packet delivery ratio

Figure 4 studies the control overhead. When the number of session is 20, the control overhead for CBMRP is less than CBMRP+. But when the number of session is 40, the control overhead for CBMRP is slightly more than CBMRP+. However, propagation overhead will decrease after distributing traffic into diverse multiple paths in our method. Therefore, the total control overhead of CBMRP+ is lower than that of CBMRP when the session number increases large enough. The bigger the number of the session is, the lower the cost of CBMRP+ is relative to CBMRP.

Multipath Traffic Allocation Based on Ant Optimization Algorithm

983

Fig. 4. Control overhead with varying speed

Fig. 5. CoV of the network load with varying speed

Figure 5 gives the results of load balancing. The CoV of network load for CBMRP is higher than that for CBMRP+. This is because CBMRP+ can distribute the network traffic along different paths with appropriate load. In other words, our scheme can be more fairly assign duties to every path. This can be more beneficial to load balancing. With the decrease of pause time, the CoV of network load for the unipath routes and the multipath routing also decrease. This shows that the increase in mobility could result in better load balancing of the traffic among the nodes. “Hot spots” are likely removed due to mobility.

References 1. Ephremides, J. E. Wieselthier and D. J. Baker, “A design concept for reliable mobile radio networks with frequency hopping signaling,” Proc. IEEE, vol. 75, no. 1, Jan. 1987, pp. 56-73. 2. R. Krishnan, and J.A. Silvester, Choice of Allocation Granularity in Multi-path Source Routing Schemes. IEEE NFOCOM’93, vol. 1, pp.322-29. 3. I. Cidon, R. Rom, Y. Shavitt, Analysis of Multi-path Routing, IEEE/ACM Transactions on Networking, 7(6), pp. 885-896, 1999. 4. Alvin Valera, Winston K.G. Seah and SV Rao Cooperative Packet Caching and Shortest Multipath Routing in Mobile Ad hoc Networks, IEEE INFOCOM 2003 5. M. Dorigo, V. Maniezzo, A. Colorni. Ant System: Optimization by a Colony of Cooperating Agents. IEEE Transactions on Systems, Man and Cybernetics, Part-B, 1996, 26(1): 29~41 6. T. Stutzle, H. H. Hoos. MAX-MIN Ant System. Future Generation Computer System, 2000, 16(8): 889~914

Routing Algorithm Using SkipNet and Small-World for Peer-to-Peer System* Xiaoqin Huang, Lin Chen, Linpeng Huang, and Minglu Li Department of Computer Science and Engineering, Shanghai Jiao Tong University, No.1954, HuaShan Road, Shanghai, 200030 {huangxq, chenlin}@sjtu.edu.cn

Abstract. In this paper, we design a new routing algorithm using SkipNet and Small-World for peer-to-peer system. The algorithm divides the routing space into two layers, SkipNet layer and Small-World layer. In the SkipNet layer, the routing method using numeric ID is discussed. In the Small-World layer, the routing method using small-world theoretical results is discussed. We also consider the dynamic circumstance-the node's join and departure. The comparison of our algorithm with other algorithms is presented. Our algorithm supports content and path locality, it is very important for security consideration. In our algorithm, a few shortcuts to distant peers are inserted with some probabilities and the average path length is reduced. The preliminary simulation results show that our algorithm is efficient.

1 Introduction Scalable overlay networks, such as Chord [1], CAN [2], Pastry [3], and Tapestry [4], have become the hotspot for study. They emerged as flexible infrastructure for building large peer-to-peer systems. These networks use a distributed hash table (DHT), which allows data to be uniformly diffused over all the participants in the peer-to-peer system [5]. Although DHTs provide nice load balancing properties, they have at least two disadvantages: Data may be stored far from its users and it may be stored outside the administrative domain to which it belongs [5]. Papers [5] [6] introduce SkipNet or Skip Graphs, a distributed generalization of Skip Lists [7], to meet the goals of peerto-peer systems. The scheme supports content locality and path locality, which can provide a number of advantages for data retrieval, performance, manageability and security. Content locality can improve security by allowing one to control the administrative domain in which data resides [5]. A social network exhibits the small-world phenomenon [8]. Recent work has suggested that the phenomenon is pervasive in networks, especially in the structural evolution of the World Wide Web [9]. In this paper, we propose an algorithm which combines the SkipNet and SmallWorld Scheme. So our algorithm has the content and path locality properties by adopting SkipNet scheme, and the average routing length is reduced by adopting the Small-World scheme, it can perform a well-ordered search from a global view. *

This paper is supported by SEC E-Institute: Shanghai High Institutions Grid project, the 863 high Technology Program of China (No. 2004AA104340 and No. 2004AA104280).Natural Science Foundation of China (No. 60433040 and No. 60473092).

H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 984 – 989, 2005. © Springer-Verlag Berlin Heidelberg 2005

Routing Algorithm Using SkipNet and Small-World for Peer-to-Peer System

985

The rest of this paper is organized as follows: Section 2 describes related work. Section 3 describes routing algorithm. Section 4 presents simulation experimental evaluation. Section 5 gives the comparisons and Section 6 concludes the paper.

2 Related Work Skip List Data Structure. A skip list [7] is a randomized balanced tree data structure organized as a tower of increasingly sparse linked lists. Level 0 of a skip list is a linked list of all nodes in increasing order by key. In a doubly linked skip list, each node stores a predecessor pointer and a successor pointer for each list in which it appears. The lists at higher level act as “express lanes” that allow the sequence of nodes to be traversed quickly. Searching for a node with a particular key involves searching first in the highest level, and repeatedly dropping down a level whenever it becomes clear that the node is not in the current level [6]. The SkipNet Structure. First of all, we define the concept of a skip graph. As in a skip list, each node in a skip graph is a member of multiple linked lists. The level 0 list consists of all nodes in sequence. Where a skip graph is distinguished from a skip list is that there is may be many lists at level i [6]. We transform the concept of a Skip List to a distributed system setting by replacing data records with computer nodes, using the string name IDs of the nodes as the data record keys, and forming a ring instead of a list. The ring must be doubly linked to enable path locality [5]. A skip graph supports search, insert, and delete operations analogous to the corresponding operations for skip lists. For node 6 (in Fig.1), the routing table may be as in Fig.2.

6

3

2

4 5

1 3

1

2

3

6

4

5

6

9

7

8

Fig.1. A Skip List

9

Level

10 11

2

12

12

1

9

3

0

7

5

6

12 7

11 10

9

8

Fig.2. SkipNet nodes and routing tables of node 6

Small-World Phenomenon and Model. Duncan J. Watts and Steven H. Strogatz present a usual Small-Word model [9] [11]: N nodes are distributed in a ring, initially; each node has K links and links to the nearest K nodes. Then each node’s link is adjusted in turn, the linked terminal is changed randomly in probability P , but don’t link to itself. D (i, j ) is the shortest distance between nodes i and j , the average path distance L =

n , the network is in 1 D(i, j) , When p ≈ 0 , L ~ ¦ 2k n(n −1) / 2 1≤i j≤n

986

X. Huang et al.

regular state. When 0 .001 < p < 0 .01, L ~ ln n , the node is not only connected to ln k neighbor nodes but also to remote nodes. These shortcuts shorten the L the entire network is of small-world characteristic. Kleinberg [8] modeled social networks by taking a two dimensional grid and connecting each node u to q edges when edge

(u , v) exists with probability proportional to || u − v || −2 . For simplicity, we remove the parameter q and assume that each edge (u , v ) is connected with probability d

|| u − v || −2 . For any dimension d > 1 , the small world graph of dimension d has n nodes associated with the points of a d dimensional mesh, where edge (u , v ) is occupied with probability || u − v || − d [10].

3 Routing Algorithm Key Composing. In our algorithm, we consider the question of how to transfer a message from explorer.sjtu.edu.cn to receiver.sjtu.edu.cn, just in the range of *.sjtu.edu.cn, not passing intruder.tsinghua.edu.cn. So we can guarantee the message path locality and the message security. We employ two separate, but related address spaces: a string name ID space and a numeric ID space. Node names and content identifier strings are mapped directly into the name ID space. Each node’s random choice of ring memberships can be encoded as a unique binary number, which we refer to as the node’s numeric ID. We don't use the hash method as in Chord, CAN, Gnutella and Pastry et al. For example, the node explorer.sjtu.edu.cn can be mapped to its numeric ID 011.101.1001. The first three bits of the numeric ID represents the country, the second three bits represents the school and the third four bits represents the school’s various users. Routing Table Structure. Each node has to maintain a routing table of length K . A routing table can composed of two parts: SkipNet table (SNT) and Small-World table (SWT). SNT records the node’s neighbor node in level 0, 1, 2 as in Fig.2. The SWT table records query and the destination node ID, the query and destination node ID is represented by the numeric ID. If the length of SkipNet table is L , then the length of the Small-World table is then K − L . The routing table of our algorithm is as in Table 1. Routing Table Renewal Strategy. SWT renewal strategy is similar to paper [11]. We define the shortest distance between nodes a , b as D ( a, b) .

D(a, b) = min{| a − b |, | M − a − b |} , Where M is the total amount of nodes in the ring. The SWT renewal of node u is as follows: (1) Select the nearest nodes adding to node

u ’s SWT until the SWT is full.

Routing Algorithm Using SkipNet and Small-World for Peer-to-Peer System

v ’s deletion probability

(2) Calculating each record’s deletion probability, the key

P (v ) =

987

−1

D (u, v) ,where w is the random key in the SWT table [11]. ¦ D(u, w) −1 w

key Del . and key Ins as in [11].

Randomly select a key as the deletion record, denoted as (3) Recalculating the deletion probabilities of

key Del

The deletion object is determined by the deletion probability, so the SWT table can be adjusted. Because the renewal algorithm renew the SWT table by the probability, the SWT table can introduce a little of shortcuts. Table 1. Routing table of algorithm N o d e

6 K e y

N o d e

0 1 1 .1 0 1 .1 0 0 1

ID

0 1 1 .1 0 1 .1 0 1 0 ...

S W T L E V E L

S N T

2

1 2

1 2

1

9

3

0

7

5

Routing by Numeric ID. If we want to route messages to a given numeric ID for example 011.101.1010 from node A, we first route the message to the node in the level 0 whose numeric ID matches the destination numeric ID in the first digit. At this point the routing operation jumps up to this node’s level 1 ring, which also contains the destination node. The routing operation then examines nodes in this level 1 ring until a node is found whose numeric ID matches the destination numeric ID in the second digit [5]. The routing operation proceeds in this way to satisfy the first six bits. These operations are conducted in the SNT table. Then we route the message in SWT table. In our algorithm, the query is the key. Because we adopt the probability algorithm in the SWT table, there are some shortcuts to the remote nodes. So the key searching length can be reduced efficiently. Node Join and Departure. Node join and departure process is similar to paper [5]. To join a SkipNet, a newcomer must first find the top-level ring that corresponds to the newcomer’s numeric ID. The newcomer then finds its neighbors in this top-level ring, using a search by name ID within this ring only. Starting from one of these neighbors, the newcomer searches for its name ID at the next lower level and thus finds its neighbors at this lower level. This process is repeated for each level until the newcomer reaches the root ring.

4 Simulation Experiment To evaluate the routing algorithm, we do some simulation experiment. We consider the average path length as the main evaluation values [12]. Because the average path length determines the message’s waiting time. We run simulations in which we com-

988

X. Huang et al.

pared the performance of our algorithm in two circumstances: The performance of the NoN Small-World algorithm (just using SkipNet scheme) and the performance of SkipNet and Small-World algorithm. For each graph size we run 10 executions. The routing table capacity is 100, SNT is 30 and SWT is 70. We randomly select the starting node. The experimental result is as follows: We use the ratio of average path length in SkipNet and NoN Small-World and SkipNet with Small-World. When node number is 103, the ratio is 1.4. The node number is 104, the ratio is 2.1. The node number is 105, the ratio is 3.5. In the initial experimental result, we can see the SmallWorld phenomenon effect is obvious with the node number increasing. In the future work, we will give more performance analysis, For example, the routing success rate, the relationship of the routing success rate with the routing table’s capacity et al.

5 Comparisons Our routing algorithm use the SkipNet and Small-Word scheme, so our routing algorithm has a fundamental philosophical difference from existing overlay networks, such as Chord and Pastry whose goal is to implement a DHT. The basic philosophy of systems like Chord and Pastry is to diffuse content randomly throughout an overlay in order to obtain uniform, load-balanced, peer-to-peer behavior. The basic philosophy of SkipNet is to enable systems to preserve useful content and path locality [5]. Path locality allows SkipNet to guarantee that messages between two nodes within a single administrative domain will never leave the domain. So it can prevent attacks from other administrative domain. Our algorithm also used the Small-World scheme; there are some shortcuts to the remote nodes. The shortcuts can play the role of reducing the path length. So our algorithm can reduce the average path length. In previous peer-to-peer overlay designs [1] [2] [3] [4], node placement in the overlay topology is determined by a randomly chosen numeric ID. Nodes within a single organization are placed uniformly throughout the address space of the overlay. When a single organization fails together, it can affect the entire network. Since SkipNet name IDs tend to encode organizational membership, and nodes with common name ID prefixes are contiguous in the overlay, failures along organization boundaries do not completely fragment the overlay, but instead result in ring segment partitions [5]. When the message is routing in the organization disconnected from the Internet network, the message also can reach the destination. So our algorithm has good fault tolerance. In our routing algorithm, we put routing table on each node instead of a distributed hash table. Most unstructured peer-to-peer systems with a distributed hash table run a blind and not global search. There is a lack of the global properties. In our scheme, it can perform a well-ordered search from a global view. The disadvantage of our algorithm is that each node has to preserve some cache space and some computation capacity.

6 Conclusion In this paper, we design a peer-to-peer routing algorithm using SkipNet and SmallWorld scheme. In our routing algorithm, we put routing table on each node instead of

Routing Algorithm Using SkipNet and Small-World for Peer-to-Peer System

989

a distributed hash table. It can perform a well-ordered search from a global view, Furthermore, our algorithm supports content and path locality and has good fault tolerance, it is very important for security consideration. Shortcuts to remote peers are inserted with some probabilities and the average path length is reduced. The performance of our algorithm is discussed. The preliminary simulation results show that our algorithm is efficient. In the future, we will give more performance analysis, for example, the routing success rate, the relationship of the routing success rate with the routing table’s capacity et al.

References 1. Stoica, I., Morris, R., Karger D.et al.: Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. In Proceedings of ACM SIGCOMM, Aug. (2001). 2. Ratnasamy, S., Francis, P., Handley, M. et al.: A Scalable Content-Addressable Network. In Proceedings of ACM SIGCOMM, Aug. (2001). 3. Rowstron, A. and Druschel, P.: Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany, Nov. (2001) 329-350. 4. Zhao, B. Y., Kubiatowicz, J. D. and Joseph, A. D.: Tapestry: An Infrastructure for FaultResilient Wide-area Location and Routing. Technical Report UCB//CSD-01-1141, UC Berkeley, April (2001). 5. Nicholas J.A. Harvey, Michael B. Jones, Stefan Saroiu et al.: SkipNet: A Scalable Overlay Network with Practical Locality Properties. Fourth USENIX Symposium on Internet Technologies and Systems (USIT’03), Seattle, WA, March (2003). 6. J. Aspnes and G. Shah. Skip graphs. In fourteenth ACM SIAM Symposium on Discrete Algorithms (SODA), (2003) 384-393. 7. Pugh, W.: Skip Lists: A probabilistic Alternative to Balanced Trees. In Workshop on Algorithms and Data Structures, (1989). 8. Kleinberg, J.: The small-world phenomenon: an algorithmic perspective. Cornell Computer Science Technical Report, (2000) 99-1776. 9. Watts, D. and Strogatz, S.: “Collective dynamics of small-world networks”. Nature 393, 440 (1998). 10. Moni Naor, Udi Wieder.: Know thy Neighbor’s Neighbor: Better Routing for Skip-Graphs and Small Worlds. IPTPS 04, (2004). 11. Zhou, J., Lu H. and Li, Y. D.: Using Small-World to Devise Routing Algorithm for Unstructured Peer-to-Peer System. Journal of Software. Vol.15, No.6, (2004). 12. Yang, B., Garcia-Molina H.: Improving search in peer-to-peer networks. Proceedings of the Int’l Conference on Distributed Computing Systems. IEEE Computer Society, (2002). 5-14.

Smart Search over Desirable Topologies: Towards Scalable and Efficient P2P File Sharing Xinli Huang, Yin Li, Wenju Zhang, and Fanyuan Ma Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, P.R. China, 200030 {huang-xl, liyin, zwj03, fyma}@sjtu.edu.cn

Abstract. Gnutella-like peer-to-peer networks exhibit strong small-world properties and power-law node degree distributions. However, the existing floodingbased query algorithms used in such overlay networks, knowing very little about these inherent natures, scale poorly with inefficient search and heavy traffic load, which is always a challenging problem to solve. In this paper, we eye our viewpoints upon the role of overlay topology in the search performance and propose a novel solution towards scalable and efficient peer-to-peer distributed file sharing, by making better use of such emergent topological properties of these networks. We first, by examining what inspirations can be taken from these properties, provide several helpful guidelines as the design rationale of our solution, and we then propose a new technique for constructing Desirable Topologies and a novel Smart Search algorithm operating on them, as two key components of our solution. To justify the performance gains of our techniques, we also conduct extensive experiments under realistic network conditions and make an all-around comparison with currently well-known systems.

1 Introduction As the representative unstructured Peer-to-Peer (P2P) networks, Gnutella and its extensions [1] support uncoupled data placements, elaborate semantic queries and highly dynamic scenario. These properties make such systems extraordinary suitable for applications of large-scale distributed file sharing, which is still the most dominant application in use on current P2P-powered systems [2]. The main difficulty in designing search algorithms for these systems is that currently, very little is known about the nature of the network topology on which these algorithms would be operating. The end result is that even simple protocols may result in complex interactions that directly affect the overall system’s performance. Based on this situation, the aim of this paper is to develop techniques to improve the search efficiency and the network utilization of Gnutella-like P2P file-sharing systems, by examining the role of overlay topologies on the performance of these systems, and by taking inspirations from the intrinsic topological properties such as: the small-world phenomena and the powerlaw node degree distributions. Numerous contributions have been made to achieve such a goal in recent years [3]. However, most among them concentrate their viewpoints on the measurement studies [4] or algorithmic optimizations [5] like query, data placement and replication, knowing very little about the nature of network topology as well as its impact upon the performance of the systems, with very few H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 990 – 995, 2005. © Springer-Verlag Berlin Heidelberg 2005

Smart Search over Desirable Topologies

991

exceptions [6, 7, 8]. The authors in [6] propose Gia, a P2P file-sharing system extended from Gnutella, by focusing on strong guarantees of the congruence between high-capacity nodes and high-degree nodes. But they do not consider neighbors’ proximity in underlying networks and assume that high-degree nodes certainly process high capacity and be more stable than the average, which is in fact not the truth in highly dynamic and transient scenario of P2P networks. In [7], the authors introduce Acquaintances to build interest-based communities in Gnutella through dynamically adapting the overlay topology based on query patterns and results of preceding searches. Such a design, because of no feasible measures to limit the explosive increase of node degree, could quickly become divided into several disconnected subnetworks with disjoint interests. The authors in [8], through studying the role of overlay topologies on the performance of unstructured P2P networks, develop a particular topology where every node has many close neighbors and a few random neighbors. But such a method destroys the power-law link distributions, yielding no guarantee of low diameter and large clustering. In this paper, we eye our viewpoints upon the role of overlay topology in the search performance and propose a novel solution towards scalable and efficient P2P file sharing, by making better use of emergent topological properties of these networks. We first, by examining what inspirations can be taken from these properties, provide several helpful guidelines as the design rationale of our solution, and we then propose a new technique for constructing Desirable Topologies and a novel Smart Search algorithm operating on them. To justify the performance gains of our techniques, we also conduct extensive experiments under realistic network conditions and make an all-around comparison with other currently well-known systems. The rest of the paper proceeds as follows: we detail the new techniques that used for constructing desirable topologies in Section 2, and propose the Smart Search algorithm in Section 3. To evaluate our solution, Section 4 describes the experimental setup and presents the simulation results. We conclude the paper in the last section.

2 Building Gnutella-Like P2P Networks with Desirable Topologies Recent studies [4, 9, 10] have shown that Gnutella-like P2P networks demonstrate strong small-world properties, power-law degree distributions, and a significant mismatch between logical overlay and its projection on the underlying network, which can greatly impact the performance of algorithms such as those for routing and searching [11]. Therefore the existence of these properties in P2P networks presents an important issue to consider when designing new, more scalable application-level protocols. Inspired by these intrinsic topological properties and their key roles in system performance above, we advocate generating an overlay topology following such principles as: a) self-sustaining power-law degree distributions, b) dividing neighbors into many short ones and a few long ones, and c) creating better neighbourship via selecting highavailability nodes as direct neighbors. The aim of these suggestions is to build an overlay topology with desirable properties, adapt peers towards better neighbors, and direct queries to right next hops with as few duplicated messages as possible. Fig.1 illuminates

992

X. Huang et al.

Fig. 1. How to self-sustain the out-degree of a node during adapting towards better neighbors

how to self-sustain the power-law degree distribution of a node while adapting the overlay topology towards neighbors with higher availability. By increasing the fraction of links rewired we get the required low diameter. If the fraction of links deleted and rewired is p, then for very small p the average path length L(p) comes down by orders of magnitude whereas the clustering coefficient C(p) is still much large similar to that of a regular graph [13]. This is just what we desire: “small-world” properties. Below we give the pseudo code, showing how to build Gnutella-like P2P networks with these desirable topological properties: Variables: NbrsList: Ordered list of neighbors, ordered by ə CandList: List of candidate neighbors, ordered by ə short_NbrsList, long_NbrsList: List of short/long neighbors ə(P): the availability of node P, measured by the number of relevant results returned successfully by P in the near past Į: Proximity factor, the fraction of links that are short ȕ: Aging factor, with value in (0,1) į: closeness between two nodes in the underlying network // Upon a successful query from the requester Pr answered by Pa WHILE (min(ə(Pi, ∀Pi∈NbrsList)) < max(ə(Pj, ∀Pj∈CandList))) DO {NbrsListĸcand_nodemax; CandListĸnbr_nodemin} age all nodes in NbrsList and CandList by a factor ȕ; ə(Pa) ++; IF (Pa ∈ NbrsList) // Pa is an existing neighbor do nothing; return; IF (ə(Pa) > min(ə(Pi, ∀Pi∈NbrsList)))//Pa is a candidate or a new node {NbrsListĸPa; CandListĸnbr_nodemin; return} ELSE IF (Pa ∉ CandList) {CandListĸPa; return} // Upon a neighbor, say Py, leaving the network IF (CandList != Ø) {NbrsListĸcand_nodemax; return} ELSE initiate K peers in CandList randomly by means of existing neighbors; enforce a neighbor randomly chosen from CandList; // Ranking nodes in NbrsList by į incrementally, build short_NbrsList // and long_NbrsList by Į for further utilization by Smart Search short_NbrsListĸfirst Į·N peers of all the N nodes in NbrsList; long_NbrsListĸthe remaining peers of NbrsList;

Smart Search over Desirable Topologies

993

3 Smart Search: A Scalable and Efficient Search Algorithm In this section, we propose Smart Search, a bi-forked and directed search algorithm: rather than forwarding incoming queries to all neighbors (the way of the Gnutella) or randomly chosen neighbors (the way of random walks), a node forwards the query to: 1) all short neighbors using scoped-flooding with a smaller TTL, and 2) k long neighbors using random walks with the mechanism of adaptive termination-checking. As for short neighbors, although they dominate the number of a node’s neighbors in our design, they are relative more local in the underlying network, and are also highly available to the requester. So we can flood queries to all of them by a much smaller TTL value and thus reduce the network traffic without significantly affecting the success rate. This is obtained mainly because such a consideration makes duplicated messages kept in a very local range and being alive in a small duration of time. Besides, we also incorporate the determination-checking (used in the long-neighbor case below) into this TTL method to create a novel adaptive termination mechanism. As for long neighbors, these nodes are distributed across a relative global region on average. In such a random-graph case, the flooding-based search, even with only a small fraction of long neighbors, can cause explosive duplicated messages and heavy traffic load. So in Smart Search, we apply a k-walkers random walking for long neighbors: that is, a requesting node sends k queries to the first k best long neighbors at one time, and each query takes its own way. Here k is a pre-designed parameter much smaller than the number of long neighbors of the requester. To obtain adaptive query termination, we introduce an adaptive termination-checking mechanism: a walker periodically checks with the original requester before walking to the next step. An additional advantage that can not be benefited from [5] is: with the terminationchecking, a walker can learn whether a result is hit not only from other walkers during probing long neighbors, but also from the successful response of short neighbors. This means: if only a request is responded successfully by a short-neighbor probing, all the queries during the long-neighbor probing will be terminated in time. Due to the space limitation, we omit the detail algorithmic description in the paper.

4 Experimental Setup and Results To evaluate the performance gains of our solution, we consider a P2P network made of 4,096 nodes, which corresponds to an average-size Gnutella network [4]. We rely on the PLOD, a power-law out-degree algorithm, to generate an overlay topology with desired degree distribution over the P2P network simulator [12]. In the simulations, 100 unique files with varying popularity are introduced into the system. Each file has multiple copies stored at different locations chosen at random. The number of copies of a file is proportional to their popularity. The count of file copies is assumed to follow a Zipf distribution with 2,000 copies for the most popular file and 40 copies for the least popular file. The queries that search for these files are also initiated at random hosts on the overlay topology. Again the number of queries for a file is assumed to be proportional to its popularity. We focus on the performance aspects of search efficiency and network utilization, using the following metrics: 1) the success rate of queries—Pr(success), 2) the

994

X. Huang et al.

percentage of duplicate messages—duplicate msgs (%), 3) the distance to search result—D, and 4) the variation of mean stress [9] on the underlying network—Ȟ. We evaluate our solution of Smart Search over Desirable Topologies (SSDT) by making comparisons with the following currently well-known systems of a) Flooding over Gnutella (FG) and b) Random Walks over Random Topologies (RWRT). 90

FG RWRT SSDT

100 80 70

duplicate msgs (%)

Pr(success) %

80

60

40

50 40 30 20

FG SSDT RWRT

20

60

10 0

0 1

2

3

4

5

6

7

8

2

9

3

4

5

Fig.2. Success rate of queries Pr(success) as a function of #hops, the average hops number

7

8

9

Fig.3. The percentage of duplicate messages duplicate msgs (%) as a function of #hops

14

RWRT FG SSDT

60

6

#hops

#hops

13 12

50

RWRT FG SSDT

11 10 9

40

ν

D

8

30

7 6 5

20

4 3

10

2 1

0 0

20

40

60

80

100

P

Fig.4. The distance to search result (D) as a function of variable file popularities (P)

0 512

1024

1536

2048

2560

3072

3584

4096

N

Fig.5. The variation of mean stress (Ȟ) as a function of increasing node population (N)

Fig.2 plots the success rate of queries as a function of the average number of hops needed, showing that, by using our solution, we can set TTL to a much smaller value (e.g., TTL=3) than that of the other systems without reducing the success rate much. In Fig.3, we can see that, with our solution, the percentage of duplicate messages keeps at a very low level, especially after going through several hops, which results from the deliberate design and combination of desirable topological properties and the efficient Smart Search strategy. As for the aspect of the network utilization, we can see from both Fig.4 and Fig.5 that our solution can make better use of the knowledge of underlying network, by dynamically optimizing the neighborhood quality to reduce the distance to search result, and by mapping more logical links to local physical links. These results verify the significant performance gains of our solution.

Smart Search over Desirable Topologies

995

5 Conclusions Driven by the emerging collective properties behind the P2P network topology and the ignorance of the topology’s role on the performance of systems when in algorithm design, we propose a unique solution, named Smart Search over Desirable Topologies, with the aim of building more scalable and efficient peer-to-peer file-sharing systems. We achieve this goal by constructing overlay topologies with desirable properties, adapting peers towards better neighbors, and directing queries to right next hops with as few duplicated messages as possible. The experimental evaluation has shown that our techniques are more effective at improving search efficiency and reducing underlying network traffic load, compared with the currently well-known systems. Further work on the search algorithm and the factor considerations like the issues of large-sized file downloading [14], resources booking and reservation, are orthogonal to our techniques and could also be used to further improve the performance of our solution.

References 1. Gnutella. http://gnutella.wego.com 2. A. Oram, Ed., “Peer-to-Peer: Harnessing the Power of Disruptive Technologies”, O'Reilly and Associates, March 2001 3. John Risson et al, “Survey of Research towards Robust Peer-to-Peer Networks: Search Methods”, Technical Report UNSW-EE-P2P-1-1, University of New South Wales, 2004 4. Mihajlo A. Jovanovic, et al. “Scalability issues in large peer-to-peer networks - a case study of Gnutella”. Technical Report, University of Cincinnati, 2001 5. C. Lv, P. Cao, E. Cohen, K. Li, “Search and replication in unstructured peer-to-peer networks”, in ACM International Conference on Supercomputing (ICS), June 2002 6. Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and L. Breslau, "Making Gnutellalike P2P systems scalable," in ACM SIGCOMM, Aug. 2003 7. V. Cholvi, P. Felber, E.W. Biersack, “Efficient Search in Unstructured Peer-to-Peer Networks”, In European Transac-tions on Telecommunications, Special Issue on P2P Networking and P2P Services, Volume 15, Issue 6, 2004 8. Shashidhar Merugu, et al, “Adding Structure to Unstructured Peer-to-Peer Networks: The Role of Overlay Topology”, NGC/ICQT, p83-94, 2003 9. M. Ripeanu, et al, “Mapping the Gnutella Network: Properties of Large Scale Peer-to-Peer Systems and Implications for System Design,” IEEE J. on Internet Computing, 2002. 10. Mihajlo A. Jovanovic, Fred S. Annexstein, Kenneth A. Berman. “Modeling Peer-to-Peer Network Topologies through Small-World Models and Power Laws”, in Proc. of IX Telecommunications Forum Telfor, Belgrade, November 2001 11. Kleinberg, J. “The small-world phenomenon: An algorithmic perspective”, Technical Report 99-1776, Cornell University Computer Science Dept, Oct 1999 12. Christopher R. Palmer, J. Gregory Steffan, “Generating Network Topologies That Obey Powers”, in Proc. of Globecom’2000, San Francisco, November 2000 13. Amit R Puniyani, Rajan M Lukose and Bernardo A. Huberman, “Intentional Walks on Scale Free Small Worlds”, LANL archive: cond-mat/0107212, 2001 14. Saroiu S., Gummadi K. P., Dunn R. J., Gribble S. D., Levy H. M. An Analysis of Internet Content Delivery Systems. In Proc. of OSDI’2002, Boston, MA, December 2002

A Scalable Version Control Layer in P2P File System* Xin Lin, Shanping Li, Wei Shi, and Jie Teng College of Computer Science, Zhejiang University, Hangzhou, P.R. China, 310027 {alexlinxin, starsear}@hotmail.com, [email protected], [email protected]

Abstract. Challenges revealed in constructing a peer-to-peer (P2P) file system are due to the difficulties of version control. There have appeared no P2P systems, which can solve these problems smoothly. In this paper we show our efforts towards solving the problems by developing a new application, SVCL (a Scalable Version Control Layer in P2P file system), in which version control servers are woven into a peer-to-peer network so that the system will not crash under single node failure. As a result, users can carry out both file updating and reading operations. Experiments have demonstrated the high performance of the proposed system.

1 Introduction Even the experienced observers may be amazed at the explosive development of peer-to-peer (P2P) systems, which are now one of the most popular Internet applications and a very important source of Internet traffic. Our work focuses on file updating in P2P file systems, which are frequently used in mobile computing storage systems. Fig. 1 shows an example scenario. By storing their regular files in a remote file service, people can access wanted files while traveling around with the help of low storage capacity devices, such as PDA, cell phone, and so on. And a group of distributed file severs collaborate to provide this file service in P2P style. These servers form a network, in which files would be pushed to the nearest server for clients to fetch them more easily. Contrast with conventional client/server overlay, P2P network has the advantages in achieving high scalability and preventing system from crashing under single points of failure. Current P2P systems are grouped by Lv et al. [1] into three categories: (1) Centralized (e.g. Napster [2]) ;(2) Decentralized but Structured (e.g. Freenet [3] and other DHT algorithm [8,9,10]); (3) Decentralized and Unstructured (e.g. Gnutella [4]). There are many previous studies on each of them. However, strictly speaking, most of them are not real file systems because file-updating operations are not supported. We have developed a new Scalable Version Control Layer in P2P file system (SVCL), in which this problem is resolved gracefully. The key point of SVCL is weaving file version controllers into ordinary peers instead of centralizing them in a single server or some specific ones. Each peer acts as both an access server (AS) of users and a manager server (MS) of some files (see Fig. 1). In our scheme, AS and MS can be regarded as client and server respectively. As a result, the *

This work is supported by National Natural Science Foundation of China (No. 60473052).

H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 996 – 1001, 2005. © Springer-Verlag Berlin Heidelberg 2005

A Scalable Version Control Layer in P2P File System

997

load of the centralized version control server is balanced into peers and single points of failure are avoided. Each file is managed by an appointed MS, which takes charge of the version control and information recording of the file. SVCL consists of two layers, version controller layer (VC layer) and distributed hash table layer (DHT layer). The VC layer stores detailed information of each file. Every updating operation on a file is logged. Whenever read/write operations on a file are initiated, VC layer of AS resorts to DHT layer to find out the corresponding MS of the requested file and retrieves its information.

3'$

8SGDWH )LOHI

3HHUM

3HHUL &LW\$

8SGDWH )LOHI

&LW\%

69&/ 3HHUN

2WKHU3HHUV Ă

0DQDJHUVHUYHU RI)LOHI

Fig. 1. An example scenario of SVCL

The rest of the paper is organized as follows. Section 2 presents the architecture of SVCL. We evaluate the performance by experiments in Section 3 and draw the conclusion in Section 4.

2 Architecture of SVCL 2.1 Overview SVCL supports both file updating and reading operations. It consists of a collection of peers, which provide file storage services. In SVCL, users can perform similar file operations as in ordinary file systems. As shown in Fig. 2, each peer functions as both sever and client in distributed file system. As mentioned in Introduction, the main contribution of SVCL is distributing file version controller among ordinary peers. In other words, each peer, as a little version controller, is in charge of some files’ version control. In fact, each peer takes the responsibilities of locating wanted files and managing these files. SVCL consists of two layers: DHT layer and VC layer, which realize the two functions mentioned above respectively. (Shown in Fig. 2) The main task of DHT layer is to locate the manager server (MS) of requested files. We adopt Chord algorithm [8] in DHT layer to achieve this goal. In Chord, each file and peer is assigned a unique identifier (ID). These IDs are computed by consistent hashing [5] according to some attributes of files or peers, such as file name or peer IP.

998

X. Lin et al.

$SSOLFDWLRQV

9&

5HTXHVW

'+7

5HSO\

3HHUL RI69&/

9& '+7

3HHUM RI69&/

Fig. 2. The architecture of SVCL

Each file is assigned to the peer that has the most similar ID. In SVCL, this peer is defined as the MS of the files stored on it. VC layer, which is built on top of DHT layer, encapsulates the file operations and provides similar interfaces to normal file systems, such as read, write and close. It resorts to DHT layer to look up the MS of a given file and takes charge of distribution, caching and replication of managed files. The VC layer on the MS of file f stores the latest content and version information of file f with a specified data structure. All updating operations on file f are logged. The detailed implementation of SVCL will be introduced in next section. 2.2 Normal File Operations As mentioned in the previous section, VC layer encapsulates normal file operations, such as writing and reading. To avoid overloading the traffic of networks, the VC layer of each peer caches the files operated by it. Not only the content but also the version information and operating status are cached. In this paper, we focus on single-writer and multi-reader model and the multi-writer model will be discussed in future work. As a result, a lock-based mechanism is borrowed from database technique to achieve the data consistence. The compatibility of lock-based mechanism in SVCL is illustrated in Tab.1. In this table, the entry True denotes that the two locks are compatible. It’s clear that only multiple read locks could be imposed on the same file. Table 1. Compatibility table of lock-base protocol

Read lock Write lock

Read lock True False

Write lock False False

We now present the typical process of file operation in SVCL. Suppose there is some user with a PDA who want to access file f stored in SVCL (either read of write). The nearest SVCL peer, say p, is selected as the access server (AS) to conserve energy. The first job is to impose the needed access lock on f. Access is only allowed when the imposed lock is compatible with locks already imposed on f. Then the file content is downloaded from f’s MS and cached on p for user access. Note that only a incremental

A Scalable Version Control Layer in P2P File System

999

script, which describes the difference between old version and the latest version, is transferred when some version of f is already cached on p. Similarly, when the user issues a SAVE command, the incremental script mechanism is used to transfer changes to the MS. Finally a CLOSE operation is performed to release the locks. Some exceptions will make these operations fail, such as failure of AS or power exhausting of users’ PDA. To avoid a permanent lock on a file, a specific timeout is imposed on each lock. AS keeps sending refreshing request to MS to reset the timeout. SVCL treats directories and files equally, which is similar to the Unix model. Listing a directory is essentially reading the content of the directory file. Creating or deleting of a file contains two steps: adding or removing the file in the corresponding MS and updating the directory file. 2.3 Peer Joining and Departure Frequent joining and departure of peer, namely churn [6], is a remarkable feature of P2P system. After a new peer firstly joins the Chord ring, it finds out its successor and takes over the files that should be managed by it from the successor. See Fig. 3a, the MS of file 13, 15 and 25 is Peer 27. After Peer 20 joins, it becomes the MS of file 13 and 15. So Peer 20 has to take over file 13 and 15 (see Fig. 3b). The content and file information of file 13 and 15 will be transferred to Peer 20 from the former MS, Peer 27. If some files are being modified by other peers and cannot be pulled to peer 20 immediately, a pointer is set in peer 20 to redirect requests to peer 27 (e.g. File 15 in Fig.3b). Peer departure is a reverse process of joining.

)LOH 

)LOH

)LOH

   

)LOH

)LOH

)LOH 3HHU

3HHU

)LOH

3HHU

D

3HHU

)LOH FW UH GL U 5H LQWH SR



)LOH 3HHU

E

Fig. 3. The joining process in SVCL

3 Experimental Result and Discussion We have developed the evaluating version of SVCL using C programming language on LINUX. All the experiments are performed on a 16-peers cluster. Each peer is configured with Intel(R) Xeon(TM) CPU 2.40GHz and 1 Gbytes of main memory. SVCL is run on LINUX REDHAT 9.0. All peers are connected with 100 Mbps LAN. In the DHT layer of each peer, Chord algorithm maintains a successor list and finger table with at most 4 entries respectively.

1000

X. Lin et al.

3.1 Fetching File Without Caching









5HVSRQVH7LPH PV

)HWFKLQJ7LPH PV

As mentioned in Section 2.2, if the AS doesn’t cache the requested file, it has to fetch the whole file from MS. The first experiment is performed to evaluate the cost of fetching different sizes of file in multiple peers system without caching. Fig. 4 shows the result of fetching 0.58K, 1.56K, 98.8K, 215K, 562K, 1027K and 3766K bytes of files from SVCL with 2 peers, 4 peers and 16 peers. The results indicate that in 16 peers system, the overhead is little more than that in 2 peers system. It is easily explained theoretically. The fetching time consists of lookup overhead in DHT layer and file transfer time in VC layer. It’s proved in [7] that the lookup overhead in Chord algorithm is O(logN) RPCs, while N is the number of peers. The file transfer speed is invariant in the same network environment. Unobvious rising of fetching time with the increasing of the number of peers demonstrates the good scalability of SVCL. In this experiment, FTP software is also performed in the same network environment for comparison. The results show that the fetching time of SVCL is just little more than that of FTP!

 

  

  

 

QRGHV



 )LOH6L]H .

QRGHV

QRGHV





)737UDQVIHU

Fig. 4. Fetching time without caching

















9HUVLRQV

3HHU$

3HHU%

Fig. 5. Response time cost for getting ready to update each version of given file

3.2 Transfer Optimization In SVCL, cache and incremental script are adopted to optimize file transfer, which is mentioned in Section 2.2. The second experiment is conducted to show this transfer optimization. In this experiment, the example file is 4 Mbytes initially and each updating operation produces 0.5 Kbytes of incremental script. We perform the operations in two peers, A and B, which are regarded as AS in SVCL. Version 1,2,3, 6 and 7 are produced in peer A and Version 4,5 and 8 are produced in peer B. Neither of these two peers are the MS of the example file. Fig. 5 shows the simulation results. The response time of Version 1 and 4 is much longer than that of other versions. It is because they are the first operation on each peer respectively and the whole file must be downloaded from MS. The response time of version 2, 3 and 7 is close to zero thanks to the latest version cached in peer A. Because there are old versions in peer A and B when version 6 and 8 are produced, only the incremental scripts need to be transferred from MS. This experiment demonstrates that the transfer optimization in SVCL reduces the network traffic dramatically.

A Scalable Version Control Layer in P2P File System

1001

4 Conclusions We have presented the design and evaluation of SVCL, a new scalable version control layer in P2P file system, in this paper. It provides file services to users through an ordinary file system interface. SVCL employs DHT algorithm to locate the MS of requested files and the VC layer to manage the version information of files in this system. As a result of considerate mechanism of peer joining and departure, it still runs smoothly when underlying topology changes. Improvements are still needed in this system, but the current results have shown the good performance of SVCL.

References 1. Lv, Q., Cao, P., Cohen, E., Li, K., and Shenker, S.:Search and Replication in Unstructured Peer-to-Peer Networks. In Proceedings of 16th ACM International Conference on Supercomputing (ICS’02 ) (2002). 2. Napster website. http://www.napster.com 3. Clark, I., Sandberg, O., Wiley, B., and Hong, T. W: Freenet: A distributed anonymous information storage and retrieval system. In Proceedings of the ICSI Workshop on Design Issues in Anonymity and Unobservability (2000). 4. Gnutella website. http://gnutella.wego.com 5. FIPS 180-1: Secure Hash Standard. U.S. Department of Commerce NIST, National Technical Information Service (1995). 6. Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N. and Shenker, S.: Making Gnutella-like P2P Systems Scalable. In Proceeding of ACM SIGCOMM (2003). 7. Dabek, F., Kaashoek, M. F., Karger, D., Morris, R., and Stoica, I.: Wide-area cooperative storage with CFS. In Proceeding of . ACM SOSP’01 (2001). 8. Storica, I., Morris, R., Karger, D., Kaashoek, F., AND BALAKRISHNAN, H.: Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. In Proceedings of ACM SIGCOMM 2001 (2001). 9. Ratnasamy, S., Francis, P., Handley, M., Karp, R., and Shenker, S.: A scalable content-addressable network. In Proc. ACM SIGCOMM (2001) 161–172. 10. Rowstron, A., and Druschel, P. :Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of the 18th IFIP/ACM International Conference on Distributed Systems Platforms (2001).

A Framework for Transactional Mobile Agent Execution Jin Yang1, Jiannong Cao1, Weigang Wu1, and Chengzhong Xu2 1

Internet and Mobile Computing Lab, Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong {csyangj, csjcao, cswgwu}@comp.polyu.edu.hk 2 Cluster and Internet Computing Lab Dept of Electrical and Computer Engg., Wayne State University Detroit, Michigan 48202, USA [email protected]

Abstract. This paper proposes a framework for transactional Mobile Agent (MA) execution. Mechanisms and algorithms for transactional execution of single MA and multiple MAs are developed. In addition to preserving the basic transaction semantics, the proposed framework also have three desirable features. First, the framework is designed as an additional layer on top of the underlying existing system, so it can be well integrated with the existing systems without having much interference with the existing system functionality. Second, an adaptive commitment model is proposed for transaction MA execution. Third, fault tolerance support is provided to shield MA execution from various failures, so as to increase the commitment rate. The results obtained through experiments show that our framework is effective and efficient.

1 Introduction Mobile agent has a great potential of being used to develop networking/distributed systems and applications in various areas including telecommunications, e-commerce, information searching, mobile computing, and network management. For mobile agent to be widely deployed in these fields, especially in e-commerce and network management, the support for transactional mobile agent execution is necessary. More specifically, a transactional mobile agent execution should ensure Atomicity, Consistency, Isolation, and Durability (ACID) properties [2,3]. For example, suppose we want to build a path through 10 switches in a telecommunication network, the path can only be built when there is a free in-port and a free out-port for each of the ten switches on the path. Otherwise the path can not be built. This all-or-nothing property 1corresponds exactly to the atomicity property of a transaction. Transaction has been extensively studied for over a decade, and related algorithms have been proposed and wildly deployed in existing transaction support systems, such as the centralized/distributed database systems. There are some works addressing problems related to MA transactions, such as how to maintain ACID, open/close ∗

This work is supported in part by the University Grant Council of Hong Kong under the CERG Grant PolyU 5075/02E and China National 973 Program Grant 2002CB312002.

H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 1002 – 1008, 2005. © Springer-Verlag Berlin Heidelberg 2005

A Framework for Transactional Mobile Agent Execution

1003

transaction [2] models etc. However, very few works consider how to implement MA transactional execution in a real environment. In order to do so, the following challenging problems need to be solved: 1). Considering the mobility characteristic of MA (MA will migrate among several organizations’ sites) and the long execution time that may cause the problem of long lived transaction [2], a new type of commitment model is needed. 2). The transaction mechanisms need to support both single MA and multiple MA execution modes of MA transaction. 3). If a MA or the MA platform failed during the MA’s execution, the transaction must be aborted. So, fault tolerance is needed to tolerate these failures. The novel contribution of this paper is that it does not only identify these problems, but also provide solutions to them. An adaptive commitment model is proposed for MA transactions. An algorithm for the transactional execution of single MA and multiple MA is developed. MA fault tolerant mechanisms are designed for MA transactional execution so as to avoid the abortion due to mobile agent system’s failures and to increase the commitment rate. The rest of the paper is organized as follows. Section 2 summarizes previous works on MA transactions. Section 3 presents the proposed framework for MA transactional execution. We describe our performance evaluation study in section 4 and conclude this paper in section 5.

2 Related Works The related works can be classified into two categories. The first category focuses on how to maintain the ACID properties, how to support open or close transaction model, etc. However, they did not consider these issues in realistic execution environment and context. Another category of works discuss MA transaction within existing distributed transaction models/standards. The survey paper [6] summarizes the research in the first category. It divides existing approaches into blocking and non-blocking solutions according to the close or open commitment model respectively. In [1], a model based on open nested transaction is proposed. A single agent migrates from site to site to execute the assigned tasks. If a local sub-transaction is committed, The MA will migrate to the next site and start a new sub-transaction. Otherwise, the abortion of the sub-transaction will cause the whole MA transaction to be aborted. A similar model is adopted in [5], and a fault tolerance mechanism based on MA platform redundancy is introduced. The idea is that the MA is replicated and sent to a set of places at the same time. So the set of replicated MAs can proceed the transaction despite the failures of some MA systems. A fatal problem is that it may cause deadlock. In [8], the blocking model is relaxed by allowing parallel transactions to run over different parts of the itinerary. However, for realistic applications, it is hard to identify and execute parallel transactions. For the second category, the solution in [7] is based on conventional transaction technology used in X/Open DTP and CORBA OTS. A similar approach has been presented in [10]. These models use DTP to ensure the exactly-once semantic and the reliable instantiation of agents at a remote node but do not add transaction semantics at the application level. In [9], the authors presented an extension of the OMG-OTS

1004

J. Yang et al.

model with multiple mobile agents. This model is well suited for long running activities in dynamically changing environments. It is true that MA transactions share the same characteristics with distributed transactions. The difference is that the platforms supporting distributed transaction are usually deployed within one organization/company, while in a MA transactional execution, the MA may migrate across several organizations, and the hosts visited by the MA can host a centralized database or a distributed database that support distributed transaction model. So MA transaction can not only rely on these standard models. A new framework is needed for MA transaction.

3 MA Transactional Execution Framework The transaction functionalities of existing transaction support systems provide the local transaction support (2PL, Logging, Shadow file etc.) and the distributed transaction support (distributed commitment and distributed locking) within one organization. Based on this fact, we model an existing transaction support system as two sub-layers (see Fig. 1). The low layer is the Source Data layer. Source data can be ordinary file, table, some simple database (e.g., MS Access), or other structured data, such as an MIB. Source data layer will not provide any transaction support. The upper layer is the Source Data Server layer. Source data server provides APIs for accessing and manipulating source data and local transaction support: local concurrency control mechanisms, local logging/Shadow file, and local commit /abort mechanisms. In the framework, the mobile agent platform acts as an added layer on the top of the existing transaction support system. From the view of the existing transaction support system, the mobile agent platform is just an application. The advantage of this system structure is that the added layer will not interferer the existing system due to the loose coupling. Table 1. MA Platform1’s locking table Holder

Waiter

MP1_R1

T_MA1

T_MA2

MA

Context

MA TPM

Locking Table

Waiter

MP2_R1

T_MA2

T_MA1

Source Data

systems

Sources Data Server (Local transaction support)

Holder

Existing

Table 2. MA Platform2’s locking table Resource’s ID

MA Platform

Resource’s ID

Fig. 1. MA Transaction support modules

Fig. 1 illustrates the two layered nested transaction model in our framework. Mobile agents execute the top-level transactions. Transaction Processing Monitor (TPM) and related data structures like locking table manage the top-level transactions, which includes the commitment protocol and concurrency control functionalities. The

A Framework for Transactional Mobile Agent Execution

1005

commitment protocol preserves the atomicity and duration properties. The TPM and MA’s context store the information about the commitment model. Concurrency control guarantees the consistency and isolation properties. Tables 1 and 2 show a distributed deadlock happening between MA1&2 on MA platform1&2. Traditional edge chasing algorithm [2] can be used to detect and resolve the deadlock. Source data servers manage the local transactions (sub-transactions) which are nested in the top-level transaction. Top-level transactions control the whole transaction process - the top-level commitment protocol and concurrency control mechanisms control the processing of local sub-transactions. According to top-level’s commands and configuration, sub-transactions are responsible for the local concurrency control, local logging/shadow file, local commit/abort and recovery mechanisms. 3.1 Commitment Models and the Adaptive Commitment Protocol A suitable commitment model can avoid the problems like long lived transactions. The guidelines for selecting a commitment model are based on the compensability of a transaction’s operations and the scarcity of the resources needed by the transaction. According to the two attributes, either the open (commit the sub-transaction directly) or the close (commit at end of execution) commitment model can be selected [2]. The scarcity of a resource varies with different applications and even with the different execution context for the same application. Take the ticket selling service as an example: if there are plenty of tickets, the customers are allowed to book a ticket. The customer can cancel the booked ticket (abort) or pay for the booked ticket (commit) at the last moment. This situation can be modeled as a close transaction. However, if the tickets become scare and there are still many buyers, the booking service usually will be cancelled and the tickets are only sold to buyers who pay the cash immediately. This situation can be modeled as an open transaction. As mentioned, a MA may travel many sites of different organizations. It is possible that the ticket is sold using the open commitment model at company A; while at company B, the ticket is sold using the close commitment model. So an adaptive commitment model is needed to perform open commitment model at A and close commitment model at B. We define the Resource Scarcity Valve Value (SVV) to support the adaptive commitment model. SVV is configurable for each kind of resource. If a resource’s quantity is greater than its SVV, the close commitment model is a preferred choice for the resource. If a sub-transaction needs more than one kind of resource on a host, AND operation is needed to make the final decision. Suppose a sub-transaction needs N types of resources, and we use 0 to denote the close commitment model and 1 for the open commitment model. For each type of resource, we first calculate its Resource’s preferred Commitment Model (RCM) which is either 1 or 0. We get the final result by applying AND to all RCMs: (RCM1ŀ…ŀRCMn). That is to say, if one resource’s commitment model is the close commitment model, then this sub-transaction has to select the close commitment model. The commitment model selected is stored in the MA context and TPM. When the MA decides to commit the whole transaction, it will make the respective commitment

1006

J. Yang et al.

operation according to the commitment model stored in its context. If the transaction needs to be aborted (for example, the MA failed), the sub-transaction will be aborted according to the commitment model stored in TPM. 3.2 MA Transaction Execution Models In our framework, there are two MA execution modes: SMA and MMA. With the three types of commitment models, we have six combinations, as shown in table 3. Table 3. Combinations MA transactional execution modes Exe mode \ Commit model Single MA (SMA) Multiple MA (MMA)

Open model SMA Open model MMA Open model

Close model SMA Close model MMA Close model

Adaptive model SMA Adaptive model MMA Adaptive model

For the SMA execution model, there is only one MA. The MA starts execution from the first host. The MA platform on the first host generates a transaction ID to represent this transaction. This ID will be carried by MA’s context and also stored by the TPM of each visited host. TPM will log the operations for this MA transaction. Once the MA finishes the execution on a host successfully, it will perform (open model) or not perform (close model) the local commitment of this sub-transaction and migrate to the next host, until it reaches the end of its itinerary. For the open commitment model, when the MA commits the sub-transaction on the last host, it finishes its transactional execution and returns. If abort occurs, the compensation procedure will read the log to make the compensations. For the close commitment model, the MA starts the commitment procedures (2PC or 3PC) only on the last host. For the adaptive commitment model, the MA has two groups of hosts in its context: one group’s sub-transactions execute the close model and another follows the open model. The MA will commit them according to the required model respectively. For MMA, each MA first proceeds independently according to the procedures of SMA except the operation at the last host, where one MA will be elected out to execute the same commitment protocol as SMA. This MA will collect and merge all MAs’ contexts and commit the whole transaction according to the merged context. 3.3 Fault Tolerance An existing transaction support system can be considered as a robust system since they are usually equipped with fault tolerance devices such as the backup host, UPS, etc. In our proposed framework, the MA system is an added layer and running on the top of the existing transaction support system. So we need not consider the host failure. We only consider the failures caused by the MA systems. The failures of MA systems may lead to the abortion of the MA transaction. Although the ACID properties can be guaranteed through the abort operations, a higher commitment rate is what we want. So fault

A Framework for Transactional Mobile Agent Execution

1007

tolerance mechanisms for MA system are needed. We have designed several replication based MA algorithms [11] for fault tolerant MA executions.

4 Performance Evaluations The experiments for performance evaluation are done on 5 server hosts. Each server is equipped with Pentium 4 processors, 256MB RAM and 100M Ethernet. Cisco Catalyst 3500 XL Switch provides connections. The system is implemented on the Naplet MA platform [4]. For the close commitment model, 2PL and 2PC is adopted for top-level transaction supporting.

3500 3000 2500 2000 1500

Non- T

1000 500

Open- T

0 2

3

4

The number of visit ed ser ver s

5

Fig. 2. The comparison of Execution time for Four MA Execution models

Since we only want to compare the execution time of the three commitment models, the deadlock problem is not considered. On each server host, we implement an air ticket booking system. We first perform the experiment ten times on 2 servers and four MAs are dispatched. One MA does not support transaction and the rest MAs support open, adaptive and close commitment models respectively. Later the same experiment is performed on 3, 4 and 5 servers. The experiment results as illustrated in Fig. 2 show that the open model is the fastest model (nearly the same with MA non-transactional execution) and its execution time increases linearly with the server increasing. The close model is the slowest one and its execution time is not linear but accelerative with the server increasing. The execution time for adaptive model is in the middle and depends on system’s configuration (e.g. the SVV), so it is a flexible commitment model.

5 Conclusion In this paper, we proposed a framework for MA transactional execution, which can be integrated with existing transaction support systems. . By our proposed adaptive commitment protocol and fault tolerant mechanisms, the problems we identified in section 1 can be solved. We have implemented the proposed framework in the Naplet MA platform. The experiment results not only show that this framework is workable, but also give the performance comparison of the execution modes in our framework. The proposed framework provides user a flexible choice of suitable execution mode based on their requirements and the application’s characteristics.

1008

J. Yang et al.

References 1. ASSIS SILVA, F. AND KRAUSE, S. 1997. A distributed transaction model based on mobile agents. In Proc. of the 1st Int. Workshop on Mobile Agents (MA’97), K. Rothermel and R. Popescu-Zeletin, Eds. LNCS 1219. Springer-Verlag, 198–209. 2. GRAY, J. AND REUTER, A. 1993. Transaction Processing:Concepts and Techniques. Morgan Kaufmann, San Mateo, CA. 3. Harder, T. and Reuter, A. 1983. Principles of transaction oriented database recovery—a taxonomy. ACM Comput. Surv. 15, 4 (Dec.), 289–317 4. A flexible and reliable mobile agent system for network-centric pervasive computing. http://www.ece.eng.wayne.edu/~czxu/software/naplet.html 5. PLEISCH, S. AND SCHIPER, A. 2003a. Execution atomicity for non-blocking transactional mobile agents. In Proc. of Int. Conference on Parallel and Distributed Computing and Systems (PDCS’03) (Marina del Rey, CA). 6. STEFAN PLEISCH AND ANDR´E SCHIPER 2004. Approaches to Fault-Tolerant and Transactional Mobile Agent Execution—An Algorithmic View ACM Computing Surveys, Vol. 36, No. 3, September 2004, pp. 219–262. 7. K. Rothermel, M. Straßer: A Protocol for Preserving the Exactly-Once Property of Mobile Agents, Technical Report TR-1997-18, University of Stuttgart, Department of Computer Science, 1997 8. SHER, R., ARIDOR, Y., AND ETZION, O. 2001. Mobile transactional agents. In Proc. of 21st IEEE Int. Conference on Distributed Computing Systems (ICDCS’01) (Phoenix, AZ). 73–80. 9. Vogler, H.; Buchmann, A.; Using multiple mobile agents for distributed transactions Cooperative Information Systems, 1998. Proceedings. 3rd IFCIS International Conference on 20-22 Aug. 1998 Page(s):114 – 121 10. H. Vogler, T. Kunkelmann, M.-L. Moschgath: An Approach for Mobile Agent Security and Fault Tolerance using Distributed Transactions, 1997 Int'l Conf. on Parallel and Distributed Systems (ICPADS'97), Seoul, Korea, 1997 11. Jin Yang, Jiannong Cao, Weigang Wu, Chengzhong Xu Parallel Algorithms for Fault-Tolerant Mobile Agent Execution. 2005 Proceedings. 6th International Conference on Algorithms and Architectures (ICA3PP-2005) on Oct. 2005 246-256

Design of the Force Field Task Assignment Method and Associated Performance Evaluation for Desktop Grids Edscott Wilson Garc´ıa1, and Guillermo Morales-Luna2, 1

2

Institituto Mexicano del Petr´ oleo [email protected] Computer Science, CINVESTAV-IPN [email protected]

Abstract. In the case of desktop grids, a single hardware-determined latency and constant bandwidth between processors cannot be assumed without incurring in unnecessary error. The actual network topology is determined not only by the physical hardware, but also by the instantaneous bandwidth availability for parallel processes to communicate. In this paper we present a novel task assignment scheme which takes the dynamic network topology into consideration along with the traditionally evaluated variables such as processor availability and potential. The method performs increasingly better as the grid size increases. Keywords: task assignment, load distribution, desktop grids.

1

Introduction

The complexity of contemporary scientific applications with increased demand for computing power and access to larger datasets is setting a trend towards the increased utilisation of grids of desktop personal computers [1]. In the quest for load balance, an important consideration to maximise the utilisation of a desktop grid is the question of optimal task assignment to processors in a parallel program. As pointed out by Zhuge [2], in the future interconnection environment, resources will flow from high to low energy nodes. The basic laws and principles governing this field require more investigation. The work presented in this paper is a step in that direction. A promising parallel computing model suggested by Valiant [3] is the Bulk Synchronous Parallel Computer (BSP) which emulates the von Neumann machine with respect to simplicity. This model is characterised by dividing the parallel program into a sequential run of super-steps. BSP might not be the most efficient model for each and every problem to solve, but several advantages are evident: scalability, predictability, and portability are the main promises of  

Supported by the graduate studies fund of the Instituto Mexicano del Petr´ oleo. Partially supported by the Consejo Nacional de Ciencia y Tecnolog´ıa.

H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 1009–1020, 2005. c Springer-Verlag Berlin Heidelberg 2005 

1010

E.W. Garc´ıa and G. Morales-Luna

the BSP model. A further characteristic, which is used in the force field task assignment method, is that upon barrier synchronisation the system reaches a known state. In the canonical BSP model, all processors are seen as equally costly to communicate with. For the force field task assignment, the BSP cost model must be extended with the considerations made by the LogP model. The LogP model proposed by Culler et al. [4] is a generalisation of BSP, providing asynchronous communication, and accounting for latency and bandwidth costs of implementation. Using these cost features to extend the BSP model, but retaining barrier synchronisation, force field task assignment can be applied to the environment defined by a desktop grid. The BSP model has great potential in its application towards desktop grids because the model provides for scalability. One of major features of desktop grids is precisely the ability to grow, and so scalability is an important issue. In 2001 Tiskin[5], developed a divide-and-conquer method under the BSP model and suggested that load balancing can be done in an efficient manner under the BSP model. With the bandwidth-aware BSP, not only this is possible, but also an error recovery algorithm can be implemented with very little extra computational overhead. In order to harness the potential of networked computers, systems such as Globus [6] and Condor [7] distribute work tasks among a computer grid. These systems address in great detail issues such as job scheduling, access control and user/resource management. The mechanism provided by Condor [7] for sharing resources —by harnessing the idle-cycles on desktop machines—, enables high throughput using off-the-shelf components. A disadvantage arises from the organisation of the grid under the administration of a single central manager, statically configured. Details of job scheduling and migration are exclusive to the central manager. All tasks wait in a queue until the appropriate resource for execution can be located within the grid by the centralised server. Failure of this node brings down the entire Condor grid. Furthermore, there is no provision to share the workload of grid management: degraded system performance may arise with an overloaded server environment. With respect to sharing resources amongst multiple Condor pools, recent work [8] has provided algorithms for automatising the discovery of remote Condor pools across administrative domains. In this work a p2p scheme based on a proximity-aware routing substrate [9] is used in conjunction with the Condor flocking facility. But within the Condor flock, as with most load distributing schemes, there is no mechanism for resolving the proximity-aware issue and take this factor into consideration while doing task assignment or migrations. Empirical studies of desktop grids, such as the one done at the San Diego Supercomputing Center [10], provide the basis for elaborating theoretical evaluations for new load distribution schemes The main contributions of the work in this paper are as follows: – We describe an algorithm that uses proximity-aware along with cycle availability considerations for optimising task assignment on a desktop grid.

Design of the Force Field Task Assignment Method

1011

– We evaluate the proposed scheme by means of a simulator based on empirical characteristics of a desktop grid. The rest of the paper is organised as follows. Section 2.1 gives an overview of the BSP model and task assignment problem on a desktop grid. Section 3 presents our proposed force-field scheme for task distribution on desktop grids, where the proximity-aware concept is developed. Section 4 presents an evaluation and analysis of the proposed scheme. Finally, section 5 provides concluding remarks.

2 2.1

Background BSP Computational Model

Description. A bulk synchronous parallel (BSP) computer [11, 3] consists of a set of processor-memory pairs, a global communication network, and a mechanism for the efficient barrier synchronisation of the processors [12]. In the BSP model the parallel computation is divided into super-steps, each of which consists of a number of parallel-running threads that contain any number of operations. These threads perform only local communication until they reach a synchronisation barrier where all global communication takes place. Task Assignment Overview. Consider the parallel application executing in the BSP machine composed of n parallel nodes, the application consists of m parallel tasks to be performed at super-step i. In order to assign task threads to processors at the beginning of a BSP super-step, a load balancing scheme should be used. Recent strategies [13] have been guided by the considerations of minimising communication costs [14, 15] and attaining load balance [16]. While this strategy will produce good results with small grids, the force field approach presented in this article will produce results which are closer to the minimum execution time as the grid size increases, as shown in section 4. When dealing with the force field method, there are two considerations to take into account to determine the task assignment. The first —called the computational charge— is a combination of the memory cost for the task with the computational ability of the remote desktop computer. From the global desktop grid configuration, all nodes which lack the sufficient resources —mainly memory— to process the tasks will produce a net positive force, thus will repulsive and eliminated as candidates. The product of the amount of idle cycles per unit time that are available at the remote computer with the reciprocal of the cost in cycles which the memory utilisation of the task will determine the first main consideration for the force field method. This product is represented by the size of the circle in figure 1(a) and addressed in more detail in section 3.2. Note that the canonical greedy idle-cycle scheme will assign the first task to the machine with the most available cycles per unit time, which is not necessarily equivalent to largest circle in the figure since the memory cost is not taken into account in the same fashion.

1012

E.W. Garc´ıa and G. Morales-Luna

The second main consideration is the cost in communications. This factor is dealt with an inverse-square formula. This takes both bandwidth availability and latency into consideration. In figure 1(a) the cost in communication is represented by the distance from the central assigning node to each of the remote computers —numbered 1 through 10. The canonical greedy network-proximity scheme will assign the first task to the machine with the least communication cost, i.e., the closest circle. The task assignment of the force field method is not evident from figure 1(a) because both the size and distance of the circles must be taken into account to determine the net effect. The novelty of the force field method is the way both the network-proximity and cycle-availability considerations are merged into a single force field to determine which node will receive the task. Figure 1(b) shows a three dimensional representation of the force field determined by the nodes in figure 1(a). The slope of the force field will determine in which direction the task will roll and reach a particular node, represented by a depression in the force field surface. The positive force indicates the node where the tasks are generated and distributed from. 1

10 2 9

3 4 7

8

5 6

(a) Participating computer nodes

(b) Task assignment force field

Fig. 1. Task assignment considerations

The force field method for synchronisation operates as follows. Upon reaching the synchronisation barrier, each node sends a completion message to all other processors participating in the super-step. Of these, one will currently be acting as a synchronisation master or sync-master (which will be the node in charge of distributing the tasks). Summing up, the sync-master will determine the force field and assign tasks to where the force field is strongest. To calculate the forces, the sync-master needs to know the available idle cycles and available memory at each node, besides the effective communication cost. The first two parameters are readily known and used for task assignment by systems such as Condor. For the third parameter evaluation, the BSP cost model [17] is extended with the full bandwidth and latency cost considerations of the LogP [4] model.

Design of the Force Field Task Assignment Method

1013

At the barrier all communications are completed and the system reaches a known state. The resulting force field values are ordered and nodes with the most negative values (attraction) receive the tasks. Positive force field values indicate a net repulsion for the assignment of new tasks, and are generated when the available memory of the node is inferior to the net memory requirements of the task. The ubiquitous desktop computer suggests that desktop grids will continue growing. Our results indicate that the force field method yields even better results as the size of desktop grid increases. Design Issues. On applying the BSP computational model there is an important aspect not to be overlooked. All tasks must synchronise at the BSP barrier and all communication takes place at this point. The cost of communication relative to the cost of computation is important. As processor speed increases at a rate faster than that of communication hardware, the ratio between the two will also increase. Parallel applications such as data distribution [13] already have a greater communication cost and can readily profit from the force field task assignment method. This is due to the increasingly better performance as the communication to computation cost ratio increases and the desktop grid size grows. 2.2

Desktop Grids

Network. While the opportunity to perform large computations at low-cost is a clear motivation for using desktop resources for parallel processing, the uncertainty of the resources which will be available at a given moment is the challenge in designing an effective task assignment strategy. The main points which characterise a desktop grid are the real-time network topology and the individual computational capacities of each desktop resource. The real-time network topology is determined in an instantaneous manner from the latency and available bandwidth characteristics of the desktop grid. While this topology is limited by the hardware with which the grid is built, the usage pattern by applications outside the BSP context will determine the effective topology available to the BSP machine and determines the network cost between any two points in the desktop grid. Network proximity. The network proximity —or latency— is the amount of computation cycles taken to establish communication between any two nodes. This parameter is determined by the communications hardware. Network capacity. The network capacity —or bandwidth— is the amount of information that can be transmitted between hosts once communication has been established. Since the network utilisation is not uniform nor static, the network capacity between any two nodes is not necessarily uniform nor constant through time.

1014

3

E.W. Garc´ıa and G. Morales-Luna

Design

An important consideration with regard to desktop grids is that the we can assume that the number of parallel tasks to be completed during a super-step is less than or equal to the number of available processors in the grid. We can make this assumption because the increasing the grid size with off-the-shelf equipment is relatively easy and cheap. In the design of an improved task assignment scheme under the bandwidthaware BSP computational model, we borrow from the concept of force field to elaborate our proposed algorithm. On doing so, we part from an important fact: during the last years, computational speed has been increasing geometrically, while the speed in communication has only experienced a linear increase. This indicates that the trends in future parallel computing will be defined by communications. The question is not whether this will happen, but rather when the turning point will be. This is besides the fact that many data intensive parallel computations are already governed by the communications. Consider that the costs in computational cycles for each tasks in a BSP super-step need not be equal. If Ci is the cost in cycles to solve a particular task i, and ξj is the amount of available cycles per unit time at node j, then the time to solve the task i can be written as Ci /ξj . Without considering any communications cost when dealing with a computational grid of size m, the optimum for n parallel tasks is the minimum of the combinatorial set obtained by associating tasks to processors. If the cost in communication is considered, the complexity of the problem increases dramatically. The set of all super-steps in a BSP problem then becomes similar to a DAG determination, which is an NP-hard problem. 3.1

Fundamental Laws of Motion

A successful load balancing scheme using gravitational forces can be found in Hui and Chanson [18], although in this case the gravitational forces are masked behind the theory of hydrodynamics. This effective method of load balancing has found application with the PRAM parallel machine model. A computational model which is less dependent on system architecture, such as BSP, calls for a simpler expression of the laws governing task assignment. A drawback to the hydrodynamic load balancing’s use of Newton’s Law of Universal Gravitation is the absence of repulsive forces. The hydrodynamic fluid cannot flow from a lower to a higher level. Only with repulsive forces may any particular node reject tasks where requirements surpass the node’s resources. On the other hand, the BSP model allows for direct application of Newton’s laws. Nonetheless, gravitational forces are only attractive. Therefore, Coulomb’s law is more appealing for this work. How does this fit into the BSP model? At the synchronisation barrier the system reaches a well defined state upon which attractive and repulsive forces associated to emerging tasks can be determined. Network workstations participating in the BSP machine which temporarily are removed from the grid need

Design of the Force Field Task Assignment Method

1015

only to switch the sign on their computational capacity, with which all new tasks are repelled and sent to other available nodes on the desktop grid. 3.2

The Force Field Load Distribution

The Computational Distance. To find the physical equivalent for distance in the task assignment problem, we must calculate the cost in communications. Consider the case of performing task assignment on a BSP parallel computer composed of a network of workstations using shared communications resources. Available bandwidth between any two nodes will vary with time. What is important is not the instantaneous value at any particular moment but rather the average value during the particular time interval where the instantaneous value is located. The Computational Charge. When dealing with the computational charges the force field method refers to two charges. The first lies within the remote computer and the second within the task to be assigned. Remote computer charge. Represented by qj , is the computational potential of node j, and is equivalent to the available computational cycles per unit time. This number is always positive or equal to zero. If node j sets qj = 0, then neither attraction nor repulsion will exist. Whether or not the node actually receives a task depends on the force field values determined for other nodes on the grid. Task charge. Represented by qi , is a computational ease associated to the memory requirements of the task. Computational ease —inversely proportional to cost— is an important parameter which is often overlooked: the current bottleneck in the execution of most tasks is not in processor speed but rather the moving of data between memory storage and processor registers. In other words, if the available memory on node j is less than that required for task i, then qi is assigned a positive sign, ensuring a repulsive force. Otherwise the sign will be negative. If ci is the cost in computation cycles entailed by the memory requirements of task i, then qi = c1i is the charge associated to the task1 . Note that the memory requirements of every task has to be known to any assignment scheme, otherwise the determination of whether the remote node has the potential resources to deal with the job would be impossible. The cost in cycles that will be required for the task is not used at any time. In our simulator we use the Large Numbers Law to obtain values for this parameter and randomly assign these values to tasks. 1

A further distinction may be made whether the available virtual memory is fully in RAM or partly distributed in disk swap. This further refinement is not considered in the results presented in this paper.

1016

4

E.W. Garc´ıa and G. Morales-Luna

Performance Evaluation

For each run the absolute minimum execution time —which considers all possible combinations of task/processor assignments— is also obtained for comparison with the task assignment algorithms being tested. In the tests conducted, mean values were obtained from 1000 simulations for each data point. Thus, figures 2(a)–(c) show the results for an increasing amount of tasks in the parallel set to be completed, and where the grid size is equal to 1000 desktop computers. Computational availability, communications cost, task memory requirements and task computational costs are all determined by Gaussian distributions and randomly assigned to different identifiers. The purpose of the evaluation is to determine which strategy is better for task assignment on a desktop grid. Every process can be qualified with the following considerations: – The amount of memory required for data storage. – The amount of computation cycles that are required to complete all programmed operations. – The time units consumed by the communication requirements. Strictly speaking, the second point will depend on the hardware and can be associated among different platforms by linear proportionality. On the other hand, computation cycles can be dealt with in an abstract manner —architecture independent— such as Papadimitriou and Yannakakis [19] do in the presentation of the directed acyclic graphs model. In this paper the same approach is used. With these qualities the merits of any particular task assignment scheme may be evaluated. The first point —memory requirements— is necessary data to determine if the remote computer has the necessary resources to complete the job. The second point —the cycles required by the task and the cycle available per unit time at the remote node— allows the execution wall clock time to be obtained. Nonetheless, the knowledge of the amount of cycles each task will require is not generally known, except for the most elemental applications. Besides, in the tabulation of a parallel job, the time associated to communication costs must also be considered. For the optimum calculation, it is necessary to know the amount of cycles that each task will require. This will be used to evaluate the performance of the different algorithms compared (these algorithms, of course, may not use this information). An algorithm will be better than another if the times obtained are closer to the optimum values. Figure 2(a) shows the results when the cost in communication is low with respect to the cost in computation. In this graph the greedy computation cost quickly becomes asymptotic with the minimum attainable value when the number of simultaneous tasks in the BSP super-step reaches 25% of the configured processors in the desktop grid. For super-steps with simultaneous tasks occupying less than 2.5% of the configured processors, the force field task assignment

Design of the Force Field Task Assignment Method

1017

Mean values with gridsize=1000, task−weight=1 communication−weight=1 5.5 5

wall time for superstep

4.5

idle−cycle greedy (mean) force−attraction (mean) Minimum time (mean) network−proximity greedy (mean)

4 3.5 3 2.5 2 1.5 1 0.5 0

50

100

150 tasks

200

250

300

(a) Low communication cost Mean values with gridsize=1000, task−weight=1 communication−weight=10 5.5 5

wall time for superstep

4.5

idle−cycle greedy (mean) force−attraction (mean) Minimum time (mean) network−proximity greedy (mean)

4 3.5 3 2.5 2 1.5 1 0.5 0

50

100

150 tasks

200

250

300

(b) Normal communication cost Mean values with gridsize=1000, task−weight=1 communication−weight=100 18 16 idle−cycle greedy (mean) force−attraction (mean) Minimum time (mean) network−proximity greedy (mean)

wall time for superstep

14 12 10 8 6 4 2 0 0

50

100

150 tasks

200

250

300

(c) High communication cost Fig. 2. Mean values for different communication costs

out performed the greedy computation cost scheme. When the grid size starts to increase, as seen ut infra, the situation changes dramatically in favour of the force field task assignment scheme.

1018

E.W. Garc´ıa and G. Morales-Luna Mean values with gridsize=1000, task−weight=1 communication−weight=1 0.9 idle−cycle greedy (mean) force−attraction (mean) Minimum time (mean)

wall time for superstep

0.85 0.8 0.75 0.7 0.65 0.6 0.55 5

10

15

20

25

30

35

40

45

50

tasks

(a) Low communication cost (detail) Mean values with gridsize=5000, task−weight=1 communication−weight=1 0.76 0.74 idle−cycle greedy (mean) force−attraction (mean) Minimum time (mean)

wall time for superstep

0.72 0.7 0.68 0.66 0.64 0.62 0.6 0.58 5

10

15

20

25

30

35

40

45

50

tasks

(b) Low communication cost (detail 5X grid) Mean values with gridsize=10000, task−weight=1 communication−weight=1 0.76 0.74

wall time for superstep

0.72

idle−cycle greedy (mean) force−attraction (mean) Minimum time (mean)

0.7 0.68 0.66 0.64 0.62 0.6 0.58 5

10

15

20

25

30

35

40

45

50

tasks

(c) Low communication cost (detail 10X grid) Fig. 3. Mean values for low communication cost and growing grid size

In figure 2(b), are the results for a parallel program where a computation cost still exceeds the communication cost. Many parallel applications fall into this category. In this case, we can observe that the force field assignment scheme

Design of the Force Field Task Assignment Method

1019

exceeds the greedy computation cost scheme well up to the utilisation of 20% of the configured desktop grid for the simultaneous execution of a single BSP super-step. In the third graph of the simulation results, figure 2(c), communication costs exceed computation costs. Distributed data applications fall into this category, as well as an increasing number of other parallel applications as computing speed increases geometrically while communication speed does so linearly. In this case the force field approach far exceeds both the greedy communication cost and greedy computation cost and remains quite close to the theoretical minimum performance that can be obtained. In the foreseeable future the size of available desktop grids is bound to increase, and for this reason it is important to analyse how the force field task assignment will perform in relation to the greedy schemes as the grid grows in size. In figures 3(a)–(c) we can observe that, as the number of computers participating in the grid configuration increases, the force field task assignment outperforms the greedy schemes. As our results demonstrate, this is even with tasks involving a low communications cost. This points to the force field as an important option to consider when large desktop grids are to be used for parallel computing.

5

Conclusions

Although the physical motivation behind the force field task assignment algorithm may not be a consequence of formal computer science, in practise the analogies made with the laws of physics converge to a good strategy for task assignment in the case of a dynamic network bandwidth topology and growing grid size configurations. The effect that communications has on the time to complete a BSP superstep is a factor which should not be disregarded when dealing with desktop grids of growing size. The effect of communications in performing parallel computation will have an increasingly greater effect on parallel programs as growth in computation speed out paces growth in communication speed. Task assignment strategies which take both factors into consideration, such as the force field algorithm, should be preferred when there is reason to assume the performance shall be better than the respective greedy strategies. In the development of this algorithm we have considered the force field as a scalar field. Notwithstanding, the analysis as a vector field could produce better results allowing for a more homogeneous load balancing of the entire desktop grid. More work remains to be done in this direction.

References 1. Foster(Ed.), I., (Ed.), C.K.: The GRID: Blueprint for a New Computing Infrastructure. Morgan Kauffmann Publishers (1999) 2. Zhuge, H.: The future interconnetion environment. IEEE Computer 4 (2005) 27–33

1020

E.W. Garc´ıa and G. Morales-Luna

3. Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 8 (1990) 103–111 4. Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, J., Santos, E., Subramonian, R., von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Fourth ACM SIGPLAN Symposium on Principles and PRactice of PArallel Programming. (1993) 1–12 5. Tiskin, A.: A new way to divide and conquer. Parallel Processing Letters 4 (2001) 6. Foster, I., Kesselman, C.: Globus: A metacomputing infrastructure toolkit. The International Journal of Supercomputing Applications and High Performance Computing 2 (1997) 115–128 7. Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings 8th International Conference on Distributed Computing Systems (ICDCS 1988). (1988) 104–111 8. Butt, A.R., Zhang, R., Hu, Y.C.: A self-organizing flock of condors. In: Proceedings Super Computing 2003, ACM (2003) 15–21 9. Castro, M., Druschel, P., Hu, Y.C., Rowstron, A.: Exploiting network proximity in peer-to-peer overlay networks. Technical report, Microsoft Research (2002) Technical Report MSR.TR-2002-82. 10. Kondo, D., Taufer, M., Brooks, C.L., Casanova, H., Chien, A.A.: Characterizing and evaluating desktop grids: An empirical study. Technical report, San Diego Supercomputer Center and University of California, San Diego (2004) Work supported by the National Science Foundation under Grant ACI-0305390. 11. Gibbons, A.M., Spirakis, P., eds. In: General Purpose Parallel Computing. Cambridge University Press (1993) 337–391 12. van Leeuwen, J., ed. In: Scalable Computing. Springer-Verlag (1995) 46–61 13. Sujithan, K.R.: Towards a scalable parallel object database - the bulk synchronous parallel approach. Technical report, Wadham College Oxford (1996) Technical Report PRG-TR-17-96. 14. Adler, M., Byers, J.W., Karp, R.M.: Scheduling parallel communication: The hrelation problem. Technical report, International Computer Science Institutue, Berkeley (1995) Technical Report TR-95-032. 15. Goodrich, M.T.: Communication-efficient parallel sorting. In: Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, ACM (1996) 247–256 16. Shi, H., Schaeffer, J.: Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing 4 (1992) 361–372 17. Baumker, A., Dittrich, W., Heide, F.M.: Truly efficient parallel algorithms: 1optimal multisearch for an extension of the bsp model. Theoretical Computer Science (1998) 18. Hui, C.C., Chanson, S.T.: Hydrodynamic load balancing. IEEE Trans. Parallel and Distributed Systems 10 (1999) 19. Papadimitriou, C., Yannakakis, M.: Towards an architecture-independent analysis of parallel algorithms. SIAM J. Comput. 19 (1990) 322–328

Performance Investigation of Weighted Meta-scheduling Algorithm for Scientific Grid Jie Song1, Chee-Kian Koh1, Simon See1, and Gay Kheng Leng2 1

Asia Pacific Science and Technology Center, Sun Microsystems Inc., 50 Nanyang Avenue, N3-1c-10, Singapore 639798 {Jie.Song, Chee-Kian.Koh, Simon.See}@sun.com http://apstc.sun.com.sg 2 Information Communication Institute of Singapore, School of Electrical & Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798 [email protected]

Abstract. Scientific computing requires not only more computational resource, but also large amount of data storage. Therefore the scientific grid integrates the computational grid and data grid to provide sufficient resources for scientific applications. However, most of meta-scheduler only considers the system utilization, e.g. CPU load to optimize the resource allocation. This paper proposed a weighted meta-scheduling algorithm which takes into account of both system load and data grid workload. The experiments show the performance improvement for applications and achieve better load balance by efficient resource scheduling.

1 Introduction Scientific applications require not only computational resources, but also large amount of data storage. Computational grid can provide distributed large size computational resource [1]. Data grid is designed to support large size of data storage [2]. Therefore the scientific grid consists of computational grid and data grid to provide sufficient resources for scientific computing. However, most meta-scheduler is designed for computational grid. It optimizes the resource allocation only based on the system load. For example, the meta-scheduler such as Sun Grid Engine [3] will always allocate the server with lowest system load to the current job request. Some meta-scheduler uses the economic parameters to optimize resource utilization with budget constrain [4]. But for the applications has lots of data access, or request for large size of data transfer, or query from the huge amount of database, the workload of the data grid will significantly affect the application execution performance [5]. Therefore, the meta-scheduler should consider the data grid workload combined with the computational system load to optimize the application performance. H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 1021 – 1030, 2005. © Springer-Verlag Berlin Heidelberg 2005

1022

J. Song et al.

In this paper, a weighted meta-scheduling algorithm is proposed. It takes into account of both system load and data grid workload to allocate the resources. We investigate the performance affect by different load metrics, and a set of experiments have been done to evaluate the performance improvement of the weighted meta-scheduling algorithm. The rest of paper is organized as follows: The related works are introduced in Section 2. Section 3 presents the design of weighted meta-scheduling algorithm in details. In Section 4, we illustrate the experiments setup and analyze the performance results comparing to the least-load scheduling algorithm. The future works is discussed in Section 5. Finally, Section 6 gives the conclusion.

2 Related Work There are many challenges in the research of Meta-scheduling algorithm. For example, the simulation for different combination of grid scheduling algorithms has been done to investigate the relationship of data scheduler and job scheduler [6]. The Genetic scheduling algorithm is implemented using agent technologies [7] or applied for data grid applications [8]. Some researchers consider parallel tasks scheduling problems [9] and distributed dynamic resource scheduling problems [10]. Our research is based on the solution proposed in [11] to integrate computational grid and data grid. The Sun Grid Engine is used as the meta-scheduler to allocate data grid resource to the job requests. The load formula is the key for resource scheduling. The idea is to revise the load formula, so that it can reflect the affect of data grid workload. In the solution, new parameters such as buffer and number of read thread are introduced to represent the data grid workload and combine with the original system load parameter. The scheduler will select the resource with smallest value calculated using the specified load formula. However, these new parameters may not easily be collected and the range of value is dynamic and differs from various systems configuration. Therefore, we investigate the possible data grid performance metrics, and proposed the new weighted meta-scheduling algorithm. The performance of the proposed algorithm is analyzed based on the experiment results.

3 Weighted Meta-scheduling Algorithm Figure 1 shows the procedure of meta-scheduler to allocate grid resources for scientific applications. Once the meta-scheduler received an application request, it will look for the grid resources satisfied all the application requirements and generate the resource candidate list. Then it calculates the load formula using the current load values collected from each resource. Finally, it can select the resource from the candidate list with smallest or largest value according to the policy and allocate to the applications. It is obvious that the load formula is the most important key for the meta-scheduler algorithm. Different kinds of application may have different load formulas to optimize the performance for application execution.

Performance Investigation of Weighted Meta-scheduling Algorithm for Scientific Grid

1023

Fig. 1. Procedure of weighted meta-scheduling algorithm

3.1 Load Metrics The load metrics are parameters that represent the current usage of grid resources. It is able to be used to configure the load formula. Considering the features of the scientific grid, we can use the two types of metrics to define the load formula.

1024

J. Song et al.

Metrics for Computational Resource - Average system Load (Lavg): average CPU load to reflect the utilization of system resource. - Available Memory (Cmem): size of free memory - Bandwidth (Cband): maximize/minimum network bandwidth to estimate the data transfer speed. Metrics for data grid - no. of concurrent users (Nuser): average number of concurrent users to access data grid - size of free storage (Cdisk): the available free storage

SQL Exe time

dy/dx

There are more metrics can be collected from the grid resource, but only the metrics which may significantly affect the application performance will be used in the definition of load formula. Figure 2 shows that with the increase of no. of concurrent oracle users, the application execution time increased quickly until the no. of users exceed about 60 users. After that, the application execution time keeps steadily even under different system load.

1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4

Load = 10 Load = 20 Load = 50

Graph 2

0.024 0.022 0.020 0.018 0.016 0.014 0.012 0.010 0.008 0.006 0.004 0.002 0.000 0

10

20

10

40

50

60

70

80

No of Oracle Users

Graph 1

0

30

20

30

40

50

Load = 10 Load = 15 Load = 20

60

70

80

No of Oracle Users

Fig. 2. Performance affect by no. of concurrent users

3.2 Performance Metrics The following performance metrics shown in Figure 3 are used in our study to evaluate the performance of proposed meta-scheduling algorithm.

Performance Investigation of Weighted Meta-scheduling Algorithm for Scientific Grid

-

1025

Response time (Tr) is the amount of time between submitting the job from the client host and starting the job on the allocated host by Meta-scheduler. Execution Time (Te) is the average amount of time generated by the SQL SELECT statement. Job Execution Time (T) is the time taken from when a job is submitted till it ends. The job is considered completed only when the entire output has been transferred to the client host.

Fig. 3. Performance metrics

3.3 Weighted Load Formula The general definition of weighted load formula is shown as Formula (1): Φ = ∑( ± λk Pk ), ∑λk = 1

(1)

λ is the weight for each metrics in the load formula. k is the no. of metrics within he range of [1..N], where N is the maximum number of metrics to define the load formula. P is the metrics included in the formula. Φ is the value of load to schedule grid resource for applications.

4 Experiment Results The experiment aims to evaluate the performance for the weighted meta-scheduling algorithm. We investigate the performance by adjusting the different system load, and also compare with the widely used least-load scheduling algorithm. 4.1 Experiment Setup The test bed used in our experiments consists of 3 workstations running Linux OS or Solaris 9. It installed Oracle 10g to set up the data grid and uses Sun Grid Engine 5.3 to setup a computational grid. For a j number of jobs running on node i, the load formula used in our experiments is defined as follows: Φi,j = λ Ni,j / Avg( Nusr ) + Lavgj

(2)

1026

J. Song et al.

Where Φi,j is the total load when there are j number of jobs on node i. λ denotes the normalizing factor necessary to balance the equation. Let Ni,j represent the no of connections running on the node, assuming j number of jobs contributes j no of connections. The average number of Oracle Connections required to provide the homogeneity factor is Avg(Nusr) Lavg is the np_load_avg of the node i. The experiments have three steps: 1) Test under different system load 2) Test performance by changing the data grid workload 3) Compare the performance with original SGE least load scheduler algorithm. The load formula is shown as below: Φi,j = Lavgj

(3)

In our experiments, the meta-scheduler allocates grid resources according to the value of above load formula. The experiment results are analyzed in the following sections. 4.2 Response Time As shown in Figure 4 and Figure 5, it is noted that almost only 15% of the jobs using least load algorithm have response time greater than 10 seconds while it is a much higher 62% using the weighted scheduling algorithm. The increase in the response time due to the increased complexity of new load formula which consider the Oracle database information. The computation of load formula is longer and the collection of data grid performance also takes more system resources. Therefore, the response time of weighted meta-scheduling algorithm is longer. Performance of Response Time (Unmodified SGE) 30

25

Total of 154 Jobs No. of Jobs

20

15

10

5

0 5

10

15

Response Time (sec)

Fig. 4. Response Time for least load algorithm

20

Performance Investigation of Weighted Meta-scheduling Algorithm for Scientific Grid

1027

Performance of Response Time (Modified SGE)

10

Total of 71 Jobs

No. of Jobs

8

6

4

2

0 5

10

15

20

Response Time (sec)

Fig. 5. Response Time for weighted meta-scheduling algorithm

4.3 Overall Performance Figure 6 compare the overall application performance interns of execution time between the two scheduling algorithms. We note that the performance line (green color) of the selected machine using proposed scheduling algorithm lays in between the performance lines of the two servers using the least load scheduling algorithm in SGE. Since we keep the average system

Comparison of Performance (5c vs 70e) 1.7 1.6

Unmodified SGE (Callisto) Unmodified SGE (Elara) Modified SGE (Selected)

1.5 1.4

Avg SQL exe time

1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0

2

4

6

8

10

12

14

np_load_avg

Fig. 6. Performance Comparison

16

18

20

22

1028

J. Song et al.

load as same on the two servers and the least load algorithm always select the lowest load server to execute job, so it is 50% possibility to choose either faster or slower server. So the read line represents the performance in the worst case of using least load algorithm, and the blue line is the performance in best case. So we can say using weighted meta-scheduling algorithm can at least achieve medium performance, so that it provides the quality guarantee for the data grid applications. It is because the scheduler selects the server with lower data grid work load and makes execute data grid application much faster. 4.4 Resource Selection Figure 7 show that the weighted meta-scheduling algorithm will always select the resource which can execute the application faster. In our testbed, server Elara has better hardware configuration than server Callisto. In this experiment, we set more concurrent users on Elara and keep the 2 servers have same system workload. With the increase of average system load, the weighted meta-scheduler always selects server Callisto to execute job, that because Callisto has less no. of concurrent users so the data grid application will be executed faster. Therefore, the proposed algorithm can select resources intelligently with considering data grid workload. 5c vs 75e (Modified SGE)

1.8 1.7

Elara Callisto (Selected)

Avg SQL exe time

1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 2

4

6

8

10

12

14

16

18

20

22

np_load_avg

Fig. 7. Resource Selection

5 Future Works As discussed in Section 4.2, the over head computation caused the performance decrease when using weighted meta-scheduling algorithm. Thus how to reduce the overhead mechanism is one of challenging problem.

Performance Investigation of Weighted Meta-scheduling Algorithm for Scientific Grid

1029

Current investigation only test on the platform of SGE and Oracle 10g, we will do more experiments on different grid platforms to figure out the performance in scientific grid environment. And more work can be done to schedule resources for parallel applications over grid.

6 Conclusion In this paper, we proposed the weighted meta-scheduling algorithm for the scientific grid. The key idea is to introduce the data grid workload into the load formula, which is used to determine the grid resource allocation. Several experiments have been done to investigate the performance and compared with least load scheduling algorithm in Sun Grid Engine. Given a situation when two hosts are balanced in CPU load but having different data grid load, this proposed scheduling algorithm dynamically detects data grid load imbalance on nodes of a cluster, and effectively assigns jobs to the node with the lower data grid load. Therefore, the proposed scheduling algorithm has shown to be more intelligent in the selection of the execution hosts for jobs execution. A tradeoff of this modification is a general slight drop in performance in terns of response time and the average SQL execution time. It is mainly caused by complex computation for load formula and more system resources are taken by the scheduler to keep updating the data grid workload dynamically. Hence, with the integration solution of Oracle 10g resource metadata and load sensors and with the aid of an efficient load mechanism, the proposed weighted metascheduling algorithm indeed proves to be a smarter and more efficient system.

Acknowledgement The authors acknowledge for the contribution from all the students involved in this project. We thank Mr. Tay Chai Yee and Miss Heng Bao Jin for the completion of experiments. We also appreciate the initial research and development work done by Mr. Sajindra Jayasena, Mr. Yee Chin Peng and Mr. Wong Wai Hong.

References 1. Chris Smith, Computational Grids meets the database, Platform Computing white papers. 2. Heinz Stockinger, Flavia Donno, Erwin Laure, Shahzad Muzaffar, Peter Kunszt, Grid Data Management in Action: Experience in Running and Supporting Data Management Services in the EU DataGrid Project, CERN 3. Sun Grid Engine Load sensor architecture, http://gridengine.sunsource.net/project/ gridengine/howto/loadsensor.html 4. S. Venugopal, R. Buyya, and L. Winton, A Grid Service Broker for Scheduling Distributed Data-Oriented Applications on Global Grids, 2nd International Workshop on Middleware in Grid Computing, October 2004 5. W.H. Wong, Integration of Sun Grid Engine 5.3 with Oracle Database 10g, APSTC Internship report, 2004

1030

J. Song et al.

6. J. Song, Z.H. Yang and C.W. See, Investigating Super Scheduling Algorithms for Grid Computing: A Simulation Approach, proceedings of the Fifth International Conference on Parallel and Distributed Computing, Applications and Technologies, LNCS 3320, December 2004, pp. 372-375 7. G. Chen, Z.H. Yang, C.W. See, J. Song and Y.Q. Jiang, Agent-mediated Genetic Superscheduling in Grid Environments, proceedings of the Fifth International Conference on Parallel and Distributed Computing, Applications and Technologies, LNCS 3320, December 2004, pp. 367-371 8. S. Kim and J.B. Weissman, A Genetic Algorithm Based Approach for Scheduling Decomposable Data Grid Applications, Proceedings of International Conference on Parallel Processing, August 2004, pp. 406-413 9. S. Zhuk, A. Chernykh, A. Avetisyan, S. Gaissaryan, D. Grushin, N. Kuzjurin, A. Pospelov and A. Shokurov, Comparison of Scheduling Heuristics for Grid Resource Broker, Proceedings of 5th International Conference in Computer Science, September 2004, pp. 388-392 10. W.Z Zhang, B.X. Fang, H. He, H.L. Zhang and M.Z. Hu, Multisite Resource Selection and Scheduling Algorithm on Computational Grid, Proceedings of 18th International Parallel and Distributed Processing Symposium, April 2004, pp. 105 11. S. Jayasena and C.P. Yee, Integrating Sun Grid Engine Computational Grid Services with Oracle Data Grid Service, APSTC Internship Report, 2004

Performance Analysis of Domain Decomposition Applications Using Unbalanced Strategies in Grid Environments Beatriz Otero, José M. Cela, Rosa M. Badía, and Jesús Labarta Dpt. d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Campus Nord, C/ Jordi Girona, 1-3, Mòdul D6, 109, 08034, Barcelona-Spain {botero, cela, rosab, jesus}@ac.upc.edu

Abstract. In this paper, we compare the performance of some mesh-based applications in a Grid environment using the domain decomposition technique and unbalanced workload strategies. We propose unbalanced distributions in order to overlap computation with remote communications. Results are presented for typical cases in car crashing simulation where finite element schemes are applied in fine mesh. The expected execution time is basically the same when two unbalanced techniques are used, but it is up 34% smaller that the one requires by the classical balanced strategy. We also analyze the influence of the communication pattern on execution times using the Dimemas simulator. Keywords: Domain decomposition, Grid environments.

1 Introduction Nowadays the computational solution of huge-sized problems has become feasible by using a grid computing which enhances limited computational resources of an individual company. The use of Globus allows the cooperation of many computing resources distributed in a wide area to solve and analyze macro-sized problems. Usually, such problems arise from the discretization of elliptic partial differential equations on meshes whose solution involves computational matrix algebra operations, such as vector-matrix and matrix-matrix multiplications. In this paper, we consider distributed applications that carry out matrix-vector product operations. In order to achieve good performance of these applications, the spatial domain must be divided among the available processors in an efficient manner [1], and each processor performs a numerical treatment on its assigned sub-domain. However, efficient implementations of matrix algebra operations are also required to wisely exploit the computational facility. Mesh partitioning for homogeneous systems has been extensively studied [2], [3]; however, mesh partitioning for distributed system in Grid environments is a relatively new area of research. Previous work [4], present studies on latency, bandwidth and optimum workload to take full advantage of available resources using balanced workload. In this work, we use an unbalanced distribution where the workload could be different for each processor. Basically, this distribution assigns H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 1031 – 1042, 2005. © Springer-Verlag Berlin Heidelberg 2005

1032

B. Otero et al.

less workload to processors responsible for sending updates outside of his host. This data domain with a few loads is defined like special domains. For instance, in [5] we found that this strategy was more effective than the balanced technique in most of the cases considered. In this work, we proposed a new distribution pattern for the data in which the workload is different depending on the processor (many special domains per host). Nevertheless the scalability of this unbalanced distribution is moderate. In this paper, we propose to assign all special domains to a single CPU in each host, who concurrently manages communications, from this host. We use the Dimemas tool to simulate the behavior of the distributed applications in Grid environments. The paper is organized as follows: Section §2 present the tool used to simulate the Grid environment. Section §3 proposes the workload assignment patterns. Section §4 shows the results obtained in the environments specified for different data distribution patterns. Conclusions from this work are presented in Section §5.

2 Grid Environment This section describes the generation trace and the configurations of the Grid environment studied. Moreover, we present Dimemas [6], [7], the tool used for simulating Grid environments. This tool is developed by CEPBA1 for the simulating parallel environments. In DAMIEN2 project, Dimemas is extended to work in a distributed and heterogeneous Grid environment [8], [9]. Dimemas is a performance prediction simulator for message passing applications. Dimemas is fed with a trace file and a configuration file. In order to obtain the tracefile, the parallel application is executed using an instrumented version of MPI. Therefore, the tracefile has an execution of the application on a source machine that captures the CPU bursts the communication pattern information. The configuration file contains details of the simulated architecture, such as number of nodes, latency and bandwidth between nodes. Dimemas generates an output file that contains the execution times of the simulated application for the parameters specified in the configuration file. Dimemas simulator considers a very simple model for point to point communications. This model decomposes the communication time in five components: latency time, resource contention time, transfer time, WAN contention time and flight time. Latency time (TLatency) is a time required to start the communication. This time is defined once per each machine. Resource contention time (TResource) is a time simulated by Dimemas that depends on the availability of links and buses. Transfer time (TSend) is a time that depends on the message size and connection bandwidth. The WAN contention time (TWAN) is a time that models the effect of the traffic on the network. It is a factor that reduces the connection bandwidth. Finally, the flight time (TFlight) is a time simulated by Dimemas that model the transmission of the message to the destination. It depends on the distance between hosts [10]. The flight time is a square matrix of N_host x N_host. 1 2

European Center for Parallelism of Barcelona, www.cepba.upc.edu. Distributed Applications and Middleware for Industrial use of European Networks.

Performance Analysis of Domain Decomposition Applications

1033

If consider the before communication model, we suppose an infinite number of buses for the interconnection network and as many full-duplex connections as different remote communication has the host with others hosts (TResource is negligible). We consider the same number of processors per host. For the WAN contention time (TWAN), we use a lineal model to estimate the traffic present in the external network [11]. Thus, our communication model depends of three parameters: latency, bandwidth and flight time. To estimate these parameters, we have used the ping program. Depending on the network situation at each moment, the nominal values of the WAN parameters can vary at each moment. Afterwards, measurements of distributed executions between machines were performed to work. Moreover, we considered set values according to what is commonly found in present networks [12], [13], [14], [15]. The communications between processors are relatively fast, and depend of latency and bandwidth between processors inside a node. The inter-node communication is relatively slow and depends of latency, bandwidth and flight time between hosts. Table 1 shows the values for the internal and external host communications. The internal column defines the latency and bandwidth between processors inside a host. The external column defines the latency and bandwidth values between hosts. Table 1. Communication parameters values Parameters Latency Bandwidth Flight time

Internal (Processors) 25 μs 100 Mbps -

External (Hosts) 10 ms and 100 ms 64 Kbps, 300 Kbps and 2Mbps From 1 ms to 100 ms

Our Grid environment is formed by a set of connected hosts; each of them can be a network of symmetric multi-processors (SMP) or a parallel machine. Different hosts are connected through an external WAN. The communications between different hosts are defined like the remote communications.

3 Domain Decomposition Domain decomposition is used for efficient parallel execution of mesh-based applications, such as finite element or finite difference modelling of phenomena in engineering and sciences. Mesh-based applications use a meshing procedure to discretize the problem domain, which is partitioned into several sub-domains; each of them is assigned to individual processors in the Grid environment. In these applications, one matrix-vector operation is performed for each iteration of the explicit method. To realize the matrix-vector operation, we use the domain decomposition strategy. The initial mesh is split in sub-domains, and each sub-domain must exchanges boundary values with all its neighbours [16]. The common boundary between sub-domains defines the size of the interchanged messages. In the previous

1034

B. Otero et al.

section, we defined a host as a set of processors. When processors at the same host exchange boundary values, we said that this communication is local. In other hand, a remote communication occurs when one processor exchange data with a processor in other host. The remote communications are slower than the local communications. Therefore, we consider two unbalanced data distributions for to overlap computation with remote communications. 3.1 Unbalanced Distribution Pattern That Depends on Remote Communication (U-Bdomains) In this unbalanced partition, each domain is equivalent to one parallel process assigned to one CPU. Then we have as many domains as CPUs. The partition of the data is done in a two-step partitioning. The first step splits the mesh into host partitions by using METIS [17], and each partition is assigned to one host. This procedure guarantees that the computational load is balanced between hosts. Now, the second step involves an unbalanced division of the data per host. We create as many special domains as remote communications have a host with different hosts. These special domains are called Bdomains. The Bdomain contains only boundary nodes, so the computational load for the processes with Bdomains is negligible. Finally, the remaining host partition is decomposed in (nproc-Bdomains) domains. These domains are called Cdomains. 3.2 Unbalanced Distribution Pattern That Do Not Depend on the Remote Communications (U-1domains) In this distribution pattern, the total number of processes is equal to: ((TCPU – Thosts) + TRemComm), where TCPU represents the total number of CPUs in the Grid, Thosts is the total number of host, and TRemComm is the total number of remote communications with different hosts per iteration. Each remote communication is a Bdomains, and Bdomains in each host are assigned to a single CPU. Too as in the case of UBdomains, we perform the decomposition in two phases, and the first phase is basically the same. Now, in the second phase, we create as many Bdomains as remote communication could be in the host, but, all Bdomains are assigned to a single CPU inside the host. The remaining host-partition is decomposed in (nproc-1) domains. Below, we show an example of these unbalanced distributions. A finite element mesh of 256 degrees of freedom (dofs) is considered, and a grid configuration of 4 hosts and 8 CPUs per host is assumed. First at all, we make an initial decomposition in four sub-graphs. Figure 1 shows this balanced partition with boundary nodes (Bdomains). For every sub-graph we assign each Bdomains to one domain, if the unbalanced distribution is U-Bdomains. The remainder sub-graph is divided in five balanced Cdomains (figure 2). However, if the unbalanced distribution is U1domains, all of the Bdomains are assigned to one CPU. Then, the remaining subgraph is split in seven balanced Cdomains (figure 3).

Performance Analysis of Domain Decomposition Applications 3

7

8

9

11

12

13

14

22

23

24

25

26D8 27

28

29

D309

31

37

38

39

40

41

42

43

44

45

46

47

54

55 D 5

D5613

57

58

59

60

61

62

63

70

72

73

74

76

77

78

79

86

88

89

90

5

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0

1

2

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

16

D 170

18

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

32

33

34

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

48

49

50

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

64

65

66

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

80

81

82

96

97

98

99 100 101 102 103

104 105 106 107 108 109 110 111

96

97

98

112 113 114 115 116 117 118 119

120 121 122 123 124 125 126 127

D7 116 117 118 112 113 114 115

128 129 130 131 132 133 134 135

136 137 138 139 140 141 142 143

128 129 130

D 131 21

144 145 146 147 148 149 150 151

152 153 154 155 156 157 158 159

D2

4

36

51

52

67

68

69

83

84

85

D3

15

6

20D1 21

0 16

35

1035

D4

99 100 101 102 103

10

D10 91

92

93

D12 94

95

D11 104 105 106 107 108 109 110 111

D6 D 120 14

124 122 123 D 15 125 126 127

135 132 133 134 D 22

D30

144 145 146

148 149 150 151

152 153 154 155 156 157 158 159

165 166 167 164 D 17

143 140 137 138 139 D 29 141 142

160 161 162 163 164 165 166 167

168 169 170 171 172 173 174 175

160 161 D16162

176 177 178 179 180 181 182 183

184 185 186 187 188 189 190 191

176 177 178 179 180 181 182 183

D24 D25 D26 184 185 187 187 188 189 190 191

192 193 194 195 196 197 198 199

200 201 202 203 204 205 206 207

D23 192 193 194 195 196 197 198 199

D 31 200

208 209 210 211 212 213 214 215

216 217 218 219 220 221 222 223

216 217 218 219 220

224 225 226 227 228 229 230 231

232 233 234 235 236 237 238 239

208 209 210 211 212 213 214 215 D18 D19 D20 224 225 226 227 228 229 230 231

240 241 242 243 244 245 246 247

248 249 250 251 252 253 254 255

240 241 242 243 244 245 246 247

Fig. 1. Specials domains

168 169 187 171 172 173 174 175

201 202 203 204 205 206 207 222 223

239 232 233 234 D27235 236 237 D238 28 187 249 250 251 252 253 254 255

Fig. 2. U-Bdomains distribution

Figure 4 shows the communication pattern and the relationship among processeshost in this example. Arrows in the diagram denote processes that interchange data. The source of an arrow identifies the sender, and the end identifies the receiver. A short arrow represents local communications inside a host, whereas a long arrow represents remote communications between hosts. In figure 4.a, each process is assigned to one processor. Notice that, 62.5% of processors inside hosts are busy, while the remaining processors are performing remote communications. In figure 4.b, processes with remote communication are assigned to the same processor. Then, 87.5% of processors have computation to do. In both cases, all remote communications are overlapped inside a host. Figure 4.b depicts a case with one local communication more and less calculation than one in figure 4.a.

1

0

D1

D017

16

6

7

8

9

10

11

21D 22 2

23

24

25

D26 10

27

38

39

40

41

42

43

55 D 7

D5617

57

58

59

72

73

88

89

D3 64

70

65

D 80 481

82

83 D 5

97

98

99 100

96

84

85D686 102 103

14

D11

90

91

D14

D19 D37

144 145

152 153 154

183

176 177 192 193 208 209 D24 224 225 240 241

D22

D23

D25

168 169 187 D30 184 185 187

29 199 197 198 D

D 39 200

213 214 215

216

D26 229 230 231 245 246 247

D15

D16 104 105 106

D38

D21

47

74

D8 D 120 18

160 161 D20

31

46

D12

D13

135 133 134 D 131 128 129 130 D 28 27 132

112 113 114 115 D9 116 117 118

15

30

232 187

157 158 159

D31

D32 175 173 174 D35

D33

D34

237 238 D36 253 254 255

Fig. 3. U-1domains distribution

1036

B. Otero et al.

Process 10 Process 11 Process 12 Process 13

Process Process Process Process Process

Host 1

Host 1

Process 2

Host 2

Process 0 Process 1 3 4 5 6 7

Host 2

Process 8 Process 9

Process 14 Process 15

Process 18 Process 19

Host 3

Host 3

Process 16 Process 17

Process 20 Process 21 Process 22 Process 23

Process 26 Process 27 Process 28

Host 4

Host 4

Process 24 Process 25

Process 29 Process 30 Process 31

(a)

Process 0 Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Process 7 Process 8 Process 9 Process 10 Process 11 Process 12 Process 13 Process 14 Process 15 Process 16 Process 17 Process 18 Process 19 Process 20 Process 21 Process 22 Process 23 Process 24 Process 25 Process 26 Process 27 Process 28 Process 29 Process 30 Process 31 Process 32 Process 33 Process 34 Process 35 Process 36 Process 37 Process 38 Process 39

(b)

Fig. 4. Diagram of communication for one computational iteration: (a) U-Bdomains (b) U-1domain

4 Results In this section, we evaluate the performance of distributed applications in Grid environment when data distribution is handled by U-Bdomains and U-1domains techniques. In these experiments, we have supposed a maximum number of processors equal to 128, and the data set is given by a finite element mesh (FEM) of 1,000,000 dofs. This data set was obtained from simulations of car crashing [18] and sheet stamping models [19]. However, similar data sets are common in other industrial problems. For our purposes, FEM’s are classified in two different kinds. The first one, a stick mesh, can be completely decomposed in strips; therefore each parallel process has two remote communications per iteration at most. The second kind, a box mesh, can not be decomposed in strips, and then the number of remote communications per process could be greater than two. In this case, the dimension of the stick mesh is 104x10x10 nodes, while the dimension of the box mesh is 102x102x102 nodes. Finally, we consider a maximum eight hosts. Table 2 shows the average computational load of each distribution pattern for both kinds of meshes when two, four and eight hosts are used. The execution time of distributed applications in Grid environment is given by the execution time of the last processor to finish its own computation load. It is well known, that this execution time depends on both, local calculations performed by the processor and any communication with other processor. In our case, this execution time has depended mainly on the number of remote communications. The metric used to compare our unbalanced distributions performance was the execution time reduction. Figures 5 and 6 show the time reduction percentages as a function of

Performance Analysis of Domain Decomposition Applications

1037

bandwidth for each grid environment in the case of stick mesh. These percentages allow comparing the balanced distribution with U-1domains and U-Bdomains distributions.We noticed that U-Bdomains distribution reduces the execution time of the balanced distribution in most cases. However, the U-Bdomains distribution creates as many special domains per host as external communications. Therefore, the scalability of U-Bdomains distribution is moderate, because of in each special domain one processor is devoted only to perform communications. Our simulations show that scalability for this distribution is between 40% and 98%, while scalability of U1domains distribution is between 75% and 98%. Table 2. Average of computational load per processor STICK MESH

Hostx CPUs 2x4 2x8 2x16 2x32 2x64 4x4 4x8 4x16 4x32 8x8 8x16

Hostx CPUs 2x4 2x8 2x16 2x32 2x64 4x8 4x16 4x32 8x8 8x16

Processes with remote communications Nodes per Number process of process Bdomains Bdomains 101 2 102 2 101 2 100 2 100 2 101 6 102 6 101 6 100 6 100 14 102 14 Processes with remote communications Nodes per Number process of process Bdomains Bdomains 10334 2 10329 2 10325 2 10324 2 10364 2 3486 10 3664 10 3499 10 1816 38 1814 38

U-1domains Nodes per Number process of process Cdomains Cdomains 166633 6 71414 14 33327 30 16126 62 7935 126 83283 12 35693 28 16657 60 8060 124 17832 56 8321 120 BOX MESH U-1domains Nodes per process Cdomains 163222 69953 32645 15796 7772 34469 16056 7782 16625 7759

Number of process Cdomains 6 14 30 62 126 28 60 124 56 120

U-Bdomains Nodes per process Cdomains 166633 71414 33327 16126 7935 99940 38438 17231 8192 19972 8759

Number of process Cdomains 6 14 30 62 126 10 26 58 122 50 114

U-Bdomains Nodes per process Cdomains 163222 69953 32645 15796 7772 43870 17840 8178 35807 10345

Number of process Cdomains 6 14 30 62 126 22 54 118 26 90

1038

B. Otero et al. U-1do main (4x4)

STICK MESH

U-B do main (4x4)

External latency of 10 ms and flight time of 1 ms

U-B do main (4x8)

U-1do main (4x8) U-1do main (4x16) U-B do main (4x16) U-1do main (4x32)

90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 -40,00 -50,00 -60,00 -70,00 -80,00 -90,00 -100,00

U-B do main (4x32)

Percentage of reduction for execution time (%)

U-1do main (8x8) U-B do main (8x8) U-1do main (8x16) U-B do main (8x16)

64 Kbps

300 Kbps

2 Mbps

Bandwidth

Fig. 5.a. Execution time reduction with external latency of 10 ms and flight time of 1ms U-1do main (4x4)

STICK MESH

U-B do main (4x4)

External latency of 10 ms and flight time of 100 ms

U-B do main (4x8)

U-1do main (4x8) U-1do main (4x16) U-B do main (4x16) U-1do main (4x32)

100,00 90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 -40,00 -50,00 -60,00 -70,00 -80,00 -90,00 -100,00

U-B do main (4x32)

Percentage of reduction for execution time (%)

U-1do main (8x8) U-B do main (8x8) U-1do main (8x16) U-B do main (8x16)

64 Kbps

300 Kbps

2 Mbps

Bandwidth

Fig. 5.b. Execution time reduction with external latency of 10 ms and flight time of 100ms U-1do main (4x4)

STICK MESH

U-B do main (4x4)

External latency of 100 ms and flight time of 1 ms

U-B do main (4x8)

U-1do main (4x8) U-1do main (4x16) U-B do main (4x16) U-1do main (4x32)

90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 -40,00 -50,00 -60,00 -70,00 -80,00 -90,00 -100,00

U-B do main (4x32)

Percentage of reduction for execution time (%)

U-1do main (8x8) U-B do main (8x8) U-1do main (8x16) U-B do main (8x16)

64 Kbps

300 Kbps

2 Mbps

Bandwidth

Fig. 6.a. Execution time reduction with external latency of 100 ms and flight time of 1 ms

Performance Analysis of Domain Decomposition Applications

1039

U-1do main (4x4)

STICK MESH

U-B do main (4x4)

External latency of 100 ms and flight time of 100 ms

U-B do main (4x8)

U-1do main (4x8) U-1do main (4x16) U-B do main (4x16) U-1do main (4x32)

90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 -40,00 -50,00 -60,00 -70,00 -80,00 -90,00 -100,00

U-B do main (4x32)

Percentage of reduction for execution time (%)

U-1do main (8x8) U-B do main (8x8) U-1do main (8x16) U-B do main (8x16)

64 Kbps

300 Kbps

2 Mbps

Bandwidth

Fig. 6.b. Execution time reduction with external latency of 100 ms and flight time of 100 ms U-1do main (4x8)

BOX MESH

U-Bdo main (4x8)

External latency of 10 ms and flight time of 1 ms

U-Bdo main (4x16)

U-1do main (4x16) U-1do main (4x32) U-Bdo main (4x32) U-1do main (8x8)

70,00

U-Bdo main (8x8) U-1do main (8x16)

Percentage of reduction for execution time (%)

60,00

U-Bdo main (8x16)

50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 64 Kbps

300 Kbps

2 Mbps

-40,00

Bandwidth

Fig. 7.a. Execution time reduction with external latency of 10 ms and flight time of 1 ms U-1do main (4x8)

BOX MESH

U-Bdo main (4x8)

External latency of 10 ms and flight time of 100 ms

U-Bdo main (4x16)

U-1do main (4x16) U-1do main (4x32) U-Bdo main (4x32) U-1do main (8x8)

70,00

U-Bdo main (8x8) U-1do main (8x16)

Percentage of reduction for execution time (%)

60,00

U-Bdo main (8x16)

50,00 40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 64 Kbps

300 Kbps

2 Mbps

-40,00

Bandwidth

Fig. 7.b. Execution time reduction with external latency of 10 ms and flight time of 100 ms

1040

B. Otero et al. U-1domain (4x8)

BOX MESH

U-B domain (4x8)

External latency of 100 ms and flight time of 1 ms

U-B domain (4x16)

U-1domain (4x16) U-1domain (4x32) U-B domain (4x32) U-1domain (8x8)

60,00

U-B domain (8x8) U-1domain (8x16)

Percentage of reduction for execution time (%)

50,00

U-B domain (8x16)

40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 64 Kbps

300 Kbps

2 Mbps

-40,00

Bandwidth

Fig. 8.a. Execution time reduction with external latency of 100 ms and flight time of 1 ms U-1do main (4x8)

BOX MESH

U-B do main (4x8)

External latency of 100 ms and flight time of 100 ms

U-B do main (4x16)

U-1do main (4x16) U-1do main (4x32) U-B do main (4x32) U-1do main (8x8)

60,00

U-B do main (8x8) U-1do main (8x16)

Percentage of reduction for execution time (%)

50,00

U-B do main (8x16)

40,00 30,00 20,00 10,00 0,00 -10,00 -20,00 -30,00 64 Kbps

300 Kbps

2 Mbps

-40,00

Bandwidth

Fig. 8.b. Execution time reduction with external latency of 100 ms and flight time of 100ms

As it is shown, the results are similar when the external latency is considered equal to 100 ms (figures 7 and 8). However, in the case of box-type meshes, U-Bdomains distribution reduces the execution time to 32% more than in the case where stick meshes are used. The execution time reduction achieved by U-Bdomains distribution ranges from 1% to 38% compared to the one obtained by the U-1domains distribution. Table 3 shows theses percentages. Nonetheless, the U-1domains distribution is an efficient choice in the sense that fewer processors are dedicated to perform remote communications, and the execution time is up 34% smaller than in the case of balanced distribution. It is also important to look at the MPI implementation, because of the ability to overlap communications and computations depends on this implementation. A multithread MPI implementation could overlap communication and computation, but context switching problems between threads and interferences appear. In a single-thread MPI implementation we can use nonblocking send/receive calls with a wait_all routine, but we have observed some problems with this approach.

Performance Analysis of Domain Decomposition Applications

1041

Table 3. Average of the additional percentage of U-1domains compared with U-Bdomains

Host 2 4 8

Process B-domains per host Box Stick Mesh Mesh 1 1 1 or 2 2 or 3 1 or 2 3, 5 or 6

Additional Percentage (External latency 10 ms) Stick Mesh

Box Mesh

0% -6.91 % -9.04 %

-4.65E-06% -18.68 % -17,21 %

Additional Percentage (External latency 100 ms) Stick Box Mesh Mesh 0% -8.84E-06 % -35.74 % -37.74 % -23.81 % -23,98 %

These problems are associated with the internal order of execution. In our experiment, we solved these problems by programming explicitly the proper order of the communications. However, these problems remain in the general case. We conclude that is very important to have non-blocking MPI primitives that really exploit the full duplex channel capability. In a future work, we will consider other MPI implementations that optimize the collective operations [20].

5 Conclusions In this work the performance of some distributed applications in grid environments is studied when two different unbalanced-workload distributions, U-Bdomains and U1domains, are utilized. By using these unbalanced distributions, the execution time is 53% better that the one spent by the traditional balanced-workload distribution. Our results show that U-Bdomains becomes 38% faster than U-1domains distribution, but it requires 25% more processors to perform remote communications. In other cases, the U-1domains distribution reduced the execution time until 34%. We show that these unbalanced distributions exploit more efficiently the pool of available processors to perform computation and allows a greater scalability.

Acknowledgements This work was supported by the Ministry of Science and Technology of Spain under contract TIN2004-07739-C02-01, the HiPEAC European Network of Excellence and BSC (Barcelona Supercomputing Center).

References 1. Taylor V. E., Holmer B. K., Schwabe E. J. and Hribar M. R.: Balancing Load versus Decreasing Communication: Exploring the Tradeoffs. HICSS 1 (1996) 585-593 2. Berger M. and Bokhari S.:A partitioning strategy for non-uniform problems on multiprocessors. IEEE Transactions on Computers (1997) C-36:5 3. Simon H. D., Sohn A. and Biswas R.: Harp: A fast spectral partitioner. 9th ACM Symposium on Parallel Algorithms and Architectures, Newport, Rhode Island (1997)

1042

B. Otero et al.

4. Gropp W. D., Kaushik D. K., Keyes D. E. and Smith B. F.: Latency, Bandwidth, and Concurrent Issue Limitations in High-Performance CFD. Conference on Computational Fluid and Solid Mechanics, Cambridge, MA (2001) 839-841 5. Otero B., Cela J. M., Badia R. M. and Labarta J.: A Domain Decomposition Strategy for GRID Environments. 11th European PVM/MPI 2004, LNCS Vol. 3241, Hungary, (2004) 353-361 6. Dimemas. Internet, http://www.cepba.upc.es/tools/dimemas/ (2002) 7. Girona S., Labarta J. and Badía R. M.: Validation of Dimemas communication model for MPI collective operations. LNCS, vol. 1908, EuroPVM/MPI 2000, Hungary (2000) 8. Badia R. M., Labarta J., Giménez J. and Escale F.: DIMEMAS: Predicting MPI Applications Behavior in Grid Environments. Workshop on Grid Applications and Programming Tools (2003) 9. Badia R. M., Escale F., Gabriel E., Giménez J., Keller R., Labarta J.and Müller M. S.: Performance Prediction in a Grid Environment. 1st European across Grid Conference (2003) 10. Badía R. M., Escalé F., Giménez J. and Labarta J.: DAMIEN: D5.3/CEPBA. IST-200025406 11. Badía R. M., Giménez J., Labarta J., Escalé F. and Keller R.: DAMIEN:D5.2/CEPBA.IST2000-25406 12. http://clik.to/adslvelocidad 13. http://www.bandwidthplace.com/speedtest 14. Bjørndalen J. M., Anshus O. J., Vinter B. and Larsen T.: The Impact on Latency and Bandwidth for a Distributed Shared Memory System Using a Gigabit Network Supporting the Virtual Interface Architecture. NIK 2000, Norsk Informatikk Konferanse, Bodø, Norway (2000) 15. Wolman A., Voelker G.M., Sharma N., Cardwell N., Karlin A. and Levy H. M.: On the scale and performance of cooperative Web proxy caching. Proceedings of the seventeenth ACM symposium on Operating systems principles, ACM Press New York, USA (1999) 16-31 16. Keyes D. E.: Domain Decomposition Methods in the Mainstream of Computational Science. 14th International Conference on Domain Decomposition Methods, Mexico (2003) 79-93 17. Metis, Internet, http://www.cs.umn.edu/~metis 18. Frisch N., Rose D., Sommer O., and Ertl Th.: Visualization and pre-processing of independent finite element meshes for car crash simulations. The Visual Computer 18:4 (2002) 236-249 19. Sosnowski W.: Flow approach-finite element model for stamping processes versus experiment. Computer Assisted Mechanics and Engineering Sciences, vol. 1 (1994) 49-75 20. Keller R., Gabriel E., Krammer B., Müller M. S. and Resch M. M.: Towards Efficient Execution of MPI Applications on the Grid: Porting and Optimization Issues. Journal of Grid Computing, Vol. 1, Issue 2 (2003) 133-149

Cooperative Determination on Cache Replacement Candidates for Transcoding Proxy Caching Keqiu Li1 , Hong Shen1 , and Di Wu2 1

Graduate School of Information Science, Japan Advanced Institute of Science and Technology, 1-1, Asahidai, Nomi, Ishikawa, 923-1292, Japan 2 Department of Computer Science and Engineering, Dalian University of Technology, No 2, Linggong Road, Ganjingzi District, Dalian, 116024, China

Abstract. In this paper, the performance evaluation of an optimal Solution for coordinated cache replacement in Transcoding Proxies is present. We compare the performance over three impacts: impact of cache size, impact of object access frequency, and impact of the number of client classes. The extensive simulation results show that the coordinated cache replacement model greatly improves network performance compared to local replacement models that determine cache replacement candidates from the view of only a single node. Keywords: Web caching, multimedia object, transcoding proxy, cache replacement, performance evaluation, Internet.

1

Introduction

As the transcoding proxy is attracting an increasing amount of attention in the environment of mobile computing, it is noted that new efficient cache replacement policies are required for these transcoding proxies. There are many cache replacement algorithms for proxy caching proposed in literature. An overview of these caching replacement algorithms can be found in [2]. However, they cannot be simply or directly applied to solve the same problem for transcoding proxy caching due to the new emerging factors in the transcoding proxy (e.g., the additional delay caused by transcoding, different sizes and reference rates for different versions of a multimedia object) and the aggregate effect of caching multiple versions of the same multimedia object. Although the authors have elaborated these issues in [7], they considered the cache replacement problem at only a single node. Cooperative caching, in which caches cooperate in serving each other’s requests and making storage decisions, is a powerful paradigm to improve cache 

This work was partially supported by Japan Society for the Promotion of Science (JSPS) under its General Research Scheme B Grant No. 14380139). Corresponding author H. Shen ([email protected]).

H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 1043–1053, 2005. c Springer-Verlag Berlin Heidelberg 2005 

1044

K. Li, H. Shen, and D. Wu

effectiveness [8, 12]. Efficient coordinated object management algorithms are crucial to the performance of a cooperative caching system, which can be divided into two type of algorithms: placement and replacement algorithms. There are a number of research on finding efficient solutions for cooperative object placement [10, 16]. However, there is little work done on finding efficient solutions for cooperative object replacement. In [11], we presented an original model which determines cache replacement candidates on all candidate nodes in a coordinated fashion with the objective of minimizing the total cost loss. However, no simulation experiments are conducted to validate the performance of the proposed model. In this paper, we present the performance evaluating results for comparing the relative performance of our model with existing models. We compare the performance over three impacts: impact of cache size, impact of object access frequency, and impact of the number of client classes. The extensive simulation results show that the coordinated cache replacement model greatly improves network performance compared to local replacement models that determine cache replacement candidates from the view of only a single node. The rest of this paper is organized as follows: Section 2 briefly describes a cooperative cache replacement scheme for transcoding proxy caching. In Section 3, the simulation model is introduced. We present performance evaluation in Section 4. Finally, we conclude this paper in Section 5.

2

Cooperative Cache Replacement for Transcoding Proxy Caching

In this section, we briefly introduce the algorithm proposed in [11]. The problem addressed in [11] is to determine where a new or updated version Oi0 should be cached among nodes {v1 , v2 , · · · , vn } and which version of object j should be removed at that node to make room for Oi0 such that the total cost loss is minimized. Suppose that P ⊆ V is the set of nodes at each of which Xi,ki ∈ Ai should be removed to make room for Oi0 , then this problem can be formally defined as follows:  (l(Xi,ki ) − gi (Oi0 )) (1) L(P ∗ ) = min {L(P )} = P ⊆V

vi ∈P

where L(P ) is the total relative cost loss, l(Xi,ki ) is the cost loss of removing Xi,ki from node vi , and gi (Oi0 ) is the cost saving of caching Oi0 at node vi . Now we begin to present an optimal solution for the problem as defined in Equation 1. In the following, we call the problem a k-optimization problem if we determine cache replacement candidates from nodes {v1 , v2 , · · · , vk }. Thus, the original problem (Equation (1)) is an n-optimization problem. Theorem 1 shows an important property that the optimal solution for the whole problem must contain optimal solutions for some subproblems.

Cooperative Determination on Cache Replacement Candidates

1045

' & Theorem 1. [11] Suppose that X = Xi1 ,ki1 , Xi2 ,ki2 , · · · , Xiα ,kiα is an optimal ! ( 

solution for the α-optimization problem and X =

Xi ,k  , Xi ,k  , · · · , Xi ,k  1

i 1

2

i 2

β

i β

is an optimal solution for the kiα − 1-optimization problem. Then we have X ∗ = ! ( Xi ,k  , Xi ,k  , · · · , Xi ,k  , Xiα ,kiα 1

i 1

2

i 2

β

is also an optimal solution for the α-

i β

optimization problem. Based on Theorem 1, an optimal solution for the n-optimization can be obtained by checking all possible removed candidates from node v1 to node vn in order. Therefore, it is east to get that the time complexity of this solution is O(n2 +mn log n) based on our previous result that the complexity for computing all S(ri,k ) is O(mn log n), where n is the number of nodes in the network and m is the number of versions of object j.

3

Simulation Model

We outline the system configuration in section 3.1 and introduce existing models used for the purpose of comparison in Section 3.2. 3.1

System Configuration

To the best of our knowledge, it is difficult to find true trace data in the open literature to simulate our model. Therefore, we generated the simulation model from the empirical results presented in [1, 3, 4, 5, 7, 10]. The network topology was randomly generated by the Tier program [5]. Experiments for many topologies with different parameters have been conducted and the relative performance of our model was found to be insensitive to topology changes. Here, only the experimental results for one topology was listed due to space limitations. The characteristics of this topology and the workload model are shown in Table 1, which are chosen from the open literature and are considered to be reasonable. The WAN (Wide Area Network) is viewed as the backbone network to which no servers or clients are attached. Each MAN (Metropolitan Area Network) node is assumed to connect to a content server. Each MAN and WAN node is associated with an en-route cache. Similar to the studies in [4, 6, 9, 15], cache size is described as the total relative size of all objects available in the content server. In our experiments, the object sizes are assumed to follow a Pareto distribution and the average object size is 26KB. We also assume that each multimedia object has five versions and that the transcoding graph is as shown in Figure 1. The transcoding delay is determined as the quotient of the object size to the transcoding rate. In our experiments, the client at each MAN node randomly generates the requests, and the average request rate of each node follow–s the distribution of U (1, 9), where U (x, y) represents a uniform distribution between

1046

K. Li, H. Shen, and D. Wu Table 1. Parameters Used in Simulation Parameter Number of WAN Nodes Number of MAN Nodes

Value 200 200 Exponential Distribution

Delay of WAN Links

p(x) = θ−1 e−x/θ (θ = 0.45 Sec) Exponential Distribution

Delay of MAN Links

p(x) = θ−1 e−x/θ (θ = 0.06 Sec)

Number of Servers Number of Web Objects

100 1000 objects per srever Pareto Distribution

Web Object Size Distribution

p(x) =

aba a−1

(a = 1.1, b = 8596)

Zipf-Like Distribution Web Object Access Frequency Relative Cache Size Per Node Average Request Rate Per Node Transcoding Rate

1 iα

(i = 0.7)

4% U (1, 9) requests per second 20KB/Sec

Fig. 1. Transcoding Graph for Simulation

x and y. The access frequencies of both the content servers and the objects maintained by a given server follow a Zipf-like distribution [4, 13]. Specifically, the probability of a request for object O in server S is proportional to 1/(iα · j α ), where S is the ith most popular server and O is the jth popular object in S. The delay of both MAN links and WAN links follows an exponential distribution, where the average delay for WAN links is 0.46 seconds and the average delay for WAN links is 0.06 seconds. The cost for each link is calculated by the access delay. For simplicity, the delay caused by sending the request and the relevant response for that request is proportional to the size of the requested object. Here, we consider the aver-

Cooperative Determination on Cache Replacement Candidates

1047

age object sizes for calculating all delays, including the transmission delay, and transcoding delay. The cost function is taken to be the delay of the link, which means that the cost in our model is interpreted as the access latency in our simulation. We apply a “sliding window” technique to estimate the access frequency to make our model less sensitive to transient workload [15]. Specifically, for each object O, f (O, v) is calculated by K/(t − tK ), where K is the number of accesses recorded, t is the current time, and tK is the Kth most recently referenced time (the time of the oldest reference in the sliding window). K is set to 2 in the simulation. To reduce overhead, the access frequency is only updated when the object is referenced and at reasonably large intervals, e.g., several minutes, to reflect aging, which is also applied in [10]. 3.2

Evaluation Models

– LRU : Least Recently Used (LRU ) evicts the web object which is requested the least recently. The requested object is stored at each node through which the object passes. The cache purges one or more least recently requested objects to accommodate the new object if there is not enough room for it. – LN C − R [14]: Least Normalized Cost Replacement (LN C − R) is a model that approximates the optimal cache replacement solution. It selects the least profitable documents for replacement. Similar to LRU , the requested object is cached by all nodes along the routing path. – AE [7]: Aggregate Effect (AE) is a model that explores the aggregate effect of caching multiple versions of the same multimedia object in the cache. It formulates a generalized profit function to evaluate the aggregate profit from caching multiple versions of the same multimedia object. When the requested object passes through each node, the cache will determine which version of that object should be stored at that node according to the generalized profit. – CCR: Cooperative Cache Replacement (CCR) determines cache replacement candidates on all candidate nodes in a coordinated fashion with the objective of minimizing the total cost loss.

4

Performance Evaluation

In this section, we compare the performance results of our model (described in Section 2) with those models introduced in Section 3.2 in terms of several performance metrics. The performance metrics we used in our simulation include delay-saving ratio (DSR), which is defined as the fraction of communication and server delays which is saved by satisfying the references from the cache instead of the server, average access latency (ASL), request response ratio (RRR), which is defined as the ratio of the access latency of the target object to its size, object hit ratio (OHR), which is defined as the ratio of the number of requests satisfied by the caches as a whole to the total number of requests, and highest server load (HSL), which is defined as the largest number of bytes served by

1048

K. Li, H. Shen, and D. Wu

the server per second. In the following figures, CCR ,LRU , LN C − R, and AE denote the results for the four models introduced in Section 3.2, Table 2 lists the abbreviations used in this section. Table 2. Abbreviations Used in Performance Analysis Meaning

Abbreviation Decription DSR Delay-Saving Ratio (%) ASL Average Access Latency (Sec) Performance Metric RRR Request Response Ratio (Sec/MB) OHR Object Hit Ratio (%) HSL Highest Server Load (MB/Sec) CCR Coordinated Cache Replacement AE Standing for Aggregate Effect Caching Model LN C − R Least Normalized Cost Replacement LRU Least Recently Used

4.1

Impact of Cache Size

In this experiment set, we compare the performance results of different models across a wide range of cache sizes, from 0.04 percent to 15.0 percent. The first experiment investigates DSR as a function of the relative cache size per node and Figure 2 shows the simulation results. As presented in Figure 2, we can see that our model outperforms the other models since our coordinated cache replacement model determines the replacement candidates cooperatively among all the nodes on the path from the server to the client, whereas existing solutions, including LRU , LN C −R, and AE, consider decide cache replacement candidates locally, i.e., only from the view of a single node. Specifically, the mean improvements of DSR over AE, LN C − R, and LRU are 21.2 percent, 18.9 percent, and 13.0 percent, respectively. Figure 3 shows the simulation results of ASL as a function of the relative cache size at each node; we describe the results of RRR as a function of the relative cache size at each node in Figure

65

60

DSR (%)

55

50

45

40 CCR AE LNC−R LRU 35

0

5

10 Cache Size Per Node (%)

Fig. 2. Experiment on DSR

15

Cooperative Determination on Cache Replacement Candidates

1049

4.5 CCR AE LNC−R LRU

4

3.5

ASL(Sec)

3

2.5

2

1.5

1

0.5

0

0

5

10

15

Cache Size Per Node (%)

Fig. 3. Experiment on ASL 1.4 CCR AE LNC−R LRU

1.2

RRR (Sec/MB)

1

0.8

0.6

0.4

0.2

0

0

5

10

15

Cache Size Per Node (%)

Fig. 4. Experiment on RRR 100

90

80

OHR (%)

70

60

50

40

30 CCR AE LNC−R LRU

20

10

0

5

10

15

Cache Size Per Node (%)

Fig. 5. Experiment for OHR

4. Clearly, the lower the ASL or the RRR, the better the performance. As we can see, all models provide steady performance improvement as the cache size increases. We can also see that CCR significantly improves both ASL and RRR compared to AE, LN C − R and LRU , since our model determines the cache replacement candidates in an optimal and coordinated way, while the others decide the replacement candidates only by considering the situation of a single node. For ASL to achieve the same performance as CCR, the other models need 2 to 6 times as much cache size.

1050

K. Li, H. Shen, and D. Wu 1.4 CCR AE LNC−R LRU

1.2

HSL (MB/Sec)

1

0.8

0.6

0.4

0.2

0

0

5

10

15

Cache Size Per Node (%)

Fig. 6. Experiment for HSL

Figure 5 shows the results of OHR as a function of the relative cache size for different models. By computing the optimal replacement candidates, we can see that the results for our model can greatly outperform those of the other solutions, especially for smaller cache sizes. We can also see that OHR steadily improves as the relative cache size increases, which conforms to the fact that more requests will be satisfied by the caches as the cache size becomes larger. Particularly, the mean improvements of DSR over AE, LN C − R, and LRU are 27.1 percent, 22.5 percent, and 13.9 percent, respectively. Figure 6 shows the results of HSL as a function of the relative cache size. It can be seen that HSL for our model is lower than that of the other solutions. We can also see that HSL decreases as the relative cache size increases. 4.2

Impact of Object Access Frequency

This experiment set examines the impact of object access frequency distribution on the performance results of different models. Figures 7, 8, and 9 show the performance results of DSR, RRR, and OHR respectively for the values of Zipf parameter α from 0.2 to 1.0. We can see that CCR consistently provides the best performance over a wide range of object access frequency distributions. Specially, CCR reduces or

60

55

DSR (%)

50

45

40 CCR AE LNC−R LRU 35 0.2

0.3

0.4

0.5

0.6 Zipf Parameter

0.7

0.8

Fig. 7. Experiment for DSR

0.9

1

Cooperative Determination on Cache Replacement Candidates

1051

0.7 CCR AE LNC−R LRU

0.6

OHR (%)

0.5

0.4

0.3

0.2

0.1

0 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Zipf Parameter

Fig. 8. Experiment on RRR

95

90

RRR (Sec/MB)

85

80

75

70

65

60 0.2

CCR AE LNC−R LRU 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Zipf Parameter

Fig. 9. Experiment for OHR

improves DSR by 17.7.4 percent, 15.0 percent, and 7.5 percent compared to LRU , LN C − R, and AE, respectively; the default cache size used here (4 percent) is fairly large in the context of en-route caching due to the large network under consideration. 4.3

Impact of the Number of Client Classes

The last experiment set examines the impact of the number of client classes on the performance results of different solutions. The number of client classes refers to the number of transcodable versions. In our experiments, the number of transcodable versions is set to be 5 and the relevant vectors are (100%, 0, 0, 0, 0), (50%, 0, 50%, 0, 0), (50%, 0, 30%, 0, 20%), (40%, 0, 30%, 20%, 10%), and (20%, 15%, 20%, 15%, 30%). Figures 10 and 11 show the simulation results on DSR and RRR, respectively. We can see that DSR and RRR decrease as the number of the transcodable versions increase due to the fact that the requests from the clients will tend to disperse with increasing the number of the transcodable versions. Specifically, the mean improvements of DSR over AE, LN C − R, LRU are 9.5 percent, 8.2 percent, and 5.1 percent, respectively.

1052

K. Li, H. Shen, and D. Wu 54

52

50

48

DSR (%)

46

44

42

40

38 CCR AE LNC−R LRU

36

34

1

2

3

4

5

Number of Client Class

Fig. 10. Experiment for DSR

0.8 CCR AE LNC−R LRU

0.7

RRR (Sec/MB)

0.6

0.5

0.4

0.3

0.2

0.1

1

2

3 Number of Client Classes

4

5

Fig. 11. Experiment on RRR

5

Conclusion

In this paper, we presented performance evaluation for four existing models, including our coordinated cache replacement model in transcoding proxies where multimedia object placement and replacement policies are managed in a coordinated way. Extensive simulation experiments have been performed to compare the proposed coordinated cache replacement model with several existing models. The results show that our model effectively improves delay-saving ratio, average access latency, request response ratio, object hit ratio, and highest server load. The proposed coordinated cache replacement model considerably outperforms local cache replacement models that consider cache replacement at individual nodes only.

References 1. C. Aggarwal, J. L. Wolf, and P. S. Yu. Caching on the World Wide Web. IEEE Trans. on Knowledge and Data Engineering, Vol 11, No. 1, pp. 94-107, 1999. 2. A. Balamash and M. Krunz. An Overview of Web Caching Replacement Algorithms. IEEE Communications Surveys & Tutorials, Vol. 6, No. 2, pp.44-56, 2004.

Cooperative Determination on Cache Replacement Candidates

1053

3. P. Barford and M. Crovella. Generating Representive Web Workloads for Network and Server Performance Evaluation. Proc. of ACM SIGMETRICS’98, pp. 151-160, 1998. 4. L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching and Ziplike Distributions: Evidence and Implications. Proc. of IEEE INFOCOM’99, pp. 126-134, 1999. 5. K. L. Calvert, M. B. Doar, and E. W. Zegura. Modelling Internet Topology. IEEE Comm. Magazine, Vol. 35, No. 6, pp. 160-163, 1997. 6. P. Cao and S. Irani. Cost-Aware WWW Proxy Caching Algorithms. Proc. of First USENIX Symposium on Internet Technologies and Systems (USITS), pp. 193-206, 1997. 7. C. Chang and M. Chen. On Exploring Aggregate Effect for Efficient Cache Replacement in Transcoding Proxies. IEEE Trans. on Parallel and Distributed Systems, Vol. 14, No. 6, pp. 611-624, June 2003. 8. M. D. Dahlin, R. Y. Wang, T. E. Anderson, and D. A. Patterson. Cooperative Caching: Using Remote Client Memory to Improve File System Performance. Proc. of First Symp. Operating Systems Design and Implementations, pp. 267-280, 1994. 9. S. Jin and A. Bestavros. Greeddual* Web Caching Algorithm: Exploiting the Two Sources of Temporal Locality in Web Request Streams. Computer Comm., Vol. 4, No. 2, pp. 174-183, 2001. 10. K. Li and H. Shen. Coordinated En-Route Multimedia Object Caching in Transcoding Proxies for Tree Networks. ACM Trans. on Multimedia Computing, Communications and Applications (TOMCAPP), Vol. 5, No. 3, pp. 1-26, August 2005. 11. K. Li, H. Shen, and F. Chin. Cooperative Determination on Cache Replacement Candidates for Transcoding Proxy Caching. Proc. of The 2005 International Conference on Computer Networks and Mobile Computing (ICCNMC 2005), pp. 178-187, 2005. 12. M. R. Korupolu and M. Dahlin. Coordinated Placement and Replacement for LargeScale Distributed Caches. IEEE Trans. on Knowledge and Data Engineering, Vol. 14, No. 6, pp. 1317-1329, 2002. 13. V. N. Padmanabhan and L. Qiu. The Content and Access Dynamics of a Busy Site: Findings and Implications. Proc. of ACM SIGCOMM’00, pp.111-123, August 2000. 14. P. Scheuermann, J. Shim, and R. Vingralek. A Case for Delay-Conscious Caching of Web Documents. Computer Network and ISDN Systems, Vol 29, No. 8-13, pp. 997-1005, 1997. 15. J. Shim, P. Scheuermann, and R. Vingralek. Proxy Cache Algorithms: Design, Implementation, and Performance. IEEE Trans. on Knowledge and Data Engineering, Vol 11, No. 4, pp. 549-562, 1999. 16. J. Xu, B. Li, and D. L. Li. Placement Problems for Transparent Data Replication Proxy Services. IEEE Journal on Selected Areas in Communications, Vol. 20, No. 7, pp. 1383-1398, 2002.

Mathematics Model and Performance Evaluation of a Scalable TCP Congestion Control Protocol to LNCS/LNAI Proceedings Li-Song Shao, He-Ying Zhang, Yan-Xin Zheng, and Wen-Hua Dou Computer School, National University of Defence Technology, Changsha, China [email protected]

Abstract. So far the AIMD (Additive Increase Multiplicative Decrease) mechanism used by the conventional TCP congestion control protocol has been supported nearly by every Internet hosts. However, the conventional TCP has been designed without the theoretic foundations, so as to result in some problems in the long distance high speed network, such as the low bandwidth efficiency and the RTT (Round Trip Time) bias. Based on the flow fluid model, this paper models a scalable TCP congestion control protocol by the continuous differential equation on the RTT timescale, analyzes the condition for stability, and discusses the interrelations among stability, fairness and efficiency, which is aid to guide the design of end-to-end congestion control in the future.

1 Introduction At all times the issue of end-to-end congestion control is a research hotspot. If the traffic load coming into network exceeds highly the capability of the network, the network performances will fall down. Routers introduce AQM (Active Queue Management) mechanism in order to prevent its output buffer from overflowing and inform hosts implicitly or explicitly about the congestion; while hosts adopt end-to-end congestion control to modify their packet sending rates according to the feedback congestion signal from the routers. The design difficulties of end-to-end congestion control protocol are its decentralized nature [4], route complexity, and conflicts between the target performances. Each host on the edge of network can not apperceive other host’s information, but its RTT and the congestion state of the links passed through by the flows from itself; Each router at the center of network can not obtain information on the congestion state of other router, even of the neighboring router, but the traffic arriving itself and the consuming state of its own resources, such as buffer and bandwidth. Routers determine the next hop for individual packet, so it is possible that the packets of the same flow experience different paths and RTTs. On the other hand, if a link passed through by a flow fails to function suddenly, the residual packets of the flow have to select a new route path. A congestion control protocol must assure primarily that the global network system is stable. Additionally the fairness and efficiency properties are important too: roughly speaking, ”fairness” means that no user is penalized compared to others that share the same bottleneck links; ”efficiency” means that no bandwidth is H. Zhuge and G.C. Fox (Eds.): GCC 2005, LNCS 3795, pp. 1054 – 1065, 2005. © Springer-Verlag Berlin Heidelberg 2005

Mathematics Model and Performance Evaluation

1055

wasted in the sense that the throughput of any active flow is limited only by the capacity of bottleneck links on its path[25]. An ideal congestion control protocol makes trade-off among above these performances. P.Kelly presented Primal Algorithm based on user utility, and demonstrated global stability of network without propagation delays using a Lyapunov argument [8]. R.Johar extended this result to give local stability in a general network where all flows have s single RTT, and conjectured that s similar result might hold for a general network where flows have heterogeneous RTTs [6]. Vinnicombe proved that local stability of the Primal Algorithm interconnected with a general network in which flows can have heterogeneous RTTs, and extended this result to a broader class of Congestion control protocol [3], [4]. T.Kelly putted forward STCP protocol to improve the conventional TCP performance in the long distance high speed network [9], such as the backbone network. S.Floyd offered a similar scheme, HSTCP protocol [17]. The organization of the paper is as follows. Section 2 introduces a basic mathematics model which is the foundation of our work. Section 3 analyzes the problems of the conventional TCP in the high speed network, and points out that the window-based and AIMD mechanisms become the bottleneck of the expanding Internet. Section 4 models the dynamic network system using a scalable TCP protocol on the RTT level timescale, and proves that the protocol is capable to make the network system stable around an equilibrium point, and discusses the protocol’s fairness and efficiency properties. Section 5 concludes.

2 Flow Fluid Model Given a communication network, identify J with the set of links and R with the set of all flows between the sources and the receivers; a route r, corresponding to a user, is the ordered subset of J that the flow traverses. Throughout this paper route, flow and user will be interchangeably in reference to an element r R. Let Ajr = 1 if j r and Ajr = 0 otherwise; then the matrix A=(Ajr;j J,r R) is the routing matrix. The sending rate of route r is denoted by xr. It is assumed that routing is not altered on timescales comparable with RTT. Let Cj be the capacity of link j for all j J. Suppose that congestion at a given link j is signaled to a source with probability μj, where yj is the instantaneous link load on link j. It has been assumed that μj is a static function over the timescale of interest, is determined by the packet arrival process at a link , and is independent on other links.

μ j (t ) = p j ( y j (t ))

(1)

where congestion measure function pj is an increase function of the arrival traffic yj(t). The probability that an acknowledgement, received at the source of route r at time t, signals congestion is

qr (t ) = 1 − ∏ j∈J (1 − A jr μ j (t − T jr ))

(2)

1056

L.-S. Shao et al.

where the product is over each link forming the route r. In general, congestion measure function pj is small enough to obtain high bandwidth efficiency. Then relation (2) is replaced approximately by

qr (t ) = ∑ j∈J A jr μ j (t − T jr )

(3)

The return delay from link j for the acknowledgement via the receiver of route r is denoted by Tjr, and the load at link j is given by

y j (t ) = ∑ r∈R Ajr xr (t − Trj )

(4)

where Trj is the forward delay from the source on route r to link j. On route r the round trip time, Tr, is given by

Tr = T jr + Trj

∀j ∈ r

(5)

It has been assumed that queuing delay is negligible in comparison to RTT. Another assumption implicit in these relations is that a route’s load is invariable from the source to the receiver. This is not the case in a network which experiences significant packet drop rates. In such a network the rate arriving at links near to the source will be closer to the sending rate xr than those near to the receiver. Consider the vectors: Y(t)=(yj(t): j J), U(t)=(μj(t): j J), X(t)=(xr(t): r R), Q(t)=(qr(t): r R), the dynamic system (3)(4)(5) is extended to the corresponding forms in the frequency flied:

Y (s) = R( s) X (s)

(6)

Q( s ) = diag (e −Tr S ) AT (− s )U ( s )

(7)

Because sources and links of the network in Figure 1 are distributed, the corresponding controls are the diagonal matrixes. The source of route r implements end-toend congestion control to adjust sending rate xr according to the probability qr. Link j implements AQM adjust the congestion probability μj according to the consuming state of its own resources.

Fig. 1. The Closed loop structure of flow fluid model

Mathematics Model and Performance Evaluation

1057

3 The Conventional TCP Slow start and Congestion avoid algorithms putted forward by V. Jacobson is the basic framework of TCP Reno protocol controlling the conventional Internet successfully. Because the design of TCP Reno is short of the theoretic guidance, its core mechanisms, window-based and AIMD, become gradually bottleneck of the expanding Internet. When many flows share a single bottleneck link, the window-based mechanism results in RTT bandwidth allocation bias that the flow with the smaller RTT can obtain higher bandwidth. In this way, network bandwidth allocation is very unfairness to the flows traversing a long distance. For example, the TCP flows with RTT time of 10ms can obtain as twenty times bandwidth as ones with RTT time of 200 ms. In a long distance high speed backbone link, AI of TCP Reno is too slow and MD too fast which result in low bandwidth efficiency. Additionally TCP Reno can lead to dramatic oscillations in network queues that not only cause delay jitter but can even impact bandwidth efficiency [15]. Suppose that the TCP flow sharing a single bottleneck link has a round trip time of T seconds and a packet size of L bytes. An available bandwidth of B bps allocated for the flow corresponds to the congestion window of about W=BT/(8L) packets. Based on the response curve of the TCP Reno [26]

W ≈ 1.5

q

(8)

The relation between the bandwidth delay product BT and the probability q in equilibrium state is given by

q∝

1 ( BT ) 2

(9)

This places a serious constrain on the congestion windows that can be achieved by TCP Reno in realistic environments. For example, passing through a backbone link traversing ocean with RTT of about 100ms and bandwidth of about 1Tbps, a single TCP flow with the average packet length of 1.5KB can obtain bandwidth of 10Gbps in equilibrium state. The corresponding window is 83,333 packets and the corresponding probability q must be smaller than 2×10-10 in the sense that the links must run continuously at least 1.6 hours without any discarded or marked packet which is an unrealistic requirement for current networks. To resolve the above questions, many new end-to-end congestion control protocols have been offered, where HSTCP and STCP are two typical protocols. The primary aim of HSTCP is that the routes with the large BT adjust the increase and decrease parameters to improve the sending rate. Its basic idea is that when congestion window exceeds a threshold value, increase and decrease parameters of AIMD are modified according to the current congestion window [17], [18]. STCP takes two questions into account On one hand, for the sake of bandwidth efficiency, it increases the congestion window rapidly using MI instead of AI; on the other hand, it updates the increase

1058

L.-S. Shao et al.

and decrease parameters of MIMD periodically on a smaller timescale than RTT in order to correct unfairness of the conventional TCP [9], [10].

4 The Scalable TCP Protocol The source of each TCP flow manages a congestion window which records the packets sent but not acknowledged. TCP Reno updates the value of congestion window, W, according to AIMD mechanism which increases the window by one every RTT if there is no mark in the RTT, and halves window otherwise. There are two versions of AIMD on the packet level timescale and the RTT level timescale respectively [12]. Old version called Reno-1 implements AIMD on the packet level timescale, which increases the reciprocal of W or halves W depended on whether a mark is detected or not, every time an acknowledgement packet is received. Whereas new version called Reno-2 implements AIMD on the RTT timescale, which increases one packet or halves W only once depended on whether there is mark or not in the RTT. They have slightly the different static and dynamic attributions, such as fairness property and differential equation. Many congestion controls are suggested and analyzed based on the packet timescale, whereas this paper suggests a scalable TCP protocol on the RTT timescale that the congestion window will increases aW n if there is not any mark and decrease bW m only once otherwise in the RTT. At time t, the route r sends W r(t) packets in a RRT. Then the probability that no packet is discarded or marked in the RTT is given by

1 − (1 − qr (t ))Wr (t )

(10)

where qr(t) is so small that expression (10) is approximately qr(t)Wr(t). Then the probability that there is one or more packet discarded or marked is 1-qr(t)Wr(t). So the expected change in the congestion window Wr per RTT is approximately

ΔWr (t ) = aWr n (t )(1 − qr (t )Wr (t )) − bWr m (t )qr (t )Wr (t )

(11)

Because it has been assumed that queuing delay is negligible in comparison to RTT, the RTT of route r is close to a constant Tr. Since the time between windowupdated steps is about 1/xr(t)=Tr /Wr(t), the expected change in the rate xr per unit time is approximately

ΔWr (t ) xr (t ) Tr

(12)

Motivated by (11) and (12), we model the scalable TCP protocol by the system of differential equations

dxr (t ) dt = {a ( xr (t )Tr )n − qr (t )[a( xr (t )Tr ) n +1 + b( xr (t )Tr ) m +1 ]} Tr 2

(13)

Thus we model the network traffic by a closed loop dynamic system (1)(2)(4)(5)(13), which is the foundation of the following content.

Mathematics Model and Performance Evaluation

1059

4.1 Local Stability Condition Theorem 1. If there is a constant B in the dynamic network system (1)(2)(4)(5)(13), the AQM algorithm satisfies the condition (14) and the scalable TCP protocol satisfies the condition (15), the dynamic network system is stability around a equilibrium point.

y j p 'j ( y j ) ≤ Bp j ( y j ) ar ( xrTr ) n −1