Traffic Monitoring and Analysis: Second International Workshop, TMA 2010, Zurich, Switzerland, April 7, 2010.Proceedings [1 ed.] 3642123643, 9783642123641

The Second International Workshop on Traffic Monitoring and Analysis (TMA 2010) was an initiative of the COST Action IC0

277 74 4MB

English Pages 199 [206] Year 2010

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Matter....Pages -
Understanding and Preparing for DNS Evolution....Pages 1-16
Characterizing Traffic Flows Originating from Large-Scale Video Sharing Services....Pages 17-31
Mixing Biases: Structural Changes in the AS Topology Evolution....Pages 32-45
EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network....Pages 46-58
A Database of Anomalous Traffic for Assessing Profile Based IDS....Pages 59-72
Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases....Pages 73-86
DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections....Pages 87-100
Uncovering Relations between Traffic Classifiers and Anomaly Detectors via Graph Theory....Pages 101-114
Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers....Pages 115-126
TCP Traffic Classification Using Markov Models....Pages 127-140
K-Dimensional Trees for Continuous Traffic Classification....Pages 141-154
Validation and Improvement of the Lossy Difference Aggregator to Measure Packet Delays....Pages 155-170
End-to-End Available Bandwidth Estimation Tools, An Experimental Comparison....Pages 171-182
On the Use of TCP Passive Measurements for Anomaly Detection: A Case Study from an Operational 3G Network....Pages 183-197
Back Matter....Pages -
Recommend Papers

Traffic Monitoring and Analysis: Second International Workshop, TMA 2010, Zurich, Switzerland, April 7, 2010.Proceedings [1 ed.]
 3642123643, 9783642123641

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

6003

Fabio Ricciato Marco Mellia Ernst Biersack (Eds.)

Traffic Monitoring and Analysis Second International Workshop, TMA 2010 Zurich, Switzerland, April 7, 2010 Proceedings

13

Volume Editors Fabio Ricciato Università del Salento, Lecce, Italy and FTW Forschungszentrum Telekommunikation Wien, Austria E-mail: [email protected] Marco Mellia Politecnico di Torino, Italy E-mail: [email protected] Ernst Biersack EURECOM, Sophia Antipolis, France E-mail: [email protected]

Library of Congress Control Number: 2010923705 CR Subject Classification (1998): C.2, D.4.4, H.3, H.4, D.2 LNCS Sublibrary: SL 5 – Computer Communication Networks and Telecommunications ISSN ISBN-10 ISBN-13

0302-9743 3-642-12364-3 Springer Berlin Heidelberg New York 978-3-642-12364-1 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

The Second International Workshop on Traffic Monitoring and Analysis (TMA 2010) was an initiative of the COST Action IC0703 "Data Traffic Monitoring and Analysis: Theory, Techniques, Tools and Applications for the Future Networks" (http:// www.tma-portal.eu/cost-tma-action). The COST program is an intergovernmental framework for European cooperation in science and technology, promoting the coordination of nationally funded research on a European level. Each COST Action aims at reducing the fragmentation in research and opening the European research area to cooperation worldwide. Traffic monitoring and analysis (TMA) is nowadays an important research topic within the field of computer networks. It involves many research groups worldwide that are collectively advancing our understanding of the Internet. The importance of TMA research is motivated by the fact that modern packet networks are highly complex and ever-evolving objects. Understanding, developing and managing such environments is difficult and expensive in practice. Traffic monitoring is a key methodology for understanding telecommunication technology and improving its operation, and the recent advances in this field suggest that evolved TMA-based techniques can play a key role in the operation of real networks. Besides its practical importance, TMA is an attractive research topic for many reasons. First, the inherent complexity of the Internet has attracted many researchers to face traffic measurements since the pioneering times. Second, TMA offers a fertile ground for theoretical and cross-disciplinary research––such as the various analysis techniques being imported into TMA from other fields––while at the same time providing a clear perspective for the exploitation of the results in real network environments. In other words, TMA research has the potential to reconcile theoretical investigations with practical applications, and to realign curiosity-driven with problemdriven research. In the spirit of the COST program, the COST-TMA Action was launched in 2008 to promote building a research community in the specific field of TMA. Today, it involves research groups from academic and industrial organizations from 24 countries in Europe. The goal of the TMA workshops is to open the COST Action research and discussions to the worldwide community of researchers working in this field. Following the success of the first edition of the TMA workshop in 2009––which gathered around 70 participants involved in lively interaction during the presentation of the papers––we decided to maintain the same format for this second edition: single-session full-day program. TMA 2010 was organized jointly with the 11th Passive and Active Measurement conference (PAM 2010) and was held in Zurich on April 7 2010. We are grateful to Bernhard Plattner and Xenofontas Papadimitropoulos from ETH Zurich for the perfect local organization.

VI

Preface

The submission and revision process for the two events was done independently. For TMA 2010, 34 papers were submitted. Each paper received at least three independent reviews by TPC members or external reviewers. Finally, 14 papers were accepted for inclusion in the present proceedings. A few papers were conditionally accepted and shepherded to ensure that the authors in the final version addressed the critical points raised by the reviewers. Given the very tight schedule available for the review process, it was not possible to implement a rebuttal phase, but we recommend considering this option for future editions of the workshop. We are planning to implement a comment-posting feature for all the papers included in these proceedings on the TMA portal (http://www.tma-portal.eu/forums). The goal is to offer a ready-to-use channel to readers and authors for posting comments, expressing criticisms, requesting and providing clarifications and any other material relevant to each paper. In the true spirit of the COST Action, we hope in this way to contribute to raising the level of research interactions in this field. We wish to thank all the TPC members and the external reviewers for the great job done: accurate and qualified reviews are key for building and maintaining a high-level standard for the TMA workshop series. We are grateful to Springer for accepting to be the publisher of the TMA workshop. We hope you will enjoy the proceedings!

Fabio Ricciato Marco Mellia Ernst Biersack

Organization

Technical Program Committee Patrice Abry Valentina Alaria Pere Barlet-Ros Christian Callegari Ana Paula Couto da Silva Jean-Laurent Costeaux Udo Krieger Youngseok Lee Michela Meo Philippe Owezarski Aiko Pras Kavé Salamatian Dario Rossi Matthew Roughan Luca Salgarelli Yuval Shavitt Ruben Torres Steve Uhlig Pierre Borgnat

ENS Lyon, France Cisco Systems UPC Barcelona, Spain University of Pisa, Italy Federal University of Juiz de Fora, Brazil France Telecom, France University of Bamberg, Germany CNU, Korea Politecnico di Torino, Italy LAAS-CNRS, France University of Twente, The Netherlands University of Lancaster, UK TELECOM ParisTech, France University of Adelaide, Australia University of Brescia, Italy Tel Aviv University, Israel Purdue University, USA T-labs/TU Berlin, Germany ENS Lyon, France

Local Organizer Xenofontas Dimitropoulos Bernhard Plattner

ETH Zurich, Switzerland ETH Zurich, Switzerland

Technical Program Co-chairs Fabio Ricciato Marco Mellia Ernst Biersack

University of Salento, Italy Politecnico di Torino, Italy EURECOM

Table of Contents

Analysis of Internet Datasets Understanding and Preparing for DNS Evolution . . . . . . . . . . . . . . . . . . . . Sebastian Castro, Min Zhang, Wolfgang John, Duane Wessels, and Kimberly Claffy Characterizing Traffic Flows Originating from Large-Scale Video Sharing Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Mori, Ryoichi Kawahara, Haruhisa Hasegawa, and Shinsuke Shimogawa Mixing Biases: Structural Changes in the AS Topology Evolution . . . . . . Hamed Haddadi, Damien Fay, Steve Uhlig, Andrew Moore, Richard Mortier, and Almerima Jamakovic

1

17

32

Tools for Traffic Analysis and Monitoring EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaroslaw Sliwinski, Andrzej Beben, and Piotr Krawiec A Database of Anomalous Traffic for Assessing Profile Based IDS . . . . . . Philippe Owezarski

46 59

Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luca Deri, Valeria Lorenzetti, and Steve Mortimer

73

DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Barbuzzi, Gennaro Boggia, and Luigi Alfredo Grieco

87

Traffic Classification Uncovering Relations between Traffic Classifiers and Anomaly Detectors via Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Romain Fontugne, Pierre Borgnat, Patrice Abry, and Kensuke Fukuda Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers . . . . . . . . . . . Alessandro Finamore, Michela Meo, Dario Rossi, and Silvio Valenti

101

115

X

Table of Contents

TCP Traffic Classification Using Markov Models . . . . . . . . . . . . . . . . . . . . . Gerhard M¨ unz, Hui Dai, Lothar Braun, and Georg Carle

127

K-Dimensional Trees for Continuous Traffic Classification . . . . . . . . . . . . . Valent´ın Carela-Espa˜ nol, Pere Barlet-Ros, Marc Sol´e-Sim´ o, Alberto Dainotti, Walter de Donato, and Antonio Pescap´e

141

Performance Measurements Validation and Improvement of the Lossy Difference Aggregator to Measure Packet Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josep Sanju` as-Cuxart, Pere Barlet-Ros, and Josep Sol´e-Pareta

155

End-to-End Available Bandwidth Estimation Tools, an Experimental Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emanuele Goldoni and Marco Schivi

171

On the Use of TCP Passive Measurements for Anomaly Detection: A Case Study from an Operational 3G Network . . . . . . . . . . . . . . . . . . . . . . . . Peter Romirer-Maierhofer, Angelo Coluccia, and Tobias Witek

183

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199

Understanding and Preparing for DNS Evolution Sebastian Castro1,2, Min Zhang1 , Wolfgang John1,3 , Duane Wessels1,4 , and Kimberly Claffy1 1

CAIDA, University of California, San Diego 2 NZRS, New Zealand 3 Chalmers University of Technology, Sweden 4 DNS-OARC {secastro,mia,johnwolf,kc}@caida.org [email protected]

Abstract. The Domain Name System (DNS) is a crucial component of today’s Internet. The top layer of the DNS hierarchy (the root nameservers) is facing dramatic changes: cryptographically signing the root zone with DNSSEC, deploying Internationalized Top-Level Domain (TLD) Names (IDNs), and addition of other new global Top Level Domains (TLDs). ICANN has stated plans to deploy all of these changes in the next year or two, and there is growing interest in measurement, testing, and provisioning for foreseen (or unforeseen) complications. We describe the Day-in-the-Life annual datasets available to characterize workload at the root servers, and we provide some analysis of the last several years of these datasets as a baseline for operational preparation, additional research, and informed policy. We confirm some trends from previous years, including the low fraction of clients (0.55% in 2009) still generating most misconfigured “pollution”, which constitutes the vast majority of observed queries to the root servers. We present new results on security-related attributes of the client population: an increase in the prevalence of DNS source port randomization, a short-term measure to improve DNS security; and a surprising decreasing trend in the fraction of DNSSEC-capable clients. Our insights on IPv6 data are limited to the nodes who collected IPv6 traffic, which does show growth. These statistics serve as a baseline for the impending transition to DNSSEC. We also report lessons learned from our global trace collection experiments, including improvements to future measurements that will help answer critical questions in the evolving DNS landscape.

1

Introduction

The DNS is a fundamental component of today’s Internet, mapping domain names used by people and their corresponding IP addresses. The data for this mapping is stored in a tree-structured distributed database where each nameserver is authoritative for a part of the naming tree. The root nameservers play a vital role providing authoritative referrals to nameservers for all top-level domains, which recursively determine referrals for all host names on the Internet, F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 1–16, 2010. c Springer-Verlag Berlin Heidelberg 2010 

2

S. Castro et al.

among other infrastructure information. This top (root) layer of the DNS hierarchy is facing three dramatic changes: cryptographically signing the root zone with DNSSEC, deploying Internationalized Top-Level Domain (TLD) Names (IDNs), and addition of other new global Top Level Domains (TLDs). In addition, ICANN and the root zone operators must prepare for an expected increase in IPv6 glue records in the root zone due to the exhaustion of IPv4 addresses. ICANN currently plans to deploy all of these changes within a short time interval, and there is growing interest in measurement, testing, and provisioning for foreseen (or unforeseen) complications. As part of its DNS research activities, in 2002 CAIDA responded to the Root Server System Advisory Committee’s invitation to help DNS root operators study and improve the integrity of the root server system. Based on the few years of trust we had built with these operators, in 2006 we asked them to participate in a simultaneous collection of a day of traffic to (and in some cases from) the DNS root nameservers. We collaborated with the Internet Systems Consortium (ISC) and DNS Operation and Research Center (DNS-OARC) in coordinating four annual large-scale data collection events that took place in January 2006, January 2007, March 2008, and March 2009. While these measurements can be considered prototypes of a Day in the Life of the Internet [8], their original goal was to collect as complete a dataset as possible about the DNS root servers operations and evolution, particularly as they deployed new technology, such as anycast, with no rigorous way to evaluate its impacts in advance. As word of these experiments spread, the number and diversity of participants and datasets grew, as we describe in Section 2. In Section 3 we confirm the persistence of several phenomenon observed in previous years, establishing baseline characteristics of DNS root traffic and validating previous measurements and inferences, and offering new insights into the pollution at the roots. In Section 4 we focus on the state of deployment of two major security-related aspects of clients querying the root: source port randomization and DNSSEC capability. We extract some minor insights about IPv6 traffic in Section 5 before summarizing overall lessons learned in Section 6.

2

Data Sets

On January 10–11, 2006, we coordinated concurrent measurements of three DNS root server anycast clouds (C, F, and K, see [13] for results and analysis). On January 9–10, 2007, four root servers (C, F, K, and M) participated in simultaneous capture of packet traces from almost all instances of their anycast clouds [5]. On March 18–19, 2008, operators of eight root servers (A, C, E, F, H, K, L, and M), five TLDs (.ORG, .UK, .BR, .SE, and .CL), two Regional Internet Registries (RIRs: APNIC and LACNIC), and seven operators of project AS112 joined this collaborative effort. Two Open Root Server Network (ORSN) servers, B in Vienna and M in Frankfurt, participated in our 2007 and 2008 collection experiments. On March 30–April 1, 2009, the same eight root servers participated in addition to seven TLDs (.BR, .CL, .CZ, .INFO, .NO, .SE, and .UK), three

Understanding and Preparing for DNS Evolution

3

RIRs (APNIC, ARIN, and LACNIC), and several other DNS operators [9]. To the best of our knowledge, these events deliver the largest simultaneous collection of full-payload packet traces from a core component of the global Internet infrastructure ever shared with academic researchers. DNS-OARC provides limited storage and compute power for researchers to analyze the DITL data, which for privacy reasons cannot leave OARC machines.1 For this study we focus only on the root server DITL data and their implications for the imminent changes planned for the root zone. Each year we gathered more than 24 hours of data so that we could select the 24-hour interval with the least packet loss or other trace damage. The table in Fig. 1 presents summary statistics of the most complete 24-hour intervals of the last three years of DITL root server traces. Figure 1 (right) visually depicts our data collection gaps for UDP (the default DNS transport protocol) and TCP queries to the roots for the last three years. The darker the vertical bar, the more data we had from that instance during that year. The noticeable gaps weaken our ability to compare across years, although some (especially smaller, local) instances may have not received any IPv6 or TCP traffic during the collection interval, i.e., it may not always be a data gap. The IPv6 data gaps were much worse, but we did obtain (inconsistently) IPv6 traces from instances of four root servers (F, H, K, M), all of which showed an increase of albeit low levels of IPv6 traffic over the 2-3 observation periods (see Section 5).

3

Trends in DNS Workload Characteristics

To discover the continental distribution of the clients of each root instances measured, we mapped the client IP addresses to their geographic location (continent) using NetAcuity [2]; the location of the root server instances is available at www.root-servers.org [1]. Not surprisingly, the 3 unicast root servers observed had worldwide usage, i.e., clients from all over the globe. Fifteen (15) out of the 19 observed global anycast instances also had globally distributed client populations (exceptions were f-pao1, c-mad1, k-delhi, m-icn2 ). Our observations confirm that anycast is effectively accomplishing its distributive goals, with 42 of the 46 local anycast instances measured serving primarily clients from the continent they are located in (exceptions were f-cdg1, k-frankfurt, k-helsinki, f-sjc13 ). We suspect that the few unusual client distributions results from particular BGP routing policies, as reported in Liu et al.[13] and Gibbard [10]. Figure 3 shows fairly consistent and expected growth in mean query rates observed at participating root servers. The geographic distribution of these queries spans the globe, and similar to previous years [13] suggest that anycast at the root servers is performing effectively at distributing load across the now much more globally pervasive root infrastructure. 1 2

3

OARC hosts equipment for researchers who need additional computing resources. f-pao1 is in Palo Alto, CA; c-mad1 in Madrid, ES; and m-icn in Incheon, South Korea. f-cdg1 is in Paris, FR, and f-sjc1 in San Jose, CA.

4

S. Castro et al. DITL2007 DITL2008 DITL2009 roots, 24h roots, 24h roots, 24h

Duration IPv4 # instances*

C: 4/4 F: 36/40 K: 15/17 M: 6/6 3.83 B 2.8 M

# queries # clients IPv6 # instances*

F: 5/40 K: 1/17

# queries # clients

0.2 M 60

TCP # instances*

A: 1/1 C: 4/4 E: 1/1 F: 35/41 H: 1/1 K: 16/17 L: 2/2 M: 6/6 7.99 B 5.6 M

A: 1/1 C: 6/6 E: 1/1 F: 35/48 H: 1/1 K: 16/17 L: 2/2 M: 6/6 8.09 B 5.8 M

F: 10/41 H: 1/1 K: 1/17 M: 4/6 23 M 9 K

F: 16/48 H: 1/1 K: 9/17 M: 5/6 29 M 16 K

A: 1/1

A: 1/1 C: 6/6 E: 1/1 F: 35/48 H: 1/1 K: 16/17 M: 5/6 3.04 M 163 K

C: 4/4

# query # client

F: 36/40

E: 1/1 F: 35/41

K: 14/17 M: 5/6 0.7 M 256 K

K: 16/17 M: 5/6 2.07 M 213 K

*observed/total

Fig. 1. DITL data coverage for 2007, 2008, 2009. The table summarizes participating root instances, and statistics for the most complete 24-hour collection intervals, including IPv4 UDP, IPv6 UDP, and TCP packets. The plots on the right show data collection gaps for UDP and TCP DNS traffic to the roots for the last three years. Clients distribution by Continent for each instance (2009) N. America

S. America

Europe

Africa

Asia

Oceania

100

% of Clients

75

Unicast

50

Global Node Local Node

NorthAmerica

SouthAmerica

Europe

Africa

Asia

Oceania

f−bne1 k−brisbane f−akl1

f−cai1

f−tlv1 f−svo1 k−moscow k−doha k−abudhabi f−dxb1 f−khi1 k−delhi f−dac1 f−sin1 f−hkg1 f−pek1 f−tpe1 f−sel1 m−icn f−kix1 k−tokyo m−nrt−dixie m−nrt−jpix m−nrt−jpnap

k−reykjavik c−mad1 f−mad1 f−bcn1 f−cdg1 m−cdg f−ams1 k−amsterdam k−geneva f−trn1 c−fra1 k−frankfurt k−milan f−osl1 f−muc1 f−rom1 k−poznan f−tgd1 k−budapest k−athens k−helsinki

f−uio1 f−scl1 f−ccs1 f−eze1 f−gru1

m−sfo f−pao1 e−root f−sjc1 c−lax1 l−lax c−ord1 f−ord1 f−atl1 k−miami l−mia f−yyz1 f−pty1 a−root c−iad1 h−h4 f−yow1 c−jfk1

25

Unknown

Fig. 2. The geographic distribution of clients querying the root server instances participating in the DITL 2009 (colored according to their continental location). The root server instances are sorted by geographic longitude. Different font styles indicate unicast (green), global anycast (black, bold) and local anycast nodes (black, italic). The figure shows that anycast achieves its goal of localizing traffic, with 42 out of 46 local anycast instances indeed serving primarily clients from the same continent.

Understanding and Preparing for DNS Evolution

5

Mean query rate at the root servers (2006, 2007, 2008, 2009) 16396

16000 15139 14528

14000

13390

13168 12917

Queries Per Second

Root

13454

13337 13119

11978 11389

12000 10644

10337

10000

A C E F H K L M

10625 9868

9614 8909

8859

8450

8000

7553

7511 6317

6076

6000

4000

Growth Growth 2007-2008 2008-2009 -1.91% 40.5% 12.86% 10.71% 13.06% -41.15% 25.74% 32.94% 1.66% 18.90% 5.17% 12.32%

2000

0

08

09

06

A

07

08

09

C

08

09

06

07

E

08

09

08

F

09

06

H

07

08

09

08

K

Root Server

09

07

L

08

09

M

Fig. 3. Mean query rate over IPv4 at the root servers participating in DITL from 2006 to 2009. Bars represent average query rates on eight root servers over the four years. The table presents the annual growth rate at participating root servers since 2007. The outlying (41%) negative growth rate for F-root is due to a measurement failure at (and thus no data from) a global F-root (F-SFO) node in 2009.

Distribution of queries by query type (2006,2007,2008,2009) 1

Fraction of Queries

0.8

0.6

0.4

0.2

0 08

09

A A TXT

06

07

08

09

C

08

09

06

E

NS AAAA

07

08

09

F

CNAME SRV

08

09

06

07

H SOA A6

08

09

K PTR

08

09

L

07

08

09

M

MX

OTHER

Fig. 4. DITL distribution of IPv4 UDP queries by types from 2007 to 2009. IPv6related developments caused two notable shifts in 2008: a significant increase in AAAA queries due to the addition of IPv6 glue records to root servers, and a noticeable decrease in A6 queries due to their deprecation.

6

S. Castro et al.

Figure 4 shows that the most common use of DNS – requesting the IPv4 address for a hostname via A-type queries – accounts for about 60% of all queries every year. More interesting is the consistent growth (at 7 out of 8 roots) in AAAA-type queries, which map hostnames to IPv6 addresses, using IPv4 packet transport. IPv6 glue records were added to six root servers in February 2008, prompting a larger jump in 2008 than we saw this year. Many client resolvers, including BIND, will proactively look for IPv6 addresses of NS records, even if they do not have IPv6 configured locally. We further discuss IPv6 in Section 5. Figure 4 also shows a surprising drop in MX queries from 2007 to 2009, even more surprising since the number of clients sending MX queries increased from .4M to 1.4M over the two data sets. The majority of the moderate to heavy hitter “MX” clients dramatically reduced their per-client MX load on the root system, suggesting that perhaps spammers are getting better at DNS caching. Distribution of clients binned by query rate intervals (2007,2008,2009) 87.7 23.9

25% 10%

Percent of Clients

5%

Queries

9.7 5.8

2.3

100% 75% 50%

35.6 20.5

11.9

10% 2007 2008 2009

2.1

1%

0.4

0.5%

25%

0.1%

5%

1% 0.5%

0.1%

0.07 Clients

0.01%

Percent of Queries

100% 75% 50%

0.01%

0.005

>10

1−10

0.1−1

0.01−0.1

0.001−0.01

0.001%

0−0.001

0.001%

Query rate interval [q/s]

Fig. 5. Distribution of clients and queries as a function of mean IPv4 query rate order of magnitude for last three years of DITL data sets (y-axes log scale), showing the persistence of heavy-hitters, i.e. a few clients (in two rightmost bins) account for more than 50% of observed traffic. The numbers on the lines are the percentages of queries (upward lines) and clients represented by each bin for DITL 2009 (24-hour) data.

Several aspects of client query rates are remarkably consistent across years: the high variation in rate, and the distributions of clients and queries as a function of query rate interval. We first note that nameservers cache responses, including referrals, conserving network resources so that intermediate servers do not need to query the root nameservers for every request. For example, the name server learns that a.gtld-servers.net and others are authoritative for the .com zone, but alsos learns a time-to-live (TTL) for which this information is considered valid.

Understanding and Preparing for DNS Evolution

7

Typical TTLs for top level domains are on the order of 12 days. In theory, a caching recursive nameserver only needs to query the root nameservers for an unknown top level domain or when a TTL expires. However, many previous studies have shown that the root nameservers receive many more queries than they should [23,22,13,7]. Figure 5 shows the distributions of clients and queries binned by average query rate order of magnitude, ranging from 0.001 q/s (queries per second) to >10 q/s. The decreasing lines show the distribution of clients (unique IP addresses) as a function of their mean query rate (left axis), and the increasing lines show the distribution of total query load produced by clients as a function of their mean query rate (right axis). The two bins with the lowest query rates (under 1 query per 100s) contain 97.4% of the clients, but are only responsible for 8.1% of all queries. In stark contrast, the busiest clients (more than 1 query/sec) are miniscule in number ( 10

Query rate interval

Fig. 6. Query validity as a function of query rate (2009) of the reduced datasets (queries from a random 10% sample of clients)

Figure 6 reflects this sample set of queries and confirms previous years – over 98% of these queries are pollution. The three rightmost groups in Figure 6 and corresponding three bottom rows of Table 1, which include moderately and very busy clients, represent less than 0.54% of the client IPs, but send 82.3% of the queries observed, with few legitimate queries. A closer look at the pollution class of invalid TLDs (orange bars in Figure 6) reveals that the top 10 most common invalid TLDs represent 10% of the total(!) query load at the root servers, consistently over last four years. The most common invalid TLD is always local, followed by (at various rankings within the top 10) generic TLD names such as belkin, lan, home, invalid, domain, localdomain, wpad, corp and localhost, suggesting that misconfigured home routers contribute significantly to the invalid TLD category of pollution. Table 2. Pollution and total queries of the busiest DITL2009 clients Clients % of clients Top 4000 0.07% Top 4000-8000 0.07% Top 8000-32000 0.41% Top 32000 0.55% All clients 100.00%

#Pollution/#Total % of queries 4,958M/4,964M=99.9% 61.39% 760M/ 762M=99.7% 9.42% 1,071M/1,080M=99.2% 13.36% 6,790M/6,803M=99.8% 84.13% #Total queries: 8,086M 100.00%

To explore whether we can safely infer that the 98% pollution in our sample also reflects the pollution level in the complete data set, we examine a different sample: the busiest (“heavy hitter”) clients in the trace. We found that the 32,000 (0.55%) busiest clients accounted for a lower bound of 84% of the pollution queries in the whole trace (Table 2). These busy clients sent on average more

Understanding and Preparing for DNS Evolution

9

than 1 query every 10 seconds during the 24-hour interval (the 3 rightmost groups in Figures 5 and 6). We also mapped these busy clients to their origin ASes, and found no single AS was responsible for a disproportionate number of either the busy clients or queries issued by those clients. DNS pollution is truly a pervasive global phenomenon. There is considerable speculation on whether the impending changes to the root will increase the levels and proportion of pollution, and the associated impact on performance and provisioning requirements. Again, the DITL data provide a valuable baseline against which to compare future effects.

4

Security-Related Attributes of DNS Clients

We next explore two client attributes related to DNS security and integrity. 4.1 Source Port Randomness The lack of secure authentication of either the DNS mapping or query process has been well-known among researchers for decades, but a discovery last year by Dan Kaminksy [17] broadened consciousness of these vulnerabilities by demonstrating how easy it was to poison (inject false information info) a DNS cache by guessing port numbers on a given connection4 . This discovery rattled the networking and Source Port Randomness Comparison 100 90

Percent of clients

80 70 60 50 40 30

2006

POOR

20

GOOD

GREAT

2007 2008

10

2009 0.9

0.95

0.8

0.85

0.7

0.75

0.6

0.65

0.5

0.55

0.4

0.45

0.3

0.35

0.2

0.25

0.1

0.15

0

0.05

0

Score Port changes/queries ratio

# different ports/queries ratio

Bits of randomness

Fig. 7. CDFs of Source Port Randomness scores across four years of DITL data. Scores 0.86 as Great. DNS source port randomness has increased significantly in the last 4 years, with the biggest jump between 2008 and 2009, likely in response to Kaminksy’s demonstration of the effectiveness of port-guessing to poison DNS caches [17]. 4

Source port randomness is an important security feature mitigating the risk of different types of spoofing attacks, such as TCP hijacking or TCP reset attacks [20].

10

S. Castro et al.

operational community, who immediately published and promoted tools and techniques to test and improve the degree of randomization that DNS resolvers apply to DNS source ports. Before Kaminsky’s discovery, DITL data indicated that DNS port randomization was typically poor or non-existent [7]. We applied three scores to quantify the evolution of source port randomness from 2006-2009. For each client sending more than 20 queries during the observation interval, we calculated: (i) the number of port number changes/query ratio; (ii) the number of unique ports/query ratio; (iii) bits of randomness as proposed in [21,22]. We then classified scores 0.86 as Great. Figure 7 shows some good news: scores improved significantly, especially in the year following Kaminsky’s (2008) announcement. In 2009, more than 60% of the clients changed their source port numbers between more than 85% of their queries, which was only the case for about 40% of the clients in 2008 and fewer than 20% in 2007. 4.2

DNSSEC Capability

Although source-port randomization can mitigate the DNS cache poisoning vulnerability inherent in the protocol, it cannot completely prevent hijacking. The longer-term solution proposed for this vulnerability is the IETF-developed DNS Security extensions (DNSSEC) [3] architecture and associated protocols, in development for over a decade but only recently seeing low levels of deployment [19]. DNSSEC adds five new resource record (RR) types: Delegation signer (DS), DNSSEC Signature (RRSIG), Next-Secure record NSEC and NSEC3), and DNSSEC key request (DNSKEY). DNSSEC also adds two new DNS header flags: Checking Disabled (CD) and Authenticated Data (AD). The protocol extensions support signing zone files and responses to queries with cryptographic keys. Because the architecture assumes a single anchor of trust at the root of the naming hierarchy, pervasive DNSSEC deployment is blocked on cryptographically signing the root zone. Due to the distributed and somewhat convoluted nature of control over the root zone, this development has lagged expectations, but after considerable pressure and growing recognition of the potential cost of DNS vulnerabilities to the global economy, the U.S. government, ICANN, and Verisign are collaborating to get the DNS root signed by 2010. A few countries, including Sweden and Brazil, have signed their own ccTLD’s in spite of the root not being signed yet, which has put additional pressure on those responsible for signing the root. Due to the way DNSSEC works, clients will not normally issue queries for DNSSEC record types; rather, these records are automatically included in responses to normal query types, such as A, PTR, and MX. Rather than count queries from the set of DNSSEC types, we explore two other indicators of DNSSEC capability across the client population. First we analyse the presence of EDNS support, a DNS extension that allows longer responses, required to implement DNSSEC. We also know that if an EDNS-capable query has its DO bit set, the sending client is DNSSEC-capable. By checking the presence and value of the OPT RR pointer, we classify queries and clients into three groups:

Understanding and Preparing for DNS Evolution EDNS support (by clients) 1

0.9

0.9

0.8

0.8

0.7

0.7

Fraction of Clients

Fraction of Queries

EDNS support (by queries) 1

0.6 0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.2

0.2 0.1

0.1 0

11

0 08 09

06 07 08 09

08 09

A

C

E

06 07 08 09

08 09

06 07 08 09

08 09

07 08 09

F

H

K

L

M

08 09

06 07 08 09

08 09

A

C

E

DO bit

EDNS version 0

No EDNS

Fig. 8. Growth of EDNS support (needed for DNSSEC) measured by DNS queries, especially between 2007 and 2008. In 2009, over 90% of the EDNS-capable queries are also DO enabled, i.e., advertising DNSSEC capability.

06 07 08 09

08 09

06 07 08 09

08 09

07 08 09

F

H

K

L

M

Root Server

Root Server

DO bit

EDNS version 0

No EDNS

Mixed EDNS

Fig. 9. Decrease in EDNS Support measured by clients. In contrast to the query evolution, the fraction of EDNS enabled clients has dropped since 2007. Worse news for DNSSEC, in 2009 only around 60% of the observed EDNS clients were DO enabled, i.e., DNSSEC-capable.

(i) no EDNS; (ii) EDNS version 0 (EDNS0) without DO bit set; (iii) and EDNS0 with DO bit. A fourth type of client is mixed, i.e. an IP address that sources some, but not all queries with EDNS support. Figure 8 shows clear growth in EDNS support as measured by queries, particularly from 2007 to 2008. Even better news, over 90% of the observed EDNS-capable queries were DO-enabled in 2009. This high level of support for DNSSEC seemed like good news, until we looked at EDNS support in terms of client IP addresses. Figure 9 shows that the fraction of the EDNS-capable clients has actually decreased over the last several years, by almost 20%! In 2009, fewer than 30% clients supported EDNS, and of those only around 60% included DO bits indicating actual DNSSEC capability. We hypothesized that the heavy hitter (hyperbusy) clients had something to do with this disparity, so we grouped clients according to query rate as in Section 3. Figure 10 shows that EDNS support for clients sending few queries dropped significantly after 2007, while busy clients have increased EDNS support. In our 2009 data set, more than half of the EDNS queries were generated by the fewer than 0.1% of clients in the two rightmost categories, sending more than 1 query/sec. (cf. Figure 5). Since we have already determined that these busiest clients generate almost no legitimate DNS queries, we conclude that most of the DNSSEC-capable queries are in pollution categories. The category of clients with mixed EDNS support represents 7% (or 396K) of the unique sources in the 2009 dataset. We identified two reasons why clients can show mixed support: (i) several hosts can hide behind the same IP address (e.g. NAT); and (ii) EDNS fallback, i.e. clients fail to receive responses to queries with EDNS support, so they fallback to “vanilla” DNS and retry once more without

12

S. Castro et al.

Client distribution by EDNS support

100

% of Clients

75

50

25

08

09

EDNS0

06

07

08

09

06

07

08

09

06

07

08

09

06

07

08

09

> 10

07

1−10

06

0.1−1

09

0.01−0.1

08

0.001−0.01

07

< 0.001

06

Query rate interval No EDNS

Mixed EDNS

Fig. 10. Plotting EDNS support vs. query rate reveals that EDNS support is increasing for busy clients, who mainly generate pollution, but has declined substantially for low frequency (typical) clients.

EDNS support. A test on a sample of 72K (18%) of the mixed EDNS clients showed that EDNS fallback patterns account for 36% of the mixed clients. EDNS also provides a mechanism to allow clients to advertise UDP buffer sizes larger than the default maximum size of 512 bytes [14]. Traditionally, responses larger than 512 bytes had to be sent using TCP, but EDNS signaling enables transmission of larger responses using UDP, avoiding the potential cost of a query retry using TCP. Figure 11 shows the UDP buffer size value distribution found in the queries signaling EDNS support. There are only four different values observed: (1) 512 bytes was the default maximum buffer size for DNS responses before the introduction of EDNS in RFC 2671 [18]; 1280 bytes is a value suggested for Ethernet networks to avoid fragmentation; 2048 was the default value for certain versions of BIND and derived products; and 4096 bytes is the maximum value permitted by most implementations. Figure 11 reveals a healthy increase in the use of the largest buffer size of 4096 bytes (from around 50% in 2006 to over 90% in 2009), which happened at the expense of queries with a 2048-byte buffer size. The fraction of queries using a 512-byte buffer size is generally below 5%, although it sometimes varies over years, with no consistent pattern across roots. One of the deployment concerns surrounding DNSSEC is that older traffic filtering appliances, firewalls, and other middleboxes may drop DNS packets larger than 512 bytes, forcing operators to manually set the EDNS buffer size to 512 to overcome this limitation. These middleboxes are harmful to the deployment of DNSSEC, since small buffer sizes

Understanding and Preparing for DNS Evolution

13

EDNS buffer size (by query) 1.0

Fraction of Queries

0.8

0.6

0.4

0.2

0.0 2008 2009

2006 2007 2008 2009

2008 2009

A

C

E

512 bytes

1280 bytes

2006 2007 2008 2009

2008 2009

2006 2007 2008 2009

2008 2009

2007 2008 2009

F

H

K

L

M

Root Server 2048 bytes

4096 bytes

other size

Fig. 11. Another capability provided by EDNS is signaling of UDP buffer sizes. For the queries with EDNS support, we analyze the buffer size announced. An increase from 50% to 90% in the largest size can be observed from 2006 to 2009.

combined with the signaling of DNSSEC support (by setting the DO bit on) could increase the amount of TCP traffic due to retries.

5

A First Look at DNS IPv6 Data

Proposed as a solution for IPv4 address exhaustion, IPv6 supports a vastly larger number of endpoint addresses than IPv4, although like DNSSEC its deployment has languished. As of November 2009, eight of the thirteen root servers have been assigned IPv6 addresses [1]. The DITL 2009 datasets are the first with significant (but still pretty inconsistent) IPv6 data collection, from four root servers. Table 3 shows IPv6 statistics for the one instance of K-root (in Amsterdam) that captured IPv6 data, without huge data gaps in the collection, for the last three years. Both the IPv6 query count and unique client count are much lower than for IPv4, although growth in both IPv6 queries and clients is evident. Geolocation of DITL 2009 clients reveals that at least 57.9% of the IPv6 clients querying this global root instance are from Europe [16], not surprising since this Table 3. IPv4 vs. IPv6 traffic on the K-AMS-IX root instance over three DITL years K-AMS-TX, k-root Query Count Unique Clients

2007 IPv4 IPv6 248 M 39 K 392 K 48

2008 2009 IPv4 IPv6 IPv4 IPv6 170 M 8.21 M 277.56 M 9.96 M 340 K 6.17 K 711 K 9K

14

S. Castro et al.

instance is in Europe, where IPv6 has had significant institutional support. The proportion of legitimate IPv6 queries (vs. pollution) is 60%, far higher than for IPv4, likely related to its extremely low deployment [4,11].

6

Lessons Learned

The Domain Name System (DNS) provides critical infrastructure services necessary for proper operation of the Internet. Despite the essential nature of the DNS, long-term research and analysis in support of its performance, stability, and security is extremely sparse. Indeed, the biggest concern with the imminent changes to the DNS root zone (DNSSEC, new TLDs, and IPv6) is the lack of data with which to evaluate our preparedness, performance, or problems before and throughout the transitions. The DITL project is now four years old, with more participants and types of data each year across many strategic links around the globe. In this paper we focused on a limited slice – the most detailed characterization of traffic to as many DNS root servers possible, seeking macroscopic insights to illuminate the impending architectural changes to the root zone. We validated previous results on the extraordinary high levels of pollution at the root nameservers, which continues to constitute the vast majority of observed queries to the roots. We presented new results on security-related attributes of the client population: an increase in the prevalence of DNS source port randomization, and a surprising decreasing trend in the fraction of DNSSEC-capable clients, which serve as a motivating if disquieting baseline for the impending transition to DNSSEC. From a larger perspective, we have gained insights and experience from these global trace collection experiments, which inspire recommended improvements to future measurements that will help optimize the quality and integrity of data in support of answering critical questions in the evolving Internet landscape. We categorize our lessons into three categories: data collection, data management, and data analysis. Lessons in Data Collection. Data collection is hard. Radically distributed Internet data collection across every variety of administrative domain, time zone, and legislative framework around the globe is in “pray that this works” territory. Even though this was our fourth year, we continued to fight clock skew, significant periods of data loss, incorrect command line options, dysfunctional network taps, and other technical issues. Many of these problems we cannot find until we analyze the data. We rely heavily on pcap for packet capture and have largely assumed that it does not drop a significant number of packets during collection. We do not know for certain if, or how many, packets are lost due to overfull buffers or other external reasons. Many of our contributors use network taps or SPAN ports, so it is possible that the server receives packets that our collector

Understanding and Preparing for DNS Evolution

15

does not. Next year we are considering encoding the pcap stats() output as special “metadata” packets at the end of each file. For future experiments, we also hope to pursue additional active measurements to improve data integrity and support deeper exploration of questions, including sending timestamp probes to root server instances during collection interval to test for clock skew, fingerprinting heavy hitter clients for additional information, and probing to assess extent of DNSSEC support and IPv6 deployment of root server clients. We recommend community workshops to help formulate questions to guide others in conducting potentially broader “Day-in-the-Life” global trace collection experiments [6]. Lessons in Data Management. As DITL grows in number and type of participants, it also grows in its diversity of data “formatting”. Before any analysis can begin, we spend months fixing and normalizing the large data set. This curation includes: converting from one type of compression (lzop) to another (gzip), accounting for skewed clocks, filling in gaps of missing data from other capture sources5 , ensuring packet timestamps are strictly increasing, ensuring pcap files fall on consistent boundaries and are of a manageable size, removing packets from unwanted sources6, separating data from two sources that are mixed together7 , removing duplicate data8 , stripping VLAN tags, giving the pcap files a consistent data link type, removing bogus entries from truncated or corrupt pcap files. Next, we merge and split pcap files again to facilitate subsequent analysis. The establishment of DNS-OARC also broke new (although not yet completely arable) ground for disclosure control models for privacy-protective data sharing. These contributions have already transformed the state of DNS research and data-sharing, and if sustained and extended, they promise to dramatically improve the quality of the lens with which we view the Internet as a whole. But methodologies for curating, indexing, and promoting use of data could always use additional evaluation and improvement. Dealing with extremely large and privacy-sensitive data sets remotely is always a technical as well as policy challenge. Lessons in Data Analysis. We need to increase the automatic processing of basic statistics (query rate and type, topological coverage, geographic characteristics) to facilitate overview of traces across years. We also need to extend our tools to further analyze IPv6, DNSSEC, and non-root server traces to promote understanding of and preparation for the evolution of the DNS.

5

6 7 8

In 2009, for example, one contributor used dnscap, but their scripts stopped working. They also captured data using WDCAP and were able to fill in some gaps, but the WDCAP data files were not boundary-aligned with the missing pcap files. Another contributor included packets from their nearby non-root nameservers. In 2008, we received data from A-root and Old-J-Root as a single stream. In 2007-09, at least one contributor mistakenly started two instances of the collection script.

16

S. Castro et al.

References 1. List of root servers, http://www.root-servers.org/ (accessed 2009.11.20) 2. NetAcuity, http://www.digital-element.net (accessed 2009.11.20) 3. Arends, R., Austein, R., Larson, M., Massey, D., Rose, S.: DNS Security Introduction and Requirements. RFC 4033 (2005) 4. CAIDA. Visualizing IPv6 AS-level Internet Topology (2008), http://www.caida.org/research/topology/as_core_network/ipv6.xml (2009.11.20) 5. CAIDA and DNS-OARC. A Report on DITL data gathering (January 9-10, 2007), http://www.caida.org/projects/ditl/summary-2007-01/ (accessed 2009.11.20) 6. CAIDA/WIDE. What researchers would like to learn from the ditl project (2008), http://www.caida.org/projects/ditl/questions/ (accessed 2009.11.20) 7. Castro, S., Wessels, D., Fomenkov, M., claffy, k.: A Day at the Root of the Internet. In: ACM SIGCOMM Computer Communications Review, CCR (2008) 8. N. R. Council. Looking over the Fence: A Neighbor’s View of Networking Research. National Academies Press, Washington (2001) 9. DNS-OARC. DNS-DITL (2009), participants, https://www.dns-oarc.net/oarc/data/ditl/2009 (2009.11.20) 10. Gibbard, S.: Observations on Anycast Topology and Performance (2007), http://www.pch.net/resources/papers/anycast-performance/ anycast-performance-v10.pdf (2009.11.20) 11. Karpilovsky, E., Gerber, A., Pei, D., Rexford, J., Shaikh, A.: Quantifying the Extent of IPv6 Deployment. In: Moon, S.B., Teixeira, R., Uhlig, S. (eds.) PAM 2009. LNCS, vol. 5448, pp. 13–22. Springer, Heidelberg (2009) 12. Larson, M., Barber, P.: Observed DNS Resolution Misbehavior. RFC 4697 (2006) 13. Liu, Z., Huffaker, B., Brownlee, N., claffy, k.: Two Days in the Life of the DNS Anycast Root Servers. In: Uhlig, S., Papagiannaki, K., Bonaventure, O. (eds.) PAM 2007. LNCS, vol. 4427, pp. 125–134. Springer, Heidelberg (2007) 14. Mockapetris, P.: Domain names - implementation and specification. RFC 1035, Standard (1987) 15. Rekhter, Y., Moskowitz, B., Karrenberg, D., de Groot, G.J., Lear, E.: Address Allocation for Private Internets. RFC 1918 (1996) 16. Team Cymru. Ip to asn mapping, http://www.team-cymru.org/Services/ip-to-asn.html (accessed 2009.11.20) 17. US-CERT. Vulnerability note vu#800113: Multiple dns implementations vulnerable to cache poisonings, http://www.kb.cert.org/vuls/id/800113 (2009.11.20) 18. Vixie, P.: Extension Mechanisms for DNS (EDNS0). RFC 2671 (1999) 19. Vixie, P.: Reasons for deploying DNSSEC (2008), http://www.dnssec.net/why-deploy-dnssec (2009.11.20) 20. Watson, P.: Slipping in the Window: TCP Reset attacks (2004), http://osvdb.org/ref/04/04030-SlippingInTheWindow_v1.0.doc (2009.11.20) 21. Wessels, D.: DNS port randomness test, https://www.dns-oarc.net/oarc/services/dnsentropy (2009.11.20) 22. Wessels, D.: Is your caching resolver polluting the internet? In: ACM SIGCOMM Workshop on Network Troubleshooting, Netts 2004 (2004) 23. Wessels, D., Fomenkov, M.: Wow, that’s a lot of packets. In: Passive and Active Measurement Workshop (PAM) 2002, Fort Collins, USA (2002)

Æ

Characterizing Tra c Flows Originating from Large-Scale Video Sharing Services Tatsuya Mori, Ryoichi Kawahara, Haruhisa Hasegawa, and Shinsuke Shimogawa NTT Research Laboratories, 3–9–11 Midoricho, Musashino-city, Tokyo 180–8585, Japan              

Abstract. This work attempts to characterize network traÆc flows originating from large-scale video sharing services such as YouTube. The key technical contributions of this paper are twofold. We first present a simple and eective methodology that identifies traÆc flows originating from video hosting servers. The key idea behind our approach is to leverage the addressingnaming conventions used in large-scale server farms. Next, using the identified video flows, we investigate the characteristics of network traÆc flows of video sharing services from a network service provider view. Our study reveals the intrinsic characteristics of the flow size distributions of video sharing services. The origin of the intrinsic characteristics is rooted on the dierentiated service provided for free and premium membership of the video sharing services. We also investigate temporal characteristics of video traÆc flows.

1 Introduction Recent growth in large-scale video sharing services such as YouTube [19] has been tremendously significant. These services are estimated to facilitate hundreds of thousands of newly uploaded videos per day and support hundreds of millions of video views per day. The great popularity of these video sharing services has even lead to a drastic shift in Internet tra c mix. Ref. [5] reported that the share of P2P tra c dropped to 51% at the end of 2007, down from 60% the year before, and that the decline in this tra c share is due primarily to an increase in tra c from web-based video sharing services. We envision that this trend will potentially keep growing; thus, managing the high demand for video services will continue to be a challenging task for both content providers and ISPs. On the basis of these observations, this work attempts to characterize the network tra c flows, originating from large-scale video sharing services as the first step toward building a new data-centric network that is suitable for delivering numerous varieties of video services. We target currently prominent video sharing services; YouTube in US, Smiley videos in Japan [16], Megavideo in Hong kong [12], and Dailymotion in France [6]. Our analysis is oriented from the perspective of a network service provider, i.e., we aim to characterize the tra c flows from the viewpoints of resident ISPs or other networks that are located at the edges of the global Internet. Our first contribution is identifying tra c flows that originate from several video sharing services. The advantage of our approach lies in its simplicity. It uses source

Æ

Æ

Æ

Æ

Æ

Æ

Æ

F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 17–31, 2010. c Springer-Verlag Berlin Heidelberg 2010

­

18

T. Mori et al.

IP addresses as the key for identification. To compile a list of IP addresses associated with video sharing services, we analyze a huge amount of access logs, collected at several web cache servers. The key idea behind our approach is to leverage the namingaddressing conventions used by large-scale server farms. In many cases, web servers for hosting videos and those for other functions such as managing text, images, or applications, are isolated. These servers are assigned dierent sets of IP prefixes, which are often associated with intrinsic hostnames, e.g., “img09.example.com” is likely used for servers that serve image files. We also leverage open recursive DNS servers to associate the extracted hostnames of video hosting servers with their globally distributed IP addresses. Our second contribution is revealing the intrinsic characteristics of video sharing services, which are not covered by conventional web traÆc models. The origin of the characteristics is based on the dierentiated services provided for free and premium membership of the video sharing services. We also investigate temporal characteristics of video traÆc flows. The remainder of this paper is structured as follows. Section 2 describes the measurement data set we used in this study. We present our classification techniques in section 3. We then analyze the workload of video sharing services, using the identified traÆc flows originating from video hosting servers, in section 4. Section 5 presents related work. Finally, section 6 concludes this paper.

2 Data Description This section describes the two data sets we used in this study. The first data set was web proxy logs, which enable us to collect the IP addresses of video hosting servers used by video sharing services. The second data set was network traÆc flows, which enable us to investigate the characteristics of the workload of video sharing services. 2.1 Web Cache Server Logs We used IRCache data set [10], which is web cache server logs, open to the research community. We used the access logs collected from 7 root cache servers located in cities throughout the US. The access logs were collected in September 2009. Since the client IP addresses were anonymized for privacy protection, and the randomization seeds are dierent among files, we could not count the cumulative number of unique client IP addresses that used the cache servers. We noted, however, that a typical one-day log file for a location consisted of 100–200 unique client IP addresses, which include both actual clients and peered web cache servers deployed by other institutes. Assuming there were no overlaps of client IP addresses among the web cache servers, the total number of unique client IP addresses seen on September 1, 2009 was 805, which was large enough to merit statistical analysis. The one-month web cache logs consisted of 118 M web transactions in total. The 118 M transactions account for 7.8 TB of traÆc volume. 89 M transactions return the HTTP status code of “200 OK” and these successfully completed transactions account for 6.2 Terabytes of traÆc flows that are processed on the web cache servers.

Æ

Characterizing Tra c Flows Originating from Large-Scale Video Sharing Services

19

2.2 Network Flow Data In this work, we define a flow as a unique combination of sourcedestination IP address, sourcedestination port number, and protocol. We used network flow data that were collected at an incoming 10-Gbps link of a production network. For each flow, its length in seconds and size in bytes were recorded. The measurement was conducted for 9.5 hours on a weekday in the first quarter of 2009. The format of the network flow data set is:            

        , where “ ” and “” are created and modified time of a flow, and “” and “” are the number of packets and bytes of a flow, respectively. “ ” is randomized destination (client) IP address. The 5-tuple,      

       composes a flow. The total amount of incoming traÆc carried during the measurement period was 4.4 TB, which corresponded to the mean oered traÆc rate of 1.03 Gbps. The incoming traÆc consisted of 108 M distinct flows that were originated from 5.5 M of sender IP addresses to 34 K of receiver (client) IP addresses. Of these, 40.6 M were the incoming web flows. The traÆc volume of the web flows was 1.8 TB (mean oered traÆc rate was 0.42 Gbps).

3 Extracting Sources of Video Flows We now present the techniques for identifying video flows among the network flow data set. We use a source IP address as a key for identification, i.e., if an incoming flow originates from an IP address associated with a video sharing service, we identify the flow as a video flow. As mentioned earlier, the key idea of this approach is leveraging the naming addressing conventions used by large-scale web farms, where servers are grouped by their roles, e.g., hosting a massive amount of large video files, hosting a massive number of thumbnails, or providing rich web interfaces. We first present a video sharing service uses distinct hostnames for each subtypes of HTTP content-type, i.e., video, text, image, and application. We then collect hostnames of video hosting servers. Finally, we compile the IP addresses that are associated with the hostnames. 3.1 Classifying Hostnames with Subtypes of HTTP Content-Type This section studies naming convention used in large-scale video sharing services and presents distinct hostnames are used for each sub-category of HTTP content-type. The web cache server logs are used for the analysis. We also study the basic property of the objects for each category. We start by looking for a primary domain name for the video sharing service of interest, e.g., YouTube. More specifically, we compare a hostname recorded in web cache logs with that domain name to see if the hostname in URL matches the regular expression in perl-derivative,  

. If we see a match, we regard the object as one associated with YouTube. Although we use the YouTube domain as an example in the following, other prominent video sharing services today can also be explored in a similar way. For brevity, only the results for those services will be shown later.

20

T. Mori et al. Table 1. Statistics of Content-types in web transactions associated with YouTube Content-type No. of transactions total volume mean size video image text application other

160,180 48,756 458,357 109,743 35,021

681 GB 157 MB 4.3 GB 359 MB 23 MB

4.3 MB 3.2 KB 9.5 KB 3.3 KB 678 B

0

10

-1

P(X 0 (disassortative have r < 0 resp.) and tend to have nodes that are connected to nodes with similar (dissimilar resp.) degree. See [7] and [9] for a detailed explanation on the mathematical measures and different datasets.

38

H. Haddadi et al.

0.07

0.5

December 2007

0.45

0.06

fk−core(k−core=k)

4

fλ(λ)(1−λ)

January 2004

0.04

January 2002

0.03

January 2004

0.4

January 2006 0.05

0.02

0.35 0.3 0.25 0.2 0.15

April 2008

0.1 0.01 0.05 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

λ

(a) WSD, Skitter topology.

0.9

1

0

0

2

4

6

8

10

12

14

16

18

20

k−core

(b) k-core proportions, Skitter topology.

Fig. 2. Clustering and spectral features of Skitter topology

hierarchical, with increasing numbers of ASes peering at public Internet Exchange Points (IXPs) to bypass the core of the Internet [8]. To understand the unexpectedly dominant core seen in the Skitter dataset, we rely on the k-core metric. A k-core is defined as the maximum connected subgraph, H , of a graph, G, with the property that dv ≥ k ∀v ∈ H. As pointed out by [1] and [3] the k-core exposes the structure of a graph by pruning nodes with successively higher degrees, k, and examining the maximum remaining subgraph; note this is not the same as pruning all nodes with degree k or less. Figure 2(b) shows the proportion of nodes in each k-core as a function of k. There are 84 plots shown, but as can be seen there is little difference between each of them, demonstrating that the proportion of nodes in each k-core is not fundamentally changing over time. The WSD on the Skitter data is therefore not really observing a more dominant core, but a less well-sampled edge of the AS topology. We provide explicit evidence in Section 4 that Skitter has increasing problems over time to sample the non-core part of the topology. There is a practical explanation for the sampling bias of Skitter: the Skitter dataset is composed of traceroutes rooted at a limited set of locations, so the k-core is expected to be similar to peeling the layers from an onion [1]. From a topology evolution point of view, Skitter’s view of the AS evolution is inconclusive, due to its sampling bias. Skitter is not sampling the periphery of the Internet and so cannot see evolutionary changes in the whole AS topology. Based on our evidence, we cannot make claims about the relative change of the core compared to the edge, as we can with the UCLA dataset. We insist on the fact that the purpose of this paper is not to blame the Skitter dataset for its limited coverage of the AS topology, as it aims at sampling the router-level topology. Datasets like Skitter that rely on active probing do provide some topological information not visible from BGP data, as will be shown in Section 4.

Mixing Biases: Structural Changes in the AS Topology Evolution

3.2

39

UCLA

We now examine the evolution of the AS topology using 52 snapshots, one per month, from January 2004 to April 2008. This dataset, referred to in this paper as the UCLA dataset, comes from the Internet topology collection9 maintained by Oliviera et al. [16]. These topologies are updated daily using data sources such as BGP routing tables and updates from RouteViews, RIPE,10 Abilene11 and LookingGlass servers. Each node and link is annotated with the times it was first and last observed. Note that due to the multiple sources of data used by the UCLA dataset, there is a risk of pollution and bias when combining such differing data sources, which may contain inconsistencies or outdated information. 240000 220000

32000

200000

30000 28000 26000 24000

160000 140000 120000 100000 80000

20000

60000 5

10

15

20

25

30

35

40

45

50

10 9 8 7 6 5

0

10

20

Month

30

40

50

60

0

10

20

Month

(a) Number of nodes 0.08

0.05 0.045

60

UCLA 1.1

-0.14

1

Σfλ(λ)(1−λ)3

Assortativity Coefficient

0.06 0.055

50

(c) Average node degree

-0.13

0.065

40

1.2

-0.12

UCLA

0.07

30 Month

(b) Number of links

0.075 Clustering Coefficient

11

40000 0

UCLA

12

180000

22000

18000

13

UCLA

Average node degree

UCLA

34000

Number of links

Number of ASes

36000

-0.15 -0.16 -0.17

0.9

0.8

0.7

0.04 -0.18

0.035

0.6

0.03

-0.19 0

10

20

30

40

50

60

0

10

Month

20

30

40

50

60

Month

(d) Clustering coefficient (e) Assortativity coefficient

0.5

0

10

20

30

40

50

60

Month

(f) ω(G, 3)

Fig. 3. Topological metrics for UCLA AS topology

Figure 3 presents the evolution of the same set of topological metrics as Figure 1, over 4 years of AS topologies in the UCLA dataset. The UCLA AS topologies display a completely different evolution compared to the Skitter dataset, more consistent with expectations. As the three upper graphs of Figure 3 show, the number of ASes, AS edges, and the average node degree are all increasing, as expected in a growing Internet. The increasing assortativity coefficient indicates that ASes increasingly peer with ASes of similar degree. The preferential attachment model seem to be less dominant over time. This trend towards a less disassortative network is consistent with more ASes bypassing the tier-1 providers through public IXPs [8], 9 10 11

http://irl.cs.ucla.edu/topology/ http://www.ripe.net/db/irr.html http://abilene.internet2.edu/

40

H. Haddadi et al.

0.5

0.45

0.06

(k−core=k)

0.35

4

0.04

k−core

February 2005 0.03

f

fλ(λ)(1−λ)

January 2004

0.4

January 2004 0.05

0.3

0.25

0.2

0.15

0.02 April 2008

0.1

April 2008

0.01

0

0.05

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

2

4

6

8

10

12

14

16

18

20

k−core

λ

(a) WSD, UCLA topology.

(b) k-core proportions, UCLA topology.

Fig. 4. Clustering and spectral features of UCLA topology

hence connecting with nodes of similar degree. Another explanation for the increasing assortativity is an improvement in the visibility of non-core edges in BGP data. We will see in Section 4 that the sampling of core and non-core edges by UCLA and Skitter biases the observed AS topology structure. Contrary to the case of Skitter, ω(G, 3) for UCLA decreases over time. As a weighted clustering metric, ω(G, 3) indicates that the transit part of the AS topology is actually becoming relatively sparser over time compared to the periphery. Increasing local peering with small ASes in order to reduce the traffic sent to providers decreases both the hierarchy induced by strict customer-provider relationships, and in turn decreases the number of 3-cycles on which ω(G, 3) is based. If we look closely at Figure 4(a), we see a spectrum with a large peak at λ = 0.3 in January 2004, suggesting a strongly hierarchical topology. As time passes, the WSD becomes flatter with a peak at λ = 0.4, consistent with a mixed topology where core and non-core are not so easily distinguished. Figure 4(b) shows the proportion of nodes in each k-core as a function of k. There are 52 plots shown as a smooth transition between the first and last plots, emphasized with bold curves. The distribution of k-cores moves to the right over time, indicating that the proportion of nodes with higher connectivity is increasing over time. This adds further weight to the conclusion that the UCLA dataset shows a weakening hierarchy in the Internet, with more peering connections between nodes on average.

4

Reconciling the Datasets

The respective evolutions of the AS topology visible in the Skitter and UCLA datasets differ, as seen from topological metrics. Skitter shows an AS topology that is becoming sparser and more hierarchical, while UCLA shows one that is becoming denser and less hierarchical. Why do these two datasets show such

Mixing Biases: Structural Changes in the AS Topology Evolution

41

differences? The explanation lies in the way Skitter and UCLA sample different parts of the AS topology: Skitter sees a far smaller fraction of the complete AS topology than UCLA, and even UCLA does not see the whole AS topology [15]. A far larger number of vantage points than those currently available are likely to be necessary in order to reach almost complete visibility of the AS topology [17]. To check how similar the AS topologies of Skitter and UCLA are, we computed the intersection and the difference between the two datasets in terms of AS edges and ASes. We used a two-year period from January 2006 until December 2007. In Table 1 we show the number of AS edges and ASes that Skitter and UCLA have in common during some of these monthly periods (labeled “intersection”), as well as the number of AS edges and ASes contributed to the total and coming from one of the two datasets only (labeled “Skit-only” or “UCLA-only”). We observe a steady increase in number of total ASes and AS edges seen by the union of the two datasets. At the same time, the intersection between the two datasets decreases. In late 2007, Skitter had visibility of less than 25% of the ASes and less than 10% of the AS edges seen by both datasets. As Skitter aims at sampling the Internet at the router-level, we should not expect that it has a wide coverage of the AS topology. Such a limited coverage is however surprising, given the popularity of this dataset. Note that Skitter sees a small fraction of all AS edges, which is not seen by the UCLA dataset. This indicates that there is potential in active topology discovery to complement BGP data. From Table 1, we may conclude that the Skitter dataset is uninteresting. To the contrary, the relatively constant, albeit decreasing, sampling of the Internet core by Skitter gives us a clue about which part of the Internet is responsible for its structural evolution. In Table 2 we show the number of AS edges belonging to the tier-112 mesh (labeled “T1 mesh”) as well as other AS edges where a tier-1 appears. More than 30% of the AS edges sampled by Skitter cross at least a tier-1 AS, against about 15% for UCLA. Both dataset see almost all AS edges from the tier-1 mesh. Note that the decrease in the number of AS edges in which a tier-1 appears in Skitter is partly related to IP to AS mapping issues for multi-origin ASes [8]. The evolutions of the AS topology observed by the Skitter and UCLA datasets are not inconsistent. Rather, the two datasets sample differently, the AS topology, leading to different bias. A large fraction of the AS topology sampled by Skitter relates to the core, i.e., edges containing at least a tier-1 AS. With its wider coverage, UCLA observes a different evolution of the AS topology, with a non-core part that grows more than the core. The evolution seen from the UCLA dataset seems more likely to reflect the evolution of the periphery of the AS topology. The non-core part of the Internet is growing and is becoming less and less hierarchical. We wish to point out that, despite a common trend towards making a union of datasets in our networking community, such simple addition is not appropriate for the UCLA and Skitter datasets. Each dataset 12

We rely on the currently accepted list of 12 tier-1 ASes that provide transit-only service: AS174, AS209, AS701, AS1239, AS1668, AS2914, AS3356, AS3549, AS3561, AS5511, AS6461, and AS7018.

42

H. Haddadi et al.

Table 1. Statistics on AS and AS edge counts in the intersection of both Skitter and UCLA datasets, and for each dataset alone Time Jan. 2006 Mar. 2006 May. 2006 Jul. 2006 Sep. 2006 Nov. 2006 Jan. 2007 Mar. 2007 May. 2007 Jul. 2007 Sep. 2007 Nov. 2007

Total 25,301 26,007 26,694 27,396 28,108 28,885 29,444 30,236 30,978 31,668 32,326 33,001

Autonomous Systems Intersect. Skit-only UCLA-only 32.6% 0% 67.4% 31.6% 0% 68.4% 30.5% 0% 69.5% 29.5% 0% 70.5% 28.7% 0% 71.3% 27.9% 0% 72.1% 27.2% 0% 72.8% 26.5% 0% 73.5% 25.6% 0% 74.4% 25.9% 0% 86.1% 24.5% 0% 75.5% 23.9% 0% 76.1%

Total 114,847 118,786 124,052 128,624 133,813 139,447 144,721 151,380 157,392 166,057 168,876 174,318

AS Edges Intersect. Skit-only 15.4% 5.3% 14.9% 4.4% 13.8% 4.6% 13.2% 3.7% 12.6% 3.4% 12.4% 3.4% 11.6% 3.1% 11.2% 3.0% 10.5% 2.7% 10.0% 3.8% 9 .7% 2.5% 9 .5% 2.2%

UCLA-only 79.3% 80.7% 81.5% 83.1% 84.0% 84.2% 85.3% 85.8% 86.8% 86.2% 87.8% 88.3%

Table 2. Coverage of tier-1 edges by Skitter and UCLA

Time Jan. 2006 Mar. 2006 May. 2006 Jul. 2006 Sep. 2006 Nov. 2006 Jan. 2007 Mar. 2007 May. 2007 Jul. 2007 Sep. 2007 Nov. 2007

Skitter Total T1 mesh Other T1 23,805 66 7,498 22,917 66 7,289 22,888 64 7,504 21,740 65 7,192 21,400 65 6,974 22,034 66 7,159 21,345 65 6,898 21,366 65 6,774 20,738 65 6,694 22,972 65 6,838 20,570 64 6,510 20,466 64 6,430

UCLA Total T1 mesh Other T1 108,720 64 19,149 113,555 64 19,674 118,331 64 20,143 123,842 64 20,580 129,228 64 21,059 134,636 65 21,581 140,216 65 22,531 147,000 65 23,194 153,156 65 23,769 159,792 65 24,310 164,770 65 24,888 170,431 65 25,480

has its own biases and measurement artifacts. Combining them blindly will only add these biases together, potentially leading to poorer quality data. Further research is required in order to devise a correct methodology that takes advantage of different datasets obtained from different sampling processes. The above observations suggests that the Internet, once seen as a tree-like, disassortative network with strict power-law properties [6], is moving towards an assortative and highly inter-connected network. Tier-1 providers have always been well connected, but the biggest shift is seen at the Internet’s periphery where content providers and small ISPs are aggressively adding peering links among themselves using IXPs to avoid paying transit charges to tier-1 providers. Content distribution networks are partly the reason behind such changes [13]. A different view of the Internet evolution can be obtained using the WSD, shown in Figures 2(a) and 4(a). One possible cause for this behavior is increased mixing of the core and periphery of the network, i.e. the strict tiered hierarchy is becoming less important in the network structure. This is given further weight by

Mixing Biases: Structural Changes in the AS Topology Evolution

43

studies such as [15] which show that the level of peering between ASes in the Internet has greatly increased during this period, leading to a less core-dominated network. Given that a fraction of AS edges are not visible from current datasets and that visibility is biased towards a better visibility of customer-provider peering relationships, we believe that our observations actually underestimate the changes in the structure of the AS topology. Using a hierarchical and preferential attachment-based model to generate synthetic AS topologies is likely to be less and less justified than ever. The AS topology structure is becoming more complex than in the past.

5

Related Work

In this section we outline related work, classified into three groups: evolution of the AS topology, spectral graph analysis of the AS topology, and analysis of the clustering features of the AS topology. Dhamdhere and Dovrolis [4] rely on available estimation methods for type of relationships between ASes in order to analyze the evolution of the Internet ecosystem in last decade. They believe the available historic datasets from RouteViews and RIPE are not sufficient to infer the evolution of peering links, and so they restrict their focus to customer-provider links. They find that after an exponential increase phase until 2001, the Internet now grows linearly in terms of both ASes and inter-AS links. The growth is mostly due to enterprise networks and content/access providers at the periphery of the Internet. The average path length remains almost constant mostly due to the increasing multi-homing degree of transit and content/access providers. Relying on geo-location tools, they find that the AS ecosystem is now larger and more dynamic in Europe than in North America. In our paper we have relied on two datasets, covering a more extensive set of links and nodes, in order to focus on structural growth and evolution of the Internet. We use a large set of graph-theoretic measures in order the focus on the behavior of the topology. Due to inherent issues involved with inference of node locations and types of relationships [11], we treat the AS topology as an undirected graph. Shyu et al. [18] study the evolution of a set of topological metrics computed on a set of observed AS topologies. The authors rely on monthly snapshots extracted from BGP RouteViews from 1999 to 2006. The topological metrics they study are the average degree, average path length, node degree, expansion, resilience, distortion, link value, and the Normalized Laplacian Spectrum. They find that the metrics are not stable over time, except for the Normalized Laplacian Spectrum. We explore this metric further by using WSD. Oliveira et al. [16] look at the evolution of the AS topology as observed from BGP data. Note that they do not study the evolution of the AS topology structure, only the nodes and links. They propose a model aimed at distinguishing real changes in ASes and AS edges from BGP routing observation artifacts. We use the extended dataset made available by the authors, in addition to 7 years of AS topology data from an alternative measurement method.

44

6

H. Haddadi et al.

Conclusions

In this paper we presented a study of two views of the evolving Internet AS topology, one inferred from traceroute data and the other from BGP data. We exposed discrepancies between these two inferred AS topologies and their evolution. We reconciled these discrepancies by showing that the topologies are not directly comparable as neither method sees the entire Internet topology: BGP data misses some peering links in the core which traceroute observes; traceroute misses many more peering links than BGP in the periphery. However, traceroute and BGP data do provide complementary views of the AS topology. To remedy the problems of decreasing coverage by the Skitter traceroute infrastructure and the lack of visibility of the core by UCLA BGP data, significant improvements in fidelity could be achieved with changes to the existing measurement systems. The quality of data then collected by the traceroute infrastructure would benefit from greater AS coverage, while the BGP data would benefit from data showing intra-core connectivity it misses today. We acknowledge the challenges inherent in these improvements but emphasize that, without such changes, the study of the AS topology will forever be subject to the vagaries of imperfect and flawed data. Availability of traceroute data from a larger number of vantage points, as attempted by the Dimes project, will hopefully help remedy these issues. However, even such measurements have to be done on a very large scale, and ideally performed both from the core of the network (like Skitter), as well as the edge (like Dimes). Efforts in better assessment of the biases inherent to the measurements are also necessary. In an effort to provide a better perspective on the changing structure of the AS topology, we used a wide range of topological metrics, including the newly introduced weighted spectral distribution. Our analysis suggests that the core of the Internet is becoming less dominant over time, and that edges at the periphery are growing more relative to the core. The practice of content providers and content distribution networks seeking connectivity to greater numbers of ISPs at the periphery, and the rise of multi-homing, both support these observations. Further, we observe a move away from a preferential attachment, tree-like disassortative network, toward a network that is flatter, highly-interconnected, and assortative. These findings are also indicative of the need for more detailed and timely measurements of the Internet topology, in order to build up on works such as [5], focusing on the economics of the structural changes such as institutional mergers, multi-homing and increasing peering relationships.

Acknowledgements We thank Mickael Meulle for his help with the BGP datasets.

References 1. Alvarez-Hamelin, J.I., Dall’Asta, L., Barrat, A., Vespignani, A.: k-core decomposition of Internet graphs: hierarchies, self-similarity and measurement biases. Networks and Heterogeneous Media 3, 371 (2008)

Mixing Biases: Structural Changes in the AS Topology Evolution

45

2. Bush, R., Hiebert, J., Maennel, O., Roughan, M., Uhlig, S.: Testing the reachability of (new) address space. In: Proceedings of the 2007 SIGCOMM workshop on Internet network management, INM 2007 (2007) 3. Carmi, S., Havlin, S., Kirkpatrick, S., Shavitt, Y., Shir, E.: A model of Internet topology using k-shell decomposition. In: PNAS (2007) 4. Dhamdhere, A., Dovrolis, C.: Ten years in the evolution of the Internet ecosystem. In: Proceedings of ACM/Usenix Internet Measurement Conference (IMC) 2008 (2008) 5. Economides, N.: The economics of the Internet backbone. NYU, Law and Research Paper No. 04-033; and NET Institute Working Paper No. 04-23 (June 2005) 6. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the Internet topology. In: Proceedings of ACM SIGCOMM 1999, Cambridge, Massachusetts, United States, pp. 251–262 (1999) 7. Fay, D., Haddadi, H., Uhlig, S., Moore, A.W., Mortier, R., Jamakovic, A.: Weighted Spectral Distribution. IEEE/ACM Transactions on Networking (TON) (to appear) 8. Gill, P., Arlitt, M., Li, Z., Mahanti, A.: The flattening Internet topology: Natural evolution, unsightly barnacles or contrived collapse? In: Claypool, M., Uhlig, S. (eds.) PAM 2008. LNCS, vol. 4979, pp. 1–10. Springer, Heidelberg (2008) 9. Haddadi, H., Fay, D., Jamakovic, A., Maennel, O., Moore, A.W., Mortier, R., Rio, M., Uhlig, S.: Beyond node degree: evaluating AS topology models. Technical Report UCAM-CL-TR-725, University of Cambridge, Computer Laboratory (July 2008) 10. Haddadi, H., Fay, D., Jamakovic, A., Maennel, O., Moore, A.W., Mortier, R., Rio, M., Uhlig, S.: On the Importance of Local Connectivity for Internet Topology Models. In: 21st International Teletraffic Congress (ITC 21) (2009) 11. Haddadi, H., Iannaccone, G., Moore, A., Mortier, R., Rio, M.: Network topologies: Inference, modelling and generation. IEEE Communications Surveys and Tutorials 10(2) (2008) 12. Huffaker, B., Andersen, D., Aben, E., Luckie, M., claffy, K., Shannon, C.: The skitter as links dataset (2001-2007) 13. Labovitz, C., Iekel-Johnson, S., McPherson, D., Oberheide, F.J.J., Karir, M.: ATLAS Internet Observatory 2009 Annual Report. NANOG47 (June 2009), http://tinyurl.com/yz7xwvv 14. Newman, M.: Assortative mixing in networks. Physical Review Letters 89(20), 871–898 (2002) 15. Oliveira, R., Pei, D., Willinger, W., Zhang, B., Zhang, L.: In search of the elusive ground truth: The Internet’s AS-level connectivity structure. In: ACM SIGMETRICS, Annapolis, USA (June 2008) 16. Oliveira, R., Zhang, B., Zhang, L.: Observing the evolution of Internet AS topology. In: Proceedings of ACM SIGCOMM 2007, Kyoto, Japan (August 2007) 17. Roughan, M., Tuke, S.J., Maennel, O.: Bigfoot, sasquatch, the yeti and other missing links: what we don’t know about the as graph. In: IMC 2008: Proceedings of the 8th ACM SIGCOMM conference on Internet measurement, pp. 325–330. ACM, New York (2008) 18. Shyu, L., Lau, S.-Y., Huang, P.: On the search of Internet AS-level topology invariants. In: Proceedings of IEEE Global Telecommunications Conference, GLOBECOM 2006, Francisco, CA, USA, pp. 1–5 (2006) 19. Subramanian, L., Agarwal, S., Rexford, J., Katz, R.H.: Characterizing the Internet hierarchy from multiple vantage points. In: Proceedings of IEEE Infocom 2002 (June 2002) 20. Zhou, S.: Characterising and modelling the Internet topology, the rich-club phenomenon and the PFP model. BT Technology Journal 24 (2006)

EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network⋆ Jaroslaw Sliwinski, Andrzej Beben, and Piotr Krawiec Institute of Telecommunications Warsaw University of Technology Nowowiejska 15/19, 00-665 Warsaw, Poland {jsliwins,abeben,pkrawiec}@tele.pw.edu.pl

Abstract. The paper describes the EmPath tool that was designed to emulate packet transfer characteristics as delays and losses in IP network. The main innovation of this tool is its ability to emulate packet stream transfer while maintaining packet integrity, packet delay and loss distribution and correlation. In this method, we decide about the fate of new packet (delay and loss) by using the conditional probability distributions depending on the transmission characteristics of the last packet. For this purpose, we build a Markov model with transition probabilities calculated on the basis of the measured packet traces. The EmPath tool was implemented as a module of the Linux kernel and its capabilities were examined in the testbed environment. In the paper, we show some results illustrating the effectiveness of EmPath tool. Keywords: network emulator, traffic modeling, validation, testing.

1

Introduction

Network emulators are attractive tools for supporting design, validation and testing of protocols and applications. They aim to provide a single node (or a set of nodes) with ability to introduce the packet transfer characteristics as they would be observed in a live network. Therefore, the network emulators are often regarded as a “network in a box” solution [1]. The key issue during the design of the network emulator is the method for representation of the network behavior. Among the realizations of emulators we recognize two main techniques: (1) the real-time simulation, where emulator simulates the network to get appropriate treatment of packets, e.g., as proposed in [2,3], or (2) the model based emulation, where emulator enforces the delay, loss or duplication of packets using a model of the network behavior; the parameters for the model come from measurements, simulations or analysis. We focus on model based emulation because it is regarded to be scalable even for large networks and high link speeds. One of the first widely used emulators was dummynet, which design and features were presented in [4]. Its main objective was the evaluation of TCP performance with regard to the limited bit rate, constant propagation delay and ⋆

This work was partially funded by MNiSW grant no. 296/N-COST/2008/0.

F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 46–58, 2010. c Springer-Verlag Berlin Heidelberg 2010 

EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network

47

packet losses. This concept was further investigated by several authors, e.g., in [5,6], leading to implementation NIST Net [1] and NetEm [7] tools for Linux kernel versions 2.4 and 2.6, respectively. Both of the tools model the network impairment as a set of independent processes related to packet transfer delays, packet losses, packet reordering and duplication. Due to the assumed independence, they do not maintain the integrity of transferred packet streams nor maintain the autocorrelation within packet transfer delay process or within packet loss process. Moreover, they lack any cross-correlation between delay and loss generation processes. On the other hand, the studies of packet transfer characteristics in the Internet, e.g., as these presented in [8,9,10,11], point out the significant dependencies in the Internet traffic that results in strong correlation between delays and losses experienced by transferred packets, as well as, the long rage dependency. This effect is especially visible for packets sent in a sequence with small interpacket gaps. Furthermore, the autoregression analysis of Internet traffic, which was performed in [10], suggests that the transfer delay of given packet strongly characterizes the transfer delay of the consecutive one. The constraints of the NetEm tool, which we briefly presented above, are a topic of more detailed discussion in section 2. Following those limitations we formulate the requirements for the design of a new emulation method. In our method, named EmPath, the delays and losses experienced by transferred packets are modeled as a Markov process. However there are also solutions that use Markovian description for this purpose, e.g., in [12] and [13], our model correlates both the delay and loss processes into one solution. Moreover, our approach uses multiple transition matrices, where each of them is conditioned on the status of the proceeding packet. Contrary to previous works, we do not try to fit the transition matrices using linear programming optimization. We derive necessary conditional probabilities from the delay traces measured in a live network by sending the probing packets with small inter-packets gaps. Notice that the correlation depends on the duration of inter-packets gap. Therefore, for each incoming packet we observe its inter-packet gap and then we calculate a number of steps over the transition matrix; the number of steps depends on the inter-packet gap. Finally, we implemented this emulation method as an open source tool for the Linux kernel with version 2.6. The paper is structured as follows: in section 2, we recall the concept of network emulation and we discuss the requirements for useful emulators. Then, we analyze the NetEm tool and we show its capabilities and limitations. After that, in section 3, we present the proposed emulation algorithm that is based on the Markov model and we focus on implementation issues. In the next section, we show results of exemplary experiments that show performance of our tool. Finally, section 5 summarizes the paper and gives a brief outline of further works.

2

The Problem Statement

In this section we recall the concept of network emulation and present its applications for experimentally driven research. Then, we discuss requirements for

48

J. Sliwinski, A. Beben, and P. Krawiec

pdf

correlation

Measured packet transfer characteristics, e.g. packet delay distribution, losses, correlations

lag

delay

Input data

Network Emulator

Tested protocol

Fig. 1. The concept of network emulation

designing a network emulator. From this point of view, we study the effectiveness of widely used NetEm tool [7]. Our experiments aim to identify NetEm capabilities and limitations. The obtained results motivate us to design and implement the EmPath emulation tool. 2.1

The Concept of Network Emulation

The concept of network emulation assumes that instead of performing experiments in a live network, we measure packet transfer characteristics offered in the network, and on that basis, we synthetically reproduce the packet transfer process in a single device, called the network emulator. As illustrated in Fig. 1, the network emulator should statistically provide the same values of packet transfer delay, delay variation, packet loss and correlation between consecutive packets as in a live network. Therefore, in principle, there should be no difference if an incoming packet is served by network or by emulator. The network emulator is regarded as a convenient tool for supporting experimental driven research, prototype testing and designing of new protocols [14]. It may be treated as a complementary approach for testing real code in “semisynthetic” environment [1]. The key advantages of network emulation comparing to simulation techniques and network trails are the following. First, the network emulator allows for testing of prototype implementations (real code and physical equipment) with theirs software or hardware limitations. Second, the network emulator allows for repeating tests under the same network conditions, what is practically impossible in the case of live network trials. Last but not least, the network emulator simplifies the tests in complex network scenarios since it

EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network

49

reproduces end-to-end characteristics without modeling details of network elements. In addition, lab environment does not need to have the access to a live network. On the other hand, the network emulator is not a perfect solution, because it just follows the network behavior that was observed in the past. This is usually performed by gathering packet traces that should be representative for given experiment. Therefore, the collection of packet traces is a crucial issue. These packet traces may be obtained from live network, from testbed environment or even from simulation experiments. As we mentioned above, the network emulator requires a method for accurate replication of the packet transfer process. The method should statistically assure the same service of packets as in the case that they would be handled in the network; nevertheless, one must define measurable features, which can be used for evaluation of emulator’s effectiveness. In this paper, we consider the following conditions: 1. the emulator should provide the same probability distribution of packet transfer delay and packet loss as it was observed in the network. In this way, the network emulator provides accurate values of IP performance metrics, e.g., IP packet transfer delay (IPTD), IP packet delay variation (IPDV), IP packet loss ratio (IPLR), defined in [15]. 2. the emulator should introduce similar autocorrelation of the packet transfer delay process and the packet loss process. The autocorrelation is an important factor that shows how the emulator represents the dependencies between samples in a given realization of the process. Our analysis is focused on the correlograms (autocorrelation plots). 3. the emulator should allow for maintaining cross-correlation between emulated processes as it was experienced by packets in the network. This feature shows how the method captures dependencies between different random processes, i.e., between packet delay process and packet loss processes. We measure the cross-correlation by correlation coefficient [16]. 4. the emulator should maintain the packet stream integrity as occurs in the live network. This feature is important, because reordered packets may have deep impact on the protocol performance. We measure the level of packet stream integrity by IP packet reordered ratio (IPRR) metric defined in [15]. 2.2

Case Study: NetEm Tool

In this case study we focus on evaluation of capabilities and limitations of NetEm emulator [7] available in Linux operating system. The NetEm uses four independent processes to emulate the network behavior that are: (1) packet delay process, (2) packet loss process, (3) packet duplication and (4) reordering process. The packet delay process uses delay distribution stored in the form of inverted cumulative distribution function. The NetEm offers a few predefined distributions, e.g., uniform, normal and Pareto, but it also allows to provide custom distributions. They could be created from packet delay traces by maketables tool available in iproute2 package.

50

J. Sliwinski, A. Beben, and P. Krawiec

100

Probability

10−1 10−2 10−3 10−4 10−50

10

20 30 Delay [ms]

40

50

Fig. 2. Histogram of packet transfer delay for original network

Notice that the traces with packet transfer delay samples are not usually publicly available. Taking into account this fact, we used “one point” packet traces (with volume of traffic over time) and performed simulations to obtain delay and loss characteristics. We selected one of the traffic traces that are available in the MAWI repository [17], i.e., a trace file captured in sample point F of the WIDE network on the 1st of November 2009 (file name 200911011400.dump). This sample point records traffic going in both directions of an inter-continental link. In the network emulation, we were interested in one direction, so we filtered the trace file to include only the packets with destination Ethernet address 00:0e:39:e3:34:00. In simulations, we used only IP packets (version 4 and 6) without overhead of Ethernet headers; there were 6 non-IP packets and they were discarded. Finally, the filtered trace file covered around 8.2 × 109 bytes that were sent over 15 minutes; the mean bit rate of traffic was close to 73 Mbps. Note that the packet trace was collected in a single point in the network. In order to obtain packet transfer delay and loss characteristics we performed simple simulation experiment. First, we created a topology with 2 nodes that are connected by 100 Mbps link with 10 ms propagation delay. The size of output buffer for each interface was set to 300 packets. Next, we introduced two traffic streams: (1) background stream based on the prepared packet trace, and (2) foreground constant bit rate stream using 100 byte packets emitted every 1 ms (bit rate equal to 800 kbps). Since the link capacity in the original packet trace was equal to 150 Mbps, we artificially created a bottleneck where queueing effects appeared. Finally, we recorded the packet delay and loss traces for the probing stream. The obtained histogram of delay distribution is presented in Fig. 2. On the basis of these delay samples, we prepared a distribution table for NetEm tool. This table was used for tests in a testbed network, which consisted of 3 nodes (PCs) connected in cascade by 1 Gbps links, similarly to the scenario

100

100

10−1

10−1 Probability

Probability

EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network

10−2 10−3 10−4 10−50

51

10−2 10−3 10−4

10

20 30 Delay [ms]

40

50

10−50

(a) NetEm with default configuration

10

20 30 Delay [ms]

40

50

(b) NetEm with integrity

Fig. 3. Histograms of packet transfer delay for NetEm tool

presented in Fig. 1. The middle node runs NetEm (with our custom delay distribution table), while the other two run MGEN [18] to generate traffic. All nodes were synchronized with the GPS clock with time drift below 100 µs. The traffic emitted by the generator had the same profile as the foreground traffic used in simulation (constant bit rate, 100 byte IP packets, 1 ms inter-packet gap). Using this setup we performed measurements in two test cases: Case 1: NetEm with default configuration, Case 2: NetEm with enforced packet stream integrity. Fig. 3 shows the histogram of packet transfer delay for both test cases. Moreover, Table 1 presents values of performance metrics measured for NetEm with reference to the original network. In the first test case, we observe that the delay distribution is maintained up to 23 ms, but above this value the distribution is trimmed; all greater mass of probability is assigned to the last value. This effect comes from the NetEm’s delay distribution storage method, which in default configuration does not allow the values to differ more than 4 times standard deviations from the mean value. This behavior limits the usage of distributions with long tails, which are typically observed in the Internet [8]. Other limitation appeared in the form of large Table 1. Results of NetEm tests original NetEm NetEm network with integrity mean IPTD [ms] 11.7 stddev of IPTD [ms] 3.3 IPLR [%] 0.215 IPPR [%] 0 cross-correlation coeff. 0.26

10.4 0.6 0.213 46 0.00

14.7 3.8 0.213 0 0.00

J. Sliwinski, A. Beben, and P. Krawiec

original NetEm NetEm - integrity

0.8 0.6 0.4 0.2 0.0 −0.2 0

original NetEm NetEm - integrity

0.8 Autocorrelation of loss

Autocorrelation of delay

52

0.6 0.4 0.2 0.0

10

20

30

40

Lag

50

−0.2 0

10

(a) packet transfer delay

20

30

40

50

Lag

(b) packet loss

Fig. 4. Correlogram of packet transfer delay and packet loss

amount of reordered packets, i.e., more than 46% packets changed the order in packet stream. In the second test case, we changed the default behavior of NetEm to maintain the packet stream integrity (we changed queue type from tfifo to pfifo). This modification caused that the delay distribution changed it’s shape and parameters, see Fig. 3(b) and Table 1. In Fig. 4, we presented the correlograms (autocorrelation plots) of delay and loss processes as observed in both test cases and original network. Notice that, for both first and second test case, the autocorrelation function of these processes is not maintained. In fact, the loss process is entirely uncorrelated, while the delay process shows minor autocorrelation only for the second test case caused by enforcing integrity. We also see that, not only each process shows no autocorrelation, there is no cross-correlation between packet transfer delay and packet loss processes. In original network the cross-correlation coefficient equals about 0.24, while for both NetEm tests it is zero (no correlation). The lack of correlation motivated us to perform additional test with NetEm’s built-in correlation feature. Comparing the results of this test with the first test case, we observe lack of lost packets. This effect comes from incorrect implementation of NetEm “correlation” model when inverted cumulative distribution function is used. Therefore, we ignored these results. By performing the above validation, we concluded that the NetEm emulation model has important lacks. This motivated us to create a new model that mitigates NetEm’s limitations and allows for more precise replication of network characteristics.

3

Emulation Method in EmPath Tool

In this section we present the EmPath tool. After describing the proposed algorithm we focus on its implementation in Linux kernel.

EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network

3.1

53

The Emulation Algorithm

The EmPath uses two random processes to emulate packet transfer. The first process decides whether incoming packet is lost or transferred, while the second process determines the delay of transferred packets. Specifically, we define these processes as: – the packet loss process, named L(t), which takes the value 1 when incoming packet is lost and 0 otherwise, – the packet transfer delay process, named D(t), which determines the delay observed by transferred packet. Next, we define discrete-times series {Li } and {Di }, based on the above processes using moments of packet arrivals ti , where i=1, 2, 3, . . . denotes packet number, as: – {Li }, where Li = L(ti ), – {Di }, where Di = D(ti ). We assume that emulator’s decision about incoming packet depends only on the status of the previous packet and current inter-packet gap. This assumption allows us to use a discrete-time Markov process with the following generic equation: (Ln−1 , Dn−1 , tn − tn−1 ) → (Ln , Dn ).

(1)

Although we could apply more complex models with broader range of dependencies (beyond n − 1 state), our choice originates from the usual behavior of the queueing systems. Moreover, implementation of models with “long” memory of the process is unfeasible due to state space explosion. Notice that, the generic rule (1) uses real numbers for Dn−1 , tn−1 and tn , which for efficient implementation would require us to know the exact distribution functions. In order to circumvent this limitation, our model uses simplified representation with quantized values. First, we assume that quantized delay values can be grouped into a number of predefined states (with relation f (delay) → state). Furthermore, we introduce a special state sloss that is used to emulate the delay of the packet transfered after any lost packet. Next, we treat the packet inter-arrival period with finite resolution of time ∆ = tn − tn−1 , where all packets arriving within one time unit ∆ observe the same result (loss or delay). Finally, for each state s we need to know: – the probability of packet loss ls , – the conditional probability distribution of packet delay ds under the condition that current packet is not lost (the support set of this distribution is also quantized). Using the above variables the emulation algorithm for each time unit ∆ summarizes as: 1. Check packet loss against ls . 2. If packet is lost, set current state to s ← sloss , return the result {loss} and stop the algorithm.

54

J. Sliwinski, A. Beben, and P. Krawiec nǻ

(n+1)ǻ

(n+2)ǻ

(n+3)ǻ

(n+4)ǻ

time

sloss

s2

s1

s0

state

{delayn}

{lossn+1}

{delayn+2}

{delayn+3}

Fig. 5. Exemplary operation of the emulation algorithm

3. Generate new delay from distribution ds . 4. Update the current state according to the relation s ← f (delay). 5. Return the result {delay} and stop the algorithm. In order to better understand the proposed emulation method, let us consider the example presented in Fig. 5, which shows few transition steps. Initially (time moment n∆ ), the algorithm is in state s0 . As new packet arrives, the decision is made that it would not be lost and that it would observe a delay equal to {delayn}. Furthermore, this value of delay is related to a new state s1 for the algorithm. Consequently, in time moment (n + 1)∆ the algorithm uses another loss probability and delay distribution table. This time, it was decided that packet should be lost, so algorithm switches into a state sloss and it returns a result {lossn+1 }. Following this scheme, the algorithm switches in next steps into states s0 and s2 with respective loss and delay distributions, and returns exemplary values {delayn+2} and {delayn+3}. The emulation algorithm must be performed for each time unit ∆, so the number of iterations is proportional to the duration of packet inter-arrival time, e.g. for packet inter-arrival time equal to k ∗ ∆ it requires k iterations. This behavior may hamper the emulator performance, especially when there are long idle periods between arriving packets. In order to improve emulator performance, we can calculate new table with analogical distributions for time unit 2∆ based on distributions ls and ds for time unit ∆. This is similar to Markov chain when we want to obtain transition matrix with “two steps” from transition matrix

EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network

55

with “one step”, i.e., we calculate a square of transition matrix. This method can by applied recursively for any time unit in the form 2n ∆. Consequently, using multiple tables for different time units we reduce the complexity to the logarithmic level, e.g., for inter-arrival 78∆ it is sufficient to use 4 iterations as 78∆ = (64 + 8 + 4 + 2)∆. 3.2

Implementation

The core part of the EmPath is implemented as a kernel module of the Linux operating system. Similar to the NetEm tool, our tool is implemented as a packet scheduler, which can be deployed on a network interface (the kernel module is named sch empath). The emulation algorithm is applied to each packet arriving to the interface, which is dropped or delayed in first-in first-out queue. While the decision about packet loss can be implemented using single condition, the representation of the delay distribution requires more attention. As our emulation model assumes multiple distribution tables that are used depending on the current state, we decided to use less memory consuming tree-like structure. Moreover, the EmPath kernel module allows for recursive calculation of “two step” distribution tables from any ∆ time unit into 2∆ one. The second part of our tool is the extension of tc tool in the iproute2 package. The tc tool allows for deployment and configuration of various packet schedulers from the user space of the Linux operating system. Our extension reads a specially formatted files containing packet loss probabilities and packet transfer delay distributions. Notice that the Linux kernel does not support calculations with floating point numbers. Consequently, both EmPath extension of tc tool and EmPath kernel module use fixed point representations with unsigned 32 bit integer numbers. This choice was also motivated by the fact that the default kernel random number generator that we use: function net random(), which provides unsigned 32 bit integer values. We released the EmPath tool as open source software available at address http://code.google.com/p/empath/.

4

Evaluation of EmPath Tool

In order to start the evaluation of EmPath tool we created tables with conditional probability distributions of packet transfer delay. We used the same packet delay traces for the original network that were used in the NetEm case study (see section 2.2). The profile of traffic stream used for sampling the delay and loss processes considered 1 ms intervals, so the time unit was set to ∆ = 1 ms. Furthermore, the resolution of delay distributions was set to 1 ms. Taking into account that the range of the packet transfer delay was equal to 30 ms we decided to use 31 states (including one “loss state”). Therefore, we prepared 31 delay distributions and packet loss values, one for each state.

J. Sliwinski, A. Beben, and P. Krawiec

100

100

10−1

10−1 Probability

Probability

56

10−2

10−3

10−4

10−50

10−2

10−3

10−4

10

20 30 Delay [ms]

40

50

10−50

(a) original network

10

20 30 Delay [ms]

40

50

(b) EmPath emulation

Fig. 6. Histograms of packet transfer delay in EmPath experiment

The experiment was performed in the same network topology as used for the NetEm use case (3 nodes connected in cascade where middle node provided emulation capabilities). The profile of measurement traffic remained the same. Fig. 6 presents histograms of delay for original network and for values measured with EmPath tool (more than 10 million packets were emitted). The shape of the delay distribution, as well as, performance metrics (shown in Table 2) are very similar to the original characteristic. The recovered distribution is shifted to the right due to assumed 1 ms delay resolution. Moreover, Fig. 7(a) shows that the autocorrelation function of delay process is similar to the original network, which suggests that proposed model correctly captures this characteristic. For the characteristic of the loss process, the mean value (IPLR) is similar, but the autocorrelation functions (shown in Fig. 7(b)) differ for lower lags. This indicates that the assumed emulation method of packet losses is insufficient in short term time scale (the geometric distribution is too simple). One may extend the model to add more states related with losses, e.g., by adding a condition for rare loss events upon other rare loss events. Consequently, it would require us to gather much more samples from original network when preparing the distribution tables. On the other hand, the cross-correlation coefficient between delay and loss processes is similar for original network (0.26) and for EmPath emulation (0.24). Furthermore, the EmPath tool guaranteed no packet reordering. Table 2. Selected metrics measured in the EmPath experiment original EmPath network emulation mean IPTD [ms] 11.7 stddev of IPTD [ms] 3.3 IPLR [%] 0.215 IPPR [%] 0 cross-correlation coeff. 0.26

12.1 3.4 0.231 0 0.24

original EmPath

0.8 0.6 0.4 0.2 0.0 −0.2 0

57

original EmPath

0.8 Autocorrelation of loss

Autocorrelation of delay

EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network

0.6 0.4 0.2 0.0

10

20

30

40

Lag

(a) packet transfer delay

50

−0.2 0

10

20

30

40

50

Lag

(b) packet loss

Fig. 7. Correlogram of packet transfer delay and packet loss

Notice that, while the implementations of NetEm and EmPath tools are similar in their nature, EmPath provides more realistic characteristics and features of emulated network.

5

Summary

In this paper we investigated the problem of network emulation. Our objective was to emulate the packet transfer characteristics offered by a network path. We performed experiments with the widely used NetEm tool, which is available in the kernel of Linux operating system. The obtained results showed several limitations of the NetEm tool, related to limited range of packet transfer delay distribution, lack of correlation between packet transfer delays and packet losses lack of packet stream integrity. These limitations motivated us to design more effective emulation method and implement own tool called EmPath. In our method, the delays and losses experienced by transferred packets are modeled as a Markov process that uses multiple transition matrices, where each of them is conditioned on the status of the proceeding packet. Our method guarantees the distribution of packet transfer delay without a packet reordering and introduces correlation of delay and loss processes (autocorrelation and cross-correlation) in similar way as observed in a live network. The parameters required by EmPath, such as conditional probability distribution of packet delays and losses are directly derived from the delay traces measured in a live network. We implemented EmPath tool in the Linux kernel and released it under GNU Public License. Then, we preformed experiments to verify the EmPath capabilities and possible limitations. The obtained results, as illustrated in the included examples, confirmed the effectiveness of the proposed method and EmPath tool. In further work, we plan to focus on extending the model of loss process for inclusion of burst losses which was identified as a slight limitation of EmPath tool. Moreover, we plan more experiments for delay and loss traces collected in different network environments.

58

J. Sliwinski, A. Beben, and P. Krawiec

References 1. Carson, M., Santay, D.: NIST Net: a Linux-based network emulation tool. SIGCOMM Comput. Commun. Rev. 33(3), 111–126 (2003) 2. Fall, K.: Network emulation in the Vint/NS simulator. In: Proceedings of IEEE International Symposium on Computers and Communications 1999, pp. 244–250 (1999) 3. Vahdat, A., Yocum, K., Walsh, K., Mahadevan, P., Kostic, D., Chase, J., Becker, D.: Scalability and accuracy in a large-scale network emulator. Operating Systems Review 36, 271–284 (2002) 4. Rizzo, L.: Dummynet: a simple approach to the evaluation of network protocols. SIGCOMM Comput. Commun. Rev. 27(1), 31–41 (1997) 5. Yeom, I., Reddy, A.N.: ENDE: An end-to-end network delay emulator tool for multimedia protocol development. Multimedia Tools and Applications 14(3), 269– 296 (2001) 6. Avvenuti, M., Vecchio, A.: Application-level network emulation: the EmuSocket toolkit. Journal of Network and Computer Applications 29(4), 343–360 (2006) 7. Hemminger, S.: Network Emulation with NetEm. In: Linux Conf. Au. (April 2005) 8. Papagiannaki, K., Moon, S., Fraleigh, C., Thiran, P., Diot, C.: Measurement and analysis of single-hop delay on an IP backbone network. IEEE Journal on Selected Areas in Communications 21(6), 908–921 (2003) 9. Piratla, N., Jayasumana, A., Smith, H.: Overcoming the effects of correlation in packet delay measurements using inter-packet gaps. In: Proceedings of 12th IEEE International Conference on Networks 2004 (ICON 2004), vol. 1, pp. 233–238 (November 2004) 10. Vivanco, D.A., Jayasumana, A.P.: A measurement-based modeling approach for network-induced packet delay. In: LCN 2007: Proceedings of the 32nd IEEE Conference on Local Computer Networks, pp. 175–182. IEEE Computer Society, Washington (2007) 11. Borgnat, P., Dewaele, G., Fukuda, K., Abry, P., Cho, K.: Seven years and one day: Sketching the evolution of Internet traffic. In: INFOCOM 2009, pp. 711–719. IEEE, Los Alamitos (2009) 12. Zinner, T., Tutschku, K., Nakao, A., Tran-Gia, P.: Performance evaluation of packet re-ordering on concurrent multipath transmissions for transport virtualization. In: 20th ITC Specialist Seminar, Hoi An, Viet Nam (May 2009) 13. Nebat, Y., Sidi, M.: Resequencing considerations in parallel downloads. In: Proceedings of INFOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 3, pp. 1326–1335. IEEE, Los Alamitos (2002) 14. Owezarski, P., Berthou, P., Labit, Y., Gauchard, D.: LaasNetExp: a generic polymorphic platform for network emulation and experiments. In: TRIDENTCOM 2008 (2008) 15. ITU-T Recommendation Y.1540: IP packet transfer and availability performance parameters (November 2007) 16. Krickeberg, K.: Probability Theory. Addison-Wesley, Reading (1965) 17. Cho, K., Mitsuya, K., Kato, A.: Traffic data repository at the WIDE project. In: ATEC 2000: Proceedings of the annual conference on USENIX Annual Technical Conference, Berkeley, CA, USA, June 2000, pp. 51–51. USENIX Association (2000) 18. Naval Research Laboratory: Multi-Generator Toolset (MGEN) 4.0 (November 2009), http://cs.itd.nrl.navy.mil/work/mgen/

A Database of Anomalous Traffic for Assessing Profile Based IDS Philippe Owezarski⋆ CNRS; LAAS; 7 Avenue du colonel Roche, F-31077 Toulouse, France Universit´e de Toulouse; UPS, INSA, INP, ISAE; LAAS;F-31077 Toulouse, France [email protected]

Abstract. This paper aims at proposing a methodology for evaluating current IDS capabilities of detecting attacks targeting the networks and their services. This methodology tries to be as realistic as possible and reproducible, i.e. it works with real attacks and real traffic in controlled environments. It especially relies on a database containing attack traces specifically created for that evaluation purpose. By confronting IDS to these attack traces, it is possible to get a statistical evaluation of IDS, and to rank them according to their detection capabilities without false alarms. For illustration purposes, this paper shows the results obtained with 3 public IDS. It also shows how the attack traces database impacts the results got for the same IDS. Keywords: Statistical evaluation of IDS, attack traces, ROC curves, KDD’99.

1

Motivation

1.1

Problematics

Internet is becoming the universal communication network, conveying all kinds of information, ranging from the simple transfer of binary computer data to the real time transmission of voice, video, or interactive information. Simultaneously, Internet is evolving from a single best effort service to a multiservice network, a major consequence being that it becomes highly exposed to attacks, especially to denial of services (DoS) and distributed DoS (DDoS) attacks. DoS attacks are responsible for large changes in traffic characteristics which may in turn significantly reduce the quality of service (QoS) level perceived by all users of the network. Detecting and reacting against DoS attacks is a difficult task and current intrusion detection systems (IDS), especially those based on anomaly detection ⋆

The author wants to thank all members of the MetroSec project. Thanks in particular to Patrice Abry, Julien Aussibal, Pierre Borgnat, Gustavo Comerlatto, Guillaume Dewaele, Silvia Farraposo, Laurent Gallon, Yann Labit, Nicolas Larrieu and Antoine Scherrer.

F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 59–72, 2010. c Springer-Verlag Berlin Heidelberg 2010 

60

P. Owezarski

from profile, often fail in detecting DDoS attacks efficiently. This can be explained via different lines of arguments. First, DDoS attacks can take a large variety of forms so that proposing a common definition is in itself a complex issue. Second, it is commonly observed that Internet traffic under normal conditions presents per se, or naturally, large fluctuations and variations in its throughput at all scales [PKC96], often described in terms of long memory [ENW96], selfsimilarity [PW00], multifractality [FGW98]. Such properties significantly impair anomaly detection procedures by decreasing their statistical performance. Third, Internet traffic may exhibit strong, possibly sudden, however legitimate, variations (flash crowds - FC) that may be hard to distinguish from illegitimate ones. That is why, IDS relying on anomaly detection by way of statistical profile often yield a significant number of false positives, and are not very popular. These tools also lack efficiency when the increase of traffic due to the attack is small. This situation is frequent and extremely important because of the distributed nature of current denial of service attacks (DDoS). These attacks are launched from a large number of corrupted machines (called zombies) and under the control of a hacker. Each machine generates a tiny amount of attacking traffic in order to hide it in the large amount of cross Internet traffic. On the other hand, as soon as these multiple sources of attacking traffic aggregate on links or routers on their way to their target, they represent a massive amount of traffic which significantly decreases the performance level of their victim, and of the network it is connected to. The anomaly detection is easy close from the victim, but detecting it so late is useless: targeted resources have been wasted, and the QoS degraded; the attack is therefore successful. It is then essential for IDS to detect attacks close from their sources, when the anomaly just results from the aggregation of the attacking traffic of few zombies hidden in the massive amount of legitimate traffic. 1.2

The KDD’99 Traditional Evaluation Method

It exists several approaches for statistically evaluating IDS. However, they all have in common the need of a documented attack database. Such database generally represents the ground truth [RRR08]. Up to now, the most used database is KDD’99. The principle of a statistical evaluation deals with making the IDS to evaluate analyze the documented attack traces, and to count the number of true positives, true negatives, false positives, and false negatives. The relations between these values can then be exhibited by mean of a ROC curve (Receiver Operating Characteristic). The ROC technique has been used in 1998 for the first time for evaluating IDS in the framework of the DARPA project on the off-line analysis of intrusion detection systems, at the MIT’ Lincoln laboratory [MIT]. At that time, it was the first intelligible evaluation test applied to multiple IDS, and using realistic configurations. A small network was set-up for this purpose, the aim being to emulate a US Air Force base connected to the Internet. The background traffic was generated by injecting attacks in well defined points of the network, and collected with TCPDUMP. Traffic was grabbed and recorded for 7 weeks,

A Database of Anomalous Traffic for Assessing Profile Based IDS

61

and served for IDS calibration. Once calibration was performed, 2 weeks of traces containing attacks were used to evaluate the performance of IDS under study. Several papers describe this experience as Durst et al. [DCW+ 99], Lippmann et al. [LFG+ 00] and Lee et al. [LSM99]. This DARPA work in 1998 used a wide range of intrusion attempts, tried to simulate a realistic normal activity, and produced results that could be shared between all researchers interested in such topic. After an evaluation period, researchers involved in the DARPA project, as well as many others of the same research community, provided a full set of evaluation results with the attack database, which lead to some changes in the database, known nowadays under the name KDD’99. These changes mainly dealt with the use of more furtive attacks, the use of target machines running Windows NT, the definition of a security policy for the attacked network, and tests with more recent attacks. Whereas the KDD’99 aimed at serving the needs of the research community on IDS, some important questions about its usability were raised. McHugh [McH01] published a strong criticism on the procedures used when creating the KDD’99 database, especially on the lack of verification of the network realism compared to an actual one. It was followed in 2003 by Mahoney et Chan [MC03] who decided to review into detail the database. Mahoney et Chan showed that the traces were far from simulating realistic conditions, and therefore, that even a very simple IDS can exhibit very high performance results, performances that it could never reach in a real environment. For example, they discovered that the trace database includes irregularities as differences on TTL between attacking and legitimate packets. Unfortunately, despite all the disclaimers about this KDD’99 database for IDS evaluation, it is still massively used. This is mainly due to the lack of other choices and efforts for providing new attack databases. Such limitations of the KDD’99 database motivated ourselves for issuing a new one, as we needed to compare performances of existing profile based IDS with the ones of new anomaly detection tools we were designing in the framework of the MetroSec project [MET] (project from the French ACI program on Security & Computer science, 2004-2007). 1.3

Contribution

Despite these limitations of the DARPA project contributions for IDS evaluation, the introduction of ROC technique remains nevertheless a very simple and efficient solution, massively used since. It consists in combining detection results with the number of testing sessions to issue two values which summarize IDS performance: the detection ratio (number of detected intrusions divided by the number of intrusion attempts) and the false alarm rate (number of false alarms divided by the total number of network sessions). These summaries of detection results then represent one point on the ROC curve for a given IDS. The ROC space is defined by the false alarms and true positive rates on X and Y axis respectively, what in fact represents the balance between efficiency and cost. The best possible IDS would then theoretically be represented by a single point

62

P. Owezarski

curve of coordinates (0, 1) in the ROC space. Such a point means that all attacks were detected and no false alarm raised. A random detection process would be represented in the ROC space by a strait line going from the bottom left corner (0, 0) to the upper right corner (1, 1) (the line with equation y = x ). Points over this line mean that the detection performance is better than the one of a random process. Under the line, it is worse, and then of no real meaning. Given the efficiency and simplicity of the ROC method, we propose to use it as a strong basis for our new profile based IDS evaluation methodology; it therefore looks like the KDD’99 one. At the opposite, the KDD’99 trace database appears to us as completely unsuited to our needs for evaluating performances of IDS and anomaly detection systems (ADS). This is what our contribution is about. Let us recall here that in the framework of the MetroSec project, we were targeting attacks which can have an impact on the quality of service and performances of networks, i.e. on the quality of packets forwarding. It is not easy to find traffic traces containing this kind of anomalies. It would be required to have access to many traffic capture probes, close from zombies, and launch traffic captures when attacks arise. Of course, hackers do not advertise when they launch attacks and their zombies are unknown. We then haven’t documented traffic traces at our disposal containing these kinds of anomalies, i.e. traces for which no obvious anomaly appears, but for which we would know that between two dates, an attack of that kind, and having a precise intensity (also called magnitude) was perpetrated. The lack of such traces is one of the big issues for researchers in anomaly detection.1 In addition, it is not enough to validate detection methods and tools on one or two traces; it would for instance forbid to quantify detection mistakes. 1.4

Paper Structure

This paper presents a new evaluation and comparison method of profile based IDS and ADS performances, and that improves and adapts to new requirements the ancient KDD’99 method. The main contribution dealt with creating a new documented anomaly database - among which some are attacks. These traces contain, in addition of anomalies, a realistic background traffic having the variability, self-similarity, dependence, and correlation characteristics of real traffic, which massively distributed attacks can easily hide in. The creation process of this trace database, as well as the description of the main components used are described in section 2. Then, section 3 shows for a given ADS - called NADA [FOM07] - the differences in evaluation results depending on the anomaly/attack trace database used. Section 4 then shows by using our new evaluation method based on our new trace database - called MetroSec database - the statistical evaluation results got for 3 IDS or ADS publicly available. Last, section 5 concludes this paper with a discussion on the strength and weaknesses of our new evaluation method relying on our new MetroSec anomalies database. It also gives information on the database availability. 1

This lack is even more important as for economical and strategic reasons of carriers, or users’ privacy, such data relating anomalies or attacks are not made public.

A Database of Anomalous Traffic for Assessing Profile Based IDS

2

63

Generation of Traffic Traces with or without Anomalies, by Ways of Reproducible Experiments

2.1

The MetroSec Experimental Platform

One of the contributions of MetroSec was to produce controlled and documented traffic traces, with or without anomalies, for testing and validating intrusion detection methods. For this purpose, we lead measurement and experimentation campaigns on a reliable operational network (for which we are sure that is does not contain any anomalies, or at least very few), and to generate ourselves attacks and other kinds of anomalies which are going to mix and interact with background regular traffic. It is then possible to define the kinds of attacks we want to launch, to control them (sources, targets, intensities2 , etc.), and to associate to the related captured traces a ticket indicating the very accurate characteristics of perpetrated attacks. In such a context, anomalies are reproducible (we can regenerate as often as wanted the same experimental conditions). This reproducibility makes possible the multiplication of such scenarios in order to improve the validation statistics of detection tools, or the comparison accuracy of our methods with others. The trace database produced in MetroSec is one of the significant achievements of the project, and guaranties the reliability of IDS evaluation. The experimental platform which was used for creating the MetroSec trace database uses the Renater network, the French network for education and research. Renater is an operational network which is used by a significantly large community in its professional activity. Because of its design, Renater has the necessary characteristics for our experiments: – it is largely over-provisioned related to the amount of traffic it is transporting. Its OC-48 links provide 2,4 Gbits/s of throughput, whereas a laboratory as LAAS, having at its disposal a link whose capacity is 100 Mbits/s, generates in average a traffic less than 10 Mbits/s [OBLG08]. As a consequence, Renater provides a service with a constant quality. Thus, even if we want to saturate the LAAS access link, the impact of Renater on this traffic, and the provided QoS would be transparent. Experimental conditions on Renater would be all the times the same, and therefore our experiments are reproducible; – Renater integrates two levels of security to avoid attacks coming from outside, but also from inside the network. Practically speaking, we effectively never observed any attack at the measurement and monitoring points we installed in Renater. The laboratories involved in the generation of traces are ENS in Lyon, LIP6 in Paris, IUT of Mont-de-Marsan, ESSI in Nice, and LAAS in Toulouse. Traffic is 2

Intensities (or magnitude) of attacks are defined in this paper as the byte or packet rate of the attacking source(s).

64

P. Owezarski

captured at these different locations by workstations equipped with DAG cards [CDG+ 00] and GPS for a very accurate temporal synchronization. In addition, if we want to perform massive attacks, the target is the laasnetexp network at LAAS [OBLG08], which is a network fully dedicated to risky experiments. We can completely saturate it in order to analyze extreme attacking situations. 2.2

Anomalies Generation

Anomalies studied and generated in the framework of the MetroSec project consist of more or less significant increases of traffic in terms of volume. We can distinguish two kinds of anomalies: – anomalies due to legitimate traffic. Let us for instance quote in this class flash crowds (FC). It is important to mention here that such experiment can hardly be fully controlled; – anomalies due to illegitimate traffic, as flooding attacks. This traffic, which we can have a full control on, is generated thanks to several classical attacking tools. Details about anomalies generation are given in what follows. • Flash Crowd (FC). For analyzing the impact on traffic characteristics of a flooding event due to legitimate traffic variations, we triggered flash crowds on a web server. For realism purpose, i.e. humanly random, we chose not to generate them using an automatic program, but to ask our academic colleagues to browse the LAAS web server ( http://www.laas.fr). • DDoS attack. Attacks generated for validating our anomaly detection methods consist of DDoS attacks, launched by using flooding attacking tools (IPERF, HPING2, TRIN00 et TFN2K). We selected well known attacking tools in order to generate malicious traffic as realistic as possible. The IPERF tool [IPE] (under standard Linux environment) aims at generating UDP flows at variable rates, with variable packets rates and payloads. The HPING2 tool [HPI] aims at generating UDP, ICMP and TCP flows, with variable rates (same throughput control parameters as IPERF). Note that with this tool, it is also possible to set TCP flags, and then to generate specific signatures in TCP flows. These two tools were installed on each site of our national distributed platform. At the opposite of TRIN00 and TFN2K (cf. next paragraph), it is not possible to centralize the control on all IPERF and HPING2 entities running at the same time, and then to synchronize attacking sources. One engineer on each site is then in charge of launching attacks at a given predefined time, what induces at the target level a progressive increase of the global attack load. TRINOO [Tri] and TFN2K [TFN] are two well known distributed attacking tools. They allow the installation on different machines of a program called zombie (or daemon, or bot). This program is in charge of generating the attack

A Database of Anomalous Traffic for Assessing Profile Based IDS

65

towards the target. It is remotely controlled by a master program which commands all the bots. It is possible to constitute an attacking army (or botnet) commanded by one or several masters. TFN2K bots can launch several kinds of attacks. In addition of classical flooding attacks using UDP, ICMP and TCP protocols (sending of a large number of UDP, ICMP or TCP packet to the victim), many other attacks are possible. The mixed flooding attack is a mix of UDP flooding, ICMP flooding and TCP SYN flooding. Smurf is an attacking technique based on the concept of amplification: bots use the broadcast address for artificially multiplying the number of attacking packets sent to the target, and then multiplying the power of this attack. TRIN00 bots, on their side, can only perform UDP flooding. Table 1. Description of attacks in the trace database Tool

Attack type

Trace duration

Campaign of November-December 2004 HPING TCP flooding 1h23mn 1h23mn 3h3mn 3h3mn UDP flooding 16h20mn Campaign of June 2005 IPERF UDP flooding 1h30 1h30 1h30 1h30 1h30 1h30 1h30 Campaign of March 2006 TRINOO UDP flooding 2h 1h Campaigns from April to July 2006 TFN2K UDP flooding 2h 1h ICMP flooding 1h30 1h TCP SYN flooding 2h 1h Mixed flooding 1h Smurf 1h

Attack duration

Intensity

3h3mn 15mn 13mn 3mn 30.77% 27.76% 90.26% 30mn 7mn 8mn 5mn 70.78% 45.62% 91.63% 5mn 99.46% 1h30 30mn 30mn 30mn 1h30 41mn 30mn 30mn 1h30 30mn 30mn 30mn 30mn 1h

17.06% 14.83% 21.51% 33.29% 39.26% 34.94% 40.39% 36.93% 56.40% 58.02%

10mn 10mn 10mn 7.0%

22.9% 86.8%

30mn 11mn 10mn 10mn 92% 4.0% 20mn 10mn 13% 9.8% 10mn 10mn 12% 33% 10mn 27.3% 10mn 3.82%

7.0%

Attacks launched with the different attacking tools (IPERF, HPING2, TRINOO et TFN2K) have been performed by changing frequently the attack characteristics and parameters (duration, DoS flow intensity, size and rate of packets) in order to create different profiles for attacks to be detected afterwards. Main characteristics of generated attacks are summarized in table 1. For each configuration, we captured the traffic before, during and after the attack, in order to mix the DoS period with two normal traffic periods. It is important to recall here that most of the times, we tried to generate very low intensity attacks, so that they do not have a significant impact on the global traffic (and therefore be not the cause of average traffic change). This emulates the case of a router receiving the packets from a small number of zombies, and then represents the most interesting problem of our problematic, i.e. detecting DDoS attacks close from their low intensity sources. Our trace database contains nowadays around thirty captures of such kinds of experiments.

66

3

P. Owezarski

Comparative Evaluation with Different Anomalies Databases

NADA (Network Anomaly Detection Algorithm) is an anomaly detection tool relying on the use of deltoids for detecting significantly anomalous variations on traffic characteristics. This tool also includes a classification mechanism of anomalies aiming at determining whether detected anomalies are legitimate and their characteristics. NADA has been issued in the framework of the Metrosec project. Thanks to the full control we have on its code, it was easier for us to run experiments with such a tool. Results were also easier to analyze with a full knowledge of the detection tool. For more details, interested readers can refer to [FOM07]. Anyway, as the evaluation methodology we are proposing is of the ”black box” kind, it is not necessary to know how a tool is designed and developed for evaluating it. Just let us say that NADA uses a threshold k which aims at determining whether a deltoid corresponds to an anomalous variation. Setting the k threshold allows the configuration of the detection tool sensitivity, in particular related to the natural variability of normal traffic. The rest of this section aims at comparing NADA’s performance evaluation results with our method depending on the anomalies database used, MetroSec or KDD’99. 3.1

Evaluation with the MetroSec Anomalies Database

The statistical evaluation of NADA was performed by using the traces with documented anomalies presented in section 2.2. This means a total of 42 different traces. Each of the attack traces contain at least one DDoS attack; some contain up to four attacks of small intensity. Six traffic traces with flash crowds were also used for the NADA evaluation. In addition, the documented traces of the MetroSec database can be grouped according to the attacking tools used for generating the attacks/anomalies, and in each group differentiate them according to attack intensities, durations, etc. Such a differentiation is important as it makes possible to measure the sensitivity of the tool under evaluation, i.e. its capability of detecting small intensity anomalies (what of course cannot be done using the KDD’99 database, only able to provide binary results). The intensity and duration of anomalies are two characteristics which have a significant impact on the capability of profile based IDS/ADS to detect them. Whereas the detection of strong intensity anomalies is well done by most of detection tools, it is in general not the case when small intensity attacks are considered. Therefore, a suited method for evaluating anomaly detection tools performance must be able to count how many times it succeeds or failed in detecting anomalies contained in the traces, and among which some are of low intensity. Figure 1.a shows the ROC curve got by evaluating NADA with the MetroSec anomaly database. It shows the detection probability (P D ) according to the probability of false alarms (P F ). Each point on the curve represents the average of all results obtained by NADA on all the anomalies of the MetroSec database for a given value of the k parameter, i.e. for a given sensitivity level. The curve analysis shows that NADA is significantly more efficient than a random

A Database of Anomalous Traffic for Assessing Profile Based IDS

(a)

67

(b)

Fig. 1. Statistical performances of NADA evaluated based on (a) the MetroSec database and (b) the 10% KDD database. Detection Probability (P D) vs. Probability of false alarms (P F ), P D = f (P F ). Table 2. Characteristics of the KDD’99 database in terms of samples numbers Base DoS Scan 10% KDD 391 458 4107 Corrected KDD 229 853 4166 Full KDD 3 883 370 41 102

U2R 52 70 52

R2L 1126 16 347 1 126

Normal 97 277 60 593 972 780

tool, as all points are over the line y = x. Even when the detection probability increases, the NADA performance ROC curve exhibits a very weak false alarm rate. For example, with P D in [60%, 70%], the probability of false alarms is in [10%, 20%], which is a good result. 3.2

KDD’99

KDD’99 database consists of a set of traces detailed in table 2. During the International Knowledge Discovery and Data Mining Tools contest [Kd], only 10% of the KDD database were used for the learning phase [HB]. This part of the database contains 22 types of attacks and is a concise version of the full KDD database. This later contains a greater number of attack examples than normal connections, and the types of attacks do not appear in a similar way. Because of their nature, DoS attacks represent the huge majority of the database. On the other hand, the corrected KDD database provides a database with different statistical distributions compared to the databases ”10% KDD” or ”Full KDD”. In addition, it contains 14 new types of attacks. NADA evaluation was limited to the 10% KDD database. Several reasons motivated this choice: first, despite this database is the simplest, it is also the most used. Second, our intension is to show that the KDD database is not suited for

68

P. Owezarski

evaluating current ADS (and we will show that the reduced database is enough to demonstrate it). Last, it is a first good test for observing NADA behavior with high intensity DoS attacks. Figure 1.b shows the NADA performance ROC curve obtained with the 10% KDD database. It especially shows the detection probability (P D) according to the probability of false alarms (P F ), and its analysis shows that NADA got very good results. Applied to the KDD’99 database, NADA exhibits a detection probability close to 90%, and a probability of false alarms around 2%. These results are extremely good, but unfortunately unrealistic if we compare them with the results obtained with the MetroSec database! DoS attacks are detected in a very reliable way, but certainly because the database is excessively simple: around 98% of attacks are DoS attacks of the same type, presenting in addition very strong intensities. The differences of NADA performances when applied to the two MetroSec and KDD’99 databases underline the importance of the anomaly database for evaluating profile based IDS and ADS. It is obvious that the MetroSec database presents more complex situations for NADA than KDD’99. Therefore, the evaluation results got with the MetroSec database are certainly closer from the real performance level of NADA than the ones got with KDD’99.

4

Evaluation of 2 Other IDS/ADS with the MetroSec Database

For illustrating the real statistic evaluation efficiency of the MetroSec method proposed in this paper to distinguish between real capabilities of profile based IDS and ADS, this section shows the comparative results between NADA and two other tools or approaches: the Gamma-FARIMA based approach [SLO+ 07], and the PHAD tool (experimental Packet Header Anomaly Detection) [MC01]. The Gamma-FARIMA approach is also one of the achievements of the MetroSec projects. PHAD was selected as, in addition of being freely available, it aims at both detecting and classifying anomalies and attacks similarly to what the objectives of MetroSec were. In addition, PHAD, funded by the American NSF, is said to be the ultimate intrusion and anomaly detection tool. The argumentation of its authors mainly relies on tests lead with KDD’99 traces: on these traces, PHAD gives perfect results. • PHAD. The evaluation of PHAD using the MetroSec database was performed by making the K threshold vary; K also represents the probability of generating a correct alarm. Figure 2.a shows the ROC curve obtained. Its analysis shows that PHAD behaves just a little bit better than a random detection process. Additional analysis for explaining the problem of too much false alarms points a problem with the underlying model and the learning process. It was also observed that when facing low intensity attacks, PHAD detects none of them. When comparing with the performances obtained with the KDD’99 database, there is a big gap. In fact, it seems that PHAD only detects anomalies of the KDD’99, and no other.

A Database of Anomalous Traffic for Assessing Profile Based IDS

(a)

69

(b)

Fig. 2. Statistical performances of (a) PHAD and (b) Gamma-FARIMA approach evaluated thanks to the MetroSec database. Detection Probability (P D) vs. Probability of false alarms (P F ), P D = f (P F ).

• Gamma-FARIMA Figure 2.b shows the ROC curve obtained for the Gamma-FARIMA approach evaluated with the MetroSec database. More than 60% of anomalies are detected with a false alarm rate close to 0. By observing all the ROC curves, it seems that the Gamma-FARIMA approach is the most efficient among the 3 when confronted to the MetroSec database. NADA also exhibits good performances, not too far from the ones of the Gamma-FARIMA approach. These two algorithms designed and developed in the framework of the MetroSec project aim at detecting and classifying known and unknown anomalies. Both algorithms reach this goal by using nevertheless different approaches. While NADA uses simple mathematical functions, the Gamma-FARIMA approach relies on more complex mathematical analysis which are the source of its advantage. However, the NADA simplicity coupled with its high performance level can be of interest. It is particularly true if NADA has to be combined with an identification process of malicious packets. On the other side, PHAD presents a dramatically low performance level, when confronted to the MetroSec database (but is excellent when confronted to KDD’99 one). By going into a deep analysis of the PHAD algorithm, it seems that the use of 33 different parameters, and assigning a score to each of its anomalous occurrences introduces uncertainty without improving detection accuracy. In addition, these 33 parameters only play a minor role in the detection of most of the attacks [MC01].

5

Conclusion

This paper presented a statistical evaluation method of profile based IDS and ADS that relies on a new traffic traces database containing both legitimate and

70

P. Owezarski

illegitimate anomalies. This paper exhibited that the anomalies database has a major impact on the evaluation results. For example, it was shown that for NADA and PHAD the results obtained with KDD’99 and MetroSec traces are completely different. This paper also shows that the KDD’99 does not permit satisfactory results when considering the evaluation accuracy of the different detection tools. It does not confront them to realistic enough conditions (and then complex enough), and in general the different evaluated tools pass the tests with good marks... marks that are not reproducible once installed in a real environment. Indeed, the KDD’99 evaluation is kind of binary: it only shows whether high intensity attacks can be detected. The MetroSec method plays with the intensities and durations of anomalies for determining levels at which an attack can be detected. All these observations exhibit one of the big problems. If we assume that the MetroSec database is exhaustive for the current traffic anomalies phenomenon (we tried as much as possible to define generic anomalies on all dimensions of network traffic), how would it be possible to ensure its everlastingness? Indeed, even if anomalies classes do not radically change, their shapes and intensities (especially related to the network capacities evolutions) will change. And the database, on a more or less long term, will lose some of its realism. Given the strategic aspect of traffic traces for operators, it is not sure at all that we could continue having traffic traces in which anomalies could be injected. In addition, producing such traces containing anomalies is a very time consuming task which can hardly be supported by a single or few laboratories. It represents one of the main drawbacks against which it does not appear any solution. The last problem exhibited by this paper is related to the role of experimental data we are exploiting. In fact, we use the same data for designing and validating/evaluating our tools. Thus, PHAD which works perfectly on KDD’99 database (it was designed for detecting anomalies and attacks by taking its inspiration in the KDD’99 database) shows incredibly low performances when tested with the MetroSec database. What would be the results of GammaFARIMA or NADA tools if they were evaluated with other anomalies databases than MetroSec or KDD’99 ones (we took our inspiration from these databases when designing our tools)? Again, the solution would be to have a large number of anomalous traces database in order to separate, at least experimentally, design and evaluation. But we then fall back in the problem previously quoted of the lack of exploitable traffic traces. • Database availability Our intension is to open the trace database as widely as possible, given some respect of privacy. These privacy issue forces us to keep the traces on LAAS’ servers, and it does not allow us to grant anyone to download them. Up to now, the database is made available for COST-TMA members which could visit LAAS for short scientific missions. This possibility will be maintained and we hope to extend it in the near future. LAAS will then offer the storage and computing capacities for all COST-TMA researchers interested in working with the traces. An explicit agreement with the LAAS’ charter is required. Our intension is to

A Database of Anomalous Traffic for Assessing Profile Based IDS

71

open the trace database to the research community at large. We already asked what are the requirements to make them publicly available on a web/FTP server. When details will be defined, they will be put on the MetroSec web page [MET].

References [CDG+ 00]

Cleary, J., Donnelly, S., Graham, I., Mcgregor, A., Pearson, M.: Design principles for accurate passive measurement. In: Passive and Active Measurements, Hamilton, New Zealand (April 2000) [DCW+ 99] Durst R., Champion T., Witten B., Miller E., Spagnuolo L.: Testing and evaluating computer intrusion detection system. Communications of the ACM 42(7) (1999) [ENW96] Erramilli, A., Narayan, O., Willinger, W.: Experimental queueing analysis with long-range dependent packet traffic. ACM/IEEE transactions on Networking 4(2), 209–223 (1996) [FGW98] Feldmann, A., Gilbert, A.C., Willinger, W.: Data networks as cascades: Investigating the multifractal nature of internet wan traffic. In: ACM/SIGCOMM Conference on Applications, technologies, architectures, and protocols for computer communication, pp. 42–55 (1998) [FOM07] Farraposo, S., Owezarski, P., Monteiro, E.: NADA - Network Anomaly Detection Algorithm. In: Clemm, A., Granville, L.Z., Stadler, R. (eds.) DSOM 2007. LNCS, vol. 4785, pp. 191–194. Springer, Heidelberg (2007) [HB] Hettich, S., Bay, S.: The UCI KDD archive, Department of Information and Computer Science. University of California, Irvine (1999), http://kdd.ics.uci.edu [HPI] HPING2, http://sourceforge.net/projects/hping2 [IPE] IPERF. The TCP/UDP bandwith Measurement Tool, http://dast.nlanr.net/Projects/Iperf/ [Kd] UCI KDD Archive KDD 1999 datasets, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [LFG+ 00] Lippman, R., Fried, D., Graf, I., Haines, J., Kendall, K., Mcclung, D., Weber, D., Webster, S., Wyschogrod, D., Cunningham, R., Zissman, Y.: Evaluating intrusion detection systems: the 1998 darpa off-line intrusion detection evaluation. In: DARPA Information Survivability Conference and Exposition, pp. 12–26 (2000) [LSM99] Lee, W., Stolfo, S., Mok, K.: Mining in a data-flow environment: Experience in network intrusion detection. In: Proceedings of the ACM International Conference on Knowledge Discovery & Data Mining KDD 1999, pp. 114–124 (1999) [MC01] Mahoney, M., Chan, P.: Phad: Packet header anomaly detection for identifying hostile network traffic. Technical Report CS-2001-04. Department of Computer Sciences - Florida Institute of Technology (2001) [MC03] Mahoney., M., Chan, P.: An analysis of the 1999 darpa/lincoln laboratory evaluation data for network anomaly detection. In: Vigna, G., Kr¨ ugel, C., Jonsson, E. (eds.) RAID 2003. LNCS, vol. 2820, pp. 220–237. Springer, Heidelberg (2003)

72

P. Owezarski

[McH01]

[MET] [MIT] [OBLG08]

[PKC96]

[PW00]

[RRR08]

[SLO+ 07]

[TFN]

[Tri]

Mchugh, J.: Testing intrusion detection systems: A critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory. ACM Transactions on Information and System Security 3(4), 262–294 (2001) METROSEC, http://www.laas.fr/METROSEC MIT. Lincoln Laboratory (2008), http://www.ll.mit.edu/mission/communications/ist/corpora/ideval Owezarski, P., Berthou, P., Labit, Y., Gauchard, D.: Laasnetexp: a generic polymorphic platform for network emulation and experiments. In: Proceedings of the 4th International Conference on Testbeds and Research Infrastructure for the Development of Network & Communities (TRIDENTCOM 2008) (March 2008) Park, K., Kim, G., Crovella, M.: On the relationship between file sizes, transport protocols, and self-similar network traffic. In: International Conference on Network Protocols, pp. 171–180. IEEE Computer Society, Washington (1996) Park, K., Willinger, W.: Self-similar network traffic: an overview. In: Park, K., Willinger, W. (eds.) Self-Similar Network Traffic and Performance Evaluation, pp. 1–38. Wiley (Interscience Division), Chichester (2000) Ringberg, H., Roughan, M., Rexford, J.: The need for simulation in evaluating anomaly detectors. Computer Communication Review 38(1), 55–59 (2008) Scherrer, A., Larrieu, N., Owezarski, P., Borgnat, P., Abry, P.: Nongaussian and long memory statistical characterisations for internet traffic with anomalies. IEEE Transaction on Dependable and Secure Computing 4(1) (January 2007) TFN2K. An analysis, http://packetstormsecurity.org/distributed/ TFN2kAnalysis-1.3.txt Trinoo. The DoS Project’s “trinoo” distributed denial of service attack tool, http://staff.washington.edu/dittrich/misc/trinoo.analysis

Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases Luca Deri1,2, Valeria Lorenzetti1, and Steve Mortimer3 1

ntop.org, Italy IIT/CNR, Italy 3 British Telecom, United Kingdom {deri,lorenzetti}@ntop.org, [email protected] 2

Abstract. Collecting and exploring monitoring data is becoming increasingly challenging as networks become larger and faster. Solutions based on both SQL-databases and specialized binary formats do not scale well as the amount of monitoring information increases. This paper presents a novel approach to the problem by using a bitmap database that allowed the authors to implement an efficient solution for both data collection and retrieval. The validation process on production networks has demonstrated the advantage of the proposed solution over traditional approaches. This makes it suitable for efficiently handling and interactively exploring large data monitoring sets. Keywords: NetFlow, Flow Collection, Bitmap Databases.

1 Introduction NetFlow [1] and sFlow [2] are the current state-of-the-art standards for building traffic monitoring applications. Both are based on the concept of traffic probe (or agent in the sFlow parlance) that analyzes network traffic and produces statistics, known as flow records, which are delivered to a central data collector [3]. As the number of records can be pretty high, probes can use sampling mechanisms in order to reduce the workload on both probe and collectors. In sFlow, the use of sampling mechanisms is native in the architecture so that it can be used by agents to effectively reduce the number of flow records delivered to collectors. This practice has a drawback in terms of result accuracy while providing them with quantifiable accuracy. In NetFlow the use of sampling (both on packets and flows) while reducing the load on routers it leads to inaccuracy [4] [5] [6], hence it is often disabled in production networks. The consequence is that network operators have to face the problem of collecting and analyzing a large number of flow records. This problem is often solved using a flow collector that stores data on a relational database or on a disk in raw format for maximum collection speed [7] [8]. Both approaches have pros and cons; in general SQLbased solutions allow users to write powerful and expressive queries while sacrificing flow collection speed and query response time, whereas raw-based solutions are more efficient but provide limited query facilities. The motivation behind this work is to overcome the limitations of existing solutions and create an efficient alternative to relational databases and raw files. We aim F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 73–86, 2010. © Springer-Verlag Berlin Heidelberg 2010

74

L. Deri, V. Lorenzetti, and S. Mortimer

to create a new generation of a flow collection and storage architecture that exploits state-of-the-art indexing and querying technologies [9], and a set of tools capable of interactively exploring large volume of collected traffic data with minimal query response time. The main contributions of this paper include: • The ability to execute multidimensional queries on arbitrary large amounts of data with response time in the order of seconds (in many cases, milliseconds). • An efficient yet simple flow record storage architecture in terms of disk space, query response time, and data collection duration. • A system that operates on raw flow records without first reducing or summarizing them. • The reduction of the time needed to explore a large dataset and the possibility to display query results in real-time, making the exploration process truly interactive. The following section presents a survey of relevant flow storage and retrieval architectures, describes their limitations, and lists a set of requirements that a flow management architecture should feature. Section three covers the architecture and design choices of the proposed solution. Section four validates this solution on two production networks, evaluates the implementation performance and positions this work against popular tools identified during the survey.

2 Related Work and Motivation Flow collectors are software applications responsible for receiving flow records emitted by network elements such as routers and switches. Their main duty is to make sure that all flow records are received and successfully written on a persistent storage. This solution limits flow record loss and decouples the collection phase from flow analysis, with the drawback of adding some latency as records are often not immediately processed as they arrive. Tools falling into this category include nfdump [10], flow-tools [11], FlowScan [12], Stager [13] and SiLK [14]. These tools store data in binary flat files, optionally in compressed format in order to reduce disk space usage and read time; they typically offer additional tools for filtering, extracting, and summarizing flow records matching specific criteria. As flat files have no indexing, data searching always requires a sequential scan of all stored records. In order to reduce the dataset to scan, these tools save flow records in directories that have a specific duration, so that to ease record temporal selection during queries. Basically the speed advantage of dumping flow records in raw format is paid at each search operation in terms of amount of data to read. Another limitation of these families of tools, is that the query language they offer is limited when compared to SQL, as they feature flowbased filtering with minimal aggregation, join and reporting facilities. The use of relational databases is fairly popular in most commercial flow-collectors such as Cisco NetFlow collector, Fluke NetFlow Tracker, and on open-source tools such as Navarro [15] and pmacct [16]. The flexibility of the SQL language is very useful during report creation and data aggregation phases although some researchers

Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases

75

have proposed a specialized flow query language [17]. Unfortunately the use of relational databases is known to be slower (both during data insert and query) and take more space when compared to raw flow record files [18] [19] [20]. The conclusions of the survey on popular flow management tools are: • Tools based on raw binary files are efficient when storing flow records (e.g. nfdump can store over 250K records/sec on a dual-core PC) but provide limited flow query facilities. • Relational databases are both slower during flow record insertion and retrieval, but thanks to SQL they offer very flexible flow query and reporting facilities. • On large volume of collected flow records, the query time of both tool families takes a significant amount of time (measured in minutes if not hours [21]) even when high-end computers are used, making them unsuitable for interactive data exploration. Seen that the performance figures of state-of-the-art tools is suboptimal, authors investigated whether there was a better solution to the problem of flow collection and query with respect to raw files and relational databases. 2.1 Towards Column-Oriented Databases with Bitmap Indexes A database management system typically structures data records using tables with rows and columns. The system optimizes the query-answering process by implementing auxiliary data structures known as database indexes [22] to accelerate queries. Relational databases encounter performance issues with large tables in particular because of the size of table indexes that need to be updated at each record insertion. In the last few years, new regulations that require ISPs to maintain large archive of user activities (e.g. login/logout/radius/email/wifi access logs) [23] stimulated the development of new database types able to efficiently handle billion of records. Although available since late 70‘s [24], column-oriented databases [25] have been niche products until vendors such as Sensage [26], Sybase [27] and open source implementation such as FastBit [28] [29] [30] ignited new interest on this technology. A columnoriented database stores its content by column rather than by row is known as vertical organization. This way the values for each single column are stored contiguously, and column-stores compression ratios are generally better than row-stores because consecutive entries in a column are homogeneous to each other [31] [32]. These database systems have been shown to perform more than an order of magnitude better than traditional row-oriented database systems, particularly on read-intensive analytical processing workloads. In fact, column-stores are more I/O efficient for read-only queries since they only have to read from disk (or from memory) those attributes accessed by a query [25]. B-tree indexes are the most popular method for accelerating search operations. They are designed initially for transactional data (where any index on data must be updated quickly as data records are modified, and query results are usually limited in number of records) and fail to meet requirements of modern data analysis, such as interactive analysis over large volume of collected traffic data. Such queries return thousands of records that with b-trees would require a large number of tree-branching

76

L. Deri, V. Lorenzetti, and S. Mortimer

operations that use slow pointer chases in memory and random disk access, thus taking a long time. Many popular indexing techniques such as hash indexes, have similar shortcomings. Considering the peculiarity of network monitoring data where flow records are read-only and several flow fields have very few unique values, as of today the best indexing method is a bitmap index [33]. These indexes use bit arrays (commonly called bitmaps) and answer queries by performing bitwise logical operations on these bitmaps. For tasks that demand the fastest possible query processing speed, bitmap indexes perform extremely well because the intersection between the search results on each variable is a simple AND operation over the resulting bitmaps [22]. Seen that column-oriented databases with bitmap indexes provide better performance compared to relational databases, the authors explored their use in the field of flow monitoring. Hence they have designed a system based on this technology able to efficiently handle flow records. The main requirements of this development work include: • Ability to save flow records on disk with minimal overhead allowing no-loss onthe-fly flow-to-disk storage, as it happens with tools based on raw files. • Compact data storage for limiting disk usage hence enable users to store months of flow records on a cheap hard-disk with no need to use costly storage systems. • Stored data must be immutable (i.e. once it has been saved it cannot be modified/deleted) as this is a key feature for billing and security systems where nonrepudiation is mandatory. • Ability to perform efficiently on network storage such as NFS (Network File System). • Simple data archive structure in order to move ancient data on off-line storage systems without having to use complex data partitioning solutions. • Avoid complex architectures [34], hard to maintain and operate, by developing a simple tool that can be potentially used by all network administrators. • On tens of millions of records: • Sub-second search time when performing cardinality searches (e.g. count the number or records that satisfy a certain criteria). This is a requirement for exploring data in real-time and implementing interactive drill-down data search. • Sub-minute search time when extracting records matching a certain criteria (e.g. top X hosts and their total traffic on TCP port Y). • Feature rich query language as SQL with the ability to sort, join, and aggregate data while perform mathematical operations on columns (e.g. sum, average, min/max, variance, median, distinct), necessary to perform complex statistics on flows. The following chapters covers the design and implementation of an extension to nProbe [35], an open-source probe and flow collector, that allows flow records to be stored on disk using a column-oriented database with an efficient compressed bitmap indexing technology. Finally the nProbe implementation performance is evaluated and positioned against similar tools previously listed.

Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases

77

3 Architecture and Implementation nProbe is an open-source NetFlow probe that supports both NetFlow and sFlow collection, as well as flow conversion between versions (for instance convert v5 to v9 flows). sFlow Packet Capture

NetFlow

nProbe

Flow Export

Data Dump

Raw Files / MySQL / SQLite / FastBit

Fig. 1. nProbe Flow Record Collection and Export Architecture

It fully supports the NetFlow v9/IPFIX so it has the ability to specify dynamic flow templates (i.e. it supports flexible netflow) that are configured when the tool is started. nProbe features flow collection and storage, both on raw files and relational databases such as MySQL and SQLite. Support of relational databases has always been controversial as nProbe users appreciated the ability to query flow records using SQL, but at the same time flow dump to database is usually activated only for small sites. The reason is that enabling database support could lead to flow records loss due to the database processing overhead. This is mostly caused by network latency and multi-user database access, slow-down caused by table indexes update during data insertion, and poor database performance while searching records during data insertion. Databases offer mechanisms for mitigating some of the above issues, including data insertion in batch mode instead of realtime, transaction disabling, and definition of tables with no indexes for avoiding the overhead of indexes update. In order to overcome the limitations of existing flow-management systems, the authors decided to explore the use of column-based databases by implementing an extension to nProbe that allows flows to be stored on disk using FastBit [29]. More precisely, FastBit is not a database but a C++ library that implements efficient bitmap indexing methods. Data is represented as tables with rows and columns. A large table may be partitioned into many data partitions and each of them is stored on a distinct directory, with each column stored as a separated file in raw binary form. The name of the data file is the name of the column. In each data partition there is an extra file named -part.txt that contains metadata information such as the name of the partition, and column names. Each column contains data stored in an uncompressed form, so its size is the same size of a raw file dump. Columns can accept data of 1, 2, 4, and 8 bytes long. Data longer than 8 bytes needs to be split across two or more columns.

78

L. Deri, V. Lorenzetti, and S. Mortimer

Compressed bitmap indexes are stored in separate files whose name is the name of the column with the .idx suffix. This means that each column typically has two files: one file contains data and the other the index. Indexes can be created on data “as stored on disk” or on reordered data. This is a main difference with respect to conventional databases. In fact it is possible to first reorder data, column by column, so that bitmap indexes are built on reordered data. Please note that reordering does not affect queries results (i.e. rows data is not mixed when columns are reordered), but it just improves index size and query speed. Data insert and query facilities is performed by means of library calls or using a subset of SQL, natively supported by the library. In FastBit the SELECT clause can only contain a list of column names and some functions that include AVG, MIN, MAX, SUM, and DISTINCT. Each function can only take a column name as its argument. The WHERE clause is a set of range conditions joined together with logical operators such as AND, OR, XOR, and NOT. The clauses GROUP BY, ORDER BY, LIMIT and the operators IN, BETWEEN and LIKE can also be applied to queries. FastBit actually does not support advanced SQL functionalities such as nested queries, and neither operators such as UNION, HAVING, or functions like FIRST, LAST, NOW, and FORMAT. nProbe creates FastBit partitions depending on the flow templates being configured (probe mode) or read from incoming flows (collector mode), with columns having the same size as the the netflow element it contains. Users can configure partition duration (in minutes) at runtime and when a partition reaches its maximum duration, a new one is automatically created. Partition names are created on a tree fashion (e.g. /year/month/day/hour/minute). Similar to [36], authors have developed facilities for rotating partitions hence limiting disk space usage while preserving their structure. No FastBit specific configuration is necessary as nProbe knows the flow format, and then it automatically creates partitions and columns. Datatypes longer than 64 bit as IPv6 addresses are transparently split onto two FastBit columns. Flow records are not saved individually on disk, but for efficiency reasons they are dumped in blocks of 4096 records. Users can decide to build indexes on all or only on a few selected columns, this in order to save space creating indexes for columns that will never be used in queries. If while executing a query FastBit does not find an index for a key column, it will build the index for such column on the fly, prior to execute the query. For efficiency reasons, the authors have decided that indexes are not built at every data dump but when a partition is completed (e.g. the partition duration time has elapsed). This happens because building indexes on reordered data is more efficient (both in terms of disk usage and query response time) than building them on data on the same order as it has been saved on disk. The drawback of this design choice is that queries can use indexes only once they have been built hence the partition is completely dumped on disk. On the other hand, flow records can be dumped at full speed with no index-build overhead. Thus, not considering flow receive/decoding overhead, it is possible to save on disk more than one million flow records/sec on a standard Serial ATA (SATA) disk. Column indexes are completely loaded into memory during searches, thus it imposes a limit on the partition size also limited by FastBit to 232 records. Hence it is wise to avoid creating large partitions, but at the same time the creation of too many small partitions must also be avoided, as this will result in many files created on disk and the overhead of accessing them

Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases

79

(open, close and file seek time) can dominate the data analysis time. A good compromise is to have partitions that either last a fixed amount of time (e.g. 5 minutes of flow records) or that have a maximum number of records. Typically, for a machine with a few GB of memory, FastBit developers recommend data partition containing between 1 million and 100 million records. Conceptually a FastBit partition is similar to a table on a relational database, thus when a query is spread across several partitions, it is necessary to merge results and to collapse them when using the DISTINCT SQL clause. This task is not performed by FastBit but it is delegated to utilities developed by the authors: • fbmerge: tool for merging several FastBit partitions into a single one. This tool, now part of the FastBit distribution, is useful when small fine grained partitions need to be aggregated into a larger one. For instance if nProbe is configured to create ‘one minute’ partitions, at the end of the hour all of them can be aggregated into a ‘one hour’ partition. This allows the number of column files hence the number of disk i-nodes to be reduced a lot, very useful on large disks containing many days/months of collected records. • fbquery: tool that allows queries to be performed on partitions. It supports SQLlike syntax for querying data and implements on top of FastBit useful facilities such as: • Aggregation of similar results, data sort, and result set limitation (same as MySQL LIMIT). • Search recursively on nested directories so that a single directory containing several partitions can be searched in one shot. This is useful for instance when nProbe has dumped 5 minutes long partitions, and users want to search on the last hour so that various partitions need to be read by fbquery. • Data dump on several formats such as CSV, XML, and plain text. Data format is based on the metadata information produced by nProbe, thus partition columns are printed according to its native representation (e.g. an IPV4_DST_ADDR is printed as dot-separated IPv4 address and not as a 32 bit unsigned integer). • Scriptability using the Python language for combining several queries or creating HTML pages for rendering data on a web browser. In a nutshell, the authors have used the FastBit library for creating an efficient flow collection and storage system. As the library was not designed for handling network flows, the authors have implemented some missing features that are a prerequisite for creating comprehensive network traffic reports. The following section evaluates the performance of the proposed solution, compares it against relational databases, and validates it on two large networks. This is to demonstrate that nProbe with FastBit is a mature solution that can be used on a production environment.

4 Validation and Performance Evaluation In order to evaluate the FastBit performance, nProbe has been deployed in two different environments:

80

L. Deri, V. Lorenzetti, and S. Mortimer

• Medium ISP: Bolig:net A/S The average backbone traffic is around 250 Mbit/sec (about 40K pps). The traffic is mirrored onto a Linux PC (Linux Fedora Core 8 32 bit, Dual Core Pentium D 3.0 GHz, 1 GB of RAM, two SATA III disks configured with RAID 1) that runs nProbe in probe mode. nProbe computes the flows (NetFlow v9 bidirectional format with 5 minutes maximum flow duration) and saves flow records on disk using FastBit. Each FastBit partition stores one hour of traffic, and in average the probe produces 36 million flow records/day. Before deploying nProbe, records were collected and stored on a MySQL database. • Large ISP: British Telecom nProbe is used in collector mode. It receives flow records from 10 peering routers, with peak flow export of 85 K flow records/sec with no flow loss. Each month the total amount of record exceeds 4 TB of disk space. The application server has dual quad-core Intel processors with 24 GB of memory, running Ubuntu Linux 9.10 64 bit, and is used to carry out queries on the data stored on an NFS server by the Collection server. The Netflow collection server has a single quad-core Intel processor and 8 GB of memory, running Ubuntu Linux 9.10 64 bit, and stores the fastbit data to the NFS server. Each FastBit partition stores 60 minutes of traffic that occupy about 5.8 GB of disk space when indexed. Before deploying nProbe, flow records were collected using nfdump. The goal of these two setups is to both validate nProbe with FastBit on two different setups and compare the results with the solutions previously used. The idea is to compare a regional with a country-wide ISP, and verify if the proposed solution can be effectively used in both scenarios. Being the code open-source, it is also important to verify that this work is efficient when used on standard PCs (contrary to solutions based on costly clusters or server farms mostly used in Telco environments) as this is the most common scenario for many open-source users. 4.1 FastBit vs. Relational Databases The goal of this test is to compare the performance of FastBit with respect to MySQL (version 5.1.40 64 bit), a popular relational database. As the host running nProbe is a critical machine, in order to not interfere with the collection process, two days worth of traffic was dumped in FastBit format, and then transferred to a Core2Duo 3.06 GHz Apple iMac running MacOS 10.6.2. Moving FastBit partitions across machines running different operating systems and word length (one is 32, the other is 64 bit) has not required any data conversion as FastBit transparently takes care of differences among various architectures. This is a good feature as collector hosts can be based on different operating systems and technology. In order to evaluate how FastBit partition size affects the search speed, hourly partitions have been merged into a single daily directory. In order to compare both approaches, five queries have been defined: • Q1: SELECT COUNT(*), SUM(PKTS), SUM(BYTES) FROM NETFLOW • Q2: SELECT COUNT(*) FROM NETFLOW WHERE L4_SRC_PORT=80 OR L4_DST_PORT=80 • Q3: SELECT COUNT(*) FROM NETFLOW GROUP BY IPV4_SRC_ADDR

Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases

81

• Q4: SELECT IPV4_SRC_ADDR, SUM(PKTS), SUM(BYTES) AS s FROM NETFLOW GROUP BY IPV4_SRC_ADDR ORDER BY s DESC LIMIT 1,5 • Q5: SELECT IPV4_SRC_ADDR, L4_SRC_PORT, IPV4_DST_ADDR, L4_DST_PORT, PROTOCOL, COUNT(*), SUM(PKTS), SUM(BYTES) FROM NETFLOW WHERE L4_SRC_PORT=80 OR L4_DST_PORT=80 GROUP BY IPV4_SRC_ADDR, L4_SRC_PORT, IPV4_DST_ADDR, L4_DST_PORT, PROTOCOL FastBit partitions have been queried using the fbquery tool with appropriate command line parameters. All MySQL tests have been performed on the same machine with no network communications between client and server (i.e. MySQL client and server communicate using a Unix socket). In order to evaluate the influence of MySQL indexes on queries, the same test has been repeated with and without indexes. Tests were performed on 68 million flow records containing a subset of all NetFlow fields (IP source/destination, port source/destination, protocol, begin/end time). The following table compares the disk space used by MySQL and FastBit. In the case of FastBit, indexes have been computed on all columns. Table 1. FastBit vs MySQL Disk Usage (results are in GB)

MySQL FastBit

No/With Indexes

1.9 / 4.2

Daily Partition (no/with Indexes)

1.9 / 3.4

Hourly Partition (no/with Indexes)

1.9 / 3.9

Table 2. FastBit vs MySQL Query Speed (results are in seconds)

Query

MySQL

Daily Partitions

With Indexes 22.6

No Cache 12.8

Cached

Q1

No Index 20.8

Q2

23.4

69

Q3

796

Q4 Q5

Hourly Partitions Cached

5.86

No Cache 10

0.3

0.29

1.5

0.5

971

17.6

14.6

32.9

12.5

1033

1341

62

57.2

55.7

48.2

1754

2257

44.5

28.1

47.3

30.7

5.6

The test outcome has demonstrated that FastBit takes approximately the same disk space as MySQL in terms or raw data, whereas MySQL indexes are much larger. Merging FastBit partitions does not usually improve the search speed, but instead queries on merged data requires more memory, as FastBit loads a larger index.

82

L. Deri, V. Lorenzetti, and S. Mortimer

The size/duration of a partition mostly depends on the application that will access data. Having small partitions (e.g. 1 or 5 minutes long) makes sense for interactive data exploration where drill-down operations are common. In this case, having small partitions means that the FastBit index would also be small, resulting in faster operations and less memory used. On the other hand, querying data on a long period using small partitions requires fbquery to read several small indexes instead of a single one that is inefficient on standard disks (i.e. non solid-state drive) due to disk seek time. In addition, a side effect of multi-partitions is that fbquery need to merge results produced on each partition, this without relying on FastBit. Note that the use of large partitions has drawbacks on searches, as indexes cannot be built on the them until they have been completely dumped. For this reason, if nProbe saves flow records on a large one day long partition, it means that queries on the current day must be performed without indexes as the partition has not completely dumped yet. In a nutshell there is not a single rule for defining partition duration; in general the partition granularity should be as close as possible to the expected query granularity. Authors suggest to use partitions lasting from 1 to 5 minutes in order to have quick searches even on partitions being written (i.e. on most recent data), and then daily merge partitions using fbmerge. This to avoid exhausting disk i-nodes with index files, and efficiently perform searches on past data without accessing too many files. In terms of query performance FastBit is not at all comparable with MySQL: • Queries that only require access to indexes take less than a second, regardless of the query type. • Queries that require data access are at least an order of magnitude faster than on MySQL but always complete within a minute. • Index creation time on MySQL takes many minutes and it prevents it using in real life when importing data in (near-)realtime, also considering that they take a significant amount of disk space. Indexes on MySQL do not speed up queries, contrary to FastBit, as query time using indexes takes longer when compared to the same query on unindexed data. • Disk speed is an important factor for accelerating queries. In fact running the same test twice with data already cached in memory, it significantly decreases the query speed. The use of RAID 0 has demonstrated that the performance speed has been improved. 4.2 FastBit vs. Raw Files The goal of this test is to compare FastBit with a popular open-source collection tool named nfdump. Tests have been performed on a large network with TB of collected flow data per month. Although nfdump performs well when it comes to flow collection, its performance is sub-optimal during query time when using large data sets. One of the main concerns of the network operators is that with nfdump queries take a long amount of time, so they often need to be run overnight before producing results. An explanation of this behavior is that nfdump does not index data, so searching on a large time span means reading all raw data that was received over that period, and in this setup means GBs (if not TBs) of records. Using FastBit the average speed improvement is in the order of 20:1. From the operator's point of view this means that queries can last a reasonable amount of time. For instance, a query written in SQL as

Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases

83

‘SELECT IPV4_SRC_ADDR, L4_SRC_PORT, IPV4_DST_ADDR, L4_DST_PORT, PROTOCOL FROM NETFLOW WHERE IPV4_SRC_ADDR=X OR IPV4_DST_ADDR=X’ on 19 GB of data that contain 14 hours of collected flow records, takes about 45 seconds with FastBit which is major improvement with respect to nfdump, which takes about 1500 seconds (25 minutes) to complete the same query. As nfdump does not use any index, its execution time is dominated by the time needed to sequentially read the binary data. This means that: query time = (time to sequentially read the raw data) + (record filtering time). The time needed to filter records is usually very little as nfdump is fast enough, and also because the complexity of filters, whose syntax is similar to BPF [37] filters, is usually limited. This means that in nfdump the query time is basically the time needed to sequentially read the raw data. The previous query validates this hypothesis: 1500 seconds to read 19 GB of data means that the average reading speed is about 12.6 MB/sec, that is the typical speed of a SATA drive. For this reason, this section does not list the same tests as in section 4.1, because the query time of nfdump is mostly proportional to the amount of data to read [20]; hence with some simple math it is possible to compute the expected nfdump response time. Also note that the nfdump query language is not SQL-like, therefore it is not possible to make a one-to-one comparison with FastBit and MySQL. As flow records take a large amount of disk space, it is likely that they will be stored on a SAN (Storage Area Network). When the storage is directly attached to the host by means of fast communication links such as InfiniBand and FibreChannel, the system does not see any speed degradation when compared with a directly attached SATA disk. The authors decided to study how the use of network file systems such as NFS affects the query results. A simple model for the time needed to read bytes is t = α + * , where α represents the disk access latency and is the throughput. NFS typically increases α but not as the network speed is typically higher than disk read speed. In the case of nfdump the data is read sequentially, whereas on FastBit the raw data is accessed based on indexes. Thus FastBit requires a small number of read operations which have to pay α multiple times. However this extra cost is in milliseconds, so it does not alter the overall comparison. This behavior has been tested repeating some queries of 4.1, and demonstrating that the use of NFS marginally affects the total query time. 4.3 FastBit Scalability The tests have shown that the use of FastBit offers advantages with respect to both relational databases and raw files-based solutions. In order to understand nProbe scalability when used with FastBit, it is necessary to split flow collection from flow query. As stated in section 3, the index creation happens when the partition has been dumped on disk, hence the dump speed to disk is basically the speed of the hard drive where, in the case of SATA disks, it exceeds 1 million flow records/sec. As shown in 4.2, a large ISP network produces less than 100’000 flow records/sec, this means that FastBit introduces no bottleneck in flow collection and dump. Flow query requires disk access, therefore the query process is mostly I/O bound. For every query, FastBit reads the whole index file of each column present in the WHERE clause. Then based on the index search, it reads if necessary (e.g. COUNT(*) does not require that) the

84

L. Deri, V. Lorenzetti, and S. Mortimer

column files containing real data by performing seeks on files in order to move to the offset where the index has found a data match. Thus a simple model for query response time is τ = + + , where represents the time needed to read all the column indexes present in the WHERE clause, is the time to read (if there is any) the matching rows data present in the SELECT clause, and is the processing overhead. In general is very limited with respect to and . As = (index size / disk speed), it takes no more than a couple of seconds. Instead can be pretty large if data is sparse, and several disk seeks are required. Note that can grow significantly depending on the query (e.g. in case of sorting large data sets), and that is zero for queries that count (e.g. compute the number of records on port X produced by host Y) or that use mathematical functions such as SUM (e.g. total number of bytes on port X).

5 Open Issues and Future Work Tests on various FastBit configurations have shown that the disk is an important component that has a major impact on the whole system. The authors are planning to use SSD drives in order to see how query time is affected, in particular while accessing raw data records that require several disk seek operations. One of the main limitations of FastBit is the lack of data compression, since it currently compresses only indexes. This is a feature that the authors are planning to add, as it allows disk space to be saved hence reduce the time needed to read the records. Using compression libraries such as QuickLZO, lzop, and FastLZ it should be possible to implement transparent de/compression while reducing disk space. Another area of interest is the use of FastBit for indexing packets instead of flows. The authors have prototyped an application that parses pcap files and creates a FastBit partition based on various packet fields such as IP address, port, protocol, and flags, in addition to an extra column that contains the file id and packet offset inside the pcap file. Using a web interface built on top of fbquery, users can search packets matching the criteria and also retrieve the original packet contained in the original pcap files. Although this work is not rich in features when compared with specialized tools [36], it demonstrates that the use of bitmap indexes is also effective for handling packets and not just flow records. The work described on this paper is the base for developing interactive data visualization tools based on FastBit partitions. Thanks to recent innovation in web 2.0, there are libraries such as the Google Visualization API that split visualization from data. Currently, the authors are extending nProbe adding an embedded web server that can make FastBit queries on the fly and return query results in JSON format [38]. The idea is to create an interactive query system that can visualize both tabular data (e.g. flow information) and graphs (e.g. average number of flow records on port X over the last hour) by means of FastBit queries. This way the user does not have to interact with FastBit tools at all, and can focus on data exploration.

6 Final Remarks The use of nProbe with FastBit is a major step ahead when compared to state-of-theart tools based on both relational databases and raw data dump. When searching data

Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases

85

on datasets of a few million records the query time is limited to a few seconds in the worst case, whereas queries that just use indexes are completed within a second. The consequence of this major speed improvement is that it is now possible to query data in real time and avoid periodically updating costly counters, as their value can be computed on-demand using bitmap indexes. Finally this work paves the way to the creation of new monitoring tools on large data sets that can interactively analyze traffic data in near-realtime, contrary to what usually happens with most tools available today. Availability. This work is distributed under the GNU GPL license and is available at the ntop home page http://www.ntop.org/nProbe.html. Acknowledgments. The authors would like to thank K. John Wu for his help and support while using the FastBit library, Anders Kjærgaard Jørgensen for his support during the validation of this work, and Cristian Morariu for his suggestions during MySQL tests.

References 1. Claise, B.: NetFlow Services Export Version 9, RFC 3954 (2004) 2. Phaal, P., et al.: InMon Corporation’s sFlow: A Method for Monitoring Traffic in Switched and Routed Networks, RFC 3176 (2001) 3. Quittek, J., et al.: Requirements for IP Flow Information Export (IPFIX). RFC 3917 (2004) 4. Haddadi, H., et al.: Revisiting the Issues on Netflow Sample and Export Performance. In: ChinaCom 2008, pp. 442–446 (2008) 5. Duffield, N., et al.: Properties and Statistics from Sampled Packet Streams. In: Proc. ACM SIGCOMM IMW 2002 (2002) 6. Estan, C., et al.: Building a better NetFlow. In: Proc. of the ACM SIGCOMM Conference (2004) 7. Chakchai, S.: A Survey of Network Traffic Monitoring and Analysis Tools (2006) 8. Ning, C., Tong-Ge, X.: Study on NetFlow-based Network Traffic Data Collection and Storage. Application Research of Computers 25(2) (2008) 9. Reiss, F., et al.: Enabling Real-Time Querying of Live and Historical Stream Data. In: Proc. of 19th Intl. Conference on Scientific and Statistical Database Management (2007) 10. Haag, P.: Watch your Flows with NfSen and NfDump. In: 50th RIPE Meeting (2005) 11. Fullmer, M., Roming, S.: The OSU Flow-tools Package and Cisco NetFlow Logs. In: Proc. of 14th USENIX Lisa Conference (2000) 12. Plonka, D.: FlowScan: A Network Traffic Flow Reporting and Visualization Tool. In: Proc. of 14th USENIX Lisa Conference (2000) 13. Øslebø, A.: Stager A Web Based Application for Presenting Network Statistics. In: Proc. of NOMS 2006 (2006) 14. Gates, C., et al.: More NetFlow Tools: For Performance and Security. In: Proc. 18th Systems Administration Conference, LISA (2004) 15. Navarro, J.P., et al.: Combining Cisco NetFlow Exports with Relational Database Technology for Usage Statistics, Intrusion Detection and Network Forensics. In: Proc. 14th Systems Administration Conference, LISA (2000) 16. Lucente, P.: Pmacct: a New Player in the Network Management Arena. RIPE 52 Meeting (2006)

86

L. Deri, V. Lorenzetti, and S. Mortimer

17. Marinov, V., Schönwälder, J.: Design of an IP Flow Record Query Language. In: Hausheer, D., Schönwälder, J. (eds.) AIMS 2008. LNCS, vol. 5127, pp. 205–210. Springer, Heidelberg (2008) 18. Sperrotto, A.: Using SQL databases for flow processing. In: Joint EMANICS/IRTFNMRG Workshop on Netflow/IPFIX Usage in Network Management (2008) 19. Hofstede, R.: Performance measurements of NfDump and MySQL and development of a SURFmap plug-in for NfSen, Bachlor assignement. University of Twente (2009) 20. Hofstede, R., et al.: Comparison Between General and Specialized Information Sources When Handling Large Amounts of Network Data, Technical Report, University of Twente (2009) 21. Siekkinen, M., et al.: InTraBase: Integrated Traffic Analysis Based on a Database Management System. In: Proc. of E2EMON 2005 (2005) 22. Wu, K., et al.: A Lightning-Fast Index Drives Massive Data Analysis. SciDAC Review (2009) 23. Oltsik, J.: The Silent Explosion of Log Management, CNET News (2008), http://news.cnet.com/8301-10784_3-9867563-7.html 24. Turner, M.J., et al.: A DBMS for large statistical databases. In: Proc. of 5th VLDB Conference (1979) 25. Abadi, D., et al.: Column-Stores vs. Row-Stores: How Different Are They Really? In: Proc. of ACM SIGMOD 2008 (2008) 26. Herrnstadt, O.: Multiple Dimensioned Database Architecture, U.S. Patent application 20090193006 (2009) 27. Loshin, D.: Gaining the Performance Edge Using a Column-Oriented Database Management System, White Paper (2009) 28. Wu, K., et al.: Compressed Bitmap Indices for Efficient Query Processing, Technical Report LBNL-47807 (2001) 29. Wu, K., et al.: FastBit: Interactively Searching Massive Data. In: Proc. of SciDAC 2009 (2009) 30. Bethel, E.W., et al.: Accelerating Network Traffic Analytics Using Query-Driver Visualization. In: IEEE Symposium in Visual Analytics Science and Technology (2006) 31. Abadi, D., et al.: Integrating Compression and Execution in Column-Oriented Database Systems. In: Proc. of 2006 ACM SIGMOD (2006) 32. Otten, F.: Evaluating Compression as an Enabler for Centralised Monitoring in a Next Generation Network. In: Proc. of SATNAC 2007 (2007) 33. Sharma, V.: Bitmap Index vs. B-tree Index: Which and When? Oracle Technology Network (2005) 34. Chandrasekaran, J., et al.: TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In: Proc. of Conference on Innovative Data Systems Research (2003) 35. Deri, L.: nProbe: an Open Source NetFlow Probe for Gigabit Networks. In: Proc. of Terena TNC 2003 (2003) 36. Desnoyers, P., Shenoy, P.: Hyperion: High Volume Stream Archival for Retrospective Querying. In: Proc. of 2007 USENIX Annual Technical Conference (2007) 37. McCanne, S., Jacobson, V.: The BSD Packet Filter: A New architecture for User-level Packet Capture. In: Proc. Winter 1993 USENIX Conference (1993) 38. Crockford, D.: JSON: The fat-free alternative to XML. In: Proc. of XML 2006 (2006)

DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections Antonio Barbuzzi, Gennaro Boggia, and Luigi Alfredo Grieco DEE - Politecnico di Bari - V. Orabona, 4 - 70125, Bari, Italy Ph.: +39 080 5963301; Fax: +39 080 5963410 {a.barbuzzi,g.boggia,a.grieco}@poliba.it

Abstract. Spurious Retransmission Timeouts in TCP connections have been extensively studied in the scientific literature, particularly for their relevance in cellular mobile networks. At the present, while several algorithms have been conceived to identify them during the lifetime of a TCP connection (e.g., Forward-RTO or Eifel), there is not any tool able to accomplish the task with high accuracy by processing off-line traces. The only off-line existing tool is designed to analyze a great amount of traces taken from a single point of observation. In order to achieve a higher accuracy, this paper proposes a new algorithm and a tool able to identify Spurious Retransmission Timeouts in a TCP connection, using the dumps of each peer of the connection. The main strengths of the approach are the great accuracy and the absence of assumptions on the characteristics of TCP protocol. In fact, except for rare cases that are not classifiable with absolute certainty at all, the algorithm shows no ambiguous nor erroneous detections. Moreover, the tool is also able to deal with reordering, small windows, and other cases where competitors fail. Our aim is to provide to the community a very reliable tool to: (i) test the working behavior of cellular wireless networks, which are more prone to Spurious Retransmission Timeouts with respect to other technologies; (ii) validate run-time Spurious Retransmission Timeout detection algorithms.

1

Introduction

TCP congestion control [1,2,3] is fundamental to ensure Internet stability. Its main rationale is to control the number of in-flight segments in a connection (i.e., segments sent, but not yet acknowledged) using a sliding window mechanism. In particular, TCP sets an upper bound on the number of in-flights segments via the congestion window (c w nd) variable. As well known, the value of c w nd is progressively increased over the time to discover new available bandwidth, until a congestion episode happens, i.e., 3 Duplicated Acknowledgements (DUPACKs) are received or a Retransmission Timeout (RTO) expires. After a congestion episode, c w nd is suddenly shrinked to avoid a network collapse. TCP congestion control has demonstrated its robustness and effectiveness over the last two decades, especially in wired networks. F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 87–100, 2010. c Springer-Verlag Berlin Heidelberg 2010 

88

A. Barbuzzi, G. Boggia, and L.A. Grieco

Recently, the literature has questioned the effectiveness of the basic RTO mechanism originally proposed in [1] and refined in [4]. Such a mechanism measures the Smoothed Round Trip Time (SRTT) and its standard deviation (DEV) and then sets R T O = S R T T + 4DEV . The rationale is that, if the RTT exhibits a stationary behavior, RTO can be considered as a safe upper bound for the RTT. Unfortunately, in cellular mobile networks this is not ever true. In fact, delay spikes due to retransmissions, fading, and handover can trigger Spurious Retransmission Timeouts (SRTOs) that lead to unnecessary segment retransmissions and useless reductions of the transmission rate [5,6]. Despite of the importance of SRTOs, there is not any tool able to accomplish the task of properly identifying them with high accuracy by processing off-line traces. The only off-line existing tool is designed to analyze a great amount of traces taken from a single point of observation [7]. To bridge this gap, this paper proposes a new algorithm and a tool able to Detect SRTOs (which will be referred to as DeSRTO ) in a TCP connection, using the dumps of each peer of the connection. The main strengths of the approach are the great accuracy and the absence of assumptions on the characteristics of TCP protocol. In fact, except for rare cases that are not classifiable with absolute certainty at all as SRTO, the algorithm shows no ambiguous nor erroneous detections. Moreover, the tool is also able to deal with reordering, small windows, and other cases where competitors fail. Our aim is to provide to the community a very reliable tool to: (i) test the working behavior of cellular wireless networks, which are more prone to Spurious Retransmission Timeouts with respect to other technologies (but it is well suited also for any other kind of networks with traffic flows using TCP); (ii) validate run-time Spurious Retransmission Timeout detection algorithms. To provide a preliminary demonstration of the capabilities of DeSRTO, some results derived by processing real traffic traces collected over a real 3G network have been reported. Moreover, to test its effectiveness, a comparison with the Vacirca detection tool [7] is reported. The rest of the paper is organized as follows. In Sec. 2 a summary of related works is reported. Sec. 3 describes our algorithm, reporting also some examples on its behavior. Sec. 4 shows the experimental results. Finally, conclusions are drawn in Sec. 5.

2

Related Works

So far, research on SRTOs has produced schemes that can be classified in to two families: (i) real-time detection schemes for TCP stacks; (ii) off-line processing tools for SRTO identification. The Eifel Detection [5], the F-RTO [6], and the DSACK [8] schemes belong to the first family. Instead, the tool proposed in [7] and the DeSRTO algorithm herein presented belong to the second family. The goal of the Eifel detection algorithm [5] is to avoid the go-back-N retransmits that usually follows a spurious timeout. It exploits the information provided by the TCP Timestamps option in order to distinguish if coming ACKs are related to a retransmitted segment or not.

DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections

89

The algorithm described in [8] exploits DSACK (Duplicate Selective ACK) informations to identify SRTOs. DSACK is a TCP extension used to report the receipt of duplicate segments to the sender. The receipt of a duplicate segment implies either the packet is replicated by the network, or both the original and the retransmitted packet arrive at the receiver. If all retransmitted segments are acknowledged and recognized as duplicated using DSACK information, the algorithm can conclude that all retransmissions in the previous window of data were spurious and no loss occurred. The F-RTO algorithm [6] modifies the TCP behavior in response to RTOs: in general terms, when a RTO expires, F-RTO retransmits the first unacknowledged segment and waits for subsequent ACKs. If it receives an acknowledgment for a segment that was not retransmitted due to the timeout, the F-RTO algorithm declares a spurious timeout. Possible responses to a spurious timeout are specified in [9], namely the Eifel Response Algorithm. Basically, three actions are specified: (i) the TCP sender sends new data instead of retransmitting further segments; (ii) the TCP stack tries to reverse the congestion control state prior to the SRTO; (iii) the RTO estimator is re-initialized by taking in account the round-trip time spike that caused the SRTO (a slightly different approach has been also proposed in [10]). To the best of our knowledge, till now, the only SRTO detection algorithm aimed at the analysis of collected TCP traces is the one proposed by Vacirca et al. in [7]. This algorithm is conceived to process a large amount of traces from different TCP connections. Such traces are collected by a single monitoring interface, placed in the middle of a path traversed by many connections. The design philosophy of the algorithm targets strict constraints on execution speed and simplicity. Anyway, being the algorithm based on a monitoring point placed in the middle of the network, it cannot exploit fundamental information available on dumps collected at connection endpoints that could improve estimation accuracy. The rationale of the algorithm of Vacirca is to analyze ACKs stream to identify SRTOs. In case of a Normal RTO (NRTO), a loss involves the transmission of duplicate ACKs by the TCP data receiver, that indicates the presence of a hole in the TCP data stream. On the contrary, in case of a SRTO, no duplicate ACKs are expected, since there is no loss. It is well known that this algorithm does not work properly in the following conditions: (i) packet loss before the monitoring interface; (ii) presence of packet reordering; (iii) small windows; (iv) no segment is sent between the original transmission that caused the RTO and its first retransmission; (v) loss of all the segments transmitted between the original transmission that caused the RTO and its first retransmission; (vi) loss of all ACKs. Some of these cases lead to erroneous conclusions, while others lead to ambiguity. Furthermore, the absence of packet reordering is a fundamental hypothesis for the validity of the Vacirca detection scheme. Our algorithm is instead aimed to the analysis of a single TCP flow, without making any assumption on the traffic characteristic.

90

3

A. Barbuzzi, G. Boggia, and L.A. Grieco

Spurious Timeout Identification Algorithm: DeSRTO

In this section, we will examine closely the SRTO concept, reporting some example of SRTOs. Then, we will explain our algorithm. 3.1

What Is a SRTO?

As well known, every time a data packet is sent, TCP starts a timer and waits, till its expiration, for a feedback indicating the delivery of the segment. The length of this timer is just the retransmission timeout, i.e., the RTO. The expiration of a RTO timeout is interpreted by the TCP as an indication of packet losses. The computation of RTO value is specified in [4]; it is based on the estimated RTT and its variance. The proper setting of the RTO is a tricky operation: if it is too long, TCP would waste a lot of time before it realizes that a segment is lost; if it is too short, unnecessary segments would be retransmitted, with a useless waste of bandwidth. As already stated, it has been shown that the RTO estimation can be unreliable in some cases, especially when connections cross through cellular networks. Therefore, retransmission procedure can be unnecessarily triggered, even in absence of packet loss. The RFC 3522 names these RTOs as spurious (i.e., there is a SRTO), and defines them as follow: a timeout is considered spurious if it would have been avoided had the sender waited longer for an acknowledgment to arrive. Let us clarify the SRTO concept through an example. In Fig. 1(a), it is reported a borderline case, representing the simplest possible case of SRTO. The segment in the packet p 1 was sent and received successfully, like the relative acknowledgment, i.e., the packet p 3 . However, p 3 arrived after the RTO expiration; therefore, the sender uselessly retransmitted in packet p 2 the data contained in the payload of packet p 1 . Note that if the sender had waited longer, it would have received the acknowledgment contained in packet p 3 . Thus, as stated by the definition, here we are in presence of a SRTO. The presented example considers a very simple scenario with a small TCP transmission window. A more complex case is reported in Fig. 1(b): the first ACK segment (the one with ac k number 100) is lost, but since ACKs are cumulative, data contained in packet p1 can be acknowledged by any of the subsequent ACKs. Therefore, the correct application of SRTO definition requires to check that at least one ACK (among the ones sent between the reception of p1 and the reception of the retransmitted segment) is delivered successfully to the sender. 3.2

DeSRTO

DeSRTO is an algorithm we developed to detect Spurious RTOs. The main difference between its counterparts is that it uses both TCP peer dumps, a feature that enhances the knowledge on the network behavior during a RTO event. DeSRTO discriminates SRTOs from NRTOs according to the RFC 3522, by reconstructing the journey of the TCP segments involved into the RTO management.

DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections

91

{

RcvdPacketList

S1

0 : 10

k

k

0

10

10

ac

S1

ac

: 10

0

Receiver

p3

p2

p1

0

Sender RTO

(a) A simple example of Spurious RTO.

{

RcvdPacketList

R

t RTO

0

0

p2

S2

00

: 30

: 20 00

0

p3

x

S1

S1

0 : 10 S1

30

0

0

20

30

0

k

k

ac

10

ac

k

p1

k

ac

Sender

ac

: 10

0

Receiver

S

RTO

t RTO

(b) A more complicated example of Spurious RTO. Fig. 1. Spurious RTO examples

For a specific RTO event, the algorithm needs to associate a packet containing a TCP segment with the packet(s) containing the relative acknowledgment that was(were) sent in reply to the segment. Note that the ACK number refers to a flow of data and not to a specific received packet. Instead, an IP datagram containing a TCP ACK is triggered by a specific TCP segment and thus can be associated to it. Hence, in the sequel, we will adopt the expression “packet A that acknowledges packet B” referring to the specific IP datagram A that contains a TCP ACK related to the reception of packet B. It is important to highlight that the algorithm needs unambiguous couple of packets on the sender and receiver dumps. Since we cannot rely only on the sequence number or on the ACK number of TCP segments, we make use of the identification field in the IP header, that, according to [11], is used to distinguish the fragments of one datagram from those of another. Let us clarify the behavior of DeSRTO applying it to the SRTO cases described in the previous section. We will start from the simplest scenario in Fig. 1(a). In this instance, DeSRTO would proceed according to the following steps: 1. 2. 3. 4.

Identify the packet that caused the RTO (p1 ) on the sender dump. Find packet p1 in the receiver dump. If p1 is lost, the RTO is declared Normal. Otherwise, find the ACK “associated” to p1 on the receiver dump (namely p3 ). This packet should exist because it is transmitted by the receiver.

92

A. Barbuzzi, G. Boggia, and L.A. Grieco

5. Find the packet p3 on the sender dump. 6. If p3 is not present in the sender dump (that is, it was lost), the RTO is declared N ormal, otherwise it is declared as Spurious. Now, let us analyze the more complicated case in Fig. 1(b). Of course, the loss of only the first ACK is not enough to come to a conclusion. Therefore, the algorithm would check if at least one ACK (sent between the reception of p1 and the instant tR RT O of the reception of the retransmitted packet) was delivered correctly to the sender. The aim of our algorithm is to try to detect all types of SRTOs. Thus, through a deeper analysis of the methodology needed to distinguish all the possible RTO types, we realized that the definition used by RFC 3522 does not allow the practical identification of all possible RTO episodes. In fact, according to the definition, in order to understand what would happened if we had “waited longer”, it is possible to check what the TCP data sender would have done till the RTO event tSRT O , and what the TCP data receiver would have done till the reception of the first retransmission triggered by the RTO event, tR RT O . Everything happens after these two instants depends also on how the RTO event has been handled. From tR RT O on, the storyline diverges, and the check of the “what if” scenarios can result in significantly different outcomes, since the consequences of the RTO management cannot be really undo: it influences the following events in an unforeseeable way. We can check what would have happened if we “waited longer” with certainty as long as we do not need to undo the TCP stack management of the RTO event. To deal also with these uncertain cases, we define the concept of Butterfly-RTO as follows: a Butterfly-RTO is a RTO whose identification as SRTO or NRTO would require the check of packets at the receiver side after the instant tR RT O . In order to highlight the complexity involved in dealing with a ButterflyRTO, we can consider the RTO example in Fig. 2. Packet p1 is transmitted, but the RTO timer expires before the reception of any ACKs related to it (the ACK relative to packet p1 is lost). Packet p2 is the retransmitted packet, with the same sequence number interval as p1 . Note that the reception of packet p2 marks the instant tR RT O , i.e., the instant of the reception by the sender of the retransmitted packet. The following packet (the one with sequence numbers 100 : 200) is lost whereas packet pB (the one with sequence numbers 200 : 300) is delivered correctly. But, due to reordering, it arrives after the instant tR RT O and, consequently, p3 , the packet that acknowledges pB , is sent after tR . Apparently RT O the case depicted in Fig. 2 is a Spurious RTO, because, if we had waited for p3 to arrive, we would not have sent the retransmission p2 . Examining more carefully the situation, we should note that the story of packet pB could also be different if packet p2 would have not been never transmitted. In other terms, the storylines of p2 and pB are coupled and there is no way to undo the effects of RTO management with absolute certainty. To further stress the concept, we remark that, if the TCP delayed ACK option is enabled, the time instant at which the ACK p3 is transmitted depends also on p2 and not only on pB . Finally, TCP implementations running at different hosts often

DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections

93

R

t RTO

Receiver ac

: 10

0

0

: 10

30

20

S1

0:

: 20 00

p2

S

pB

p3

S1

0

p1

10

0

k

x

S1

0

ac

x

10

0

k

Sender

S

t RTO RTO

Fig. 2. An example of Butterfly RTO

implements RFC specifications in slightly different ways, thus making even worse the problem. Actually, Butterfly-RTOs are negligible in normal network conditions, since reordering is rare and the probability of joint RTO and packet reordering events (needed for the Butterfly-RTO occurrence) is very low if the RTO and reordering events are uncorrelated. Anyway, note that possible network malfunctions during handover or channel management (due, for example, to bugs or incorrect settings in network devices) could systematically cause the scenario happening. To conclude this discussion, we describe how our algorithm tries to solve and/or classify the uncertain cases. If no reordering happens, DeSRTO searches for all ACKs that acknowledge p1 and have been transmitted within tR RT O . If at least one of these ACKs has been received by the sender, then the RTO is classified as spurious. Otherwise, the RTO is normal. If a reordering happens and none of the ACKs transmitted within tR RT O has been received, then the RTO is classified as Butterfly. In Fig. 3, it is shown a case taken from [7], where the detection with the Vacirca tool would fails, as stated in [7] itself. The dashed line indicates the point where the monitoring interface used by the Vacirca algorithm is placed, and the sequence of packet passing through the line is the one seen by such a detection tool. In this case, the network experiences packet reordering, specifically packet with sequence numbers 100 : 200 arrives after packet with sequence numbers 200 : 300. As reported in [7], the Vacirca tool will erroneously classify the SRTO as normal. In fact, the algorithm will see an ACK (the packet with ACK number 300) after the retransmission of p1 and p3 ; therefore, it deduces that packet p3 fills the hole in the data sequence. Our tool, instead, is not affected by packet reordering. In fact, by following the same steps (1 - 6) before outlined to explain the simpler SRTO example in Fig. 1(a), it is straightforward to show that DeSRTO is able to classify the RTO as spurious. It is worth to note that the four examples reported in Figs. 1-3 have been pictured to provide an idea of the extreme complexity associated to SRTO detection. More SRTO scenarios can happen, depending on reordering, loss of data packet, and so on. The details on DeSRTO behavior in a general setting are presented in the pseudocode description of the algorithm (see Sec. 3.3).

94

A. Barbuzzi, G. Boggia, and L.A. Grieco

{

RcvdPacketList

300

00

p2

ac k k 1 300 00

S1

: 40 00

ac

00:

S3

S1 0 20 0: 10 S

p3

S2

p1

100

Sender

ack

Vacirca's Monitoring Interface

0

: 10

0

: 20

0

Receiver

RTO

Fig. 3. Example of SRTO with reordering: the Vacirca tool fails considering it as normal

3.3

The Algorithm in Detail: The Pseudocode

Hereafter, we discuss the DeSRTO pseudocode reported in Algotithm 1. We continue to refer to Fig. 1(b) all over the text. As general consideration, we recall that to unambiguously couple packets on the sender and receiver dumps, we use the identification field of the IP header [11]. To simplify the notation, we use the following conventional definitions: SndDu mp is the dump of the TCP flow at the sender side; Rcv Dump is the dump of the TCP flow at the receiver side; RT OList is the list of all the generated RTOs. Below, in the step-by-step description of the algorithm, numbers at the beginning of each line refer to the line numbers in the pseudocode of Alg. 1. This description will give further insight in the comprehension of DeSRTO behavior. 1:3 Initially, RT OList contains a list of RTOs, with timestamps of each RTO and the sequence number of the TCP segment that caused it. From these information, we can find in the SndDump the packet p1 (see fig. 1(b)) that caused the RTO. The second step requires to find p1 also on the receiver side in RcvDump. 4:5 If packet p1 is lost, that is, if it is not present in the RcvDump, the RTO is declared Normal. Note that TCP “fast retransmit” is comprised in this case. 6:13 It is found the first retransmission of the segment encapsulated in the p1 datagram, straight after the RTO event, namely p2 . The packet p2 is also found at the receiver side. If p2 is not present on the receiver dump, i.e., it has been lost, the algorithm looks for the first packet transmitted after p2 that successfully arrives at the receiver. 14:21 In SndDump, all the packets sent between the transmission of p1 and the transmission of p2 are found. They are stored in the list SentP ktsList. Then, DeSRTO stores in the list RcvdP ktsList all the packets in SentP ktsList that have been successfully received (this step requires an inspection of RcvDump). The first and the last packets in RcvdP ktsList are called pm

DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections

95

and pM , respectively. We will refer to tm as the reception instant of pm and to tM as the transmission time instant of the first ACK transmitted after the reception of pM . Note that pm and pM could be different from p1 and p2 , respectively, in case of packet reordering. 22:27 Search on RcvDump, in the time interval [tm , tM ], all the ACKs that acknowledged p1 . Found ACKs are saved in the list AckList, according to their transmission order. 28:43 For each ACK saved in AckList, the algorithm checks if it was successfully received by the sender. The search stops as soon as the first ACK successfully received is found. If a received ACK is found, two cases are considered: 1. the ACK was sent before tR RT O . 2. the ACK was sent after tR RT O In the first case, the sender received an ACK for p1 after the RTO expiration, therefore the RTO is declared Spurious. In the second case, we have a Butterfly-RTO. Note that we account also for reordering of the ACKs on the reverse path; therefore, the check on the ACK packets is done in chronological order, starting from the first sent ACK packet till the last one. 44:48 If none of ACKs in AckList has been received by the sender, i.e., an entire window of ACKs is lost, two cases are considered: – if the the greatest timestamp of packets in AckList is smaller than tR RT O (i.e., tp2 ), the RTO is declared Normal, – otherwise it is declared Butterfly. 3.4

Implementation Details

To verify the effectiveness of our algorithm, DeSRTO has been written in python programming language. The realized tool implements exactly the pseudocode described above (the actual version of the tool is v1.0-beta and it is freely available at svn://telematics.poliba.it/desrto/tags/desrto_v1). The DeSRTO tool takes in input a list of RTOs (RT OList in the pseudocode), with timestamp and sequence number and the dumps related to the TCP connection of two peers. The list of RTOs is generated using a Linux kernel patch, included in the repository, that simply logs the sequence numbers and the timestamps of each RTO. Of course, other methods can be used to have a list of RTOs, such as a simple check of the presence of duplicate transmission without 3-DUPACK. We have planned to implement it as an option in the near future. The dumps of each peer can be truncated in order to discard the TCP payload. Of course, DeSRTO requires that no packets are discarded by the kernel. In fact, if packets we look for are not found, the analysis would be wrong. An option to deal with flows that go through a NAT has been implemented. The aim of each option is to analyze most of the cases in background, without the presence of an operator.

96

A. Barbuzzi, G. Boggia, and L.A. Grieco

Algorithm 1. Pseudocode of DeSRTO 1: for each rto in RT OList do 2: FIND the packet p1 that causes the rto in SndDump 3: FIND packet p1 in RcvDump 4: if p1 is Lost then 5: rto ← N RT O 6: else 7: FIND packet p2 in SndDump, the first retransmission of packet p1 8: FIND p2 in RcvDump 9: while p2 is Lost do 10: FIND tmp the first packet transmitted after p2 in SndDump 11: p2 ← tmp 12: FIND p2 in RcvDump 13: end while 14: SET tretr TO the timestamp of p2 on RcvDump 15: GET all sent packets between p1 and p2 in SndDump, including p1 and not p2 16: FIND the corresponding received packets in RcvDump 17: STORE founded packets in RcvdP ktsList 18: SET pm TO the first packet in RcvdP ktsList in chronological order 19: SET tm TO the timestamp of pm 20: SET pM TO the last packet in RcvdP ktsList in chronological order 21: SET tM TO the timestamp of the first ACK transmitted after pM 22: for EACH sent packet pa in RcvDump FROM tm TO tM do 23: if pa acknowledges p1 then 24: STORE pa IN AckList 25: end if 26: end for 27: SET tmax to the greatest timestamp of the packets in AckList 28: SET tp2 TO the timestamp of p2 on RcvDump 29: SET ACK F OU N D TO False 30: for EACH ack packet a in AckList do 31: if ACK F OU N D = F alse then 32: FIND packet pa in SndDump 33: if pa is not LOST then 34: SET ACK F OU N D TO True 35: SET t TO the timestamp of pa on RcvDump 36: if t3 > tp2 then 37: rto ← Butterf lyRT O 38: else 39: rto ← SRT O 40: end if 41: end if 42: end if 43: end for 44: if ACK F OU N D = F alse and tmax > tp2 then 45: rto ← Butterf lyRT O 46: else if ACK F OU N D = F alse then 47: rto ← N RT O 48: end if 49: end if 50: end for

DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections

4

97

Experimental Results

To test the effectiveness of the tool and its performance, we have considered a series of TCP flows generated using iperf, a commonly used network testing tool that can create TCP data streams (http://dast.nlanr.net/iperf/), in a real 3.5G cellular network (a UMTS network with the HSPA protocol) with concurrent real traffic . The testbed is presented in Fig. 4. There are two machine equipped with a Linux kernel patched with the Web100 patch (http://www.web100.org/), a tool that implements TCP instruments, defined in [12], able to log internal TCP kernel status, and with a kernel patch we developed that logs all the RTO events (see Sec. 3.4). The developed patch has been tested comparing the reported timeouts against Web100 output.

3G core network

Mobile Host

Internet

Wired Host

Fig. 4. Experimental testbed

The first PC is connected to the Internet through a wired connection, while the second one is equipped with a UMTS (3.5G) card and it is connected to the cellular network. We have generated a series of one hour greedy long flows between the two machines using iperf. We have conducted several experiments, with flows originating from both the machines, in order to test either the directions of the connection. The average transfer rate was 791 kbits/sec in download and 279 kbits/sec in upload. No experiments experienced packet reordering. In the download case, where the UMTS equipped machine receives data, the number of detected SRTOs is negligible also due to the low number of RTOs (actually most RTOs are due to retransmission of SYN packets); whereas in the upload case, the SRTOs are more common, even if not prevalent. This behavior was expected due to the asymmetry between uplink and downlink in cellular networks (downlink usually provides higher bandwidth, higher reliability, and smaller delays with respect to uplink [13]). Tabs. 1 and 2 show the number of NRTOs and SRTOs detected by DeSRTO and by the Vacirca tool in the upload and in the download cases, respectively. Note that there are no Butterfly RTOs, since no reordering was experienced in performed experiments.

98

A. Barbuzzi, G. Boggia, and L.A. Grieco Table 1. Results reported by Vacirca Tool and DeSRTO for the upload case

N. 1 2 3 4 5 6 7 8 9 10 11 TOT

DeSRTO Results SRTO NRTO % SRTO 3 23 13,0% 5 27 18,5% 5 231 2,2% 4 305 1,3% 7 151 4,6% 5 48 10,4% 4 343 1,2% 3 83 3,6% 2 9 22,2% 4 108 3,7% 1 2 50,0% 43 1330 3,2%

SRTO 5 5 15 23 24 10 28 4 3 9 1 127

Vacirca tool results NRTO Am% biguous SRTO 535 4 0,9% 536 7 0,9% 711 31 2,0% 784 43 2,7% 637 19 3,5% 502 9 1,9% 749 69 3,3% 636 1619 0,2% 441 2 0,7% 629 22 1,4% 348 0 0,3% 6508 1825 1,5%

% Ambiguous 0,7% 1,3% 4,1% 5,1% 2,8% 1,7% 8,2% 71,7% 0,4% 3,3% 0,0% 21,6%

To validate the algorithm, we have manually inspected all the RTOs expired during the experiments and we have verified their correspondence with the ones revealed by DeSRTO. It is worth to highlight that no false positive or negative cases were found by DeSRTO. It was an expected results, since the algorithm behavior follows the human operational procedure to find SRTO. To validate its own algorithm, [7] uses a patched kernel that logs the timeout sequence numbers on the sender side and, on the receiver side, logs the hole in the sequence number space left by the reception of an out-of-order segment. In that paper, it is claimed that an out-of-order segment point out a loss, i.e., a NRTO, and, therefore, all the remaining RTOs are spurious. Note that this technique is more accurate that the use of the Vacirca’s tool, but it is not free from errors. In fact, besides the intuitive failure in case of reordering, where an out-of-order segments is not a lost packet, this validation technique does not consider RTOs due to lost ACKs. In fact, in case a whole ACK window is lost, no hole is logged on the receiver, and then a NRTO is wrongly believed to be spurious. Therefore, we think that the validation technique used by [7] was unfeasible for our algorithm; in fact, our algorithm claims to work even with cases where the validation technique used by [7] fails. Thus, the only possible validation technique is the manual inspection of all RTOs. Even if Vacirca tool and DeSRTO are designed with different targets, a working comparison between the two tools is mandatory, although some differences in results are expected. For this purpose, we used an implementation of the Vacirca tool available (http://ccr.sigcomm.org/online/?q=node/220) as a patch for tcptrace v.6.6.7. The algorithm was applied using the traces captured on the sender side (the Ethernet interface in case of download, the UMTS interface in case of upload). Even if the location of the monitoring interface is unusual (it is not in the middle of the path), the placement is correct, since the only assumption done in [7] is that no loss is present between the sender side and the monitoring

DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections

99

Table 2. Results reported by Vacirca Tool and DeSRTO for the download case

N. 1 2 3 4 5 6 7 8 9 10 11 TOT

DeSRTO Results SRTO NRTO % SRTO 1 0 100,0% 2 0 100,0% 1 0 100,0% 2 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 13 0 100,0%

SRTO 0 0 0 1 0 0 0 0 0 0 0 1

Vacirca tool results NRTO Am% biguous SRTO 11 0 0,0% 12 0 0,0% 18 0 0,0% 65 0 1,5% 10 0 0,0% 8 0 0,0% 10 0 0,0% 8 0 0,0% 13 0 0,0% 3 0 0,0% 6 0 0,0% 164 0 0,6%

% Ambiguous 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0%

interface. The obtained results are reported in Tabs. 1 and 2. The comparison shows lots of differences. The number of RTOs detected by the Vacirca tool is substantially different from the ones reported by our kernel patch or, equally, by Web100. On average, the Vacirca patched version of tcptrace reports a number of RTOs about 6 times greater that the ones reported by the kernel, with peaks of 100 times. Instead, the number of SRTOs is more similar between the two tools, and, even if the results are significantly different, in some cases the reported values are comparable. It is worth to highlight that sometimes the number of ambiguous RTOs reported by the Vacirca tools is very high, although no packet reordering was experimented on the network in any experiments. Unfortunately, we were not able to make any reliable hypothesis on the causes of results obtained by the Vacirca tool. We found some issues in the use of such a tool and details about these problems can be found in [14].

5

Conclusions

In this paper, the new algorithm DeSRTO to find Spurious Retransmission Timeouts in TCP connections has been developed. Several examples have been reported to illustrate its behavior in the presence of packet reordering, small windows, and other cases where competitors fail. Except for rare cases that are not classifiable with absolute certainty at all, the algorithm shows no ambiguous nor erroneous detections. Moreover, the effectiveness of the proposed algorithm has been highlighted with some results of its application on TCP traces collected in a real 3.5G cellular network and comparing its performance with respect to another detection tool available in literature. Future work will illustrate the application of DeSRTO to data traces in order to analyze the presence of SRTOs and their impact in several network environments.

100

A. Barbuzzi, G. Boggia, and L.A. Grieco

Acknowledgement Authors want to thank Dr. F. Ricciato and its team at FTW (Vienna) for suggestions and the valuable support during this work, which was funded by projects PS-121 and DIPIS (Apulia Region, Italy) as well as supported by TMA-COST action IC0703.

References 1. Jacobson, V.: Congestion avoidance and control. SIGCOMM Comput. Commun. Rev. 18(4), 314–329 (1988) 2. Allman, M., Paxson, V., Stevens, W.: RFC 2581: TCP congestion control (1999) 3. Floyd, S., Henderson, T., Gurtov, A.: RFC 3782: The NewReno modification to TCP’s fast recovery algorithm (2004) 4. Paxson, V., Allman, M.: Computing TCP’s retransmission timer (2000) 5. Ludwig, R., Meyer, M.: The Eifel detection algorithm for TCP. RFC 3522, Experimental (April 2003) 6. Sarolahti, P., Kojo, M.: Forward RTO-Recovery (F-RTO): An algorithm for detecting spurious retransmission timeouts with TCP and the stream control transmission protocol (SCTP). RFC 4138, Experimental (August 2005) 7. Vacirca, F., Ziegler, T., Hasenleithner, E.: An algorithm to detect TCP spurious timeouts and its application to operational UMTS/GPRS networks. Comput. Netw. 50(16), 2981–3001 (2006) 8. Blanton, E., Allman, M.: Using TCP duplicate selective acknowledgement (DSACKs) and stream control transmission protocol (SCTP) duplicate transmission sequence numbers (TSNs) to detect spurious retransmissions. RFC 3708, Experimental (February 2004) 9. Ludwig, R., Gurtov, A.: The eifel response algorithm for TCP. RFC 4015, Proposed Standard (February 2005) 10. Blanton, E., Allman, M.: Using spurious retransmissions to adapt the retransmission timeout (July 2007) 11. Postel, J.B.: Internet protocol. Internet RFC 791 (September 1981) 12. Mathis, M., Heffner, J., Raghunarayan, R.: TCP extended statistics MIB. RFC 4898, Proposed Standard (May 2007) 13. Bannister, J., Mather, P., Coope, S.: Convergence Technologies for 3G Networks: IP, UMTS, EGPRS and ATM. John Wiley & Sons, Chichester (2004) 14. Barbuzzi, A.: Comparison measures between desrto and vacirca tool. Technical report (2009), available at, http://telematics.poliba.it/DeSRTO_tech_rep.pdf

Uncovering Relations between Traffic Classifiers and Anomaly Detectors via Graph Theory Romain Fontugne1 , Pierre Borgnat2, Patrice Abry2 , and Kensuke Fukuda3 1

3

The Graduate University for Advanced Studies, Tokyo, JP 2 Physics Lab, CNRS, ENSL, Lyon, FR National Institute of Informatics / PRESTO JST, Tokyo, JP

Abstract. Network traffic classification and anomaly detection have received much attention in the last few years. However, due to the the lack of common ground truth, proposed methods are evaluated through diverse processes that are usually neither comparable nor reproducible. Our final goal is to provide a common dataset with associated ground truth resulting from the cross-validation of various algorithms. This paper deals with one of the substantial issues faced in achieving this ambitious goal: relating outputs from various algorithms. We propose a general methodology based on graph theory that relates outputs from diverse algorithms by taking into account all reported information. We validate our method by comparing results of two anomaly detectors which report traffic at different granularities. The proposed method succesfully identified similarities between the outputs of the two anomaly detectors although they report distinct features of the traffic.

1

Introduction

Maintaining network resources available and secured in the Internet is an unmet challenge. Hence, various network traffic classifiers and anomaly detectors (hereafter both called classifiers) have been recently proposed. However, the evaluation of these classifiers usually lacks rigor, leading to hasty conclusions [1]. Since synthetic data is rather criticized and common labeled database (like the datasets from the DARPA Intrusion Detection Evaluation Program [2]) is not available for backbone traffic; researchers analyze real data and validate their methods by manual inspection, or by comparison with other methods. Our final goal is to provide a reference database by labeling the MAWI archive [3] which is a publicly available collection of real backbone traffic traces. Due to the difficulties faced in analyzing backbone traffic (e.g. lack of packet payload, asymmetric traffic), we plan to label the MAWI archive by cross-validating results from several methods based on different theoretical backgrounds. This systematic approach permits to maintain updated database in which recent traffic traces are regularly added, and labels are improved with upcoming algorithms. This database aims at helping researchers by providing a ground truth relative to the state of the art. F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 101–114, 2010. c Springer-Verlag Berlin Heidelberg 2010 

102

R. Fontugne et al.

However, we face several complicated issues to reach our final goal. This article discusses the difficulties faced in relating outputs provided by distinct algorithms, and proposes a methodology to achieve it. This is an important first step for labeling traffic data traces. The main contribution is to provide a general methodology to efficiently compare outputs exhibiting various granularities of the traffic. It uncovers the relations between several outputs by inspecting all information reported by classifiers and the original traffic. Also, the proposed method inherently groups similar events and permits to label quantity of traffic at once. 1.1

Related Work

Usually ground truth is built by hand implying a lot of human work. Several applications have been proposed to assist humans and speed up this laborious task [4,5,6]. For example, GTVS [4] helps researchers by automating several tasks, and authors claim that a 30 minutes trace from a gigabyte link can be labeled within days. Since our purpose is to label a database containing 15 minutes backbone traffic traces taken everyday for 9 years (MAWI archive) manual labeling is unpractical. Alternatively, specialized network interface drivers have been recently proposed [1,7] to label traffic while packets are collected. These drivers trace each packet and retrieve the corresponding application. Although these approaches are really promising to compute confident ground truth from Internet edges, it is not applicable for backbone traffic. Closer to our work, Moore et al. [8] proposed an application combining nine algorithms that analyze different properties of the traffic. This application successfully achieved an accurate classification on a full payload packet traces recording both link directions. TIE [9] is also an application designed to label traffic with several algorithms. It computes sessions — i.e. flows, bi-directional flows, or traffic related to a host — from the original traffic and provides them to encapsulated classifiers. The final label for each session is decided from the labels provided by each classifier. Although these two applications are similar to our work, they do not solve the general problem of relating outputs from distinct algorithms. Indeed, both applications restrict classifiers to label only flows, ignoring all classifiers that operate at other granularities (e.g. packet, host...) and their benefits. Thus, they only deal with flows and bypass the problem addressed in this paper. Our work provides a more general approach that permits to combine results from any classifier. This issue has been only tackled in previous work; for example, Salgarelli et al. [10] also discuss the challenges faced by researchers in comparing performances of classifiers and proposed unified metrics to measure the quality of an algorithm. Although the need of common metrics in evaluating classifiers is crucial, we stress that these measures are not sufficient to compare classifiers outputs. For instance, let A and B be two distinct classifiers with the same true positive score (as defined in [10]: the percentage of flows that the classifiers labeled correctly) equal to 50% on a certain dataset. Let assume that the combination of A and B achieve 75% of true positive on the same dataset, then it will be interesting to know what kind of traffic A could identify that B could not (and vice versa).

Uncovering Relations between Traffic Classifiers and Anomaly Detectors

2

103

Problem Statement

Comparing outputs from several classifiers seems at first glance to be trivial, but in practice, it is a baffling problem. The main issue is that classifiers report different features of the traffic that are difficult to systematically compare. Hereafter, we define an event as any classifier decision to categorize a traffic (i.e. alarms from anomaly detectors or labels from traffic classifiers). Formally, an event e is a set of items e = {tbegin , tend , f1 , ..., fh } where tbegin , tend are timestamps respectively standing for the begin and the end of identified traffic, and other items fi , 0 < i ≤ h correspond to one of the following five traffic features:{ srcIP, dstIP, srcP ort, dstP ort, protocol } . At least one traffic feature (0 < h) is required to describe identified traffic. For example, the event } refers to one minute of traffic from e1 ={ tbegin : 90s, tend : 150s, srcPort : 80 source port 80. Also, the same traffic feature can occur several times in a single event. For example the event e2 ={ tbegin : 30s, tend : 90s, srcPort : 53, protocol : udp, protocol : tcp } refers to one minute of UDP or TCP traffic from port 53. 2.1

Granularity of Events

The traffic granularity of reported events results from the diverse traffic abstractions, dimensionality reductions and theoretical tools employed by classifiers. For example, in the case of anomaly detection: – hash based (sketch) anomaly detectors [11,12] usually report only IP addresses and corresponding time bin, no other information (e.g. port number) describes identified anomalies. – An anomaly detector based on image processing reports an event as a set of IP addresses, port numbers and timestamps corresponding to a group of packets identified in analyzed pictures [13]. – Several intrusion detection systems take advantage of clustering techniques to identify anomalous traffic [14]. These methods classify flows in several groups and report clusters with abnormal properties. Thereby, events reported by these methods are sets of flows. These different kinds of event provide distinct details of the traffic that are difficult to systematically compare. A simple way is to digest all of them to a less restrictive form; namely, by examining only the source or destination IP addresses (assuming that anomaly detectors report at least one IP address). Comparing only IP addresses permits to determine that Event 1, Event 2 and Event 3 in Fig. 1 are similar. However, the port numbers provided by Event 2 and Event 3 indicate that these two events represent distinct traffics. Consequently, an accurate comparison of these two events requires to also take into account port numbers, but it raises other issues. First, a heuristic is needed to make a decision when port number is not reported (for example in comparing Event 1 and Event 2 ). Second, fuzzy equality is required to compare Event 4 and Event 5 of Fig.1. So forth, inspecting various traffic features reported by events makes the task harder although the accuracy of the comparison increases.

104

R. Fontugne et al.

Fig. 1. Event 1, Event 2 and Event 3 report different traffics from the same host. A same port scan is reported by two events; Event 4 identifies only a part of it (beginning of the port range), whereas Event 5 identifies another part (the end of the port range).

Similar problems arise in the case of traffic classification where different entities are labeled: – Usually flows are directly labeled (e.g. based on clustering techniques [15,16,17]). – Whereas, BLINC [18] decides a label for a source (IP address, source port) based on its connection pattern. Also, a recent method [19] labels directly hosts without any traffic information by collecting and analyzing information freely available on the web. Thus, researchers faced difficulties in comparing events standing for flows with events representing hosts. A common way is to apply the same label to all flows initiated from the host reported by an event, thus, only flows are compared [16]. Unfortunately, this way of comparing these two kinds of traffic classifiers leads to erroneous results. For example, if an host is reported by a classifier as a web client then all its corresponding flows are cast as HTTP. A simple port-based method also classifies most of these flows as HTTP but a few of them are labeled as DNS. In this case we cannot conclude that the port-based method misclassified DNS flows neither the other classifier failed in classifying this host. Obviously, the transition between an event representing host to its corresponding flows introduce errors. More sophisticated mechanisms are required to handle these two concepts (flow and host), whereas the synergy between them might provides accurate traffic classification. 2.2

Traffic Assortment

Recent applications and network attacks tend to be distributed over the network and composed of numerous flows. Therefore, classifiers labeling flows output an excessive number of events for a single distributed traffic. Regardless the quantity of traffic involved by a unique attack, or instance of an application, the whole amount of generated traffic should be handled and annotated as a single entity. In that way traffic annotations are clarified and highlight connection pattern of hosts. In some cases, finding these similarities between events requires to retrieve the original traffic. For example, let X be an event corresponding to traffic emitted from a single host, and Y an event representing traffic received by another host. X and Y can represent exactly the same traffic but from two different points of view, one reports the source whereas the other one reports the destination of the traffic. The only way to verify if these events are related to each other is to

Uncovering Relations between Traffic Classifiers and Anomaly Detectors

105

investigate the analyzed traffic. If all traffic reported by X is also reported by Y , then we conclude that they are strongly related. Obviously, a quantitative measure is also needed to accurately score their similarities.

3

The Proposed Method

We propose a method to relate several events of different granularities by analyzing all their details. The main idea underlying our approach is to discover events relations among events from original traffic (oracle in Fig.2) and represent all events and their relations as a graph (graph generator in Fig.2). Afterwards, coherent groups of similar events are identified in the graph with an algorithm finding community structure (community mining in Fig.2).

Fig. 2. Overview of the proposed method

3.1

Oracle

The oracle is the interface binding specific classifiers outputs to our general methodology. Its role is to retrieve the relation between the original traffic and the reported events. It accepts a query in the form of a packet p and returns a list of events, Rp = {ep0 , ..., epn }, consisting of all events from every classifiers that are relevant to the query. Formally, let a packet p be a set of five distinct traffic features and a timestamp tp , p = {tp , fp1 , ..., fp5 } then Rp consists of all events e = {tbegin , tend , f1 , ..., fh } where tbegin ≤ tp ≤ tend and ∃fj , fj = fpi , with 0 < i ≤ 5 and 0 < j ≤ h. Queries are generated for each packet of the original traffic, thereby the oracle produces the lists of events matching all packets R = {Rp1 , ..., Rpm } (m is the total number of packets). 3.2

Graph Generator

The graph generator collects all responses from the oracle and builds a graph highlighting event similarities. Nodes of the graph represent the events and those appearing in a same list returned by the oracle are connected to each other by edges. Thus, for any edge of the graph (ex , ey ) there is at least one list provided by the oracle, Rpz ∈ R, in which the two connected nodes appear ex , ey ∈ Rpz . Weights of edges quantify the similarities of events based on the quantity of traffic they have in common. Let c(e1 , ..., en ) be a function returning the number of lists, Rpz ∈ R, in which all events given as parameters appear

106

R. Fontugne et al.

together, e1 , ..., en∈ Rpz . Then the weight of an edge (ex , ey ) is computed with the following equation: w(ex , ey ) = c(ex , ey )/ min(c(ex ), c(ey )) w ranges (0, 1], 1 means that events are strongly related whereas values close to 0 represent weak relationships. The characteristic of graphs built by the graph generator is that connected components stand for sets of events representing common traffic. Also, connected components consists of sparse and dense parts, hereafter, we define a community as a coherent group of nodes representing similar events. 3.3

Community Mining

The next step is to find out community structure [20] to identify coherent groups of similar events within connected components of graphs constructed by the graph generator. Although many kinds of community structure algorithm have been proposed, we only take an interest in those based on modularity because there exists versions that perform fast on sparse graph [21]. Modularity. Newman and Girvan proposed a metric for evaluating the strength of a given community structure [20] based on inter and intra-community connections; this metric is called the modularity. The main idea underlying the modularity is that the fraction (or proportion) of edges connecting nodes of a single community is expected to be higher than the value of the same quantity in a graph with the same number of nodes but randomly connected. Let eij be a fraction of edges connecting nodes of community i to those of community j, such that eii is the fraction of edges within a  community i. Thus, i eii is the total fraction of edges connecting nodes of the same community. This value highlights the connections within communities, a large value represents a good division of the graph in communities. However, it takes the maximum value 1, for the particularly meaningless case in which all nodes are grouped in a single community. Newman et al. enhanced this measure by subtractingfrom it the value it would take if edges were randomly placed. We note ai = j eij the fraction of all edges attached to nodes in community i. If the edges are placed at random, the fraction of edges nodes within community i is a2i . The modularity  that link 2 is defined as Q = i (eii − ai ). If the fractions of edges within communities are similar to those expected in a randomized graph, then score of the modularity will be 0, whereas Q = 1 indicates graphs with strong community structure. Since Q represents the quality of community structure in a graph, researchers investigated this metric to efficiently partition graph in communities. Finding communities. Blondel et al. proposed an algorithm [21] finding community structure by optimizing the modularity in an agglomerative manner.

Uncovering Relations between Traffic Classifiers and Anomaly Detectors

107

Their algorithm starts by assigning a community to each node of the graph, then the following step is repeated iteratively. For each community i the gain of modularity obtained by merging it with one of its neighbor j is evaluated. The merge of i and j is done for the maximum gain, but only if this gain is positive. Otherwise i is not merged with other communities. Once all nodes have been examined a new graph is build, and this process is repeated again until no merge can be done. The authors claim that the computational complexity of their algorithm is linear on typical and sparse data. In their experiments Blondel et al. successfully analyzed a graph with 118 million nodes and 1 billion edges in 152 minutes. The performances of this algorithm allow us to compare thousands of events in a really short time frame (order of seconds).

4

Evaluation

4.1

Data and Processing

The proposed method is preliminarily evaluated by comparing the results of two anomaly detectors based on different theoretical backgrounds. One consists of random projection techniques (sketches) and multiresolution gamma modeling [11]. Hereafter we call it as the gamma-based method. In a nutshell, the traffic is split into sketches and modeled using Gamma laws at several time scales. Anomalous traffic is detected by reporting too large distances from an adaptively computed reference. The sketches are computed twice; the traffic is hashed on source addresses and on destination addresses. Thus, when anomalies are detected this method reports the corresponding source or destination address within a certain time bin. The other anomaly detector is based on an image processing technique called the Hough transform [13] (we call it the Hough-based method). Traffic is monitored in 2-D scatter plot where each plot represents packets and anomalous traffics appear as “lines”. Anomalies are extracted with a line detector (the Hough transform) and the original data are retrieved from the identified plots. The output of this method is an aggregated set of packets. These two anomaly detectors were tested on a pcap file of the MAWI archive containing 15 minutes of traffic taken at a trans-Pacific link between Japan and US (Samplepoint-B, 2004/08/01) corresponding to a period of Sasser outbreak. In practice, the output of these two anomaly detectors is in admd1 form, which is a XML schema allowing to annotate traffic in an easy and flexible way. Hence, we implemented an oracle able to read any admd and pcap file to compare results from both methods. 4.2

Results

In our experiments 332 events have been reported by the gamma-based method and 873 by the Hough-based one, where respectively 235 and 247 events have 1

Meta-data format and associated tools for the analysis of pcap data: http://admd.sourceforge.net

108

R. Fontugne et al.

        

         

(a) Both methods detect the same host (b) The gamma-based method reports infected by the Sasser worm. the destination of anomalous traffic whereas the Hough-based one reports the source of it.

Fig. 3. Two simple connected components with two similar events

been merged by our method. The resulting graph consists of 124 connected components (we do not consider components with a single event), we present some typical graph structures in this Section. Note that we use following legend for Fig. 3-7. Gray rectangles represent the separation in community structure, green ellipses are events reported by the Houghbased method, and red rectangles are events reported by the gamma-based method. The labels of events are displayed as: IPaddress direction;nbPackets, where IPaddress is the IP address reported by the gamma-based model or the prominent IP address of traffic reported by the Hough-based method; direction is a letter, s or d, informing if the identified hosts are rather the sources or destinations of reported traffic; nbPackets is the number of packets in the traffic trace that match the event. We emphasize that the IP addresses shown in these labels are only provided to facilitate the readability of figures. Thus it is not the only information considered in the oracle decisions (the gamma-based method also reports timestamps, and the Hough-based method can provide several IP addresses, port number, timestamps, and protocols). The label for an edge linking two events is the weight of the edge w and the number of packets matching both events. Simple connected components. Figure 3 consists of two examples of the simplest connected components built by our method. Figure 3(a) stands for the same Sasser activity reported by both methods. Since the two methods reported anomalous traffic from the same host, the events have been obviously grouped together. The single edge between the two events represents numerous incomplete flows that can be labeled as the same malicious activity. Figure 3(b) displays two events reporting different hosts; one event describes anomalous traffic sent from a source IP (172.92.103.79), whereas the other one exhibits abnormal traffic received by another host (210.133.66.52). Their relationship is uncovered by the original traffic, all packets (except one) initiated from the source host have been sent to the identified host destination. This connected component illustrates the case where both anomaly detectors report the same

Uncovering Relations between Traffic Classifiers and Anomaly Detectors

109

        

  

 

     

  

 

 

  

Fig. 4. RSync traffic identified by 5 events

traffic but from different points of view, one identified its source whereas the other emphasized its destination. In our experiments, we observed 86 connected components containing only 2 events (like those depicted in Fig.3) where, the two linked events are sometimes reported by the same anomaly detector. Large connected components. The proposed method found 38 connected components that consist in more than 2 events. For example, Fig. 4 shows 5 events grouped in a strongly connected component. All these events report abnormally high traffic volume, and manual inspection of packets header revealed that they are all RSync traffic. Three hosts are concerned by these events, and the structure of the component helps us in understanding their connection pattern. The weights of edges indicate that these events are closely related. Thus, these 5 events are grouped as one community that is uniquely reported. Figure 5 depicts a connected component consisting of 29 events; 27 are from the gamma-based method output and 2 from the Hough-based one. All these events are reporting abnormal DNS traffic. The event on the right-hand side of Fig.5 and the one on the left-hand side (both labeled 200.24.119.113) represent traffic from a DNS server. This server is reported by both methods because it replies to numerous requests during the whole traffic trace. Other events shown in Fig.5 represent the main clients soliciting this service. By grouping all these events together our method permits to report the flooded server and the blamed clients at the same time. Whereas, by analyzing individually events raised by clients, one may misunderstand the distributed characteristic of this network activity — similar to DDoS, flash crowd, or botnet activity — and misinterpret each event.

110

R. Fontugne et al.

  

 

 

   

     

 

    

 

 



  

 

   

 

 

   

 

 

  

 

       

 



            

 

 

 

 

 

 

 

 

   



                 

  

 

  



 

 

       

 

   

    

  

  

Fig. 5. DNS traffic reported by many events

Uncovering Relations between Traffic Classifiers and Anomaly Detectors

111

Communities in connected components. In the examples presented above the algorithm finding community structure (see Section 3.3) identified each connected component as a single community. Nevertheless, our method found 11 connected components that are split in several communities (e.g. Fig. 6 and 7); the smallest contains 5 events grouped in 2 communities, and the largest consists of 47 events clustered in 8 communities. These connected components stand for distinct network traffics that are linked by loose events (i.e. events reporting only one traffic feature). Fortunately, the algorithm finding community structure succeed in cutting connected components in coherent groups of events. An example of a connected component representing two communities is depicted in Fig.6. The community on the left-hand side of Fig.6 stands for a high-volume-traffic directed to port number 3128 (proxy server). However, the community on the right-hand side of Fig.6 represents nntp traffic between two hosts. A single packet is responsible for connecting two events from both communities. It is a TCP/SYN packet sent from the main host representing the left-hand side community and aiming at the port 3128 of a host belonging to the other community. This is the only traffic observed on port 3128 for the latter host. The proposed method successfully dissociates the two sets of events having no similarities, so they can be handle separately. Figure 7 depicts another example of connected component split in several communities, but this involves 14 events grouped in 5 communities. All events report uncommon HTTP traffic among numerous hosts. Although all events are connected together, weight of edges emphasizes several dense sets of events. By analyzing the weight of edges and the degree of nodes, the algorithm finding community structure successfully detected these coherent groups of events. 4.3

Discussion

The proposed method enabled us to compare outputs from different kinds of classifier (e.g. host classification and flow classification), and fulfill our requirements to combine results form many classifiers. Our method is also useful in inspecting the output of a single method. For example, the gamma-based method inherently reports either source or destination address of anomalous traffic, but both are sometimes reported in two distincts events. Let T be an anomalous traffic between hosts A and B raising two events ex = {tbegin : X, tend : Y, srcIP : A} and ey = {tbegin : X, tend : Y, dstIP : B}, then the proposed method merges these events as packets p ∈ T are in the form p = {tp : Z, srcIP : A, dstIP : B, ...} with X ≤ Z ≤ Y . In our experiments, our method permits to merge 27 events (see Fig.5) reported by the gamma-based method increasing the quality of reported events and reducing the size of the output. Another benefit of the proposed method is to help researchers in understanding different results from their algorithms. For instance, while developing anomaly detector, researchers commonly face a problem in tuning their parameter set. Therefore, researchers usually run their application with numerous parameter settings, and the best parameter set is selected by looking at the highest

112

R. Fontugne et al.

  

  

       

  

  

   

      

  

      

   

    

     

     

Fig. 6. Connected component standing for two distinct traffics

     

     

  

     

     

   

  

  

        

  

  

     

  

  

     

        

  

       

      

  

     

Fig. 7. HTTP traffic represented by a large connected component split in 5 communities

detection rate. Although this process is commonly accepted by the community a crucial issue still remains. For instance, a parameter set A may give a similar detection rate to that obtained with a parameter set B, but a deeper analysis of reported events may show that B is more effective for a certain kind of anomalies not detectable with the parameter set A (and vice versa). Deciding if A or B is the best parameter is then not straightforward. This interesting case

Uncovering Relations between Traffic Classifiers and Anomaly Detectors

113

is not solved by simply comparing detection rates. The overlap of both outputs as exhibited by our method would help us first to compare in which conditions a parameter set is more effective, second to make methods collaborate.

5

Conclusion

This article first raised the difficulties in relating outputs of different classifiers. We proposed a methodology to relate reported events although they are expressed in different ways and represent distinct granularities of the traffic. Our approach relies on the abstraction level of graph theory, graphs are generated from events and the original traffic to uncover the similarities of events. An algorithm finding community structure permits to distinguish coherent sets of nodes in the graph standing for sets of similar events. Preliminary evaluation highlighted the flexibility of our method and its effectiveness to cluster events reported by different anomaly detectors. The proposed methodology is a first step in our process to build a common database of annotated backbone traffic. We need more analyses to better understand the basic ability of the proposed method with different datasets and classifiers. In future work we will also adopt a strategy taking into account the nature of classifiers to decide the final label to annotate traffic represented by a set of events.

Acknowledgments We would like to thank V.D. Blondel et al. for having provided us with the source code of their community structure finding algorithm. This work is partially supported by MIC SCOPE.

References 1. Szab´ o, G., Orincsay, D., Malomsoky, S., Szab´o, I.: On the validation of traffic classification algorithms. In: Claypool, M., Uhlig, S. (eds.) PAM 2008. LNCS, vol. 4979, pp. 72–81. Springer, Heidelberg (2008) 2. Lippmann, R., Haines, J.W., Fried, D.J., Korba, J., Das, K.: The 1999 darpa offline intrusion detection evaluation. Computer Networks 34(4), 579–595 (2000) 3. Cho, K., Mitsuya, K., Kato, A.: Traffic data repository at the WIDE project. In: USENIX 2000 Annual Technical Conference: FREENIX Track, June 2000, pp. 263–270 (2000) 4. Canini, M., Li, W., Moore, A.W., Bolla, R.: Gtvs: Boosting the collection of application traffic ground truth. In: Papadopouli, M., Owezarski, P., Pras, A. (eds.) TMA 2009. LNCS, vol. 5537, pp. 54–63. Springer, Heidelberg (2009) 5. Haakon Ringberg, A.S., Rexford, J.: Webclass: adding rigor to manual labeling of traffic anomalies. SIGCOMM CCR 38(1), 35–38 (2008) 6. Fontugne, R., Hirotsu, T., Fukuda, K.: A visualization tool for exploring multi-scale network traffic anomalies. In: SPECTS 2009, pp. 274–281 (2009)

114

R. Fontugne et al.

7. Gringoli, F., Salgarelli, L., Cascarano, N., Risso, F., Claffy, K.C.: Gt: Picking up the truth from the ground for internet traffic. SIGCOMM CCR 39(5), 13–18 (2009) 8. Moore, A.W., Papagiannaki, K.: Toward the accurate identification of network applications. In: Dovrolis, C. (ed.) PAM 2005. LNCS, vol. 3431, pp. 41–54. Springer, Heidelberg (2005) 9. Dainotti, A., Donato, W., Pescap´e, A.: Tie: A community-oriented traffic classification platform. In: Papadopouli, M., Owezarski, P., Pras, A. (eds.) TMA 2009. LNCS, vol. 5537, pp. 64–74. Springer, Heidelberg (2009) 10. Salgarelli, L., Gringoli, F., Karagiannis, T.: Comparing traffic classifiers. SIGCOMM CCR 37(3), 65–68 (2007) 11. Dewaele, G., Fukuda, K., Borgnat, P., Abry, P., Cho, K.: Extracting hidden anomalies using sketch and non gaussian multiresolution statistical detection procedures. In: SIGCOMM LSAD 2007, pp. 145–152 (2007) 12. Li, X., Bian, F., Crovella, M., Diot, C., Govindan, R., Iannaccone, G., Lakhina, A.: Detection and identification of network anomalies using sketch subspaces. In: SIGCOMM 2006, 147–152 (2006) 13. Fontugne, R., Himura, Y., Fukuda, K.: Evaluation of anomaly detection method based on pattern recognition. IEICE Trans. on Commun. E93-B(2) (Febuary 2010) 14. Sadoddin, R., Ghorbani, A.A.: A comparative study of unsupervised machine learning and data mining techniques for intrusion detection. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 404–418. Springer, Heidelberg (2007) 15. Erman, J., Arlitt, M., Mahanti, A.: Traffic classification using clustering algorithms. In: SIGCOMM MineNet 2006, pp. 281–286 (2006) 16. chul Kim, H., Claffy, K., Fomenkov, M., Barman, D., Faloutsos, M., Lee, K.: Internet traffic classification demystified: Myths, caveats, and the best practices. In: CoNEXT 2008 (2008) 17. Bernaille, L., Teixeira, R., Salamatian, K.: Early application identification. In: CoNEXT 2006, pp. 1–12 (2006) 18. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: Blinc: multilevel traffic classification in the dark. In: SIGCOMM 2005, vol. 35(4) (2005) 19. Trestian, I., Ranjan, S., Kuzmanovi, A., Nucci, A.: Unconstrained endpoint profiling (googling the internet). In: SIGCOMM 2008 (2008) 20. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004) 21. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. (2008)

Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers Alessandro Finamore1, Michela Meo1 , Dario Rossi2 , and Silvio Valenti2 1 2

Politecnico di Torino, Italy TELECOM Paristech, France

Abstract. In the last few years the research community has proposed several techniques for network traffic classification. While the performance of these methods is promising especially for specific classes of traffic and particular network conditions, the lack of accurate comparisons among them makes it difficult to choose between them and find the most suitable technique for given needs. Motivated also by the increase of P2P-TV traffic, this work compares Abacus, a novel behavioral classification algorithm specific for P2P-TV traffic, and Kiss, an extremely accurate statistical payload-based classifier. We first evaluate their performance on a common set of traces and later we analyze their requirements in terms of both memory occupation and CPU consumption. Our results show that the behavioral classifier can be as accurate as the payload-based with also a substantial gain in terms of computational cost, although it can deal only with a very specific type of traffic.

1 Introduction In the last years, Internet traffic classification has attracted a lot of attention from the research community. This interest is motivated mainly by two reasons. First, an accurate traffic classification allows network operators to perform many fundamental activities, e.g. network provisioning, traffic shaping, QoS and lawful interception. Second, traditional classification methods, which rely on either well-known ports or packet payload inspection, have become unable to cope with modern applications (e.g., peer-to-peer) or with the increasing speed of modern networks [1,2]. Researchers have proposed many innovative techniques to address this problem. Most of them exploit statistical properties of the traffic generated by different applications at the flow or host level. These novel methods have the advantage of requiring less resources while still being able to identify applications which do not use wellknown ports or exploit encrypted/closed protocols. However the lack of accurate and detailed comparisons discourages their adoption. In fact, since each study tests its own algorithm on a different set of traces, under different conditions and often using different metrics, it is really difficult for a network operator to identify which methods could best fit its needs. In this paper, we face this problem by comparing two traffic classifiers, one of which is specifically targeted to P2P-TV traffic. These applications, which are rapidly gaining a very large audience, are characterized by a P2P infrastructure providing a live streaming video service. As next generation of P2P-TV services are beginning to offer F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 115–126, 2010. c Springer-Verlag Berlin Heidelberg 2010 

116

A. Finamore et al.

HD content, the volume of traffic they generate is expected to grow even further, so that their identification is particularly interesting for network operators. The two considered classifiers are the first ones to handle this kind of traffic and exploit original and quite orthogonal approaches. Kiss [3] is a statistical payload-based classifier and it bases the classification on the examination of the first bytes of the application-layer payload. It has already been compared with other classifiers in [4], proving to be the best one for this specific class of traffic. Abacus [5], instead, is a behavioral classifier, which derives a statistical representation of the traffic patterns generated by a host by simply counting the number of packets and bytes exchanged with other peers during small-time windows. This simple approach can capture several distinctive properties of different applications, allowing their classification. We test the two techniques on an common set of traces, evaluating their accuracy in terms of both true positives (i.e., correct classification of P2P-TV traffic) and true negatives (i.e., correct identification of traffic other than P2P-TV). We also provide a detailed comparison of theirs features, focusing mostly on the differences which stem from the undertaken approaches. Moreover, we formally investigate the computational complexity by comparing the memory occupation and the computational costs. Results show that Abacus achieves practically the same performance of Kiss and both classifiers exceed 99% of correctly classified bytes for P2P-TV traffic. Abacus exhibits some problems in terms of flow accuracy for one specific application, for which it still has a high bytewise accuracy. The two algorithms are also very effective when dealing with non P2P-TV traffic, raising a negligible number of false negatives. Finally we found that Abacus outperforms Kiss in terms of computation complexity, while Kiss is a much more general classifier, able to work with a wider range of protocols and network conditions. The paper is organized as follows. In Sec. 2 we present some work related to ours. In Sec. 3 we briefly present the two techniques under exam, then in Sec. 4 we test them on a common set of traces and compare their performance. We proceed with a more qualitative comparison of the classifiers in Sec. 5 as well as an evaluation of their computational cost. Finally Sec. 6 concludes the paper.

2 Related Work Recently, many works have been focusing on the problem of traffic classification. In fact, traditional techniques like port-based classification or deep packet inspection appear more and more inadequate to deal with modern networks and applications [1,2]. Therefore the research community has proposed a rather large number of innovative solutions, which consist notably in several statistical flow-based approaches [6,7,8] and in a fewer host-based behavioral techniques [9,10]. The heterogeneity of these approaches, the lack of a common dataset and the lack of a widely approved methodology make a fair and comprehensive comparison of these methods a daunting task [11]. In fact, to date, most of the comparison effort has addressed the investigation of different machine learning techniques [6,7,8], using the same set of features and the same set of traces.

Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers

117

More recently, a few works have specifically taken into account the comparison problem [12,13,14]. The authors of [12] present a qualitative overview of several machine learning based classification algorithms. On the other hand, in [13] the authors compare three different approaches (i.e., based on signature, flow statistics and host behavior) on the same set of traces, highlighting both advantages and limitations of the examined methods. A similar study is carried also in [14], where authors evaluate spatial and temporal portability of a port-based, a DPI and a flow-based classifier. The work presented in this paper follows the same direction of these comparative studies, but focuses only on P2P-TV applications. In fact, P2P-TV has been attracting many users in the last few years, and consequently also much consideration from the research community. Moreover, works on this topic consist mainly in measurement studies of P2P-TV application performance in real networks [15,16]. The two classifiers compared are the only ones proven to correctly identify this type of traffic. Kiss was already contrasted with a DPI and a flow-based classification algorithm in [4], proving itself the most accurate for this class of traffic. Moreover in our study we also take into account the computational cost and memory occupation of the algorithms under comparison.

3 Classification Algorithms This section briefly introduces the two classifiers. Here we focus our attention on the most relevant aspects in a comparison perspective, while we refer the interested reader to [5] and [3] for further details and discussion on parameters settings. Both Kiss and Abacus employ supervised machine learning as their decision process, in particular Support Vector Machine - SVM [17], which has already been proved particularly suited for traffic classification [13]. In the SVM context, entities to be classified are described by an ordered set of features, which can be interpreted as coordinates of points in a multidimensional space. Kiss and Abacus differ for the choice of the features. The SVM must be trained with a set of previously labeled points, commonly referred to as the training set. During the training phase, the SVM basically defines a mapping between the original feature space and a new space, usually characterized by an higher dimensionality, where the training points could be separated by hyperplanes. In this way, the target space is subdivided in areas, each associated to a specific class. During the classification phase, a point can be classified simply looking for the region which best fits it. Before proceeding with the description of the classifiers, it is worth analyzing their common assumption. First of all, they both classify endpoints, i.e., couples (IP address, transport-layer port) on which a given application is running. Second, they currently work only on UDP traffic, since this is the transport-layer protocol generally chosen by P2P-TV applications. Finally, given that they rely on a machine learning process, they follow a similar procedure to perform the classification. As a first step, the engines derive a signature vector from the analysis of the traffic relative to the endpoint they are classifying. Then, they feed the vector to the trained SVM, which in turn gives the classification result. Once an endpoint has been identified, all the flows which have that endpoint as source or destination are labeled as being generated by the identified application.

118

A. Finamore et al.

3.1 Abacus A preliminary knowledge of the internal mechanisms of P2P-TV applications is needed to fully understand the key idea behind the Abacus classifier. A P2P-TV application performs two different tasks: first, it exchanges video chunks (i.e., small fixed-size pieces of the video stream) with other peers, and, second, it participates to the P2P overlay maintenance. The most important aspect is that it must keep downloading a steady rate of video stream to provide users with a smooth video experience. Consequently, a P2PTV application maintains a given number of connections with other peers from which it downloads pieces of the video content. Abacus signatures are thus based on the number of contacted peers and the amount of exchanged information among them. In Tab. 4 we have reported the procedure followed by Abacus to build the signatures. The first step consists in counting the number of packets and bytes received by an endpoint from each peer during a time window of 5 sec. At the beginning, let us focus on the packet counters. We first define a partition of N in B exponential-sized bins I i , i.e. I 0 = [0,1], I i = [2i−1 + 1,2i ] and I B = [2B ,∞). Then, we order the observed peers in bins according to the number of packets they have sent to the given endpoint. In the pseudo-code we see that we can assign a peer to a bin by simply calculating the logarithm of the associated number of packets. We proceed in the same way also for the byte counters (except that we use a different set of bins), finally obtaining two vectors of frequencies, namely p and b. The concatenation of the two vectors is the Abacus signature which is fed to the SVM for the actual decision process. This simple method highlights the distinct behaviors of the different P2P-TV applications. Indeed, an application which implements an aggressive peer-discovering strategy will receive many single-packet probes, consequently showing large values for low order bins. Conversely, an application which downloads the video stream using chunks of, say, 64 packets will exhibit a large value of the 6-th bin. Abacus provides a simple mechanism to identify applications which are “unknown” to the SVM (i.e., not present in the training set), which in our case means non P2P-TV applications. Basically, for each class we define a centroid based on the training points, and we label a signature as unknown if its distance from the centiroid of the associated class exceeds a given threshold. To evaluate this distance we use the Bhattacharyya distance, which is specific for probability mass functions. All details on the choice of the threshold, as well as all other parameters can be found in [5]. 3.2 Kiss The Kiss classifier [3] is instead based on a statistical analysis of the packets payload. In particular, it exploits a Chi-Square like test to extract statistical features from the first application-layer payload bytes. Considering a window of C segments sent (or received) by an endpoint, the first k bytes of each packet payload are split into G groups of b bits. Then, the empirical distributions Oi of values taken by the G groups over the C segments are compared to a uniform distribution Ei = C/2b by means of the ChiSquare like test: 2b 2  (Oig − E) g ∈ [1, G] (1) Xg = E i=1

Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers

119

Table 1. Datasets used for the comparison Dataset Napa-WUT Operator 2006 (op06) Operator 2007 (op07)

Duration 180 min 45 min 30 min

Flows 73k 785k 319k

Bytes Endpoints 7Gb 25k 4Gb 135k 2Gb 114k

This allows to measure the randomness of each group of bits and to discriminate among constant/random values, counters, etc. as the Chi-Square test assumes different values for each of them. The array of the G Chi-Square values defines the application signature. In this paper, we use the first k = 12 bytes of the payload divided into groups of 4 bits (i.e., G = 24 features per vector) and C = 80 segments to compute each Chi-Square. The generated signatures are then fed to a multi-class SVM machine, similarly to Abacus. As previously stated, a training set is used to characterize each target class, but for Kiss an additional class must be defined to represent the remaining traffic, i.e., the unknown class. In fact, a multi-class SVM machine always assigns a sample to one of the known classes, in particular to the best fitting class found during the decision process. Therefore, in this case a trace containing only traffic other than P2P-TV is needed to characterize the unknown class. We already mentioned that in Abacus this problem is solved by means of a threshold criterion using the distance of a sample from the centroid of the class. We refer the reader to [3] for a detailed discussion about Kiss parameter settings and about the selection of traffic to represent the unknown class in the training set.

4 Experimental Results 4.1 Methodology and Datasets We evaluate the two classifiers on the traffic generated by four popular P2P-TV applications, namely PPLive, TVAnts, SopCast and Joost 1 . Furthermore we use two distinct sets of traces to asses two different aspects of our classifiers. The first set was gathered during a large-scale active experiment performed in the context of the Napa-Wine European project [18]. For each application we conduct an hour-long experiment where several machines provided by the project partners run the software and captured the generated traffic. The machines involved were carefully configured in such a way that no other interfering application was running on them, so that the traces contain P2P-TV traffic only. This set is used both to train the classifiers and to evaluate their performance in identifying the different P2P-TV applications. The second dataset consists of two real-traffic traces collected in 2006 and 2007 on the network of a large Italian ISP. This operator provides its customers with uncontrolled Internet access (i.e., it allows them to run any kind of application, from web browsing to file-sharing), as well as telephony and streaming services over IP. Given the extremely rich set of channels available through the ISP streaming services, customers 1

Joost became a web-based application in October 2008. At the time we conducted the experiments, it was providing VoD and live-streaming by means of P2P.

120

A. Finamore et al. Table 2. Classification results (a) Flows pp 13.35 0.86 0.33 0.06 0.1 0.21

Abacus tv sp jo 0.32 0.06 95.67 0.15 0.03 98.04 0.1 2.21 - 81.53 0.1 1.03 0.06 0.03 0.87 0.05

pp pp 99.33 tv 0.01 sp 0.01 jo op06 1.02 op07 3.03

Abacus tv sp jo 0.11 99.95 0.09 99.85 0.02 - 99.98 0.58 0.55 0.71 0.25

pp tv sp jo op06 op07

un 86.27 3.32 1.5 16.2 98.71 98.84

pp tv sp jo op06 op07

pp 98.8 -

tv 97.3 0.44 2.13

Kiss sp jo 0.01 98.82 - 86.37 0.08 0.55 0.09 1.21

un 0.2 0.69 0.21 3.63 92.68 84.07

nc 1 2 0.97 10 6.25 12.5

(b) Bytes un 0.56 0.04 0.03 0.02 97.85 96.01

pp tv sp jo op06 op07

Kiss pp tv sp jo un nc 99.97 0.01 0.02 - 99.96 0.03 0.01 - 99.98 0.01 0.01 - 99.98 0.01 0.01 0.07 0.08 98.45 1.4 0.08 0.74 0.05 96.26 2.87

pp=PPLive, tv=Tvants, sp=Sopcast, jo=Joost, un=Unknown, nc=not-classified.

are not inclined to use P2P-TV applications and actually no such traffic is present in the traces. We verified this by means of a classic DPI classifier as well as by manual inspection of the traces. This set has the purpose of assessing the number of false alarms raised by the classifiers when dealing with non P2P-TV traffic. We report in Tab. 1 the main characteristics of the traces. To compare the classification results, we employ the diffinder tool [19], as already done in [4] . This simple software takes as input the logs from different classifiers with the list of flows and the associated classification outcome. Then, it calculates as output several aggregate metrics, such as the percentage of agreement of the classifiers in terms of both flows and bytes, as well as a detailed list of the differently classified flows, so eventually enabling further analysis. 4.2 Classification Results Tab. 2 reports the accuracy achieved by the two classifiers on the test traces. Each table is organized in a confusion-matrix fashion where rows correspond to real traffic i.e. the expected outcome, while columns report the possible classification results. For each table, the upper part is related to the Napa-Wine traces while the lower part is dedicated to the operator traces. The values in bold on the main diagonal of the tables express the recall, a metric commonly used to evaluate classification performance, defined as the ratio of true positives over the sum of true positives and false negatives. The “unknown” column counts the percentage of traffic which was recognized as not being P2P-TV

Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers

121

traffic, while the column “not classified” accounts for the percentage of traffic that Kiss cannot classify as it needs at least 80 packets for any endpoint. At first glance, both the classifiers are extremely accurate in terms of bytes. For the Napa-Wine traces the percentage of true positives exceeds 99% for all the considered applications. For the operator traces, again the percentage of true negatives exceeds 96% for all traces, with Kiss showing a overall slightly better performance. These results demonstrate that even an extremely lightweight behavioral classification mechanism, such as the one adopted in Abacus, can achieve the same precision of an accurate payload based classifier. If we consider flow accuracy, we see that for three out of four applications the performance of the two classifiers is comparable. Yet Abacus presents a very low percentage of 13.35% true positives for PPLive, with a rather large number of flows falling in the unknown class. By examining the classification logs, we found that PPLive actually uses more ports on the same host to perform different functions (e.g. one for video transfer, one for overlay maintenance). In particular, from one port it generates many single-packet flows all directed to different peers, apparently to perform peer discovery. All these flows, which account for a negligible portion of the overall bytes, fall in the first bin of the abacus signature, which is always classified as unknown. However, from the byte-wise results we can conclude that the video endpoint is always correctly classified. Finally, we observe that Kiss has a lower flow accuracy for the operator traces. In fact, the great percentage of flows falling in the “not classified” class means that many flows are shorter than 80 packets. Again, this is only a minor issue since Kiss byte accuracy is anyway very high.

5 Comparison 5.1 Functional Comparison In the previous section we have shown that the classifiers actually have similar performance for the identification of the target applications as well as the “unknown” traffic. Nevertheless, they are based on very different approaches, both presenting pros and cons, which need to be all carefully taken into account. Tab. 3 summarizes the main characteristics of the classifiers, which are reviewed in the following. The most important difference is the classification technique used. Even if both classifiers are statistical, they work at different levels and clearly belong to different families of classification algorithms. Abacus is a behavioral classifier since it builds a statistical representation of the pattern of traffic generated by an endpoint, starting from transport-level data. Conversely, Kiss derives a statistical description of the application protocol by inspecting packet-level data, so it is a payload-based classifier. The first consequence of this different approach lies in type and volume of information needed for the classification. In particular, Abacus takes as input just a measurement of the traffic rate of the flows directed to an endpoint, in terms of both bytes and packets. Not only this represents an extremely small amount of information, but it could also be gathered by a Netflow monitor, so that no packet trace has to be inspected by the classification engine itself. On the other hand, Kiss must necessary access packet

122

A. Finamore et al. Table 3. Main characteristics of Abacus and Kiss Characteristic Abacus Kiss Technique Behavioral Stocastic Payload Inspection Entity Endpoint Endpoint/Flow Input Format Netflow-like Packet trace Grain Fine grained Fine grained Protocol Family P2P-TV Any Rejection Criterion Threshold Train-based Train set size Big (4000 smp.) Small (300 smp.) Time Responsiveness Deterministic (5sec) Stochastic (early 80pkts) Network Deploy Edge Edge/Backbone

payload to compute its features. This constitutes a more expensive operation, even if only the first 12 bytes are sufficient to achieve a high classification accuracy. Despite the different input data, both classifiers work at a fine-grained level, i.e., they can identify the specific application related to each flow and not just the class of applications (e.g., P2P-TV). This consideration may appear obvious for a payloadbased classifier such as Kiss, but it is one of the strength of Abacus over other behavioral classifiers which are usually capable only of a coarse grained classification. Clearly, Abacus pays the simplicity of its approach in terms of possible target traffic. In fact its classification process relies on some specific properties of P2P-TV traffic (i.e., the steady download rate required by the application to provide a smooth video playback), which are really tied to this particular service. For this reason Abacus currently cannot be applied to applications other than P2P-TV applications. On the contrary, Kiss is more general, it makes no particular assumptions on its target traffic and can be applied to any protocol. Indeed, it successfully classifies other kinds of P2P applications, from file-sharing (e.g., eDonkey) to P2P VoIP (e.g., Skype), as well as traditional clientserver applications (e.g., DNS). Another important distinguishing element is the rejection criterion. Abacus defines an hypersphere for each target class and measures the distance of each classified point from the center of the associated hypersphere by means of the Bhattacharyya formula. Then, by employing a threshold-based rejection criterion, a point is label as “unknown” when its distance from the center exceeds a given value. Instead Kiss exploits a multiclass SVM model where all the classes, included the unknown, are represented in the training set. If this approach makes Kiss very flexible, the characterization of the classes can be critical especially for the unknown since it is important that the training set contains samples from all possible protocols other than the target ones. We also notice that there is an order of magnitude of difference in the size of the training set used by the classifiers. In fact, we trained Abacus with 4000 samples per class (although in some tests we experimented the same performance even with smaller sets) while Kiss, thanks to the combination of the discriminative power of both the ChiSquare signatures and the SVM decision process, needs only 300 samples per class. On the other hand, Kiss needs at least 80 packets generated from (or directed to) an endpoint in order to classify it. This may seem a strong constraint but results reported

Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers

123

Table 4. Analytical comparison of the resource requirements of the classifiers Abacus Memory allocation

Packet processing Tot. op.

Feature extraction

Tot. op.

2F counters EP state = hash(IPd , portd ) FL state = EP state.hash(IPs , ports ) FL state.pkts ++ FL state.bytes += pkt size

2 lup + 2 sim EP state = hash(IPd , portd ) for all FL state in EP state.hash do p[ log2 (FL state.pkts )] += 1 b[ log2 (FL state.bytes)] += 1 end for N = count(keys(EP state.hash)) for all i = 0 to B do p[i] /= N b[i] /= N end for

Kiss b

2 G counters EP state = hash(IPd , portd ) for g = 1 to G do Pg = payload[g] EP state.O[g][Pg ]++ end for

(2G+1) lup + G sim E = C/2b (precomputed) for g = 1 to G do Chi[g] = 0 for i = 0 to 2b do Chi[g] += (EP state.O[g][i]-E)2 end for Chi[g] /= E end for

(4F+2B+1) lup + 2(F+B) com + 3F sim 2b+1 G lup + G com + (3·2b +1)G sim lup=lookup, com=complex operation, sim=simple operation.

in Sec. 4 actually show that the percentage of not supported traffic is negligible, at least in terms of bytes. This is due to the adoption of the endpoint-to-flow label propagation scheme, i.e. the propagation of the label of an “elephant” flow to all the “mice” flows of the same endpoint. With the exception of particular traffic conditions, this labeling technique can effectively bypass the constraint on the number of packets. Finally, for what concerns the network deployment, Abacus needs all the traffic received by the endpoint to characterize its behavior. Therefore, it is only effective when placed at the edge of the network, where all traffic directed to an host transits. Conversely, in the network core Abacus would likely see only a portion of this traffic, so gathering an incomplete representation of an endpoint behavior, which in turn could result in an inaccurate classification. Kiss, instead, is more robust with respect to the deployment position. In fact, by inspecting packet payload, it can operate even on a limited portion of the traffic generated by an endpoint, provided that the requirement on the minimum number of packets is satisfied. 5.2 Computational Cost To complete the classifiers comparison, we provide an analysis of the requirements in terms of both memory occupation and computational cost. We follow a theoretical approach and calculate these metrics from the formal algorithm specification. In this way, our evaluation is independent from specific hardware platforms or code optimizations. Tab. 4 compares the costs from an analytical point of view while in Tab. 5 there is a numerical comparison based on a case study.

124

A. Finamore et al. Table 5. Numerical case study of the resource requirements of the classifiers Abacus Kiss Memory allocation 320 bytes 384 bytes Packet processing 2 lup + 2 sim 49 lup + 24 sim Feature selection 177 lup + 96 com + 120 sim 768 lup + 24 com + 1176 sim Params values

B=8, F=40

G=24, b=4

Memory footprint is mainly related to the data structures used to compute the statistics. Kiss requires a table of G · 2b counters for each endpoint to collect the observed frequencies employed in the chi-square computation. For the default parameters, i.e. G = 24 chunks of b = 4 bits, each endpoint requires 384 counters. Abacus, instead, requires two counters for each flow related to an endpoint, so the total amount of memory is not fixed but it depends on the number of flows per endpoint. As an example, Fig. 1-(a) reports, for the two operator traces, the CDF of the number of flows seen by each endpoint in consecutive windows of 5 seconds, the default duration of the Abacus time-window. It can be observed that the 90th percentile in the worst case is nearly 40 flows. By using this value as a worst case estimate of the number of flows for a generic endpoint, we can say that 2 · #F lows = 80 counters are required for each endpoint. This value is very small compared to Kiss requirements but for a complete comparison we also need to consider the counters dimension. As Kiss uses windows of 80 packets, its counters assume values in the interval [0, 80] so single byte counters are sufficient. Using the default parameters, this means 384 bytes for each endpoint. Instead, the counters of Abacus do not have a specific interval so, using a worst case scenario of 4 bytes for each counter, we can say that 320 bytes are associated to each endpoint. In conclusion, in the worst case, the two classifiers require a comparable amount of memory but on average Abacus requires less memory than Kiss. Computational cost can be evaluated comparing three tasks: the operations performed on each packet, the operations needed to compute the signatures and the operations needed to classify them. Tab. 4 reports the pseudo code of the first two tasks for both classifiers, specifying also the total amount of operations needed for each task. The operations are divided in three categories and considered separately as they have different costs: lup for memory lookup operations, com for complex operations (i.e., floating point operations), sim for simple operations (i.e., integer operations). Let us first focus on the packet processing part, which presents many constraints from a practical point of view, as it should operate at line speed. In this phase, Abacus needs 2 memory lookup operations, to access its internal structures, and 2 integer increments per packet. Kiss, instead, needs 2G + 1 = 49 lookup operations, half of which are accesses to packet payload. Then, Kiss must compute G integer increments. Since memory read operations are the most time consuming, from our estimation we can conclude that Abacus should be approximately 20 times faster than Kiss in the packet processing phase. The evaluation of the signature extraction process instead is more complex. First of all, since the number of flows associated to an endpoint is not fixed, the Abacus cost is not deterministic but, like in the memory occupation case, we can consider 40 flows as a worst case scenario. For the lookup operations, Considering B = 8, Abacus requires a total of 177 operations, while Kiss needs 768 operations, i.e., nearly four times as

1

1

0.8

0.8

0.6 op06 op07 joost pplive sopcast tvants

0.4 0.2 0 1

10

100

CDF

CDF

Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers

125

0.6 op06 op07 joost pplive sopcast tvants

0.4 0.2 0

Flows @ 5sec

0.1

1 time @ 80pkt

(a)

(b)

10

Fig. 1. Cumulative distribution function of (a) number of flows per endpoint and (b) duration of a 80 packet snapshot for the operator traces

many. For the arithmetic operations, Abacus needs 96 floating point and 120 integer operations, while Kiss needs 24 floating point and 1176 integer operations. Abacus produces one signature every 5 seconds, while Kiss signatures are processed every 80 packets. To estimate the frequency of the Kiss calculation, in Fig. 1(b) we show the CDF of the amount of time needed to collect 80 packets for an endpoint. It can be observed that, on average, a new signature is computed every 2 seconds. This means that Kiss performs the feature calculation more frequently, i.e., it is more reactive and possibly more accurate than Abacus but obviously also more resource consuming. Finally, the complexity of the classification task depends on the number of features per signature, since both classifiers are based on a SVM decision process. The Kiss signature is composed, by default, of G = 24 features, while the Abacus signature contains 16 features: also from this point of view Abacus appears lighter than Kiss.

6 Conclusions In this paper we compared two approaches to the classification of P2P-TV traffic. We provided not only a quantitative evaluation of the algorithm performance by testing them on a common set of traces, but also a more insightful discussion of the differences deriving from the two followed paradigms. The algorithms proved to be comparable in terms of accuracy in classifying P2P-TV applications, at least regarding the percentage of correctly classified bytes. Differences emerged also when we compared the computational cost of the classifiers. With this respect, Abacus outperforms Kiss, because of the simplicity of the features employed to characterize the traffic. Conversely, Kiss is much more general, as it can classify other types of applications as well. Our work is a first step in cross-evaluating the novel algorithms proposed by the research community in the field of traffic classification. We showed how an innovative behavioral method can be as accurate as a payload-based one, and at the same time lighter, so being a perfect candidate for scenarios with hard constraints in term of computational resources. However, we also showed some limitations in its general applicability, which we would like to address in our future work.

126

A. Finamore et al.

Acknowledgements.This work was funded by EU under the FP7 Collaborative Project “Network-Aware P2P-TV Applications over Wise-Networks” (NAPAWINE).

References 1. Moore, A.W., Papagiannaki, K.: Toward the Accurate Identification of Network Applications. In: Dovrolis, C. (ed.) PAM 2005. LNCS, vol. 3431, pp. 41–54. Springer, Heidelberg (2005) 2. Karagiannis, T., Broido, A., Brownlee, N., Claffy, K., Faloutsos, M.: Is p2p dying or just hiding? In: IEEE GLOBECOM 2004., Dallas, Texas, US (2004) 3. Finamore, A., Mellia, M., Meo, M., Rossi, D.: KISS: Stocastic Packet Inspection. In: Traffic Measurement and Analysis (TMA) Workshop at IFIP Networking 2009, Aachen, Germany (May 2009) 4. Cascarano, N., Risso, F., Este, A., Gringoli, F., Salgarelli, L., Finamore, A., Mellia, M.: Comparing p2ptv traffic classifiers submitted to IEEE ICC 2010 (2010) 5. Valenti, S., Rossi, D., Meo, M., Mellia, M., Bermolen, P.: Accurate, Fine-Grained Classification of P2P-TV Applications by Simply Counting Packets. In: Traffic Measurement and Analysis (TMA) Workshop at IFIP Networking 2009, Aachen, Germany (May 2009) 6. Bernaille, L., Teixeira, R., Salamatian, K.: Early application identification. In: Proc. of ACM CoNEXT 2006, Lisboa, PT (December 2006) 7. Williams, N., Zander, S., Armitage, G.: A prelimanery performance comparison of five machine learning algorithms for practical ip traffic flow comparison. ACM SIGCOMM Comp. Comm. Rev. 36(5), 7–15 (2006) 8. Erman, J., Arlitt, M., Mahanti, A.: Traffic classification using clustering algorithms. In: MineNet 2006: Mining network data (MineNet) Workshop at ACM SIGCOMM 2006, Pisa, Italy (2006) 9. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: Blinc: multilevel traffic classification in the dark. SIGCOMM Comput. Commun. Rev. 35(4), 229–240 (2005) 10. Iliofotou, M., Kim, H., Pappu, P., Faloutsos, M., Mitzenmacher, M., Varghese, G.: Graphbased p2p traffic classification at the internet backbone. In: 12th IEEE Global Internet Symposium (GI 2009), Rio de Janeiro, Brazil (April 2009) 11. Salgarelli, L., Gringoli, F., Karagiannis, T.: Comparing traffic classifiers. ACM SIGCOMM Comp. Comm. Rev. 37(3), 65–68 (2007) 12. Nguyen, T.T.T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials 10(4), 56–76 (2008) 13. Kim, H., Claffy, K., Fomenkov, M., Barman, D., Faloutsos, M., Lee, K.: Internet traffic classification demystified: myths, caveats, and the best practices. In: Proc. of ACM CoNEXT 2008, Madrid, Spain (2008) 14. Li, W., Canini, M., Moore, A.W., Bolla, R.: Efficient application identification and the temporal and spatial stability of classification schema. Computer Networks 53(6), 790–809 (2009) 15. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W.: A Measurement Study of a Large-Scale P2P IPTV System. IEEE Transactions on Multimedia (December 2007) 16. Li, B., Qu, Y., Keung, Y., Xie, S., Lin, C., Liu, J., Zhang, X.: Inside the New Coolstreaming: Principles, Measurements and Performance Implications. In: IEEE INFOCOM 2008, Phoenix, AZ (April 2008) 17. Cristianini, N., Shawe-Taylor, J.: An introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, New York (1999) 18. Napa-Wine, http://www.napa-wine.eu/ 19. Risso, F., Cascarano, N.: Diffinder, http://netgroup.polito.it/research-projects/ l7-traffic-classification

TCP Traffic Classification Using Markov Models Gerhard M¨ unz, Hui Dai, Lothar Braun, and Georg Carle Network Architectures and Services – Institute for Informatics Technische Universit¨ at M¨ unchen, Germany {muenz,braun,carle}@net.in.tum.de, [email protected]

Abstract. This paper presents a novel traffic classification approach which classifies TCP connections with help of observable Markov models. As traffic properties, payload length, direction, and position of the first packets of a TCP connection are considered. We evaluate the accuracy of the classification approach with help of packet traces captured in a real network, achieving higher accuracies than the cluster-based classification approach of Bernaille [1]. As another advantage, the complexity of the proposed Markov classifier is low for both training and classification. Furthermore, the classification approach provides a certain level of robustness against changed usage of applications.

1

Introduction

Network operators are interested in identifying the traffic of different applications in order to monitor and control the utilization of the available network resources. Since the traffic of many new applications cannot be identified by specific port numbers, deep packet inspection (DPI) is the current technology of choice. However, DPI is very costly as it requires a lot of computational resources as well as up-to-date signatures of all relevant applications. Furthermore, DPI is limited to unencrypted traffic. Therefore, traffic classification using statistical methods has become an important area of research. In this paper, we present a novel classification approach which models transitions between data packets using Markov models. While most existing Markovbased traffic classification methods rely on hidden Markov models (HMMs), we make use of observable Markov models where each state directly reflects certain packet attributes, such as the payload length, the packet direction, and the position within the connection. Using training data, separate Markov models are estimated for those applications which we want to identify and distinguish. The classification of new connections is based on the method of maximum likelihood which selects the application whose Markov model yields the highest a-posteriori probability for the given packet sequence. We restrict the evaluation of our approach to the classification of TCP traffic. Based on traffic traces captured in our department network, we compare the outcome of the Markov classifier with the results of Bernaille’s cluster-based classification approach [1]. Furthermore, we show an example of changed application usage and its effect on the classification accuracy. Last but not least, we assess and discuss the complexity of the Markov classifier. F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 127–140, 2010. c Springer-Verlag Berlin Heidelberg 2010 

128

G. M¨ unz et al.

After giving an overview on existing Markov-based traffic classification approaches in Section 2, we explain our approach in Section 3. Section 4 presents the evaluation results before Section 5 concludes this paper.

2

Related Work

In the recent past, various research groups have proposed the utilization of statistical methods for traffic classification. Common to these approaches is that application specific traffic characteristics are learned from training data. Typically, the considered properties are statistics derived from entire flows or connections, or attributes of individual packets. Examples for these two kinds of properties are the average packet length and the length of the first packet of a connection, respectively. Nguyen and Armitage provide a comparison of various existing approaches in a survey paper [2]. In the following, we give an overview on existing traffic classification approaches which make use of Markov models. Wright et al. [3] and Dainotti et al. [4] estimate a separate HMM for each application considering packet lengths and inter-arrival times. In an HMM, the output of a state is not deterministic but randomly distributed according to the emission probability distribution of the state. While the state output is observable, transitions between states are hidden. Readers looking for a comprehensive introduction to HMMs are referred to Rabiner’s tutorial [5]. Wright [3] considers TCP connections and deploys left-right HMMs with a large number of states and discrete emission probability distributions. In contrast, Dainotti [4] generates ergodic HMMs with four to seven states and Gamma-distributed emission probabilities for unidirectional TCP and UDP traffic from clients to servers; packets without payload are ignored. In both cases, traffic classification assigns new connections to the application whose HMM yields the maximum likelihood. Our approach is motivated by Estevez-Tapiador et al. who use observable ergodic Markov models for detecting anomalies in TCP connections [6]. In an observable Markov model, each state emits a different symbol, which allows deducing the state transitions from a series of observations directly. In the case of Estevez-Tapiador et al., the Markov model generates the sequence of TCP flag combinations observed in those packets of a TCP connection which are sent from the client to the server. Hence, every state represents a specific combination of TCP flags, every transition the arrival of a new packet in the same TCP connection. The transition matrix is estimated using training data which is free of anomalies. During the detection phase, anomalies are then detected by calculating the a-posteriori probability and comparing it with a lower threshold. Estevez-Tapiador et al. use separate Markov models for different applications which are distinguished by their well-known port numbers. We adopt and extend the modeling approach of Estevez-Tapiador for classifying TCP connections. The training phase is identical: we estimate distinct Markov models for the different applications using training data. In the classification phase, however, we calculate the a-posteriori probabilities of an observed connection for all Markov models. Thereafter, the connection is assigned

TCP Traffic Classification Using Markov Models

129

to the application for which the Markov model yields the maximum a-posteriori probability. In contrast to Estevez-Tapiador, we consider both directions of the TCP connection and take payload lengths instead of TCP flag combinations into account. In prior work [7], we achieved good classification results with states reflecting the payload length and PUSH flag of each packet. However, the deployed Markov models did not consider the position of each packet within the connection although the packet position strongly influences the payload length distribution and the occurrence probability of the PUSH flag. In this paper, we present a new variant of the Markov classifier which is based on left-right Markov models instead of ergodic Markov Models. Hence, we are able to incorporate the dependency of transition probabilities on the packet’s position within the TCP connection. The next section explains this approach in more detail.

3

TCP Traffic Classification Using Markov Models

Just like other statistical traffic classification approaches, we assume that the communication behavior of an application influences the resulting traffic. Hence, by observing characteristic traffic properties, it should be possible to distinguish applications with different behaviors. One such characteristic property is the sequence of packet lengths observed within a flow or connection, which serves as input to many existing traffic classification methods [3, 1, 4]. We use observable Markov models to describe the dependencies between subsequent packets of a TCP connection. The considered packet attributes are payload lengths (equaling the TCP segment size), packet direction, and packet position within the connection. Considering the TCP payload length instead of the IP packet length has the advantage that the value is independent of any IP and TCP options. Similar to several existing approaches (e.g., [1,4]), we only take into account packets carrying payload. We call these packets “data packets” in the following. The reason for ignoring empty packets is that these are either part of the three-way handshake, which is common to all TCP connections, or they represent acknowledgments. In both cases, the packet transmission is mainly controlled by the transport layer and not by the application. The packet direction denotes whether the packet is sent from the client to the server or vice versa. As client, we always consider the host which initiates a TCP connection. In contrast to our previous work [7], we do not consider any TCP flags although the occurrence of the PUSH flag may be influenced by how the application passes data to the transport layer. However, an experimental evaluation and comparison of TCP implementations showed that the usage of the PUSH flag varies a lot between different operating systems. Hence, slight improvements of the classification results which can be achieved by considering the PUSH flag might not be reproducible if other operating systems are deployed. As another difference to our previous work, we take into account the packet position within the TCP connection. This leads to better models since the probability distribution of payload length and direction typically depends on the packet position,

130

G. M¨ unz et al.

especially in the case of the first packets of the connection. Moreover, the classification accuracy can be increased because payload length and direction at specific packet positions are often very characteristic for an application. For example, the majority of HTTP connections start with a small request packet sent from the client to the server, followed by a longer series of long packets from the server to the client. In Section 4.1, we empirically confirm these assumptions by looking at TCP connections of different applications. In general, a Markov model consists of n distinct states Σ = {σ1 , . . . , σn }, a vector of initial state probabilities Π = (π1 , . . . , πn ), and an n × n transition matrix A = {aσi ,σj }. In our case, each state represents a distinct combination of payload length, packet direction, and packet position within the TCP connection. The initial state reflects the properties of the first packet within the TCP connection. A transition from one state to the next state corresponds to the arrival of a new packet. The next state then describes the properties of the new packet. To obtain a reasonably small number of states, the payload lengths are discretized into a few intervals. We evaluated different interval definitions and found that good classification results can be obtained with a rather small number of intervals. The evaluation results presented in Section 4 are based on the following four intervals: [1,99], [100,299], [300, MSS-1], [MSS]. The value of the maximum sequence size (MSS) is often exchanged in a TCP option during the TCP three-way handshake. Alternatively, MSS can be deduced from the maximum observed payload length unless the connection does not contain any packet of maximum payload length. A fallback option is to set MSS to a reasonable default value. Another measure to keep the number of states small is to limit the Markov model to a maximum of l data packets per TCP connection. Hence, if a connection contains more than l data packets, we only consider the first l of them. In order to find a good value for l, we evaluated different settings and show the classification results for l = 3, . . . , 7 in Section 4. The initial state and transition probabilities are estimated from training data using the following equations: F0 (σi ) m=1 F0 (σm )

πσi = n

;

F (σi , σj ) m=1 F (σi , σm )

aσi ,σj = n

(1)

F0 (σi ) is the number of initial packets matching the state σi . F (σi , σj ) is the frequency of transitions from packets described by state σi to packets described by state σj . Since the packet position is reflected in the state definitions, we obtain a left-right Markov model with l stages corresponding to the l first data packets in the TCP connection. In our case, every stage comprises eight states representing four payload length intervals and two directions. An example of such a Markov model with l = 4 stages is given in Figure 1. L = 1, . . . , 4 denote the different payload length intervals, C ⇒ S and S ⇒ C the two directions from client to server and server to client. Only transitions from one stage to the next (left to right) may occur, which means that at most 82 (l − 1) out of (8l)2 transition matrix elements are nonzero. Apart from the packet position, the states within each of the stages describe

TCP Traffic Classification Using Markov Models

131

Fig. 1. Left-right Markov model

the same set of packet properties. Therefore, we may alternatively interpret the model as a Markov model with eight states and a time-variant 8 × 8 transition matrix At , t = 1, . . . , (l − 1). This interpretation enables a much more memory efficient storage of the transition probabilities than one large 8l × 8l matrix. For every application k, we determine a separate Markov model M (k) . For this purpose, the training data must be labeled, which means that every connection must be assigned to one of the applications. In order to obtain reliable estimates of the initial and transition probabilities, the training data must contain a sufficiently large number of TCP connections for each application. On the other hand, it is not necessary that all connections contain at least l data packets since the estimation does not require a constant number of observations for every transition. Instead of individual applications, we may also use a single Markov model for a whole class of applications. This approach is useful if multiple applications are expected to show a similar communication behavior, for example because they use the same protocol. Figure 2 illustrates how the resulting Markov models are used to classify new TCP connections. Given the first l packets of a TCP connection O = {o1 , o2 , . . . , ol }, the log-likelihood for this observation is calculated for all Markov (k) (k) (k) models M (k) with Π (k) = (π1 , . . . , πn ) and A(k) = {aσi ,σj } using the following equation:   l−1 l−1     (k) (k) (k) (k) = log πo1 (2) log a(k) aoi ,oi+1 = log πo1 + log Pr O|M oi ,oi+1 i=1

i=1

132

G. M¨ unz et al.

Fig. 2. Traffic classification using Markov models

The maximum likelihood classifier then selects the application for which the log-likelihood is the largest. If a connection contains less than l data packets, the log-likelihood is calculated for the available number of transitions only. It is possible that a TCP connection to be classified contains an initial state (k) (k) for which πo1 = 0, or a transition for which aoi ,oi+1 = 0. This means that such an initial state or transition has not been observed in the training data. Thus, the connection does not fit to the corresponding Markov model. Furthermore, if an unknown initial state or transition occurs in every model, the connection cannot be assigned to any application. This approach, however, may lead to unwanted disqualifications if the training data does not cover all possible traffic, including very rare transitions. As the completeness of the training data usually cannot be guaranteed, we tolerate a certain amount of non-conformance but punish it with a very low (k) (k) likelihood. For this purpose, we replace all πσi = 0 and all aσi ,σj = 0 by a positive value ǫ which is much smaller than any of the estimated non-zero  (k) probabilities. Then, we reduce the remaining probabilities to ensure i πσi =  (k) −5 = 0.001%, j aσi ,σj = 1. In the evaluation in Section 4, we use ǫ = 10 which is very small compared to the smallest possible estimated probability of 1 300 = 0.33% (300 is the number of connections per application in the training data). Despite of the uncertainty regarding the completeness of the training data, we want to limit the number of tolerated ǫ-states and ǫ-transitions per connection. This is achieved by setting a lower threshold of 3 log ǫ for the log-likelihood, which corresponds to three unknown transitions, or an unknown initial state plus two unknown transitions. Connections with a log-likelihood below this threshold are considered unclassifiable.

TCP Traffic Classification Using Markov Models

4 4.1

133

Evaluation Training and Test Data

We evaluated the presented traffic classification approach using TCP traffic traces captured in our department network. The traces comprise four classical client-server applications (HTTP, IMAP, SMTP, and SSH) and three peerto-peer (P2P) applications (eDonkey, BitTorrent, and Gnutella). An accurate assignment of each TCP connection to one of the applications is possible as the HTTP, IMAP, SMTP, and SSH traffic involved our own servers. The P2P traffic, on the other hand, originated or terminated at hosts on which we had installed the corresponding peer-to-peer software; no other network service was running. The training data consists of 300 TCP connections of each application. The evaluation of the classification approach is based on test data containing 500 connections for each application. In order to enable a comparison with the clusterbased classification approach by Bernaille [1], we only consider connections with at least four data packets. In principle, our approach also works for connections with a smaller number of data packets, yet the classification accuracy is likely to decreases in this case. Using boxplots, Figure 3 illustrates the payload length distribution of the first seven data packets in the TCP connections contained in the training data. The packet direction is encoded in the sign: payload lengths of packets sent by the server are accounted with a negative sign. In addition to the seven applications used for classification, there is another boxplot for HTTP connections carrying Adobe Flash video content which will be discussed later in Section 4.5. The upper and lower end of the boxes correspond to the 25% and 75% quantiles, the horizontal lines in the boxes indicate the medians. The length of the whiskers is 1.5 times the distance between 25% and 75% quantile. Crosses mark outliers. As can be seen, two groups of protocols can be distinguished by looking at the first data packet. In the case of SMTP and SSH, the server sends the first data packet, in all other cases, it is the client. Protocols, such as IMAP or SMTP, which specify a dialog in which client and server negotiate certain parameters, are characterized by alternating packet directions. In contrast, the majority of the HTTP connections follow a simple scheme of one short client request followed by a series of large packets returned by the server. 4.2

Evaluation Metrics

As evaluation metrics, we calculate recall and precision for every application k: recallk =

number of connections correctly classified as application k number of connections of application k in the test data

precisionk =

number of connections correctly classified as application k total number of connections classified as application k

These two metrics are frequently used for evaluating statistical classifiers. A perfect classifier achieves 100% recall and precision for all applications. Recall is

134

G. M¨ unz et al.

IMAP 1500

1000

1000

500

500

payload length

payload length

HTTP 1500

0

0

−500

−500

−1000

−1000

−1500

1

2

3

4

5

6

−1500

7

1

2

3

data packet

4

1000

1000

500

500

0

−500

−1000

−1000

2

3

4

5

6

−1500

7

1

2

3

data packet

1000

1000

500

500

0

−500

−1000

−1000

3

4

5

6

−1500

7

1

2

3

data packet

1000

1000

500

500

0

−500

−1000

−1000

3

4

data packet

5

6

7

6

7

0

−500

2

4

Flash video over HTTP 1500

payload length

payload length

Gnutella

1

7

data packet

1500

−1500

6

0

−500

2

5

BitTorrent 1500

payload length

payload length

eDonkey

1

4

data packet

1500

−1500

7

0

−500

1

6

SSH 1500

payload length

payload length

SMTP 1500

−1500

5

data packet

5

6

7

−1500

1

2

3

4

data packet

Fig. 3. Payload lengths of first data packets

5

TCP Traffic Classification Using Markov Models

135

Table 1. Classification results of Markov classifier Recall HTTP 96.00% 94.60% IMAP 99.60% SMTP 99.00% SSH 55.00% eDonkey BitTorrent 98.80% 97.20% Gnutella Average 91.46%

Recall HTTP IMAP SMTP SSH eDonkey BitTorrent Gnutella Average

97.20% 94.80% 99.60% 99.40% 93.80% 98.40% 96.60% 97.11%

3 stages Prec. 97.17% 75.20% 94.86% 99.80% 99.28% 86.67% 95.48% 92.64%

Uncl. 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%

Recall 98.80% 94.80% 99.80% 99.20% 87.20% 98.80% 95.40% 96.29%

4 stages Prec. 95.92% 97.33% 95.23% 99.60% 99.09% 89.98% 97.95% 96.44%

Uncl. 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%

Recall 97.20% 95.00% 99.80% 99.20% 89.00% 99.40% 97.20% 96.69%

5 stages Prec. 97.59% 97.94% 95.23% 99.80% 99.55% 91.03% 97.01% 96.88%

Uncl. 0.00% 0.20% 0.00% 0.20% 0.00% 0.00% 0.00% 0.06%

6 stages 7 stages Prec. Uncl. Recall Prec. Uncl. 97.79% 99.79% 95.40% 99.40% 99.79% 93.18% 96.60% 97.42%

0.20% 0.40% 0.20% 0.20% 0.00% 0.20% 0.40% 0.23%

97.20% 94.80% 99.60% 99.40% 97.40% 98.40% 96.80% 97.66%

98.38% 99.79% 95.40% 100% 99.19% 96.85% 96.61% 98.03%

0.20% 0.40% 0.20% 0.40% 0.00% 0.20% 1.00% 0.34%

independent of the traffic composition, which means that it does not matter how many connections of the test data belong to application k. In contrast, precision depends on the traffic composition in the test data since the denominator usually increases for larger numbers of connections not belonging to application k. Using test data which contains an equal number of connections for every application, we ensure that the calculated precision values are unbiased. In order to compare different classifiers with a single value, we calculate the overall accuracy, which is usually defined as the number of correctly classified connections divided by the total number of connections in the test data. Since the number of connections per application is constant in our case, the overall accuracy is identical to the average recall. Note that the accuracy values mentioned in this document cannot be directly compared to accuracies mentioned in many related publications which are usually based the unbalanced traffic compositions observed in real networks. 4.3

Classification Results

Table 1 shows the classification results for different numbers of stages l. In addition to recall and precision, the table indicates the percentage of unclassifiable connections for every application. These connections could not be assigned to any application because the maximum log-likelihood is smaller than the lower threshold 3 log 10−5 = −15. As explained in Section 3, we apply this threshold to sort out connections which differ very much from all Markov models.

136

G. M¨ unz et al.

As can be seen in the table, the recall values of most applications increase or do not change much if the Markov models contain more stages, which means that more transitions between data packets are considered. Stage l = 5 is an exception because HTTP reaches a much higher and Gnutella a much lower recall value than for the other setups. We inspected this special case and saw that 11 to 13 Gnutella connections are usually misclassified as HTTP traffic and vice versa. If the Markov models contain four stages, however, 21 Gnutella connections are misclassified as HTTP, and only four HTTP connections are misclassified as Gnutella, which leads to unusual recall (and precision) values. Except for Markov models with seven stages, eDonkey is the application with the largest number of misclassified connections. In fact, a large number of eDonkey connections are misclassified as BitTorrent and IMAP traffic. For example, in the case of four stages, 53 eDonkey connections are assigned to BitTorrent, another 11 eDonkey connections to IMAP. These numbers decrease with larger numbers of stages. The example of eDonkey nicely illustrates the relationship between a low recall value for one application and low precision values for other applications: low recall values of eDonkey coincide with low precision values of BitTorrent and IMAP. The recall value of IMAP stays below 95% because 24 IMAP connections are classified as SMTP in all setups. The precision values show little variation and increase gradually with larger numbers of stages. Finally, the number of unclassifiable connections increases for larger numbers of stages. The reason is that more transitions are evaluated, which also increases the probability of transitions which did not appear in the training data. Although we account unknown initial states and transitions with ǫ-probability, connections with three or more of these probabilities are sorted out by the given threshold. Obviously, the number of unclassifiable connections could be reduced by tolerating a larger number of unknown transitions. Alternatively, we could increase the number of connections in the training data in order to cover a larger number of rare transitions. The average recall, which is equal to the overall accuracy, jumps from 91.46% to 96.29% when the number of stages is increased from three to four. At the same time, the average precision increases from 92.64% to 96.44%. Thereafter, both averages increase gradually with every additional stage. Hence, at least four data packets should be considered in the Markov models to obtain good results. 4.4

Comparison with Bernaille’s Approach

Bernaille [1] proposed a traffic classification method which uses clustering algorithms to find connections with similar payload lengths and directions in the first data packets. The Matlab code of this method can be downloaded from a website [8]. Bernailles’s approach requires that all connections in the test and training data have at least as many data packets as analyzed by the classification method. Furthermore, the results of his work show that best results can be achieved with three or four data packets. As mentioned in Section 4.1, we prepared our datasets for a comparison with Bernaille by including connections with at least four data packets only.

TCP Traffic Classification Using Markov Models

137

Table 2. Classification results of Bernaille’s classifier 3 data packets, 27 clusters Recall Prec. HTTP 88.60% 96.30% 91.00% 96.19% IMAP 98.80% 95.18% SMTP 97.20% 98.98% SSH 95.80% 87.09% eDonkey BitTorrent 88.80% 100% 96.40% 85.61% Gnutella Average 93.80% 94.19%

4 data packets, 34 clusters Recall Prec. 99.60% 86.16% 93.20% 99.79% 97.40% 95.49% 95.40% 99.58% 98.80% 98.80% 93.60% 100% 92.00% 92.37% 95.71% 96.03%

3 data packets, 28 clusters Recall Prec. 88.00% 94.62% 92.60% 83.88% 90.20% 100% 97.40% 100% 91.00% 92.11% 96.80% 95.09% 96.20% 88.75% 93.17% 93.49%

3 data packets, 29 clusters Recall Prec. 90.20% 95.35% 87.80% 90.89% 98.80% 95.37% 97.80% 98.79% 100% 89.61% 97.20% 100% 95.20% 97.74% 95.29% 95.39%

The learning phase of Bernaille’s classifier is nondeterministic and depends on random initialization of the cluster centroids. Furthermore, the number of clusters as well as the number of data packets needs to be given as input parameters to the training algorithm. The documentation of Bernaille’s Matlab code recommends 30 to 40 clusters and three to four data packets as a good start point. At the end of the clustering, the algorithm automatically removes clusters which are assigned less then three connections of the training data. A calibration method performs the training of the classifier with different numbers of clusters and data packets and returns the model which achieves the highest classification accuracy with respect to the training data. As recommended by Bernaille, we ran the calibration method to cluster the connections in the training data with 30, 35, and 40 initial cluster centroids and three and four data packets. The best classifier was then used to classify the test data by assigning each of the connections to the nearest cluster. Further improvements, which Bernaille achieved by considering port numbers in addition to cluster assignments [1], were not considered since our approach does not evaluate port numbers either. Table 2 shows the classification results for four different runs of the calibration method. As can be seen, the average recall and precision values do not reach the same level as the Markov classifier. A possible explanation is that Bernaille’s approach does not consider any correlation between subsequent packets. The classification results vary a lot between different runs of the calibration method. Interestingly, we obtain very different results in the third and forth run although both classifiers use three data packets and a very similar number of clusters. The range of the recall values obtained for an individual application can be very wide. The most extreme example is HTTP with recall values ranging from 88.6% to 99.6%. In general, we observed that the classification results depend very much on the initialization values of the cluster centroids and not so much on the remaining parameters, such as the number of clusters and data packets.

138

G. M¨ unz et al. Table 3. Classification of Flash over HTTP traffic

4 5 6 7

stages stages stages stages

tolerant classifier HTTP Gnutella Uncl. 68.0% 17.4% 14.6% 63.8% 16.6% 19.6% 60.8% 17.0% 22.2% 61.2% 16.0% 22.8%

intolerant classifier HTTP Gnutella Uncl. 60.0% 11.0% 29.0% 44.6% 14.4% 41.0% 43.8% 11.6% 44.6% 39.6% 11.6% 48.8%

In contrast to Bernaille’s approach, the training of the Markov classifier always yields deterministic models which do not depend on any random initialization. Hence, we do not need to run the training method several times, which is an advantage regarding the practical deployment. 4.5

Change of Application Usage

HTTP has become a universal protocol for various kinds of data transports. Many websites now include multimedia contents, such as animated pictures or videos. There are many sites delivering such contents, with www.youtube.com being one of the most popular. A large proportion of these embedded multimedia contents are based on Adobe Flash. Flash typically transfers data in streaming mode, which means that after a short prefetching delay the user can start watching the video without having to wait until the download is finished. In order to assess how our classification approach behaves if the usage of an application changes, we applied the classifier to 500 HTTP connections carrying Flash video content. These connections were captured in our university network and identified by the HTTP content type “video/x-flv”. The boxplots at the bottom right of Figure 3 show the payload length distribution. Compared to the previously analyzed HTTP connections, which did not include any Flash video downloads, the variance in the first four packets is much larger. The request packets sent from the client to the server tend to contain more payload than in the case of other HTTP traffic whereas the second and third packets are often smaller. Traffic classification should be robust against such changes of application usage. In the optimal case, the classifier still classifies the corresponding connections correctly as HTTP traffic. Apart from that, it is also acceptable to label the connections as unclassifiable. On the other hand, the connections should not be assigned to wrong applications. Table 3 shows how the HTTP connections containing Flash video content are classified in dependence of the number of stages in the Markov models. Apart from tolerant classification with ǫ = 10−5 and log-likelihood threshold −15, we tested an intolerant classifier which disqualifies all connections with unknown initial state or transition. The tolerant classifier assigns 60% of the connections to HTTP and around 17% to Gnutella. Hence again, similarities between HTTP and Gnutella traffic cause a certain number of misclassified connections. The

TCP Traffic Classification Using Markov Models

139

remaining connections remain unclassified because the maximum log-likelihood is smaller than 3 log 10−5 . With the intolerant classifier, twice as many connections remain unclassified, mainly account of connections previously assigned to HTTP. This shows that tolerance of non-conforming connections increases the robustness of the classifier against usage changes. Although the tolerant classifier still classifies most of the connections as HTTP traffic, the classification accuracy is degraded. To solve this problem, it suffices to re-estimate the Markov model of HTTP with training data covering the new kind of HTTP traffic. Alternatively, we can add a Markov model which explicitly models Flash over HTTP. 4.6

Complexity

The estimation of initial state and transition probabilities using equations (1) requires counting the frequency of initial packet properties and transitions. If the training data contains C connections of an application, estimating the parameters of the corresponding Markov model with l stages requires at most C · l counter increments plus 7 + 56(l − 1) additions and 8l divisions. In order to classify a connection, the log-likelihood needs to be calculated for every Markov model using equation (2). This calculation requires (N − 1) additions, N being the number of analyzed data packets in the connection. The number of stages l is an upper bound for N . The maximum log-likelihood of all Markov models needs to be determined and checked against the given lower threshold. Hence, for K different applications, we have at most K(l−1) additions and K comparisons. Other statistical traffic classification approaches typically require more complex calculations. This is particularily true for HMMs where emission probabilities have to be considered in addition to transition probabilities. Regarding Bernaille’s approach, the clustering algorithm determines the assignment probability of every connection in the training data to every cluster. After recalculating the cluster centroids, the procedure is repeated in another iteration. Just one of these iterations is more complex than estimating the Markov models. The classification of a connection requires calculating the assignment probabilities for every cluster. If Gaussian mixture models (GMMs) are used as in Bernaille’s Matlab code, the probabilities are determined under the assumption of multivariate normal distributions, which is more costly than calculating the Markov likelihoods.

5

Conclusion

We presented a novel traffic classification approach based on observable Markov models and evaluated the classification accuracy with help of TCP traffic traces of different applications. The results show that our approach yields slightly better results than Bernaille’s cluster-based classification method [1]. Furthermore, it provides a certain level of robustness with respect to the changed usage of an

140

G. M¨ unz et al.

application. After all, the complexity of our approach is low compared to other statistical traffic classification methods. The classification accuracy depends on the number of stages per Markov model, which corresponds to the maximum number of data packets considered per TCP connection. Based on our evaluation, we recommend Markov models with at least four stages, corresponding to 32 states. Every additional stage gradually improves the accuracy. As an important property, connections whose number of data packets is smaller than the number of stages can still be classified. Hence, the only drawback of maintaining more stages is that more transition probabilities need to be estimated, saved, and evaluated per application. In order to better assess the performance of our classification approach, we intend to apply it to other traffic traces captured in different networks. Beyond that, it will be interesting to consider additional applications since the set of applications regarded in our evaluation is very limited, of course. Finally, we think of extending the approach to the classification of UDP traffic, which is mainly used for real-time applications.

Acknowledgments We gratefully acknowledge support from the German Research Foundation (DFG) funding the LUPUS project in which this research work as been conducted.

References 1. Bernaille, L., Teixeira, R., Salamatian, K.: Early application identification. In: Proc. of ACM International Conference on Emerging Networking Experiments and Technologies (CoNEXT) 2006, Lisboa, Portugal (2006) 2. Nguyen, T.T.T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials 10, 56–76 (2008) 3. Wright, C., Monrose, F., Masson, G.: HMM profiles for network traffic classification (extended abstract). In: Proc. of Workshop on Visualization and Data Mining for Computer Security (VizSEC/DMSEC), Fairfax, VA, USA, pp. 9–15 (2004) 4. Dainotti, A., de Donato, W., Pescap`e, A., Rossi, P.S.: Classification of network traffic via packet-level hidden markov models. In: Proc. of IEEE Global Telecommunications Conference, GLOBECOM 2008, New Orleans, LA, USA (2008) 5. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of IEEE, vol. 77, pp. 257–286 (1989) 6. Estevez-Tapiador, J.M., Garcia-Teodoro, P., Diaz-Verdejo, J.E.: Stochastic protocol modeling for anomaly based network intrusion detection. In: Proc. of IEEE International Workshop on Information Assurance, IWIA (2003) 7. Dai, H., M¨ unz, G., Braun, L., Carle, G.: TCP-Verkehrsklassifizierung mit MarkovModellen. In: 5. GI/ITG-Workshop MMBnet 2009, Hamburg, Germany (2009) 8. Bernaille, L.: Homepage of early application identification (2009), http://www-rp.lip6.fr/~ teixeira/bernaill/earlyclassif.html

K-Dimensional Trees for Continuous Traffic Classification⋆ Valent´ın Carela-Espa˜ nol1, Pere Barlet-Ros1, Marc Sol´e-Sim´o1, 2 Alberto Dainotti , Walter de Donato2 , and Antonio Pescap´e2 1

2

Department of Computer Architecture, Universitat Polit`ecnica de Catalunya (UPC) {vcarela,pbarlet,msole}@ac.upc.edu Department of Computer Engineering and Systems, Universit´ a di Napoli Federico II {alberto,walter.dedonato,pescape}@unina.it Abstract. The network measurement community has proposed multiple machine learning (ML) methods for traffic classification during the last years. Although several research works have reported accuracies over 90%, most network operators still use either obsolete (e.g., port-based) or extremely expensive (e.g., pattern matching) methods for traffic classification. We argue that one of the barriers to the real deployment of ML-based methods is their time-consuming training phase. In this paper, we revisit the viability of using the Nearest Neighbor technique for traffic classification. We present an efficient implementation of this well-known technique based on multiple K-dimensional trees, which is characterized by short training times and high classification speed.This allows us not only to run the classifier online but also to continuously retrain it, without requiring human intervention, as the training data become obsolete. The proposed solution achieves very promising accuracy (> 95%) while looking just at the size of the very first packets of a flow. We present an implementation of this method based on the TIE classification engine as a feasible and simple solution for network operators.

1

Introduction

Gaining information about the applications that generate traffic in an operational network is much more than mere curiosity for network operators. Traffic engineering, capacity planning, traffic management or even usage-based pricing are some examples of network management tasks for which this knowledge is extremely important. Although this problem is still far from a definitive solution, the networking research community has proposed several machine learning (ML) techniques for traffic classification that can achieve very promising results in terms of accuracy. However, in practice, most network operators still use either obsolete (e.g., port-based) or unpractical (e.g., pattern matching) methods for traffic identification and classification. One of the reasons that explains this slow adoption by network operators is the time-consuming training phase involving ⋆

This work has been supported by the European Community’s 7th Framework Programme (FP7/2007-2013) under Grant Agreement No. 225553 (INSPIRE Project) and Grant Agreement No. 216585 (INTERSECTION Project).

F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 141–154, 2010. c Springer-Verlag Berlin Heidelberg 2010 

142

V. Carela-Espa˜ nol et al.

most ML-based methods, which often requires human supervision and manual inspection of network traffic flows. In this paper, we revisit the viability of using the well-known Nearest Neighbor (NN) machine learning technique for traffic classification. As we will discuss throughout the paper, this method has a large number of features that make it very appealing for traffic classification. However, it is often discarded given its poor classification speed [15, 11]. In order to address this practical problem, we present an efficient implementation of the NN search algorithm based on a K-dimensional tree structure that allows us not only to classify network traffic online with high accuracy, but also to retrain the classifier on-the-fly with minimum overhead, thus lowering the barriers that hinder the general adoption of ML-based methods by network operators. Our K-dimensional tree implementation only requires information about the length of the very first packets of a flow. This solution provides network operators with the interesting feature of early classification [2, 3]. That is, it allows them to rapidly classify a flow without having to wait until its end, which is a requirement of most previous traffic classification methods [12,16,7]. In order to further increase the accuracy of the method along with its classification speed, we combine the information about the packet sizes with the relevant data still provided by the port numbers [11]. We present an actual implementation of the method based on the Traffic Identification Engine (TIE) [5]. TIE is a community-oriented tool for traffic classification that allows multiple classifiers (implemented as plugins) to run concurrently and produce a combined classification result. Given the low overhead imposed by the training phase of the method and the plugins already provided by TIE to set the ground truth (e.g., L7 plugin), the implementation has the unique feature of continuous training. This feature allows the system to automatically retrain itself as the training data becomes obsolete. We hope that the large advantages of the method (i.e., accuracy (> 95%), classification speed, early classification and continuous training) can give an incentive to network operators to progressively adopt new and more accurate ML-based methods for traffic classification. The remainder of this paper is organized as follows. Section II reviews the related work. Section III describes the ML-based method based on TIE. Section IV analyzes the performance of the method and presents preliminary results of its continuous training feature. Finally, Section V concludes the paper and outlines our future work.

2

Related Work

Traffic classification is a classical research area in network monitoring and several previous works have proposed different solutions to the problem. This section briefly reviews the progress in this research field, particularly focusing on those works that used the Nearest Neighbor algorithm for traffic classification. Originally, the most common and simplest technique to identify network applications was based on the port numbers (e.g., those registered by the IANA [9]).

K-Dimensional Trees for Continuous Traffic Classification

143

This solution was very efficient and accurate with traditional applications. However, the arrival of new applications (e.g., P2P) that do not use a pre-defined set of ports or even use registered ones from other applications made this solution unreliable to classify current Internet traffic. Deep packet inspection (DPI) constituted the first serious alternative to the wellknown ports technique. DPI methods are based on searching for typical signatures of each application in the packet payloads. Although these techniques can potentially be very accurate, the high resource requirements of pattern matching algorithms and their limitations in the presence of encrypted traffic make their use incompatible with the continuously growing amount of data in current networks. Machine learning techniques (ML) were later proposed as a promising solution to the well-known limitations of port- and DPI-based techniques. ML methods extract knowledge of the characteristic features of the traffic generated by each application from a training set. This knowledge is then used to build a classification model. We refer the interested reader to [13], where an extensive comparative study of existing ML methods for traffic classification is presented. Among the different ML-based techniques existing in literature, the NN method rapidly became one of the most popular alternatives due to its simplicity and high accuracy. In general, given an instance p, the NN algorithm finds the nearest instance (usually using the Euclidean distance) from a training set of examples. NN is usually generalized to K-NN where K refers to the number of nearest neighbors to take into account. The NN method for traffic classification was firstly proposed in [14], where a comparison of the NN technique with the Linear Discriminant Analysis method was presented. They showed that NN was able to classify, among 7 different classes of traffic, with an error rate below 10%. However, the most interesting conclusions about the NN algorithm are found in the works from Williams et al. [15] and Kim et al. [11]. Both works compared different ML methods and showed the pros and cons of the NN algorithm for traffic classification. In summary, NN was shown to be one of the most accurate ML methods, with the additional feature of requiring zero time to build the classification model. However, NN was the ML-based algorithm with the worst results in terms of classification speed. This is the reason why NN is often discarded for online classification. The efficient implementation of the NN algorithm presented in this paper is based instead on the K-dimensional tree, which solves its problems in terms of classification speed, while keeping very high accuracy. Another important feature of the method is its ability to early classify the network traffic. This idea is exported from the work from Bernaille et al. [2, 3]. This early classification feature allows the method to classify the network traffic by just using the first packets of each flow. Bernaille et al. compared three different unsupervised ML methods (K-Means, GMM and HMM), while in this work we apply this idea to a supervised ML method (NN). As ML-based methods for traffic classification become more popular, new techniques appear in order to evade classification. These techniques, such as

144

V. Carela-Espa˜ nol et al.

protocol obfuscation, modify the value of the features commonly used by the traffic classification methods (e.g., by simulating the behavior of other applications or padding packets). Several alternative techniques have been also proposed to avoid some of these limitations. BLINC [10] is arguably the most well-known exponent of this alternative branch. Most of these methods base their identification in the behavior of the end-hosts and, therefore, their accuracy is strongly dependent on the network viewpoint where the technique is deployed [11].

3

Methodology

This section describes the ML-based classification method based on multiple K-dimensional trees, together with its continuous training system. We also introduce TIE, the traffic classification system we use to implement our technique, and the modifications made to it in order to allow the method to continuously retrain itself. 3.1

Traffic Identification Engine

TIE [5] is a modular traffic classification engine developed by the Universit´ a di Napoli Federico II. This tool is designed to allow multiple classifiers (implemented as plugins) to run concurrently and produce a combined classification result. In this work, we implement the traffic classification method as a TIE plugin. TIE is divided in independent modules that are in charge of the different classification tasks. The first module, Packet Filter, uses the Libpcap library to collect the network traffic. This module can also filter the packets according to BPF or user-level filters (e.g., skip the first n packets, check header integrity or discard packets in a time range). The second module, Session Builder, aggregates packets in flows (i.e., unidirectional flows identified by the classic 5-tuple), biflows (i.e., both directions of the traffic) or host sessions (aggregation of all the traffic of a host). The Feature Extractor module calculates the features needed by the classification plugins. There is a single module for feature extraction in order to avoid redundant calculations for different plugins. TIE provides a multiclassifier engine divided in a Decision Combiner module and a set of classification plugins. On the one hand, the Decision Combiner is in charge of calling several classification plugins when their features are available. On the other hand, this module merges the results obtained from the different classification plugins in a definitive classification result. In order to allow comparisons between different methods, the Output module provides the classification results from the Classification Combiner based on a set of applications and groups of applications defined by the user. TIE supports three different operating modes. The offline mode generates the classification results at the end of the TIE execution. The real-time mode outputs the classification results as soon as possible, while the cycling mode is an hybrid mode that generates the information every n minutes.

K-Dimensional Trees for Continuous Traffic Classification

3.2

145

KD-Tree Plugin

In order to evaluate the traffic classification method, while providing a ready-touse tool for network operators, we implement the K-dimensional tree technique as a TIE plugin. Before describing the details of this new plugin, we introduce the K-dimensional tree technique. In particular, we focus on the major differences with the original NN search algorithm. The K-dimensional tree is a data structure to efficiently implement the Nearest Neighbor search algorithm. It represents a set of N points in K-dimensional spaces as described by Friedman et al. [8] and Bentley [1]. In the naive NN technique the set of points is represented as a set of vectors where each position of a vector represents a coordinate from a point (i.e., feature). Besides these data, the K-dimensional tree implementation also creates a binary tree that recursively take the median point of the set of points, leaving half of points in each side. The original NN algorithm searches iteratively the nearest point i, from a set of points E, to a point p. In order to find the i point, it computes, for each point in E, the distance (e.g., Euclidean or Manhattan distance) to the point p. Likewise, if we are performing a K-NN search, the algorithm looks for the K i points nearest to the point p. This search has O(N) time complexity and becomes unpractical with the amount of traffic found in current networks. On the contrary, the search in a K-dimensional tree allows to find in average the nearest point in O(log N), with the additional cost of spending once O(N log N) building the binary tree. Besides this notable improvement, the structure also supports approximate searches, which can substantially improve the classification time at the cost of producing a very small error. The K-dimensional tree plugin that we implement in TIE is a combination of the K-dimensional tree implementation provided by the C++ ANN library and a structure to represent the relevant information still provided by the port numbers. In particular, we create an independent K-dimensional tree for each relevant port. We refer as relevant ports as those that generate more traffic. Although the list of relevant ports can be computed automatically, we also provide the user with the option of manually configuring this list. Another configuration parameter is the approximation value, which allows the method to improve its classification speed by performing an approximate NN search. In the evaluation, we set this parameter to 0, which means that this approximation feature is not used. However, higher values of this parameter could substantially improve the classification time in critical scenarios, while still obtaining a reasonable accuracy. Unlike in the original NN algorithm, the proposed method requires a lightweight training phase to build the K-dimensional tree structure. Before building the data structure, a sanitation process is performed on the training data. This procedure removes the instances labeled as unknown from the training dataset assuming that they have similar characteristics to other known flows. This assumption is similar to that of ML clustering methods, where unlabeled instances are classified according to their proximity in the feature space to those that are known. The sanitation process also removes repeated or indistinguishable instances.

146

V. Carela-Espa˜ nol et al.

The traffic features used by our plugin are the destination port number and the length of the first n packets of a flow (without considering the TCP handshake). By using only the first n packets, the plugin can classify the flows very fast, providing the network operator with the possibility of quickly reacting to the classification results. In order to accurately classify short flows, the training phase also accepts flows with less than n packets by filling the empty positions with null coordinates. 3.3

Continuous Training System

In this section, we show the interaction of our KD-Tree plugin with the rest of the TIE architecture, and describe the modifications done in TIE to allow our plugin to continuously retrain itself. Figure 1 shows the data flow of our continuous training system based on TIE. The first three modules are used without any modification as found in the original version of TIE. Besides the implementation of the new KD-Tree plugin, we significantly modified the Decision Combiner module and the L7 plugin. Our continuous training system follows the original TIE operation mode most part of the time. Every packet is aggregated in bidirectional flows while its features are calculated. When the traffic features of a flow (i.e., first n packet sizes) are available or upon its expiration, the flow is classified by the KD-Tree plugin. Although the method was tested with bidirectional flows, the current implementation also supports the classification of unidirectional flows. In order to automatically retrain our plugin, as the training data becomes obsolete, we need a technique to set the base-truth. TIE already provides the L7 plugin, which implements a DPI technique originally used by TIE for validation purposes. We modified the implementation of this plugin to continuously produce training data (which includes flow labels - that is, the base-truth - obtained by L7) for future trainings. While every flow is sent to the KD-Tree plugin through the main path, the Decision Combiner module applies flow sampling to the traffic, which is sent through a secondary path to the L7 plugin. This secondary path is used to (i) set the base truth for the continuous training system, (ii) continuously check the accuracy of the KD-Tree plugin by comparing its output with that of L7, and (iii) keep the required computational power low by using flow sampling (performing DPI on every single flow will significantly decrease the performance of TIE). The Decision Combiner module is also in charge of automatically triggering the training of the KD-Tree plugin according to three different events that can be configured by the user: after p packets, after s seconds, or if the accuracy of the plugin compared to the L7 output is below a certain threshold t. The flows classified by the L7 plugin, together with their features (i.e., destination port, n packet sizes, L7 label), are placed in a queue. This queue keeps the last f classified flows or the flows classified during the last s seconds. The training module of the KD-Tree plugin is executed in a separate thread. This way, the KD-Tree plugin can continuously classify the incoming flows without interruption, while it is periodically updated. The training module builds a

K-Dimensional Trees for Continuous Traffic Classification

147

Fig. 1. Diagram of the Continuous Training Traffic Classification system based on TIE

completely new multi K-dimensional tree model using the information available in the queue. We plan as future work to study the alternative solution of incrementally updating the old model with the new information, instead of creating a new model from scratch. In addition, it is possible to automatically update the list of relevant ports by using the training data as a reference.

4

Results

This section presents a performance evaluation of the proposed technique. First, Subsection 4.1 describes the dataset used in the evaluation. Subsection 4.2 compares the original Nearest Neighbor algorithm with the K-dimensional tree implementation. Subsection 4.3 presents a performance evaluation of the proposed plugin described in Subsection 3.2 and, evaluates different aspects of the technique as the relevant ports or the number of packet sizes used for the classification. Finally, Subsection 4.4 presents a preliminary study of the impact of the continuous training system in the traffic classification. 4.1

Evaluation Datasets

The dataset used in our performance evaluation consists of 8 full-payload traces collected at the Gigabit access link of the Universitat Polit`ecnica de Catalunya (UPC), which connects about 25 faculties and 40 departments (geographically distributed in 10 campuses) to the Internet through the Spanish Research and Education network (RedIRIS). This link provides Internet access to about 50000 users. The traces were collected at different days and hours trying to cover as much diverse traffic from different applications as possible. Due to privacy issues, we are not able to publish our traces. However, we made our traces accessible using the CoMo-UPC model presented in [4]. Table 1 presents the details of the traces used in the evaluation. In order to evaluate the proposed method, we used the first seven traces. Among those traces, we selected a single trace (UPC-II) as training dataset, which is the trace that contains the highest diversity in terms of instances from different applications. We limited our training set to one trace in order to leave a meaningful

148

V. Carela-Espa˜ nol et al. Table 1. Characteristics of the traffic traces in our dataset

Name

Date

UPC-I 11-12-08 UPC-II 11-12-08 UP-III 12-12-08 UPC-IV 12-12-08 UPC-V 14-12-08 UPC-VI 21-12-08 UPC-VII 22-12-08 UPC-VIII 10-03-09

Day Start Time Duration Packets Bytes Valid Flows Avg. Util Thu Thu Fri Fri Sun Sun Mon Tue

10:00 12:00 01:00 16:00 00:00 12:00 12:30 03:00

15 min 15 min 15 min 15 min 15 min 1h 1h 1h

95 M 114 M 69 M 102 M 53 M 175 M 345 M 114 M

53 G 63 G 38 G 55 G 29 G 133 G 256 G 78 G

1936 2047 1419 2176 1346 3793 6684 3711

K K K K K K K K

482 573 345 500 263 302 582 177

Mbps Mbps Mbps Mbps Mbps Mbps Mbps Mbps

number of traces for the evaluation that are not used to build the classification model. Therefore, the remaining traces were used as the validation dataset. The last trace, UPC-VIII, was recorded with a difference in time of four months with the trace UPC-II. Given this time difference, we used this trace to perform a preliminary experiment to evaluate the gain provided by our continuous training solution. 4.2

Nearest Neighbor vs. K-Dimensional Tree

Section 3.2 already discussed the main advantages of the K-dimensional tree technique compared to the original Nearest Neighbor algorithm. In order to present numerical results showing this gain, we perform a comparison between both methods. We evaluate the method presented in this paper with the original NN search implemented for validation purposes by the ANN library. Given that the ANN library implements both methods in the same structure we calculated the theoretical minimum memory resources necessary for the naive NN technique (i.e., # unique examples * # packet sizes * 4 bytes (C++ integer)). We tested both methods with the trace UPC-II (i.e., ≈500.000 flows after the sanitation process) using a 3GHz machine with 4GB of RAM. It is important to note that, since we are performing an offline evaluation, we do not approximate the NN search in the NN original algorithm or in the K-dimensional tree technique. For this reason, the accuracy of both methods is the same. Table 2 summarizes the improvements obtained with the combination of the K-dimensional tree technique with the information from the port numbers. Results are shown in terms of classifications per second depending on the number of packets needed for the classification and the list of relevant ports. There are three possible lists of relevant ports. The unique list, where there are no relevant ports and all the instances belong to the same K-dimensional tree or NN structure. The selected list, which is composed by the set of ports that contains most traffic from the UPC-II trace (i.e., ports that receive more than 0.05% of the traffic (69 ports in the UPC-II trace)). We finally refer to all as the list where all ports found in the UPC-II trace are considered as relevant. The first column corresponds to the original NN presented in previous works [11, 14, 15], where

K-Dimensional Trees for Continuous Traffic Classification

149

Table 2. Speed Comparison (flows/s): Nearest Neighbor vs K-Dimensional Tree Packet Size 1 5 7 10

Naive Nearest Neighbor Unique Selected Ports All Ports 45578 104167 185874 540 2392 4333 194 1007 1450 111 538 796

K-Dimensional Tree Unique Selected Ports All Ports 423729 328947 276243 58617 77280 159744 22095 34674 122249 1928 4698 48828

Table 3. Memory Comparison: Nearest Neighbor vs K-Dimensional Tree Packet Size 1 5 7 10

Naive Nearest Neighbor 2.15 MB 10.75 MB 15.04 MB 21.49 MB

K-Dimensional Tree Unique Selected Ports All Ports 40.65 MB 40.69 MB 40.72 MB 52.44 MB 52.63 MB 53.04 MB 56.00 MB 56.22 MB 57.39 MB 68.29 MB 68.56 MB 70.50 MB

all the information is maintained in a single structure. When only one packet is required, the proposed method is ten times faster than the original NN. However, the speed of the original method dramatically decreases when the number of packets required increases, becoming even a hundred times slower than the K-dimensional tree technique. In almost all the situations, the introduction of the list of relevant ports substantially increases the classification speed in both methods. Tables 3 and 4 show the extremely low price that the K-dimensional tree technique pays for a notable improvement in classification speed. The results show that the memory resources required by the method, although being higher than the naive NN technique, are few. The memory used in the K-dimensional tree is almost independent from the relevant ports parameter and barely affected by the number of packet sizes. Regarding time, we believe that the trade-off of the training phase is well compensated by the ability to use the method as an online classifier. In the worst case, the method only takes about 20 seconds for the building phase. Since both methods output the same classification results, the data presented in this subsection show that the combination of the relevant ports and the Kdimensional tree technique significantly improves the original NN search with the only drawback of a (very fast) training phase. This improvement allows us to use this method as an efficient online traffic classifier. 4.3

K-Dimensional Tree Plugin Evaluation

In this section we study the accuracy of the method depending on the different parameters of the KD-Tree plugin. Figure 2(a) presents the accuracy according to the number of packet sizes for the different traces of the dataset. In this case,

150

V. Carela-Espa˜ nol et al.

Table 4. Building Time Comparative: Nearest Neighbor vs K-Dimensional Tree Packet Size

Naive Nearest Neighbor 0s 0s 0s 0s

1 5 7 10

100%

100000

10000

80% UPC−I UPC−II UPC−III UPC−IV UPC−V UPC−VI UPC−VII

70%

60%

1

2

3

4

5

6

7

8

9

10

Number of Packet Sizes

(a) K-dimensional tree accuracy (by flow) without relevant ports support

# Flows

Accuracy

90%

50%

K-Dimensional Tree Unique Selected Ports All Ports 13.01 s 12.72 s 12.52 s 16.45 s 16.73 s 15.62 s 17.34 s 16.74 s 16.07 s 19.81 s 19.59 s 18.82 s

# −

INTERACTIVE GAME P2P FILE−SYSTEM ENCRYPTED TUNNELING

#

1000

−− #− − −− −− − − −−− − −− − #− − −# − − # − − # − − # − − ##− − − − − − # −− − − − −#− # − − #−− − − − − −− − #− −# − #− − − −−−− ## #−−−−− − #− ##− #−−#−

100

10 5 3 1 −300

WEB MAIL BULK CONFERENCE MULTIMEDIA SERVICES

− −

0

# −

− −

−# − −−#− −−−−−− − #−###− #−#− ## ## #− #−−−−−−# −#−−−− − −#− − #−−− # #− #−#−

300

##−− # # −− −

600

#





900



1200



1500

Packet Size

(b) First packet size distribution in the training trace UPC-II

Fig. 2. K-dimensional tree evaluation without the support of the relevant ports

no information from the relevant ports is taken into account producing a single K-dimensional tree. With this variation, using only the first two packets, we achieve an accuracy of almost 90%. The accuracy increases with the number of packet sizes until a stable accuracy > 95% is reached with seven packet sizes. In order to show the impact of using the list of relevant ports in the classification, in Figure 2(b) we show the distribution of the first packet sizes for the training trace UPC-II. Although there are some portions of the distribution dominated by a group of applications, most of the applications have their first packet sizes between the 0 and the 300 bytes ticks. This collision explains the poor accuracy presented in the previous figure with only one packet. The second parameter of the method, the relevant ports, besides improving the classification speed appears to alleviate that situation. Figure 3(a) presents the accuracy of the method by number of packets using the set of relevant ports that contains most of the traffic in UPC-II. With the help of the relevant ports, the method achieves an accuracy > 90% using only the first packet size and achieving a stable accuracy of 97% with seven packets. Figure 3(b) presents the accuracy of the method depending on the set of relevant ports with seven packet sizes. We choose seven because as it can be seen in Figures 2(a) and 3(a), increasing the number of packet sizes beyond seven does not improve its accuracy but decrease, its classification speed. Using all the ports of the training trace UPC-II, the method achieves the highest accuracy with the same trace. However, with the rest of the traces the accuracy substantially decreases but being always higher than 85%. The reason of this decrease is that using all the ports as relevant ports is very dependent to the scenario and could

K-Dimensional Trees for Continuous Traffic Classification

90%

90%

80% UPC−I UPC−II UPC−III UPC−IV UPC−V UPC−VI UPC−VII

70%

60%

50%

1

2

3

4

5

6

7

8

9

10

Number of Packet Sizes

(a) K-dimensional tree accuracy (by flow) with relevant ports support

Accuracy

100%

Accuracy

100%

151

80%

70%

All Single Selected

60%

50%

UPC−I

UPC−II

UPC−III

UPC−IV

UPC−V

UPC−VI

UPC−VII

Traces

(b) K-dimensional tree accuracy (by flow, n=7) by set of relevant ports

Fig. 3. K-dimensional tree evaluation with the support of the relevant ports

present classification inaccuracies with new instances belonging to ports not represented in the training data. Furthermore, the classification accuracy also decreases because it produces fragmentation in the classification model for those applications that use multiple or dynamic ports (i.e., their information is spread among different K-dimensional trees). However, the figure shows that using a set of relevant ports - in our case the ports that receive more than 0.05% of the traffic - besides increasing the classification speed also improves accuracy. Erman et al. pointed out in [6] a common situation found among the ML techniques: the accuracy when measured by flows is much higher than when measured by bytes or packets. This usually happens because some elephantflows are not correctly classified. Figures 4(a) and 4(b) present the classification results of the method considering also the accuracy by bytes and packets. They show that, unlike other ML solutions, the method is able to keep high accuracy values even with such metrics.This is because the method is very accurate with the group of applications P2P and WEB, which represent in terms of bytes most of the traffic in our traces. Finally, we also study the accuracy of the method broken down by application group. In our evaluation we use the same application groups as in TIE. Figure 5 shows that the method is able to classify with excellent accuracy the most popular groups of applications. However, the accuracy of the applications groups that are not very common substantially decreases. These accuracies have a very low impact on the final accuracy of the method given that the representation of these groups in the used traces is almost negligible. A possible solution to improve the accuracy for these groups of applications could be the addition of artificial instances of these groups in the training data. Another potential problem is the disguised use of ports by some applications. Although we do not have evaluated this impact in detail, the results show that currently we can still achieve an additional gain in accuracy by considering the port numbers. We have also checked the accuracy by application group with a single K-dimensional tree and we found that it was always below the results shown in Figure 5. We omit the figure in the interest of brevity. In conclusion, we presented a set of results showing how the K-dimensional tree technique, combined with the still useful information provided by the ports, improves almost all the aspects of previous methods based in the NN search.

152

V. Carela-Espa˜ nol et al.

90%

90%

Accuracy

100%

Accuracy

100%

80%

70%

All Single Selected

60%

50%

UPC−I

UPC−II

UPC−III

UPC−IV

UPC−V

UPC−VI

80%

70%

All Single Selected

60%

50%

UPC−VII

UPC−I

UPC−II

Traces

UPC−III

UPC−IV

UPC−V

UPC−VI

UPC−VII

Traces

(a) K-dimensional tree accuracy (by packet, n=7) by set of relevant ports

(b) K-dimensional tree accuracy (by byte, n=7) by set of relevant ports

Fig. 4. K-dimensional tree evaluation with the support of the relevant ports

100% UPC−I UPC−III UPC−IV UPC−V UPC−VI UPC−VII AVERAGE

90% 80%

Accuracy

70% 60% 50% 40% 30% 20% 10% 0% CONFERENCING

P2P

WEB

SERVICES

ENCRYPTION

GAMES

MAIL

MULTIMEDIA

BULK

FILE_SYSTEM

TUNNEL

INTERACTIVE

Application Groups

Fig. 5. Accuracy by application group (n=7 and selected list of ports as parameters)

With the unique drawback of a short training phase, the method is able to perform online classification with very high accuracy, > 90% with only one packet or > 97% with seven packets. 4.4

Continuous Training System Evaluation

This section presents a preliminary study of the impact of our continuous training traffic classifier. Due to lack of traces comprising a very long period of time and because of the intrinsic difficulties in processing such large traces, we simulate a scenario in which the features of the traffic evolve by concatenating the UPC-II and UPC-VIII traces. The trace UPC-VIII, besides belonging to a difference day-time, was recorded four months later than UPC-II, this suggests a different traffic mix with different properties. Using seven as the fixed number of packets sizes, the results in Table 5 confirm our intuition. On one hand, using the trace UPC-II as training data to classify the trace UPC-VIII we obtain an accuracy of almost 85%. On the other hand, after detecting such decrease in accuracy and retraining the system, we obtain and impressive accuracy of 98,17%.

K-Dimensional Trees for Continuous Traffic Classification

153

Table 5. Evaluation of the Continuous Training system by training trace and set of relevant ports Training Trace UPC-II First 15 min. UPC-VIII Relevant Port List UPC-II UPC-VIII UPC-II UPC-VIII Accuracy 84.20 % 76.10 % 98.17 % 98.33 %

This result shows the importance of the continuous training feature to maintain a high classification accuracy. Since this preliminary study was performed with traffic traces, instead of a live traffic stream, we decided to use the first fifteen minutes of the UPC-VIII trace as the queue length parameter (s) of the retraining process. The results of a second experiment are also presented in Table 5. Instead of retraining the system with a new training data we study if the modification of the list of relevant ports is enough to obtain the original accuracy. The results show that this solution does not bring any improvement when applied alone. However the optimum solution is obtained when both the training data and the list of relevant ports are updated and the system is then retrained.

5

Conclusions and Future Work

In this paper, we revisited the viability of using the Nearest Neighbor algorithm (NN) for online traffic classification, which has been often discarded in previous studies due to its poor classification speed. In order to address this well-known limitation, we presented an efficient implementation of the NN algorithm based on a K-dimensional tree data structure, which can be used for online traffic classification with high accuracy and low overhead. In addition, we combined this technique with the relevant information still provided by the port numbers, which further increases its classification speed and accuracy. Our results show that the method can achieve very high accuracy (> 90%) by looking only at the first packet of a flow. When the number of analyzed packets is increased to seven, the accuracy of the method increases beyond 95%. This early classification feature is very important, since it allows network operators to quickly react to the classification results. We presented an actual implementation of the traffic classification method based on the TIE classification engine. The main novelty of the implementation is its continuous training feature, which allows the system to be automatically retrained by itself as the training data becomes obsolete. Our preliminary evaluation of this unique feature presents very encouraging results. As future work, we plan to perform a more extensive performance evaluation of our continuous training system with long-term executions in order to show the large advantages of maintaining the classification method continuously updated without requiring human supervision.

154

V. Carela-Espa˜ nol et al.

Acknowledgments This paper was done under the framework of the COST Action IC0703 “Data Traffic Monitoring and Analysis (TMA)” and with the support of the Comissionat per a Universitats i Recerca del DIUE from the Generalitat de Catalunya. The authors thank UPCnet for the traffic traces provided for this study and the anonymous reviewers for their useful comments.

References 1. Bentley, J.L.: K-d trees for semidynamic point sets, pp. 187–197 (1990) 2. Bernaille, L., Teixeira, R., Salamatian, K.: Early application identification. In: Proc. of ACM CoNEXT (2006) 3. Bernaille, L., et al.: Traffic classification on the fly. ACM SIGCOMM Comput. Commun. Rev. 36(2) (2006) 4. CoMo-UPC data sharing model, http://monitoring.ccaba.upc.edu/como-upc/ 5. Dainotti, A., et al.: TIE: a community-oriented traffic classification platform. In: Proceedings of the First International Workshop on Traffic Monitoring and Analysis, p. 74 (2009) 6. Erman, J., Mahanti, A., Arlitt, M.: Byte me: a case for byte accuracy in traffic classification. In: Proc. of ACM SIGMETRICS MineNet (2007) 7. Erman, J., et al.: Identifying and discriminating between web and peer-to-peer traffic in the network core. In: Proc. of WWW Conf. (2007) 8. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3(3), 209–226 (1977) 9. Internet Assigned Numbers Authority (IANA): as of August 12 (2008), http://www.iana.org/assignments/port-numbers 10. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: BLINC: multilevel traffic classification in the dark. In: Proc. of ACM SIGCOMM (2005) 11. Kim, H., et al.: Internet traffic classification demystified: myths, caveats, and the best practices. In: Proc. of ACM CoNEXT (2008) 12. Moore, A., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In: Proc. of ACM SIGMETRICS (2005) 13. Nguyen, T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys and Tutorials 10(4) (2008) 14. Roughan, M., et al.: Class-of-service mapping for qos: a statistical signature-based approach to ip traffic classification. In: Proc. of ACM SIGCOMM IMC (2004) 15. Williams, N., Zander, S., Armitage, G.: Evaluating machine learning algorithms for automated network application identification. CAIA Tech. Rep. (2006) 16. Zander, S., Nguyen, T., Armitage, G.: Automated traffic classification and application identification using machine learning. In: Proc. of IEEE LCN Conf. (2005)

Validation and Improvement of the Lossy Difference Aggregator to Measure Packet Delays Josep Sanju` as-Cuxart, Pere Barlet-Ros, and Josep Sol´e-Pareta Department of Computer Architecture Universitat Polit`ecnica de Catalunya (UPC) Barcelona, Spain {jsanjuas,pbarlet,pareta}@ac.upc.edu

Abstract. One-way packet delay is an important network performance metric. Recently, a new data structure called Lossy Diff e r e n c e A g g r e g a t o r ( L D A ) h a s b e e n p r o p o s e d t o e sti m a t e t h is m e tric m o r e e ffi cie n tly t h a n w it h t h e cl a ssic l a a p p r o a c h e s of s e n d i n g i n d iv i d u l a p a c k e t ti m e st a m p s o r p r o b e tr a ffi c. T h is w o r k p r e s e n ts a n i n d e p e n d e n t v l a i d a ti o n of t h e L D A l a g o rit h m a n d p r o vi d e s a n i m p r o v e d a n l a y sis t h a t r e s u lts i n a 20% increase in the number of packet delay samples collected by the algorithm. We also extend the analysis by relating the number of collected samples to the accuracy of the LDA and provide additional insight on how to parametrize it. Finally, we extend the algorithm to overcome some of its practical limitations and validate our analysis using real network traffic.

1

Introduction

Packet delay is one of the main indicators of network performance, together with throughput, jitter and packet loss. This metric is becoming increasingly important with the rise of applications like voice-over-IP, video conferencing or online gaming. Moreover, in certain environments, it is an extremely critical network performance metric; for example, in high-performance computing or automated trading, networks are expected to provide latencies in the order of few microseconds [1]. Two main approaches have been used to measure packet delays. Active schemes send traffic probes between two nodes in the network and use inference techniques to determine the state of the network (e.g., [2,3,4,5]). Passive schemes are, instead, based on traffic analysis in two of the points of a network. They are, in principle, less intrusive to the network under study, since they do not inject probes. However, they have been often disregarded, since they require collecting, transmitting and comparing packet timestamps at both network measurement points, thus incurring large overheads in practice [6]. For example, [7] proposes delaying computations to periods of low network utilization if measurement information has to be transmitted over the network under study. The Lossy Difference Aggregator (LDA) is a data structure that has been recently proposed in [1] to enable fine-grain measurement of one-way packet delays using a passive measurement approach with low overhead. The data structure F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 155–170, 2010. c Springer-Verlag Berlin Heidelberg 2010 

156

J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta

is extremely lightweight in comparison with the traditional approaches, and can collect a number of samples that easily outperforms active measurement techniques, where traffic probes interfering with the normal operation of a network can be a concern. The main intuition behind this measurement technique is to sum all packet timestamps in the first and second measurement nodes, and infer the average packet delay by subtracting these values and dividing over the total number of packets. The LDA, though, maintains several separate counters and uses coordinated, hash-based traffic sampling [8] in order to protect against packet losses, which would invalidate the intuitive approach. The complete scheme is presented in Sect. 2. This work constitutes an independent validation of the work presented in [1]. Section 3 revisits the analysis of the algorithm. In particular, Sect. 3.1 investigates the effective number of samples that the algorithm can collect under certain packet loss ratios. This work improves the original analysis, and finds that doubling the sampling rates suggested in [1] maximizes the expectation of the number of samples collected by the algorithm. In Sect. 3.2, we contribute an analysis that relates the effective sample size with the accuracy that the method can obtain, while Sect. 3.3 compares the network overhead of the LDA with pre-existing techniques. For the case when packet loss ratios are unknown, the original work proposed and compared three reference configurations of the LDA in multiple memory banks to target a range of loss ratios. In Sects. 3.4 and 3.5 we extend our improved analysis to the case of unknown packet loss, and we (i) find that such reference configurations are almost equivalent in practice, and (ii) provide improved guidelines on how to dimension the multi-bank LDA. Sect. 4 validates our analysis through simulations, with similar parameters to [1], for the sake of comparability. Finally, in Sect. 5 we deploy the LDA on a real network scenario. The deployment of the LDA in a real setting presents a series of challenges that stem from the assumptions behind the algorithm as presented in [1]. We propose a simple extension of the algorithm that overcomes some of the practical limitations of the original proposal. At the time of this writing, another analysis of the Lossy Difference Aggregator already exists in the form of a public draft [9]. The authors provide a parallel analysis of the expectation for the sample size collected by the LDA and, coherently with ours, suggest doubling the sampling rates compared to [1]. For the case where packet loss ratios are unknown beforehand, their analysis studies how to dimension the multi-bank LDA to maximize the expectation for the sample size. Optimal sampling rates are determined that maximize sample sizes for tight ranges of expected packet loss ratios. Our analysis differs in that we relate sample size with accuracy, and focus on maximizing accuracy rather than sample size. Additionally, our study includes an analytic overhead comparison with traditional techniques, presents the first real world deployment of the LDA and proposes a simple extension to overcome some of its practical limitations.

Validation and Improvement of the Lossy Difference Aggregator

2

157

Background

The Lossy Difference Aggregator (LDA) [1] is a data structure that can be used to calculate the average one-way packet delay between two network points, as well as its standard deviation. We refer to these points as the sender and the receiver, but they need not be the source or the destination of the packets being transmitted, but merely two network viewpoints along their path. The LDA operates under three assumptions. First, packets are transmitted strictly in FIFO order. Second, the clocks of the sender and the receiver are synchronized. Third, the set of packets observed by the receiver is identical to the one observed by the sender, or a subset of it when there is packet loss. That is, the original packets are not diverted, and no extra traffic is introduced that reaches the receiver. A classic algorithm to calculate the average packet delays in such a scenario would proceed as follows. In both the sender and the receiver, the timestamps of the packets are recorded. After a certain measurement interval, the recorded packet delays (or, possibly, a subset of them) are transmitted from the sender to the receiver, which can then compare the timestamps and compute the average delay. Such an approach is impractical, since it involves storing and transmitting large amounts of information. The basic idea behind the LDA is to maintain a pair of accumulators that sum all packet timestamps in the sender and the receiver separately, as well as the total count of packets. When the measurement interval ends, the sender transmits the value of its accumulator to the receiver, which can then compute the average packet delay by subtracting the values and dividing over the total number of packets. The LDA requires the set of packets processed by the sender and the receiver to be identical, since the total packet counts in the sender and the receiver must agree. Thus, it is extremely sensitive to packet loss. In order to protect against it, the LDA partitions the traffic into b separate streams, and aggregates timestamps for each one separately in both the sender and the receiver. Additionally, for each of the sub-streams, it maintains a packet count. Thus, it can detect packet losses and invalidate the data collected in the corresponding accumulators. When the measurement interval ends, the sender transmits all of the values of the accumulators and counters to the receiver. Then, the receiver discards the accumulators where packet counts disagree, and computes an estimate for the average sampling delay using the remainder. Each of the accumulators must aggregate the timestamps from the same set of packets in the sender and the receiver, i.e., both nodes must partition the traffic using the same criteria. In order to achieve this effect, the same pre-arranged, pseudo-random hash function is used in both nodes, and the hash of a packet identifier is used to determine its associated position in the LDA. As packet losses grow high, though, the number of accumulators that are invalidated increases rapidly. As an additional measure against packet loss, the LDA samples the incoming packet stream. In the most simple setting, all of the accumulators apply an equal sampling rate p to the incoming packet stream.

158

J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta

Again, sender and receiver sample incoming packets coordinately using a prearranged pseudo-random hash function [8]. As an added benefit, the LDA data structure can also be mined to estimate the standard deviation of packet delays using a known mathematical trick [10]. We omit this aspect of the LDA in this work, but the improvements we propose will also increase the accuracy in the estimation of the standard deviation of packet delays.

3

Improved Analysis

The LDA is a randomized algorithm that depends on the correct setting of the sampling rates to gather the largest possible number of packet delay samples. The sampling rate p presents a classical tradeoff. The more packets are sampled, the more data the LDA can collect, but the more it is affected by packet loss. Conversely, lower sampling rates provide more protection against loss, but limit the amount of information collected by the accumulators. This section improves the original analysis of the Lossy Difference Aggregator (LDA) introduced in [1] in several ways. First, it improves the analysis of the expected number of packet delay samples it can collect, which leads to the conclusion that sampling rates should be twice the ones proposed in [1]. Second, it relates the number of samples with the accuracy in an easy to understand way that makes it obvious that undersampling is preferable to sampling too many packets. Third, it compares its network overhead with pre-existing passive measurement techniques. Fourth, it provides a better understanding and provides guidelines to dimension multi-bank LDAs. 3.1

Effective Sample Size

In order to protect against packet loss, the LDA uses packet hashes to distribute timestamps across several accumulators, so that losses only invalidate the samples collected by the involved memory positions. Table 1 summarizes the notation used in this section. Given n packets, b buckets (accumulator-counter pairs) and packet loss probability r, the probability of a bucket of staying useful corresponds to the probability that no lost packet hashes to the bucket in the receiver node, which can be computed as (1 − r/b)n ≈ e−n r/b (according to the law of rare events). Then, the expectation for the number of usable samples, which we call the n . In order to provide effective sample size, can be approximated to E [S] ≈ (1−r) en r/b additional protection against packet losses, the LDA also samples the incoming packets; we can adapt the previous formulation to account for packet sampling as follows: (1 − r) p n (1) en r p/b Reference [9] shows that this approximation is extremely accurate for large values of n. The approximation is best as n becomes larger and the probability of E [S] ≈

Validation and Improvement of the Lossy Difference Aggregator

159

Table 1. Notation name variable n r p

name variable

#pkts packet loss ratio sampling rate

b µ µ ˆ

#buckets average packet delay estimate of the avg. delay

sampling a packet loss stays low. Note that this holds in practice; otherwise, the buckets would too often be invalidated. For example, when the absolute number of sampled packet losses is in the order of the number of buckets b, it obtains relative errors around 5 × 10−4 for as few as n = 1000 packets. Note however that this formula only accounts for a situation where all buckets use an equal fixed sampling rate p, i.e., a single bank LDA. Section 3.5 extends this analysis to the multi-bank LDA, while Sect. 4 provides an experimental validation of this formula. Reference [1] provides a less precise approximation for the expected effective sample size. When operating under a sampling rate p = α b/(L + 1), a lower bound E[S] >= α (1 − α) R b/(L + 1) is provided, where R corresponds to the number of received packets and L to the number of lost packets; in our notation, R = n (1 − r) and L = n r. Trivially, this bound is maximized when α = 0.5. Therefore, it is concluded that the best sampling rate p that can be chosen is b . p = 0.5 n r+1 However, our improved analysis leads to a different value for p by maximizing (1). The optimal sampling rate p that maximizes the effective sample size for any known loss ratio r can be obtained by solving ∂E[S] ∂p = 0, which leads to b p = n r (in practice, we set p = min (b/n r, 1)). Thus, our analysis approximately doubles the sampling rate compared to [1], i.e., sets α = 1 in their notation, which yields an improvement in the effective sample size of around 20% at no cost. The conclusions of this improved analysis are coherent with the parallel analysis of [9], which also shows that the same conclusions are reached without the approximation in (1). Assuming a known loss ratio and the optimal setting of the sampling rate p = nbr , then, the expectation of the effective sample size is (by substitution of p in (1)): E[Sopt ] =

1−r b re

(2)

In other words, given a known number of incoming packets and a known loss ratio, setting p optimally maximizes the expectation of the sample size at 1−r re samples per bucket. Figure 1 shows how the number of samples that can be collected by the LDA quickly degrades when facing increasing packet loss ratios. Therefore, in a high packet loss ratio scenario, the LDA will require large amounts of memory to preserve the sample size. As an example, in order to sustain the same sample size of a 0.1% loss scenario, the LDA must grow around 50 times

0

100

200

300

J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta

samples per bucket

160

0.0

0.2

0.4

0.6

0.8

1.0

packet loss rate

Fig. 1. Expected number of samples collected per bucket under varying packet loss ratios, assuming an ideal LDA that can apply, for each packet loss ratio, the optimal sampling rate

larger on 5% packet loss, and by a factor of around 250 in the case of 20% packet loss. Recall that this analysis assumes that the packet loss ratios are known beforehand, so that the sampling rate can be tuned optimally. When facing unknown loss ratios, the problem becomes harder, since it is not possible to configure p = nbr , given that both parameters are unknown. However, this analysis does provide an upper bound on the performance of this algorithm. In any configuration of the LDA, including in multiple banks, the expectation of the effective sample size will be at most 1−r r e b. 3.2

Accuracy

It is apparent from the previous subsection that increasing packet loss ratios have a severe impact on the effective sample size that the LDA can obtain. However, the LDA is empirically shown to obtain reasonable accuracy up to 20% packet loss in [1]. How can we accommodate these two facts? The resolution of this apparent contradiction lies in the fact that the accuracy of the LDA does not depend linearly on the sample size but, instead, the gains in terms of accuracy of the larger sample sizes are small. The LDA algorithm estimates the average delay µ from a sample of the overall population of packet delays. According to the central limit theorem, the sample mean is a random variable that converges to a normal distribution as the sample size (S in our notation) grows [11]. The rate of convergence towards normality depends on the distribution of the sampled random variable (in this case, packet delays). If the arbitrary distribution of the packet delays has mean µ and variance σ 2 , assuming that the sample size S obtained by the LDA is large enough for the normal approximation to be accurate, the sample mean can be considered to be normally distributed, with mean µ and variance σ 2 /S, which implies that, with 99% confidence, the estimate of the average delay µ ˆ as the sample average will √ σ. be within µ ± 2.576 √σS and, thus, the relative error will be below 2.576 µ S

0.3 0.1

0.2

relative error

0.4

0.45 0.30 0.15

0.0

0.00

relative error

161

0.5

Validation and Improvement of the Lossy Difference Aggregator

1e+02

1e+03

1e+04

1e+05

sample size (log.)

1e+06

0.0

0.2

0.4

0.6

0.8

1.0

loss ratio

Fig. 2. 99% confidence bound on the relative error of the estimation of the average delay as a function of the obtained sample size (left) and as a function of the packet loss ratio (right), assuming a 1024 bucket ideal LDA, 5 × 106 packets and Weibull (α = 0.133, β = 0.6) distributed packet delays

An observation to be made is that the relative error of the LDA is proportional to √1S , that is, halving the relative error requires 4 times as many samples. A point is reached where the return of obtaining additional samples has a negligible practical impact on the relative error. As stated, the accuracy of the LDA depends on the distribution of the packet delays, which are shown to be accurately modeled by a Weibull distribution in [6], and this distribution is used in [1] to evaluate the LDA. Figure 2 plots, as an example, the accuracy as a function of the sample size (left) and as a function of the loss ratio (right) when packet delays are Weibull distributed with scale parameter α = 0.133 and shape β = 0.6, and 5 × 106 packets per measurement interval (these parameters have been chosen consistently with [1] for comparability). It can be observed that, in practice, small sample sizes obtain satisfactory accuracies. In this particular case, 2000 samples bound the relative error to around 10%, 8000 lower the bound to 5%, and 25 times as many, to 1%. 3.3

Overhead

Ref. [1] presents an experimental comparison of the LDA with active probing. In this section, we compare the overhead of the LDA with that of a passive measurement approach based on trajectory sampling [8] that sends a packet identifier and a timestamp for each sampled packet. As a basis for comparison, we compute the network overhead for each method per collected sample. Note that, for equal sample sizes, the accuracy of both methods is expected to match, since samples are collected randomly. Traditional techniques incur an overhead directly proportional to the collected number of samples. For example, an active probe will send a packet to obtain each sample. The overhead of a trajectory sampling based technique is also a constant α bytes/sample. For example, a 32 bit hash of a packet plus a 64 bit timestamp set α = 12.

10 20 30 40 50 60

J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta

LDA traditional

0

network overhead (B/sample)

162

0.0

0.2

0.4

0.6

0.8

1.0

packet loss ratio

Fig. 3. Communication overhead of the LDA relative to a traditional trajectory sampling approach, assuming 12 byte per bucket and per sample transmission costs

However, as discussed in the previous section, the sample size collected by the LDA depends on the packet loss ratio. Assuming a single-bank, optimally dimensioned LDA, it requires sending b × β bytes (where β denotes the size βre of a bucket) to gather 1−r r e b samples. Thus, the overhead of the LDA is 1−r B/sample, and using 64 bit timestamp accumulators and 32 bit counters yields β = 12. re < α and, thus, The LDA is preferable as long as it has lower overhead, i.e., β1−r α r < β e+α . The values of α and β will vary in real deployments (e.g., timestamps can be compressed in both methods). In the example, where α = β = 12, the 1 LDA is preferable as long as r < e+1 ≈ 0.27. Figure 3 compares the overheads of both techniques in such a scenario, and shows the superiority of the LDA for the lowest packet loss ratios and its undesirability for the highest. 3.4

Unknown Packet Loss Ratio

It has already been established that the optimal choice of the LDA sampling rate is p = nbr , which obtains 1−r r e b samples. However, in practice, both n and r are unknown a priori, since they depend on the network conditions, which are generally unpredictable. Thus, setting p beforehand implies that, inevitably, its choice will be suboptimal. What is the impact of over and under-sampling, i.e., setting a conservatively low or an optimistically high sampling rate on the LDA algorithm? We find that undersampling is preferable to oversampling. As explained, the relative error of √ the algorithm is proportional to 1/ S. Thus, oversampling leads to collecting a high number of samples on low packet loss ratios, and slightly increases the accuracy on such circumstances, but leads to a high percentage of buckets being invalidated on high loss, thus incurring large errors. Conversely, undersampling preserves the sample size on high loss, thus obtaining reasonable accuracy, at the cost of missing the opportunity to collect a much larger sample on when losses are low, which, however, has a comparatively lower impact on the accuracy.

0.5

1e+06

163

80% loss

0.3 0.1

80% loss

0.2

1e+02

5% loss 20% loss

rel err bound

1e+04

0.4

ideal single−bank LDA

20% loss

0.0

0.2

0.4

0.6

0.8

1.0

ideal single−bank LDA

5% loss

0.0

1e+00

obtained sample size

Validation and Improvement of the Lossy Difference Aggregator

0.0

packet loss ratio

0.2

0.4

0.6

0.8

1.0

packet loss ratio

Fig. 4. Impact on the sample size (left) and expected relative error (right) of selecting a sub-optimal sampling rate

Figure 4 provides a graphical example of this analysis. In this example we consider, again analogously to [1], Weibull (α = 0.133, β = 0.6) distributed packet delays. We compare the sample sizes and accuracy bounds obtained by different configurations of the LDA using a value of p targeted at loss ratios of 5%, 20% and 80%. All LDA configurations use b = 1024 accumulators. It can be observed that, in terms of sample size, the conservative setting of p for 80% loss underperforms, in terms of sample size, under the lowest packet loss ratios, but this loss does not imply an extreme degradation in terms of measurement accuracy. On the contrary, the more optimistic sampling rate settings achieve better accuracy under low loss, but incur extreme accuracy degradation as the loss ratio grows. 3.5

The Multi-bank LDA

So far, the analysis of the LDA has assumed all buckets have a common sampling rate p. However, as exposed in [1], when packet loss ratios are unknown, it is interesting to divide the LDA in multiple banks. A bank is a section of the LDA for which all the buckets use the same sampling rate. Each of the banks can be tuned to a particular sampling rate, so that, intuitively, the LDA is resistant to a range of packet loss ratios. Reference [1] tests three different configurations of the multi-bank LDA, always using equal (or almost) sized banks. No systematic analysis is performed on the appropriate bank sizing nor on the appropriate sampling rate for each of the banks; each LDA configuration is somewhat arbitrary and based on intuition. We extend our analysis to the most general multi-bank LDA, where each bucket i independently samples the full packet stream at rate pi (i.e., our analysis supports all combinations of bank configurations and sampling rates). We adapt (1) accordingly: E[S] ≈

b  (1 − r) pi n i=1

en r pi

(3)

164

J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta

When every bucket uses the same sampling rate, the two equations are equivalent with pi = p/b (each bucket receives 1/b of the traffic and samples packets at rate p). As for the error bound, the analysis from Sect. 3.2 still holds. We have evaluated the three alternative multi-bank LDA configurations proposed in [1], using the same configuration parameters and distribution of packet delays. Figure 5 compares the accuracy obtained by the three configurations. The figure assumes, again, a Weibull distribution for packet losses, with shape parameter β = 0.6 and scale α = 0.133, and a number of packets n = 5 × 106 . All configurations use b = 1024 buckets. The first uses two banks, each targeted to 0.005 and 0.1 loss; the second, three banks that target 0.001, 0.01 and 0.1 loss; the third, four banks that target 0.001, 0.01, 0.05 and 0.1 loss. The figure shows that, in practice, the three approaches (lda-2, lda-3 and lda-4 in the figure) proposed in [1] perform very similarly, which motivates further discussion on how to dimension multi-bank LDAs. The figure also provides, as a reference, the accuracy obtained by an ideal LDA that, for every packet loss ratio, obtains the best possible accuracy (from (2)). We argue that, consistently with the discussion of subsection 3.4, in order to support a range of packet loss ratios, the LDA should be primarily targeted towards maintaining accuracy over the worst-case target packet loss ratio. Using this conservative approach has two benefits. First, it guarantees that a target accuracy can be maintained in the worst-case packet loss scenario. Second, it is guaranteed that its accuracy over the smaller packet loss ratios is at least as good. However, this rather simplistic approach has an evident flaw in that it does not provide significantly higher performance gains in the lowest packet loss scenarios, where a small number of high packet sampling ratio provisioned buckets would easily gather a huge number of samples. Based on this intuition, as a rule of thumb, 90% of the LDA could be targeted to a worst-case sampling ratio, using the rest of the buckets to increase the accuracy in low packet loss scenarios. A more sophisticated approach to dimensioning a multi-bank LDA is to determine the vector of sampling rates < p1 , p2 , . . . , pb > that performs closest to optimal across a range of sampling rates. We have used numerical optimization to search for a vector of sampling rates such that it minimizes the maximum difference between the accuracies of the multi-bank LDA and the ideal LDA across a range of packet loss ratios. Additionally, we have restricted the search space to sampling rates that are powers of two for performance reasons [1,9]. We have obtained a multi-bank LDA that targets a range of loss rates between 0.1% and 20% for the given scenario: 5 million packets, Weibull distributed delays, and 1024 buckets. The best solution that our numerical optimizator has found is, coherently with the previous discussion, targeted primarily to the highest loss ratios. Table 2 summarizes the resulting multi-bank LDA. Most notably, a majority (70%) of the buckets use pi = 2−20 , i.e., are targeted to a packet loss ratio of 20%, while fewer (around 20%) use pi = 2−17 , i.e., are optimized for around 2.6% loss. All buckets combined sample around 0.47% of the packets.

Validation and Improvement of the Lossy Difference Aggregator

165

Table 2. Per-bucket sampling rates respective to the full packet stream of the numerically optimized LDA for the given scenario. Overall sampling rate is around 0.47%. 2−14 2−15 2−16 2−17 2−18 2−19 2−20 2−21

#buckets

2

189

7

27

717

2

0.10

0.15

0.20

6

rel err bound

0.4

0.6

0.8

lda−2 lda−3 lda−4 optimized ideal

lda−2 lda−3 lda−4 optimized ideal

0.0

0.00

0.2

rel err bound

74

0.05

1.0

sampling rate

0.0

0.2

0.4

0.6

packet loss ratio

0.8

1.0

0.00

0.05

0.10

0.15

0.20

packet loss ratio

Fig. 5. Error bounds for several configurations of multi-bank LDA in the 0-1 packet loss ratio range (left) and in the 0-0.2 range (right)

Figure 5 shows the result of this approach (line optimized ) when targeting a range of loss rates between 0.1% and 20% for 5 million packets with the mentioned Weibull distribution of delays. The solution our optimizer found has the desirable property of always staying within below 3% higher relative error than the best possible, for any given loss ratio within the target range. These results suggest that there is little room for improvement in the multi-bank LDA parametrization problem. In the parallel analysis of [9], numerical optimization is also mentioned as an alternative to maximize the effective sample size when facing unknown packet loss. Optimal configurations are derived using competitive analysis for some particular cases of tight ranges of target packet loss ratios [l1 , l2 ]. In particular, it is found that both for l2 /l1 ≤ 2, and for l2 /l1 ≤ 5.5 and a maximum of 2 l1 banks, the optimal configuration is a single bank LDA with p = ln ll22 −ln −l1 . We believe that our approach is more practical in that it supports arbitrary packet loss ratios and it focuses on preserving the accuracy, rather than sample size.

4

Validation

In the previous section, we derived formulas for the expected effective sample size of the LDA when operating under various sampling rates, and provided bounds for the expected relative error under typical distributions of the network delays. In this section, we validate the analytical results through simulation. We have chosen the same configuration parameters as in the evaluation of [1]. Thus, this section not only validates our analysis of the LDA algorithm, but also

0.30 0.20

experiment − ideal experiment − LDA2 experiment − LDA3 experiment − LDA4 analytic bounds

0.00

0.10

1e+04

experiment − ideal experiment − LDA2 experiment − LDA3 experiment − LDA4 expected

99−pct of relative error

1e+06

J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta

1e+02

effective sample size

166

0.00

0.05

0.10 packet loss ratio

0.15

0.20

0.00

0.05

0.10

0.15

0.20

packet loss ratio

Fig. 6. Effective sample size (left) and 99 percentile of the relative error (right) obtained from simulations of the LDA algorithm using 5 million packets per measurement interval, and Weibull distributed packet delays

shows consistency with the previous results of [1]. The simulation parameters are as follows: we assume 5 million packets per measurement interval, and Weibull (α = 0.133, β = 0.6) distributed packet delays. In our simulation, losses are uniformly distributed. Note however that, as stated in [1], the LDA is completely agnostic to the packet loss distribution, but only sensitive to the overall packet loss ratio. Thus, other packet loss models (e.g., in bursts [12]) are supported by the algorithm without requiring any changes. Figure 6 (left) compares the expected sample sizes with the actual results from the simulations. The figure includes the three multi-bank LDA configurations introduced in [1], with expected sample size calculated using (3), and the ideal LDA that achieves the best possible accuracy under each packet loss ratio, obtained from (2). This figure validates our analysis of the algorithm, since effective sample sizes are always around their expected value (while in [1], only a noticeably pessimistic lower bound is presented). On the other hand, Figure 6 (right) plots the 99 percentile of the relative error obtained after 500 simulations for each loss ratio, and compares it to the 99% bound on the error derived from the analysis of Sect. 3.2. The figures confirm the correctness of our analysis for both the effective sample size and the 99% confidence bound on the relative error.

5 5.1

Experiments Scenario

In the previous section, a simulation based validation of our analysis of the LDA has been presented that reproduces that of [1]. In this section we evaluate the algorithm using real network traffic. To the best of our knowledge, this is the first work to evaluate the algorithm in a real scenario.

Validation and Improvement of the Lossy Difference Aggregator

167

Our scenario consists of two measurement points: one of the links that connect the Catalan academic network (also known as Scientific Ring) to the rest of the Internet, and the access link of the Technical University of Catalonia (UPC). In the first measurement point, a commodity PC equipped with a 10 Gb/s Endace DAG card [13] obtains a copy the of the inbound traffic via an optical splitter, and filters for incoming packets with destination address belonging to the UPC network. In the second measurement point, a commodity PC equipped with a 1 Gb/s Endace DAG card [13] analyzes a copy of the traffic that enters UPC, obtained from a port mirror from a switch. 5.2

Deployment Challenges

The deployment of the LDA in a real world scenario presents important challenges. The design of the LDA is built upon several assumptions. First, as stated in [1], the clocks in the two measurement points must be synchronized. We achieve this effect by synchronizing the internal DAG clocks prior to trace collection. Second, packets in the network follow strict FIFO ordering, and the monitors can inject control packets in the network (by running in the routers themselves) which also observe this strict FIFO ordering, and are used to signal measurement intervals. In our setting, packets are not forwarded in a strict FIFO ordering due to different queueing policies being applied to certain traffic. Moreover, injecting traffic to signal the intervals is unfeasible, since the monitors are isolated from the network under study. Third, in the original proposal, the complete set of packets observed in the second monitor (receiver) must have also been observed in the first (sender). In [1], the LDA algorithm is proposed to be applied in network hardware in a hop-by-hop fashion. However, this assumption severely limits the applicability of the proposal; for example, as is, it cannot be used in our scenario, since receiver observes packets that have been routed through a link to a commercial network that sender does not monitor (we refer to these packets as third party traffic). This limitation could be addressed by using appropriate traffic filters to discern whether each packet comes from receiver (e.g., source MAC address, or source IP address), but in the most general case, this is not possible. In particular, in our network, we lack routing information, and traffic engineering policies make it likely that the same IP networks are routed differently. The problem lies in that the LDA counters might match by chance when, in receiver, packet losses are compensated by extra packets from the third party traffic. The LDA would assume that the affected buckets are usable, and introduce severe error. We work around this by introducing a simple extension to the data structure: we attach to each LDA bucket an additional memory position that stores an XOR of all the hashes of the packets aggregated in the corresponding accumulator. Thus, receiver can trivially confirm that the set of packets of each position matches the set of packets aggregated in sender by checking this XOR. From a practical standpoint, using this approach makes third party traffic

168

J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta

count as losses. We use 64 bit hashes and, thus, the probability of the XORs matching by chance is negligible1 . 5.3

Experimental Results

We have simultaneously collected a trace in each of the measurement points in the described scenario, and wrote two CoMo [14] modules to process the traces offline: one that implements the LDA, and another that computes the average packet delays exactly. The traces have a duration of 30 minutes. We have configured 10 seconds measurement intervals, so that the average number of packets per measurement interval is in the order of 6 × 105 . We have tested 16 different single-bank configurations of the LDA with b = 1024 buckets and sampling rates ranging from 20 to 2−15 . Also, we have used our numerical optimizator to obtain a multi-bank LDA configuration that tolerates up to 80% loss in our scenario. Figure 7 summarizes our results. As noted in the previous discussion, third party traffic that is not seen in sender is viewed as packet losses in receiver. Therefore, our LDAs operate at an average loss rate of around 10%, which roughly corresponds to the fraction of packets arriving from a commercial network link that sender does not monitor. Hence, the highest packet sampling ratios are over-optimistic and collect too much traffic. It can be observed in Fig. 7 (right) that sampling ratios from 20 to 2−4 lose an intolerable amount of measurement intervals because all LDA buckets become unusable. Lower sampling rates, though, are totally resistant to the third party traffic. Figure 7 (left) plots the results in terms of accuracy, but only including measurement intervals not lost. It can be observed that 2−6 and 2−7 are the best settings. This is consistent with the analysis of Sect. 3, that suggests using p = nbr ≈ 0.17 ≈ 2−6 . The figure also includes the performance of our numerically optimized LDA, portrayed as a horizontal line (the multi-bank LDA is a hybrid of the other sampling rates). It performs very similarly to the best static sampling rates. However, it is important to note that this configuration will consistently perform close to optimal when losses (or third party traffic) grow to 80%, obtaining errors below 50%, while the error bound for the less flexible single bank LDA reaches 400%. On average, for each measurement interval, the optimized LDA collected around 3478 samples, while transmitting 1024 × 20 bytes (8 for the timestamp accumulators and the XOR field plus 4 for the counter for each bucket), resulting in 5.8 B/sample of network overhead. A traditional technique based on sampling and transmitting packet timestamps would cause a higher overhead, e.g., if using 8 byte timestamps and 4 byte packet IDs, it would transmit 12 B/sample. Thus, in this scenario, the LDA reduced the communication overhead in over 50%.

1

The XORs of the hashes have to be transmitted from sender to receiver, causing extra network overhead. Choosing the smallest hash size that still guarantees high accuracy is left for future work.

1e−04

1e−03

1e−02

1e−01

1e+00

single bank lda optimized lda

0.2

0.4

0.6

0.8

1.0

169

0.0

fraction of measurements lost

0.10

0.20

0.30

single bank lda (avg) single bank lda (99−percentile) optimized multi−bank lda (avg.) optimized multi−bank lda (99−percentile

0.00

relative error (%)

Validation and Improvement of the Lossy Difference Aggregator

1e−04

sampling rate

1e−03

1e−02

1e−01

1e+00

sampling rate

Fig. 7. Experimental results

6

Conclusions

We have performed a validation on the Lossy Difference Aggregator (LDA) algorithm originally presented in [1]. We have improved the theoretical analysis of the algorithm by providing a formula for the expected sample size collected by the LDA, while in [1] only a pessimistic lower bound was presented. Our analysis finds that the sampling rates originally proposed must be doubled. Only three configurations of the more complex multi-bank LDA were evaluated in [1]. We have extended our analysis to multi-bank configurations, and explored how to properly parametrize them, obtaining a procedure to numerically search for multi-bank LDA configurations that maximize accuracy over an arbitrary range of packet losses. Our results show that there is little room for additional improvement in the problem of multi-bank LDA configuration. We have validated our analysis through simulation and using traffic from a monitoring system deployed over a large academic network. The deployment of the LDA on a real network presented a number of challenges related to the assumptions behind the original proposal of the LDA algorithm, that does not tolerate packet insertion/diversion and depends on strict FIFO packet forwarding. We propose a simple extension that overcomes such limitations. We have compared the network overhead of the LDA with pre-existing techniques, and observed that it is preferable under zero to moderate loss or addition/diversion of packets (up to ∼25% combined). However, the extra overhead of pre-existing techniques can be justified in some scenarios, since they can provide further information on the packet delay distribution (e.g., percentiles), than just the average and standard deviation that are provided by the LDA.

Acknowledgments We thank the anonymous reviewers and Fabio Ricciato for their comments, which led to significant improvements in this work. This research has been partially funded by the Comissionat per a Universitats i Recerca del DIUE de la Generalitat de Catalunya (ref. 2009SGR-1140).

170

J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta

References 1. Kompella, R., Levchenko, K., Snoeren, A., Varghese, G.: Every microsecond counts: tracking fine-grain latencies with a lossy difference aggregator. In: Proc. of ACM SIGCOMM Conf. (2009) 2. Bolot, J.: Characterizing end-to-end packet delay and loss in the internet. Journal of High Speed Networks 2(3) (1993) 3. Paxson, V.: Measurements and analysis of end-to-end Internet dynamics. University of California at Berkeley, Berkeley (1998) 4. Choi, B., Moon, S., Cruz, R., Zhang, Z., Diot, C.: Practical delay monitoring for ISPs. In: Proc. of ACM Conf. on Emerging network experiment and tech. (2005) 5. Sommers, J., Barford, P., Duffield, N., Ron, A.: Accurate and efficient SLA compliance monitoring. ACM SIGCOMM Computer Communication Review 37(4) (2007) 6. Papagiannaki, K., Moon, S., Fraleigh, C., Thiran, P., Diot, C.: Measurement and analysis of single-hop delay on an IP backbone network. IEEE Journal on Selected Areas in Communications 21(6) (2003) 7. Zseby, T.: Deployment of sampling methods for SLA validation with non-intrusive measurements. In: Proc. of Passive and Active Measurement Workshop (2002) 8. Duffield, N., Grossglauser, M.: Trajectory sampling for direct traffic observation. IEEE/ACM Transactions on Networking 9(3) (2001) 9. Finucane, H., Mitzenmacher, M.: An improved analysis of the lossy difference aggregator (public draft), http://www.eecs.harvard.edu/~ michaelm/postscripts/LDApre.pdf 10. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. Journal of Computer and system sciences 58(1) (1999) 11. Rohatgi, V.: Statistical inference. Dover Pubns (2003) 12. Sommers, J., Barford, P., Duffield, N., Ron, A.: Improving accuracy in end-toend packet loss measurement. ACM SIGCOMM Computer Communication Review 35(4) (2005) 13. Endace: DAG network monitoring cards, http://www.endace.com 14. Barlet-Ros, P., Iannaccone, G., Sanju` as-Cuxart, J., Amores-L´ opez, D., Sol´e-Pareta, J.: Load shedding in network monitoring applications. In: Proc. of USENIX Annual Technical Conf. (2007)

End-to-End Available Bandwidth Estimation Tools, An Experimental Comparison Emanuele Goldoni1 and Marco Schivi2 1

2

University of Pavia, Dept. of Electronics, 27100-Pavia, Italy [email protected] University of Pavia, Dept. of Computer Engineering and Systems Science, 27100-Pavia, Italy [email protected]

Abstract. The available bandwidth of a network path impacts the performance of many applications, such as VoIP calls, video streaming and P2P content distribution systems. Several tools for bandwidth estimation have been proposed in the last years but there is still uncertainty in their accuracy and efficiency under different network conditions. Although a number of experimental evaluations have been carried out in order to compare some of these methods, a comprehensive evaluation of all the existing active tools for available bandwidth estimation is still missing. This article introduces an empirical comparison of most of the active estimation tools actually implemented and freely available nowadays. Abing, ASSOLO, DietTopp, IGI, pathChirp, Pathload, PTR, Spruce and Yaz have been compared in a controlled environment and in presence of different sources of cross-traffic. The performance of each tool has been investigated in terms of accuracy, time and traffic injected into the network to perform an estimation.

1

Introduction

Available bandwidth is a fundamental metric for describing the performance of a network path. This parameter is used in many applications, from routing algorithms to congestion control mechanisms and multimedia services. For example, in [1,2] the authors investigated the importance of the available bandwidth for adaptive content delivery in peer-to-peer (P2P) or video streaming systems. The easiest and most effective method for estimating the available bandwidth is active probing – a few test packets are transmitted through the path and are used to infer the network status. The problem of end-to-end estimation has received considerable attention and a number of active probing tools have emerged in recent years [3]. Nevertheless, producing reliable estimations still remains challenging: the measurement process should be accurate, non-intrusive and robust at the same time. Considerable efforts have also been put in comparison projects aiming to analyze the performances of existing methods in different network scenarios. F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 171–182, 2010. c Springer-Verlag Berlin Heidelberg 2010 

172

E. Goldoni and M. Schivi

Nevertheless, many issues remain still unresolved and the quest for the best available bandwidth estimation tool is still open [4]. Compared to previous works, this paper proposes the largest comparison of available bandwidth estimation methods. The performances of 9 different tools are investigated in terms of accuracy, time and intrusiveness. All the experiments have been conducted in a low-cost and flexible testbed which could be easily extended to simulate more complex and realistic network topologies. The remainder of the article is organized as follows. Section 2 briefly presents related works on measurement tools and past performance comparisons. In Section 3 we describe the testbed and the methodology adopted for the experimental comparison of the tools. Next, Section 4 includes the preliminary results obtained during the performance tests we performed in our laboratory. Finally, the conclusions drawn from this study are presented in Section 5.

2

Background and Related Work

Many software tools for network bandwidth monitoring have been developed in the last years by both independent scientists and collaborative research projects. Although designed for the same purpose, these tools are based on different principles and implement various techniques. This section briefly introduces the main methodologies proposed in literature and also describes the previous works carried out to compare them. 2.1

Measurement Techniques

Several active end-to-end measurement tools have been proposed in the last years. Looking at the big picture, these systems infer the available bandwidth of a network path by sending a few packets and analyzing the effects on the probe frames of intermediate nodes and cross-traffic. Examples of probing tools which have emerged in recent years are Pathload [5], IGI/PTR [6], Abing [7], Spruce [8], pathChirp [9], DietTopp [10], Yaz [11], and ASSOLO [12]. These methods differ in the size and temporal structure of probe streams, and in the way the available bandwidth is derived from the received packets. Spruce [8] uses tens of packet pairs having an input rate chosen to be roughly around to the capacity of the path, which is assumed to be known. Moreover, packets are spaced with exponential intervals in order to emulate a poissonian sampling process. IGI [6] uses a sequence of about 60 unevenly spaced packets to probe the network and the gap between two consecutive packets is increased until the average output and initial gaps match. Similarly, PTR relies on unevenly spaced packets but the background traffic is detected through a comparison of the time intervals at the source with those found on the destination side. Abing [7] relies on packet pair dispersion technique. Typically, 10 or 20 closely spaced probes are sent to one destination as a train. The evaluation of the

End-to-End Available Bandwidth Estimation Tools

173

observed packet pairs delays and the estimation of the available bandwidth are based on a technical analysis of the problems that the frames could meet in the routers or other network devices. Pathload [5] and DietTopp [10] use constant bit-rate streams and change the sending rate every round. Although both tool try to identify the turning point, DietTopp increases linearly the sending rate in successive streams while Pathload varies the probing rate using a binary search scheme. Yaz [11] is a similar estimation tool derived from Pathload which should reports results more quickly and with increased accuracy with respect to its predecessor. PathChirp [9] sends a variable bit-rate stream consisting of exponentially spaced packets. The actual unused capacity is inferred from the rate responsible for increasing delays at the receiver side. ASSOLO [12] is a tool based on the same principle, but it features a different probing traffic profile and uses a filter to improve the accuracy and stability of results. Other works like AB-Shoot [13], S-chirp [14], FEAT [15] , BART [16] or MRBART [17] have also been proposed in the past. However, the source codes of these tools have never been released publicly or the methods have been implemented only in simulations. A detailed analysis of the existing estimation techniques is outside the scope of this paper – a proposed taxonomy has been developed by [8], while more information on specific tools can be found in the original papers. 2.2

Past Comparisons

Most of tools’ proponents have compared the performance of their solution against that of others researchers. For example, in [11] Sommers et al. compared Yaz with Pathload and Spruce in a controlled environment, while Strauss and his colleagues [8] investigated the performances of Spruce against IGI and Pathload over hundreds of real Internet paths. Ribeiro et al. [9] tested pathChirp against Pathload and TOPP through emulation. In [12] the performance of pathChirp has been compared to that of ASSOLO in a laboratory network setup. Unfortunately, the works mentioned above covered only a small number of tools and the scenarios investigated are limited too. A more comprehensive evaluation has been performed by Shriram et al. [18], who compared Abing, pathChirp, Pathload and Spruce on a high-speed testbed and on real world GigE paths. The specific features of the network paths also allowed the researchers to investigate timing issues related to high-speed links and network interfaces. A similar work has been carried out by Labit et al. [19], that tested Abing, Spruce, Netest/Pipechar, pathChirp and IGI over a real Internet path of the French national monitoring and measurement platform Metropolis. Angrisani et al. [20] compared IGI, pathChirp and Pathload in a testbed equipped with a proper measurement station. The adoption of a performance evaluation methodology relying on the use of electronics instrumentation for time measurements allowed the authors to focusing on concurrence, repeatability and bias of the results obtained from the testbed. Furthermore, an optimal setting of each tool has been identified thanks to the experimental activity.

174

E. Goldoni and M. Schivi

In [21] the authors presented a comparative study of DietTopp, Pathload and pathChirp in a mobile transport network. However, all the results presented have been generated only from simulations using ns2. The ns2 network simulator has been used also by Shriram and Kaur [22] to evaluate the performance of Pathload, pathChirp, Spruce, IGI and Cprobe under different network conditions. Two additional works in this research fields are [23] and [24]. In the first paper the authors proposed a comparative analysis of Spruce, Pathload, pathChirp and IGI in a simple testbed and they analyzed in depth the measurement errors and the uncertainty of the tools. On the other hand, in the latter article Urvoy-Keller et al. investigated the long-term behavior and the biases of Pathload and Spruce collecting data from real Internet paths. Finally, Guerrero and Labrador [25] presented a low cost and flexible testbed and they evaluated Pathload, IGI, and Spruce in a common environment in presence of different cross-traffic loads. The same authors included in [4] more tools in the performance evaluation, comparing Pathload, pathChirp, Spruce, IGI and Abing. The considered scenarios were extended too, examining varying packet loss rate, cross-traffic packet size, link capacity and delay. In addition, the newest article pointed out which tools might be the best choices for particular applications and environments. Although great efforts have been made to compare the existing estimation methods, all past works considered only part of the existing measurement tools. The above-mentioned experiments have also been performed considering different scenarios and testbed configurations, thus making the various results not easily comparable. We advocate the need for a unified, flexible and low-cost platform for independent evaluations of measurements tools, and we propose in this paper a testbed solution based on free GPL-licensed software alternative to the one described in [4]. Our study also takes one step further with respect to previous works, since it proposes the largest comparison of available bandwidth estimation tools – the performances of 9 software programs are examined in terms of accuracy, time and intrusiveness. All the tools have been ported to a recent operating system, and the changes required to make older software work on a newer system have been publicly released [26].

3

Testbed Setup

All the experimental results reported in this paper have been obtained using the simple testbed setup depicted in Figure 3. Our controlled network is based on general purpose PCs running only open source software. Two low-cost computers running Ubuntu GNU/Linux 8.04 are connected together through a 100 Mbps Fast Ethernet link and serve as routers. Two other machines of the testbed have been used to load the network with a source of controlled traffic originated by DITG [27] traffic generator. Finally, we installed the client and the server of each measurement tool on two additional computer running Ubuntu GNU/Linux 8.04

End-to-End Available Bandwidth Estimation Tools

175

Fig. 1. Testbed setup used to compare available bandwidth estimation tools

and we connected these two end-host to the testbed. We also added two Fast Ethernet switches to the network in order to make the tests more realistic. The two intermediate routers which emulate the multi-hop network path are based on Linux 2.6. The routers also contains iproute2 tc [28], an utility used to configure traffic control mechanisms in the Linux kernel. With tc it is possible to change the capacity of each interface, limiting the output in terms of packets or byte per second. tc also supports different packet queuing disciplines, and it can emulate the properties of wide area networks, such as variable delay, loss, duplication and re-ordering, through the netem kernel component. The traffic generator D-ITG allowed us to produce traffic at packet level, replicating stochastic processes having both temporal and packet size distributed as random variables with known properties. D-ITG is also capable to generate traffic at network, transport, and application layer, and it can also use real traffic traces. In our experiments we loaded the network with poissonian or constant bit rate (CBR) cross-traffic with varying rate from 0 to 64 Mbps and we did not introduce any traffic shaping policy. The final topology of the testbed and the scenarios considered are simple and admittedly unrealistic, but sufficient to perform a preliminary evaluation of the various measurement tools. A similar configuration has been used for example in [20] and [23], and the resulting system has the same features and flexibility of the testbed proposed in [25]. All the tools considered in this work must be executed on the two terminal hosts of the measured path, using a regular user account (administrator privileges are not required). For each measurement tool we left the configuration parameters untouched, using the default values suggested by the authors in the original paper or directly within the software – a list of list of the main configuration settings for each tested program is given in [26]. Although better results might be obtained using different setups, an optimal tuning of the tools is

176

E. Goldoni and M. Schivi

outside the scope of this paper. We also run one measurement tool at a time – as shown in [29], current techniques can offer good estimates when used alone, but they might not work if several estimations severely interfere with each other.

4

Experimental Results

Using the testbed described before, we evaluated Abing, ASSOLO, DietTopp, IGI, pathChirp, Pathload, PTR, Spruce and Yaz in terms of estimation time, overhead and accuracy. For each tool, we considered respectively 5 CBR and 5 additional poissonian cross-traffic scenarios with varying intensity. We loaded the network using sources of 64, 32, 16 and 8 Mbps and, finally, we turned off the traffic generator. We repeated the measurement process from scratch 10 times before calculating the averaged results for each scenario. The convergence time and the amount of probe traffic transmitted have been calculated considering only actual probing frames. For example, we did not consider the delay of the initial control connection phase which most of the tools use to synchronize the client and the server. 4.1

Accuracy

Figures 2 and 3 show the average results obtained from our experiments. Abing, Spruce and DietToop provide good estimations in presence low-rate cross-traffics, but the accuracy decreases significantly when the network load increases. On the contrary, the stability and the accuracy of measurements obtained with IGI and PTR increase when the intensity of the cross-traffic is higher. PathChirp constantly overestimates available bandwidth and its measurements are quite unstable – this is a well-know problem of this tool and similar results have been obtained in [16], [18], [23]. Pathload and Yaz are quite accurate and their results are similar; this is justified by the fact that Yaz is a modified version of Pathload. Comparable results in terms of accuracy are also provided by ASSOLO. It is worth noting that the measured values do not exhibit significant differences with respect to the kind of cross-traffic source – the tools performed in the same way regardless of the use of CBR or poissonian distributed packets. 4.2

Intrusiveness

Table 1 shows the preliminary data obtained from the testbed network in presence of a 16 Mbps CBR cross-traffic load, that is an available bandwidth of around 80 Mbps. We ran the measurement process for each tool in this scenario and we used a network protocol sniffer [30] to evaluate the exact time required to provide an estimation and the amount of probe traffic injected into the path. During tests we calculated only the actual estimation time and the probe traffic, not considering for example the delay introduced by an eventual initial

110

110

100

100

100

90 80 70 60 50 40 30 20

Available Bandwidth [Mbps]

110 Available Bandwidth [Mbps]

Available Bandwidth [Mbps]

End-to-End Available Bandwidth Estimation Tools

90 80 70 60 50 40 30

0

20 40 Cross−Traffic [Mbps]

20

60

70 60 50 40

20 40 Cross−Traffic [Mbps]

20

60

110

100

100

80 70 60 50 40

Available Bandwidth [Mbps]

110

100

20

90 80 70 60 50 40 30

0

20 40 Cross−Traffic [Mbps]

20

60

80 70 60 50 40

20 40 Cross−Traffic [Mbps]

20

60

100

70 60 50 40 30

Available Bandwidth [Mbps]

110

100 Available Bandwidth [Mbps]

110

100

90 80 70 60 50 40 30

0

20 40 Cross−Traffic [Mbps]

(g) PTR

0

60

20

20 40 Cross−Traffic [Mbps]

60

(f) Pathload

110

80

60

90

(e) pathChirp

90

20 40 Cross−Traffic [Mbps]

30 0

(d) IGI

20

0

(c) DietTopp

110 Available Bandwidth [Mbps]

Available Bandwidth [Mbps]

80

(b) ASSOLO

30

Available Bandwidth [Mbps]

90

30 0

(a) Abing

90

177

90 80 70 60 50 40 30

0

20 40 Cross−Traffic [Mbps]

(h) Spruce

60

20

0

20 40 Cross−Traffic [Mbps]

60

(i) Yaz

Fig. 2. Experimental results (solid line) obtained in presence of Constant Bit Rate cross-traffic with varying rate from 0 to 64 Mbps (dashed line)

E. Goldoni and M. Schivi

110

110

100

100

100

90 80 70 60 50 40 30 20

Available Bandwidth [Mbps]

110 Available Bandwidth [Mbps]

Available Bandwidth [Mbps]

178

90 80 70 60 50 40 30

0

20 40 Cross−Traffic [Mbps]

20

60

70 60 50 40

20 40 Cross−Traffic [Mbps]

20

60

110

100

100

80 70 60 50 40 30

Available Bandwidth [Mbps]

110

100 90

90 80 70 60 50 40 30

0

20 40 Cross−Traffic [Mbps]

20

60

90 80 70 60 50 40

20 40 Cross−Traffic [Mbps]

20

60

100

60 50 40 30

Available Bandwidth [Mbps]

110

100 Available Bandwidth [Mbps]

110

100

70

90 80 70 60 50 40 30

0

20 40 Cross−Traffic [Mbps]

(g) PTR

60

20

20 40 Cross−Traffic [Mbps]

60

(f) Pathload

110

20

0

(e) pathChirp

80

60

30 0

(d) IGI

90

20 40 Cross−Traffic [Mbps]

(c) DietTopp

110

20

0

(b) ASSOLO

Available Bandwidth [Mbps]

Available Bandwidth [Mbps]

80

30 0

(a) Abing

Available Bandwidth [Mbps]

90

90 80 70 60 50 40 30

0

20 40 Cross−Traffic [Mbps]

(h) Spruce

60

20

0

20 40 Cross−Traffic [Mbps]

60

(i) Yaz

Fig. 3. Experimental results (solid line) obtained in presence of possonian cross-traffic with varying rate from 0 to 64 Mbps (dashed line)

End-to-End Available Bandwidth Estimation Tools

179

control connection phase. Similarly, we ignored the traffic and the time required by IGI and PTR to measure the initial capacity. For each tool, we considered a single estimation, although tools like ASSOLO and pathChirp usually repeat the process a number of times and produce a final value using filtering techniques. During the first round of tests the average end-to-end delay in our testbed was around 2 milliseconds – this is a reasonable value for short FastEthernet links. We repeated all the experiments enabling the netem module on the two routers in order to emulate a real Internet path with a symmetric One-Way Delay of 125 ms. Table 1. Estimation time (in seconds) and amount of probe traffic (in Megabyte) associated to the tools analyzed Tool Abing ASSOLO DietTopp IGI pathChirp Pathload PTR Spruce Yaz

T raffic TimeOW D=2ms TimeOW D=125ms 0.6 1.0 1.2 >0.1 0.4 0.5 7.6 1.7 1.9 9.1 0.9 1.1 >0.1 0.4 0.5 40.6 7.6 18.2 9.1 0.9 1.1 0.3 10.0 10.1 6.4 4.3 8.1

Results show that DietTopp, IGI and PTR are quite fast, but they also inject a significant amount of probe traffic in the network; on the other hand, Spruce could take seconds but it is more lightweight. Pathload and Yaz are quite slow and intrusive, while ASSOLO, Abing and pathChirp appear to have a good trade-off between speed and intrusiveness. It is worth pointing out that ASSOLO, DietTopp, pathChirp, Pathload, Yaz and PTR are based on the concept of self-induced congestion – the search of the available bandwidth is performed by transmitting probe traffic at a rate higher than the unused capacity of the path. The major drawback of this approach is that one or more intermediate queues will fill up during the measurement process – the existing network traffic will be delayed and some packet could even be discarded by the congested bottleneck. The remaining tools are based instead on the probe gap model, which infers the available bandwidth by observing the gap measured on a packet pair injected into the path. Although this method limits the interference between probe and exiting traffic, it has been proved to be less accurate in some network scenarios [31]. The total estimation time of some tools also depends on the Round Trip Time of the observed network path – the results change significantly when iterative programs like Pathload or Yaz are used over links with sizable delays. On the other hand, the impact of the one-way delay on the direct tools is negligible since they relay on a single stream to produce an estimation.

180

5

E. Goldoni and M. Schivi

Conclusion

In this work we presented the largest experimental comparison of existing available bandwidth measurement tools on a laboratory testbed. We compared tools’ performance in terms of intrusiveness, response time and accuracy in presence of different cross-traffics. All the tests have been carried out on a flexible and highly customizable laboratory testbed based only on open-source software, lowcost personal computers and simple network devices. Preliminary results shows that ASSOLO, Pathload and Yaz are accurate and scale well with increasing traffic loads, while Abing seems to be the best choice from the speed-intrusiveness point of view. Although the programs considered in this work represent the majority of existing active estimation tools, we intend to extend the experimental comparison to more candidates as they will become freely available and usable. Ongoing works are devoted to include Netest [32] in our testbed – this was a promising tool but the source code is based on a home-made and unmaintained build system which does not compile successfully under any modern GNU/Linux environment. BART is another recent tool which is being used for experiments over the European research measurement infrastructure Etomic. However, Ericsson owns BART intellectual property rights and the code has not yet been freely released for scientific purposes. The laboratory testbed we used is actually quite simple: the single-bottleneck topology and the limited number of links oversimplify reality, and the CBR or poissonian cross-traffic sources do not fully catch the complexity of actual communication flows. We also did not consider longterm oscillations or biases in the estimations and the analysis we performed does not include highly congested scenarios. Although quite promising, the preliminary results obtained from our tests are not sufficient to draw any definitive conclusions on how the tools will behave on real networks. As further development, we plan to complete the analysis extending the set of considered scenarios to actual Internet paths or using a testbed having a more complex topology and loaded with real-world traffic traces.

References 1. Chuan, W., Baochun, L., Shuqiao, Z.: Characterizing Peer-to-Peer Streaming Flows. IEEE JSAC 25(9), 1612–1626 (2007) 2. Favalli, L., Folli, M., Lombardo, A., Reforgiato, D., Schembra, G.: A BandwidthAware P2P Platform for the Transmission of Multipoint Multiple Description Video Streams. In: Proceedings of the Italian Networking Workshop 2009 (2009) 3. Shamsi, J., Brockmeyer, M.: Principles of Network Measurement. In: Misra, S., Misra, S.C., Woungang, I. (eds.) Selected Topics Communication Networks and Distributed Systems, pp. 1–40. World Scientific, Singapore (2010) 4. Guerrero, C.D., Labrador, M.A.: On the applicability of available bandwidth estimation techniques and tools. Computer Communications 33(1), 11–22 (2010)

End-to-End Available Bandwidth Estimation Tools

181

5. Jain, M., Dovrolis, C.: Pathload: A measurement tool for end-to-end available bandwidth. In: Proceedings of the 3th International workshop on Passive and Active network Measurement, PAM 2002 (2002) 6. Hu, N., Steenkiste, P.: Evaluation and Characterization of Available Bandwidth Probing Techniques. IEEE JSAC 21(6), 879–894 (2003) 7. Navratil, J., Cottrell, R.L.: ABwE: A Practical Approach to Available Bandwidth. In: Proceedings of the 4th International workshop on Passive and Active network Measurement, PAM 2003 (2003) 8. Strauss, J., Katabi, D., Kaashoek, F.: A measurement study of available bandwidth estimation tools. In: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, IMC 2003 (2003) 9. Ribeiro, V., Riedi, R., Baraniuk, R., Navratil, J., Cottrell, L.: PathChirp: Efficient Available Bandwidth Estimation for Network Paths. In: Proceedings of the 4th International workshop on Passive and Active network Measurement, PAM 2003 (2003) 10. Johnsson, A., Melander, B., Bjorkman, M.: DietTopp: A First Implementation and Evaluation of a Simplified Bandwidth Measurement Method. In: Proceedings of the 2nd Swedish National Computer Networking Workshop (2004) 11. Sommers, J., Barford, P., Willinger, W.: A Proposed Framework for Calibration of Available Bandwidth Estimation Tools. In: Proceedings of the 11th IEEE Symposium on Computers and Communications, ISCC 2006, pp. 709–718 (2006) 12. Goldoni, E., Rossi, G., Torelli, A.: Assolo, a New Method for Available Bandwidth Estimation. In: Proceedings of the Fourth International Conference on Internet Monitoring, ICIMP 2009, pp. 130–136 (May 2009) 13. Tan, W., Zhanikeev, M., Tanaka, Y.: ABshoot: A Reliable and Efficient Scheme for End-to-End Available Bandwidth Measurement. In: Proceedings of the IEEE Region 10 Conference TENCON 2006, pp. 1–4 (2006) 14. Pasztor, A.: Accurate Active Measurement in the Internet and its Applications. University of Melbourne, Department of Electrical and Electronic Engineering, Ph.D. Thesis (2003) 15. Qiang, W., Liang, C.: FEAT: Improving Accuracy in End-to-end Available Bandwidth Measurement. In: Proceedings of IEEE Global Telecommunications Conference GLOBECOM 2006, pp. 1–4 (2006) 16. Ekelin, S., Nilsson, M., Hartikainen, E., Johnsson, A., Mangs, J.-E., Melander, B., Bjorkman, M.: Real-Time Measurement of End-to-End Available Bandwidth using Kalman Filtering. In: Proceedings of 10th IEEE/IFIP Network Operations and Management Symposium, NOMS 2006, pp. 73–84 (2006) 17. Sedighizad, M., Seyfe, B., Navaie, K.: MR-BART: multi-rate available bandwidth estimation in real-time. In: Proceedings of the 3nd ACM workshop on Performance monitoring and measurement of heterogeneous wireless and wired networks PM2HW2N 2008, pp. 1–8 (2008) 18. Shriram, A., Murray, M., Hyun, Y., Brownlee, N., Broido, A., Fomenkov, M., claffy, k.: Comparison of public end-to-end bandwidth estimation tools on high-speed links. In: Dovrolis, C. (ed.) PAM 2005. LNCS, vol. 3431, pp. 306–320. Springer, Heidelberg (2005) 19. Labit, Y., Owezarski, P., Larrieu, N.: Evaluation of active measurement tools for bandwidth estimation in real environment. In: Proceedings of the IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services E2EMON 2005, pp. 71–85 (2005)

182

E. Goldoni and M. Schivi

20. Angrisani, L., D’Antonio, S., Esposito, E., Vardusi, M.: Techniques for available bandwidth measurement in IP networks: a performance comparison. Elsevier Computer Networks 50(3), 332–349 (2006) 21. Castellanos, C.U., Villa, D.L., Teyeb, O.M., Elling, J., Wigard, J.: Comparison of Available Bandwidth Estimation Techniques in Packet-Switched Mobile Networks. In: Proceedings of the 17th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, pp. 1–5 (2006) 22. Shriram, A., Kaur, J.: Empirical Evaluation of Techniques for Measuring Available Bandwidth. In: Proceedings of 26th IEEE International Conference on Computer Communications INFOCOM 2007, pp. 2162–2170 (2007) 23. Ait Ali, A., Michaut, F., Lepage, F.: End-to-End Available Bandwidth Measurement Tools: A Comparative Evaluation of Performances. In: Proceedings of the 4th International Workshop on Internet Performance, Simulation, Monitoring and Measurements IPS-MoMe 2006, pp. 1–14 (2006) 24. Urvoy-Keller, G., En-Najjary, T., Sorniotti, A.: Operational comparison of available bandwidth estimation tools. ACM SIGCOMM Comput. Commun. Rev. 38(1), 39– 42 (2008) 25. Guerrero, C.D., Labrador, M.A.: Experimental and Analytical Evaluation of Available Bandwidth Estimation Tools. In: Proceedings of the 31st IEEE Conference on Local Computer Networks 2006, pp. 710–717 (2006) 26. University of Pavia, Networking Lab: Collection of Available Bandwidth Estimation Tools, http://netlab-mn.unipv.it/avail-bw/ 27. Avallone, S., Guadagno, S., Emma, D., Pescap, A., Ventre, G.: D-ITG Distributed Internet Traffic Generator. In: Proceedings of the First International Conference on Quantitative Evaluation of Systems QEST 2004, pp. 316–317 (2004) 28. Hemminger, S., Kuznetsov, A., et al.: iproute2 utility suite, http://www.linuxfoundation.org/collaborate/workgroups/networking/ iproute2 29. Croce, D., Mellia, M., Leonardi, E.: The Quest for Bandwidth Estimation Techniques for large-scale Distributed Systems. In: Proceedings of ACM HotMetrics 2009 (2009) 30. Combs, G., et al.: The Wireshark Network Protocol Analyzer, http://www.wireshark.org 31. Lao, L., Dovrolis, C., Sanadidi, M.Y.: The probe gap model can underestimate the available bandwidth of multihop paths. ACM SIGCOMM Comput. Commun. Rev. 36(5), 29–34 (2006) 32. Jin, G., Tierney, B.: Netest: a tool to measure the maximum burst size, available bandwidth and achievable throughput. In: Proceedings of the International Conference on Information Technology: Research and Education ITRE 2003, pp. 578–582 (2003)

On the Use of TCP Passive Measurements for Anomaly Detection: A Case Study from an Operational 3G Network Peter Romirer-Maierhofer1, Angelo Coluccia2 , and Tobias Witek1 1

Forschungszentrum Telekommunikation Wien (FTW), Austria 2 Universit` a del Salento, Italy [email protected]

Abstract. In this work we discuss the use of passive measurements of TCP performance indicators in support of network operation and troubleshooting, presenting a case-study from a real 3G cellular network. From the analysis of TCP handshaking packets measured in the core network we infer Round-Trip-Times (RTT) on both the client and server sides separately for UMTS/HSPA and GPRS/EDGE sections. We also keep track of the relative share of packet pairs which did not lead to a valid RTT sample, e.g. due to loss and/or retransmission events, and use this metric as an additional performance signal. In a previous work we identified the risk of measurement bias due to early retransmission of TCP SYNACK packets by some popular servers. In order to mitigate this problem we introduce here a novel algorithm for dynamic classification and filtering of early retransmitters. We present a few illustrative cases of abrupt-change observed in the real network, based on which we derive some lessons learned about using such data for detecting anomalies in a real network. Thanks to such measurements we were able to discover a hidden congestion bottleneck in the network under study.

1

Motivations

The evolving nature and functional complexity of a 3G network increases its vulnerability to network problems and errors. Hence, the timely detection and reporting of network anomalies is highly desirable for operators of such networks. Passive packet-level monitoring can be an important instrument for supporting the operation and troubleshooting of 3G networks. A natural approach to validating the health status of a network is the extraction of performance indicators from passive probes, e.g. Round-Trip Time (RTT) percentiles and/or frequency of retransmission. These indicators, which we hereafter refer to as “network signals”, can be analyzed in real time in order to detect anomalous deviations from the “normal” network performance observed in the past. This approach underlies two fundamental assumptions: i) that extracted network signals are stable over time under problem-free operation, and ii) that anomalous phenomena generate appreciable deviations in any of the observed signals. In an earlier work [3] we demonstrated that passively extracted TCP RTT distributions are relatively F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 183–197, 2010. c Springer-Verlag Berlin Heidelberg 2010 

184

P. Romirer-Maierhofer, A. Coluccia, and T. Witek

Fig. 1. Monitoring setting

stable over time in the operational network under study. Here, we take the next step and present some cases from a real 3G network where abnormal events were reflected by a sudden change in the analyzed network signals. Our findings are promising about the possibility of leveraging passively extracted TCP performance indicators for troubleshooting of real 3G networks. The TCP performance indicators presented in this work are obtained from the passive analysis of TCP handshaking packets. The idea of measuring TCP performance by observing handshaking packets was already presented in 2004 by Benko et al. [1] who reported results from an operational GPRS network. Vacirca et al. [2] reported RTT measurements from an operational 3G network including GPRS and also UMTS. In [3] we have shown that RTT values have decreased considerably due to the introduction of EDGE and HSPA and the consequential increase of radio bandwidth. Several studies [4,5,6,7] presented passive estimation of TCP RTT in wired networks inferred also from TCP DATA/ACK pairs. However, this approach is complicated by loss, reordering and duplication of TCP segments as well as by delayed acknowledgements [4]. Jaiswal et al. [5] measure TCP RTT by keeping track of the current congestion window of a connection by applying finite state machines (FSM). Since the computation of the congestion window differs among different flavors of TCP, the authors suggest the parallel operation of several FSMs, each tailored to a specific TCP flavor. Rewaskar et al. [6] identified Operation System (OS)-specific differences in prominent TCP implementations, which may bias the passive analysis of TCP segment traces if not handled properly. This issue is addressed by implementing four OS-specific state machines to measure TCP RTT while discarding all connections with less than 10 transmitted segments. Mellia et al. [7] compute the TCP RTT by applying the moving average estimator standardized in [8]. In case of short TCP flows as e.g. client HTTP requests, no RTT samples may be collected by this approach [8]. As shown in [4], the RTT inferred from TCP handshake packets is a reasonable approximation of the minimum RTT of the whole connection. Motivated by this result, we elaborate on the use of such RTT measurements for long-term and real-time anomaly detection in an operational 3G network.

On the Use of TCP Passive Measurements for Anomaly Detection

(a) Computation of RTT

185

(b) Ambiguity by retransmission

Fig. 2. Measurement schemes

We believe this method to be much more scalable, since it neither requires the analysis of all packets of a TCP flow nor it relies on any knowledge about the involved TCP flavors and/or Operating Systems. Moreover, in contrast to [6,7], this approach does not exclude short TCP flows from RTT measurements. The detection of congestion bottlenecks by passively inferring spurious retransmission timeouts from DATA/ACK pairs was presented in [9]. In this work we show that our simpler approach of extracting RTT just from TCP handshake packets is also suitable to detect hidden congestion bottlenecks.

2

Measurement Methodology

The measurement setting is depicted in Fig. 1. Packet-level traces are captured on the so-called “Gn interface” links between the GGSN and SGSN — for more information about the 3G network structure refer to [10]. We use the METAWIN monitoring system developed in a previous research project and deployed in the network of a mobile operator in EU — for more details refer to [11]. By extracting and correlating information from the 3GPP layers (GTP protol on Gn, see [12]) the METAWIN system enables discrimination of connections originated in the GPRS/EDGE and UMTS/HSPA radio sections. We shortly recap the RTT measurement methodology already introduced in [3]. We only consider TCP connections established in uplink, i.e. initiated by Mobile Stations in the Radio Access Network. By measuring the time span between the arrival of a SYN and the arrival of the associated SYNACK we infer the (semi-)RTT between the Gn link and a remote server in the Internet — denoted by “server-side RTT” in Fig. 2(a). Similarly, we estimate the (semi-)RTT in the Radio Access Network (RAN), between the Gn link and the Mobile Station — referred to as “client-side RTT”— by calculating the time span between the arrival of the SYNACK and the associated ACK. Valid RTT samples may only be estimated from unambiguous and correctly conducted 3-way handshakes. Those cases where the association between packet pairs is ambiguous (e.g. due to retransmission, duplication) have to be discarded. Within a measurement interval (e.g. 5 minutes) valid RTT samples are aggregated into equal-sized bins.

186

P. Romirer-Maierhofer, A. Coluccia, and T. Witek

The corresponding bin width is 0.1 ms for RTT samples 700. The points of early retransmitters show a clear offset towards higher values of NSY N ACK , since a significant number of SYNACKs is ambiguously replied due to early retransmission of the SYNACK packets. In contrast to that, the points of residual servers are located along the line where each SYNACK is unambiguously replied by an ACK packet (i.e. NSY N ACK ≈ NU ). Interesting to note, both clusters are overlapping for NU < 700. This might be explained by the fact that few early retransmissions per server are already sufficient to exceed our classification threshold of ri (k) > 0.01 if this server is sending a low number of NSY N ACK , which leads to false negatives in our classification

190

P. Romirer-Maierhofer, A. Coluccia, and T. Witek Client−side RTT percentiles; UMTS; Gn−Interface; Port 80 7 days; Time Bins of 5 Min. 0.05

0.25

0.5

0.75

Client−side RTT percentiles; UMTS; Gn−Interface; Port 80 7 days; Time Bins of 5 Min.

0.95

0.05

0

0.25

0.5

0.75

0.95

0

Delay [sec]

10

Delay [sec]

10

−1

−1

10

10

Fri

Sat

Sun

Mon Tue Days

Wed

Thu

Fri

Fri

(a) Path A via SGSN and RNC

Sat

Sun

Mon Tue Days

Wed

Thu

Fri

(b) Path B via RNC only

Fig. 6. Client-side RTT percentiles in UMTS, different paths, 7 days, 5 min bins

method. Moreover, there might be intervals where early retransmitters do not retransmit any SYNACKs, because all ACKs arrive before the expiration of the (short) SYNACK retransmission timeout. In such an interval also an early retransmitter would be located along the line NSY N ACK ≈ NU in Fig. 5(a). The same scatterplot for connections via GPRS/EDGE is depicted in Fig. 5(b). The qualitative shape is comparable to Fig. 5(a). However, the existing clusters are less clearly separated. In Fig. 5(b) we observe points where NSY N ACK ≫ NU also for servers not classified as early retransmitters. Note that this is not necessarily due to misclassification of the corresponding servers, since a SYNACK can be replied ambiguously also due to other effects than early retransmission of SYNACK.

3

Measurement Results

In the following we present three illustrative examples of abrupt changes in the network-wide performance signals (i.e. ISSY N ACK and RTT percentiles) found in an operational 3G network in Austria between June and October 2009. By investigating the root causes of these sudden deviations we will discuss relevant practical issues and show the applicability of these performance signals for detecting anomalies in a real 3G network. 3.1

Client-Side RTT Per Network Area

One interesting feature of the METAWIN monitoring system [11] is the analysis of TCP RTT, separately for different SGSN and RNC areas. In Fig. 6 we plot two timeseries of client-side RTT percentiles for connections established towards TCP port 80 and via UMTS/HSPA. The RTT percentiles for a specific SGSN area are depicted in Fig. 6(a), while Fig. 6(b) shows the RTT percentiles measured via a RNC directly connected to the GGSN (see dashed line labelled

On the Use of TCP Passive Measurements for Anomaly Detection

191

S(10) vs. time; Time bins of 5 Min. 0.18 0.16 0.14

Ratio

0.12 0.1 0.08 0.06 0.04 0.02 GPRS

0 0

UMTS

6 12 18 Hours after Day 1 00:00:00

Fig. 7. Temporary increase of SL (10), 1 day time series, 5 min bins

“direct tunnel” in Fig. 1). In both cases the percentiles are relatively stable over time, showing statistical fluctuations during night hours when traffic load, and consequently also the number of RTT samples per measurement bin is low. As expected the direct path bypassing the SGSN has lower client-side RTT. 3.2

Temporary Increase of Packet Loss

In Fig. 7 we report a 24 hour timeseries of ISR separately for UMTS/HSPA and GPRS/EDGE. The estimated ISR of GPRS/EDGE shows a time-of-day variation between slightly aboc 0.04 in the night hours and around 0.16 in the peak hour after 18:00. However, we observe a sudden increase of the estimated loss probability SL (10) of UMTS/HSPA starting at 04:00 and lasting until around 08:00. This sudden shift from 0.02 to 0.08 with a distinct spike of around 0.15 is a clearly anomalous behavior. A deep exploration of the phenomenon at hand showed that this anomaly was caused by a temporary network problem associated to the reconfiguration of one GGSN, which led to partial packet loss at a specific site of the network. The presented anomaly is an important confirmation that, as we expected, an anomalous increase in packet loss is reflected by an abnormal deviation in the estimation of invalid sample ratio ISSY N ACK . 3.3

Detection of Bottleneck Link

Within our analysis of TCP RTT, we discriminate between server IP addresses allocated to the 3G operator (used for e.g. gateway servers, internal application servers) and all other IP addresses of the public Internet. The server-side percentiles for two consecutive days in time bins of 5 minutes only for internal servers deployed by the mobile network operator are depicted in Fig. 8. The percentiles show a slight time of day effect, i.e. the server-side RTT is higher during the peak hours in the evening. This might be explained by two phenomena. First, an increase of traffic rate may lead to higher link utilization and thus

192

P. Romirer-Maierhofer, A. Coluccia, and T. Witek RTT percentiles of ISP−net; RAT: all; Gn−Interface; All Ports 2 days; Time Bins of 5 Min. 0.01

0.05

06:00

12:00

0.25

0.5

0.75

0.9

0.95

−2

Delay [sec]

10

−3

10

00:00

18:00

00:00 06:00 Time of Day

12:00

18:00

00:00

Fig. 8. Time series of server-side RTT percentiles, internal servers, 2 days, 5 min bins

larger delays on the path from/to the internal servers. Second, a higher load at the involved servers may increase their response time and hence also the serverside RTT. Besides a slight time-of-day effect of the percentiles we observe that 75% of the RTT samples take values below ≈2 ms. However, at around 20:30 of day 1 there is an abrupt change in the RTT percentiles. For instance, the 75-percentile is suddenly shifted from 2 ms to 10 ms. We observe that this shift of RTT percentiles is persistent for a period of about two hours. This is a clearly anomalous behavior. By taking into account also other signals obtained from the METAWIN system [11] we revealed that this RTT shift was contemporary to a significant increase of UDP/RTP traffic from the video streaming server during the live broadcast of a soccer match. In fact this traffic increase and consequently the abrupt shift in the server-side RTT was triggered by a significant number of users watching this live broadcast. Note the notch in the RTT percentiles at around 21:15 during the half-time break of the soccer match. Moreover, Fig. 8 shows a second abrupt change in the RTT percentiles at the second day with a clear spike around 14:00. Similarly to the example of the soccer match, this anomaly was caused by users watching the live broadcast of a Formula One race. Note that the increase of video streaming traffic did not only increase the server-side RTT towards the video streaming server, but also towards all internal servers located in the same subnet. Our findings finally pointed at a hidden congestion bottleneck on the path towards these internal servers of the network operator. After reporting our results to the network staff, the problem was fixed by increasing the capacity of the corresponding link. It is interesting to note that a traffic congestion caused at the UDP layer was discovered by analyzing just the handshaking packets of TCP at a single monitoring point. This result confirms

On the Use of TCP Passive Measurements for Anomaly Detection

0

RTT percentiles of RAN; RAT: UMTS; Gn−Interface; Port 80 5 days; Time Bins of 5 Min.

193

RTT percentiles of RAN; RAT: GPRS; Gn−Interface; Port 80 5 days; Time Bins of 5 Min.

10

0

Delay [sec]

Delay [sec]

10

−1

10

0.01 Mon

0.05 Tue

0.25

0.5

Wed

0.75 Thu

0.9 Fri

0.95

−1

Sat

10 Mon

Days

(a) UMTS/HSPA

0.01

0.05 Tue

0.25

0.5

Wed

0.75 Thu

0.9 Fri

0.95 Sat

Days

(b) GPRS/EDGE

Fig. 9. Time series of client-side percentiles, 5 days, 5 min bins

the value of leveraging passively extracted TCP RTT for detecting bottleneck links in an operational 3G network. 3.4

Activation of Transparent Proxy

Fig. 9 depicts two 5 day time series of client-side RTT percentiles, separately for UMTS/HSPA and GPRS/EDGE. The client-percentiles of UMTS/HSPA are stable over time and do not show time-of-day variation. The median RTT is slightly above 100 ms. In the network under study, traffic load is steadily increasing during day hours (reaching its peak after 20:00) and decreasing again during night hours. The fact that client-side percentiles of UMTS/HSPA are independent from the actual traffic load suggests that the network under study is well-provisioned. Also the client-side percentiles of GPRS/EDGE in Fig. 9(b) are stable and independent from the variations of traffic load. However, Fig. 9(b) shows a sudden and persistent shift in client-side percentiles of GPRS/EDGE in the morning of the third day. For instance, the median of RTT is shifted from below 600 ms to around 700 ms. Further analysis revealed that the shift of RTT percentiles observed in Fig. 9(b) is caused by a reconfiguration of the network, specifically the activation of a network-wide proxy mediating TCP connections to Port 80 established in the GPRS/EDGE RAN. For further clarification we now elaborate on the dependence of the client-side RTT on the SYNACK retransmission timeouts of remote servers. Recall from § 2.1 that we cannot compute a valid client-side RTT whenever a SYNACK is retransmitted by a remote server, since this retransmission leads to an ambiguous relation between two observed SYNACK and the ACK. A remote server retransmits a SYNACK if it did not receive the client ACK within a specific retransmission timeout (RTO). In other words, we can compute a valid client-side RTT (inferred from an unambiguous SYNACK/ACK pair) if and only if a client ACK arrives at the remote server before the expiration of its SYNACK retransmission timeout referred to as RT OSY N ACK in Fig. 10(a). Let TSY N ACK denote the time required for a server SYNACK to arrive at our passive probe. TACK

194

P. Romirer-Maierhofer, A. Coluccia, and T. Witek Retransmission Timeout CDF Gn−Interface, 19:00 to 20:59 1

Fraction of Samples