254 48 5MB
English Pages 543 Year 2006
PERFORMANCE ANALYSIS OF COMMUNICATIONS NETWORKS AND SYSTEMS
PIET VAN MIEGHEM Delft University of Technology
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge , UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521855150 © Cambridge University Press 2006 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2006 - -
---- eBook (NetLibrary) --- eBook (NetLibrary)
- -
---- hardback --- hardback
Cambridge University Press has no responsibility for the persistence or accuracy of s for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
Waar een wil is, is een weg. to my father
to my wife Saskia and my sons Vincent, Nathan and Laurens
Contents
Preface 1
2
Introduction
1
Part I
7 9
Probability theory and set theory Discrete random variables Continuous random variables The conditional probability Several random variables and independence Conditional expectation
Basic distributions 3.1 3.2 3.3 3.4 3.5 3.6 3.7
4
Probability theory
Random variables 2.1 2.2 2.3 2.4 2.5 2.6
3
xi
37
Discrete random variables Continuous random variables Derived distributions Functions of random variables Examples of other distributions Summary tables of probability distributions Problems
Correlation 4.1 4.2 4.3
9 16 20 26 28 34
37 43 47 51 54 58 59 61
Generation of correlated Gaussian random variables Generation of correlated random variables The non-linear transformation method v
61 67 68
vi
Contents
4.4 4.5 4.6 5
Inequalities 5.1 5.2 5.3 5.4 5.5 5.6 5.7
6
Examples of the non-linear transformation method Linear combination of independent auxiliary random variables Problem
The minimum (maximum) and infimum (supremum) Continuous convex functions Inequalities deduced from the Mean Value Theorem The Markov and Chebyshev inequalities The Hölder, Minkowski and Young inequalities The Gauss inequality The dominant pole approximation and large deviations
Limit laws 6.1 6.2 6.3 6.4
7
8
A stochastic process The Poisson process Properties of the Poisson process The nonhomogeneous Poisson process The failure rate function Problems
Renewal theory 8.1 8.2 8.3 8.4 8.5
9
Stochastic processes
The Poisson process 7.1 7.2 7.3 7.4 7.5 7.6
Basic notions Limit theorems The residual waiting time The renewal reward process Problems
Discrete-time Markov chains 9.1
78 82 83 83 84 86 87 90 92 94 97
General theorems from analysis Law of Large Numbers Central Limit Theorem Extremal distributions
Part II
74
Definition
97 101 103 104 113 115 115 120 122 129 130 132 137 138 144 149 153 155 157 157
Contents
9.2 9.3 9.4 10
Continuous-time Markov chains 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8
11
Discrete-time Markov chain The steady-state of a Markov chain Problems
Definition Properties of continuous-time Markov processes Steady-state The embedded Markov chain The transitions in a continuous-time Markov chain Example: the two-state Markov chain in continuous-time Time reversibility Problems
Applications of Markov chains 11.1 Discrete Markov chains and independent random variables 11.2 The general random walk 11.3 Birth and death process 11.4 A random walk on a graph 11.5 Slotted Aloha 11.6 Ranking of webpages 11.7 Problems
12
Branching processes 12.1 12.2 12.3 12.4 12.5
13
General queueing theory 13.1 13.2 13.3 13.4 13.5 13.6
14
The probability generating function The limit Z of the scaled random variables Zn The Probability of Extinction of a Branching Process Asymptotic behavior of Z A geometric branching processes
A queueing system The waiting process: Lindley’s approach The Bene˘s approach to the unfinished work The counting process PASTA Little’s Law
Queueing models
vii
158 168 177 179 179 180 187 188 193 195 196 199 201 201 202 208 218 219 224 228 229 231 233 237 240 243 247 247 252 256 263 266 267 271
viii
Contents
14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9
The M/M/1 queue Variants of the M/M/1 queue The M/G/1 queue The GI/D/m queue The M/D/1/K queue The N*D/D/1 queue The AMS queue The cell loss ratio Problems
Part III 15
Physics of networks
General characteristics of graphs 15.1 15.2 15.3 15.4 15.5 15.6 15.7
Introduction The number of paths with m hops The degree of a node in a graph Connectivity and robustness Graph metrics Random graphs The hopcount in a large, sparse graph with unit link weights 15.8 Problems 16
The Shortest Path Problem 16.1 The shortest path and the link weight structure 16.2 The shortest path tree in NQ with exponential link weights 16.3 The hopcount kQ in the URT 16.4 The weight of the shortest path 16.5 The flooding time WQ 16.6 The degree of a node in the URT 16.7 The minimum spanning tree 16.8 The proof of the degree Theorem 16.6.1 of the URT 16.9 Problems
17
The e!ciency of multicast 17.1 General results for jQ (p) 17.2 The random graph Js (Q ) 17.3 The n-ary tree
271 276 283 289 296 300 304 309 312 317 319 319 321 322 325 328 329 340 346 347 348 349 354 359 361 366 373 380 385 387 388 392 401
Contents
17.4 17.5 17.6 17.7 17.8 18
The Chuang—Sirbu law Stability of a multicast shortest path tree Proof of (17.16): jQ (p) for random graphs Proof of Theorem 17.3.1: jQ (p) for n-ary trees Problem
The hopcount to an anycast group 18.1 18.2 18.3 18.4 18.5 18.6
Introduction General analysis The n-ary tree The uniform recursive tree (URT) Approximate analysis The performance measure in exponentially growing trees
ix
404 407 410 414 416 417 417 419 423 424 431 432
Appendix A
Stochastic matrices
435
Appendix B
Algebraic graph theory
471
Appendix C
Solutions of problems
493
Bibliography
523
Index
529
Preface Performance analysis belongs to the domain of applied mathematics. The major domain of application in this book concerns telecommunications systems and networks. We will mainly use stochastic analysis and probability theory to address problems in the performance evaluation of telecommunications systems and networks. The first chapter will provide a motivation and a statement of several problems. This book aims to present methods rigorously, hence mathematically, with minimal resorting to intuition. It is my belief that intuition is often gained after the result is known and rarely before the problem is solved, unless the problem is simple. Techniques and terminologies of axiomatic probability (such as definitions of probability spaces, filtration, measures, etc.) have been omitted and a more direct, less abstract approach has been adopted. In addition, most of the important formulas are interpreted in the sense of “What does this mathematical expression teach me?” This last step justifies the word “applied”, since most mathematical treatises do not interpret as it contains the risk to be imprecise and incomplete. The field of stochastic processes is much too large to be covered in a single book and only a selected number of topics has been chosen. Most of the topics are considered as classical. Perhaps the largest omission is a treatment of Brownian processes and the many related applications. A weak excuse for this omission (besides the considerable mathematical complexity) is that Brownian theory applies more to physics (analogue fields) than to system theory (discrete components). The list of omissions is rather long and only the most noteworthy are summarized: recent concepts such as martingales and the coupling theory of stochastic variables, queueing networks, scheduling rules, and the theory of long-range dependent random variables that currently governs in the Internet. The confinement to stochastic analysis also excludes the recent new framework, called Network Calculus by Le Boudec and Thiran (2001). Network calculus is based on min-plus algebra and has been applied to (Inter)network problems in a deterministic setting. As prerequisites, familiarity with elementary probability and the knowledge of the theory of functions of a complex variable are assumed. Parts in the text in small font refer to more advanced topics or to computations that can be skipped at first reading. Part I (Chapters 2—6) reviews probability theory and it is included to make the remainder self-contained. The book essentially starts with Chapter 7 (Part II) on Poisson processes. The Poisxi
xii
Preface
son process (independent increments and discontinuous sample paths) and Brownian motion (independent increments but continuous sample paths) are considered to be the most important basic stochastic processes. We briefly touch upon renewal theory to move to Markov processes. The theory of Markov processes is regarded as a fundament for many applications in telecommunications systems, in particular queueing theory. A large part of the book is consumed by Markov processes and its applications. The last chapters of Part II dive into queueing theory. Inspired by intriguing problems in telephony at the beginning of the twentieth century, Erlang has pushed queueing theory to the scene of sciences. Since his investigations, queueing theory has grown considerably. Especially during the last decade with the advent of the Asynchronous Transfer Mode (ATM) and the worldwide Internet, many early ideas have been refined (e.g. discrete-time queueing theory, large deviation theory, scheduling control of prioritized flows of packets) and new concepts (self-similar or fractal processes) have been proposed. Part III covers current research on the physics of networks. This Part III is undoubtedly the least mature and complete. In contrast to most books, I have chosen to include the solutions to the problems in an Appendix to support self-study. I am grateful to colleagues and students whose input has greatly improved this text. Fernando Kuipers and Stijn van Langen have corrected a large number of misprints. Together with Fernando, Milena Janic and Almerima Jamakovic have supplied me with exercises. Gerard Hooghiemstra has made valuable comments and was always available for discussions about my viewpoints. Bart Steyaert eagerly gave the finer details of the generating function approach to the GI/D/m queue. Jan Van Mieghem has given overall comments and suggestions beside his input with the computation of correlations. Finally, I thank David Hemsley for his scrupulous corrections in the original manuscript. Although this book is intended to be of practical use, in the course of writing it, I became more and more persuaded that mathematical rigor has ample virtues of its own. Per aspera ad astra
January 2006
Piet Van Mieghem
1 Introduction
The aim of this first chapter is to motivate why stochastic processes and probability theory are useful to solve problems in the domain of telecommunications systems and networks. In any system, or for any transmission of information, there is always a non-zero probability of failure or of error penetration. A lot of problems in quantifying the failure rate, bit error rate or the computation of redundancy to recover from hazards are successfully treated by probability theory. Often we deal in communications with a large variety of signals, calls, sourcedestination pairs, messages, the number of customers per region, and so on. And, most often, precise information at any time is not available or, if it is available, deterministic studies or simulations are simply not feasible due to the large number of dierent parameters involved. For such problems, a stochastic approach is often a powerful vehicle, as has been demonstrated in the field of physics. Perhaps the first impressing result of a stochastic approach was Boltzmann’s and Maxwell’s statistical theory. They studied the behavior of particles in an ideal gas and described how macroscopic quantities as pressure and temperature can be related to the microscopic motion of the huge amount of individual particles. Boltzmann also introduced the stochastic notion of the thermodynamic concept of entropy V, V = n log Z where Z denotes the total number of ways in which the ensembles of particles can be distributed in thermal equilibrium and where n is a proportionality factor, afterwards attributed to Boltzmann as the Boltzmann constant. The pioneering work of these early physicists such as Boltzmann, Maxwell and others was the germ of a large number of breakthroughs in science. Shortly after their introduction of stochastic theory in classical physics, the 1
2
Introduction
theory of quantum mechanics (see e.g. Cohen-Tannoudji et al., 1977) was established. This theory proposes that the elementary building blocks of nature, the atom and electrons, can only be described in a probabilistic sense. The conceptually di!cult notion of a wave function whose squared modulus expresses the probability that a set of particles is in a certain state and the Heisenberg’s uncertainty relation exclude in a dramatic way our deterministic, macroscopic view on nature at the fine atomic scale. At about the same time as the theory of quantum mechanics was being created, Erlang applied probability theory to the field of telecommunications. Erlang succeeded to determine the number of telephone input lines p of a switch in order to serve QV customers with a certain probability s. Perhaps his most used formula is the Erlang E formula (14.17), derived in Section 14.2.2, Pr [QV = p] =
p p! Pp m m=0 m!
where the load or tra!c intensity is the ratio of the arrival rate of calls to the telephone local exchange or switch over the processing rate of the switch per line. By equating the desired blocking probability s = Pr [QV = p], say s = 1034 , the number of input lines p can be computed for each load . Due to its importance, books with tables relating s, and p were published. Another pioneer in the field of communications that deserves to be mentioned is Shannon. Shannon explored the concept of entropy V. He introduced (see e.g. Walrand, 1998) the notion of the Shannon capacity of a channel, the maximum rate at which bits can be transmitted with arbitrary small (but non zero) probability of errors, and the concept of the entropy rate of a source which is the minimum average number of bits per symbol required to encode the output of a source. Many others have extended his basic ideas and so it is fair to say that Shannon founded the field of information theory. A recent important driver in telecommunication is the concept of quality of service (QoS). Customers can use the network to transmit dierent types of information such as pictures, files, voice, etc. by requiring a specific level of service depending on the type of transmitted information. For example, a telephone conversation requires that the voice packets arrive at the receiver G ms later, while a file transfer is mostly not time critical but requires an extremely low information loss probability. The value of the mouth-to-ear delay G is clearly related to the perceived quality of the voice conversation. As long as G ? 150 ms, the voice conversation has toll quality, which is roughly speaking, the quality that we are used to in classical
Introduction
3
telephony. When G exceeds 150 ms, rapid degradation is experienced and when G A 300 ms, most of the test persons have great di!culty in understanding the conversation. However, perceived quality may change from person to person and is di!cult to determine, even for telephony. For example, if the test person knows a priori that the conversation is transmitted over a mobile or wireless channel as in GSM, he or she is willing to tolerate a lower quality. Therefore, quality of service is both related to the nature of the information and to the individual desire and perception. In future Internetworking, it is believed that customers may request a certain QoS for each type of information. Depending on the level of stringency, the network may either allow or refuse the customer. Since customers will also pay an amount related to this QoS stringency, the network function that determines to either accept or refuse a call for service will be of crucial interest to any network operator. Let us now state the connection admission control (CAC) problem for a voice conversation to illustrate the relation to stochastic analysis: “How many customers p are allowed in order to guarantee that the ensemble of all voice packets reaches the destination within G ms with probability s?”This problem is exceptionally di!cult because it depends on the voice codecs used, the specifics of the network topology, the capacity of the individual network elements, the arrival process of calls from the customers, the duration of the conversation and other details. Therefore, we will simplify the question. Let us first assume that the delay is only caused by the waiting time of a voice packet in the queue of a router (or switch). As we will see in Chapter 13, this waiting time W of voice packets in a single queueing system depends on (a) the arrival process: the way voice packets arrive, and (b) the service process: how they are processed. Let us assume that the arrival process specified by the average arrival rate and the service process specified by the average service rate are known. Clearly, the arrival rate is connected to the number of customers p. A simplified statement of the CAC problem is, “What is the maximum allowed such that Pr [W A G] ? ?” In essence, the CAC problem consists in computing the tail probability of a quantity that depends on parameters of interest. We have elaborated on the CAC problem because it is a basic design problem that appears under several disguises. A related dimensioning problem is the determination of the buer size in a router in order not to lose more than a certain number of packets with probability s, given the arrival and service process. The above mentioned problem of Erlang is a third example. Another example treated in Chapter 18 is the server placement problem: “How many replicated servers p are needed to guarantee that any user can access the information within n hops with probability Pr [kQ (p) A n] ”, where
4
Introduction
is certain level of stringency and kQ (p) is the number of hops towards the most nearby of the p servers in a network with Q routers. The popularity of the Internet results in a number of new challenges. The traditional mathematical models as the Erlang B formula assume “smooth” tra!c flows (small correlation and Markovian in nature). However, TCP/IP tra!c has been shown to be “bursty” (long-range dependent, self-similar and even chaotic, non-Markovian (Veres and Boda, 2000)). As a consequence, many traditional dimensioning and control problems ask for a new solution. The self-similar and long range dependent TCP/IP tra!c is mainly caused by new complex interactions between protocols and technologies (e.g. TCP/IP/ATM/SDH) and by other information transported than voice. It is observed that the content size of information in the Internet varies considerably in size causing the “Noah eect”: although immense floods are extremely rare, their occurrence impacts significantly Internet behavior on a global scale. Unfortunately, the mathematics to cope with the self-similar and long range dependent processes turns out to be fairly complex and beyond the scope of this book. Finally, we mention the current interest in understanding and modeling complex networks such as the Internet, biological networks, social networks and utility infrastructures for water, gas, electricity and transport (cars, goods, trains). Since these networks consists of a huge number of nodes Q and links O, classical and algebraic graph theory is often not suited to produce even approximate results. The beginning of probabilistic graph theory is commonly attributed to the appearance of papers by Erdös and Rényi in the late 1940s. They investigated a particularly simple growing model for a graph: start from Q nodes and connect in each step an arbitrary random, not yet connected pair of nodes until all O links are used. After about Q@2 steps, as shown in Section 16.7.1, they observed the birth of a giant component that, in subsequent steps, swallows the smaller ones at a high rate. This phenomenon is called a phase transition and often occurs in nature. In physics it is studied in, for example, percolation theory. To some extent, the Internet’s graph bears some resemblance to the Erdös-Rényi random graph. The Internet is best regarded as a dynamic and growing network, whose graph is continuously changing. Yet, in order to deploy services over the Internet, an accurate graph model that captures the relevant structural properties is desirable. As shown in Part III, a probabilistic approach based on random graphs seems an e!cient way to learn about the Internet’s intriguing behavior. Although the Internet’s topology is not a simple ErdösRényi random graph, results such as the hopcount of the shortest path and the size of a multicast tree deduced from the simple random graphs provide
Introduction
5
a first order estimate for the Internet. Moreover, analytic formulas based on other classes of graphs than the simple random graph prove di!cult to obtain. This observation is similar to queueing theory, where, beside the M/G/x class of queues, hardly closed expressions exist. We hope that this brief overview motivates su!ciently to surmount the mathematical barriers. Skill with probability theory is deemed necessary to understand complex phenomena in telecommunications. Once mastered, the power and beauty of mathematics will be appreciated.
Part I Probability theory
2 Random variables
This chapter reviews basic concepts from probability theory. A random variable (rv) is a variable that takes certain values by chance. Throughout this book, this imprecise and intuitive definition su!ces. The precise definition involves axiomatic probability theory (Billingsley, 1995). Here, a distinction between discrete and continuous random variables is made, although a unified approach including alsoR mixed cases via the Stieltjes integral (Hardy et al., 1999, pp. 152—157), j({)gi ({), is possible. In general, the distribution I[ ({) = Pr [[ {] holds in both cases, and Z X j({)gI[ ({) = j(n) Pr[[ = n] where [ is a discrete rv n
=
Z
j({)
gI[ ({) g{ g{
where [ is a continuous rv
In most practical situations, the Stieltjes integral reduces to the Riemann integral, else, Lesbesgue’s theory of integration and measure theory (Royden, 1988) is required.
2.1 Probability theory and set theory Pascal (1623—1662) is commonly regarded as one of the founders of probability theory. In his days, there was much interest in games of chance1 and the likelihood of winning a game. In most of these games, there was a finite number q of possible outcomes and each of them was equally likely. The 1
“La règle des partis”, a chapter in Pascal’s mathematical work (Pascal, 1954), consists of a series of letters to Fermat that discuss the following problem (together with a more complex question that is essentially a variant of the probability of gambler’s ruin treated in Section 11.2.1): Consider the game in which 2 dice are thrown q times. How many times q do we have to throw the 2 dice to throw double six with probability s = 12 ?
9
10
Random variables
probability of the event D of interest was defined as qD Pr [D] = q where qD is the number of favorable outcomes (samples points of D). If the number of outcomes of an experiment is not finite, this classical definition of probability does not su!ce anymore. In order to establish a coherent and precise theory, probability theory employs concepts of group or set theory. The set of all possible outcomes of an experiment is called the sample space . A possible outcome of an experiment is called a sample point $ that is an element of the sample space . An event D consists of a set of sample points. An event D is thus a subset of the sample space . The complement Df of an event D consists of all sample points of the sample space that are not in (the set) D, thus Df = \D. Clearly, (Df )f = D and the complement of the sample space is the empty set, f = > or, vice a versa, >f = . A family F of events is a set of events and thus a subset of the sample space that possesses particular events as elements. More precisely, a family F of events satisfies the three conditions that define a -field2 : (a) f > 5 F, (b) if D1 > D2 > = = = 5 F, then ^" m=1 Dm 5 F and (c) if D 5 F, then D 5 F. These conditions guarantee that F is closed under countable unions and intersections of events. Events and the probability of these events are connected by a probability measure Pr [=] that assigns to each event of the family F of events of a sample space a real number in the interval [0> 1]. As Axiom 1, we require that Pr [ ] = 1. If Pr [D] = 0, the occurrence of the event D is not possible, while Pr [D] = 1 means that the event D is certain to occur. If Pr [D] = s with 0 ? s ? 1, the event D has probability s to occur. If the events D and E have no sample points in common, D _ E = >, the events D and E are called mutually exclusive events. As an example, the event and its complement are mutually exclusive because D _ Df = >. Axiom 2 of a probability measure is that for mutually exclusive events D and E holds that Pr [D ^ E] = Pr [D]+Pr [E]. The definition of a probability measure and the two axioms are su!cient to build a consistent framework on which probability theory is founded. Since Pr [>] = 0 (which follows from 2
A field F posseses the properties: (i) M F; (ii) if D> E M F, then D E M F and D K E M F; (iii) if D M F, then Df M F= This definition is redundant. For, we have by (ii) and (iii) that (D E)f M F. Further, by De Morgan’s law (D E)f = Df K E f , which can be deduced from Figure 2.1 and again by (iii), the argument shows that the reduced statement (ii), if D> E M F, then D E M F, is su!cient to also imply that D K E M F.
2.1 Probability theory and set theory
11
Axiom 2 because D _ > = > and D = D ^ >), for mutually exclusive events D and E holds that Pr [D _ E] = 0. As a classical example that explains the formal definitions, let us consider the experiment of throwing a fair die. The sample space consists of all possible outcomes: = {1> 2> 3> 4> 5> 6}. A particular outcome of the experiment, say $ = 3, is a sample point $ 5 . One may be interested in the event D where the outcome is even in which case D = {2> 4> 6} and Df = {1> 3> 5}. If D and E are events, the union of these events D ^ E can be written using set theory as D ^ E = (D _ E) ^ (Df _ E) ^ (D _ E f ) because D_E, Df _E and D_E f are mutually exclusive events. The relation is immediately understood by drawing a Venn diagram as in Fig. 2.1. Taking
ABc
AB
AcB
A
B
:
Fig. 2.1. A Venn diagram illustrating the union D ^ E.
the probability measure of the union yields Pr [D ^ E] = Pr [(D _ E) ^ (Df _ E) ^ (D _ E f )]
= Pr [D _ E] + Pr [Df _ E] + Pr [D _ E f ]
(2.1)
where the last relation follows from Axiom 2. Figure 2.1 shows that D = (D _ E) ^ (D _ E f ) and E = (D _ E) ^ (Df _ E). Since the events are mutually exclusive, Axiom 2 states that Pr [D] = Pr [D _ E] + Pr [D _ E f ]
Pr [E] = Pr [D _ E] + Pr [Df _ E] Substitution into (2.1) yields the important relation
Pr [D ^ E] = Pr [D] + Pr [E] Pr [D _ E]
(2.2)
Although derived for the measure Pr [=], relation (2.2) also holds for other measures, for example, the cardinality (the number of elements) of a set.
12
Random variables
2.1.1 The inclusion-exclusion formula A generalization of the relation (2.2) is the inclusion-exclusion formula, Pr [^qn=1 Dn ] =
q X
n1 =1
+
Pr [Dn1 ]
q X
q X
q X
q X
n1 =1 n2 =n1 +1 q X
n1 =1 n2 =n1 +1 n3 =n2 +1 q X q1
+ · · · + (1)
Pr [Dn1 _ Dn2 ]
Pr [Dn1 _ Dn2 _ Dn3 ] q X
···
n1 =1 n2 =n1 +1
q X
nq =nq1 +1
£ ¤ Pr _qm=1 Dnm
(2.3)
The formula shows that the probability of the union consists of the sum of probabilities of the individual events (first term). Since sample points can belong to more than one event Dn , the first term possesses double countings. The second term removes all probabilities of samples points that belong to precisely two event sets. However, by doing so (draw a Venn diagram), we also subtract the probabilities of samples points that belong to three events sets more than needed. The third term adds these again, and so on. The inclusion-exclusion formula can be written more compactly as, q q q q h i X X X X q m31 ··· Pr _mp=1 Dnp (1) (2.4) Pr [^n=1 Dn ] = m=1
n1 =1 n2 =n1 +1
or with
X
Vm =
1$n1 ?n2 ?===?nm $q
as
Pr [^qn=1 Dn ] =
nm =nm31 +1
i h Pr _mp=1 Dnp
q X (1)m31 Vm
(2.5)
m=1
q31 Dn and E = Dq such that Proof of the inclusion-exclusion formula 3 : Let D = n=1 3
Another proof (Grimmett and Stirzacker, 2001, p. 56) uses the indicator function defined in Section 2.2.1. Useful indicator function relations are 1DKE = 1D 1E 1Df = 1 3 1D 1DX E = 1 3 1(DE)f = 1 3 1Df KE f = 1 3 1Df 1E f
= 1 3 (1 3 1D )(1 3 1E ) = 1D + 1E + 1D 1E = 1D + 1E + 1DKE
Generalizing the last relation yields 1q D =13 n=1 n
q \
(1 3 1Dn )
n=1
Multiplying out and taking the expectations using (2.13) leads to (2.3).
2.1 Probability theory and set theory
13
q31 q31 D E = q n=1 Dn and D K E = Dq K n=1 Dn = n=1 Dn K Dq by the distributive law in set
theory, then application of (2.2) yields the recursion in q
l k l k q31 q31 Pr [q n=1 Dn ] = Pr n=1 Dn + Pr [Dq ] 3 Pr n=1 Dn K Dq
(2.6)
By direct substitution of q < q 3 1, we have k k l l l k q32 q32 Pr q31 n=1 Dn = Pr n=1 Dn + Pr [Dq31 ] 3 Pr n=1 Dn K Dq31 while substitution in this formula of Dn < Dn K Dq gives l l k l k k q32 q32 Pr q31 n=1 Dn K Dq = Pr n=1 Dn K Dq + Pr [Dq31 K Dq ] 3 Pr n=1 Dn K Dq K Dq31 Substitution of the last two terms into (2.6) yields l k q32 Pr [q n=1 Dn ] = Pr [Dq31 ] + Pr [Dq ] 3 Pr [Dq31 K Dq ] + Pr n=1 Dn l l k l k k q32 q32 3 Pr n=1 Dn K Dq31 3 Pr q32 n=1 Dn K Dq + Pr n=1 Dn K Dq K Dq31
(2.7)
Similarly, in a next iteration we use (2.6) after suitable modification in the right-hand side of (2.7) to lower the upper index in the union, l l l k k k q33 q33 Pr q32 n=1 Dn = Pr n=1 Dn + Pr [Dq32 ] 3 Pr n=1 Dn K Dq32 k l k l q32 q33 Dn K Dq31 = Pr n=1 Dn K Dq31 + Pr [Dq32 K Dq31 ] Pr n=1 l k 3 Pr q33 n=1 Dn K Dq31 K Dq32 k l l l k k q33 q32 Dn K Dq + Pr[Dq32 K Dq ]3Pr q33 Dn K Dq = Pr n=1 Pr n=1 n=1 Dn K Dq K Dq32 k l l k q33 Pr q32 n=1 Dn K Dq K Dq31 = Pr n=1 Dn K Dq K Dq31 + Pr [Dq32 K Dq K Dq31 ] l k 3 Pr q33 n=1 Dn K Dq K Dq31 K Dq32 The result is Pr [q n=1 Dn ] = Pr [Dq32 ] + Pr [Dq31 ] + Pr [Dq ] + 3 Pr [Dq32 K Dq31 ] 3 Pr [Dq32 K Dq ] l k 3 Pr [Dq31 K Dq ] + Pr [Dq32 K Dq31 K Dq ] + Pr q33 n=1 Dn k l l k l k q33 q33 3 Pr q33 n=1 Dn K Dq32 3 Pr n=1 Dn K Dq31 3 Pr n=1 Dn K Dq k l k l q33 + Pr q33 n=1 Dn K Dq31 K Dq32 + Pr n=1 Dn K Dq K Dq32 k l l k q33 + Pr q33 n=1 Dn K Dq K Dq31 3 Pr n=1 Dn K Dq K Dq31 K Dq32 which starts revealing the structure of (2.3). Rather than continuing the iterations, we prove the validity of the inclusion-exclusion formula (2.3) via induction. In case q = 2, the basic expression (2.2) is found. Assume that (2.3) holds for q, then the case for q + 1 must obey (2.6) where q < q + 1, k l q q Pr q+1 n=1 Dn = Pr [n=1 Dn ] + Pr [Dq+1 ] 3 Pr [n=1 Dn K Dq+1 ]
14
Random variables
Substitution of (2.3) into the above expression yields, after suitable grouping of the terms, q q l k [ [ Pr Dn1 3 Pr q+1 n=1 Dn = Pr[Dq+1 ] +
+
q [ Pr Dn1 K Dn2 3 Pr Dn1 K Dq+1
n1 =1 n2 =n1 +1
n1 =1
q [
q [
q [
q [
n1 =1
q [ Pr Dn1 KDn2 K Dn3 +
+ · · · + (31)q31
q [
q [
q [
q+1 [
q+1 [
···
n1 =1 n2 =n1 +1
=
q+1 [
n1 =1
+
Pr [Dn ] 3
q+1 [
n1 =1 n2 =n1 +1
q+1 [
q+1 [
n1 =1 n2 =n1 +1 n3 =n2 +1
+ · · · + (31)q
q+1 [
nq =nq1 +1
k l Pr Kq m=1 Dnm 3
l k Pr Kq m=1 Dnm K Dq+1
q [
nq =nq1 +1
Pr Dn1 K Dn2
Pr Dn1 K Dn2 K Dn3
q+1 [
···
n1 =1 n2 =n1 +1
which proves (2.3).
q [
···
n1 =1 n2 =n1 +1
+ · · · + (31)q
Pr Dn1 KDn2 KDq+1
n1 =1 n2 =n1 +1
n1 =1 n2 =n1 +1 n3 =n2 +1 q [
q [
q+1 [
nq+1 =nq +1
l k Pr Kq m=1 Dnm K Dq+1
¤
Although impressive, the inclusion-exclusion formula is useful when dealing with dependent random variables because of its general nature. In parh i m ticular, if Pr _p=1 Dnp = dm and not a function of the specific indices np , the inclusion-exclusion formula (2.4) becomes more attractive, Pr [^qn=1 Dn ]
q X (1)m31 dm = m=1
X
1
1$n1 ?n2 ?===?nm $q
µ ¶ q X m31 q = dm (1) m m=1
An application of the latter formula to multicast can be found in Chapter 17 and many others are in Feller (1970, Chapter IV). Sometimes it is useful to reason with the complement of the union (^qn=1 Dn )f = \ ^qn=1 Dn = _qn=1 Dfn . Applying Axiom 2 to (^qn=1 Dn )f ^ (^qn=1 Dn ) = , Pr [(^qn=1 Dn )f ] = Pr [ ] Pr [^qn=1 Dn ] and using Axiom 1 and the inclusion-exclusion formula (2.5), we obtain Pr [(^qn=1 Dn )f ] = 1
q q X X (1)m31 Vm = (1)m Vm m=1
m=0
(2.8)
2.1 Probability theory and set theory
15
with the convention that V0 = 1. The Boole’s inequalities Pr [^qn=1 Dn ]
q X
Pr [Dn ]
(2.9)
n=1
Pr [_qn=1 Dn ] 1
q X
Pr [Dfn ]
n=1
are derived as consequences of the inclusion-exclusion formula (2.3). Only if all events are mutually exclusive, the equality sign in (2.9) holds whilst the inequality sign follows from the fact that possible overlaps in events are, in contrast to the inclusion-exclusion formula (2.3), not subtracted. The inclusion-exclusion formula is of a more general nature and also applies to other measures on sets than Pr [=], for example to the cardinality as mentioned above. For the cardinality of a set D, which is usually denoted by |D|, the inclusion-exclusion variant of (2.8) is |(^qn=1 Dn )f | =
q X (1)m |Vm |
(2.10)
m=0
where the total number of elements in the sample space is |V0 | = Q and ¯ ¯ X ¯ m ¯ |Vm | = ¯_p=1 Dnp ¯ 1$n1 ?n2 ?===?nm $q
A nice illustration of the above formula (2.10) applies to the sieve of Eratosthenes (Hardy and Wright, 1968, p. 4), a procedure to construct the table of prime numbers4 up to Q . Consider the increasing sequence of integers
= {2> 3> 4> = = = > Q }
and remove successively all multiples of 2 (even numbers starting from 4, 6, ...), all multiples of 3 (starting from 32 and not yet removed previously), all multiples of 5, all multiples of the next number larger than 5 and still in the list (which is the prime 7) and so on, up to all multiples hs i of the largest Q . Here [{] is the possible prime divisor that is equal to or smaller than largest integer smaller than or equal to {. The remaining numbers in the list are prime numbers. Let us now compute the number of primes (Q ) smaller than or equal to Q by using the inclusion-exclusion formula (2.10). 4
An integer number s is prime if s A 1 and s has no other integer divisors than 1 and itself s. The sequence of the first primes are 2, 3, 5, 7, 11, 13, etc. If I d and e are divisors of q, then q = de from which it follows that d and e cannot exceed both q. Hence, any composite I number q is divisible by a prime s that does not exceed q.
16
Random variables
The number of primes smaller than a real number { is ({) and, evidently, if sq denotes the q-th prime, then (sq ) = q. Let Dn denote the set of the multiples of the n-th prime sn that belong to . The number of such sets Dn in the sieve of Eratosthenes is equal to³the largest prime number sq smaller hs i s ´ than or equal to Q , hence, q = Q . If t 5 (^qn=1 Dn )f , this means that t is not divisible by each prime s number smaller than sq and that t is a prime number lying between Q ? t sQ . The cardinality of the set (^qn=1 Dn )f , the number of primes between Q ? t Q is ³s ´ f q |(^n=1 Dn ) | = (Q ) Q
On the other hand, if u 5 _mp=1 Dnp for 1 n1 ? n2 ? · · · ? nm q, then u is a multiple of sn1 sn2 = = = snm and the number of multiples of the integer sn1 sn2 = = = snm in is ¸ ¯ ¯ Q ¯ ¯ = ¯_mp=1 Dnp ¯ sn1 sn2 = = = snm
Applying ³s the ´ inclusion-exclusion formula (2.10) with | | = V0 = Q 1 and Q gives q= (Q )
q ³s ´ X Q =Q 1 (1)m m=1
X
1$n1 ?n2 ?===?nm $q
Q sn1 sn2 = = = snm
¸
hs i Q , i.e. the The knowledge of the prime numbers smaller than or equal to ³s ´ first q = Q primes, su!ces to compute the number of primes (Q ) smaller than s or equal to Q without explicitly knowing the primes t lying between Q ? t Q . 2.2 Discrete random variables Discrete random variables are real functions [ defined on a discrete probability space as [ : $ R with the property that the event {$ 5 : [ ($) = {} 5 F for each { 5 R. The event {$ 5 : [ ($) = {} is further abbreviated as {[ = {}. A discrete probability density function (pdf) Pr[[ = {] has the following properties: (i) 0 Pr[[ = {] 1 for real { that are possible outcomes of an
2.2 Discrete random variables
17
experiment. The set of values { can be finite or countably infinite and constitute the discrete probability space. P (ii) { Pr[[ = {] = 1=
In the classical example of throwing a die, the discrete probability space
= {1> 2> 3> 4> 5> 6} and, since each of the six edges of the (fair) die is equally possible as outcome, Pr[[ = {] = 61 for each { 5 . 2.2.1 The expectation
An important operator acting on a discrete random variable [ is the expectation, defined as X H [[] = { Pr [[ = {] (2.11) {
The expectation H [[] is also called the mean or average or first moment of [. More generally, if [ is a discrete random variable and j is a function, then \ = j([) is also a discrete random variable with expectation H [\ ] equal to X j({) Pr [[ = {] (2.12) H [j([)] = {
A special and often used function in probability theory is the indicator function 1| defined as 1 if the condition | is true and otherwise it is zero. For example, X X 1{Ad Pr [[ = {] = Pr [[ = {] = Pr[[ A d] H [1[Ad ] = {
{Ad
H [1[=d ] = Pr[[ = d]
(2.13)
The higher moments of a random variable are defined as the case where j({) = {q , X H [[ q ] = {q Pr [[ = {] (2.14) {
From the definition (2.11), it follows that the expectation is a linear operator, # " q q X X dn [n = dn H [[n ] H n=1
n=1
The variance of [ is defined as
h i Var[[] = H ([ H [[])2
(2.15)
18
Random variables
The variance is always non-negative. Using the linearity of the expectation operator and = H [[], we rewrite (2.15) as £ ¤ Var[[] = H [ 2 2 (2.16) £ ¤ Since Var[[] 0, relation (2.16) indicates that H [ 2 (H [[])2 . Often p the standard deviation, defined as = Var [[], is used. An interesting variational principle of the variance follows, for the variable x, from i h i h H ([ x)2 = H ([ )2 + (x )2
which is minimized at x = = H [[] with value Var[[]. Hence, the best least square approximation of the random variable [ is the number H [[].
2.2.2 The probability generating function The probability generating function (pgf) of a discrete random variable [ is defined, for complex }, as £ ¤ X { } Pr [[ = {] (2.17) *[ (}) = H } [ = {
where the last equality follows from (2.12). If [ is integer-valued and nonnegative, then the pgf is the Taylor expansion of the complex function *[ (}). Commonly the latter restriction applies, otherwise the substitution ¡ ¢ } = hlw is used such that (2.17) expresses the Fourier series of *[ hlw . The importance of the pgf mainly lies in the fact that the theory of functions can be applied. Numerous examples of the power of analysis will be illustrated. Concentrating on non-negative integer random variables [, *[ (}) =
" X
Pr [[ = n] } n
(2.18)
n=0
and the Taylor coe!cients obey ¯ 1 gn *[ (}) ¯¯ Pr [[ = n] = n! g} n ¯}=0 Z 1 *[ (}) = g} 2l F(0) } n+1
(2.19) (2.20)
where F(0) denotes a contour around } = 0. Both are inversion formulae5 . Since the general form H[j([)] is completely defined when Pr[[ = {] is 5
A similar inversion formula for Fourier series exist (see e.g. Titchmarsh (1948)).
2.2 Discrete random variables
19
known, the knowledge of the pgf results in a complete alternative description, ¯ " X j(n) gn *[ (}) ¯¯ (2.21) H [j([)] = n! g} n ¯}=0 n=0
Sometimes it is more convenient to compute values of interest directly from (2.17) £ ¤rather than from (2.21). For example, q-fold dierentiation of *[ (}) = H } [ yields µ ¶ ¸ ¤ £ [ [3q gq *[ (}) 1 [3q } = H = H [([ 1) · · · ([ q + 1)} q g} q q! such that
¯ µ ¶¸ 1 gq *[ (}) ¯¯ [ = H q! g} q ¯}=1 q
(2.22)
Similarly, let } = hw , then
£ ¤ gq *[ (hw ) = H [ q hw[ q gw
from which the moments follow as
¯ gq *[ (hw ) ¯¯ H [[ ] = gwq ¯w=0 q
and, more generally,
(2.23)
¢¯ ¡ gq h3wd *[ (hw ) ¯¯ H [([ d) ] = ¯ ¯ gwq q
(2.24)
w=0
2.2.3 The logarithm of the probability generating function The logarithm of the probability generating function is defined as ¡ £ ¤¢ O[ (}) = log (*[ (})) = log H } [
(2.25) *0 (})
[ from which O[ (1) = 0 because *[ (1) = 1. The derivative O0[ (}) = *[ (}) ´2 ³ 0 00 (}) * (}) * *[ , it follows shows that O0[ (1) = *0[ (1), while from O00[ (}) = *[ [ (}) [ (})
that O00[ (1) = *00[ (1) (*0[ (1))2 . These first few derivatives are interesting because they are related directly to probabilistic quantities. Indeed, from (2.23), we observe that H[[] = *0[ (1) = O0[ (1)
(2.26)
20
Random variables
and from H[[ 2 ] = *00[ (1) + *0[ (1) ¡ ¢2 Var[[] = *00[ (1) + *0[ (1) *0[ (1) = O00[ (1) + O0[ (1)
(2.27)
2.3 Continuous random variables Although most of the concepts defined above for discrete random variables are readily transferred to continuous random variables, the calculus is in general more di!cult. Indeed, instead of reasoning on the pdf, it is more convenient to work with the probability distribution function defined for both discrete and continuous random variables as I[ ({) = Pr [[ {]
(2.28)
Clearly, we have lim{. For mutually exclusive events D _ E = >, Axiom 2 in Section 2.1 states that Pr [D ^ E] = Pr [D] + Pr [E] which proves (2.29). As a corollary of (2.29), I[ ({) is continuous at the right which follows from (2.29) by denoting d = e for any A 0. Less precise, it follows from the equality sign at the right, [ e, and inequality at the left, d ? [. Hence, I[ ({) is not necessarily continuous at the left which implies that I[ ({) is not necessarily continuous and that I[ ({) may possess jumps. But even if I[ ({) is continuous, the pdf is not necessary continuous6 . The pdf of a continuous random variable [ is defined as i[ ({) = 6
gI[ ({) g{
(2.30)
Weierstrass was the first to present a continuous non-dierentiable function, i ({) =
" [
eq cos (dq {)
q=0
where 0 ? e ? 1 and d is an odd positive integer. Since the series is uniformly convergent for any {, i ({) is continuous everywhere. Titchmarsh (1964, Chapter IX) demonstrates for ({) that i ({+k)3i takes arbitrarily large values such that i 0 ({) does not exist. de A 1 + 3 2 k Another class of continuous non-dierentiable functions are the sample paths of a Brownian motion. The Cantor function which is discussed in (Berger, 1993, p. 21) and (Billingsley, 1995, p. 407) is an other classical, noteworthy function with peculiar properties.
2.3 Continuous random variables
21
Assuming that I[ ({) is dierentiable at {, from (2.29), we have for small, positive { Pr [{ ? [ { + {] = I[ ({ + {) I[ ({) ³ ´ gI[ ({) { + R ({)2 = g{
Using the definition (2.30) indicates that, if I[ ({) is dierentiable at {, Pr [{ ? [ { + {] {{ 0) are both integrable over D. Although this restriction seems only of theoretical interest, in some applications (see the
2.3 Continuous random variables
23
\ = j([) is also a continuous random variable with expectation H [\ ] equal to Z "
H [j([)] =
3"
j({)i[ ({)g{
(2.34)
It is often useful to express the expectation H [[] of a non-negative random variable [ in tail probabilities. Upon integration by parts, ¯" Z " Z " Z " Z " ¯ i[ (x)gx g{ i[ (x)gx¯¯ + H [[] = {i[ ({)g{ = { { 0 { 0 Z0 " (1 I[ ({)) g{ (2.35) = 0
The case for a non-positive random variable [ is derived analogously, ¯0 Z 0 Z { Z 0 Z { ¯ ¯ {i[ ({)g{ = { i[ (x)gx¯ H [[] = g{ i[ (x)gx 3" Z 0
=
3"
3"
3"
3"
3"
I[ ({)g{
The general case follows by addition: Z Z " (1 I[ ({)) g{ H [[] =
0
3"
0
I[ ({)g{
A similar expression exists for discrete random variables. In general for any discrete random variable [, we can write H [[] =
" [
n=3"
=
31 [
n=3"
=
31 [
n=3"
31 [
n Pr [[ = n] =
n Pr [[ = n] +
n=3"
= 3 Pr [[ $ 31] 3
32 [
n=3"
32 [
n=3"
n Pr [[ = n]
n=0
n (Pr [[ $ n] 3 Pr [[ $ n 3 1]) + n Pr [[ $ n] 3
" [
" [
n=0
n (Pr [[ D n] 3 Pr [[ D n + 1])
(n + 1) Pr [[ $ n] +
Pr [[ $ n] +
" [
n=1
" [
n=1
n Pr [[ D n] 3
" [
n=1
(n 3 1) Pr [[ D n]
Pr [[ D n]
Cauchy distribution defined U in (3.38)) the Riemann integral may exists where the Lesbesgue does not. For example, 0" sin{ { g{ equals, in the Riemann sense, 2 (which is a standard excercise in contour integration), but this integral does not exists in the Lesbesgue sense. Only for improper integrals (integration interval is infinite), Riemann integration may exist where Lesbesgue does not. However, in most other cases (integration over a finite interval), U Lesbesgue integration is more general. For instance, if i ({) = 1{{ is ra tio n a l} , then 01 i (x)gx does not exist in the Riemann sense (since upper and lower sums do not converge to each U other). However, 01 i (x)gx = 0 in the Lesbesgue sense (since there is only a set of measure zero dierent from 0, namely all rational numbers in [0> 1] ). In probability theory and measure theory, Lesbesgue integration is assumed.
24
Random variables
or the mean of a discrete random variable [ expressed in tail probabilities is9 " 31 X X H [[] = Pr [[ n] Pr [[ n] (2.36) n=1
n=3"
2.3.3 The probability generating function The probability generating function (pgf) of a continuous random variable [ is defined, for complex }, as the Laplace transform Z " £ ¤ *[ (}) = H h3}[ = h3}w i[ (w)gw (2.37) 3"
Again, in some cases, it may be more convenient to use } = lx in which case the double sided Laplace transform reduces to a Fourier transform. The strength of these transforms is based on the numerous properties, especially the inverse transform, Z f+l" 1 *[ (})h}w g} (2.38) i[ (w) = 2l f3l" where f is the smallest real variable Re(}) for which the integral in (2.37) converges. as for discrete random variables, we have h}d *[ (}) = ¤ £ 3}([3d)Similarly H h ¯ q }d ¯ q q g (h *[ (})) ¯ (2.39) H [([ d) ] = (1) ¯ g} q }=0
¤ £ The main dierence £with¤ the discrete case lies in the definition H h3}[ (continuous) versus H } [ (discrete). Since the exponential is an entire 9
We remark that
" [
H [[] =
n Pr [[ = n] =
n=3"
6=
" [
n=3"
" [
n=3"
n Pr [[ D n] 3
" [
n=3"
n (Pr [[ D n] 3 Pr [[ D n + 1]) n Pr [[ D n + 1] =
" [
n=3"
Pr [[ D n]
because the series in the second line are diverging. In fact, there exists a finite integer n such that, for any real arbitrarily small A 0 holds that Pr [[ D n ] = 1 3 and Pr [[ D n ] $ Pr [[ D n] for all n ? n . Hence, H [[] =
n [
n=3"
Pr [[ D n] +
" [
n=n
Pr [[ D n] D (1 3 )
n [
n=3"
1+f for any n and m 6= n. Then, with (2.45), X X Pr [D _ En ] = Pr [D|En ] Pr [En ] n
n
The event Dn = {D _ En } is a decomposition (or projection) of the event D in the basis event En , analogous to the decomposition of a vector in terms of a set of orthogonal basis vectors that span the total state space. Indeed, using the associative property D _ {E _ F} = D _ E _ F and D _ D = D, the intersection Dn _ Dm = {D _ En } _ {D _ Em } = D _ {En _ Em } = >, which implies mutual exclusivity (or orthogonality). Using the distributive property D _ {En ^ Em } = {D _ En } ^ {D _ Em }, we observe that D=D_
= D _ {^n En } = ^n {D _ En } = ^n Dn
P Finally, since all events Dn are mutually exclusive, Pr [D] = n Pr [Dn ] = P n Pr [D _ En ]. Thus, if = ^n En and in addition, for any pair m> n holds that En _ Em = >, we have proved the law of total probability or decomposability, X Pr [D|En ] Pr [En ] (2.46) Pr [D] = n
Conditioning on events is a powerful tool that will be used frequently. If the conditional probability Pr [D|En ] is known as a function j (En ), the law of total probability can also be written in terms of the expectation operator defined in (2.12) as Pr [D] = H [j (En )]
(2.47)
Also the important memoryless property of the exponential distribution (see Section 3.2.2) is an example of the application of the conditional probability. Another classical example is Bayes’ rule. Consider again the events En defined above. Using the definition (2.44) followed by (2.45), Pr [En |D] =
Pr [D _ En ] Pr [D|En ] Pr [En ] Pr [En _ D] = = Pr [D] Pr [D] Pr [D]
(2.48)
28
Random variables
Using (2.46), we arrive at Bayes’ rule Pr [D|En ] Pr [En ] Pr [En |D] = P m Pr [D|Em ] Pr [Em ]
(2.49)
where Pr [En ] are called the a-priori probabilities, while Pr [En |D] are the a-posteriori probabilities. The conditional distribution function of the random variable \ given [ is defined by I\ |[ (||{) = Pr [\ ||[ = {]
(2.50)
for any { provided Pr [[ = {] A 0. This condition follows from the definition (2.44) of the conditional probability. The conditional probability density function of \ given [ is defined by i\ |[ (||{) = Pr [\ = ||[ = {] = =
Pr[[ = {> \ = |] Pr [[ = {]
i[\ ({> |) i[ ({)
(2.51)
for any { such that Pr [[ = {] A 0 (and similarly for continuous random variables i[ ({) A 0) and where i[\ ({> |) is the joint probability density function defined below in (2.59).
2.5 Several random variables and independence 2.5.1 Discrete random variables Two events D and E are independent if Pr [D _ E] = Pr [D] Pr [E]
(2.52)
Similarly, we define two discrete random variables to be independent if Pr [[ = {> \ = |] = Pr [[ = {] Pr [\ = |]
(2.53)
If ] = i ([> \ ), then ] is a discrete random variable with X Pr [] = }] = Pr [[ = {> \ = |] i ({>|)=}
Applying the expectation operator (2.11) to both sides yields X i ({> |) Pr [[ = {> \ = |] H [i ([> \ )] = {>|
(2.54)
2.5 Several random variables and independence
29
If [ and \ are independent and i is separable, i ({> |) = i1 ({)i2 (|), then the expectation (2.54) reduces to X X i1 ({) Pr [[ = {] i2 (|) Pr [\ = |] = H [i1 ([)] H [i2 (\ )] H [i ([> \ )] = {
|
(2.55) The simplest example of the general function is ] = [ + \ . In that case, the sum is over all { and | that satisfy { + | = }. Thus, X X Pr [[ = {> \ = } {] = Pr [[ = } |> \ = |] Pr [[ + \ = }] = {
|
If [ and \ are independent, we obtain the convolution, X Pr [[ + \ = }] = Pr [[ = {] Pr [\ = } {] {
=
X |
Pr [[ = } |] Pr [\ = |]
2.5.2 The covariance The covariance of [ and \ is defined as Cov [[> \ ] = H [([ [ ) (\ \ )] = H [[\ ] [ \
(2.56)
If Cov[[> \ ] = 0, then the variables [ and \ are uncorrelated. If [ and \ are independent, then Cov[[> \ ] = 0. Hence, independence implies uncorrelation, but the converse is not necessarily true. The classical example13 is Q (0> 1) (Section 3.2.3) because \ = [ 2 where [ £ has2 ¤a normal ¤ £ 3distribution [ = 0 and H [\ = H [ = 0 as follows from (3.23). Although [ and \ are perfect dependent, they are uncorrelated. Thus, independence is a stronger property than uncorrelation. The covariance Cov[[> \ ] measures the degree of dependence between two (or generally more) random variables. If [ and \ are positively (negatively) correlated, the large values of [ tend to be associated with large (small) values of \ . As an application of the covariance, consider the problem of computing the variance of a sum Vq of random variables [1 > [2 > = = = > [q . Let n = H [[n ], 13
Another example: let X be uniform on [0> 1] and [ = cos(2X) and \ = sin (2X ). Using (2.34), ] 1 cos(2x) sin (2x) gx = 0 H [[\ ] = 0
as well as H [[] = H [\ ] = 0. I Thus, Cov[[> \ ] = 0, but [ and \ are perfectly dependent because [ = cos (arcsin \ ) = ± 1 3 \ 2 .
30
Random variables
then H [Vq ] =
Pq
n=1 n
and
5Ã !2 6 q i h X ([n n ) 8 Var [Vq ] = H (Vq H [Vq ])2 = H 7 n=1
5 6 q X q X =H7 ([n n )([m m )8 n=1 m=1
6 5 q q q X X X = H 7 ([n n )2 + 2 ([n n )([m m )8 n=1
n=1 m=n+1
Using the linearity of the expectation operator and the definition of the covariance (2.56) yields Var [Vq ] =
q X n=1
Var [[n ] + 2
q q X X
Cov [[n > [m ]
(2.57)
n=1 m=n+1
Observe that for a set of independent random variables {[n } the double sum with covariances vanishes. The Cauchy-Schwarz inequality (5.17) derived in Chapter 5 indicates that i h i h (H [([ [ ) (\ \ )])2 H ([ [ )2 H ([ [ )2 such that the covariance is always bounded by
|Cov [[> \ ]| [ \ 2.5.3 The linear correlation coe!cient Since the covariance is not dimensionless, the linear correlation coe!cient defined as Cov [[> \ ] ([> \ ) = (2.58) [ \ is often convenient to relate two (or more) dierent physical quantities expressed in dierent units. The linear correlation coe!cient remains invariant (possibly apart from the sign) under a linear transformation because (d[ + e> f\ + g) = sign(df)([> \ ) This transform shows that the linear correlation coe!cient ([> \ ) is inde2 provided 2 A 0. pendent of the value of the mean [ and the variance [ [ Therefore, many computations simplify if we normalize the random variable properly. Let us introduce the concept of a normalized random variable
2.5 Several random variables and independence
31
[ [ W = [3 [ . The normalized random variable has a zero mean and a variance equal to one. By the invariance under a linear transform, the correlation coe!cient ([> \ ) = ([ W > \ W ) and also ([> \ ) = Cov[[ W > \ W ]. The variance of [ W ± \ W follows from (2.57) as
Var [[ W ± \ W ] = Var[[ W ] + Var[\ W ] ± 2 Cov [[ W > \ W ] = 2(1 ± ([> \ )) Since the variance is always positive, it follows that 1 ([> \ ) 1. The extremes ([> \ ) = ±1 imply a linear relation between [ and \ . Indeed, ([> \ ) = 1 implies that Var[[ W \ W ] = 0, which is only possible if \ + f0 . A similar argu[ W = \ W + f, where f is a constant. Hence, [ = [ \ ment applies for the case ([> \ ) = 1. For example, in curve fitting, the goodness of the fit is often expressed in terms of the correlation coe!cient. A perfect fit has correlation coe!cient equal to 1. In particular, in linear regression where \ = d[ + e, the regression coe!cients h i dU and eU are the 2 minimizers of the square distance H (\ (d[ + e)) and given by dU =
Cov [[> \ ] 2 [
eU = H [\ ] dU H [[] Since a correlation coe!cient ([> \ ) = 1 implies Cov[[> \ ] = [ \ , we see that dU = [ as derived above with normalized random variables. \ Although the linear correlation coe!cient is a natural measure of the dependence between random variables, it has some disadvantages. First, the variances of [ and \ must exist, which may cause problems with heavy-tailed distributions. Second, as illustrated above, dependence can lead to uncorrelation, which is awkward. Third, linear correlation is not invariant under non-linear strictly increasing transformations W such that (W ([)> W (\ )) 6= ([> \ ). Common intuition expects that dependence measures should be invariant under these transforms W . This leads to the definition of rank correlation which satisfies that invariance property. Here, we merely mention Sperman’s rank correlation coe!cient, which is defined as V ([> \ ) = (I[ ([)> I\ (\ )) where is the linear correlation coe!cient and where the non-linear strict increasing transform is the probability distribution. More details are found in Embrechts et al. (2001b) and in Chapter 4.
32
Random variables
2.5.4 Continuous random variables We define the joint distribution function by I[\ ({> |) = Pr [[ {> \ |] and the joint probability density function by i[\ ({> |) =
C 2 I[\ ({> |) C{C|
(2.59)
Hence, I[\ ({> |) = Pr [[ {> \ |] =
Z
{
3"
Z
|
3"
i[\ (x> y)gxgy
(2.60)
The analogon of (2.54) is H [j([> \ )] =
Z
"
3"
Z
"
3"
j({> |)i[\ ({> |)g{g|
(2.61)
Most of the di!culties occur in the evaluation of the multiple integrals. The change of variables in multiple dimensions involves the Jacobian. Consider the transformed random variables X = j1 ([> \ ) and Y = j2 ([> \ ) and denote the inverse transform by { = k1 (x> y) and | = k2 (x> y), then iX Y (x> y) = i[\ (k1 (x> y)> k1 (x> y)) M (x> y) where the Jacobian M (x> y) is M (x> y) = det
C{ Cx C| Cx
C{ Cy C| Cy
¸
If [ and \ are independent and ] = [ + \ , we obtain the convolution, Z " Z " i] (}) = i[ ({)i\ (} {)g{ = i[ (} |)i\ (|)g| (2.62) 3"
3"
which is often denoted by i] (}) = (i[ i\ )(}). If both i[ ({) = 0 and i\ ({) = 0 for { ? 0, then the definition (2.62) of the convolution reduces to Z } (i[ i\ )(}) = i[ ({)i\ (} {)g{ 0
2.5.5 The sum of independent random variables P Let VQ = Q n=1 [n , where the random variables [n are all independent. We first concentrate on the case where Q = q is a (fixed) integer. Since VQ = VQ31 + [Q , direct application of (2.62) yields the recursion Z " iVQ (}) = (2.63) iVQ 31 (} |)i[Q (|)g| 3"
2.5 Several random variables and independence
33
which, when written out explicitly, leads to the Q -fold integral iVQ (}) =
Z
4
4
i[Q (|Q )g|Q · · ·
Z
4
4
i[1 (|1 )i[0 (} |Q · · · |1 )g|1
(2.64)
In many cases, convolutions are more e!ciently computed via generating functions. The generating function of Vq equals " q # i h Sq Y £ Vq ¤ [ [ } n *Vq (}) = H } = H } n=1 n = H n=1
Since all [n are independent, (2.55) can be applied, *Vq (}) =
q Y
£ ¤ H } [n
q Y
*[n (})
n=1
or, in terms of generating functions, *Vq (}) =
(2.65)
n=1
Hence, we arrive at the important result that the generating function of a sum of independent random variables equals the product of the generating functions of the individual random variables. We also note that the condition of independence is crucial in that it allows the product and expectation operator to be reversed, leading to the useful result (2.65). Often, the random variables [n all possess the same distribution. In this case of independent identically distributed (i.i.d.) random variables with generating function *[ (}), the relation (2.65) further simplifies to *Vq (}) = (*[ (}))q
(2.66)
In the case where the number of terms Q in the sum VQ is a random variable with generating function *Q (}), independent of the [n , we use the general definition of expectation (2.54) for two random variables, " X £ V ¤ X Q *VQ (}) = H } } { Pr [VQ = {> Q = n] = n=0 {
=
" X X
} { Pr [VQ = {|Q = n] Pr [Q = n]
n=0 {
where the conditional probability (2.45) is used. Since the value of VQ
34
Random variables
depends on the number of terms Q in the sum, we have Pr [VQ = {|Q = n] = Pr [Vn = {]. Further, with X } { Pr [VQ = {|Q = n] = *Vn (}) {
we have *VQ (}) =
" X
*Vn (}) Pr [Q = n]
(2.67)
n=0
The average H [VQ ] follows from (2.26) as H [VQ ] =
" X
*0Vn (1) Pr [Q = n] =
n=0
" X
H [Vn ] Pr [Q = n]
(2.68)
n=0
i
hP n
Pn Since H [Vn ] = H m=1 [m = m=1 H [[m ] and assuming that all random variables [m have equal mean H [[m ] = H [[], we have H [VQ ] =
" X
nH [[] Pr [Q = n]
n=0
or
H [VQ ] = H [[] H [Q ]
(2.69)
This relation (2.69) is commonly called Wald’s identity. Wald’s identity holds for any random sum of (possibly dependent) random variables [m provided the number Q of those random variables is independent of the [m . In the case of i.i.d. random variables, we apply (2.66) in (2.67) so that *VQ (}) =
" X
(*[ (}))n Pr [Q = n] = *Q (*[ (}))
(2.70)
n=0
This expression is a generalization of (2.66).
2.6 Conditional expectation The generating function (2.67) of a random sum of independent random variables can be derived using the conditional expectation H [\ |[ = {] of two random variables [ and \ . We will first define the conditional expectation and derive an interesting property. Suppose that we know that [ = {, the conditional density function
2.6 Conditional expectation
35
i\ |[ (||{) defined by (2.51) of the random variable \f = \ |[ can be regarded as only function of |. Using the definition of the expectation (2.33) for continuous random variables (the discrete case is analogous), we have Z " H [\ |[ = {] = |i\ |[ (||{) g| (2.71) 3"
Since this expression holds for any value of { that the random variable [ can take, we see that H [\ |[ = {] = j ({) is a function of { and, in addition since [ = {, H [\ |[ = {] = j ([) can be regarded as a random variable that is a function of the random variable [. Having identified the conditional expectation H [\ |[ = {] as a random variable, let us compute its expectation or the expectation of the slightly more general random variable k ([) j ([) with j ([) = H [\ |[ = {]. From the general definition (2.34) of the expectation, it follows that Z " Z " k ({) j ({) i[ ({) g{ = k ({) H [\ |[ = {] i[ ({) g{ H [k ([) j ([)] = 3"
3"
Substituting (2.71) yields Z "Z " k ({) |i\ |[ (||{) i[ ({) g|g{ H [k ([) j ([)] = 3" 3" Z "Z " = k ({) |i[\ ({> |) g|g{ = H [k ([) \ ] 3"
3"
where we have used (2.51) and (2.61). Thus, we find the interesting relation H [k ([) H [\ |[ = {]] = H [k ([) \ ]
(2.72)
As a special case where k({) = 1, the expectation of the conditional expectation follows as H [\ ] = H[ [H\ [\ |[ = {]] where the index in H] clarifies that the expectation is over the random P variable ]. Applying this relation to \ = } VQ where VQ = Q n=1 [n and all [n are independent yields £ ¤¤ ¤ £ £ *VQ (}) = H } VQ = HQ HV } VQ |Q = q ¤ £ Since HV } VQ |Q = q = *VQ (}) and specified in (2.65), we end up with *VQ (}) = HQ [*VQ (})] =
" X n=0
which is (2.67).
*Vn (}) Pr [Q = n]
3 Basic distributions
This chapter concentrates on the most basic probability distributions and their properties. From these basic distributions, other useful distributions are derived.
3.1 Discrete random variables 3.1.1 The Bernoulli distribution A Bernoulli random variable [ can only take two values: either 1 with probability s or 0 with probability t = 1 s. The standard example of a Bernoulli random variable is the outcome of tossing a biased coin, and, more generally, the outcome of a trial with only two possibilities, either success or failure. The sample space is = {0> 1} and Pr[[ = 1] = s, while Pr [[ = 0] = t. From this definition, the pgf follows from (2.17) as £ ¤ *[ (}) = H } [ = } 0 Pr [[ = 0] + } 1 Pr [[ = 1] or
*[ (}) = t + s}
(3.1)
From (2.23) or (2.14), the q-th moment is H [[ q ] = s which shows that = H[[] = s. From (2.24), we find H [([ d)q ] = s(1 d)q + t(d)q such that the moments centered around the mean are ¡ ¢ H [([ )q ] = st q + (1)q tsq = st t q31 + (1)q sq31 £ ¤ Explicitly, with s + t = 1, Var[[] = st and H ([ )3 = st(t s). 37
38
Basic distributions
3.1.2 The binomial distribution A binomial random variable [ is the sum of q independent Bernoulli random variables. The sample space is = {0> 1> · · · > q}. For example, [ may represent the number of successes in q independent Bernoulli trials such as the number of heads after q-times tossing a (biased) coin. Application of (2.66) with (3.1) gives *[ (}) = (t + s})q
(3.2)
Expanding the binomial pgf in powers of }, which justifies the name “binomial”, q µ ¶ X q n q3n n s t } *[ (}) = n n=0
and comparing to (2.18) yields µ ¶ q n q3n s t Pr[[ = n] = n
(3.3)
The alternative, probabilistic approach starts with (3.3). Indeed, the probability that [ has n successes out of q trials consists of precisely n successes (an event with probability sn ) and q n failures (with probability equal to t q3n ). The total number ¡q¢ of ways in which n successes out of q trials can be obtained is precisely n . P The mean follows from (2.23) or from the definition [ = qm=1 [Bernoulli and the linearity of the expectation as H [[] = qs. Higher order moments around the mean can be derived from (2.24) as ¯ ¢q ¢ ¯¯ ¡ ¡ q µ ¶ gp h3wqs t + shw gp X q n q3n w(tq3n) ¯¯ ¯ p t s H [([ ) ] = h = p ¯ ¯ ¯ ¯ n gwp gw n=0 w=0 w=0 q µ ¶ X q n q3n t s = (tq n)p n n=0
In general, this form seems di!cult to express more elegantly. It illustrates that, even for simple random variables, computations may rapidly become unattractive. For p = 2, the above dierentiation leads to Var[[] = qst. But, this result is more economically obtained from (2.27), since O[ (}) = qs2 qs and O00[ (}) = (t+s}) q log (t + s}), O0[ (}) = t+s} 2 . Thus, Var [[] = qs2 + qs = qst
(3.4)
3.1 Discrete random variables
39
3.1.3 The geometric distribution The geometric random variable [ returns the number of independent Bernoulli trials needed to achieve the first success. Here the sample space is the infinite set of integers. The probability density function is Pr [[ = n] = st n31
(3.5)
because a first success (with probability s) obtained in the n-th trial is proceeded by n 1 failures (each having probability t = 1 s). Clearly, Pr [[ = 0] = 0. The series expansion of the probability generating function, *[ (}) = s}
" X
tn }n =
n=0
s} 1 t}
(3.6)
justifies the name “geometric”. The mean H [[] = *0[ (1) equals H [[] = s1 . The higher-order moments can be deduced from (2.24) as à !¯ " X h3wt@s ¯¯ gq q = s t n (n t@s)q H [([ ) ] = s q ¯ gw 1 thw ¯ w=0
n=0
Similarly as for the binomial random variable, the variance most easily folt , lows from (2.27) with O[ (}) = log s+log (})log(1t}), O0[ (}) = }1 + 13t}
O00[ (}) = }12 +
t2 (13t})2 .
Thus, Var [[] =
t t2 t + = 2 2 s s s
(3.7)
P The distribution function I[ (n) = Pr [[ n] = nm=1 Pr [[ = m] is obtained as n31 X 1 tn = 1 tn Pr [[ n] = s tm = s 1t m=0
The tail probability is Pr[[ A n] = t n
(3.8)
Hence, the probability that the number of trials until the first success is larger than n decreases geometrically in n with rate t. Let us now consider an important application of the conditional probability. The probability that, given the success is not found in the first n trials, success does not occur within the next p trials, is with (2.44) Pr[[ A n + p|[ A n] =
Pr [{[ A n + p} _ {[ A n}] Pr [[ A n + p] = Pr [[ A n] Pr [[ A n]
40
Basic distributions
and with (3.8) Pr[[ A n + p|[ A n] = tp Pr[[ A p] This conditional probability turns out to be independent of the hypothesis, the event {[ A n}, and reflects the famous memoryless property. Only because Pr[[ A n] obeys the functional equation i ({ + |) = i ({)i (|), the hypothesis or initial knowledge does not matter. It is precisely as if past failures have never occurred or are forgotten and as if, after a failure, the number of trials is reset to 0. Furthermore, the only solution to the functional equation is an exponential function. Thus, the geometric distribution is the only discrete distribution that possesses the memoryless property. 3.1.4 The Poisson distribution Often we are interested to count the number of occurrences of an event in a certain time interval, such as, for example, the number of IP packets during a time slot or the number of telephony calls that arrive at a telephone exchange per unit time. The Poisson random variable [ with probability density function n h3 (3.9) n! turns out to model many of these counting phenomena well as shown in Chapter 7. The corresponding generating function is Pr [[ = n] =
*[ (}) = h3
" X n n=0
n!
} n = h(}31)
(3.10)
and the average number of occurrences in that time interval is H[[] =
(3.11)
This average determines the complete distribution. In applications it is convenient to replace the unit interval by an interval of arbitrary length w such that (w)n h3w Pr [[ = n] = n! equals the probability that precisely n events occur in the interval with duration w. The probability that no events occur during w time units is Pr [[ = 0] = h3w and the probability that at least one event (i.e. one or more) occurs is Pr [[ A 0] = 1 h3w . The latter is equal to the exponential distribution. We will also see later in Theorem 7.3.2 that the Poisson
3.1 Discrete random variables
41
counting process and the exponential distribution are intimately connected. The sum of q independent Poisson random variables each with mean n is P again a Poisson random variable with mean qn=1 n as follows from (2.65) and (3.10). The higher-order moments can be deduced from (2.24) as ³ ´¯ w gq h3(w3h ) ¯¯ ¯ H [([ )q ] = h3 ¯ gwq ¯ w=0
from which
¤ £ H[[] = Var[[] = H ([ )3 =
The Poisson tail distribution equals
Pr [[ A p] = 1
p X n h3 n=0
n!
which precisely equals the sum of p exponentially distributed variables as demonstrated below in Section 3.3.1. The Poisson density approximates the binomial density (3.3) if q $ 4 but the mean qs = . This phenomenon is often referred to as the law of rare events: in an arbitrarily large number q of independent trials each with arbitrarily small success s = q , the total number of successes will approximately be Poisson distributed. The classical argument is to consider the binomial density (3.3) with s =
q
q! q3n n 1 3 n!(q 3 n)! qn q 3n n31 n \ m q = 13 13 13 n! q q q m=1
Pr[[ = n] =
or log (Pr[[ = n]) = log
n n!
n31 [ m log 1 3 3 n log 1 3 + + q log 1 3 q q q m=1
33 { { {2 For large q, we use the Taylor expansion log 1 3 q to obtain up to order = 3q 3 2q 2 +R q 32 R q n n(n 3 1) 2 log (Pr[[ = n]) = log + n + R q32 3 + R q32 3 3 + R q32 n! q 2q 2q n 1 (n 3 )2 3 n + R q32 = log 33 n! 2q
With h{ = 1 + { + R({2 ), we finally obtain the approximation for large q, n h3 1 Pr[[ = n] = 13 (n 3 )2 3 n + R q32 n! 2q
42
Basic distributions
The coe!cient of
1 q
k is negative if n M +
1 2
3
t + 14 > +
1 2
+
t l + 41 . In that n-interval,
the Poisson density is a lower bound for the binomial density for large q and qs = . The reverse holds for values of n outside that interval. Since for the Poisson density
Pr[[=n] Pr[[=n31]
=
, n
we
see that Pr[[ = n] increases as A n and decreases as ? n. Thus, the maximum of the Poisson density lies around n = = H[[]. In conclusion, we can say that the Poisson density approximates the binomial density for large q and qs = from below in the region of about the I standard deviation around the mean H[[] = and from above outside this region (in the tails of the distribution).
A much shorter derivation anticipates results of Chapter 6 and starts from the probability generating function (3.2) of the binomial distribution after substitution of s = q , ¶ µ (} 1) q = h(}31) lim *[ (}) = lim 1 + q \ ) =
[ n
|Pr [[ = n] 3 Pr [\ = n]|
and satisfies gW Y ([> \ ) = 2 sup |Pr [[ M D] 3 Pr [\ M D]| DaZ
3.2 Continuous random variables
43
3.2 Continuous random variables 3.2.1 The uniform distribution A uniform random variable [ has equal probability to attain any value in the interval [d> e] such Rthat the probability density function is a constant. e Since Pr[d [ e] = d i[ ({)g{ = 1, the constant value equals i[ ({) =
1 1 e d {M[d>e]
(3.12)
where 1| is the indicator function defined in Section 2.2.1. The distribution function then follows as {d 1 + 1{Ae Pr [d [ {] = e d {M[d>e] The Laplace transform (2.37) is2 Z " h3}d h3}e h3}w i[ (w)gw = *[ (}) = }(e d) 3"
(3.13)
while the mean = H [[] most easily follows from Z " {g{ d+e 1{M[d>e] = H[[] = 2 3" e d The centered moments are obtained from (2.39) as ³ } ´¯ (e3d) 3 }2 (e3d) ¯ 2 h q q h ¯ (1) g ¯ H [([ )q ] = ¯ q e d g} } ¯ }=0 ¯ ¯ )} 2(1)q gq sinh( e3d ¯ 2 = ¯ ¯ e d g} q } }=0
Using the power series
"
X ( e3d )2n+1 sinh( e3d 2 )} 2 = } 2n } (2n + 1)! n=0
leads to
2
£ ¤ (e d)2q H ([ )2q = (2q + 1)22q £ ¤ H ([ )2q+1 = 0
(3.14)
(}) equals the convolution i W j of two exponential densities i and j with rates Notice that }*[ de d and e, respectively.
44
Basic distributions
Let us define X as the uniform random variable in the interval [0> 1]. If Z = 1 X is a uniform random variable on [0> 1], then Z and X have the g
same distribution denoted as Z = X because Pr[Z {] = Pr[1 X {] = Pr[X 1 {] = 1 (1 {) = { = Pr [X {] = The probability distribution function I[ ({) = Pr[[ {] = j({) whose inverse exists can be written as a function of IX ({) = {1{M[0>1] . Let [ = j 31 (X ). Since the distribution function is non-decreasing, this also holds for the inverse j 31 (=). Applying (2.32) yields with [ = j 31 (X ) £ ¤ I[ ({) = Pr j 31 (X ) { = Pr [X j({)] = IX (j({)) = j({) g
For instance, j 31 (X ) = ln(13X) = lnX are exponentially random vari 31 ables (3.17) with parameter ; j (X ) = X 1@ are polynomially distributed random variables with distribution Pr [[ {] = { ; j 31 (X ) = cot(X ) is a Cauchy random variable defined in (3.38) below. In addition, we observe that X = j ([) = I[ ([), which means that any random variable [ is transformed into a uniform random variable X on [0> 1] by its own distribution function. The numbers dn that satisfy congruent recursions of the form dn+1 = (dn +) mod P , where P is a large prime number (e.g. P = 231 1), and are integers (e.g. = 397 204 094 and = 0) are to a good approximation dn uniformly distributed. The scaled numbers |n = P31 are nearly uniformly distributed on [0> 1]. Since these recursions with initial value or seed d0 5 [0> P 1] are easy to generate with computers (Press et al., 1992), the above property is very useful to generate arbitrary random variables [ = j 31 (X ) from the uniform random variable X .
3.2.2 The exponential distribution An exponential random variable [ satisfies the probability density function i[ ({) = h3{
> { 0
(3.15)
where is the rate at which events occur. The corresponding Laplace transform is Z " (3.16) h3w h3}w gw = *[ (}) = }+ 0 and the probability distribution is, for { 0, I[ ({) = 1 h3{
(3.17)
3.2 Continuous random variables
45
The mean or average follows from (2.33) or from H [[] = *0[ (0) as = H [[] = 1 . The centered moments are obtained from (2.39) as ³ ´¯ ¶q ¸ µ q h}@ ¯ g }+ ¯ 1 ¯ H [ = (1)q ¯ g} q ¯ }=0
h}@
around } = 0 is ! õ ¶ q " " " ³ ´n X h}@ X 1 ³ } ´n X 1 q X (1)n n } }q = = (1) }+ n! n!
Since the Taylor expansion of
n=0
we find that
n=0
}+
q=0
n=0
¶ ¸ µ q 1 q q! X (1)n H [ = q n!
(3.18)
n=0
For large q, the centered moments are well approximated by µ ¶ ¸ 1 q q! H [ ' q h The exponential random variable possesses, just as its discrete counterpart, the geometric random variable, the memoryless property. Indeed, analogous to Section 3.1.3, consider Pr[[ w + W |[ A w] =
Pr [{[ w + W } _ {[ A w}] Pr [[ w + W ] = Pr [[ A w] Pr [[ A w]
and since Pr [[ A w] = h3w , the memoryless property Pr[[ w + W |[ A w] = Pr[[ A W ] is established. Since the only non-zero solution (proved in Feller (1970, p. 459)) to the functional equation i ({ + |) = i ({)i (|), which implies the memoryless property, is of the form f{ , it shows that the exponential distribution is the only continuous distribution that has the memoryless property. As we will see later, this memoryless property is a fundamental property in Markov processes. It is instructive to show the close relation between the geometric and exponential random variable (see Feller (1971, p. 1)). Consider the waiting time W (measured in integer units of w) for the first success in a sequence of W Bernoulli trials where only one trial occurs in a timeslot w. Hence, [ = {w is a (dimensionless) geometric random variable. From (3.8), Pr[W A nw] = (1 s)n and the average waiting time is H [W ] = wH [[] = {w s . The
46
Basic distributions
transition from the discrete to continuous space involves the limit process w $ 0 subject to a fixed average waiting time H [W ]. Let w = nw, then ¶ µ w w@{w = h3w@H[W ] lim Pr[W A w] = lim 1 {w 2 ), then is, for { 0, i h 2) µ s ¶ exp ({+ { 2 2 s cosh i[ 2 ({) = 2 2{ In particular, for Q (0> 1) random variables where = 0 and = 1, i[ 2 ({) =
3.4 Functions of random variables
51
3{ 2
reduces to a Gamma distribution (3.25) with = 12 and = 12 . Since the sum of q independent Gamma random variables with (> ) is again a Gamma random variable (> q), we arrive at the chi-square "2 probability density function, h I 2{
q
{ { 2 31 i"2 ({) = q ¡ q ¢ h3 2 2 2 2
(3.31)
3.4 Functions of random variables 3.4.1 The maximum and minimum of a set of independent random variables The minimum of p i.i.d. random variables {[n }1$n$p possesses the distribution4 ¸ Pr min [n { = Pr [at least one [n {] = Pr [not all [n A {] 1$n$p
or
Pr
¸ p Y min [n { = 1 Pr[[n A {]
1$n$p
(3.32)
n=1
whereas for the maximum, ¸ p Y Pr max [n A { = Pr [not all [n {] = 1 Pr[[n {] 1$n$p
n=1
or
Pr
¸ Y p max [n { = Pr[[n {]
1$n$p
(3.33)
n=1
For example, the distribution function for the minimum of p independent exponential random variables follows from (3.17) as à ! ¸ p p Y X 3n { Pr min [n { = 1 h = 1 exp { n 1$n$p
n=1
n=1
or, the minimum of p independent exponential random variables each with Pp rate n is again an exponential random variable with rate n=0 n . In addition to the memoryless property, this property of the exponential distribution will determine the fundamentals of Markov chains. 4
An alternative argument for independent random variables is that the event {min1$n$p [n A {} is only possible if and only if {[n A {} for each 1 $ n $ p. Similarly, the event {max1$n$p [n $ {} is only possible if and only if all {[n $ {} for each 1 $ n $ p.
52
Basic distributions
3.4.2 Order statistics The set [(1) > [(2) > = = = > [(p) are called the order statistics of the set of random variables {[n }1$n$p if [(n) is the n-th smallest value of the set {[n }1$n$p . Clearly, [(1) = min1$n$p [n while [(p) = max1$n$p [n . If the set {[n }1$n$p consists of i.i.d. random variables with pdf i[ , the joint density function of the order statistics is, for only {1 ? {2 ? · · · ? {p , £ ¤ Cp Pr [(1) {1 > = = = > [(p) {p i{[(m) } ({1 > {2 > = = = > {p ) = C{1 = = = C{p p Y i[ ({m ) (3.34) = p! m=1
Indeed, confining to discrete random variables for simplicity, if {1 ? {2 ? · · · ? {p , then £ ¤ Pr [(1) = {1 > = = = > [(p) = {p = p! Pr [[1 = {1 > = = = > [p = {p ] else
£ ¤ Pr [(1) = {1 > [(2) = {2 > = = = > [(p) = {p = 0
because there are precisely p! permutations of the set {[n }1$n$p onto the given ordered sequence {{1 > {2 > = = = > {p }. If the sequence is not ordered such that {n A {o for at least©one couple of ª indices n ? o, then the probability is zero because the event [(n) A [(o) is, by definition, impossible. Finally, the product in (3.34) follows by independence. If the set {[n }1$n$p is uniformly distributed over [0> w], then p! i{[(m) } ({1 > {2 > = = = > {p ) = p w =0
0 {1 ? {2 ? · · · ? {p w elsewhere
while for exponential random variables with i[ ({) = h3{ i{[(m) } ({1 > {2 > = = = > {p ) = p!p h3 =0
Sp
m=1
{m
0 {1 ? {2 ? · · · ? {p elsewhere
The order relation between the set [(1) [(2) · · · ¡ [(p) ¢ is preserved ¡ ¢ j [ after a continuous, non-decreasing transform j, i.e. j [ (1) (2) ¡ ¢ · · · j [(p) . If the distribution function I[ is continuous (it is always non-decreasing), the argument shows that the order statistics of a general set of i.i.d. random variable {[n }1$n$p can be reduced to a study of the order statistics of the set of i.i.d. uniform random variables {Xn }1$n$p on [0,1] because X = I[ ([).
3.4 Functions of random variables
53
© ª The event [(n) { means that at least n among the p random variables {[m }1$m$p are smaller than {. Since each of the p random variables is chosen independently from a same distribution I[ , the probability that precisely q of the p random variables is smaller than { is binomially distributed with parameter s = Pr [[ {]. Hence, p µ ¶ ¤ X £ p (Pr [[ {])q (1 Pr [[ {])p3q (3.35) Pr [(n) { = q q=n
The probability density function can be obtained in the usual, though cumbersome, way by dierentiation, £ ¤ g Pr [(n) { i[(n) ({) = g{ p µ ¶ X p g (Pr [[ {])q (1 Pr [[ {])p3q = q g{ q=n µ ¶ p X p (Pr [[ {])q31 (1 Pr [[ {])p3q = i[ ({) q q q=n µ ¶ p X p (Pr [[ {])q (1 Pr [[ {])p3q31 i[ ({) (p q) q q=n ¡p¢ ¡p31¢ ¡ ¢ ¡p31¢ Using q q = p q31 , (p q) p and lowering the upper index q =p q in the last summation, we have ¶ p µ X p1 i[(n) ({) = pi[ ({) (Pr [[ {])q31 (1 Pr [[ {])p3q q1 q=n p31 X µp 1¶ (Pr [[ {])q (1 Pr [[ {])p3q31 pi[ ({) q q=n ¶ p µ X p1 (Pr [[ {])q31 (1 Pr [[ {])p3q = pi[ ({) q1 q=n ¶ p µ X p1 (Pr [[ {])q31 (1 Pr [[ {])p3q pi[ ({) q1 q=n+1
or, with I[ ({) = Pr [[ {], ¶ µ p1 (I[ ({))n31 (1 I[ ({))p3n i[(n) ({) = pi[ ({) n1
(3.36)
The more elegant and faster argument is as follows: in order for [(n) to be equal to {, exactly n 1 of the p random variables {[m }1$m$p must be
54
Basic distributions
less than {, one equal to { and the£ other p ¤ n must all be greater¡than¢ {. Abusing the notation i[ ({) = Pr [(n) = { and observing that p p31 n31 = p! p! 1!(n31)!(p3n)! is an instance of the multinomial coe!cient q1 !q2 !···qn ! which gives the number of ways of putting p = q1 + q2 + · · · + qn dierent objects into n dierent boxes with qm in the m-th box, leads alternatively to (3.36). 3.5 Examples of other distributions 1. The Gumbel distribution appears in the theory of extremes (see Section 6.4) and is defined by the distribution function 3d({3e)
IGumbel ({) = h3h
(3.37)
The corresponding Laplace transform is Z " ³ }´ 3d(w3e) h3}w h3h dh3d(w3e) gw = h3e} 1 + *Gumbel (}) = d 3" ¡ ¢¯ g 3e} from which the mean follows as H [[] = g} h 1 + d} ¯}=0 = e + d , where = 0.57721=== is the Euler constant. The variance is best computed 2 with (2.43) resulting in Var[[] = 6d 2. 2. The Cauchy distribution has the probability density function iCauchy ({) =
1 (1 + {2 )
and corresponding distribution, ICauchy ({) = The Laplace transform
´ 1 ³ + arctan { 2
1 *Cauchy (}) =
Z
"
3"
(3.38)
h3}{ g{ 1 + {2
only converges for purely imaginary } = l$, in which case it reduces to a Fourier transform, Z 1 " h3l${ g{ *Cauchy (l$) = 3" 1 + {2
This integral is best evaluated by contour integration. If $ 0, we consider a contour F consisting of the real axis and the semi-circle that encloses the negative Im({)-plane, Z 3l$uh3l ¡ 3l ¢ Z " 3l${ Z 3l${ h g uh h g{ h g{ = + lim 2 2 2 u e A 0 ³ ¡ ¢´ e exp {d ¡ ¢ (3.39) iWeibull ({) = d 1 + 1e generalizes the exponential distribution (3.17) corresponding to e = 1 and d = 1 . It is related to the Gaussian distribution if e = 2. Let [ be a Weibull
56
Basic distributions
random variable. All higher moments can be computed from (2.34) as ¢ ¡ µ ³ ´¶ Z " h i dn n+1 1 { e n n e ¡ ¡ ¢ ¢ { exp g{ = H [ = d d 1 + 1e 0 1e
The generating function possesses the expansion µ ¶ " " £ 3}[ ¤ X n + 1 (}d)n 1 X (})n h n i H [ = ¡1¢ *[ (}) = H h = n! e n! e n=0 n=0
which cannot be summed in explicit form for general e. Sometimes an alternative definition of the Weibull distribution appears iWeibull ({) = de{e31 h3d{
e
(3.40)
3d{e
IWeibull ({) = 1 h
with the advantage of a simpler expression for the distribution function IWeibull ({). If [ possesses this probability density (3.40), the moments and variance are ¡ ¢ h i Z " n + 1e n n { iWeibull ({)g{ = H [ = dn@e 0 ¡ ¡ ¢ ¢ Z " 1 + 2e 2 1 + 1e 2 Var[[] = ({ H [[]) iWeibull ({)g{ = d2@e 0
The interest of the Weibull distribution in the Internet stems from the self-similar and long-range dependence of observables (i.e. quantities that can be measured such as the delay, the interarrival times of packets, etc.). Especially if the shape factor e 5 (0> 1), the Weibull has a sub-exponential tail that decays more slowly than an exponential, but still faster than any power law. 4. Power law behavior is often described via the Pareto distribution with pdf for { 0 and A 0> ³ { ´331 (3.41) iPareto ({) = 1+ and with distribution function Z ³ { ´331 { ´3 {³ 1+ gw = 1 1 + (3.42) IPareto ({) = 0
Since lim{ 2 is£a Gaussian ¤ or normal random variable. From (2.32), it follows that Pr h\ { = Pr [\ log {] for { 0, and with (3.20) ¸ Z log { 1 (w )2 Ilognormal ({) = s exp gw (3.43) 22 2 3" and, for { A 0,
h i {3)2 exp (log2 2 s ilognormal ({) = { 2
(3.44)
The moments are ¸ Z " h i 1 (log { )2 n31 n H [ = s { exp g{ 22 2 0 ¸ Z " (x )2 1 nx h exp gx = s 22 2 3" or, explicitly,
and
µ 2 2¶ h i n n H [ = exp (n) exp 2
(3.45)
³ 2 ´ 2 Var[[] = h2 h2 h
(3.46)
The probability generating function is by definition (2.37) 1 *[ (}; > 2 ) = s 2
Z
0
"
h3 h3}w
(log w3)2 2 2
w
1 gw = s 2
Z
"
3"
{
h3}h h3
({3)2 2 2
g{
(3.47) only exists for Re(}) 0. The integral (3.47) indicates that *[ This means that *[ (}; > 2 ) is not analytic at any point } = lw on the imaginary axis because the circle with arbitrary small but non-zero radius around } = lw necessarily encircles points with Re(}) ? 0 where *[ (}; > 2 ) does not exist. Hence, the Taylor expansion (2.40) of the generating function around } = 0 does not exist, although all moments or derivatives at } = 0 (}; > 2 )
58
Basic distributions
exist. Indeed, the series £ ¤ µ 2 2¶ " " X (1)n H [ n n X (}h )n n } = exp n! n! 2 n=0
n=0
is a divergent series (except for = 0 or } = 0). The fact that the pgf (3.44) is not available in closed form complicates the computation of the sum of i.i.d. lognormal random variables via (2.66). This sum appears in radio communications with several transmitters and receivers. In radio communications, the received signal levels decrease with the distance between the transmitter and the receiver. This phenomenon is called pathloss. Attenuation of radio signals due to pathloss has been modeled by averaging the measured signal powers over long times and over various locations with the same distances to the transmitter. The mean value of the signal power found in this way is referred to as the area mean power Pd (in Watts) and is well-modeled as Pd (u) = f·u3 where f is a constant and is the pathloss exponent5 . In reality the received power levels may vary significantly around the mean power Pd (u) due to irregularities in the surroundings of the receiving and transmitting antennas. Measurements have revealed that the logarithm of the mean power P (u) at dierent locations on a circle with radius u around the transmitter is approximately normally distributed with mean equal to the logarithm of the area mean power Pd (u). The lognormal shadowing model assumes that the logarithm of P(u) is precisely normally distributed around the logarithmic value of the area mean power: log10 (P(u)) = log10 (Pd (u))+[, where [ = Q (0> ) is a zero-mean normal distributed random variable (in dB) with standard deviation (also in dB and for severe fluctuations up to 12 dB). Hence, the random variable P(u) = Pd (u)10[ has a lognormal distribution (3.43) equal to Pr [P(u) $ {] = Pr [ $ log10
& % ] { { (log10 x 3 log10 (Pd (u)))2 gx 1 exp 3 = I Pd (u) 22 x 2 log 10 0
3.6 Summary tables of probability distributions 3.6.1 Discrete random variables Name
Pr [[ = n]
H [[]
Var[[]
Bernoulli Binomial Geometric
Pr [[ = 1] = s ¡q¢ n q3n n s (1 s) s (1 s)n31
s qs
s (1 s) qs (1 s)
Poisson 5
n n!
h3
1 s
13s s2
£ ¤ *[ (}) = H } [
1 s + s} ((1 s) + s})q s} 13(13s)} h(}31)
The constant f depends on the transmitted power, the receiver and the transmitter antenna gains and the wavelength. The pathloss exponent depends on the environment and terrain structure and can vary between 2 in free space to 6 in urban areas.
3.7 Problems
59
3.6.2 Continuous random variables Name Uniform Exponential Gaussian Gamma Gumbel Cauchy Weibull Pareto Lognormal
i[ ({)
H [[]
Var[[]
1d{e e3d h3{ ({)2 exp 3 22 I 2 ({)1 3{ h K() { 3{ h h h
d+e 2 1
(e3d)2 12 1 2
2
2 2 6
1 (1+{2 )
e
de{e31 h3d{ 331 1+ {
(log {)2 exp 3 22 I { 2
= 0=5772=== does not exist
does not exist
K(1+ 1 e)
3K2 (1+ 1 K(1+ 2 e) e)
d1@e 1{A1} 31
exp () exp
*[ (}) = H h3}[ h}d 3h}e }(e3d) }+
exp
k
2 }2 2
}+
3 }
l
K (} + 1) h3| Im(})| (Re(}) = 0)
d2@e 2 1{A2}
(31)2 (32)
2
2
2 2 h2 h2 3 h
3.7 Problems (i) If *[ (}) is the probability generating function of a non-zero discrete random variable [, find an expression of H [log [] in terms of *[ (}). (ii) Compute the mean value of the n-th order statistic in an ensemble of (a) p i.i.d. exponentially distributed random variables with mean 1 and (b) p i.i.d. polynomially distributed random variables on [0,1]. (iii) Discuss how a probability density function of a continuous random variable [ can be approximated from a set {{1 > {2 > = = = > {q } of q measurements or simulations. (iv) In a circle with radius u around a sending mobile node, there are Q 1 other mobile nodes uniformly distributed over that circle. The possible interference caused by these other mobile nodes depends on their distance to the sending node at the center. Derive for large Q but constant density of mobile nodes the pdf of the distance of the p-th nearest node to the center. (v) Let X and Y be two independent random variables. What is the probability that the one is larger than the other?
4 Correlation
In this chapter methods to compute bi-variate correlated random variables are discussed. As a measure for the correlation, the linear correlation coe!cient defined in (2.58) is used. First, the generation of q correlated Gaussian random variables is explained. The sequel is devoted to the construction of two correlated random variables with arbitrary distribution.
4.1 Generation of correlated Gaussian random variables Due to the importance of Gaussian correlated random variables as an underlying system for generating arbitrary correlated random variables, as will be demonstrated in Section 4.3, we discuss how they can be generated in multiple dimensions. With the notation of Section 3.2.3, a Gaussian (normal) random variable with average and variance 2 is denoted by Q (> 2 ). By linearly combining Gaussian random variables, we can create a new Gaussian random variable with a desired mean and variance 2 .
4.1.1 Generation of two independent Gaussian random variables The fact that a linear combination of Gaussian random variables is again a Gaussian random variable allows us to concentrate on normalized Gaussian random variables Q (0> 1). Let [1 and [2 be two independent normalized Gaussian random variables. Independent random variables are not correlated and the linear correlation coe!cient = 0. The resulting joint probability distribution is i[1 [2 ({> |; ) = i[1 ({)i[2 (|) and with (3.19), i[1 [2 ({> |; 0) = 61
h3
{2 +| 2 2
2
62
Correlation
It is natural to consider a polar transformation ³ ´and the transformed random 2 2 2 variables U = [1 + [2 and = arctan [ [1 . The inverse transform is s s { = u cos and | = u sin , which diers slightly from the usual polar transformation in that we now define u = {2 + | 2 instead of u2 = {2 + |2 . The reason is that the Jacobian is simpler for our purposes, " cos # s C{ C{ ¸ I u sin 1 2 u Cu C = M (u> ) = det C| C| = det sin s I u cos 2 Cu C 2 u whereas the usual polar transformation has the Jacobian equal to the variable u. Using the transformation rules in Section 2.5.4, u
h3 2 iUX (u> ) = 4 which shows that iUX (u> ) does not depend on . Hence, we can write iUX (u> ) = iU (u) iX () with iX () = f, where f is a constant and iU (u) = u
h3 2 4f
. This implies that is a uniform random variable over an interval 1@f. We also recognize from (3.15) that iU (u) is close to an exponential random variable with rate = 21 . Therefore, it is instructive to choose the constant f such that U is precisely an exponential random variable with rate = 12 . 3u
1 1 , we end up with iU (u) = h 2 2 and iX () = 2 . Thus, choosing f = 2 These two independent random variables U and can each be generated separately from a uniform random variable X on [0,1], as discussed in Section 3.2.1, leading to
U = 2 ln(X1 )
= 2X2
and, finally, to the independent Gaussian random variables p p [1 = 2 ln(X1 ) cos 2X2 [2 = 2 ln(X1 ) sin 2X2
The procedure can be used to generate a single Gaussian random variable, but also more independent Gaussians by repeating the generation procedure. 4.1.2 The q-joint Gaussian probability distribution function A collection of q random variables [l is called a random vector [ = ([1 > [2 > = = = > [q )W , a matrix with dimension q × 1. The average of a random vector is a vector with components H [[l ] for 1 l q. The variance of a random vector ¤ £ ¤ £ Var [[] = H ([ H [[])([ H [[])W = H [[ W H [[] (H [[])W
4.1 Generation of correlated Gaussian random variables
63
is a matrix [ with elements ( [ )l>m = Cov[[l > [m ]. Since Cov[[l > [m ] = Cov[[m > [l ], the covariance matrix [ is real and symmetric, [ = W[ . The importance of real, symmetric matrices is that they have real eigenvectors (see Appendix A.2). Moreover, [ is non-negative definite because, using vector norms defined in Section A.3, £ ¤ {W [ { = H {W ([ H [[])([ H [[])W { h¡ i ¢W = H ([ H [[])W { ([ H [[])W { h° °2 i = H °([ H [[])W {°2 0
which implies that all real eigenvalues l are non-negative. Hence, there exists an orthogonal matrix X such that [ = Xdiag(l )X W
(4.1)
If all random variables [l are independent, Cov[[l > [m ] = 0 for l 6= m and Cov[[l > [l ] = Var[[l ] 0 then [ = diag(Var[[l ]). Gaussian random variables are completely determined by the mean and the variance, i.e. by the first two moments. We will now show that the existence of an orthogonal transformation for any probability distribution such that X W [ X = diag(l ) implies that a vector of joint Gaussian random variables can be transformed into a vector of independent Gaussian random variables. Also the reverse holds, which will be used below to generate q joint correlated Gaussian random variables. The multi-dimensional generating function of a q-joint Gaussian or q-joint normal random vector [ is defined for the vector } = (}1 > }2 > = = = > }q )W as ¶ µ £ 3}[ ¤ 1 W W } [ } H [[] } (4.2) *[ (}) = H h = exp 2 Using (4.1), and the fact that X is an orthogonal matrix such that X 31 = X W and X X W = L, µ ¶ ¡ W ¢W W 1 ¡ W ¢W W *[ (}) = exp X } diag(l )X } X H [[] X } 2 Denote the vectors z = X W } and p = X W H [[]. Then we have ¶ µ 1 W W z diag(l )z p z *[ (}) = exp 2 3 4 q z2 q 2 Y X m m m zm D C pm zm = h 2 3pm zm = exp 2 m=1
m=1
64
Correlation m zm2 3pm zm 2
= *[m (zm ) is the Laplace transform (3.22) of a Gaussian and h random variable [m because all m are real and non-negative. With (2.65), this shows that a vector of joint Gaussian random variables can be transformed into a vector of independent Gaussian random variables. Reversing the order of the manipulations also justifies that (4.2) indeed defines a general q-joint Gaussian probability generating function. If [1 > [2 > = = = > [q are joint normal and not correlated, then [ is a diagonal matrix, which implies that [1 > [2 > = = = > [q are independent. As discussed in Section 2.5.2, independence implies non-correlation, but the converse is generally not true. These properties make Gaussian random variables particularly suited to deal with correlations.
y
2
0 2
fXY(x,y)
0.25 0.2 0.15 0.1 0.05 0 2 0 2
x
Fig. 4.1. The joint probability density function (4.4) with [ = \ = 0 and [ = \ = 1 and = 0=
The corresponding q-joint Gaussian probability density function of the vector [ can be derived after inverse Laplace transform for the vector { = ({1 > {2 > = = = > {q )W as µ ¶ 1 1 W 31 i[ ({) = ¡s ¢q s exp ({ H[{]) [ ({ H[{]) (4.3) 2 2 det [
The inverse Laplace transform for q = 2 is computed in Section C.2. After computing the inverse matrix and the determinant in (4.3) explicitly, the two-dimensional (q = 2) or bi-variate Gaussian probability density function is 5 2 6 2 ({3[ )
i[\ ({> |; ) =
exp 7
2 [
2 [ \
3
({3[ )(|3\ )+ 2(132 )
2[ \
p 1 2
(|3\ ) 2 \
8
(4.4)
4.1 Generation of correlated Gaussian random variables
65
Figures 4.1—4.3 plot i[\ ({> |; ) for various correlation coe!cients . If = 0, we observe that i[\ ({> |; 0) = i[ ({)i\ (|), which indicates that uncorrelated Gaussian random variables are also independent.
y
2
0
2
fXY(x,y)
0.25 0.2 0.15 0.1 0.05 0
2 0
x
2
Fig. 4.2. The joint probability density function (4.4) with [ = \ = 0 and [ = \ = 1 and = 0=8=
y
2
0
2
fXY(x,y)
0.25 0.2 0.15 0.1 0.05 0
2 0
x
2
Fig. 4.3. The joint probability density function (4.4) with [ = \ = 0 and [ = \ = 1 and = 0=8=
If we denote {W = (4.4) reduces to
({3[ ) [
and |W =
(|3\ ) , \
the bi-variate normal density
h i W 2 W | W +(| W )2 exp ({ ) 32{ 2(132 ) p i[\ ({W > |W ; ) = 2[ \ 1 2
66
Correlation
from which we can verify the partial dierential equation Ci[\ ({W > |W ; ) Ci[\ ({W > |W ; ) = C C{W C|W
(4.5)
and the symmetry relations i[\ ({W > |W ; ) = i[\ ({W > | W ; ) = i[\ ({W > | W ; )
(4.6)
4.1.3 Generation of q correlated Gaussian random variables Let {[l }1$l$q be a set of q independent normal random variables, where each [l is distributed as Q (0> 1). The vector [ is rather easily simulated. The analysis above shows that H [[] = 0 (the null-vector) and [ = diag(Var[[l ]) = diag(1) = L, the identity matrix. We want to generate the correlated normal vector \ with a given mean vector H [\ ] and a given covariance matrix \ . Since linear combinations of normal random variables are normal random variables, we consider the linear transformation \ = D[ + E where D and E are constant matrices. We will now determine D and E. First, H [\ ] = H [D[] + H [E] = DH [[] + H [E] = H [E] Hence, the matrix E is a vector with components equal to the given components H [\l ] of the mean vector H [\ ]. Second, ¤ £ \ = H (\ H [\ ])(\ H [\ ])W ¤ £ ¤ £ ¤ £ = H D[(D[)W = H D[[ W DW = DH [[ W DW = D [ DW = DDW
From the eigenvalue decomposition of \ = X diag(l )X W with real eigens ¢W s ¡ such values l 0 and the fact that diag(l ) = diag( l ) diag( l ) that p ´W p ³ DDW = X diag( l ) diag( l ) X W we obtain
p D = X diag( l )
The matrix D is also called the square root matrix of \ and can be found from the singular value decomposition of \ or from Cholesky factorization (Press et al., 1992). Example Generate a normal vector \ with H [\ ] = (300> 300)W , with standard deviations 1 = 106=066, 2 = 35=355 and correlation \ = 0=8.
4.2 Generation of correlated random variables
67
Solution: The covariance matrix of \ is obtained using the definition of the linear correlation coe!cient (2.58), ¶ µ 2 ¶ µ 1 11250 3000 \ 1 2 = \ = 3000 1250 \ 1 2 22 The square root matrix D of \ is ¶ µ 63=640 84=853 D= 0 35=355 which is readily checked by computing DDW = \ . It remains to generate p independent draws for [1 and [2 from a normal distribution with zero mean and unit variance as explained in Section 4.1.1. Each pair ([1 > [2 ) out of the p pairs is transformed as \ = D[ +H [\ ]. The result, component \2 versus \1 , is shown in Fig. 4.4. Y2
450 400 350 300 250 200 150 100 50 0 0
100
200
300
400
500
600
700 Y1
Fig. 4.4. The scatter diagram of the simulated vector \=
4.2 Generation of correlated random variables Let us consider the problem of generating two correlated random variables [ and \ with given distribution functions I[ and I\ . The correlation is expressed in terms of the linear correlation coe!cient ([> \ ) = defined in (2.58). The need to generate correlated random variables often occurs in simulations. For example, as shown in Kuipers and Van Mieghem
68
Correlation
(2003), correlations in the link weight structure may significantly increase the computational complexity of multi-constrained routing, called in brief QoS routing. The importance of measures of dependence between quantities in risk and finance is discussed in Embrechts et al. (2001b). In general, given the distribution functions I[ and I\ , not all linear correlations from 1 1 are possible. Indeed, let [ and \ be positive real random variables with infinite range which means that I[ ({) = 1 if { $ 4 and that I[ ({) = I\ ({) = 0 for { ? 0. Consider \ = d[ + e with d ? 0 and e 0. For all finite | ? 0, ¸ ¸ |e |e Pr [ A I\ (|) = Pr [\ |] = Pr [d[ + e |] = Pr [ d d ¶ µ |e = 1 I[ A0 d which contradicts the fact that I\ (|) = 0 for | ? 0. Hence, positive random variables with infinite range cannot be correlated with = 1= The requirement that the range needs to be unbounded is necessary because two uniform random variables on [0> 1], X1 and X2 , are negatively correlated with = 1 if X1 = 1 X2 . In summary, the set of all possible correlations is a closed interval [min > max ] for which min ? 0 ? max . The precise computation of min and max is, in general, di!cult, as shown below. 4.3 The non-linear transformation method The non-linear transformation approach starts from a given set of two random variables [1 and [2 that have a correlation coe!cient [ 5 [1> 1]. If the joint distribution function Z {1 Z {2 i[1 [2 (x> y; [ )gxgy I[1 [2 ({1 > {2 ; [ ) = Pr [[1 {1 > [2 {2 ] = 3"
3"
is known, the marginal distribution follows from (2.60) as Z {1 Z " Pr [[1 {1 ] = i[1 [2 (x> y; [ )gxgy 3"
3"
Since for any random variable [ holds that I[ ([) = X where X is a uniform random variable on [0> 1], it follows that X1 = I[1 ([1 ) and X2 = I[2 ([2 ) are uniformly correlated random variables with correlation coe!cient X . As shown in Section 3.2.1, if X is a uniform random variable on [0,1], any other random variable \ with distribution function j({) can be constructed as j 31 (X ). By combining the two transforms, we can generate
4.3 The non-linear transformation method
69
\1 = j131 (I[1 ([1 )) and \2 = j231 (I[2 ([2 )) that are correlated because [1 and [2 are correlated. It may be possible to construct directly the correlated random variables \1 = W1 ([1 ) and \2 = W2 ([2 ) if the transforms W1 and W2 are known. The goal is to determine the linear correlation coe!cient \ defined in (2.58), H [\1 \2 ] H [\1 ] H [\2 ] p \ = p Var [\1 ] Var [\2 ] as a function of [ . Using (2.61), £ ¤ H [\1 \2 ] = H j131 (I[1 ([1 )) j231 (I[2 ([2 )) Z "Z " = j131 (I[1 (x)) j231 (I[2 (y)) i[1 [2 (x> y; [ )gxgy (4.7) 3"
3"
This relation shows that \ is a continuous function in [ and that the joint distribution function of [1 and [2 is needed. The main di!culty lies now in the computation of the integral appearing in H [\1 \2 ]. For [1 and [2 , Gaussian correlated random variables are most often chosen because an exact analytic expression (4.4) exists for the joint distribution function I[2 [2 ({1 > {2 ; [ ). 4.3.1 Properties of \ as a function of [ From now on, we choose Gaussian correlated random variables for [1 and [2 . Theorem 4.3.1 The correlation coe!cient \ is a dierentiable and increasing function of [ . Proof:
From the partial dierential equation (4.5) of i[1 [2 (x> y; [ ), it follows that ] " ] " C 2 i[1 [2 (x> y; [ ) CH [\1 \2 ] j131 I[1 (x) j231 I[2 (y) = gxgy C[ CxCy 3" 3"
Partial integration with respect to x and y yields ] " ] " gj131 I[1 (x) gj231 I[2 (y) CH [\1 \2 ] i[1 [2 (x> y; [ )gxgy = C[ gx gx 3" 3" Applying the chain rule for dierentiation and
gj 1 ({) g{
=
1 j 0 (j 1 ({))
gives
gj 31 ({) i[ (x) g{ gj 31 (I[ (x)) = = 0 31 gx g{ j (j (I[ (x))) {=I[ (x) gx
Since j 0 ({) and i[ (x) are probability density functions and positive,
CH[\1 \2 ] C[
we have shown that \ is a dierentiable, increasing function of [ .
=
¤
C\ C[
A 0. Hence,
70
Correlation
Since [ 5 [1> 1], \ increases from \ min at [ = 1 to \ max corresponding to [ = 1. In the sequel, we will derive expressions to compute the boundary cases [ = 1 and [ = 1. Theorem 4.3.2 (of Lancaster) For any two strictly increasing real functions W1 and W2 that transform the correlated Gaussian random variables [1 and [2 to the correlated random variables \1 = W1 ([1 ) and \2 = W2 ([2 ), it holds that |\ | |[ | If two correlated random variables \1 and \2 can be obtained by separate transformations from a bi-variate normal distribution with correlation coefficient [ , the correlation coe!cient \ of the transformed random variables cannot in absolute value exceed [ . The interest of the proof is that it uses powerful properties of orthogonal polynomials and that \ is expanded in a power series in [ in (4.12). Proof: The proof is based on the orthogonal Hermite polynomials Kq ({) (see e.g. Rainville (1960) and Abramowitz and Stegun (1968, Chapter 22)) defined by the generating function " [ Kq ({) wq exp 2{w 3 w2 = q! q=0
(4.8)
After expanding exp 2{w 3 w2 in a Taylor series and equating corresponding powers in w, we find that [ q2 ] [ (31)n (2{)q32n Kq ({) = q! (4.9) n! (q 3 2n)! n=0 with K0 ({) = 1. The Hermite polynomials satisfy the orthogonality relations ] " 2 h3{ Kq ({) Kp ({) g{ = 0 p 6= q 3"
]
"
3"
I 2 2 h3{ Kq ({) g{ = 2q q!
These orthogonality relations enable us to expand functions in terms of Hermite polynomials (similar to Fourier analysis). If the expansion of a function i ({), i ({) =
" [
dn Kn ({)
n=0
converges for all {, then it follows from the orthogonality relations that ] " 2 1 h3{ i ({) Kn ({) g{ dn = n I 2 n! 3" The joint normalized Gaussian density function can be expanded (Rainville, 1960, pp. 197—198) in terms of Hermite polynomials 2 2 exp 3 { 32{|+| " 2 (13 ) 2 2 [ q (4.10) s = h3{ 3| Kq ({) Kq (|) q 2 q! 1 3 2 q=0
4.3 The non-linear transformation method
71
In order for the covariance Cov[\1 \2 ] to exist, both H \12 and H \22 must be finite. Since \m = Wm ([m ) for m = 1> 2, the mean is & % ] " ] " ({ 3 [ )2 1 g{ H [\m ] = Wm ({) exp 3 Wm ({)i[m ({) g{ = I 2 2[ 2[ 3" 3" ] " I 2 1 = I Wm ([ + 2[ x)h3x gx 3" Let " [ I Wm [ + 2[ x = dn;m Kn (x)
(4.11)
n=0
with dn;m =
1 I 2n n!
]
"
3"
I 2 h3{ Wm [ + 2[ { Kn ({) g{
then, since K0 ({) = 1, H [\m ] = d0;m k l The second moment H \m2 follows from (2.34) as H \m2 =
]
"
3"
1 = I
1
2
(Wm ({)) i[m ({) g{ = I 2[ " I 3x2 2 gx Wm ([ + 2[ x)h
]
]
"
3"
Wm2 ({) exp
%
& ({ 3 [ )2 3 g{ 2 2[
3"
Substituting (4.11) gives ] " " [ " " [ [ 2 1 Kn (x) Kp (x) h3x gx = dp;m dn;m I d2n;m 2n n! H \m2 = 3" n=0 p=0 n=0 which is convergent. Similarly, using (4.4), ] " ] " H [\1 \2 ] = W1 (x) W2 (y) i[1 [2 (x> y; [ )gxgy 3"
]
"
3"
]
"
5
9 exp 73
(x[ )2 2 [
2[ (y\ )2 (x3[ )(y3\ )+ 2 [ \ \ 2(132 [)
3
6 : 8
s gxgy 1 3 2 {2 32[ {|+| 2 ] " ] " exp 3 I I (132[ ) = W1 [ + 2[ { W2 \ + 2\ | g{g| t 3" 3" 1 3 2[ % & ] " ] " " " [ [ 1 {2 3 2[ {| + | 2 = t dn;1 dp;2 Kn ({) Kp (|) exp 3 g{g| 1 3 2[ 3" 3" 1 3 2 n=0 p=0
=
3"
W1 (x) W2 (y)
3"
2[ \
[
Using (4.10), H [\1 \2 ] =
] " ] " " " " [ [ q 2 2 1 [ [ h3{ Kn ({) Kq ({) g{ h3| Kp (|) Kq (|) g| dn;1 dp;2 q n=0 2 q! 3" 3" p=0 q=0
72
Correlation
Introducing the orthogonality relations for Hermite polynomials leads to H [\1 \2 ] =
" [
dq;1 dq;2 2q q!q [
q=0
The correlation coe!cient becomes S" S" q q dq;1 dq;2 2q q!q q=0 dq;1 dq;2 2 q![ 3 d0;1 d0;2 [ tS = tS q=1 \ = tS S" " " " 2 2 2 2 2 2 2n n! n n n n! d 2 n! 3 d d 2 n! 3 d d 2 d n=0 n;1 n=1 n;2 n=1 n;1 n=1 n;2 0;1 0;2
I I S S" 2 2 Denote q = dq;1 2q q! and q = dq;2 2q q!, then Var[\1 ] = " n=1 n and Var[\2 ] = n=1 n . Since the linear correlation coe!cient ([> \ ) equals the correlation coe!cient of the corresponding normalized random with mean zero and variance 1, as shown in Section 2.5.3, we may S S"variable 2 2 choose " n=1 n = n=1 n = 1 such that \ =
" [
q q q [
(4.12)
q=1
If 21 = 1 and 12 = 1, then |\ | = |[ | because all other n and n must then vanish. In all other cases, either 21 ? 1 or 12 ? 1 or both, such that \ = 1 1 [ +
" [
q q q [
q=2
and
y y x" x" " " [ [ x[ x[ q q q 2 | |q 2q |[ | w q q q [ $ |q q | |[ | $ w [ q=2
q=2
q=2
where we have used the Cauchy-Schwarz inequality partial summation, " [
q=2
2q |[ |q = (1 3 |[ |)
because
Sq
n=2
2n ?
S"
n=2
S
de $
q=2
sS
d2
sS
e2 (see Section 5.5). By
" [ q [
|[ |2 = 1 3 21 |[ |2 2n |[ |q $ (1 3 |[ |) 1 3 21 1 3 | | [ q=2 n=2
2n = 1 3 21 . Thus
t t |\ | $ |[ | |1 1 | + 1 3 21 1 3 12 |[ |
Finally, for 21 $ 1 and 12 $ 1, the inequality |1 1 | + Lancaster’s theorem because |[ | $ 1.
¤
t t 1 3 21 1 3 12 $ 1 holds. This proves
4.3.2 Boundary cases Let us investigate some cases for special values of [ . 1. [ = 0. Since uncorrelated Gaussian random variables ([ = 0) are independent, also \1 = j131 (I[1 ([1 )) and \2 = j231 (I[2 ([2 )) are independent such that \ = 0. Hence, uncorrelated Gaussian random variables with [ = 0 lead to uncorrelated random variables \1 and \2 with \ = 0.
4.3 The non-linear transformation method
73
2. [ = 1. Perfect positively correlated Gaussian random variables [1 = [2 = [ have joint distribution ¸ (x [ )2 1 exp (x y) i[1 [2 (x> y; 1) = s 2 2[ 2[ which follows from Pr [[1 {1 > [2 {2 ] = Pr [[ {1 > [ {2 ] = Pr [[ {] with { = max({1 > {2 ). In that case, Z " j131 (I[ (x)) j231 (I[ (x)) gI[ (x) H [\1 \2 ] = 3" 1
=
Z
0
j131 ({) j231 ({) g{
(4.13)
which may lead to \ max ? 1 depending on the specifics of j1 and j2 . By transforming { = j1 (x), we obtain Z j31 (1) 1 xj231 (j1 (x)) j10 (x)gx H [\1 \2 ] = j131 (0)
which shows that, if j1 = j2 = j, Z j31 (1) £ ¤ x2 j 0 (x)gx = H \ 2 H [\1 \2 ] = j 31 (0)
Hence, if \1 and \2 have the same distribution function j as \ , the case [ = 1 leads to £ ¤ H \ 2 (H [\ ])2 =1 \ = Var [\ ] 3. [ = 1. Perfect negatively correlated Gaussian random variables [1 = [2 = [ have joint distribution ¸ 1 (x [ )2 i[1 [2 (x> y; 1) = s exp (x + y) 2 2[ 2[ which follows from the symmetry relations (4.6). In that case, Z " j131 (I[ (x)) j231 (I[ (x)) gI[ (x) H [\1 \2 ] = Z3" " = j131 (I[ (x)) j231 (1 I[ (x)) gI[ (x) 3" 1
=
Z
0
j131 ({) j231 (1 {) g{
(4.14)
which may lead to \ min A 1, depending on the specifics of j1 and j2 .
74
Correlation
4.4 Examples of the non-linear transformation method 4.4.1 Correlated uniform random variables Let us first focus on the relation between [ and X . Since H [X ] = 12 and 1 X2 = 12 , the definition of the linear correlation coe!cient (2.58) gives H [X1 X2 ]
X =
1 4
1 12
where, using (2.61), H [X1 X2 ] = H [I[1 ([1 )I[2 ([2 )] Z "Z " = I[1 (x)I[2 (y)i[1 [2 (x> y; [ )gxgy 3"
3"
In the case of Gaussian correlated random variables specified by (3.20) and (4.4), we must evaluate the integral ]
H [X1 X2 ] =
"
gx
3"
Substituting successively x0 =
1
H [X1 X2 ] =
(2)2
"
5
"
gx0
3"
gy
"
3
x
gw h
(w[ )2 1 22 [1
2 x[ 2[ 1 3 2 [1 [2 [1
, w0 =
gy 0
]
x0
w3[1 [1
w02 h3 2
gw0
y
g h
3
( [ )2 2 22 [2
(x3[1 )(y3[2 )+ 2(132 [)
, y0 = ]
t 1 3 2[
y3[2 [2
y0
, 0 =
02 h3 2
3"
3"
3"
]
3"
2 2 (2)2 [ 1 [2
x3[1 [1
]
]
3"
3"
9 exp 9 73
×
]
]
2 y[ 2 2 [2
3[2 [2
6 : : 8
, we obtain
02 0 0 02 [ x y +y exp 3 x 32 2(132 [) t g 0 1 3 2[
We now use the partial dierential equation (4.5), 4 3 02 0 0 02 x02 32[ x0 y 0 +y02 C2 exp 3 x 32[ x 2y +y exp 3 Cx0 Cy 0 2(13[ ) 2(132 F C E [) F= E t t D C C 1 3 2[ 1 3 2[ such that
CH [X1 X2 ] = C
]
"
gx0
]
"
gy0
x0
h3
w02 2
3"
3"
3"
]
gw0
]
y0
3"
Partial integration in the last integral y 0
h3
02 2
g 0
C2 Cx0 Cy 0
02 0 0 02 exp 3 x 32[ x 2y +y 2(13[ ) t 2 (2) 1 3 2[
&$ # % x02 3 2[ x0 y0 + y 02 C2 exp 3 h gy L2 = g Cx0 Cy 0 2 1 3 2[ 3" 3" # % &$ ] " y2 C x02 3 2[ x0 y + y2 exp 3 = gyh3 2 Cx0 2 1 3 2[ 3" ]
"
0
]
y0
02
3 2
0
4.4 Examples of the non-linear transformation method
75
yields CH [X1 X2 ] = C
]
"
y2 gyh3 2
3"
]
"
gx0
3"
]
x0
w02 h3 2
gw0
C Cx0
3"
and similarly in the x0 integral,
02 0 0 02 exp 3 x 32[ x 2y +y 2(13[ ) t (2)2 1 3 2[
% & ] " ] " x2 2 3 2[ 3 2[ xy + y 2 2 3 2[ CH [X1 X2 ] 1 t gx exp 3 gy = C 2 1 3 2[ 3" (2)2 1 3 2[ 3" 5 (432[ ) $2 6 # ] " exp 3 y2 ] " 2 2(23[ ) 2 3 2[ y [ 8 x3 t gx exp 73 gy = 2 3 2[ 2 1 3 2[ 3" 3" (2)2 1 3 2[ v v 2 2 3 2[ 2 1 3 2[ 1 1 1 = t t = 2 2 2 2 4 3 2 3 2 [ [ (2) 1 3 [ 4 3 2[
Thus, we find that
6 CX 1 = t C[ 4 3 2 [
or that X =
6
]
1 6 [ t +f + f = arcsin 2 4 3 2[
It remains to determine the constant f. We have shown in Section 4.3.2 that random variables generated from uncorrelated Gaussian random variables are also uncorrelated implying that X = 0 if [ = 0 and, hence, that the constant f = 0. This finally results in 6 [ (4.15) X = arcsin 2
In summary, two uniform correlated random variables X1 and X2 with correlation coe!cient X are found by transforming two Gaussian correlated ¢ ¡ random variables [1 and [2 with correlation coe!cient [ = 2 sin 6X . Equation (4.15) further shows that X = ±1 if [ = ±1, which indicates that the whole range of the correlation coe!cient X is possible. 4.4.2 Correlated exponential random variables In Section 3.2.1, we have seen that, if X is a uniform random variable on [0,1], j 31 (X ) = 1 log X is an exponential random variable with mean 1 . The correlation coe!cient for two exponential random variables, \1 and \2 , with mean 11 and 12 respectively, is \ =
H [\1 \2 ] 1 1 2
1 1 2
= H [1 2 \1 \2 ] 1
76
Correlation
As above, we generate \1 = 11 log I[1 ([1 ) and \2 = 12 log I[2 ([2 ), where [1 and [2 are correlated Gaussian random variables with correlation coe!cient [ . Then, 1 2 H [1 2 \1 \2 ] = H [log I[1 ([1 ) log I[2 ([2 )] 1 2 Z "Z " log I[1 (x) log I[2 (y)i[1 [2 (x> y; [ )gxgy = 3"
3"
In the general case for [ 6= 0, the previous method can be followed, which yields after substitution towards normalized variables, 2 2 ] y exp 3 x 32[ xy+y ] x ] " ] " 2 ) 2 2 2 13 ( w [ H [1 2 \1 \2 ] = gx gy log h3 2 g h3 2 gw log t 3" 3" 3" 3" (2)2 1 3 2[ Unfortunately, we cannot evaluate this integral analytically.
Let us compute the upper bound \ max from (4.13) with j131 ({) = 11 log { and j231 ({) = 12 log {, Z 1 log2 {g{ = 2 H [1 2 \1 \2 ; [ = 1] = 0
and thus \ max = 1. The lower boundary \ min follows from (4.14) as1 , Z 1 2 log { log(1 {)g{ = 2 H [1 2 \1 \2 ; [ = 1] = 6 0 2
Here, we find \ min = 1 6 = 0.644 934===. In summary, exponential correlated random variables can be generated from Gaussian correlated random variables, but the correlation coe!cient 1
Substituting the Taylor expansion log(1 3 {) = 3 ]
1
0
and
]
1
0
log { log(1 3 {)g{ = 3
{n log {g{ = 3
]
0
"
S"
{n n=1 n
" [ 1 n n=1
]
1
gives {n log {g{
0
h3(n+1)x xgx = 3
1 (n + 1)2
Thus, ]
0
Since ]
0
1 n(n+1)2
=
1 n
3
1 n+1
1
log { log(1 3{)g{ =
3
1
log { log(1 3 {)g{ =
" [
n=1
1 n(n + 1)2
1 , (n+1)2
" " " " " [ [ [ 1 [ 1 [ 1 1 1 2 = 13 = 23 = 2 3(2) = 2 3 3 3 2 2 2 n n=2 n n=2 n n n 6 n=1 n=2 n=1
4.4 Examples of the non-linear transformation method
77
i h 2 \ is limited to the interval 1 6 > 1 . As explained in the introduction of Section 4.2, the exponential random variables are positive with infinite range for which not all negative correlations are possible. The analysis demonstrates that it is not possible to construct two exponential random 2 variables with correlation coe!cient smaller than \ min = 1 6 ' 0=645.
4.4.3 Correlated lognormal random variables Two correlated lognormal random variables \1 and \2 with distribution specified in (3.43) can be constructed directly from two correlated Gaussian random variables [1 and [2 . In particular, let \1 = hd1 [1 and \2 = hd2 [2 . The explicit scaling parameters can be used to determine the desired mean. From (4.7), H [\1 \2 ] =
Z
"
3"
Z
"
3"
hd1 x hd2 y i[1 [2 (x> y; [ )gxgy
¸ 12 2 22 2 = exp d1 1 + d2 2 + d1 + d1 d2 [ 1 2 + d2 2 2 where the Laplace transform (4.2) for q = 2 has been used. Invoking (3.45) and (3.46) with m $ dm m and m2 $ d2m m2 , the correlation coe!cient \ is hd1 1 d2 2 [ 1 \ = r³ ´³ ´ 2 2 2 2 hd1 1 1 hd2 2 1
(4.16)
If at least one (but not all) of the quantities 1 > 2 > d1 or d2 grows large, \ tends to zero irrespective of [ . Thus even if [1 and [2 and, hence also \1 and \2 , have the strongest kind of dependence possible, i.e. [ = ±1, the correlation coe!cient \ can be made arbitrarily small. In case d1 1 = d2 2 = , (4.16) reduces to 2
\ =
h [ 1 h2 1 2
We observe that \ max = 1, while \ min = h3 A 1 for A 0; again a manifestation that for positive random variables with infinite range not all negative correlations are possible.
78
Correlation
4.5 Linear combination of independent auxiliary random variables In spite of the generality of the non-linear transformation method, the involved computational di!culty suggests us to investigate simpler methods of construction. It is instructive to consider two independent random variables Y and Z with known probability generating functions *Y (}) and *Z (}) respectively. In the discussion of the uniform random variable in Section 3.2.1, it was shown how to generate by computer an arbitrary random variable from a uniform random variable. We thus assume that Y and Z can be constructed. Let us now write [ and \ as a linear combination of Y and Z, [ = d11 Y + d12 Z + e1 \ = d21 Y + d22 Z + e2 which is specified by the matrix D=
d11 d12 d21 d22
¸
and compute the covariance defined in (2.56), Cov [[> \ ] = H [[\ ] [ \ h £ ¤ i = d11 d21 H Y 2 (H [Y ])2 + (d11 d22 + d12 d21 ) H [Y Z ] h £ i ¤ (d11 d22 + d12 d21 ) H[Y ] H[Z ] + d12 d22 H Z 2 (H[Z ])2
Since X and Y are independent, H [Y Z ] = H [Y ] H [Z ], and with the definition of the variance (2.16) and denoting Y2 =Var[Y ] and similarly for Z , we obtain 2 Cov [[> \ ] = d11 d21 Y2 + d12 d22 Z
In a same way, we find 2 2 [ = d211 Y2 + d212 Z 2 \2 = d221 Y2 + d222 Z
(4.17)
such that the correlation coe!cient, in general, becomes 2 d11 d21 Y2 + d12 d22 Z q = q 2 2 d211 Y2 + d212 Z d221 Y2 + d222 Z
which is independent of the constants e1 and e2 since for a centered moment H ([ H [[])2 = H ([ + e H [[ + e])2 .
4.5 Linear combination of independent auxiliary random variables
79
In order to achieve our goal of constructing two correlated random variables [ and \ , we can choose the coe!cients of the matrix D to obtain an expression as simple as possible. If we choose [ = Y or d11 = 1, d12 = e1 = 0, the correlation coe!cient reduces to 2 d21 [ 1 =r = q q 2 2 2 + d2 2 d2 Z [ d221 [ 22 Z 1 + d22 2 2 21
[
By rewriting this relation, we obtain d21 = ± If we choose d22 =
d Z p 22 [ 1 2
p 1 2 , the random variables [ and \ are specified as [=Y
\ =±
p Z Y + 1 2 Z + e2 [
2 = 2 and 2 = 2 . Finally, and the corresponding variances (4.17) are [ Y \ Z we require that H [Z ] = Z = 0, which specifies
e2 = H [\ ]
\ H [[] [
If Z is a zero mean random variable with standard deviation Z = \ , the random variables [ and \ are correlated with correlation coe!cient p \ \ [ + 1 2 Z + \
[ (4.18) \ =± [ [
In the sequel, we take the positive sign for . Let us now investigate what happens with the distribution functions of [¤ £ 3}[ and \ . Using the pgfs for continuous random variables * (}) = H h [ £ ¤ and *\ (}) = H h3}\ , the last relation (4.18) becomes ¤ £ 3} H h3}\ = h
\ 3 \ [ [
i h s 2 i h 3} \ [ H h3} 13 Z H h [
because Y = [ and Z are independent, or ¶ µ ³p ´ \ 3} \ 3 \ [ [ } *Z 1 2 } *[ *\ (}) = h [
(4.19)
In order to produce two random variables [ and \ that are correlated with correlation coe!cient , the pgf of the zero mean random variable Z with
80
Correlation
variance \2 must obey, h
\ 3 \ [ [ I } 132
*Z (}) = *[
µ
[
*\
\ s
µ
13
s} 2 13
} 2
¶
¶
(4.20)
which can be written in terms of the translated random variables \ 0 = \ \ and [ 0 = [ \ [ , µ ¶ } *\ 0 s 2 13 µ ¶ *Z (}) = \ s *[ 0 } 2 [
13
This form shows that, if [ 0 and \ 0 have a same distribution, Z possesses, in general, a dierent distribution. Only the pgf of a Gaussian (with zero mean) obeys the functional equation ¶ µ } i s 2 13 ¶ i (}) = µ i s 2} 13
The joint probability generating function follows from (2.61) as Z "Z " £ ¤ h3}1 {3}2 | i[\ ({> |)g{g| *[\ (}1 > }2 ) = H h3}1 [3}2 \ = 3"
and the inverse is i[\ ({> |) =
1 (2l)2
Z
f1 +l" Z f2 3l"
f1 3l"
f2 3l"
3"
h}1 {+}2 | *[\ (}1 > }2 )g}1 g}2
(4.21)
Using (4.18), we have ¸ s 2 i h 3 }1 +}2 \ [ [ [ H h H h3}2 13 Z *[\ (}1 > }2 ) = h ¶ µ ´ ³ p \ 3} 3 \ *Z }2 1 2 (4.22) = h 2 \ [ [ *[ }1 + }2 [ 3}2 \ 3 \ [
Introduced into the complex double integral (4.21), the joint probability density function of the two correlated random variables can be computed. The main deficiency of the linear combination method is the implicit assumption that any joint distribution function i[\ ({> |) can be constructed from two independent random variables [ and Z . The corresponding joint
4.5 Linear combination of independent auxiliary random variables
81
pgf (4.22) possesses a product form that cannot always be made compatible with the form of an arbitrary pgf *[\ (}1 > }2 ). The examples below illustrate this deficiency.
4.5.1 Correlated Gaussian random variables
£ ¤ If [ and \ are Gaussian random variables with Laplace transform H h3}\ given in (3.22), the expression (4.20) for *Z (}) becomes µ 2 ¶ ¸ \ }2 *Z (}) = exp 2 which shows that Z is also a Gaussian random variable with mean Z = 0 and standard deviation Z = \ . Further, the joint pgf follows from (4.22) as ¸ 2 £ 3}1 [3}2 \ ¤ \2 2 [ 2 = exp }2 \ [ }1 + } + }1 }2 \ [ + } H h 2 1 2 2 Since
¸ ¸ 2 2 ¤ \2 2 1 £ [ }1 [ \ [ 2 } + }1 }2 \ [ + } = }1 }2 }2 [ \ \2 2 1 2 2 2 £ ¤ formula (4.2) indicates that H h3}1 [3}2 \ is the two dimensional pgf of a joint Gaussian with pdf (4.4). The linear combination method thus provides the exact results for correlated Gaussian random variables.
4.5.2 Correlated exponential random variables Let [ and \ be two correlated, exponential random variables with rate { and | . Recall that H [[] = [ = 1{ . Using the Laplace transform (3.16) in (4.20), we obtain *Z (}) = h |
13 I
132
}
1+ 1+
s}
132 } s | 132 |
The corresponding probability distribution function follows from (2.38) as
IZ (w) =
1 2l
Z
f+l"
f3l"
1+ 1+
s}
132 s} | 132 |
} w+
h
|
13 I
}
132
g}
fA0
82
Correlation
Define the normalized time W = 1 IZ (w) = 2l
Z
|
s1
f+l"
f3l"
132
, then
1 + W } h}(w+(13)W ) g} 1 + }W }
Since w + (1 ) W A 0, the contour can be closed over the negative Re(}) plane encircling the poles at } = W1 and } = 0. By Cauchy’s residue theorem, ¡ ¢ (1 + W }) } + W1 }(w+(13)W ) (1 + W }) } }(w+(13)W ) h h IZ (w) = lim + lim } |) (5.1) {| 2 ¡s s ¢2 which directly follows from { | 0, they masterly extend this relation to the theorem of the arithmetic and geometric mean in several real variables {n , q q Y X {tnn tn {n (5.2) min({> |)
n=1
n=1
Pq
where n=1 tn = 1. They further move to the inequalities of CauchySchwarz, of Hölder, of Minkowski and many more. Only a few inequalities are reviewed here and we recommend the classic treatise on inequalities by Hardy, Littlewood and Polya for those who search for more depth, elegance and insight.
5.1 The minimum (maximum) and infimum (supremum) Since these concepts will be frequently used, we explain the dierence by concentrating on the minimum and infimum (the maximum and the supremum follow analogously). Let be a non-empty subset of R. The subset
1
The arithmetic-geometric mean P({> |) is the limit for q < " of the recursion {q = I 1 ({q31 + |q31 ), which is an arithmetic mean, and |q = {q31 |q31 , which is a geometric 2 mean, with initial values {0 = { and |0 = |. Gauss’s famous discovery on intriguing properties of P({> |) (which lead e.g. to very fast converging series for computing ) is narrated in a paper by Almkvist and Berndt (1988).
83
84
Inequalities
is said to be bounded from below by P if there exists a number P such that, for all { 5 holds that { P . The largest lower bound (largest number P ) is called the infimum and is denoted by inf ( ). Further, if there exists an element p 5 such that p { for all { 5 , then this element p is called the minimum and is denoted by min ( ). If the minimum min ( ) exists, then min ( ) = inf ( ). However, the minimum does not always exists. The classical example is the open interval (d> e), where inf ((d> e)) = d, but the minimum does not exist because d 5 @ (d> e). On the other hand, for the closed interval [d> e], we have that inf ([d> e]) = min ([d> e]) = d. This example also illustrates that every finite non-empty subset of R has a minimum. 5.2 Continuous convex functions A continuous function i ({) that satisfies for x and y belonging to an interval L, µ ¶ x+y i (x) + i (y) i 2 2 is called convex in that interval L. If i is convex, i is concave. Hardy et al. (1999, Section 3.6) demonstrate that this condition is fundamental from which the more general condition2 Ã q ! q X X i tn i ({n ) (5.3) tn {n n=1
n=1
Pq
where n=1 tn = 1, can be deduced. Moreover, they show that a convex function is either very regular or very irregular and that a convex function that is not “entirely irregular” is necessarily continuous. Current textbooks, in particular the book by Boyd and Vandenberghe (2004), usually start with the definition of convexity from (5.3) in case q = 2 where t1 = 1 t2 = t and 0 t 1 as i (tx + (1 t)y) ti (x) + (1 t)i (y)
(5.4)
where x and y can be vectors in an p-dimensional space.Fig_convex Geometrically with p = 1 as illustrated in Fig. 5.1, relation (5.4) shows that each point on the chord between (x> i (x)) and (y> i (y)) lies above the 2
The convexity concept can be generalized (Hardy et al., 1999, Section 98) to several variables in which case the condition (5.3) becomes $ # [ [ [ i tn {n > tn |n $ tn i ({n > |n ) n
n
n
5.2 Continuous convex functions
85
f(x)
f
f(v) c2 c1
f(u)
u a
a'
b' b v
x
Fig. 5.1. The function i is convex between x and y.
curve i in the interval L. The more general form (5.3) asserts that the centre of gravity of any number of arbitrarily weighted points of the curve lies above or on the curve. Figure 5.1 illustrates that for any convex function i and points d> d0 > e> e0 5 [x> y] such that d d0 e0 and d ? e e0 , the chord f1 over (d> e) has a smaller slope than the chord f2 over (d0 > e0 ) or, 0 (d0 ) (d) i (ee)3i = Suppose that i ({) is twice dierentiable equivalently, i (e)3i 0 3d0 e3d in the interval L, then a necessary and su!cient condition for convexity is i 00 ({) 0 for each { 5 L. This theorem is proved in Hardy et al. (1999, pp. 76—77). Moreover, they prove that the equality in (5.3) can only occur if i ({) is linear. Applied to probability, relation (5.3) with tn = Pr [[ = n] and {n = n is written with (2.12) as i (H [[]) H [i ([)]
(5.5)
and is known as Jensen’s inequality. The Jensen’s inequality (5.5) also hold for continuous random variables. Indeed, if i is dierentiable and convex, then i ({) i (|) i 0 (|)({ |). Substitute { by the random variable [ and | = H [[], then i ([) i (H [[]) i 0 (H [[])({ H [[]) After applying the expectation operator to both sides, we obtain (5.5). An important application of Jensen’s inequality is obtained for i ({) = h3}{ with real } as ¤ £ h3}H[[] H h3}[ = *[ (})
86
Inequalities
Any probability generating function *[ (}) is, for real }, bounded from below by h3}H[[] . A continuous analog of (5.3) with i ({) = h{ (and similarly for i ({) = log {) ¸ Z y Z y 1 1 exp i ({)g{ hi ({) g{ yx x yx x can be regarded as a generalization of the inequality between arithmetic and geometric mean.
5.3 Inequalities deduced from the Mean Value Theorem The mean value theorem (Whittaker and Watson, 1996, p. 65) states that if j({) is continuous on { 5 [d> e], there exists a number 5 [d> e] such that Z e j(x)gx = (e d)j() d
or, alternatively, if i ({) is dierentiable on [d> e], then (5.6) i (e) i (d) = (e d)i 0 () R{ The equivalence follows by putting i ({) = d j(x)gx= It is convenient to rewrite this relation for 0 1 as i ({ + k) i ({) = ki 0 ({ + k) In this form, the mean value theorem is nothing else than a special case for q = 1 of Taylor’s theorem (Whittaker and Watson, 1996, p. 96), i ({ + k) i ({) =
q31 X n=1
i (n) ({) n kq (q) k + i ({ + k) n! q!
(5.7)
An important application of Taylor’s Theorem (or of the mean value theorem) to the exponential function gives a list of inequalities. First, h{ = 1 + { +
{2 { h 2
and, since h{ A 0 for any finite {, we have for any { 6= 0, h{ A 1 + {
(5.8)
5.4 The Markov and Chebyshev inequalities
87
A direct generalization follows from Taylor’s Theorem (5.7), {
h =
q31 X
{n {q { + h n! q!
n=0
such that, for q = 2p and any {, h{ A
2p31 X n=0
{n n!
and, for q = 2p + 1, h{ A
2p n X { n=0
h{ ?
2p n X { n=0
Second, estimates of the product from (5.8) as3 q Y
{A0
n!
{?0
n! Qq
n=0 (1 + dn {)
(1 + dn {) ? exp
à q X
where dn { 6= 0 are obtained !
dn {
n=0
n=0
5.4 The Markov and Chebyshev inequalities Consider first a non-negative random variable [. The expectation reads Z " Z " Z d H [[] = {i[ ({)g{ {i[ ({)g{ + {i[ ({)g{ = d 0 Z " Z0 " {i[ ({)g{ d i[ ({)g{ = d Pr [[ d] d
d
Hence, we obtain the Markov inequality Pr [[ d] 3
H [[] d
(5.9)
A tighter bound relation indicates SqThe above Tqis obtained if all dn A 0 (e.g. dn is a probability). that j({) = n=0 (1 + dn {) is smaller than i ({) = exp { n=0 dn for any { 6= 0 and j(0) = i (0) = 1. Further, from (1 + dn {) ? hdn { it can be verified that, S for all Taylor q n coe!cients 1 ? n $ q holds that 0 ? jn $ in and j2 ? i2 such that j({) = n=0 jn { ? Sq S S " q n ? n for { A 0. Thus, for { = 1, we have j(1) ? i { i { i or n=0 n n=0 n n=0 n q \
(1 + dn ) ?
n=0
q [ 1 n! n=0
#
q [
n=0
dn
$n
88
Inequalities
Another proof of the Markov inequality follows after taking the expectation of the inequality d1[Dd [ for [ 0. The restriction to non-negative random variables can be circumvented by considering the random variable [ = (\ H [\ ])2 and d = w2 in (5.9), h i i H (\ H [\ ])2 h Var [\ ] = Pr (\ H [\ ])2 w2 w2 w2 From this, the Chebyshev inequality follows as
Pr [|[ H [[]| w]
2 w2
(5.10)
The Chebyshev inequality quantifies the spread of [ around the mean H [[]. The smaller , the more concentrated [ is around the mean. Further extensions of the Markov inequality use the equivalence between the events {[ d} / {j([) j(d)} where j is a monotonously increasing function. Hence, (5.9) becomes Pr [[ d]
H [j([)] j(d)
H [[ n ] For example, if j({) = {n , then Pr [[ d] dn . An interesting application of this idea is based on the equivalence of the events {[ H [[]+w} / {hx[ hx(H[[]+w) } provided x 0. For x 0, i h £ ¤ Pr [[ H [[] + w] = Pr hx[ hx(H[[]+w) h3x(H[[]+w) H hx[ (5.11)
where in the last step Markov’s inequality ¤ has been used. If the gener£ (5.9) ating function or Laplace transform H hx[ is known, the sharpest bound is obtained by the minimizer xW in x of the right-hand side because (5.11) holds for any x A 0. In Section 5.7, we show that this minimizer xW obeying Re x A 0 indeed exists for probability generating functions. The resulting inequality h W i W Pr [[ H [[] + w] h3x (H[[]+w) H hx [ (5.12) is called the Cherno bound.
The Cherno bound of the binomial distribution Let [ denote a binomial random variable hwith probability generating function given by i £ x[ ¤ [ x = H (h ) = (t + shx )q . Then, with H [[] = qs, (3.2) such that H h ¤ £ x h3x(H[[]+w) H hx[ = h3x(qs+w)+q log(t+sh )
5.4 The Markov and Chebyshev inequalities
£ x[ ¤¯¯ g2 3x(H[[]+w) h H h Provided gx ¯ 2
x=xW
89
A 0, the minimum xW is solution of
g 3x(H[[]+w) £ x[ ¤ h H h =0 gx
Explicitly,
µ ¶ qshx g 3x(H[[]+w) £ x[ ¤ x h H h = h3x(qs+w)+q log(t+sh ) (qs + w) + gx t + shx
from which xW follows using t = 1 s as ¶ µ qst + tw xW = log qst sw Hence, ³ i h 1 W W h3x (H[[]+w) H hx [ = ³ 1+
w qt w qs
´w3qt
´w+qs
£ W ¤ W For large q, but s and w fixed, we observe4 that h3x (H[[]+w) H hx [ = ¡ ¢¢ w2 ¡ w2 , we h3 qst 1 + R q1 . Since Var[[] = qst and by denoting | 2 = Var[[] find that the asymptotic regime for large q, # " |[ H [[]| 2 | h3| (5.13) Pr p Var [[]
is in agreement with the Central Limit Theorem 6.3.1. The corresponding Chebyshev inequality, # " |[ H [[]| 1 | 2 Pr p | Var [[]
is considerably less tight for the binomial distribution than the Cherno bound (5.13). More advanced and sharper inequalities than that of Chebyshev are surveyed by Janson (2002). 4
Write h3x
(H[[]+w)
k l w w 3 (w + qs) log 1 + H hx [ = exp (w 3 qt) log 1 3 qt qs
and use the Taylor expansion of log (1 ± {) around { = 0.
90
Inequalities
5.5 The Hölder, Minkowski and Young inequalities The Hölder inequality is ³X ´ ³X ´ ³X ´ X d e · · · } e ··· } d
+ +··· = 1
Let d = [ and e = \ , and further s = 1 A 1 and t = 1 A 1 such that 1 1 s + t = 1, then we obtain as a frequently used application, H [[\ ] (H [[ s ])1@s (H [\ t ])1@t
(5.14)
The Hölder inequality can be deduced from the basic convexity inequality (5.4). Since log { is a convex function for real { A 0, the basic convexity inequality (5.4) is with 0 1, log (x + (1 )y) log(x) + (1 ) log(y) After exponentiation, we obtain for x> y A 0 a more general inequality than (5.1), which corresponds to = 12 ,
Substitute x = Ã
|{ |s Pq m s m=1 |{m |
s Sq|{m | s |{ m=1 m |
! Ã
x y 13 x + (1 )y and y =
|| |t Pq m t m=1 ||m |
!13
and summing over all m yields q X
|{m |s ||m |t(13)
By choosing s = 1 and t = s A 1 and 1s + 1t = 1,
m=1
then
|{m |s ||m |t P Pq + (1 ) q s t m=1 |{m | m=1 ||m |
413 4 3 3 q q X X ||m |t D |{m |s D C C
1 13 ,
we arrive at the Hölder inequality with
3 41 3 41 s t q q X X s t |{m |m | C |{m | D C ||m | D m=1
(5.15)
m=1
m=1
m=1
q X
t Sq||m | t , || m=1 m |
(5.16)
m=1
A special important case of the Hölder inequality (5.14) for s = t = 2 is the Cauchy—Schwarz inequality, £ ¤ £ ¤ (H [[\ ])2 H [ 2 H \ 2 (5.17)
It is of interest to mention that the Hölder inequality is of a general type in
5.5 The Hölder, Minkowski and Young inequalities
91
the following sense (Hardy et al., 1999, Theorem 101 (p. 82)). Suppose that i ({) is convex (such Rthat the inverse j({) =Ri 31 ({) is also convex) and that { { i (0) = 0. If I ({) = 0 i (x)gx and J({) = 0 j(x)gx, and if ! ! à q à q q X X X tn J(en ) tn I (dn ) J31 tn dn en I 31 Pq
n=1
n=1
n=1
with n=1 tn = 1 holds for all positive dn and en , then i ({) = {u and the above inequality is Hölder’s inequality. The next inequalities are of a dierent type. For s A 1, the Minkowski inequality is (H [|[ + \ |s ])1@s (H [|[|s ])1@s + (H [|\ |s ])1@s
(5.18)
or, written algebraically, 41 3 41 41 3 3 s s s q q q X X X s s s C |{m | D + C ||m | D |{m + |m | D C m=1
m=1
m=1
Suppose that i ({) is continuous and strictly increasing for { 0 and i (0) = 0. Then the inverse function j({) = i 31 ({) satisfies the same conditions. The Young inequality states that for d 0 and e 0 holds that Z e Z d i (x)gx + j(x)gx (5.19) de 0
0
with equality only if e = i (d). The Young inequality follows by geometrical consideration. The first integral is the area under the curve | = i ({) from [0> d], while the second is the area under the curve { = j(|) = i 31 (|) from [0> e]. Applications of the Cauchy—Schwarz inequality ¤ will demonstrate that both the generating function *[ (}) = £1. We H h3}[ and its logarithm O[ (}) = log(*[ (})) are convex functions of }. First, the £ second ¤ derivative is continuous and non-negative because *00[ (}) = H [ 2 h3}[ 0. Further, since O00[ (})
*[ (})*00[ (}) (*0[ (}))2 = *2[ (})
it remains to show that *[ (})*00[ (})(*£0[ (}))2 ¤ 0. From Cauchy—Schwarz } inequality (5.17) applied to *0[ (}) = H [h3}[ with [ $ h3 2 [ and \ = £ ¡ £ ¤ ¤ £ ¤¢ } 2 [h3 2 [ , we obtain (*0[ (}))2 = H [h3}[ H h3}[ H [ 2 h3}[ = *[ (})*00[ (}). Hence, O00[ (}) 0.
92
Inequalities
2. Let \ = 1[A0 in (5.17) while [ is a non-negative random variable, then with (2.13), £ ¤ £ ¤ (H [[])2 H [ 2 H [1[A0 ] = H [ 2 (1 Pr [[ = 0])
such that an upper bound for Pr [[ = 0] is obtained, Pr [[ = 0] 1
(H [[])2 H [[ 2 ]
(5.20)
5.6 The Gauss inequality In this section, we consider a continuous random variable [ with even probability density function, i.e. i[ ({) = i[ ({), which is not increasing for { A 0. A typical example of such random variables are measurement errors due to statistical fluctuations. In his epoch-making paper, Gauss (1821) established the method of the least squares (see e.g. Section 2.2.1 and 2.5.3). In that same paper, Gauss (1821, pp. 10-11) also stated and proved Theorem 5.6.1, which is appealing because of its generality. We define the probability p as Z p = Pr [ [ ] = i[ (x) gx (5.21) 3
p where = Var [[] is the standard deviation.
Theorem 5.6.1 (Gauss) If [ is a continuous random variable with even probability density function, i.e. i[ ({) = i[ ({), which is not increasing for { A 0, then s p 3 if p ? 23 then q if p = 23 then 43 2 if p A 32 then ? 3I13p and, conversely, if ? if A
q
4
q3 4 3
then p
I 3
then p 1
4 2
Given a bound on the probability p, Gauss’s Theorem 5.6.1 bounds the extent of the error [ around its mean zero in units of the standard deviation or, equivalently, it provides bounds for the normalized random variable
5.6 The Gauss inequality
93
. The proof of this theorem only uses real function theory and [ W = [3H[[] is characteristic for the genius of Gauss. Proof: Consider the inverse function { = j (|) of the integral | = I[ (3{). An interesting general property of the inverse function is ]
1
j 2 (x) gx =
]
"
3"
0
U{
3{
i[ (x) gx = I[ ({) 3
{2 i[ ({) g{
which is verified by the substitution { = j (x). Since H [[] = 0 and Var[[] = H [ 2 , we have ]
1
j 2 (x) gx = 2 = Var [[]
(5.22)
0
Beside j (0) = 0, the derivative j 0 (|) =
1 1 = 0 ({) 3 I 0 (3{) I[ i ({) + i[ (3{) [ [
is increasing from | = 0 until | = 1 because i[ ({) attains a maximum at { = 0 and is not 00 increasing for { A 0. Hence, j (|) D 0. From the dierential 00 g |j0 (|) = j 0 (|) g| + |j (|) g|
we obtain by integration
|j 0 (|) 3 j (|) =
]
|
00
xj (x) gx
0
00
Since j (|) D 0, we have that |j 0 (|) 3 j (|) D 0 and since |j 0 (|) A 0 (for | A 0) that k (|) = 1 3
j (|) |j 0 (|)
lies in the interval [0> 1]. From (5.21), it follows that = j (p) and that k (p) = 1 3 j 0 (p) =
pj 0 (p)
or
p (1 3 k (p))
With this preparation, consider now the following linear function J (|) =
(| 3 pk(p)) p (1 3 k (p))
(5.23)
Clearly, we have that J (p) = and that J0 (|) = p(13k(p)) = j 0 (p) is independent of |. Since j 0 (|) is non decreasing — which is the basic assumption of the theorem — the dierence g j 0 (|)3J0 (|) is negative if | ? p, but positive if | A p. Since j 0 (|)3J0 (|) = g| (j (|) 3 J (|)), the function j (|) 3 J (|) is convex with minimum at | = p for which j (p) 3 J (p) = 0. Hence, j (|) 3 J (|) D 0 for all | M [0> 1]. Further, J (|) is positive for | M (pk (p) > 1]. Especially in this interval, the inequality j (|) D J (|) is sharp because j (|) is positive in (0> 1]. Thus,
]
1
pk(p)
J2 (|) g| $
]
1
j 2 (|) g| ?
pk(p)
]
1
j 2 (|) g|
0
Using (5.22) and with (5.23), we have 2 2 p2 (1 3 k (p))2
(1 3 pk(p))3 ? 2 3
94
Inequalities
from which we arrive at the inequality 3p2 (1 3 })2
2 ?
(5.24)
(1 3 p})3
where } = k (p) M [0> 1]. The derivative of the right-hand side with respect to }, $ # 3p2 (1 3 }) 3 (1 3 })2 2 g =3 p (2 3 3p + p}) g} (1 3 p})3 (1 3 p})4 3p2 (13})2 is monotonously decreasing for all } M [0> 1] if p ? 23 with maximum at (13p})3 I } = 0. Thus, if p ? 23 , evaluating (5.24) at } = 0 yields ? 3p. On the other hand, if p A 23 , 2 2 (13}) 2 then 3p is maximal provided 2 3 3p + p} = 0 or for } = 3 3 p . With that value of }, (13p})3 2 2 2 I the inequality (5.24) yields ? 3 13p . Both regimes p A 3 and p ? 3 tend to a same bound ? I2 if p < 32 . The converse is similarly derived from (5.24). ¤ 3
shows that
1 1{M[3d>d] , then If [ has a symmetric uniform distribution with i[ ({) = 2d d p = d and = I3 from which p = I3 . This example shows that Gauss’s
Theorem 5.6.1 is sharpsfor p the first condition 3p.
2 3
in the sense that equality can occur in
5.7 The dominant pole approximation and large deviations In this section, we relate asymptotic results of generating functions to the theory of large deviations. An asymptotic expansion in discrete-time is compared to established large deviations results. The first approach using the generating function *[ (}) of the random variable [ is an immediate consequence of Lemma 5.7.1. Lemma 5.7.1 If *[ (}) is meromorphic with residues un at the (simple) poles sn ordered as 0 ? |s0 | |s1 | |s2 | · · · and if *[ (}) = r(} Q+1 ) as } $ 4, then holds *[ (}) = =
Q X
n=0 Q X
n
Pr [[ = n] } + n
Pr [[ = n] } +
n=0
" X
sQ +1 n=0 n à " X un
n=0
un } Q+1 (} sn )
Q X 1 }p + } sn p=0 sp+1 n
(5.25) !
(5.26)
The normalization condition *[ (1) = 1 implies that Pr[[ A Q ] = 1
Q X n=0
Pr [[ = n] =
" X
un Q+1 s (1 sn ) n=0 n
(5.27)
5.7 The dominant pole approximation and large deviations
95
The Lemma follows from Titchmarsh (1964, Section 3.21). Rewriting (5.26) gives, ! Ã" Q " X X un X n (5.28) Pr [[ = n] } }m *[ (}) = m+1 s m=Q+1 n=0 n n=0 and hence, Pr [[ = m] =
" X un n=0
sm+1 n
(m A Q )
(5.29)
The cumulative density function for N A Q follows from (5.29) as Pr[[ A N] =
" X
Pr [[ = m] =
m=N+1
" X
un N+1 s (1 sn ) n=0 n
(N A Q ) (5.30)
Lemma 5.7.1 means that, ¡if the ¢ plot Pr [[ = m] versus m exhibits a kink at m = Q , then *[ (}) = R } Q as } $ 4. Alternatively5 , the asymptotic regime does not start earlier than m Q . For large N, only the pole with smallest modulus, s0 , will dominate. Hence, u0 (5.31) Pr[[ A N] N+1 s0 (1 s0 ) This approximation is called the dominant pole approximation with the residue at the simple pole s0 equal to u0 = lim} O) is disconnected : a comparison between the exact result (15.17) and Erdos’ asymptotic formula (15.16) for O = Q , O = 32 Q , O = 2Q and O = 32 Q log Q .
The key observation of Erdös and Rényi (1959) is that a phase transition in random graphs with Q nodes occurs when the number of links O is around
15.6 Random graphs
335
Of = 21 Q log Q . Phase transitions are well-known phenomena in physics. For example, at a certain temperature, most materials possess a solid-liquid transition and at a higher temperature a second liquid-gas transition. Below that critical temperature, most properties of the material are completely dierent than above that temperature. Some materials are superconductive below a certain critical temperature Wf , but normally conductive (or even on the ¤property Dn isolating) above Wf . Erdös and Rényi concentrated £ that a random graph Ju (Q> O) with O{ = 21 Q log Q + {Q consists of Q n connected nodes and n isolated nodes for fixed n. If Dfn means the absence of property Dn , they proved that, for all fixed n, Pr [Dfn ] $ 0 if Q $ 4 which means that for a large number of nodes Q , almost all random graphs Ju (Q> O{ ) possess property Dn . This result is equivalent to a result proved in Section 15.6.3 that the class of random graphs Js (Q) is almost surely disconnected if the link density s is below sf logQQ and connected for s A sf . In view of the analogy with physics, it is not surprising that corresponding sharp transitions also are observed for other properties than just Dn . In the sequel, we will show that, for the random graph Ju (Q> O{ ), the probability that the largest connected component, called the giant component JF (Q> O{ ), has Q n nodes is, for large Q , Poisson distributed with mean h32{ , 32{
(h32{ )n h3h lim Pr [number of nodes in JF (Q> O{ ) = Q n] = Q O{ ) F(Q> O{ ) Pr [Df0 ] $ 0 ¡(Q )¢ 2
O{
which demonstrates (15.18) for n = 0. The remaining case for n 0 in (15.18) follows from the observation that the number of graphs in Ju (Q> O{ ) 9
It is convenient to take the logarithm of
wn =
=
Q n
Q n 2 O { Q 2 O{
O\ n31 { 31 Q 3n 3m 1 \ 2 (Q 3 m) = Q n! m=0 3 m m=0 2
2m O{ 31 O\ n31 { 31 1 3 m n O{ 31 n Qn \ (Q 3n)(Q 313n) 13 13 13 2m n! m=0 Q Q Q 31 13 m=0 Q (Q 31)
which is log (n!wn ) = n log Q +
+
H[ { 31 m=0
m n n log 1 3 + (H{ 3 1) log 1 3 + log 1 3 Q Q Q 31 m=0
n31 [
log 1 3
2m (Q 3 n)(Q 3 1 3 n)
3 log 1 3
2m Q(Q 3 1)
For large Q and using the expansion log (1 3 }) = 3} + R } 2 , we have for fixed n with 2m 2m log 1 3 = log 1 3 + R Q 33 (Q 3 n)(Q 3 1 3 n) Q(Q 3 1) that
2n log (n!wn ) = n log Q + R Q 31 3 O{ + R O2{ Q 33 Q
In order to have a finite limit limQ O{ ) (h32{ )n 3h32{ h $ Q 3n ¡( )¢ n! 2 O{
where the limit gives the correct result because the small dierence between the total number and that without property Dn tends to zero.
15.6.3 Connectivity and degree There is an interesting relation between the connectivity of a graph, a global property, and the degree G of an arbitrary node, a local property. The implication {J is connected} =, {Gmin 1} where Gmin = minall nodes MJ G is always true. The opposite implication is not always true, however, because a network can consists of separate, disconnected clusters containing nodes each with minimum degree larger than 1. A random graph can be generated from a set of labelled Q nodes by randomly assigning a link with probability s to each pair of nodes. During this construction process, initially separate clusters originate, but at a certain moment, one of those clusters starts dominating (and swallowing) the other clusters. This largest cluster becomes the giant component. For large Q and a certain sQ which depends on Q , the implication {Gmin 1} =, {Js (Q) is connected} is almost surely (a.s.) correct. A rigorous mathematical proof is fairly complex and omitted. Thus, for large random graphs Js (Q ) holds the equivalence {Js (Q ) is connected} +, {Gmin 1} almost surely such that Pr [Js (Q ) is connected] = Pr [Gmin 1]
a.s.
From (3.32) and (15.11), we have that ³ ´Q Pr[Gmin 1] = (Pr[Guj 1])Q = (1 Pr[Guj = 0])Q = 1 (1 s)Q31 which shows that Pr [Gmin 1] rapidly tends to one for fixed 0 ? s ? 1 and large Q. Therefore, the asymptotic behavior of Pr [Js (q) is connected]
338
General characteristics of graphs
requires the investigation of the influence of s as a function of Q , ´´ ³ ³ Pr [Js (Q ) is connected] = exp Q log 1 (1 sQ )Q31 4 3 " m(Q31) X (1 sQ ) D = exp CQ m m=1 4 3 " (Q31)m X (1 s ) Q 31 Q D = h3Q(13sQ ) exp CQ m m=2
If we denote fQ , Q · (1 sQ )Q 31 , then Q
" X (1 sQ )(Q31)m m=2
m
=
" X m=2
fmQ mQ m31
¡ ¢ can be made arbitrarily small for large Q provided we choose fQ = R Q with ? 21 . Thus, for large Q , we have that ³ ³ ´´ Pr [Js (Q ) is connected] = h3fQ 1 + R Q 231
which tends to 0 for 0 ? ? 12 and to 1 for ? 0. Hence, the critical exponent where a sharp transition occurs is = 0. In that case, fQ = f (a real positive constant) and µ µ ¶ ¶ log Qf log f log Q sQ = 1 exp +R = = Q 1 Q Q In summary, for large Q , Pr [Js (Q ) is connected] $
0 1
if s ? if s A
log Q Q log Q Q
(15.19)
with a transition region around sf logQQ with a width of R( Q1 ). Notice { ' logQQ + Q{ : for large { ? 0, the agreement with (15.15) where s{ = OOmax 32{ 32{ h3h $ 0, while for large { A 0, h3h $ 1 and the width of the transition 1 region for the link density s is R( Q ). 15.6.4 Size of the giant component Let V = Pr [q 5 F] denote the probability that a node q in Js (Q ) belongs to the giant component F. If q 5 @ F, then none of the neighbors of node q
15.6 Random graphs
339
belongs to the giant component. The number of neighbors of a node q is the degree gq of a node such that Pr [q 5 @ F] = Pr [all neighbor of q 5 @ F] X = Pr [all n neighbors of q 5 @ F|gq = n] Pr [gq = n] nD0
Since in Js (Q ) all neighbors of q are independent10 , the conditional probability becomes, with 1 V = Pr [q 5 @ F], Pr [all n neighbors of q 5 @ F|gq = n] = (Pr [q 5 @ F])n = (1 V)n Moreover, this probability holds for any node in q 5 Js (Q ) such that, writing the random variable Grg instead of an instance gq , 1V =
" X n=0
(1 V)n Pr [Grg = n] = *Grg (1 V)
£ ¤ where *Grg (x) = H xGrg is the generating function of the degree Grg in Js (Q ). For large Q , the degree distribution in Js (Q ) is Poisson distributed with mean degree rg = s (Q 1) and *Grg (x) ' hrg (x31) . For large Q , the fraction V of nodes in the giant component in the random graph satisfies an equation similar to that in (12.13) of the extinction probability in a branching process, V = 1 h3rg V
(15.20)
and the average size of the giant component is Q V. For rg ? 1 the only solution is V = 0 whereas for rg A 1 there is a non-zero solution for the size of the giant component. The solution can be expressed as a Lagrange series using (5.34), V (rg ) = 1 h3rg
" X ¢q (q + 1)q ¡ rg h3rg (q + 1)! q=0
(15.21)
By reversing (15.20), the average degree in the random graph can be expressed in terms of the fraction V of nodes in the giant component, rg (V) = 10
log (1 V) V
(15.22)
This argument is not valid, for example, for a two-dimensional lattice Z2s in which each link between adjacent nodes at integer value coordinates in the plane exists with probability s. The critical link density for connectivity in Z2s is sf = 21 , a famous result proved in the theory of percolation (see, for example, Grimmett (1989)).
340
General characteristics of graphs
15.7 The hopcount in a large, sparse graph with unit link weights Routers in the Internet forward IP packets to the next hop router, which is found by routing protocols (such as OSPF and BGP). Intra-domain routing as OSPF is based on the Dijkstra shortest path algorithm, while inter-domain routing with BGP is policy-based, which implies that BGP does not minimize a length criterion. Nevertheless, end-to-end paths in the Internet are shortest paths in roughly 70% of the cases. Therefore, we consider the shortest path between two arbitrary nodes because (a) the IP address does not reflect a precise geographical location and (b) uniformly distributed world wide communication, especially, on the web seems natural since the information stored in servers can be located in places unexpected and unknown to browsing users. The Internet type of communication is dierent from classical telephony because (a) telephone numbers have a direct binding with a physical location and (b) the intensity of average human interaction rapidly decreases with distance. We prefer to study the hopcount KQ because it is simple to measure via the trace-route utility, it is an integer, dimensionless, and the quality of service (QoS) measures (such as packet delay, jitter and packet loss) depend on the hopcount, the number of traversed routers. In this section, we first investigate the hopcount in a sparse, but connected graph where all links have unit weight. Chapter 16 treats graphs with other link weight structures.
15.7.1 Bi-directional search The basic idea of a bi-directional search to find the shortest path is by starting the discovery process (e.g. using Dijkstra’s algorithm) from D and E simultaneously. When both subsections from D and from E meet, the concatenation forms the shortest path from D to E. In case all link weights are equal, z (l $ m) = 1 for any link l $ m in a graph J, the shortest path from D and E is found when the discovery process from D and that from E have precisely one node of the graph in common. Denote by FD (o), respectively FE (o), the set of nodes that can be reached from D, respectively E, in o or less hops. We define FD (0) = {D} and FE (0) = {E}. The hopcount is larger than 2o if and only if FD (o) _ FE (o) is empty. Conditionally on |FD (o)| = qD , respectively |FE (o)| = qE , the sets FD (o) and FE (o) do not possess a common node with probability ¡Q313qD ¢ Pr [FD (o) _ FE (o) = B||FD (o)| = qD > |FE (o)| = qE ] =
q
E ¡Q 31 ¢
qE
15.7 The hopcount in a large, sparse graph with unit link weights
341
which consists of the ratio of all combinations in which the qE nodes around E can be chosen out of the remaining nodes that do not belong to the set FD over all combinations in which qE nodes can be chosen in the graph with Q nodes except for node D. Furthermore, ¡Q313qD ¢ (Q qD 1)(Q qD 2) · · · (Q qD qE ) qE ¡Q31 ¢ = (Q 1)(Q 2) · · · (Q qE ) q E
=
(1
qD +2 qD +qE qD +1 ) Q )(1 Q ) · · · (1 Q qE 1 2 (1 Q )(1 Q ) · · · (1 Q )
For large Q , we apply the Taylor series around { = 0 of log (1 {) = P {m { " m=2 M , log
¡Q313qD ¢ q
E ¡Q31 ¢
qE
¶ µ ¶ µ n qD + n log 1 log 1 Q Q n=1 ! Ã ¶ X qE µ qE " X X 1 n qD + n = (qD + n)m nm Q Q mQ m m=2 n=1 n=1 µ ¶ ³ ´ 2 qD qE qD qE 1 1 1 = U + + Q Q 2qD 2qE 2qD qE =
qE X
where the remainder is
µ³ ¶ m31 µ ¶ qE " X X qD qE ´3 1 X m m3p p q U= n =R p D mQ m Q m=3
p=0
n=1
After exponentiation µ µ³ ¶¶ qD qE ´2 1+R Q ¶ µ H [|FD (o)|2 |FE (o)|2 ] for By the law of total probability (2.47) and up to R Q2 ¯ qD qE ¤ £ Pr KQ A 2o¯|FD (o)| = qD > |FE (o)| = qE = h3 Q
large Q , we obtain
Pr [KQ
¶¸ µ |FD (o)| |FE (o)| A 2o] H exp Q
(15.23)
This probability (15.23) h holds for any ilarge graph with a unit link weight structure provided H |FD (o)|2 |FE (o)|2 = r(Q 2 ). Formula (15.23) becomes increasingly accurate for decreasing |FD (o)| and |FE (o)|, and so for sparser large graphs.
342
General characteristics of graphs
15.7.2 Sparse large graphs and a branching process In order to proceed, the number of nodes in the sets FD (o) and FE (o) needs to be determined, which is di!cult in general. Therefore, we concentrate here on a special class of graphs in which the discovery process from D and E is reasonably well modeled by a branching process (Chapter 12). A branching process evolves from a given set FD (o 1) in the next o-th discovery cycle (or generation) to the set FD (o) by including only new nodes, not those previously discovered. The application of a branching process implies that the newly discovered nodes do not possess links to any previously discovered node of FD (o 1) except for its parent node in FD (o 1). Hence, only for large and sparse graphs or tree-like graphs, this assumption can be justified, provided that the number of links that point backwards to early discovered nodes in FD (o 1) is negligibly small. Assuming that a branching process models the discovery process well, we will compute the number of nodes that can be reached from D and similarly from E in o hops from a branching process with production \ specified by the degree distribution of the nodes in the graph. The additional number of nodes [o discovered during the o-th cycle of a branching process that are included in the set FD (o) is described by the basic law (12.1). Thus, P |FD (o)| = on=0 [n with [0 = 1 (namely node D). In terms of the scaled n random variable Zn = [ with unit mean H [Zn ] = 1, n |FD (o)| =
o X
Zn n
n=0
and where = H [\ ] 1 A 1 denotes the average degree minus 1, i.e. the outdegree, in the graph. Only the root has H [\ ] equal to the mean degree. Immediately, the average size of the set of nodes reached from D in o hops is with H [Zn ] = 1, H [|FD (o)|] =
o X
n =
n=0
o+1 1 1
which equally holds for H [|FE (o)|]. Applying Jensen’s inequality (5.5) to (15.23) yields ¶ µ ¶¸ µ H [FD (o)] H [FE (o)] FD (o)FE (o) H exp exp Q Q such that Pr [KQ
µ A n] exp
2 n Q ( 1)2
¶
15.7 The hopcount in a large, sparse graph with unit link weights
343
With the tail probability expression (2.36) for the average, we arrive at the lower bound for the expected hopcount in large graphs, µ ¶ " " X X 2 n Pr [KQ A n] exp H [KQ ] = Q ( 1)2 n=0 n=0 ¡ ¢ P n can be evaluated exactly11 as The sum V1 (w) = " n=0 exp w h ³ ´i ¡ ¢ " cos 2n log + arg 2nl X log w log 1 log w + 2 q V1 (w) = +s 2 2 log log n=1 2n sinh 2n log
" ³ ´ X 3n 1 h3w + n=1
Furthermore, h ³ ´i ¯ ¯ ¡¢ ¯ ¯X 2n 2nl " ¯ " cos log log w + arg log ¯ X 1 ¯ ¯ q q = e() ¯ ¯ 2 ¯ n=1 2n sinh 2n2 ¯n=1 2n sinh 2n log log
and the function W () =
2e() I log
is increasing, but for 1 ? 5 its maximum 2
value W (5) is smaller than 0=0035. Since w = Q(31) 2 is small and A 1, we approximate 1 log w + (15.24) V1 (w) 2 log 11
For = Uh(v) A 0 and Uh(s) D 0, we have K(v) = sv
]
"
sw
wv31 h3
gw
0
and ] " " " [ [ n 1 wv31 h3 w gw = nv 0 n=0 n=0
K(v) or
]
"
wv31 V1 (w) gw = K(v)
0
v 31
v
By Mellin inversion, for f A 0, V1 (w) =
1 2l
]
f+l"
f3l"
K(v) v gv v 3 1 w
By moving the line of integration to the left, we encounter a double pole at v = 0 from K(v) 2nl and v131 and simple poles at v = log from v131 . Invoking Cauchy’s residue theorem leads to the result.
344
General characteristics of graphs
and arrive, for large Q , at 2
1 log (31)2 + log Q log Q + H [KQ ] log 2 log log
This shows that in large, sparse graphs for which the discovery process is Q well modeled by a branching process, it holds that H [KQ ] scales as log log where = H [\ ] 1 A 1 is the average degree minus 1 in the graph. We can refine the above analysis. Let us now assume that the convergence of Zn $ Z is su!ciently fast for large Q and that Z A 0 such that, |FD (o)| ZD
o X n=0
n = ZD
o+1 o+1 1 ZD 1 1
is a good approximation (and similarly for |FE (o)|). The verification of this approximation is di!cult in general. Theorem 12.3.2 states that Pr [Z = 0] = 0 and equivalently Pr [Z A 0] = 1 0 where the extinction probability 0 obeys the equation (12.13). Using this approximation, we find from (15.23) ¸ ¶¯ µ ¯ ZD ZE 2o+2 ¯ ZD > ZE A 0 Pr [KQ A 2o] H exp Q ( 1)2 ¯
where the condition on Z A 0 is required else there are no clusters FD (o) and FE (o) nor a path. Since the same asymptotics also holds for odd values of the hopcount, we finally arrive, for n 1 and large Q , at h ³ i ´¯ ¯ Pr [KQ A n] H exp ]n ¯ ZD > ZE A 0 where the random variable
]= g
g
2 ZD ZE Q ( 1)2
and ZD = ZE = Z . A more explicit computation of Pr [KQ A n] requires the knowledge of the limit random variable Z , which strongly depends on the nodal degree \ . The average hopcount H [KQ ] is found similarly as in the analysis above by using (15.24) with w = ], H [KQ ] H [ V1 (])| ZD > ZE A 0] ¯ # " 1 2 log Z log Q + 2 log (31) + ¯¯ =H ¯Z A 0 ¯ 2 log 1 log Q 2 log (31) H [ log Z | Z A 0] = + 2 2 log log
15.7 The hopcount in a large, sparse graph with unit link weights
345
In sparse graphs with average degree H [\ ] equal to and for a large number of nodes Q , the average hopcount is well approximated12 by H [KQ ] =
1 2 log ( 1) H [ log Z | Z A 0] log Q + 2 log 2 log log
(15.25)
This expression (15.25) for the average hopcount — which is more refined than Q the commonly used estimate H [KQ ] log log — contains the curious average H [ log Z | Z A 0] where Z is the limit random variable of the branching process produced by the graph’s degree distribution \ . Application to Gp (N) The above analysis holds for fixed H [\ ] = s(Q where is approximately 1) such that, for large Q , we require that s = Q equal to the average degree. Since the binomial distribution (15.11) for the degree in Js (Q) is very well approximated by the Poisson distribution n Pr [Grg = n] n! h3 for large Q and constant , formula (15.25) requires the computation of H [ log Z | Z A 0] in a Poisson branching process, which is presented in Hooghiemstra and Van Mieghem (2005) but here summarized in Fig. 15.8. The numerical evaluation of average hopcount (15.25) in a 1.2
1.0
E[logW|W>0]
0.8
0.6
0.4
0.2
0.0
-0.2 1
2
3
4
5
6
7
8
9
10
P
Fig. 15.8. The quantity H [ log Z | Z A 0] of a Poisson branching process versus the average degree . 12
A more rigorous derivation that stochastically couples the graph’s growth specified by a certain degree distribution to a corresponding branching process is found in van der Hofstad et al. (2005). In particular, the analysis is shown to be valid for any randomly constructed graph with a finite variance of the degree. More details on the result for the average hopcount are presented in Hooghiemstra and Van Mieghem (2005).
346
General characteristics of graphs
random graph of the class Js (Q ) for small average degree and large Q shows that (15.25) is much more accurate than only its first term log Q . At the other end of the scale for a constant link density s = f ? 1, which corresponds to an average degree H [\ ] = f(Q 1), the above analysis no longer applies for such large values of the average degree H [\ ]. Fortunately, in that case, an exact asymptotic analysis is possible (see Problem (iii)): Pr [KQ = 1] = s
¢ ¡ Pr [KQ = 2] = (1 s) 1 (1 s2 )Q 32
(15.26)
Values of KQ higher than 2 are extremely unlikely since Pr [KQ A 2] = (1 £ ¤Q 32 s) 1 s2 tends to zero rapidly for su!ciently large Q . Hence, H[KQ ] ' Pr [KQ = 1] + 2 Pr [KQ = 2] ' 2 s and, similarly, we find Var[KQ ] ' s(1s). This asymptotic analysis even holds for a larger link density regime 1 s = fQ 3 2 + with A 0 because £ ¤Q32 1 =0 lim Pr [KQ A 2] = lim (1 fQ 3 2 + ) 1 fQ 31+2 Q
A 0>
(16.1)
The corresponding density is iz ({) = {31 1{M[0>1] . The exponent = lim {0
log Iz ({) log {
is called the extreme value index of the probability distribution of z and = 1 for regular distributions. By varying the exponent over all nonnegative real values, any extreme value index can be attained and a large class of corresponding SPTs, in short -trees, can be generated. Fw(x) 1
D D
0
H
D!
larger scale
1
x
Fig. 16.1. A schematic drawing of the distribution of the link weights for the three dierent -regimes. The shortest path problem is mainly sensitive to the small region around zero. The scaling invariant property of the shortest path allows us to divide all link weights by the largest possible such that Iz (1) = 1 for all link weight distributions.
16.2 The shortest path tree in NQ with exponential link weights
349
Figure 16.1 illustrates schematically the probability distribution of the link weights around zero (0> ], where A 0 is an arbitrarily small, positive real number. The larger link weights in the network will hardly appear in a shortest path provided the network possesses enough links. These larger link weights are drawn in Fig. 16.1 from the double dotted line to the right. The nice advantage that only small link weights dominantly influence the property of the resulting shortest path tree implies that the remainder of the link weight distribution (denoted by the arrow with larger scale in Fig. 16.1) only plays a second order role. To some extent, it also explains the success of the simple SPT model based on the complete graph NQ with i.i.d. exponential link weights, which we derive in Section 16.2. A link weight structure eectively thins the complete graph NQ — any other graph is a subgraph of NQ — to the extent that a specific shortest path tree can be constructed. Finally, we assume the independence of link weights, which we deem a reasonable assumption in large networks, such as the Internet with its many independent autonomous systems (ASs). Apart from the Section 16.7, we will mainly consider the case for = 1, which allows an exact analysis.
16.2 The shortest path tree in NQ with exponential link weights 16.2.1 The Markov discovery process Let us consider the shortest path problem in the complete graph NQ , where each node in the graph is connected to each other node. The problem of finding the shortest path between two nodes D and E in NQ with exponentially distributed link weights with mean 1 can be rephrased in terms of a Markov discovery process. The discovery process evolves as a function of time and stops at a random time W when node E is found. The process is shown in Fig. 16.2. The evolution of the discovery process can be described by a continuoustime Markov chain [(w), where [(w) denotes the number of discovered nodes at time w, because the characteristics of a Markov chain (Theorem 10.2.3) are based on the exponential distribution and the memoryless property. Of particular interest here is the property (see Section 3.4.1) that the minimum of q independent exponential variables each with parameter l is again an P exponential variable with parameter ql=1 l . The discovery process starts at time w = W0 with the source node D and for the initial distribution of the Markov chain, we have Pr[[(W0 ) = 1] = 1. The state space of the continuous Markov chain is the set VQ consisting of all positive integers (nodes) q with q Q . For the complete graph NQ , the
350
The Shortest Path Problem
transition rates are given by q = q(Q q)>
q 5 VQ
(16.2)
Indeed, initially there is only the source node D with label2 0, hence q = 1. From this first node D precisely Q 1 new nodes can be reached in the complete graph NQ . Alternatively one can say that Q 1 nodes are competing with each other each with exponentially distributed strength to be discovered and the winner amongst them, say F with label 1, is the one reached in shortest time which corresponds to an exponential variable with rate Q 1. v8 v7
corresponding URT
Markov discovery process
v6 v5 v4 v3
2 5
v2
h=0
0
6
3
4
1
5
h=1 h=2
6
v1
1 7
0
8
h=3
time 2 4 3
8
W6 7
Fig. 16.2. On the left, the Markov discovery process as function of time in a graph with Q = 9 nodes. The circles centered at the discovering node D with label 0 present equi-time lines and yn is the discovering time of the n-th node, while n = yn yn1 is the n-th interattachment time. The set of discovered nodes redrawn per level are shown on the right, where a level gives the number of hops k from the source node D. The tree is a uniform recursive tree (URT). 2
When continuous measures such as time and weight of a path are computed, the source node is most conveniently labeled by zero, whereas in counting processes, such as the number of hops of a path, the source node is labeled by one.
16.2 The shortest path tree in NQ with exponential link weights
351
After having reached F from D at hitting time y1 , two nodes q = 2 are found and the discovery process restarts from both D and F. Although at time y1 we were already progressed a certain distance towards each of the Q 2 other, not yet discovered, nodes, the memoryless property of the exponential distribution tells us that the remaining distance to these Q 2 nodes is again exponentially distributed with the same parameter 1. Hence, this allows us to restart the process from D and F by erasing the previously partial distance to any other not yet discovered node as if we ignore that it were ever travelled. From the discovery time y1 of the first node on, the discovery process has double strength to reach precisely Q 2 new nodes. Hence, the next winner, say G labeled by 2, is reached at y2 in the minimum time out of 2(Q 2) traveling times. This node G has equal probability to be attached to D or F because of symmetry. When G is attached to D (the argument below holds similarly for attachment to F), symmetry appears to be broken, because G and F have only one link used, whereas D has already two links used. However, since we are interested in the shortest path problem and since the direct link from D to G is shorter than the path D $ F $ G, we exclude the latter in the discovery process, hereby establishing again the full symmetry in the Markov chain. This exclusion also means that the Markov chain maintains single paths from D to each newly discovered node and this path is also the shortest path. Hence, there are no cycles possible. Furthermore, similar to Dijkstra’s shortest path algorithm, each newly reached node is withdrawn from the next competition round, which guarantees that the Markov chain eventually terminates. Besides terminating by extinction of all available nodes, after each transition when a new node is discovered, the Markov chain stops with 1 , since each of the q already discovered nodes has probability equal to Q3q precisely 1 possibility out of the remaining Q q to reach E and only one of them is the discoverer. The stopping time W is defined as the infimum for w 0 at which the destination node E is discovered. In summary, the described Markov discovery process, a pure birth process with birth rate q = q(Q q), models exactly the shortest path for all values of Q . 16.2.2 The uniform recursive tree A uniform recursive tree (URT) of size Q is a random tree rooted at D. At each stage a new node is attached uniformly to one of the existing nodes until the total number of nodes is equal to Q . The hopcount kQ (equivalent to the depth or distance) is the smallest number of links between the root D and a destination chosen uniformly from all nodes {1> 2> = = = > Q }.
352
The Shortest Path Problem
o n (n) Denote by [Q the n-th level set of a tree W , which is the set of nodes in the tree W at hopcount n from the root nD in aograph with Q nodes, and (n) (n) (0) by [Q the number of elements in the set [Q . Then, we have [Q = 1 because the zeroth level can only contain the root node D itself. For all (n) n A 0, it holds that 0 [Q Q 1 and that Q31 X
(n)
[Q = Q
(16.3)
n=0
(q)
Another consequence of the definition is that, if [Q = 0 for some level (m) q ? Q 1, then all [Q = 0 for levels m A q. In such a case, the longest possible shortest path in the tree has a hopcount of q. The level set n o (1) (2) (Q31) OQ = 1> [Q > [Q > = = = > [Q (n)
of a tree W is defined as the set containing the number of nodes [Q at each level n. An example of a URT organized per level n is drawn on the right in Fig. 16.2 and in Fig. 16.3. A basic theorem for URTs proved in van der Hofstad et al. (2002b), is the following: (n)
(n)
Theorem 16.2.1 Let {\Q }n>QD0 and {]Q }n>QD0 be two independent copies of the vector of level sets of two sequences of independent URTs. Then (n)
g
(n31)
{[Q }nD0 = {\Q1
(n)
+ ]Q3Q1 }nD0 >
(16.4)
where on the right-hand side the random variable Q1 is uniformly distributed over the set {1> 2> = = = > Q 1}. Theorem 16.2.1 also implies that a subtree rooted at a direct child of the root is a URT. For example, in Fig. 16.3, the tree rooted at node 5 is a URT of size 13 as well as the original tree without the tree rooted at node 5. By applying Theorem 16.2.1 to the URT subtree, any subtree rooted at a member of a URT is also a URT. An arbitrary URT X consisting of Q nodes and with the root labeled by 1 can be represented as X = (q2 # 2) (q3 # 3) = = = (qQ # Q )
(16.5)
where (qm # m) means that the m-th node is attached to node qm 5 [1> m 1] and q2 = 1. Hence, qm is the predecessor of m and the predecessor relation is indicated by the arrow “#”. Moreover, qm is a discrete uniform random variable on [1> m 1] and all q2 > q3 > = = = > qQ are independent.
16.2 The shortest path tree in NQ with exponential link weights
Root 1
6
12
2
18
3
4
22
24
5
26
9
7
8
10
14
21
13
16
20
23
25
19
11
15
17
353
X N( 0)
1
X N(1)
5
X N( 2)
9
X N( 3)
7
X N( 4)
4
Fig. 16.3. An instance of a uniform recursive tree with Q = 26 nodes organized per level 0 n 4. The node number (inside the circle) indicates the order in which the nodes were attached to the tree.
Theorem 16.2.2 The total number of URTs with Q nodes is (Q 1)! Proof: (a) Let the nodes be labeled in the order of attachment to the URT and assign label 1 for the root. The URT growth law indicates that node 2 can only be attached in one way, node 3 in two ways, namely to node 1 and node 2 with equal probability. The n-th node can be attached in n 1 possible nodes. Each of these possible constructions leads to a URT. (b) By summing over all allowable configurations in (16.5), we obtain 1 X 2 X
q2 =1 q3 =1
and this proves the theorem.
===
Q31 X
qQ =1
1 = (Q 1)!
¤
In general, Cayley’s Theorem (Appendix B.1 art. 3) states that there are Q Q32 labeled trees possible. The URT is a subset of the set of all possible labeled trees. Not all labeled trees are URTs, because the nodes that are further away from the root must have larger labels. The shortest path tree from the source or root D to other nodes in the complete graph is the tree associated with the Markov discovery process, where the number of nodes [(w) at time w is constructed as follows. Just as the discovery process, the associated tree starts at the root D. We now investigate the embedded Markov chain (Section 10.4) of the continuous-time discovery process. After each transition in the continuous-time Markov chain, [(w) $
354
The Shortest Path Problem
[(w)+1, an edge of unit length is attached randomly to one of the q already discovered nodes in the associated tree because a new edge is equally likely to be attached to any of the q discovering nodes. Hence, the construction of the tree associated with the Markov discovery process and illustrated in Fig. 16.2 on the right demonstrates that the shortest path tree in the complete graph NQ with exponential link weights is an uniform recursive tree. This property of the shortest path tree in NQ with exponential link weights is an important motivation to study the URT. More generally, in van der Hofstad et al. (2001) we have proved that, for a fixed link density s and su!ciently large Q , the shortest path tree in the class RGU, the class of random graphs Js (Q ) with exponential or uniformly distributed link weights, is a URT. Smythe and Mahmoud (1995) have reviewed a number of results on recursive trees that have appeared in the literature from the late 1960s up to 1995.
16.3 The hopcount kQ in the URT 16.3.1 Theory The hopcount kQ from the root to an arbitrary chosen node in the URT equals the number of links or hops from the root to that node. We allow the arbitrary node to coincide with the root in which case kQ = 0. Theorem 16.3.1 The probability generating function of the hopcount in the URT with Q nodes is i h (Q + }) *kQ (}) = H } kQ = (16.6) (Q + 1)(} + 1) Proof: Since the number of nodes at hopcount n from the root (or at (n) level n) is [Qk , al node uniformly chosen out of Q nodes in the URT has (n)
probability
H [Q Q
of having hopcount n, Pr[kQ = n] =
h i (n) H [Q Q
(16.7)
If the size of the URT grows from q to q + 1 nodes, each node at hopcount n 1 from the root can generate a node at hopcount n with probability 1@q. Hence, for n 1, i h (n31) h i Q31 X H [q (n) H [Q = q q=n
16.3 The hopcount kQ in the URT
355
With (16.7), a recursion for Pr[kQ = n] follows for n 1 as Pr[kQ
Q 31 1 X Pr[kq = n 1] = n] = Q q=n
The generating function of kQ equals
Q31 i h X kQ = Pr[kQ = 0] + Pr [kQ = n] } n *kQ (}) = H } n=1
=
1 1 + Q Q
1 1 + = Q Q
Q 31 Q31 X X n=1 q=n q Q 31 X X q=1 n=1
Pr[kq = n 1]} n
Q31 } X 1 + *k (}) Pr[kq = n 1]} = Q Q q=1 q n
Taking the dierence between (Q + 1)*kQ +1 (}) and Q *kQ (}) results in the recursion (Q + 1)*kQ +1 (}) = (Q + })*kQ (}) £ ¤ £ ¤ Iterating this recursion starting from *k1 (}) = H } k1 = H } 0 = 1 leads to (16.6). ¤ Corollary 16.3.2 The probability density function of the hopcount in the URT with Q nodes is (n+1)
(1)Q 3(n+1) VQ (16.8) Q! Proof: The probability generating function *kQ (}) in (16.6) is also the (n) generating function of the Stirling numbers VQ of the first kind (Abramowitz and Stegun, 1968, 24.1.3) such that the probability that a uniformly chosen node in the URT has hopcount n equals (16.8). ¤ Pr[kQ = n] =
The explicit form of the generating function shows that the average hopcount kQ in a URT of size Q equals ¯ Q X ¯ 1 g 0 H[kQ ] = *kQ (1) = log *kQ (})¯¯ (16.9) = g} o }=1 o=2
= #(Q + 1) + 1
0
(}) where #(}) = KK(}) is the digamma function (Abramowitz and Stegun, 1968, Section 6.3) and the Euler constant is = 0=57721 = = =. Similarly,
356
The Shortest Path Problem
the variance (2.27) follows from the logarithm of the generating function OkQ (}) = log (Q + }) log (Q + 1) log (} + 1) as Var[kQ ] = # 0 (Q + 1) # 0 (2) + #(Q + 1) + 1
2 + # 0 (Q + 1) 6 Using the asymptotic formulae for the digamma function leads to µ ¶ 1 H[kQ ] = log Q + 1 + R Q µ ¶ 2 1 +R Var[kQ ] = log Q + 6 Q = #(Q + 1) +
(16.10) (16.11)
For large Q , we apply an asymptotic formula of the Gamma function (Abramowitz and Stegun, 1968, Section 6.1.47) to the generating function of the hopcount (16.6), µ µ ¶¶ 1 Q }31 *kQ (}) = 1+R (} + 1) Q P 1 n = " Introducing the Taylor series of K(}) n=1 fn } where the coe!cients fn are listed in Abramowitz and Stegun (1968, Section 6.1.34), we obtain with Q } = h} log Q , µ µ ¶¶ " " X 1 logn Q n 1 X n31 *kQ (}) = } 1+R fn } Q n! Q n=1 n=0 ¡1¢ " n 1+R Q X X logn3p Q n = fp+1 } Q (n p)! n=0 p=0
With the definition (2.18) of the probability generating function, we conclude that the asymptotic form of the probability density function (16.8) of the hopcount in the URT is ¡ ¢ n 1 + R Q1 X logn3p Q (16.12) fp+1 Pr[kQ = n] = Q (n p)! p=0
Since the coe!cients fn are rapidly decreasing, approximating the sum in (16.12) by its first term (p = 0) yields to first order in Q , (log Q )n (16.13) Pr[kQ = n] Q n! which is recognized as a Poisson distribution (3.9) with mean log Q . Hence, for large Q and to first order, the average and variance of the hopcount in
16.3 The hopcount kQ in the URT
357
the URT are approximately H[kQ ] Var[kQ ] log Q . The accuracy of the Poisson approximation can be estimated by comparison with the average (16.10) and the variance (16.11) found above up to second order in Q . For example, if the URT has Q = 104 nodes, the Poisson approximation yields H[kQ ] = Var[kQ ] = 9=21034, while the average (16.10) is H[kQ ] = 8=78756 accurate up to 1034 and the variance (16.11) is Var[kQ ] = 8=14262. The exact results are H[kQ ] = 8=78761 and Var[kQ ] = 8=14277.
16.3.2 Application of the URT to the hopcount in the Internet In trace-route measurements explained in Van Mieghem (2004a), we are interested in the hopcount KQ denoted with capital K, which equals kQ in the URT excluding the event kQ = 0. In other words, the source and the destination are dierent nodes in the graph. Since from (16.8) Pr[kQ = 0] = (1)
(31)Q 31 VQ Q!
=
1 Q
we obtain, for 1 n Q 1,
Pr[KQ = n] = Pr[kQ = n|kQ 6= 0] = =
Pr[kQ = n> kQ 6= 0] Pr[kQ 6= 0]
Q Pr[kQ = n] Q 1
Using (16.8), we find (n+1)
Pr[KQ = n] =
Q (1)Q3(n+1) VQ Q 1 Q!
(16.14)
with corresponding generating function, *KQ (}) =
Q31 X
Pr[KQ = n] } n
n=1
Q31 Q Q X Pr[kQ = 0] Pr[kQ = n] } n Q 1 Q 1 n=0 µ ¶ Q 1 = *kQ (}) Q 1 Q
=
The average hopcount H[KQ ] = H[kQ |kQ 6= 0] is H[KQ ] =
Q 31 Q X1 Q 1 o o=2
(16.15)
358
The Shortest Path Problem
Hence, for large Q and in practice, we find that Pr[KQ = n] = Pr[kQ = n] + R
µ
1 Q
¶
which allows us to use the previously derived expressions (16.12), (16.10) and (16.11). The histogram of the number of traversed routers in the Internet measured between two arbitrary communicating parties seems reasonably well modeled by the pdf (16.12). Figure 16.4 shows both the histogram of the hopcount deduced from paths in the Internet measured via the trace-route utility and the fit with (16.12). From the fit, we find a rather high number of nodes
Asia Europe USA fit with log(NAsia) = 13.5 fit with log(NEurope) = 12.6 fit with log(NUSA) = 12.9
0.10
Pr[H = k]
0.08
0.06
0.04
0.02
0.00 0
5
10
15
20
25
30
hop k
Fig. 16.4. The histograms of the hopcount derived from the trace-route measurement in three continents from CAIDA in 2004 are fitted by the pdf (16.12) of the hopcount in the URT.
h12=6 3 105 Q h13=5 7 105 , which points to the approximate nature of modeling the Internet hopcount by that deduced from a URT. The relation between Internet measurements and the properties of the URT is further analyzed in a series of articles (Van Mieghem et al., 2000; van der Hofstad et al., 2001; Van Mieghem et al., 2001b; Janic et al., 2002; van der Hofstad et al., 2002b). At the time of writing, an accurate model of the hopcount in the Internet is not available.
16.4 The weight of the shortest path
359
16.4 The weight of the shortest path The weight — sometimes also called the length — of the shortest path is defined as the sum of the link weights that constitute the shortest path. In Section 16.2.1, the shortest path tree in the complete graph with exponential link weights was shown to be a URT. In this section, we confine ourselves to the same type of graph and require that the source node D (or root) is dierent from the destination node E. By Theorem 10.2.3 of a continuous-time Markov chain, the discovery time Pn of the n-th node from node D equals yn = q=1 q , where 1 > 2 > = = = > n are independent, exponentially distributed random variables with parameter q = q(Q q) with 1 q n. We call m the interattachement time between the discovery or the attachment to the URT of the m 1-th and m-th node in the graph. The Laplace transform of yn is Z " £ 3}y ¤ g n H h h3}w Pr [yn w] = gw 0
For a sum of independent exponential random variables, using the probability generating function (3.16), we have " Ã !# n n n Y X ¤ Y £ £ 3}y ¤ q(Q q) H h3}q = H h n = H exp } q = } + q(Q q) q=1 q=1 q=1 (16.16) £ ¤ The probability generating function3 *ZQ (}) = H h3}ZQ of the weight ZQ of the shortest path equals *ZQ (}) =
Q31 X n=1
=
£ ¤ H h3}yn Pr [E is n-th attached node in URT]
Q31 n 1 X Y q(Q q) Q 1 } + q(Q q)
(16.17)
n=1 q=1
because any node apart from the root D but including the destination node E has equal probability to be the n-th attached node. The average weight is ¯ ¯ Q 31 n g*ZQ (}) ¯¯ 1 X g Y q(Q q) ¯¯ H [ZQ ] = = ¯ ¯ g} Q 1 g} q=1 } + q(Q q) ¯ }=0 n=1
3
1 d
}=0
(instead of 1), then ZQ is multiplied by d as explained in If the link weights have mean Sections 16.2.1 and 3.4.1. The weight of the scaled shortest path ZQ>d has pgf k l *ZQ>d (}) = H h3}dZQ = *ZQ (d})
360
The Shortest Path Problem
Using the logarithmic derivative of the product, ¯ n g Y q(Q q) ¯¯ ¯ g} q=1 } + q(Q q) ¯
}=0
n Y
q(Q q) g = } + q(Q q) g} q=1 =
gives
Ã
n X
q(Q q) log } + q(Q q) q=1
n X
1 q(Q q) q=1
!¯ ¯ ¯ ¯ ¯
}=0
Q31 n Q31 Q 31 X 1 X 1 1 1 XX = H [ZQ ] = 1 Q 1 q(Q q) Q 1 q=1 q(Q q) q=1
=
1 Q 1
The average weight is
n=1 Q31 X q=1
n=q
Q q q(Q q)
H [ZQ ] =
Q31 1 X 1 #(Q ) + = Q 1 q Q 1
(16.18)
q=1
For large Q ,
log Q + +R H [ZQ ] = Q
µ
1 Q2
¶
Similarly, the variance is computed (see problem (ii) in Section 16.9) as, ³P ´2 Q 31 1 Q31 X q=1 q 1 3 Var [ZQ ] = (16.19) 2 Q (Q 1) q (Q 1)2 Q q=1
and for large Q,
2 +R Var [ZQ ] = 2Q 2
µ
log2 Q Q3
¶
By inverse Laplace transform of (16.17), the distribution Pr [ZQ w] can be computed. The asymptotic distribution for the weight of the shortest path is (see problem (iii) in Section 16.9) 3{
lim Pr [Q ZQ log Q {] = h3h
Q for Q large. We rewrite (16.21) as [(Q 1)!]2 h *WQ ({) = Q ¡ Q31 Q2 q=1 { + 4 q
For Q = 2P , using
K(}+p) K(}+1)
*W2P ({) =
=
Ã
Qp31 q=1
¢ Q 2 2
i
(16.24)
(q + }), we deduce that
!2 s (2P )(1 + { + P 2 P ) s (P + { + P 2 )
(16.25)
s { For large P , there holds { + P 2 P + 2P , provided |{| ? 2P . After substitution of { = 2P | in (16.25), with ||| ? 1, we obtain *W2P (2P |) 2 (1 + |)
2 (2P ) 2 (1 + |)(2P )32| 2 (2P + |)
from which follows the asymptotic relation lim Q 2| *WQ (Q |) = 2 (1 + |)>
Q ===> we find that Q 2| *WQ (Q|) is analytic whenever the real part of | is non-negative. Evaluation along the line Re(|) = f = 0 then gives ] " 1 hlwx Q 2lx *WQ (lQx)gx lim jQ (w) = lim Q where (w) = h3h is the Gumbel distribution (3.37). Furthermore, the two-fold convolution is given by Z " g ³ (2W) ´ 3x 3(w3x) h3h h3h gx (w) = h3w gw 3" ¶¸ µ Z " w 3w 3w@2 x gx =h exp 2h cosh 2 3" Z " h i ´ ³ = 2h3w exp 2h3w@2 cosh (x) gx = 2h3w N0 2h3w@2 0
where N ({) denotes the modified Bessel function (Abramowitz and Stegun, 1968, Section 9.6) of order . In summary, ³ ´ g ³ (2W) ´ (w) = 2h3w N0 2h3w@2 (16.29) lim jQ (w) = j(w) = QGQ @ GQ >GQ Pr qQ+1 5 Pr GQ +1 = m = Pr GQ = m|qQ +1 5 oi h oi n n h (n) (n) (n) Pr qQ+1 5 GQ + Pr GQ = m + 1|qQ+1 5 GQ h n oi h n oi (n31) (n) (n31) + Pr GQ = m 1|qQ+1 5 GQ Pr qQ+1 5 GQ
If the process of attaching a new node Q + 1 does not depend on the way thehQ previous nodes i arehattached ibut rather on their number, there holds (n) (n) Pr GQ = m|qQ+1 = Pr GQ = m . This property holds for the URT. We obtain a three point recursion for n A 1, h oi i h i h n (n) (n) (n) (n31) Pr GQ+1 = m = Pr GQ = m Pr qQ+1 5 @ GQ > GQ i h n oi h (n) (n) + Pr GQ = m + 1 Pr qQ +1 5 GQ oi i h n h (n) (n31) + Pr GQ = m 1 Pr qQ +1 5 GQ The probability generating function " h i h (n) i X (n) Pr GQ = m } m *G (}> Q ; n) = H } GQ = m=0
is obtained after multiplication by } m and summing over all m, oi h n (n) (n31) *G (}> Q ; n) *G (}> Q + 1; n) = Pr qQ+1 5 @ GQ > GQ h oi * (}> Q ; n) * (0> Q; n) n (n) G G + Pr qQ+1 5 GQ } h oi n (n31) + Pr qQ+1 5 GQ }*G (}> Q ; n) h i (n) Now *G (0> Q ; n) = Pr GQ = 0 is the probability of the event that there are no nodes with degree n for 1 n nmax Q 1. Since the normalization of the generating function requires that *G (1> Q ; n) = 1 and since oi h oi h oi n n n h (n) (n31) (n) (n31) +Pr qQ+1 5 GQ +Pr qQ+1 5 GQ =1 @ GQ >GQ Pr qQ+1 5 oi i h n h (n) (n) = 0. Further, it follows that Pr GQ = 0 Pr qQ+1 5 GQ h n oi (n) 6= 0 Pr qQ+1 5 GQ
368
The Shortest Path Problem
for any n 5 [1> nmax ] because the attachment of the node qQ+1 is possible to any non-empty set, this means that the absence of nodes with degree h i (n) n 5 [1> nmax ] cannot occur in URTs, thus Pr GQ = 0 = 0. A consequence is that the probability generating function *G (}> Q ; n) is at least R (}) as h } $ 0 n(for Q A 1).oiAfter using *G (0> Q ; n) = 0 and eliminat(n) (n31) , the recursion relation for the probability @ GQ > GQ ing Pr qQ +1 5 generating function becomes5 oi n 4 3 h (n) h oi n 5 G Pr q Q+1 Q *G (}> Q + 1; n) (n31) D (1 }) = 1+C Pr qQ+1 5 GQ *G (}> Q ; n) } (16.33)
The special case for n = 1 and Q A 1 is h oi n 4 3 (1) (1 }) Pr qQ+1 5 GQ D *G (}> Q ; 1) *G (}> Q + 1; 1) = C1 + } 16.6.2 The Average Number of Degree n Nodes in the URT In the URT, a new node qQ+1 is attached uniformly to any of Q previously attached nodes such that h i (n) oi H GQ n h (n) = Pr qQ+1 5 GQ Q Also, the probability that an arbitrary node in a URT of size Q has degree k l (n)
n equals
H GQ Q
. We obtain from (16.33) 3
3
*G (}> Q + 1; n) = C1 + C 5
i h (n) H GQ Q}
h i4 4 (n31) H GQ D (1 })D *G (}> Q ; n) Q
(16.34)
k (n) l (n) = 1 for all p $ n because Gp = 0 if With the initialization *G (}> p; n) = H } Gp 1 ? p $ n, after iterating (16.33) we arrive at *G (}> Q ; n) =
k rl q rl q k 1 (n31) (n) (1 3 }) 13 1 3 Pr qp+1 M Gp 3 Pr qp+1 M Gp }
Q 31 \
p=n
16.6 The degree of a node in the URT
369
By taking the derivative of both sides in (16.34) with respect to } and evaluating at } = 1, a recursion for the average is found, h i (n31) i H GQ i Q 1 h h (n) (n) H GQ + (16.35) H GQ+1 = Q Q i h (n) (n) Let uQ = (Q 1)H GQ , then the recursion valid for 1 ? n Q 2 becomes un31 (n) (n) uQ+1 = uQ + Q (16.36) Q 1 Theorem 16.6.1 In the URT, the average number of degree n nodes is given by (n) n31 i Q h (1)Q+n31 VQ31 (1)Q X (m) (n) + n VQ31 (2)m (16.37) H GQ = n + 2 (Q 1)! 2 (Q 1)! m=1
Proof: See Section 16.8.
¤ (n)
For large Q and using the asymptotics of the Stirling numbers VQ of the first kind (Abramowitz and Stegun, 1968, Section 24.1.3.III), the asymptotic law is h i ! Ã (n) H GQ 1 logn31 Q = n +R Pr [GURT = n] = (16.38) Q 2 Q2 The ratio of the average number of nodes with degree n over the total number of nodes, which equals the probability that an arbitrary node in a URT of size Q has degree n, exponentially fast with rate ln 2. i h decreases (n) The variance Var GQ is most conveniently computed from the logarithm of the probability generating function with (2.27). By taking the logarithm of both sides in (16.34) and dierentiating twice and adding (16.35), we obtain i h i h (n) (n) Var GQ+1 = i (Q ; n) + Var GQ
where 3
i (Q ; n) = C
i h (n) H GQ Q
h h i i 42 3 h i4 (n) (n31) (n31) H GQ H GQ H GQ D +C D + + Q Q Q
370
The Shortest Path Problem
h i (n) Since Var Gp = 0 for p n, the general solution is Q i X h (n) i (m; n) Var GQ = m=n
For large Q , using (16.38), we observe that h i à ! (n) µ ¶ Var GQ log2n32 Q 3 1 = +R Q 2n 22n Q2
(16.39)
(n)
G (n) In practice, if we use the estimator wˆQ = QQ for the probability that the degree of a node equals n, then (a) the estimator is unbiased because the l k (n) h i H G (n) Q mean of the estimator H wˆQ equals the correct mean and (b) the Q l k ¸ (n) h i (n) ¡ ¢ Var GQ G (n) variance Var wˆQ = Var QQ = $ 0 as R Q1 for large Q . 2 Q 10
0
RIPE data (May-June 2003) N = 2574, L = 3992 fit: ln(Pr[D U = k]) = 0.44 - 0.67 k with U = 0.99
Pr[DU = k]
10
10
10
10
RIPE data (Jan.-Feb. 2004) N = 3850, L = 6743 fit: ln(Pr[D U = k]) = -0.49 - 0.41 k with U = 0.95
-1
-2
-3
-4
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
k
Fig. 16.6. The histogram of the degree GX derived from the graph JX formed by the union of paths measured via trace-route in the Internet. Both measurements in 2003 and 2004 are fitted on a log-lin plot and the correlation coe!cient quantifies the quality of the fit.
The law (16.38) is observed in Fig. 16.6, which plots the histogram of the degree GX in the graph JX . The graph JX is obtained from the union of trace-routes from each RIPE measurement box to any other box positioned
16.6 The degree of a node in the URT
371
mainly in the European part of the Internet. For about 50 measurement boxes in 2003, the correspondence is striking because the slope of the fit on a log-lin scale equals 0.668 while the law (16.38) gives ln 2 = 0=693. Ignoring in Fig. 16.6 the leave nodes with n = 1 suggests that the graph JX is URT-like. For 72 measurement boxes in 2004 which obviously results in a larger graph JX , deviations from the URT law (16.38) are observed. If measurements between a larger full mesh of boxes were possible and if the measurement boxes were more homogeneously spread over the Internet, a power law behavior is likely to be expected as mentioned in Section 15.3. However, these earlier reported trace-route measurements that lead to power law degrees have been performed from a relatively small number of sources to a large number of destinations. These results question the observability of the Internet: how accurate are Internet properties such as hopcount and degree that are derived from incomplete measurements, i.e. from a selected small subset of nodes at which measurement boxes are placed?
16.6.3 The degree of the shortest path tree in the complete graph with i.i.d. exponential link weights In the complete graph NQ with i.i.d. exponential link weights, any node q possesses equal properties in probability because of symmetry. If we denote by gq the degree of node q in the shortest path tree rooted at that node q, the symmetry implies that Pr [gq = n] = Pr [gl = n] for any node q and l. In fact, we consider here the degree of a URT as an overlay tree in a complete graph. Concentrating on a node with label 1, we obtain from (16.32) i h i h (n) (1) H GQ = Q Pr [g1 = n] = Q Pr [Q = n The latter follows from the fact that the degree of a node is equal to the (1) number of its direct neighbors, the nodes at level 1, [Q . By definition of the URT, the second node surely belongs to the level set 1, while node 3 has equal probability to be attached to the root or to node 2. In general, when attaching a node m to a URT of size m 1, the probability that node m is 1 . Thus, the number of nodes at level 1 in the attached to the root equals m31 URT (constructed upon the complete graph) is in distribution equal to the sum of Q 1 independent Bernoulli random variables each with dierent 1 , mean m31 (1) g [Q =
Q X m=2
Bernoulli
µ
1 m1
¶
=
Q31 X m=1
µ ¶ 1 Bernoulli m
372
The Shortest Path Problem
because each node in the complete graph is connected to Q 1 neighbors. The generating function is SQ 31 ¸ Q31 h (1) i Y Bernoulli 1 ¸ 1 [Q m=1 Bernoulli m m = H } H } =H } m=1
Using the generating function (3.1) of a Bernoulli random vari probability ¸ Bernoulli 1m able, H } = 1 1m + }m , yields h (1) i Q31 Y µ } + m 1 ¶ (} + Q 1) [Q = H } = m (})(Q ) m=1
Compared to the generating function (16.6) of the hopcount kQ , we recognize that Q32 Q31 h (1) i X X H } [Q = }*Q31 (}) = Pr[kQ31 = n]} n+1 = Pr[kQ31 = n 1]} n n=0
n=1
from which we deduce, for 1 n Q 1, h i (1) Pr [Q = n = Pr[kQ31 = n 1]
Using (16.7), we arrive at the curious result h i (n31) h i H [Q 31 (1) for 1 n Q 1= Pr [Q = n = Q 1
The probability that the number of level 1 nodes in the shortest path tree in the complete graph with i.i.d. exponential link weights is n equals the average number of nodes on level n 1 in a URT of size Q 1 divided by that size Q 1. In other words, the “horizontal” distribution at level 1 is related to the “vertical” distribution of the size of the level sets. In summary6 , in the complete graph with i.i.d. exponential link weights, the probability that an arbitrary node q as root of a shortest path tree has degree n is (n)
Pr [gq = n] = Pr[kQ31
(1)Q313n VQ31 = n 1] = (Q 1)!
(16.40)
The degree of an arbitrary node in the union of all shortest paths trees in the complete graph NQ with i.i.d. exponential link weights is also given by (16.40) because in that union each node q is once a root and further plays, 6
This result is due to Remco van der Hofstad (private communication).
16.7 The minimum spanning tree
373
by symmetry, the role of the m-th attached node in the URT rooted at any other node in NQ . 16.7 The minimum spanning tree From an algorithmic point of view, the shortest path problem is closely related to the computation of the minimum spanning tree (MST). The Dijkstra shortest path algorithm is similar to Prim’s minimum spanning tree algorithm (Cormen et al., 1991). In this section, we compute the average weight of the MST in a graph with a general link weight structure.
16.7.1 The Kruskal growth process of the MST Since the link weights in the underlying complete graph are chosen independently and assigned randomly to links in the complete graph, the resulting graph is probabilistically the same if we first order the set of link weights and assign them in increasing order randomly to links in the complete graph. In the latter construction process, only the order statistics or the ranking of the link weights su!ce to construct the graph because the precise link weight can be unambiguously associated to the rank of a link. This observation immediately favors the Kruskal algorithm for the MST over Prim’s algorithm. Although the Prim algorithm leads to the same MST, it gives a more complicated, long-memory growth process, where the attachment of each new node depends stochastically on the whole growth history so far. Pietronero and Schneider (1990) illustrate that in our approach Prim, in contrast with Kruskal, leads to a very complicated stochastic process for the construction of the MST. The Kruskal growth process described here is closely related to a growth process of the random graph Ju (Q> O) with Q nodes and O links. The construction or growth of Ju (Q> O) starts from Q individual nodes and in each step an arbitrary, not yet connected random pairs is connected. The only dierence with Kruskal’s algorithm for the MST is that, in Kruskal, links generating loops are forbidden. Those forbidden links are the links that connect nodes within the same connected component or “cluster”. As a result, the internal wiring of the clusters diers, but the cluster size statistics (counted in nodes, not links) is exactly the same as in the corresponding random graph. The metacode of the Kruskal growth process for the construction of the MST is shown in Fig. 16.7. The growth process of the random graph Js (Q ), which is asymptotically equal to that of Ju (Q> O), is quantified in Section 15.6.4 for large Q . The
374
The Shortest Path Problem KruskalGrowthMST 1. start with Q disconnected nodes 2. repeat until all nodes are connected 3. randomly select a node pair (l> m) 4. if a path Pl$m does not exist 5. then connect l to m Fig. 16.7. Kruskal growth process
fraction of nodes V in the giant component of Js (Q ) is related to the average degree or to the link density s because rg = s(Q 1) in Js (Q ) by (15.20). For large Q , the size of the giant cluster in the forest is thus determined as a function of the number of added links that increase rg .
a e c
d f b
Fig. 16.8. Component structure during the Kruskal growth process.
We will now transform the mean degree rg in the random graph Js (Q ) to the mean degree MST in the corresponding stage in Kruskal growth process of the MST. In early stages of the growth each selected link will be added with high probability such that MST = rg almost surely. After some time the probability that a selected link is forbidden increases, and thus rg exceeds MST . In the end, when connectivity of all Q nodes is reached, MST = 2 (since it is a tree) while rg = R(log Q ), as follows from (15.19) and the critical threshold sf logQQ . Consider now an intermediate stage of the growth as illustrated in Fig. 16.8. Assume there is a giant component of average size Q V and qo = Q(1 V)@vo small components of average size vo each. Then we can distinguish six types of links labelled d-i in Fig. 16.8. Types d and e are links that have been
16.7 The minimum spanning tree
375
chosen earlier in the giant component (d) and in the small components (e) respectively. Types f and g are eligible links between the giant component and a small component (f) and between small components (g) respectively. Types h and i are forbidden links connecting nodes within the giant component (h), respectively within a small component (i ). For large Q , we can enumerate the average number of links O{ of each type {: Od + Oe = 21 PVW Q Of = VQ · (1 V)Q Og = 21 q2o · v2o
Oh = 12 (VQ )2 VQ Oi = 12 qo vo (vo 1) qo (vo 1)
To highest order in R(Q 2 ), we have Of = Q 2 V(1 V)>
1 Og = Q 2 (1 V)2 > 2
1 Oh = Q 2 V 2 2
The probability that a randomly selected link is eligible is t = ¡ ¢ or, to order R Q 2 ,
Of +Og Of +Og +Oh +Oi
t = 1 V2
(16.41)
In contrast with the growth of the random graph Js (Q ) where at each stage a link is added with probability s, in the Kruskal growth of the MST we are only successful to add one link (with probability 1) per 1t stages on average. Thus the average number of links added in the random graph corresponding 1 to one link in the MST is 1t = 13V 2 . This provides an asymptotic mapping between rg and MST in the form of a dierential equation,
By using (15.22), we find
grg 1 = gMST 1 V2
gMST grg (1 + V) (V + (1 V) log(1 V)) gMST = = gV grg gV V2 Integration with the initial condition MST = 2 at V = 1, finally gives the average degree MST in the MST as function of the fraction V of nodes in the giant component MST (V) = 2V
(1 V)2 log(1 V) V
(16.42)
As shown in Fig. 16.9, the asymptotic result (16.42) agrees well with the simulation (even for a single sample), except in a small region around the transition MST = 1 and for relatively small Q . The key observation is that all transition probabilities in the Kruskal
376
The Shortest Path Problem
Fraction S of nodes in the giant component
1.0
0.8
N = 1000 N = 10000 N = 25000 Theory
0.6
0.4
0.2
0.0 0.0
0.5
1.0
1.5
2.0
Mean degree PMST
Fig. 16.9. Size of the giant component (divided by Q ) as a function of the mean degree M ST . Each simulation for a dierent number of nodes Q consists of one MST sample.
growth process asymptotically depend on merely one parameter V, the fraction of nodes in the giant component, and V is called an order parameter in statistical physics. In general, the expectation of an order parameter distinguishes the qualitatively dierent regimes (states) below and above the phase transition. In higher dimensions, fluctuations of the order parameter around the mean can be neglected and the mean value can be computed from a selfconsistent mean-field theory. In our problem, the underlying complete (or random) graph topology makes the problem eectively infinite-dimensional. The argument leading to (15.20) is essentially a mean-field argument.
16.7.2 The average weight of the minimum spanning tree By definition, the weight of the MST is ZMST =
O X
z(m) 1mMMST
(16.43)
m=1
where z(m) is the m-th smallest link weight. The average MST weight is H [ZMST ] =
O X m=1
¤ £ H z(m) 1mMMST
16.7 The minimum spanning tree
377
The random variables z(m) and 1mMMST are independent because the m-th smallest link weight z(m) only depends on the link weight distribution and the number of links O, while the appearance of the m-th link in the MST only depends on the graph’s topology, as shown in Section 16.7.1. Hence, £ £ ¤ £ ¤ ¤ H z(m) 1mMMST = H z(m) H [1mMMST ] = H z(m) Pr [m 5 MST]
such that the average weight of the MST is H [ZMST ] =
O X m=1
¤ £ H z(m) Pr [m 5 MST]
(16.44)
In general for independent link weights with probability density function iz ({) and distribution function Iz ({) = Pr [z {], the probability density function of the m-th order statistic follows from (3.36) as µ ¶ miz ({) O (Iz ({))m (1 Iz ({))O3m (16.45) iz(m) ({) = Iz ({) m ¡ ¢ The factor Om (Iz ({))m (1 Iz ({))O3m is a binomial distribution with mean = Iz ({) O and variance 2 = OIz ({) (1 Iz ({)) that, by the Central 2
Limit Theory 6.3.1, tends for large O to a Gaussian m O,
(m3) I1 h3 22 2 ¤ £
, which
we {m = H z(m) ' Iz31 ( Om ). peaks at m = . For large Q and fixed We found before in (16.41) that the link ranked m appears in the MST with probability have7
Pr [m 5 MST] = 1 Vm2 where Vm is the fraction of nodes in the giant component during the construction process of the random graph at the stage where the number of links precisely equals m. Since links are added independently, that stage in fact establishes the random graph Ju (Q> O = m). Our graph under ¡Q ¢consideration is the complete graph NQ such that we add in total O = 2 links. 7
31 In general, it holds that z(n) = Iz (X(n) ) and 31 31 (X(n) ) 6= Iz (H X(n) ) H z(n) = H Iz
but, for a large number of order statistics O, the Central Limit Theorem 6.3.1 leads to m 31 31 H z(n) ' Iz (H X(n) ) ' Iz O
because for a uniform random variable X on [0,1] the average weight of the m-th smallest link is exactly m m H z(l) = ' O+1 O
378
The Shortest Path Problem
With (15.22) and rg =
2O Q,
it follows that log(1 Vm ) 2m = Q Vm
(16.46)
Hence, H [ZMST ] '
O X m=1
Iz31
µ ¶ ¢ m ¡ 1 Vm2 O
We now approximate the sum by an integral, Z O ³x´¡ ¢ Iz31 H [ZMST ] ' 1 Vx2 gx O 1
Substituting { = 2x Q (which is the average degree in any graph J (Q> x)) 2 yields for large Q where O ' Q2 , Z Z ´ ¢ Q Q 31 ³ { ´¡ Q Q 3131 ³ { ´³ 2 H [ZMST ] ' Iz Iz 1 V Q { g{ ' 1 V 2 ({) g{ 2 2 Q 2 0 Q 2 Q
It is known (Janson et al., 1993) that, if the number of links in the growth process of the random graph is below Q2 , with high probability (and ignoring a small onset region just below Q2 ), there is no giant component such that V ({) = 0 for { 5 [0> 1]. Thus, we arrive at the general formula valid for large Q , Z Z ¢ Q Q 31 ³ { ´ ¡ Q 1 31 ³ { ´ g{ + Iz Iz H [ZMST ] ' 1 V 2 ({) g{ 2 0 Q 2 1 Q (16.47) The first term is the contribution from the smallest Q@2 links in the graph, which are included in the MST almost surely. The remaining part comes from the more expensive links in the graph, which are included with diminishing probability since 1 V 2 ({) decreases exponentially for large { as can be deduced from (15.21). The rapid decrease of 1 V 2 ({) makes only ¢ ¡ relatively small values of the argument Iz31 Q{ contribute to the second integral. At this point, the specifics of the link weight ¡ ¢ distribution needs to be introduced. The Taylor expansion of Q2 Iz31 Q{ for large Q to first order is µ ¶ µ ¶ 1 1 { { Q 31 ³ { ´ Q 31 = Iz +R +R = Iz (0) + 2 Q 2 2iz (0) Q 2iz (0) Q since we require that link weights are positive such that Iz31 (0) = 0. This expansion is only useful provided iz is regular, i.e. iz (0) is neither zero nor
16.7 The minimum spanning tree
379
infinity. These cases occur, for example, for polynomial link weights with iz ({) = {31 and 6= 1. For polynomial link weights, however, holds ¡ ¢ 13 1 1 that Q2 Iz31 Q{ = Q 2 { . Formally, this latter expression reduces to the first order Taylor approach for = 1, apart from the constant factor iz1(0) . Therefore, we will first compute H [ZMST ] for polynomial link weights and then return to the case in which the Taylor expansion is useful. 16.7.2.1 Polynomial link weights The average weight of the MST for polynomial link weights follows8 from (16.47) as ! Ã 1 Z Q ¢ 1 ¡ 1 Q 13 + { 1 V 2 ({) g{ H [ZMST ()] ' 1 2 + 1 1
and g{ = Let | = V ({) and use (15.22), then { = V 31 (|) = log(13|) | ³ ´ log(13|) g g| while | = V (1) = 0 and | = V (Q ) = 1, such that g| | L=
Z
1
=
Z
0
Q
¢ 1 ¡ { 1 V 2 ({) g{
1µ
log(1 |) |
¶1
After partial integration, we have L =1
1 + +1
1
2 +1
Z
1
1 +1
Z
"
H [ZMST ()] ' Q 8
Ã
h3{
1
{ +1
0
Finally, we end up with 1 13
µ ¶ log(1 |) g| |
¡ ¢ g 1 |2 g|
"
{
1
(1 h3{ ) h3{
1 +1
0
1
(1 h3{ )
g{
!
g{
(16.48)
Since the average of the n-th smallest link weight can be computed from (3.36) as 1 K n+ H! H z(n) = 1 K (n) K H+1+
the exact formula (16.44) reduces to H [ZM S T
H! ()] = K H+1+
1
H [ K m+
m=1
1
K (m)
1 3 Vm2
Analogously to the above manipulations, after convertion to an integral, substituting { = K(}+ 1 ) 2x and using (Abramowitz and Stegun, 1968, Section 6.1.47), for large }, that K(}) = Q 1 (}) 1 + R }1 , we arrive at the same formula.
380
The Shortest Path Problem
If ? 1, then H[ZMST ()] $ 0 for Q $ 4, while for A 1, H[ZMST ()] $ 4. In particular, lim" , and we can simply replace iz (0) by siz (0) in the expression (16.50) to obtain the average weight of the MST in the random graph Js (Q ). 16.8 The proof of the degree Theorem 16.6.1 of the URT i h (Q31) 16.8.1 The case n = Q: H GQ (Q )
If n = Q, the recursion (16.35) becomes with GQ
With initial value
(1) G2
l k (Q ) H GQ +1 =
= 0, l k (Q 31) H GQ Q
= 2, the solution l k (Q 31) = H GQ
2 (16.51) (Q 3 1)! l k (n) is readily verified. Since for any URT, it holds that Pr GQ = m = 0 for m A Q 3 n, we have k l l k (Q 31) (Q 31) that H GQ = Pr GQ = 1 . Since there exists in total (Q 3 1)! dierent URTs of size
Q, this result (16.51) means that there are precisely two possible URTs with a node of degree Q 3 1. Indeed, one is the root with Q 3 1 children and the other is the root with one child of degree Q 3 1 that in turn possesses Q 3 2 children. Also, (Q 31)
uQ
l k (Q 31) = = (Q 3 1)H GQ
2 (Q 3 2)!
(16.52)
16.8 The proof of the degree Theorem 16.6.1 of the URT
i h (1) 16.8.2 The case n = 1: H GQ
381
If n = 1 and Q D 3, the recursion (16.35) is slightly dierent because the newly attached node nQ +1 necessarily belongs to the set of degree 1 nodes in the URT of size Q + 1 such that
(0)
(1)
with G1 = 1 and G2 for n = 1 becomes
k l Q 3 1 k (1) l (1) H GQ + 1 H GQ +1 = Q l l k k (1) (1) (1) = 2. Hence, H G3 = 2. With uQ = (Q 3 1)H GQ , the recursion (1)
(1)
uQ +1 = uQ + Q (1)
(1)
The particular solution is uQ ;s = dQ 2 + eQ + f. Substitution of uQ = dQ 2 + eQ + f into the dierence equation yields dQ 2 + (e + 2d) Q + d + e + f = dQ 2 + (e + 1) Q + f or, by equating corresponding power in Q, we find the conditions e + 2d = e + 1 and d + e = 0 from which d = 21 , e = 3 21 . Thus, l k Q (1) (1) uQ = (Q 3 1)H GQ = (Q 3 1) + f 2
and
k l f Q (1) H GQ = + 2 Q 31
k l (1) Using H G3 = 2 shows that f = 1 such that, for Q A 2,
l k Q 1 (1) + H GQ = 2 Q 31
Let us denote
(16.53)
i h (n) 16.8.3 The general case: H GQ U({> |) =
31 " Q [ [
(n)
uQ {n |Q
(16.54)
Q =3 n=2
then the recursion (16.36) is transformed into 31 31 31 " Q " Q " Q [ [ [ [ [ [ (n) (n) (n31) n Q (Q 3 1)uQ +1 {n | Q = (Q 3 1)uQ {n | Q + { | uQ
Q =3 n=2
Q =3 n=2
Q =3 n=2
Now, the left-hand side is 31 " Q 32 " Q 32 " Q [ [ 1 [ [ 1 [ [ (n) (n) (n) (Q 3 2)uQ {n | Q = (Q 3 2)uQ {n | Q (Q 3 1)uQ +1 {n | Q = | | Q =4 n=2 Q =3 n=2 Q =3 n=2
=
" Q 31 " 1 [ 1 [ [ (Q 31) Q 31 Q (n) { | QuQ QuQ {n |Q 3 | Q =3 n=2 | Q =3
3
" Q 31 " 2 [ [ (n) n Q 2 [ (Q 31) Q 31 Q { | u uQ { | + | Q =3 n=2 | Q =3 Q
382
The Shortest Path Problem
Using (16.54) yields 31 " Q [ [ 2 CU({> |) (n) 3 U({> |) (Q 3 1)uQ +1 {n | Q = C| | Q =3 n=2
3
" " 2 [ (Q 31) Q 31 Q 1 [ (Q 31) Q 31 Q { | + { | QuQ u | Q =3 | Q =3 Q
Invoking (16.52) yields " " " [ 2 [ (Q 31) Q 31 Q ({|)Q 4 [ ({|)Q { | = uQ = 4{| = 4{| (h{| 3 1) | Q =3 {| Q =3 (Q 3 2)! Q! Q =1
" " " [ 2 [ Q ({|)Q (Q + 2) ({|)Q 1 [ (Q 31) Q 31 Q = 2{| QuQ { | = | Q =3 {| Q =3 (Q 3 2)! Q! Q =1
= 2 ({|)2
" " [ [ ({|)Q ({|)Q + 4{| = 2 ({|)2 h{| + 4{| (h{| 3 1) Q! Q! Q =0 Q =1
such that 31 " Q [ [ CU({> |) 2 (n) (Q 3 1)uQ +1 {n | Q = 3 U({> |) 3 2 ({|)2 h{| C| | Q =3 n=2
Similarly, 31 " Q [ [
(n31) n Q
{ |
uQ
={
Q =3 n=2
32 " Q [ [
(n)
uQ {n |Q
Q =3 n=1
={
31 " Q [ [
(n)
uQ {n |Q + {2
= {U({> |) + {2
" [
Q =3 (1)
Q 2
{2
" [
(1)
uQ |Q = {2
Q =3
{
" [
(Q 31) Q 31 Q uQ { |
=2
Q =3
(1)
Q =3
Q =3 n=2
Using both (16.52) and uQ =
" [
(1)
uQ | Q 3 {
uQ |Q 3 { " [
" [
(Q 31) Q 31 Q
uQ
{
|
Q =3
(Q 31) Q 31 Q
uQ
{
|
Q =3
(Q 3 1) + 1 leads to
3 " " [ [ ({|)2 g2 Q | {2 |3 (Q 3 1)|Q + {2 |Q = + 2 2 2 g| 13| 13| Q =3 Q =3
" [
Q =3
" [ ({|)Q ({|)Q = 2 ({|)2 = 2 ({|)2 (h{| 3 1) (Q 3 2)! Q! Q =1
such that 31 " Q [ [
(n31) n Q
uQ
{ |
Q =3 n=2
= {U({> |) +
({|)2 g2 2 g| 2
|3 13|
+
{2 | 3 3 2 ({|)2 (h{| 3 1) 13|
Combining all transforms the recursion (16.36) to a first order linear partial dierential equation (1 3 |)
CU({> |) + C|
2 {2 |3 3 3 3| + | 2 13{3 + 2 ({|)2 + U({> |) = {2 | 3 | 13| (1 3 |)3 1 1 + = {2 | 2 13| (1 3 |)3
16.8 The proof of the degree Theorem 16.6.1 of the URT
383
with boundary equations U({> 0) = U(0> |) = 0. Further, ]
(n) 31 31 " Q " Q l k [ [ [ [ uQ U({> |) (n) n Q 31 H GQ {n | Q 31 g| = { | = |2 Q 3 1 Q =3 n=2 Q =3 n=2
Hence, if { = 1, ]
31 " Q " l k l k [ [ [ U(1> |) (1) (n) | Q 31 Q 3 H GQ g| = H GQ |Q 31 = 2 | Q =3 n=2 Q =3
" " " l k [ [ [ Q Q 31 |Q 31 (1) | Q|Q 31 3 H GQ | Q 31 = 3 2 Q 31 Q =3 Q =3 Q =3 Q =3 Q =3 ] 1 1 1 | g| 3 (1 + 2|) 3 = 2 (1 3 |)2 2 13|
=
" [
Q| Q 31 3
" [
or U(1> |) =
|2 (1 3 |)3
3
|2 (1 3 |)
(16.55)
It is more convenient to consider the dierential equation as an ordinary dierential equation in | and to regard the variable { as a parameter. The homogeneous dierential equation, 2 CUk ({> |) = { + 3 1 Uk ({> |) (1 3 |) C| | is solved after integration with respect to |, ] ] ] {+ 2 31 g| g| | ln Uk ({> |) = g| = ({ 3 1) +2 13| 13| | (1 3 |) k l = (1 3 {) ln(1 3 |) + 2 (ln | 3 ln(1 3 |)) = ln (1 3 |)3{31 |2
or Uk ({> |) = (1 3 |)3{31 | 2 . The particular solution is of the form U ({> |) = F (|) Uk ({> |) where F (|) obeys CF (|) 1 1 = {2 (1 3 |){ + C| 13| (1 3 |)3 or F (|) = {2 =3
] (1 3 |){33 + (1 3 |){31 g| + f ({)
{2 (1 3 |){32 {2 (1 3 |){ 3 + f ({) {32 {
where f ({) is a function of {, independent of |, to be determined later. The solution is U ({> |) = 3
({|)2 3
({ 3 2) (1 3 |)
3
{| 2 + f ({) (1 3 |)3{31 | 2 (1 3 |)
The initial condition U (0> |) = 0 shows that f (0) = 0, while the boundary condition (16.55) implies that f(1) = 0. Expanding this solution in a power series around { = 0 and | = 0 yields Uk ({> |) = (1 3 |)3{31 |2 =
" [ 3{ 3 1 (31)Q |Q +2 Q Q =0
384
The Shortest Path Problem
From the generating function of the Stirling numbers of the first kind (Abramowitz and Stegun, 1968, Section 24.1.3), q [ K({ + 1) (m) = Vq {m K({ + 1 3 q) m=0
(16.56)
we observe that (n+1) Q 3{ 3 1 [ VQ +1 (31)n n K(3{) = = { Q Q !K(3{ 3 Q) Q! n=0
such that Uk ({> |) =
(n+1) " [ Q [ VQ +1
Q =0 n=0
Q!
(31)Q +n {n | Q +2 =
(n+1) 32 " Q [ [ VQ 31
Q =2 n=0
(Q 3 2)!
(31)Q +n {n |Q
Hence,
U ({> |) =
" [ " [
Q =2 n=2
1 2n31
(n+1) 32 " Q " Q [ [ [ VQ 31 {n |Q 3 { (31)Q +n {n | Q |Q + f ({) 2 (Q 3 2)! Q =2 n=0 Q =2
It remains to determine f ({) by equating the corresponding powers in { and | at both sides. With the definition (16.54), equating the second power (Q = 2) in | yields 0=
" [
n=2
1 {n 3 { + f ({) 2n31
which indicates that f ({) = { 3
{2 23{
agreeing with f (0) = f (1) = 0. The Taylor series around { = 0 is f ({) = 1 f1 = 1 and fn = 3 2n1 for n A 1. Equating the power Q A 2 in |, Q 31 [
(n)
uQ {n =
" [
n=2
n=2
S"
n n=0 fn {
(n+1) Q 32 Q [ VQ 31 n { 3 { + f ({) (31)Q +n {n 2n31 2 (Q 3 2)! n=0
with f0 = 0,
1
(n+1) " " Q [ [ VQ 31 n n { 3 { + f { (31)Q +n {n n 2n31 2 (Q 3 2)! n=2 n=0 n=0 3 4 (m+1) n31 " " [ [ [ VQ 31 1 Q n Q +m D n C (31) fn3m = { { 3{+ 2n31 2 (Q 3 2)! m=0 n=1 n=2 3 3 4 4 (m+1) (1) n31 " " [ [ [ (31)Q +m VQ 31 (31)Q VQ 31 1 Q n C C D D {n { 3 { + f1 fn3m = {+ 2n31 2 (Q 3 2)! (Q 3 2)! m=0 n=2 n=2
=
" [
1
(1)
which, by using VQ 31 = (31)Q (Q 3 2)! and f1 = 1, equals Q 31 [ n=2
(n) uQ {n
=
" [
n=2
4 3 (m+1) n31 " Q [ [ VQ 31 Q +m D n n C { + (31) fn3m { 2n31 2 (Q 3 2)! m=0 n=2 1
16.9 Problems
385
Finally, by equating the corresponding powers in {, leads to (n)
uQ =
=
(m+1) Q n31 [ VQ 31 (31)Q +m f + n3m 2n31 2 (Q 3 2)! m=0
1
1 2n31
n31 Q (31)Q +n31 V (n) [ (m) (31)Q Q 31 + + n (32)m V 2 (Q 3 2)! 2 (Q 3 2)! m=1 Q 31
or to (16.37). As a check, using (16.56) the generating function reveals that l k (Q 31) = H GQ
Q 2Q 31
(31)q K(q3{) K(3{)
=
1 Q! 1 2 Q + 3 Q 31 + = 2Q 31 (Q 3 1)! 2 (Q 3 1)! (Q 3 1)! (Q 3 1)! +
m=0
(m)
Vq {m ,
Q 32 [ 1 (31)Q (m) (32)m V + Q 31 (Q 3 1)! 2 (Q 3 1)! m=1 Q 31 & % (31)Q 31 K(Q 3 1 + 2) 1 (31)Q (Q 31) Q 31 + + Q 31 3 VQ 31 (32) (Q 3 1)! 2 (Q 3 1)! K(2)
Q 2Q 31
Q 2
Sq
+
=
1 = Also H GQ
=
1 Q 31
is readily verified.
16.9 Problems (i) Comparison of simulations with exact results. Many of the theoretical results are easily verified by simulations. Consider the following standard simulation: (a) Construct a graph of a certain class, e.g. an instance of the random graphs Js (Q ) with exponentially distributed link weights (b) Determine in that graph a desired property, e.g. the hopcount of the shortest path between two dierent arbitrary nodes, (c) Store the hopcount in a histogram and (d) repeat the sequence (a)-(c) q times with each time a dierent graph instance in (a). Estimate the relative error of the simulated hopcount in Js (Q ) with s = 1 for q = 104 > 105 and 106 . (ii) Given the probability generating function (16.17) of the weight of the shortest path in a complete graph with independent exponential link weights, compute the variance of ZQ . (iii) Prove the asymptotic law (16.20) of the weight of the shortest path in a complete graph with i.i.d. exponential link weights. (iv) In a communication network often two paths are computed for each important flow to guarantee su!cient reliability. Apart from the shortest path between a source D and a destination E, a second path between D and E is chosen that does not travel over any intermediate router of the shortest path. We call such a path node-disjoint to the shortest path. Derive a good approximation for the distribution of
386
The Shortest Path Problem
the hopcount of the shortest node-disjoint path to the shortest path in the complete graph with exponential link weights with mean 1.
17 The e!ciency of multicast
The e!ciency or gain of multicast in terms of network resources is compared to unicast. Specifically, we concentrate on a one-to-many communication, where a source sends a same message to p dierent, uniformly distributed destinations along the shortest path. In unicast, this message is sent p times from the source to each destination. Hence, unicast uses on average iQ (p) = pH [KQ ] link-traversals or hops, where H [KQ ] is the average number of hops to a uniform location in the graph with Q nodes. One of the main properties of multicast is that it economizes on the number of linktraversals: the message is only copied at each branch point of the multicast tree to the p destinations. Let us denote by KQ (p) the number of links in the shortest path tree (SPT) to p uniformly chosen nodes. If we define the multicast gain jQ (p) = H [KQ (p)] as the average number of hops in the SPT rooted at a source to p randomly chosen distinct destinations, then jQ (p) iQ (p). The purpose here is to quantify the multicast gain jQ (p). We present general results valid for all graphs and more explicit results valid for the random graph Js (Q ) and for the n-ary tree. The analysis presented here may be valuable to derive a business model for multicast: “How many customers p are needed to make the use of multicast for a service provider profitable?” Two modeling assumptions are made. First, the multicast process is assumed1 to deliver packets along the shortest path from a source to each of the p destinations. As most of the current Internet protocols forward packets based on the (reverse) shortest path, the assumption of SPT delivery is quite realistic. The second assumption is that the p multicast group member nodes are uniformly chosen out of the total number of nodes Q . This assumption has been discussed by Phillips et al. (1999). They concluded 1
The assumption ignores shared tree multicast forwarding such as core-based tree (CBT, see RFC2201).
387
388
The e!ciency of multicast
that, if p and Q are large, deviations from the uniformity assumption are negligibly small. Also the Internet measurements of Chalmers and Almeroth (2001) seem to confirm the validity of the uniformity assumption. 17.1 General results for jQ (p) Theorem 17.1.1 For any connected graph with Q nodes, Qp (17.1) p+1 Proof: We need at least one edge for each dierent user; therefore jQ (p) p and the lower bound is attained in a star topology with the source at the center. We will next show that an upper bound is obtained in a line topology. It is su!cient to consider trees, because multicast only uses shortest paths without cycles. If the tree has not a line topology, then at least one node has degree 3 or the root has degree 2. Take the node closest to the root with this property and cut one of the branches at this node; we paste that branch to a node at the deepest level. Through this procedure the multicast function jQ (p) stays unaltered or increases. Continuing in this fashion until we reach a line topology demonstrates the claim. For the line topology we place the source at the origin and the other nodes at the integers 1> 2> = = = > Q 1. The links of the graph are given by (l> l + 1)> l = 0> 1> = = = > Q 2. The multicast gain jQ (p) equals H [P ], where P is the maximum of a sample of size p, without replacement, from the integers 1> 2> = = = > Q 1. Thus, ¡n¢ p jQ (p)
p Pr [P n] = ¡Q31 ¢> p
pn Q 1
from which jQ (p) = H [P ] is ¡ n ¢ ¡n31¢ Q31 ¡ n31 ¢ Q31 X X p31 jQ (p) = n p ¡Q31¢p = n ¡Q31¢ n=p Q31 X
p
¡n¢
n=p Q 31 X
p
¡n¢
pQ pQ p ¡ Q ¢= p+1 p+1 p n=p n=p p+1 PQ31 ¡ n ¢ ¡ Q ¢ where we have used that n=p p @ p+1 = 1, because it is a sum of probabilities over all possible disjoint outcomes. ¤ =p
p ¡Q31 ¢=
Figure 17.1 shows the allowable space for jQ (p).
17.1 General results for jQ (p) gN(m) N1
389
Nm/(m + 1)
N/2
clog(N) 1
m 1
N1
Fig. 17.1. The allowable region (in white) of jQ (p). For exponentially growing graphs, H[KQ ] = f log Q , implying that the allowable region for these graphs is smaller and bounded at the left (in dotted line) by the straight line p(f log Q ).
Theorem 17.1.2 For any connected graph with Q nodes, the map p 7$ (p) jQ (p) is concave and the map p 7$ ijQ is decreasing. Q (p) Proof: Define \p to be the random variable giving the additional number of hops necessary to reach the p-th user when the first p1 users are already connected. Then we have that H [\p ] = jQ (p) jQ (p 1) Moreover, let \p0 be the random number of additional hops necessary to reach the p-th multicast group member, when we discard all extra hops of the (p 1)-st group member. An example is illustrated in Fig. 17.2. The random variable \p0 has the same distribution as \p31 , because both the (p 1)-st and the p-th group member are chosen uniformly from the remaining Q p 1 nodes. In general, \p0 6= \p31 > but, for each n, Pr[\p0 = n] = Pr[\p31 = n] and, hence, £ ¤ (17.2) H \p0 = H [\p31 ]
Furthermore, we have by construction that \p \p0 with probability 1, implying that £ ¤ (17.3) H [\p ] H \p0
Indeed, attaching the p-th group member to the reduced tree takes at least as many hops as attaching that same group member to the non-reduced tree because the former is contained in the latter and the extra hops added by
390
The e!ciency of multicast
the p 1 group member can only help us. Combining (17.2) and (17.3) immediately gives £ ¤ jQ (p) jQ (p 1) = H [\p ] H \p0 = jQ (p 1) jQ (p 2) (17.4)
This is equivalent to the concavity of the map p 7$ jQ (p). Root A
C
B
D
3
1
4
5
2
Fig. 17.2. A multicast session with p = 5 group members where \5 = 1 (namely link C-5). To construct \50 the three dotted lines must be removed and we observe that \50 = 2 (A-C-5), which is referred to as the reduced tree. In this example, \50 = \4 = 2 because A-C-4 and A-C-5 both consist of 2 hops. In general, they are equal in distribution because the role of group member 4 and 5 are identical in the reduced tree.
In order to show that
jQ (p) iQ (p)
jQ (p) p
is decreasing it su!ces to show that p 7$
is decreasing, since iQ (p) is proportional to p. Defining jQ (0) = 0, we can write jQ (p) as a telescoping sum p p X X {jQ (n) jQ (n 1)} = {n jQ (p) = n=1
n=1
where {n = jQ (n) jQ (n 1)> n = 1> = = = > p. Then, p
jQ (p) 1 X {n = p p n=1
is the mean of a sequence of p positive numbers {n . By (17.4) the sequence {n {n31 is decreasing and, hence, p p31 jQ (p) 1 X jQ (p 1) 1 X {n {n = = p p p1 p1 n=1
n=1
17.1 General results for jQ (p)
This proves that p 7$ jQ (p)@p is decreasing.
391
¤
Next, we will give a representation for jQ (p) that is valid for all graphs. Let [l be the number of joint hops that all l uniformly chosen and dierent group members have in common, then the following general theorem holds, Theorem 17.1.3 For any connected graph with Q nodes, p µ ¶ X p (1)l31 H [[l ] jQ (p) = l
(17.5)
l=1
Note that
jQ (1) = iQ (1) = H [[1 ] = H [KQ ] so that the decrease in average hops or the “gain” by using multicast over unicast is precisely p µ ¶ X p (1)l31 H [[l ] jQ (p) iQ (p) = l l=2
However, computing H [[l ] for general graphs is di!cult. Proof of Theorem 17.1.3: Let D1 > D2 > = = = > Dp be sets where Dl consists of all links that constitute the shortest path from the source to multicast group member l. Denote by |Dl | the number of elements in the set Dl . The multicast group members are chosen uniformly from the set of all nodes except for the root. Hence, H [[1 ] = H [|Dl |] >
for 1 l Q
and H [[2 ] = H [|Dl _ Dm |] >
for 1 l ? m Q
¡ ¢ etc.. Now, jQ (p) = H [|D1 ^ D2 ^ · · · ^ Dp |]. Since T(D) = H [|D|] @ Q2 is a probability measure on the set of all links, we obtain from ¡Q ¢ the inclusionexclusion formula (2.3) applied to T and multiplied with 2 afterwards, H [|D1 ^ D2 ^ · · · ^ Dp |] =
p X l=1
H [|Dl |]
X l?m
H [|Dl _ Dm |] + · · ·
+ (1)p31 H [|D1 _ D2 _ · · · _ Dp |] µ ¶ p H [[2 ] + · · · + (1)p31 H [[p ] = pH [[1 ] 2 This proves Theorem 17.1.3.
¤
392
The e!ciency of multicast
Corollary 17.1.4 For any connected graph with Q nodes, p µ ¶ X p H [[p ] = (1)l31 jQ (l) l
(17.6)
l=1
The corollary is a direct consequence of the inversion formula for the binomial (Riordan, 1968, Chapter 2). Alternatively, in view of the GregoryNewton interpolation formula (Lanczos, 1988, Chapter 4, Section 2) for P ¡p¢ l l31 l j (0) where j jQ (p) = " Q (0), we can write H [[l ] = (1) Q l=1 l is the dierence operator, i (0) = i (1) i (0). Corollary 17.1.5 For any connected graph, the multicast e!ciency jQ (p) is bounded by iQ (p) H [KQ ] (17.7) jQ (p) where H [KQ ] is the average number of hops in unicast. Proof: We give two demonstrations. (a) From jQ (Q 1) = Q 1 (all nodes, source plus Q 1 destinations, of the graph are spanned by a tree (p) (see Theorem consisting of Q 1 links) and the monotonicity of p 7$ jiQ Q (p) 17.1.2) we obtain: jQ (p) jQ (Q 1) Q 1 1 = = iQ (p) iQ (Q 1) (Q 1)H [KQ ] H [KQ ] (b) Alternatively, Theorem 17.1.1 indicates that jQ (p) p, which, with the identity iQ (p) = pH [KQ ], immediately leads to (17.7). ¤ Corollary 17.1.5 means that for any connected graph, including the graph describing the Internet, the ratio of the unicast over multicast e!ciency is bounded by the expected hopcount in unicast. In order words, the maximum savings in resources an operator can gain by using multicast (over unicast) never exceeds H [KQ ], which is roughly about 15 in the current Internet. 17.2 The random graph Js (Q ) In this section, we confine to the class RGU, the random graphs of the class Js (Q ) with independent identically and exponentially distributed link weights z with mean H [z] = 1 and where Pr[z {] = 1 h3{ , { A 0. In Section 16.2, we have shown that the corresponding SPT is, asymptotically, a URT. The analysis below is exact for the complete graph NQ while asymptotically correct for connected random graphs Js (Q ).
17.2 The random graph Js (Q )
393
17.2.1 The hopcount of the shortest path tree Based on properties of the URT, the complete probability density function of the number of links KQ (p) in the SPT to p uniformly chosen nodes can be determined. We first derive £ K (p)a¤recursion for the probability generating of the number of links KQ (p) in the function *KQ (p) (}) = H } Q SPT to p uniformly chosen nodes in the complete graph NQ . Lemma 17.2.1 For Q A 1 and all 1 p Q 1, *KQ (p) (}) =
(Q p 1)(Q 1 + p}) p2 } * (}) + K (p) Q 1 2 *KQ 1 (p1) (}) (Q 1)2 (Q 1) (17.8)
Proof: To prove (17.8), we use the recursive growth of URTs: a URT of size Q is a URT of size Q 1, where we add an additional link to a uniformly chosen node. 1
N
N
2 N N
Case A
Case B
Case C and D
Fig. 17.3. The several possible cases in which the Q -th node can be attached uniformly to the URT of size Q 1. The root is dark shaded while the p multicast member nodes are lightly shaded.
In order to obtain a recursion for KQ (p) we distinguish between the p uniformly chosen nodes all being in the URT of size Q 1 or not. The p probability that they all belong to the tree of size Q 1 is equal to 1 Q31 (case A in Fig. 17.3). If they all belong to the URT of size Q 1, then we have that KQ (p) = KQ31 (p). Thus, we obtain µ ¶ h i p p *KQ (p) (}) = 1 *KQ 31 (p) (}) + H } 1+OQ 31 (p) (17.9) Q 1 Q 1
where OQ31 (p) is the number of links in the subtree of the URT of size Q 1 spanned by p 1 uniform nodes and the “one” refers to the link from
394
The e!ciency of multicast
the added Q -th node to its ancestor in the URT of size Q 1. We complete the proof by investigating the generating function of OQ31 (p). Again, there are two cases. In the first case (B in Fig. 17.3), the ancestor of the added Q -th node is one of the p 1 previous nodes (which can only happen if it is unequal to the root), else we get one of the cases C and D in Fig. 17.3. The probability of the first event equals p31 Q31 , the probability of the latter equals p31 1 Q31 . If the ancestor of the added Q-th node is one of the p 1 previous nodes, then the number of links OQ31 (p) equals KQ31 (p 1), otherwise the generating function of the number of additional links equals ¶ µ 1 1 * *KQ 31 (p) (}) + 1 (}) Q p Q p KQ 31 (p31) The first contribution comes from the case where the ancestor of the added Q -th node is not the root, and the second from where it is equal to the root, 1 1 . Therefore, = Q3p which has probability Q313(p31) i p1 h * H } OQ 31 (p) = (}) Q 1 KQ 31 (p31) µ ¶ *KQ 31 (p31) (}) Q p Q p1 *KQ 31 (p) (}) + + Q 1 Q p Q p p Q p1 = *KQ 31 (p31) (}) + *KQ 31 (p) (}) (17.10) Q 1 Q 1 Substitution of (17.10) into (17.9) leads to (17.8).
¤
Since jQ (p) = H[KQ (p)] = *0KQ (p) (1), we obtain the recursion for jQ (p), µ ¶ p2 p p2 jQ (p) = 1 jQ31 (p 1) + j (p) + Q31 (Q 1)2 (Q 1)2 Q 1 (17.11) Theorem 17.2.2 For all Q 1 and 1 p Q 1, p µ ¶ i p!(Q 1 p)! X h (Q + n}) p KQ (p) = (1)p3n *KQ (p) (}) = H } 2 (1 + n}) n ((Q 1)!) n=0 (17.12) Consequently, (m+1) (p)
p!(1)Q3(m+1) VQ Sm Pr [KQ (p) = m] = ¡ ¢ (Q 1)! Q31 p
(17.13)
17.2 The random graph Js (Q ) (m+1)
395
(p)
where VQ and Sm denote the Stirling numbers of first and second kind (Abramowitz and Stegun, 1968, Section 24.1). Proof: By iterating the recursion (17.8) for small values of p, the computations given in van der Hofstad et al. (2006a, Appendix) suggest the solution (17.12) for (17.8). One can verify that (17.12) satisfies (17.8). This proves (17.12) of Theorem 17.2.2. Using (Abramowitz and Stegun, 1968, Section 24.1.3.B), the Taylor expansion around } = 0 equals *KQ (p) (}) =
p K(Q + n}) 1 p!Q(Q 3 1 3 p)! [ p (31)p3n 3 n (Q 3 1)! Q!K(1 + n}) Q n=0
(m+1) Q 31 p [ (31)Q 3(m+1) VQ p!Q(Q 3 1 3 p)! [ p (31)p3n nm } m n (Q 3 1)! Q! m=1 n=0 $ # p (m+1) Q 31 [ p p!Q(Q 3 1 3 p)! [ (31)Q 3(m+1) VQ (31)p3n nm } m = n (Q 3 1)! Q! m=1 n=0
=
Using the definition of Stirling numbers of the second kind (Abramowitz and Stegun, 1968, 24.1.4.C), p [ p (p) (31)p3n nm = p!Sm n n=0 (p)
for which Sm
= 0 if m ? p, gives *KQ (p) (}) =
Q 31 (p!)2 (Q 3 1 3 p)! [ (m+1) (p) m Sm } (31)Q 3(m+1) VQ 2 ((Q 3 1)!) m=1
This proves (17.13) and completes the proof of Theorem 17.2.2.
¤
Figure 17.4 plots the probability density function of K50 (p) for dierent values of p. Corollary 17.2.3 For all Q 1 and 1 p Q 1, jQ (p) = H [KQ (p)] =
Q X 1 pQ Q p n
(17.14)
n=p+1
and
P 1 2 (p) p2 Q 2 Q jQ Q 1+p n=p+1 n2 Var [KQ (p)] = jQ (p) Q +1p (Q + 1 p) (Q p)(Q + 1 p) (17.15)
The formula (17.14) is proved in two dierent ways. The earlier proof presented in Section 17.6 below does not rely on the recursion in Lemma 17.2.1 nor on Theorem 17.2.2. The shorter proof is presented here. Formula (17.14) can be expressed in terms of the digamma function #({) as ¶ µ #(Q ) #(p) 1 (17.16) jQ (p) = pQ Q p
396
The e!ciency of multicast 0.5
Pr[H50(m) = j]
0.4
0.3
0.2
0.1
0.0 0
10
20
30
40
50
j hops
Fig. 17.4. The pdf of K50 (p) for p = 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 47. Proof of Corollary 17.2.3: The expectation and variance of KQ (p) will not be obtained using the explicit probabilities (17.13), but by rewriting (17.12) as p k l K(p + 1)K(Q 3 p) [ p (31)p3n CwQ 31 wQ 31+n} 2 w=1 n K (Q) n=0 k l K(p + 1)K(Q 3 p) (31)p CwQ 31 wQ 31 (1 3 w} )p = w=1 K2 (Q)
*KQ (p) (}) =
(17.17)
Indeed, l k K(p + 1)K(Q 3 p) Q 31 Q 31 } p p (31) w (1 3 w ) C C } w w=}=1 K2 (Q) k l K(p + 1)K(Q 3 p) p(31)p31 CwQ 31 wQ log w(1 3 w)p31 > = 2 w=1 K (Q) k l K(p + 1)K(Q 3 p) H[KQ (p) (KQ (p) 3 1)] = (31)p C}2 CwQ 31 wQ 31 (1 3 w} )p 2 w=}=1 K (Q) K(p + 1)K(Q 3 p) p31 = p(31) K2 (Q) k l × CwQ 31 wQ log2 w(1 3 w)p32 [3(p 3 1)w + (1 3 w)] H[KQ (p)] =
w=1
We will start with the former. Using H[KQ (p)] =
Cwl (1
3
w)m |w=1
=
m!(31)m
l>m
and Leibniz’ rule, we find
l K(p + 1)K(Q 3 p) Q 3 1 Q 3p k Q p! C w log w w=1 K2 (Q) p31 w
17.2 The random graph Js (Q )
397
Since Cwn [wq log w]w=1 =
q! (q 3 n)!
q [
m=q3n+1
1 m
we obtain expression (17.14) for H[KQ (p)]. We now extend the above computation to H[KQ (p)(KQ (p) 3 1)] that we write as H[KQ (p) (KQ (p) 3 1)] =
K(p + 1)K(Q 3 p) (U1 + U2 ) K2 (Q)
(17.18)
where k l U1 = p(p 3 1)(31)p32 CwQ 31 wQ +1 log2 w(1 3 w)p32 w=1 k l U2 = p(31)p31 CwQ 31 wQ log2 w(1 3 w)p31 w=1
Using Cwn [wq
2
log w]w=1
q! =2 (q 3 n)!
q [
q [
l=q3n+1 m=l+1
we obtain,
53 q [ 1 q! 7C = lm (q 3 n)! l=q3n+1
6 42 q [ 18 1D 3 l l2 l=q3n+1
k l Q 3 1 p(p 3 1)(p 3 2)!CwQ 3p+1 wQ +1 log2 w w=1 p32 53 6 42 Q +1 Q +1 Q 3 1 [ 1 [ 1 7C 8 D 3 = (Q + 1)! p32 n n2 n=p+1 n=p+1
U1 =
Similarly,
l Q 3 1 k p(p 3 1)!CwQ 3p wQ log2 w w=1 p31 6 42 53 Q Q Q 3 1 [ [ 1 1 8 7C D 3 = Q! n n2 p31 n=p+1 n=p+1
U2 =
Substitution into (17.18) leads to
6 53 42 Q Q [ [ 1 1 p2 Q 2 8 7C D 3 H[KQ (p)(KQ (p) 3 1)] = (Q + 1 3 p)(Q 3 p) n n2 n=p+1 n=p+1 +
2p(p 3 1)Q (Q + 1 3 p)(Q 3 p)
Q [
n=p+1
1 n
2 (p), we From jQ (p) = H[KQ (p)] and Var [KQ (p)] = H[KQ (p)(KQ (p) 3 1)] + jQ (p) 3 jQ
obtain (17.15). This completes the proof of Corollary 17.2.3.
¤
For Q = 1000, Fig. 17.5 illustrates the typical behaviorp for large Q of the expectation jQ (p) and the standard deviation Q (p) = Var [KQ (p)] for all values of p. For any spanning tree, the number of links KQ (Q 1) is precisely Q 1, so that Var[KQ (Q 1)] = 0.
398
The e!ciency of multicast 1000
12
10
8 600 6 400 4
g1000 (m) V1000 (m)
Standard deviation VN (m)
average hopcount gN (m)
800
200 2
0
200
400
600
0 1000
800
m
Fig. 17.5. The average number of hops jQ (p) (left axis) in the SPT and the corresponding standard deviation Q (p) (right axis) as a function of the number p of multicast group members in the complete graph with Q = 1000.
Figure 17.5 also indicates that the standard deviation Q (p) of KQ (p) is much smaller than the average, even for Q = 1000. In fact, we obtain from (17.15) that Var [KQ (p)]
2 (p) 2jQ 2Q jQ (p) Q 1+p jQ (p) = PQ Q +1p Q p p n=p+1
1 n
2 = r(jQ (p))
This bound implies that with probability converging to 1 for every p = 1> = = = > Q 1, ¯ ¯ ¯ KQ (p) ¯ ¯ ¯% 1 ¯ ¯ jQ (p)
KQ (p)3jQ (p) s jQ (p) g KQ (p)3jQ (p) s $ jQ (p)
In van der Hofstad et al. (2006a), the scaled random variable
is proved to tend to a Gaussian random variable, i.e. s Q (0> 1), for all p = r( Q ). For large graphs of the size of the Internet and larger, this observation implies that the mean jQ (p) = H [KQ (p)] is a good approximation for the random variable KQ (p) itself because the variations of KQ (p) around the mean are small. Consequently, it underlines the importance of jQ (p) as a significant measure for multicast.
17.2 The random graph Js (Q )
399
17.2.2 The weight of the shortest path tree In this section, we summarize results on the weight ZQ (p) of the SPT and omit derivations, but refer to van der Hofstad et al. (2006a). For all 1 p Q 1, the average weight of the SPT is H [ZQ (p)] =
p X m=1
Q31 1 X 1 Q m n
(17.19)
n=m
In particular, if the shortest path tree spans the whole graph, then for all Q 2, H [ZQ (Q 1)] = from which H [ZQ (Q 1)] ? (2) =
2 6
Q31 X q=1
1 q2
(17.20)
for any finite Q . The variance is
m Q31 Q31 Q31 X 1 X X 1 1 4 X 1 + 4 5 Var [ZQ (Q 1)] = Q n3 m3 n m4 n=1
m=1
(17.21)
m=1
n=1
or asymptotically, for large Q , 4 (3) +R Var [ZQ (Q 1)] = Q
µ
log Q Q2
¶
(17.22)
Asymptotically for large Q , the average weight of a shortest path tree is (2) = 1=645 = = =, while the average weight of the minimum spanning tree, given by (16.49), is (3) ? (2). This result has an interesting implication. The Steiner tree is the minimum weight tree that connects a set of p members out of Q nodes in the graph. The Steiner tree problem is NP-complete, which means that it is unfeasible to compute for large Q . If p = 2, the weight of the Steiner tree equals that of the shortest path, ZSteiner>Q (2) = ZQ , while for p = Q , we have ZSteiner>Q (Q ) = ZMST . Hence, for any p ? Q and Q , H[ZSteiner>Q (p)] (3) because the weight of the Steiner tree does not decrease if the number of members p increases. (2) = 1=368 indicates that the use of the SPT (computationally The ratio (3) easy) never performs on average more than 37% worse than the optimal Steiner tree (computationally unfeasible). In a broader context and referring to the concept of the “Prize of Anarchy”, which is broadly explained in Robinson (2004), the SPT used in a communications network is related to the Nash equilibrium, while the Steiner tree gives the hardly achievable global optimum. Simulations — even for small Q, which allow us to cover the entire p-range
400
The e!ciency of multicast 10
fW* (m)(x) N
10
10
0
m m m m m m m m m m m m
-1
-2
=1 =5 = 10 = 20 = 30 = 40 = 50 = 60 = 70 = 80 = 90 = 95
N = 100 10
10
-3
Normalized Gumbel Normalized Gaussian N(0,1)
-4
-4
-2
0
2
4
6
x
Fig. 17.6. The pdf of the normalized random variable ZQ (p) for Q = 100.
as illustrated in Fig. 17.6 — indicate that the normalized random variable ZQW (p) = ZQs(p)3H[ZQ (p)] lies between a normalized Gaussian Q (0> 1) and ydu[ZQ (p)]
a normalized Gumbel (see Theorem 6.4.1). Fig. 17.6 may suggest that, for all p ? Q , {3 3I 6
lim Pr [ZQW (p) {] h3h
Qn (p) = Q 1 n 1 G
G31 X
n G3m
m=2
p Q 1
nm+1 31 31 n31
Y
t=0
¶
µ 1
n
G31
n µ Y 1
t=0
p Q 1t
¶
t=
n31
n
p Q313t
´
1 ? n
Ã
1
p Q 1
nm 31 n31
¶
(17.26)
which shows that jQ>n (p) is a polynomial in p of degree the terms in the m-sum rapidly decrease; their ratio equals ³ 31 Q nm+1 31 n31 1 m n 31
p Q 1t
Q31 n .
!nm
Moreover,
?? 1
17.3 The n-ary tree
403
Figure 17.8 indicates that formula j(17.25), althoughk derived subject to (n31)] 1 , where b{c is the (17.24), also seems valid when G = log[1+Q log n largest integer smaller than or equal to {. This suggests that the deepest level G need not be filled completely to count nG nodes and that (17.25) may extend to “incomplete” n-ary trees. As further observed from Fig. 17.8, jQ>n (p) is monotonously decreasing in n. Hence, it is quite likely that he map n 7$ jQ>n (p) is decreasing in n 5 [1> Q 1]. Intuitively, this conjecture can be understood from Fig. 17.7. Both the n = 2 and n = 5 trees have an equal number of nodes. We observe that the deeper G (or the smaller n), the more overlap is possible, hence, the larger jQ>n (p). Theorem 17.1.1 can also be deduced from (17.25). The lower bound is attained in a star topology where n = Q 1, G = 1 and H[KQ ] = 1. The upper bound is attained in a line topology where n = 1, G = Q 1 and H[KQ ] = Q2 . Furthermore, for real values of n 5 [1> Q 1], the set of curves specified by (17.25) covers the total allowable space of jQ>n (p), as shown in Fig. 17.1. This suggests to consider (17.25) for estimating n in real topologies. Since jQ (1) = H[KQ ], the average hopcount in a n-ary tree follows from (17.25) as H[KQ ] = Q 1
G31 X
m+1
n
m=0
G3m
G31 Q 1 n n3131 1 X G3m nm+1 1 = n Q 1 Q 1 n1 m=0
G 1 QG + = Q 1 (Q 1)(n 1) n 1
(17.27)
For large Q , we find with º log[1 + Q (n 1)] 1 logn Q + logn (1 1@n) + R(1@Q ) G= log n ¹
that 1 H[KQ ] = logn Q + logn (1 1@n) +R n1
µ
logn Q Q
¶
(17.28)
Comparing (17.28) with the average hopcount in the random graph (16.10) shows equality to first order if nrg = h. Moreover, both the second order 1 terms 1 = 0=42 and log(1 1@h) h31 = 1=04 are R(1) and independent of Q . This shows that the multicast gain in the random graph is well approximated by jQ>h (p).
404
The e!ciency of multicast
17.4 The Chuang—Sirbu law We discuss the empirical Chuang—Sirbu scaling law, which states that jQ (p) H [KQ ] p0=8 for the Internet. Based on Internet measurements, Chuang and Sirbu (1998) observed that jQ (p) H [KQ ] p0=8 . Subsequently, Phillips et al. (1999) dubbed this observation the Chuang—Sirbu law. Corollary 17.1.5 implies that the empirical law of Chuang—Sirbu cannot hold true for all p Q . Indeed, if jQ (p) = H [KQ ] p0=8 > we obtain from the inequality (17.7) and the identity iQ (p) = pH [KQ ], that p0=2 H [KQ ]. Write p = {Q for a fixed 0 ? { ? 1 and { independent of Q . Hence, we have shown that Q] $ 0> Corollary 17.4.1 For all graphs satisfying the condition that H[K Q 0=2 for large Q , the empirical Chuang—Sirbu law does not hold in the region p = {Q with 0 ? { 1 and su!ciently large Q .
The most realistic graph models for the Internet assume that H [KQ ] f log Q , since this implies that the number of routers that can be reached from any starting destination grows exponentially with the number of hops. For these realistic graphs, Corollary 17.4.1 states that empirical Chuang— Sirbu law does not hold for all p. On the other hand, there are more regular graphs (such as a g-lattice, where H[KQ ] ' g3 Q 1@g ) with H [KQ ] Q 0=2+ (and A 0) for which the mathematical condition p0=2 H [KQ ] is satisfied for all p and Q . As shown in Van Mieghem et al. (2000), however, these classes of graphs, in contrast to random graphs, are not leading to good models for SPTs in the Internet.
17.4.1 Validity range of the Chuang—Sirbu law For the random graph Js (Q ), the SPT is very close to a URT for large Q and with (16.10), we obtain iQ (p) p(log Q + 1) From the exact jQ (p) formula (17.16) for the random graph Js (Q ), the asymptotic for large Q and p follows as µ ¶ 1 Q pQ log (17.29) jQ (p) Q p p 2 The above scaling explains the empirical Chuang—Sirbu law for Js (Q ): ¡for¢p pQ log Q small with respect to Q, the graphs of (log Q +1)p0=8 and Q3p p 1 2 look very alike in a log-log plot, as illustrated in Fig. 17.9.
17.4 The Chuang—Sirbu law
405
Using the asymptotic properties of the digamma function #, we obtain (17.29) as an excellent approximation for large Q (and all p) or, in normalized form with p = {Q and 0 ? { ? 1, jQ ({Q ) + 0=5 { log { Q {1
(17.30)
Q ] 0=8 = H[K The normalized Chuang—Sirbu law is jQ ({Q) Q Q 0=2 { . It is interesting Q] = 1, since then both to note that the Chuang—Sirbu law is “best” if H[K Q 0=2 endpoints { = 0 and { = 1 coincide with (17.30). This optimum is achieved when Q 250 000, which is of the order of magnitude of the estimated number of routers in the current Internet. This observation may explain the fairly good correspondence on a less sensitive log-log scale with Internet measurements. At the same time, it shows that for a growing Internet, the fit of the Chuang—Sirbu law will deteriorate. For Q 106 , the Chuang—Sirbu law underestimates jQ (p) for all p. 7
10
0.8
m law random graph 6
10
5
10
gN (m)
1.00 4
10
0.95 Effective Power Exponent
3
10
2
10
0.90
0.85
0.80 Number of Nodes N 0.75
1
1
10
10 0
10
1
10
2
10
2
3
10
10 3
4
5
10
10 4
10
10
6
7
10
10 5
10
8
9
10
10 6
10
10
10
7
10
m
Fig. 17.9. The multicast e!ciency for Q = 10m with m = 3> 4> ===> 7. The endpoint of each curve jQ (Q 1) = Q 1 determines Q . The insert shows the eective power exponent versus Q .
17.4.2 The eective power exponent (Q ) For small to moderate values of p, jQ (p) is very close to a straight line in a log-log plot. This “power law behavior” implies that log jQ (p)
406
The e!ciency of multicast
log H(KQ )+(Q ) log p, which is a first order Taylor expansion of log jQ (p) in log p. This observation suggests the computation4 of the eective power exponent (Q ) as ¯ g log jQ (p) ¯¯ (Q ) = (17.31) g log p ¯ p=1
Only for a straight line, the dierential operator can be replaced by the dierence operator such that (Q) W (Q )> where W
(Q) =
jQ (2) log H[K Q]
log 2
(17.32)
In general, for small p, the eective power exponent (17.31) is not a constant 0.8 as in the Chuang—Sirbu law, but dependent on Q . Since jQ (p) is concave jQ (p) by Theorem 17.1.2, (Q ) is the maximum possible value for g log at any g log p p 1. A direct consequence of Theorem 17.1.1 is that the eective power exponent (Q ) 5 [ 21 > 1]. From recent Internet measurements, Chalmers and Almeroth (2001) found that 0=66 (Q) 0=7. The eective power exponent (Q) as defined in (17.31) for the random graph is ³ ´ 2 2 Q #(Q ) + 6 + 6Q ¡ ¢ (Q ) = (Q 1) #(Q ) + ( 1) + Q1 while, according to the definition (17.32), W
(Q ) =
jQ (2) log H[K Q]
log 2
= 1 + log2
(Q 1) (#(Q ) + 3@2 + 1@Q ) (Q 2) (#(Q ) + 1 + 1@Q )
¸
The dierence (Q ) W (Q ) monotonously decreases and is largest, 0.048 at Q = 3 while 0.0083 at Q = 105 and 0.0037 at Q = 1010 . This eective power exponent (Q ) is drawn in the insert of Fig. 17.9, which shows that (Q ) is increasing and not a constant close to 0.8. More interestingly, for Q] large Q , we find with (16.10) and (16.11) that (Q ) Var[K H[KQ ] and that
Q] limQ Q 3 p; Dm 3 p + 1; }) = }n K(Dm 3 p + 1) K(D m 3 p + 1 + n) n=0
Hence, Vm =
1 K(Q 3 p) gp [(1 3 })p I (1> Q 3 p; Dm 3 p + 1; })]|}=0 p! K(Dm 3 p + 1) g} p
Invoking the dierentiation formula (Abramowitz and Stegun, 1968, Section 15.2.7), (31)p K(d + p)K(f 3 e + p)K(f) gp (13})d31 I (d+p> e; f+p; }) (1 3 })d+p31 I (d> e; f; }) = g} p K(d)K(f 3 e)K(f + p)
we have, since d = 1 and I (d> e; f; 0) = 1, Vm =
(31)p K(Q 3 p)K(Dm + 1 3 Q + p) K(Dm + 1 3 Q)K(Dm + 1)
Thus, jQ>n (p) =
=
G31 [ Dm ! (31)p (Q 3 p 3 1)!(Dm 3 Q + p)! (Q 3 1)! pnG 3Q 3 nG3m Q 31 Q! (Dm 3 Q)!Dm ! Dm ! m=1 G31 G31 [ (31)p31 (Q 3 p 3 1)! [ G3m (Dm 3 Q + p)! pnG nG3m + + n Q 31 (Q 3 1)! (Dm 3 Q)! m=1 m=1
from which (17.25) is immediate.
¤
17.8 Problem (i) Compute the eective power exponent W (Q) for the n-ary tree.
18 The hopcount to an anycast group
In this chapter, the probability density function of the number of hops to the most nearby member of the anycast group consisting of p members (e.g. servers) is analyzed. The results are applied to compute a performance measure of the e!ciency of anycast over unicast and to the server placement problem. The server placement problem asks for the number of (replicated) servers p needed such that any user in the network is not more than m hops away from a server of the anycast group with a certain prescribed probability. As in Chapter 17 on multicast, two types of shortest path trees are investigated: the regular n-ary tree and the irregular uniform recursive tree treated in Chapter 16. Since these two extreme cases of trees indicate that the performance measure 1 d log p where the real number d depends on the details of the tree, it is believed that for trees in real networks (as the Internet) a same logarithmic law applies. An order calculus on exponentially growing trees further supplies evidence for the conjecture that 1 d log p for small p. 18.1 Introduction IPv6 possesses a new address type, anycast, that is not supported in IPv4. The anycast address is syntactically identical to a unicast address. However, when a set of interfaces is specified by the same unicast address, that unicast address is called an anycast address. The advantage of anycast is that a group of interfaces at dierent locations is treated as one single address. For example, the information on servers is often duplicated over several secondary servers at dierent locations for reasons of robustness and accessibility. Changes are only performed on the primary servers, which are then copied onto all secondary servers to maintain consistency. If both the primary and all secondary servers have a same anycast address, a query 417
418
The hopcount to an anycast group
from some source towards that anycast address is routed towards the closest server of the group. Hence, instead of routing the packet to the root server (primary server) anycast is more e!cient. Suppose there are p (primary plus all secondary) servers and that these p servers are uniformly distributed over the Internet. The number of hops from the querying device D to the closest server is the minimum number of hops, denoted by kQ (p), of the set of shortest paths from D to these p servers in a network with Q nodes. In order to solve the problem, the shortest path tree rooted at node D, the querying device, needs to be investigated. We assume in the sequel that one of the p uniformly distributed servers can possibly coincide with the same router to which the querying machine D is attached. In that case, kQ (p) = 0. This assumption is also reflected in the notation, small k, according to the convention made in Section 16.3.2 that capital K for the hopcount excludes the event that the hopcount can be zero. Clearly, if p = 1, the problem reduces to the hopcount of the shortest path from D to one uniformly chosen node in the network and we have that kQ (1) = kQ > where kQ is the hopcount of the shortest path in a graph with Q nodes. The other extreme for p = Q leads to kQ (Q ) = 0 because all nodes in the network are servers. In between these extremes, it holds that kQ (p) kQ (p 1) since one additional anycast group member (server) can never increase the minimum number of hops from an arbitrary node to that larger group. The hopcount to an anycast group is a stochastic problem. Even if the network graph is exactly known, an arbitrary node D views the network along a tree. Most often it is a shortest path tree. Although the sequel emphasizes “shortest path trees”, the presented theory is equally valid for any type of tree. The node D’s perception of the network is very likely dierent from the view of another node D0 . Nevertheless, shortest path trees in the same graph possess to some extent related structural properties that allow us to treat the problem by considering certain types or classes of shortest path trees. Hence, instead of varying the arbitrary node D over all possible nodes in the graph and computing the shortest path tree at each dierent node, we vary the structure of the shortest path tree rooted
18.2 General analysis
419
at D over all possible shortest path trees of a certain type. Of course, the confinement of the analysis then lies in the type of tree that is investigated. We will only consider the regular n-ary tree and the irregular URT . It seems reasonable to assume that “real” shortest path trees in the Internet possess a structure somewhere in between these extremes and that scaling laws observed in both the two extreme cases may also apply to the Internet. The presented analysis allows us to address at least two dierent issues. First, for a same class of trees, the e!ciency of anycast over unicast defined in terms of a performance measure , =
H [kQ (p)] 1 H [kQ (1)]
is quantified. The performance measure indicates how much hops (or link traversals or bandwidth consumption) can be saved, on average, by anycast. Alternatively, also reflects the gain in end-to-end delay or how much faster than unicast, anycast finds the desired information. Second, the so-called server placement problem can be treated. More precisely, the question “How many servers p are needed to guarantee that any user request can access the information within m hops with probability Pr [kQ (p) A m] , where is certain level of stringency,” can be answered. The server placement problem is expected to gain increased interest especially for real-time services where end-to-end QoS (e.g. delay) requirements are desirable. In the most general setting of this server placement problem, all nodes are assumed to be equally important in the sense that users’ requests are generated equally likely at any router in the network with Q nodes. As mentioned in Chapter 17, the validity of this assumption has been justified by Phillips et al. (1999). In the case of uniform user requests, the best strategy is to place servers also uniformly over the network. Computations of Pr [kQ (p) A m] ? for given stringency and hop m, allow the determination of the minimum number p of servers. The solution of this server placement problem may be regarded as an instance of the general quality of service (QoS) portfolio of an network operator. When the number of servers for a major application oered by the service provider are properly computed, the service provider may announce levels of QoS (e.g. via Pr [kQ (p) A m] ? ) and accordingly price the use of the application. 18.2 General analysis Let us consider a particular o shortest path tree W rooted at node D with n (n) the level set OQ = [Q as defined in Section 16.2.2. Suppose 1$n$Q31
420
The hopcount to an anycast group
that the result of uniformly distributing p anycast group members over the graph leads to a number p(n) of those anycast group member nodes that (n) are n hops away n from o the root. These p distinct nodes all belong to the (n) (n) n-th level set [Q . Similarly as for [Q , some relations are immediate.
First, p(0) = 0 means that none of the p anycast group members coincides with the root node D or p(0) = 1 means that one of them (and at most one) is attached to the same router D as the querying device. Also, for all n A 0, (n) it holds that 0 p(n) [Q and that Q31 X
p(n) = p
(18.1)
n=0
Given the tree W specified the level set OQªand the anycast group members © (0) by (1) specified by the set p > p > = = = > p(Q31) , we will derive the lowest nonempty level p(m) , which is equivalent to kQ (p). Let us denote by hm the event that all first m + 1 levels are not occupied by an anycast group member, n o n o n o hm = p(0) = 0 _ p(1) = 0 _ · · · _ p(m) = 0 The probability distribution of the minimum hopcount, [kQ (p) © Pr ª = m|OQ ], (m) is then equal to the probability of the event hm31 _ p A 0 . Since the ª © ªf © event p(m) A 0 = p(m) = 0 , using the conditional probability yields i hn o¯ ¯ Pr [kQ (p) = m|OQ ] = Pr p(m) A 0 ¯ hm31 Pr [hm31 ] ³ hn o¯ i´ ¯ = 1 Pr p(m) = 0 ¯ hm31 Pr [hm31 ]
(18.2)
© ª Since hm = hm31 _ p(m) = 0 , the probability of the event hm can be decomposed as i hn o¯ ¯ (18.3) Pr [hm ] = Pr p(m) = 0 ¯ hm31 Pr [hm31 ]
The assumption that all£p group members are uniformly distributed © anycast ª¯ ¤ enables to compute Pr p(m) = 0 ¯ hm31 exactly. Indeed, by the uniform assumption, the probability equals the ratio of the favorable possibilities over the total possible. The total number of ways to distribute p items over P (n) — the latter constraints follows from the condition Q m31 n=0 [Q positions ¡Q3Sm31 [ (n) ¢ n=0 Q . Likewise, the favorable number of ways to hm31 — equals p
18.2 General analysis
421
distribute p items over the remaining levels higher than m, leads to S hn i ¡Q 3 mn=0 [Q(n) ¢ o¯ ¯ (m) p Pr p = 0 ¯ hm31 = ¡ Sm31 (18.4) (n) Q3 [ ¢ Q
n=0
p
The recursion (18.3) needs an initialization, given by h i p Pr [h0 ] = Pr p(0) = 0 = 1 Q Q 31 ª ¤ £ (0) ¤ £© ( ) which follows from Pr p = 0 = p and equals Pr p(0) = 0 |h31 Q (p) £ ¤ p (although the event h31 is meaningless). Observe that Pr p(0) = 1 = Q holds for any tree such that p Pr [kQ (p) = 0] = Q By iteration of (18.3), we obtain Pr [hm ] =
m Y
v=0
¡Q3Sv
(n)
n=0
[Q
n=0
Q
¢
p ¡Q3Sv31 [ (n) ¢ = p
¡Q 3Sm
(n)
n=0
p ¢ ¡Q
[Q
p
¢
(18.5)
P where the convention in summation is that en=d in = 0 if d A e. Finally, combining (18.2) with (18.4) and (18.5), we arrive at the general conditional expression for the minimum hopcount to the anycast group, ¡Q 3Sm31 [ (n) ¢ ¡Q3Sm [ (n) ¢ n=0 Q n=0 Q p p (18.6) Pr [kQ (p) = m|OQ ] = ¡Q ¢ p
Clearly, while Pr [kQ (0) = m|OQ ] = 0 since there is no path, we have for p = 1, (m)
[ Pr [kQ (1) = m|OQ ] = Q Q It directly follows from (18.6) that
¡Q 3Sq
n=0
Pr [kQ (p) q|OQ ] = 1
p ¡Q ¢ p
(n)
[Q
¢
(18.7)
PQ 31 P (n) (n) [Q ? p, then equation If Q qn=0 [Q ? p or, equivalently, n=q+1 (18.7) shows that Pr [kQ (p) A q|OQ ] = 0. The maximum possible hopcount of a shortest path to an anycast group strongly depends on the specifics of the shortest path tree or the level set OQ . A general result is worth mentioning:
422
The hopcount to an anycast group
Theorem 18.2.1 For any graph, it holds that Pr[kQ (p) A Q p] = 0 In words, the longest shortest path to an anycast group with p members can never possess more than Q p hops. Proof: This general theorem follows from the fact that the line topology is the tree with the longest hopcount Q 1 and only in the case that all p last positions (with respect to the source or root) are occupied by the p anycast group members, is the maximum hopcount Q p. ¤ For the URT , Pr[kQ (p) = Q p] is computed exactly in (18.12). Corollary 18.2.2 For any graph, it holds that 1 Q Proof: This corollary follows from Theorem 18.2.1 and the law of total probability. Alternatively, if there are Q 1 anycast members in a network with Q nodes, the shortest path can only consist of one hop if none of the anycast members coincides with the root node. This probability is precisely 1 ¤ Q. Pr[kQ (Q 1) = 1] =
Using the tail probability formula (2.36) for the average, it follows from (18.7) that P Q32 µ (n) ¶ 1 X Q qn=0 [Q (18.8) H [kQ (p)|OQ ] = ¡Q ¢ p p q=0
from which we find,
Q31 1 X (n) H [kQ (1)|OQ ] = n[Q Q n=1
Thus, given OQ , a performance measure for anycast over unicast can be quantified as H [kQ (p)|OQ ] 1 = H [kQ (1)|OQ ] Using the law of total probability, the distribution of the minimum hopcount to the anycast group is X Pr [kQ (p) = m] = Pr [kQ (p) = m|OQ ] Pr [OQ ] (18.9) all OQ
18.3 The n-ary tree
423
or explicitly, Pr[kQ (p) = m] =
X
SQ31
¡SQ 31 {n ¢ ¡SQ 31 {n ¢ h i n=m n=m+1 (1) (Q31) p = {Q31 Pr [Q = {1 >= = = >[Q ¡Q ¢ p
n=1 {n =Q31
p
where the integers {n 0 for all n. This expression explicitly shows the importance of the level structure OQ of the shortest path tree W . The level set OQ entirely determines the shape of the tree W . Unfortunately, a general form for Pr [OQ ] or Pr [kQ (p) = m] is di!cult to obtain. In principle, via extensive trace-route measurements from several roots, the shortest path tree and Pr [OQ ] can be constructed such that a (rough) estimate of the level set OQ in the Internet can be obtained. 18.3 The n-ary tree For regular trees, explicit expressions are possible because the summation in (18.9) simplifies considerably. For example, for the n-ary tree defined in Section 17.3, (m)
[Q = n m (m)
Provided the set OQ only contains these values of [Q for each m, we have that Pr [OQ ] = 1, else it is zero (because then OQ is not consistent with a G+1 n-ary tree). Summarizing, for the n-ary tree with Q = n n3131 and G levels, the distribution of the minimum hopcount to the anycast group is ¡Q3 nm 31 ¢ ¡Q3 nm+1 31 ¢ n31 n31 p (18.10) Pr [kQ (p) = m] = ¡Q ¢ p p
Extension of the integer n to real numbers in the formula (18.10) is expected to be of value as suggested in Section 17.3. When a n-ary tree was used to fit corresponding Internet multicast measurements (Van Mieghem et al., 2001a), a remarkably accurate agreement was found for the value n 3=2, which is about the average degree of the Internet graph. Hence, if we were to use the n-ary tree as model for the hopcount to an anycast group, we expect that n 3=2 is the best value for Internet shortest path trees. However, we feel we ought to mention that the hopcount distribution of the shortest path between two arbitrary nodes is definitely not a n-ary tree, because Pr [kQ (1) = m] increases with the hopcount m, which is in conflict with Internet trace-route measurements (see, for example, the bell-shape curve in Fig. 16.4). Figure 18.1 displays Pr [k(p) m] for a n-ary with outdegree n = 3 and
424
The hopcount to an anycast group 1.0 N = 500 k=3
0.8
Pr[hN (m) d j]
m = 50 m = 10
0.6
m=5
0.4
m=2 m=1
0.2
0
1
2
3
4
5
j
Fig. 18.1. The distribution function of k500 (p) versus the hops m for various sizes of the anycast group in a n-ary tree with n = 3 and Q = 500
Q = 500. This type of plot allows us to solve the “server placement problem”. For example, assuming that the n-ary tree is a good model and the network consists of Q = 500 nodes, Fig. 18.1 shows that at least p = 10 servers are needed to assure that any user is not more than four hops separated from a server with a probability of 93%. More precisely, the equation Pr[k500 (p) A 4] ? 0=07 is obeyed if p 10. Figure 18.2 gives an idea how the performance measure decreases with the size of the anycast group in n-ary trees (all with outdegree n = 3), but with dierent size Q . For values of p up to around 20% of Q , we observe that decreases logarithmically in p.
18.4 The uniform recursive tree (URT) Chapter 16 motivates the interest in the URT. The URT is believed to provide a reasonable, first order estimate for the hopcount problem to an anycast group in the Internet.
18.4.1 Recursion for Pr [k(p) = m] Usually, a combinatorial approach such as (18.9) is seldom successful for URTs while structural properties often lead to results. The basic Theo-
18.4 The uniform recursive tree (URT)
425
1.0
k=3
0.8
0.6
K
N = 100 N = 500
0.4 N = 5000 5
N = 10
0.2 6
N = 10
0.0 2
3
4
5
6
7 8 9
2
3
4
5
6
7 8 9
0.1
1
m/N
Fig. 18.2. The performance measure for several sizes of n-ary trees (with n = 3) as a function of the ratio of anycast nodes over the total number of nodes.
rem 16.2.1 of the URT, applied to the anycast minimum hop problem, is illustrated in Fig. 18.3. Root
i anycast members
m i anycast members
R1
N k nodes
T2
k nodes T1
Fig. 18.3. A uniform recursive tree consisting of two subtrees W1 and W2 with n and Q n nodes respectively. The first cluster contains l anycast members while the cluster with Q n nodes contains p l anycast members.
Figure 18.3 shows that any URT can be separated into two subtrees W1 and W2 with n and Q n nodes respectively. Moreover, Theorem 16.2.1 states
426
The hopcount to an anycast group
that each subtree is independent of the other and again a URT. Consider now a specific separation of a URT W into W1 = w1 and W2 = w2 , where the tree w1 contains n nodes and l of the p anycast members and w2 possesses Q n nodes and the remaining p l anycast members. The event {kW (p) = m} equals the union of all possible sizes Q1 = n and subgroups p1 = l of the event {kw1 (l) = m 1} _ {kw2 (p l) m} and the event {kw1 (l) A m 1} _ {kw2 (p l) = m}, {kW (p) = m} = ^n ^l {{kw1 (l) = m 1} _ {kw2 (p l) m}} ^ {{kw1 (l) A m 1} _ {kw2 (p l) = m}}
Because kQ (0) is meaningless, the relation must be modified for the case l = 0 to {kW (p) = m} = {kw2 (p) = m} and for the case l = p to {kW (p) = m} = {kw1 (p) = m 1} This decomposition holds for any URT W1 and W2 , not only for the specific ones w1 and w2 . The transition towards probabilities becomes X Pr [kW (p) = m] = (Pr [kw1 (l) = m 1] Pr [kw2 (p l) m] all w1 >w2 >n>l
+ Pr [kw1 (l) m 1] Pr [kw2 (p l) = m]) × Pr [W1 = w1 > W2 = w2 > Q1 = n> p1 = l]
Since W1 and W2 and also p1 are independent given Q1 , the last probability o simplifies to o = Pr [W1 = w1 > W2 = w2 > Q1 = n> p1 = l] = Pr [W1 = w1 |Q1 = n] Pr [W2 = w2 |Q1 = n] Pr [p1 = l|Q1 = n] Pr [Q1 = n] Theorem 16.2.1 states that Q1 is uniformly distributed over the set with 1 . The fact that l out of the p Q 1 nodes such that Pr [Q1 = n] = Q31 anycast members, uniformly chosen out of Q nodes, belong to the recursive subtree W1 implies that p l remaining anycast members belong to W2 . Hence, analogous to a combinatorial problem outlined by Feller (1970, p. 43) that leads to the hypergeometric distribution, we have ¡n¢¡Q 3n¢ Pr [p1 = l|Q1 = n] =
l
¢ ¡Qp3l p
18.4 The uniform recursive tree (URT)
427
¡n¢
because all favorable combinations are those l to distribute l anycast mem¡ 3n¢ bers in W1 with n nodes multiplied by all favorable Q p3l to distribute the remaining p l in W2 containing Q ¡ n¢nodes. The total way to distribute p anycast members over Q nodes is Q p . Finally, we remark that the hopcount of the shortest path to p anycast members in a URT only depends on its size. This means that the sum over all w1 of Pr [W1 = w1 |Q1 = n], which equals 1, disappears and likewise also the sum over all w2 . Combining the above leads to Pr [kQ (p) = m] =
Q31 X X p31 n=1 l=1
(Pr [kn (l) = m 1] Pr [kQ3n (p l) m]
+ Pr [kn (l) A m 1] Pr [kQ3n (p l) = m]) +
¢ ¡ Q31 X Q3n Pr [kQ 3n (p) p
= m] +
¡n¢
p ¡Q ¢ (Q 1) p
n=1
¡n¢¡Q3n¢
l p3l ¡ ¢ (Q 1) Q p
Pr [kn (p) = m 1]
By substitution of n 0 = Q n and p0 = p l, we obtain the recursion, Pr [kQ (p) = m] =
¢ ¡ ¢¡ Q31 X p31 X nl Q3n p3l (Pr [kn (l) n=1 l=1 Q3n31 X
×
t=m
+
Q31 X n=1
= m 1] + Pr [kn (l) = m]) ¡ ¢ (Q 1) Q p
Pr [kQ3n (p l) = t]
¡n¢ p
(Pr [kn (p) = m] + Pr [kn (p) = m 1]) ¡ ¢ (Q 1) Q p (18.11)
This recursion (18.11) is solved numerically for Q = 20. The result is shown in Fig. 18.4, which demonstrates that Pr [k(p) A Q p] = 0 or that the path with the longest hopcount to an anycast group of p members consists of Q p links. Since there are (Q 1)! possible recursive trees (Theorem 16.2.2) and there is only one line tree with Q 1 hops where each node has precisely one child node, the probability to have precisely Q 1 hops from the root is 1 (Q31)! (which also is Pr [kQ = Q 1] given in (16.8)). The longest possible hopcount from a root to p anycast members occurs in the line tree where all p anycast members occupy the last p positions. Hence, the probability
428
The hopcount to an anycast group
for the longest possible hopcount equals Pr [kQ (p) = Q p] =
p! ¡ ¢ (Q 1)! Q p
(18.12)
because there are p! possible ways to distribute the p¡ anycast members Q¢ at the p last positions in the line tree while there are p possibilities to distribute p anycast members at arbitrary places in the line tree. -1
10
N = 20
-3
10
-5
10
Pr[hN (m) = j]
-7
10
-9
10
-11
10
-13
10
-15
10
-17
10
-19
10
0
2
4
6
8
10
12
14
16
18
20
j
Fig. 18.4. The pdf of kQ (p) in a URT with Q = 20 nodes for all possible p. Observe that Pr[kQ (p) A Q p] = 0= This relation connects the various curves to the value for p.
Figure 18.4 allows us to solve the “server placement problem”. For example, consider the scenario in which a network operator announces that any user request will reach a server of the anycast group in no more than m = 4 hops in 99.9% of the cases. Assuming his network has Q = 20 routers and the shortest path tree is a URT, the network operator has to compute the number of anycast servers p he has to place uniformly spread over the Q = 20 routers by solving Pr [k20 (p) A 4] ? 1033 . Figure 18.4 shows that the intersection of the line m = 4 and the line Pr [k20 (p) = 4] = 1033 is the curve for p = 7. Since the curves for p 7 are exponentially decreasing, Pr [k20 (p) A 4] is safely1 approximated by Pr [k20 (p) = 4], which leads to the placing of p = 7 servers. When following the line m = 4, we also observe that the curves for p = 5> 6> 7> 8 lie near to that of p = 7. This means that 1
More precisely, since Pr [k20 (4) A 4] = 0.001 06 and Pr [k20 (5) A 4] = 0.000 32, only p = 5 servers are su!cient.
18.4 The uniform recursive tree (URT)
429
placing a server more does not considerably change the situation. It is a manifestation of the law 1 d log p, which tells us that by placing p servers the gain measured in hops with respect to the single server case is slowly, more precisely logarithmically, increasing. The performance measure for the URT is drawn for several sizes Q in Fig. 18.5.
18.4.2 Analysis of the recursion relation The product of two probabilities in the double sum in (18.11) seriously complicates a possible analytic treatment. A relation for a generating function of Pr [kQ (p) = m] and other mathematical results are derived in Van Mieghem (2004b). Here, we summarize the main results. p . Using Pr [kn (l) 1] = 1, the (a) Let us check Pr [kQ (p) = 0] = Q p3l convention that Pr [kn (l) = 1] = 0 and Pr [kQ3n (p l) = 0] = Q 3n , the right hand side of (18.11), denoted by u, simplifies to ¶ µ ¶µ Q31 p XX pl n Q n 1 u= ¡ ¢ pl Q n l (Q 1) Q p n=1 l=0 Q31 X p31 X µn ¶µQ 1 n¶ 1 = ¡ ¢ l p1l (Q 1) Q p n=1 l=0 Q31 X µQ 1¶ p 1 = = ¡ ¢ Q p1 (Q 1) Q p n=1 (b) Observe that Pr [kQ (Q ) = m] = 0 for m A 0. (c) For p = 1,
Pr [kQ
Q 31 1 X n = m] = (Pr[kn = m] + Pr [kn = m 1]) Q 1 Q n=1
Multiplying both sides by } m , summing over all m leads to the recursion for the generating function (16.6) (Q + 1)*Q+1 (}) = (} + Q )*Q (})
430
The hopcount to an anycast group
(d) The case p = 2 is solved in Van Mieghem (2004b, Appendix) as ¶ µ m 2(1)Q3m X 2(1)Q313m (m+1) (n+m+1) n 2n1 VQ Pr [kQ (2) = m] = VQ (1) + Q! Q !(Q 1) n n=1 ¶¸ ¶ µ m31 µ 2n + 1 2(1)Q3m X m + n + 1 n+m+1 (1)n VQ + n m Q !(Q 1) n=0
(18.13)
In van der Hofstad et al. (2002b) we have demonstrated that the covariance between the number of nodes at level u and m for u m in the URT is ¶ µ u h i (1)Q31 X (u) (m) (n+m+1) n+m 2n + m u H [Q [Q = (1) VQ (Q 1)! n n=0
For m u = 1, the last term in (18.13) is recognized as ¡2n31¢ 1 ¡2n¢ = 2 n , the first sum in (18.13) is n µ ¶ m 2(1)Q3m X (n+m+1) n 2n1 VQ (1) Q !(Q 1) n n=1
With
2(31)Q 313m (m+1) VQ Q!
l k (m31) (m) [Q H [Q
(Q2 )
. Since
³ ´ ¸ (m) 2 H [Q (m+1) 2(1)Q3m31 VQ = ¡ ¢ (Q 1) Q! 2 Q2
= 2 Pr [kQ = m], we obtain
³ ´ ¸ h i (m) 2 (m31) (m) H [Q H [Q [Q 2Q Pr [kQ (2) = m] = Pr [kQ = m] + ¡ ¢ ¡Q ¢ Q 1 2 Q2 2 ¶ m µ 2(1)Q31 X m + n n+m (1)n+m VQ + Q !(Q 1) n n=1
It would be of interest to find an interpretation for the last sum. Without proof2 , we mention the following exact results: Q µ ¶ X Q
p=1
p
Pr [kQ (p) = Q 2] =
For p Q 3, it holds that
Q1 Xµ p=1
¶ Q 1 Pr [kQ 1 (p) = Q 3] 1 + Q 1 (Q 2)! p
p! Pr [kQ (p) = Q p 1] = ¡ ¢ (Q 1)! Q p 2
"µ ¶ # p X Q p+1 + (p 1)(p@2 + 1) + 2 n
By substitution into the recursion (18.11), one may verify these relations.
n=2
18.5 Approximate analysis
431
18.5 Approximate analysis Since the general solution (18.9) is in many cases di!cult to compute as shown for the URT in Section 18.4, we consider a simplified version of the p above problem where each node in the tree has equal probability s = Q to be a server. Instead of having precisely p servers, the simplified version considers on average and the probability that there are precisely ¡Q ¢ p p servers Q3p p servers is p s (1 s) . In the simplified version, the associated equations to (18.4) and (18.3) are i hn hn o¯ oi (m) ¯ Pr p(m) = 0 ¯ hm31 = Pr p(m) = 0 = (1 s)[Q Pr [hm ] =
m Y o=0
Pr
hn oi Sm31 (o) p(m) = 0 = (1 s) o=0 [Q
which implies that the probability that there are no servers in the tree is (1 s)Q . Since in that case, the hopcount is meaningless, we consider the conditional probability (18.2) of the hopcount given that the level set contains at least one server (which is denoted by e kQ (p)) is ³ ´ Sm31 (o) (m) i h 1 (1 s)[Q (1 s) o=0 [Q kQ (p) = m|OQ = Pr e 1 (1 s)Q
Thus,
h i 1 (1 s)Sqo=0 [Q(o) e Pr kQ (p) q|OQ = 1 (1 s)Q
i h (o) Finally, to avoid the knowledge of the entire level set OQ , we use H [Q = (o)
Q Pr [kQ (1) = o] from (16.7) as the best estimate for each [Q and obtain the approximate formula µ l¶ k Sm31 k (o) l (m) H [Q H [Q (1 s) o=0 1 (1 s) h i kQ (p) = m = Pr e (18.14) 1 (1 s)Q
In the dotted lines in Fig. 18.5, we have added the approximate result for the URT where H [kQ (p)] is computed based on (18.14), but where H[kQ (1)] is computed exactly. For p = 1, the approximate analysis (18.14) is not well i Fig. 18.5 illustrates this deviation in the fact that appr (1) = h suited: kQ (1) @H [kQ (1)] ? 1. For higher values of p we observe a fairly good H e correspondence. We found that the probability (18.14) reasonably approximates the exact result plotted on a linear scale. Only the tail behavior (on
432
The hopcount to an anycast group
log-scale) and the case for p = 1 deviate significantly. In summary for the URT, the approximation (18.14) for Pr [kQ (p) = m] is much faster to compute than the exact recursion and it seems appropriate for the computation of for p A 1. However, it is less adequate to solve the server placement problem that requires the tail values Pr [kQ (p) A m]. 1.0 N = 10 : K N = 20 : K N = 30 : K N = 50 : K
0.8
0.404 ln(m/N) 0.295 ln(m/N) 0.252 ln(m/N) 0.210 ln(m/N)
K
0.6
0.4
0.2
0.0
2
3
4
5
6
7 8 9
2
0.1
3
4
5
6
7 8 9
1 m/N
Fig. 18.5. The performance measure for several sizes Q of URTs as a function of the ratio p@Q
18.6 The performance measure in exponentially growing trees In this section, we investigate the observed law 1 d log p for a much larger class of trees, namely the class of exponentially growing trees to which both the n-ary tree and the URT belong. Also most trees in the Internet are exponentially growing trees. A tree is said to grow exponentially in the ³ ´ (m) 1@m number of nodes Q with degree if limm n=1
which, because all eigenvalues are distinct, implies that there is a smaller set of v 1 linearly depending eigenvectors. This contradicts the initial hypothesis. This important property has a number of consequences. First, it applies to left- as well as to right-eigenvectors. Relation (A.10) then shows that the sets
438
Stochastic matrices
of left- and right-eigenvectors form a bi-orthogonal system with |nW {n 6= 0. For, if {n were orthogonal to |n (or |nW {n = 0), (A.10) demonstrates that {n would be orthogonal to all left-eigenvectors |m . Since the set of lefteigenvectors span the q dimensional vector space, it would mean that the q dimensional vector {n would be orthogonal to the whole q-space, which is impossible because {n is not the null vector. Second, any q dimensional vector can be written in terms of either the left- or right-eigenvectors. 4. Let us denote by [ the matrix with in column m the right-eigenvector {m and by \ W the matrix with in row n the left-eigenvector | W . If the rightand left-eigenvectors are scaled such that, for all 1 n q, |nW {n = 1, then \ W[ = L
(A.13)
or, the matrix \ W is the inverse of the matrix [. Furthermore, for any right-eigenvector, (A.1) holds, rewritten in matrix form, that D[ = [ diag(n )
(A.14)
Left-multiplying by [ 31 = \ W yields the similarity transform of matrix D, [ 31 D[ = \ W D[ = diag(n )
(A.15)
Thus, when the eigenvalues of D are distinct, there exists a similarity transform K 31 DK that reduces D to diagonal form. In many applications, similarity transforms are applied to simplify matrix problems. Observe that a similarity transform preserves the eigenvalues, because, if D{ = {, then K 31 { = K 31 D{ = (K 31 DK)K 31 {. The eigenvectors are transformed to K 31 {. When D has multiple eigenvalues, it may be impossible to reduce D to a diagonal form by similarity transforms. Instead of a diagonal form, the most compact form when D has u distinct eigenvalues each with multiplicity P pm such that um=1 pm = q is the Jordan canonical form F, 5
9 9 9 F=9 9 7
Fp1 3d (1 )
Fd (1 )
6
.. . Fpu31 (u31 ) Fpu (u )
: : : : : 8
A.1 Eigenvalues and eigenvectors
439
where Fp () is a p × p submatrix of the form, 5
0 .. .
1 .. .
9 9 9 Fp () = 9 9 7 0 ··· 0 ···
0 ··· 1 0 .. .. . . 0 0 0
0 ··· .. .
6
: : : : : 1 8
The number of independent eigenvectors is equal to the number of submatrices. If an eigenvalue has multiplicity p, there can be one large submatrix Fp (), but also a number n of smaller submatrices Fem () such P that nm=1 em = p. This illustrates, as mentioned in art. 1, the much higher complexity of the eigenproblem in case of multiple eigenvalues. For more details we refer to Wilkinson (1965). 5. The companion matrix of the characteristic polynomial (A.3) of D is defined as 5
9 9 9 F=9 9 7
(1)q31 fq31 (1)q31 fq32 1 0 0 1 .. .. . . 0 0
··· ··· ··· .. . ···
(1)q31 f1 (1)q31 f0 0 0 0 0 .. .. . . 1 0
6 : : : : : 8
Expanding det (F L) in cofactors of the first row yields det (F L) = f (). If D has distinct eigenvalues, D as well as F are similar to diag(l ). It has been shown that the similarity transform K for D equals K = [. The similarity transform for F is the Vandermonde matrix Y (), where 5
9 9 9 Y ({) = 9 9 9 7
{1q31 {2q31 · · · {1q32 {2q32 · · · .. .. . ··· . .. { . { 1
2
1
1
···
q31 {q31 q31 {q q32 q32 {q31 {q .. .. . .
{q31 1
{q 1
6 : : : : : : 8
The Vandermonde matrix Y () is clearly non-singular if all eigenvalues are
440
Stochastic matrices
distinct. Furthermore,
while
5
9 9 9 Y ()diag (l ) = 9 9 9 7 5
9 9 9 FY () = 9 9 9 7
q1 q2 ··· q31 1 2q31 · · · .. .. . ··· . .. 2 2 . 1
2
1
2
···
(1)q1 f (1 ) + q1 1q1 .. .
(1)q1 f (2 ) + q2 q1 2 .. .
21 1
22 2
qq31 qq q31 q31 q31 q .. .. . . 2q31 q31 ··· ··· ··· .. . ···
2q q
6 : : : : : : 8
(1)q1 f (q ) + qq q1 q .. . 2q q
6 : : : : : : 8
Since f (m ) = 0, it follows that FY () = Y ()diag(l ), which demonstrates the claim. Hence, the eigenvector {n of F belonging to eigenvalue n is £ ¤ {Wn = nq31 nq32 · · · n 1 6. When left-multiplying (A.1), we obtain D2 { = D{ = 2 {
or, in general for any integer n 0,
Dn { = n {
(A.16)
Since any eigenvalue satisfies its characteristic polynomial f () = 0, we directly find from (A.16) that the matrix D satisfies its own characteristic equation, f(D) = 0
(A.17)
This result is the Caley—Hamilton theorem. There exist several other proofs of the Caley—Hamilton theorem. 7. Consider an arbitrary matrix polynomial in , I () =
p X
In n
n=0
where all In are q × q matrices and Ip 6= R. Any matrix polynomial I () can be right and left divided by another (non-zero) matrix polynomial E() in a unique way as proved in Gantmacher (1959a, Chapter IV). Hence the
A.1 Eigenvalues and eigenvectors
441
left-quotient and left-remainder I () = E()TO () + O() and the rightquotient and right-remainder I () = TU ()E() + U() are unique. Let us concentrate on the right-remainder in the case where E() = L D is a linear polynomial in . Using Euclid’s division scheme for polynomials, p31
I () = Ip
p31
(L D) + (Ip D + Ip31 )
+
¤ £ = Ip p31 + (Ip D + Ip31 ) p32 (L D)
p32 X
In n
n=0
X ¢ p32 p33 ¡ 2 + Ip D + Ip31 D + Ip32 + In n n=0
and continued, we arrive at 5
I () = 7Ip p31 + · · · + n31 +
p X
p X
Im Dm3n + · · · +
p X m=1
m=n
Im Dm
6
Im Dm31 8 (L D)
m=0
In summary, I () = TU () (L D) + U() (and similarly for the leftquotient and left-remainder) with ´ ³P ³P ´ P Pp p p m3n n31 m3n I n31 T I D D () = TU () = p m m O m=n m=n P n=1 m Ppn=1 m U() = p I D = I (D) O() = D I m m=0 m m=0 (A.18) and where the right-remainder is independent of . The Generalized Bézout Theorem states that the polynomial I () is divisible by (L D) on the right (left) if and only if I (D) = R (O() = R). By the Generalized Bézout Theorem, the polynomial I () = j()L j(D) is divisible by (L D) because I (D) = j(D)L j(D) = R. If I () is an ordinary polynomial, the right- and left-quotient and remainder are equal. The Caley—Hamilton Theorem (A.17) states that f(D) = 0, which indicates that f()L = T() (L D) and also f()L = (L D) T(). The matrix T() = (L D)31 f() is called the adjoint matrix of D. Explicitly, from (A.18), 4 3 q q X X fm Dm3n D T() = n31 C n=1
m=n
Pq m3n . The main theand, with (A.6), T(0) = (D)31 det D = m=1 fm D oretical interest of the adjoint matrix stems from its definition f()L =
442
Stochastic matrices
T() (L D) = (L D) T() in case = n is an eigenvalue of D. Then, (n L D) T(n ) = 0, which indicates by (A.1) that every non-zero column of the adjoint matrix T(n ) is an eigenvector belonging to the eigenvalue n . In addition, by dierentiation with respect to , we obtain f0 ()L = (L D) T0 () + T() This demonstrates that, if T(n ) 6= R, the eigenvalue n is a simple root of f() and, conversely, if T(n ) = R, the eigenvalue n has higher multiplicity. The adjoint matrix T() = (L D)31 f() is computed by observing is divisible without rethat, on the Generalized Bézout Theorem, f()3f() 3 mainder. By replacing in this polynomial and by L and D respectively, T() readily follows as illustrated in Section A.4.2. 8. Consider the arbitrary polynomial of degree o, j({) = j0
o Y
({ m )
m=1
Substitute { by D, then j(D) = j0
o Y
(D m L)
m=1
Since det (DE) = det D det E and det(nD) = n q det D, we have det(j(D)) = j0q
o Y
m=1
det(D m L) = j0q
o Y
f(m )
m=1
With (A.5), det(j(D)) = j0q
o Y q Y
m=1 n=1
=
q Y
(n m ) =
q Y
n=1
j0
o Y
m=1
(n m )
j (n )
n=1
If k({) = j({) , we arrive at the general result: For any polynomial j({), the eigenvalues values of j(D) are j (1 ) > = = = > j (q ) and the characteristic polynomial is q Y (j (n ) ) (A.19) det(j(D) L) = n=1
which is a polynomial in of degree at most q. Since the result holds for an
A.2 Hermitian and real symmetric matrices
443
arbitrary polynomial, it should not surprise that, under appropriate conditions of convergence, it can be extended to infinite polynomials, in particular to the Taylor series of a complex function. As proved in Gantmacher (1959a, Chapter V), if the power series of a function i (}) around } = }0 i (}) =
" X m=1
im (}0 )(} }0 )m
(A.20)
P m converges for all } in the disc |}}0 | ? U, then i (D) = " m=1 im (}0 )(D}0 L) provided all eigenvalues of D lie with the region of convergence of (A.20), i.e. | }0 | ? U. For example, hD} = log D =
" X } n Dn n=0 " X n=1
n!
for all D
(1)n31 (D L)n for |n 1| ? 1, all 1 n q n
and, from (A.19), the eigenvalues of hD} are h}1 > = = = > h}1 . Hence, the knowledge of the eigenstructure of a matrix D allows us to compute any function of D (under the same convergence restrictions as complex numbers }).
A.2 Hermitian and real symmetric matrices
¡ ¢W = A Hermitian matrix D is a complex matrix that obeys DK = DW D, where dK = (dlm )W is the complex conjugate of dlm = Hermitian matrices possess a number of attractive properties. A particularly interesting subclass of Hermitian matrices are real, symmetric matrices that obey DW ¡= D.¢The W inner-product of vector | and { is defined as | K { and obeys | K { = ¡ K ¢K P | { = {K |. The inner-product {K { = qm=1 |{m |2 is real and positive for all vectors except for the null vector. 9. The eigenvalues of a Hermitian matrix are all real. Indeed, leftmultiplying (A.1) by {K yields {K D{ = {K { ¢K ¡ and, since {K D{ = {K DK { = {K D{, it follows that {K { = K {K { or = K because {K { is a positive real number. Furthermore, since D = DK , we have DK { = {
444
Stochastic matrices
Taking the complex conjugate, yields DW {W = {W In general, the eigenvectors of a Hermitian matrix are complex, but real for a real symmetric matrix since DK = DW . Moreover, the left-eigenvector | W is the complex conjugate of the right-eigenvector {. Hence, the orthogonality relation (A.10) reduces, after normalization, to an inner-product {K n {m = nm
(A.21)
where nm is the Kronecker delta, which is zero if n 6= m and else nn = 1. Consequently, (A.13) reduces to [K [ = L which implies that the matrix [ formed by the eigenvectors is an unitary matrix ([ 31 = [ K ). For a real symmetric matrix D, the corresponding relation [ W [ = L implies that [ is an orthogonal matrix ([ 31 = [ W ). Although the arguments so far (see Section A.1) have assumed that the eigenvalues of D are distinct, the theorem applies in general (as proved in Wilkinson (1965, Section 47)): For any Hermitian matrix D, there exists a unitary matrix X such that X K DX = diag (m )
real m
and for any real symmetric matrix D, there exists an orthogonal matrix X such that X W DX = diag (m )
real m
10. To a real symmetric matrix D, a bilinear form {W D| is associated, which is a scalar defined as q X q X W W { D| = {D| = dlm {l |m l=1 m=1
We call a bilinear form a quadratic form if | = {. A necessary and su!cient condition for a quadratic form to be positive definite, i.e. {W D{ A 0 for all { 6= 0, is that all eigenvalues of D should be positive. Indeed, art. 9 shows the existence of an orthogonal matrix X that transforms D to a diagonal form. Let { = X }, then {W D{ = } W X W DX } =
q X n=1
n }n2
(A.22)
A.3 Vector and matrix norms
445
which is only positive for all }n provided n A 0 for all n. From (A.6), a positive definite quadratic form {W D{ possesses a positive determinant, det D A 0. This analysis shows that the problem of determining an orthogonal matrix X (or the eigenvectors of D) is equivalent to the geometrical problem of determining the principal axes of the hyper-ellipsoid q X q X
dlm {l |m = 1
l=1 m=1
Relation (A.22) illustrates that the eigenvalues n are the squares of the principal axis. A multiple eigenvalue refers to an indeterminacy of the principal axes. For example if q = 3, an ellipsoid with two equal principal axis means that any section along the third axis is a circle. Any two perpendicular diameters of the largest circle orthogonal to the third axis are principal axis of that ellipsoid.
A.3 Vector and matrix norms Vector and matrix norms, denoted by k{k and kDk respectively, provide a single number reflecting a “size” of the vector or matrix and may be regarded as an extension of the concept of the modulus of a complex number. A norm is a certain function of the vector components or matrix elements. All norms, vector as well as matrix norms, satisfy the three distance relations (i) k{k A 0 unless { = 0 (ii) k{k = || k{k for any complex number (iii) k{ + |k k{k + k|k In general, the Hölder t-norm of a vector { is defined as 3 41@t q X |{m |t D k{k = C t
(A.23)
m=1
For example, the well-known Euclidean norm or length of the vector { is found for t = 2 and k{k22 = {K {. In probability theory where { denotes P a discrete pdf, the law of total probability states that k{k1 = qm=1 {m = 1 and we will write k{k1 = k{k. Finally, max |{m | = limt t A 1, ¯ K ¯ ¯{ | ¯ k{k k|k s t
1 s
+ 1t = 1 and (A.24)
A special case of the Hölder inequality where s = t = 2 is the CauchySchwarz inequality ¯ K ¯ ¯{ | ¯ k{k k|k (A.25) 2 2
The t = 2 norm is invariant under an unitary (hence also orthogonal) transformation X , where X K X = L, because kX {k22 = {K X K X { = {K { = k{k2 . An p other example s of a non-homogeneous vector norm is the quadratic form k{kD = {W D{ provided D is positive definite. Relation (A.22) shows that, if not all eigenvalues m of D are the same, not all p components of the vector { are weighted similarly and, thus, in general, k{kD is a non-homogeneous norm. The quadratic form k{kL equals the homogeneous Euclidean norm k{k22 .
A.3.1 Properties of norms All norms are equivalent in the sense that there exist positive real numbers f1 and f2 such that, for all {, f1 k{ks k{kt f2 k{ks For example, k{k2 k{k1
s q k{k2
k{k" k{k1 q k{k" s k{k" k{k2 q k{k" By choosing in the Hölder inequality (5.15) s = t = 1, {m $ m {vm for real v A 0 and |m $ m A 0, we obtain with 0 ? ? 1 an inequality for the weighted t-norm à Pq
m=1 m |{m | Pq m=1 m
v
!1
v
à Pq
v m=1 m |{m | P q m=1 m
!1 v
For m = 1, the weights m disappear such that the inequality for the Hölder t-norm becomes 1 1
k{kv k{kv q v ( 31)
A.3 Vector and matrix norms
where q
1 1 ( 31) v
³P q
447
1. On the other hand, with 0 ? ? 1 and for real v A 0, ´ 1v
3 4 1v 3 41 ¶ 1 v q q µ v v X X k{kv |{ |{ | | m m C D =C D Pq = P 1 = 1 Pq v q k{kv |{ | v ) n n=1 ( n=1 |{n |v ) v ( |{ | m=1 m=1 n n=1 v m=1 |{m |
Since | =
|{ |v Sq m v n=1 |{n |
1 and
1
1
A 1, it holds that | | and
4 1v à P 41 3 3 !1 ¶ 1 v q q q µ v v X X |{m |v v |{ | | |{ m=1 m m D = Pq D C C Pq Pq =1 v v v n=1 |{n | n=1 |{n | n=1 |{n | m=1 m=1 1 1
which leads to the opposite inequality (without normalization as q v ( 31) ), k{kv k{kv In summary, if s A t A 0, then the general inequality for Hölder t-norm is 1
1
k{ks k{kt k{ks q t 3 s
(A.26)
For p × q matrices D, the most frequently used norms are the Euclidean or Frobenius norm 3 41@2 q p X X |dlm |2 D (A.27) kDkI = C l=1 m=1
and the t-norm
kDkt = sup
kD{kt
(A.28) k{kt ° ° ° { ° = °D k{k ° , which shows that
{6=0
On the second distance relation,
kD{kt k{kt
t
t
kDkt = sup kD{kt
(A.29)
k{kt =1
Furthermore, the matrix t-norm (A.28) implies that kD{kt kDkt k{kt
(A.30)
Since the vector norm is a continuous function of the vector components and since the domain k{kt = 1 is closed, there must exist a vector { for which equality kD{kt = kDkt k{kt holds. Since the n-th vector component of D{ P is (D{)l = qm=1 dlm {m , it follows from (A.23) that ¯t 41@t ¯ 3 ¯ p ¯X X ¯ ¯ q ¯ D ¯ C kD{kt = d { lm m ¯ ¯ ¯ ¯ l=1 m=1
448
Stochastic matrices
For example, for all { with k{k1 = 1, we have that ¯ ¯ ¯ X p X q q p p ¯X X X X ¯ ¯ q ¯ ¯ |d | |{ | = |{ | |dlm | kD{k1 = d { lm m m lm m ¯ ¯ ¯ ¯ l=1 m=1 m=1 l=1 l=1 m=1 Ã ! q p p X X X |{m | max |dlm | = max |dlm | m=1
m
m
l=1
l=1
Clearly, there exists a vector { for which equality holds, namely, if n is the column in D with maximum absolute sum, then { = hn , the n-th basis vector with all components zero, except for the n-th one, which is 1. Similarly, for all { with k{k" = 1, ¯ ¯ ¯X ¯ q q X X ¯ q ¯ ¯ ¯ kD{k" = max ¯ dlm {m ¯ max |dlm | |{m | max |dlm | l l ¯ l ¯ m=1 m=1 m=1
Again, if u is the row with maximum absolute sum and {m = 1.sign(dum ) P P such that k{k" = 1, then (D{)u = qm=1 |dum | = maxl qm=1 |dlm | = kD{k" . Hence, we have proved that kDk" = max l
kDk1 = max m
from which
q X
m=1 p X
|dlm |
(A.31)
|dlm |
(A.32)
l=1
° K° °D ° = kDk 1 "
The t = 2 matrix norm, kD{k2 > is obtained dierently. Consider kD{k22 = (D{)K D{ = {K DK D{ Since DK D is a Hermitian matrix, art. 9 shows that all eigenvalues are real and non-negative because a norm kD{k22 0. These ordered eigenvalues are denoted as 12 22 · · · q2 0. Applying the theorem in art. 9, there exists a unitary matrix X such that { = X } yields ¡ ¢ {K DK D{ = } K X K DK DX } = } K diag m2 } 12 } K } = 12 k}k22 Since the t = 2 norm is invariant under a unitary (orthogonal) transform k{k2 = k}k2 , by the definition (A.28), kDk2 = sup {6=0
kD{k2 = 1 k{k2
(A.33)
A.3 Vector and matrix norms
449
where the supremum is achieved if { is the eigenvector of DK D belonging to 12 . Meyer (2000, p. 279) proves the corresponding result for the minimum eigenvalue provided that D is non-singular, ° 31 ° °D ° = 2
1 = q31 min kD{k2
k{k2 =1
The non-negative quantity m is called the m-th singular value and 1 is the largest singular value of D. The importance of this result lies in an extension of the eigenvalue problem to non-square matrices which is called the singular value decomposition. A detailed discussion is found in Golub and Loan (1983). If D has real eigenvalues 1 2 · · · q , the above can be simplified and we obtain {W D{ {W {
(A.34)
{W D{ {6=0 {W {
(A.35)
1 = sup {6=0
q = inf
because, for any {, it holds that q {W¡{ {W¢ D{ 1 {W {. The Frobenius norm kDk2I = trace DK D . With (A.7) and the analysis of DK D above, kDk2I =
q X
n2
(A.36)
n=1
In view of (A.33), the bounds kDk2 kDkI
s q kDk2 may be attained.
A.3.2 Applications of norms ° ° ° n ° ° n31 ° (a) Since °D ° = °DD ° kDk °Dn31 °, by induction, we have for any integer n, that ° ° ° n° n °D ° kDk and
lim Dn = 0 if kDk ? 1
n }1 A 0
which is always possible by suitable renumbering of the states and ¸ S11 S12 S = S21 S22
A.4 Stochastic matrices
451
The relation } = (S + L)y is written as ¸ ¸ ¸ y1 }1 S11 y1 + = 0 0 S21 y1 Since S is irreducible, S21 6= R, such that y1 A 0 implies that S21 y1 6= 0, which proves the lemma. ¤ Observe, in addition, that all components of } are never smaller than those of y. Also, transposing does not alter the result. Theorem A.4.2 (Frobenius) The modulus of all eigenvalues of an irreducible stochastic matrix S are less than or equal to 1. There is only one real eigenvalue = 1 and the corresponding eigenvector has positive components. Proof: The t = 4 norm (A.31) of a probability matrix S with q states defined by (9.7) subject to (9.8) precisely equals kS k" = 1. From (A.37), it follows that all eigenvalues are, in absolute value, smaller than or equal to 1. Since all elements Slm 5 [0> 1] and because an irreducible matrix has no zero element rows, y W S has positive components if yW has positive components. (yW S ) Thus, there always exists a scalar 0 ? y = min1$n$q (yW ) n , such that n
y y W y W S . By Lemma A.4.1, we can always transform the vector y to a vector } by right-multiplying both sides with (L + S ) such that y y W (L + S ) yW S (L + S ) y } W } W S
and, by definition of y , y } since the components of } are never smaller than those of y. Hence, for any arbitrary vector y with positive components, the transform in Lemma A.4.1 leads to an increasing set y } · · · , which is bounded by 1 because no eigenvalue can exceed 1. This shows that = 1 is the largest eigenvalue and the corresponding eigenvector | W has positive components. This eigenvector | W is unique. For, if there were another linearly independent eigenvector zW corresponding to the eigenvalue = 1, any linear combination } W = | W + zW is also an eigenvector belonging to = 1. But and can always be chosen to produce a zero component which the transform method shows to be impossible. The fact that the eigenvector | W is the only eigenvector belonging = 1, implies that the eigenvalue = 1 is a single zero of the characteristic polynomial of S . ¤ The theorem proved for stochastic matrices is a special case of the famous Frobenius theorem for non-negative matrices (see for a proof, e.g. Gant-
452
Stochastic matrices
macher (1959b, Chapter XIII)). We note that, in the theory of Markov chains, the interest lies in the determination of the left-eigenvector | W = belonging to = 1, because the right-eigenvector { of S belonging to = 1 equals xW = [1 1 · · · 1], where is a scalar, because of the constraints (9.8). Recall (A.10) and (A.13), the proper normalization, | W x = 1, precisely corresponds to the total law of probability. Using the interpretation of Markov chains, an alternative argument is possible. If all eigenvalues were || ? 1, application (c) in Section A.3.2 indicates that the steady-state would be non-existent because S n $ 0 for n $ 4. Since this is impossible, there must be at least one eigenvalue with || = 1. Furthermore, (9.22) shows that at least one eigenvalue corresponding to the steady-state is real and precisely 1. Corollary A.4.3 An irreducible probability matrix S cannot have two linearly independent eigenvectors with positive components. Proof: Consider, apart from |W = belonging to = 1, another eigenvector zW belonging to the eigenvalue $ 6= 1. On art. 3, zW | = 0, which is ¤ only possible if not all components of zW are positive.
The corollary is important because no other eigenvector of S than | W = can represent a (discrete) probability density. Since the null vector is never an eigenvector, the corollary implies that at least one component in the other eigenvectors must be negative. Since the characteristic polynomial of S has real coe!cients (because Slm is real), the eigenvalues occur in complex conjugate pairs. Since = 1 is an eigenvalue, for an even number of state q, there must be at least another real eigenvalue obeying 1 ? 1. It has been proved that the boundary of the locations of the eigenvalues inside the unit disc consists of a finite number of points on the unit circle joined by certain curvilinear arcs. There exist an interesting property of a rank-one update S¯ of a stochastic matrix S . The lemma is of a general nature and also applies to reducible Markov chains with several eigenvalues m = 1 for 1 ? m n. Lemma A.4.4 If {1> 2 > 3 > = = = > q } are the eigenvalues of the stochastic matrix S , then the eigenvalues of S¯ = S + (1 )xy W , where y W is any probability vector, are {1> 2 > 3 > = = = > q }.
A.4 Stochastic matrices
453
Proof: We start from the eigenvalues equation (A.2) ¡ ¢ ¢ ¡ det S¯ L = det S L + (1 )xy W ´´ ³ ³ = det (S L) L + (S L)31 (1 )xyW ´ ³ = det (S L) det L + (1 ) (S L)31 xy W
Applying the formula
¡ ¢ det L + fgW = 1 + gW f
(A.38)
which follows, after taking the determinant, from the matrix identity µ ¶µ ¶µ ¶ µ ¶ L 0 L 0 L + fgW f L f = gW 1 gW 1 0 1 0 1 + gW f gives ³ ´ ¢ ¡ det S¯ L = det (S L) 1 + y W (1 ) (S L)31 x
Since the row sum of a stochastic matrix S is 1, we have that S x = x and, thus, (S L) x = ( ) x from which (S L)31 x = ( )31 x. Using this result leads to 1 + y W (1 ) (S L)31 x = 1 +
1 1 1 W y x=1+ =
because a probability vector is normalized to 1, i.e. y W x = 1. Hence, we end up with ¡ ¢ 1 det S¯ L = det (S L)
Invoking (A.19) yields
q q Y ¢ Y 1 ¯ det S L = = (1 ) (n ) (n )
¡
n=1
n=2
which shows that the eigenvalues of S¯ are {1> 2 > 3 > = = = > Q }.
¤
A similar property may occur in a special case where a Markov chain is supplemented by an additional state q + 1 which connects to every other state and to which every other state is connected (such that S¯ is irreducible). Then, ¶ µ S (1 )x ¯ S = 0 yW
454
Stochastic matrices
with corresponding eigenvalues {1> 2 > 3 > = = = > q > 0}. This result is similarly proved as Lemma A.4.4 using (Meyer, 2000, p. 475) ¶ µ ¢ ¡ D E (A.39) = det D det G FD31 E det F G provided D31 exists unless F = 0.
A.4.2 Example: the two-state Markov chain The two-state Markov chain is defined by ¸ 1s s S = t 1t Observe that det S = 1st. The eigenvalues of S satisfy the characteristic polynomial f() = 2 (2 s t) + det S = 0, from which 1 = 1 and 2 = 1 s t = det S . The adjoint matrix T () is computed (art. 7) via the polynomial f()3f() 3 , f() f() = + (2 s t)
and after $ L and $ S
T () = L + S (2 s t)L ¸ 1+t s = t 1+s The (unscaled) right- (left-) eigenvectors of S follow as the non-zero columns (rows) of T (). For 1 = 1, we find {1 = (1> 1) and |1W = (t> s). For 2 = 1st, the eigenvector {2 = (s> t) and |2W = (1> 1). Normalization 1 1 (1> 1) and {2 = s+t (s> t). If the (art. 4) requires that |nW {n = 1 or {1 = s+t eigenvalues are distinct (s + t 6= 0), the matrix S can be written as (art. 4) S = [diag(n )\ W , ¸ ¸ ¸ 1 1 s 1 0 t s S = 0 1st 1 1 s + t 1 t from which any power S n is immediate as ¸ ¸ ¸ 1 1 0 1 s t s n S = 1 1 0 (1 s t)n s + t 1 t ¸ ¸ n (1 s t) 1 t s s s = + t t s+t t s s+t
(A.40)
A.4 Stochastic matrices
The steady-state matrix S " = limn Q ] to is transformed with {m = 0 if m 5 ¢ ¡ ¢ ¡ Q } J (}) {Q } Q } 2 J0 (}) Q {Q } Q31 ( + 1 ) }J0 (}) [Q f]J (}) + J0 (}) = 0
from which the logarithmic derivative is J0 (}) Q } + f Q = 2 J (}) } + ( + 1 ) } 1 The integration of the right-hand side requires a partial faction decomposition, } 2
Q } + f Q f1 f2 = + + ( + 1 ) } 1 } u1 } u2
where u1 and u2 are the roots of the quadratic polynomial } 2 +( + 1 ) } 1 and f1 and f2 are the residues computed for n = 1> 2 as fn = lim
} (Q@2)2 , ¡ ¢ (n) = 16| 2 + 4 (1 + ) f2 (1 + ) 2fQ + Q 2 | which shows that (n) is concave in | because
g2 {(n) = 32 ? 0, for g| 2 2 Q (f(1 + ) Q )2 A 0
| = 0, (Q@2) = 0 and, for | = (Q@2)2 , (0) = and, hence, (n) 0 for n 5 [0> Q@2]. This means that, for 0 n ? Q@2, the roots 1 (n) and 2 (n) are real and distinct and, for n = Q@2 (only if Q is even) where (Q@2) = 0, 1 (Q@2) = 2 (Q@2) =
1+ E (Q@2) = 2D (Q@2) 1 2 Qf
For ? n Q@2, the roots {1 ()> 2 ()} are dierent from the roots {1 (n)> 2 (n)} because D(n) } 2 + E(n) } + F(n) ? D() } 2 + E() } + F() for all }. Indeed, D() D(n) = (Q@2 )2 (Q@2 n)2 A 0 and the discriminant (E() E (n))2 4 (D() D(n)) (F() F(n)) ? 0 shows that there are no real solutions. Thus, an extreme eigenvalue occurs for n = 0 for which F (0) = 0 such that 1 (0) = 0 and 2 (0) =
1 + Q E (0) f = D (0) 1 Qf
(A.47)
Q ? 1 and f ? Q shows that 2 (0) ? 0, The stability requirement = f(1+) and thus 2 (0) is the largest negative eigenvalue. The eigenvalues for other 0 ? n Q@2 are either larger than 0 or smaller than 2 (0). We need to consider two dierent cases (a) f ? Q@2 and (b) f A Q@2 while F (n) ? 0 for all n 5 [0> Q).
(a) If f ? Q@2 and if 0 n ? f and , then D (n) A 0. Hence, the product
A.5 Special types of stochastic matrices
463
1 (n)2 (n) = F(n) D(n) ? 0 which means that 1 (n) A 0 A 2 (n) and that there are precisely [f] positive eigenvalues. Similarly, D (n) ? 0 for f ? n ? Q@2, such that 1 (n)2 (n) A 0 while 1 (n) + 2 (n) = E(n) D(n) ? 0 shows that both eigenvalues are negative because E (n) ? 0. Indeed, if 1 and f ? Q@2, the above expression immediately leads to E (n) ? 0 while if ? 1 and f ? Q@2, the expression "µ ¶2 µ ¶2 # ¶ ¸ µ Q Q Q Q n f f +1 2f E(n) = 2(1 ) 2 2 2 f shows that both terms are negative. (b) If f A Q@2, we see that D (n) A 0 for 0 ? n ? Q f leading to 1 (n) A 0 A 2 (n). For Q f ? n ? Q@2, we have D (n) ? 0 and thus 1 (n)2 (n) = F(n) D(n) A 0 while their same sign follows from 1 (n) + 2 (n) = E(n) D(n) requires us to consider the sign of E (n). If 1, then E (n) A 0. If A 1, then ¶ µ Q + 2(1 ) (Q@2 n)2 E(n) = Q (1 + ) f 2 ¶ µ ¶ µ Q 2 Q + 2(1 ) f ? Q (1 + ) f 2 2 ¶ ¶µ µ (Q f) Q +1 A0 = 2f f 2 f which shows that 0 ? 2 (n) ? 1 (n). Hence, there are Q [f] + 2(Q@2 Q + [f]) = [f] positive eigenvalues. In summary, there are [f] positive eigenvalues, one 1 (0) = 0 and Q [f] negative eigenvalues. Relabel the eigenvalues as (n > Q3n ) = (1 (n)> 2 (n)) in increasing order Q3[f]31 ? · · · ? 1 ? 0 ? Q = 0 ? Q31 ? · · · ? Q3[f] .This way of writing distinguishes between underload and overload eigenvalues. In terms of the discriminant by (n) = E 2 (n) 4D (n) F (n), the non-positive eigenvalues are (a) If f ? Q@2, s
3E(n)3 {(n) 2D(n) s 3E(n)~ {(n) = 2D(n)
1 (n) = 1>2 (n)
0 n [f] [f] + 1 n
Q 2
(b) If f A Q@2, 1 (n) =
s
3E(n)3 {(n) 2D(n)
0 n Q [f] 1
464
Stochastic matrices
The eigenvector belonging to m follows from (A.45) where u1 and u2 are given in (A.42) and n is determined from (A.43) since n = f1 . The eigenvectors for 1 (n) and 2 (n) belonging to a same quadratic n must be dierent. Especially in this case, the corresponding n = f1 values can be determined from (A.43). For example, for Q = 0, we find u1 = 1, u2 = 1 and n = 0 and the eigenvector belonging to Q is with (A.45), µ ¶ µ ¶ m Q 3m Q 3m Q3m Q u1 u2 (A.48) = {m (0) = (1) m m Q After renormalization such that k{(0)k1 = 1, i.e. by dividing each com¡Q ¢ m P (1+)Q 1 PQ ponent by Q , the steady-state vector m=0 {m (0) = Q m=0 m = Q (14.52) is obtained. Similarly, for the largest negative eigenvalue 0 in (A.47), we find with u1 = 1 Qf , u2 = Q131 and n = f1 = Q such that (f ) ¶Q3m µ ¶ µ ¶µ Q Q Q3m Q Q 3m 0 1 u1 u2 = (A.49) {m (Q ) = (1) f m m The left-eigenvectors | satisfy (A.9): | W G31 T = | W . The above approach is applicable. However, there is a more elegant method based on the observation that there exists a diagonal matrix q Z = gldj (Z0 > = = = > ZQ ) for ¡ 31 ¢W ¡Q ¢ 31 31 TZ is m which Z TZ = Z TZ , namely Zm = m . Since Z symmetric, the left- and right-eigenvectors corresponding to the same eigenvalue are the same (Section A.2, art. 9). Now |W G31 T = | W is equivalent to ¡ ¢31 ¡ 31 ¢ | W Z = |W Z Z 31 G31 Z Z 31 TZ = | W Z Z 31 GZ Z TZ W = |W Z , G 31 GZ and T 31 TZ = TW , we obtain With |Z Z = Z Z = Z Z 31 31 W W |Z GZ TZ = |Z . The transpose |Z = TZ GZ |Z is
Z 2 | = TG31 Z 2 | which shows compared to G31 T{ = { that { = Z 2 | or, the vector components are, for 0 m Q , µ ¶ Q m {m = |m (A.50) m A.5.2.3 General tri-diagonal matrices Since tri-diagonal matrices of the form (11.1) frequently occur in Markov theory, we devote this section to illustrate how far the eigen-analysis can be
A.5 Special types of stochastic matrices
465
pushed. For an eigenpair (the right-eigenvector { belonging to eigenvalue ), the components in (S L){ = 0 satisfy (u0 ) {0 + s0 {1 = 0
tm {m31 + (um ) {m + sm {m+1 = 0 tQ {Q31 + (uQ ) {Q = 0
1m?Q
If sm = s and tm = t, the matrix S reduces to a Toeplitz form for which the eigenvalues and eigenvectors can be explicitly written, as shown in Appendix A.5.2.1. Here, we consider the general case and show how orthogonal polynomials enter the scene. Using um = 1 tm sm , u0 = 1 s0 and uQ = 1 tQ , the set becomes, with = 1, s0 + {0 () s0 sm + tm + tm {m () {m31 () {m+1 () = sm sm tQ {Q 31 () {Q () = tQ + {1 () =
1m?Q
(A.51)
The dependence on the eigenvalue is made explicit. Solving (A.51) iteratively for m ? Q , ¢ {0 () ¡ 2 + (t1 + s1 + s0 ) + s1 s0 s0 s1 ¢ {0 () ¡ 3 {3 () = + (t1 + t2 + s2 + s1 + s0 ) 2 s2 s1 s0 + (t2 t1 + t2 s0 + s2 t1 + s2 s1 + s2 s0 + s1 s0 ) + s2 s1 s0
{2 () =
reveals a polynomial of degree m in the eigenvalue = 1. By inspection, the general form of {m () for m ? Q is m {0 () X fn (m) n {m () = Qm31 p=0 sp n=0
(A.52)
466
Stochastic matrices
with fm (m) = 1 fm31 (m) = f0 (m) =
m31 X
p=0 m31 Y
(sp + tp ) sp
p=0
where t0 = sQ = 0. By substituting (A.52) into (A.51), m31 X
fn (m + 1) n =
n=1
m31 X n=1
[(tm + sm ) fn (m) tm sm31 fn (m 1) + fn31 (m)] n
and equating the corresponding powers in , a recursion relation for the coe!cients fn (m) (0 n ? Q ) is obtained with fm (m) = 1, fn (m + 1) = (tm + sm ) fn (m) tm sm31 fn (m 1) + fn31 (m) from which all coe!cients can be determined. Finally, for m = Q , the explicit form of {Q () follows from (A.51) as Q 31 {0 () X tQ tQ {Q31 () = {Q () = fn (Q 1) n QQ 32 tQ + tQ + p=0 sp n=0
We can always scale an eigenvector without eecting the corresponding eigenvalue. If we require a normalization of the eigenvector k{()k1 = 1, then {0 () is uniquely determined, {0 () =
¯ PQ 31 ¯¯Pm 1 + m=1 ¯ n=0
1 ¯ ¯ fn (m) Tm31 n ¯¯ + p=0 sp
tQ |tQ +|
¯P ¯ Q31 ¯ n=0
¯
f (Q31) n ¯ TnQ 32 ¯ p=0 sp
Another scaling consists of choosing {0 () = 1. Hence, apart from the eigenvalue , all eigenvector components {m () are explicitly determined. If = 1 or = 0, the solution is {m () = {0 (0). If k{()k1 = 1, then {m () 1 , which is, after proper scaling by Q + 1 (art. 4 in Section A.1), the = Q+1 £ ¤ right-eigenvector x = 1 1 · · · 1 belonging to the left-eigenvector (see also Section A.4.1). If {0 () = 1, we immediate obtain x. Eigenvectors belonging to dierent eigenvalues 0 6= are linearly independent (art. 3 in Section A.1), but only orthogonal if S = S W , i.e. if sm = tm+1 . Only in the latter case (art. 9 in Section A.2), where also all eigenvalues are real, we
A.5 Special types of stochastic matrices
467
have Q X m=0
¡ ¢ {m () {m 0 = k{()k22 0
This orthogonality requirement determines the dierent eigenvalues . Since 0 = 0 is an eigenvalue, each other real eigenvalue 6= 0 must obey Q X
{m () = 0
m=0
P while the normalization enforces k{()k1 = Q m=0 |{m ()| = 1. The scaling PQ {0 () = 1 leads to the polynomial m=0 en n of degree Q whose Q zeros equal the eigenvalues 6= 0 and whose coe!cients are, with sm = tm+1 and for 2 n Q 2, e0 = (Q + 1) tQ e1 = Q + tQ
Q 32 X m=1
en =
Q32 X m=n
Q31 X fn31 (m) tQ fn (m) 2tQ fn (Q 1) + + QQ32 Qm31 Qm31 p=0 sp p=0 sp p=0 sp m=n31
sQ32 + 2tQ + fQ32 (Q 1) QQ32 p=0 sp 1 = QQ31 p=0 sp
eQ31 = eQ
f1 (Q 1) f1 (m) + 2tQ QQ32 Qm31 p=0 sp p=0 sp
The Newton identities (B.9) relate these coe!cients to the sum of integer powers of the real zeros 6= 0. Proceeding much further in the case that S is not symmetric is di!cult. A similarity transform is needed to transform the linearly independent set of vectors { () for dierent to an orthogonal set from which the eigenvalues then follow, as in the symmetric case above. Karlin and McGregor (see Schoutens (2000, Chapter 3)) have shown the existence of a set of orthogonal polynomials (similar to our set {m ()) that obey an integral orthogonality condition (similar to Legendre or Chebyshev polynomials) instead of our summation orthogonality condition. Only in particular cases, however, were they able to specify this orthogonal set explicitly.
468
Stochastic matrices
A.5.3 A triangular matrix complemented with one subdiagonal The transition probability matrix S has the structure of a triangular matrix complemented with one subdiagonal, 5
9 9 9 9 S =9 9 9 7
S00 S01 S02 ··· ··· S0Q S10 S11 S12 ··· ··· S1Q 0 S21 S22 ··· ··· S2Q .. .. .. .. .. . . . ··· . . 0 0 · · · SQ31>Q32 SQ31;Q31 SQ31;Q 0 0 ··· 0 SQ;Q31 SQQ
6 : : : : : : : 8
Besides the normalization kk1 = 1, the steady-state vector obeys the relation = =S , or per vector component (9.23),
m =
m+1 X
Snm n
n=0
because Snm = 0 if n A m + 1. Immediately we obtain an iterative equation that expresses m+1 (for m ? Q ) in terms of the n for 0 n m as m+1 =
µ
1 Smm Sm+1;m
¶
m31 X Snm m n Sm+1;m n=0
Let us consider the eigenvalue equation (A.1) that is written for stochastic matrices as (S L)W {W = 0. The matrix (S L)W is a (Q + 1) × (Q + 1) matrix of rank Q because det(S L)W = 0 (else all eigenvectors { are zero). When writing this set of equations in terms of {0 , we produce the following set of Q equations, 5
S10 9 S11 3 9 9 9 S12 9 9 9 9 ··· 9 9 .. 7 . S1;Q 31
0 S21
0 0
S22 3
S32
··· .. .
··· .. .
S2;Q 31
S3;Q 31
··· ··· ··· .. . ··· ···
0 0 .. .
0 0 .. .
···
···
SQ 31;Q 32 SQ 31;Q 31 3
0 SQ ;Q 31
6
5 { : 1 : : 9 {2 : 9 {3 :9 :=9 .. :9 :9 . :7 : {Q 31 8 {Q
6
5 3S 00 3S01 3S02 .. . 3S0;Q 32 3S0;Q 31
: 9 : 9 : 9 :=9 : 9 : 9 8 7
6
: : : : {0 : : 8
Since the right hand side matrix is a triangular matrix, the determinant QQ31 equals the product of the diagonal elements or n=0 Sn+1;n . By Cramer’s
A.5 Special types of stochastic matrices
469
rule, we find that 5
{m = {0
S10 9 S11 3 9 9 9 S12 9 9 .. 9 9 . 9 det 9 .. 9 . 9 9 9 9 S1m 9 9 .. 7 . S1;Q 31
0 S21 S22 3 .. . .. . S2m .. . S2;Q 31
··· ··· .. .
0 0 .. .
3 S00 3S01 .. .
···
Sm31;m32
3S0>m32
···
Sm31;m31 3
3S0;m31
0
···
··· .. . ···
Sm31;m .. .
3S0m .. . 3S0;Q 31
Sm+1;m .. .
··· .. . ···
Sm31;Q 31 TQ 31 n=0
Sn+1;n
0 0 .. . .. .
Sm+1;Q 31
··· ··· ··· ···
0 0 .. . .. . .. . .. . 0 SQ ;Q 31
6 : : : : : : : : : : : : : : : : : 8
The above determinant is of the form (Meyer, 2000, p. 467) ¸ Dm×m Rm×Q3m = det F det D det EQ3m×m FQ 3m×Q3m QQ 31 Sn+1;n . In the determinant det D, we can change the where det F = n=m m-th column with the (m 1)-th, and subsequently, the (m 1) th with the (m 2)-th and so on until the last column is permuted to the first column, in total m 1 permutations. After changing the sign of that first column, the result is that det D = (1)m det (Sm×m Lm×m ) where Sm×m is the original transition probability matrix limited to m states (instead of Q + 1). Hence, for 1 m Q , {0 (1)m det (Sm×m Lm×m ) {m = Qm31 n=0 Sn+1;n and the normalization of eigenvectors k{k1 = 1 determines {0 as {0 =
1+
PQ
1
(31)m det(Sm×m 3Lm×m ) Tm31 m=1 n=0 Sn+1;n
If the Q + 1 eigenvalues are known, we observe that all eigenvectors can be expressed in terms of the original matrix S in a same way.
Appendix B Algebraic graph theory
This appendix reviews the elementary basics of the matrix theory for graphs J (Q> O). The book by Cvetkovic et al. (1995) is the current standard work on algebraic graph theory.
B.1 The adjacency and incidence matrix 1. The adjacency matrix D of a graph J with Q nodes is an Q × Q matrix with elements dlm = 1 only if (l> m) is a link of J, otherwise dlm = 0. Because the existence of a link implies that dlm = dml , the adjacency matrix D = DW is a real symmetric matrix. It is assumed further that the graph J does not contain self-loops (dll = 0) nor multiple links between two nodes. The complement Jf of the graph J consists of the same set of nodes but with a link between (l> m) if there is no link (l> m) in J and vice versa. Thus, (Jf )f = J and the adjacency matrix Df of the complement Jf is Df = M L D where M is the all-one matrix ((M)lm = 1). Information about the direction 1
3 4
2 6
5
Fig. B.1. A graph with Q = 6 and O = 9. The links are lexicographically ordered, h1 = 1 $ 2> h2 = 1 $ 3> h3 = 1 # 6 etc.
of the links is specified by the incidence matrix E, an Q × O matrix with 471
472
Algebraic graph theory
elements
; ? 1 if link hm = l $ m elm = 1 if link hm = l # m = 0 otherwise
Figure B.1 exemplifies the definition of D and E: 5
9 9 9 D=9 9 7
0 1 1 0 0 1
1 0 1 0 1 1
1 1 0 1 0 0
0 0 1 0 1 0
0 1 0 1 0 1
1 1 0 0 1 0
6
5
: 9 : 9 9 : : E =9 9 : 8 7
1 1 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1
6 : : : : : 8
2. The relation between adjacency and incidence matrix is given by the admittance matrix or Laplacian T, T = EE W = D where = diag(g1 > g2 > = = = > gQ ) is the degree matrix. Indeed, if l 6= m and noting that each column has only two non-zero elements at a dierent row, O X ¡ ¢ tlm= EE W lm = eln emn = 1 n=1
PO
2 If l = m, then n=1 eln = gl , the number of links that have node l in common. Also, by the definition of D, the row sum l of D equals the degree gl of node l, Q X gl = dln (B.1) n=1
PQ
Consequently, each row sum n=1 tln = 0 which shows that T is singular implying that det T = 0. The Laplacian is symmetric T = TW because D and are both symmetric and the quadratic form defined in Section A.2 art. 10, {W T{ = {W TW { = {W E W E{ = kE{k22 0 is positive semidefinite, which implies that all eigenvalues of T are nonnegative and at least one is zero because det T = 0. P PQ Since Q n=1 dln = 2O, the basic law for the degree follows as l=1 Q X l=1
gl = 2O
(B.2)
B.1 The adjacency and incidence matrix
473
Notice that S = 31 D is a stochastic matrix because all elements of S lie in the interval [0> 1] and each row sum is 1. 3. Let M denote the all-one matrix with (M)lm = 1 and (J) the total number of spanning trees in the graph J, also called the complexity of J, then adjT = (J) M
(B.3)
adjT where T31 = det T . We omit the proof, but apply the relation (B.3) to the complete graph NQ where T = Q L M. Equation (B.3) demonstrates that all elements of adjT are equal to (J). Hence, it su!ces to compute one suitable element of adjT, for example (adjT)11 that is equal to the determinant of the (Q 1) × (Q 1) principal submatrix of T obtained by deleting the first row and column in T, 5 6 Q 1 1 === 1 9 1 Q 1 === 1 : 9 : (adjT)11 = det 9 : .. .. .. 7 8 . . .
1
1
=== Q 1
Adding all rows to the first and subsequently adding this new first row to all other rows gives
(adjT)11
5 6 1 1 === 1 1 : 9 9 1 Q 1 = = = : = det 9 = det 9 .. .. 7 8 7 ... . . 1 1 === Q 1 5
1 1 0 Q .. . 0 0
6 === 1 === 0 : . : = Q Q 2 .. . .. 8 === Q
Hence, the total number of spanning trees in the complete graph NQ which is also the total number of possible spanning trees in any graph with Q nodes equals Q Q32 . This is a famous theorem of Cayley of which many proofs exist (van Lint and Wilson, 1996, Chapter 2). 4. The complexity of J is also given by (J) =
det (M + T) Q2
Indeed, observe that MT = (ME) E W = 0 since ME = 0. Hence, (Q L M) (M + T) = Q M + QT M 2 MT = Q T and adj ((Q L M) (M + T)) = adj (M + T) adj (Q L M) = adj (QT)
(B.4)
474
Algebraic graph theory
Since TNQ = Q L M and as shown in art. 3, adj(Q L M) = Q Q32 M and since adj(Q T) = Q Q31 adjT = Q Q 31 (J) M where we have used (B.3), adj (M + T) M = Q (J) M Left-multiplication with M+T taking into account that MT = 0 and M 2 = Q M finally gives (M + T) adj (M + T) M = det (M + T) M = Q 2 (J) M which proves (B.4). 5. A walk of length n from node l to node m is a succession of n arcs of the form (q0 $ q1 )(q1 $ q2 ) · · · (qn31 $ qn ) where q0 = l and qn = m. A path is a walk in which all vertices are dierent, i.e. qo 6= qp for all 0 o 6= p n. Lemma B.1.1 The ¡number of walks of length n from node l to node m is ¢ equal to the element Dn lm .
Proof (by induction): For n = 1, the number of walks of length 1 between state l and m equals the number of direct links between l and m, which is by definition the element dlm in the adjacency matrix D. Suppose the lemma holds for n 1. A walk of length n consists of a walk of length n 1 from l to some vertex u which is adjacent to m. By the induction ¡ n31 ¢ and hypothesis, the number of walks of length n 1 from l to u is D lu total number the number of walks with length 1 from u to m equals d¡um . The ¡ n¢ ¢ P n31 of walks from l to m with length n then equals Q d = D lm um u=1 D lu (by the rules of matrix multiplication). ¤ Explicitly, Q Q Q X ³ ´ X X dlu1 du1 u2 · · · dun32 un31 dun31 m ··· = Dn lm
u1 =1 u2 =1
un31 =1
As shown in Section 15.2, the number of paths with n hops between node l and node m is X X X [n (l $ m; Q ) = ··· dlu1 du1 u2 · · · dun31 m u1 6={l>m} u2 6={l>u1 >m}
un31 6={l>u1 >===>un32 >m}
The definition of a path restricts the first index u1 to Q 2 possible values, the second u2 to Q 3, etc.. such that the total possible number of paths is n31 Y o=1
(Q 1 o) =
(Q 2)! (Q n 1)!
B.2 The eigenvalues of the adjacency matrix
475
whereas the total possible number of walks clearly is Q n31 . A graph is connected if, for each pair of nodes, there a walk or, ¡ exists ¢ n equivalently, if there exists some integer n A 0 for which D lm 6= 0 for each ¡ ¢ l> m. The lowest integer n for which Dn lm 6= 0 for each pair of nodes l> m is called the diameter of the graph J. Lemma B.1.1 demonstrates that the diameter equals the length of the longest shortest hop path in J. B.2 The eigenvalues of the adjacency matrix In this section, only general results of the eigenvalue spectrum of a graph J are treated. For special types of graphs, there exists a wealth of additional, but specific properties of the eigenvalues. 1. Since D is a real symmetric matrix, it has Q real eigenvalues (Section A.2), which we order as 1 2 · · · Q . Section A.1, art. 4 shows that, apart from a similarity transform, the set of eigenvalues with corresponding eigenvectors is unique. A similarity transform consists of a relabeling of the nodes in the graph that obviously does not alter the structure of the graph but merely expresses the eigenvectors in a dierent base. The classical Perron-Frobenius Theorem for non-negative square matrices (of which Theorem A.4.2 is a special case) states that 1 is a simple and positive root of the characteristic polynomial in (A.3) possessing the only eigenvector of D with non-zero components. Moreover, it follows from (A.34) that PQ PQ {W D{ l=1 m=1 dlm {l {m 1 = sup W = max PQ 2 {6=0 {6=0 { { l=1 {l
The maximum is attained if and only if { is the eigenvector of D belonging W as shown in Section A.3. to 1 and for any other vector | 6= {, 1 {{WD{ { P By choosing the vector | = x = (1> 1> = = = > 1), we have, with Q m=1 dlm = gl and (B.2), 1
Q Q Q 2O 1 X 1 XX gl = dlm = Q Q Q l=1 m=1
(B.5)
l=1
The stochastic matrix S = 31 D where = diag(g¡1 > g2 > = = = > gQ¢) is the degree matrix has the characteristic polynomial det 31 D L = Q Y det(D3{) where det = gm . Since the largest eigenvalue of a stochastic det { m=1
matrix equals 1 = 1 (Theorem A.4.2), for a regular graph where gm = u, the largest eigenvalue equals 1 = u.
476
Algebraic graph theory
2. Since dll = 0, we have that trace(D) = 0. From (A.7), Q 31
(1)
fQ31 =
Q X
n = 0
(B.6)
n=1
3. The Newton identities for polynomials. Let sq (}) denote a polynomial of order q defined as sq (}) =
q X
n
dn (q) } = dq (q)
n=0
q Y
(} }n (q))
(B.7)
n=1
where {}n (q)} are the q zeros. It follows from (B.7) that sq (0) = d0 (q) = Q dq (q) qn=1 (}n (q)). The logarithmic derivative of (B.7) is s0q (})
= sq (})
q X n=1
1 } }n (q)
For } A maxn }n (q) (which is always possible for polynomials, but not for functions), we have that s0q (}) = sq (})
q X " X (}n (q))m
} m+1
n=1 m=0
= sq (})
" X ]m (q) m=0
} m+1
where ]m (q) =
q X
(}n (q))m
n=1
Thus q X
ndn (q) } n =
n=1
q X
dn (q) } n
" X
]m (q)} 3m =
m=0
n=0
q " X X
dn (q) ]m (q)} n3m (B.8)
m=0 n=0
Let o = n m, then 4 o q. Also m = n o 0 such that n o. Combined with 0 n q, we have max(0> o) n q. Thus, q " X X
dn (q) ]m (q)} n3m =
m=0 n=0
q X
q X
dn (q) ]n3o (q)} o
o=3" n=max(o>0)
=
31 X q X
o=3" n=0
dn (q) ]n3o (q)} o +
q X q X o=0 n=o
dn (q) ]n3o (q)} o
B.2 The eigenvalues of the adjacency matrix
477
Equating the corresponding powers of } in (B.8) yields d0 (q)]3o (q) +
q X
n=1 q X
n=o+1
dn (q) ]n3o (q) = 0
o0
dn (q) ]n3o (q) = (o q)do (q)
The last set of equations for 0 o ? q, do (q) =
q 1 X dn (q) ]n3o (q) qo
(B.9)
n=o+1
are the Newton identities that relate the coe!cients of a polynomial to the sum of the positive powers of the zeros. Applied to the characteristic polynomials (A.3) and (A.5) of the adjacency matrix with }n (q) = n , dn (q) = (1)Q fn and fQ 31 = 0 (from (B.6)) yields for the first few values, fQ 32 =
n=1
Q 1X
3n 3 n=1 4 3Ã !2 Q Q X 1C X 2 n 2 4n D = 8
fQ 33 = fQ 34
Q 1 X 2 n 2
n=1
n=1
4. From (A.4), the coe!cient of the characteristic polynomial fQ 32 = P P . Each principal minor P2 has a principal submatrix of the form doo 2¸ 0 { with {> | 5 [0> 1]. A minor P2 is non-zero if and only if { = | = 1 | 0 in which case P2 = 1. For each set of adjacent nodes, there exists such non-zero minor, which implies that fQ32 = O From art. 3, it follows that the number of links O equals O=
Q 1 X 2 n 2 n=1
(B.10)
478
Algebraic graph theory
5. Each principal submatrix P3×3 is 5 0 P3×3 = 7 { }
of the form 6 { } 0 | 8 | 0
with determinant P3 = det P3×3 = 2{|}, which is only non-zero for { = | = }. That form of P3×3 corresponds with a subgraph of 3 nodes that are fully connected. Hence, fQ33 = 2× the number of triangles in J. From art. 3, it follows that Q
1X 3 n the number of triangles in J = 6
(B.11)
n=1
6. In general, from (A.4) and by identifying the structure of a minor Pn , any coe!cient fQ3n can be expressed in terms of graph characteristics, X (1)f|fohv(G) (B.12) (1)Q fQ3n = GMJn
where Jn is the set of all subgraphs of J with exactly n nodes and f|fohv (G) is the number of cycles in a subgraph G 5 Jn . The minor Pn is a determinant of the Pn×n submatrix of D and defined as X (1)(s) d1s1 d2s2 · · · dnsn det Pn = s
where the sum is over all n! permutations s = (s1 > s2 > = = = > sn ) of (1> 2> = = = > n) and (s) is the parity of s, i.e. the number of interchanges of (1> 2> = = = > n) to obtain (s1 > s2 > = = = > sn ). Only if all the links (1> s1 ) > (2> s2 ) > = = = (n> sn ) are contained in J, d1s1 d2s2 = = = dnsn is non-zero. Since dmm = 0, the sequence of contributing links (1> s1 ) > (2> s2 ) > = = = (n> sn ) is a set of disjoint cycles and (s) depends on the number of those disjoint cycles. Now, det Pn is constructed ¡ ¢ from a specific set G 5 Jn of n out of Q nodes and in total there are Qn such sets in Jn . Combining all contributions leads to the expression (B.12). 7. Since D is a symmetric 0-1 matrix, we observe that using (B.1), Q Q Q X X X ¡ 2¢ 2 dln dnl = dln = dln = gl D ll = n=1
n=1
n=1
Hence, with (A.16) or (B.10), (A.7) and basic law for the degree (B.2) is expressed as Q Q X X 2n = gn = 2O (B.13) trace(D2 ) = n=1
n=1
B.2 The eigenvalues of the adjacency matrix
479
Furthermore, Q Q X X ¡ l=1 m=1;m6=l
D2
¢
lm
=
Q Q Q X X X l=1 m=1;m6=l n=1
=
Q X Q X n=1 l=1
or
dln dnm =
D2
l=1 m=1;m6=l
¢
lm
dnl
n=1 l=1
dnl (gn dnl ) =
Q Q X X ¡
Q X Q X
=
Q X n=1
Q X n=1
Ã
gn
Q X
dnm
m=1;m6=l Q X l=1
dnl
Q X
gn (gn 1)
l=1
dnl
!
(B.14)
¡ 2¢ P PQ Lemma B.1.1 states that Q m=1;m6=l D lm equals twice the total numl=1 ber of two-hop walks with dierent source and destination nodes. In other words, the total number of connected triplets of nodes in J equals half (B.14). 8. The total number Qn of walks of length n in a graph follows from Lemma B.1.1 as Q Q X X Qn = (Dn )lm l=1 m=1
Since any real symmetric matrix (Section A.2, art. 9) can be written as D = X diag(m )X W where X is an orthogonal matrix of the (normalized) eigenvecP n tors of D, we have that Dn = X diag(nm )X W and (Dn )lm = Q q=1 xlq xmq q . Hence, !2 ÃQ Q Q X Q X Q X X X Qn = xlq xmq nq = xlq nq q=1
l=1 m=1 q=1
l=1
9. Applying the Hadamard inequality for the determinant of any matrix Fq×q , Ã q !1 q 2 Y X |det F| |flm |2 m=1
l=1
yields, with dlm = dml and (B.1), |det D|
ÃQ Q Y X
m=1
l=1
d2ml
! 12
=
ÃQ Q Y X
m=1
l=1
dml
! 21
=
Q Y p gm
m=1
480
Algebraic graph theory
Hence, with (A.6), (det D)2 =
Q Y
n=1
2n
Q Y
gm
(B.15)
m=1
10. Applying the Cauchy—Schwarz inequality (5.17) Ã q !2 q q X X X dn en d2n e2n n=1
n=1
n=1
to the vector (2 > = = = > Q ) and the 1 vector (1> 1> = = = > 1) gives ÃQ !2 Q X X (Q 1) 2n n n=2
n=2
Introducing (B.6) and (B.13) ¢ ¡ 21 (Q 1) 2O 21
leads to the bound for the largest (and positive) eigenvalue 1 , r 2O (Q 1) 1 (B.16) Q P 2O Alternatively, in terms of the average degree gd = Q1 Q m=1 gm = Q , the largest eigenvalue 1 is bounded by the geometric mean of the average degree p and the maximum possible degree, 1 gd (Q 1). Combining the lower bound (B.5) and upper bound (B.16) yields r 2O (Q 1) 2O 1 (B.17) Q Q 11. From the inequality (A.26) for Hölder t-norms, we find that, if Q X
|n |t ? t
n=1
then Q X
|n |s ? s
n=1
PQ
for s ¯A t A 0. ¯Since n=1 n = 0, not all n can ¯be positive ¯ and combined PQ ¯PQ ¯PQ s¯ s¯ s with ¯ n=1 n ¯ n=1 |n | , we also have that ¯ n=1 n ¯ ? s . Applied to the case where t = 2 and s = 3 gives the following implication: if
B.2 The eigenvalues of the adjacency matrix
481
¯P ¯ 2 ? 2 then ¯ Q 3 ¯ ? 3 . In that case, the number of triangles ¯ n=1 n ¯ 1 1 n=2 n given in (B.11) is ¯Q ¯ Q 1 3 1X 3 1 3 1 ¯¯X 3 ¯¯ the number of triangles in J = 1 + n 1 ¯ n ¯ A 0 ¯ 6 6 6 6¯ PQ
n=2
n=2
P 2 2 Hence, if Q n=2 n ? 1 , then the number s of triangles in J is at least one. Equivalently, in view of (B.10), if 1 A O then the graph J contains at least one triangle. 12. A Theorem of Turan states that Theorem B.2.1 A graph J with Q nodes and more than tains at least one triangle.
h
Q2 4
i
links con-
h 2i 2 This theorem is a consequence of art. 7 and 11. For, using O A Q4 Q4 s which is equivalent to Q ? 2 O in the bound on the largest eigenvalue (B.5), 1 and 1 A triangle.
s 2O 2O A s = O Q 2 O
s O is precisely the condition in art. 11 to have at least one
13. The eigenvalues of the complete graph NQ are 1 = Q 1 and 2 = = = = = Q = 1. This follows by computing the determinant in (A.2) in the same way as in Section B.1, art. 3. Alternatively, the adjacency matrix of the complete graph is M L and, if xW = [1 1 · · · 1] is the all-one vector, then M = x=xW . A direct computation yields µ ¶ ¢ ¡ W x=xW Q det (M L L) = det x=x ( + 1) L = ( ( + 1)) det L +1 Using (A.38) and xW x = Q ,
µ det (M L L) = ( ( + 1)) 1 Q
Q +1
¶
= (1)Q31 ( + 1)Q31 ( + 1 Q )
, gives the eigenvalues of NQ . Since the number of links in NQ is O = Q(Q31) 2 Q(Q31) for we observe that the equality sign in (B.16) can occur. Since O 2 any graph, the upper bound (B.16) shows that 1 Q 1 for any graph.
482
Algebraic graph theory
14. The dierence between the largest eigenvalue 1 and second largest 2 is never larger than Q , i.e. 1 2 Q
(B.18)
Since 1 A 0 as indicated by (B.17), it follows from (B.6) that 0=
Q X n=1
such that
n 1 +
Q X n=2
|n | 1 + (Q 1) |2 |
2 Hence, 1 2 1 +
1 Q 1
Q 1 1 = Q 1 Q 1
Art. 13 states that the largest possible eigenvalue is 1 = Q 1 of the complete graph which proves (B.18). Again, the equality sign in (B.18) occurs in case of the complete graph. 15. Regular graphs. Every node m in a regular graph has the same degree gm = u and relation (B.1) indicates that each row sum of D equals u. Theorem B.2.2 The maximum degree gmax = max1$m$Q gm is an eigenvalue of the adjacency matrix D of a connected graph J if and only if the corresponding graph is regular (i.e. gm = gmax for all m). Proof: If { is an eigenvector of D belonging to eigenvalue = gmax so is each vector n{ for each complex n (Section A.1, art. 1). Thus, we can scale the eigenvector { such that the maximum component, say {p = 1, and {n 1 for all n. The eigenvalue equation D{ = gmax { for that maximum component {p is gmax {p = gmax =
Q X
dpm {m
m=1
which implies that all {m = 1 whenever dpm = 1, i.e. when the node m is adjacent to node p. Hence, the degree of node p is gp = gmax . For any node m adjacent to p for which the component {m = 1, a same eigenvalue relation holds and thus gm = gmax . Proceeding this process shows that every node n 5 J has same degree gn = gmax because J is connected. Hence, { = x where xW = [1 1 · · · 1]. Conversely, if J is connected and regular, P then Q m=1 dpm = gmax for each p such that x is the eigenvector belonging
B.2 The eigenvalues of the adjacency matrix
483
to eigenvalue = gmax , and the only possible eigenvector (as follows from ¤ art. 1). Hence, there is only one eigenvalue gmax . 16. The characteristic polynomial of the complement Jf is det (Df L) = det (M D ( + 1) L) ³ ³ ´´ = (1)Q det (D + ( + 1) L) L (D + ( + 1) L)31 M ´ ³ = (1)Q det ((D + ( + 1) L)) det L (D + ( + 1) L)31 x=xW
where we have used that M = x=xW and x is the all-one vector. Similar to the proof of Lemma A.4.4, we find det (Df L) = (1)Q j () det (D + ( + 1) L)
(B.19)
where j () = 1 xW (D + ( + 1) L)31 x In general, j () is not a simple function of although a little more is ° °2 1 ° ° known. For example, j () = 1 °(D + ( + 1) L)3 2 x° which shows that 2 j () 5 (4> 1]. Unlike in the proof of Lemma A.4.4, x is generally not an eigenvector of D and we can write (Section A.1, art. 8) (D + ( + 1) L)31 =
"
1 X (1)n Dn +1 ( + 1)n n=0
P n n where the last sum " n=0 D } can be interpreted as the matrix generating function of the number of walks ³ ´ of length n (see Section B.1, art. 5 n and art. 8). Since D = X diag nm X W (Section A.2, art. 9) where the orthogonal s £ xm of D, the matrix product¤ ¤ £ matrix X consists of eigenvectors W x X = x=x1 x=x2 · · · x=xQ = Q cos 1 cos 2 · · · cos Q where m is the angle between ³the´eigenvector xm and the all-one vector P n 2 x. Hence, xW Dn x = xW X diag nm X W x = Q Q m=1 m cos m and, with P" (3m )n +1 n=0 (+1)n = +1+m , we can write j () = 1 Q
Q X m=1
cos2 m + 1 + m
With (A.5), we have f ( 1) = det (D + ( + 1) L) =
QQ
n=1 (n
+ 1 + )
484
Algebraic graph theory
and, hence, det (Df L) =
Q ¢ (1)Q X ¡ + 1 + m Q 2 cos2 m Q m=1
Q Y
(n + 1 + )
n=1;n6=m
(B.20) which shows that the poles of j () are precisely compensated by the zeros of the polynomial f ( 1). Thus, the eigenvalues of Df are generally dierent from {m 1}1$m$Q where m is an eigenvalue of D. Only if x n 3Q and is an eigenvector of D corresponding with n , then j () = +1+ +1+n all eigenvalues of Df belong to the set {m 1}1$m6=n$Q ^ {Q 1 n }. According to art. 15, x is only an eigenvector when the graph is regular.
B.3 The stochastic matrix S = 31 D The stochastic matrix S = 31 D, introduced in Section B.2, art. 1, characterizes a random walk on a graph. A random walk is described by a finite Markov chain that is time-reversible. Alternatively, a time-reversible Markov chain can be viewed as random walk on an undirected graph. Random walks on graphs have many applications in dierent fields (see e.g. the survey by Lovász (1993)); perhaps, the most important application is randomly searching or sampling. The combination of Markov theory and algebra leads to interesting properties of S = 31 D. Section 9.3.1 and A.4.1 show that the left-eigenvector of S belonging to eigenvalue = 1 is the steady-state vector (which is a 1×Q row vector) and that the corresponding right-eigenvector is the all-one vector x, which essentially follows from (9.8) and which indicates that, at each discrete time step, precisely one transition occurs. These eigenvectors obey the eigenvalue equations S W W = W and S x = x and the orthogonality relation x = 1 (Section A.1, art. 3). If g = (g1 > g2 > = = = > gQ ) is the degree vector, then the basic law for the degree (B.2) is written in vector form as ¡ g ¢W gW x = 2O, or, 2O x = 1. Theorem 9.3.5 states that the steady-state ¡ g ¢W x = 1 eigenvector is unique such that the equations x = 1 and 2O imply that the steady-state vector is µ ¶W g = 2O or m =
gm 2O
(B.21)
B.3 The stochastic matrix S = 1 D
485
In general, the matrix S is not symmetric, but, after a similarity transform K = 1@2 , a symmetric matrix U = 1@2 S 31@2 = 31@2 D31@2 is obtained whose eigenvalues are the same as those of S (Section A.1, art. 4). The powerful property (Section A.2, art. 9) of symmetric matrices shows that all eigenvalues are real and that U = X W diag(U ) X , where the columns of the orthogonal matrix X consist of the normalized eigenvectors yn that obey ymW yn = mn . Explicitly written in terms of these eigenvectors gives U=
Q X
n yn ynW
n=1
where, with Frobenius Theorem A.4.2, the real eigenvalues are ordered as 1 = 1 2 · · · Q 1. If we exclude bipartite graphs (where the set of nodes is N = N1 ^ N2 with N1 _ N2 = B and where each link connects a node in N1 and in N2 ) or reducible Markov chains (Section A.4), then |n | ? 1, for n A 1. Section A.1, art. 4 shows that the similarity transform K = 1@2 maps the steady state vector into y1 = K 31 W and, with (B.21), 31@2 W ° y1 = ° °31@2 W ° 2
or
s
gm 2O
y1m = s µ s ¶2 = PQ gm m=1
r
gm s = m 2O
2O
Finally, since S = 31@2 U1@2 , the spectral decomposition of the transition probability matrix of a random walk on a graph with adjacency matrix D is S =
Q X
31@2
n
yn ynW 1@2
= x +
n=1
Q X
n 31@2 yn ynW 1@2
n=2
¡ ¢ The q-step transition probability (9.10) is, with yn ynW lm = ynl ynm and (B.21), s Q gm gm X q q + n ynl ynm Slm = 2O gl n=2
The convergence towards the steady state m can be estimated from s s Q Q X ¯ ¯ q gm X q ¯Slm m ¯ gm |qn | |ynl | |ynm | ? |n | gl gl n=2
n=2
486
Algebraic graph theory
Denoting by = max (|2 | > |Q |) and by 0 the largest element of the reduced set {|n |} \ {} with 2 n Q , we obtain s ¯ ¯ q ¯Slm m ¯ ? gm q + R (0q ) gl B.4 Eigenvalues and connectivity A graph J has n components (or clusters) if there exists a relabeling of the nodes such that the adjacency matrix has the structure 6 5 D1 R = = = R 9 .. : 9 R D2 . : : D=9 : 9 .. .. 7 . 8 . R = = = Dn
where the square submatrix Dp is the adjacency matrix of the connected component p. Disconnectivity is a special case of reducibility of a stochastic matrix defined in Section A.4 and expresses that no communication is possible between two states in a dierent component or cluster. Using (A.39) indicates that det (D L) =
n Y
p=1
det (Dp p L)
(B.22)
If D is a regular graph with degree u, so is each submatrix Dp . Since Dp is connected, Section B.2, art. 15 states that the largest eigenvalue of any Dp equals u. Hence, by (B.22), the multiplicity of the largest eigenvalue of D equals the number of components in the regular graph. As shown in Section B.1, art. 2, the Laplacian T has non-negative eigenvalues of which at least one equals zero. In addition, the matrix (Q 1)L T = (Q 1)L + D is non-negative with constant row sums all equal to Q 1. Although the matrix (Q 1)L T is not an adjacency matrix and does not represent a regular graph, the main argument in the proof of Theorem B.2.2 is the property of constant row sums and non-negative matrix elements. Hence, the multiplicity of the largest eigenvalue of (Q 1)L T is equal to the number of components of J. But the largest eigenvalue of (Q 1)L T is the smallest of T (Q 1)L and also of T. Hence, we have proved
B.5 Random matrix theory
487
Theorem B.4.1 The multiplicity of the smallest eigenvalue = 0 of the Laplacian T is equal to the number of components in the graph J If T has only 1 zero eigenvalue with corresponding eigenvector x (because PQ n=1 tln = 0 for each 1 l Q is, in vector notation, Tx = 0), then the graph is connected; it has only 1 component. Theorem B.4.1 also implies (T) that, if the second smallest eigenvalue T = Q31 of T is zero, the graph J is disconnected. Since all eigenvectors of a matrix are linearly independent, the eigenvector {T of T must satisfy {WT x = 0 since x is the eigenvector belonging to = 0. By requiring this additional constraint and choosing the scaling of the eigenvector such that {W { = 1, we obtain similar to (A.35) that T =
min
k{k22 =1 and {W x=0
{W T{
The second smallest eigenvalue T has many interesting properties that characterize how strongly a graph J is connected. It is interesting to mention the inequality (Cvetkovic et al., 1995, p. 265) ³ ´ (J) T 2 (J) 1 cos Q
(B.23)
where (J) and (J) are the vertex and edge connectivity respectively.
B.5 Random matrix theory Random matrix theory investigates the eigenvalues of an Q × Q matrix D whose elements dlm are random variables with a given joint distribution. Even in case all elements dlm are independent, there does not exist a general expression for the distribution of the eigenvalues. However, in some particular cases (such as Gaussian elements dlm ), there exist nice results. Moreover, if the elements dlm are properly scaled, in various cases the spectrum in the limit Q $ 4 seems to converge rapidly to a deterministic limit distribution. The fascinating results of random matrix theory and applications from nuclear physics to the distributions of the non-trivial zeros of the Riemann Zeta function are discussed by Mehta (1991). Random matrix theory immediately applies to the adjacency matrix of the random graph Js (Q ) where each element dlm is 1 with probability s and zero with probability 1 s.
488
Algebraic graph theory
B.5.1 The spectrum of the random graph Js (Q ) Let denote an arbitrary eigenvalue of the adjacency matrix of the random graph Js (Q ). Clearly, is a random variable with mean H [] = Q1 £ ¤ PQ n = 0 because of (B.6). In addition, the variance Var[] = H 2 = n=1 1 PQ 2 n=1 n and from (B.10) Q
2O = s(Q 1) Q This results implies ³s ´ that, for fixed s and large Q , the eigenvalues of Js (Q ) grow as R Q , with the exception1 of the largest eigenvalue 1 . The number of links O in Js (Q ) is binomially distributed with mean H [O] = s Q (Q231) . Taking the expectation of the bounds (B.17) on the largest eigenvalue gives r 2 2 (Q 1) hs i H [O] H [1 ] H O Q Q Using (2.12) yields Var [] =
(Q2 ) (Q2 ) µ¡Q ¢¶ hs i X X s n s Q 2 O = n Pr [O = n] = ns (1 s)( 2 )3n H n n=0
n=0
Unfortunately, the sum cannot be expressed in closed form, but (Q2 ) µ¡ ¢¶ Q 1 X Q2 s n s q¡ ¢ ns (1 s)( 2 )3n s n Q 2
n=0
with equality for Q $ 4. In summary, for any Q and s, s s(Q 1) H [1 ] s (Q 1)
(B.24)
The degree distribution (15.11) of the random graph is a binomial distribution with mean H [Grg ] = s(Q 1) and Var[Grg ] = (Q 1)s(1 s). The inequality (5.13) indicates that the degree Grg converges exponentially fast to zero the mean H [Guj ] for fixed s and large Q , which means that the random graphs tends to a regular graph with high probability. Section B.2, art. 1 states that 1 $ s (Q 1) with high probability. Comparison with the bounds (B.24) indicates that the upper bound is less tight than the lower bound and that the upper bound is only sharp when s $ 1, i.e. for the complete graph. Section B.2, art. 13 shows that only for the complete graph the upper bound is indeed exactly attained. 1
1 It is known that, for large Q, the second largest eigenvalue of Js (Q) grows as R Q 2 + .
B.5 Random matrix theory
489
B.5.2 Wigner’s Semicircle Law Wigner’s Semicircle Law is the fundamental result in the spectral theory of large random matrices. Theorem B.5.1 (Wigner’s Semicircle Law) Let D be a random Q × Q real symmetric matrix with independent and identically distributed elements dlm with 2 = Var[dlm ] and denote by (DQ ) an eigenvalue of the set of the Q real eigenvalues of the scaled matrix DQ = IDQ . The probability density function i(DQ ) ({) of (DQ ) tends for Q $ 4 to lim i(DQ ) ({) =
Q 1] into p + 1 subintervals. The length O of each subinterval has a same distribution, which more easily follows by symmetry if the line segment is replaced by a circle of unit perimeter. Since the length O of each subinterval is equal in distribution, one can consider the first subinterval [0> [(1) ] whose length O exceeds a value { M (0> 1) if and only if all p uniform random variables belong to [{> 1]. The latter event has probability equal to (1 3 {)p such that Pr [O A {] = (1 3 {)p 1 . and, with (2.35), H [O] = p+1 (iii) If [ were a discrete random variable, then Pr [[ = n] E qqn , where qn is the number of values in the set {{1 > {2 > = = = > {q } that is equal to n. For a continuous random variable [, the values are generally real numbers ranging from {min = min1$m$q {m until {max = max1$m$q {m . We first construct a histogram K from the set {{1 > {2 > = = = > {q } by choosing 3{min a bin size {{ = {maxp , where p is the number of bins (abscissa points). The choice of 1 ? p ? q is in general di!cult to determine. However, most computer packages allow us to experiment with p and the human eye proves sensitive enough to make a good choice
C.1 Probability theory (Chapter 2)
495
of p: if p is too small, we loose details, while a high p may lead to high irregularities due to the stochastic nature of [. Once p is chosen, the histogram consists of the set {k0 > k1 > = = = > kp31 } where km equals the number of [ values in the set {{1 > {2 > = = = > {q } that lies in the interval [{min + m{{> {min + (m + 1){{] for 0 $ m $ p 3 1. By construction, Sp31 m=0 km = q. The histogram K approximates the probability density function i[ ({) after dividing each value km by q{{ because
1=
]
{max
{min
i[ ({)g{ = lim
{{ {min + (m + 1){{]. Alternatively from (2.31) we obtain i[ (m ) = lim
{{ |; ) =
1 (2l)2
]
f1 +l"
f1 3l"
]
f2 +l"
h}1 ({3[ )+}2 (|3\ )
f2 3l"
2 2 2 \ (1 ) } 2 1 2 g} g} 2 × h 2 ([ }1 +}2 \ ) + 1 2 ] f2 +l" 2 (12 ) 2 \ 1 }2 }2 (|3\ ) 2 h g}2 = h 2l f2 3l" ] f1 +l" 2 1 1 h}1 ({3[ ) h 2 ([ }1 +}2 \ ) g}1 × 2l f1 3l"
Evaluating the last integral, denoted by O, yields ] f1 +l" 1 1 ([ }1 + }2 \ )2 g}1 h}1 ({3[ ) exp 2l f1 3l" 2 ] " 1 1 ([ (f1 + lw) + }2 \ )2 gw = h(f1 +lw)({3[ ) exp 2 3" 2 % 2 & ] " 2 3[ }2 \ 1 f1 ({3[ ) gw = w 3 l f1 + hlw({3[ ) exp h 2 2 [ 3"
O=
C.3 Poisson process (Chapter 7)
497
Since the integrand is an entire function, thekcontour canlbe shifted, which allows substitution as in real analysis. Thus, let x = w 3 l f1 + }2\ , then [
=
1 3 }2\ [ h 2
=
1 3 }2\ [ h 2
%
& 2 [ 2 [ x gx exp 3 h 2 3" % $& # ] " 2 ({ 3 [ ) ({3[ ) exp 3 [ x2 + 2lx gx 2 2 [ 3" 5 &] $2 6 % # 2 " [ ) ({ 3 [ )2 ({ 3 ({3[ ) [ 8 gx exp 3 exp 73 x+l 2 2 2[ 2 [ 3"
1 f1 ({3[ ) h O= 2
]
"
l k } ({3[ ) l x+l f1 + 2 \
[) By substituting w = x + l ({3 , the integral becomes 2 [
% & I ] " ] " 2 2 [ [ 2 2 2 w gw = 2 w gw = exp 3 h3z z31@2 gz exp 3 2 2 [ 0 0 3" I I 2 2 1 K = = [ 2 [
]
"
%
&
where we have used the Gamma function (Abramowitz and Stegun, 1968, Chapter 6). Hence, O=
} 3 2 \ ({3[ )
h
[
I [ 2
%
({ 3 [ )2 exp 3 2 2[
&
and 2 [) ] f2 3l" \2 (12 ) 2 exp 3 ({3 2 2[ 1 }2 3 \ + \ 2 [ i[\ ({> |; ) = h I 2l f2 3l" [ 2
({3[ ) }2 }2 |
h
g}2
The last integral is recognized with (3.22) as the inverse Laplace transform of a Gaussian 2 1 3 2 and mean = + \ ({ 3 ). Thus with variance 2 = \ \ [ [
% 2 & |3\ 3 \ ({3[ ) [ ({3[ )2 exp 3 2 exp 3 22 2\ (132 ) [ s I i[\ ({> |; ) = I [ 2 \ 1 3 2 2
which finally leads to the joint Gaussian density function (4.4). Hence, the linear combination method leads to exact results for Gaussian random variables.
C.3 Poisson process (Chapter 7) (i) Let \ be a binomial random variable with parameters Q and s, where Q is a Poisson random variable with parameter . The probability density function of \ is obtained by applying the law of total probability (2.46), Pr [\ = n] =
" [
q=0
Pr[\ = n|Q = q] Pr [Q = q]
498
Solutions of problems With (3.3) and (3.9), we have Pr [\ = n] =
=
" " [ sn 3 [ t (q3n) q q n (q3n) q h3 = h s t n q! n! (q 3 n)! q=0 q=n " (s)n 3+t sn n 3 [ t q q h = h n! q! n! q=0 (s)n
Since t = 1 3 s, we arrive at Pr [\ = n] = n! h3s , which means that \ is a Poisson random variable with mean s. If a su!cient sample of test strings defined above is sent and received, the average number of “one bits” at receiver divided by the average number of bits at the sender gives the probability s (if errors occur indeed independently). (ii) Since the counting process of a sum of a Poisson process is again a Poisson counting process S with rate equal to 4m=1 m , the average number of packets of the four classes in the router’s S buers during interval W is = W 4m=1 m . Hence, the probability density function for the q
total number Q of arrivals is Pr [Q = q] = q! h3 . (iii) Theorem 7.3.4 states the Q(w) is a Poisson counting process with rate 1 + 2 . Then, Pr [{[1 (w) = 1} K {[(w) = 1}] Pr [[(w) = 1] Pr [{[1 (w) = 1} K {[2 (w) = 0}] = Pr [[(w) = 1] Pr [[1 (w) = 1] Pr [[2 (w) = 0] 1 = = Pr [[(w) = 1] 1 + 2
Pr [[1 (w) = 1|[(w) = 1] =
since the Poisson random variables [1 and [2 are independent. As an application we can consider a Poissonean arrival flow of packets at a router with rate . If the packets are marked randomly with probability s = 1 , the resulting flow consists of two types, those marked and those not. Each of these flows is again a Poisson flow, the marked flow with rate 1 = s and the non-marked flow with 2 = (1 3 s). Actually, this procedure leads to a decomposition of the Poisson process into two independent Poisson processes and leads to the reverse of Theorem 7.3.4. 1 (iv) (a) Applying the solution of previous exercise immediately gives + 1 2 +3 (b) Since the three Poisson processes are independent, the total number of cars on the three lanes, denoted by [, is also a Poisson process (Theorem 7.3.4) with rate = 1 + 2 + 3 . q Hence, Pr [[ = q] = q! h3 . (c) Let us denote the Poisson process in lane m by [m . Then, using the independence between the [m , Pr [[1 = q> [2 = 0> [3 = 0] = Pr [[1 = q] Pr [[2 = 0] Pr [[3 = 0] =
q q 1 31 32 33 h h h = 1 h3 q! q!
(v) (a) The player relies on the fact that during the time there is exactly one arrival. Since the game rules mention that he should identify the last signal in (0> W ), signals arriving during (0> v) do not influence his chance to win because of the memoryless property of the Poisson process. The number of arrivals in the interval (v> W ) obeys a Poisson distribution with parameter (W 3 v). The probability that precisely one signal arrives in the interval (v> W ) is Pr [Q (W ) 3 Q (v) = 1] = (W 3 v) h3(W 3v) . (b) Maximizing this winning probability with respect to v (by equating the first derivative to zero) yields g Pr [Q (W ) 3 Q (v) = 1] = 3h3(W 3v) + 2 (W 3 v) h3(W 3v) = 0 gv with solution (W 3 v) = 1 or v = W 3 1@. This maximum (which is readily verified by g2 checking that gv 2 Pr [Q (W ) 3 Q (v) = 1] ? 0) lies inside the allowed interval (0> W ). The maximum probability of winning is Pr [Q (W ) 3 Q (W 3 1@) = 1] = 1@h.
C.3 Poisson process (Chapter 7)
499
(vi) (a) We apply the general formula (7.1) for the pdf of a Poisson process with mean H [[(w)] = w = 1. Then, Pr [[ (w + v) 3 [ (v) = 0] = h3w = 1h . S 1 (b) Pr [[ (w + v) 3 [ (v) A 10] = 1 3 Pr [[ (w + v) 3 [ (v) $ 10] = 1 3 1h 10 n=0 n! . (c) Each minute is equally probable as follows from Theorem 7.3.3. (vii) This exercise is an application of randomly marking in a Poisson flow as explained in solution (iii) above. The total flow of packets can be split up into an ACK stream, a Poisson process Q1 with rate s = 3v31 and a data flow, an independent Poisson process Q2 with rate (1 3 s) = 7v31 . Then, (a) Pr [Q1 A 1] = 1 3 Pr [Q1 = 0] = 1 3 h33 (b) The average number is H [Q1 + Q2 |Q1 = 5] = H [Q1 |Q1 = 5] + H [Q2 |Q1 = 5] = 5 + H [Q2 ] = 5 + 7 = 12 packets. (c) Pr [Q1 = 2|Q1 + Q2 = 8] =
Pr[Q1 =2>Q1 +Q2 =8] Pr[Q1 +Q2 =8]
=
32 h3 76 h7 2! 6! 108 h10 8!
E 29=65%
(viii) (a) Since the three Poisson arrival processes are independent, the total number of requests will also be a Poisson process with the parameter = 1 + 2 + 3 = 20 requests/hour (Theorem 7.3.4). The expected number of requests during an 8-hour working day is H [Q] = w = 20 × 8 = 160 requests. (b) If we denote arrival processes of requests with dierent ADSL problems each with a random variable [l for l = 1> 2> and 3, then due to their mutual independence Pr [[1 = 0> [2 = n> [3 = 0] = Pr [[1 = 0] Pr [[2 = n] Pr [[3 = 0] = h31 w 8
from which Pr [[1 = 0> [2 = 3> [3 = 0] = h3 3
(2 w)n h32 w 33 w h n!
6 3
6
h3 3
6
h3 3 = 1=7 × 1033 . 3! 20 (c) If we denote the total number of requests by [ then Pr [[ = 0] = h3w = h3 4 = 33 6=7 × 10 . (d) The precise time is irrelevant for Poisson processes, only the duration of the interval matters. Here intervals are overlapping and we need to compute the probability 3
s = Pr [{[ (0=2) = 1} K {[ (0=5) 3 [ (0=1) = 2}] =
1 [
n=0
=
1 [
n=0
=
1 [
n=0
Pr [{[ (0=1) = n} K {[ (0=2) 3 [ (0=1) = 1 3 n} K {[ (0=5) 3 [ (0=2)} = 1 + n] Pr [[ (0=1) = n] Pr [[ (0=2) 3 [ (0=1) = 1 3 n] Pr [[ (0=5) 3 [ (0=2) = 1 + n] h32
(2)n 32 (2)13n 36 (6)1+n h h = 48h310 = 2=18 × 1033 n! (1 3 n)! (1 + n)!
(e) Given that at the moment w + v there are n + p requests, the probability that there were n requests at the moment w is Pr [{[ (w) = n} K {[ (w + v) = n + p}] Pr [[ (w + v) = n + p] Pr [[ (w) = n] Pr [[ (w + v) 3 [ (w) = p] = Pr [[ (w + v) = n + p]
Pr [[ (w) = n|[ (w + v) = n + p] =
(w)n h3w (v)p h3v n! p! = ( (w + v))n+p h3(w+v) (n + p)! n + p w n v p = n w+v w+v
500
Solutions of problems
(ix) (a) The number of attacks that are arriving to the PC is a Poisson random variable [ (w) with rate = 6. The probability that exactly one (n = 1) attack during one (w = 1) hour follows from (7.1) as Pr [[(1) = 1] = 6h36 . = (b) Applying (7.2), the expected amount of time that the PC has been on is w = H[[(w)] 60 = 10 hours. 6 (c) The arrival time of the fifth attack is denoted by W . Given that there are six attacks in one hour (w = 1), we compute the probability Pr[W ? w|[(1) = 6] that either five attacks arrive in the interval (0> w) and one arrives in (w> 1) or all six attacks arrive in (0> w) and none arrives in the interval (w> 1). Hence, for 0 $ w ? 1, IW (w) = Pr[W ? w|[(1) = 6] Pr[{[(w) = 5} K {[(1) = 6}] + Pr[{[(w) = 6} K {[(1) = 6}] Pr[[(1) = 6] Pr[[(w) = 5] Pr[[(1) 3 [(w) = 1] + Pr[[(w) = 6] Pr[[(1) 3 [(w) = 0] = Pr[[(1) = 6]
=
=
((w)5 @5!)h3w (1 3 w)h3(13w) + ((w)6 @6!)h3w h3(13w) = 6w5 3 5w6 (6 @6!)h3
The probability that the fifth attack will arrive between 1:30 p.m. and 2 p.m. is IW (1) 3 7 = 57 . IW 21 = 1 3 64 64 U (d) The expectation of W given [(1) = 6 follows from (2.33) as H [W |[(1)] = 01 {iW ({)g{ gI (w)
W derived in (c). Alternatively, the expectation can be computed from where iW (w) = gw U (2.35), H [W |[(1)] = 01 1 3 (6{5 3 5{6 ) g{ = 57 . Hence the expected arrival time of the fifth attack between 1 p.m. and 2 p.m. is about 1:43 p.m. (x) Let [ and [m denote the lifetime of system and subsystem m respectively. For a series of subsystems with independent lifetimes [m is the event {[ A w} = Kq {[ A w} and m=1 m T Pr [[ A w] = q Pr [[ A w]. Recall with (3.32) that Pr [[ A w] = Pr min m 1$m$q [m A w . m=1 Using the definition of the reliability function (7.5) then yields
Userie s (w) =
q \
Um (w)
m=1
(xi) The probability that the system V shown in Fig. 7.6 fails is determined by the subsystem with longest lifetime or [ = max1$m$q [m . Invoking relation (3.33) combined with the definition of the reliability function (7.5) leads to
Up a ra llel (w) = 1 3
q \
m=1
(1 3 Um (w))
C.4 Renewal theory (Chapter 8) (i) The equivalence {Q (w) A q} Ui {Zq $ w} indicates " [ Pr ZQ (w) $ { = Pr [{Zq $ {} K {Zq+1 A w}] q=0
= Pr [Z0 $ {> Z1 A w] +
" [
q=1
Pr [Zq $ {> Zq+1 A w]
C.4 Renewal theory (Chapter 8)
501
The convention Z0 = 0 reduces Pr [Z0 $ {> Z1 A w] = Pr [Z1 A w] = Pr [1 A w] = 1 3 I (w). Furthermore, by the law of total probability, Pr [Zq $ {> Zq+1 A w] = =
]
"
0
]
0
Pr [Zq $ {> Zq+1 A w|Zq = x]
g Pr [Zq $ x] gx gx
{
Pr [Zq+1 A w|Zq = x] g Pr [Zq $ x]
A renewal process restarts after each renewal from scratch (due to the stationarity and the independent increments of the renewal process). This implies that Pr [Zq+1 A w|Zq = x] = Pr [q+1 A w 3 x] = 1 3I (w 3 x) because the interarrival times are i.i.d. random variables. Combined, " ] [ Pr ZQ (w) $ { = Pr [ A w] + q=1
]
= Pr [ A w] +
{
Pr [ A w 3 x] g Pr [Zq $ x]
0
{
0
Pr [ A w 3 x] g
#
" [
q=1
Pr [Zq $ x]
$
With the basic equivalence (8.6) and the definition (8.7) of the renewal function p(w), we arrive at ] { Pr ZQ (w) $ { = Pr [ A w] + Pr [ A w 3 x] gp (x) 0
This equation holds for all {. If { = w, we can use the renewal equation, ]
0
w
Pr [ A w 3 x] gp (x) = p(w) 3
]
w
Pr [ $ w 3 x] gp (x)
0
= p(w) 3 p(w) + I (w)
which indeed confirms Pr ZQ (w) $ w = 1. (ii) The generating function of the number of renewals in the interval [0> w] is with (8.10) " k l [ Pr [Q(w) = n] } n *Q (w) (}) = H } Q (w) = n=0
= Pr [Q(w) = 0] +
]
w
0
= Pr [Q(w) = 0] + }
#
]
0
= Pr [Q(w) = 0] + }
]
0
" [
Pr [Q(w 3 v) = n 3 1] } n
n=1 #" w [
n=0
Pr [Q(w 3 v) = n] } n
$
$
i (v)gv
i (v)gv
w
*Q (w3v) (}) gI (v)
From (8.6), we have that Pr [Q(w) = 0] = 1 3 I (w) and *Q (w) (}) = 1 3 I (w) + }
]
w
0
*Q (w3v) (}) gI (v)
By derivation with respect to }, we arrive at the dierential-integral equation for the derivative of the generating function, *0Q (w) (}) = =
]
0
w
*Q (w3v) (}) gI (v) + }
*Q (w) (}) 3 1 + I (w) }
+}
]
w
0 ] w 0
*0Q (w3v) (}) gI (v) *0Q (w3v) (}) gI (v)
502
Solutions of problems which reduces to the renewal equation (8.9) for } = 1 since *0Q (w) (1) = p(w). The second derivative ] w ] w 0 (}) = 2 * (}) gI (v) + } *00 *00 Q (w) Q (w3v) Q (w3v) (}) gI (v) 0
2 2 = *0Q (w) (}) 3 } }
]
0
w
*Q (w3v) (}) gI (v) + }
0
]
w
0
*00 Q (w3v) (}) gI (v)
evaluated at } = 1, is *00 Q (w) (1) = 2p(w) 3 2I (w) +
]
0
w
*00 Q (w3v) (1) gI (v)
The variance Var[Q(w)] follows from (2.27) as 2 0 0 Var[Q(w)] = *00 Q (w) (1) + *Q (w) (1) 3 *Q (w) (1) ] w *00 = 3p(w) 3 p2 (w) 3 2I (w) + Q (w3v) (1) gI (v) 0
(iii) Every time an IP packet is launched by TCP, a renewal occurs and the reward is that 2000 km are travelled, in each renewal, thus Uq = 2000 km. The speed in a trip that suers from congestion is, on average, 40 000 km/s, while the speed without congestion experience is 120 000 km/s. Since congestion only occurs in 1/5 cases, the average length (in s) of a renewal period is H [ ] =
4 2000 1 7 2000 × + × = 120000 5 40000 5 300
The average speed of an IP packet (in km/s) then follows from (8.20) as lim
w