245 41 1MB
English Pages 148 Year 2016
Vidyadhar S. Mandrekar Weak Convergence of Stochastic Processes De Gruyter Graduate
Also of Interest Numerical Analysis of Stochastic Processes Wolf-Jürgen Beyn, Raphael Kruse, 2017 ISBN 978-3-11-044337-0, e-ISBN (PDF) 978-3-11-044338-7, e-ISBN (EPUB) 978-3-11-043555-9
Stochastic Finance. An Introduction in Discrete Time Hans Föllmer, Alexander Schied, 2016 ISBN 978-3-11-046344-6, e-ISBN (PDF) 978-3-11-046345-3, e-ISBN (EPUB) 978-3-11-046346-0
Stochastic Calculus of Variations: For Jump Processes, 2nd Ed. Yasushi Ishikawa, 2016 ISBN 978-3-11-037776-7, e-ISBN (PDF) 978-3-11-037807-8, e-ISBN (EPUB) 978-3-11-039232-6 Probability Theory and Statistical Applications. A Profound Treatise for Self-Study Peter Zörnig, 2016 ISBN 978-3-11-036319-7, e-ISBN (PDF) 978-3-11-040271-1, e-ISBN (EPUB) 978-3-11-040283-4 Probability Theory. A First Course in Probability Theory and Statistics Werner Linde, 2016 ISBN 978-3-11-046617-1, e-ISBN (PDF) 978-3-11-046619-5, e-ISBN (EPUB) 978-3-11-046625-6
www.degruyter.com
Vidyadhar S. Mandrekar
Weak Convergence of Stochastic Processes | With Applications to Statistical Limit Theorems
Mathematics Subject Classification 2010 60B20, 60F17 Author Vidyadhar S. Mandrekar Michigan State University Department of Statistics and Probability East Lansing MI 48824 USA [email protected]
ISBN 978-3-11-047542-5 e-ISBN (PDF) 978-3-11-047631-6 e-ISBN (EPUB) 978-3-11-047545-6 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2016 Walter de Gruyter GmbH, Berlin/Munich/Boston Typesetting: Compuscript Ltd, Ireland Printing and binding: CPI books GmbH, Leck Cover image: 123dartist/iStock/thinkstock ♾ Printed on acid-free paper Printed in Germany www.degruyter.com
Contents 1 Weak convergence of stochastic processes | 1 Introduction | 1 2 Weak convergence in metric spaces | 4 2.1 Cylindrical measures | 4 2.2 Kolmogorov consistency theorem | 6 2.3 The finite dimensional family for Brownian motion | 9 2.4 Properties of Brownian motion | 11 2.5 Kolmogorov continuity theorem | 14 2.6 Exit time for Brownian motion and Skorokhod theorem | 16 2.6.1 Skorokhod theorem | 19 2.7 Embedding of sums of i.i.d. random variable in Brownian motion | 21 2.8 Donsker’s theorem | 23 2.9 Empirical distribution function | 25 2.10 Weak convergence of probability measure on Polish space | 29 2.10.1 Prokhorov theorem | 34 2.11 Tightness and compactness in weak convergence | 36 3 3.1 3.1.1 3.2 3.3 3.4 3.4.1 3.4.2 3.5 3.6 3.7 3.8 3.8.1 3.8.2 3.8.3 3.8.4
Weak convergence on C[0, 1] and D[0, ∞) | 41 Structure of compact sets in C[0, 1] | 41 Arzela-Ascoli theorem | 41 Invariance principle of sums of i.i.d. random variables | 45 Invariance principle for sums of stationary sequences | 48 Weak convergence on the Skorokhod space | 50 The space D[0, 1] | 50 Skorokhod topology | 52 Metric of D[0, 1] to make it complete | 53 Separability of the Skorokhod space | 56 Tightness in the Skorokhod space | 58 The space D[0, ∞) | 60 Separability and completeness | 64 Compactness | 64 Tightness | 66 Aldous’s tightness criterion | 67
4
Central limit theorem for semi-martingales and applications | 70 Local characteristics of semi-martingale | 70 Lenglart inequality | 72
4.1 4.2
vi | Contents
4.3 4.4 4.5
Central limit theorem for semi-martingale | 76 Application to survival analysis | 81 ̂ and Kaplan-Meier estimate | 85 Asymptotic distribution of 𝛽(t)
5
Central limit theorems for dependent random variables | 89
6 6.1 6.2 6.3 6.4 6.4.1 6.4.2 6.5
Empirical process | 110 Spaces of bounded functions | 114 Maximal inequalities and covering numbers | 119 Sub-Gaussian inequalities | 125 Symmetrization | 126 Glivenko-Cantelli theorems | 129 Donsker theorems | 131 Lindberg-type theorem and its applications | 133
Bibliography | 142
1 Weak convergence of stochastic processes Introduction The study of limit theorems in probability has been important in inference in statistics. The classical limit theorems involved methods of characteristic functions, in other words, Fourier transform. One can see this in the recent book of Durrett [8]. To study the weak convergence of stochastic processes, one may be tempted to create concepts of Fourier’s theory in infinite dimensional cases. However, such an attempt fails, as can be seen from the work of Sazanov and Gross [14]. The problem was also solved by Donsker [5] by choosing techniques of weak convergence of measures on the space of continuous functions by interpolating the partial sum in the central limit theorem. If we denote for X1 , ..., X n i.i.d. (independent identically distributed random variable with zero expectation and finite variance) S n,m = X1 + . . . + X m/√n , where m = 1, 2, . . . , n, and S n,0 = 0. One can define S n,u = {
S n,m if u = 0, 1, . . . n linear if u𝜖[m − 1, m]
Then, the continuous process S[n,t] , t𝜖[0, 1] generates a sequence of measures on C[0, 1] if we can prove, with appropriate definition P n converges weakly to the measure P given by Brownian motion on C[0, 1], we can conclude that max |S n,t | converges to max |W(t)| 0≤t≤1
0≤t≤1
and Rn 1 = + max S − min S √n √n 1≤m≤n n,m 0≤t≤1 n,m converges to max W(t) − min W(t), 1≤t≤1
0≤t≤1
giving more information than the classical central theorem. One can also prove empirical distribution function, if properly interpolated, F ̂n (x) =
1 n ∑ 1(x i ≤ x), x𝜖IR, n 1
2 | 1 Weak convergence of stochastic processes
satisfies √n sup |F ̂n (x) − F(x)|, x
which converges to the maximum of the Brownian bridge max0≤t≤1 |W(t) − W(1)|, justifying the Kolmogorov-Smirnov statistics for F continuous distribution function of X1 . We prove these results using convergence in probability in C[0, 1] space of S[n.] to W[.] in the supremum distance in C[0, 1]. For this, we used the embedding theorem of Skorokhod [23]. We follow [8] for the proof of this theorem of Skorokhod. As we are interested in the convergence of semi-martingales to prove the statistical limit theorems for censored data that arise in clinical trials, we introduce the following Billingsley [2] convergence in a separable metric (henceforth, Polish space). Here we show the compactness of the sequences of probability measures on Polish space using the so-called tightness condition of Prokhorov [22], which connects compactness of measures with a compact set having large measures for the whole sequence of probability measures. We then consider the form of compact sets in C[0, 1] using the Arzela-Ascoli theorem. We also consider the question if one can define on the space of functions with jump discontinuity D[0, T] a metric to make it a Polish space. Here we follow Billingsley [2] to present the so-called Skorokhod topology on D[0, T] and D[0, ∞). We then characterize the compact sets in these space to study the tightness of a sequence of probability measures. Then we use a remarkable result of Aldous [1] to consider compactness in terms of stopping times. This result is then exploited to study the weak convergence of semi-martingales in D[0, ∞). Following Durrett and Resnick [9], we then generalize the Skorokhod embedding theorem for sums of dependent random variables. This allows us to extend weak convergence result, as in Chapter 2, to martingale differences [3]. Using the work of Gordin [13], one can reduce a similar theorem for stationary sequences to that for martingale differences. If one observes that empirical measures take values in D[0, 1] and we try to use the maximum norm, we get a nonseparable space. Thus, we can ask the question, “can one study weak convergence of stochastic processes taking values in non-separable metric spaces? This creates a problem with measurability as one can see from Dudley-Phillip’s [7] work presented in [18]. To handle these problems, we study the so-called empirical processes following Van der Vaart and Wellner [25]. We introduce covering numbers, symmetrization, and sub-Gaussian inequalities to find conditions for weak convergence of measures. We begin chapter 2 by constructing a measure on IRT to get a stochastic process. We then construct a Gaussian process with given covariance. We obtain sufficient conditions on the moments of a process indexed by T = IR to have a continuous version. Using this, we construct Brownian motion. Then we prove the Skorokhod embedding theorem for sums of independent random variables. It is then exploited to obtain a convergence in C[0, 1] of the functions of the sum of independent random variables as described earlier. As a consequence of the central limit theorem in C[0, 1],
1 Weak convergence of stochastic processes | 3
we prove using [20] the results of [11] and the convergence of symmetric statistics. The chapter ends by giving a weak convergence of probability measures and Prokhorov’s tightness result on a Polish space. In chapter 3, we study compact sets in C[0, 1] and use them to get alternate proof of theorems similar to the one in chapter 2 using weak convergence in Polish spaces. This is followed by studying the topology of Skorokhod on D[0, T] and D[0, ∞) and proving that these are Polish spaces. Again, we study the compact sets and prove the result of Aldous. Chapter 4 studies the weak convergence of semi-martingales, which requires Lenglart [16] inequality to prove compactness using the result of Aldous. As a consequence of this, we derive weak convergence of Nelson and Kaplan-Meier estimates by simplifying the proof of Gill [12]. We do not present convergence of the Susarla-Van Ryzin Bayes estimate [24], but it can be obtained by similar methods as shown in [4]. For convergence of Linden-Bell estimates arising in astronomy, see [21] where similar techniques are used. Chapter 5 considers limit theorems as in chapter 2 using generalization of Skorokhod theorem from [9]. Limit theorems in chapters 2 and 5 use techniques given in Durrett’s book [8]. Chapter 3 follows the presentation in the book of Billingsley [2], and chapter 4 uses the simplified version of techniques in [17] (see also [15]). We present in the last chapter the convergence of empirical processes using the techniques mentioned above taken from [25].
Acknowledgments The first version of this presentation was typed by Mr. J. Kim. Professor Jayant Deshpande encouraged me to write it as a book to make it available to wider audiences. Professor U. Naik-Nimbalkar read the final version. I thank all of them for their help in preparing the text. Finally, I thank Dr. K. Kieling for suggesting the format presented here.
2 Weak convergence in metric spaces We begin in this chapter the process of associating a probability measure on the function space IRT for any set T given a family of probability measures on IRS , S ⊆ T finite set with certain conditions. This allows us to define a family of real random variables {X t , t𝜖T} on a probability space (Ω, ℱ, P). Such a family is referred to as a stochastic process. This idea originated from the work of Wiener (see [19]) in the special case of Brownian motion. Kolmogorov generalized it for the construction of any stochastic process and gave conditions under which one can find a continuous version (cf. section 2.5), that is, a stochastic process with continuous paths. We use his approach to construct Brownian motion. We explain it in the next section.
2.1 Cylindrical measures Let {X t , t ∈ T} be family of random variable on a probability space (Ω, ℱ, P). Assume that X t takes values in (𝒳t , 𝒜t ). For any finite set S ⊂ T, 𝒳S = ∏ 𝒳t , 𝒜S = ⨂ 𝒜t , Q S = P ∘ (X t , t ∈ S)−1 , t∈S
t∈S
where QS is the induced measure. Check, if ΠS S : 𝒳S → 𝒳S for S ⊂ S , then Q S = Q S ∘ Π−1 S S
(2.1)
Suppose we are given a family {Q S , S ⊆ T finite-dimensional}, a probability measure where QS on (𝒳S , 𝒜S ). Assume they satisfy (2.1). Then, there exists Q on (𝒳T , 𝒜T ) such that Q ∘ Π−1 S = QS , where 𝒳T = ∏t∈T 𝒳t , 𝒜T = 𝜎(⋃S⊂T 𝒞S ), 𝒞S = Π−1 S (𝒜S ). Remark: For S ⊂ T, finite, C ∈ 𝒞S define Q0 (C) = Q S (A), where C = Π−1 S (A) as 𝒞S = Π−1 S (𝒜S ) We can define Q0 on ⋃S⊂T 𝒞S . Then, for C ∈ 𝒞S and 𝒞S , then by (2.1), Q0 (C) = Q S (A) = Q S (A), and hence Q0 is well-defined.
2.1 Cylindrical measures | 5
Note that as 𝒞S1 ∪ 𝒞S2 ∪ ⋅ ⋅ ⋅ ∪ 𝒞S k ⊂ 𝒞S1 ∪ ⋅⋅⋅∪ S k . Q0 is finitely additive on ⋃S⊂T 𝒞S . We have to show the countable additivity. Definition 2.1: A collection of subsets 𝒦 ⊂ 𝒳 is called a compact class if for every sequence {C k , k = 1, 2, ⋅ ⋅ ⋅ n}, n finite, ∞
n
⋂ Ck = / 0 ⇒ ⋂ C k = /0 k=1
k=1
Exercise 1: Every subcollection of compact class is compact. Exercise 2: If 𝒳 and 𝒴 are two spaces and T : 𝒳 → 𝒴 and 𝒦 is a compact class in 𝒴, then T −1 (𝒦) is a compact class in 𝒳 . Definition 2.2: A finitely additive measure 𝜇 on (𝒳 , 𝒜0 ) is called compact if there exists a compact class 𝒦 such that for every A ∈ 𝒜0 and 𝜖 > 0, there exists C𝜖 ∈ 𝒦, and A𝜖 ∈ 𝒜0 such that A𝜖 ⊂ C𝜖 ⊂ A
and 𝜇(A − A𝜖 ) < 𝜖.
We call 𝒦 is 𝜇-approximates 𝒜0 . Lemma 2.1: Every compact finitely additive measure is countably additive. Proof: Suppose (𝒳 , 𝒜0 , 𝜇) is given. There exists a compact class 𝒦 that is 𝜇-approximates 𝒜0 . Let {A n } ⊂ 𝒜0 such that A n ↘ 0. We need to show that 𝜇(A n ) ↘ 0. For given 𝜖 > 0, let B n ∈ 𝒜0 and C n ∈ 𝒦, such that Bn ⊂ Cn ⊂ An
and 𝜇(A n − B n )
𝜖. Since we know that n
n
n
𝜇(A n − ⋂ B k ) = 𝜇( ⋂ A k ) − 𝜇( ⋂ B k ) < k=1
k=1
k=1
we conclude that for all n, n
𝜇( ⋂ B k ) > k=1
𝜖 . 2
𝜖 , 2
6 | 2 Weak convergence in metric spaces
Next, for all n n
⋂ Bk = / 0, k=1
and hence, we have for all n n
⋂ Ck = / 0, k=1
which implies ∞
⋂ Ck = /0 k=1
since C k ∈ 𝒦 and 𝒦 is compact. Therefore, it follows that ∞
∞
k=1
k=1
⋂ Ak ⊃ ⋂ Ck = /0 implies lim A n = / 0,
n→∞
which is a contradiction.
2.2 Kolmogorov consistency theorem Suppose S ⊂ T is finite subset and QS is measure on (𝒳S , 𝒜S ) satisfying consistency condition (2.1). Let (𝒳{t} , 𝒜{t} , Q{t} ) be a compact probability measure space. For each t ∈ T, there exists a compact class 𝒦t ⊆ 𝒜{t} and 𝒦t , Q t approximates 𝒜{t} . Then, there exists a unique probability measure Q0 on (𝒳T , 𝒜T ) such that ΠS : 𝒳T → 𝒳S , and Q0 ∘ Π−1 S = QS . Proof: Define 𝒟 = {C : C = Π−1 t (K), K ∈ 𝒦t , t ∈ T}. Let {Π−1 t i (C t i ), i = 1, 2...} be a countable family of sets and B t = ⋃ Π−1 t i (C t i ) t i =t
2.2 Kolmogorov consistency theorem | 7
If the countable intersection of {B t , t ∈ T} is empty, then B t0 is empty for some t0 . Since 𝒦t0 is a compact class and all C t i ∈ 𝒦t0 , we get a finite set of ti ’s (t i = t0 ). Let us call it J for which ⋃ C t i = 0 ⇒ ⋃ Π−1 t i (C t i ) = 0 t i ∈J
t i ∈J
Since 𝒟 is a compact class, 𝒦 as a countable intersections of sets in 𝒟 is a compact finitely additive class. We shall show that Q0 is a compact measure, i.e., 𝒦 Q0 -approximates 𝒞0 . Take C ∈ 𝒞0 and 𝜖 > 0. For some S ⊂ T, C = Π−1 S (B). Choose a rectangle ∏(A t ) ⊂ B t∈S
so that for A t ∈ 𝒜t 𝜖 2 t∈S 𝜖 −1 −1 Q0 (ΠS (B) − ΠS (∏ A t )) < . 2 t∈S Q S (B − ∏ A t )
0. Then, there exists a version of {X t , t ∈ [0, 1]}, which has a continuous sample paths. Corollary 2.1: The Gaussian process with covariance function C(t, s) = min(t, s), t, s ∈ [0, 1]
2.5 Kolmogorov continuity theorem | 15
(has independent increment and is Markovian) has a version that is continuous. E(X t − X s )2 = EX 2t − 2EX t X s + EX 2s = t − 2s + s = |t − s| E(X t − X s ) = 3[E(X t − X s )2 ]2 4
= 3|t − s|2 We shall denote the continuous version by W t . Proof of Proposition 2.5: Take 0 < 𝛾
0 such that
(1 − 𝛿)(1 + 𝛼 − 𝛽𝛾) > 1 + 𝛿. For 0 ≤ i ≤ j ≤ 2n and |j − i| ≤ 2n𝛿 , ∑ P(|X j2−n − X i2−n | > [( j − i)2−n ]𝛾 ) ≤ C ∑[(j − i)2−n ]−𝛽𝛾+(1+𝛿) (by Chevyshev) i,j
i,j
= C2 2n[(1+𝛿)−(1−𝛿)(1+𝛼−𝛽𝛾)] < ∞, where (1 + 𝛿) − (1 − 𝛿)(1 + 𝛼 − 𝛽𝛾) = −𝜇. Then, by the Borell-Cantelli lemma, P(|X j2−n − X i2−n | > [( j − i)2−n ]𝛾 ) = 0, i.e., there exists n0 (𝜔) such that for all n ≥ n0 (𝜔) |X j2−n − X i2−n | ≤ [( j − i)2−n ]𝛾 . Let t1 < t2 be rational numbers in [0, 1] such that t2 − t1 ≤ 2−n0 (1−𝛿) t1 = i2−n − 2−p1 − ... − 2−p k (n < p1 < ... < p k ) t2 = j2−n − 2−q1 − ... − 2−q k (n < q1 < ... < q k ) t1 ≤ i2−n ≤ j2−n ≤ t2 . Let h(t) = t𝛾 for 2−(n+1)(1−𝛿) ≤ t ≤ 2−n(1−𝛿) .
16 | 2 Weak convergence in metric spaces
Then, X t − X i2−n ≤ C1 h(2−n ) 1 X t − X j2−n ≤ C2 h(2−n ) 2 X t − X t ≤ C3 h(t2 − t1 ) 2 1 and X i2−n −2−p1 −...−2−pk − X i2−n −2−p1 −...−2−pk−1 ≤ 2−p k 𝛾 . Under this condition, the process {X t , t ∈ [0, 1]} is uniformly continuous on rational numbers in [0, 1]. Let 𝜓 : (ΩQ , 𝒞ΩQ ) → (C[0, 1], 𝜎(C[0, 1])), which extends uniformly continuous functions in rational to continuous function in [0, 1]. Let P is a measure generated by the same finite dimensional on rationals. Then P̃ = P ∘ 𝜓−1 is the measure of X̃ t , version of X t . For {X t , t ∈ [0, 1]}, there exists a version of continuous sample path. In case of Brownian motion, there exists a continuous version. We call it {W t }. {W t+t0 − W t0 , t ∈ [0, ∞]} is a Weiner process.
2.6 Exit time for Brownian motion and Skorokhod theorem It is well known that if X1 , ..., X n are independent identically distributed (iid) random variables with EX1 = 0 and EX12 < ∞, then Yn =
X1 + ... + X n √n
converges in distribution to standard normal random variable Z. This is equivalent to weak convergence of Y n to Z, i.e., Ef (Y n ) → Ef (Z) for all bounded continuous functions for IR. If we denote by Y n,m =
X1 + ... + X m and Y n,0 = 0 √n
2.6 Exit time for Brownian motion and Skorokhod theorem | 17
and define interpolated continuous process { Y n,m t = m𝜖{0, 1, ...n} Y n,t = { { linear if t𝜖[m − 1, m] for m = 0, 1, 2, ...n and we shall show that Ef (Y n,[n.] → Ef (W .) for all bounded continuous functions for C[0,1] with {W(t), t𝜖[0, 1]} Brownian motion. In fact, we shall use Skorokhod embedding to prove for any 𝜖 > 0 P( sup |Y n,[nt] − W(t)| > 𝜖) → 0 0≤t≤1
as n → ∞, which implies the above weak convergences (see Theorem 2.3). Let ℱtW = 𝜎(W s , s ≤ t). 𝜏 is called “stopping time” if {𝜏 ≤ t} ∈ ℱtW . Define ℱ𝜏 = {A : A ∩ {𝜏 ≤ t} ∈ ℱtW }. Then ℱ𝜏 is a 𝜎-field. Define
Wt √t
is a standard normal variable.
T a = inf{t : W t = a}. Then, T a < ∞ a.e. and is a stopping time. Theorem 2.1: Let a < x < b. Then P x (T a < T b ) =
b−x . b−a
Remark: W t is Gaussian and has independent increment. Also, for s ≤ t E(W t − W s |ℱs ) = 0 and hence {W t } is martingale.
18 | 2 Weak convergence in metric spaces
Proof: Let T = T a ∧ T b . We know T a , T b < ∞ a.e., and W T a = a, and W T b = b. Since {W t } is MG, E x W T = E x W0 =x = aP(T a < T b ) + b(1 − P(T a < T b )). Therefore, P x (T a < T b ) =
b−x . b−a
{W t , t ∈ [0, ∞]} is a Weiner Process, starting from x with a < x < b. We know that E((W t − W s )2 |ℱsW ) = E(W t − W s )2 = (t − s). Also, E((W t − W s )2 |ℱsW ) = E(W t2 |ℱsW ) − 2E(W t W s |ℱsW ) + E(W s2 |ℱsW ) = E(W t2 |ℱsW ) − W s2 = E(W t2 − W s2 |ℱsW ) = (t − s). Therefore, E(W t2 − t|ℱsW ) = W s2 − s, and hence {(W t2 − t), t ∈ [0, ∞]} is a martingale. Suppose that x = 0 and a < 0 < b. Then T = T a ∧ T b is a finite stopping time. Therefore, T∧t is also stopping time. 2 E(W T∧t − T ∧ t) = 0
E0 (W T2 ) = E0 T EW T2 = ET = a2 P(T a < T b ) + b2 (1 − P(T a < T b )) = −ab
2.6 Exit time for Brownian motion and Skorokhod theorem | 19
Suppose X has two values a, b with a < 0 < b and EX = aP(X = a) + bP(X = b) =0 Remark: b a and P(X = b) = − b−a b−a
P(X = a) =
Let T = T a ∧ T b . Then, W T has the same distribution as X. We denote ℒ(W T ) = ℒ(X) or W T =𝒟 X
2.6.1 Skorokhod theorem Let X be random variable with EX = 0 and EX 2 < ∞. Then, there exists a ℱtW -stopping time T such that ℒ(W T ) = ℒ(X) and ET = EX 2 . Proof: Let F(x) = P(X ≤ x). ∞
0
EX = 0 ⇒ ∫
−∞
udF(u) + ∫ vdF(v) = 0 0 ∞
0
⇒ −∫
−∞
udF(u) = ∫ vdF(v) = C. 0
Let 𝜓 be a bounded function with 𝜓(0) = 0. Then, C ∫ 𝜓(x)dF(x) R
∞
0
0
−∞
= C( ∫ 𝜓(v)dF(v) + ∫ ∞
= ∫ 𝜓(v)dF(v) ∫
0
−∞
0 ∞
0
0
−∞
= ∫ dF(v) ∫
𝜓(u)dF(u))
−udF(u) + ∫
0
−∞
∞
𝜓(u)dF(u) ∫ vdF(v)
dF(u)(v𝜓(u) − u𝜓(v)).
0
20 | 2 Weak convergence in metric spaces
Therefore, ∞
0
0
−∞
∞
0
0
−∞
∫ 𝜓(x)dF(x) = C−1 ∫ dF(v) ∫ R
= C−1 ∫ dF(v) ∫
dF(u)(v𝜓(u) − u𝜓(v)) dF(u)(v − u) [
u v 𝜓(u) − 𝜓(v)] . v−u v−u
Consider (U, V) be a random vector in R2 such that P[(U, V) = (0, 0)] = F({0}) and for A ⊂ (−∞, 0) × (0, ∞) P((U, V) ∈ A) = C−1 ∫ ∫ dF(u)dF(v)(v − u). A
If 𝜓 = 1, P((U, V) ∈ (−∞, 0) × (0, ∞)) ∞
0
0
−∞
∞
0
0
−∞
= C−1∫ dF(v) ∫ = C−1∫ dF(v) ∫
dF(u)(v − u) dF(u)(v − u) [
u v 𝜓(u) − 𝜓(v)] v−u v−u
= ∫ 𝜓(x)dF(x) R
= ∫ dF(x) R
= 1, and hence, P is a probability measure. Let u < 0 < v such that 𝜇U,V ({u}) =
v u and 𝜇U,V ({v}) = − . v−u v−u
Then, by Fubini, ∫ 𝜓(x)dF(x) = E ∫ 𝜓(x)𝜇u,v (dx). On product space Ω × Ω , let W t (𝜔, 𝜔 ) = W t (𝜔) (U, V)(𝜔, 𝜔 ) = (U, V)(𝜔 ). T U,V is not a stopping time on ℱtW .
2.7 Embedding of sums of i.i.d. random variable in Brownian motion | 21
We know that if U = u and V = v ℒ(T U,V ) = ℒ(X). Then, ET U,V = E U,V E(T U,V |U, V) = −EUV = C−1 ∫
∞
0
dF(u)(−u) ∫ dF(v)v(v − u)
−∞ ∞
0 −1
∞
= − ∫ dF(v)(−u) [C ∫ vdF(v) − u] 0 2
0
= EX .
2.7 Embedding of sums of i.i.d. random variable in Brownian motion Let t0 ∈ [0, ∞). Then, {W(t + t0 ) − W(t0 ), t ≥ 0} is a Brownian motion and independent of ℱt0 . Let 𝜏 be a stopping time w.r.t. ℱtW . Then, one can see that W t∗ (𝜔) = W𝜏(𝜔) + t (𝜔) − W𝜏(𝜔) (𝜔) is a Brownian motion w.r.t. ℱ𝜏W where ℱ𝜏W = {B ∈ ℱ : B ∩ {𝜏 ≤ t} ∈ ℱtW } for all t ≤ 0}. Let V0 be countable. Then, {𝜔 : W∗t (𝜔) ∈ B} = ⋃ {𝜔 : W(t + t0 ) − W(t0 ) ∈ B, 𝜏 = t0 }. t0 ∈V0
For A ∈ ℱ𝜏W , P[{(W t∗1 , ..., W t∗k ) ∈ B k } ∩ A] = ∑ P[{(W t1 , ..., W t k ) ∈ B k } ∩ A ∩ {𝜏 = t0 }] t0 ∈V0
= ∑ P((W t∗1 , ..., W t∗k ) ∈ B k ) ⋅ P(A ∩ {𝜏 = t0 }) t0 ∈V0
(because of independence)
22 | 2 Weak convergence in metric spaces
= P((W t∗1 , ..., W t∗k ) ∈ B k ) ∑ P(A ∩ {𝜏 = t0 }) t0 ∈V0
P((W t∗1 ,
=
...,
W t∗k )
∈ B k )P(A)
Let 𝜏 be any stopping time such that 𝜏n = {
If
k 2n
≤t
0, there exists 𝛿 > 0(1/𝛿 is an integer) such that P(|W t − W s | < 𝜖, for all t, s ∈ [0, 1], |t − s| < 2𝛿) > 1 − 𝜖 𝜏mn is increasing in m. For n ≥ N𝛿, 1 n P(|𝜏nk𝛿 − k𝛿| < 𝛿, k = 1, 2, ..., ) ≥ 1 − 𝜖 𝛿 n since 𝜏[ns] → s. For s ∈ ((k − 1)𝛿, k𝛿), we have n n 𝜏[ns] − s ≥ 𝜏[n(k−1)𝛿] − k𝛿
(2.3)
24 | 2 Weak convergence in metric spaces
n n 𝜏[ns] − s ≤ 𝜏[nk𝛿] − (k − 1)𝛿.
Combining these, we have for n ≥ N𝛿 n P( sup |𝜏[ns] − s| < 2𝛿) > 1 − 𝜖. 0≤s≤1
For 𝜔 in event in (2.3) and (2.4), we get for m ≤ n, as W t nm = S n,m , W𝜏n − W m < 𝜖. m n For t =
m+𝜃 n
with 0 < 𝜃 < 1, S n,[nt] − W t ≤ (1 − 𝜃)S n,m − W m + 𝜃S n,m+1 − W m+1 n n W +(1 − 𝜃)W m − W t + 𝜃 n+1 − W t . n n
For n ≥ N𝛿 with
1 n
< 2𝛿, P(||S n,[ns] − W s ||∞ ≥ 2𝜖) < 2𝜖.
Theorem 2.3: Let f be bounded and continuous function on [0,1]. Then Ef (S n,[n⋅] ) → Ef (W(⋅)). Proof: For fixed 𝜖 > 0, define G𝛿 = {W , W ∈ C[0, 1] : ||W − W ||∞ < 𝛿 implies |f (W) − f (W )| < 𝜖}. Observe that G𝛿 ↑ C[0, 1] as 𝛿 ↓ 0. Then, Ef (S n,[n⋅] ) − Ef (W(⋅)) ≤ 𝜖 + 2M(P(G𝛿c ) + P(||S n,[n⋅] − W(⋅)|| > 𝛿)). Since P(G𝛿c ) → 0 and P(||S n,[n⋅] − W(⋅)|| > 𝛿) → 0 by Lemma 2.2. For f (x) = max |x(t)|, t
we have S [nt] → max |W(t)| in distribution max t t √n
(2.4)
2.9 Empirical distribution function | 25
and S max m → max |W(t)| in distribution. t 1≤m≤n √n Let R n = 1 + max S m − min S m 1≤m≤n
1≤m≤n
Then Rn ⇒ max W(t) − min W(t). 0≤t≤1 √n weakly 0≤t≤1 We now derive from Donsker theorem invariance principle for U-statistics [nt]
∏ (1 + i=1
[nt] k 𝜃X i ) = ∑ n− 2 √n k=1
∑ 1≤i1 ≤⋅⋅⋅≤i k ≤n
X i1 ⋅ ⋅ ⋅ X i k , X
where X i are i.i.d. and EX 2i < ∞. Next, using CLT, SLLN, and the fact P(max| √ni | > 𝜖) → 0, [nt]
log [∏ (1 + i=1
[nt] 𝜃X i 𝜃X i )] = ∑ log (1 + ) √n √n i=1 [nt]
= 𝜃∑ i=1
[nt] 2 [nt] X X 3i Xi 𝜃2 𝜃3 ∑ i + ∑ − − ⋅⋅⋅ 2 i=1 n 3 i=1 n√n √n
⇒ 𝜃W(t) −
𝜃2 t, 2
and hence [nt]
∏ (1 + i=1
𝜃2 𝜃X i ) ⇒ e𝜃W(t)− 2 t . √n
2.9 Empirical distribution function Let us define empirical distribution function 1 n F̂ n (x) = ∑ 1(X i ≤ x), x ∈ IR. n i=1 Then Glivenko-Cantelli lemma says sup |F̂ n (x) − F(x)| →a.s. 0. x
26 | 2 Weak convergence in metric spaces
Assume F is continuous. Let U i = F(X i ). For y ∈ [0, 1], define 1 n ̂ G(y) = ∑ 1(F(X i ) ≤ y). n i=1 Then by 1-1 transformation, √n sup |F̂ n (x) − F(x)| = √n sup |Ĝ n (y) − y|. x
y∈[0,1]
Let U1 , U2 , ..., U n be uniform distribution and let U(i) be order statistic such that U(1) (𝜔) ≤ ⋅ ⋅ ⋅ ≤ U(n) (𝜔). Next, f (U(1) , ..., U(n) ) = f (U𝜋(1) , ..., U𝜋(n) ) f U𝜋(1) ,...,U𝜋(n) (u1 , ..., u n ) = f U1 ,...,U n (u1 , ..., u n ) if u1 < u2 < ... < u n . ={
1, if u ∈ [0, 1]n ; 0, if u ∉ [0, 1]n .
For bounded g, Eg(U1 , ..., U n ) = ∑ ∫
𝜋∈Π u1 0, there exists K such that for all k ≥ K lim sup P n (F) = lim sup ∫ 1F dP n n→∞
n→∞
≤ lim sup ∫ f k dP n n→∞
= lim ∫ f k dP n n→∞
= ∫ f k dP ≤ P(F) + 𝛿. The last inequality follows from the fact that ∫ f n dP ↘ P(F). F
As a result, for all 𝛿 > 0, we have lim sup P n (F) ≤ P(F) + 𝛿. n→∞
(2) → (3) Let G = F c . Then, it follows directly. (2) + (3) → (4) trivial. (4) → (1) Approximate f let simple functions f n of sets A with P(A)̄ = P(A).
32 | 2 Weak convergence in metric spaces
Theorem 2.9: A necessary and sufficient condition for P n ⇒ P is that each subsequence {P n i } contain a further subsequence {P n i (m) } converging weakly to P. Proof: The necessary is easy. As for sufficiency, if Pn does not converge weakly to P, then there exists some bounded and continuous f such that ∫ fdP n does not converge to ∫ fdP. However, for some positive 𝜖 and some subsequence P n i , ∫ fdP − ∫ fdP > 𝜖 ni for all i, and no further subsequence can converge weakly to P. Suppose that h maps 𝒳 into another metric space 𝒳 , with metric 𝜌 and Borel 𝜎–field ℬ(𝒳 ). If h is measurable 𝒳 /𝒳 , then each probability P on (𝒳 , ℬ(𝒳 )) induces on (𝒳 , ℬ(𝒳 )) a probability P ∘ h−1 defined as usual by P ∘ h−1 (A) = P(h−1 (A)). We need conditions under which P n ⇒ P implies P n ∘ h−1 ⇒ P ∘ h−1 . One such condition is that h is continuous: If f is bounded and continuous on 𝒳 , then fh is bounded and continuous on 𝒳 , and by change of variable, P n ⇒ P implies ∫
𝒳
f (y)P n ∘ h−1 (dy) = ∫ f (h(x))P n (dx) → ∫ f (h(x))P(dx) = ∫ 𝒳
𝒳
𝒳
f (y)P ∘ h−1 (dy) (2.6)
Let Dh be discontinuity set of h. Theorem 2.10: Let (𝒳 , 𝜌) and (𝒳 , 𝜌 ) be two Polish space and h : 𝒳 → 𝒳 with P(D h ) = 0. Then, P n ⇒ P implies P n ∘ h−1 ⇒ P ∘ h−1 . Proof: Since h−1 (F) ⊂ h−1 (F) ⊂ D h ∪ h−1 (F), lim sup P n (h−1 (F)) ≤ lim sup P n (h−1 (F)) n→∞
n→∞
≤ P(D h ∪ h−1 (F)) ≤ P(h−1 (F)) (since D h is a set of P-measure zero). Therefore, for all closed set F, lim sup P n ∘ h−1 (F) ≤ P ∘ h−1 (F), n→∞
and hence, by Theorem 2.8, the proof is completed.
2.10 Weak convergence of probability measure on Polish space | 33
Let X n and X be random variables(𝒳 -valued). Then, we say X n →𝒟 X if P ∘ X −1 n ⇒ P ∘ X −1 . Observation: If X n →𝒟 X and 𝜌(X n , Y n ) →p 0, then Y n →𝒟 X. Remark: We use the following property of limsup and liminf. lim sup(a n + b n ) ≤ lim sup a n + lim sup b n n→∞
n→∞
n→∞
and lim inf(a n + b n ) ≥ lim inf a n + lim inf b n . n→∞
n→∞
n→∞
Proof: Consider closed set F. Let F 𝜖 = {x : 𝜌(x, F) ≤ 𝜖}. Then, F 𝜖 ↘ F as 𝜖 → 0 and {X n ∉ F 𝜖 } = {𝜔 : X n (𝜔) ∉ F 𝜖 } = {𝜔 : 𝜌(X n (𝜔), F) > 𝜖} Therefore, 𝜔 ∈ {X n ∉ F 𝜖 } ∩ {𝜌(X n , Y n ) < 𝜖} ⇒ 𝜌(X n (𝜔), F) > 𝜖 and 𝜌(X n (𝜔), Y n (𝜔)) < 𝜖 ⇒ 𝜌(Y n (𝜔), F) > 0 (draw graph.) ⇒ Y n (𝜔) ∉ F ⇒ 𝜔 ∈ {Y n ∉ F} Thus, {X n ∉ F 𝜖 } ∩ {𝜌(X n , Y n ) < 𝜖} ⊂ {Y n ∉ F}. Therefore, P(Y n ∈ F) ≤ P(𝜌(X n , Y n ) > 𝜖) + P(X n ∈ F 𝜖 ). Let P Yn = P ∘ Y n−1 and P X = P ∘ X −1 . Then, for all 𝜖 > 0, lim sup P n (F) = lim sup P(Y n ∈ F) n→∞
n→∞
𝜖 ≤ lim sup P(𝜌(X ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ n , Y n ) > 𝜖) + lim sup P(X n ∈ F ) n→∞
→p 0
n→∞
34 | 2 Weak convergence in metric spaces
= lim sup P(X n ∈ F 𝜖 ) n→∞
= P(X ∈ F 𝜖 ) (since X n ⇒𝒟 X). Therefore, for all closed set F, we have lim sup P Yn (F) ≤ P X (F), n→∞
and hence, by Theorem 2.5, P Yn ⇒ P X , which implies Y n ⇒𝒟 X. We say that a family of probability measure Π ⊂ 𝒫(𝒳 ) is tight if, given 𝜖 > 0, there exists compact K𝜖 such that P(K𝜖 ) > 1 − 𝜖 for all P ∈ Π.
2.10.1 Prokhorov theorem Definition 2.4: Π is relatively compact if, for {P n } ⊂ Π, there exists a subsequence {P n i } ⊂ Π and probability measure P (not necessarily an element of Π), such that P n i ⇒ P. Even though P n i ⇒ P makes no sense if P(𝒳 ) < 1, it is to be emphasized that we do require P(𝒳 ) = 1 and we disallow any escape of mass, as discussed below. For the most part, we are concerned with the relative compactness of sequences {P n }; this means that every subsequence {P n i } contains a further subsequence {P n i (m) }, such that P n i (m) ⇒ P for some probability measure P. Example: Suppose we know of probability measures Pn and P on (C, 𝒞) that the finite-dimensional distributions of Pn converges weakly to those of P: P n 𝜋t−11 ,...,t k ⇒ P𝜋t−11 ,...,t k for all k and all t1 , ..., t k . Notice that Pn need not converge weakly to P. Suppose, however, that we also know that {P n } is relatively compact. Then each {P n } contains some {P n i (m) } converging weakly to some Q. Since the mapping theorem then gives P n i (m) 𝜋t−11 ,...,t k ⇒ Q𝜋t−11 ,...,t k and since P n 𝜋t−11 ,...,t k ⇒ P𝜋t−11 ,...,t k by assumption, we have P𝜋t−11 ,...,t k = Q𝜋t−11 ,...,t k for all t1 , ..., t k . Thus, the finite-dimensional distributions of P and Q are identical, and since the class 𝒞f of finite-dimensional sets is a separating class, P = Q. Therefore, each subsequence contains a further subsequence converging weakly to P, not to some fortuitous limit, but specifically to P. It follows by Theorem 2.9 that the entire sequence {P n } converges weakly to P.
2.10 Weak convergence of probability measure on Polish space | 35
Therefore, {P n } is relatively compact and the finite-dimensional distributions of Pn converge weakly to those of P, then P n ⇒ P. This idea provides a powerful method for proving weak convergence in C and other function spaces. Note that if {P n } does converge weakly to P, then it is relatively compact, so that this is not too strong a condition. Theorem 2.11: Suppose (𝒳 , 𝜌) is a Polish space and Π ⊂ 𝒫(𝒳 ) is relatively compact, then it is tight. This is the converse half of Prohorov’s theorem. Proof: Consider open sets, G n ↗ 𝒳 . For each 𝜖 > 0, there exists n, such that for all P∈Π P(G n ) > 1 − 𝜖. Otherwise, for each n, we can find Pn such that P n (G n ) < 1 − 𝜖. Then by by relative compactness, there exists {P n i } ⊂ Π and probability measure Q ∈ Π such that P n i ⇒ Q. Thus, Q(G n ) ≤ lim inf P n i (G n ) i→∞
≤ lim inf P n i (G n i ) (since n i ≥ n and hence G n ⊂ G n i ) i→∞
< 1−𝜖 Since G n ↗ 𝒳 , 1 = Q(𝒳 ) = lim Q(G n ) n→∞
< 1 − 𝜖, which is contradiction. Let A k m , m = 1, 2, ... be open ball with radius 𝒳 (separability). Then, there exists nk such that for all P ∈ Π, P( ⋃ A k i ) > 1 − i≤n k
𝜖 2k
Then, let K𝜖 = ⋂ ⋃ A ki , k≥1 i≤n k
1 , km
covering
36 | 2 Weak convergence in metric spaces
where ⋂k≥1 ⋃i≤n k A k i is totally bounded set. Then, K𝜖 is compact(completeness), and P(K𝜖 ) > 1 − 𝜖. Remark: The last inequality is from the following. Let Bi be such that P(B i ) > 1 − Then, P(B i ) > 1 −
𝜖 . 2i
𝜖 𝜖 ⇒ P(B ci ) ≤ i 2i 2 c ⇒ P(∪ ∞ i=1 B i ) ≤ 𝜖
⇒ P(∩ ∞ i=1 B i ) > 1 − 𝜖.
2.11 Tightness and compactness in weak convergence Theorem 2.12: If Π is tight, then for {P n } ⊂ Π, there exists a subsequence {P n i } ⊂ {P n } and probability measure P such that P n i ⇒ P. Proof: Choose compact K1 ⊂ K2 ⊂ . . ., such that for all n P n (K u ) > 1 −
1 u
from tightness condition. Look at ⋃u K u . We know that there exists a countable family of open sets, 𝒜, such that if x ∈ ∪ u K u and G is open, then x ∈ A ⊂ Ā ⊂ G for some A ∈ 𝒜. Let ℋ = {0} ∪ {finite union of sets of the form Ā ∩ K u u ≥ 1, A ∈ 𝒜}. Then, ℋ is a countable family. Using the Cantor diagonalization method, there exists {n i } such that for all H ∈ ℋ, 𝛼(H) = lim P n i (H). i→∞
Our aim is to construct a probability measure P such that for all open set G, P(G) = sup 𝛼(H). H⊂G
(2.7)
2.11 Tightness and compactness in weak convergence | 37
Suppose we showed (2.7) above. Consider an open set G. Then, for 𝜖 > 0, there exists H𝜖 ⊂ G such that P(G) = sup 𝛼(H) H⊂G
< 𝛼(H𝜖 ) + 𝜖 = lim P n i (H𝜖 ) + 𝜖 i
= lim inf P n i (H𝜖 ) + 𝜖 i
≤ lim inf P n i (G) + 𝜖 i
and hence, for all open set G, P(G) ≤ lim inf P n i (G), i
which is equivalent to P n i ⇒ P. Observe ℋ is closed under finite union and 1. 2. 3.
𝛼(H1 ) ≤ 𝛼(H2 ) if H1 ⊂ H2 𝛼(H1 ∪ H2 ) = 𝛼(H1 ) + 𝛼(H2 ) if H1 ∩ H2 = 0 𝛼(H1 ∪ H2 ) ≤ 𝛼(H1 ) + 𝛼(H2 ).
Define for open set G 𝛽(G) = sup 𝛼(H). H⊂G
Then, 𝛼(0) = 𝛽(0) = 0 and 𝛽 is monotone. Define for M ⊂ 𝒳 𝛾(M) = inf 𝛽(G). M⊂G
Then, 𝛾(M) = inf 𝛽(G) M⊂G
= inf ( sup 𝛼(H)) M⊂G
H⊂G
𝛾(G) = inf 𝛽(G ) G⊂G
= 𝛽(G) M is 𝛾-measurable if for all L ⊂ 𝒳 𝛾(L) ≥ 𝛾(M ∩ L) + 𝛾(M c ∩ L).
(2.8)
38 | 2 Weak convergence in metric spaces
We shall prove that 𝛾 is outer measure, and hence open and closed sets are 𝛽-measurable. 𝛾-measurable sets M form a 𝜎-field, ℳ, and 𝛾 ℳ is a measure. Claim: Each closed set is in ℳ and P = 𝛾 ℬ(𝒳 ) open set G P(G) = 𝛾(G) = 𝛽(G). Note that P is a probability measure. K u has finite covering of sets in 𝒜 when K u ∈ ℋ. 1 ≥ P(𝒳 ) = 𝛽(𝒳 ) = sup 𝛼(K u ) u
= sup (1 − u
1 ) u
= 1. Step 1: If F ⊂ G (F is closed and G is open), and if F ⊂ H for some H ∈ ℋ, then there exists some H0 ∈ ℋ such that H ⊂ H0 ⊂ G. Proof: Consider x ∈ F and A x ∈ 𝒜 such that x ∈ A x ⊂ Ā x ⊂ G. Since F is closed subset of compact, F is compact. Since Ax covers F, there exists finite subcovers A x1 , A x2 , ..., A x k . Take k
H0 = ⋃ (A x̄ i ∩ K u ). i=1
Step 2: 𝛽 is finitely sub-additive on open set. Suppose H ⊂ G1 ∪ G2 and H ∈ ℋ. Let F1 = {x ∈ H : 𝜌(x, G1c ) ≥ 𝜌(x, G2c )} F2 = {x ∈ H : 𝜌(x, G2c ) ≥ 𝜌(x, G1c )}.
2.11 Tightness and compactness in weak convergence | 39
If x ∈ F1 but not in G1 , then x ∈ G2 , and hence x ∈ H. Suppose x is not in G2 . Then x ∈ G2c , and hence, 𝜌(x, G2c ) > 0. Therefore, 0 = 𝜌(x, G1c ) ( since x ∈ G1c ) < 𝜌(x, G2c ), ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ >0
which contradicts x ∈ F1 , and hence contradicts 𝜌(x, G1c ) ≥ 𝜌(x, G2c ). Similarly, if x ∈ F2 but not in G2 , then x ∈ G1 . Therefore, F1 ⊂ G1 and F2 ⊂ G2 . Since F i ’s are closed, by step 1, there exist H1 and H2 such that F1 ⊂ H1 ⊂ G1 and F2 ⊂ H2 ⊂ G2 . Therefore, 𝛼(H) ≤ 𝛼(H1 ) + 𝛼(H2 ) 𝛽(G) ≤ 𝛽(G1 ) + 𝛽(G2 ). Step 3: 𝛽 is countably sub-additive on open-set H ⊂ ⋃n G n , where Gn is an open set. Since H is compact (union of compacts), there exist a finite subcovers, i.e., there exists n0 such that H ⊂ ⋃ Gn n≤n0
and 𝛼(H) ≤ 𝛽(H) ≤ 𝛽( ⋃ G n ) n≤n0
= ∑ 𝛽(G n ) n≤n0
= ∑ 𝛽(G n ). n
Therefore, 𝛽( ⋃ G n ) = sup 𝛼(H) n
H⊂∪ n G n
≤ sup ∑ 𝛽(G n ) H⊂∪ n G n n
= ∑ 𝛽(G n ). n
40 | 2 Weak convergence in metric spaces
Step 4: 𝛾 is an outer measure. We know 𝛾 is monotonic by definition and is countably sub-additive. Given 𝜖 > 0 and subsets {M n } ⊂ 𝒳 , choose open sets Gn , M n ⊂ G n such that 𝜖 𝛽(G n ) ≤ 𝛾(M n ) + n 2 𝛾( ⋃ M n ) ≤ 𝛽( ⋃ G n ) n
n
= ∑ 𝛽(G n ) = ∑ 𝛾(M n ) + 𝜖. n
Step 5: F is closed G is open. 𝛽(G) ≥ 𝛾(F ∩ G) + 𝛾(F c ∩ G). Choose 𝜖 > 0 and H1 ∈ ℋ, H1 ⊂ F c ∩ G such that 𝛼(H1 ) > 𝛽(G ∩ F c ) − 𝜖. Chose H0 such that 𝛼(H0 ) > 𝛽(H1c ∩ G) − 𝜖. Then, H0 , H1 ⊂ G, and H0 ∩ H1 = 0, 𝛽(G) ≥ 𝛼(H0 ∪ H1 ) = 𝛼(H0 ) + 𝛼(H1 ) > 𝛽(H1c ∩ G) + 𝛽(F c ∩ G) − 2𝜖 ≥ 𝛾(F ∩ G) + 𝛾(F c ∩ G) − 2𝜖. Step 6: If F ∈ ℳ, then F are all closed. If G is open and L ⊂ G, then 𝛽(G) ≥ 𝛾(F ∩ L) + 𝛾(F c ∩ L). Then, inf 𝛽(G) ≥ inf (𝛾(F ∩ L) + 𝛾(F c ∩ L)) ⇒ 𝛾(L) ≥ 𝛾(F ∩ L) + 𝛾(F c ∩ L).
3 Weak convergence on C [0, 1] and D[0, ∞) We now use the techniques developed in sections 2.10 and 2.11 to the case of space C[0, 1], which is a complete separable metric space with sup norm. In later sections, we consider the space of functions with discontinuities of the first kind. Clearly, one has to define distance on this space denoted by D[0, 1], such that the space is complete separable metric space. This was done by Skorokhod with the introduction of the topology called Skorokhod topology. In view of the fact that convergence of finite dimensional distributions determines the limiting measure and the Prokhorov theorem from chapter 2, we need to characterize compact sets in C[0, 1] with sup norm and D[0, 1] with Skorokhod topology, which is described in section 3.4. The tightness in this case is described in section 3.7.
3.1 Structure of compact sets in C[0, 1] Let 𝒳 be a complete separable metric space. We showed that Π is tight if and only if Π is relatively compact. Consider P n is measure on C[0, 1] and let 𝜋t1 ,...,t k (x) = (x(t1 ), . . . , x(t k )) and suppose that P n ∘ 𝜋t−11 ,...,t k ⇒ P ∘ 𝜋t−11 ,...,t k does not imply P n ⇒ P on C[0, 1]. However, P n ∘ 𝜋t−11 ,...,t k ⇒ P ∘ 𝜋t−11 ,...,t k and {P n } is tight. Then, P n ⇒ P. Proof: Since tightness is implied (as we proved), there exists a subsequence {P n i } of {P n } such that the sequence converges weakly to a probability measure Q, i.e P n i ⇒ Q. Hence, by Theorem 2.9, P ni ∘ 𝜋t−11 ⋅⋅⋅t k → Q ∘ 𝜋t−11 ...t k , Giving P = Q. However, all subsequences have the same limit. Hence, P n (→)P. What is a compact set in C[0, 1]?
3.1.1 Arzela-Ascoli theorem Definition 3.1: The uniform norm (or sup norm) assigns to real- or complex-valued bounded functions f defined on a set S the non-negative number ||f ||∞ = ||f ||∞,S = sup{|f (x)| : x ∈ S}.
42 | 3 Weak convergence on C [0, 1] and D[0, ∞)
This norm is also called the supremum norm, the Chebyshev norm, or the infinity norm. The name “uniform norm” derives from the fact that a sequence of functions {f n } converges to f under the metric derived from the uniform norm if and only if f n converges to f uniformly. Theorem 3.1: The set A ⊂ C[0, 1] is relatively compact in sup topology if and only if (i) sup |x(0)| < ∞. x∈A
(ii) lim ( sup w x (𝛿)) = 0. 𝛿
x∈A
Remark (modulus of continuity): Here w x (𝛿) = sup |x(t) − x(s)|. |s−t|≤𝛿
Proof: Consider function f : C[0, 1] → R, such that f (x) = x(0). Claim: f is continuous. Proof of claim: We want to show that for 𝜖 > 0, there exists 𝛿 such that ||x − y||∞ = sup {|x(t) − y(t)|} < 𝛿 implies |f (x) − f (y)| = |x(0) − y(0)| < 𝛿. t∈[0,1]
Given 𝜖 > 0, let 𝛿 = 𝜖. Then, we are done. Since A is compact, continuous mapping x → x(0) is bounded. Therefore, sup |x(0)| < ∞. x∈A
w x ( n1 ) is continuous in x uniformly on A and hence 1 lim w x ( ) = 0. n
n→∞
Suppose (i) and (ii) hold. Choose k large enough so that 1 sup w x ( ) = sup ( sup |x(s) − x(t)|) k x∈A x∈A |s−t|≤ 1 k
3.1 Structure of compact sets in C [0, 1] |
43
is finite. Since k (i − 1)t it ), |x(t)| < |x(0)| + ∑ x( ) − x( k k i=1
we have 𝛼 = sup ( sup |x(t)|) < ∞. 0≤t≤1
x∈A
Choose 𝜖 > 0 and finite 𝜖-covering H of [−𝛼, 𝛼]. Choose k large enough so that 1 w x ( ) < 𝜖. k Take B to be finite set of polygonal functions on C[0, 1] that are linear on [ i−1 , ki ] and k takes the values in H at end points. If x ∈ A and x( k1 ) ≤ 𝛼 so that there exists a point y ∈ B such that i i x( ) − y( ) < 𝜖, k k
i = 1, 2, . . . , k
then i−1 i i y( ) − x(t) < 2𝜖 for t ∈ [ , ]. k k k ), so it is within 2𝜖 of x(t), ||x − y||∞ < 2𝜖, B is y(t) is convex combination of y( ki ), y( i−1 k finite, B is 2𝜖-covering of A. This implies A is compact. Theorem 3.2: {P n } is tight on C[0, 1] if and only if 1. For each 𝜂 > 0, there exists a and n0 such that for n ≥ n0 P n ({x : |x(0)| > a}) < 𝜂 2.
For each 𝜖, 𝜂 > 0, there exists 0 < 𝛿 < 1 and n0 such that for n ≥ n0 P n ({x : w x (𝛿) ≥ 𝜖}) < 𝜂
Proof: Since {P n } is tight, given 𝛿 > 0, choose K compact such that P n (K) > 1 − 𝜂.
44 | 3 Weak convergence on C [0, 1] and D[0, ∞)
Note that by the Arzela-Ascoli theorem, for large a K ⊂ {x : |x(0)| ≤ a} and for small 𝛿 K ⊂ {x : w x (𝛿) ≤ 𝜖}. Now, C[0, 1] is a complete separable metric space. So for each n, P n is tight, and hence, we get the necessity condition. Given 𝜂 > 0, there exists a such that P({x : |x(0)| > a}) < 𝜂 and 𝜖, 𝜂 > 0, there exists 𝛿 > 0 such that P n ({x : w x (𝛿) ≥ 𝜖}) < 𝜂. This happens for P n , where n ≤ n0 with n0 is finite. Assume (i) and (ii) holds for all n. Given 𝜂, choose a so that B = {x : |x(0)| ≤ a}, a satisfies for all n P n (B) > 1 − 𝜂, and choose 𝛿k such that B k = {x : w x (𝛿k ) ≤
1 }, k
with P n (B k ) > 1 −
𝜂 . 2k
Let K = A where A = B ∩ ( ⋂ B k ). k
K is compact by Arzela-Ascoli Theorem, P n (K) > 1 − 2𝜂.
3.2 Invariance principle of sums of i.i.d. random variables |
45
3.2 Invariance principle of sums of i.i.d. random variables Let X i ’s be i.i.d. with EX i = 0 and EX 2i = 𝜎2 . Define W tn (𝜔) =
1 1 S[nt] (𝜔) + (nt − [nt]) 2 X[nt]+1 . 𝜎√n 𝜎 √n
Consider linear interpolation Wk = n
Sk √n
where W n ∈ C[0, 1] a.e. P n .
Let 𝜓nt = (nt − [nt]) ⋅
X[nt+1] 𝜎√n
.
Claim: For fixed t, by Chevyshev’s inequality, as n → ∞ 𝜓nt → 0 Proof of Claim: P(|𝜓nt |𝜖) = P(|X[nt+1] | > ≤
𝜎√n𝜖 ) (nt − [nt])
E|X[nt+1] |2 𝜎2 n𝜖2 (nt−[nt])2
(nt − [nt])2 n𝜖2 1 ≤ 2 → 0. n𝜖 =
By CLT, S[nt] 𝜎√[nt] Since
[nt] n
⇒𝒟 N(0, 1).
→ t, by CLT, S[nt] 𝜎√n
=
S[nt] 𝜎√[nt]
×
√[nt] ⇒𝒟 √tZ √n
by Slutsky’s equation. Therefore, W tn ⇒𝒟 √tZ.
(3.1)
46 | 3 Weak convergence on C [0, 1] and D[0, ∞)
Then, (W sn , W tn − W sn ) =
1 (S[ns] , S[nt] − S[ns] ) + (𝜓ns , 𝜓nt − 𝜓ns ) 𝜎√n ⇒𝒟 (N1 , N2 ).
Since S[ns] and S[nt] − S[ns] are independent, N1 and N2 are independent normal with variance s and t − s. Thus, (W sn , W tn ) = (W sn , (W tn − W sn ) + W sn ) ⇒𝒟 (N1 , N1 + N2 ). The two-dimensional distributions of (W tn )t𝜖[0,1] converges to two dimensional distributions of Brownian Motion. We consider similarly k-dimensional distributions of (W tn )t𝜖[0,1] converge to those of Brownian Motion. We considered two-dimensional. We can take k–dimensional. Similar argument shows that (W1n , . . . , W kn ) ⇒𝒟 finite dimensional distribution of Brownian motion. Now, we have to show that P n is tight. Recall the Arzela-Ascoli theorem. Theorem 3.3: {P n } is tight on C[0, 1] if and only if 1. For each 𝜂 > 0, there exists a and n0 such that for n ≥ n0 P n ({x : |x(0)| > a}) > 𝜂. 2.
For each 𝜖, 𝜂 > 0, there exists 0 < 𝛿 < 1 and n0 such that for n ≥ n0 P n ({x : w x (𝛿) ≥ 𝜖}) < 𝜂.
Theorem 3.4: Suppose 0 = t0 < t1 < ⋅ ⋅ ⋅ < t𝜈 = 1 and min(t i − t i−1 ) ≥ 𝛿.
(3.2)
w x (𝛿) ≤ 3 max ( sup |x(s) − x(t i−1 )|)
(3.3)
1 3𝜖) t i−1 ≤s≤t i
1≤i≤𝜈
= P(x : max ( sup |x(s) − x(t i−1 )|) > 𝜖) 1≤i≤𝜈 𝜈
t i−1 ≤s≤t i
= ∑ P(x : sup |x(s) − x(t i−1 )| > 𝜖). i=1
t i−1 ≤s≤t i
This proves the theorem. Condition (ii) of the Arzela-Ascoli theorem holds if for each 𝜖, 𝜂, there exists 𝛿 ∈ (0, 1) and n0 such that for all n ≥ n0 1 P (x : sup x(s) − x(t) ≥ 𝜖) > 𝜂. 𝛿 n t≤s≤t+𝛿
48 | 3 Weak convergence on C [0, 1] and D[0, ∞)
Now apply Theorem 3.4 with t i = i𝛿 for i < 𝜈 = [1/𝛿]. Then by (3.4), condition (ii) of Theorem 3.3 holds.
3.3 Invariance principle for sums of stationary sequences Definition 3.2: {X n } is stationary if for any m, (X i i , . . . , X i k ) =𝒟 (X i i +m , . . . , X i k +m ). Lemma 3.1: Suppose {X n } is stationary and W n is defined as above. If lim lim sup 𝜆2 P( max |S k | > 𝜆𝜎√n) = 0
𝜆→∞
k≤n
n→∞
then, W n is tight. Proof: Since W0n = 0, the condition (i) of Theorem 3.3 is satisfied. Let P n is induced measure of W n , i.e. consider now P n (w(𝛿) > 𝜖), we shall show that for all 𝜖 > 0 lim lim sup P n (w(𝛿) ≥ 𝜖) = 0.
𝛿→0
n→∞
If min(t t − t i−1 ) ≥ 𝛿, then by Theorem 3.4, 𝜈
P(w(W n , 𝛿) ≥ 3𝜖) ≤ ∑ P( sup W sn − W tn ≥ 𝜖). t ≤s≤t i=1
Take t i =
mi , n
i−1
i
0 = m0 < m1 < ⋅ ⋅ ⋅ < m𝜈 = n. W tn is polygonal and hence, sup W sn − W tn = ≤s≤t
t i−1
i
max
|S k − S m i−1 |
m i−1 ≤k≤m i
𝜎√n
.
Therefore, 𝜈
P(w(W n , 𝛿) ≥ 3𝜖) ≤ ∑ P( max i=1
m i−1 ≤k≤m i
|S k − S m i−1 | 𝜎√n
≥ 𝜖)
𝜈
≤ ∑ P( max |S k − S m i−1 | ≥ 𝜎√n𝜖) i=1
m i−1 ≤k≤m i
𝜈
= ∑ P( max |S k | ≥ 𝜎√n𝜖) (by stationarity). i=1
k≤m i −m i−1
3.3 Invariance principle for sums of stationary sequences |
This inequality holds if m i m i−1 − ≥ 𝛿 for 1 < i < 𝜈. n n Take m i = im for 0 ≤ i < 𝜈 and m𝜈 = n. For i < 𝜈 choose 𝛿 such that m i − m i−1 = m ≥ n𝛿. Let m = [n𝛿], 𝜈 = [ mn ]. Then, m𝜈 − m𝜈−1 ≤ m and 𝜈=[
n 1 1 1 2 ] → where < < . m 𝛿 2𝛿 𝛿 𝛿
Therefore, for large n 𝜈
P(w(W n , 𝛿) ≥ 3𝜖) ≤ ∑ P( max |S k | ≥ 𝜎√n𝜖) i=1
k≤m i −m i−1
≤ 𝜈P( max |S k | ≥ 𝜎√n𝜖) k≤m
≤ Take 𝜆 =
𝜖 . √2𝛿
2 P( max |S k | ≥ 𝜎√n𝜖). k≤m 𝛿
Then, P(w(W n , 𝛿) ≥ 3𝜖) ≤
4𝜆2 P( max |S k | ≥ 𝜆𝜎√n). k≤m 𝜖2
By the condition of the Lemma, given 𝜖, 𝜂 > 0, there exists 𝜆 > 0 such that 4𝜆2 lim sup P( max |S k | > 𝜆𝜎√n) < 𝜂. k≤n 𝜖2 n→∞ Now, for fixed 𝜆, 𝛿, let m → ∞ with n → ∞. Look at X k i.i.d. Then, lim lim sup P( max |S k | > 𝜆𝜎√n) = 0.
𝜆→∞
n→∞
k≤n
We know that P( max |S u | > 𝛼) ≤ 3 max P(|S u | > u≤m
u≤m
𝛼 ). 3
49
50 | 3 Weak convergence on C [0, 1] and D[0, ∞)
To show lim lim sup 𝜆2 P( max |S k | > 𝜆𝜎√n) = 0,
𝜆→∞
k≤n
n→∞
(A),
we assume that X i i.i.d. normal, and hence, S k /√k is asymptotically normal, N. Since we know P(|N| > 𝜆) ≤
EN 4 3𝜎4 = 4 , 𝜆4 𝜆
> 1) we have for k ≤ n, ( √n √k P(|S k | > 𝜆𝜎√n) = P(√k|N| > 𝜆𝜎√n) ≤
3 . 𝜆4 T 4
k𝜆 is large and k𝜆 ≤ k ≤ n. Then, P(|S k | > 𝜆𝜎√n) ≤ P(|S k | > 𝜆𝜎√k) ≤
3 . 𝜆4
Also, P(|S k | > 𝜆𝜎√n) ≤ ≤
E|S k |2 /𝜎2 𝜆2 n k𝜆 . 𝜆2 n
and hence, we get the above convergence (A) is true.
3.4 Weak convergence on the Skorokhod space 3.4.1 The space D[0, 1] Let x : [0, 1] → R be right-continuous with left limit such that 1. for 0 ≤ t < 1 lim x(s) = x(t+) = x(t) s↘t
3.4 Weak convergence on the Skorokhod space | 51
2.
for 0 < t ≤ 1 lim x(s) = x(t−). s↗t
We say that x(t) has discontinuity of the first kind at t if left and right limit exist. For x ∈ D and T ⊂ [0, 1], w x (T) = w(x, T) = sup |x(s) − x(t)|. s,t∈T
We define the modulus of continuity w x (𝛿) = sup w x ([t, t + 𝛿)) 0≤t≤1−𝛿
Lemma 3.2 (D1): For each x ∈ D and 𝜖 > 0, there exist points 0 = t0 < t1 < ⋅ ⋅ ⋅ < t𝜈 = 1 and w x ([t i−1 , t i )) < 𝜖. Proof: Call t i ’s above 𝛿-sparse. If mini {(t i − t i−1 )} ≥ 𝛿, define for 0 < 𝛿 < 1 wx (𝛿) = w (x, 𝛿) = inf max w x ([t i−1 , t i )). {t i } 1≤i≤𝜈
(If we prove the above lemma, we get x ∈ D, lim wx (𝛿) = 0.) 𝛿→0
If 𝛿 < 21 , we can split [0, 1) into subintervals [t i−1 , t i ) such that 𝛿 < (t i − t i−1 ) ≤ 2𝛿 and hence, wx (𝛿) ≤ w x (2𝛿). Let us define jump function j(x) = sup |x(t) − x(t−)|. 0≤t≤1
We shall prove that w x (𝛿) ≤ 2wx (𝛿) + j(x). Choose 𝛿-sparse sequence {t i } such that w x ([t i−1 , t i )) < wx (𝛿) + 𝜖.
52 | 3 Weak convergence on C [0, 1] and D[0, ∞)
We can do this from the definition wx (𝛿) = w (x, 𝛿) = inf max w x ([t i−1 , t i )). {t i } 1≤i≤𝜈
If |s − t| < 𝛿, then s, t ∈ [t i−1 , t i ) or belongs to adjoining intervals. Then, |x(s) − x(t)| {
wx (𝛿) + 𝜖, 2wx (𝛿)
if s, t belong to the same interval;
+ 𝜖 + j(x), if s, t belong to adjoining intervals.
If x is continuous, j(x) = 0 and hence, w x (𝛿) ≤ 2wx (𝛿). 3.4.2 Skorokhod topology Let Λ be the class of strictly increasing functions on [0, 1] and 𝜆(0) = 0, 𝜆(1) = 1. Define d(x, y) = inf{𝜖 : ∃𝜆 ∈ Λ such that sup |𝜆(t) − t| < 𝜖 and sup |x(𝜆(t)) − y(t)| < 𝜖}. t
t
d(x, y) = 0 implies there exists 𝜆 n ∈ Λ such that 𝜆 n (t) → t uniformly and x(𝜆 n (t)) → y(t) uniformly. Therefore, with ||𝜆 − I|| = sup |𝜆(t) − t| t∈[0,1]
||x − y ∘ 𝜆|| = sup |x(t) − y(𝜆(t))| t∈[0,1]
d(x, y) = inf (||𝜆 − I|| ∨ ||x − y ∘ 𝜆||). 𝜆
If 𝜆(t) = t, then 1. d(x, y) = sup |x(t) − y(t)| < ∞ since we showed |x(s) − x(t)| ≤ wx (𝛿) < ∞. 2. d(x, y) = d(y, x). 3. d(x, y) = 0 only if x(t) = y(t) or x(t) = y(t−). If 𝜆 1 , 𝜆 2 ∈ Λ and 𝜆 1 ∘ 𝜆 2 ∈ Λ ||𝜆 1 ∘ 𝜆 2 − I|| ≤ ||𝜆 1 − I|| + ||𝜆 2 − I||. If 𝜆 1 , 𝜆 2 ∈ Λ, then the following holds: 1. 𝜆 1 ∘ 𝜆 2 ∈ Λ. 2. ||𝜆 1 ∘ 𝜆 2 − I|| ≤ ||𝜆 1 − I|| + ||𝜆 2 − I||.
3.5 Metric of D[0, 1] to make it complete | 53
3. ||x − z ∘ (𝜆 1 ∘ 𝜆 2 )|| ≤ ||x − y ∘ 𝜆 2 || + ||y − z ∘ 𝜆 1 ||. 4. d(x, z) ≤ d(x, y) + d(y, z). Therefore, Skorokhod topology is given by d. (D[0,1],d) is not complete. To see this, choose x n = 1[0,
1 2n
] (t)
and 𝜆 n be such that
𝜆 n (1/2 ) = 1/2 and linear in [0, 1/2 ] and [1/2 , 1] then || x n+1 0𝜆 n − x n || = 0 and || 1 𝜆 n − I || = 2n+1 . Meanwhile, 𝜆 n (1/2n ) = / 1/2n+1 , then || x n − x n+1 || = 1 and d(x n , x n+1 ) = n+1 1/2 , i.e. x n is d-Cauchy and d(x n , 0) = 1. Choose 𝜆 ∈ Λ near identity. Then for t, s close, 𝜆(t)−𝜆(s) is close to 1. Therefore, t−s n
n+1
n
n+1
𝜆(t) − 𝜆(s) ||𝜆||0 = sup log ∈ (0, ∞). t − s s 0, |u − 1| ≤ e| log u| − 1, we have 𝜆(t) − 𝜆(0) − 1 sup |𝜆(t) − t| = sup t t − 0 0≤t≤1 0≤t≤1 = e||𝜆|| − 1. 0
For any 𝜈, 𝜈 ≤ e𝜈 − 1, and hence d(x, y) ≤ e d
0
(x,y)
− 1.
Thus, d0 (x n , y) → 0 implies d(x n , y) → 0. Lemma 3.3 (D2): If x, y ∈ D[0, 1] and d(x, y) < 𝛿2 , then d0 (x, y) ≤ 4𝛿 + wx (𝛿).
54 | 3 Weak convergence on C [0, 1] and D[0, ∞)
Proof: Take 𝜖 < 𝛿 and {t i } 𝛿–sparse with w x ([t i−1 , t i )) < wx (𝛿) + 𝜖 ∀i. We can do this from definition of wx (𝛿). Choose 𝜇 ∈ Λ such that sup |x(t) − y(𝜇(t))| = sup |x(𝜇−1 (t)) − y(t)| < 𝛿2
(3.5)
sup |𝜇(t) − t| < 𝛿2 .
(3.6)
t
t
and t
This follows from d(x, y) < 𝛿2 . Take 𝜆 to agree with 𝜇 at points t i and linear between. 𝜇−1 ∘ 𝜆 fixes t i and is increasing in t. Also, (𝜇−1 ∘ 𝜆)(t) lies in the same interval [t i−1 , t i ). Thus, from (3.5) and (3.6), |x(t) − y(𝜆(t))| ≤ |x(t) − x((𝜇−1 ∘ 𝜆)(t))| + |x((𝜇−1 ∘ 𝜆)(t)) − y(𝜆(t))| = wx (𝛿) + 𝜖 + 𝛿2 . 𝛿 < 21 < 4𝛿 + wx (𝛿). 𝜆 agrees with 𝜇 at t i ’s. Then by (3.5), (3.6), and (t i − t i−1 ) > 𝛿 (𝛿–sparse), |(𝜆(t i ) − 𝜆(t i−1 )) − (t i − t i−1 )| < 2𝛿2 < 2𝛿(t i − t i−1 ) and |(𝜆(t) − 𝜆(s)) − (t − s)| ≤ 2𝛿|t − s| for t, s ∈ [t i−1 , t i ) by polygonal property. Now, we take care of adjoining interval. For u1 , u2 , u3 |(𝜆(u3 ) − 𝜆(u1 )) − (u3 − u1 )| ≤ |(𝜆(u3 ) − 𝜆(u2 )) − (u3 − u2 )| + |(𝜆(u2 ) − 𝜆(u1 )) − (u2 − u1 )|. If t and s are in adjoining intervals, we get the same bound. Since for u < | log(1 ± u)| ≤ 2u, we have log(1 − 2𝛿) ≤ log
𝜆(t) − 𝜆(s) ≤ log(1 + 2𝛿). t−s
1 2
3.5 Metric of D[0, 1] to make it complete | 55
Therefore, 𝜆(t) − 𝜆(s) ||𝜆||0 = sup log < 4𝛿 t − s s 0, find 𝛿–sparse set {t i } such that w x ([t i−1 , t i )) < wx (𝛿) + 𝜖 for all i. Let 𝜆(t i ) be defined by 1. 𝜆(t0 ) = s0 . 2. 𝜆(t i ) = s v if t i ∈ [s v−1 , s v ), where t i − t i−1 > 𝛿 ≥ s v − s v−1 . Then, 𝜆(t i ) is increasing. Now, extend it to 𝜆 ∈ Λ by linear interpolation. Claim: ̂ − x(𝜆−1 (t)) = |x(𝜁(t)) − x(𝜆−1 (t))| ||x(t) < wx (𝛿) + 𝜖. Clearly, if t = 0, or t = 1, it is true. Let us look at 0 < t < 1. First we observe that 𝜁(t), 𝜆−1 (t) lie in the same interval [t i−1 , t i ).(We will prove it.) This follows if we shows t j ≤ 𝜁(t) iff t j ≤ 𝜆−1 (t) or equivalently, t j > 𝜁(t) iff t j > 𝜆−1 (t). This is true for t j = 0. Suppose t j ∈ (s v−1 , s v ] and 𝜁(t) = s i for some i. By definition, 𝜁(t) t ≤ 𝜁(t) is equivalent to s v ≤ t. Since t j ∈ (s v−1 , s v ], 𝜆(t j ) = s v . This completes the proof.
56 | 3 Weak convergence on C [0, 1] and D[0, ∞)
3.6 Separability of the Skorokhod space d0 -convergence is stronger than d-convergence. Theorem 3.5 (D1): The space (D, d) is separable, and hence, so is (D, d0 ). , uk ] and Proof: Let B k be the set of functions taking constant rational value on [ u−1 k taking rational value at 1. Then, B = ∪ k B k is countable. Given x ∈ D, 𝜖 > 0, choose k such that k1 < 𝜖 and w x ( k1 ). Apply Lemma D3 with 𝜎 = { uk }. Note that A𝜎 x has finite many values and d(x, A𝜎 x) < 𝜖. Since A𝜎 x has finitely many real values, we can find y ∈ B such that given d(x, y) < 𝜖, d(A𝜎 x, y) < 𝜖. Now, we shall prove the completeness. Proof: We take d0 -Cauchy sequence. Then it contains a d0 -convergent subsequence. If {x k } is Cauchy, then there exists {y n } = {x k n } such that 1 . 2n
d0 (y n , y n+1 ) < There exists 𝜇n ∈ Λ such that 1. ||𝜇n ||0 < 21n . 2.
sup |y n (t) − y n+1 (𝜇n (t))| = sup |y n (𝜇n−1 (t)) − y n+1 (t)| t
t
1 < n. 2 We have to find y ∈ D and 𝜆 n ∈ Λ such that ||𝜆 n ||0 → 0 and y n (𝜆−1 n (t)) → y(t) uniformly.
3.6 Separability of the Skorokhod space | 57
−1 −1 Heuristic (not a proof): Suppose y n (𝜆−1 n (t)) → y(t). Then, by (2), y n (𝜇n (𝜆 n+1 (t))) is 1 −1 −1 within 2n of y n+1 (𝜆 n+1 (t)). Thus, y n (𝜆 n (t)) → y(t) uniformly. Find 𝜆 n such that −1 y n (𝜇n−1 (𝜆−1 n+1 (t))) = y n (𝜆 n (t)),
i.e. −1 𝜇n−1 ∘ 𝜆−1 n+1 = 𝜆 n .
Thus, 𝜆 n = 𝜆 n+1 𝜇n = 𝜆 n+2 𝜇n+1 𝜇n .. . = ⋅ ⋅ ⋅ 𝜇n+2 𝜇n+1 𝜇n . Proof: Since e u − 1 ≤ 2u, for 0 ≤ u ≤ 21 , we have sup |𝜆(t) − t| ≤ e||𝜆|| − 1. 0
t
Therefore, sup |(𝜇n+m+1 𝜇n+m ⋅ ⋅ ⋅ 𝜇n )(t) − (𝜇n+m 𝜇n+m−1 ⋅ ⋅ ⋅ 𝜇n )(t)| ≤ sup |𝜇n+m+1 (s) − s| s
t
≤ 2||𝜇n+m+1 ||0 1 = n+m . 2 For fixed n, (𝜇n+m 𝜇n+m−1 ⋅ ⋅ ⋅ 𝜇n )(t) converges uniformly in t as n goes to ∞. Let 𝜆 n (t) = lim (𝜇n+m 𝜇n+m−1 ⋅ ⋅ ⋅ 𝜇n )(t). m→∞
58 | 3 Weak convergence on C [0, 1] and D[0, ∞)
Then, 𝜆 n is continuous and non-decreasing with 𝜆 n (0) = 0 and 𝜆 n (1) = 1. We have to prove ||𝜆 n ||0 is finite. Then, 𝜆 n is strictly increasing. (𝜇 𝜇 ⋅ ⋅ ⋅ 𝜇n )(t) − (𝜇n+m 𝜇n+m−1 ⋅ ⋅ ⋅ 𝜇n )(s) log n+m n+m−1 t−s ≤ ||𝜇n+m 𝜇n+m−1 ⋅ ⋅ ⋅ 𝜇n ||0 (since 𝜆 n ∈ Λ, ||𝜆 n ||0 < ∞) ≤ ||𝜇n+m ||0 + ⋅ ⋅ ⋅ + ||𝜇n ||0 (since ||𝜆 1 𝜆 2 ||0 ≤ ||𝜆 1 ||0 + ||𝜆 2 ||0 )
0, there exists 𝛿 > 0 such that 𝜌(x, y) < 𝛿 ⇒ f (y) < f (x) + 𝜖. Lemma 3.5: For fixed 𝛿, w (x, 𝛿) is upper-semicontinuous in x. Proof: Let x, 𝛿, and 𝜖 be given. Let {t i } be a 𝛿-spars set such that w x [t i−1 , t i ) < wx (𝛿) + 𝜖 for each i. Now choose 𝜂 small enough that 𝛿 + 2𝜂 < min(t i − t i−1 ) and 𝜂 < 𝜖. Suppose that d(x, y) < 𝜂. Then, for some 𝜆 in Λ, we have sup |y(t) − x(𝜆t)| < 𝜂, t
sup |𝜆−1 t − t| < 𝜂 t
Let s i = 𝜆−1 t i . Then s i − s i−1 > t i − t i−1 − 2𝜂 > 𝛿. Moreover, if s and t both lies in [s i−1 , s i ), then 𝜆s and 𝜆t both lie in [t i−1 , t i ), and hence |y(s) − y(t)| < |x(𝜆s) − x(𝜆t)| + 2𝜂 ≤ wx (𝛿) + 𝜖 + 2𝜂. Thus, d(x, y) < 𝜂 implies wy (𝛿) < wx (𝛿) + 3𝜖. Definition 3.4 (d-bounded): A is d-bounded if diameter is bounded, i.e. diameter(A) = sup d(x, y) < ∞. x,y∈A
Proof of necessity in Theorem 3.6: If A− is compact, then it is d-bounded, and since supt |x(t)| is the d-distance from x to the 0-function, (3.7) follows. By Lemma 3.1, w (x, 𝛿) goes to 0 with 𝛿 for each x. However, since w (⋅, 𝛿) is upper-semicontinuous the convergence is uniform on compact sets.
60 | 3 Weak convergence on C [0, 1] and D[0, ∞)
Theorem 3.6, which characterizes compactness in D, gives the following result. Let {P n } be a sequence of probability measure on (D, 𝒟). Theorem 3.7: The sequence {P n } is tight if and only if these two conditions hold: We have lim lim sup P n ({x : ||x|| ≥ a}) = 0
(3.9)
lim lim sup P n ({x : wx (𝛿) ≥ 𝜖}) = 0.
(3.10)
a→∞
n
(ii) for each 𝜖, 𝛿→∞
n
Proof: Conditions (i) and (ii) here are exactly conditions (i) and (ii) of Azela-Ascoli theorem with ||x|| in place of |x(0)| and w in place of w. Since D is separable and complete, a single probability measure on D is tight, and so the previous proof goes through.
3.8 The space D[0, ∞) Here we extend the Skorohod theory to the space D∞ = D[0, ∞) of cadlag functions on [0, ∞), a space more natural than D = D[0, 1] for certain problems. In addition to D∞ , consider for each t > 0 the space D t = D[0, t] of cadlag functions on [0, t]. All the definitions for D1 have obvious analogues for D t : sups≤t |x(s)|, Λ t , ||𝜆||0t , d0t , d t . All the theorems carry over from D1 to D t in an obvious way. If x is an element of D∞ , or if x is an element of D u and t < u, then x can also be regarded as an element of D t by restricting its domain of definition. This new cadlag function will be denoted by the same symbol; it will always be clear what domain is intended. One might try to define the Skorohod convergence x n → x in D∞ by requiring that 0 d t (x n , x) → 0 for each finite, positive t. However, in a natural theory, x n = I[0,1−1/n] will converge to x = I[0,1] in D∞ , while d01 (x n , x) = 1. The problem here is that x is discontinuous at 1, and the definition must accommodate discontinuities. Lemma 3.6: Let x n and x be elements of D u . If d0u (x n , x) → 0 and m < u, and if x is continuous at m, then d0m (x n , x) → 0. Proof: We can work with the metrics d u and d m . By hypothesis, there are elements 𝜆 n of Λ u such that ||𝜆 n − I||u → 0
3.8 The space D[0, ∞) |
61
and ||x n − x𝜆 n ||u → 0. Given 𝜖, choose 𝛿 so that |t − m| ≤ 2𝛿 implies |x(t) − x(m)| < 𝜖/2. Now choose n0 so that, if n ≥ n0 and t ≤ u, then |𝜆 n t − t| < 𝛿 and |x n (t) − x(𝜆 n t)| < 𝜖/2. Then, if n ≥ n0 and |t − m| ≤ 𝛿, we have |𝜆 n t − m| ≤ |𝜆 n t − t| + |t − m| < 2𝛿 and hence |x n (t) − x(m)| ≤ |x n (t) − x(𝜆 n t)| + |x(𝜆 n t) − x(m)| < 𝜖. Thus sup |x(t) − x(m)| < 𝜖,
|t−m|≤𝛿
sup |x n (t) − x(m)| < 𝜖,
|t−m|≤𝛿
for n ≥ n0 .
(3.11)
If (i) 𝜆 n m < m, let p n = m − n1 ;
(ii) 𝜆 n m > m, let p n = 𝜆−1 (m − n1 ); (iii) 𝜆 n m = m, let p n = m. Then, (i) |p n − m| = n1 ; 1 −1 −1 (ii) |p n − m| =≤ |𝜆−1 n (m − n ) − (m − n )| + n ; (iii) |p n − m| = m. Therefore, p n → m. Since |𝜆 n p n − m| ≤ |𝜆 n p n − p n | + |p n − m|, we also have 𝜆 n p n → m. Define 𝜇n ∈ Λ n so that 𝜇n t = 𝜆 n t on [0, p n ] and 𝜇n m = m, and interpolate linearly on [p n , m]. Since 𝜇n m = m and 𝜇n is linear over [p n , m], we have |𝜇n t − t| ≤ |𝜆 n p n − p m | there, and therefore, 𝜇n t → t uniformly on [0, m]. Increase the n0 of (3.11) so that p n > m − 𝛿 and 𝜆 n p n > m − 𝛿 for n ≥ n0 . If t ≤ p n , then |x n (t) − x(𝜇n t)| = |x n (t) − x(𝜆 n t)| ≤ ||x n − x𝜆 n ||u . Meanwhile, if p n ≤ t ≤ m and n ≥ n0 , then m ≥ t ≥ p n > m − 𝛿 and m ≥ 𝜇n t ≥ 𝜇n p n = 𝜆 n p n > m − 𝛿, and therefore, by (3.11), |x n (t)− x(𝜇n t)| ≤ |x n (t)− x(m)|+|x(m)− x(𝜇n t)| < 2𝜖. Thus, |x n (t)− x(𝜇n t)| → 0 uniformly on [0, m]. The metric on D∞ will be defined in terms of the metrics d0m (x, y)for integral m, but before restricting x and y to [0, m], we transform them in such a way that they are continuous at m. Define if t ≤ m − 1; { 1, g n (t) = { m − t, if m − 1 ≤ t ≤ m; t ≥ m. { 0,
(3.12)
For x ∈ D∞ , let x m be the element of D∞ defined by x m (t) = g m (t)x(t),
t≥0
(3.13)
62 | 3 Weak convergence on C [0, 1] and D[0, ∞)
Now take ∞
d0∞ (x, y) = ∑ 2−m (1 ∧ d0m (x m , y m )).
(3.14)
m=1
If d0∞ (x, y) = 0, then d0m (x, y) = 0 and x m = y m for all m, and this implies x = y. The other properties being easy to establish, d0∞ is a metric on D∞ ; it defines the Skorohod topology there. If we replace d0m by d m in (3.14), we have a metric d∞ equivalent to d0∞ . Let Λ ∞ be the set of continuous, increasing maps of [0, ∞) onto itself. Theorem 3.8: There is convergence d0∞ (x n , x) → 0 in D∞ if and only if there exist elements 𝜆 n of Λ ∞ such that sup |𝜆 n t − t| → 0
(3.15)
sup |x n (𝜆 n t) − x(t)| → 0.
(3.16)
t m and hence ||x n ||m ≤ ||x n 𝜆 n ||2m ; and ||x n 𝜆 n ||2m → ||x n ||2m by (3.16). This means that ||x n ||m is bounded(m is fixed). Given 𝜖, choose n0 so that n ≥ n0 implies that p n and 𝜇n p n both lies in (m − 𝜖, m], an interval on which g m is bounded by 𝜖. If n ≥ n0 and p n ≤ t ≤ m, then t and 𝜇n t both lie in (m − 𝜖, m], and it follows by (3.18) that |x m (t) − x m n (𝜇n t)| ≤ 𝜖(||x||m + ||x n ||m ). Since m m ||x n ||m is bounded, this implies that |x (t) − x n (𝜇n t)| → 0 holds uniformly on [p n , m] m 0 as well as on [0, p n ]. Therefore, d0m (x m n , x ) → 0 for each m and hence d ∞ (x n , x) and d∞ (x n , x) go to 0. This completes the proof. Theorem 3.9: There is convergence d0∞ (x n , x) → 0 in D∞ if and only if d0t (x n , x) → 0 for each continuity point t of x. m Proof: If d0∞ (x n , x) → 0, then d0∞ (x m n , x ) → 0 for each m. Given a continuity point t of x, fix an integer m for which t < m − 1. By Lemma 1 (with t and m in the roles of m m and u) and the fact that y and y m agree on [0, t], d0t (x n , x) = d0t (x m n , x ) → 0. To prove the reverse implication, choose continuity points t m of x in such a way that t m ↑ ∞. The argument now follows the first part of the proof of (3.8). Choose elements 𝜆m n of Λ t m in such a way that m 𝜖nm = ||𝜆m n − I||t m ∨ ||x n 𝜆 n − x||t m → 0
for each m. As before, define integers m n in such a way that m n → ∞ and 𝜖nm n < 1/m n , and this time define 𝜆 n ∈ Λ ∞ by 𝜆n t = {
n 𝜆m n t, if t ≤ t m n ; t, if t ≥ t m n .
n The |𝜆 n t − t| ≤ 1/m n for all t, and if c < t m n , then ||x n 𝜆 n − x||c = ||x n 𝜆m n − x||c ≤ 1/m n → 0. This implies that (3.15) and (3.16) hold, which in turn implies that d0∞ (x n , x) → 0. This completes the proof.
64 | 3 Weak convergence on C [0, 1] and D[0, ∞)
3.8.1 Separability and completeness For x ∈ D∞ , define 𝜓m x as x m restricted to [0, m]. Then, since d0m (𝜓m x n , 𝜓m x) = m d0m (x m n , x ), 𝜓m is a continuous map of D ∞ into D m . In the product space Π = D 1 × D2 × ⋅ ⋅ ⋅ , the metric ∞
𝜌(𝛼, 𝛽) = ∑ 2−m (1 ∧ d0m (𝛼m , 𝛽m )) m=1
defines the product topology, that of coordinatewise convergence. Now define 𝜓 : D∞ → Π by 𝜓x = (𝜓1 x, 𝜓2 x, . . .): 𝜓m : D∞ → D m ,
𝜓 : D∞ → Π.
Then d0∞ (x, y) = 𝜌(𝜓x, 𝜓y) : 𝜓 is an isometry of D∞ into Π. Lemma 3.7: The image 𝜓D∞ is closed in Π. Proof: Suppose that x n ∈ D∞ and 𝛼 ∈ Π and 𝜌(𝜓x n , 𝛼) → 0; then d0m (x m n , 𝛼m ) → 0 for each m. We must find an x in D∞ such that 𝛼 = 𝜓x-that is, 𝛼m = 𝜓m x for each m. Let T be the dense set of t such that for every m ≥ t, 𝛼m is continuous at t. Since m d0m (x m n , 𝛼m ) → 0, t ∈ T ∩ [0, m] implies x n (t) = g n (t)x n (t) → 𝛼m (t). This means that for every t in T, the limit x(t) = lim x n (t) exists (consider an m > t + 1, so that g n (t) = 1). Now g m (t)x(t) = 𝛼m (t) on T∩[0, m]. It follows that x(t) = 𝛼m (t) on T∩[0, m−1], so that x can be extended to a cadlag function on each [0, m − 1] and then to a cadlag function on [0, ∞]. Now, by right continuity, g m (t)x(t) = 𝛼m (t) on [0, m], or 𝜓m x = x m = 𝛼m . This completes the proof. Theorem 3.10: The space D∞ is separable and complete. Proof: Since Π is separable and complete, so are the closed subspace 𝜓D∞ and its isometric copy D∞ . This completes the proof.
3.8.2 Compactness Theorem 3.11: Set A is relatively compact in D∞ if and only if, for each m, 𝜓m A is relatively compact in D m . Proof: If A is relatively compact, then Ā is compact and hence the continuous image 𝜓m Ā is also compact. But then, 𝜓m A, as a subset of 𝜓m A,̄ is relatively compact. Conversely, if each 𝜓m A is relatively compact, then each 𝜓m A is compact, and therefore, B = 𝜓1 A × 𝜓2 A × ⋅ ⋅ ⋅ and E = 𝜓D∞ ∩ B are both compact in Π. But x ∈ A
3.8 The space D[0, ∞) |
65
implies 𝜓x ∈ 𝜓m A for each m, so that 𝜓x ∈ B. Hence 𝜓A ⊂ E, which implies that 𝜓A is totally bounded and so is its isometric image A. This completes the proof. For an explicit analytical characterization of relative compactness, analogous to the Arzela-Ascoli theorem, we need to adapt the w (x, 𝛿) to D∞ . For an x ∈ D m , define wm (x, 𝛿) = inf max w(x, [t i−1 , t i )),
(3.19)
1≤i≤v
where the infimum extends over all decompositions [t i−1 , t i ), 1 ≤ i ≤ v, of [0, m) such that t t − t i−1 > 𝛿 for 1 ≤ i ≤ v. Note that the definition does not require t v − t v−1 > 𝛿; Although 1 plays a special role in the theory of D1 , the integers m should play no special role in the theory of D∞ . The exact analogue of w (x, 𝛿) is (3.19), but with the infimum extending only over the decompositions satisfying t t − t i−1 > 𝛿 for i = v as well as for i < v. Call this w̄ m (x, 𝛿). By an obvious extension, a set B in D m is relatively compact if and only if ̄ 𝛿) = 0. Suppose that A ⊂ D∞ and transform the supx ||x||m < ∞ and lim𝛿 supx w(x, two conditions by giving 𝜓m A the role of B. By (Theorem 3.11), A is relatively compact if and only if, for every m sup ||x m ||m < ∞
(3.20)
lim sup w̄ m (x m , 𝛿) = 0.
(3.21)
x∈A
and 𝛿→0 x∈A
The next step is to show that (3.20) and (3.21) are together equivalent to the condition that, for every m, sup ||x||m < ∞
(3.22)
lim sup wm (x, 𝛿) = 0.
(3.23)
x∈A
and 𝛿→0 x∈A
The equivalence of (3.20) and (3.22) follows easily because ||x m ||m ≤ ||x||m ≤ ||x m+1 ||m+1 . Suppose (3.22) and (3.23) both hold, and let K m be the supremum in (3.22). If x ∈ A and 𝛿 < 1, then we have |x m (t)| ≤ K m 𝛿 for m − 𝛿 ≤ t < m. Given 𝜖, choose 𝛿 so that K m 𝛿 < 𝜖/4 and the supremum in (3.23) is less than 𝜖/2. If x ∈ A and m − 𝛿 lies in the interval [t j−1 , t j ) of the corresponding partition, replace the intervals [t i−1 , t i ) for i ≥ j by the single interval [t j−1 , m). This new partition shows that w̄ m (x, 𝛿). Hence, (3.21).
66 | 3 Weak convergence on C [0, 1] and D[0, ∞)
That (3.21) implies (3.23) is clear because wm (x, 𝛿) ≤ w̄ m (x, 𝛿): An infimum increases if its range is reduced. This gives us the following criterion. Theorem 3.12: A set A ∈ D∞ is relatively compact if and only if (3.22) and (3.23) hold for all m.
3.8.3 Tightness Theorem 3.13: The sequence {P n } is tight if and only if there two conditions hold: (i) For each m lim lim sup P n ({x : ||x||m ≥ a}) = 0.
(3.24)
lim lim sup P n ({x : wm (x, 𝛿) ≥ 𝜖}) = 0.
(3.25)
a→∞
n
(ii) For each m and 𝜖, 𝛿
n
There is the corresponding corollary. Let j m (x) = sup |x(t) − x(t−)|.
(3.26)
t≤m
Corollary 3.1: Either of the following two conditions can be substituted for (i) in Theorem 3.13: (i’) For each t in a set T that is dense in [0, ∞), lim lim sup P n ({x : |x(t)| ≥ a}) = 0.
a→∞
(3.27)
n
(ii’) The relation (3.27) holds for t = 0, and for each m, lim lim sup P n ({x : j m (x) ≥ a}) = 0.
a→∞
(3.28)
n
Proof: The proof is almost the same as that for the corollary to Theorem 3.7. Assume (ii) and (i’). Choose points t i such that 0 = t0 < t1 < ⋅ ⋅ ⋅ < t v = m, t i − t i−1 > 𝛿 for 1 ≤ i ≤ v − 1, and w x [t i−1 , t i ) < wm (x, 𝛿) + 1 for 1 ≤ i ≤ v. Choose from T points s j such that 0 = s0 < s1 < ⋅ ⋅ ⋅ < s k = m and s j − s j−1 < 𝛿 for 1 ≤ j ≤ k. Let m(x) = max0≤j≤k |x(s j )|. If t v − t v−1 > 𝛿, then ||x||m ≤ m(x) + wm (x, 𝛿) + 1, just as before. If t v − t v−1 ≤ 𝛿(and 𝛿 < 1, so that t v−1 > m − 1), then ||x||m−1 ≤ m(x) + wm (x, 𝛿) + 1. The old argument now gives (3.24), but with ||x||m replaced by ||x||m−1 , which is just as good. In the proof that (ii) and (i’) imply (i), we have (v − 1)𝛿 ≤ m instead of v𝛿
3 , 4
for i = 1, 2
(3.32)
If 𝜇(M2 ∩ I) ≤ 41 , then 𝜇(M2 ) ≤ 43 , which is a contradiction. Hence, 𝜇(M2 ∩ I) > 41 , and for d(0 ≤ d ≤ 𝛿), 𝜇((M2 +d)∩ J) ≤ 𝜇((M2 ∩ I)+d) = 𝜇((M2 ∩ I)) 41 . Thus 𝜇(M1 )+𝜇((M2 +d)∩ J) > 1, which implies 𝜇(M1 ∩ (M2 + d)) > 0. There is therefore an s such that s ∈ M1 and s − d ∈ M2 , from which follows |x(t1 ) − x(t2 )| < 2𝜖.
(3.33)
68 | 3 Weak convergence on C [0, 1] and D[0, ∞)
Thus, (3.31) and (3.32) together implies (3.33). To put it another way, if (3.31) holds but (3.33) does not, then either P(𝜃 ∈ M1c ) ≥ 41 or P(𝜃 ∈ M2c ) ≥ 41 . Therefore, 2 1 P(X𝜏n2 − X𝜏n1 ≥ 2𝜖, 𝜏2 − 𝜏1 ≤ 𝛿) ≤ ∑ P[P(X𝜏ni +𝜃 − X𝜏ni ≥ 𝜖|ℱ n ) ≥ ] 4 i=1
≤ 4 ∑ P(X𝜏ni +𝜃 − X𝜏ni ≥ 𝜖). 2
i=1
Since 0 ≤ 𝜃 ≤ 2𝛿 ≤ 2𝛿0 , and since 𝜃 and ℱ n are independent, it follows by (3.29) that the final term here is at most 8𝜂. Therefore, Condition 10 implies Condition 20 . This is Aldous’s theorem: Theorem 3.15 (Aldous): If (3.24) and Condition 1° hold, then {X n } is tight. Proof: By Theorem 3.13, it is enough to prove that lim lim sup P(wm (X n , 𝛿) ≥ 𝜖) = 0.
a→∞
(3.34)
n
Let Δ k be the set of nonnegative dyadic rationals¹ j/2k of order k. Define random variables 𝜏0n , 𝜏1n , . . . by 𝜏0n = 0 and n 𝜏in = min{t ∈ Δ k : 𝜏i−1 < t ≤ m, |X tn − X𝜏nn | ≥ 𝜖}, i−1
with 𝜏in = m if there is no such t. The 𝜏in depend on 𝜖, m, and k as well as on i and n, although the notation does not show this. It is easy to prove by induction that the 𝜏in are all stopping times. Because of Theorem 3.14, we can assume that condition 2 holds. For given 𝜖, 𝜂, m, choose 𝛿 and n0 so that n P(X𝜏ni − X𝜏ni−1 ≥ 𝜖, 𝜏in − 𝜏i−1 ≤ 𝛿 ) ≤ 𝜂 for i ≥ 1 and n ≥ n0 . Since 𝜏in < m implies that X𝜏ni − X𝜏ni−1 ≥ 𝜖, we have n P(𝜏in < m, 𝜏in − 𝜏i−1 ≤ 𝛿 ) ≤ 𝜂,
i ≥ 1, n ≥ n0 .
(3.35)
Now choose an integer q such that q𝛿 ≥ 2m. There is also a 𝛿 such that n P(𝜏in < m, 𝜏in − 𝜏i−1 ≤ 𝛿) ≤
𝜂 , q
i ≥ 1, n ≥ n0 .
(3.36)
1 Dyadic rational is a rational number whose denominator is a power of 2, i.e. a number of the form a/2b, where a is an integer and b is a natural number; for example, 1/2 or 3/8, but not 1/3. These are precisely the numbers whose binary expansion is finite.
3.8 The space D[0, ∞) |
69
However, q n P( ⋃ {𝜏in < m, 𝜏in − 𝜏i−1 ≤ 𝛿}) ≤ 𝜂,
n ≥ n0 .
(3.37)
i−1
Although 𝜏in depends on k, (3.35) and (3.37) hold for all k simultaneously. By (3.35), n n E(𝜏in − 𝜏i−1 |𝜏qn < m) ≥ 𝛿 P(𝜏in − 𝜏i−1 ≥ 𝛿 |𝜏qn < m)
≥ 𝛿 (1 −
P(𝜏qn
𝜂 ), < m)
and therefore, m ≥ E(𝜏qn |𝜏qn < m) q n = ∑ E(𝜏in − 𝜏i−1 |𝜏qn < m) i=1
≥ q𝛿 (1 −
𝜂 ). P(𝜏qn < m)
Since q𝛿 ≥ 2m by the choice of q, this leads to P(𝜏qn < m) ≥ 2𝜂. By this and (3.37), q n P({𝜏qn < m} ∪ ⋃ {𝜏in < m, 𝜏in − 𝜏i−1 ≤ 𝛿}) ≤ 3𝜂,
k ≥ 1, n ≥ n0
(3.38)
i−1
Let A n k be the complement of the set in (3.38). On this set, let v be the first index for which 𝜏vn = m. Fix an n beyond n0 . There are points t ki (𝜏in ) such that 0 = t0k < ⋅ ⋅ ⋅ < t kv = m and t ki − t ki−1 > 𝛿 for 1 ≤ i < v. |X tn − X sn | < 𝜖 if s, t lie in the same [t ki−1 , t ki ) as well as in Δ k . If A n = lim supk A n k , then P(A n ) ≥ 1 − 3𝜂, and on A n there is a sequence of values of k along which v is constant (v ≤ q), and for each i ≤ v, t ki converges to some t i . However, 0 = t0 < ⋅ ⋅ ⋅ < t v = m, t i − t i−1 ≥ 𝛿 for i < v, and by right continuity, |X tn − X sn | ≤ 𝜖 if s, t lie in the same [t i−1 , t i ). It follows that w (X n , 𝛿) ≤ 𝜖 on a set of probability at least 1 − 3𝜂 and hence (3.34). As a corollary to the theorem we get. Corollary 3.2: If for each m, the sequences {X0n } and {j m (X n )} are tight on the line, and if Condition 1° holds, then {X n } is tight. Skorokhod introduced the topology on D[0, ∞) to study the weak convergence of processes with independent increment. Semi-martingales are generalizations of these processes. In the next chapter, we shall discuss the weak convergence of semimartingales.
4 Central limit theorem for semi-martingales and applications As stated at the end of chapter 3, Skorokhod developed the theory of weak convergence of stochastic processes with values in D[0, T] to consider the limit theorems (or invariance principle) with convergence to process with independent increments. Since these are special classes of semi-martingales that have sample paths in D[0, ∞), we now study the work of Liptser and Shiryaev [17] (see [15]) on weak convergence of a sequence of semi-martingales to a limit. We begin with the definition of semimartingales and their structure, including semi-martingale characteristics [15]. Based on this, we obtain conditions for the weak convergence of semi-martingale sequence to a limit. We begin with some preliminary lemmas, which will be needed in the proofs. We end the chapter by giving applications to statistics of censored data that arises in survival analysis in clinical trials.
4.1 Local characteristics of semi-martingale In this chapter, we study the central limit theorem by Lipster and Shiryayev. We begin by giving some preliminaries. We consider (Ω, {ℱt }, ℱ, P) a filtered probability space, where F = {ℱt , t ≥ 0} is a non-decreasing family of sub 𝜎-field of ℱ, satisfying ⋂t≥s ℱt = ℱs . We say that {X, F} is a martingale if for each t, X t ∈ X = {X t } ⊂ L1 (Ω, ℱt , P) and E(X(t)|ℱs ) = X s a.e. P. WLOG, we assume {X t , t ≥ 0} is D[0, ∞) valued (or a.s. it is cadlag) as we can always find a version. A martingale X is said to be square-integrable if supt EX 2t < ∞. We say that {X t , t ≥ 0} is locally square integrable martingale if there exists an increasing sequence 𝜎n of (ℱt )-stopping times such that 0 ≤ 𝜎n < ∞ a.e. limn 𝜎n = ∞, and {X(t ∧ 𝜎n )1{𝜎n >0} } is a square integrable martingale. A process (X, F) is called a semi-martingale if it has the decomposition X t = X0 + M t + A t , where {M t } is local martingale, M0 = 0, A is right continuous process with A0 , A t , ℱt measurable and has sample paths of finite variation. We now state condition of A to make this decomposition unique (called canonical decomposition). For this we need the following. We say that a sub 𝜎-field of [0, ∞) × Ω generated by sets of the form {(s, t] × A, 0 < s ≤ t < ∞ and A ∈ ℱs } and {0} × B(B ∈ ℱ0 ) is the 𝜎-field of predictable sets from now on called predictable 𝜎-algebra 𝒫. In the above decomposition, A is 𝒫-measurable then it is canonical. We remark that if the jumps of the semi-martingale are bounded, then the decomposition is canonical. We now introduce the concept of local characteristics.
4.1 Local characteristics of semi-martingale | 71
Let (X, F) be a semi-martingale. Set with ΔX(s) as jump at s, ̃ = ∑ ΔX(s)1(|ΔX(s)| ≥ 𝜖). X(t) s≤t
̃ is a semi-martingale with unique canonical decomposition The X(t) = X(t) − X(t) X(t) = X(0) + M(t) + A(t), where (M, F) is a local martingale and A is predictable process of finite variation. Thus, ̃ X(t) = X(0) + M(t) + A(t) + X(t). Let 𝜇((0, t]; A ∩ {|x| ≥ 𝜖}) = ∑ 1(ΔX(s) ∈ A, |ΔX(s)| > 𝜖) and v((0, t]; ⋅ ∩ {|x| ≥ 𝜖}) its predictable projection. Then for each 𝜖 > 0, we can write with v as a mesaure generated on v(∗x0) 1
X(t) = X(0) + M (t) + A(t) + ∫ ∫ 0
|x|>𝜖
xv(ds, dx),
where (M , F) is a local martingale. Now the last two terms are predictable. Thus, the semi-martingale is described by M , A, and v. We thus have the following. Definition 4.1: The local characteristic of a semi-martingale X is defined by the triplet (A, C, v), where 1. 2.
A is the predictable process of finite variation appearing in the above decomposition of X(t). C is a continuous process defined by C t C t = [X, X]ct = ⟨M c ⟩t .
3.
v is the predictable random measure on R+ × R, the dual predictable projection of the measure 𝜇 associated to the jumps of X given on ({0}c ) by 𝜇(w, dt, dx) = ∑ 1(ΔX(s, w) ≠ 0)𝛿(s,ΔX(s,w)) (dt, dx) s>0
with 𝛿 being the dirac delta measure.
72 | 4 Central limit theorem for semi-martingales and applications
4.2 Lenglart inequality Lemma 4.1 Mcleish lemma: Let F n (t), n = 1, 2, . . . and F(t) be in D[0, ∞) such that F n (t) is increasing in t for each n, and F(t) is a.s. continuous and for each t > 0, there exists t n → t such that F n (t n ) → F(t). Then, sup |F n (t) − F(t)| → 0, t
where sup is taken over a compact set of [0, ∞). Proof: WLOG, choose compact set [0, 1] : For 𝜖 > 0 choose {t n i , i = 0, 1, . . . , k} for fixed k ≥ 1/𝜖 such that t n i → i𝜖 and F n (t n i ) →p F(i𝜖) as n → ∞. Then sup |F n (t) − F(t)| ≤ sup |F n (t n i+1 ) − F n (t n i )| + sup |F n (t n i ) − F(t n i )| t
i
i
+ sup |F n (t n i+1 ) − F(t n i )| + 𝜖. i
As n → ∞, choose 𝜖 such that |F((i + 1)𝜖) − F(i𝜖)| is small. We assume that A t is an increasing process for each t and is ℱt -measurable. Definition 4.2: An adapted positive right continuous process X t is said to be dominated by an increasing predictable process A if for all finite stopping times T we have EX T ≤ EA T Example: Let M 2t is square martingale. Consider X t = M 2t . Then, we know X t − < M >t is a martingale, and hence, X T − < M >T is a martingale. Thus, E(X T − < M >T ) = EX0 = 0 ⇒ EX T = E < M >T Let X ∗t = sup |X s | s≤t
Lemma 4.2: Let T be a stopping time and X be dominated by increasing process A (as above). Then, P(X ∗T ≥ c) ≤ for any positive c.
E(A T ) c
4.2 Lenglart inequality | 73
Proof: Let S = inf{s ≤ T ∧ n : X s ≥ c}, clearly S ≤ T ∧ n. Thus, EA T ≥ EA S ( since A is an increasing process) ≥ EX S ( since X is dominated by A) ≥∫
{X ∗T∧n >c}
X S dP ( since X S > 0 on {X ∗T∧n > c})
≥ c ⋅ P(X ∗T∧n > c) Therefore, we let n go to ∞, then we get EA T ≥ c ⋅ P(X ∗T > c). Theorem 4.1 (Lenglart Inequality): If X is dominated by a predictable increasing process, then for every positive c and d P(X ∗T > c) ≤
E(A T ∧ d) + P(A T > d). c
Proof: It is enough to prove for predictable stopping time T > 0, P(X ∗T− ≥ c) ≤
1 E(A T− ∧ d) + P(A T− ≥ d). c
(4.1)
We choose T = ∞. Then T is predictable, 𝜎n = n and apply to X tT = X t∧T for T finite T∗ stopping time X T− = X ∗T . To prove (4.1) P(X ∗T− ≥ c) = P(X ∗T− ≥ c, A T− < d) + P(X ∗T− ≥ c, A T− ≥ d) ≤ P(X ∗T− ≥ c) + P(A T− ≥ d)
(4.2)
Let S = inf{t : A t ≥ d}. It is easy to show that S is a stopping time. Also, S is predictable. On {𝜔 : A T− < d}, S(𝜔) ≥ T(𝜔), and hence ∗ 1(A t− < d)X ∗T− ≤ X(T∧S)− .
By (4.2), we have P(X ∗T− ≥ c) ≤ P(X ∗T− ≥ c) + P(A T− ≥ d) ≤ P(X ∗T∧S ≥ c) + P(A T− ≥ d)
74 | 4 Central limit theorem for semi-martingales and applications
Let 𝜖 > 0, 𝜖 < c and S n ↗ S ∧ T. Then, ∗ P(X(T∧S)− ≥ c) ≤ lim inf P(X ∗S n ≥ c − 𝜖) ( by Fatou’s lemma) n
1 lim EA S n ( by Lemma 4.2) ≤ c − 𝜖 n→∞ 1 = EA ( by Monotone convergence theorem) c − 𝜖 (S∧T) Since 𝜖 is arbitrary, ∗ P(X(T∧S)− ≥ c) ≤
1 1 EA ≤ E(A(S∧T)− ) c Sn c 1 ≤ E(A(T−∧d) ) c
This completes the proof. Corollary 4.1: Let M ∈ ℳ2LOC ((ℱt ), P) (class of locally square integrable martingale). Then, P(sup |M t | > a) ≤ t≤T
1 E(< M >T ∧b) + P(< M >T ≥ b) a2
Proof: Use X t = |M t |2 , c = a2 , b = d, A t = < M >t . Lemma 4.3: Consider {ℱtn , P}. Let {M n } be locally square martingale. Assume that < M n >t →p f (t), where f is continuous and deterministic function (hence, f will be an increasing function). Then, {P ∘ (M n )−1 } on D[0, ∞) is relatively compact on D[0, ∞). Proof: It suffices to show for constant T < ∞, and for any 𝜂 > 0 there exists a > 0 such that sup P( sup |M tn | > a) < 𝜂. n
(4.3)
t≤T
For each T < ∞ and 𝜂, 𝜖 > 0, there exist n0 , 𝛿 such that for any stopping time 𝜏n (w.r.t (ℱtn ), 𝜏n ≤ T, 𝜏n + 𝛿 < T). sup P( sup |M𝜏nn +t − M𝜏nn | ≥ 𝜖) < 𝜂. n≥n0
0≤t≤𝛿
(4.4)
4.2 Lenglart inequality | 75
Observe that by corollary to Lenglart inequality, P(sup |M tn | > a) ≤ t≤T
1 E(< M n >T ∧b) + P(< M n >T ≥ b) a2
Let b = f (T) + 1, then under the hypothesis, there exists n1 such that for all n ≥ n1 P(< M n >T ≥ b)
a) ≤ n
t≤T
1 𝜂 b + + ∑ P(sup |M tk | > a). 2 2 k=1 t≤T a
Choose a large to obtain (4.3). n We again note that M𝜏+t − M𝜏n is a locally square integrable martingale. Hence, by Corollary 4.1 and triangle inequality n P( sup |M𝜏+t − M tn | ≥ 𝜖) ≤ 0≤t≤𝛿
1 E(( < M n >𝜏+𝛿 − < M n >𝜏 ) ∧ b) 𝜖2 + P( < M n >𝜏+𝛿 − < M n >𝜏 ≥ b)
≤
1 E( sup ( < M n >t+𝛿 − < M n >t ) ∧ b) 𝜖2 t≤T + P( sup < M n >t+𝛿 − < M n >t ≥ b) t≤T
≤
n 1 E( sup M t+𝛿 − f (t + 𝛿) ∧ b) 𝜖2 t≤T 1 E( sup M tn − f (t + 𝛿) ∧ b) 𝜖2 t≤T 1 + 2 sup |f (t + 𝛿) − f (t)| 𝜖 t≤T
+
b + P( sup < M n >t+𝛿 − f (t + 𝛿) ≥ ≥ b) 3 t≤T b + P( sup < M n >t − f (t) ≥ ≥ b) 3 t≤T b + 1( sup |f (t + 𝛿) − f (t)| ≥ ) 3 t≤T n + P( sup < M >t+𝛿 − < M n >t ≥ b). t≤T
Using the McLeish lemma, each term goes to 0. This completes the proof.
76 | 4 Central limit theorem for semi-martingales and applications
If our conditions guarantee that for a locally square integrable martingale the associated increasing process converges, then the problem reduces to the convergence of finite-dimensional distributions for weak convergence of {M tn } to the convergence of finite-dimensional distributions.
4.3 Central limit theorem for semi-martingale Theorem 4.2: Let {X n } be a sequence of semi-martingale with characteristics (B n [X n , X n ], 𝜈n ) and M be a continuous Gaussian martingale with increasing process < M >. (i) For any t > 0 and 𝜖 ∈ (0, 1), let the following conditions be satisfied: (A) t
∫ ∫ 0
|x|>𝜖
𝜈n (ds, dx) →p 0
(B) B nc t + ∑ ∫
0≤s≤t |x|≤𝜖
x𝜈n ({s}, dx) →p 0
(C) 2
t
< X nc >t + ∫ ∫
|x|≤𝜖
0
x2 𝜈n (ds, dx) − ∑ ( ∫
|x|≤𝜖
0≤s≤t
x𝜈n ({s}, dx)) →P < M >t
Then X n ⇒ M for finite dimension. (ii) If (A) and (C) are satisfied as well as the condition sup B nc x𝜈n ({u}, dx) →P 0 s + ∑ ∫ 01) . 01) > 𝛿} ⊂ { ∑ 1(|ΔX sn |>1) > 𝛿}. 01
0
0t + ∫ ∫ 0
− ∑ (∫ 0t by condition (C). By McLeish lemma, sup | < Δn (𝜖) >t − < M >t | →p 0.
(4.7)
t≤T
We showed supt≤T |𝛾tn (𝜖)| → 0. Combining this with (4.7), we have max ( sup | < Δn (𝜖) >t − < M >t |, sup |𝛾tn (𝜖)|) → 0 t≤T
t≤T
Then, there exists {𝜖n } such that sup | < Δn (𝜖n ) >t − < M >t | → 0, t≤T
M tn = Δnt (𝜖n ),
sup |𝛾tn (𝜖n )| → 0 t≤T
Y tn = Δnt (𝜖n ) + 𝛾tn (𝜖n ).
(4.8)
4.3 Central limit theorem for semi-martingale | 79
It suffices to prove that M n ⇒ M. {M tn = Δnt (𝜖n )} is compact by Lemma 4.3 and (4.7). It suffices to prove finite-dimensional convergence. Let H(t), 0 ≤ t ≤ T be a piecewise constant left-continuous function assuming finitely many values. Let t
N tn = ∫ H(s)dM sn ,
t
N t = ∫ H(s)dM s .
0
0
Since M is Gaussian, N is also Gaussian. Remark: Cramer-Wold criterion for finite-dimensional convergence n
T
Ee iN T → Ee iN T = e− 2 ∫0 1
H 2 (s)ds
.
Let A be predictable, A ∈ 𝒜LOC (ℱ, P) e(A)t = e A t ∏ (1 + ΔA s )e−A s . 0≤s≤t
Then e A t will be a solution of dZ t = Z t− dA t by Dolean-Dade’s work. If m ∈ ℳLOC , then t 1 A t = − < m c >t + ∫ ∫ (e isx − 1 − ix)𝜈m (ds, dx). 2 0 R−{0}
Lemma 4.4: For some a > 0 and c ∈ (0, 1), let < m >∞ ≤ a, supt |Δm t | ≤ c. Then (e(A t ), ℱt ) is such that |e(A)t | ≥ exp ( −
2a ), 1 − c2
−1
and the process (Z t , ℱt ) with Z t = e imt (e(A)t ) is a uniformly integrable martingale. We will use the lemma to prove n
T
Ee iN T → Ee iN T = e− 2 ∫0 1
H 2 (s)ds
.
Case 1: Let us first assume < M n >T ≤ a and a > 1+ < M >T . Observe that 1. By (4.7), < N n >t →< N >t . 2. |Δnt | ≤ 2𝜖n . 3. |ΔN tn | ≤ 2𝜆𝜖n = d n , where 𝜆 = maxt≤T |H(s)|. We want to prove E exp (iN Tn +
1 < N >T ) → 1. 2
(4.9)
80 | 4 Central limit theorem for semi-martingales and applications
n
−1
Let A nt be increasing process associated with N tn . Let Z t = e iN t (e(A n )t ) . Choose n0 such that d n0 = 2𝜆𝜖n0 ≤ 1/2. By Lemma 4.4, Z n is a martingale with EZ Tn = 1. To prove (57) is equivalent to proving −1 n 1 1 lim (E exp (iN Tn + T )−Ee iN T (e(A n )T ) ) = 0E exp (iN Tn + T ) → 1. (4.10) 2 2
n→∞
Thus, it is sufficient to prove e(A n )T → e− 2 T . 1
Recall t 1 (e ix − 1 − ix)𝜈̃n (ds, dx). A nt = − < N >t + ∫ ∫ 2 0 |x|≤d n
Let 𝛼tn = ∫
|x|≤d n
(e ix − 1 − ix)𝜈̃n ({t}, dx).
Since (e ix − 1 − ix) ≤ x2 /2, we have 𝛼tn ≤ d2n /2. Therefore, ∑ |𝛼tn | = 0≤t≤T
1 T ∫ ∫ x2 𝜈̃n (dt, dx) 2 0 |x|≤d n
=
1 < N n >T 2
=
1 𝜆2 a . 2 2
Then, n
∏ (1 + 𝛼tn )e−𝛼t → 1. 0T − ∫ ∫ (e isx − 1 − isx)𝜈̃n (ds, dx) →p < N >T . 2 2 0 |x|≤d n
By observation (a) and the form of < N n >T , it suffices to prove T
∫ ∫ 0
|x|≤d n
(e isx − 1 − isx)𝜈̃n (ds, dx) →p 0.
4.4 Application to survival analysis |
81
3
We have, since |(e isx − 1 − isx + x2 /2)| ≤ c s x3 , T
T d x2 ((e isx − 1 − isx) + ) 𝜈̃n (ds, dx) ≤ n ∫ ∫ x2 𝜈̃n (ds, dx) 2 6 0 |x|≤d n |x|≤d n ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
∫ ∫ 0
≤ |x|6
3
dn < N n >T 6 d ≤ n 𝜆2 a 6 → 0.
≤
To dispose of assumption, define 𝜏n = min{t ≤ T :< M n >t ≥< M >T +1}. n Then 𝜏n is stopping time. We have 𝜏n = T if < M n >t T +1. Let M̃ n = M t∧𝜏 . Then n
< M̃ n >T ≤ 1+ < M >T +𝜖n2 ≤ 1+ < M >T +𝜖12 and lim P(| < M̃ n >t − < M̃ >t | > 𝜖) ≤ lim P(𝜏n > T) = 0. n
n
Next, n
n
n
n
lim Ee iN T = lim E(e iN t − e iN t∧𝜏n ) + lim Ee iN t∧𝜏n
n→∞
n→∞
n→∞
n
n
= lim E(e iN T − e iN T∧𝜏n ) + Ee iN T n→∞
= Ee iN T . The last equality follows from n n lim E(e iN T − e iN T∧𝜏n ) ≤ 2 lim P(𝜏n > T) = 0. n→∞ n→∞ This completes the proof.
4.4 Application to survival analysis Let X be a positive random variable. Let F and f be cumulative distribution function and probability density function of X. Then, survival function F̄ is defined as ̄ = P(X > t) = 1 − F(t) F(t)
82 | 4 Central limit theorem for semi-martingales and applications
Then, we have P(t < X ≤ t + △t|X > t) =
P(t < X ≤ t + △t) ̄ F(t) t+△t
=
∫t
dF(s)
̄ F(t)
.
Since we know 1 t+△t ∫ f (x)ds → f (t). △t t as △t → 0, hazard rate, is defined as h(t) =
f (t) ̄ F(t)
=−
d ̄ log F(t). dt
Therefore, survival function can be written as t
̄ = exp ( − ∫ h(s)ds). F(t) 0
If the integrated hazard rate is given, then it determines uniquely life distribution. For example, think about the following: 𝜏F = sup{s : F(s) < 1}. Consider now the following problem arising in clinical trials. Let 1. X1 , ...., X n be i.i.d F (life time distribution) 2. U1 , ...., U n be i.i.d measurable function with distribution function G with G(∞) < 1, which means U i are not random variable in some sense. Now, consider 1. indicator for “alive or not at time s”: 1(X i ≤ U i , X i ∧ U i ≤ s). 2. indicator for “alive and leave or not at time s”: U i 1(X i ∧ U i ≤ s). 3. indicator for “leave or not at time s”: 1(X i ∧ U i ≥ s). and 𝜎-field ℱtn = 𝜎({1(X i ≤ U i , X i ∧ U i ≤ s), U i 1(X i ∧ U i ≤ s), 1(X i ∧ U i ≥ s), s ≤ t, i = 1, 2, . . . , n}). ℱtn is called information contained in censored data.
4.4 Application to survival analysis |
83
Let t
𝛽(t) = ∫ h(s)ds. 0
̂ is an estimate of 𝛽(t), then we can estimate survival function. The estimator of If 𝛽(t) survival function will be ̂ ̂̄ = e−𝛽(t) F(t) ,
which will be approximately be ̂ ∏ (1 − d(𝛽(s))). s≤t
This is alternate estimate of survival function, which is called, Nelson estimate. Let n
N n (t) = ∑ 1(X i ≤ U i , X i ∧ U i ≤ t) i=1 n
Y n (t) = ∑ 1(X i ∧ U i ≥ t). i=1
̂ Then, 𝛽(t), which is called Breslow estimator, will be t+△t
t ∫ dF(s) ̂ = ∫ dN n (s) ≈ t . 𝛽(t) ̄ 0 Y n (s) F(t)
Now, we consider another estimator of survival function, which is called the KaplanMeier estimator. It will be ∏ (1 − s≤t
△N n (s) ) Y n (s)
Gill [12] showed the asymptomatic properties of the Kaplan-Meier estimator. We can show that △N n (s) ̂ 1 −𝛽(t) e = O( ) (1 ) ∏ − − Y (s) n n s≤t using following lemma. Lemma 4.5: Let {𝛼n (s), 0 ≤ s ≤ T, n ≥ 1} be real-valued function such that 1. {s ∈ (0, u] : 𝛼n (s) = / 0} is P-a.e. at most countable for each n. 2. ∑0 𝛿|𝒢0 )
(5.4)
5 Central limit theorems for dependent random variables | 97
Proof: We use induction. (i) n = 1 We want to show P(A1 𝒢0 ) ≤ 𝛿 + P(P(A1 |𝒢0 ) > 𝛿|𝒢0 ).
(5.5)
Consider Ω⊖ = {𝜔 : P(A1 |𝒢0 ) ≤ 𝛿}. Then (5.5) holds. Also, consider Ω⊕ = {𝜔 : P(A1 |𝒢0 ) > 𝛿}. Then P(P(A1 |𝒢0 ) > 𝛿|𝒢0 ) = E(1(P(A1 |𝒢0 ) > 𝛿)|𝒢0 ) = 1(P(A1 |𝒢0 ) > 𝛿) = 1, and hence (5.5) also holds. (ii) n > 1 Consider 𝜔 ∈ Ω⊕ . Then n
P( ∑ P(A m |𝒢m−1 ) > 𝛿|𝒢0 ) ≥ P(P(A1 |𝒢0 ) > 𝛿|𝒢0 ) m=1
= 1Ω⊕ (𝜔) = 1. Then, (5.4) holds. Consider 𝜔 ∈ Ω⊖ . Let B m = A m ∩ Ω⊖ . Then, for m ≥ 1, P(B m |𝒢m−1 ) = P(A m ∩ Ω⊖ |𝒢m−1 ) = P(A m |𝒢m−1 ) ⋅ P(Ω⊖ |𝒢m−1 ) = P(A m |𝒢m−1 ) ⋅ 1Ω⊖ (𝜔) = P(A m |𝒢m−1 ). Now suppose 𝛾 = 𝛿 − P(B1 |𝒢0 ) ≥ 0 and apply the last result for n − 1 sets (induction hypothesis). n
n
m=2
m=2
P( ⋃ B m 𝒢1 ) ≤ 𝛾 + P( ∑ P(B m |𝒢m−1 ) > 𝛾|𝒢1 ). Recall E(E(X|𝒢0 )𝒢1 ) = E(X|𝒢0 ) if 𝒢0 ⊂ 𝒢1 . Taking conditional expectation w.r.t. 𝒢0 and noting 𝛾 ∈ 𝒢0 , n
n
m=2
m=2
P( ⋃ B m 𝒢0 ) ≤ P(𝛾 + P( ∑ P(B m |𝒢m−1 ) > 𝛾|𝒢1 )𝒢0 )
98 | 5 Central limit theorems for dependent random variables
n
= 𝛾 + P( ∑ P(B m |𝒢m−1 ) > 𝛾|𝒢0 ) m=2 n
= 𝛾 + P( ∑ P(B m |𝒢m−1 ) > 𝛿|𝒢0 ). m=1
Since ∪ B m = (∪ A m ) ∩ Ω⊖ , on Ω⊖ we have n
n
∑ P(B m |𝒢m−1 ) = ∑ P(A m |𝒢m−1 ). m=1
m=1
Thus, on Ω⊖ , n
n
m=2
m=1
P( ⋃ A m 𝒢0 ) ≤ 𝛿 − P(A1 |𝒢0 ) + P( ∑ P(A m |𝒢m−1 ) > 𝛿|𝒢0 ). The result follows from n
n
m=1
m=2
P( ⋃ A m 𝒢0 ) ≤ P(A1 |𝒢0 ) + P( ⋃ A m 𝒢0 ) using monotonicity of conditional expectation and 1A∪ B ≤ 1A + 1B . Proof of Lemma 5.5: Let A m = {|X n,m | > 𝜖n }, 𝒢m = ℱn,m , and let 𝛿 be a positive number. Then, Proposition 5.1 implies n
P(|X n,m | > 𝜖n for some m ≤ n) ≤ 𝛿 + P( ∑ P(|X n,m | > 𝜖n |ℱn,m−1 ) > 𝛿). m=1
To estimate the right-hand side, we observe that “Chebyshev’s inequality” implies n
n
m=1
m=1
̂ 2 |ℱn,m−1 ) → 0 ∑ P(|X n,m | > 𝜖n |ℱn,m−1 ) ≤ 𝜖n−2 ∑ E(X n,m so lim sup P(|X n,m | > 𝜖n for some m ≤ n) ≤ 𝛿. Since 𝛿 is arbitrary, the proof of lemma and theorem is complete. Theorem 5.4 (Martingale cental limit theorem): Let {X n , ℱn } be a martingale difference sequence, X n,m = X m /√n, and V k = ∑kn=1 E(X 2n ℱn−1 ). Assume that 1. V k /k →P 𝜎2 , 2.
n−1 ∑m≤n E(X 2m 1(|X m | > 𝜖√n)) → 0.
Then, S(n⋅) √n
⇒ 𝜎W(⋅).
5 Central limit theorems for dependent random variables | 99
Definition 5.2: A process {X n , n ≥ 0} is called stationary if {X m , X m+1 , . . . , X m+k } =D {X0 , . . . , X k } for any m, k. Definition 5.3: Let (Ω, ℱ , P) be a probability space. A measurable map 𝜑 : Ω → Ω is said to be measure preserving if P(𝜑−1 A) = P(A) for all A ∈ ℱ. Theorem 5.5: If X0 , X1 , . . . is a stationary sequence and g : RN → R is measurable, then Y k = g(X k , X k+1 , . . .) is a stationary sequence. Proof: If x ∈ RN , let g k (x) = g(x k , x k+1 , . . .), and if B ∈ ℛN let A = {x : (g0 (x), g1 (x), . . .) ∈ B}. To check stationarity now, we observe: P({𝜔 : (Y0 , Y1 , . . .) ∈ B}) = P({𝜔 : (g(X0 , X1 , . . .), g(X1 , X2 , . . .), . . .) ∈ B}) = P({𝜔 : (g0 (X), g1 (X), . . .) ∈ B}) = P({𝜔 : (X0 , X1 , . . .) ∈ A}) = P({𝜔 : (X k , X k+1 , . . .) ∈ A}) ( since X0 , X1 , . . . is a stationary sequence) = P({𝜔 : (Y k , Y k+1 , . . .) ∈ B}). Exercise: Show the above equality. Definition 5.4: Assume that 𝜃 is measure preserving. A set A ∈ ℱ is said to be invariant if 𝜃−1 A = A. Denote by ℐ = {A in ℱ , A = 𝜃−1 A} in measure. Definition 5.5: A measure preserving transformation on (Ω, ℱ , P) is said to be ergodic if ℐ is trivial, i.e., for every A ∈ ℐ, P(A) ∈ {0, 1}. Example: Let X0 , X1 , . . . be the i.i.d. sequence. If Ω = RN and 𝜃 is the shift operator, then an invariant set A has {𝜔 : 𝜔 ∈ A} = {𝜔 : 𝜃𝜔 ∈ A} ∈ 𝜎(X1 , X2 , . . .). Iterating gives ∞
A ∈ ⋂ 𝜎(X n , X n+1 , . . .) = 𝒯 , the tail 𝜎-field n=1
100 | 5 Central limit theorems for dependent random variables
so ℐ ⊂ 𝒯 . For an i.i.d. sequence, Kolmogorov’s 0-1 law implies 𝒯 is trivial, so ℐ is trivial and the sequence is ergodic. We call 𝜃 the ergodic transformation. Theorem 5.6: Let g : RN → R be measurable. If X0 , X1 , . . . is an ergodic stationary sequence, then Y k = g(X k , X k+1 , . . .) is ergodic. Proof: Suppose X0 , X1 , . . . is defined on sequence space with X n (𝜔) = 𝜔n . If B has {𝜔 : (Y0 , Y1 , . . .) ∈ B} = {𝜔 : (Y1 , Y2 , . . .) ∈ B}, then A = {𝜔 : (Y0 , Y1 , . . .) ∈ B} is shift invariant. Theorem 5.7 (Birkhoff’s ergodic theorem): For any f ∈ L1 (P), 1 n−1 ∑ f (𝜃m 𝜔) → E(f |𝒢) a.s. and in L1 (P), n m=0 where 𝜃 is measure preserving transformation on (Ω, ℱ , P) and 𝒢 = {A ∈ ℱ : 𝜃−1 A = A}. For the proof, see [8]. Theorem 5.8: Suppose {X n , n ∈ Z} is an ergodic stationary sequence of martingale differences, i.e., 𝜎2 = EX 2n < ∞ and E(X n |ℱn−1 ) = 0 with respect to ℱn = 𝜎(X m , m ≤ n). Let S n = X1 + ⋅ ⋅ ⋅ + X n . Then, S(n⋅) √n
⇒ 𝜎W(⋅).
Proof: Let u n = E(X 2n |ℱn−1 ). Then u n can be written as 𝜃(X n−1 , X n−2 , . . .), and hence, by Theorem 5.6 u n is stationary and ergodic. By Birkhoff’s ergodic theorem (𝒢 = {0, Ω}), 1 n ∑ u → Eu0 = EX02 = 𝜎2 a.s. n m=1 m The last conclusion shows that (i) of Theorem 5.4 holds. To show (ii), we observe 1 n 1 n ∑ E(X 2m 1(|X m | > 𝜖√n)) = ∑ E(X02 1(|X0 | > 𝜖√n)) n m=1 n m=1 (because of stationarity) = E(X02 1(|X0 | > 𝜖√n)) → 0 by the dominated convergence theorem. This completes the proof. Let’s consider stationary process, (with 𝜉i i.i.d., E𝜉i = 0 and E𝜉i2 < ∞) ∞
X n = ∑ c k 𝜉n−k k=0
5 Central limit theorems for dependent random variables |
101
with ∞
∑ c2k < ∞. k=0
If 𝜉i are i.i.d., X n is definitely stationary, but it is not martingale difference process. This is called moving average process. What we will do is we start with stationary ergodic process, and then we will show that limit of this process is the limit of martingale difference sequence. Then this satisfies the conditions of martingale central limit theorem. Note that in our example, EX 2n < ∞ for all n. If X n is stationary second-order, then r(n) = EX n X o is positive definite. We can separate phenomenon into two parts: new information(nondeterministic) and non-new information (deterministic). 2𝜋
EX n X0 = ∫ e in𝜆 dF(𝜆), 0
where F is the spectral measure. In case X n = ∑∞ k=0 c k 𝜉n−k , then F ≪ Lebesgue measure. In fact, we have the following theorem: Theorem 5.9: There exist c̄ k and 𝜙 such that 2 – f (e i𝜆 ) = 𝜙(e i𝜆 ) ̄ ik𝜆 – 𝜙(e i𝜆 ) = ∑∞ k=0 c k e if and only if 2𝜋
∫ log f (𝜆)dF(𝜆) > −∞. 0
Now we start with {X n : n ∈ Z} ergodic stationary sequence such that – EX n = 0, EX 2n < ∞. – ∑∞ n=1 ||E(X 0 |ℱ−n )||2 < ∞. The idea is if we go back, then X n will be independent of X0 . Let H n = {Y ∈ ℱn with EY 2 < ∞} = L2 (Ω, ℱn , P) K n = {Y ∈ H n with EYZ = 0 for all Z ∈ H n−1 } = H n ⊖ H n−1 . Geometrically, H0 ⊃ H−1 ⊃ H−2 . . . is a sequence of subspaces of L2 and K n is the orthogonal complement of H n−1 . If Y is a random variable, let (𝜃n Y)(𝜔) = Y(𝜃n 𝜔),
102 | 5 Central limit theorems for dependent random variables
i.e., 𝜃 is isometry(measure-preserving) on L2 . Generalizing from the example Y = f (X−j , . . . , X k ), which has 𝜃n Y = f (X n−j , . . . , X n+k ), it is easy to see that if Y ∈ H k , then 𝜃n Y ∈ H k+n , and hence Y ∈ K j then 𝜃n Y ∈ K n+j . Lemma 5.6: Let P be a projection such that X j ∈ H−j implies P H−j X j = X j . Then, 𝜃P H−1 X j = P H0 X j+1 = P H0 𝜃X j . Proof: For j ≤ −1, 𝜃 ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ P H−j X j = 𝜃X j = X j+1 . Xj
We will use this property. For Y ∈ H−1 , X j − P H−1 X j ⊥Y ⇒ (X j − P H−1 X j , Y)2 = 0 ⇒ (𝜃(X j − P H−1 X j ), 𝜃Y) = 0 ( since 𝜃 is isometry on L2 ) 2
Since Y ∈ H−1 , 𝜃Y generates H0 . Therefore, for all Z ∈ H0 , we have ((𝜃X j − 𝜃P H−1 X j ), Z) = 0 2
⇒ (𝜃X j − 𝜃P H−1 X j )⊥Z ⇒ 𝜃P H−1 X j = P H0 𝜃X j = P H0 X j+1 . We come now to the central limit theorem for stationary sequences. Theorem 5.10: Suppose {X n , n ∈ Z}is an ergodic stationary sequence with EX n = 0 and EX 2n < ∞. Assume ∞
∑ ||E(X0 |ℱ−n )||2 < ∞ n=1
Let S n = X1 + . . . + X n . Then S(n⋅) √n where we do not know what 𝜎 is.
⇒ 𝜎W(⋅)
5 Central limit theorems for dependent random variables |
103
Proof: If X0 happened to be in K0 since X n = 𝜃n X0 ∈ K n for all n, and taking Z = 1A ∈ H n−1 , we would have E(X n 1A ) = 0 for all A ∈ ℱn−1 and hence E(X n |ℱn−1 ) = 0. The next best thing to having X n ∈ K0 is to have X0 = Y0 + Z0 − 𝜃Z0
(∗)
with Y0 ∈ K0 and Z0 ∈ L2 . Let ∞
Z0 = ∑ E(X j |ℱ−1 ) j=0 ∞
𝜃Z0 = ∑ E(X j+1 |ℱ0 ) j=0 ∞
Y0 = ∑ (E(X j |ℱ0 ) − E(X j |ℱ−1 )). j=0
Then we can solve (∗) formally Y0 + Z0 − 𝜃Z0 = E(X0 |ℱ0 ) = X0 .
(5.6)
We let n
n
m=1
m=1
S n = ∑ X m = ∑ 𝜃m X 0
n
and T n = ∑ 𝜃m Y0 . m=1
We want to show that T n is martingale difference sequence. We have S n = T n + 𝜃Z0 − 𝜃n+1 Z0 . The 𝜃m Y0 are a stationary ergodic martingale difference sequence (ergodicity follows from Theorem 5.6), so Theorem 5.8 implies T(n⋅) √n
⇒ 𝜎W(⋅), where 𝜎2 = EY02 .
To get rid of the other term, we observe 𝜃Z0 → 0 a.s. √n and n−1
P( max 𝜃m+1 Z0 > 𝜖√n) ≤ ∑ P(𝜃m+1 Z0 > 𝜖√n) 0≤m≤n−1 m=0
= nP(Z0 > 𝜖√n) ≤ 𝜖−2 E(Z02 1{|Z0 |>𝜖√n} ) → 0.
104 | 5 Central limit theorems for dependent random variables
The last inequality follows from the stronger form of Chevyshev, E(Z02 1{|Z0 |>𝜖√n} ) ≥ 𝜖2 nP(Z0 > 𝜖√n). Therefore, in view of the above comments, Sn T 𝜃Z 𝜃n+1 Z0 = n + 0 − √n √n √n √n ⇒
T p Sn − n →0 √n √n
⇒ lim
n→∞
S(n⋅) √n
= lim
T(n⋅)
n→∞
√n
= 𝜎W(⋅).
Theorem 5.11: Suppose {X n , n ∈ Z} is an ergodic stationary sequence with EX n = 0 and EX 2n < ∞. Assume ∞
∑ ||E(X0 |ℱ−n )||2 < ∞. n=1
Let S n = X1 + . . . + X n . Then S(n⋅) √n
⇒ 𝜎W(⋅),
where ∞
𝜎2 = EX02 + 2 ∑ EX0 X n . n=1
∞ If ∑∞ n=1 EX 0 X n diverges, theorem will not be true. We will show that ∑n=1 EX 0 X n < ∞. 2 This theorem is different from previous theorem since we now specify 𝜎 . Proof: First, EX0 X m = E(E(X0 X m |ℱ0 )) ≤ EX0 E(X m |ℱ0 ) ≤ ||X0 ||2 ⋅ E(X m |ℱ0 )2 ( by Cauchy Schwarz inequality ) = ||X0 ||2 ⋅ E(X0 |ℱ−m )2 ( by shift invariance ). Therefore, by assumption, ∞
∞
n=1
n=1
∑ EX0 X n ≤ ||X0 ||2 ∑ E(X0 |ℱ−m )2 < ∞.
5 Central limit theorems for dependent random variables |
105
Next, n
n
ES2n = ∑ ∑ EX j X k j=1 k=1 n−1
= nEX02 + 2 ∑ (n − m)EX0 X m . m=1
From this, it follows easily that ∞ ES2n → EX02 + 2 ∑ EX0 X m . n m=1
To finish the proof, let T n = ∑nm=1 𝜃m Y0 , observe 𝜎2 = EY02 , and n−1 E(S n − T n )2 = n−1 E(𝜃Z0 − 𝜃n+1 Z0 )2 ≤
3EZ02 →0 n
since (a − b)2 ≤ (2a)2 + (2b)2 . We proved central limit theorem of ergodic stationary process. We will discuss examples: M-dependence and moving average. Example 1. M-dependent sequences: Let X n , n ∈ Z be a stationary sequence with EX n = 0, EX 2n < ∞. Assume that 𝜎({X j , j ≤ 0}) and 𝜎({X j , j ≥ M}) are independent. In this case, E(X0 |ℱ−n ) = 0 for n > M, and ∑∞ n=0 ||E(X 0 |ℱ−n )||2 < ∞. Let ℱ−∞ = ∩ m 𝜎({X j , j ≥ M}) and ℱk = 𝜎({X j , j ≤ k}). If m − k > M, then ℱ−∞ ⊥ℱk . Recall Kolmogorov 0-1 law. If A ∈ ℱk and B ∈ ℱ−∞ , then P(A ∩ B) = P(A) ⋅ P(B). For all A ∈ ∪ k ℱk , A ∈ 𝜎(∪ k ℱk ). Also, A ∈ ℱ−∞ , where ℱ−∞ ⊂ ∪ k ℱ−k . Therefore, by Kolmogorov 0-1 law, P(A ∩ A) = P(A) ⋅ P(A), and hence, {X n } is stationary and ergodic. Thus, Theorem 5.8 implies S n,(n⋅) √n
⇒ 𝜎W(⋅),
where M
𝜎2 = E20 + 2 ∑ EX0 X m . m=1
Example 2. Moving average: Suppose X m = ∑ c k 𝜉m−k , where ∑ c2k < ∞, k≥0
k≥0
106 | 5 Central limit theorems for dependent random variables
and 𝜉i , i ∈ Z are i.i.d. with E𝜉i = 0 and E𝜉i2 = 1. Clearly, {X n } is a stationary sequence since series converges. Check whether {X n } is ergodic. We have ⋂ 𝜎({X m , m ≤ n}) ⊂ ⋂ 𝜎({𝜉k , k ≤ n}); n
n
therefore, by Kolmogorov 0-1 law, {X n } is ergodic. Next, if ℱ−n = 𝜎({𝜉m , m ≤ −n}), then ||E(X0 |ℱ−n )||2 = || ∑ c k 𝜉k ||2 k≥n 1/2
= ( ∑ c2k ) . k≥n
If, for example, c k = (1 + k)−p , ||E(X0 |ℱ−n )||2 ∼ n(1/2−p) , and Theorem 5.11 applies if p > 3/2. Let 𝒢, ℋ ⊂ ℱ, and 𝛼(𝒢, ℋ) =
sup A∈𝒢,B∈ℋ
{P(A ∩ B) − P(A)P(B)}.
If 𝛼 = 0, 𝒢, and ℋ are independent, 𝛼 measures the dependence of two 𝜎-algebras. Lemma 5.7: Let p, q, r ∈ (1, ∞] with 1/p +1/q +1/r = 1, and suppose X ∈ 𝒢, Y ∈ ℋ have E|X|p , E|Y|q < ∞. Then 1/r
|EXY − EXEY| ≤ 8||X||p ||Y||q (𝛼(𝒢, ℋ)) . Here, we interpret x0 = 1 for x > 0 and 00 = 0. Proof: If 𝛼 = 0, X and Y are independent and the result is true, so we can suppose 𝛼 > 0. We build up to the result in three steps, starting with the case r = ∞. (a). r = ∞ |EXY − EXEY| ≤ 2||X||p ||Y||q . Proof of (a): Hölder’s inequality implies |EXY| ≤ ||X||p ||Y||q , and Jensen’s inequality implies ||X||p ||Y||q ≥ E|X|E|Y| ≥ |EXEY|; thus, the result follows from the triangle inequality.
5 Central limit theorems for dependent random variables |
(b) X, Y ∈ L∞ |EXY − EXEY| ≤ 4||X||∞ ||Y||∞ 𝛼(𝒢, ℋ). Proof of (b): Let 𝜂 = sgn(E(Y|𝒢) − EY) ∈ 𝒢. EXY = E(XE(Y|𝒢)), so |EXY − EXEY| = |E(X(E(Y|𝒢) − EY))| ≤ ||X||∞ E|E(Y|𝒢) − EY| = ||X||∞ E(𝜂E(Y|𝒢) − EY) = ||X||∞ (E(𝜂Y) − E𝜂EY). Applying the last result with X = Y and Y = 𝜂 gives |E(Y𝜂) − EYE𝜂| ≤ ||Y||∞ |E(𝜁𝜂) − E𝜁E𝜂|, where 𝜁 = sgn(E(𝜂|ℋ) − E𝜂). Now 𝜂 = 1A − 1B and 𝜁 = 1C − 1D , so |E(𝜁𝜂) − E𝜁E𝜂| = |P(A ∩ C) − P(B ∩ C) − P(A ∩ D) + P(B ∩ D) −P(A)P(C) + P(B)P(C) + P(A)P(D) − P(B)P(D)| ≤ 4𝛼(𝒢, ℋ). Combining the last three displays gives the desired result. (c) q = ∞, 1/p + 1/r = 1 1−1/p
|EXY − EXEY| ≤ 6||X||p ||Y||∞ (𝛼(𝒢, ℋ))
.
Proof of (c): Let C = 𝛼−1/p ||X||p , X1 = X1(|X|≤C) , and X2 = X − X1 . |EXY − EXEY| ≤ |EX1 Y − EX1 EY| + |EX2 Y − EX2 EY| ≤ 4𝛼C||Y||∞ + 2||Y||∞ E|X2 | by (a) and (b). Now E|X2 | ≤ C−(p−1) E(|X|p 1(|X|≤C) ) ≤ C−p+1 E|X|p . Combining the last two inequalities and using the definition of C gives |EXY − EXEY| ≤ 4𝛼1−1/p ||X||p ||Y||∞ + 2||Y||∞ 𝛼1−1/p ||X||−p+1+p , p which is the desired result.
107
108 | 5 Central limit theorems for dependent random variables
Finally, to prove Lemma 5.7, let C = 𝛼−1/q ||Y||q , Y1 = Y1(|Y|≤C) , and Y2 = Y − Y1 . |EXY − EXEY| ≤ |EXY1 − EXEY1 | + |EXY2 − EXEY2 | ≤ 6C||X||p 𝛼1−1/p + 2||X||p ||Y2 ||𝜃 , where 𝜃 = (1 − 1/p)−1 by (c) and (a). Now E|Y|𝜃 ≤ C−q+𝜃 E(|Y|q 1(|Y|≤C) ) ≤ C−1+𝜃 E|Y|q . Taking the 1/𝜃 root of each side and recalling the definition of C (q−𝜃)/q𝜃 ||Y2 ||𝜃 ≤ C−(q−𝜃) ||Y||q/𝜃 ||Y||q , q ≤ 𝛼
so we have |EXY − EXEY| ≤ 6𝛼−1/q ||Y||q ||X||p 𝛼1−1/p + 2||X||p 𝛼1/𝜃−1/q ||Y||1/𝜃+1/q , q proving Lemma 5.7. Combining Theorem 5.11 and Lemma 5.7 gives: Theorem 5.12: Suppose X n , n ∈ Z is an ergodic stationary sequence with EX n = 0, E|X0 |2+𝛿 < ∞. Let 𝛼(n) = 𝛼(ℱ−n , 𝜎(X0 )), where ℱ−n = 𝜎({X m , m ≤ −n}), and suppose ∞
∑ 𝛼(n)𝛿/2(2+𝛿) < ∞. n=1
If S n = X1 + ⋅ ⋅ ⋅ X n , then S(n⋅) √n
⇒ 𝜎W(⋅),
where ∞
𝜎2 = EX02 + 2 ∑ EX0 X n . n=1
Proof: To use Lemma 5.7 to estimate the quantity in Theorem 5.11 we begin with ||E(X|ℱ)||2 = sup{E(XY) : Y ∈ ℱ , ||Y||2 = 1}
(∗).
Proof of (*): If Y ∈ ℱ with ||Y||2 = 1, then using a by now familiar property of conditional expectation and the Cauchy-Schwarz inequality EXY = E(E(XY|ℱ)) = E(YE(X|ℱ)) ≤ ||E(X|ℱ)||2 ||Y||2 Equality holds when Y = E(X|ℱ)/||E(X|ℱ)||2 .
5 Central limit theorems for dependent random variables |
Letting p = 2 + 𝛿 and q = 2 in Lemma 5.7, noticing 1 1 𝛿 1 =1− − = r p q 2(2 + 𝛿) and recalling EX0 = 0, showing that if Y ∈ ℱ−n |EX0 Y| ≤ 8||X0 ||2+𝛿 ||Y||2 𝛼(n)𝛿/2(2+𝛿) . Combining this with (*) gives ||E(X0 |ℱ−n )||2 ≤ 8||X0 ||2+𝛿 𝛼(n)𝛿/2(2+𝛿) , and it follows that the hypotheses of Theorem 5.11 are satisfied.
109
6 Empirical process We have so far considered the weak convergence of stochastic processes with values in complete separable metric spaces. The main assumption needed to study such convergence in terms of bounded continuous functions is that the measures be defined on Borel subsets of the space. This influences theorems like the Prokhorov theorem. In other words, the random variables with values in separable metric spaces. If we consider the space D[0, 1] with sup norm, we get non-separability. We note that if we consider the case of empirical processes X n (t) = √n(F̂ n (t) − t), where F̂ n (t) is the empirical distribution function of i.i.d. uniform random variable, then the above X n is not a Borel measureable function in D[0, 1] with sup norm if we consider the domain of X n (., w) with w𝜖[0, 1]? Thus, the weak convergence definition above cannot be used in this case. Dudley [6] developed an alternative weak convergence theory. Even this cannot handle the general empirical processes, hence some statistical applications. One therefore has to introduce outer expectations of X n of possibly non-measureable maps as far as limit of X n is Borel measureable. We shall also consider convergence in probability and almost sure convergence for such functions. We therefore follow this idea as in [25]. A similar idea was exploited in the basic paper on invariance principle in non-separable Banach spaces by Dudley and Phillips ([7], see also [18]). ¯ an arbitrary map. Let (Ω, 𝒜, P) be an arbitrary probability space and T : Ω → R The outer integral of T with respect to P is defined as ¯ measurable and EU exists}. E∗ T = inf{EU : U ≥ T, U : Ω → R Here, as usual, EU is understood to exist if at least one of EU + or EU − is finite. The outer probability of an arbitrary subset B of Ω is P∗ (B) = inf{P(A) : A ⊃ B, A ∈ 𝒜}. Note that the functions U in the definition of outer integral are allowed to take the value ∞, so that the infimum exists. The inner integral and inner probability can be defined in a similar fashion. Equivalently, they can be defined by E∗ T = −E∗ (−T) and P∗ (B) = 1 − P∗ (B), respectively. In this section, D is metric space with a metric d. The set of all continuous, bounded functions f : D → R is denoted by C b (D).
6 Empirical process | 111
Definition 6.1: Let (Ω𝛼 , 𝒜𝛼 , P𝛼 ), 𝛼 ∈ I be a net of probability spaces, X𝛼 : Ω𝛼 → D, and P𝛼 = P ∘ X𝛼−1 . Then we say that the net X𝛼 converges weakly to a Borel measure L, i.e., P𝛼 ⇒ L if E∗ f (X𝛼 ) → ∫ fdL,
for every f ∈ C b (D).
Theorem 6.1 (Portmanteau): The following statements are equivalent: (i) P𝛼 ⇒ L; (ii) lim inf P∗ (X𝛼 ∈ G) ≥ L(G) for every open G; (iii) lim sup P∗ (X𝛼 ∈ F) ≤ L(F) for every closed set F; (iv) lim inf E∗ f (X𝛼 ) ≥ ∫ fdL for every lower semicontinuous f that is bounded below; (v) lim sup E∗ f (X𝛼 ) ≤ ∫ fdL for every upper semicontinuous f that is bounded above; (vi) lim P∗ (X𝛼 ∈ B) = lim P∗ (X𝛼 ∈ B) = L(B) for every Borel set B with L(𝜕B) = 0. (vii) lim inf E∗ f (X𝛼 ) ≥ ∫ fdL for every bounded, Lipschitz continuous, nonnegative f . Proof: (ii) and (iii) are equivalent by taking complements. Similarly, (ii) and (v) are equivalent by using f by −f . The (i) ⇒ (vii) is trivial. (vii) ⇒ (ii) Suppose (vii) holds. For every open G, consider a sequence of Lipchitz continuous functions f m ≥ 0 and f m ↑ 1G (f m (x) = md(x, D − G) > 1, then lim inf P∗ (X𝛼 ∈ G) ≥ lim inf E∗ (f (X𝛼 )) ≥ ∫ f m dL. Letting m → ∞, we get the result. (ii) ⇒ (iv) Let f be lower semicontinuous with f ≥ 0. Define m2
f m = ∑(1/m)1G i , G i = {x : f (x) > i/m}. i=1
Then f m ≤ f and |f m (x)− f (x)| ≤ 1/m when x ≤ m. Fix m, use the fact G i is open for all i, m2
lim inf E∗ f (X𝛼 ) ≥ lim inf E∗ f m (X𝛼 ) ≥ ∑ i=1
1 P(X ∈ G i ) = ∫ f m dL. m
Let n → ∞, we get (iv) for non-negative lower semi continuous f . The conclusion follows. Since a continuous function is both upper and lower continuous (iv) and (v) imply (i). We now prove (vi) is equivalent to others. Consider (ii) ⇒ (vi). If (ii) and (iii) hold, then L(int B) ≤ lim inf P∗ (X𝛼 ∈ int B) ≤ lim sup P∗ (X𝛼 ∈ B) ≤ L(B). If L(𝛿B) = 0, we get equalities giving (vi).
112 | 6 Empirical process
(vi) ⇒ (iii) Suppose (vi) holds and let F be closed. Write F 𝜖 = {x : d(X, F) < 𝜖}. The sets 𝜕F 𝜖 are disjoint for different 𝜖 > 0, giving at most countably many have non-zero L-measure. Choose 𝜖m ↓ 0 with L(𝜕F 𝜖m ) = 0. For fixed m, 𝜖m
lim sup P∗ (X𝛼 ∈ F) ≤ lim sup P∗ (X𝛼 ∈ F ) = L(F 𝜖m ). Let m → ∞ to complete the proof. Definition 6.2: The net of maps X𝛼 is asymptotically measurable if and only if E∗ f (X𝛼 ) − E∗ f (X𝛼 ) → 0,
for every f ∈ C b (D).
The net X𝛼 is asymptotically tight if for every 𝜖 > 0 there exists a compact set K, such that lim inf P∗ (X𝛼 ∈ K 𝛿 ) ≥ 1 − 𝜖,
for every 𝛿 > 0.
Here K 𝛿 = {y ∈ D : d(y, K) < 𝛿} is the “𝛿-enlargement” around K. A collection of Borel measurable maps X𝛼 is called uniformly tight if, for every 𝜖 > 0, there is a compact K with P(X𝛼 ∈ K) ≥ 1 − 𝜖 for every 𝛼. The 𝛿 in the definition of tightness may seem a bit overdone. Its non-asymptotic tightness as defined is essentially weaker than the same condition but with K instead of K 𝛿 . This is caused by a second difference with the classical concept of uniform tightness: the enlarged compacts need to contain mass 1 − 𝜖 only in the limit. Meanwhile, nothing is gained in simple cases: for Borel measurable maps in a Polish space, asymptotic tightness and uniform tightness are the same. It may also be noted that, although K 𝛿 is dependent on the metric, the property of asymptotic tightness depends on the topology only. One nice consequence of the present tightness concept is that weak convergence usually implies asymptotic measurability and tightness. Lemma 6.1: (i) X𝛼 ⇒ X, then X𝛼 is asymptotically measurable. (ii) If X𝛼 ⇒ X, then X𝛼 is asymptotically tight if and only if X is tight. Proof: (i) This follows upon applying the definition of weak convergence to both f and −f . (ii) Fix 𝜖 > 0. If X is tight, then there is a compact K with P(X ∈ K) > 1 − 𝜖. By the Portmanteau theorem, lim inf P∗ (X𝛼 ∈ K 𝛿 ) ≥ P(X ∈ K 𝛿 ), which is larger than 1 − 𝜖 for every 𝛿 > 0. Conversely, if X𝛼 is tight, then there is a compact K with lim inf P∗ (X𝛼 ∈ K 𝛿 ) ≥ 1 − 𝜖. By the Portmanteau theorem, P(X ∈ K 𝛿 ) ≥ 1 − 𝜖. Let 𝛿 ↓ 0.
6 Empirical process | 113
The next version of Prohorov’s theorem may be considered a converse of the previous lemma. It comes in two parts, one for nets and one for sequences, neither one of which implies the other. The sequence case is the deepest of the two. Theorem 6.2 (Prohorov’s theorem): (i) If the net X𝛼 is asymptotically tight and asymptotically measurable, then it has a subnet X𝛼(𝛽) that converges in law to a tight Borel law. (ii) If the sequence X n is asymptotically tight and asymptotically measurable, then it has a subsequence X n j that converges weakly to a tight Borel law. Proof: (i) Consider (E∗ f (X𝛼 ))f ∈C b (D) as a net in product space ∏ [−‖f ‖∞ , ‖f ‖∞ ],
f ∈C b (D)
which is compact in product topology. Hence, the net has convergent subnet. This implies there exist constants L(f ) ∈ [−‖f ‖∞ , ‖f ‖∞ ] such that E∗ f (X𝛼(𝛽) ) → L(f ) for f ∈ C b (D). In view of the asymptotic measurability, the numbers are the limits of corresponding E∗ f (X𝛼 ). Now, L(f1 + f2 ) ≤ lim(E∗ f1 (X𝛼(𝛽) ) + E∗ f2 (X𝛼(𝛽) )) = L(f1 ) + L(f2 ) = lim(E∗ f1 (X𝛼(𝛽) ) + E∗ f2 (X𝛼(𝛽) )) ≤ L(f1 + f2 ). Thus, L : C b (D) → ℝ is a additive, and similarly L(𝜆f ) = 𝜆L(f ) for 𝛼 ∈ ℝ and L is positive. L(f ) ≥ 0 for f ≥ 0. If f m ↓ 0, L(f m ) ↓ 0. To see this, fix 𝜖 > 0. There is a compact set K such that lim inf P∗ (X𝛼 ∈ K 𝛿 ) > 1 − 𝜖 for all 𝛿 > 0. Using the Dini theorem, for sufficiently large m, |f m (x)| ≤ 𝜖 for all x ∈ K. Using compactness of K, there exists 𝛿 > 0, such that |f m (x)| ≤ 2𝜖 for every x ∈ K 𝛿 . One gets A𝛼 ⊆ {X𝛼 ∈ K 𝛿 } measurable (1{X𝛼 ∈K 𝛿 } )∗ = 1A𝛼 . Hence, L(f m ) = limEf m (X𝛼 )∗ ((1{X𝛼 ∈K 𝛿 } )∗
+(1{X𝛼 ∈K̸ 𝛿 } )∗ ≤ 2𝜖 + ‖f1 ‖∞ 𝜖.
Thus, L is a measure. 𝛿 (ii) For m ∈ ℕ, K m is a compact set with lim inf P∗ (X n ∈ K m ) ≥ 1 − m1 for 𝛿 > 0. Since K m is compact, the space C b (K m ) and {f : f ∈ C b (K m ), ‖f (x)‖ ≤ 1, for x ∈ K m } are
114 | 6 Empirical process
separable. Using Tietze’s theorem, every f in the unit ball of C b (K m ) can be extended to an f in the unit ball of C b (D). Hence, ball of C b (D) the restrictions of which to K m are dense in the unit ball of C b (K m ). Pick such a countable set for every m, and let ℑ be countably many functions obtained. For a fixed, bounded f , there is a subsequence X n j such that E∗ f (X n j ) converges to some number. Using diagonalization, one obtains a subsequence such that ∗
E f (X n j ) → L(f ) for f ∈ ℑ. with L(f ) ∈ [−1, 1]. Let f ∈ C b (D) taking values in [−1, 1]. Fix 𝜖 > 0, and m. There exists f m ∈ ℑ with |f (x) − f m (x)| < 𝜖 for x ∈ K m . Then as before there exists 𝛿 > 0 such that |f (x) − 𝛿 f m (x)| ≤ 𝜖 for every x ∈ K m . Then, |E∗ f (X n ) − E∗ f m (X n )| ≤ E|f (X n ) − f m (X n )|∗ (1{X n ∈K 𝛿m } )∗ 2 𝛿 +2P∗ (X n ∈ ̸ K m ) ≤ 2𝜖 + m for n sufficiently large. Then E∗ f (X n j ) has the property that, for 𝜂 > 0, there is a converging subsequence of numbers that is eventually with distance 𝜂. This gives convergence E∗ f (X n j ) to a limit following as in the proof of (i) we can get the result. Remark: Let D0 ⊆ D and X and X𝛼 take values in D0 . Then X𝛼 → X as maps in D0 iff X𝛼 → X as maps in D if D and D0 are equipped with the same metric. This is easy from Portmaneteau theorem (ii) as each G0 ⊆ D0 open is of the form G ∩ D0 with G open in D.
6.1 Spaces of bounded functions A vector lattice ℱ ⊂ C b (D) is a vector space that is closed under taking positive parts: if f ∈ ℱ, then f = f ∨ 0 ∈ ℱ. Then automatically f ∨ g ∈ ℱ and f ∧ g ∈ ℱ for every f , g ∈ ℱ. A set of functions on D separates points of D if, for every pair x = / y ∈ D, there is f ∈ ℱ with f (x) = / f (y). Lemma 6.2: Let L1 and L2 be finite Borel measures on D. (i) If ∫ fdL1 = ∫ fdL2 for every f ∈ C b (D), then L1 = L2 . Let L1 and L2 be tight Borel probability measures on D. (ii) If ∫ fdL1 = ∫ fdL2 for every f in a vector lattice ℱ ⊂ C b (D) that contains the constant functions and separates points of D, then L1 = L2 . Proof: (i) For every open set G, there exists a sequence of continuous functions with 0 ≤ f m ↑ 1G . Using the monotone convergence theorem, L1 (G) = L2 (G) for every open set G. Since L1 (D) = L2 (D), the class of sets {A : L1 (A) = L2 (A)} is a 𝜎-field ⊇ open sets.
6.1 Spaces of bounded functions | 115
(ii) Fix 𝜖 >, take K compact so that L1 (K)L2 (K) ≥ 1 − 𝜖. Note that by the Stone Weirstvass theorem, ℑ (containing constant and separating points of K) lattice ⊆ C b (K) is uniformly dense in C b (k). Given g ∈ C b (D) with 0 ≤ g ≤ 1, take f ∈ ℑ, with |g(x) − f (x)| ≤ 𝜖 for x ∈ K. Then | ∫ gdL1 − ∫ gdL2 | ≤ | ∫(f ∧ 1)+ dL1 − inf(f ∧ 1)+ dL2 + 𝜖. This equals 4𝜖 as (f ∧ 1)+ ∈ ℑ. Hence, one can equals 4𝜖 as (f ∧ 1)+ ∈ ℑ. Hence, one can prove ∫ gdL1 = ∫ gdL2 for all g ∈ C b (D). Lemma 6.3: Let the net X𝛼 be asymptotically tight, and suppose E∗ f (X𝛼 )−E∗ f (X𝛼 ) → 0 for every f in a subalgebra ℱ of C b (D) that separates points of D. Then the net X𝛼 is asymptotically measurable. Proof: Fix 𝜖 > 0 and K compact such that lim sup P∗ (X𝛼 ∈ ̸ K 𝛿 ) ≤ 𝜖 for all 𝛿 > 0. Assume without loss of generality that ℑ contains constant functions. Then restrictions of the function ℑ into K are uniformly dense in C b (K) as before. Hence, given f ∈ C b (D), there exists g ∈ ℑ with |f (x) − g(x)| < 4𝜖 for x ∈ K. As K is compact one can show there exists a 𝛿 > 0 such that |f (x) − g(x)| < 3𝜖 for x ∈ K 𝛿 . Let {X𝛼 ∈ k𝛿 }∗ be a measurable set. Then P(Ω𝛼 \ {X𝛼 ∈ K 𝛿 }∗ ) = P∗ {X𝛼 ∈ ̸ K 𝛿 ) and for 𝛼 large P(|f (X𝛼 )∗ − f (X𝛼 ))∗ | > 𝜖}. Let T be an arbitrary set. The space l∞ (T) is defined as the set of all uniformly bounded, real functions on T: all functions z : T → R such that ||z||T := sup |z(t)| < ∞ t∈T
It is a metric space with respect to the uniform distance d(z1 , z2 ) = ||z1 − z2 ||T . The space l∞ (T), or a suitable subspace of it, is a natural space for stochastic processes with bounded sample paths. A stochastic process is simply an indexed collection {X(t) : t ∈ T} of random variables defined on the same probability space: every X(t) : Ω → R is a measurable map. If every sample path t → X(t, 𝜔) is bounded, then a stochastic process yields a map X : Ω → l∞ (T). Sometimes, the sample paths have additional properties, such as measurability or continuity, and it may be fruitful to consider X as a map into a subspace of l∞ (T). If in either case the uniform metric is used, this does not make a difference for weak convergence of a net, but for measurability it can. In most cases, a map X : Ω → l∞ (T) is a stochastic process. The small amount of measurability this gives may already be enough for asymptotic measurability. The special role played by the marginals (X(t1 ), ..., X(t k )), which are considered as maps
116 | 6 Empirical process
into Rk , is underlined by the following three results. Weak convergence in l∞ (T) can be characterized as asymptotic tightness plus convergence of marginals. Lemma 6.4: Let X𝛼 : Ω𝛼 → l∞ (T) be asymptotically tight. Then it is asymptotically measurable if and only if X𝛼 (t) is asymptotically measurable for every t ∈ T. Lemma 6.5: Let X and Y be tight Borel measurable maps into l∞ (T). Then X and Y are equal in Borel law if and only if all corresponding marginals of X and Y are equal in law. Theorem 6.3: Let X𝛼 : Ω𝛼 → l∞ (T) be arbitrary. Then X𝛼 converges weakly to a tight limit if and only if X𝛼 is asymptotically tight and the marginals (X(t1 ), ..., X(t k )) converge weakly to a limit for every finite subset t1 , ..., t k of T. If X𝛼 is asymptotically tight and its marginals converge weakly to the marginals (X(t1 ), ..., X(t k )) of a stochastic process X, then there is a version of X with uniformly bounded sample paths and X𝛼 ⇒ X. Proof: For the proof of both lemmas, consider the collection ℱ of all functions f : l∞ (T) → R of the form f (z) = g(z(t1 ), ..., z(t k )),
g ∈ C b (Rk ), t1 , ..., t k ∈ T, k ∈ N.
This forms an algebra and a vector lattice, contains the constant functions, and separates points of l∞ (T). Therefore, the lemmas are corollaries of Lemmas 6.2 and 6.3, respectively. If X𝛼 is asymptotically tight and marginals converge, then X𝛼 is asymptotically measurable by the first lemma. By Prohorov’s theorem, X𝛼 is relatively compact ||z||T = sup |ℤ(t)| < ∞. To prove weak convergence, it suffices to show that t𝜖T
all limit points are the same. This follows from marginal convergence and the second lemma. Marginal convergence can be established by any of the well-known methods for proving weak convergence on Euclidean space. Tightness can be given a more concrete form, either through finite approximation or with the help of the ArzelàAscoli theorem. Finite approximation leads to the simpler of the two characterizations, but the second approach is perhaps of more interest, because it connects tightness to continuity of the sample paths t → X𝛼 (t). The idea of finite approximation is that for any 𝜖 > 0, the index set T can be partitioned into finitely many subsets T i such that the variation of the sample paths t → X𝛼 (t) is less than 𝜖 on every one of the sets T i . More precisely, it is assumed that for every 𝜖, 𝜂 > 0, there exists a partition T = ∪ ki=1 T i such that lim sup P∗ ( sup sup X𝛼 (s) − X𝛼 (t) > 𝜖) < 𝜂. 𝛼 i
s,t∈T i
(6.1)
6.1 Spaces of bounded functions | 117
Clearly, under this condition, the asymptotic behavior of the process can be described within error margin 𝜖, 𝜂 by the behavior of the marginal (X𝛼 (t1 ), ..., X𝛼 (t k )) for arbitrary fixed points t i ∈ T i . If the process can thus be reduced to a finite set of coordinates for any 𝜖, 𝜂 > 0 and the nets or marginal distributions are tight, then the net X𝛼 is asymptotically tight. Theorem 6.4: A net X𝛼 : Ω𝛼 → l∞ (T) is asymptotically tight if and only if X𝛼 (t) is asymptotically tight in R for every t and, for all 𝜖, 𝜂 > 0, there exists a finite partition T = ∪ ki=1 T i such that (6.1) holds. Proof: The necessity of the conditions follows easily from the next theorem. For instance, take the partition equal to disjointified balls of radius 𝛿 for a semi-metric on T as in the next theorem. We prove sufficiency. For any partition, as in the condition of the theorem, the norm ||X𝛼 ||T is bounded by max i |X𝛼 (t i )| + 𝜖, with inner probability at least 1 − 𝜂, if t i ∈ T i for each i. Since a maximum of finitely many tight nets of real variables is tight, it follows that the net ||X𝛼 ||T is asymptotically tight in R. Fix 𝜁 > 0 and a sequence 𝜖m ↓ 0. Take a constant M such that lim sup P∗ (||X𝛼 ||T > M) < 𝜁, and for each 𝜖 = 𝜖m and 𝜂 = 2−m 𝜁, take a partition T = ∪ ki=1 T i as in (61). For the moment, m is fixed and we do not let it appear in the notation. Let z1 , ..., z p be the set of all functions in l∞ (T) that are constant on each T i and take on only the values 0, ±𝜖m , ..., ±[M/𝜖m ]𝜖m . Let K m be the union of the p closed balls of radius 𝜖m around the z i . Then, by construction, the two conditions ||X𝛼 ||T ≤ M
and
sup sup X𝛼 (s) − X𝛼 (t) ≤ 𝜖m i
s,t∈T i
imply that X𝛼 ∈ K m . This is true for each fixed m. Let K = ∩ ∞ m=1 K m . Then K is closed and totally bounded (by construction of the K m and because 𝜖 ↓ 0) and hence compact. Furthermore, for every 𝛿 > 0, there is an 𝛿 m with K 𝛿 ⊃ ∩ m m=1 K i . If not, then there would be a sequence z m not in K , but with m z m ∈ ∩ m=1 K i for every m. This would have a subsequence contained in one of the balls making up K1 , a further subsequence eventually contained in one of the balls making up K2 , and so on. The “diagonal” sequence, formed by taking the first of the first subsequence, the second of the second subsequence, and so on would eventually be contained in a ball of radius 𝜖m for every m, hence Cauchy. Its limit would be in K, contradicting the fact that d(z m , K) ≥ 𝛿 for every m. Conclude that if X𝛼 is not in K 𝛿 , then it is not in ∩ m m=1 K i for some fixed m. Then m
−m lim sup P∗ (X𝛼 ∉ K 𝛿 ) ≤ lim sup P∗ (X𝛼 ∉ ∩ m < 2𝜁. m=1 K i ) ≤ 𝜁 + ∑ 𝜁2 i=1
This concludes the proof of the theorem.
118 | 6 Empirical process
The second type of characterization of asymptotic tightness is deeper and relates the concept to asymptotic continuity of the sample paths. Suppose 𝜌 is a semimetric on T A net X𝛼 : Ω𝛼 → l∞ (T) is asymptotically uniformly 𝜌-equicontinuous in probability if for every 𝜖, 𝜂 > 0 there exists a 𝛿 > 0 such that lim sup P∗ ( sup X𝛼 (s) − X𝛼 (t) > 𝜖) < 𝜂. 𝛼 𝜌(s,t) 0 sufficiently small so that the last displayed inequality is valid. Since T is totally bounded, it can be covered with finitely many balls of radius 𝛿. Construct a partition of T by disjointifying these balls. (⇒). If X𝛼 is asymptotically tight, then g(X𝛼 ) is asymptotically tight for every continuous map g; in particular, for each coordinate projection. Let K1 ⊂ K2 ⊂ ... be 𝜖 compacts with lim inf P∗ (X𝛼 ∈ K m )1 − 1/m for every 𝜖 > 0. For every fixed m, define a semi-metric 𝜌m on T by 𝜌m (s, t) = sup |z(s) − z(t)|, z∈K m
s, t ∈ T.
Then (T, 𝜌m ) is totally bounded. Indeed, cover K m by finitely many balls of radius 𝜂, centered at z1 , ..., z k . Partition R k into cubes of edge 𝜂, and for every cube pick at most one t ∈ T such that (z1 (t), ..., z k (t)) is in the cube. Since z1 , ..., z k are uniformly bounded, this gives finitely many points t1 , ..., t p . Now the balls {t : 𝜌(t, t i ) < 3𝜂} cover T: t is in the ball around t i for which (z1 (t), ..., z k (t)) and (z1 (t i ), ..., z k (t i )) fall in the same cube. This follows because 𝜌m (t, t i ) can be bounded by 2 supz∈K m infi ||z − z i ||T + supj |z j (t i ) − z j (t)|. Next set ∞
𝜌(s, t) = ∑ 2−m (𝜌m (s, t) ∧ 1). m=1
Fix 𝜂 > 0. Take a natural number m with 2−m < 𝜂. Cover T with finitely many 𝜌m balls of radius 𝜂. Let t1 , ..., t p be their centers. Since 𝜌1 ≤ 𝜌2 ≤ ..., there is for every −k −m t a t i with 𝜌(t, t i ) ≤ ∑m < 2𝜂. Thus, (T, 𝜌) is totally bounded for 𝜌, k=1 2 𝜌k (t, t i ) + 2 too. It is clear from the definitions that |z(s) − z(t)| ≤ 𝜌m (s, t) for every z ∈ K m and
6.2 Maximal inequalities and covering numbers | 119
that 𝜌m (s, t) ∧ 1 ≤ 2m 𝜌(s, t). Also, if ||z0 − z||T < 𝜖 for z ∈ K m then |z0 (s) − z0 (t)| < 2𝜖 + |z(s) − z(t)| for any pair s, t. Deduce that 𝜖 Km ⊂ {z :
sup
𝜌(s,t) x) ≤ P(𝜓(|X|/||X||𝜓 ) ≥ 𝜓(x/||X||𝜓 )) ≤
1 𝜓(x/||X||𝜓 )
p
For 𝜓p (x) = e x − 1, this leads to tail estimates exp(−Cx p ) for any random variable with a finite 𝜓p -norm. Conversely, an exponential tail bound of this type shows that ||X||𝜓p is finite. p
Lemma 6.6: Let X be a random variable with P(|X| > x) ≤ Ke−Cx for every x, for 1/p
constants K and C, and for p ≥ 1. Then its Orlicz norm satisfies ||X||𝜓p ≤ ((1 + K)/C)
.
Proof: By Fubini’s theorem, |X|p
E(e
D|X|p
∞
− 1) = E ∫ De Ds ds = ∫ P(|X| > s1/p )De Ds ds 0
0
Now insert the inequality on the tails of |X| and obtain the explicit upper bound 1/p
KD/(C−D). This is less than or equal to 1 for D−1/p greater than or equal to ((1+K)/C) This completes the proof.
.
Next consider the 𝜓-norm of a maximum of finitely many random variables. Using the fact that max |X i |p ≤ ∑ |X i |p , one easily obtains for the L p -norms max X i = (E max X p ) 1≤i≤m p 1≤i≤m i
1/p
≤ m1/p max ||X i ||p . 1≤i≤m
A similar inequality is valid for many Orlicz norms, in particular the exponential ones. Here, in the general case, the factor m1/p becomes 𝜓−1 (m).
6.2 Maximal inequalities and covering numbers | 121
Lemma 6.7: Let 𝜓 be a convex, nondecreasing, nonzero function with 𝜓(0) = 0 and lim supx,y→∞ 𝜓(x)𝜓(y)/𝜓(cxy) < ∞ for some constant c. Then for any random variables X1 , ..., X n , max X i =≤ K𝜓−1 (m) max ||X i ||p 1≤i≤m 𝜓 1≤i≤m for a constant K depending only on 𝜓. Proof: For simplicity of notation, assume first that 𝜓(x)𝜓(y) ≤ 𝜓(cxy) for all x, y ≥ 1. In that case, 𝜓(x/y) ≤ 𝜓(cx)/𝜓(y) for all x ≥ y ≥ 1. Thus, for y ≥ 1 and any C, max 𝜓(
𝜓(c|X i |/C) |X i | |X | |X | ) ≤ max [ + 𝜓( i )1{ i < 1}] Cy 𝜓(y) Cy Cy ≤∑
𝜓(c|X i |/C) + 𝜓(1). 𝜓(y)
Set C = c max ||X i ||𝜓 , and take expectations to get E𝜓(
max |X i | m )≤ + 𝜓(1). Cy 𝜓(y)
When 𝜓(1) ≤ 1/2, this is less than or equal to 1 for y = 𝜓−1 (2m), which is greater than 1 under the same condition. Thus, max X i ≤ 𝜓−1 (2m)c max ||X i ||𝜓 . 1≤i≤m 𝜓 By the convexity of 𝜓 and the fact that 𝜓(0) = 0, it follows that 𝜓−1 (2m) ≤ 2𝜓−1 (m). The proof is complete for every special 𝜓 that meets the conditions made previously. For a general 𝜓, there are constants 𝜎 ≤ 1 and 𝜏 > 0 such that 𝜙(x) = 𝜎𝜓(𝜏x) satisfies the conditions of the previous paragraph. Apply the inequality to 𝜙, and observe that ||X||𝜓 ≤ ||X||𝜙 /(𝜎𝜏) ≤ ||X||𝜓 /𝜎. For the present purposes, the value of the constant in the previous lemma is irrelevant. The important conclusion is that the inverse of the 𝜓-function determines the size of the 𝜓-norm of a maximum in comparison to the 𝜓-norms of the individual p terms. The 𝜓-norms grows slowest for rapidly increasing 𝜓. For 𝜓(x) = e x − 1, the growth is at most logarithmic because 𝜓p−1 (m) = (log(1 + m))1/p The previous lemma is useless in the case of a maximum over infinitely many variables. However, such a case can be handled via repeated application of the lemma via
122 | 6 Empirical process
a method known as chaining. Every random variable in the supremum is written as a sum of “little links,” and the bound depends on the number and size of the little links needed. For a stochastic process {X t : t ∈ T}, the number of links depends on the entropy of the index set for the semi-metric d(s, t) = ||X s − X t ||𝜓 . The general definition of “metric entropy” is as follows. Definition 6.3 (Covering numbers): Let (T, d) be an arbitrary semi-metric space. Then the covering number N(𝜖, d) is the minimal number of balls of radius 𝜖 needed to cover T. Call a collection of points 𝜖-separated if the distance between each pair of points is strictly larger than 𝜖. The packing number D(𝜖, d) is the maximum number of 𝜖-separated points in T. The corresponding entropy numbers are the logarithms of the covering and packing numbers, respectively. For the present purposes, both covering and packing numbers can be used. In all arguments one can be replaced by the other through the inequalities 𝜖 N(𝜖, d) ≤ D(𝜖, d) ≤ N( , d). 2 Clearly, covering and packing numbers become bigger as 𝜖 ↓ 0. By definition, the semimetric space T is totally bounded if and only if the covering and packing numbers are finite for every 𝜖 > 0. The upper bound in the following maximal inequality depends on the rate at which D(𝜖, d) grows as 𝜖 ↓ 0, as measured through an integral criterion. Theorem 6.6: Let 𝜓 be a convex, nondecreasing, nonzero function with 𝜓(0) = 0 and lim supx,y→∞ 𝜓(x)𝜓(y)/𝜓(cxy) < ∞, for some constant c. Let {X t : t ∈ T} be a separable stochastic process with ||X s − X t ||𝜓 ≤ Cd(s, t),
for every s, t
for some semi-metric d on T and a constant C. Then, for any 𝜂, 𝛿 > 0, 𝜂
sup |X s − X t | ≤ K[ ∫ 𝜓−1 (D(𝜖, d))d𝜖 + 𝛿𝜓−1 (D2 (𝜂, d))], 𝜓 d(s,t)≤𝛿 0
for a constant K depending on 𝜓 and C only.
6.2 Maximal inequalities and covering numbers | 123
Corollary 6.1: The constant K can be chosen such that diamT
sup |X s − X t | ≤ K ∫ 𝜓−1 (D(𝜖, d))d𝜖, 𝜓 s,t 0
where diamT is the diameter of T. Proof: Assume without loss of generality that the packing numbers and the associated “covering integral” are finite. Construct nested sets T0 ⊂ T1 ⊂ ⋅ ⋅ ⋅ ⊂ T such that every T j is a maximal set of points such that d(s, t) > 𝜂2−j for every s, t ∈ T j , where “maximal” means that no point can be added without destroying the validity of the inequality. By the definition of packing numbers, the number of points in T j is less than or equal to D(𝜂2−j , d). “Link” every point t j+1 ∈ T j+1 to a unique t j ∈ T j such that d(t j , t j+1 ) ≤ 𝜂2−j . Thus, obtain for every t k+1 a chain t k+1 , t k , ..., t0 that connects it to a point in T0 . For arbitrary points s k+1 , t k+1 in T k+1 , the difference in increments along their chains can be bounded by k k |(X s k+1 − X s0 ) − (X t k+1 − X t0 )| = ∑(X s j+1 − X s j ) − ∑(X t j+1 − X t j ) j=0 j=0 k
≤ 2 ∑ max |X u − X v | j=0
where for fixed j the maximum is taken over all links (u, v) from T j+1 to T j . Thus, the jth maximum is taken over at most #T j+1 links, with each link having a 𝜓-norm ||X u − X v ||𝜓 bounded by Cd(u, v) ≤ C𝜂2−j . It follows with the help of Lemma 6.7 that, for a constant depending only on 𝜓 and C, k max |(X s − X s ) − (X t − X t )| ≤ K ∑ 𝜓−1 (D(𝜂2−j−1 , d))𝜂2−j s,t∈T 0 0 𝜓 k+1 j=0 𝜂
≤ 4K ∫ 𝜓−1 (D(𝜖, d))d𝜖.
(6.2)
0
In this bound, s0 and t0 are the endpoints of the chains starting at s and t, respectively. The maximum of the increments |X s k+1 − X t k+1 | can be bounded by the maximum on the left side of (6.2) plus the maximum of the discrepancies |X s0 − X t0 | at the end of the chains. The maximum of the latter discrepancies will be analyzed by a seemingly circular argument. For every pair of endpoints s0 , t0 of chains starting at two points in
124 | 6 Empirical process
T k+1 within distance 𝛿 of each other, choose exactly one pair s k+1 , t k+1 in T k+1 , with d(s k+1 , t k+1 ) < 𝛿, whose chains end at s0 , t0 . By definition of T0 , this gives at most D2 (𝜂, d) pairs. By the triangle inequality, |X s0 − X t0 | ≤ |(X s0 − X s k+1 ) − (X t0 − X t k+1 )| + |X s k+1 − X t k+1 |. Take the maximum over all pairs of endpoints s0 , t0 as above. Then the corresponding maximum over the first term on the right in the last display is bounded by the maximum in the left side of (6.2). It 𝜓-norm can be bounded by the right side of this equation. Combine this with (6.2) to find that |(X s − X s0 ) − (X t − X t0 )| s,t∈T max 𝜓 ,d(s,t) 0. It is a small step to deduce the continuity of almost all sample paths from this inequality, but this is not needed at this point.
6.3 Sub-Gaussian inequalities | 125
6.3 Sub-Gaussian inequalities A standard normal variable has tails of the order x−1 exp (− x2 ) and satisfies P(|X| > x) 2
2
≤ 2 exp (− x2 ) for every x. In this section, we study random variables satisfying similar tail bounds. Hoeffding’s inequality asserts a “sub-Gaussian” tail bound for random variables of the form X = ∑ X i with X1 , ..., X n i.i.d. with zero means and bounded range. The following special case of Hoeffding’s inequality will be needed. Theorem 6.7 (Hoeffding’s inequality): Let a1 , ..., a n be constants and 𝜖1 , ..., 𝜖n be independent Rademacher random variables, i.e., with P(𝜖i = 1) = P(𝜖i = −1) = 1/2. Then P(| ∑ 𝜖i a i | > x) ≤ 2e
−
x2 2||a||2
,
for the Euclidean norm ||a||. Consequently, || ∑ 𝜖i a i ||𝜓2 ≤ √6||a||. Proof: For any 𝜆 and Rademacher variable 𝜖, one has Ee𝜆𝜖 = (e𝜆 + e−𝜆 ) ≤ e𝜆 /2 , where the last inequality follows after writing out the power series. Thus by Markov’s inequality, for any 𝜆 > 0, 2
n
P( ∑ 𝜖i a i > x) ≤ e−𝜆x Ee𝜆 ∑i=1 a i 𝜖i ≤ e(
𝜆2 2
||a||2 −𝜆x)
.
The best upper bound is obtained for 𝜆 = x/||a||2 and is the exponential in the probability bound of the lemma. Combination with a similar bound for the lower tail yields the probability bound. The bound on the 𝜓-norm is a consequence of the probability bound in view of Lemma 6.6. A stochastic process is called sub-Gaussian with respect to the semi-metric d on its index set if P(|X s − X t | > x) ≤ 2e
−
x2 2d2 (s,t)
,
for every s, t ∈ T, x > 0
any Gaussian process is sub-Gaussian for the standard deviation semi-metric d(s, t) = 𝜎(X s − X t ). Another example is Rademacher process n
X a = ∑ a i 𝜖i ,
a ∈ Rn
i=1
for Rademacher variables 𝜖1 , ..., 𝜖n . By Hoeffding’s inequality, this is sub-Gaussian for the Euclidean distance d(a, b) = ||a − b||.
126 | 6 Empirical process
Sub-Gaussian processes satisfy the increment bound ||X s − X t ||𝜓2 ≤ √6d(s, t). Since the inverse of the 𝜓2 -function is essentially the square root of the logarithm, the general maximal inequality leads for sub-Gaussian processes to a bound in terms of an entropy integral. Furthermore, because of the special properties of the logarithm, the statement can be slightly simplified. Corollary 6.2: Let {X t : t ∈ T} be a separable sub-Gaussian process. Then for every 𝛿 > 0, 𝛿
E sup |X s − X t | ≤ K ∫ √log D(𝜖, d)d𝜖, d(s,t)≤𝛿
0
for a universal constant K. In particular, for any t0 , ∞
E sup |X t | ≤ E|X t0 | + K ∫ √log D(𝜖, d)d𝜖. t
0
2
Proof: Apply the general maximal inequality with 𝜓2 (x) = e x − 1 and 𝜂 = 𝛿. Since 𝜓2−1 (m) = √log(1 + m), we have 𝜓2−1 (D2 (𝛿, d)) ≤ √2𝜓2−1 (D(𝛿, d)). Thus, the second term in the maximal inequality can first be replaced by √2𝛿𝜓−1 (D(𝜂, d)) and next be incorporated in the first at the cost of increasing the constant. We obtain 𝛿
sup |X s − X t | ≤ K ∫ √log(1 + D(𝜖, d))d𝜖. 𝜓2 d(s,t)≤𝛿 0
Here D(𝜖, d) ≥ 2 for every 𝜖 that is strictly less than the diameter of T. Since log(1+m) ≤ 2 log m for m ≥ 2, the 1 inside the logarithm can be removed at the cost of increasing K.
6.4 Symmetrization Let 𝜖1 , ..., 𝜖n be i.i.d. Rademacher random variables. Instead of the empirical process f → (P n − P)f =
1 n ∑(f (X i ) − Pf ), n i=1
consider the symmetrized process f → P on f =
1 n ∑ 𝜖 f (X i ), n i=1 i
where 𝜖1 , ..., 𝜖n are independent of (X1 , ..., X n ). Both processes have mean function zero. It turns out that the law of large numbers or the central limit theorem for one of these processes holds if and only if the corresponding result is true for the other
6.4 Symmetrization | 127
process. One main approach to proving empirical limit theorems is to pass from P n − P to P on and next apply arguments conditionally on the original X’s. The idea is that, for fixed X1 , ..., X n , the symmetrized empirical measure is a Rademacher process, hence a sub-Gaussian process, to which Corollary 6.2 can be applied. Thus, we need to bound maxima and moduli of the process P n − P by those of the symmetrized process. To formulate such bounds, we must be careful about the possible nonmeasurability of suprema of the type ||P n − P||ℱ . The result will be formulated in terms of outer expectation, but it does not hold for every choice of an underlying probability space on which X1 , ..., X n are defined. Throughout this part, if outer expectations are involved, it is assumed that X1 , ..., X n are the coordinate projections on the product space (𝒳 n , 𝒜n , P n ), and the outer expectations of functions (X1 , ..., X n ) → h(X1 , ..., X n ) are computed for P n . thus “independent” is understood in terms of a product probability space. If auxiliary variables, independent of the X’s, are involved, as in the next lemma, we use a similar convention. In that case, the underlying probability space is assumed to be of the form (𝒳 n , 𝒜n , P n ) × (𝒵 , 𝒞, Q) with X1 , ..., X n equal to the coordinate projections on the first n coordinates and the additional variables depending only on the (n + 1)st coordinate. The following lemma will be used mostly with the choice Φ(x) = x. Lemma 6.8 (symmetrization): For every nondecreasing, convex Φ : R → R and class of measurable functions ℱ, E∗ Φ(||P n − P||ℱ ) ≤ E∗ Φ(2||P0n ||ℱ ), where the outer expectations are computed as indicated in the preceding paragraph. Proof: Let Y1 , ..., Y n be independent copies of X1 , ..., X n , defined formally as the coordinate projections on the last n coordinates in the product space (𝒳 n , 𝒜n , P n ) × (𝒵 , 𝒞, Q) × (𝒳 n , 𝒜n , P n ). The outer expectations in the statement of the lemma are unaffected by this enlargement of the underlying probability space, because coordinate projections are perfect maps. For fixed values X1 , ..., X n , ||P n − P||ℱ = sup f ∈ℱ
1 n ∑ (f (X i ) − Ef (Y i )) n i=1
≤ E∗Y sup f ∈ℱ
1 n ∑ (f (X i ) − f (Y i )), n i=1
where E∗Y is the outer expectation with respect to Y1 , ..., Y n computed for P n for given, fixed values of X1 , ..., X n . Combination with Jensen’s inequality yields 1 n ∗Y Φ(||P n − P||ℱ ) ≤ E Y Φ( ∑ (f (X i ) − f (Y i )) ), ℱ n i=1
128 | 6 Empirical process
where ∗Y denotes the minimal measurable majorant of the supremum with respect to Y1 , ..., Y n , still with X1 , ..., X n fixed. Because Φ is nondecreasing and continuous, the ∗Y inside Φ can be moved to E∗Y . Next take the expectation with respect to X1 , ..., X n to get 1 E∗ Φ(||P n − P||ℱ ) ≤ E∗X E∗Y Φ( ∑ (f (X i ) − f (Y i )) ). ℱ n i=1 n
Here the repeated outer expectation can be bounded above by the joint outer expectation E∗ by Fubini’s theorem. Adding a minus sign in front of a term (f (X i ) − f (Y i )) has the effect of exchanging X i and Y i . By construction of the underlying probability space as a product space, the outer expectation of any function f (X1 , ..., X n , Y1 , ..., Y n ) remains unchanged under permutations of its 2n arguments, hence the expression 1 n E∗ Φ( ∑ e i (f (X i ) − f (Y i )) ) ℱ n i=1 is the same for any n-tuple (e1 , ..., e n ) ∈ {−1, 1}n . Deduce that 1 n E∗ Φ(||P n − P||ℱ ) ≤ E𝜖 E∗X,Y Φ( ∑ 𝜖i (f (X i ) − f (Y i )) ). ℱ n i=1 Use the triangle inequality to separate the contributions of the X’s and the Y’s and next use the convexity of Φ and triangle inequality to bound the previous expression by 1 n 1 n 1 1 E𝜖 E∗X,Y Φ(2 ∑ 𝜖i f (X i ) ) + E𝜖 E∗X,Y Φ(2 ∑ 𝜖i f (Y i ) ). ℱ ℱ 2 n i=1 2 n i=1 By perfectness of coordinate projections, the expectation E∗X,Y is the same as E∗X and E∗Y in the two terms, respectively. Finally, replace the repeated outer expectations by a joint outer expectation. This completes the proof. The symmetrization lemma is valid for any class ℱ. In the proofs of GlivenkoCantelli and Donsker theorems, it will be applied not only to the original set of functions of interest, but also to several classes constructed from such a set ℱ. The next step in these proofs is to apply a maximal inequality to the right side of the lemma, conditionally on X1 , ..., X n . At that point, we need to write the joint outer expectation as the repeated expectation E∗X E𝜖 , where the indices X and 𝜖 mean expectation over X and 𝜖 conditionally on remaining variables. Unfortunately, Fubini’s theorem is not valid for outer expectations. To overcome this problem, it is assumed that the integrand in the right side of the lemma is jointly measurable in (X1 , ..., X n , 𝜖1 , ..., 𝜖n ).
6.4 Symmetrization | 129
Since the Rademacher variables are discrete, this is the case if and only if the maps n (X1 , ..., X n ) → ∑ e i f (X i ) ℱ i=1
(6.3)
are measurable for every n-tuple (e1 , ..., e n ) ∈ {−1, 1}n . For the intended application of Fubini’s theorem, it suffices that this is the case for the completion of (𝒳 n , 𝒜n , P n ). Definition 6.4 (measurable class): A class ℱ of measurable functions f : 𝒳 → R on a probability space (𝒳 , 𝒜, P) is called a P-measurable class if the function (6.3) is measurable on the completion of (𝒳 n , 𝒜n , P n ) for every n and every vector (e1 , ..., e n ) ∈ R n .
6.4.1 Glivenko-Cantelli theorems In this section, we prove two types of Glivenko-Cantelli theorems. The first theorem is the simplest and is based on entropy with bracketing. Its proof relies on finite approximation and the law of large numbers for real variables. The second theorem uses random L1 -entropy numbers and is proved through symmetrization followed by a maximal inequality. Definition 6.5 (Covering numbers): The covering number N(𝜖, ℱ , || ⋅ ||) is the minimal number of balls {g : ||g−f || < 𝜖} of radius 𝜖 needed to cover the set ℱ. The centers of the balls need not belong to ℱ, but they should have finite norms. The entropy (without bracketing) is the logarithm of the covering number. Definition 6.6 (bracketing numbers): Given two functions l and u, the bracket [l, u] is the set of all functions f with l ≤ f ≤ u. An 𝜖-bracket is a bracket [l, u] with ||u − l|| < 𝜖. The bracketing number N[] (𝜖, ℱ , || ⋅ ||) is the minimum number of 𝜖-brackets needed to cover ℱ. The entropy with bracketing is the logarithm of the bracketing number. In the definition of the bracketing number, the upper and lower bounds u and l of the brackets need not belong to ℱ themselves but are assumed to have finite norms. Theorem 6.8: Let ℱ be a class of measurable functions such that N[] (𝜖, ℱ , L1 (P)) < ∞ for every 𝜖 > 0. Then ℱ is Glivenko-Cantelli. Proof: Fix 𝜖 > 0. Choose finitely many 𝜖-brackets [l i , u i ] whose union contains ℱ and such that P(u i − l i ) < 𝜖 for every i. Then for every f ∈ ℱ, there is a bracket such that (P n − P)f ≤ (P n − P)u i + P(u i − f ) ≤ (P n − P)u i + 𝜖
130 | 6 Empirical process
Consequently, sup(P n − P)f ≤ max(P n − P)u i + 𝜖. f ∈ℱ
i
The right side converges almost surely to 𝜖 by the strong law of large numbers for real variables. Combination with a similar argument for inff ∈ℱ (P n − P)f yields that lim sup ||P n − P||∗ℱ ≤ 𝜖 almost surely, for every 𝜖 > 0. Take a sequence 𝜖m ↓ 0 to see that the limsup must actually be zero almost surely. This completes the proof. An envelope function of a class ℱ is any function x → F(x) such that |f (x)| ≤ F(x), for every x and f . The minimal envelope function is x → sup f |f (x)|. It will usually be assumed that this function is finite for every x. Theorem 6.9: Let ℱ be a P-measurable class of measurable functions with envelope F such that P∗ F < ∞. Let ℱM be the class of functions f 1{F ≤ M} when f ranges over ℱ. If log N(𝜖, ℱM , L1 (P n )) = o∗P (n) for every 𝜖 and M > 0, then ||P n − P||∗ℱ → 0 both almost surely and in mean. In particular, ℱ is Glivenko-Cantelli. Proof: By the symmetrization lemma, measurability of the class ℱ, and Fubini’s theorem, 1 n E∗ ||P n − P||ℱ ≤ 2E X E𝜖 ∑ 𝜖i f (X i ) ℱ n i=1 1 n ≤ 2E X E𝜖 ∑ 𝜖i f (X i ) + 2P∗ F{F > M} n i=1 ℱM by the triangle inequality, for every M > 0. For sufficiently large M, the last term is arbitrarily small. To prove convergence in mean, it suffices to show that the first term converges to zero for fixed M. Fix X1 , ..., X n . If 𝒢 is an 𝜖-net in L1 (P n ) over ℱM , then n 1 n 1 E𝜖 ∑ 𝜖i f (X i ) ≤ E𝜖 ∑ 𝜖i f (X i ) + 𝜖. ℱM 𝒢 n i=1 n i=1
The cardinality of 𝒢 can be chosen equal to N(𝜖, ℱM , L1 (P n )). Bound the L1 -norm on the right by the Orlicz-norm for 𝜓2 (x) = exp(x2 ) − 1, and use the maximal inequality Lemma 6.7 to find that the last expression does not exceed a multiple of 1 n √1 + log N(𝜖, ℱM , L1 (P n )) sup ∑ 𝜖i f (X i ) + 𝜖, 𝜓2 |X n f ∈𝒢 i=1
(6.4)
6.4 Symmetrization | 131
where the Orlicz norms || ⋅ ||𝜓2 |X are taken over 𝜖1 , ..., 𝜖n with X1 , ..., X n fixed. By Hoeffding’s inequality, they can be bounded by √6/n(P n f 2 )1/2 , which is less than √6/nM. Thus, the last displayed expression is bounded by 6 √1 + log N(𝜖, ℱM , L1 (P n ))√ M + 𝜖 →P∗ 𝜖 n It has been shown that the left side of (6.4) converges to zero in probability. Since it is bounded by M, its expectation with respect to X1 , ..., X n converges to zero by the dominated convergence theorem. This concludes the proof that ||P n − P||∗ℱ in mean. That it also converges almost surely follows from the fact that the sequence ||P n − P||∗ℱ is a reverse martingale with respect to a suitable filtration. Fact: Let ℱ be class of measurable functions with envelope F such that P∗ F < ∞. Define ∑n be the 𝜎-field generated by measurable functions h : X ∞ → IR that are permutation symmetric in first arguments. Then E(||P n − P||∗ℱ |𝛴n+1 ) ≥ ||P n+1 − P||∗ℱ
a.s.
6.4.2 Donsker theorems Uniform entropy: In this section, the weak convergence of the empirical process will be established under the condition that the envelope function F be square integrable, combined with the uniform entropy bound ∞
∫ sup √log N(𝜖||F||Q,2 , ℱ , L2 (Q))d𝜖 < ∞.
(6.5)
Q
0
Here the supremum is taken over all finitely discrete probability measures Q on (𝒳 , 𝒜) with ||F||2Q,2 = ∫ F 2 dQ > 0. These conditions are by no means necessary, but they suffice for many examples. The finiteness of the previous integral will be referred to as the uniform entropy condition. Theorem 6.10: Let ℱ be a class of measurable functions that satisfies the uniform 2 entropy bound (6.5). Let the class ℱ𝛿 = {f − g : f , g ∈ ℱ , ||f − g||P,2 < 𝛿} and ℱ∞ be ∗ 2 P-measurable for every 𝛿 > 0. If P F < ∞, then ℱ is P-Donsker. Proof: Let 𝛿n ↓ 0 be arbitrary. By Markov’s inequality and the symmetrization lemma, P∗ (||G n ||ℱ𝛿 > x) ≤ n
2 ∗ 1 n ∑ 𝜖 f (X i ) . E ℱ𝛿n x √n i=1 i
132 | 6 Empirical process
Since the supremum in the right-hand side is measurable by assumption, Fubini’s theorem applies and the outer expectation can be calculated as E X E𝜖 . Fix X1 , ..., X n . By Hoeffding’s inequality, the stochastic process f → {n−1/2 ∑ni=1 𝜖i f (X i )} is sub-Gaussian for the L2 (P n )-seminorm ||f ||n = √
1 n 2 ∑ f (X i ). n i=1
Use the second part of the maximal inequality Corollary 6.2 to find that ∞
1 n ∑ 𝜖i f (X i ) E𝜖 ≤ ∫ √log N(𝜖, ℱ𝛿n , L2 (P n ))d𝜖. √n i=1 ℱ𝛿n 0
For large values of 𝜖, the set ℱ𝛿n fits in a single ball of radius 𝜖 around the origin, in which case the integrand is zero. This is certainly the case for values of 𝜖 larger than 𝜃n , where 1 n 𝜃n2 = sup ||f ||2n = ∑ f 2 (X i ) . n i=1 ℱ𝛿n f ∈ℱ𝛿 n
Furthermore, covering numbers of the class ℱ𝛿 are bounded by covering numbers of ℱ∞ = {f − g : f , g ∈ ℱ}. The latter satisfy N(𝜖, ℱ∞ , L2 (Q)) ≤ N 2 (𝜖/2, ℱ , L2 (Q)) for every measure Q. Limit the integral in (6.5) to the interval (0, 𝜃n ), make a change of variables, and bound the integrand to obtain the bound 𝜃n /||F||n
∫
sup √log N(𝜖||F||Q,2 , ℱ , L2 (Q))d𝜖||F||n . Q
0
Here the supremum is taken over all discrete probability measures. The integrand is integrable by assumption. Furthermore, ||F||n is bounded below by ||F∗ ||n , which converges almost surely to its expectation, which may be assumed positive. Use the Cauch-Schwarz inequality and the dominated convergence theorem to see that the expectation of this integral converges to zero provided 𝜃n →P∗ 0. This would conclude the proof of asymptotic equicontinuity. Since sup{Pf 2 : f ∈ ℱ𝛿n } → 0 and ℱ𝛿n ⊂ ℱ∞ , it is certainly enough to prove that ||P n f 2 − Pf 2 ||ℱ∞ →P∗ 0. 2 This is a uniform law of large numbers for the class ℱ∞ . This class has integrable 2 envelope (2F) and is measurable by assumption. For any pair f , g of functions in ℱ∞ ,
P n |f 2 − g2 | ≤ P n |f − g|4F ≤ ||f − g||n ||4F||n .
6.5 Lindberg-type theorem and its applications | 133
2 It follows that the covering number N(𝜖||2F||2n , ℱ∞ , L1 (P n )) is bounded by the covering number N(𝜖||F||n , ℱ∞ , L2 (P n )). By assumption, the latter number is bounded by a fixed number, so its logarithm is certainly o∗P (n), as required for the uniform law of large numbers, Theorem 6.8. This concludes the proof of asymptotic equicontinuity. Finally, we show that ℱ is totally bounded in L2 (P). By the result of the last paragraph, there exists a sequence of discrete measures P n with ||(P n − P)f 2 ||ℱ∞ converging to zero. Take n sufficiently large so that the supremum is bounded by 𝜖2 . by assumption, N(𝜖, ℱ , L2 (P n )) is finite. Any 𝜖-net for ℱ in L2 (P n ) is a √2𝜖-net in L2 (P). This completes the proof.
6.5 Lindberg-type theorem and its applications In this section, we want to show how the methods developed earlier of symmetrization and entropy bounds can be used to prove a limit theorem for sums of independent stochastic processes Z ni , f ∈ ℱ with bounded sample paths indexed by an arbitrary set ℱ. We need some preliminarily results for this purpose. Lemma 6.5.1: Let Z1 , Z2 , ...Z n be independent stochastic processes with mean 0. With || ||ℱ denoting the sup norm on ℱ, we get n n 1 n E∗ Φ( || ∑ 𝜖i Z i ||ℱ ) ≤ E∗ Φ(|| ∑ Z i ||ℱ ) ≤ E∗ Φ(2|| ∑ 𝜖i (Z i − 𝜇i )||ℱ ) 2 i=1 i=1 i=1
for every non-decreasing, convex Φ : IR → IR and arbitrary functions 𝜇i : ℱ → IR. Proof: The inequality on the right can be proved using techniques as in the proof of symmetrization Lemma and is left to the reader. For the inequality on the left, let Y1 , Y2 , ...Y n be independent copy of Z1 , ...Z n defined on n
n
i=1
1
(∏(𝒳i , a i , P i )x(𝒵 , 𝒢, 𝒬)x ∏(𝒳i , a i , P i ) (the 𝜖i ’s are defined on (𝒵 , 𝒢, 𝒬)) and depend on the last n co-ordinates exactly as Z1 , ...Z n depend on the first n coordinates. Since EY i (f ) = 0, the LHS of the above expression is the average of 1 n E∗ Φ(|| ∑ 𝜖i [Z i (f ) − EY i (f )||ℱ ), 2 i=1 where (e1 ...e n ) range over {−1, 1}n By Jenssen’s inequality E∗Z,Y Φ(|| 21 ∑ni=1 e i [Z i (f ) − Y i (f )||ℱ ) ≤ E∗Z,Y Φ(|| 21 ∑ni=1 [Z i (f ) − Y i (f )]||ℱ ). Apply triangle inequality and convexity of Φ to complete the proof.
134 | 6 Empirical process
Lemma 6.5.2: For arbitrary stochastic processes Z1 , ...Z n , arbitrary functions 𝜇1 , ...𝜇n : ℱ → IR, and x > 0 n
𝛽n (x)P∗ (|| ∑ Z i ||ℱ > x) i=1 n
≤ 2P∗ (4|| ∑ 𝜖i (Z i − 𝜇i )||ℱ > x) i=1 n
and 𝛽n (x) ≤ inf f P(1 ∑ Z i (f )| < x/2). In particular, this is true for i.i.d. mean 0 processes i=1
and 𝛽n (x) = 1 − 4 xn2 sup f var(Z1 (f )). n
Proof: Let Y1 , ...Y n be independent copy of Z1 , Z2 , ...Z n as defined above. If || ∑ Z i ||ℱ > 1
x, then there is some f for which n
| ∑ Z i (f )| > x. 1
Fix a realization of Z1 , ...Z n and corresponding f . For this realization, n
𝛽 ≤ P∗Y (1 ∑ Y i (f )| < x/2) i=1 n
1
i=1
i=1
≤ P∗Y (1 ∑ Y i (f ) − ∑ Z i (f )| > x/2) n
≤ P∗Y (|| ∑(Y i − Z i )||ℱ > x/2). i=1
The far left and far right sides do not depend on a particular f , and the inequality between them is valid on the set n
{|| ∑ Z i ||ℱ > x}. 1
Integrate both sides with respect to Z1 , ...Z n over the set to obtain n
n
i=1
i=1
𝛽P∗ (|| ∑ Z i || > x) ≤ P∗Z P∗Y (|| ∑(Y i − Z i )| > x/2). By symmetry, the RHS equals n
E𝜖 P∗Z P∗Y (|| ∑ 𝜖i (Y i − Z i )||ℱ > x/2). i=1
6.5 Lindberg-type theorem and its applications | 135
In view of triangle inequality, this is less than or equal to n
2P∗ (|| ∑ 𝜖i (Y i − 𝜇i )||ℱ > x/4). 1
Assuming i.i.d. Z1 , ...Z n , one can derive the inequality from Chebyshev inequality. Proposition 6.5.3: Let 0 < p < ∞ and X1 , ...X n be independent stochastic processes indexed by T. Then there exist constants C p and 0 < 𝜇p < 1 such that E∗ max ||S k ||p ≤ C p (E∗ max ||X k ||p + F −1 (𝜇p )P ), k≤n
k≤n
where F −1 is quantile function of maxk≤n ||S k ||∗ . If X1 , ...X n are symmetric, then there exist K p and (0 < 𝜐p < 1) such that E∗ ||S n ||p ≤ K p (E∗ max ||x k ||p + G−1 (𝜐p )p ) k≤n
where G−1 is quantile function of ||S n ||∗ . For each p ≤ 1, the last inequality has a version for mean zero processes. Proof: This is based on the following fact due to Höffman-Jørgensen. Fact: Let X1 , ...X n be independent stochastic processes indexed by an arbitrary set. Then for 𝜆, 𝜂 > 0 (1) P∗ (max ||S k || > 3𝜆 + 𝜂) ≤ P∗ (max ||S k || > 𝜆)2 + P∗ (max ||x k || > 𝜂). isK≤n
k≤n
k≤n
If X1 , ...X n are independent symmetric, then (2) P∗ (||S n || > 2𝜆 + 𝜂) ≤ 4P∗ (||S n || > 𝜆)2 + P∗ (max ||x k || > 𝜂). k≤n
Proof of Proposition: Take 𝜆 = 𝜂 = t in the above inequality (1) to find that for t > 0 E∗ max ||S k ||p ≤ 4p ∫ P∗ (max ||S k || > 4t)d(t p ) k≤n
k≤n
∞
∞
≤ (4t)p + 4p ∫ P∗ (||S k || > t)2 d(t p ) + 4p ∫ P∗ (max ||X k ||∗ > td(t p ) k≤n
t
t
≤ (4t)p + 4p P∗ (max ||S k || > t)E∗ max ||S n ||p + 4p E∗ max ||X k ||p . k≤n
k≤n
k≤n
(6.6)
136 | 6 Empirical process
Choose t such that 4p P∗ (maxk≤n ||S k || > t) < 1/2 we get the inequality. Similar arguments using inequality (2) gives the second inequality. The inequality for mean-zero processes follows using symmetrization using the inequality above. Then one can using desymmetrization as by Jensen inequality E∗ ||S n ||p is bounded by E∗ (||S n −T n ||p ) if T n is sum of n independent copies of X1 , ...X n . For each n, let Z1 , ...Z n , m n be independent stochastic processes indexed by a common semi-metric space (ℱ , 𝒫). These processes are defined on a product probability space of dimension m n . Define random semi-metric mn
d2n (f , g) = ∑(Z ni (f ) − Z ni (g))2 . i=1
The condition below in Theorem 6.5.1 is random entropy condition. Theorem 6.5.1 (Lindberg): For each n, let Z1 , ...Z n , m n be independent stochastic processes indexed by (ℱ , 𝒫) (a totally bounded semi-metric space). Assume mn
(a) ∑ E∗ (||Z ni ||2ℱ {||Z ni ||ℱ > 𝜂) → 0 for each 𝜂 > 0 1
(b) For each (f , g)𝜖, ℱ ⊗ ℱ mn
∑ E(Z ni (f ), Z ni (g)) → C(f , g) finite 1 n
sup ∑ E(Z ni (f ) − Z ni (g))2 → for 𝛿n ↓ 0
(c)
p(f ,g) t/2) < 1/2 i=1
for f , g with P(f , g) < 𝛿n . By Lemma 6.8, for sufficiently large n, mn
P∗ ( sup | ∑(Z 0ni (f ) − Z 0ni (g))| > t) p(f ,g) t/4). p(f ,g) 𝜂n }||A n → 0. P∗
Thus, without loss of generality, we can assume ||Z ni ||ℱ ≤ 𝜂n to show Θn → 0. Fix Z n1 , ...Z nm n and take 𝜖-net b n for A n with Euclidean norm. For every a𝜖A n , there is a b𝜖B n with n
| ∑ 𝜖i a2i | = | ∑ 𝜖i (a i − b i )2 + 2 ∑(a i − b i )𝜖b i + ∑ 𝜖i b2i | ≤ 𝜖2 + 2𝜖||b|| + | ∑ 𝜖i b2i |. i
By Höffding inequality, ∑ 𝜖i b2i has orlicz norm for 𝛹2 bounded by a multiple of n
(∑, b4i )1/2 ≤ 𝜂n (∑ b2i )1/2 . Apply Lemma 6.7 to the third term on the right and substitute i
sup A n for sup over B n . E𝜖 || ∑ 𝜖i a2i || ≤ 𝜖2 + 2𝜖|| ∑ a2i ||A2 + √1 + log |B n |𝜂n || ∑ a2i ||A2 . y
y
n
n
138 | 6 Empirical process
The size of the 𝜖-net can be chosen |B n | ≤ N 2 (𝜖/2, ℱ , dn) by entropy condition d, this variable is bounded in probability. Conclude for some constant, K P(|| ∑ 𝜖i a2i ||A n > t) ≤ P∗ (B n | > M) + K/t[𝜖2 + (𝜖 + 𝜂n √log M)xE|| ∑ a2i ||1/2 A n ]. For M, t sufficiently large, the RHS is smaller than 1 − 𝜐, for the constant 𝜐, in Proposition 6.5.3. More precisely, this can be achieved for M such that P∗ (B n | > M) ≤ (1 − 𝜐1 )/2 and t = (1 − 𝜐1 )/2 and t is equal to (1 − v1 ) times the numerator of second term. Then t is bigger than 𝜐-quantile of || ∑ 𝜖i a2i ||A n . Thus, Proposition 6.5.3 gives E|| ∑ 𝜖i a2i ||A n ≲ E|| max a2i ||A n + t ≲ 𝜂n2 + 𝜖2 + (𝜖 + 𝜂n √log M)(E|| ∑ a2i ||A n )(1/2) . Now (c) gives || ∑ Ea2i ||A n → 0. Combining this with Lemma 6.5.1, we get E|| ∑ a2i ||A n ≤ E|| ∑ 𝜖i a2i ||A n + 0(1) ≤ 𝛿 + 𝛿(E|| ∑ a2i ||A n ) for 𝛿 > max(𝜖2 , 𝜖) and sufficiently large n. Now for c ≥ 0, c ≤ 𝛿 + 𝛿√c implies c ≤ (𝛿 + √𝛿2 + 4𝛿)2 . Applying this to c = E|| ∑ a2i ||A n to conclude E|| ∑ a2i ||A n → 0 as n → ∞. Example 1: One can look for X1 ...X n i.i.d. random variables Z ni (f ) = n(−1/2) f (X i ) with f measurable functions on the sample space. From the above theorem, we get that ℱ is Donsker if it is measurable and totally bounded in L2 (P), possesses a square integrable envelop and satisfies with P n equal to distribution of X1 ...X n 𝛿n P
∫ √N(𝜖, ℱ , L2 (P n ))d𝜖 → 0. 0
6.5 Lindberg-type theorem and its applications | 139
If ℱ satisfies the uniform entropy condition in section 6.5, then it satisfies the above condition. Thus, Theorem 6.10 is the consequence of the above theorem. Example 2: The Lindberg condition on norms is not necessary for the CLT. In combination with truncation the preceding theorem applies to more general processes. Choose stochastic processes Z1 ...Z n , m n with mn
∑ P(||Z ni ||ℱ > 𝜂) → 0 for 𝜂 > 0. 1
Then truncated processes Z ni , 𝜂(f ) = Z ni (f )1{||Z ni || ≤ 𝜂} satisfy mn
mn
1
1
P
∑ Z ni − ∑ Z ni , 𝜂 → 0 in l∞ (ℱ). Since this is true for every 𝜂 > 0, it is also true for 𝜂n ↓ 0 sufficiently slowly. The processes Z n,i,𝜂n satisfy Lindberg condition. If they or their centered version satisfy other conditions of Theorem 6.5.1, then the sequence mn
∑(Z ni − EZ ni,𝜂n ) 1
converges weakly in l∞ (ℱ). The random semi-metrics d n decrease by truncation. Thus, one can get the result for truncated processes under weaker conditions. We now define measure-like processes. In the previous theorem, consider index set ℱ as a set of measurable functions f : x → IR on a measurable space (x, a) and the distance be in L2 (Q). Q finite measure: assume ∞
(1) ∫ sup √log N(𝜖||F||Q,2 ℱ L2 (Q))d𝜖 < ∞. Q
0
Then the preceding theorem yields a central limit theorem for processes with increments that are bounded by random L2 -metric on ℱ. We call Z ni measure-like with respect to random measures 𝜇ni : f (Z ni (f ) − Z ni (g)) ≤2 ∫(f − g)2 d𝜇ni f , g, 𝜖ℱ . For measure-like processes d n is bounded by L2 (∑ 𝜇ni )-semi-metric and entropy condition there can be related to uniform entropy condition. Lemma 6.5.3: Let ℱ be a class of measurable function F. Let Z1 , ...Z n , m n be measurable processes indexed by ℱ. If ℱ satisfies uniform entropy condition (1) for a set Q that
140 | 6 Empirical process
mn
contain measures 𝜇ni and ∑ ∫ F 2 d𝜇ni = 0∗p (1) then entropy condition d) of Theorem 1
6.5.1 is satisfied. mn
Proof: Let 𝜇n = ∑ 𝜇ni . Since d n is bounded by L2 (𝜇n )-semi-metric we get i=1
𝛿/||F||𝜇n
𝛿n
(∗) ∫ √log N(𝜖, ℱ , d n )d𝜖 ≤ 0
∫
√log N(𝜖||F||𝜇n , ℱ , L2 (𝜇n ))d𝜖.||F||𝜇n
0
on the set where ||F||2𝜇n < ∞. Denote by 𝛿
J(𝛿) = ∫ sup √log N(𝜖||F||Q,2, ℱ , L2 (Q)d𝜖 Q
0
on the set where ||F||𝜇n > 𝜂 the RHS of the equation (∗) is bounded by J(𝛿n /𝜂)0P (1) which converges to zero for 𝜂 > 0. On the set where ||F||𝜇n ≤ 𝜂, we have the bound J(∞)𝜂 as 𝜂 is arbitrary we get the result. Example 1: Suppose we have independent processes Z1 , ...Z n , m n are measure line. Let ℱ satisfy uniform entropy condition. Assume for some probability measure P with ∫ F 2 dP∗ < ∞ mn
E∗ ∑ ∫ F 2ni 1{∫ F 2 d𝜇ni > 𝜂} → 0 for n → 0 1
and mn
sup ||f −g||p,2