152 38 7MB
English Pages ix, 389 [395] Year 2024
Universitext Series Editors Nathanaël Berestycki, Universität Wien, Vienna, Austria Carles Casacuberta, Universitat de Barcelona, Barcelona, Spain John Greenlees, University of Warwick, Coventry, UK Angus MacIntyre, Queen Mary University of London, London, UK Claude Sabbah, École Polytechnique, CNRS, Université Paris-Saclay, Palaiseau, France Endre Süli, University of Oxford, Oxford, UK
Universitext is a series of textbooks that presents material from a wide variety of mathematical disciplines at master’s level and beyond. The books, often well class-tested by their author, may have an informal, personal, or even experimental approach to their subject matter. Some of the most successful and established books in the series have evolved through several editions, always following the evolution of teaching curricula, into very polished texts. Thus as research topics trickle down into graduate-level teaching, first textbooks written for new, cutting-edge courses may find their way into Universitext.
Paolo Baldi
Probability An Introduction Through Theory and Exercises
Paolo Baldi Dipartimento di Matematica Università di Roma Tor Vergata Roma, Italy
ISSN 0172-5939 ISSN 2191-6675 (electronic) Universitext ISBN 978-3-031-38491-2 ISBN 978-3-031-38492-9 (eBook) https://doi.org/10.1007/978-3-031-38492-9 Mathematics Subject Classification: 60-XX © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable
Preface
This book is based on a one-semester basic course on probability with measure theory for students in mathematics at the University of Roma “Tor Vergata”. The main objective is to provide the necessary notions required for more advanced courses in this area (stochastic processes and statistics, mainly) that students might attend later. This explains some choices: • Random elements in spaces more general than the finite dimensional Euclidean spaces are considered: in the future the student might be led to consider r.v.’s with values in Banach spaces, the sphere, a group of rotations. . . • Some classical finer topics (e.g. around the Law of Large numbers and the Central Limit Theorem) are omitted. This has made it possible to devote some time to other topics more essential to the objective indicated above (e.g. martingales). It is assumed that students • Are already familiar with elementary notions of probability and in particular know the classical distributions and their use • Are acquainted with the manipulations of basic calculus and linear algebra and the main definitions of topology • Already know measure theory or are following simultaneously a course on measure theory The book consists of six chapters and an additional chapter of solutions to the exercises. The first is a recollection of the main topics of measure theory. Here “recollection” means that only the more significant proofs are given, skipping the more technical points, the important thing being to become comfortable with the tools and the typical ways of reasoning of this theory. The second chapter develops the main core of probability theory: independence, laws and the computations thereof, characteristic functions and the complex Laplace transform, multidimensional Gaussian distributions. The third chapter concerns convergence, the fourth is about conditional expectations and distributions and the fifth is about martingales. v
vi
Preface
Chapters 1 to 5 can be covered in a 64-hour course with some time included for exercises. The sixth chapter develops two subjects that regretfully did not fit into the time schedule above: simulation and tightness (the last one without proofs). Most of the material is, of course, classical and appears in many of the very good textbooks already available. However, the present book also includes some topics that, in my experience, are important in view of future study and which are seldom developed elsewhere: the behavior of Gaussian laws and r.v.’s concerning convergence (Sect. 3.7) and conditioning (Sect. 4.4), quadratic functionals of Gaussian r.v.’s (Sects. 2.9 and 3.9) and the complex Laplace transform (Sect. 2.7), which is of constant use in stochastic calculus and the gateway to changes of probability. Particular attention is devoted to the exercises: detailed solutions are provided for all of them in the final chapter, possibly making these notes useful for self study. In the preparation of this book, I am indebted to B. Pacchiarotti and L. Caramellino, of my University, whose lists of exercises have been an important source, and P. Priouret, who helped clarify a few notions that were a bit misty in my head. Roma, Italy April 2023
Paolo Baldi
Contents
1
Elements of Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Measurable Spaces, Measurable Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Real Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Important Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Lp Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Product Spaces, Product Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 4 7 12 23 26 28 35
2
Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Random Variables, Laws, Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Computation of Laws. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 A Convexity Inequality: Jensen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Moments, Variance, Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 The Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Multivariate Gaussian Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Quadratic Functionals of Gaussian r.v.’s, a Bit of Statistics . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41 41 44 53 60 63 69 82 86 90 98
3
Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Convergence of r.v.’s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Almost Sure Convergence and the Borel-Cantelli Lemma . . . . . . . . . . 3.3 Strong Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Weak Convergence of Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Convergence in Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Uniform Integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Convergence in a Gaussian World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Application: Pearson’s Theorem, the χ 2 Test . . . . . . . . . . . . . . . . . . . . . . . .
115 115 117 124 129 140 144 148 152 154 vii
viii
Contents
3.10 Some Useful Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 4
Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Conditional Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The Conditional Laws of Gaussian Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
177 177 178 188 196 198
5
Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Martingales: Definitions and General Facts . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Doob’s Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 The Stopping Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Doob’s Inequality and Lp Convergence, p > 1 . . . . . . . . . . . . . . . . . . . . . . 5.8 L1 Convergence, Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
205 205 206 208 209 213 216 222 224 231
6
Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Random Number Generation, Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Tightness and the Topology of Weak Convergence . . . . . . . . . . . . . . . . . . 6.3 Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
241 241 252 255 258
7
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Notation
Real, Complex Numbers, .Rm .x ∨ y .= max(x, y) the largest of the real numbers x and y .x ∧ y .= min(x, y) the smallest of the real numbers x and y .x, y the scalar product of .x, y ∈ Rm or .x, y ∈ Cm + − .x , .x the positive and negative parts of .x ∈ R: .x + = max(x, 0), .x − = max(−x, 0) .|x| according to the context, the absolute value of the real number x, the modulus of the complex number x or the norm of the vector x .ℜz, .ℑz the real and imaginary parts of .z ∈ C m .BR (x) .= {y ∈ R , |y − x| < R}, the open ball centered at x with radius R ∗ .A , .tr A, .det A the transpose, trace, determinant of matrix A Functional Spaces Mb (E) real bounded measurable functions on the topological space E .f ∞ the sup norm.= supx∈E |f (x)| if .f ∈ Mb (E) .Cb (E) the Banach space of real bounded continuous functions on the topological space E endowed with the norm . ∞ .C0 (E) the subspace of .Cb (E) of the functions f vanishing at infinity, i.e. such that for every .ε > 0 there exists a compact set .Kε such that .|f | ≤ ε outside .Kε .CK (E) the subspace of .Cb (E) of the continuous functions with compact support. It is dense in .C0 (E) .
To be Precise Throughout this book, “positive” means .≥ 0, “strictly positive” means .> 0. Similarly “increasing” means .≥, “strictly increasing” .>.
ix
Chapter 1
Elements of Measure Theory
The building block of probability is the triple .(Ω, F, P), where . F is a .σ -algebra of subsets of a set .Ω and .P a probability. This is the typical setting of measure theory. In this first chapter we shall peruse the main points of this theory. We shall skip the more technical proofs and focus instead on the results, their use and the typical ways of reasoning. In the next chapters we shall see how measure theory allows us to deal with many, often difficult, problems in probability. For more information concerning measure theory in view of probability and of further study see in the references the books [3], [5], [11], [12], [17], [19], [24], [20].
1.1 Measurable Spaces, Measurable Functions Let E be a set and . E a family of subsets of E.
.
E is a .σ -algebra (resp. an algebra) if
• .E ∈ E, • . E is stable with respect to set complementation; • . E is stable with respect to countable (resp. finite) unions.
This also .Ac ∈ E and that if .(An )n ⊂ E then also means that if .A ∈ E then c . n An ∈ E. Of course .∅ = E ∈ E. Actually a .σ -algebra is also stable with respect to countable intersections: if .(An )n ⊂ E then we can write ∞ .
n=1
An =
∞
Acn
c ,
n=1
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Baldi, Probability, Universitext, https://doi.org/10.1007/978-3-031-38492-9_1
1
2
1 Elements of Measure Theory
so that also . ∞ n=1 An ∈ E. A pair .(E, E), where . E is a .σ -algebra on E, is a measurable space. Of course the family .P(E) of all subsets of E is a .σ -algebra and it is immediate that the intersection of any family of .σ -algebras is a .σ -algebra. Hence, given a class of sets . C ⊂ P(E), we can consider the smallest .σ -algebra containing . C: it is the intersection of all .σ -algebras containing . C (such a family is non-empty as certainly . P(E) belongs to it). It is the .σ -algebra generated by . C, denoted .σ ( C).
Definition 1.1 A monotone class is a family .M of subsets of E such that • .E ∈ M, • .M is stable with respect to relative complementation, i.e. if .A, B ∈ M and .A ⊂ B, then .B \ A ∈ M. • .M is stable with respect to increasing limits: if .(An )n ⊂ M is an increasing sequence of sets, then .A = n An ∈ M.
Note that a .σ -algebra is a monotone class. Actually, if . E is a .σ -algebra, then • if .A, B ∈ E and .A ⊂ B, then .Ac ∈ E hence also .B \ A = B ∩ Ac ∈ E; • if .(An )n ⊂ E then .A = n An ∈ E, whether the sequence is increasing or not. On the other hand, to prove that a family of sets is a monotone class may turn out to be easier than to prove that it is a .σ -algebra. For this reason the next result will be useful in the sequel (for a proof, see e.g. [24], p. 39).
Theorem 1.2 (The Monotone Class Theorem) Let . C ⊂ P(E) be a family of sets that is stable with respect to finite intersections and let .M be a monotone class containing . C. Then .M also contains .σ ( C).
Note that in the literature the definition of “monotone class” may be different and the statement of Theorem 1.2 modified accordingly (see e.g. [2], p. 43). The next definition introduces an important class of .σ -algebras.
Definition 1.3 Let E be a topological space and .O the class of all open sets of E. The .σ -algebra .σ (O) (i.e. the smallest one containing all open sets) is the Borel .σ -algebra of E, denoted .B(E).
Of course .B(E) is also the smallest .σ -algebra containing all closed sets. Actually the latter also contains all open sets, which are the complements of closed sets,
1.1 Measurable Spaces, Measurable Functions
3
hence also contains .B(E), that is the smallest .σ -algebra containing the open sets. By the same argument (closed sets are the complements of open sets) the .σ -algebra generated by all closed sets is contained in .B(E) hence the two .σ -algebras coincide. If E is a separable metric space, then .B(E) is also generated by smaller families of sets. Example 1.4 Assume that E is a separable metric space and let .D ⊂ E be a dense subset. Then the Borel .σ -algebra .B(E) is also generated by the family . D of the balls centered at D with rational radius. Actually every open set is a countable union of these balls. Hence .B(E) ⊂ σ ( D) and, as the opposite inclusion is obvious, .B(E) = σ ( D).
Let .(E, E) and .(G, G) be measurable spaces. A map .f : E → G is said to be measurable if, for every .A ∈ G, .f −1 (A) ∈ E.
It is immediate that if g is measurable from .(E, E) to .(G, G) and h is measurable from .(G, G) to .(H, H) then .h ◦ g is measurable from .(E, E) to .(H, H). Remark 1.5 (A very useful criterion) In order for f to be measurable it suffices to have .f −1 (A) ∈ E for every .A ∈ C, where . C ⊂ G is such that .σ ( C) = G. Indeed the class . G of the sets .A ⊂ G such that .f −1 (A) ∈ E is a .σ -algebra, thanks to the easy relations ∞ .
n=1
f −1 (An ) = f −1 f −1 (A)c =
∞
An ,
n=1 f −1 (Ac ) .
(1.1)
As . G contains the class . C, it also contains the whole .σ -algebra . G that is generated by . C. Therefore .f −1 (A) ∈ E also for every .A ∈ G.
The criterion of Remark 1.5 is very useful as often one knows explicitly the sets of a class . C generating . G, but not those of . G. For instance, if . G is the Borel .σ -algebra of a topological space G, in order to establish the measurability of f it is sufficient, for instance, to check that .f −1 (A) ∈ E for every open set A.
4
1 Elements of Measure Theory
In particular, if E, G are topological spaces, a continuous map .f : E → G is measurable with respect to the respective Borel .σ -algebras.
1.2 Real Measurable Functions +
If the target space is .R, .R, .R , .Rd , .C, we shall always understand that it is endowed + with the respective Borel .σ -algebra. Here .R = R ∪ {+∞, −∞} and .R = R+ ∪ {+∞}. Let .(E, E) be a measurable space. In order for a numerical map (i.e. .R-valued) to be measurable it is sufficient to have, for every .a ∈ R, .{f > a} = {x, f (x) > a} = f −1 (]a, +∞]) ∈ E, as the sets of the form .]a, +∞] generate the Borel .σ -algebra (Exercise 1.2) and we can apply the criterion of Remark 1.5. Generating families of sets are also those of the form .{f < a}, .{f ≤ a}, .{f ≥ a} (see Exercise 1.2). Many natural operations are possible on numerical measurable functions. Are linear combinations, products, limits . . . of measurable functions still measurable? These properties are easily proved: for instance if .(fn )n is a sequence of measurable numerical functions and .h = supn fn , then, for every .a ∈ R, the sets −1 .{fn ≤ a} = fn ([−∞, a]) are measurable and {h ≤ a} =
∞
.
{fn ≤ a} ,
n=1
hence .{h ≤ a} is measurable, being the countable intersection of measurable sets. Similarly, if .g = infn fn , then {g ≥ a} =
.
∞
{fn ≥ a} ,
n=1
hence .{g ≥ a} is also measurable. Recall that .
lim fn (x) = lim ↓ sup fk (x),
n→∞
n→∞
k≥n
lim fn (x) = lim ↑ inf fk (x) ,
n→∞
n→∞
k≥n
(1.2)
where these quantities are .R-valued. If the .fn are measurable, then also .limn→∞ fn , limn→∞ fn , .limn→∞ fn (if it exists) are measurable: actually, for the .lim for instance, the functions .gn = supk≥n fk are measurable, being the supremum of measurable functions, and then also .limn→∞ fn , being the infimum of the .gn .
.
1.2 Real Measurable Functions
5
As a consequence, if .(fn )n is a sequence of measurable real functions and fn →n→∞ f pointwise, then f is measurable. This is true also for sequences of measurable functions with values in a separable metric space, see Exercise 1.6. The same argument gives that if .f, g : E → R are measurable then also .f ∨ g and .f ∧ g are measurable. In particular
.
f+ = f ∨ 0
and
.
f − = −f ∨ 0
are measurable functions. .f + and .f − are the positive and negative parts of f and we have f = f+ − f− ,
.
|f | = f + + f − . Note that both .f + and .f − are positive functions. Let .f1 , .f2 be real measurable maps defined on the measurable space .(E, E). Then the map .f = (f1 , f2 ) is measurable with values in .(R2 , B(R2 )). Indeed, if −1 (A × A ) = f −1 (A ) ∩ f −1 (A ) ∈ E. Moreover, it .A1 , A2 ∈ B(R), then .f 1 2 1 2 1 2 is easy to prove, with the argument of Example 1.4, that every open set of .R2 is a countable union of rectangles .A1 × A2 , so that they generate .B(R2 ) and we can apply the criterion of Remark 1.5. As .(x, y) → x + y is a continuous map .R2 → R, it is also measurable. It follows that .f1 + f2 is also measurable, being the composition of the measurable maps .f = (f1 , f2 ) and .(x, y) → x + y. In the same way one can prove the measurability of the maps .f1 f2 and . ff12 (if defined). Similar results hold for numerical functions .f1 and .f2 , provided that we ensure that indeterminate forms such as .+∞ − ∞, .0/0, .∞/∞. . . are not possible.
As these examples suggest, in order to prove the measurability of a real function f , one will seldom try to use the definition, but rather apply the criterion of Remark 1.5, investigating .f −1 (A) for A in a class of sets generating the .σ -algebra of the target space, or, for real or numerical functions, writing f as the sum, product, limit, . . . of measurable functions.
If .A ⊂ E, the indicator function of A, denoted .1A , is the function that takes the value 1 on A and 0 on .Ac . We have the obvious relations .1Ac = 1 − 1A , 1∩An = 1An = infn 1An , 1∪An = supn 1An . n
It is immediate that, if .A ∈ E, then .1A
is measurable. A function .f : (E, E) → R is elementary if it is of the form .f = nk=1 ak 1Ak with .Ak ∈ E and .ak ∈ R. The following result is fundamental, as it allows us to approximate every positive measurable function with elementary functions. It will be of constant use.
6
1 Elements of Measure Theory
Proposition 1.6 Every positive measurable function f is the limit of an increasing sequence of elementary positive functions.
Proof Just consider fn (x) =
n −1 n2
.
k=0
k 1 k k+1 (x) + n1{f (x)≥n} , 2n {x; 2n ≤f (x)< 2n }
(1.3)
i.e. ⎧ ⎨k .fn (x) = 2n ⎩n
if f (x) < n and if f (x) ≥ n .
k k+1 ≤ f (x) < 2n 2n
Clearly the sequence .(fn )n is increasing. Moreover, as .f (x) − if .f (x) < n, .fn (x) →n→∞ f (x).
1 2n
≤ fn (x) ≤ f (x)
Let f be a map from E into another measurable space .(G, G). We denote by σ (f ) the .σ -algebra generated by f , i.e. the smallest .σ -algebra on E such that .f : (E, σ (f )) → (G, G) is measurable.
.
It is easy to check that the family . Ef = {f −1 (A), A ∈ G} is a .σ -algebra of subsets of E (use the relations (1.1)). Hence it coincides with .σ (f ). Actually .σ (f ) must contain every set of the form .f −1 (A), so that .σ (f ) ⊃ Ef . Conversely, f is obviously measurable with respect to . Ef , so that . Ef must contain .σ (f ), which, by definition, is the smallest .σ -algebra enjoying this property.
More generally, if .(fi , i ∈ I ) is a family of maps on E with values respectively in the measurable spaces .(Gi , Gi ), we denote by .σ (fi , i ∈ I ) the smallest .σ -algebra on E with respect to which all the .fi are measurable. We shall call .σ (fi , i ∈ I ) the .σ -algebra generated by the .fi .
Proposition 1.7 (Doob’s Criterion) Let f be a map from E to some measurable space .(G, G) and let .h : E → R. Then h is .σ (f )-measurable if and only if there exists a . G-measurable function .g : G → R such that .h = g ◦ f (see Fig. 1.1).
1.3 Measures
7 f E
................................................
G
..... ... ..... .. ..... ..... ... ..... .. ..... ..... ..... ... ..... ..... ......... ......... ... .....
h
g
Fig. 1.1 Proposition 1.7 states the existence of a g such that .h = g ◦ f
Proof Of course if .h = g ◦ f with g measurable, then h is .σ (f )-measurable, being the composition of measurable maps. Conversely, let us assume first that
hn is .σ (f )-measurable, positive and elementary. Then h is of the form .h = k=1 ak 1Bk with .Bk ∈ σ (f ) and therefore −1 (A ) for some .A ∈ G. As .1 .Bk = f k k Bk = 1f −1 (Ak ) = 1Ak ◦ f , we can write
n .h = g ◦ f with .g = a 1 . k=1 k Ak Let us drop the assumption that h is elementary and assume h positive and .σ (f )-measurable. Then .h = limn→∞ ↑ hn for an increasing sequence .(hn )n of elementary positive functions (Proposition 1.6). Thanks to the first part of the proof, .hn is of the form .hn = gn ◦ f with .gn positive and . G-measurable. We deduce that .h = g ◦ f with .g = limn→∞ gn , which is a positive . G-measurable function. Let now h be .σ (f )-measurable (not necessarily positive). It can be decomposed into the difference of its positive and negative parts, .h = h+ −h− , and we know that we can write .h+ = g + ◦ f , .h− = g − ◦ f for some positive . G-measurable functions + − −1 (A ) .g , g . The function h being .σ (f )-measurable, we have .{h ≥ 0} = f 1 −1 and .{h < 0} = f (A2 ) for some .A1 , A2 ∈ G. Therefore .h = g ◦ f with .g = g + 1A1 − g − 1A2 . There is no danger of encountering a form .+∞ − ∞ as the sets .A1 , A2 are disjoint.
1.3 Measures Let .(E, E) be a measurable space. +
Definition 1.8 A measure on .(E, E) is a map .μ : E → R (it can also take the value .+∞) such that (a) .μ(∅) = 0, (b) for every sequence .(An )n ⊂ E of pairwise disjoint sets ∞ μ An = μ(An ) .
.
n≥1
n=1
The triple .(E, E, μ) is a measure space.
8
1 Elements of Measure Theory
Some terminology. • If .E = n En for .En ∈ E with .μ(En ) < +∞, .μ is said to be .σ -finite. • If .μ(E) < +∞, .μ is said to be finite. • If .μ(E) = 1, .μ is a probability, or also a probability measure.
As we shall see, the assumption that .μ is .σ -finite will be necessary in most statements.
Remark 1.9 Property (b) of Definition 1.8 is called .σ -additivity. If in Definition 1.8 we assume that . E is only an algebra and we add to (b) the condition that . n An ∈ E, then we have the notion of a measure on an algebra.
Remark 1.10 (A Few Properties of a Measure as a Consequence of the Definition) (a) If .A, B ∈ E and .A ⊂ B, then .μ(A) ≤ μ(B). Indeed A and c .B ∩ A are disjoint measurable sets and their union is equal to B. Therefore c .μ(B) = μ(A)+ .μ(B ∩ A ) ≥ μ(A). (b) If .(An )n ⊂ E is a sequence of measurable sets increasing to A, i.e. such that .An ⊂ An+1 and .A = n An , then .μ(An ) ↑ μ(A) as .n → ∞. Indeed let .B1 = A1 and recursively define .Bn = An \ An−1 . The .Bn are pairwise disjoint (.Bn−1 ⊂ An−1 whereas .Bn ⊂ Acn−1 ) and, clearly, .B1 ∪ · · · ∪ Bn = An . Hence A=
∞
.
n=1
An =
∞
Bn
n=1
and, as the .Bn are pairwise disjoint, ∞ ∞ n μ(A) = μ Bk = μ(Bk ) = lim μ(Bk ) = lim μ(An ) .
.
k=1
k=1
n→∞
k=1
n→∞
(c) If .(An )n ⊂ E isa sequence of measurable sets decreasing to A (i.e. such that .An+1 ⊂ An and . ∞ n=1 An = A) and if, for some .n0 , .μ(An0 ) < +∞, then .μ(An ) ↓ μ(A) as .n → ∞.
1.3 Measures
9
Indeed we have .An0 \ An ↑ An0 \ A as .n → ∞. Hence, using the result of (b) on the increasing sequence .(An0 \ An )n , ↓
μ(A) = μ(An0 ) − μ(An0 \ A) = μ(An0 ) − lim μ(An0 \ An ) n→∞ = lim μ(An0 ) − μ(An0 \ An ) = lim μ(An )
.
n→∞
n→∞
(.↓ denotes the equality where the assumption .μ(An0 ) < +∞ is necessary). In general, a measure does not necessarily pass to the limit along decreasing sequences of events (we shall see examples). Note, however, that the condition .μ(An0 ) < +∞ for some .n0 is always satisfied if .μ is finite. The next, very important, statement says that if two measures coincide on a class of sets that is large enough, then they coincide on the whole generated .σ -algebra.
Proposition 1.11 (Carathéodory’s Criterion) Let .μ, .ν be measures on the measurable space .(E, E) and let . C ⊂ E be a class of sets which is stable with respect to finite intersections and such that .σ ( C) = E. Assume that • for every .A ∈ C, .μ(A) = ν(A); • there exists an increasing sequence of sets .(En )n ⊂ C such that .E = n En and .μ(En ) < +∞ (hence also .ν(En ) < +∞) for every n. Then .μ(A) = ν(A) for every .A ∈ E.
Proof Let us assume first that .μ and .ν are finite. Let .M = {A ∈ E, μ(A) = ν(A)} (the family of sets of . E on which the two measures coincide) and let us check that . M is a monotone class. We have • .μ(E) = limn→∞ μ(En ) = limn→∞ ν(En ) = ν(E), so that .E ∈ M. • If .A, B ∈ M and .A ⊂ B then, as A and .B \ A are disjoint sets and their union is equal to B, ↓
μ(B \ A) = μ(B) − μ(A) = ν(B) − ν(A) = ν(B \ A)
.
and therefore .M is stable with respect to relative complementation (.↓ : here we use the assumption that .μ and .ν are finite). • If .(An )n ⊂ M is an increasing sequence of sets and .A = n An , then (Remark 1.10 (b)) μ(A) = lim μ(An ) = lim ν(An ) = ν(A) ,
.
n→∞
n→∞
10
1 Elements of Measure Theory
so that also .A ∈ M. By Theorem 1.2, the Monotone Class Theorem, . E = σ ( C) ⊂ M, hence .μ and .ν coincide on . E. In order to deal with the general case (i.e. .μ and .ν not necessarily finite), let, for A ∈ E,
.
μn (A) = μ(A ∩ En ) ,
.
νn (A) = ν(A ∩ En ) .
It is easy to check that .μn , .νn are measures on . E and as .μn (E) = μ(En ) < +∞ and .νn (E) = ν(En ) < +∞ they are finite. They obviously coincide on . C (which is stable with respect to finite intersections) and, thanks to the first part of the proof, also on . E. Now, if .A ∈ E, as .A ∩ En ↑ A, we have μ(A) = lim μ(A ∩ En ) = lim ν(A ∩ En ) = ν(A) .
.
n→∞
n→∞
Remark 1.12 If .μ and .ν are finite measures, the statement of Proposition 1.11 can be simplified: if .μ and .ν coincide on a class . C which is stable with respect to finite intersections, containing E and generating . E, then they coincide on . E.
An interesting, and natural, problem is the construction of measures satisfying particular properties. For instance, such that they take given values on some classes of sets. The key tool in this direction is the following theorem. We shall skip its proof.
Theorem 1.13 (Carathéodory’s Extension Theorem) Let .μ be a measure on an algebra .A (see Remark 1.9). Then .μ can be extended to a measure on .σ ( A). Moreover, if .μ is .σ -finite this extension is unique.
Let us now introduce a particular class of measures.
A Borel measure on a topological space E is a measure on .(E, B(E)) such that .μ(K) < +∞ for every compact set .K ⊂ E.
Let us have a closer look at the Borel measures on .R. Note first that the class C = { ]a, b], −∞ < a < b < +∞} (the half-open intervals) is stable with respect to finite intersections and that .σ ( C) = B(R) (Exercise 1.2). Thanks
.
1.3 Measures
11
to Proposition 1.11 (Carathéodory’s criterion), a Borel measure .μ on .B(R) is determined by the values .μ(]a, b]), .a, b ∈ R, .a < b, which are finite, as .μ is finite on compact sets. Given such a measure let us define a function F by setting .F (0) = 0 and F (x) =
.
μ(]0, x])
if x > 0
−μ(]x, 0])
if x < 0 .
(1.4)
Then F is right-continuous, as a consequence of Remark 1.10 (c): if .x > 0 and xn ↓ x, then .]0, x] = n ]0, xn ] and, as the sequence .(]0, xn ])n is decreasing and .(μ(]0, xn ]))n is bounded by .μ(]0, x1 ]), we have .F (xn ) = μ(]0, xn ]) ↓ μ(]0, x]) = F (x). If .x < 0 or .x = 0 the argument is the same. F is obviously increasing and we have .
μ(]a, b]) = F (b) − F (a) .
.
(1.5)
A right-continuous increasing function F satisfying (1.5) is a distribution function (d.f.) of .μ. Of course the d.f. of a Borel measure on .R is not unique: .F + c is again a d.f. for every .c ∈ R. Conversely, let .F : R → R be an increasing right-continuous function, does a measure .μ on .B(R) exist such that .μ(]a, b]) = F (b) − F (a)? Such a .μ would be a Borel measure, of course. Let us try to apply Theorem 1.13, Carathéodory’s existence theorem. Let . C be the family of sets formed by the half-open intervals .]a, b]. It is immediate that the algebra .A generated by . C is the family of finite disjoint unions of these intervals, i.e. .
n A= A = ]ak , bk ], −∞ ≤ a1 < b1 < a2 < · · · < bn−1 < an < bn ≤ +∞ k=1
with the understanding .]an , bn ] =]an , +∞[
if .bn = +∞. Let us define .μ on .A by setting .μ(A) = nk=1 (F (bk ) − F (ak )), with .F (+∞) = limx→+∞ F (x), .F (−∞) = limx→−∞ F (x). It is easy to prove that .μ is additive on . A; a bit more delicate is to prove that .μ is .σ -additive on . A, and we shall skip the proof of this fact. As .σ (A) = B(R), we have therefore, thanks to Theorem 1.13, the following result that characterizes the Borel measures on .R.
Theorem 1.14 Let .F : R → R be a right-continuous increasing function. Then there exists a unique Borel measure .μ on .B(R) such that, for every .a < b, .μ(]a, b]) = F (b) − F (a).
12
1 Elements of Measure Theory
Uniqueness, of course, is a consequence of Proposition 1.11, Carathéodory’s criterion, as the class . C of the half-open intervals is stable with respect to finite intersections and generates .B(R) (Exercise 1.2). Borel measures on .R are, of course, .σ -finite: the sets .]−n, n] have finite measure equal to .F (n) − F (−n), and their union is equal to .R. The property of .σ -finiteness of Borel measures holds in more general topological spaces: actually it is sufficient for the space to be .σ -compact (i.e. a countable union of compact sets), which is the case, for instance, if it is locally compact and separable (see Lemma 1.26 below). If we choose .F (x) = x, we obtain existence and uniqueness of a measure .λ on .B(R) such that .λ(I ) = |I | = b − a for every interval .I =]a, b]. This is the Lebesgue measure of .R. Let .(E, E, μ) be a measure space. A subset .A ∈ E is said to be negligible if .μ(A) = 0. We say that a property is true almost everywhere (a.e.) if it is true outside a negligible set. For instance, .f = g a.e. means that the set .{x ∈ E, f (x) = g(x)} is negligible. If .μ is a probability, we say almost surely (a.s.) instead of a.e. Beware that in the literature sometimes a slightly different definition of negligible set can be found. Note that if .(An )n is a sequence of negligible sets, then their union is also negligible (Exercise 1.7).
Remark 1.15 If .(fn )n is a sequence of real measurable functions such that fn →n→∞ f , then we know that f is also measurable. But what if the convergence only takes place a.e.? Let N be the negligible set outside which the convergence takes place and let .fn = fn 1N . Then the .fn are also measurable and converge, everywhere, to := f 1N , which is therefore measurable. .f In conclusion, we can state that there exists at least one measurable function which is the a.e. limit of .(fn )n . Using Exercise 1.6, this remark also holds for sequences .(fn )n of functions with values in a separable metric space. .
1.4 Integration Let .(E, E, μ) be a measure space. In this section we define the integral with respect to .μ. As above we shall be more interested in ideas and tools and shall skip the more technical proofs.
1.4 Integration
13
Let us first define the integral with respect to .μ of a measurable function .f :
+ E → R . If f is positive elementary then .f = nk=1 ak 1Ak , with .Ak ∈ E, and .ak ≥ 0 and we can define f dμ :=
.
E
n
ak μ(Ak ) .
k=1
Some simple remarks show that this number (which can turn out to be .= +∞) does not depend on the representation of f (different numbers .ak and sets .Ak can define the same function). If .f, g are positive and elementary, we have easily (a) if .a, b > 0 then . (af + bg) dμ = a f dμ + b g dμ, (b) if .f ≤ g, then . f dμ ≤ g dμ. The following technical result is the key to the construction.
Lemma 1.16 If .(fn )n , .(gn )n are increasing sequences of positive elementary functions such that .limn→∞ ↑ fn = limn→∞ ↑ gn , then also .
lim ↑
n→∞
fn dμ = lim ↑ n→∞
E
gn dμ . E
+
Let now .f : E → R be a positive . E-measurable function. Thanks to Proposition 1.6 there exists a sequence .(fn )n of elementary positive functions such that .fn ↑ f as .n → ∞; then the sequence .( fn dμ)n of their integrals is increasing thanks to (b) above; let us define . f dμ := lim ↑ fn dμ . (1.6) E
n→∞
E
By Lemma 1.16, this limit does not depend on the particular approximating sequence .(fn )n , hence (1.6) is a good definition. Taking the limit, we obtain immediately that, if .f, g are positive measurable, then • for every .a, b > 0, . (af + bg) dμ = a f dμ + b g dμ; • if .f ≤ g, . f dμ ≤ g dμ. In order to define the integral of a numerical . E-measurable function, let us write the decomposition .f = f + − f − of f into positive and negative parts. The simple idea is to define + . f dμ := f dμ − f − dμ E
E
E
14
1 Elements of Measure Theory
provided that at least one of the quantities . f + dμ and . f − dμ is finite. • f is said to be lower semi-integrable (l.s.i.) if . f − dμ < +∞. In this case the integral of f is well defined (but can take the value .+∞). • f is said to be upper semi-integrable (u.s.i.) if . f + dμ < +∞. In this case the integral of f is well defined (but can take the value .−∞). • f is said to be integrable if both .f + and .f − have finite integral. Clearly a function is l.s.i. if and only if it is bounded below by an integrable function. A positive function is always l.s.i. and a negative one is always u.s.i. Moreover, as .|f | = f + + f − , f is integrable if and only if . |f | dμ < +∞. If f is semi-integrable (upper or lower) we have the inequality f − dμ f dμ = f + dμ − E
.
E
≤
+
E
f dμ + E
−
f dμ = E
(1.7) |f | dμ .
E
Note the difference of the integral just defined (the Lebesgue integral) with respect to the Riemann integral: in both of them the integral is first defined for a class of elementary functions. But for the Riemann integral the elementary functions are piecewise constant and defined by splitting the domain of the function. Here the elementary functions (have a look at the proof of Proposition 1.6) are obtained by splitting its co-domain.
The integral is easily defined also for complex-valued functions. If .f : E → C, and .f = f1 + if2 , then it is immediate that if . |f | dμ < +∞ (here .| | denotes the complex modulus), then both .f1 and .f2 are integrable, as both .|f1 | and .|f2 | are majorized by .|f |. Thus we can define
f dμ =
.
E
f1 dμ + i
E
f2 dμ . E
Also (a bit less obvious) (1.7) still holds, with .| | meaning the complex modulus.
1.4 Integration
15
It is easy to deduce from the properties of the integral of positive functions that • (linearity) if .a,b ∈ C and f and g are both integrable, then also .af + bg is integrable and . (af + bg) dμ = a f dμ + b g dμ; • (monotonicity) if f and g are real and semi-integrable and .f ≤ g, then . f dμ ≤ g dμ. The following properties are often very useful (see Exercise 1.9). (a) If f is positive measurable and if . f dμ < +∞, then .f < +∞ a.e. (recall that we consider numerical functions that can take the value .+∞) (b) If f is positive measurable and . f dμ = 0 then .f = 0 a.e. The reader is encouraged to write down the proofs: it is important to become acquainted with the simple arguments they use. If f is positive measurable (resp. integrable) and .A ∈ E, then .f 1A is itself positive measurable (resp. integrable). We define then
f dμ :=
.
A
f 1A dμ . E
The following are the three classical convergence results.
Theorem 1.17 (Beppo Levi’s Theorem or the Monotone Convergence Theorem) Let .(fn )n be an increasing sequence of measurable functions bounded from below by an integrable function and .f = limn→∞ ↑ fn . Then .
lim ↑
n→∞
fn dμ =
E
f dμ . E
We already know (Remark 1.10 (b)) that if .fn = 1An where .(An )n is an increasing sequence of measurable sets, then .fn ↑ f = 1A where .A = n An and
fn dμ = μ(An ) ↑ μ(A) =
.
E
f dμ . E
Hence Beppo Levi’s Theorem is an extension of the property of passing to the limit of a measure on increasing sequences of sets.
16
1 Elements of Measure Theory
Proposition 1.18 (Fatou’s Lemma) Let .(fn )n be a sequence of measurable functions bounded from below (resp. from above) by an integrable function, then . lim fn dμ ≥ lim fn dμ n→∞ E
E n→∞
n→∞ E
E n→∞
resp. lim fn dμ ≤
lim fn dμ .
Fatou’s Lemma and Beppo Levi’s Theorem are most frequently applied to sequences of positive functions. Fatou’s Lemma implies
Theorem 1.19 (Lebesgue’s Theorem) Let .(fn )n be a sequence of integrable functions such that .fn →n→∞ f a.e. and such that, for every n, .|fn | ≤ g for some integrable function g. Then
.
lim
n→∞ E
fn dμ =
f dμ . E
Lebesgue’s Theorem has a useful “continuous” version.
Corollary 1.20 Let .(ft )t∈U be a family of integrable functions, where .U ⊂ Rd is an open set. Assume that .limt→t0 ft = f a.e. and that, for every .t ∈ U , ft dμ = f dμ. .|ft | ≤ g for some integrable function g. Then .limt→t0 Proof Just note that .limt→t0 ft dμ = f dμ if and only if, for every sequence ftn dμ = f dμ, which holds thanks to .(tn )n ⊂ U converging to .t0 , .limn→∞ Theorem 1.19. This corollary has an important application.
1.4 Integration
17
Proposition 1.21 (Derivation Under the Integral Sign) Let .(E, E, μ) be a measure space, .I ⊂ R an open interval and .(f (t, x), t ∈ I ) a family of integrable functions .E → C. Let, for every .t ∈ I , φ(t) =
f (t, x) dμ(x) .
.
E
Let us assume that there exists a negligible set .N ∈ E such that • for every .x ∈ N c , .t → f (t, x) is differentiable on I ; • there exists an integrable function g such that, for every .t ∈ I , .x ∈ N c , ∂f .| ∂t (t, x)| ≤ g(x). Then .φ is differentiable in the interior of I and
φ (t) =
.
E
∂f (t, x) dμ(x) . ∂t
(1.8)
Proof Let .t ∈ I . The idea is to write, for .h > 0, 1 φ(t + h) − φ(t) = . h
1 f (t + h, x) − f (t, x) dμ(x) h
(1.9)
and then to take the limit as .h → 0. We have for every .x ∈ N c .
1 ∂f f (t + h, x) − f (t, x) → (t, x) h→0 ∂t h
and by the mean value theorem, for .x ∈ N c , 1 ∂f f (t + h, x) − f (t, x) = (τ, x) ≤ g(x) h ∂t
.
for some .τ , .t ≤ τ ≤ t + h (.τ possibly depending on x). Hence by Lebesgue’s Theorem in the version of Corollary 1.20 .
E
1 f (t + h, x) − f (t, x) dμ(x) → h→0 h
E
∂f (t, x) dμ(x) . ∂t
Going back to (1.9), this proves that .φ is differentiable and that (1.8) holds.
Another useful consequence of the “three convergence theorems” is the following result of integration by series.
18
1 Elements of Measure Theory
Corollary 1.22 Let .(E, E, μ) be a measure space. (a) Let .(fn )n be a sequence of positive measurable functions. Then ∞
∞
fk dμ =
.
k=1 E
(1.10)
fk dμ .
E k=1
(b) Let .(fn )n be a sequence of real measurable functions such that ∞ .
|fk | dμ < +∞ .
(1.11)
k=1 E
Then (1.10) holds.
Proof (a) As the partial sums increase to the sum of the series, (1.10) follows as ∞ .
k=1 E
n
fk dμ = lim
n→∞
fk dμ = lim
n
n→∞ E k=1
k=1 E
↓
fk dμ =
∞
fk dμ ,
E k=1
where the equality indicated by .↓ is justified by Beppo Levi’s Theorem. (b) Thanks to (a) we have ∞ .
|fk | dμ =
k=1 E
∞
|fk | dμ
E k=1
so that by (1.11) the sum . ∞ k=1 |fk | is integrable. Then, as above, ∞ .
k=1 E
fk dμ = lim
n→∞
n
fk dμ = lim
n
n→∞ E k=1
k=1 E
↓
fk dμ =
∞
fk dμ ,
E k=1
where now .↓ follows by Lebesgue’s Theorem, as n ∞ ≤ f |fk | k
.
k=1
for every n ,
k=1
so that the partial sums are bounded in modulus by an integrable function.
1.4 Integration
19
Example 1.23 Let us compute
+∞
.
−∞
x dx . sinh x
1 Recall the power series expansion . 1−x = .x > 0,
∞
k=0 x
k
(for .|x| < 1), so that, for
∞
.
1 = e−2kx . −2x 1−e k=0
As .x → .
x sinh x
is an even function we have
+∞ +∞ x x xe−x dx = 4 dx = 4 dx ex − e−x 1 − e−2x −∞ sinh x 0 0 +∞ ∞ ∞ +∞ xe−(2k+1)x dx = 4 xe−(2k+1)x dx =4 +∞
0
k=0 0
k=0
=4
∞ k=0
1 π2 = · 2 (2k + 1)2
Integration by series is authorized here, everything being positive. Let us denote by . E+ the family of positive measurable functions. If we write, for + simplicity, .I (f ) = E f dμ, the integral is a functional .I : E+ → R . We know that I enjoys the properties: +
(a) (positive linearity) if .f, g ∈ E+ and .a, b ∈ R , .I (af + bg) = aI (f ) + bI (g) (with the understanding .0 · +∞ = 0); (b) if .(fn )n ⊂ E+ and .fn ↑ f , then .I (fn ) ↑ I (f ) (this is Beppo Levi’s Theorem).
We have seen how, given a measure, we can define the integral I with respect to it + and that I is a functional . E+ → R satisfying (a) and (b) above. Let us see now how it is possible to reverse the argument, i.e. how, starting from a given functional + + .I : E → R enjoying the properties (a) and (b) above, a measure .μ on .(E, E) can be defined such that I is the integral with respect to .μ.
20
1 Elements of Measure Theory
+
Proposition 1.24 Let .(E, E) be a measurable space and .I : E+ → R a functional enjoying the properties (a) and (b) above. Then .μ(A) := I (1A ), + .A ∈ E, defines a measure on . E and, for every .f ∈ E , .I (f ) = f dμ.
Proof Let us prove that .μ is a measure. Let .f0 ≡ 0, then .μ(∅) = I (f0 ) = I (0 · f0 ) = 0 · I (f0 ) = 0. As for .σ -additivity: let .(An )n
⊂ E be a sequence of pairwise
n disjoint sets whose union is equal to A; then .1A = ∞ 1 = lim ↑ A n→∞ k k=1 k=1 1Ak and, thanks to the properties (a) and (b) above, μ(A) = I (1A ) = I
.
n→∞
= lim ↑ n→∞
lim ↑ n
n
n 1Ak = lim ↑ I (1Ak ) n→∞
k=1
μ(Ak ) =
k=1
∞
k=1
μ(Ak ) .
k=1
Hence .μ is a measure on
.(E, E). Moreover, by (a) above, for every positive elementary function .f = m k=1 ak 1Ak , f dμ =
.
E
m
ak μ(Ak ) = I (f ) ,
k=1
hence the integral with respect to .μ coincides with the functional I on positive elementary functions. Proposition 1.6 and (b) above give that . f dμ = I (f ) for every .f ∈ E+ . Thanks to Proposition 1.24, measure theory and integration can be approached in two different ways. • The first approach is to investigate and construct measures, i.e. set functions .μ satisfying Definition 1.8, and then construct the integral of measurable functions with respect to measures (thus obtaining functionals on positive functions). • The second approach is to directly construct functionals on . E+ satisfying properties (a) and (b) above and then obtain measures by applying these functionals to functions that are indicators of sets, as in Proposition 1.24. These two points of view are equivalent but, according to the situation, one of them may turn out to be significantly simpler. So far we have followed the first one but we shall see situations where the second one turns out to be much easier. A question that we shall often encounter in the sequel is the following: assume that we know that the integrals with respect to two measures .μ and .ν coincide for
1.4 Integration
21
every function in some class . D, for example continuous functions in a topological space setting. Can we deduce that .μ = ν? With this goal in mind it is useful to have results concerning the approximation of indicator functions by means of “regular” functions. If E is a metric space and .G ⊂ E is an open set, let us consider the sequence of continuous functions .(fn )n defined as fn (x) = n d(x, Gc ) ∧ 1 .
.
(1.12)
A quick look shows immediately that .fn vanishes on .Gc whereas .fn (x) increases to 1 if .x ∈ G. Therefore .fn ↑ 1G as .n → ∞.
Proposition 1.25 Let .(E, d) be a metric space. (a) Let .μ, ν be finite measures on .B(E) such that
f dμ =
.
E
f dν
(1.13)
E
for every bounded continuous function .f : E → R. Then .μ = ν. (b) Assume that E is also separable and locally compact and that .μ and .ν are Borel measures on E (not necessarily finite). Then if (1.13) holds for every function f which is continuous and compactly supported we have .μ = ν.
Proof (a) Let .G ⊂ E be an open set and let .fn be as in (1.12). As .fn ↑ 1G as n → ∞, by Beppo Levi’s Theorem
.
μ(G) = lim
.
n→∞ E
fn dμ = lim
n→∞ E
fn dν = ν(G) ,
(1.14)
hence .μ and .ν coincide on open sets. Taking .f ≡ 1 we have also .μ(E) = ν(E): just take .f ≡ 1 in (1.13). As the class of open sets is stable with respect to finite intersections the result follows thanks to Carathéodory’s criterion, Proposition 1.11. (b) Let .G ⊂ E be a relatively compact open set and .fn as in (1.12). .fn is continuous and compactly supported (its support is contained in .G) and, by (1.14), .μ(G) = ν(G).
22
1 Elements of Measure Theory
Hence .μ and .ν coincide on the class . C of relatively compact open subsets of E. Thanks to Lemma 1.26 below, there exists a sequence .(Wn )n of relatively compact open sets increasing to E. Hence we can apply Carathéodory’s criterion, Proposition 1.11, and .μ and .ν also coincide on .σ ( C). Moreover, every open set G belongs to .σ ( C) as G=
∞
G ∩ Wn
.
n=1
and .G ∩ Wn is a relatively compact open set. Hence .σ ( C) contains all open sets and also the Borel .σ -algebra .B(E), completing the proof.
Lemma 1.26 Let E be a locally compact separable metric space. (a) E is the countable union of an increasing sequence of relatively compact open sets. In particular, E is .σ -compact, i.e. the union of a countable family of compact sets. (b) There exists an increasing sequence of compactly supported continuous functions .(hn )n such that .hn ↑ 1 as .n → ∞.
Proof (a) Let . D be the family of open balls with rational radius centered at the points of a given countable dense subset D. . D is countable and every open set of E is the union (countable of course) of elements of . D. Every .x ∈ E has a relatively compact neighborhood .Ux , E being assumed to be locally compact. Then .V ⊂ Ux for some .V ∈ D. Such balls V are relatively compact, as .V ⊂ Ux . The balls V that are contained in some of the .Ux ’s as above are countably many as . D is itself countable, and form a countable covering of E n )n then that is comprised of relatively compact open sets. If we denote them by .(V the sets Wn =
n
.
k V
(1.15)
k=1
form an increasing sequence of relatively compact open sets such that .Wn ↑ E as n → ∞. (b) Let
.
hn (x) := n d(x, Wnc ) ∧ 1
.
with .Wn as in (1.15). The sequence .(hn )n is obviously increasing and, as the support of .hn is contained in .Wn , each .hn is also compactly supported. As .Wn ↑ E, for every .x ∈ E we have .hn (x) = 1 for n large enough.
1.5 Important Examples
23
Note that if E is not locally compact the relation . f dμ = f dν for every compactly supported continuous function f does not necessarily imply that .μ = ν on .B(E). This should be kept in mind, as it can occur when considering measures on, e.g., infinite-dimensional Banach spaces, which are not locally compact. In some sense, if the space is not locally compact, the class of compactly supported continuous functions is not “large enough”.
1.5 Important Examples Let us present some examples of measures and some ways to construct new measures starting from given ones. • (Dirac masses) If .x ∈ E let us consider the measure on .P(E) (all subsets of E) that is defined as μ(A) = 1A (x) .
.
(1.16)
This is the measure that gives to a set A the value 0 or 1 according as .x ∈ A or not. It is immediate that this is a measure; it is denoted .δx and is called the Dirac mass at x. We have the formula . f dδx = f (x) , E
which can be easily proved by the same argument as in the forthcoming Propositions 1.27 or 1.28. • (Countable sets) If E is a countable set, a measure on .(E, P(E)) can be constructed in a simple (and natural) way: let
us associate to every .x ∈ E a number + and let, for .A ⊂ E, .μ(A) = .px ∈ R x∈A px . The summability properties of positive series imply that .μ is a measure: actually, if .A1 , A2 , . . . are pairwise disjoint subsets of E, and .A = n An , then the .σ -additivity relationship μ(A) =
∞
.
μ(An )
n=1
is equivalent to ∞ .
n=1
x∈An
px = px , x∈A
which holds because the sum of a series whose terms are positive does not depend on the order of summation.
24
1 Elements of Measure Theory
A natural example is the choice .px = 1 for every x. In this case the measure of a set A coincides with its cardinality. This is the counting measure of E. • (Image measures) Let .(E, E) and .(G, G) be measurable spaces, .Φ : E → G a measurable map and .μ a measure on .(E, E); we can define a measure .ν on .(G, G) via ν(A) := μ Φ −1 (A)
.
A ∈ G.
(1.17)
Also here it is immediate to check that .ν is a measure (thanks to the relations (1.1)). ν is the image measure of .μ under .Φ and is denoted .Φ(μ) or .μ ◦ Φ −1 .
.
Proposition 1.27 (Integration with Respect to an Image Measure) Let + .g : G → R be a positive measurable function. Then
g dν =
.
g ◦ Φ dμ .
G
(1.18)
E
A measurable function .g : G → R is integrable with respect to .ν if and only if .g ◦ Φ is integrable with respect to .μ and also in this case (1.18) holds.
+
Proof Let, for every positive measurable function .g : G → R , I (g) =
g ◦ Φ dμ .
.
E
It is immediate that the functional I satisfies the conditions (a) and (b) of Proposition 1.24. Therefore, thanks to Proposition 1.24, A → I (1A ) =
1A ◦ Φ dμ =
.
E
E
1Φ −1 (A) dμ = μ(Φ −1 (A))
is a measure on .(G, G) and (1.18) holds for every positive function g. The proof is completed taking the decomposition of g into positive and negative parts. • (Measures defined by a density) Let .μ be a .σ -finite measure on .(E, E).
A positive measurablefunction f is a density if there exists a sequence (An )n ⊂ E such that . n An = E, .μ(An ) < +∞ and .f 1An is integrable for every n.
.
1.5 Important Examples
25
In particular a positive integrable function is a density.
Theorem 1.28 Let .(E, E, μ) be a .σ -finite measure space and f a density with respect to .μ. Let for .A ∈ E
ν(A) :=
1A f dμ =
.
E
f dμ .
(1.19)
A
Then .ν is a .σ -finite measure on .(E, E) which is called the measure of density f with respect to .μ, denoted .dν = f dμ. Moreover, for every positive measurable function .g : E → R we have
g dν =
.
E
g f dμ .
(1.20)
E
A measurable function .g : E → R is integrable with respect to .ν if and only if gf is integrable with respect to .μ and also in this case (1.20) holds.
Proof The functional .
+
E g →
gf dμ E
is positively linear and passes to the limit on increasing sequences of positive functions by Beppo Levi’s Theorem (recall . E+ = the positive measurable functions). Hence, by Proposition 1.24, ν(A) := I (1A ) =
f dμ
.
A
is a measure on .(E, E) such that (1.20) holds for every positive function g. .ν is σ -finite because if .(An )n is a sequence of sets of . E such that . n An = E and with .f 1An integrable, then .
ν(An ) =
.
E
f 1An dμ < +∞ .
Finally (1.20) is proved to hold for every .ν-integrable function by decomposing g into positive and negative parts. Let .μ, ν be .σ -finite measures on the measurable space .(E, E). We say that .ν is absolutely continuous with respect to .μ, denoted .ν μ, if and only if every .μ-negligible set .A ∈ E (i.e. such that .μ(A) = 0) is also .ν-negligible.
26
1 Elements of Measure Theory
If .ν has density f with respect to .μ then clearly .ν μ: if A is .μ-negligible then the function .f 1A is .= 0 only on A, hence .ν(A) = E f 1A dμ = 0 (Exercise 1.10). A remarkable and non-obvious result is that the converse is also true.
Theorem 1.29 (Radon-Nikodym) If .μ, ν are .σ -finite and .ν μ then .ν has a density with respect to .μ.
A proof of this theorem can be found in almost all the books listed in the references. A proof in the case of probabilities will be given in Example 5.27. It is often important to establish whether a Borel measure .ν on .(R, B(R)) has a density with respect to the Lebesgue measure .λ, i.e. is such that .ν λ, and to be able to compute it. First, in order for .ν to be absolutely continuous with respect to .λ it is necessary that .ν({x}) = 0 for every x, as .λ({x}) = 0 and the negligible sets for .λ must also be negligible for .ν. The distribution function of .ν, F , therefore must be continuous, as 0 = ν({x}) = lim ν(]x − n1 , x]) = F (x) − lim F (x − n1 ) .
.
n→∞
n→∞
Assume, moreover, that F is absolutely continuous, hence a.e. differentiable and such that, if .F (x) = f (x), for every .−∞ < a ≤ b < +∞,
b
.
f (x) dx = F (b) − F (a) .
(1.21)
a
In (1.21) the term on the right-hand side is nothing else than .ν(]a, b]), whereas the left-hand term is the value on .]a, b] of the measure .f dλ. The two measures .ν and .f dλ therefore coincide on the half-open intervals and by Theorem 1.14 (Carathéodory’s criterion) they coincide on the whole .σ -algebra .B(R). Note that, to be precise, it is not correct to speak of “the” density of .ν with respect to .μ: if f is a density, then so is every function g that is .μ-equivalent to f (i.e. such that .f = g .μ-a.e.).
1.6 Lp Spaces Let .(E, E, μ) be a measure space, V a normed vector space and .f : E → V a measurable function. Let, for .1 ≤ p < +∞, f p =
1 p |f |p dμ
.
E
1.6 Lp Spaces
27
and, for .p = +∞, f ∞ = inf M; μ(|f | > M) = 0 .
.
In particular the set .{|f | > f ∞ } is negligible. . p and . ∞ can of course be +∞. Let, for .1 ≤ p ≤ +∞,
.
.
Lp = {f ; f p < +∞} .
Let us state two fundamental inequalities: if .f, g : E → V are measurable functions then f + gp ≤ f p + gp ,
.
1 ≤ p ≤ +∞ ,
(1.22)
which is Minkowski’s inequality and |f | |g| ≤ f p gq , 1
.
1 ≤ p ≤ +∞,
1 1 + =1, p q
(1.23)
which is Hölder’s inequality. Thanks to Minkowski’s inequality, .Lp is a vector space and . p a seminorm. It is not a norm as it is possible for a function .f = 0 to have .f p = 0 (this happens if and only if .f = 0 a.e.). Let us define an equivalence relation on .Lp by setting .f ∼ g if .f = g a.e. and then let .Lp = Lp / ∼, the quotient space with to thisequivalence. Then .Lp is a normed space. Actually, .f = g a.e. implies respect p . |f | dμ = |g|p dμ, and we can define, for .f ∈ Lp , .f p without ambiguity. Note however that .Lp is not a space of functions, but of equivalence classes of functions; this distinction is seldom important and in the sequel we shall often identify a function f and its equivalence class. But sometimes it will be necessary to pay attention. If the norm of V is associated to a scalar product .·, ·, then, for .p = q = 2, Hölder’s inequality (1.23) gives the Cauchy-Schwarz inequality
2 f, g dμ ≤ |f |2 dμ |g|2 dμ .
.
E
E
(1.24)
E
It can be proved that if the target space V is complete, i.e. a Banach space, then the normed space .Lp is itself a Banach space and therefore also complete. In this case 2 .L is a Hilbert space with respect to the scalar product f, g2 =
f, g dμ .
.
E
28
1 Elements of Measure Theory
Note that, if .V = R, then f, g2 =
f g dμ
.
E
and, if .V = C, f, g2 =
f g dμ .
.
E
A sequence of functions .(fn )n ⊂ Lp is said to converge to f in .Lp if .fn − f p → 0 as .n → ∞. Remark 1.30 Let .f, g ∈ Lp , .p ≥ 1. Then by Minkowski’s inequality we have f p ≤ f − gp + gp ,
.
gp ≤ f − gp + f p from which we obtain both inequalities gp − f p ≤ f − gp
.
and
f p − gp ≤ f − gp ,
hence f p − gp ≤ f − gp ,
.
so that .f → f p is a continuous map .Lp → R+ and .Lp -convergence implies convergence of the .Lp norms.
1.7 Product Spaces, Product Measures Let .(E1 , E1 ), . . . , (Em , Em ) be measurable spaces. On the product set .E := E1 × · · · × Em let us define the product .σ -algebra . E by setting .
.
E := E1 ⊗ · · · ⊗ Em := σ (A1 × · · · × Am ; A1 ∈ E1 , . . . , Am ∈ Em ) .
(1.25)
E is the smallest .σ -algebra that contains the “rectangles” .A1 × · · · × Am with .A1 ∈ E1 , . . . , Am ∈ Em .
1.7 Product Spaces, Product Measures
29
Proposition 1.31 Let .pi : E → Ei , .i = 1, . . . , m, be the canonical projections pi (x1 , . . . , xm ) = xi .
.
Then .pi is measurable .(E, E) → (Ei , Ei ) and the product .σ -algebra . E is the smallest .σ -algebra on the product space E that makes the projections .pi measurable.
Proof If .Ai ∈ Ei , then, for .1 ≤ i ≤ m, pi−1 (Ai ) = E1 × · · · × Ei−1 × Ai × Ei+1 × · · · × Em .
.
(1.26)
This set belongs to . E (it is a “rectangle”), hence .pi is measurable .(E, E) → (Ei , Ei ). Conversely, let . E denote a .σ -algebra of subsets of .E = E1 × · · · × Em with respect to which the canonical projections .pi are measurable. . E must contain the sets .pi−1 (Ai ), .Ai ∈ Ei , .i = 1, . . . , m. Therefore . E also contains the rectangles, as, −1 (A ). Therefore recalling (1.26), we can write .A1 ×· · ·×Am = p1−1 (A1 )∩· · ·∩pm m . E also contains the product .σ -algebra . E, which is the smallest .σ -algebra containing the rectangles. Let now .(G, G) be a measurable space and .f = (f1 , . . . , fm ) a map from .(G, G) to the product space .(E, E). As an immediate consequence of Proposition 1.31, f is measurable if and only if all its components .fi = pi ◦ f : (G, G) → (Ei , Ei ) are measurable. Indeed, if .f : G → E is measurable .(G, G) → (E, E), then the components .fi = pi ◦ f are measurable, being compositions of measurable maps. Conversely, if the components .f1 , . . . , fm are measurable, then for every rectangle .A = A1 × · · · × Am ∈ E we have f −1 (A) = f1−1 (A1 ) ∩ · · · ∩ fm−1 (Am ) ∈ G .
.
Hence the pullback of every rectangle is a measurable set and the claim follows thanks to Remark 1.5, as the rectangles generate the product .σ -algebra . E. Given two topological spaces, on their product we can consider • the product of the respective Borel .σ -algebras • the Borel .σ -algebra of the product topology. Do they coincide? In general they do not, but the next proposition states that they do coincide under assumptions that are almost always satisfied. Recall that a topological space is said to have a countable basis of open sets if there exists a countable family .(On )n of
30
1 Elements of Measure Theory
open sets such that every open set is the union of some of the .On . In particular, every separable metric space has such a basis.
Proposition 1.32 Let .E1 , . . . , Em be topological spaces. Then (a) .B(E1 × · · · × Em ) ⊃ B(E1 ) ⊗ · · · ⊗ B(Em ). (b) If .E1 , . . . , Em have a countable basis of open sets, then .B(E1 × · · · × Em ) = B(E1 ) ⊗ · · · ⊗ B(Em ).
Proof In order to keep the notation simple, let us assume .m = 2. (a) The projections p1 : E1 × E2 → E1 ,
.
p2 : E1 × E2 → E2
are continuous when we consider on .E1 × E2 the product topology (which, by definition, is the smallest topology on the product space with respect to which the projections are continuous). They are therefore also measurable with respect to .B(E1 × E2 ). Hence .B(E1 × E2 ) contains .B(E1 ) ⊗ B(E2 ), which is the smallest .σ -algebra making the projections measurable (Proposition 1.31). (b) If .(U1,n )n , .(U2,n )n are countable bases of the topologies of .E1 and .E2 respectively, then the sets .Vn,m = U1,n × U2,m form a countable basis of the product topology of .E1 × E2 . As .U1,n ∈ B(E1 ) and .U2,n ∈ B(E2 ), we have .Vn,m ∈ B(E1 ) ⊗ B(E2 ) (.Vn,m is a rectangle). As all open sets of .E1 × E2 are countable unions of the open sets .Vn,m , all open sets of the product topology belong to the .σ -algebra .B(E1 ) ⊗ B(E2 ) which therefore contains .B(E1 × E2 ). Let .μ, .ν be finite measures on the product space. Carathéodory’s criterion, Proposition 1.11, ensures that if they coincide on rectangles then they are equal. Indeed the class of rectangles .A1 × · · · × Am is stable with respect to finite intersections. In order to prove that .μ = ν it is also sufficient to check that
f1 (x1 ) · · · fm (xm ) dμ(x) =
.
E
f1 (x1 ) · · · fm (xm ) dν(x) E
for every choice of bounded measurable functions .fi : Ei → R. If the spaces (Ei , Ei ) are metric spaces, a repetition of the arguments of Proposition 1.25 proves the following criterion.
.
1.7 Product Spaces, Product Measures
31
Proposition 1.33 Assume that .(Ei , Ei ), .i = 1, . . . , m, are metric spaces endowed with their Borel .σ -algebras. Let .μ, .ν be finite measures on the product space. (a) Assume that . f1 (x1 ) · · · fm (xm ) dμ(x) = f1 (x1 ) · · · fm (xm ) dν(x) E
(1.27)
E
for every choice of bounded continuous functions .fi : Ei → R, .i = 1, . . . , m. Then .μ = ν. (b) If, moreover, the spaces .Ei , .i = 1, . . . , m, are also separable and locally compact and if (1.27) holds for every choice of continuous and compactly supported functions .fi , then .μ = ν.
Let .μ1 , . . . , μm be .σ -finite measures on .(E1 , E1 ), . . . , (Em , Em ) respectively. For every rectangle .A = A1 × · · · × Am let μ(A) = μ1 (A1 ) . . . μm (Am ) .
(1.28)
.
Is it possible to extend .μ to a measure on the product .σ -algebra . E = E1 ⊗· · ·⊗ Em ? In order to prove the existence of this extension it is possible to take advantage of Theorem 1.13, Carathéodory’s extension theorem, whose use here however requires some work in order to check that the set function .μ defined in (1.28) is .σ -additive on the algebra of finite unions of rectangles (recall Remark 1.9). It is easier to proceed following the idea of Proposition 1.24, i.e. constructing a positively linear functional on the positive functions on .(E, E) that passes to the limit on increasing sequences. More precisely the idea is the following. Let us + assume for simplicity .m = 2 and let .f : E1 × E2 → R be a positive . E1 ⊗ E2 measurable function. (1) First prove that, for every given .x1 ∈ E1 , .x2 ∈ E2 , the functions .f (x1 , ·) and .f (·, x2 ) are respectively . E2 - and . E1 -measurable. (2) Then prove that, for every .x1 ∈ E1 , .x2 ∈ E2 , the “partially integrated” functions x1 →
x2 →
f (x1 , x2 ) dμ2 (x2 ),
.
E2
f (x1 , x2 ) dμ1 (x1 ) E1
are respectively . E1 - and . E2 -measurable. (3) Now let .I (f ) = dμ2 (x2 ) E2
E1
f (x1 , x2 ) dμ1 (x1 )
(1.29)
32
1 Elements of Measure Theory
(i.e. we integrate first with respect to .μ1 the measurable function .x1 → f (x1 , x2 ), the result is a measurable function of .x2 that is then integrated with respect to .μ2 ). It is immediate that the functional I satisfies assumptions (a) and (b) of Proposition 1.24 (use Beppo Levi’s Theorem twice). It follows (Proposition 1.24) that .μ(A) := I (1A ) defines a measure on . E1 ⊗ E2 . Such a .μ satisfies (1.28), as, by (1.29), = E1
μ(A1 × A2 ) = I (1A1 ×A2 ) 1A1 (x1 ) dμ1 (x1 ) 1A2 (x2 ) dμ2 (x2 ) .
E2
= μ1 (A1 )μ2 (A2 ) . This is the extension we were looking for. The measure .μ is the product measure of μ1 and .μ2 , denoted .μ = μ1 ⊗ μ2 . Uniqueness of the product measure follows from Carathéodory’s criterion, Proposition 1.11, as two measures satisfying (1.28) coincide on the rectangles having finite measure, which form a class that is stable with respect to finite intersections and, as the measures .μi are assumed to be .σ -finite, generates the product .σ -algebra. In order to properly apply Carathéodory’s criterion however we also need to prove that there exists a sequence of rectangles of finite measure increasing to the whole product space. Let, for every .i = 1, . . . , m, .Ci,n ∈ Ei be an increasing sequence of sets such that .μi (Ci,n ) < +∞ and . n Ci,n = Ei . Such a sequence exists as the measures .μ1 , . . . , μm are assumed to be .σ -finite. Then the sets .Cn = C1,n × · · · × Cm,n are increasing, such that .μ(Cn ) < +∞ and . n Cn = E. .
The proofs of (1) and (2) above are without surprise: these properties are obvious if f is the indicator function of a rectangle. Let us prove next that they hold if f is the indicator function of a set of . E = E1 ⊗ E2 : let .M be the class of the sets .A ∈ E whose indicator functions satisfy 1), i.e. such that .1A (x1 , ·) and .1A (·, x2 ) are respectively . E2 - and . E1 -measurable. It is immediate that they form a monotone class. As .M contains the rectangles, a family which is stable with respect to finite intersections, by Theorem 1.2, the Monotone Class Theorem, .M also contains . E, which is the .σ -algebra generated by the rectangles. By linearity (1) is also satisfied by the elementary functions on . E and finally by all . E1 ⊗ E2 -positive measurable functions thanks to Proposition 1.6 (approximation with elementary functions). The argument to prove (2) is similar but requires more care, considering first the case of finite measures and then taking advantage of the assumption of .σ -finiteness. In practice, in order to integrate with respect to a product measure one takes advantage of the following, very important, theorem. We state it with respect to the product of two measures, the statement for the product of m measures being left to the imagination of the reader.
1.7 Product Spaces, Product Measures
33
Theorem 1.34 (Fubini-Tonelli) Let .f : E1 × E2 → R be an . E1 ⊗ E2 measurable function and let .μ1 , .μ2 be .σ -finite measures on .(E1 , E1 ) and .(E2 , E2 ) respectively. Let .μ = μ1 ⊗ μ2 be their product. (a) If f is positive, then the functions x1 →
f (x1 , x2 ) dμ2 (x2 ) E2
.
x2 →
(1.30) f (x1 , x2 ) dμ1 (x1 )
E1
are respectively . E1 - and . E2 -measurable. Moreover, we have f dμ = dμ2 (x2 ) f (x1 , x2 ) dμ1 (x1 ) E1 ×E2 E2 E1 dμ1 (x1 ) f (x1 , x2 ) dμ2 (x2 ) . =
.
E1
(1.31)
E2
(b) If f is real, numerical or complex-valued and integrable with respect to the product measure .μ1 ⊗ μ2 , then the functions in (1.30) are respectively . E1 - and . E2 -measurable and integrable with respect to .μ1 and .μ2 respectively and (1.31) holds.
For simplicity we shall refer to this theorem as Fubini’s Theorem. The main ideas in the application of Fubini’s Theorem for the integration of a function with respect to a product measure are: • if f is positive everything is allowed (i.e. you can integrate with respect to the variables one after the other in any order) and the result is equal to the integral with respect to the product measure, which can be a real number or possibly .+∞ (this is part (a) of Fubini’s Theorem); • if f is real and takes both positive and negative values or is complex-valued, in order for (1.31) to hold f must be integrable with respect to the product measure. In practice one first checks integrability of .|f | using part (a) of the theorem and then applies part (b) in order to compute the integral. • In addition to the two integration results for positive and integrable functions, the measurability and integrability results of the “partially integrated” functions (1.30) is also useful. Therefore Fubini’s Theorem 1.34 contains in fact three different results, all of them very useful.
34
1 Elements of Measure Theory
Remark 1.35 Corollary 1.22 (integration by series) can be seen as a consequence of Fubini’s Theorem. Indeed, let .(E, E, μ) be a measure space. Given a sequence .(fn )n of measurable functions .E → R we can consider the function .Φf : N × E → R defined as .(n, x) → fn (x). Hence the relation ∞
fn dμ =
.
n=1 E
∞
fn dμ
E n=1
is just Fubini’s theorem for the function .Φf integrated with respect to the product measure .νc ⊗ μ, .νc denoting the counting measure of .N. Measurability of .Φf above is immediate. Let us consider .(R, B(R), λ) (.λ = the Lebesgue measure). By Proposition 1.32, B(R) ⊗ . . . ⊗ B(R) = B(Rd ). Let .λd = λ ⊗ . . . ⊗ λ (d times). We can apply Carathéodory’s criterion, Proposition 1.11, to the class of sets
.
.
d C = A; A = ] ai , bi [, −∞ < ai < bi < +∞ i=1
and obtain that .λd is the unique measure on .B(Rd ) such that, for every .−∞ < ai < bi < +∞, λd
d
.
i=1
d ]ai , bi [ = (bi − ai ) . i=1
λd is the Lebesgue’s measure of .Rd . In the sequel we shall also need to consider the product of countably many measure spaces. The theory is very similar to the finite case, at least for probabilities. Let .(Ei , Ei , μi ), .i = 1, 2, . . . , be measure spaces. Then the product .σ -algebra ∞ .E = i=1 Ei is defined as the smallest ∞ ∞ .σ -algebra of subsets of the product .E = i=1 Ei containing the rectangles . i=1 Ai , .Ai ∈ Ei . The following statement says that on the product space .(E, E) there exists a probability that is the product of the .μi . .
Exercises
35
Theorem 1.36 Let .(Ei , Ei , μi ), .i = 1, 2, . . . , be a countable family of measure spaces such that .μi is a probability for every i. Then there exists a unique probability .μ on .(E, E) such that for every rectangle .A = ∞ i=1 Ai μ(A) =
∞
.
μi (Ai ) .
i=1
For a proof and other details, see Halmos’s book [16].
Exercises 1.1 (p. 261) A .σ -algebra . F is said to be countably generated if there exists a countable subfamily . C ⊂ F such that .σ ( C) = F. Prove that if E is a separable metric space, then its Borel .σ -algebra, .B(E), is countably generated. In particular, so is the Borel .σ -algebra of .Rd or, more generally, of any separable Banach space. 1.2 (p. 261) The Borel .σ -algebra of .R is generated by each of the following families of sets. (a) (b) (c) (d)
The open intervals .]a, b[, .a < b. The half-open intervals .]a, b], .a < b. The open half-lines .]a, ∞[, .a ∈ R. The closed half-lines .[a, ∞[, .a ∈ R.
1.3 (p. 261) Let E be a topological space and let us denote by .B0 (E) the smallest σ -algebra of subsets of E with respect to which all real continuous functions are measurable. .B0 (E) is the Baire .σ -algebra.
.
(a) Prove that .B0 (E) ⊂ B(E). (b) Prove that if E is metric separable then .B0 (E) and .B(E) coincide. 1.4 (p. 262) Let .(E, E) be a measurable space and .S ⊂ E (not necessarily .S ∈ E). Prove that .
ES = {A ∩ S; A ∈ E}
is a .σ -algebra of subsets of S (the trace .σ -algebra of . E on S). 1.5 (p. 262) Let .(E, E) be a measurable space.
36
1 Elements of Measure Theory
(a) Let .(fn )n be a sequence of real measurable functions. Prove that the set L = {x; lim fn (x) exists}
.
n→∞
is measurable. (b) Assume that the .fn take their values in a metric space G. Using unions, intersections, complementation. . . describe the set of points x such that the Cauchy property for the sequence .(fn (x))n is satisfied and prove that, if E is complete, L is measurable also in this case. 1.6 (p. 262) Let .(E, E) be a measurable space, .(fn )n a sequence of measurable functions taking values in the metric space .(G, d) and assume that .limn→∞ fn = f pointwise. We have seen (p. 4) that if .G = R then f is also measurable. In this exercise we address this question in more generality. (a) Prove that for every continuous function .Φ : G → R the function .Φ ◦ f is measurable. (b) Prove that if the metric space .(G, d) is separable, then f is measurable .E → G. Recall that, for .z ∈ G, the function .x → d(x, z) is continuous. 1.7 (p. 263) Let .(E, E, μ) be a measure space. (a) Prove that if .(An )n ⊂ E then ∞ ∞ μ An ≤ μ(An ) .
.
n=1
(1.32)
n=1
(b) Let .(An )n be a sequence of negligible events. Prove that . n An is also negligible. (c) Let .A = {A; A ∈ E, μ(A) = 0 or μ(Ac ) = 0}. Prove that .A is a .σ -algebra. 1.8 (p. 264) (The support of a measure) Let .μ be a Borel measure on a separable metric space E. Let us denote by .Bx (r) the open ball with radius r centered at x and let F = {x ∈ E; μ(Bx (r)) > 0 for every r > 0}
.
(i.e. F is formed by all .x ∈ E such that all their neighborhoods have strictly positive measure). (a) Prove that F is a closed set. (b1) Prove that .μ(F c ) = 0. (b2) Prove that F is the smallest closed subset of E such that .μ(F c ) = 0. • F is the support of the measure .μ. Note that the support of a measure is always a closed set.
Exercises
37
1.9 (p. 264) Let .μ be a measure on .(E, E) and .f : E → R a measurable function. (a) Prove that if f is integrable, then .|f | < +∞ a.e. (b) Prove that if f is positive and f dμ = 0
.
E
then .f = 0 .μ-a.e. (c) Prove that if f is semi-integrable and if . A f dμ ≥ 0 for every .A ∈ E, then .f ≥ 0 a.e. 1.10 (p. 265) Let .(E, E, μ) be a measure space and .f : E → R a measurable function vanishing outside a negligible set N . Prove that f is integrable and that its integral vanishes. 1.11 (p. 265) (a) Let .(wn )n be a bounded sequence of positive numbers and let, for .t > 0, φ(t) =
∞
.
wn e−tn .
(1.33)
n=1
Is it true that .φ is differentiable by series, i.e. that, if .t > 0, .φ is differentiable and φ (t) = −
∞
.
nwn e−tn ?
(1.34)
n=1
(b) Consider the same question where the sequence .(wn )n is bounded but not necessarily positive. √ (c1) And if .wn = √n? (c2) And if .wn = e n ? 1.12 (p. 267) (Counterexamples) (a) Find an example of a measure space .(E, E, μ) and of a decreasing sequence .(An )n ⊂ E such that .μ(An ) does not converge to .μ(A) where .A = n An . (b) Beppo Levi’s Theorem requires the existence of an integrable function f such that .f ≤ fn for every n. Give an example where this condition is not satisfied and the statement of Beppo Levi’s Theorem is not true. 1.13 (p. 267) Let .ν, .μ be measures on the measurable space .(E, E) such that .ν μ. .ν, . μ Let .φ be a measurable map from E into the measurable space .(G, G) and let be the respective images of .ν and .μ. Prove that .ν μ.
38
1 Elements of Measure Theory
1.14 (p. 267) Let .λ be the Lebesgue measure on .[0, 1] and .μ the set function on B([0, 1]) defined as
.
μ(A) =
.
0
if λ(A) = 0
+∞ if λ(A) > 0 .
(a) Prove that .μ is a measure on .B([0, 1]). (b) Note that .λ μ but the Radon-Nikodym Theorem does not hold here and explain why. 1.15 (p. 267) Let .(E, E, μ) be a measure space and .(fn )n a sequence of real functions bounded in .Lp , .0 < p ≤ +∞, and assume that .fn →n→∞ f .μ-a.e. (a1) Prove that .f ∈ Lp . (a2) Does the convergence necessarily also take place in .Lp ? (b) Let .g ∈ Lp , .0 < p ≤ +∞, and let .gn = g ∧ n ∨ (−n). Prove that .gn → g in p .L as .n → +∞. 1.16 (p. 268) (Do the .Lp spaces become larger or smaller as p increases?) Let .μ be a finite measure on the measurable space .(E, E). (a1) Prove that, if .0 ≤ p ≤ q, then .|x|p ≤ 1 + |x|q for every .x ∈ R and that q p p .L ⊂ L , i.e. the spaces .L become smaller as p increases (recall that .μ is finite). (a2) Prove that, if .f ∈ Lq , then .
lim f p = f q .
p→q−
(1.35)
(a3) Prove that, if .f ∈ Lq , then .
lim
p→q− E
|f |p dμ = +∞ .
(1.36)
(a4) Prove that .
lim f p ≥ f q
(1.37)
p→q+
but that, if .f ∈ Lq0 for some .q0 > q, then .
lim f p = f q .
p→q+
(1.38)
Exercises
39
(a5) Give an example of a function that belongs to .Lq for a given value of q, but that does not belong to .Lp for any .p > q, so that, in general, .limp→q+ f p = f q does not hold. (b1) Let .f : E → R be a measurable function. Prove that .
lim f p ≤ f ∞ .
p→+∞
(b2) Let .M ≥ 0. Prove that, for every .p ≥ 0, |f |p dμ ≥ M p μ(|f | ≥ M)
.
E
and deduce the value of .limp→+∞ f p . 1.17 (p. 269) (Again, do the .Lp spaces become larger or smaller as p increases?) Let us consider the set .N endowed with the counting measure: .μ({k}) = 1 for every p q .k ∈ N (hence not a finite measure). Prove that if .p ≤ q, then .L ⊂ L . • The .Lp spaces with respect to the counting measure of .N are usually denoted . p . 1.18 (p. 269) The computation of the integral
+∞
.
0
1 −tx e sin x dx x
(1.39)
for .t > 0 does not look nice. But as 1 sin x = . x
1
cos(xy) dy 0
and Fubinizing. . . Compute the integral in (1.39) and its limit as .t → 0+. 1.19 (p. 270) Let .f, g : Rd → R be integrable functions. Prove that x →
.
Rd
f (y)g(x − y) dy
defines a function in .L1 . This is the convolution of f and g, denoted .f ∗g. Determine a relation between the .L1 norms of f , g and .f ∗ g. • Note the following apparently surprising fact: the two functions .y → f (y) and 1 1 .y → g(x − y) are in .L but, in general, the product of functions of .L is not integrable.
Chapter 2
Probability
2.1 Random Variables, Laws, Expectation A probability space is a triple .(Ω, F, P) where .(Ω, F) is a measurable space and .P a probability on .(Ω, F). Other objects of measure theory appear in probability but sometimes they take a new name that takes into account the role they play in relation to random phenomena. For instance the sets of the .σ -algebra . F are the events. A random variable (r.v.) is a measurable map defined on .(Ω, F, P) with values in some measurable space .(E, E). In most situations .(E, E) will be one among m m .(R, B(R)), .(R, B(R)) (i.e. the values .+∞ or .−∞ are also possible), .(R , B(R )) or .(C, B(C)) and we shall speak, respectively, of real, numerical, m-dimensional or complex r.v.’s. It is not unusual, however, to be led to consider more complicated spaces such as, for instance, matrix groups, the sphere .S2 , or even function spaces, endowed with their Borel .σ -algebras. R.v.’s are traditionally denoted by capital letters (.X, Y, Z, . . . ). They of course enjoy all the properties of measurable maps as seen at the beginning of §1.2. In particular, sums, products, limits,. . . of real r.v.’s are also r.v.’s. The law or distribution of the r.v. .X : (Ω, F) → (E, E) is the image of .P under X, i.e. the probability .μ on .(E, E) defined as μ(A) = P(X−1 (A)) = P({ω; X(ω) ∈ A})
.
A ∈ E.
We shall write .P(X ∈ A) as a shorthand for .P({ω; X(ω) ∈ A}) and we shall write X ∼ Y or .X ∼ μ to indicate that X and Y have the same distribution or that X has law .μ respectively. If X is real, its distribution function F is the distribution function of .μ (see (1.4)). In this case (i.e. dealing with probabilities) we can take F as the increasing and right continuous function
.
F (x) = μ(] − ∞, x]) = P(X ≤ x) .
.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Baldi, Probability, Universitext, https://doi.org/10.1007/978-3-031-38492-9_2
41
42
2 Probability
If the real or numerical r.v. X is semi-integrable (upper or lower) with respect to P, its mathematical expectation (or mean), denoted .E(X), is the integral . X dP. If .X = (X1 , . . . , Xm ) is an m-dimensional r.v. we define .
E(X) := (E[X1 ], . . . , E[Xm ]) .
.
X is said to be centered if .E(X) = 0. If X is .(E, E)-valued, .μ its law and .f : E → R is a measurable function, by Proposition 1.27 (integration with respect to an image measure), .f (X) is integrable if and only if . |f (x)| dμ(x) < +∞ E
and in this case E[f (X)] =
f (x) dμ(x) .
.
(2.1)
E
Of course (2.1) holds also if the r.v. .f (X) is only semi-integrable (which is always true if f is positive, for instance). In particular, if X is real-valued and semiintegrable we have .E(X) = x dμ(x) . (2.2) R
This is the relation that is used in practice in order to compute the mathematical expectation of an r.v. The equality (2.2) is also important from a theoretical point of view as it shows that the mathematical expectation depends only on the law: different r.v.’s (possibly defined on different probability spaces) which have the same law also have the same mathematical expectation. Moreover, (2.1) characterizes the law of X: if the probability .μ on .(E, E) is such that (2.1) holds for every real bounded measurable function f (or for every measurable positive function f ), then necessarily .μ is the law of X. This is a useful method to determine the law of X, as better explained in §2.3 below. The following remark provides an elementary formula for the computation of expectations of positive r.v.’s that we shall use very often. Remark 2.1 (a) Let X be a positive r.v. having law .μ and .f : R+ → R an absolutely continuous function such that .f (X) is integrable. Then E[f (X)] = f (0) +
.
0
+∞
f (y)P(X ≥ y) dy .
(2.3)
2.1 Random Variables, Laws, Expectation
43
This is a clever application of Fubini’s Theorem: actually such an f is a.e. differentiable and x .f (x) = f (0) + f (y) dy , 0
so that
+∞
E[f (X)] =
f (x) dμ(x) +∞ x = f (0) + dμ(x) f (y) dy 0 0+∞ +∞ ! = f (0) + f (y) dy dμ(x) y 0 +∞ f (y)P(X ≥ y) dy , = f (0) + 0
.
(2.4)
0
where ! indicates where we apply Fubini’s Theorem, concerning the integral of (x, y) → f (y) on the set .{(x, y); 0 ≤ y ≤ x} ⊂ R2 with respect to the product measure .μ ⊗ λ (.λ =Lebesgue’s measure). Note however that in order to apply Fubini’s theorem the function .(x, y) → |f (y)|1{0≤y≤x} (x) must be integrable with respect to .μ ⊗ λ. For instance (2.4) does not hold for .f (x) = sin(ex ), whose derivative exhibits high frequency oscillations at infinity. Note also that in (2.3) .P(X ≥ y) can be replaced by .P(X > y): the two functions .y → P(X ≥ y) and .y → P(X > y) are monotone and coincide except at their points of discontinuity, which are countably many at most, hence of Lebesgue measure 0. Relation (2.4), replacing f with the identity function .x → x and X with .f (X) becomes, still for .f ≥ 0, .
+∞
E[f (X)] =
.
+∞
P(f (X) ≥ t) dt =
0
μ(f ≥ t) dt .
(2.5)
0
(b) If X is positive and integer-valued and .f (x) = x, (2.5) takes an interesting form: as .P(X ≥ t) = P(X ≥ k + 1) for .t ∈]k, k + 1], we have E(X) =
+∞
.
P(X ≥ t) dt =
0
=
∞ k=0
∞
k+1
P(X ≥ t) dt
k=0 k
P(X ≥ k + 1) =
∞ k=1
P(X ≥ k) .
44
2 Probability
In the sequel we shall often make a slight abuse: we shall consider some r.v.’s without stating on which probability space they are defined. The justification for this is that, in order to make the computations, often it is only necessary to know the law of the r.v.’s concerned and, anyway, the explicit construction of a probability space on which the r.v.’s are defined is always possible (see Remark 2.13 below). The model of a random phenomenon will be a probability space .(Ω, F, P), of an unknown nature, on which some r.v.’s .X1 , . . . , Xn with given laws are defined.
2.2 Independence In this section .(Ω, F, P) is a probability space and all the .σ -algebras we shall consider are sub-.σ -algebras of . F.
Definition 2.2 The .σ -algebras .Bi , i = 1, . . . , n, are said to be independent if n n P Ai = P(Ai )
.
i=1
(2.6)
i=1
for every choice of .Ai ∈ Bi , .i = 1, . . . , n. The .σ -algebras of a, possibly infinite, family (.Bi , i ∈ I ) are said to be independent if the .σ -algebras of every finite sub-family are independent.
The next remark is obvious but important. Remark 2.3 If the .σ -algebras .(Bi , i ∈ I ) are independent and if, for every i ∈ I , .Bi ⊂ Bi is a sub-.σ -algebra, then the .σ -algebras .(Bi , i ∈ I ) are also independent.
.
The next proposition says that in order to prove the independence of .σ -algebras it is sufficient to check (2.6) for smaller classes of events. This is obviously a very useful simplification.
2.2 Independence
45
Proposition 2.4 Let . Ci ⊂ Bi , .i = 1, . . . , n, be families of events that are stable with respect to finite intersections, containing .Ω and such that .Bi = σ ( Ci ). Assume that (2.6) holds for every .Ai ∈ Ci , then the .σ -algebras .Bi , i = 1, . . . , n, are independent.
Proof We must prove that (2.6), which by hypothesis holds for every .Ai ∈ Ci , actually holds for every .Ai ∈ Bi . Let us fix .A2 ∈ C2 , . . . , An ∈ Cn and on .B1 consider the two finite measures defined as n A → P A ∩ Ak
.
k=2
and
n A → P(A)P Ak . k=2
By assumption they coincide on . C1 . Thanks to Carathéodory’s criterion, Proposition 1.11, they coincide also on .B1 . Hence the independence relation (2.6) holds for every .A1 ∈ B1 and .A2 ∈ C2 , . . . , An ∈ Cn . Let us argue by induction: let us consider, for .k = 1, . . . , n, the property n n P Ai = P(Ai ), for Ai ∈ Bi , i = 1, . . . , k, and Ai ∈ Ci , i > k .
.
i=1
(2.7)
i=1
This property is true for .k = 1; note also that the condition to be proved is simply that this property holds for .k = n. If (2.7) holds for .k = r − 1, let .Ai ∈ Bi , .i = 1, . . . , r − 1 and .Ai ∈ Ci , .i = r + 1, . . . , n and let us consider on . Br the two measures .
Br B → P(A1 ∩ . . . ∩ Ar−1 ∩ B ∩ Ar+1 ∩ . . . ∩ An ) Br B → P(A1 ) · · · P(Ar−1 )P(B)P(Ar+1 ) · · · P(An ) .
By the induction assumption they coincide on the events of . Cr . Thanks to Proposition 1.11 (Carathéodory’s criterion again) they coincide also on .Br , and therefore (2.7) holds for .k = r. By induction then (2.7) also holds for .k = n which completes the proof. Next let us consider the property of “independence by packets”: if .B1 , .B2 and .B3 are independent .σ -algebras, are the .σ -algebras .σ (B1 , B2 ) and .B3 also independent? The following proposition gives an answer in a more general setting.
46
2 Probability
Proposition 2.5 Let .(Bi , i ∈ I) be independent .σ -algebras and .(Ij , j ∈ J ) a partition of . I. Then the .σ -algebras .(σ (Bi , i ∈ Ij ), j ∈ J ) are independent.
Proof As independence of a family of .σ -algebras is by definition the independence of each finite subfamily, it is sufficient to consider the case of a finite J , .J = {1, . . . , n} so that the set of indices . I is partitioned into .I1 , . . . , In . Let, for .j ∈ J , . Cj be the family of all finite intersections of events of the .σ -algebras . Bi for .i ∈ Ij , i.e. .
Cj = {C; C = Aj,i1 ∩ Aj,i2 ∩ . . . ∩ Aj,i , Aj,i1 ∈ Bi1 , . . . , Aj,ik ∈ Bik , i1 , . . . , i ∈ Ij , = 1, 2, . . . } .
The families of events . Cj are stable with respect to finite intersections, generate respectively the .σ -algebras .σ (Bi , i ∈ Ij ) and contain .Ω. As the .Bi , .i ∈ I, are independent, we have, for every choice of .Cj ∈ Cj , .j ∈ J , j n n n P Cj = P (Aj,i1 ∩ . . . ∩ Aj,ij ) = P(Aj,ik )
.
j =1
=
j =1
n j =1
P(Aj,i1 ∩ Aj,i2 ∩ . . . ∩ Aj,ij ) =
j =1 k=1
n
P(Cj ) ,
j =1
and thanks to Proposition 2.4 the .σ -algebras .σ ( Cj ) = σ (Bi , i ∈ Ij ) are independent. From the definition of independence of .σ -algebras we derive the corresponding definitions for r.v.’s and events.
Definition 2.6 The r.v.’s .(Xi )i∈I with values in the measurable spaces (Ei , Ei ) respectively are said to be independent if the generated .σ -algebras .(σ (Xi ))i∈ I are independent. The events .(Ai )i∈I are said to be independent if the .σ -algebras .(σ (Ai ))i∈ I are independent. .
2.2 Independence
47
Besides these formal definitions, let us recall the intuition beyond these notions of independence: independent events should be such that the knowledge that some of them have taken place does not give information about whether the other ones will take place or not. In a similar way independent .σ -algebras are such that the knowledge of whether the events of some of them have occurred or not does not provide useful information concerning whether the events of the others have occurred or not. In this sense a .σ algebra can be seen as a “quantity of information”. This intuition is important when we must construct a model (i.e. a probability space) intended to describe a given phenomenon. A typical situation arises, for instance, when considering events related to subsequent coin or die throws, or to the choice of individuals in a sample. However let us not forget that when concerned with proofs or mathematical manipulations, only the formal properties introduced by the definitions must be taken into account. Note that independent r.v.’s may take values in different measurable spaces but, of course, must be defined on the same probability space. Note also that if the events A and B are independent then also A and .B c are independent, as the .σ -algebra generated by an event coincides with the one generated by its complement: .σ (A) = {Ω, A, Ac , ∅} = σ (Ac ). More generally, if .A1 , . . . , An are independent events, then also .B1 , . . . , Bn are independent, where c .Bi = Ai or .Bi = A . i This is in agreement with intuition, as A and .Ac carry the same information. Recall (p. 6) that the .σ -algebra generated by an r.v. X taking its values in a measurable space .(E, E) is formed by the events .X−1 (A) = {X ∈ A}, .A ∈ E. Hence to say that the .(Xi )i∈I are independent means that P(Xi1 ∈ Ai1 , . . . , Xim ∈ Aim ) = P(Xi1 ∈ Ai1 ) · · · P(Xim ∈ Aim )
.
(2.8)
for every finite subset .{i1 , . . . , im } ⊂ I and for every choice of .Ai1 ∈ Ei1 , . . . , Aim ∈ Eim . Thanks to Proposition 2.4, in order to prove the independence of .(Xi )i∈I, it is sufficient to verify (2.8) for .Ai1 ∈ Ci1 , . . . , .Ain ∈ Cin , where, for every i, . Ci is a class of events generating . Ei . If these r.v.’s are real-valued, for instance, it is sufficient for (2.8) to hold for every choice of intervals .Aik . The following statement is immediate.
Lemma 2.7 If the .σ -algebras .(Bi )i∈I are independent and if, for every .i ∈ I, .Xi is .Bi -measurable, then the r.v.’s .(Xi )i∈I are independent.
48
2 Probability
Actually .σ (Xi ) ⊂ Bi , hence also the .σ -algebras .(σ (Xi ))i∈I are independent (Remark 2.3).
If the r.v.’s .(Xi )i∈I are independent with values respectively in the measurable spaces .(Ei , Ei ) and .fi : Ei → Gi are measurable functions with values respectively in the measurable spaces .(Gi , Gi ), then the r.v.’s .(fi (Xi ))i∈I are also independent as obviously .σ (fi (Xi )) ⊂ σ (Xi ).
In other words, functions of independent r.v.’s are themselves independent, which agrees with the intuitive meaning described previously: if the knowledge of the values taken by some of the .Xi does not give information concerning the values taken by other .Xj ’s, there is no reason why the values taken by some of the .fi (Xi ) should give information about the values taken by other .fj (Xj )’s. The next, fundamental, theorem establishes a relation between independence of r.v’s. and their joint law.
Theorem 2.8 Let .Xi , .i = 1, . . . , n, be r.v.’s with values in the measurable spaces .(Ei , Ei ) respectively. Let us denote by .μ the law of .(X1 , . . . , Xn ), which is an r.v. with values in the product space of the .(Ei , Ei ), and by .μi the law of .Xi , .i = 1, . . . , n. Then .X1 , . . . , Xn are independent if and only if .μ = μ1 ⊗ · · · ⊗ μn .
Proof Let us assume .X1 , . . . , Xn are independent: we have, for every choice of Ai ∈ Ei , .i = 1, . . . , n,
.
μ(A1 × · · · × An ) = P(X1 ∈ A1 , . . . , Xn ∈ An ) .
= P(X1 ∈ A1 ) · · · P(Xn ∈ An ) = μ1 (A1 ) · · · μn (An ) .
(2.9)
Hence .μ coincides with the product measure .μ1 ⊗ · · · ⊗ μn on the rectangles .A1 × · · · × An . Therefore .μ = μ1 ⊗ · · · ⊗ μn . The converse follows at once by writing (2.9) the other way round: if .μ = μ1 ⊗ · · · ⊗ μn P(X1 ∈ A1 , . . . , Xn ∈ An ) = μ(A1 × · · · × An ) = μ1 (A1 ) · · · μn (An ) =
.
= P(X1 ∈ A1 ) · · · P(Xn ∈ An ) . so that .X1 , . . . , Xn are independent
Thanks to Theorem 2.8 the independence of r.v.’s depends only on their joint law: if .X1 , . . . , Xn are independent and .(X1 , . . . , Xn ) has the same law as .(Y1 , . . . , Yn )
2.2 Independence
49
(possibly defined on a different probability space), then also .Y1 , . . . , Yn are independent. The following proposition specializes Theorem 2.8 when the r.v.’s .Xi take their values in a metric space.
Proposition 2.9 Let .X1 , . . . , Xm be r.v.’s taking values in the metric spaces E1 ,. . . , .Em . Then .X1 , . . . , Xm are independent if and only if for every choice of bounded continuous functions .fi : Ei → R, .i = 1, . . . , m,
.
E[f1 (X1 ) · · · fm (Xm )] = E[f1 (X1 )] · · · E[fm (Xm )] .
.
(2.10)
If in addition the spaces .Ei are also separable and locally compact, then it is sufficient to check (2.10) for compactly supported continuous functions .fi .
Proof In (2.10) we have the integral of .(x1 , . . . , xm ) → f1 (x1 ) · · · fm (xm ) with respect to the joint law, .μ, of .(X1 , . . . , Xm ) on the left-hand side, whereas on the right-hand side appears the integral of the same function with respect to the product of their laws. The statement then follows immediately from Proposition 1.33.
Corollary 2.10 Let .X1 , . . . , Xn be real integrable independent r.v.’s. Then their product .X1 · · · Xn is integrable and E(X1 · · · Xn ) = E(X1 ) · · · E(Xn ) .
.
Proof This result is obviously related to Proposition 2.9, but for the fact that the function .x → x is not bounded. But Fubini’s Theorem easily handles this difficulty. As the joint law of .(X1 , . . . , Xn ) is the product .μ1 ⊗ · · · ⊗ μn , Fubini’s Theorem gives E(|X1 · · · Xn |) =
|x1 | dμ1 (x1 ) · · ·
|xn | dμn (xn )
.
(2.11)
= E(|X1 |) · · · E(|Xn |) < +∞ . Hence the product .X1 · · · Xn is integrable and, repeating the argument of (2.11) without absolute values, Fubini’s Theorem again gives E(X1 · · · Xn ) =
.
x1 dμ1 (x1 ) · · ·
xn dμn (xn ) = E(X1 ) · · · E(Xn ) .
50
2 Probability
Remark 2.11 Let .X1 , . . . , Xn be r.v.’s taking their values in the measurable spaces .E1 , . . . , En , countable and endowed with the .σ -algebra of all subsets respectively. Then they are independent if and only if for every .xi ∈ Ei we have for every .xi ∈ Ei P(X1 = x1 , . . . , Xn = xn ) = P(X1 = x1 ) · · · P(Xn = xn ) .
.
(2.12)
Actually from this relation it is easy to see that the joint law of .(X1 , . . . , Xn ) coincides with the product law on the rectangles.
Remark 2.12 Given a family .(Xi )i∈I of r.v.’s, it is possible to have .Xi independent of .Xj for every .i, j ∈ I, .i = j , without the family being formed of independent r.v.’s, as shown in the following example. In other words, pairwise independence is a (much) weaker property than independence. Let X and Y be independent r.v.’s such that .P(X = ±1) = P(Y = ±1) = 12 and let .Z = XY . We have easily that also .P(Z = ±1) = 12 . X and Z are independent: indeed .P(X = 1, Z = 1) = P(X = 1, Y = 1) = 1 = P(X = 1)P(Z = 1) and in the same way we see that .P(X = i, Z = j ) = 4 P(X = i)P(Z = j ) for every .i, j = ±1, so that the criterion of Remark 2.11 is satisfied. By symmetry Y and Z are also independent. The three r.v.’s .X, Y, Z however are not independent: as .X = Z/Y , X is .σ (Y, Z)-measurable and .σ (X) ⊂ σ (Y, Z). If they were independent .σ (X) would be independent of .σ (Y, Z) and the events of .σ (X) would be independent of themselves. But if A is independent of itself then .P(A) = P(A ∩ A) = P(A)2 so that it can only have probability equal to 0 or to 1, whereas here the events 1 .{X = 1} and .{X = −1} belong to .σ (X) and have probability . . 2 Note in this example that .σ (X) is independent of .σ (Y ) and is independent of .σ (Z), but is not independent of .σ (Y, Z).
Remark 2.13 (a) Given a probability .μ on a measurable space .(E, E), it is always possible to construct a probability space .(Ω, F, P) on which an r.v. X is defined with values in .(E, E) and having law .μ. It is sufficient, for instance, to set .Ω = E, . F = E, .P = μ and .X(x) = x. (b) Very often we shall consider sequences .(Xn )n of independent r.v.’s, defined on a probability space .(Ω, F, P) having given laws, .Xi ∼ μi say.
2.2 Independence
51
Note that such an object always exists. Actually if .Xi is .(Ei , Ei )-valued and Xi ∼ μi , let
.
Ω = the infinite product set E1 × E2 × · · ·
.
F = the product σ -algebra E1 ⊗ E2 ⊗ · · · P = the infinite product probability μ1 ⊗ μ2 ⊗ · · · , see Theorem 1.36. As the elements of the product set .Ω are of the form .ω = (x1 , x2 , . . . ) with xi ∈ Ei , we can define .Xi (ω) = xi . Such a map is measurable .E → Ei (it is a projector, recall Proposition 1.31) and the sequence .(Xn )n defined in this way satisfies the requested conditions. Independence is guaranteed by the fact that their joint law is the product law.
.
Remark 2.14 Let .Xi , .i = 1, . . . , m, be real independent r.v.’s with values respectively in the measurable spaces .(Ei , Ei ). Let, for every .i = 1, . . . , m, .ρi be a .σ -finite measure on .(Ei , Ei ) such that the law .μi of .Xi has density .fi with respect to .ρi . Then the product measure .μ := μ1 ⊗ · · · ⊗ μm , which is the law of .(X1 , . . . , Xm ), has density f (x) = f1 (x1 ) · · · fm (xm )
.
with respect to the product measure .ρ := ρ1 ⊗ · · · ⊗ ρm . Actually it is immediate that the two measures .f dρ and .μ coincide on rectangles. In this case we shall say that the joint density f is the tensor product of the marginal densities .fi .
Theorem 2.15 (Kolmogorov’s 0-1 Law) Let .(Xn )n be a sequence of independent r.v.’s. Let .Bn = σ (Xk , k ≥ n) and .B∞ = n Bn (the tail ∞ is .P-trivial, i.e. for every .A ∈ B∞ , we have .P(A) = 0 .σ -algebra). Then . B or .P(A) = 1. Moreover, if X is an m-dimensional .B∞ -measurable r.v., then X is constant a.s.
Proof Let . Fn = σ (Xk , k ≤ n), . F∞ = σ (Xk , k ≥ 0). Thanks to Proposition 2.5 (independence by packets) . Fn is independent of .Bn+1 , which is generated by the ∞ ⊂ Bn+1 . .Xi with .i > n. Hence it is also independent of . B
52
2 Probability
Let us prove that .B∞ is independent of . F∞ . The family . C
= n Fn is stable with respect to finite intersections and generates . F∞ . If .A ∈ n Fn , then .A ∈ Fn for some n, hence is independent of .B∞ . Therefore A is independent of .B∞ and by Proposition 2.4 .B∞ and . F∞ are independent. But .B∞ ⊂ F∞ , so that .B∞ is independent of itself. If .A ∈ B∞ , as in Remark 2.12, we have .P(A) = P(A ∩ A) = P(A)P(A), i.e. .P(A) = 0 or .P(A) = 1. If X is a real .B∞ -measurable r.v., then for every .a ∈ R the event .{X ≤ a} belongs to .B∞ and its probability can be equal to 0 or to 1 only. Let .c = sup{a; P(X ≤ a) = 0}, then necessarily .c < +∞, as .1 = P(X < +∞) = lima→∞ P(X ≤ a), so that .P(X ≤ a) > 0 for some k. For every .n > 0, 1 1 1 ∞ .P(X ≤ c + n ) > 0, hence .P(X ≤ c + n ) = 1 as .{X ≤ c + n } ∈ B , whereas 1 .P(X ≤ c − ) = 0. From this we deduce that X takes a.s. only the value c as n ∞ P(X = c) = P {c −
.
n=1
1 n
≤ X ≤ c + n1 } = lim P c − n→∞
1 n
≤X≤c+
1 n
=1.
If .X = (X1 , . . . , Xm ) is m-dimensional, by the previous argument each of the marginals .Xi is a.s. constant and the result follows. If all events of .σ (X) have probability 0 or 1 only then X is a.s. constant also if X takes values in a more general space, see Exercise 2.2. Some consequences of Kolmogorov’s 0-1 law are surprising, at least at first sight. Let .(Xn )n be a sequence of real independent r.v.’s and let .X n = n1 (X1 +· · ·+Xn ) (the empirical means). Then .X = limn→∞ Xn is a tail r.v. Actually we can write, for every integer k, Xn =
.
1 1 (X1 + · · · + Xk ) + (Xk+1 + · · · + Xn ) n n
and as the first term on the right-hand side tends to 0 as .n → ∞, .X does not depend on .X1 , . . . , Xk for every k and is therefore .Bk+1 -measurable. We deduce that .X is measurable with respect to the tail .σ -algebra and is a.s. constant. As the same argument holds for .limn→∞ X n we also have {the sequence (Xn )n is convergent} = { lim X n = lim Xn } ,
.
n→∞
n→∞
which is a tail event and has probability equal to 0 or to 1. Therefore either the sequence .(Xn )n converges a.s. with probability 1 (and in this case the limit is a.s. constant) or it does not converge with probability 1. A similar argument can be developed when investigating the convergence of a series . ∞ n=1 Xn of independent r.v.’s. Also in this case the event .{the series converges} belongs to the tail .σ -algebra, as the convergence of a series does not depend on its first terms. Hence either the series does not converge with probability 1 or is a.s. convergent.
2.3 Computation of Laws
53
In this
case, however, the sum of the series depends also on its first terms. Hence the r.v. . ∞ n=1 Xn does not necessarily belong to the tail .σ -algebra and need not be constant.
2.3 Computation of Laws Many problems in probability boil down to the computation of the law of an r.v., which is the topic of this section. Recall that if X is an r.v. with values in a measurable space .(E, E), its law is a probability .μ on .(E, E) such that (Proposition 1.27, integration with respect to an image measure) E[φ(X)] =
φ(x) dμ(x)
.
(2.13)
E
for every bounded measurable function .φ : E → R. More precisely, if .μ is a probability on .(E, E) such that (2.13) holds for every bounded measurable function .φ : E → R, then .μ is necessarily the law of X. Let now X be an r.v. with values in .(E, E) having law .μ and let .Φ : E → G be a measurable map from E to some other measurable space .(G, G). How to determine the law, .ν say, of .Φ(X)? We have, by the integration rule with respect to an image probability (Proposition 1.27), E φ(Φ(X)) =
φ(Φ(x)) dμ(x) ,
.
E
but also E φ(Φ(X)) =
φ(y) dν(y) ,
.
G
which takes us to the relation . φ(Φ(x)) dμ(x) = φ(y) dν(y) E
(2.14)
G
and a probability .ν satisfying (2.14) is necessarily the law of .Φ(X). Hence a possible way to compute the law of .Φ(X) is to solve “equation” (2.14) for every bounded measurable function .φ, with .ν as the unknown. This is the method of the “dumb function”. A closer look at (2.14) allows us to foresee that the question boils down naturally to a change of variable. Let us now see some examples of application of this method. Other tools toward the goal of computing the law of an r.v. will be introduced in §2.6 (characteristic functions), §2.7 (Laplace transforms) and §4.3 (conditional laws).
54
2 Probability
Example 2.16 Let X, Y be .Rd - and .Rm -valued respectively r.v.’s, having joint density .f : Rd+m → R with respect to the Lebesgue measure of .Rd+m . Do X and Y also have a law with a density with respect to the Lebesgue measure (of d m .R and .R respectively)? What are these densities? In other words, how can we compute the marginal densities from the joint density? We have, for every real bounded measurable function .φ, E[φ(X)] =
.
Rd ×Rm
φ(x)f (x, y) dx dy =
Rd
φ(x) dx
Rm
f (x, y) dy ,
from which we conclude that the law of X is dμ(x) = fX (x) dx ,
.
where fX (x) =
.
Rm
f (x, y) dy .
Note that the measurability of .fX follows from Fubini’s Theorem.
Example 2.17 Let .X, Y be d-dimensional r.v.’s having joint density .: Rd × Rd → R. Does their sum .X+Y also have a density with respect to the Lebesgue measure? We have .E[φ(X + Y )] = dy φ(x + y)f (x, y) dx . Rd
Rd
With the change of variable .z = x + y in the inner integral and changing the order of integration we find E[φ(X+Y )] =
.
Rd
dy
φ(z)f (z−y, y) dz =
Rd
Rd
φ(z) dz
Comparing with (2.14), .X + Y has density h(z) =
.
Rd
f (z − y, y) dy
R
d
f (z−y, y) dy . :=g(z)
2.3 Computation of Laws
55
with respect to the Lebesgue measure. A change of variable gives that also h(z) =
.
Rd
f (x, z − x) dx .
Given two probabilities .μ, ν on .Rd , their convolution is the image of the product measure .μ ⊗ ν under the “sum” map .Rd × Rd → Rd , .(x, y) → x + y. The convolution is denoted .μ ∗ ν (see also Exercise 1.19). Equivalently, if .X, Y are independent r.v.’s having laws .μ and .ν respectively, then .μ ∗ ν is the law of .X + Y .
Proposition 2.18 If .μ, ν are probabilities on .Rd with densities .f, g with respect to the Lebesgue measure respectively, then their convolution .μ ∗ ν has density, still with respect to the Lebesgue measure, h(z) =
.
Rd
f (z − y)g(y) dy =
Rd
g(z − y)f (y) dy .
Proof Immediate consequence of Remark 2.17 with .f (x, y) replaced by .f (x)g(y). Example 2.19 Let .W, T be independent r.v.’s having density√respectively exponential of parameter . 12 and uniform on .[0, 2π ]. Let .R = W . What is the joint law of .(X, Y ) where .X = R cos T , .Y = R sin T ? Are X and Y independent? Going back to (2.14) we must find a density g such that, for every bounded measurable .φ : R2 → R, .E φ(X, Y ) =
+∞ +∞ −∞
−∞
(2.15)
φ(x, y)g(x, y) dx dy .
Let us compute first the law of R. For .r > 0 we have, recalling the expression of the d.f. of an exponential law, √ 2 FR (r) = P( W ≤ r) = P(W ≤ r 2 ) = 1 − e−r /2 ,
.
r≥0
56
2 Probability
and, taking the derivative, the law of .R = Lebesgue measure given by fR (r) = r e−r
.
2 /2
√ W has a density with respect to the
for r > 0
and .fR (r) = 0 for .r ≤ 0. The law of T has a density with respect to the 1 Lebesgue measure that is equal to . 2π on the interval .[0, 2π ] and vanishes elsewhere. Hence .(R, T ) has joint density f (r, t) =
.
1 2 r e−r /2 , 2π
for r > 0, 0 ≤ t ≤ 2π,
and .f (r, t) = 0 otherwise. By the integration formula with respect to an image law, Proposition 1.27, E[φ(X, Y )] = E[φ(R cos T , R sin T )] 2π +∞ 1 2 dt φ(r cos t, r sin t) r e−r /2 dr = 2π 0 0
.
and in cartesian coordinates +∞ +∞ 1 1 2 2 .··· = φ(x, y) e− 2 (x +y ) dx dy . 2π −∞ −∞ Comparing with (2.15) we conclude that the joint density g of .(X, Y ) is g(x, y) =
.
1 − 1 (x 2 +y 2 ) e 2 . 2π
As 1 1 2 2 g(x, y) = √ e−x /2 × √ e−y /2 , 2π 2π
.
g is the density of the product of two .N(0, 1) laws. Hence both X and Y are N(0, 1)-distributed and, as the their joint law is the product of the marginals, they are independent. Note that this is a bit unexpected, as both X and Y depend on R and T .
.
Example 2.20 Let X be an m-dimensional r.v. having density .f with respect to the Lebesgue measure. Let A be an .m × m invertible matrix and .b ∈ Rm .
2.3 Computation of Laws
57
Does the r.v. .Y = AX + b also have density with respect to the Lebesgue measure? For every bounded measurable function .φ we have E[φ(Y )] = E[φ(AX + b)] =
.
Rm
φ(Ax + b)f (x) dx .
With the change of variable .y = Ax + b, .x = A−1 (y − b), we have E[φ(Y )] =
φ(y) f (A−1 (y − b))| det A−1 | dy ,
.
Rm
so that Y has density, with respect to the Lebesgue measure, fY (y) =
.
1 f A−1 (y − b) . | det A|
If .b = 0 and .A = −I (I =identical matrix) then we have f−Y (y) = fY (−y) .
.
An r.v. Y such that .Y ∼ −Y is said to be symmetric. Of course such an r.v., if integrable, is centered, as then .E(Y ) = −E(Y ). One might wonder what happens if A is not invertible. See Exercise 2.27.
The next examples show instances of the application of the change of variable formula for multiple integrals in order to solve the dumb function “equation” (2.14).
Example 2.21 Let X, Y be r.v.’s defined on a same probability space, i.i.d. and with density, with respect to the Lebesgue measure, f (x) =
.
1 , x2
x≥1
and .f (x) = 0 otherwise. What is the joint law of .U = XY and .V = X Y? Let us surmise that this joint law has a density g: we should have then, for every bounded Borel function .φ : R2 → R2 , .E φ(U, V ) =
R2
φ(u, v)g(u, v) du dv .
58
2 Probability
But E φ(U, V ) = E φ(XY, X Y) =
+∞ +∞
.
1
1
φ(xy, xy )
1 dx dy . x2y2
Let us make the change of variable .(u, v) = Ψ (x, y) = (xy, xy ), whose inverse is √ u −1 . .Ψ (u, v) = uv, v Its differential is ⎛ DΨ −1 (u, v) =
.
1 2
v u
⎞
u ⎝ v ⎠ √1 − u3 uv v
and therefore | det DΨ −1 (u, v)| =
.
1 1 1 1 · − − = 4 v v 2v
Moreover the condition .x > 1, y > 1 becomes .u > 1, u1 ≤ v ≤ u. Hence E φ(U, V ) =
+∞
u
du
.
1
φ(u, v) 1/u
1 du dv 2u2 v
and the density of .(U, V ) is g(u, v) =
.
1 1{u>1} 1{ 1 ≤v≤u} . u 2u2 v
g is strictly positive in the shaded region of Fig. 2.1.
Sometimes, even in a multidimensional setting, it is not necessary to use the change of variables formula for multiple integrals, which requires some effort as in Example 2.21: the simpler formula for the one-dimensional integrals may be sufficient, as in the following example.
Example 2.22 Let X and Y be independent and exponential r.v.’s with paramX eter .λ = 1. What is the joint law of X and .Z = X Y ? And the law of . Y ?
2.3 Computation of Laws
59
Fig. 2.1 The joint density g is positive in the shaded region
v
0
1
u
The joint law of X and Y has density f (x, y) = fX (x)fY (y) = e−x e−y = e−(x+y) ,
.
x > 0, y > 0 .
Let .φ : R2 → R be bounded and measurable, then E φ(X, X Y) =
+∞
+∞
dx
.
0
0
φ(x, xy ) e−x e−y dy .
With the change of variable . xy = z, .dy = − zx2 dz, in the inner integral we have X .E φ(X, Y) =
+∞
+∞
dx 0
φ(x, z) 0
x −x −x/z e e dz . z2
Hence the required joint law has density with respect to the Lebesgue measure g(x, z) =
.
The density of .Z =
X Y
gZ (z) =
.
x −x(1 + 1 ) z , e z2
x > 0, z > 0 .
is the second marginal of g: 1 g(x, z) dx = 2 z
+∞
−x(1 + 1z ) xe dx .
0
This integral can be computed easily by parts, keeping in mind that the integration variable is x and that here z is just a constant. More cleverly, just recognize in the integrand, but for the constant, a Gamma.(2, 1 + 1z ) density.
60
2 Probability
1 (1+ 1z )2
Hence the integral is equal to . gZ (z) =
.
1 z2 (1 + 1z )2
and =
1 , (1 + z)2
z>0.
See Exercise 2.19 for another approach to the computation of the law of . X Y.
2.4 A Convexity Inequality: Jensen The integral of a measurable function with respect to a probability (i.e. mathematical expectation) enjoys a convexity inequality. This property is typical of probabilities (see Exercise 2.23) and is related to the fact that, for probabilities, the integral takes the meaning of a mean or, for .Rm -valued r.v.’s, of a barycenter. Recall that a function .φ : Rm → R ∪ {+∞} (the value .+∞ is also possible) is convex if and only if for every .0 ≤ λ ≤ 1 and .x, y ∈ Rm we have φ(λx + (1 − λ)y) ≤ λφ(x) + (1 − λ)φ(y) .
.
(2.16)
It is concave if .−φ is convex, i.e. if in (2.16) .≤ is replaced by .≥. .φ is strictly convex if (2.16) holds with .< instead of .≤ whenever .x = y and .0 < λ < 1. Note that an affine-linear function .f (x) = α, x + b, .α ∈ Rm , b ∈ R, is continuous and convex: (2.16) actually becomes an equality so that such an f is also concave. In the sequel we shall take advantage of the fact that if .φ is convex and lower semi-continuous (l.s.c.) then φ(x) = sup f (x) ,
.
(2.17)
f
the supremum being taken among all affine-linear functions f such that .f ≤ φ. A similar result holds of course for concave and u.s.c functions (with .inf). Recall (p.14) that a function f is lower semi-integrable (l.s.i.) with respect to a measure .μ if it is bounded from below by a .μ-integrable function and that in this case the integral . f dμ is defined (possibly .= +∞).
2.4 A Convexity Inequality: Jensen
61
Theorem 2.23 (Jensen’s Inequality) Let X be an m-dimensional integrable r.v. and .φ : Rm → R∪{+∞} a convex l.s.c. function (resp. concave and u.s.c). Then .φ(X) is l.s.i. (resp. u.s.i.) and E φ(X) ≥ φ E[X]
.
(resp.E φ(X) ≤ φ E[X]) .
Moreover, if .φ is strictly convex and X is not a.s. constant, in the previous relation the inequality is strict.
Proof Let us assume first .φ(E(X)) < +∞. A hyperplane crossing the graph of .φ at .x = E(X) is an affine-linear function of the form f (x) = α, x − E(X) + φ(E(X))
.
for some .α ∈ Rm . Note that f and .φ take the same value at .x = E(X). As .φ is convex, there exists such a hyperplane minorizing .φ, i.e. such that φ(x) ≥ α, x − E(X) + φ(E(X))
.
for all x
(2.18)
and therefore φ(X) ≥ α, X − E(X) + φ(E(X)) .
.
(2.19)
As the r.v. on the right-hand side is integrable, .φ(X) is l.s.i. Taking the mathematical expectation in (2.19) we find E φ(X) ≥ α, E(X) − E(X) + φ E(X) = φ E(X) .
.
(2.20)
If .φ is strictly convex, then in (2.18) the inequality is strict for .x = E(X). If X is not a.s. equal to its mean .E(X), then the inequality (2.19) is strict on an event of strictly positive probability and therefore in (2.20) a strict inequality holds. If .φ(E(X)) = +∞ instead, let f be an affine function minorizing .φ; then .f (X) is integrable and .φ(X) ≥ f (X) so that .φ(X) is l.s.i. Moreover, E φ(X) ≥ E f (X) = f E(X) .
.
62
2 Probability
Taking the supremum on all affine functions f minorizing .φ, thanks to (2.17) we find E φ(X) ≥ φ E(X)
.
concluding the proof.
By taking particular choices of .φ, from Jensen’s inequality we can derive the classical inequalities that we have already seen in Chap. 1 (see p. 27).
Hölder’s Inequality: If .p, q are positive numbers such that . p1 + 1/p q 1/q E |XY | ≤ E |X|p E |Y | .
.
1 q
= 1 then (2.21)
If one among .|X|p or .|Y |q is not integrable there is nothing to prove. Otherwise note that the function x 1/p y 1/q x, y ≥ 0 .φ(x, y) = −∞ otherwise is concave and u.s.c. so that 1/p q 1/q E |XY | = E φ |X|p , |Y |p ≤ φ E[|X|p ], E[|Y |p ] = E |X|p E |Y | .
.
Note that the condition . p1 + q1 = 1 requires that both p and q are .≥ 1. Equivalently, if .0 ≤ α, β ≤ 1 with .α + β = 1, (2.21) becomes E(Xα Y β ) ≤ E(X)α E(Y )β
.
(2.22)
for every pair of positive r.v.’s .X, Y . The particular case .p = q = 2 (2.21) becomes the Cauchy-Schwarz inequality 1/2 2 1/2 E |XY | ≤ E |X|2 E |Y | .
.
(2.23)
Minkowski’s Inequality: For every .p ≥ 1 E(|X + Y |p )1/p ≤ E(|X|p )1/p + E(|Y |p )1/p .
.
(2.24)
2.5 Moments, Variance, Covariance
63
Again there is nothing to prove unless both X and Y belong to .Lp . Otherwise (2.24) follows from Jensen’s inequality applied to the concave u.s.c. function φ(x, y) =
.
p x 1/p + y 1/p −∞
x, y ≥ 0 otherwise
and to the r.v.’s .|X|p , |Y |p : with this notation .φ(|X|p , |Y |p ) = (|X| + |Y |)p and we have E |X + Y |p ≤ E (|X| + |Y |)p = E φ(|X|p , |Y |p ) ≤ φ E[|X|p ], E[|Y |p ] p = E[|X|p ]1/p + E[|Y |p ]1/p
.
and now just take the . p1 -th power on both sides. As we have seen in Chap. 1, the Hölder, Cauchy-Schwarz and Minkowski inequalities hold for every .σ -finite measure. In the case of probabilities however they are particular instances of Jensen’s inequality. From Jensen’s inequality we can deduce an inequality between .Lp norms: if p/q is a continuous convex function, we have .p > q, as .φ(x) = |x| p/q p Xp = E |X|p = E φ(|X|q ) ≥ φ E[|X|q ] = E |X|q
.
and, taking the p-th root, Xp ≥ Xq
.
(2.25)
i.e. the .Lp norm is an increasing function of p. In particular, if .p ≥ q, .Lp ⊂ Lq . This inclusion holds for all finite measures, as seen in Exercise 1.16 a), but inequality (2.25) only holds for .Lp spaces with respect to probabilities.
2.5 Moments, Variance, Covariance Given an m-dimensional r.v. X and .α > 0, its absolute moment of order .α is the quantity .E |X|α = Xαα . Its absolute centered moment of order .α is the quantity α .E(|X − E(X)| ). The variance of a real r.v. X is its second order centered moment, i.e. Var(X) = E (X − E(X))2 .
.
(2.26)
Note that X has finite variance if and only if .X ∈ L2 : if X has finite variance, then as .X = (X − E(X)) + E(X), X is in .L2 , being the sum of square integrable r.v.’s. And if .X ∈ L2 , also .X − E(X) ∈ L2 for the same reason.
64
2 Probability
Recalling that .E(X) is a constant we have E (X − E(X))2 = E X2 − 2XE(X) + E(X)2 = E(X2 ) − 2E XE(X) + E(X)2
.
= E(X2 ) − E(X)2 , which provides an alternative expression for the variance: Var(X) = E(X2 ) − E(X)2 .
.
(2.27)
This is the formula that is used in practice for the computation of the variance. As the variance is always positive, this relation also shows that we always have 2 2 .E(X ) ≥ E(X) , which we already know from Jensen’s inequality. The following properties are immediate from the definition of the variance. Var(X + a) = Var(X) ,
.
Var(λX) = λ2 Var(X) ,
a∈R λ∈R.
As for mathematical expectation, the moments of an r.v. X also only depend on the law .μ of X: by Proposition 1.27, integration with respect to an image law, α = .E |X| E |X − E(X)|α =
Rm
|x|α μ(dx) ,
Rm
|x − E(X)|α μ(dx) .
The moments of X give information about the probability for X to take large values. The centered moments, similarly, give information about the probability for X to take values far from the mean. This aspect is made precise by the following two (very) important inequalities.
Markov’s Inequality: For every .t > 0, .α > 0, E |X|α .P |X| > t ≤ tα
which is immediate as E |X|α ≥ E |X|α 1{|X|>t} ≥ t α P |X| > t ,
.
where we use the obvious fact that .|X|α ≥ t α on the event .{|X| > t}.
(2.28)
2.5 Moments, Variance, Covariance
65
Applied to the r.v. .X − E(X) with .α = 2 Markov’s inequality (2.28) becomes
Chebyshev’s Inequality: For every .t > 0 Var(X) P |X − E(X)| ≥ t ≤ · t2
.
(2.29)
Let us investigate now the variance of the sum of two r.v.’s: 2 Var(X + Y ) = E X + Y − E(X) − E(Y ) 2 2 = E X − E(X) + E Y − E(Y ) + 2E X − E(X))(Y − E(Y ) .
and, if we set Cov(X, Y ) := E (X − E(X))(Y − E(Y )) = E(XY ) − E(X)E(Y ) ,
.
then Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ) .
.
Cov(X, Y ) is the covariance of X and Y . Note that .Cov(X, Y ) is nothing else than the scalar product in .L2 of .X − E(X) and .Y − E(Y ). Hence it is well defined and finite if X and Y have finite variance and, by the Cauchy-Schwarz inequality,
.
Cov(X, Y ) ≤ E |X − E(X)| · |Y − E(Y )| . 1/2 1/2 ≤ E (X − E[X])2 E (Y − E[Y ])2 = Var(X)1/2 Var(Y )1/2 .
(2.30)
If X and Y are independent, by Corollary 2.10, Cov(X, Y ) = E (X − E(X))(Y − E(Y )) = E[X − E(X)] E[Y − E(Y )] = 0
.
hence if X and Y are independent Var(X + Y ) = Var(X) + Var(Y ) .
.
The converse is not true: there are examples of r.v.’s having vanishing covariance, without being independent. This is hardly surprising: as remarked above .Cov(X, Y ) = 0 means that .E(XY ) = E(X)E(Y ), whereas independence requires .E[g(X)h(Y )] = E[g(X)]E[h(Y )] for every pair of bounded Borel functions .g, h. Lack of correlation appears to be a much weaker condition.
66
2 Probability
If .Cov(X, Y ) = 0, X and Y are said to be uncorrelated. For an intuitive interpretation of the covariance, see Example 2.24 below.
If .X = (X1 , . . . , Xm ) is an m-dimensional r.v., its covariance matrix is the m × m matrix C whose elements are
.
cij = Cov(Xi , Xj ) = E Xi − E(Xi ) Xj − E(Xj ) .
.
C is a symmetric matrix having on the diagonal the variances of the components of X and outside the diagonal their covariances. Therefore if .X1 , . . . , Xm are independent their covariance matrix is diagonal. The converse of course is not true. An elegant way of manipulating the covariance matrix is to write ∗ C = E X − E(X) X − E(X)
.
(2.31)
where .X − E(X) is a column vector. Indeed .(X − E(X))(X − E(X))∗ is a matrix whose entries are the r.v.’s .(Xi − E(Xi ))(Xj − E(Xj )) whose expectation is .cij . From (2.31) we easily see how a covariance matrix transforms under a linear map: if A is a .d × m matrix, then the covariance matrix of the d-dimensional r.v. AX is ∗ CAX = E AX − E(AX) AX − E(AX) ∗ . = E A X − E(X) X − E(X) A∗ ∗ = A E X − E(X) X − E(X) A∗ = ACA∗ .
(2.32)
An important remark: the covariance matrix is always positive definite, i.e. for every ξ ∈ Rm
.
Cξ, ξ =
m
.
cij ξi ξj ≥ 0 .
i,j =1
Actually Cξ, ξ =
m
cij ξi ξj =
i,j =1
.
m
E ξi Xi − E(Xi ) ξj Xj − E(Xj )
i,j =1
m ξi Xi − E(Xi ) ξj Xj − E(Xj ) =E i,j =1 m 2 = E ξ, X − E(X)2 ≥ 0 . ξi Xi − E(Xi ) =E i=1
(2.33)
2.5 Moments, Variance, Covariance
67
Recall that a matrix is positive definite if and only if (it is symmetric) and all its eigenvalues are .≥ 0. Example 2.24 (The Regression “Line”) Let us consider a real r.v. Y and an m-dimensional r.v. X, defined on the same probability space .(Ω, F, P), both of them square integrable. What is the affine-linear function of X that best approximates Y ? This is, we need to find a number .b ∈ R and a vector .a ∈ Rm such that the difference .a, X + b − Y is “smallest”. The simplest way (not the only one) to measure this discrepancy is to use the .L2 norm of this difference, which leads us to search for the values a and b that minimize the quantity 2 .E[(a, X + b − Y ) ]. We shall assume that the covariance matrix C of X is strictly positive definite, hence invertible. Let us add and subtract the expectations so that 2 E (a, X + b − Y )2 = E a, X − E(X) + b˜ − (Y − E(Y )) ,
.
where .b˜ = b + a, E(X) − E(Y ). Let us find the minimizer of 2 a → S(a) := E (a, X − E(X)) + b˜ − (Y − E(Y )) .
.
Expanding the square we have S(a) = E[a, X − E(X)2 ] + b˜ 2 + E[(Y − E(Y ))2 ] .
−2E[a, X − E(X)(Y − E(Y ))]
(2.34)
(the contribution of the other two double products vanishes because .X − E(X) and .Y − E(Y ) have expectation equal to 0 and .b˜ is a number). Thanks to (2.33) (read from right to left) .E[a, X − E(X)2 ] = Ca, a and also E[a, X − E(X)(Y − E(Y ))] =
m
.
ai E (Xi − E(Xi ))(Y − E(Y ))
i=1
=
m
ai Cov(Xi , Y ) = a, R ,
i=1
where by R we denote the vector of the covariances, .Ri = Cov(Xi , Y ). Hence (2.34) can be written S(a) = Ca, a + b˜ 2 + Var(Y ) − 2a, R .
.
68
2 Probability
Let us look for the critical points of S: its differential is DS(a) = 2Ca − 2R
.
and in order for .DS(a) to vanish we must have a = C −1 R .
.
We see easily that this critical value is also a minimizer (S is a polynomial of the second degree that tends to infinity as .|a| → ∞). Now just choose for b the value such that .b˜ = 0, i.e. b = E(Y ) − a, E(X) = E(Y ) − C −1 R, E(X) .
.
If .m = 1 (i.e. X is a real r.v.) C is the scalar .Var(X) and .R = Cov(X, Y ), so that a= .
Cov(X, Y ) Var(X)
b = E(Y ) − aE(X) = E(Y ) −
Cov(X, Y ) E(X) . Var(X)
(2.35)
The function .x → ax + b for the values .a, b determined in (2.35) is the regression line of Y on X. Note that the angular coefficient a has the same sign as the covariance. The regression line is therefore increasing or decreasing as a function of x according as .Cov(X, Y ) is positive or negative. The covariance therefore has an intuitive interpretation: r.v.’s having positive covariance will be associated to quantities that, essentially, take small or large values at the unison, while r.v.’s having a negative covariance will instead be associated to quantities that take small or large values in countertrend. Independent (and therefore uncorrelated) r.v.’s have a horizontal regression line, in agreement with intuition: the knowledge of the values of one of them does not give information concerning the values of the other one. A much more interesting problem will be to find the function .φ, not necessarily affine-linear, such that .E[(φ(X) − Y )2 ] is minimum. This problem will be treated in Chap. 4. Let Y be an m-dimensional square integrable r.v. What is the vector .b ∈ Rm such that the quantity .b → E(|Y − b|2 ) is minimum? Similar arguments as in Example 2.24 can be used: let us consider the map b → E(|Y − b|2 ) = E |Y |2 − 2b, Y + |b|2 = E(|Y |2 ) − 2b, E(Y ) + |b|2 .
.
2.6 Characteristic Functions
69
Its differential is .b → −2E(Y ) + 2b and vanishes at .b = E(Y ), which is the minimizer we were looking for. Hence its mathematical expectation is also the constant r.v. that best approximates Y (still in the sense of .L2 ). Even if, as noted above, there are many examples of pairs .X, Y of non-independent r.v.’s whose covariance vanishes, the covariance is often used to measure “how independent X and Y are”. However, when used to this purpose, the covariance has a drawback, as it is sensitive to changes of scale: if .α, β > 0, .Cov(αX, βY ) = αβCov(X, Y ) whereas, intuitively, the dependence between X and Y should be the same as the dependence between .αX and .βY (think of a change of unit of measure, for instance). More useful in this sense is the correlation coefficient .ρX,Y of X and Y , which is defined as Cov(X, Y ) ρX,Y := √ Var(X)Var(Y )
.
and is invariant under scale changes. Thanks to (2.30) we have .−1 ≤ ρX,Y ≤ 1. In some sense, values of .ρX,Y close to 0 indicate “almost independence” whereas values close to 1 or .−1 indicate a “strong dependence”, at the unison or in countertrend respectively.
2.6 Characteristic Functions Let X be an m-dimensional r.v. Its characteristic function is the function .φ : Rm → C defined as φ(θ ) = E(eiθ,X ) = E(cosθ, X) + i E(sinθ, X) .
.
(2.36)
The characteristic function is defined for every m-dimensional r.v. X because, for every .θ ∈ Rm , .|eiθ,X | = 1, so that the complex r.v. .eiθ,X is always integrable. Moreover, thanks to (1.7) (the integral of the modulus is larger than the modulus of the integral), .|E(eiθ,X )| ≤ E(|eiθ,X |) = 1, so that |φ(θ )| ≤ 1
for every θ ∈ Rm
.
and obviously .φ(0) = 1. Proposition 1.27, integration with respect to an image law, gives φ(θ ) =
.
Rm
eiθ,x dμ(x) ,
(2.37)
where .μ denotes the law of X. The characteristic function therefore depends only on the law of X and we can speak equally of the characteristic function of an r.v. or of a probability law.
70
2 Probability
Whenever there is a danger of ambiguity we shall write .φX or .φμ in order to specify the characteristic function of the r.v. X or of its law .μ. Sometimes we shall write . μ(θ ) instead of .φμ (θ ), which stresses the close ties between characteristic functions and Fourier transforms. Characteristic functions enjoy many properties that make them a very useful computation tool. If .μ and .ν are probabilities on .Rm we have φμ∗ν (θ ) = φμ (θ )φν (θ ) .
(2.38)
.
Indeed, if X and Y are independent with laws .μ and .ν respectively, then φμ∗ν (θ ) = φX+Y (θ ) = E(eiθ,X+Y ) = E(eiθ,X eiθ,Y )
.
= E(eiθ,X )E(eiθ,Y ) = φμ (θ )φν (θ ) . Moreover φ−X (θ ) = E(e−iθ,X ) = E(eiθ,X ) = φX (θ ) .
.
(2.39)
Therefore if X is symmetric (i.e. such that .X ∼ −X) then .φX is real-valued. What about the converse? If .φX is real-valued is it true that X is symmetric? See below. It is easy to see how characteristic functions transform under affine-linear maps: if d d d .Y = AX + b, with A a .d × m matrix and .b ∈ R , Y is .R -valued and for .θ ∈ R ∗ θ, X
φY (θ ) = E(eiθ, AX+b ) = eiθ, b E(eiA
.
) = φX (A∗ θ )eiθ, b .
(2.40)
Example 2.25 In the following examples .m = 1 and therefore .θ ∈ R. For the computations we shall always take advantage of (2.37). (a) Binomial .B(n, p): thanks to the binomial rule φ(θ ) =
.
n n n k n p (1 − p)n−k eiθk = (peiθ )k (1 − p)n−k k k k=0
k=0
= (1 − p + peiθ )n . (b) Geometric φ(θ ) =
∞
.
k=0
p(1 − p)k eiθk = p
∞ k=0
((1 − p)eiθ )k =
p · 1 − (1 − p)eiθ
2.6 Characteristic Functions
71
(c) Poisson φ(θ ) = e−λ
∞ λk
.
k=0
eiθk = e−λ
k!
∞ (λeiθ )k
k!
k=0
= e−λ eλe = eλ(e iθ
iθ −1)
.
(d) Exponential φ(θ )=λ
.
+∞
e−λx eiθx dx = λ
0
+∞
ex(iθ−λ) dx =
0
=
λ iθ − λ
λ x(iθ−λ) x=+∞ e x=0 iθ −λ
lim ex(iθ−λ) − 1 .
x→+∞
As the complex number .ex(iθ−λ) has modulus .|ex(iθ−λ) | = e−λx vanishing as .x → +∞, we have .limx→+∞ ex(iθ−λ) = 0 and φ(θ ) =
.
λ · λ − iθ
Let us now investigate regularity of characteristic functions. Looking at (2.37), . μ appears to be an integral depending on a parameter. Let us begin with continuity. We have | μ(θ ) − μ(θ0 )| = E(eiθ, X ) − E(eiθ0 , X ) ≤ E |eiθ, X − eiθ0 , X | .
.
If .θ → θ0 , then .|eiθ, X − eiθ0 , X | → 0. As also .|eiθ, X − eiθ0 , X | ≤ 2, by Lebesgue’s Theorem lim | μ(θ ) − μ(θ0 )| = 0 ,
.
θ→θ0
so that . μ is continuous. . μ is actually always uniformly continuous (Exercise 2.41). In order to investigate differentiability, let us assume first .m = 1 (i.e. .μ is a probability on .R). Proposition 1.21 (differentiability of integrals depending on a parameter) states that in order for θ → E f (θ, X) =
.
f (θ, x) dμ(x)
to be differentiable it is sufficient that the derivative . ∂f ∂θ (θ, x) exists for .μ a.s. every x and that the bound .
∂ sup f (θ, x) ≤ g(x) ∂θ θ∈R
72
2 Probability
holds for some function g such that .g(X) is integrable. In our case ∂ eiθx = |ixeiθx | = |x| . ∂θ
.
Hence .
∂ sup eiθX ≤ |X| ∂θ θ∈R
and if X is integrable . μ is differentiable and we can take the derivative under the integral sign, i.e. μ (θ ) =
+∞
.
−∞
ixeiθx μ(dx) = E(iXeiθX ) .
(2.41)
A repetition of the same argument for the integrand .f (θ, x) = ixeiθx gives ∂ ixeiθx = | − x 2 eiθx | = |x|2 , ∂θ
.
hence, if X has a finite second order moment, . μ is twice differentiable and
μ (θ ) = −
.
+∞ −∞
x 2 eiθx μ(dx) .
(2.42)
Repeating the argument above we see, by induction, that if .μ has a finite absolute moment of order k, then . μ is k times differentiable and μ(k) (θ ) =
+∞
.
−∞
(ix)k eiθx μ(dx) .
(2.43)
We have the following much more precise result.
Proposition 2.26 If .μ has a finite moment of order k then . μ is k times μ is k times differentiable and k differentiable and (2.43) holds. Conversely if . is even then .μ has a finite moment of order k and (therefore) (2.43) holds.
Proof The first part of the statement has already been proved. Assume, first, that k = 2. As . μ is twice differentiable we know that
.
.
μ(θ ) + μ(−θ ) − 2 μ(0) = μ (0) 2 θ→0 θ lim
2.6 Characteristic Functions
73
(just replace . μ by its order two Taylor polynomial). But +∞ 2 μ(0) − μ(θ ) − μ(−θ ) 2 − eiθx − e−iθx . = μ(dx) θ2 θ2 −∞ +∞ 1 − cos(θ x) 2 2 x μ(dx) . = x2θ 2 −∞ The last integrand is positive and converges to .x 2 as .θ → 0. Hence taking the limit as .θ → 0, by Fatou’s Lemma, .
− μ (0) ≥
+∞ −∞
x 2 μ(dx) ,
which proves that .μ has a finite moment of the second order and, thanks to the first part of the statement, for every .θ ∈ R,
μ (θ ) = −
.
+∞ −∞
x 2 eiθx μ(dx) .
The proof is completed by induction: let us assume that it has already been proved that if . μ is k times differentiable (k even) then .μ has a finite moment of order k and μ(k) (θ ) =
+∞
.
−∞
(ix)k eiθx μ(dx) .
(2.44)
If . μ is .k + 2 times differentiable then .
μ(k) (θ ) + μ(k) (−θ ) − 2 μ(k) (0) = μ(k+2) (0) 2 θ→0 θ lim
and, multiplying (2.44) by .i k and noting that .i 2k = 1 (recall that k is even), ik
.
2 μ(k) (0) − μ(k) (θ ) − μ(k) (−θ ) θ2 +∞ 2 − eiθx − e−iθx = ik (ix)k μ(dx) θ2 −∞ +∞ 2 − eiθx − e−iθx k x μ(dx) = θ2 −∞ 1 − cos(θ x) k+2 x μ(dx) , = 2 x2θ 2
(2.45)
74
2 Probability
so that the left-hand side above is real and positive and, as .θ → 0, by Fatou’s Lemma as above, ik μ(k+2) (0) ≥
+∞
.
−∞
x k+2 μ(dx) ,
hence .μ has a finite .(k + 2)-th order moment (note that (2.45) ensures that the quantity .i k μ(k+2) (0) is real). Remark 2.27 A closer look at the previous proof allows us to say something more: if k is even it is sufficient for . μ to be differentiable k times at the origin in order to ensure that the moment of order k of .μ is finite: if . μ is differentiable k times at 0 and k is even, then . μ is differentiable k times everywhere. For .θ = 0 (2.43) becomes μ(k) (0) = i k
+∞
.
−∞
x k μ(dx) ,
(2.46)
which allows us to compute the moments of .μ simply by taking the derivatives of μ at 0. Beware however: examples are known where . μ is differentiable but does not μ is twice differentiable, thanks have a finite mathematical expectation. If, instead, . to Proposition 2.26 (2 is even) X has a moment of order 2 which is finite (and therefore also a finite mathematical expectation). In order to find a necessary and sufficient condition for the characteristic function to be differentiable the curious reader can look at Brancovan and Jeulin book [3], Proposition 8.6, p. 154. Similar arguments (only more complicated to express) give analogous results for probabilities on .Rm . More precisely, let .α = (α1 , . . . , αm ) be a multiindex and let us denote as usual
.
|α| = α1 + · · · + αm ,
.
αm x α = x1α1 · · · xm
∂α ∂ α1 ∂ αm = · · · · ∂θ α ∂θ α1 ∂θ αm Then if
.
Rm
|x||α| μ(dx) < +∞
μ is .|α| times differentiable and
.
.
∂α μ(θ ) = ∂θ α
Rm
(ix)α eiθ,x μ(dx) .
2.6 Characteristic Functions
75
In particular, .
∂ μ (0) = i ∂θk
Rm
∂ 2 μ (0) = − ∂θk ∂θh
xk μ(dx) ,
xh xk μ(dx) ,
Rm
i.e. the gradient of .μ at the origin is equal to i times the expectation and, if .μ is centered, the Hessian of . μ at the origin is equal to minus the covariance matrix. Example 2.28 (Characteristic Function of Gaussian Laws, First Method) If .μ = N (0, 1) then 1 μ(θ ) = √ 2π
+∞
.
−∞
eiθx e−x
2 /2
dx .
(2.47)
This integral can be computed by the following argument (which is also valid for other characteristic functions). As .μ has finite mean, by (2.41) and integrating by parts, +∞ 1 2 μ (θ ) = √ ix eiθx e−x /2 dx 2π −∞ +∞ +∞ 1 1 2 iθx −x 2 /2 +√ i · iθ eiθx e−x /2 dx = −θ μ(θ ) , = − √ ie e −∞ 2π 2π −∞ .
i.e. . μ solves the linear differential equation u (θ ) = −θ u(θ )
.
with the initial condition .u(0) = 1. Its solution is μ(θ ) = e−θ
.
2 /2
.
If .Y ∼ N(b, σ 2 ), as .Y = σ X + b with .X ∼ N(0, 1) and thanks to (2.40), 1
μY (θ ) = e− 2 σ
.
2θ 2
eiθb .
We shall soon see another method of computation (Example 2.37 b)) of the characteristic function of Gaussian laws. The computation of the characteristic function of the .N(0, 1) law of the previous example allows us to derive a relation that is important in view of the next statement.
76
2 Probability
Let .X1 , . . . , Xm be i.i.d. .N(0, σ 2 )-distributed r.v.’s. Then X has a density with respect to the Lebesgue measure of .Rm given by fσ (x) = √
1
−
e
.
1 2σ 2
1 − 1 x2 ··· √ e 2σ 2 m 2π σ
x12
2π σ 1 − 1 |x|2 = e 2σ 2 . m/2 m (2π ) σ
(2.48)
Its characteristic function, for .θ = (θ1 , . . . , θm ), is φσ (θ ) = E(eiθ,X ) = E(eiθ1 X1 +···+iθm Xm ) = E(eiθ1 X1 ) · · · E(eiθm Xm ) .
1
= e− 2 σ
2 θ2 1
1
· · · e− 2 σ
2 θ2 m
1
= e− 2 σ
2 |θ|2
(2.49)
.
We have therefore − 12 σ 2 |θ|2
e
.
=
1 (2π )m/2 σ m
−
e
Rm
1 2σ 2
|x|2 iθ,x
e
dx
and exchanging the roles of x and .θ , replacing .σ by . σ1 we obtain the relation 1 2σ 2
−
e
.
|x|2
=
σm (2π )m/2
1
Rm
e− 2 σ
2 |θ|2
eiθ,x dθ ,
which finally gives that fσ (x) =
.
1 (2π )m
1
Rm
e− 2 σ
2 |θ|2
eiθ,x dθ .
(2.50)
Given a function .ψ ∈ C0 (Rm ), let ψσ (x) =
.
Rm
fσ (x − y)ψ(y) dy .
Lemma 2.29 For every .ψ ∈ C0 (Rm ) we have ψσ
.
uniformly.
→ ψ
σ →0+
(2.51)
2.6 Characteristic Functions
77
Proof We have, for every .δ > 0, |ψ(x) − ψσ (x)| =
.
≤ =
{|y−x|≤δ}
Rm
Rm
fσ (x − y)(ψ(x) − ψ(y)) dy
fσ (x − y)|ψ(x) − ψ(y)| dy
fσ (x − y)|ψ(x) − ψ(y)| dy +
{|y−x|>δ}
fσ (x − y)|ψ(x) − ψ(y)| dy
:= I1 + I2 . First, let .δ > 0 be such that .|ψ(x)−ψ(y)| ≤ ε whenever .|x −y| ≤ δ (.ψ is uniformly continuous), so that .I1 ≤ ε. Moreover, .I2 ≤ 2ψ∞ fσ (x − y) dy {|y−x|>δ}
and, if .X = (X1 , . . . , Xm ) denotes an r.v. with density .fσ , by Markov’s inequality,
.
{|y−x|>δ}
fσ (x − y) dy = ≤
{|z|>δ}
fσ (z) dy = P(|X| ≥ δ) ≤
1 E(|X|2 ) δ2
1 mσ 2 E(|X1 |2 + · · · + |Xm |2 ) = 2 · 2 δ δ
Then just choose .σ small enough so that .2ψ∞ |ψ(x) − ψσ (x)| ≤ 2ε
mσ 2 δ2
≤ ε, which gives
for every x ∈ Rm .
.
Note, in addition, that .ψσ ∈ C0 (Rm ) (Exercise 2.6).
Theorem 2.30 Let .μ, ν be probabilities on .Rm such that μ(θ ) = ν(θ )
.
for every θ ∈ Rm .
Then .μ = ν.
Proof Note that the relation . μ(θ ) = ν(θ ) for every .θ ∈ Rm means that
.
Rm
f dμ =
Rm
f dν
(2.52)
78
2 Probability
for every function of the form .f (x) = eiθ,x . Theorem 2.30 will follow as soon as we prove that (2.52) holds for every function .ψ ∈ CK (Rm ) (Lemma 1.25). Let m .ψ ∈ CK (R ) and .ψσ as in (2.51). We have . ψσ (x) dμ(x) = dμ(x) ψ(y)fσ (x − y) dy Rm
Rm
Rm
and thanks to (2.50) and then to Fubini’s Theorem 1 2 1 2 ··· = dμ(x) ψ(y) dy e− 2 σ |θ| eiθ,x−y dθ m m (2π )m Rm R R 1 − 12 σ 2 |θ|2 −iθ,y ψ(y) dy e e dθ eiθ,x dμ(x) .= m m (2π )m Rm R R 1 − 12 σ 2 |θ|2 −iθ,y = ψ(y) dy e e μ(θ ) dθ . (2π )m Rm Rm
(2.53)
Of course we have previously checked that, as .ψ ∈ CK , the function 1 2 1 2 2 2 (y, x, θ ) → ψ(y)e− 2 σ |θ| eiθ,x−y = |ψ(y)|e− 2 σ |θ|
.
is integrable with respect to .λm (dy)⊗λm (dθ )⊗μ(dx) (.λm = the Lebesgue measure of .Rm ), which authorizes the application of Fubini’s Theorem. As the integral only depends on . μ and . μ = ν we obtain
.
Rm
ψσ (x) dμ(x) =
and now, thanks to Lemma 2.29, . ψ(x) dμ(x) = lim Rm
Rm
ψσ (x) dν(x)
σ →0+ Rm
ψσ (x) dμ(x) = lim
σ →0+ Rm
=
Rm
ψσ (x) dν(x)
ψ(x) dν(x) .
Example 2.31 Let .μ ∼ N(a, σ 2 ) and .ν ∼ N(b, τ 2 ). What is the law .μ ∗ ν? Note that 1
φμ∗ν (θ ) = μ(θ ) ν(θ ) = eiaθ e− 2 σ
.
2θ 2
1 2 2 θ
eibθ e− 2 τ
1
= ei(a+b)θ e− 2 (σ
2 +τ 2 )θ 2
.
2.6 Characteristic Functions
79
Therefore .μ ∗ ν has the same characteristic function as an .N(a + b, σ 2 + τ 2 ) law, hence .μ ∗ ν = N(a + b, σ 2 + τ 2 ). The same result can also be obtained by computing the convolution integral of Proposition 2.18, but the computation, although elementary, is neither short nor amusing.
Example 2.32 Let X be an r.v. whose characteristic function is real-valued. Then X is symmetric. Indeed .φ−X (θ ) = φX (−θ ) = φX (θ ) = φX (θ ): X and .−X have the same characteristic function, hence the same law.
Theorem 2.30 is of great importance from a theoretical point of view but unfortunately it is not constructive, i.e. it does not give any indication about how, knowing the characteristic function . μ, it is possible to obtain, for instance, the distribution function of .μ or its density, with respect to the Lebesgue measure or the counting measure of .Z, if it exists. This question has a certain importance also because, as in Example 2.31, characteristic functions provide a simple method of computation of the law of the sum of independent r.v.’s: just compute their characteristic functions, then the characteristic function of their sum (easy, it is the product). At this point, what can we do in order to derive from this characteristic function some information on the law? The following theorem gives an element of an answer in this sense. Example 2.34 and Exercises 2.40 and 2.32 are also concerned with this question of “inverting” the characteristic function.
Theorem 2.33 (Inversion) Let .μ be a probability on .Rm . If . μ is integrable then .μ is absolutely continuous and has a density with respect to the Lebesgue measure given by f (x) =
.
1 (2π )m
+∞ −∞
e−iθ,x μ(θ ) dθ .
(2.54)
A proof and more general inversion results (giving answers also when .μ does not have a density) can be found in almost all books listed in the references section.
80
2 Probability
.... . ......... ......... ......... .... .... ..... .......... ..... ......... ..... .......... . . . . . . . . . . . . . ..... . ..... ..... .. .. .... ... ..... ..... ..... ..... ..... ..... ..... ... .... ..... ..... ..... . ..... ..... . . . . . . . . . . . . . . . . . ..... .... ..... ... ..... ..... .... ....... ..... .... ..... ..... ..... .. ..... ... ... ..... −4
−3
−2
−1
0
1
2
3
4
Fig. 2.2 Graph of the characteristic function .φ of Example 2.34
Example 2.34 Let .φ be the function .φ(θ ) = 1 − |θ | for .−1 ≤ θ ≤ 1 and then extended periodically on the whole of .R as in Fig. 2.2. Let us prove that .φ is a characteristic function and determine the corresponding law. As .φ is periodic, we can consider its Fourier series φ(θ ) =
.
∞ ∞ 1 a0 + ak cos(kπ θ ) = bk cos(kπ θ ) 2 k=−∞
k=1
∞
=
(2.55)
i kπ θ
bk e
k=−∞
where .bk = 12 a|k| for .k = 0, .b0 = 12 a0 . The series converges uniformly, .φ being continuous. In the series only the cosines appear as .φ is even. A closer look at (2.55) indicates that .φ is the characteristic function of an r.v. X taking the values .kπ , .k ∈ Z, with probability .bk , provided we can prove that the numbers .ak are positive. Note we know already that the sum of the
that ∞ .bk ’s is equal to 1, as .1 = φ(0) = k=−∞ bk . Let us compute these Fourier coefficients: we have a0 =
1
.
−1
φ(θ ) dθ =
1
−1
1
(1 − |θ |) dθ = 2
(1 − θ ) dθ = 1
0
and, for .k > 0, ak =
1
.
1
= −2 0
−1
(1 − |θ |) cos(kπ θ ) dθ = −
1 −1
|θ | cos(kπ θ ) dθ
1 1 1 1 θ sin(kπ θ ) + θ cos(kπ θ ) dθ = −2 sin(kπ θ ) dθ 0 kπ kπ 0 2 2 1 cos(kπ θ ) = (1 − cos(kπ )) =− 0 (kπ )2 (kπ )2
2.6 Characteristic Functions
i.e. . 12 a0 =
1 2
81
and
ak =
.
⎧ ⎨
4 (kπ )2 ⎩ 0
k odd k even .
Therefore .φ is the characteristic function of a .Z-valued r.v. X such that, for m = 0,
.
1 1 2 P X = ±(2m + 1)π = a|2m+1| = 2 2 π (2m + 1)2
.
and .P(X = 0) = 12 . Note that X does not have a finite mathematical expectation, but this we already knew, as .φ is not differentiable. This example shows, on one hand, the link between characteristic functions and Fourier series, in the case of .Z-valued r.v.’s. On the other hand, together with Exercise 2.32, it provides an example of a pair of characteristic functions that coincide in a neighborhood of the origin but that correspond to very different laws (the one of Exercise 2.32 is absolutely continuous with respect to the Lebesgue measure, whereas .φ is the characteristic function of a discrete law). Let .X1 , . . . , Xm be r.v.’s with values in .Rn1 , . . . , Rnm respectively and let us consider, for .n = n1 + · · · + nm , the .Rn -valued r.v. .X = (X1 , . . . , Xm ). Let us denote by .φ its characteristic function. Then it is easy to obtain the characteristic function .φXk of the k-th marginal of X. Indeed, recalling that .φ is defined on .Rn whereas .φXk is defined on .Rnk , ˜ ˜ φXk (θ ) = E(eiθ,Xk ) = E(eiθ ,X ) = φ(θ),
.
θ ∈ Rnk ,
where .θ˜ = (0, . . . , 0, θ, 0, . . . , 0) is the vector of .Rn all of whose components vanish except for those in the .(n1 + · · · + nk−1 + 1)-th to the .(n1 + · · · + nk )-th position. Assume the r.v.’s .X1 , . . . , Xm to be independent; if .θ1 ∈ Rn1 , . . . , θm ∈ Rnm and n .θ = (θ1 , . . . , θm ) ∈ R then φX (θ ) = E(eiθ,X ) = E(eiθ1 ,X1 · · · eiθm ,Xm ) = φX1 (θ1 ) · · · φXm (θm ) .
.
(2.56)
82
2 Probability
(2.56) can also be expressed in terms of laws: if .μ1 , . . . , μm are probabilities respectively on .Rn1 , . . . , Rnm and .μ = μ1 ⊗ · · · ⊗ μm then μ(θ ) = μ1 (θ1 ) . . . μm (θm ) .
.
(2.57)
Actually we have the following result which provides a characterization of independence in terms of characteristic functions.
Proposition 2.35 Let .X1 , . . . , Xm be r.v.’s with values in .Rn1 , . . . , Rnm respectively and .X = (X1 , . . . , Xm ). Then .X1 , . . . , Xm are independent if and only if, for every .θ1 ∈ Rn1 , . . . , θm ∈ Rnm , and .θ = (θ1 , . . . , θm ), we have φX (θ ) = φX1 (θ1 ) · · · φXm (θm ) .
.
(2.58)
Proof If the .Xi ’s are independent we have already seen that (2.58) holds. Conversely, if (2.58) holds, then X has the same characteristic function as the product of the laws of the .Xi ’s. Therefore by Theorem 2.30 the law of X is the product law and the .Xi ’s are independent.
2.7 The Laplace Transform Let X be an m-dimensional r.v., .μ its law and .z ∈ Cm . The complex Laplace transform (CLT) of X (or of .μ) is the function L(z) = E(ez,X ) =
.
Rm
ez,x dμ(x)
(2.59)
defined for those values .z ∈ Cm such that .ez,X is integrable. Obviously L is always defined on the imaginary axes, as on them .|ez,X | = 1, and actually between the CLT L and the characteristic function .φ we have the relation L(iθ ) = φ(θ )
.
for every θ ∈ Rm .
Hence the knowledge of the CLT L implies the knowledge of the characteristic function .φ, which is the restriction of L to the imaginary axes. The domain of the CLT is the set of complex vectors .z ∈ Cm such that .ez,X is integrable. Recalling
2.7 The Laplace Transform
83
that .ez,x = eℜz,x (cosz, x + i sinz, x), the domain of L is the set of the .z ∈ Cm such that . |ez,x | dμ(x) = eℜz,x dμ(x) < +∞ . Rm
Rm
The domain of the CLT of .μ will be denoted . Dμ . We shall restrict ourselves to the case .m = 1 from now on. We have
+∞
.
−∞
eℜz x dμ(x) =
0
−∞
eℜz x dμ(x) +
+∞
eℜz x dμ(x) := I1 + I2 .
0
Clearly if .ℜz ≤ 0then .I2 < +∞, as the integrand is then smaller than 1. Moreover +∞ the function .t → 0 etx dμ(x) is increasing. Therefore if " .x2 := sup t;
+∞
# etx dμ(x) < +∞
0
(possibly .x2 = +∞), then .x2 ≥ 0 and .I2 < +∞ for .ℜz < x2 , whereas .I2 = +∞ if .ℜz > x2 . Similarly, on the negative side, by the same argument there exists a number .x1 ≤ 0 such that .I1 (z) < +∞ if .x1 < ℜz and .I1 (z) = +∞ if .ℜz < x1 . Putting things together the domain . Dμ contains the open strip .S = {z; x1 < ℜz < x2 }, and it does not contain the complex numbers z outside the closure of S, i.e. such that .ℜz > x2 or .ℜz < x1 . Actually we have the following result.
Theorem 2.36 Let .μ be a probability on .R. Then there exist .x1 , x2 ∈ R (the convergence abscissas) with .x1 ≤ 0 ≤ x2 (possibly .x1 = 0 = x2 ) such that the Laplace transform, L, of .μ is defined in the strip .S = {z; x1 < ℜz < x2 }, whereas it is not defined for .ℜz > x2 or .ℜz < x1 . Moreover L is holomorphic in S.
Proof We need only prove that the CLT is holomorphic in S and this will follow as soon as we check that in S the Cauchy-Riemann equations are satisfied, i.e., if .z = x + iy and .L = L1 + iL2 , .
∂L2 ∂L1 , = ∂x ∂y
∂L1 ∂L2 =− · ∂y ∂x
84
2 Probability
The idea is simple: if .t ∈ R, then .z → ezt is holomorphic, hence satisfies the Cauchy-Riemann equations and L(z) =
+∞
.
−∞
ezt dμ(t) ,
(2.60)
so that we must just verify that in (2.60) we can take the derivatives under the integral sign. Let us check that the conditions of Proposition 1.21 (derivation under the integral sign) are satisfied. We have L1 (x, y) =
+∞
.
−∞
ext cos(yt) dμ(t),
L2 (x, y) =
+∞
−∞
ext sin(yt) dμ(t) .
As we assume .x + iy ∈ S, there exists an .ε > 0 such that .x1 + ε < x < x2 − ε (.x1 , x2 are the convergence abscissas). For .L1 , the derivative of the integrand with respect to x is .t → text cos(yt). Now the map .t → text e−(x2 −ε)t is bounded on .R+ (a global maximum is attained at .t = (x2 − ε − x)−1 ). Hence for some constant .c2 we have |t|ext ≤ c2 e(x2 −ε)t
.
for t ≥ 0 .
Similarly there exists a constant .c1 such that |t|ext ≤ c1 e(x1 +ε)t
.
for t ≤ 0 .
Hence the condition of Proposition 1.21 (derivation under the integral sign) is satisfied with .g(t) = c2 e(x2 −ε)t + c1 e(x1 +ε)t , which is integrable with respect to .μ, as .x2 − ε and .x1 + ε both belong to the convergence strip S. The same argument allows us to prove that also for .L2 we can take the derivative under the integral sign, and the first Cauchy-Riemann equation is satisfied: .
∂L1 (x, y) = ∂x
=
+∞ −∞
∂ xt e cos(yt) dμ(t) = ∂x
+∞ −∞
+∞
−∞
text cos(yt) dμ(t)
∂ xt ∂L2 e sin(yt) dμ(t) = (x, y) . ∂y ∂y
We can argue in the same way for the second Cauchy-Riemann equation.
Recall that a holomorphic function is identified as soon as its value is known on a set having at least one cluster point (uniqueness of analytic continuation). Typically, therefore, the knowledge of the Laplace transform on the real axis (or on a nonvoid open interval) determines its value on the whole of the convergence strip (which, recall, is an open set). This also provides a method of computation for characteristic functions, as shown in the next example.
2.7 The Laplace Transform
85
Example 2.37 (a) Let X be a Cauchy-distributed r.v., i.e. with density with respect to the Lebesgue measure 1 1 · π 1 + x2
f (x) =
.
Then
1 .L(t) = π
+∞
−∞
etx dx 1 + x2
and therefore .L(t) = +∞ for every .t = 0. In this case the domain is the imaginary axis .ℜz = 0 only and the convergence strip is empty. (b) Assume .X ∼ N(0, 1). Then, for .t ∈ R, 1 L(t) = √ 2π
+∞
.
−∞
etx e−x
2 /2
2 /2
et dx = √
2π
+∞ −∞
1
e− 2 (x−t) dx = et 2
2 /2
and the convergence strip is the whole of .C. Moreover, by analytic continuation, 2 the Laplace transform of X is .L(z) = ez /2 for all .z ∈ C. In particular, for −t 2 /2 which gives, in a different .z = it, on the imaginary axis we have .L(it) = e way, the characteristic function of an .N(0, 1) law. (c) If .X ∼ Gamma.(α, λ) then, for .t ∈ R, L(t) =
.
λα Γ (α)
+∞
x α−1 etx e−λx dx .
0
This integral converges if and only if .t < λ, hence the convergence strip is S = {ℜz < λ} and does not depend on .α. If .t < λ, recalling the integrals of the Gamma distributions,
.
L(t) =
.
λα · (λ − t)α
Thanks to the uniqueness of the analytic continuation we have, for .ℜz < λ, L(z) =
.
λ α λ−z
from which we obtain the characteristic function φ(t) = L(it) =
.
λ α . λ − it
(2.61)
86
2 Probability
(d) If X is Poisson distributed with parameter .λ, then, again for .z ∈ R, L(z) = e−λ
∞ λk
.
k=0
k!
ezk = e−λ
∞ (ez λ)k k=0
k!
= e−λ ee
zλ
= eλ(e
z −1)
(2.62)
and the convergence abscissas are infinite.
The Laplace transform of the sum of independent r.v.’s is easy to compute, in a similar way to the case of characteristic functions: if X and Y are independent, then LX+Y (z) = E(ez,X+Y ) = E(ez,X ez,Y ) = E(ez,X )E(ez,Y ) = LX (z)LY (z) .
.
Note however that as, in general, the Laplace transform is not everywhere defined, the domain of .LX+Y is the intersection of the domains of .LX and .LY . If the abscissas of convergence are both different from 0, then the CLT is analytic at 0, thanks to Theorem 2.36. Hence the characteristic function .φX (t) = LX (it) is infinitely many times differentiable and (Theorem 2.26) the moments of all orders are finite. Moreover, as iLX (0) = φX (0) = i E(X)
.
we have .L (0) = E(X). Also the higher order moments of X can be obtained by taking the derivatives of the CLT: it is easy to see that k L(k) X (0) = E(X ) .
.
(2.63)
More information on the law of X can be gathered from the Laplace transform, see e.g. Exercises 2.44 and 2.47.
2.8 Multivariate Gaussian Laws Let .X1 , . . . , Xm be i.i.d. .N(0, 1)-distributed r.v.’s; we have seen in (2.48) and (2.49) that the vector .X = (X1 , . . . , Xm ) has density f (x) =
.
1 1 2 e− 2 |x| m/2 (2π )
with respect to the Lebesgue measure and characteristic function 1
φ(θ ) = e− 2 |θ| .
.
2
(2.64)
2.8 Multivariate Gaussian Laws
87
This law is the prototype of a particularly important family of multidimensional laws. If .Y = AX + b for an .m × m matrix A and .b ∈ Rm , then, by (2.40), ∗ θ|2
1
φY (θ ) = eiθ,b φX (A∗ θ ) = eiθ,b e− 2 |A .
∗ θ,θ
1
= eiθ,b e− 2 AA
1
∗ θ,A∗ θ
= eiθ,b e− 2 A
(2.65)
.
Recall that throughout this book “positive” means .≥ 0.
Theorem 2.38 Given a vector .b ∈ Rm and an .m × m positive definite matrix C, there exists a probability .μ on .Rm such that 1
μ(θ ) = eiθ,b e− 2 Cθ,θ .
.
We shall say that such a .μ is an .N(b, C) law (normal, or Gaussian, with mean b and covariance matrix C).
Proof Taking into account (2.65), it suffices to prove that a matrix A exists such that .AA∗ = C. It is a classical result of linear algebra that such a matrix always exists, provided C is positive definite, and even that A can be chosen symmetric (and therefore such that .A2 = C); in this case we say that A is the square root of C. Actually if C is diagonal ⎛ ⎜ C=⎝
0
λ1 ..
.
0
.
⎞ ⎟ ⎠
λm
as all the eigenvalues .λi are .≥ 0 (C is positive definite) we can just choose ⎛√ ⎜ A=⎝
0
λ1 ..
.
0
.
√ λm
⎞ ⎟ ⎠ .
Otherwise (i.e. if C is not diagonal) there exists an orthogonal matrix O such that OCO −1 is diagonal. It is immediate that .OCO −1 is also positive definite so that there exists a matrix B such that .B 2 = OCO −1 . Then if .A := O −1 BO, A is symmetric (as .O −1 = O ∗ ) and is the matrix we were looking for as
.
A2 = O −1 BO · O −1 BO = O −1 B 2 O = C .
.
88
2 Probability
The r.v. X introduced at the beginning of this section is therefore .N(0, I )-distributed (.I = the identity matrix). In the remainder of this chapter we draw attention to the many important properties of this class of distributions. Note that, according to the definition, an r.v. having characteristic function .θ → eiθ,b is Gaussian. Hence Dirac masses are Gaussian and a Gaussian r.v. need not have a density with respect to the Lebesgue measure. See also below. • A remark that simplifies the manipulation of the .N(b, C) laws consists in recalling (2.65), i.e. that it is the law of an r.v. of the form .AX + b with .X ∼ N(0, I ) and A a square root of C. Hence an r.v. .Y ∼ N(b, C) can always be written .Y = AX + b, with .X ∼ N(0, I ). • If .Y ∼ N (b, C), then b is indeed the mean of Y and C its covariance matrix, as anticipated in the statement of Theorem 2.38. This is obvious if .b = 0 and .C = I , recalling the way we defined the .N(0, I ) laws. In general, as .Y = AX + b, where A is the square root of C and .X ∼ N(0, I ), we have immediately .E(Y ) = E(AX + b) = AE(X) + b = b. Moreover, the covariance matrix of Y is .AI A∗ = AA∗ = C, thanks to the transformation rule of covariance matrices under linear maps (2.32). • If C is invertible then the .N(b, C) law has a density with respect to the Lebesgue measure. Indeed in this case the square root A of C is also invertible (the eigenvalues of A are the square roots of those of C, which are all .> 0). If .Y ∼ N (b, C), hence of the form .Y = AX + b with .X ∼ N(0, I ), then Y has density (see the computation of a density under a linear-affine transformation, Example 2.20) fY (y) =
.
1 1 1 −1 −1 f A−1 (y − b) = e− 2 A (y−b),A (y−b) | det A| (2π )m/2 | det A| =
1
1
(2π )m/2 (det C)1/2
e− 2 C
−1 (y−b),y−b
.
If C is not invertible, then the .N(b, C) law cannot have a density with respect to the Lebesgue measure. In this case the image of the linear map associated to A is a proper hyperplane of .Rm , hence Y also takes its values in a proper hyperplane with probability 1 and cannot have a density, as such a hyperplane has Lebesgue measure 0. This is actually a general fact: any r.v. having a covariance matrix that is not invertible cannot have a density with respect to the Lebesgue measure (Exercise 2.27). • If .X ∼ N(b, C) is m-dimensional and R is a .d × m matrix and .& b ∈ Rd , then the b has characteristic function (see (2.40) again) d-dimensional r.v. .Y = RX + & &
.
&
φY (θ ) = eiθ,b φX (R ∗ θ ) = eiθ,b eiR &
1
= eiθ,b+Rb e− 2 RCR
∗ θ,b
∗ θ,θ
1
e− 2 CR
∗ θ,R ∗ θ
(2.66)
2.8 Multivariate Gaussian Laws
89
and therefore .Y ∼ N(& b + Rb, RCR ∗ ). Therefore
affine-linear maps transform Gaussian laws into Gaussian laws.
This is one of the most important properties of Gaussian laws and we shall use it throughout. In particular, for instance, if .X = (X1 , . . . , Xm ) ∼ N(b, C), then also its components .X1 , . . . , Xm are necessarily Gaussian (real of course), as the component .Xi is a linear function of X. Hence the marginals of a multivariate Gaussian law are also Gaussian. Moreover, taking into account that .Xi has mean .bi and covariance .cii , .Xi is .N(bi , cii )distributed. • If X is .N (0, I ) and O is an orthogonal matrix then the “rotated” r.v. OX is itself Gaussian, being a linear function of a Gaussian r.v. It is moreover obviously centered and, recalling how covariance matrices transform under linear transformations (see (2.32)), it has covariance matrix .C = OI O ∗ = OO ∗ = I . Hence .OX ∼ N(0, I ). • Let .X ∼ N(b, C) and assume C to be diagonal. Then we have iθ,b − 12 Cθ,θ
φX (θ ) = e
.
e
=e
iθ,b
m 1 exp − chh θh2 2 h=1
iθ1 b1 − 12 c11 θ12
=e
e
2 iθm bm − 12 cmm θm
···e
e
= φX1 (θ1 ) · · · φXm (θm ) .
Thanks to Proposition 2.35 therefore the r.v.’s .X1 , . . . , Xm are independent. Recalling that C is the covariance matrix of X, we have that uncorrelated r.v.’s are also independent if their joint distribution is Gaussian. Note that, in order for this property to hold, the r.v.’s .X1 , . . . , Xm must be jointly Gaussian. It is possible for them each to have a Gaussian law without having a Gaussian joint law (see Exercise 2.56). Individually but non-jointly Gaussian r.v.’s are however a rare occurrence. More generally, let .X, Y be r.v.’s with values in .Rm , Rd respectively and jointly Gaussian, i.e such that the pair .(X, Y ) (with values in .Rn , .n = m + d) has Gaussian distribution. Then if Cov(Xi , Yj ) = 0
.
for every 1 ≤ i ≤ m, 1 ≤ j ≤ d ,
(2.67)
i.e. the components of X are uncorrelated with the components of Y , then X and Y are independent.
90
2 Probability
Actually (2.67) is equivalent to the assumption that the covariance matrix C of (X, Y ) is block diagonal
.
⎛ ⎜ ⎜ C ⎜ X ⎜ ⎜ .C = ⎜ ⎜0 . . . ⎜ ⎜ .. . . ⎝. . 0 ...
0 .. .
⎞ 0 ... 0 .. . . .. ⎟ . . .⎟ ⎟ ⎟ 0 . . . 0⎟ ⎟ ⎟ ⎟ ⎟ CY ⎠
0
so that, if .θ1 ∈ Rm , .θ2 ∈ Rd and .θ := (θ1 , θ2 ) ∈ Rn , and denoting by .b1 , b2 respectively the expectations of X and Y and .b = (b1 , b2 ), 1
1
1
eiθ,b e− 2 Cθ,θ = eiθ1 ,b1 e− 2 CX θ1 ,θ1 eiθ2 ,b2 e− 2 CY θ2 ,θ2 ,
.
i.e. φ(X,Y ) (θ ) = φX (θ1 )φY (θ2 ) ,
.
and again X and Y are independent thanks to the criterion of Proposition 2.35. The argument above of course also works in the case of m r.v.’s: if .X1 , . . . , Xm are jointly Gaussian with values in .Rn1 , . . . , Rnm respectively and the covariances of the components of .Xk and of .Xj , .k = j , are uncorrelated, then again the covariance matrix of the vector .X = (X1 , . . . , Xm ) is block diagonal and by Proposition 2.35 .X1 , . . . , Xm are independent.
2.9 Quadratic Functionals of Gaussian r.v.’s, a Bit of Statistics 2 ∼ Recall that if .X ∼ N(0, I ) is an m-dimensional r.v. then .|X|2 = X12 + · · · + Xm 2 χ (m). In this section we go further into the investigation of quadratic functionals of Gaussian r.v.’s. Exercises 2.7, 2.51, 2.52 and 2.53 also go in this direction. The key tool is Cochran’s theorem below. Let us however first recall some notions concerning orthogonal projections. If V is a subspace of a Hilbert space H let us denote by .V ⊥ its orthogonal, i.e. the set of vectors .x ∈ H such that .x, z = 0 for every .z ∈ V . The orthogonal .V ⊥ is always a closed subspace. The following statements introduce the notion of projector on a subspace.
2.9 Quadratic Functionals of Gaussian r.v.’s, a Bit of Statistics
91
Lemma 2.39 Let H be a Hilbert space, .F ⊂ H a closed convex set and x ∈ F c a point not belonging to F . Then there exists a unique .y0 ∈ F such that
.
|x − y0 | = min |x − y| .
.
y∈F
Proof By subtraction, possibly replacing F by .F − x, we can assume .x = 0 and 0 ∈ F . Let .η = miny∈F |y|. It is immediate that, for every .z, y ∈ H ,
.
2 1 2 1 1 1 (z − y) + (z + y) = |z|2 + |y|2 2 2 2 2
.
(2.68)
and therefore 2 2 1 1 1 1 (z − y) = |z|2 + |y|2 − (z + y) . 2 2 2 2
.
If .z, y ∈ F , as also . 12 (z + y) ∈ F (F is convex), we obtain 1 2 1 1 (z − y) ≤ |z|2 + |y|2 − η2 . 2 2 2
.
(2.69)
Let now .(yn )n ⊂ F be a minimizing sequence, i.e. such that .|yn | →n→∞ η. Then (2.69) gives |yn − ym |2 ≤ 2|yn |2 + 2|ym |2 − 4η2 .
.
As .|yn |2 →n→∞ η2 this relation proves that .(yn )n is a Cauchy sequence, hence converges to some .y0 ∈ F that is the required minimizer. The fact that every minimizing sequence is a Cauchy sequence implies uniqueness. Let .V ⊂ H be a closed subspace, hence also a closed convex set. Lemma 2.39 allows us to define, for .x ∈ H , P x := argmin |x − v|
.
(2.70)
v∈V
i.e. P x is the (unique) element of V that is closest to x. Let us investigate the properties of the operator P . It is immediate that .P x = x if and only if already .x ∈ V and that .P (P x) = P x.
92
2 Probability
Proposition 2.40 Let P be as in (2.70) and .Qx = x − P x. Then .Qx ∈ V ⊥ , so that P x and Qx are orthogonal. Moreover P and Q are linear operators.
Proof Let us prove that, for every .v ∈ V , Qx, v = x − P x, v = 0 .
.
(2.71)
By the definition of P , as .P x + tv ∈ V for every .t ∈ R, for all .v ∈ V the function t → |x − (P x + tv)|2
.
is minimum at .t = 0. But |x − (P x + tv)|2 = |x − P x|2 − 2tx − P x, v + t 2 |v|2 .
.
The derivative with respect to t at .t = 0 must therefore vanish, which gives (2.71). For every .x, y ∈ H , .α, β ∈ R we have, thanks to the relation .x = P x + Qx, αx + βy = P (αx + βy) + Q(αx + βy)
.
(2.72)
but also .αx = α(P x + Qx), .βy = β(P y + Qy) and by (2.72) α(P x + Qx) + β(P y + Qy) = P (αx + βy) + Q(αx + βy)
.
i.e. αP x + βP y − P (αx + βy) = Q(αx + βy) − αQx − βQy .
.
As in the previous relation the left-hand side is a vector of V whereas the right-hand side belongs to .V ⊥ , both are necessarily equal to 0, which proves linearity.
P is the orthogonal projector on V .
We shall need Proposition 2.40 in this generality later. In this section we shall be confronted with orthogonal projectors only in the simpler case .H = Rm .
2.9 Quadratic Functionals of Gaussian r.v.’s, a Bit of Statistics
93
Example 2.41 Let .V ⊂ Rm be the subspace of the vectors of the form v = (v1 , . . . , vk , 0, . . . , 0)
.
v1 , . . . , vk ∈ R .
In this case, if .x = (x1 , . . . , xm ), P x = argmin |x − v|2 = argmin
k
.
v1 ,...,vk ∈R i=1
v∈V
(xi − vi )2 +
m
xi2
i=k+1
i.e. .P x = (x1 , . . . , xk , 0, . . . , 0). Here, of course, .V ⊥ is formed by the vectors of the form v = (0, . . . , 0, vk+1 , . . . , vm ) .
.
Theorem 2.42 (Cochran) Let X be an m-dimensional .N(0, I )-distributed r.v. and .V1 , . . . , Vk pairwise orthogonal vector subspaces of .Rm . For .i = 1, . . . , k let .ni denote the dimension of .Vi and .Pi the orthogonal projector onto .Vi . Then the r.v.’s .Pi X, i = 1, . . . , k, are independent and .|Pi X|2 is 2 .χ (ni )-distributed.
Proof Assume for simplicity .k = 2. Except for a rotation we can assume that .V1 is the subspace of the first .n1 coordinates and .V2 the subspace of the subsequent .n2 as in Example 2.41 (recall that the .N(0, I ) laws are invariant with respect to orthogonal transformations). Hence P1 X = (X1 , . . . , Xn1 , 0, . . . , 0) ,
.
P2 X = (0, . . . , 0, Xn1 +1 , . . . , Xn1 +n2 , 0, . . . , 0) . P1 X and .P2 X are jointly Gaussian (the vector .(P1 X, P2 X) is a linear function of X) and it is clear that (2.67) (orthogonality of the components of .P1 X and .P2 X) holds; therefore .P1 X and .P2 X are independent. Moreover
.
|P1 X|2 = (X12 + · · · + Xn21 ) ∼ χ 2 (n1 ) ,
.
|P2 X|2 = (Xn21 +1 + · · · + Xn21 +n2 ) ∼ χ 2 (n2 ) .
94
2 Probability
A first important application of Cochran’s Theorem is the following. Let .V0 ⊂ Rm be the subspace generated by the vector .e = (1, 1, . . . , 1) (i.e. the subspace of the vectors whose components are equal); let us show that the orthogonal projector on .V0 is .PV0 x = (x, ¯ . . . , x), ¯ where 1 (x1 + · · · + xm ) . m
x¯ =
.
In order to determine .PV0 x we must find the number .λ0 ∈ R such that the function λ → |x − λe| is minimum at .λ = λ0 . That is we must find the minimizer of
.
λ →
.
m (xi − λ)2 . i=1
m Taking
mthe derivative we find for the critical value the relation .2 i=1 (xi − λ) = 0, i.e. . i=1 xi = mλ. Hence .λ0 = x. ¯ If .X ∼ N(0, I ) and .X = m1 (X1 +· · ·+Xm ), then .Xe is the orthogonal projection of X on .V0 and therefore .X −Xe is the orthogonal projection of X on the orthogonal subspace .V0⊥ . By Cochran’s Theorem .Xe and .X − Xe are independent (which is not completely obvious as both these r.v.’s depend on .X). Moreover, as .V0⊥ has dimension .m − 1, Cochran’s Theorem again gives m . (Xi − X)2 = |X − Xe|2 ∼ χ 2 (m − 1) .
(2.73)
i=1
Let us introduce a new probability law: the Student t with n degrees of freedom is the law of an r.v. of the form X √ Z=√ n, Y
.
(2.74)
where X and Y are independent and .N(0, 1)- and .χ 2 (n)-distributed respectively. This law is usually denoted .t (n).
Student laws are symmetric, i.e. Z and .−Z have the same law. This follows immediately from their definition: the r.v.’s .X, Y and .−X, Y in (2.74) have the same joint law, as their have the same distribution and are independent. √ components X √ Hence the laws of . X n n and .− are the images of the same joint law under the Y Y same map and therefore coincide. It is not difficult to compute the density of a .t (n) law (see Example 4.17 p. 192) but we shall skip this computation for now. Actually it will be apparent that the
2.9 Quadratic Functionals of Gaussian r.v.’s, a Bit of Statistics
95
important thing about Student laws are the distribution functions and quantiles, which are provided by appropriate software (tables in ancient times. . . ).
Example 2.43 (Quantiles) Let F be the d.f. of some r.v. X. The quantile of order .α, 0 < α < 1, of F is the infimum, .qα say, of the numbers x such that .F (x) = P(X ≤ x) ≥ α, i.e. qα = inf{x; F (x) ≥ α}
.
(actually this is a minimum as F is right continuous). If F is continuous then, by the intermediate value theorem, the equation F (x) = α
.
(2.75)
has (at least) one solution for every .0 < α < 1. If moreover F is strictly increasing (which is the case for instance if X has a strictly positive density) then the solution of equation (2.75) is unique. In this case .qα is therefore the unique real number x such that F (x) = P(X ≤ x) = α .
.
If X is symmetric (i.e. X and .−X have the same law), as is the case for .N(0, 1) and Student laws, we have the relations 1 − α = P(X ≥ qα ) = P(−X ≥ qα ) = P(X ≤ −qα ) ,
.
from which we obtain that .q1−α = −qα . Moreover, we have the relation (see Fig. 2.3)
.
P(|X| ≤ q1−α/2 ) = P(−q1−α/2 ≤ X ≤ q1−α/2 ) α α = P(X ≤ q1−α/2 ) − P(X ≤ −q1−α/2 ) = 1 − − = 1 − α . 2 2
(2.76)
Going back to the case .X ∼ N(0, I ), we have seen that, as a consequence of 2 Cochran’s theorem, the r.v.’s .X and . m i=1 (Xi − X) are independent and that
m 1 . (X − X)2 ∼ χ 2 (m − 1). As .X = m (X1 + · · · + Xm ) is .N(0, m1 )-distributed, √ i=1 i . m X ∼ N(0, 1) and √
T :=
.
mX ∼ t (m − 1) .
m 1 2 i=1 (Xi − X) m−1
(2.77)
96
2 Probability
−q1−a/2
0
q1−a/2
Fig. 2.3 Each of the two shaded regions has an area equal to . α2 . Hence the probability of a value between .−q1−α/2 and .q1−α/2 is equal to .1 − α
Corollary 2.44 Let .Z1 , . . . , Zm be i.i.d. .N(b, σ 2 )-distributed r.v.’s. Let 1 (Z1 + · · · + Zm ) , m m 1 S2 = (Zi − Z)2 . m−1 Z=
.
i=1
Then .Z and .S 2 are independent. Moreover, .
m−1 2 S ∼ χ 2 (m − 1) , . σ2
√ m (Z − b) ∼ t (m − 1) . S
(2.78) (2.79)
Proof Let us trace back to the case of .N(0, I )-distributed r.v.’s that we have already seen. If .Xi = σ1 (Zi − b), then .X = (X1 , . . . , Xm ) ∼ N(0, I ) and we know already
that .X and . i (Xi − X)2 are independent. Moreover, Z = σX + b , .
m m m−1 2 1 2 S = (Z − Z) = (Xi − X)2 i σ2 σ2 i=1
(2.80)
i=1
so that .Z and .S 2 are also independent, being functions of independent r.v.’s. Finally m−1 2 . S ∼ χ 2 (m − 1) thanks to (2.73) and the second of the formulas (2.80), and as σ2 √ .
m (Z − b) = S
(2.79) follows by (2.77).
√
1 m−1
mX ,
m 2 (X − X) i i=1
2.9 Quadratic Functionals of Gaussian r.v.’s, a Bit of Statistics
97
Example 2.45 (A Bit of Statistics. . . ) Let .X1 , . . . , Xn be i.i.d. .N(b, σ 2 )distributed r.v.’s, where both b and .σ 2 are unknown. How can we, from the observed values .X1 , . . . , Xn , estimate the two unknown parameters b and .σ 2 ? If 1 (X1 + · · · + Xn ) , n n 1 S2 = (Xi − X)2 n−1 X=
.
i=1
then by Corollary 2.44 .
n−1 2 S ∼ χ 2 (n − 1) σ2
and √ T :=
.
n (X − b) ∼ t (n − 1) . S
If we denote by .tα (n − 1) the quantile of order .α of a .t (n − 1) law, then P |T | > t1−α/2 (n − 1) = α
.
(this is (2.76), as Student laws are symmetric). On the other hand, ' .
( ' S ( |T | > t1−α/2 (n − 1) = |X − b| > t1−α/2 (n − 1) √ . n
Therefore the probability for the empirical mean .X to differ from the expectation b by more than .t1−α/2 (n − 1) √Sn is .≤ α. Or, in other words, the unknown mean b lies in the interval S S I = X − t1−α/2 (n − 1) √ , X + t1−α/2 (n − 1) √ n n
.
(2.81)
with probability .1−α. We say that I is a confidence interval for b of level .1−α. The same idea allows us to estimate the variance .σ 2 , but with some changes as the .χ 2 laws are not symmetric. If we denote by .χα2 (n − 1) the quantile of order .α of a .χ 2 (n − 1) law, we have n − 1 α 2 2 P S < χ (n − 1) = , α/2 2 σ2
.
n − 1 α 2 2 P S > χ (n − 1) = 1−α/2 2 σ2
98
2 Probability
and therefore n−1 2 2 2 1 − α = P χα/2 (n − 1) ≤ S ≤ χ (n − 1) 1−α/2 σ2 n−1 n−1 S2 ≤ σ 2 ≤ 2 S2 . =P 2 χ1−α/2 (n − 1) χα/2 (n − 1)
.
In other words .
n−1 n−1 2 2 S S , 2 2 (n − 1) χ1−α/2 (n − 1) χα/2
is a confidence interval for .σ 2 of level .1 − α.
Example 2.46 In 1879 the physicist A. A. Michelson made .n = 100 measurements of the speed of the light, obtaining the value X = 299 852.4
.
with .S = 79.0. If we assume that these values are equal to the true value of the speed of light with the addition of a Gaussian measurement error, (2.81) gives, for the confidence interval (2.81), intending .299,000 plus the indicated value, [836.72, 868.08] .
.
The latest measurements of the speed of the light give the value .792.4574 with a confidence interval ensuring precision up to the third decimal place. It appears that the 1879 measurements were biased. Michelson obtained much more precise results later on.
Exercises 2.1 (p. 270) Let .(Ω, F, P) be a probability space and .(A n )n a sequence of events, each having probability 1. Prove that their intersection . n An also has probability 1. 2.2 (p. 271) Let .(Ω, F, P) be a probability space and . G ⊂ F a .P-trivial .σ -algebra, i.e. such that, for every .A ∈ G, either .P(A) = 0 or .P(A) = 1. In this exercise
Exercises
99
we prove that a . G-measurable r.v. X with values in a separable metric space E is a.s. constant. This fact has already been established in Theorem 2.15 in the case m .E = R . Let X be an E-valued . G-measurable r.v. (a) Prove that for every .n ∈ N there exists a ball .Bxn ( n1 ) centered at some .xn ∈ E and with radius . n1 such that .P(X ∈ Bxn ( n1 )) = 1. (b) Prove that there exists a decreasing sequence .(An )n of Borel sets of E such that 2 .P(X ∈ An ) = 1 for every n and such that the diameter of .An is .≤ n. (c) Prove that there exists an .x0 ∈ E such that .P(X = x0 ) = 1. 2.3 (p. 271) (a) Let .(Xn )n be a sequence of real independent r.v.’s and let Z = sup Xn .
.
n≥1
Assume that, for some .a ∈ R, .P(Z ≤ a) > 0. Prove that .Z < +∞ a.s. (b) Let .(Xn )n be a sequence of real independent r.v.’s with .Xn exponential of parameter .λn . (b1) Assume that .λn = log n. Prove that Z := sup Xn < +∞
.
a.s.
n≥1
(b2) Assume that .λn ≡ c > 0. Prove that .Z = +∞ a.s. 2.4 (p. 272) Let X and Y be real independent r.v.’s such that .X + Y has finite mathematical expectation. Prove that both X and Y have finite mathematical expectation. 2.5 (p. 272) Let X, Y be d-dimensional independent r.v.’s .μ- and .ν-distributed respectively. Assume that .μ has density f with respect to the Lebesgue measure of .Rd (no assumption is made concerning the law of Y ). (a) Prove that .X + Y also has density, g say, with respect to the Lebesgue measure and compute it. (b) Prove that if f is k times differentiable with bounded derivatives up to the order k, then g is also k times differentiable (again whatever the law of Y ). 2.6 (p. 273) Let .μ be a probability on .Rd . (a) Prove that, for every .ε > 0, there exists an .M1 > 0 such that .μ(|x| ≥ M1 ) < ε. (b) Let .f ∈ C0 (Rd ), i.e. continuous and such that for every .ε > 0 there exists an .M2 > 0 such .|f (x)| ≤ ε for .|x| > M2 . Prove that if g(x) = μ ∗ f (x) :=
.
Rd
f (x − y) μ(dy)
(2.82)
100
2 Probability
then also .g ∈ C0 (Rd ). In particular, as obviously .μ ∗ f ∞ ≤ |f ∞ , the map d .f → μ ∗ f is continuous from .C0 (R ) to itself. 2
2.7 (p. 273) Let .X ∼ N(0, σ 2 ). Compute .E(etX ) for .t ∈ R. 2.8 (p. 274) Let X be an .N(0, 1)-distributed r.v., .σ, b real numbers and .x, K > 0. Show that 1 2 E (xeb+σ X − K)+ = xeb+ 2 σ Φ(−ζ + |σ |) − KΦ(−ζ ) ,
.
(2.83)
where .ζ = |σ1 | (log Kx − b) and .Φ denotes the distribution function of an .N(0, 1) law. This quantity appears naturally in mathematical finance. 2.9 (p. 274) (Weibull Laws) Let, for .α > 0, λ > 0, f (t) =
λαt α−1 e−λt
α
.
for t > 0 for t ≤ 0 .
0
(a) Prove that f is a probability density with respect to the Lebesgue measure and compute its d.f. (b1) Let X be an exponential r.v. with parameter .λ and let .β > 0. Compute .E(Xβ ). What is the law of .Xβ ? (b2) Compute the expectation and the variance of an r.v. that is Weibull-distributed with parameters .α, λ. (b3) Deduce that for the Gamma function we have .Γ (1 + 2t) ≥ Γ (1 + t)2 holds for every .t ≥ 0. 2.10 (p. 275) A pair of r.v.’s .X, Y has joint density f (x, y) = (θ + 1)
.
eθx eθy 1
(eθx + eθy − 1)2+ θ
,
x > 0, y > 0
and .f (x, y) = 0 otherwise, where .θ > 0. Compute the densities of X and of Y . 2.11 (p. 276) Let X, Y , Z be independent r.v.’s uniform on .[0, 1]. (a1) Compute the laws of .− log X and of .− log Y . (a2) Compute the law of .− log X − log Y and then of XY (b) Prove that .P(XY < Z 2 ) = 59 . 2.12 (p. 277) Let Z be an exponential r.v. with parameter .λ and let .Z1 = Z, Z2 = Z − Z, respectively the integer and fractional parts of Z.
.
(a) Compute the laws of .Z1 and of .Z2 . (b1) Compute, for .0 ≤ a < b ≤ 1 and .k ∈ N, the probability .P(Z1 = k, Z2 ∈ [a, b]).
Exercises
101
(b2) Prove that .Z1 and .Z2 are independent. 2.13 (p. 277) (Recall first Remark 2.1) Let F be the d.f. of a positive r.v. X having finite mean .b > 0 and let .F (t) = 1 − F (t). Let g(t) =
.
1 F (t) . b
(a) Prove that g is a probability density. (b) Determine g when X is (b1) exponential with parameter .λ; (b2) uniform on .[0, 1]; (b3) Pareto with parameters .α > 1 and .θ > 0, i.e. with density ⎧ ⎨
αθ α (θ + t)α+1 .f (t) = ⎩ 0
if t > 0 otherwise .
(c) Let .X ∼ Gamma.(n, λ), with n an integer .≥ 1. Prove that g is a linear combination of Gamma.(k, λ) densities for .1 ≤ k ≤ n. (d) Assume that X has finite variance .σ 2 . Compute the mean of the law having density g with respect to the Lebesgue measure. 2.14 (p. 279) In this exercise we determine the image law of the uniform distribution on the sphere under the projection on the north-south diameter (or, indeed, on any diameter). Recall that in polar coordinates the parametrization of the sphere .S2 of 3 .R is z = cos θ ,
.
y = sin θ cos φ , x = sin θ sin φ where .(θ, φ) ∈ [0, π ] × [0, 2π ]. .θ is the colatitude (i.e. the latitude but with values in .[0, π ] instead of .[− π2 , π2 ]) and .φ the longitude. The Lebesgue measure of the sphere, normalized so that the total measure is equal to 1, is .f (θ, φ) dθ dφ, where f (θ, φ) =
.
1 sin θ 4π
(θ, φ) ∈ [0, π ] × [0, 2π ] .
Let us consider the map .S2 → [−1, 1] defined as (x, y, z) → z ,
.
i.e. the projection of .S2 on the north-south diameter.
(2.84)
102
2 Probability
What is the image of the normalized Lebesgue measure of the sphere under this map? Are the points at the center of the interval .[−1, 1] (corresponding to the equator) the most likely? Or those near the endpoints (the poles)? 2.15 (p. 279) Let Z be an r.v. uniform on .[0, π ]. Determine the law of .W = cos Z. 2.16 (p. 280) Let X, Y be r.v.’s whose joint law has density, with respect to the Lebesgue measure of .R2 , of the form f (x, y) = g(x 2 + y 2 ) ,
.
(2.85)
where .g : R+ → R+ is a Borel function. (a) Prove that necessarily
+∞
.
g(t) dt =
0
1 · π
(b1) Prove that X and Y have the same law. (b2) Assume that X and (hence) Y are integrable. Compute .E(X) and .E(Y ). (b3) Assume that X and (hence) Y are square integrable. Prove that X and Y are uncorrelated. Give an example with X and Y independent and an example with X and Y non-independent. (c1) Prove that .Z := X Y has a Cauchy law, i.e. with density with respect to the Lebesgue measure z →
.
1 · π(1 + z2 )
In particular, the law of . X Y does not depend on g. (c2) Let X, Y be independent .N(0, 1)-distributed r.v.’s. What is the law of . X Y? 1 (c3) Let Z be a Cauchy-distributed r.v. Prove that . Z also has a Cauchy law. 2.17 (p. 281) Let .(Ω, F, P) be a probability space and X a positive r.v. such that E(X) = 1. Let us define a new measure .Q on .(Ω, F) by
.
.
dQ =X, dP
i.e. .Q(A) = E(X1A ) for all .A ∈ F. (a) Prove that .Q is a probability and that .Q P. (b) We now address the question of whether also .P Q. (b1) Prove that the event .{X = 0} has probability 0 with respect to .Q.
Exercises
103
(b2) Let & .P be the measure on .(Ω, F) defined as .
d& P 1 = dQ X
(which is well defined as .X > 0 .Q-a.s.). Prove that & .P = P if and only if {X = 0} has probability 0 also with respect to .P and that in this case .P Q. Let .μ be the law of X with respect to .P. What is the law of X with respect to .Q? If .X ∼ Gamma.(λ, λ) under .P, what is its law under .Q? Let Z be an r.v. independent of X (under .P). Prove that if Z is integrable under .P then it is also integrable with respect to .Q and that .EQ (Z) = E(Z). Prove that Z has the same law with respect to .Q as with respect to .P. Prove that Z is also independent of X under .Q. .
(c) (d) (d1) (d2) (d3)
2.18 (p. 282) Let .(Ω, F, P) be a probability space, and X and Z independent exponential r.v.’s of parameter .λ. Let us define on .(Ω, F) the new measure .
i.e. .Q(A) =
λ 2
λ dQ = (X + Z) dP 2
E[(X + Z)1A ].
(a) Prove that .Q is a probability and that .Q P. (b) Compute .EQ (XZ). (c1) Compute the joint law of X and Z with respect to .Q. Are X and Z also independent with respect to .Q? (c2) What are the laws of X and of Z under .Q? 2.19 (p. 283) (a) Let X, Y be real r.v.’s having joint density f with respect to the Lebesgue measure. Prove that both XY and . X Y have a density with respect to the Lebesgue measure and compute it. (b) Let X, Y be independent r.v.’s Gamma.(α, λ)- and Gamma.(β, λ)-distributed respectively. (b1) Compute the law of .W = X Y. (b2) This law turns out not to depend on .λ. Was this to be expected? (b3) For which values of p does W have a finite moment of order p? Compute these moments. (c1) Let .X, Y, Z be .N(0, 1)-distributed independent r.v.’s. Compute the laws of W1 =
.
X2 Z2 + Y 2
104
2 Probability
and of |X| W2 = √ · Z2 + Y 2
.
(c2) Compute the law of . X Y. 2.20 (p. 286) Let X and Y be independent r.v.’s, .Γ (α, 1)- and .Γ (β, 1)-distributed respectively with .α, β > 0. (a) Prove that .U = X + Y and .V = X1 (X + Y ) are independent. (b) Determine the laws of V and of . V1 . 2.21 (p. 287) Let T be a positive r.v. having density f with respect to the Lebesgue measure and X an r.v. uniform on .[0, 1], independent of T . Let .Z = XT , .W = (1 − X)T . (a) Determine the joint law of Z and W . (b) Explicitly compute this joint law when f is Gamma.(2, λ). Prove that in this case Z and W are independent. 2.22 (p. 288) (a) Let .X, Y be real r.v.’s, having joint density f with respect to the Lebesgue measure and such that .X ≤ Y a.s. Let, for every .x, y, .x ≤ y, G(x, y) := P(x ≤ X ≤ Y ≤ y) .
.
Deduce from G the density f . (b1) Let .Z, W be i.i.d. real r.v.’s having density h with respect to the Lebesgue measure. Determine the joint density of .X = min(Z, W ) and .Y = max(Z, W ). (b2) Explicitly compute this joint density when .Z, W are uniform on .[0, 1] and deduce the value of .E[|Z − W |]. 2.23 (p. 289) Let .(E, E, μ) be a .σ -finite measure space. Assume that, for every integrable function .f : E → R and for every convex function .φ, φ(f (x)) dμ(x) ≥ φ
.
E
f (x) dμ(x)
(2.86)
E
(note that, as in the proof of Jensen’s inequality, .φ ◦ f is lower semi-integrable, so that the l.h.s above is always well defined). (a) Prove that for every .A ∈ E such that .μ(A) < +∞ necessarily .μ(A) ≤ 1. Deduce that .μ is finite. (b) Prove that .μ is a probability. In other words, Jensen’s inequality only holds for probabilities.
Exercises
105 ........................ ........ ...... ..... ..... .... ..... . . . ....... ... . ....... . . ...... . . . ...... . . ...... . . ...... . . . ...... . . ....... . ....... .. . ....... .. . ........ .. ......... . . .......... . . ............ . . .............. .. . ................... . . ..................................... . . . .................................. ......
Fig. 2.4 A typical example of a density with a positive skewness
2.24 (p. 289) Given two probabilities .μ, .ν on a measurable space .(E, E), the relative entropy (or Kullback-Leibler divergence) of .ν with respect to .μ is defined as H (ν; μ) :=
log
.
E
dν dμ
ν(dx) = E
dν dμ
dν log dμ μ(dx)
(2.87)
if .ν μ and .H (ν; μ) = +∞ otherwise. (a1) Prove that .H (ν; μ) ≥ 0 and that .H (ν; μ) > 0 unless .ν = μ. Moreover, H is a convex function of .ν. 1 (a2) Let .A ∈ E be a set such that .0 < μ(A) < 1 and .dν = μ(A) 1A dμ. Compute .H (ν; μ) and .H (μ; ν) and note that .H (ν; μ) = H (μ; ν). (b1) Let .μ = B(n, p) and .ν = B(n, q) with .0 < p, q < 1. Compute .H (ν; μ). (b2) Compute .H (ν; μ) when .ν and .μ are exponential of parameters .ρ and .λ respectively. (c) Let .νi , μi , .i = 1, . . . , n, be probabilities on the measurable spaces .(Ei , Ei ). Prove that, if .ν = ν1 ⊗ · · · ⊗ νn , .μ = μ1 ⊗ · · · ⊗ μn , then H (ν; μ) =
n
.
H (νi ; μi ) .
(2.88)
i=1
2.25 (p. 291) The skewness (or asymmetry) index of an r.v. X is the quantity γ =
.
E[(X − b)3 ] , σ3
(2.89)
where .b = E(X) and .σ 2 = Var(X) (provided X has a finite moment of order 3). The index .γ , intuitively, measures the asymmetry of the law of X: values of .γ that are positive indicate the presence of a “longish tail” on the right (as in Fig. 2.4), whereas negative values indicate the same thing on the left. (a) What is the skewness of an .N(b, σ 2 ) law? (b) And of an exponential law? Of a Gamma.(α, λ)? How does the skewness depend on .α and .λ? Recall the binomial expansion of third degree: .(a + b)3 = a 3 + 3a 2 b + 3ab2 + b3 .
106
2 Probability
2.26 (p. 292) (The problem of moments) Let .μ, ν be probabilities on .R having equal moments of all orders. Can we infer that .μ = ν? Prove that if their support is contained in a bounded interval .[−M, M], then .μ = ν (this is not the weakest assumption, see e.g. Exercise 2.45). 2.27 (p. 293) (Some information that is carried by the covariance matrix) Let X be an m-dimensional r.v. Prove that its covariance matrix C is invertible if and only if the support of the law of X is not contained in a proper hyperplane of .Rd . Deduce that if C is not invertible, then the law of X cannot have a density with respect to the Lebesgue measure. Recall Eq. (2.33). Proper hyperplanes have Lebesgue measure 0. . . 2.28 (p. 293) Let .X, Y be real square integrable r.v.’s and .x → ax + b the regression line of Y on X. (a) Prove that .Y − (aX + b) is centered and that the r.v.’s .Y − (aX + b) and .aX + b are orthogonal in .L2 . (b) Prove that the squared discrepancy .E[(Y − (aX + b))2 ] is equal to .E(Y 2 ) − E[(aX + b)2 ]. 2.29 (p. 294) (a) Let Y , W be independent r.v.’s .N(0, 1)- and .N(0, σ 2 )-distributed respectively and let .X = Y + W . What is the regression line .x → ax + b of Y with respect to X? What is the value of the quadratic error E[(Y − aX − b)2 ] ?
.
(b) Assume, instead, the availability of two measurements of the same quantity Y , .X1 = Y +W1 and .X2 = Y +W2 , where the r.v.’s Y , .W1 and .W2 are independent and .W1 , W2 ∼ N(0, σ 2 ). What is now the best estimate of Y by an affine-linear function of the two observations .X1 and .X2 ? What is the value of the quadratic error now? 2.30 (p. 295) Let .Y, W be exponential r.v.’s with parameters respectively .λ and .ρ. Determine the regression line of Y with respect to .X = Y + W . 2.31 (p. 295) Let .φ be a characteristic function. Show that .φ, .φ 2 , .|φ|2 are also characteristic functions. 2.32 (p. 296) (a) Let .X1 , X2 be independent r.v.’s uniform on .[− 12 , 12 ]. (a) Compute the characteristic function of .X1 + X2 . (b) Compute the characteristic function, .φ say, of the probability with density, with respect to the Lebesgue measure, .f (x) = 1 − |x|, .|x| ≤ 1 and .f (x) = 0 for .|x| > 1 and deduce the law of .X1 + X2 .
Exercises
107
.......... ..... .......... . . . . ..... .. ..... ..... ..... ..... . . ..... . . . . . ..... . . . . . ..... . . . . ..... . . . . ..... . . . . ........................................................................................ ........................................................................................ −3
−2
−1
0
1
2
3
Fig. 2.5 The graph of f of Exercise 2.32 (and of .ψ as well)
(c) Prove that the function (Fig. 2.5) κ(θ ) =
.
1 − |θ | if − 1 ≤ θ ≤ 1 0
otherwise
is a characteristic function and determine the corresponding law. Recall the trigonometric relation .1 − cos x = 2 sin2 x2 . 2.33 (p. 296) (Characteristic functions are positive definite) A function .f : Rd → C is said to be positive definite if, for every choice of .n ∈ N and .x1 , . . . , xn ∈ Rd , the complex matrix .(f (xh − xk ))h,k is positive definite, i.e. Hermitian and such that n .
f (xh − xk )ξh ξk ≥ 0
for every ξ1 , . . . , ξn ∈ C .
h,k=1
Prove that characteristic functions are positive definite. 2.34 (p. 297) (a) Let .ν be a Laplace law with parameter .λ = 1, i.e. having density .h(x) = 1 −|x| with respect to the Lebesgue measure. Prove that 2e .ν(θ ) =
1 · 1 + θ2
(2.90)
(b1) Let .μ be a Cauchy law, i.e. the probability having density f (x) =
.
1 π(1 + x 2 )
with respect to the Lebesgue measure. Prove that . μ(θ ) = e−|θ| . (b2) Let .X, Y be independent Cauchy r.v.’s. Prove that . 12 (X + Y ) is also Cauchy distributed.
108
2 Probability
2.35 (p. 298) A probability .μ on .R is said to be infinitely divisible if, for every n, there exist n i.i.d. r.v.’s .X1 , . . . , Xn such that .X1 + · · · + Xn ∼ μ. Or, equivalently, if for every n, there exists a probability .μn such that .μn ∗ · · · ∗ μn = μ (n times). Establish which of the following laws are infinitely divisible. (a) (b) (c) (d)
N(m, σ 2 ). Poisson of parameter .λ. Exponential of parameter .λ. Cauchy.
.
2.36 (p. 299) Let .μ, .ν be probabilities on .Rd such that μ(Hθ,a ) = ν(Hθ,a )
.
(2.91)
for every half-space .Hθ,a = {x; θ, x ≤ a}, .θ ∈ Rd , .a ∈ R. (a) Let .μθ , .νθ denote the images of .μ and .ν respectively through the map .ψθ : Rd → R defined by .ψθ (x) = θ, x. Prove that .μθ = νθ . (b) Deduce that .μ = ν. 2.37 (p. 299) Let .(Ω, F, P) be a probability space and X a positive integrable r.v. on it, such that .E(X) = 1. Let us denote by .μ and .φ respectively the law and the characteristic function of X. Let .Q be the probability on .(Ω, F) having density X with respect to .P. (a1) Compute the characteristic function of X under .Q and deduce that .−iφ also is a characteristic function. (a2) Compute the law of X under .Q and determine the law having characteristic function .−iφ . (a3) Determine the probability corresponding to .−iφ when .X ∼ Gamma.(λ, λ) and when X is geometric of parameter .p = 1. (b) Prove that if X is a positive integrable r.v. but .E(X) = 1, then .−iφ cannot be a characteristic function. 2.38 (p. 299) A professor says: “let us consider a real r.v. X with characteristic 4 function .φ(θ ) = e−θ . . . ”. What can we say about the values of mean and variance of such an X? Comments? 2.39 (p. 300) (Stein’s characterization of the Gaussian law) (a) Let .Z ∼ N(0, 1). Prove that E[Zf (Z)] = E[f (Z)]
.
for every f ∈ Cb1
(2.92)
where .Cb1 denotes the vector space of bounded continuous functions .R → C with bounded derivative. (b) Let Z be a real r.v. satisfying (2.92).
Exercises
109
(b1) Prove that Z is integrable. (b2) What is its characteristic function? Prove that necessarily .Z ∼ N(0, 1). 2.40 (p. 301) Let X be a .Z-valued r.v. and .φ its characteristic function. (a) Prove that P(X = 0) =
.
1 2π
2π
φ(θ ) dθ .
(2.93)
0
(b) Are you able to find a similar formula in order to obtain from .φ the probabilities .P(X = m), .m ∈ Z? (c) What about the integrability of .φ on the whole of .R? 2.41 (p. 302) (Characteristic functions are uniformly continuous) Let .μ be a probability on .Rd . (a) Prove that for every .η > 0 there exist .R = Rη > 0 such that .μ(BRc ) ≤ η, where .BR denotes the ball centered at 0 and with radius R. (b) Prove that, for every .θ1 , θ2 ∈ Rd , iθ ,x e 1 − eiθ2 ,x ≤ |x||θ1 − θ2 | .
.
(2.94)
In particular the functions .θ → eiθ,x are uniformly continuous as x ranges over a bounded set. (c) Prove that . μ is uniformly continuous. 2.42 (p. 303) Let X be an r.v. and let us denote by L its Laplace transform. (a) Prove that, for every .λ, .0 ≤ λ ≤ 1, and .s, t ∈ R, L λs + (1 − λ)t ≤ L(s)λ L(t)1−λ .
.
(b) Prove that L restricted to the real axis and its logarithm are both convex functions. 2.43 (p. 303) Let X be an r.v. with a Laplace law of parameter .λ, i.e. of density f (x) =
.
λ −λ|x| e 2
with respect to the Lebesgue measure. (a) Compute the Laplace transform and the characteristic function of X. (b) Let Y and W be independent r.v.’s, both exponential of parameter .λ. Compute the Laplace transform of .Y − W . What is the law of .Y − W ?
110
2 Probability
(c1) Prove that the Laplace law is infinitely divisible (see Exercise 2.35 for the definition). (c2) Prove that φ(θ ) =
.
1 (1 + θ 2 )1/n
(2.95)
is a characteristic function. 2.44 (p. 304) (Some information about the tail of a distribution that is carried by its Laplace transform) Let X be an r.v. and .x2 the right convergence abscissa of its Laplace transform L. (a) Prove that if .x2 > 0 then for every .λ < x2 we have for some constant .c > 0 P(X ≥ t) ≤ c e−λt .
.
(b) Prove that if there exists a .t0 > 0 such that .P(X ≥ t) ≤ c e−λt for .t > t0 , then .x2 ≥ λ. 2.45 (p. 304) Let .μ, .ν be probabilities on .R such that all their moments coincide:
+∞
.
−∞
x k dμ(x) =
+∞ −∞
x k dν(x)
k = 1, 2, . . .
and assume, in addition, that their Laplace transform is finite in a neighborhood of 0. Then .μ = ν. 2.46 (p. 305) (Exponential families) Let .μ be a probability on .R whose Laplace transform L is finite in an interval .]a, b[, .a < 0 < b (hence containing the origin in its interior). Let, for .t ∈ R, ψ(t) = log L(t) .
.
(2.96)
As mentioned in Sect. 2.7, L, hence also its logarithm .ψ, are infinitely many times differentiable in .]a, b[. (a) Express the mean and variance of .μ using the derivatives of .ψ. (b) Let, for .γ ∈]a, b[, dμγ (x) =
.
eγ x dμ(x) . L(γ )
Exercises
111
(b1) Prove that .μγ is a probability and that its Laplace transform is Lγ (t) :=
.
L(t + γ ) · L(γ )
(b2) Express the mean and variance of .μγ using the derivatives of .ψ. (b3) Prove that .ψ is a convex function and deduce that the mean of .μγ is an increasing function of .γ . (c) Determine .μγ when (c1) .μ ∼ N (0, σ 2 ); (c2) .μ ∼ Γ (α, λ); (c3) .μ has a Laplace law of parameter .θ , i.e. having density .f (x) = λ2 e−λ|x| with respect to the Lebesgue measure; (c4) .μ ∼ B(n, p); (c5) .μ is geometric of parameter p. 2.47 (p. 308) Let .μ, ν be probabilities on .R and denote by .Lμ and .Lν respectively their Laplace transforms. Assume that .Lμ = Lν on an open interval .]a, b[, .a < b. (a) Assume .a < 0 < b. Prove that .μ = ν. (b1) Let .a < γ < b and dμγ (x) =
.
eγ x dμ(x), Lμ (γ )
dνγ (x) =
eγ x dν(x) . Lν (γ )
Compute the Laplace transforms .Lμγ and .Lνγ and prove that .μγ = νγ . (b2) Prove that .μ = ν also if .0 ∈]a, b[. 2.48 (p. 308) Let .X1 , . . . , Xn be independent r.v.’s having an exponential law of parameter .λ and let Zn = max(X1 , . . . , Xn ) .
.
The aim of this exercise is to compute the expectation of .Zn . (a) Prove that .Zn has a law having a density with respect to the Lebesgue measure and compute it. What is the value of the mean of .Z2 ? And of .Z3 ? (b) Prove that the Laplace transform of .Zn is Ln (z) = nΓ (n)
.
and determine its domain.
Γ (1 − λz ) Γ (n + 1 − λz )
(2.97)
112
2 Probability
(c) Prove that for the derivative of .log Γ we have the relation .
1 Γ (α) Γ (α + 1) = + Γ (α + 1) α Γ (α)
and deduce that .E(Zn ) = λ1 (1 + · · · + n1 ). 1 Recall the Beta integral . 0 t α−1 (1 − t)β−1 dt =
(2.98)
Γ (α)Γ (β) Γ (α+β) .
2.49 (p. 310) Let X be a d-dimensional r.v. (a) Prove that if X is Gaussian then, for every .ξ ∈ Rd , the real r.v. .ξ, X is Gaussian. (b) Assume that, for every .ξ ∈ Rd , the real r.v. .ξ, X is Gaussian. (b1) Prove that X is square integrable. (b2) Prove that X is Gaussian. • This is a useful criterion. 2.50 (p. 311) Let .X, Y be independent .N(0, 1)-distributed r.v.’s. (a) Prove that U=√
.
X X2
+ Y2
and
V = X2 + Y 2
are independent and deduce the laws of U and of V . (b) Prove that, for .θ ∈ R, the r.v.’s U =
.
X cos θ + Y sin θ √ X2 + Y 2
and
V = X2 + Y 2
are independent and deduce the law of U . 2.51 (p. 312) (Quadratic functions of Gaussian r.v.’s) Let X be an m-dimensional N (0, I )-distributed r.v. and A an .m × m symmetric matrix.
.
(a) Compute E(eAX,X )
.
under the assumption that all eigenvalues of A are .< 12 . (b) Prove that if A has an eigenvalue which is .≥ 12 then .E(eAX,X ) = +∞. (c) Compute the expectation in (2.99) if A is not symmetric. Compare with Exercises 2.7 and 2.53.
(2.99)
Exercises
113
2.52 (p. 313) (Non-central chi-square distributions) (a) Let .X ∼ N(ρ, 1). Compute the Laplace transform L of .X2 (with specification of the domain). (b) Let .X1 , . . . , Xm be independent r.v.’s with .Xi ∼ N(bi , 1), let .X = (X1 , . . . , Xm ) and .W = |X|2 . 2 . Compute .E(W ). (b1) Prove that the law of W depends only on .λ = b12 + · · · + bm (b2) Prove that the Laplace transform of W is, for .ℜz < 12 , L(z) =
.
zλ 1 . exp 1 − 2z (1 − 2z)m/2
• The law of W is the non-central chi-square with m degrees of freedom; .λ is the parameter of non-centrality. 2.53 (p. 314) (a) Let X be an m-dimensional Gaussian .N(0, I )-distributed r.v. What is the law of 2 .|X| ? (b) Let C be an .m × m positive definite matrix and X an m-dimensional Gaussian 2 .N(0, C)-distributed r.v. Prove that .|X| has the same law as an r.v. of the form m .
λk Zk
(2.100)
k=1
where .Z1 , . . . , Zm are independent .χ 2 (1)-distributed r.v.’s and .λ1 , . . . , λm are the eigenvalues of C. Prove that .E(|X|2 ) = tr C. 2.54 (p. 315) Let .X = (X1 , . . . , Xn ) be an .N(0, I )-distributed Gaussian vector. Let, for .k = 1, . . . , n, .Yk = X1 + · · · + Xk − kXk+1 (with the understanding .Xn+1 = 0). Are .Y1 , . . . , Yn independent? 2.55 (p. 315) (a) Let A and B be .d × d real positive definite matrices. Let G be the matrix whose elements are obtained by multiplying A and B entrywise, i.e. .gij = aij bij . Prove that G is itself positive definite (where is probability here?). (b) A function .f : Rd → R is said to be positive definite if .f (x) = f (−x) and if for every choice of .n ∈ N, of .x1 , . . . , xn ∈ Rd and of .ξ1 , . . . , ξn ∈ R, we have n .
f (xh − xk )ξh ξk ≥ 0 .
h,k=1
Prove that the product of two positive definite functions is also positive definite.
114
2 Probability
Let .X, Y be d-dimensional independent r.v.’s having covariance matrices A and B respectively. . . 2.56 (p. 315) Let X and Y be independent r.v.’s, where .X ∼ N(0, 1) and Y is such that .P(Y = ±1) = 12 . Let .Z = XY . (a) What is the law of Z? (b) Are Z and X correlated? Independent? (c) Compute the characteristic function of .X+Z. Prove that X and Z are not jointly Gaussian. 2.57 (p. 316) Let .X1 , . . . , Xn be independent .N(0, 1)-distributed r.v.’s and let 1 Xk . n n
X=
.
k=1
(a) Prove that, for every .i = 1, . . . , n, .X and .Xi − X are independent. (b) Prove that .X is independent of Y = max Xi − min Xi .
.
i=1,...,n
i=1,...,n
2.58 (p. 316) Let .X = (X1 , . . . , Xm ) be an .N(0, I )-distributed r.v. and .a ∈ Rm a vector of modulus 1. (a) Prove that the real r.v. .a, X is independent of the m-dimensional r.v. .X − a, Xa. (b) What is the law of .|X − a, Xa|2 ?
Chapter 3
Convergence
Convergence is an important aspect of the computation of probabilities. It can be defined in many ways, each type of convergence having its own interest and its specific field of application. Note that the notions of convergence and approximation are very close. As usual we shall assume an underlying probability space .(Ω, F, P).
3.1 Convergence of r.v.’s
Definition 3.1 Let X, Xn , n ≥ 1, be r.v.’s on the same probability space (Ω, F, P). (a) If X, Xn , n ≥ 1, take their values in a metric space (E, d), we say that the P
sequence (Xn )n converges to X in probability (written limn→∞ Xn = X) if for every δ > 0 .
lim P d(Xn , X) > δ = 0 .
n→∞
(b) If X, Xn , n ≥ 1, take their values in a topological space E, we say that (Xn )n converges to X almost surely (a.s.) if there exists a negligible event N ∈ F such that for every ω ∈ N c .
lim Xn (ω) = X(ω) .
n→∞
(continued)
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Baldi, Probability, Universitext, https://doi.org/10.1007/978-3-031-38492-9_3
115
116
3 Convergence
Definition 3.1 (continued) (c) If X, Xn , n ≥ 1, are Rm -valued, we say that (Xn )n converges to X in Lp if Xn ∈ Lp for every n and .
1/p lim E |Xn − X|p = lim Xn − Xp = 0 .
n→∞
n→∞
Remark 3.2 (a) Recalling that for probabilities the Lp norm is an increasing function of p (see p. 63), Lp convergence implies Lq convergence for every q ≤ p. (b) Indeed Lp convergence can be defined for r.v.’s with values in a normed space. We shall restrict ourselves to the Euclidean case, but all the properties that we shall see also hold for r.v.’s with values in a general complete normed space. In this case Lp is a Banach space. (c) Recall (see Remark 1.30) the inequality Xp − Y p ≤ X − Y p .
.
Therefore Lp convergence entails convergence of the Lp norms.
Let us compare these different types of convergence. Assume the r.v.’s (Xn )n to be Rm -valued: by Markov’s inequality we have, for every p > 0, 1 P |Xn − X| > δ ≤ p E |Xn − X|p , δ
.
hence
Lp convergence, p > 0, implies convergence in probability.
If the sequence (Xn )n , with values in a metric space (E, d), converges a.s. to an r.v. X, then d(Xn , X) →n→∞ 0 a.s., i.e., for every δ > 0, 1{d(Xn ,X)>δ} →n→∞ 0 a.s. and by Lebesgue’s Theorem .
lim P d(Xn , X) > δ = lim E(1{d(Xn ,X)>δ} ) = 0 ,
n→∞
n→∞
3.2 Almost Sure Convergence and the Borel-Cantelli Lemma
117
i.e.
a.s. convergence implies convergence in probability.
The converse is not true, as shown in Example 3.5 below. Note that convergence in probability only depends on the joint laws of X and each of the Xn , whereas a.s. convergence depends in a deeper way on the joint distributions of the Xn ’s and X. It is easy to construct examples of sequences converging a.s. but not in Lp : these two modes of convergence are not comparable, even if a.s. convergence is usually considered to be stronger. The investigation of a.s. convergence requires an important tool that is introduced in the next section.
3.2 Almost Sure Convergence and the Borel-Cantelli Lemma
Let (An )n ⊂ F be a sequence of events and let ∞
A = lim An :=
.
n→∞
Ak .
n=1 k≥n
A is the superior limit of the events (An )n . A closer look at this definition shows that .ω ∈ A if and only if ω∈
.
Ak
for every n ,
k≥n
that is if and only if .ω ∈ Ak for infinitely many indices k, i.e. .
lim An = {ω; ω ∈ Ak for infinitely many indices k} .
n→∞
The name “superior limit” comes from the fact that 1A = lim 1An .
.
n→∞
(3.1)
118
3 Convergence
Clearly the superior limit of a sequence .(An )n does not depend on the “first” events A1 , . . . , Ak . Hence it belongs to the tail .σ -algebra
.
.
B∞ =
∞
σ (1Ai , 1Ai+1 , . . . )
i=1
and, if the events .A1 , A2 , . . . are independent, by Kolmogorov’s Theorem 2.15 their superior limit can only have probability 0 or 1. The following result provides a simple and powerful tool to establish which one of these contingencies holds.
Theorem 3.3 (The Borel-Cantelli Lemma) Let .(An )n ⊂ F be a sequence of events. (a) If . ∞ n=1 P(An ) < +∞ then .P(limn→∞ An ) = 0. (b) If . ∞ n=1 P(An ) = +∞ and the events .An are independent then .P(limn→∞ An ) = 1.
Proof (a) We have ∞ .
∞
P(An ) = E 1An
n=1
n=1
but .limn→∞ An is exactly the event .{ ∞ n=1 1An = +∞}: if .ω ∈ limn→∞ An then .ω ∈ An for infinitely many indices and therefore in the series on the right-hand side there are infinitely many terms that are equal to 1. Hence if . ∞ n=1 P(An ) < +∞, 1 is integrable and the event . lim A is negligible (the set of .ω’s then . ∞ n→∞ n n=1 An on which an integrable function takes the value .+∞ is negligible, Exercise 1.9). (b) By definition the sequence of events .
Ak
k≥n
n
decreases to .limn→∞ An . Hence
P lim An = lim P Ak .
.
n→∞
n→∞
k≥n
(3.2)
3.2 Almost Sure Convergence and the Borel-Cantelli Lemma
119
Let us prove that, for every n, .P k≥n Ak = 1 or, what is the same, that c P Ak = 0 .
.
k≥n
We have, the .An being independent, N N
P Ack = lim P Ack = lim P(Ack )
.
k≥n
= lim
N →∞
N →∞
N →∞
k=n
k=n
∞ N 1 − P(Ak ) = 1 − P(Ak ) . k=n
k=n
As we assume . ∞ n=1 P(An ) = +∞, the infinite product above vanishes by a wellknown convergence result for infinite products (recalled in the next proposition). Therefore .P k≥n Ak = 1 for every n and the limit in (3.2) is equal to 1. Proposition 3.4 Let .(uk )k be a sequence of numbers with .0 ≤ uk ≤ 1 and let a :=
∞
.
(1 − uk ) .
k=1
Then
(a) If . ∞ = 0. k=1 uk = +∞ then .a (b) If .uk < 1 for every k and . ∞ k=1 uk < +∞ then .a > 0.
Proof (a) The inequality .1 − x ≤ e−x gives a = lim
.
n→∞
n
(1 − uk ) ≤ lim
k=1
n→∞
n k=1
n
e−uk = lim exp − uk = 0 . n→∞
k=1
120
3 Convergence 1
..... ......... ........ .. ...... .. ....... .. ........ .. ......... .. ......... ... . .. .. .............. ..... .. .. ..... ... .. ..... ... .. ..... .. ..... ..... .. ... ..... .. ..... ... .. ... ..... .. ... ..... .. .... . . . .. ..... .... .. . . ..... .... ... . .... . ..... ... ... . . ... ..... ... ..... ... ...... ... .... ....... •............. ..... ...... ..... .... ......... ..... ..... .. ..... ..... .. ..... ..
0
1
d
Fig. 3.1 The graphs of .x → 1 − x together with .x → e−x (dots, the upper one) and .x → e−2x −2x for .0 ≤ x ≤ δ for some .δ > 0 (see Fig. 3.1). As (b) We ∞have .1 − x ≥ e . k=1 uk < +∞, we have .uk →k→∞ 0, so that .uk ≤ δ for .k ≥ n0 . Hence n .
k=1
(1 − uk ) =
n0
(1 − uk ) ≥
k=n0 +1
k=1
=
n
(1 − uk ) n0
n0
(1 − uk )
k=1
n
e−2uk
k=n0 +1
n
(1 − uk ) × exp − 2 uk k=n0 +1
k=1
0 and as .n → ∞ this converges to . nk=1 (1 − uk ) × exp − 2 ∞ k=n0 +1 uk > 0.
Example 3.5 Let .(Xn )n be a sequence of i.i.d. r.v.’s having an exponential law of parameter .λ and let .c > 0. What is the probability of the event .
lim {Xn ≥ c log n} ?
n→∞
(3.3)
Note that the events .{Xn ≥ c log n} have a probability that decreases to 0, as the .Xn have the same law. But, at least if the constant c is small enough, might it be true that .Xn ≥ c log n for infinitely many indices n a.s.? The Borel-Cantelli lemma allows us to face this question in a simple way: as these events are independent, it suffices to determine the nature of the series
.
∞ P Xn ≥ c log n . n=1
3.2 Almost Sure Convergence and the Borel-Cantelli Lemma
121
Recalling the d.f. of the exponential laws, 1 P Xn ≥ c log n = e−λc log n = λc , n
.
which is the general term of a convergent series if and only if .c > λ1 . Hence the superior limit (3.3) has probability 0 if .c > λ1 and probability 1 if .c ≤ λ1 . The computation above provides an example of a sequence converging in probability but not a.s.: the sequence .( log1 n Xn )n tends to zero in .Lp , and therefore also in probability as, for every .p > 0, .
X p 1 n p = lim lim E E(X1 ) = 0 n→∞ n→∞ (log n)p log n
(an exponential r.v. has finite moments of all orders). A.s. convergence however does not take place: as seen above, with probability 1 .
Xn ≥ε log n
infinitely many times as soon as .ε ≤ place.
1 λ
so that a.s. convergence cannot take
We can now give practical conditions ensuring a.s. convergence.
Proposition 3.6 Let .(Xn )n be a sequence of r.v.’s with values in a metric space .(E, d). Then .limn→∞ Xn = X a.s. if and only if
P lim {d(Xn , X) > δ} = 0 for every δ > 0 .
.
n→∞
(3.4)
Proof If .limn→∞ Xn = X a.s. then, with probability 1, .d(Xn , X) can be larger than δ > 0 for a finite number of indices at most, hence (3.4). Conversely if (3.4) holds, then with probability 1 .d(Xn , X) > δ only for a finite number of indices, so that .limn→∞ d(Xn , X) ≤ δ and the result follows thanks to the arbitrariness of .δ. .
Together with Proposition 3.6, the Borel-Cantelli Lemma provides a criterion for a.s. convergence:
122
3 Convergence
Remark 3.7 If for every .δ > 0 the series . ∞ n=1 P(d(Xn , X) > δ) converges (no assumptions of independence), then (3.4) holds and .Xn →a.s. n→∞ X. Note that, in comparison, only .limn→∞ P(d(Xn , X) > δ) = 0 for every .δ > 0 is required in order to have convergence in probability. In the sequel we shall use often the following very useful elementary fact.
Criterion 3.8 (The Sub-Sub-Sequence Criterion) Let .(xn )n be a sequence in the metric space .(E, d). Then .limn→∞ xn = x if and only if from every subsequence .(xnk )k a further subsequence converging to x can be extracted.
Proposition 3.9 (a) If .(Xn )n converges to X in probability, then there exists a subsequence .(Xnk )k such that .Xnk →k→∞ X a.s. (b) .(Xn )n converges to X in probability if and only if every subsequence .(Xnk )k admits a further subsequence converging to X a.s.
Proof (a) By the definition of convergence in probability we have, for every positive integer k, .
lim P d(Xn , X) > 2−k = 0 .
n→∞
Let, for every k, .nk be an integer such that .P(d(Xn , X) > 2−k ) ≤ 2−k for every .n ≥ nk . We can assume the sequence .(nk )k to be increasing. For .δ > 0 let .k0 be an integer such that .2−k ≤ δ for .k > k0 . Then, for .k > k0 , P d(Xnk , X) > δ ≤ P d(Xnk , X) > 2−k ≤ 2−k
.
−k and the series . ∞ k=1 P(d(Xnk , X) > δ) is summable as .P(d(Xnk , X) > δ) < 2 eventually. By the Borel-Cantelli lemma .P(limk→∞ {d(Xnk , X) > δ}) = 0 and .Xnk →k→∞ X a.s. by Proposition 3.6.
3.2 Almost Sure Convergence and the Borel-Cantelli Lemma
123
(b) The only if part follows from (a). Conversely, let us take advantage of Criterion 3.8: let us prove that from every subsequence of .(P(d(Xn , X) ≥ δ))n we can extract a further subsequence converging to 0. But by assumption from every subsequence .(Xnk )k we can extract a further subsequence .(Xnkh )h such that a.s. .Xnk → h→∞ X, hence also .limh→0 P(d(Xnkh , X) ≥ δ) = 0 as a.s. convergence h implies convergence in probability. Proposition 3.9, together with Criterion 3.8, allows us to obtain some valuable insights about convergence in probability. • For convergence in probability many properties hold that are obvious for a.s. convergence. In particular, if .Xn →Pn→∞ X and .Φ : E → G is a continuous function, G denoting another metric space, then also .Φ(Xn ) →Pn→∞ Φ(X). Actually from every subsequence of .(Xn )n a further subsequence, .(Xnk )k say, can be extracted converging to X a.s. and of course .Φ(Xnk ) →a.s. n→∞ Φ(X). Hence for every subsequence of .(Φ(Xn ))n a further subsequence can be extracted converging a.s. to .Φ(X) and the statement follows from Proposition 3.9. In quite a similar way other useful properties of convergence in probability can be obtained. For instance, if .Xn →Pn→∞ X and .Yn →Pn→∞ Y , then also .Xn + Yn →Pn→∞ X + Y . • The a.s. limit is obviously unique: if Y and Z are two a.s. limits of the same sequence .(Xn )n , then .Y = Z a.s. Let us prove uniqueness also for the limit in probability, which is less immediate. Let us assume that .Xn →Pn→∞ Y and .Xn →Pn→∞ Z. By Proposition 3.9(a) we can find a subsequence of .(Xn )n converging a.s. to Y . This subsequence obviously still converges to Z in probability and from it we can extract a further subsequence converging a.s. to Z. This sub-sub-sequence converges a.s. to both Y and Z and therefore .Y = Z a.s. P • The limits a.s. and in probability coincide: if .Xn →a.s. n→∞ Y and .Xn →n→∞ Z then .Y = Z a.s. • .Lp convergence implies a.s. convergence for a subsequence.
Proposition 3.10 (Cauchy Sequences in Probability) Let .(Xn )n be a sequence of r.v.’s with values in the complete metric space E and such that for every .δ, ε > 0 there exists an .n0 such that P d(Xn , Xm ) > δ ≤ ε
.
for every n, m ≥ n0 .
Then .(Xn )n converges in probability to some E-valued r.v. X.
(3.5)
124
3 Convergence
Proof For every .k > 0 let .nk be an index such that, for every .m ≥ nk , P d(Xnk , Xm ) ≥ 2−k ≤ 2−k .
.
The sequence .(nk )k of course can be chosen to be increasing, therefore
.
∞ P d(Xnk , Xnk+1 ) ≥ 2−k < +∞ , k=1
and, by the Borel-Cantelli Lemma, the event .N := limk→∞ {d(Xnk , Xnk+1 ) ≥ 2−k } has probability 0. Outside N we have .d(Xnk , Xnk+1 ) < 2−k for every k larger than some .k0 and, for .ω ∈ N c , .k ≥ k0 and .m > k, d(Xnk , Xnm ) ≤
m
.
2−i ≤ 2 · 2−k .
i=k
Therefore, for .ω ∈ N c , .(Xnk (ω))k is a Cauchy sequence in E and converges to some limit .X(ω) ∈ E. Hence the sequence .(Xnk )k converges a.s. to some r.v. X. Let us deduce that .Xn →Pn→∞ X: choose first an index .nk as above and large enough so that ε P d(Xn , Xnk ) ≥ 2δ ≤ 2 ε δ P d(X, Xnk ) ≥ 2 ≤ · 2
.
for every n ≥ nk.
(3.6) (3.7)
An index .nk with these properties exists thanks to (3.5) and as .Xnk →Pn→∞ X. Thus, for every .n ≥ nk , P d(Xn , X) ≥ δ ≤ P d(Xn , Xnk ) ≥ 2δ + P d(X, Xnk ) ≥ 2δ ≤ ε .
.
In the previous proof we have been a bit careless: the limit X is only defined on .N c and we should prove that it can be defined on the whole of .Ω in a measurable way. This recurring question is treated in Remark 1.15.
3.3 Strong Laws of Large Numbers In this section we see that, under rather weak assumptions, if .(Xn )n is a sequence of independent r.v.’s (or at least uncorrelated) and having finite mathematical
3.3 Strong Laws of Large Numbers
125
expectation b, then their empirical means Xn :=
.
1 (X1 + · · · + Xn ) n
converge a.s. to b. This type of result is a strong law of Large Numbers, as opposed to the weak laws, which are concerned with .Lp convergence or in probability. Note that we can assume .b = 0: otherwise if .Yn = Xn − b the r.v.’s .Yn have mean a.s. 0 and, as .Y n = Xn − b, to prove that .X n →a.s. n→∞ b or that .Y n →n→∞ 0 is the same thing.
Theorem 3.11 (Rajchman’s Strong Law) Let .(Xn )n be a sequence of pairwise uncorrelated r.v.’s having a common mean b and finite variance and assume that .
sup Var(Xn ) := M < +∞ .
(3.8)
n≥1
Then .X n →n→∞ b a.s.
Proof Let .Sn := X1 + · · · + Xn and assume .b = 0. For every .δ > 0 by Chebyshev’s inequality 2
n 1 1 M 1 .P |X n2 | > δ ≤ Var(X Var(Xk ) ≤ 2 2 · 2) = n 2 2 4 δ δ n δ n k=1
1 As the series . ∞ n=1 n2 is summable, by Remark 3.7 the subsequence .(X n2 )n converges to 0 a.s. Now we need to investigate the behavior of .Xn between two consecutive integers of the form .n2 . With this goal let Dn :=
.
sup n2 ≤k δ is summable and in order to do this, thinking of Markov’s inequality, we shall look for estimates of the second order moment of .Dn .
126
3 Convergence
We have .Dn2 ≤
n2 ≤k δ ≤ 2 4 E(Dn2 ) ≤ 2 2 , n δ n δ n
.
which is summable, completing the proof.
Note that, under the assumptions of Rajchman’s Theorem 3.11, by Chebyshev’s inequality, 1 P |Xn − b| ≥ δ ≤ 2 Var(X n ) δ 1 M = 2 2 Var(X1 ) + · · · + Var(Xn ) ≤ 2 δ n δ n .
→
n→∞
0,
so that the weak law, .X n →Pn→∞ b, is immediate and much easier to prove than the strong law. We state finally, without proof, the most celebrated Law of Large Numbers. It requires the r.v.’s to be independent and identically distributed, but the assumptions of existence of moments are weaker (the variances might be infinite) and the statement is much more precise. See [3, Theorem 10.42, p. 231], for a proof.
Theorem 3.12 (Kolmogorov’s Strong Law) Let .(Xn )n be a sequence of real i.i.d. r.v.’s. Then (a) if the .Xn are integrable, then .X n →n→∞ b = E(X) a.s.; (continued)
3.3 Strong Laws of Large Numbers
127
Theorem 3.12 (continued) (b) if .E(|Xn |) = +∞, then at least one of the two terminal r.v.’s .
lim Xn
n→∞
and
lim X n
n→∞
is a.s. infinite (i.e. one of them at least takes the values .+∞ or .−∞ a.s.).
Example 3.13 Let .(Xn )n be a sequence of i.i.d. Cauchy-distributed r.v.’s. If Xn = n1 (X1 + · · · + Xn ) then of course the Law of Large Numbers does not hold, as .Xn does not have a finite mathematical expectation. Kolmogorov’s law however gives a more precise information about the behavior of the sequence .(X n )n : as the two sequences .(Xn )n and .(−Xn )n have the same joint laws, we have .
.
lim X n ∼ lim −X n = − lim Xn
n→∞
n→∞
n→∞
a.s.
As by the Kolmogorov strong law at least one among .limn→∞ Xn and limn→∞ Xn must be infinite, we derive that
.
.
lim X n = −∞,
n→∞
lim Xn = +∞
n→∞
a.s.
Hence the sequence of the empirical means takes infinitely many times very large and infinitely many times very small (i.e. negative and large in absolute value) values with larger and larger oscillations.
The law of Large Numbers is the theoretical justification for many algorithms of estimation and numerical approximation. The following example provides an instance of such an application. More insight about applications of the Law of Large Numbers is given in Sect. 6.1.
Example 3.14 (Histograms) Let .(Xn )n be a sequence of real i.i.d. r.v.’s whose law has a density f with respect to the Lebesgue measure. For a given bounded interval .[a, b], let us split it into subintervals .I1 , . . . , Ik , and let, for every .j = 1, . . . , k, 1 1Ij (Xi ) . n n
Zj(n) =
.
i=1
128
3 Convergence
n
.
i=1 1Ij (Xi ) is (n)
the number of r.v.’s (observations) .Xi falling in the interval
Ij , hence .Zj is the proportion of the first n observations .X1 , . . . , Xn whose values belong to the interval .Ij . It is usual to visualize the r.v.’s .Z1(n) , . . . , Zk(n) by drawing above each interval .Ij a rectangle of area proportional to .Zj(n) ; if the intervals .Ij are equally spaced this means, of course, that the heights of the rectangles are (n) proportional to .Zj . The resulting figure is called a histogram; this is a very popular method for visually presenting information concerning the common density of the observations .X1 , . . . , Xn . The Law of Large Numbers states that
.
(n)
Zj
.
a.s.
→
E[1Ij (Xi )] = P(Xi ∈ Ij ) =
n→∞
f (x) dx . Ij
If the intervals .Ij are small enough, so that the variation of f on .Ij is small, then the rectangles of the histogram will roughly have heights proportional to the corresponding values of f . Therefore for large n the histogram provides information about the density f . Figure 3.2 gives an example of a histogram for .n = 200 independent observations of a .Γ (3, 1) law, compared with the true density. This is a very rough and very initial instance of an important chapter of statistics: the estimation of a density.
........................ ........ ....... ..... ...... ...... ..... . . . . ...... .. . ...... . . . ..... . . ...... . . . ...... . . ..... . . ...... . .. ...... . . . ...... . . ...... . . . ...... . . ...... . . ....... . . . ....... . .. ....... . ........ .. . ........ . . ......... . . .......... . ........... .. . ............. . . ................ . . ......................... . .. ........................................... . . . .......................... ......
0
1
2
3
4
5
6
7
8
9
Fig. 3.2 Histogram of 200 independent .Γ (3, 1)-distributed observations, compared with their density
3.4 Weak Convergence of Measures
129
3.4 Weak Convergence of Measures We introduce now a notion of convergence of probability laws. Let .(E, E) be a measurable space and .μ, μn , n ≥ 1, measures on .(E, E). A typical way (not the only one) of defining a convergence .μn →n→∞ μ is the following: first fix a class . D of measurable functions .f : E → R and then define that .μn →n→∞ μ if and only if
.
lim
n→∞ E
f dμn =
for every f ∈ D .
f dμ E
Of course according to the choice of the class . D we obtain different types of convergence (possibly mutually incomparable). In the sequel, in order to simplify the notation we shall sometimes write .μ(f ) instead of . f dμ (which reminds us that a measure can also be seen as a functional on functions, Proposition 1.24).
Definition 3.15 Let E be a topological space and .μ, μn , .n ≥ 1, finite measures on .B(E). We say that .(μn )n converges to .μ weakly if and only if for every function .f ∈ Cb (E) (bounded continuous functions on E) we have . lim f dμn = f dμ . (3.10) n→∞ E
E
A first important property of weak convergence is the following. Remark 3.16 Let .μ, μn , .n ≥ 1, be probabilities on the topological space E, let .Φ : E → G be a continuous map to another topological space G and let us denote by .νn , ν the images of .μn and .μ under .Φ respectively. If .μn →n→∞ μ weakly, then also .νn →n→∞ ν weakly. Indeed if .f : G → R is bounded continuous, then .f ◦ Φ is also bounded continuous .E → R. Hence, thanks to Proposition 1.27 (integration with respect to an image measure), νn (f ) = μn (f ◦ Φ) → μ(f ◦ Φ) = ν(f ) .
.
n→∞
130
3 Convergence
Assume E to be a metric space. Then the weak limit is unique. Actually if simultaneously .μn →n→∞ μ and .μn →n→∞ ν weakly, then necessarily
f dμ =
.
E
f dν
(3.11)
E
for every .f ∈ Cb (E), and therefore .μ and .ν coincide (Proposition 1.25).
Proposition 3.17 Let . D be a vector space of bounded measurable functions on the measurable space .(E, E) and let .μ, μn , .n ≥ 1, be probabilities on .(E, E). Then in order for the relation μn (g)
.
→
μ(g)
n→∞
(3.12)
to hold for every .g ∈ D it is sufficient for (3.12) to hold for every function g belonging to a set H that is total in . D.
Proof By definition H is total in . D if and only if the vector space . H of the linear combinations of functions of H is dense in . D in the uniform norm. If (3.12) holds for every .g ∈ H , by linearity it also holds for every .g ∈ H. Let .f ∈ D and let .g ∈ H be such that .f − g∞ ≤ ε; therefore for every n
|f − g| dμn ≤ ε,
.
E
|f − g| dμ ≤ ε . E
Let now .n0 be such that .|μn (g) − μ(g)| ≤ ε for .n ≥ n0 ; then for .n ≥ n0 |μn (f ) − μ(f )| ≤ |μn (f ) − μn (g)| + |μn (g) − μ(g)| + |μ(g) − μ(f )| ≤ 3ε
.
and by the arbitrariness of .ε the result follows.
If moreover E is also separable and locally compact then we have the following criterion.
Proposition 3.18 Let .μ, μn , .n ≥ 1, be finite measures on the locally compact separable metric space E, then .μn →n→∞ μ weakly if and only if (a) .μn (f ) →n→∞ μ(f ) for every compactly supported continuous function, (b) .μn (1) →n→∞ μ(1).
3.4 Weak Convergence of Measures
131
Proof Let us assume that (a) and (b) hold and let us prove that .(μn )n converges to .μ weakly, the converse being obvious. Recall (Lemma 1.26) that there exists an increasing sequence .(hn )n of continuous compactly supported functions such that .hn ↑ 1 as .n → ∞. Let .f ∈ Cb (E), then .f hk →k→∞ f and the functions .f hk are continuous and compactly supported. We have, for every k,
.
|μn (f ) − μ(f )| = μn (1 − hk + hk )f − μ (1 − hk + hk )f ≤ μn (1 − hk )f + μ (1 − hk )f | + μn (f hk ) − μ(f hk )
(3.13)
≤ f ∞ μn (1 − hk ) + f ∞ μ(1 − hk ) + |μn (f hk ) − μ(f hk )| . We have, adding and subtracting wisely, μn (1 − hk ) + μ(1 − hk ) = μn (1) + μ(1) − μn (hk ) − μ(hk )
.
= μn (1) − μ(1) + 2μ(1) − 2μ(hk ) + μ(hk ) − μn (hk ) = 2μ(1 − hk ) + (μn (1) − μ(1)) + (μ(hk ) − μn (hk )) so that, going back to (3.13), .|μn (f ) − μ(f )| ≤ |μn (f hk ) − μ(f hk )|+f ∞ |μn (hk ) − μ(hk )|+|μn (1)−μ(1)| + 2μ(1−hk ) .
Recalling that the functions .hk and .f hk are compactly supported, if we choose k large enough so that .μ(1 − hk ) ≤ ε, we have .
lim |μn (f ) − μ(f )| ≤ 2εf ∞
n→∞
from which the result follows owing to the arbitrariness of .ε.
Remark 3.19 Putting together Propositions 3.17 and 3.18, if E is a locally compact separable metric space, in order to prove weak convergence we just need to check (3.10) for every .f ∈ CK (E) or for every .f ∈ C0 (E) (functions vanishing at infinity) or indeed any family of functions that is total in .C0 (E). If .E = Rd , a total family that we shall use in the sequel is that of the functions .ψσ as in (2.51) for .ψ ∈ CK (Rd ) and .σ > 0, which is dense in d .C0 (R ) thanks to Lemma 2.29.
132
3 Convergence
Let .μ, μn , .n ≥ 1, be probabilities on .Rd and let us assume that .μn →n→∞ μ weakly. Then clearly . μn (θ ) →n→∞ μ(θ ): just note that for every .θ ∈ Rd μ(θ ) =
.
Rd
ei x,θ dμ(x) ,
i.e. . μ(θ ) is the integral with respect to .μ of the bounded continuous function .x → ei x,θ . Therefore weak convergence, for probabilities on .Rd , implies pointwise convergence of the characteristic functions. The following result states that the converse also holds.
Theorem 3.20 (P. Lévy) Let .μ, μn , .n ≥ 1, be probabilities on .Rd . Then .(μn )n converges weakly to .μ if and only if . μn (θ ) →n→∞ μ(θ ) for every d .θ ∈ R .
Proof Thanks to Remark 3.19 it suffices to prove that .μn (ψσ ) →n→∞ μ(ψσ ) where .ψσ is as in (2.51) with .ψ ∈ CK (Rd ). Thanks to (2.53) .
Rd
1 2 1 2 ψσ (x) dμn (x) = ψ(y) dy e− 2 σ |θ| e−i θ,y μn (θ ) dθ d (2π )d Rd R μn (θ )H (θ ) dθ , =
(3.14)
Rd
where 1 2 1 2 .H (θ ) = e− 2 σ |θ| d (2π )
Rd
ψ(y) e−i θ,y dy .
The integrand of the integral on the right-hand side of (3.14) converges pointwise to 1 2 2 . μH and is majorized in modulus by .θ → (2π )−d e− 2 σ |θ| Rd |ψ(y)| dy. We can therefore apply Lebesgue’s Theorem, giving
.
lim
n→∞ Rd
ψσ (x) dμn (x) =
which completes the proof.
Rd
μ(θ )H (θ ) dθ =
Rd
ψσ (x) dμ(x) ,
Actually P. Lévy proved a much deeper result: if .( μn )n converges pointwise to a function .κ and if .κ is continuous at 0, then .κ is the characteristic function of a probability .μ and .(μn )n converges weakly to .μ. We will prove this sharper result in Theorem 6.21.
3.4 Weak Convergence of Measures
133
If .μn →n→∞ μ weakly, what can be said of the behavior of .μn (f ) when f is not bounded continuous? And in particular when f is the indicator function of an event?
Theorem 3.21 (The “Portmanteau” Theorem) Let .μ, μn , .n ≥ 1, be probabilities on the metric space E. Then .μn →n→∞ μ weakly if and only if one of the following properties hold. (a) For every lower semi-continuous (l.s.c.) function .f : E → R bounded from below
.
lim n→∞ E
f dμn ≥
(3.15)
f dμ . E
(b) For every upper semi-continuous (u.s.c.) function .f : E → R bounded from above
.
lim
n→∞ E
f dμn ≤
(3.16)
f dμ . E
(c) For every bounded function f such that the set of its points of discontinuity is negligible with respect to .μ
.
lim
n→∞ E
f dμn =
(3.17)
f dμ . E
Proof Clearly (a) and (b) are equivalent (if f is as in (a)), then .−f is as in (b) and together they imply weak convergence, as, if .f ∈ Cb (E), then to f we can apply simultaneously (3.15) and (3.16), obtaining (3.10). Conversely, let us assume that .μn →n→∞ μ weakly and that f is l.s.c. and bounded from below. Then (property of l.s.c. functions) there exists an increasing sequence of bounded continuous functions .(fk )k such that .supk fk = f . As .fk ≤ f , for every k we have
fk dμ = lim
.
E
n→∞ E
fk dμn ≤ lim
n→∞ E
f dμn
and, taking the .sup in k in this relation, by Beppo Levi’s Theorem the term on the left-hand side increases to . E f dμ and we have (3.15).
134
3 Convergence
Let us prove now that if .μn →n→∞ μ weakly, then c) holds (the converse is obvious). Let .f ∗ and .f∗ be the two functions defined as f ∗ (x) = lim f (y) .
f∗ (x) = lim f (y)
.
(3.18)
y→x
y→x
In the next Lemma 3.22 we prove that .f∗ is l.s.c. whereas .f ∗ is u.s.c. Clearly .f∗ ≤ f ≤ f ∗ . Moreover these three functions coincide on the set C of continuity points of f ; as we assume .μ(C c ) = 0 they are therefore bounded .μ-a.s. and
f∗ dμ =
.
E
f ∗ dμ .
f dμ = E
E
Now (3.15) and (3.16) give
f dμ =
.
E
E
f dμ = E
f∗ dμ ≤ lim
n→∞ E
∗
f dμ ≥ lim
n→∞ E
E
f∗ dμn ≤ lim
n→∞ E
f dμn ,
∗
f dμn ≥ lim
n→∞ E
f dμn
which gives
.
lim n→∞ E
f dμn ≥
f dμ ≥ lim E
n→∞ E
f dμn ,
completing the proof.
Lemma 3.22 The functions .f∗ and .f ∗ in (3.18) are l.s.c. and u.s.c. respectively.
Proof Let .x ∈ E. We must prove that, for every .δ > 0, there exists a neighborhood Uδ of x such that .f∗ (z) ≥ f∗ (x) − δ for every .z ∈ Uδ . By the definition of .lim, there exists a neighborhood .Vδ of x such that .f (y) ≥ f∗ (x) − δ for every .y ∈ Vδ . If .z ∈ Vδ , there exists a neighborhood V of z such that .V ⊂ Vδ , so that .f (y) ≥ f∗ (x) − δ for every .y ∈ V . This implies that .f∗ (z) = limy→z f (y) ≥ f∗ (x) − δ. We can therefore choose .Uδ = Vδ and we have proved that .f∗ is l.s.c. Of course the argument for .f ∗ is the same.
.
Assume that .μn →n→∞ μ weakly and .A ∈ B(E). Can we say that .μn (A) →n→∞ μ(A)? The portmanteau Theorem 3.21 gives some answers.
3.4 Weak Convergence of Measures
135
If .G ⊂ E is an open set, then its indicator function .1G is l.s.c. and by (3.15) .
lim μn (G) = lim
n→∞
n→∞ E
1G dμn ≥
1G dμ = μ(G) .
(3.19)
E
In order to give some intuition, think of a sequence of points .(xn )n ⊂ G and converging to some point .x ∈ ∂G. It is easy to check that .δxn →n→∞ δx weakly (see also Example 3.24 below) and we would have .δxn (G) = 1 for every n but .δx (G) = 0, as the limit point x does not belong to G. Similarly if F is closed then .1F is u.s.c. and
.
lim μn (F ) = lim
n→∞
n→∞ E
1F dμn ≤
1F dμ = μ(F ) .
(3.20)
E
Of course we have .μn (A) →n→∞ μ(A), whether A is an open set or a closed one, if its boundary .∂A is .μ-negligible: actually .∂A is the set of discontinuity points of .1A . Conversely if (3.19) holds for every open set G (resp. if (3.20) holds for every closed set F ) it can be proved that .μn →n→∞ μ (Exercise 3.17). If .E = R we have the following criterion.
Proposition 3.23 Let .μ, μn , .n ≥ 1, be probabilities on .R and let us denote by .Fn , F the respective distribution functions. Then .μn →n→∞ μ weakly if and only if .
lim Fn (x) = F (x)
n→∞
for every continuity point x of F .
(3.21)
Proof Assume that .μn →n→∞ μ weakly. We know that if x is a continuity point of F then .μ({x}) = 0. As .{x} is the boundary of .] − ∞, x], by the portmanteau Theorem 3.21 c), Fn (x) = μn (] − ∞, x])
.
→
n→∞
μ(] − ∞, x]) = F (x) .
Conversely let us assume that (3.21) holds. If a and b are continuity points of F then μn (]a, b]) = Fn (b) − Fn (a)
.
→
n→∞
F (b) − F (a) = μ(]a, b]) .
(3.22)
As the points of discontinuity of the increasing function F are at most countably many, (3.21) holds for x in a set D that is dense in .R. Thanks to Proposition 3.19 we just need to prove that .μn (f ) →n→∞ μ(f ) for every .f ∈ CK (R); this will
136
3 Convergence
follow from an adaptation of the argument of approximation of the integral with its Riemann sums. As f is uniformly continuous, for fixed .ε > 0 let .δ > 0 be such that .|f (x) − f (y)| < ε whenever .|x − y| < δ. Let .z0 < z1 < · · · < zN be a grid in an interval containing the support of f such that .zk ∈ D and .|zk − zk−1 | ≤ δ. This is possible, D being dense in .R. If Sn =
N
.
f (zk ) Fn (zk ) − Fn (zk−1 ) ,
k=1
S=
N
f (zk ) F (zk ) − F (zk−1 )
k=1
then, as the .zk are continuity points of F , .limn→∞ Sn = S. We have
f dμ −∞ −∞ . +∞ +∞ ≤ f dμn − Sn + |Sn − S| + f dμ − S +∞
f dμn −
+∞
−∞
(3.23)
−∞
and
+∞
.
−∞
N f dμn − Sn =
zk
k=1 zk−1
≤
N
zk
f (x) − f (zk−1 ) dμn (x)
|f (x) − f (zk−1 )| dμn (x)
k=1 zk−1
≤ε
N
μn ([zk−1 , zk [) = ε F (zN ) − F (z0 ) ≤ ε .
k=1
Similarly .
+∞ −∞
f dμ − S ≤ ε
and from (3.23) we obtain .
lim
n→∞
+∞
−∞
f dμn −
+∞
−∞
f dμ ≤ 2ε
and the result follows thanks to the arbitrariness of .ε.
3.4 Weak Convergence of Measures
137
Example 3.24 (a) .μn = δ1/n (Dirac mass at Actually if .f ∈ Cb (R)
.
1 n ).
.
R
Then .μn → δ0 weakly.
f dμn = f ( n1 )
→
f (0) =
n→∞
R
f dδ0 .
Note that if .G =]0, 1[, then .μn (G) = 1 for every n and therefore limn→∞ μn (G) = 1 whereas .δ0 (G) = 0. Hence in this case .μn (G) →n→∞ δ0 (G); note that .∂G = {0, 1} and .δ0 (∂G) > 0. More generally, by the argument above, if .(xn )n is a sequence in the metric space E and .xn →n→∞ x, then .δxn →n→∞ δx weakly. n−1 (b) .μn = n1 δk/n . That is, .μn is a sum of Dirac masses, each of weight
.
k=0
placed at the locations .0, n1 , . . . , n−1 n . Intuitively the total mass is crumbled into an increasing number of smaller and smaller, evenly spaced, Dirac masses. This suggests a limit that is uniform on the interval .[0, 1]. Formally, if .f ∈ Cb (R) then
.
1 n,
.
R
f dμn =
n−1 1 k f( ) . n n k=0
On the right-hand side we recognize, with some imagination, the Riemann sum of f on the interval .[0, 1] with respect to the partition .0, n1 , . . . , n−1 n . As f is continuous the Riemann sums converge to the integral and therefore
.
lim
n→∞ R
1
f dμn =
f (x) dx , 0
which proves that .(μn )n converges weakly to the uniform distribution on [0, 1]. The same result can also be obtained by computing the limit of the characteristic functions or of the d.f.’s. (c) .μn ∼ B(n, λn ). Let us prove that .(μn )n converges to a Poisson law of parameter .λ; i.e. the approximation of a binomial .B(n, p) law with a large parameter n and small p with a Poisson distribution is actually a weak convergence result. This can be seen in many ways. At this point we know of three methods to prove weak convergence:
.
138
3 Convergence
• the definition; • the convergence of the distribution functions, Proposition 3.23 (for probabilities on .R only); • the convergence of the characteristic functions (for probabilities on .Rd ). In this case, for instance, the d.f. F of the limit is continuous everywhere, the positive integers excepted. If .x > 0, then x n λ k λ n−k 1− .Fn (x) = n k n k=0
x
→
n→∞
k=0
e−λ
λk = F (x) k!
as in the sum only a finite number of terms appear (. denotes as usual the “integer part” function). If .x < 0 there is nothing to prove as .Fn (x) = 0 = F (x). Note that in this case .Fn (x) →n→∞ F (x) for every x, and not just for the x’s that are continuity points. We might also compute the characteristic functions and their limit: recalling Example 2.25
n λ λ n λ μn (θ ) = 1 − + eiθ = 1 + (eiθ − 1) n n n
.
→
n→∞
eλ(e
iθ −1)
,
which is the characteristic function of a Poisson law of parameter .λ, and P. Lévy’s Theorem 3.20 gives .μn →n→∞ Poiss(λ). (d) .μn ∼ N (b, n1 ). Recall that the laws .μn have a density given by bell shaped curves centered at b that become higher and narrower with n. This suggests that the .μn tend to concentrate around b. Also in this case in order to investigate the convergence we can compute either the limit of the d.f.’s or of the characteristic functions. The last method is the simplest one here: 1
μn (θ ) = eibθ e− 2n θ
.
2
→
n→∞
eibθ
which is the characteristic function of a Dirac mass .δb , in agreement with intuition. (e) .μn ∼ N (0, n). The density of .μn is 1 2 1 gn (x) = √ e− 2n x . 2π n
.
3.4 Weak Convergence of Measures
As .gn (x) ≤
√1 2π n
.
lim
139
for every x, we have for every .f ∈ CK (R)
+∞
n→∞ −∞
f (x) dμn (x) = lim
+∞
n→∞ −∞
f (x)gn (x) dx = 0 .
Hence .(μn )n cannot converge to a probability. This can also be proved via characteristic functions: indeed 1 if θ = 0 − 12 nθ 2 . μn (θ ) = e → κ(θ ) = n→∞ 0 if θ = 0 . The limit .κ is not continuous at 0 and cannot be a characteristic function. Let .μn , .μ be probabilities on a .σ -finite measure space .(E, E, ρ) having densities fn , f respectively with respect to .ρ and assume that .fn → f pointwise as .n → ∞. What can be said about the weak convergence of .(μn )n ? Corollary 3.26 below gives an answer. It is a particular case of a more general statement that will also be useful in other situations.
.
Theorem 3.25 (Scheffé’s Theorem) Let .(E, E, ρ) be a .σ -finite measure space and .(fn )n a sequence of positive measurable functions such that (a) .fn →n→∞ f .ρ-a.e. for some measurable function f . (b) . lim fn dρ = f dρ < +∞. n→∞ E
E
Then .fn →n→∞ f in .L1 (ρ).
Proof We have
f − fn 1 =
+
|f − fn )| dρ =
.
E
(f − fn ) dρ + E
(f − fn )− dρ .
(3.24)
E
Let us prove that the two integrals on the right-hand side tend to 0 as .n → ∞. As f and .fn are positive we have • If .f ≥ fn then .(f − fn )+ = f − fn ≤ f . • If .f ≤ fn then .(f − fn )+ = 0.
140
3 Convergence
In any case .(f − fn )+ ≤ f . As .(f − fn )+ →n→∞ 0 a.e. and f is integrable, by Lebesgue’s Theorem, .
lim
n→∞ E
(f − fn )+ dρ = 0 .
As .f − fn = (f − fn )+ − (f − fn )− , we have also .
lim
n→∞ E
−
(f − fn ) d ρ = lim
+
n→∞ E
(f − fn ) dρ − lim
n→∞ E
(f − fn ) dρ = 0
and, going back to (3.24), the result follows.
Corollary 3.26 Let .μ, μn , .n ≥ 1 be probabilities on a topological space E and let us assume that there exists a .σ -finite measure .ρ on E such that .μ and .μn have densities f and .fn respectively with respect to .ρ. Assume that .
lim fn (x) = f (x)
ρ-a.e.
n→∞
Then .μn →n→∞ μ weakly and also .limn→∞ μn (A) = μ(A) for every .A ∈ B(E). Proof As . fn dρ = f dρ = 1, conditions (a) and (b) of Theorem 3.25 are satisfied so that .fn →n→∞ f in .L1 . If .φ : E → R is bounded measurable then . φ dμn − φ dμ = φ(f − fn ) dρ ≤ φ∞ |f − fn | dρ . E
E
E
E
Hence .
lim
n→∞ E
φ dμn =
φ dμ E
which proves weak convergence and, for .φ = 1A , also the last statement.
3.5 Convergence in Law Let .X, Xn , .n ≥ 1, be r.v.’s with values in the same topological space E and let μ, μn , .n ≥ 1, denote their respective laws. The convergence of laws allows us to define a form of convergence of r.v.’s.
.
3.5 Convergence in Law
141
Definition 3.27 A sequence .(Xn )n of r.v.’s with values in the topological L space E is said to converge to X in law (and we write .Xn →n→∞ X) if and only if .μn →n→∞ μ weakly.
Remark 3.28 As .E f (Xn ) = f (x) dμn (x), E
E f (X) =
f (x) dμ(x) ,
(3.25)
E
L Xn →n→∞ X if and only if
.
.
lim E f (Xn ) = E f (X)
n→∞
for every bounded continuous function .f : E → R. If E is a locally compact separable metric space, it is sufficient to check (3.25) for every .f ∈ CK (E) only (Proposition 3.19).
Let us compare convergence in law with the other forms of convergence.
Proposition 3.29 Let .(Xn )n be a sequence of r.v.’s with values in the metric space E. Then L (a) .Xn →Pn→∞ X implies .Xn →n→∞ X. L (b) If .Xn →n→∞ X and X is a constant r.v., i.e. such that .P(X = x0 ) = 1 for some .x0 ∈ E, then .Xn →Pn→∞ X.
Proof (a) Keeping in mind Remark 3.28 let us prove that .
lim E f (Xn ) = E f (X)
n→∞
(3.26)
for every bounded continuous function .f : E → R. Let us use Criterion 3.8 (the sub-sub-sequence criterion): (3.26) follows if it can be shown that from every subsequence of .(E[f (Xn )])n we can extract a further subsequence along which (3.26) holds.
142
3 Convergence
By Proposition 3.9(b), from every subsequence of .(Xn )n a further subsequence (Xnk )k converging to X a.s. can be extracted. Therefore .limk→∞ f (Xnk ) = f (X) a.s. and, by Lebesgue’s Theorem,
.
.
lim E f (Xnk ) = E f (X) .
k→∞
(b) Let us denote by .Bδ the open ball centered at .x0 with radius .δ; then we can write .P d(Xn , x0 ) ≥ δ = P(Xn ∈ Bδc ). .Bδc is a closed set having probability 0 for the law of X, which is the Dirac mass .δx0 . Hence by (3.20) .
lim P d(Xn , x0 ) ≥ δ ≤ P d(X, x0 ) ≥ δ = 0 .
n→∞
Convergence in law is therefore the weakest of all the convergences seen so far: a.s., in probability and in .Lp . In addition note that, in order for it to take place, it is not even necessary for the r.v.’s to be defined on the same probability space.
Example 3.30 (Asymptotics of Student Laws) Let .(Xn )n be a sequence of L X where r.v.’s such that .Xn ∼ t (n) (see p. 94). Let us prove that .Xn →n→∞ .X ∼ N (0, 1). Let .Z, Yn , .n = 1, 2, . . . , be independent r.v.’s with .Z ∼ N(0, 1) and .Yn ∼ χ 2 (1) for every n. Then .Sn = Y1 + · · · + Yn ∼ χ 2 (n) and .Sn is independent of Z. Hence the r.v. Z √ Z Tn := √ n= Sn Sn
.
n
has a Student law .t (n). By the Law of Large Numbers . n1 Sn →n→∞ E(Y1 ) = 1 a.s. and therefore .Tn →a.s. n→∞ Z. As a.s. convergence implies convergence in law L L Z and as .Xn ∼ Tn for every n we have also .Xn →n→∞ Z. we have .Tn →n→∞ This example introduces a sly method to determine the convergence in law of a sequence .(Xn )n : just construct another sequence .(Wn )n such that • .Xn ∼ Wn for every n; • .Wn →n→∞ W a.s. (or in probability). Then .(Xn )n converges in law to W .
3.5 Convergence in Law
143
Example 3.31 The notion of convergence is closely related to that of approximation. As an application let us see a proof of the fact that the polynomials are dense in the space .C([0, 1]) of real continuous functions on the interval .[0, 1] with respect to the uniform norm. Let, for .x ∈ [0, 1], .(Xnx )n be a sequence of i.i.d. r.v.’s with a Bernoulli x x x x .B(1, x) law and let .Sn := X + · · · + Xn so that .Sn ∼ B(n, x). Let .f ∈ 1 C([0, 1]). Then n n
E f ( n1 Snx ) = f ( nk ) x k (1 − x)n−k . k
.
k=1
The right-hand side of the previous relation is a polynomial function of the variable x (the Bernstein polynomial of f of order n). Let us denote it by f 1 x .Pn (x). By the Law of Large Numbers . Sn →n→∞ x a.s. hence also in law n and f
f (x) = lim E[f ( n1 Snx )] = lim Pn (x) .
.
n→∞
n→∞
f
Therefore the sequence of polynomials .(Pn )n converges pointwise to f . Let us demonstrate that the convergence is actually uniform. As f is uniformly continuous, for .ε > 0 let .δ > 0 be such that .|f (x) − f (y)| ≤ ε whenever .|y − x| ≤ δ. Hence, for every .x ∈ [0, 1], f |Pn (x) − f (x)| ≤ E |f ( n1 Snx ) − f (x)| = E |f ( n1 Snx ) − f (x)| 1{| 1 S x −x|≤δ} + E |f ( n1 Snx ) − f (x)|1{| 1 S x −x|>δ} n n n n .
≤ε
≤ ε + 2f ∞ P | n1 Snx − x| > δ .
By Chebyshev’s inequality and noting that .x(1 − x) ≤
1 4
for .x ∈ [0, 1],
1 1 1 P | n1 Snx − x| > δ ≤ 2 Var( n1 Snx ) ≤ 2 x(1 − x) ≤ δ nδ 4nδ 2
.
and therefore for n large f
Pn − f ∞ ≤ 2ε .
.
See Fig. 3.3 for an example.
144
3 Convergence
...... . ... ...... .... .... . . . ... ... ... ... .... ... . . . . . . . . . . ....... ...... .. ....... . . ......... . ........... . .. ... ... ........ .......... ....... ...... ....... . . . . ........... . .. .. ..... ....... .. . . . . . . . .... ....... ... ... . .. . . . . . ... .... . . .... .. .... ... . . ... ..... . . . . .... .... . .. ... ..... . . . . . . . . .... .. ..... ... ....... ....... ....... ........ ....... ...... .... ............ ..... .............. . ......... . . . . ....... . .... ..... .. . ..... . . . ..... . . . ...... ................... ..... . .... ......................... . ...... . . . . .... ..... . ... ..... . . . . . . ......... ..... . ... . ..... . . . . . . . . . . ... ... .. ..... .... . . . . . . . ... .... .... .. .... . ..... . . . . . . . ... ...... ....... .... ... . . . . . . . . .... .... . . .... . . . .... . . . .... . .. . . . .... . . . .... . .. . .... . . . . . . ... . . .... . . . . . . ... .. .... ... . . . . . . . ... ..... ...
2 5
0
1
Fig. 3.3 Graph of some function f (solid) and of the approximating Bernstein polynomials of order .n = 10 (dots) and .n = 40 (dashes)
3.6 Uniform Integrability
Definition 3.32 A family H of m-dimensional r.v.’s is uniformly integrable if . lim sup |Y | dP = 0 . R→+∞ Y ∈ H {|Y |>R}
The set formed of a single integrable r.v. Y is the simplest example of a uniformly integrable family: actually .limR→+∞ |Y |1{|Y |>R} = 0 a.s. and, as .|Y |1{|Y |>R} ≤ |Y |, by Lebesgue’s Theorem, .
lim
R→+∞ {|Y |>R}
|Y | dP = 0 .
By a similar argument, if there exists a real integrable r.v. Z such that .Z ≥ |Y | a.s. for every .Y ∈ H then . H is uniformly integrable, as in this case .{|Y | > R} ⊂ {Z > R} a.s. and . |Y | dP ≤ Z dP for every Y ∈ H . {|Y |>R}
{Z>R}
3.6 Uniform Integrability
145
Note however that in order for a family of r.v.’s . H to be uniformly integrable it is not necessary for them to be defined on the same probability space: actually
.
{|Y |>R}
|Y | dP =
{|y|>R}
|y| dμY (y)
so that uniform integrability is a condition concerning only the laws of the r.v.’s of H. Note that a uniformly integrable family . H is necessarily bounded in .L1 : if .R > 0 is such that . sup |Y | dP ≤ 1 ,
.
Y ∈ H {|Y |>R}
then, for every .Y ∈ H, .E |Y | =
{|Y |≤R}
|Y | dP +
{|Y |>R}
|Y | dP ≤ R + 1 .
The next proposition gives a useful characterization of uniform integrability.
Proposition 3.33 A family . H of r.v.’s. is uniformly integrable if and only if (i) . H is bounded in .L1 and (ii) for every .ε > 0 there exists a .δ > 0 such that for every .Y ∈ H |Y | dP ≤ ε
.
whenever P(A) ≤ δ .
(3.27)
A
Proof If . H is uniformly integrable we already know that it is bounded in .L1 . Also, for every event A we have
|Y | dP =
.
A
≤
A∩{|Y | n0 , then (3.29) gives E |Yn |1{|Yn |≥R} ≤ Yn − Y 1 + E |Y |1{|Yn |≥R} ≤ 2ε
.
(3.30)
for .n > n0 . As each of the r.v.’s .Yk is, individually, uniformly integrable, there exist .R1 , . . . , Rn0 such that .E(|Yi |1{|Yi |≥Ri } ) ≤ ε for .i = 1, . . . , n0 and, possibly replacing R with the largest among .R1 , . . . , Rn0 , R, we have .E(|Yn |1{|Yn |≥R} ) ≤ 2ε for every n. The following theorem is an extension of Lebesgue’s Theorem. Note that it gives a necessary and sufficient condition.
Theorem 3.34 Let .(Yn )n be a sequence of r.v.’s on a probability space (Ω, F, P) converging a.s. to Y . Then the convergence takes place in .L1 if and only if .(Yn )n is uniformly integrable.
.
3.6 Uniform Integrability
147
Proof The only if part is already proved. Conversely, let us assume .(Yn )n is uniformly integrable. Then by Fatou’s Lemma E |Y | ≤ lim E |Yn | ≤ M ,
.
n→∞
where M is an upper bound of the .L1 norms of the .Yn . Moreover, for every .ε > 0, E |Y − Yn | = E |Y − Yn |1{|Y −Yn |≤ε} + E |Y − Yn |1{|Y −Yn |>ε} ≤ ε + E |Yn |1{|Yn −Y |>ε} + E |Y |1{|Yn −Y |>ε} .
.
As a.s. convergence implies convergence in probability, we have, for large n, P(|Yn − Y | > ε) ≤ δ (.δ as in the statement of Proposition 3.33) so that
.
E |Yn |1{|Yn −Y |>ε} ≤ ε,
.
E |Y |1{|Yn −Y |>ε} ≤ ε
and for large n we have .E(|Y − Yn |) ≤ 3ε.
The following is a useful criterion for uniform integrability.
Proposition 3.35 Let . H be a family of r.v.’s and assume that there exists a measurable map .Φ : R+ → R, bounded below, such that .limt→+∞ 1t Φ(t) = +∞ and .
sup E Φ(|Y |) < +∞ .
Y ∈H
Then . H is uniformly integrable.
Proof Let .Φ be as in the statement of the theorem. We can assume that .Φ is positive, otherwise if .Φ ≥ −r just replace .Φ with .Φ + r. Let .K > 0 be such that .E[Φ(|Y |)] ≤ K for every .Y ∈ H and let .ε > 0 be fixed. Let .R0 be such that . R1 Φ(R) ≥ Kε for .R > R0 , i.e. .|Y | ≤ Kε Φ(|Y |) for .|Y | ≥ R0 for every .Y ∈ H. Then, for every .Y ∈ H, .
{|Y |>R0 }
|Y | dP ≤
ε K
{|Y |>R0 }
Φ(|Y |) dP ≤
ε K
Φ(|Y |) dP ≤ ε .
In particular, taking .Φ(t) = integrable.
tp,
bounded subsets of
p .L ,
p > 1, are uniformly
.
148
3 Convergence
Actually there is a converse to Proposition 3.35: if . H is uniformly integrable then there exists a function .Φ as in Proposition 3.35 (and convex in addition to that). See [9], Theorem 22, p. 24, for a proof of this converse. Therefore the criterion of Proposition 3.35 is actually a characterization of uniform integrability.
3.7 Convergence in a Gaussian World In this section we see that, concerning convergence, Gaussian r.v.’s enjoy some special properties. The first result is stability of Gaussianity under convergence in law.
Proposition 3.36 Let .(Xn )n be a sequence of d-dimensional Gaussian r.v.’s converging in law to an r.v. X. Then X is Gaussian and the means and covariance matrices of the .Xn converge to the mean and covariance matrix of X. In particular, .(Xn )n is bounded in .L2 .
Proof Let us first assume the .Xn ’s are real-valued. Their characteristic functions are of the form 1
φn (θ ) = eibn θ e− 2 σn θ
2 2
.
(3.31)
and, by assumption, .φn (θ ) →n→∞ φ(θ ) for every .θ , where by .φ we denote the characteristic function of the limit X. Let us prove that .φ is the characteristic function of a Gaussian r.v. The heart of the proof is that pointwise convergence of .(φn )n implies convergence of the sequences 2 .(bn )n and .(σn )n . Taking the complex modulus in (3.31) we obtain 1
|φn (θ )| = e− 2 σn θ
.
2 2
→
n→∞
|φ(θ )| .
This implies that the sequence .(σn2 )n is bounded: otherwise there would exist a subsequence .(σn2k )k converging to .+∞ and we would have .|φ(θ )| = 0 for .θ = 0 and .|φ(θ )| = 1 for .θ = 0, impossible because .φ is necessarily continuous. Let us show that the sequence .(bn )n of the means is also bounded. As the .Xn ’s are Gaussian, if .σn2 > 0 then .P(Xn ≥ bn ) = 12 . If instead .σn2 = 0, then the law of .Xn is the Dirac mass at .bn . In any case .P(Xn ≥ bn ) ≥ 12 . If the means .bn were not bounded there would exist a subsequence .(bnk )k converging, say, to .+∞ (if .bnk → −∞ the argument would be the same). Then, for every .M ∈ R we would have .bnk ≥ M for k large and therefore (the first inequality follows from
3.7 Convergence in a Gaussian World
149
Theorem 3.21, the portmanteau theorem, as .[M, +∞[ is a closed set) P(X ≥ M) ≥ lim P(Xnk ≥ M) ≥ lim P(Xnk ≥ bnk ) ≥
.
k→∞
k→∞
1 , 2
which is not possible as .limM→∞ P(X ≥ M) = 0. Hence both .(bn )n and .(σn2 )n are bounded and for a subsequence we have .bnk → b and .σn2k → σ 2 as .k → ∞ for some numbers b and .σ 2 . Therefore 1 − 2 σn2 θ 2
φ(θ ) = lim eibnkθ e
.
k→∞
k
1
= eibθ e− 2 σ
2θ 2
,
which is the characteristic function of a Gaussian law. A closer look at the argument above indicates that we have proved that from every subsequence of .(bn )n and of .(σn2 )n a further subsequence can be extracted converging to b and .σ 2 respectively. Hence by the sub-sub-sequence criterion, (Criterion 3.8), the means and the variances of the .Xn converge to the mean and the variance of the limit and .(Xn )n is bounded in .L2 . If the .Xn ’s are d-dimensional, note that, for every .ξ ∈ Rd , the r.v.’s .Zn = ξ, Xn are Gaussian, being linear functions of Gaussian r.v.’s, and real-valued. Obviously L .Zn →n→∞ ξ, X, which turns out to be Gaussian by the first part of the proof. As this holds for every .ξ ∈ Rm , this implies that X is Gaussian itself (see Exercise 2.49). Let us prove convergence of means and covariance matrices in the multidimensional case. Let us denote by .Cn , C the covariance matrices of .Xn and X respectively. Thanks again to the first part of the proof the means and the variances of the r.v.’s .Zn = ξ, Xn converge to the mean and the variance of . ξ, X. Note that the mean of . ξ, Xn is . ξ, bn , whereas the variance is . Cn ξ, ξ . As this occurs for every vector .ξ ∈ Rm , we deduce that .bn →n→∞ b and .Cn →n→∞ C. As .L2 convergence implies convergence in law, the Gaussian r.v.’s on a probability space form a closed subset of .L2 . But not a vector subspace . . . (see Exercise 2.56). An important feature of Gaussian r.v.’s is that the moment of order 2 controls all the moments of higher order. If .X ∼ N(0, σ 2 ), then .X = σ Z for some .N(0, 1)distributed r.v. Z. Hence, as .σ 2 = E(|X|2 ), p/2 E |X|p = σ p E |Z|p = cp E |X|2 .
.
:=cp
If X is not centered the .Lp norm of X can still be controlled by the .L2 norm, but this requires more care. Of course we can assume .p ≥ 2 as for .p ≤ 2 the .L2 norm
150
3 Convergence
is always larger than the .Lp norm, thanks to Jensen’s inequality. The key tools are, for positive numbers .x1 , . . . , xn , the inequalities p
p
p
p
x1 + · · · + xn ≤ (x1 + · · · + xn )p ≤ np−1 (x1 + · · · + xn )
.
(3.32)
that hold for every .n ≥ 2 and .p ≥ 1. If .X ∼ N(b, σ 2 ) then .X ∼ b + σ Z with .Z ∼ N(0, 1) and |X|p = |b + σ Z|p ≤ (|b| + σ |Z|)p ≤ 2p−1 (|b|p + σ p |Z|p )
.
(3.33)
hence, if now .cp = 2p−1 (1 + E(|Z|p ), E |X|p ≤ 2p−1 |b|p + σ p E(|Z|p ≤ cp |b|p + σ p .
.
Again by (3.32) (the inequality on the left-hand side for . p2 ) we have p/2 2 p/2 2 p/2 |b|p + σ p = |b|2 + σ ≤ |b| + σ 2
.
and, in conclusion, p/2 p/2 E |X|p ≤ cp |b|2 + σ 2 = cp E |X|2 .
.
(3.34)
A similar inequality also holds if X is d-dimensional Gaussian: as |X|p = (X12 + · · · + Xd2 )p/2 ≤ d p/2−1 |X1 |p + · · · + |Xd |p
.
we have, using repeatedly (3.32) and (3.34), d d p/2 E |X|p ≤ d p/2−1 E |Xk |p ≤ cp d p/2−1 E |Xk |2
.
k=1
≤ cp d p/2−1
d
E(|Xk |2 )
k=1
p/2
= cp d p/2−1 E(|X|2 )p/2 .
k=1
This inequality together with Proposition 3.36 gives
Corollary 3.37 A sequence of Gaussian r.v.’s converging in law is bounded in .Lp for every .p ≥ 1.
This is the key point of another important feature of the Gaussian world: a.s. convergence implies convergence in .Lp for every .p > 0.
3.7 Convergence in a Gaussian World
151
Theorem 3.38 Let .(Xn )n be a sequence of Gaussian d-dimensional r.v.’s on a probability space .(Ω, F, P) converging a.s. to an r.v. X. Then the convergence takes place in .Lp for every .p > 0.
Proof Let us first assume that .d = 1. Thanks to Corollary 3.37, as a.s. convergence implies convergence in law, the sequence is bounded in .Lp for every p. This implies also that .X ∈ Lp for every p: if by .Mp we denote an upper bound of the .Lp norms of the .Xn then, by Fatou’s Lemma, p E |X|p ≤ lim E |Xn |p ≤ Mp
.
n→∞
(this is the same as in Exercise 1.15 a1)). We have for every .q > p q/p |Xn − X|p = |Xn − X|q ≤ 2q−1 |Xn |q + |X|q .
.
The sequence .(|Xn − X|p )n converges to 0 a.s. and is bounded in .Lq/p . As . pq > 1, it is uniformly integrable by Theorem 3.35 and Theorem 3.34 gives .
1/p lim Xn − Xp = lim E |Xn − X|p =0.
n→∞
n→∞
In general, if .d ≥ 1, we have obviously .Lp convergence of the components of .Xn to the components of X. The result then follows thanks to the inequalities (3.32). As a consequence we have the following result stating that for Gaussian r.v.’s all .Lp convergences are equivalent.
Corollary 3.39 Let .(Xn )n be a sequence of Gaussian d-dimensional r.v.’s converging to an r.v. X in .Lp for some .p > 0. Then the convergence takes place in .Lp for every .p > 0.
Proof As .(Xn )n converges also in probability, by Theorem 3.9 there exists a subsequence .(Xnk )k converging a.s. to X, hence also in .Lp for every p by the previous Theorem 3.38. The result then follows by the precious sub-sub-sequence Criterion 3.8.
152
3 Convergence
3.8 The Central Limit Theorem We now present the most classical result of convergence in law.
Theorem 3.40 (The Central Limit Theorem) Let .(Xn )n be a sequence of d-dimensional i.i.d. r.v.’s, with mean b and covariance matrix C and let Sn∗ :=
.
X1 + · · · + Xn − nb · √ n
Then .Sn∗ converges in law to a Gaussian multivariate .N(0, C) distribution.
Proof The proof boils down to the computation of the limit of the characteristic functions of the r.v.’s .Sn∗ , and then applying P. Lévy’s Theorem 3.20. If .Yk = Xk − b, then the .Yk ’s are centered, have the same covariance matrix C and .Sn∗ = √1n (Y1 +· · ·+Yn ). Let us denote by .φ the common characteristic function of the .Yk ’s. Then, recalling the formulas of the characteristic function of a sum of independent r.v.’s, (2.38), and of their transformation under linear maps, (2.39), φSn∗ (θ ) = φ
.
n √θ n
n = 1 + φ √θn − 1 .
This is a classical .1∞ form. Let us compute the Taylor expansion to the second order of .φ at .θ = 0: recalling that φ (0) = iE(Y1 ) = 0,
.
Hess φ(0) = −C ,
we have φ(θ ) = 1 −
.
1
Cθ, θ + o(|θ |2 ) . 2
Therefore, as .n → +∞, φ
.
√θ n
−1=−
1
Cθ, θ + o( n1 ) 2n
3.8 The Central Limit Theorem
153
and, as .log(1 + z) ∼ z for .z → 0, .
lim φSn∗ (θ ) = lim exp n log 1 + φ √θn − 1
n→∞
n→∞
1
1 = lim exp n φ √θn −1 = lim exp n − Cθ, θ + o( n1 ) = e− 2 Cθ,θ , n→∞ n→∞ 2n which is the characteristic function of an .N(0, C) distribution.
Corollary 3.41 Let .(Xn )n be a sequence of real i.i.d. r.v.’s with mean b and variance .σ 2 . Then if Sn∗ =
.
X1 + · · · + Xn − nb √ σ n
L we have .Sn∗ →n→∞ N(0, 1).
The Central Limit Theorem has a long history, made of a streak of increasingly sharper and sophisticated results. The first of these is the De Moivre-Laplace Theorem (1738), which concerns the case where the .Xn are Bernoulli r.v.’s, so that the sums .Sn are binomial-distributed, and it is elementary (but not especially fun) to directly estimate their d.f. by using Stirling’s formula for the factorials, n! =
.
√
2π n
n n e
+ o(n!) .
The Central Limit Theorem states that, for large n, the law of .Sn∗ can be approximated by an .N (0, 1) law. How large must n be for this to be a reasonable approximation? In spite of the fact that .n = 30 (or sometimes .n = 50) is often claimed to be acceptable, in fact there is no all-purpose rule for n. Actually, whatever the value of n, if .Xk ∼ Gamma.( n1 , 1), then .Sn would be exponential and .Sn∗ would be far from being Gaussian. An accepted empirical rule is that we have a good approximation, also for small values of n, if the law of the .Xi ’s is symmetric with respect to its mean: see Exercise 3.27 for an instance of a very good approximation for .n = 12. In the case of asymmetric distributions it is better to be cautious and require larger values of n. Figures 3.4 and 3.5 give some visual evidence (see Exercise 2.25 for a possible way of “measuring” the symmetry of an r.v.).
154
−3
3 Convergence
−2
−1
0
1
2
3
Fig. 3.4 Graph of the density of .Sn∗ for sums of Gamma.( 12 , 1)-distributed r.v.’s (solid) to be compared with the .N (0, 1) density (dots). Here .n = 50: despite this relatively large value, the two graphs are rather distant
−3
−2
−1
0
1
2
3
Fig. 3.5 This is the graph of the .Sn∗ density for sums of Gamma.(7, 1)-distributed r.v.’s (solid) compared with the .N (0, 1) density (dots). Here .n = 30. Despite a smaller value of n, we have a much better approximation. The Gamma.(7, 1) law is much more symmetric than the Gamma.( 12 , 1): in Exercise 2.25 b) we found that the skewness of these distributions are respectively 3/2 7−1/2 = 1.07 and .23/2 = 2.83. Note however that the Central Limit Theorem, Theorem 3.40, .2 guarantees weak convergence of the laws, not pointwise convergence of the densities. In this sense there are more refined results (see e.g. [13], Theorem XV.5.2)
3.9 Application: Pearson’s Theorem, the χ 2 Test We now present a classical application of the Central Limit Theorem. Let .(Xn )n be a sequence of i.i.d. r.v.’s with values in a finite set with cardinality m, which we shall assume to be .{1, . . . , m}, and let .pi = P(X1 = i), .i = 1, . . . , m. Assume that .pi > 0 for every .i = 1, . . . , m, and let, for .n > 0, .i = 1, . . . , m, (n)
Ni
.
= 1{X1 =i} + · · · + 1{Xn =i} ,
(n)
pi
=
1 (n) N . n i
3.9 Application: Pearson’s Theorem, the χ 2 Test
155
pi is therefore the proportion, up to time n, of observations .X1 , . . . , Xn that have (n) (n) taken the value i. Of course . m = n and . m = 1. Note that by the i=1 Ni i=1 p i strong Law of Large Numbers
.
1 1{Xk =i} n n
(n)
pi
.
=
k=1
a.s.
→
n→∞
E 1{Xk =i} = pi .
(3.35)
Let, for every n, m m (n) (pi − pi )2 1 (n) (Ni − npi )2 = n npi pi
Tn =
.
i=1
i=1
Pearson’s Statistics: The quantity .Tn is a measure of the disagreement between the (n) probabilities p and .pi . Let us keep in mind that, whereas .p ∈ Rm is a deterministic (n) quantity, the .pi form a random vector (they are functions of the observations (n) and write .X1 , . . . , Xn ). In the sequel, for simplicity, we shall omit the index . .Ni , p i .
Theorem 3.42 (Pearson) Tn
.
L
→
n→∞
χ 2 (m − 1) .
(3.36)
Proof Let .Yn be the m-dimensional random vector with components 1 Yn,i = √ 1{Xn =i} , pi
(3.37)
p 1 Y1,i + · · · + Yn,i = √ Ni = n √ i · pi pi
(3.38)
.
so that .
√ Let us denote by N, p and . p the vectors of .Rm having components .Ni , .pi and √ √ . pi , .i = 1, . . . , m, respectively; therefore the vector . p has modulus .= 1. Clearly the random vectors .Yn are independent, being functions of independent r.v.’s, and √ √ .E(Yn ) = p (recall that . p is a vector).
156
3 Convergence
The covariance matrix .C = (cij )ij of .Yn is computed easily from (3.37): keeping in mind that .P(Xn = i, Xn = j ) = 0 if .i = j and .P(Xn = i, Xn = j ) = pi if .i = j ,
1 E 1{Xn =i} 1{Xn =j } − E(1{Xn =i} )E(1{Xn =j } ) cij = √ pi pj
1 P(Xn = i, Xn = j ) − P(Xn = i)P(Xn = j ) =√ pi pj √ = δij − pi pj ,
.
so that, for .x ∈ Rd , .(Cx)i = xi −
√ m √ pi j =1 pj xj , i.e.
√ √ Cx = x − p, x p .
.
(3.39)
By the Central Limit Theorem the sequence 1 1 √ Wn := √ Yk − E(Yk ) = √ Yk − p n n n
n
k=1
k=1
.
converges in law, as .n → ∞, to an .N(0, C)-distributed r.v., V say. Note however that (recall (3.38)) .Wn is a random vector whose i-th component is .
1 p √ √ p − pi , √ n √ i − n pi = n i√ pi pi n
so that .|Wn |2 = Tn and Tn
.
L
→
n→∞
|V |2 .
Therefore we must just compute the law of .|V |2 , i.e. the law of the square of the modulus of an .N(0, C)-distributed r.v. As .|V |2 = |OV |2 for every rotation O and the covariance matrix of OV is .O ∗ CO, we can assume the covariance matrix C to be diagonal, so that |V |2 =
m
.
λk Vk2 ,
(3.40)
k=1
where .λ1 , . . . , λm are the eigenvalues of the covariance matrix C and the r.v.’s .Vk2 are .χ 2 (1)-distributed (see Exercise 2.53 for a complete argument). √ Let us determine the eigenvalues .λk . Going back to (3.39) we note that .C p = √ 0, whereas .Cx = x for every x that is orthogonal to . p. Therefore one of the .λi is
3.9 Application: Pearson’s Theorem, the χ 2 Test
157
equal to 0 and the .m − 1 other eigenvalues are equal to 1 (C is the projector on the √ subspace orthogonal to . p, which has dimension .m − 1). Hence the law of the r.v. in (3.40) is the sum of .m − 1 independent .χ 2 (1)distributed r.v.’s and has a .χ 2 (m − 1) distribution. Let us look at some applications of Pearson’s Theorem. Imagine we have n independent observations .X1 , . . . , Xn of some random quantity taking the possible values .{1, . . . , m}. Is it possible to check whether their law is given by some vector p, i.e. .P(Xn = i) = pi ? For instance imagine that a die has been thrown 2000 times with the following outcomes 1
2
3
4
5
6
(3.41)
.
388 322 314 316 344 316
can we decide whether the die is a fair one, meaning that the outcome of a throw is uniform on .{1, 2, 3, 4, 5, 6}? Pearson’s Theorem provides a way of checking this hypothesis. Actually under the hypothesis that .P(Xn = i) = pi the r.v. .Tn is approximately 2 .χ (m − 1)-distributed, whereas if the law of the .Xn was given by another vector .q = (q1 , . . . , qm ), .q = p, we would have, as .n → ∞, .p i →n→∞ qi by the Law of (qi −pi )2 Large Numbers so that, as . m > 0, i=1 pi m (qi − pi )2 . lim Tn = lim n = +∞ . n→∞ n→∞ pi i=1
In other words under the assumption that the observations follow the law given by the vector p, the statistic .Tn is asymptotically .χ 2 (m − 1)-distributed, otherwise .Tn will tend to take large values.
Example 3.43 Let us go back to the data (3.41). There are some elements of suspicion: indeed the outcome 1 has appeared more often than the others: the frequencies are
.
p1
p2
p3
p4
p5
p6
0.196 0.161 0.157 0.158 0.172 0.158
(3.42)
How do we establish whether this discrepancy is significant (and the die is loaded)? Or are these normal fluctuations and the die is fair?
158
3 Convergence
Under the hypothesis that the die is a fair one, thanks to Pearson’s Theorem the (random) quantity Tn = 2000 ×
.
6 (pi − 16 )2 × 6 = 12.6 i=1
is approximately .χ 2 (5)-distributed, whereas if the hypothesis was not true .Tn would have a tendency to take large values. The question hence boils down to the following: can the observed value of .Tn be considered a typical value for a 2 .χ (5)-distributed r.v.? Or is it too large? We can argue in the following way: let us fix a threshold .α (.α = 0.05, for 2 (5) the quantile of order .1 − α of the .χ 2 (5) instance). If we denote by .χ1−α 2 (5)) = α. We shall decide 2 law, then, for a .χ (5)-distributed r.v. X, .P(X > χ1−α to reject the hypothesis that the die is a fair one if the observed value of .Tn is 2 (5) as, if the die was a fair one, the probability of observing a larger than .χ1−α 2 (5) would be too small. value exceeding .χ1−α Any suitable software can provide the quantiles of the .χ 2 distribution and it 2 (5) = 11.07. We conclude that the die cannot be considered a turns out that .χ0.95 fair one. In the language of Mathematical Statistics, Pearson’s Theorem allows us to reject the hypothesis that the die is a fair one at the level .5%. The value 2 .12.6 corresponds to the quantile of order .97.26% of the .χ (5) law. Hence if the die was a fair one, a value of .Tn larger than .12.6 would appear with probability .2.7%. The data of this example were simulated with probabilities .q1 = 0.2, .q2 = . . . = q6 = 0.16. Pearson’s Theorem is therefore the theoretical foundation of important applications in hypothesis testing in Statistics, when it is required to check whether some data are in agreement with a given theoretical distribution. However we need to inquire how large n should be in order to assume that .Tn has a law close to a .χ 2 (m − 1). A practical rule, that we shall not discuss here, requires that .npi ≥ 5 for every .i = 1, . . . , m. In the case of Example 3.43 this requirement is clearly satisfied, as in this case .npi = 16 × 2000 = 333.33.
3.9 Application: Pearson’s Theorem, the χ 2 Test Table 3.1 The Geissler data
159 k
.Nk
.pk
.p k
0 1 2 3 4 5 6 7 8 9 10 11 12
3 24 104 286 670 1033 1343 1112 829 478 181 45 7
0.000244 0.002930 0.016113 0.053711 0.120850 0.193359 0.225586 0.193359 0.120850 0.053711 0.016113 0.002930 0.000244
0.000491 0.003925 0.017007 0.046770 0.109567 0.168929 0.219624 0.181848 0.135568 0.078168 0.029599 0.007359 0.001145
Example 3.44 At the end of the nineteenth century the German doctor and statistician A. Geissler investigated the problem of modeling the outcome (male or female) of the subsequent births in a family. Geissler collected data on the composition of large families. The data of Table 3.1 concern 6115 families of 12 children. For every k, .k = 0, 1, . . . , 12, it displays the number, .Nk , of families having k sons and the corresponding empirical probabilities .pk . A natural hypothesis is to assume that every birth gives rise to a son or a daughter with probability . 12 , and moreover that the outcomes of different births are independent. Can we say that this hypothesis is not rejected by the data? Under this hypothesis, the r.v. .X =“number of sons” is distributed according to a binomial .B(12, 12 ) law, i.e. the probability of observing a family with k sons would be
pk =
.
1 12−k 12 1 k 12 1 12 1− = . 2 2 2 k k
Do the observed values .pk agree with the .pk ? Or are the discrepancies appearing in Table 3.1 significant? This is a typical application of Pearson’s Theorem. However, the condition of applicability of Pearson’s Theorem is not satisfied, as for .i = 0 or .i = 12 we have .pi = 2−12 and np0 = np12 = 6115 · 2−12 = 1.49 ,
.
160
3 Convergence
which is smaller than 5 and therefore not large enough to apply Pearson’s approximation. This difficulty can be overcome with the trick of merging classes: let us consider a new r.v. Y defined as ⎧ ⎪ ⎪ ⎨1 .Y = k ⎪ ⎪ ⎩11
if X = 0 or 1 if X = k for k = 2, . . . , 10 if X = 11 or 12 .
In other words Y coincides with X if .X = 1, . . . , 11 and takes the value 1 also on .{X = 0} and 11 also on .{X = 12}. Clearly the law of Y is
P(Y = k) = qk :=
.
⎧ ⎪ ⎪ ⎨ p0 + p1
if k = 1
pk ⎪ ⎪ ⎩p + p 11 12
if k = 11 .
if k = 2, . . . , 10
It is clear now that if we group together the observations of the classes 0 and 1 and of the classes 11 and 12, under the hypothesis (i.e. that the number of sons in a family follows a binomial law) the new empirical distributions thus obtained should follow the same distribution as Y . In other words, we shall compare, using Pearson’s Theorem, the distributions
k 1 2 3 4 5 . 6 7 8 9 10 11
qk 0.003174 0.016113 0.053711 0.120850 0.193359 0.225586 0.193359 0.120850 0.053711 0.016113 0.003174
qk 0.004415 0.017007 0.046770 0.109567 0.168929 0.219624 0.181848 0.135568 0.078168 0.029599 0.008504
3.10 Some Useful Complements
161
where the .q k are obtained by grouping the empirical distributions: .q 1 = p0 + p1 , .q k = pk for .k = 2, . . . , 10, .q 11 = p11 + p12 . Now the products .nq1 and .nq11 are equal to .6115 · 0.003174 = 19.41 > 5 and Pearson’s approximation is applicable. The numerical computation now gives T = 6115 ·
.
11 (q i − qi )2 = 242.05 , qi i=1
which is much larger than the usual quantiles of the .χ 2 (10) distribution, as 1 .χ0.95 (10) = 18.3. The hypothesis that the data follow a .B(12, ) distribution 2 is therefore rejected with strong evidence. By the way, some suspicion in this direction should already have been raised by the histogram comparing expected and empirical values, provided in Fig. 3.6. Indeed, rather than large discrepancies between expected and empirical values, the suspicious feature is that the empirical values exceed the expected ones for extreme values (.0, 1, 2 and .8, 9, 10, 11, 12) but are smaller for central values. If the differences were ascribable to random fluctuations (as opposed to inadequacy of the model) a greater irregularity in the differences would be expected. The model suggested so far, with the assumption of • independence of the outcomes of different births and • equiprobability of daughter/son, must therefore be rejected. This confronts us with the problem of finding a more adequate model. What can we do? A first, simple, idea is to change the assumption of equiprobability of daughter/son at birth. But this is not likely to improve the adequacy of the model. Actually, for values of p larger than . 12 we can expect an increase of the values .qk for k close to 11, but also, at the other extreme, a decrease for those that are close to 1. And the other way round if we choose .p < 12 . By the way, there is some literature concerning the construction of a reasonable model for Geissler’s data. We shall come back to these data later in Example 4.18 where we shall try to put together a more successful model.
162
3 Convergence
0
1
2
3
4
5
6
7
8
9
10
11
12
Fig. 3.6 The white bars are for the empirical values .p k , the black ones for the expected values .pk
3.10 Some Useful Complements Let us consider some transformations that preserve the convergence in law. A first result of this type has already appeared in Remark 3.16.
Lemma 3.45 (Slutsky’s Lemma) Let .Zn , Un , .n ≥ 1, be respectively .Rd and .Rm -valued r.v.’s on some probability space .(Ω, F, P) and let us assume L L that .Zn →n→∞ Z, .Un →n→∞ U where U is a constant r.v. taking the value m .u0 ∈ R with probability 1. Then L
(a) .(Zn , Un ) → (Z, u0 ). n→∞
(b) If .Φ
:
Rm × Rd
→
Rl is a continuous map then
L
Φ(Zn , Un ) → Φ(Z, u0 ). In particular
.
n→∞
L
(b1) if .d = m then .Zn + Un → Z + u0 ; n→∞
L
(b2) if .m = 1 (i.e. the sequence .(Un )n is real-valued) then .Zn Un → Zu0 . n→∞
Proof (a) If .ξ ∈ Rd , .θ ∈ Rm , then the characteristic function of .(Zn , Un ) computed at .(ξ, θ ) ∈ Rd+m is E ei ξ, Zn ei θ, Un = E ei ξ, Zn e θ, u0 + E ei ξ, Zn (ei θ, Un − ei θ, u0 ) .
.
3.10 Some Useful Complements
163
The first term on the right-hand side converges to .E(ei ξ,Z ei θ,u0 ); it will therefore be sufficient to prove that the other term tends to 0. Indeed i ξ, Z i θ, U n (e n − ei θ, u0 )] ≤ E |ei ξ, Zn (ei θ, Un − ei θ, u0 )| E[e = E |ei θ, Un − ei θ, u0 | = E[f (Un )] ,
.
where .f (x) = |ei θ, x − ei θ, u0 |; we have .E[f (Un )] →n→∞ E[f (U )] = f (u0 ) = 0, as f is a bounded continuous function. (b) Follows from (a) and Remark 3.28. Note that in Slutsky’s Lemma no assumption of independence between the .Zn ’s and the .Un ’s is made. This makes it a very useful tool, as highlighted in the next example.
Example 3.46 Let us go back to the situation of Pearson’s Theorem 3.42 and recall the definition of relative entropy (or Kullback-Leibler divergence) between the common distribution of the r.v.’s and their empirical distribution .p n (Exercise 2.24) H (pn ; p) =
.
m pn (i) i=1
pi
log
pn (i)
pi . pi
Recall that relative entropy is also a measure of the discrepancy between probabilities and note first that, as by the Law of Large Numbers (see relation (3.35)) .pn → p, we have .pn (i)/pi →n→∞ 1 for every i and therefore .H (p n ; p) →n→∞ 0, since the function .x → x log x vanishes at 1. L What can be said of the limit .n H (p n ; p) →n→∞ ? It turns out that Pearson’s statistics .Tn is closely related to relative entropy. The Taylor expansion of .x → x log x at .x0 = 1 gives x log x = (x − 1) +
.
1 1 (x − 1)2 − 2 (x − 1)3 , 2 6ξ
where .ξ is a number between x and 1. Therefore n H (p n ; p) =n
.
m pn (i) i=1
pi
m m
1 pn (i) 3 1 pn (i) 2 −1 pi + n −1 pi −n −1 pi 2 2 pi pi 6ξi,n i=1
= I1 + I2 + I3 .
i=1
164
3 Convergence
Of course .I1 = 0 for every n as .
m pn (i) i=1
pi
m m
− 1 pi = pn (i) − pi = 1 − 1 = 0 . i=1
i=1
By Pearson’s Theorem, 2I2 = n
.
m (pn (i) − pi )2
pi
i=1
= Tn
L
→
n→∞
χ 2 (m − 1) .
Finally |I3 | ≤ n
.
m pn (i)
pi
i=1
1 p (i)
2 n − 1 pi × max − 1 i=1,...,m 6ξ 2 p i i,n
1 p (i) n − 1 . i=1,...,m 6ξ 2 p i i,n
= Tn × max
p (i)
As mentioned above, by the Law of Large Numbers . pn i →n→∞ 1 a.s. for 2 → every .i = 1, . . . , m hence also .ξi,n n→∞ 1 a.s. (.ξi,n is a number between pn (i) pi
and 1), so that .|I3 | turns out to be the product of a term converging in law to a .χ 2 (m − 1) distribution and a term converging to 0. By Slutsky’s Lemma L therefore .I3 →n→∞ 0 and, by Slutsky again,
.
n × 2H (pn ; p)
.
L
→
n→∞
χ 2 (m − 1) .
In some sense Pearson’s statistics .Tn is the first order term in the expansion of the relative entropy H around p multiplied by 2 (see Fig. 3.7).
Another useful application of Slutsky’s Lemma is the following.
3.10 Some Useful Complements
165
1 3
0
1
Fig. 3.7 Comparison between the graphs, as a function of q, of the relative entropy of a Bernoulli distribution with respect to a .B(p, 1) with .p = 13 multiplied by 2 and of the corresponding Pearson’s statistics (dots)
.B(q, 1)
Theorem 3.47 (The Delta Method) Let .(Zn )n be a sequence of .Rd -valued r.v.’s, such that √ n (Zn − z)
.
L
→
n→∞
Z ∼ N(0, C) .
Let .Φ : Rd → Rm be a differentiable map with a continuous derivative at z. Then √ n Φ(Zn ) − Φ(z)
.
L
→
n→∞
N 0, Φ (z) C Φ (z)∗ .
Proof Thanks to Slutski’s Lemma 3.45(b), we have √ 1 Zn − z = √ × n (Zn − z) n
.
L
→
n→∞
0·Z =0.
Hence, by Proposition 3.29(b), .Zn →Pn→∞ z. Let us first prove the statement for .m = 1, so that .Φ is real-valued. By the theorem of the mean, we can write √ √ !n )(Zn − z) , n Φ(Zn ) − Φ(z) = n Φ (Z
.
(3.43)
!n − z| ≤ !n is a (random) vector in the segment between z and .Zn so that .|Z where .Z !n − z| →n→∞ 0 in probability and in law. Since .Φ is |Zn − z|. It follows that .|Z
166
3 Convergence L
!n ) → Φ (z) by Remark 3.16. Therefore (3.43) gives continuous at z, .Φ (Z n→∞
√ n Φ(Zn ) − Φ(z)
L
→
.
n→∞
Φ (z) Z
and the statement follows by Slutsky’s Lemma, recalling how Gaussian laws transform under linear maps (as explained p. 88). In dimension .m > 1 the theorem of the mean in the form above is not available, but the idea is quite similar. We can write √ √ . n Φ(Zn ) − Φ(z) = n =
√ n
=
n Φ (z)(Zn − z) +
1
0
1
√
d Φ z + s(Zn − z) ds ds
Φ z + s(Zn − z) (Zn − z) ds
0
√
1
n 0
Φ (z + s(Zn − z)) − Φ (z) (Zn − z) ds . :=In
We have √ n Φ (z)(Zn − z)
.
L
→
n→∞
N 0, Φ (z) C Φ (z)∗ ,
so that, by Slutsky’s lemma, the proof is complete if we prove that .In →n→∞ 0 in probability. We have √ |In | ≤ | n(Zn − z)| × sup Φ (z + s(Zn − z)) − Φ (z) .
.
0≤s≤1
√ Now .| n(Zn − z)| → |Z| in law and the result will follow from Slutsky’s lemma again if we can show that sup Φ (z + s(Zn − z)) − Φ (z)
.
0≤s≤1
L
→
n→∞
0.
Let .ε > 0. As .Φ is assumed to be continuous at z, let .δ > 0 be such that . Φ (z + x) − Φ (z) ≤ ε whenever .|x| ≤ δ. Then we have
P sup Φ (z + s(Zn − z)) − Φ (z) > ε
.
0≤s≤1
= P sup Φ (z + s(Zn − z)) − Φ (z) > ε, |Zn − z| ≤ δ 0≤s≤1
=∅
Exercises
167
+P sup Φ (z + s(Zn − z)) − Φ (z) > ε, |Zn − z| > δ 0≤s≤1
≤ P |Zn − z| > δ
so that .
lim P sup Φ (z + s(Zn − z)) − Φ (z) > ε ≤ lim P |Zn − z| > δ = 0 .
n→∞
n→∞
0≤s≤1
Exercises 3.1 (p. 317) Let .(Xn )n be a sequence of real r.v.’s converging to X in .Lp , .p ≥ 1. (a) Prove that .
lim E(Xn ) = E(X) .
n→∞
(b) Prove that if two sequences .(Xn )n , .(Yn )n , defined on the same probability space, converge in .L2 to X and Y respectively, then the product sequence 1 .(Xn Yn )n converges to XY in .L . (c1) Prove that if .Xn →n→∞ X in .L2 then also .
lim Var(Xn ) = Var(X) .
n→∞
(c2) Prove that if .(Xn )n and X are .Rd -valued and .Xn →n→∞ X in .L2 , then the covariance matrices converge. 3.2 (p. 317) Let .(Xn )n be a sequence of real r.v.’s on .(Ω, F, P) and .δ a real number. Which of the following is true? (a) .
% " # $ lim Xn ≥ δ = lim Xn ≥ δ .
n→∞
n→∞
(b) .
% " # $ lim Xn < δ ⊂ lim Xn ≤ δ .
n→∞
n→∞
168
3 Convergence
3.3 (p. 317) (a) Let X be an r.v. uniform on .[0, 1] and let An = {X ≤ n1 } .
.
(a1) (a2) (b) (b1)
Compute . ∞ n=1 P(An ). Compute .P(limn→∞ An ). Let .(Xn )n be a sequence of independent r.v.’s uniform on .[0, 1]. Let Bn = {Xn ≤ n1 } .
.
Compute .P(limn→∞ Bn ). (b2) And if Bn = {Xn ≤
.
1 } n2
?
3.4 (p. 318) Let .(Xn )n be a sequence of independent r.v.’s having exponential law respectively of parameter .an = (log(n + 1))α , .α > 0. Note that the sequence .(an )n is increasing so that the r.v.’s .Xn “become smaller” as n increases. (a) (b1) (b2) (c)
Determine .P(limn→∞ {Xn ≥ 1}) according to the value of .α. Compute .limn→∞ Xn according to the value of .α. Compute .limn→∞ Xn according to the value of .α. For which values of .α (recall that .α > 0) does the sequence .(Xn )n converge a.s.?
3.5 (p. 319) (Recall Remark 2.1) Let .(Zn )n be a sequence of i.i.d. positive r.v.’s. (a) Prove the inequalities ∞ .
P(Z1 ≥ n) ≤ E(Z1 ) ≤
n=1
(b) (b1) (b2) (c)
∞
P(Z1 ≥ n) .
n=0
Prove that if .E(Z1 ) < +∞ then .P(Zn ≥ n infinitely many times) = 0; if .E(Z1 ) = +∞ then .Zn ≥ n infinitely many times with probability 1. Let .(Xn )n be a sequence of i.i.d. real r.v.’s and let x2 = sup{θ ; E(eθXn ) < +∞}
.
be the right convergence abscissa of the Laplace transform of the .Xn .
Exercises
169
(c1) Prove that if .x2 < +∞ then .
lim
n→∞
1 Xn = log n x2
with the understanding . x12 = +∞ if .x2 = 0. (c2) Assume that .Xn ∼ N(0, 1). Compute .
|Xn | · lim √ log n
n→∞
3.6 (p. 320) Let .(Xn )n be a sequence of i.i.d. r.v.’s such that .0 < E(|X1 |) < +∞. For every .ω ∈ Ω let us consider the power series ∞ .
Xn (ω)x n
n=1
−1 and let .R(ω) = limn→∞ |Xn (ω)|1/n be its radius of convergence. (a) Prove that R is an a.s. constant r.v. (b) Prove that there exists an .a > 0 such that P |Xn | ≥ a for infinitely many indices n = 1
.
and deduce that .R ≤ 1 a.s. n (c) Let .b > 1. Prove that . ∞ n=1 P(|Xn | ≥ b ) < +∞ and deduce the value of R a.s. 3.7 (p. 321) Let .(Xn )n be a sequence of r.v.’s with values in the metric space E. Prove that .limn→∞ Xn = X in probability if and only if .
d(X , X) n =0. lim E n→∞ 1 + d(Xn , X)
(3.44)
Beware, sub-sub-sequences. . . 3.8 (p. 322) Let .(Xn )n be a sequence of r.v.’s on the probability space .(Ω, F, P) such that ∞ .
k=1
E(|Xk |) < +∞ .
(3.45)
170
3 Convergence
(a) Prove that the series ∞ .
(3.46)
Xk
k=1
converges in .L1 . + (b1) Prove that the series . ∞ k=1 Xk converges a.s. (b2) Prove that in (3.46) convergence also takes place a.s. 3.9 (p. 322) (Lebesgue’s Theorem for convergence in probability) If .(Xn )n is a sequence of r.v.’s that is bounded in absolute value by an integrable r.v. Z and such that .Xn →Pn→∞ X, then .
lim E(Xn ) = E(X) .
n→∞
Sub-sub-sequences. . . 3.10 (p. 323) Let .(Xn )n be a sequence of i.i.d. Gamma.(1, 1)-distributed (i.e. exponential of parameter 1) r.v.’s and Un = min(X1 , . . . , Xn ) .
.
(a1) (a2) (b) (c)
What is the law of .Un ? Prove that .(Un )n converges in law and determine the limit law. Does the convergence also take place a.s.? Let, for .α > 1, .Vn = Unα . Let .1 < β < α. Compute .P(Vn ≥ that the series ∞ .
1 ) nβ
and prove
Vn
n=1
converges a.s. 3.11 (p. 323) Let .(Xn )n be a sequence of i.i.d. square integrable centered r.v.’s with common variance .σ 2 . (a1) Does the r.v. .X1 X2 have finite mathematical expectation? Finite variance? In the affirmative, what are their values? (a2) If .Yn := Xn Xn+1 , what is the value of .Cov(Yk , Ym ) for .k = m? (b) Does the sequence .
1 X1 X2 + X2 X3 + · · · + Xn Xn+1 n
converge a.s.? If yes, to which limit?
Exercises
171
3.12 (p. 324) Let .(Xn )n be a sequence of i.i.d. r.v.’s having a Laplace law of parameter .λ. Discuss the a.s. convergence of the sequences
.
1 4 X + X24 + · · · + Xn4 , n 1
X12 + X22 + · · · + Xn2 X14 + X24 + · · · + Xn4
·
3.13 (p. 324) (Estimation of the variance) Let .(Xn )n be a sequence of square integrable real i.i.d. r.v.’s with variance .σ 2 and let 1 (Xk − Xn )2 , n n
Sn2 =
.
k=1
where .X n =
1 n
n
k=1 Xk
are the empirical means.
(a) Prove that .(Sn2 )n converges a.s. to a limit to be determined. (b) Compute .E(Sn2 ). 3.14 (p. 325) Let .(μn )n , .(νn )n be sequences of probabilities on .Rd and .Rm respectively converging weakly to the probabilities .μ and .ν respectively. (a) Prove that, weakly, .
lim μn ⊗ νn = μ ⊗ ν .
n→∞
(3.47)
(b1) Prove that if .d = m then, weakly, .
lim μn ∗ νn = μ ∗ ν .
n→∞
(b2) If .νn denotes an .N(0, n1 I ) probability, prove that .μ ∗ νn →n→∞ μ weakly. 3.15 (p. 326) (First have a look at Exercise 2.5) (a) Let .f : Rd → R be a differentiable function with bounded derivatives and .μ a probability on .Rd . Prove that the function μ ∗ f (x) :=
.
Rd
f (x − y) dμ(y)
is differentiable. (b1) Let .gn be the density of a d-dimensional .N(0, n1 I ) law. Prove that its deriva-
tives of order .α are of the form .Pα (x) e−n|x| /2 , where .Pα is a polynomial, and that they are therefore bounded. (b2) Prove that there exists a sequence .(fn )n of .C ∞ probability densities on .Rd such that, if .dμn := fn dx then .μn →n→∞ μ weakly. 2
172
3 Convergence
3.16 (p. 327) Let .(E, B(E)) be a topological space and .ρ a .σ -finite measure on B(E). Let .fn , .n ≥ 1, be densities with respect to .ρ and let .dμn = fn dρ be the probability on .(E, B(E)) having density .fn with respect to .ρ.
.
(a) Assume that .fn →n→∞ f in .L1 (ρ). (a1) Prove that f is itself a density. (a2) Prove that, if .dμ = f dρ, then .μn →n→∞ μ weakly and moreover that, for every .A ∈ B(E), .μn (A) →n→∞ μ(A). (b) On .(R, B(R)) let fn (x) =
.
1 + cos(2nπ x)
if 0 ≤ x ≤ 1
0
otherwise.
(b1) Prove that the .fn ’s are probability densities with respect to the Lebesgue measure of .R. (b2) Prove that the probabilities .dμn (x) = fn (x) dx converge weakly to a probability .μ having a density f to be determined. (b3) Prove that the sequence .(fn )n does not converge to f in .L1 (with respect to the Lebesgue measure). 3.17 (p. 328) Let .(E, B(E)) be a topological space and .μn , .μ probabilities on it. We know (this is (3.19)) that if .μn →n→∞ μ weakly then .
lim μn (G) ≥ μ(G)
n→∞
for every open set G ⊂ E .
(3.48)
Prove the converse, i.e. that, if (3.48) holds, then .μn →n→∞ μ weakly. Recall Remark 2.1. Of course a similar criterion holds with closed sets. 3.18 (p. 329) Let .(Xn )n be a sequence of r.v.’s (no assumption of independence) with .Xn ∼ χ 2 (n), .n ≥ 1. What is the behavior of the sequence .( n1 Xn )n ? Does it converge in law? In probability? 3.19 (p. 330) Let .(Xn )n be a sequence of r.v.’s having respectively a geometric law of parameter .pn = λn . Show that the sequence .( n1 Xn )n converges in law and determine its limit. 3.20 (p. 331) Let .(Xn )n be a sequence of real independent r.v.’s having respectively density, with respect to the Lebesgue measure, .fn (x) = 0 for .x < 0 and fn (x) =
.
n (1 + nx)2
for x > 0 .
(a) Investigate the convergence in law and in probability of .(Xn )n . (b) Prove that .(Xn )n does not converge a.s. and compute .lim and .lim of .(Xn )n .
Exercises
173
3.21 (p. 331) Let .(Xn )n be a sequence of i.i.d. r.v.’s uniform on .[0, 1] and let Zn = min(X1 , . . . , Xn ) .
.
(a) Does the sequence .(Zn )n converge in law as .n → ∞? In probability? A.s.? (b) Prove that the sequence .(n Zn )n converges in law as .n → ∞ and determine the limit law. Give an approximation of the probability P min(X1 , . . . , Xn ) ≤ n2
.
for n large. 3.22 (p. 332) Let, for every .n ≥ 1, .U1(n) , . . . , Un(n) be i.i.d. r.v.’s uniform on .{0, 1, . . . , n} respectively and (n)
Mn = min Uk
.
k≤n
.
Prove that .(Mn )n converges in law and determine the limit law. 3.23 (p. 332) (a) Let .μn be the probability on .R μn = (1 − an )δ0 + an δn
.
where .0 ≤ an ≤ 1. Prove that if .limn→∞ an = 0 then .(μn )n converges weakly and compute its limit. (b) Construct an example of a sequence .(μn )n converging weakly but such that the means or the variances of the .μn do not converge to the mean and the variance of the limit (see however Exercise 3.30 below). (c) Prove that, in general, if .Xn →n→∞ X in law then .limn→∞ E(|Xn |) ≥ E(|X|) and .limn→∞ E(Xn2 ) ≥ E(X2 ). 3.24 (p. 333) Let .(Xn )n be a sequence of r.v.’s with .Xn ∼ Gamma.(1, λn ) with λn →n→∞ 0.
.
(a) Prove that .(Xn )n does not converge in law. (b) Let .Yn = Xn − Xn . Prove that .(Yn )n converges in law and determine its limit (. = the integer part function). L 3.25 (p. 334) Let .(Xn )n be a sequence of .Rd -valued r.v.’s. Prove that .Xn →n→∞ X d L if and only if, for every .θ ∈ R , . θ, Xn →n→∞ θ, X.
174
3 Convergence
3.26 (p. 334) Let .(Xn )n be a sequence of i.i.d. r.v.’s with mean 0 and variance .σ 2 . Prove that the sequence Zn =
.
(X1 + · · · + Xn )2 n
converges in law and determine the limit law. 3.27 (p. 334) In the FORTRAN libraries in use in the 1970s (but also nowadays. . . ), in order to generate an .N(0, 1)-distributed random number the following procedure was implemented. If .X1 , . . . , X12 are independent r.v.’s uniform on .[0, 1], then the number W = X1 + · · · + X12 − 6
.
(3.49)
is (approximately) .N(0, 1)-distributed. (a) Can you give a justification of this procedure? (b) Let .Z ∼ N(0, 1). What is the value of .E(Z 4 )? And of .E(W 4 )? What do you think of this procedure? 3.28 (p. 336) Let .(Ω, F, P) be a probability space. (a) Let .(An )n ⊂ F be a sequence of events and assume that, for some .α > 0, .P(An ) ≥ α for infinitely many indices n. Prove that
P lim An ≥ α .
.
n→∞
(b) Let .Q be another probability on .(Ω, F) such that .Q P. Prove that, for every .ε > 0 there exists a .δ > 0 such that, for every .A ∈ F, if .P(A) ≤ δ then .Q(A) ≤ ε. 3.29 (p. 337) Let .(Xn )n be a sequence of m-dimensional r.v.’s converging a.s. to an r.v. X. Assume that .(Xn )n is bounded in .Lr for some .r > 1 and let M be an upper bound for the .Lr norms of the .Xn . (a) Prove that .X ∈ Lr . (b) Prove that, for every .p < r, .Xn →n→∞ X in .Lp . What if we assumed .Xn →n→∞ X in probability instead of a.s.? 3.30 (p. 337) Let .(Xn )n be a sequence of real r.v.’s converging in law to an r.v. X. In general convergence in law does not imply convergence of the means, as the function .x → x is not bounded and Exercise 3.23 provides some examples. But if we add the assumption of uniform integrability. . . (a) Let .ψR (x) := x d(x, [−(R + 1), R + 1]c ) ∧ 1 ; .ψR is a continuous function that coincides with .x → x on .[−R, R] and vanishes outside the interval
Exercises
−R
175
1
0
−R
R
R+1
Fig. 3.8 The graph of .ψR
[−(R + 1), R + 1] (see Fig. 3.8). Prove that, .E[ψR (Xn )] →n→∞ E[ψR (Xn )], for every .R > 0. (b) Prove that if, in addition, the sequence .(Xn )n is uniformly integrable then X is integrable and .E(Xn ) →n→∞ E(X). .
• In particular, if .(Xn )n is bounded in .Lp .p > 1, then .E(Xn ) →n→∞ E(X). 3.31 (p. 338) In this exercise we see two approximations of the d.f. of a .χ 2 (n) distribution for large n using the Central Limit Theorem, the first one naive, the other more sophisticated. (a) Prove that if .Xn ∼ χ 2 (n) then .
Xn − n √ 2n
L
→
n→∞
N(0, 1) .
(b1) Prove that √
.
lim √
n→∞
2n
1 √ = 2 2Xn + 2n
a.s.
(b2) (Fisher’s approximation) Prove that & √ 2Xn − 2n − 1
.
L
→
n→∞
N(0, 1) .
(3.50)
(c) Derive first from (a) and then from (b) an approximation of the d.f. of the 2 .χ (n) laws for n large. Use them in order to obtain approximate values of
176
3 Convergence
the quantile of order .0.95 of a .χ 2 (100) law and compare with the exact value .124.34. Which one of the two approximations appears to be more accurate? 3.32 (p. 340) Let .(Xn )n be a sequence of r.v.’s with .Xn ∼ Gamma.(n, 1). (a) Compute the limit, in law, .
lim
n→∞
1 Xn . n
(b) Compute the limit, in law, .
1 lim √ (Xn − n) . n
n→∞
(c) Compute the limit, in law, .
1 lim √ (Xn − n) . Xn
n→∞
3.33 (p. 341) Let .(Xn )n be a sequence of i.i.d. r.v.’s with .P(Xn = ±1) = 1 .X n = n (X1 + · · · + Xn ). Compute the limits in law of the sequences √ (a) .(√n sin Xn )n . (b) .( n (1 − cos Xn ))n . (c) .(n (1 − cos X n ))n .
1 2
and let
Chapter 4
Conditioning
4.1 Introduction Let .(Ω, F, P) be a probability space. The following definition is well-known.
Let .B ∈ F be a non-negligible event. The conditional probability of .P given B is the probability .PB on .(Ω, F) defined as PB (A) =
.
P(A ∩ B) P(B)
for every A ∈ F .
(4.1)
The fact that .PB is a probability on .(Ω, F) is immediate. From a modeling point of view: at the beginning we know that every event .A ∈ F can occur with probability .P(A). If, afterwards, we acquire the information that the event B has taken place, we shall replace the probability .P with .PB , in order to take into account the new information. Similarly, let X be a real r.v. and Z an r.v. taking values in a countable set E such that .P(Z = z) > 0 for every .z ∈ E. For every Borel set .A ⊂ R and every .z ∈ E let n(z, A) = P(X ∈ A|Z = z) =
.
P(X ∈ A, Z = z) · P(Z = z)
The set function .A → n(z, A) is, for every .z ∈ E, a probability on .R: it is the conditional law of X given .Z = z. This probability has an intuitive meaning not dissimilar to the one above: .A → n(z, A) is the law that is reasonable to appoint to X if we acquire the information that the event .{Z = z} has occurred.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Baldi, Probability, Universitext, https://doi.org/10.1007/978-3-031-38492-9_4
177
178
4 Conditioning
The conditional expectation of X given .Z = z is defined as the mean, if it exists, of this law: E(X1{Z=z} ) 1 · .E(X|Z = z) = x n(z, dx) = X dP = P(Z = z) P(Z = z) R {Z=z} These are very important notions, as we shall see throughout. It is therefore important to extend them to the case of a general r.v. Z (i.e. without the assumption that Z takes at most countably many values). This is the goal of this chapter, where we shall also see some applications. The idea is to characterize the quantity .h(z) = E(X|Z = z) in a way that also makes sense if Z is not discretely valued. For every .B ⊂ E we have .
{Z∈B}
h(Z) dP =
E(X|Z = z)P(Z = z) =
z∈B
= E(X1{Z∈B} ) =
E(X1{Z=z} )
z∈B
{Z∈B}
X dP ,
i.e. the integrals of .h(Z), which is .σ (Z)-measurable, and of X on the events of .σ (Z) coincide. We shall see that this property characterizes the conditional expectation. In the sequel we shall proceed contrariwise with respect to this section: we will first define conditional expectations and then return to the conditional laws at the end.
4.2 Conditional Expectation Recall (see p. 14) that for a real r.v. X, if .X = X+ − X− is its decomposition into positive and negative parts, X is lower semi-integrable (l.s.i.) if .X− is integrable and that in this case we can define the mathematical expectation .E(X) = E(X+ ) − E(X− ) (possibly .E(X) = +∞). In the sequel we shall need the following result.
Lemma 4.1 Let .(Ω, F, P) be a probability space, . D ⊂ F a sub-.σ -algebra, X and Y real l.s.i. . D-measurable r.v.’s (a) If E(X1D ) ≥ E(Y 1D )
.
for every D ∈ D
(4.2)
then .X ≥ Y a.s. (continued)
4.2 Conditional Expectation
179
Lemma 4.1 (continued) (b) If E(X1D ) = E(Y 1D )
.
for every D ∈ D
then .X = Y a.s. Proof (a) Let .Dr,q = {X ≤ r < q ≤ Y } ∈ D. Note that .{X < Y } = r,q∈Q Dr,q , which is a countable union, so that it is enough to show that if (4.2) holds then .P(Dr,q ) = 0 for every .r < q. But if .P(Dr,q ) > 0 for some .r, q, .r < q, then we would have . X dP ≤ rP(Dr,q ) < qP(Dr,q ) ≤ Y dP , Dr,q
Dr,q
contradicting the assumption. (b) Follows from (a) exchanging the roles of X and Y .
Note that for integrable r.v.’s the lemma is a consequence of Exercise 1.9 (c).
Definition and Theorem 4.2 Let X be an l.s.i. r.v. and . D ⊂ F a sub-.σ algebra. Then there exists an l.s.i. r.v. Y which is (a) . D-measurable (b) such that
Y dP =
.
D
X dP
for every D ∈ D .
(4.3)
D
We shall denote .E(X| D) such an Y . .E(X| D) is the conditional expectation of X given . D.
Proof Let us assume first that X is square integrable. Let .K = L2 (Ω, D, P) denote the subspace of .L2 of the square integrable r.v.’s that are . D-measurable. Or, to be precise, recalling that the elements of .L2 are equivalent classes of functions, K is the space of these classes that contain a function that is . D-measurable. As 2 .L convergence implies a.s. convergence for a subsequence and a.s. convergence preserves measurability (recall Remark 1.15), K is a closed subspace of .L2 .
180
4 Conditioning
Going back to Proposition 2.40, let .Y = P X the orthogonal projection of X on .L2 (Ω, D, P). We can write .X = Y + QX where .QX = X − P X. As QX is orthogonal to .L2 (Ω, D, P) (Proposition 2.40 again), we have, for every .D ∈ D,
X dP =
.
D
Y dP + D
QX dP = D
Y dP +
1D QX dP =
D
Y dP , D
and P X satisfies (a) and (b) in the statement. We now drop the assumption that X is square integrable. Let us assume X to be positive and let .Xn = X ∧ n. Then, for every n, .Xn is square integrable and .Xn ↑ X a.s. If .Yn := P Xn then, for every .D ∈ D, . Yn dP = Xn dP ≤ Xn+1 dP = Yn+1 dP D
D
D
D
and therefore, thanks to Lemma 4.1, also .(Yn )n is an a.s. increasing sequence. By Beppo Levi’s Theorem, twice, we obtain
X dP = lim
.
D
n→∞ D
Xn dP = lim E(Xn 1D ) = lim E(Yn 1D ) = n→∞
n→∞
Y dP . D
Taking .D = Ω in the previous relation, if X is integrable then also .E(X| D) := Y is integrable, hence a.s. finite. If .X = X+ − X− is l.s.i., then we can just define E(X| D) = E(X+ | D) − E(X− | D)
.
with no danger of encountering a .+∞ − ∞ situation as .E(X− | D) is integrable, hence a.s. finite. Uniqueness follows immediately from Lemma 4.1. We shall deal mostly with the conditional expectation of integrable r.v.’s. It is however useful to have this notion defined in the more general l.s.i. setting. In particular, .E(X| D) is always defined if .X ≥ 0. See also Proposition 4.6 (d). By linearity (4.3) is equivalent to E(ZW ) = E(XW )
.
(4.4)
for every r.v. W that is the linear combination with positive coefficients of indicator functions of events of . D, hence (Proposition 1.6) for every . D-measurable bounded positive r.v. W . We shall often have to prove statements of the type “a certain r.v. Z is equal to .E(X| D)”. On the basis of Theorem 4.2 this requires us to prove two things, namely that (a) Z is . D-measurable (b) .E(Z1D ) = E(X1D ) for every .D ∈ D.
4.2 Conditional Expectation
181
Actually requirement (b) can be weakened considerably (but not surprisingly), as explained in the following remark, which we only state in the case when X is integrable. Remark 4.3 If X is integrable then .Z = E(X| D) if and only if (a) Z is integrable and . D-measurable (b) .E(Z1D ) = E(X1D ) as D ranges over a class . C ⊂ D generating . D, stable with respect to finite intersections and containing .Ω. Actually let us prove that the family .M ⊂ D of the events D such that E(Z1D ) = E(X1D ) is a monotone class. • If .A, B ∈ M with .A ⊂ B then
.
E(Z1B\A ) = E(Z1B ) − E(Z1A ) = E(X1B ) − E(X1A ) = E(X1B\A )
.
and therefore also .B \ A ∈ M. Note that the previous relation requires X to be integrable, so that both .E(X1B ) and .E(X1A ) are finite. • Let .(Dn )n ⊂ M be an increasing sequence of events and .D = n Dn . Then .X1Dn →n→∞ X1D , .Z1Dn →n→∞ Z1D and, as .|X|1Dn ≤ |X| and .|Z|1Dn ≤ |Z|, we can apply Lebesgue’s Theorem (twice) and obtain that E(Z1D ) = lim E(Z1Dn ) = lim E(X1Dn ) = E(XD ) .
.
.
n→∞
n→∞
M is therefore a monotone class containing . C that is stable with respect to finite intersections. By the Monotone Class Theorem 1.2, .M contains also .σ ( C) = D.
Remark 4.4 The conditional expectation operator is monotone, i.e. if X and Y are l.s.i. and .X ≥ Y a.s., then .E(X| D) ≥ E(Y | D) a.s. Indeed for every .D ∈ D E E(X| D)1D = E(X1D ) ≥ E(Y 1D ) = E E(Y | D)1D
.
and the property follows by Lemma 4.1.
The following two statements provide further elementary, but important, properties of the conditional expectation operator.
182
4 Conditioning
Proposition 4.5 Let X, Y be integrable r.v.’s and .α, β ∈ R. Then (a) .E(αX + βY | D) = α E(X| D) + β E(Y | D) a.s. (b) .E E(X| D) = E(X). (c) If . D ⊂ D, .E E(X| D)| D = E(X| D ) a.s. (i.e. to condition first with respect to . D and then to the smaller .σ -algebra . D is the same as conditioning directly with respect to . D ). (d) If Z is bounded and . D-measurable then .E(ZX| D) = Z E(X| D) a.s. (i.e. bounded . D-measurable r.v.’s can go in and out of the conditional expectation, as if they were constants). (e) If X is independent of . D then .E(X| D) = E(X) a.s.
Proof These are immediate applications of the definition and boil down to the validation of the two conditions (a) and (b) p. 180; let us give the proofs of the last three points. (c) The r.v. .E E(X| D)| D is . D -measurable; moreover if W is bounded . D measurable then E W E E(X| D)| D = E W E(X| D) = E(W X) ,
.
where the first equality comes from the definition of conditional expectation with respect to . D and the last one from the fact that W is also . D-measurable. (d) We must prove that the r.v. .Z E(X| D) is . D-measurable (which is immediate) and that, for every bounded . D-measurable r.v. W , E(W ZX) = E W Z E(X| D) .
.
(4.5)
But this is immediate as W is bounded . D-measurable and therefore so is .ZW . (e) The r.v. .ω → E(X) is constant and therefore . D-measurable. If W is . Dmeasurable then it is independent of X and E(W X) = E(W )E(X) = E W E(X)
.
and therefore .E(X| D) = E(X) a.s.
It is easy to extend Proposition 4.5 to the case of r.v.’s that are only l.s.i. Note however that (a) holds only if .α, β ≥ 0 (otherwise .αX + βY might not be l.s.i. anymore) and that (d) holds only if Z is bounded positive (again ZX might turn out not to be l.s.i.). The next statement concerns the behavior of the conditional expectation with respect to convergence.
4.2 Conditional Expectation
183
Proposition 4.6 Let X, .Xn , .n = 1, 2, . . . , be real l.s.i. r.v.’s. Then (a) (Beppo Levi) if .Xn ↑ X as .n → ∞ a.s. then .E(Xn | D) ↑ E(X| D) as .n → ∞ a.s. (b) (Fatou) If .limn→∞ Xn = X a.s. and the r.v.’s .Xn are bounded from below by the same integrable r.v. then lim E(Xn | D) ≥ E(X| D)
.
a.s.
n→∞
(c) (Lebesgue) If .|Xn | ≤ Z for some integrable r.v. Z for every n and .Xn →n→∞ X a.s. then .
lim E(Xn | D) = E(X| D)
n→∞
a.s.
+
(d) (Jensen’s inequality) If .Φ : Rd → R is a lower semi-continuous convex function and .X = (X1 , . . . , Xd ) is a d-dimensional integrable r.v. then .Φ(X) is l.s.i. and E Φ(X)| D ≥ Φ ◦ E(X| D)
.
a.s.
denoting by .E(X| D) the d-dimensional r.v. with components .E(Xk | D), k = 1, . . . , d.
.
Proof (a) As the sequence .(Xn )n is a.s. increasing, .(E(Xn | D))n is also a.s. increasing thanks to Remark 4.4; the r.v. .Z := limn→∞ E(Xn | D) is . D-measurable and .E(Xn | D) ↑ Z as .n → ∞ a.s. If .D ∈ D, by Beppo Levi’s Theorem applied twice, . Z dP = lim E(Xn | D) dP = lim Xn dP = X dP n→∞ D
D
n→∞ D
and therefore .Z = E(X| D) a.s. (b) If .Yn = infk≥n Xk then .
lim ↑ Yn = lim Xn = X .
n→∞
n→∞
As .(Yn )n is increasing and .Yn ≤ Xn , (a) gives E(X| D) = lim E(Yn | D) ≤ lim E(Xn | D) .
.
n→∞
n→∞
D
184
4 Conditioning
(c) Immediate consequence of (b), applied both to the r.v.’s .Xn and .−Xn . (d) Same as the proof of Jensen’s inequality: recall, see (2.17), that a convex l.s.c. function .Φ is equal to the supremum of all affine-linear functions minorizing .Φ. If .f (x) = a, x+b is an affine function minorizing .Φ, then .a, X+b is an integrable r.v. minorizing .Φ(X) so that the latter is l.s.i. and E Φ(X)| D ≥ E f (X)| D = E(a, X + b| D)
.
= a, E(X| D) + b = f (E(X| D)) . Now just take the supremum in f among all affine-linear functions minorizing .Φ. Example 4.7 If . D = {Ω, ∅} is the trivial .σ -algebra, then E(X| D) = E(X) .
.
Actually the only . D-measurable r.v.’s are constant and, if .c = E(X| D), then the constant c is determined by the relation .c = E[E(X| D)] = E(X). Mathematical expectation appears therefore to be a particular case of conditional expectation.
Example 4.8 Let .B ∈ F be an event having strictly positive probability and let . D = {B, B c , Ω, ∅} be the .σ -algebra generated by B. Then .E(X| D), which is . D-measurable, is a real r.v. that is constant on B and on .B c . If we denote by .cB the value of .E(X| D) on B, from the relation .cB P(B) = E 1B E(X| D) = X dP B
and by the similar one for .B c we obtain ⎧ 1 ⎪ ⎪ X dP on B ⎨ P(B) B .E(X| D) = ⎪ 1 ⎪ ⎩ X dP on B c . P(B c ) B c
In particular,
.E(X| D) isc equal to . X dPB on B, where .PB is as in (4.1), and equal to . X dPB c on .B .
4.2 Conditional Expectation
185
Remark 4.9 As is apparent in the proof of Theorem 4.2, if X is square integrable then .E(X| D) is the best approximation in .L2 of X with a . Dmeasurable r.v. and moreover the r.v.’s E(X| D) and
.
X − E(X| D)
are orthogonal. As a consequence, as .X = X − E(X| D) + E(X| D), we have (Pythagoras’s theorem) E(X2 ) = E (X − E(X| D))2 + E E(X| D)2
.
and the useful relation E |X − E(X| D)|2 ) = E(X2 ) − E[E(X| D)2 ] .
.
(4.6)
Remark 4.10 (Conditional Expectations and .Lp Spaces) It is immediate that, if .X = X a.s., then .E(X| D) = E(X | D) a.s.: for every .D ∈ D we have E E(X| D)1D = E(X1D ) = E(X 1D ) = E E(X | D)1D
.
and the property follows by Proposition 4.1 (b). Conditional expectation is therefore defined on equivalence classes of r.v.’s. In particular, it is defined on .Lp spaces, whose elements are equivalence classes. Proposition 4.6 (d) (Jensen), applied to the convex function .x → |x|p with .p ≥ 1, gives E |E(X| D)|p ≤ E E(|X|p | D) = E(|X|p ) .
.
(4.7)
Hence conditional expectation is a continuous linear map .Lp → Lp , p ≥ 1; its norm is actually .≤ 1, i.e. it is a contraction. The image of .Lp under the operator .X → E(X| D) is the subspace of .Lp , that we shall denote .Lp ( D), that is formed by the equivalence classes of r.v.’s that contain at least a . Dmeasurable representative. p Lp In particular, if .p ≥ 1, .Xn →L n→∞ X implies .E(Xn | D) →n→∞ E(X| D). If Y is an r.v. taking values in some measurable space .(E, E), sometimes we shall write .E(X|Y ) instead of .E[X|σ (Y )]. We know that all real .σ (Y )-measurable r.v.’s are of the form .g(Y ), where .g : E → R is a measurable function (this is Doob’s
186
4 Conditioning
criterion, Lemma 1.7). Hence there exists a measurable function .g : E → R such that .E(X|Y ) = g(Y ) a.s. Sometimes we shall denote, in a suggestive way, such a function g by g(y) = E(X|Y = y) .
.
As every real .σ (Y )-measurable r.v. is of the form .ψ(Y ) for some measurable function .ψ : E → R, g must satisfy the relation E Xψ(Y ) = E g(Y )ψ(Y )
.
(4.8)
for every bounded measurable function .ψ : E → R. If X is square integrable, by Remark 4.9 .g(Y ) is “the best approximation of X by a function of Y ” (in the sense of .L2 ). Compare with Example 2.24, the regression line. The computation of a conditional expectation is an operation that we are led to perform quite often and that, sometimes, is even our goal. The next lemma can be very helpful. Let . G ⊂ F be a .σ -algebra and X an . G-measurable r.v. If Z is an r.v. independent of . G, we know that, if X and Z are integrable, then also their product XZ is integrable and E(XZ | G) = X E(Z | G) = X E(Z) .
.
(4.9)
This formula is a particular case of the following lemma.
Lemma 4.11 (The “Freezing Lemma”) Given a probability space (Ω, F, P) let
.
• • • •
(E, E) be a measurable space, G, . H independent sub-.σ -algebras of . F, X a . G-measurable .(E, E)-valued r.v., .Ψ : E × Ω → R an . E ⊗ H-measurable function such that .ω → Ψ (X(ω), ω) is integrable. . .
Then E Ψ (X, ·)| G = Φ(X) ,
.
where .Φ(x) = E[Ψ (x, ·)].
(4.10)
4.2 Conditional Expectation
187
Proof The proof uses the usual arguments of measure theory. Let us denote by V+ the family of . E ⊗ H-measurable positive functions .Ψ : E × Ω → R satisfying (4.10). It is immediate that . V+ is stable with respect to increasing limits: if .(Ψn )n ⊂ V+ and .Ψn ↑ Ψ as .n → ∞ then
.
E Ψn (X, ·)| G ↑ E Ψ (X, ·)| G , .
E[Ψn (x, ·)] ↑ E[Ψ (x, ·)] ,
(4.11)
so that .Ψ ∈ V+ . Next let us denote by .M the class of sets .Λ ∈ E ⊗ H such that .Ψ (x, ω) = 1Λ (x, ω) belongs to . V+ . It is immediate that it is stable with respect to increasing limits, thanks to (4.11), and to relative complementation, hence it is a monotone class (Definition 1.1). .M contains the rectangle sets .Λ = A × Λ1 with .A ∈ E, .Λ1 ∈ H as E 1Λ (X, ·)| G = E 1A (X)1Λ1 | G = 1A (X)P(Λ1 )
.
and .Φ(x) := E 1A (x)1Λ1 = 1A (x)P(Λ1 ). By the Monotone class theorem, Theorem 1.2, .M contains the whole .σ -algebra generated by the rectangles, i.e. all .Λ ∈ E ⊗ H. By linearity, . V+ contains all elementary . E ⊗ H-measurable functions and, by Proposition 1.6, every positive . E ⊗ H-measurable function. Then we just have to decompose .Ψ as in the statement of the lemma into positive and negative parts. Let us now present some applications of the freezing Lemma 4.11. Example 4.12 Let .(Xn )n be a sequence of i.i.d. r.v.’s with .P(Xn = ±1) = 12 and let .Sn = X1 + · · · + Xn for .n ≥ 1, .S0 = 0. Let T be a geometric r.v. of parameter p, independent of .(Xn )n . How can we compute the mean, variance and characteristic function of .Z = ST ? Intuitively .Sn models the evolution of a random motion (a stochastic process, as we shall see more precisely in the next chapter) where at every iteration a step to the left or to the right is made with probability . 12 ; we want to find information concerning its position when it is stopped at a random time independent of the motion and geometrically distributed. Let us first compute the mean. Let .Ψ : N × Ω → Z be defined as .Ψ (n, ω) = Sn (ω). We have then .ST (ω) = Ψ (T , ω) and we are in the situation of Lemma 4.11 with . H = σ (T ) and . G = σ (X1 , X2 , . . . ). By the freezing Lemma 4.11 E(ST ) = E[Ψ (T , ·)] = E E Ψ (T , ·)|σ (T ) = E[Φ(T )] ,
.
188
4 Conditioning
where .Φ(n) = E[Ψ (n, ·)] = E(Sn ) = 0, so that .E(ST ) = 0. For the second order moment the argument is the same: let .Ψ (n, ω) = Sn2 (ω) so that E(ST2 ) = E E Ψ (T , ω)|σ (T ) = E[Φ(T )] ,
.
where now .Φ(n) = E[Ψ (n, ·)] = E(Sn2 ) = nVar(X1 ) = n. Hence E(ST2 ) = E(T ) =
.
1 · p
In the same way, with .Ψ (n, ω) = eiθSn (ω) , E(eiθST ) = E E eiθST |σ (T ) = E[Φ(T )] ,
.
where now Φ(n) = E(eiθSn ) = E(eiθX1 )n =
.
1 2
(eiθ + e−iθ )
n
= cosn θ
and therefore E(eiθST ) = E[(cos T )n ] = p
.
∞ (1 − p)n cosn θ = n=0
p · 1 − (1 − p) cos θ
This example clarifies how to use the freezing lemma, but also the method of computing a mathematical expectation by “inserting” in the computation a conditional expectation and taking advantage of the fact that the expectation of a conditional expectation is the same as taking the expectation directly (Proposition 4.5 (b)).
4.3 Conditional Laws In this section we investigate conditional distributions, extending to a general space the definition that we have seen in Sect. 4.1.
4.3 Conditional Laws
189
Definition 4.13 Let .X, Y be r.v.’s taking values in the measurable spaces (E, E) and .(G, G) respectively and let us denote by .μ the law of X. A family of probabilities .(n(x, dy))x∈E on .(G, G) is a conditional law of Y given .X = x if:
.
(a) For every .A ∈ G, the map .x → n(x, A) is . E-measurable. (b) For every .A ∈ G and .B ∈ E, P(Y ∈ A, X ∈ B) =
(4.12)
n(x, A) μ(dx) .
.
B
Intuitively .n(x, ·) is “the distribution of Y taking into account the information that X = x”. Relation (4.12) can be written .E 1A (Y )1B (X) = 1B (x)n(x, A) μ(dx) = 1B (x) μ(dx) 1A (y) n(x, dy) . .
E
E
G
The usual application of Proposition 1.6, approximating f and g with linear combinations of indicator functions, implies that, if .f : E → R+ and .g : G → R+ are positive measurable functions, then
E f (X)g(Y ) =
f (x)μ(dx)
.
E
g(y) n(x, dy) .
(4.13)
G
With the usual decomposition into the difference of positive and negative parts we obtain that (4.13) holds if .f : E → R and .g : G → R are measurable and bounded or at least such that .f (X)g(Y ) is integrable. Note that (4.13) can also be written as E f (X)g(Y ) = E f (X)h(X) ,
.
where h(x) :=
g(y) n(x, dy) .
.
G
Comparing with (4.8), this means precisely that E g(Y )|X = x = h(x) .
.
(4.14)
190
4 Conditioning
Hence if Y is a real integrable r.v. E(Y |X = x) =
y n(x, dy)
.
(4.15)
G
and we recover the relation from which we started in Sect. 4.1: the conditional expectation is the mean of the conditional law.
Remark 4.14 Let .X, Y be as in Definition 4.13. Assume that the conditional law .(n(x, dy))x∈E of Y given .X = x does not depend on x, i.e. there exists a probability .ν on .(E, E) such that .n(x, ·) = ν for every .x ∈ E. Then from (4.13) first taking .g ≡ 1 we find that .ν is the law of Y and then that the joint law of X and Y is the product .μ ⊗ ν, so that X and Y are independent. Note that this is consistent with intuition. Let us now present some results which will allow us to actually compute conditional distributions. The following statement is very useful in this direction: its intuitive content is almost immediate, but a formal proof is required.
Lemma 4.15 (The Second Freezing Lemma) Let .(E, E), .(H, H) and (G, G) be measurable spaces, X, Z independent r.v.’s with values in E and H respectively and .Ψ : E × H → G. Let .Y = Ψ (X, Z). Then the conditional law of Y given .X = x is the law, .ν x , of the r.v. .Ψ (x, Z). .
Proof This is just a rewriting of the freezing Lemma 4.11. Let us denote by .μ the law of X. We must prove that, for every pair of bounded measurable functions .f : E → R and .g : G → R, .E f (X)g(Y ) =
f (x) dμ(x) E
g(y) dν x (y) .
(4.16)
G
We have E f (X)g(Y ) = E f (X)g(Ψ (X, Z))] = E E[f (X)g(Ψ (X, Z))|X] .
.
As Z is independent of X, by the freezing lemma E f (X)g(Ψ (X, Z))|X = Φ(X)
.
(4.17)
4.3 Conditional Laws
191
where Φ(x) = E f (x)g(Ψ (x, Z)) = f (x)
g(y) dν x (y)
.
G
and, going back to (4.17), we have .E f (X)g(Y ) = E[Φ(X)] =
f (x) dμ(x) E
g(y) dν x (y) G
i.e. (4.16).
As mentioned above, this lemma is rather intuitive: the information .X = x tells us that we can replace X by x in the relation .Y = Ψ (X, Z), whereas it does not give any information on the value of Z, which is independent of X. The next example recalls a general situation where the computation of the conditional law is easy. Example 4.16 Let X, Y be r.v.’s with values in the measurable spaces .(E, E) and .(G, G) respectively. Let .ρ, γ be .σ -finite measures on .(E, E) and .(G, G) respectively and assume that the pair .(X, Y ) has a density h with respect to the product measure .ρ ⊗ γ on .(E × G, E ⊗ G). Let hX (x) =
h(x, y) γ (dy)
.
E
be the density of the law of X with respect to .ρ and let .Q = {x; hX (x) = 0} ∈ E. Clearly the event .{X ∈ Q} is negligible as .P(X ∈ Q) = Q hX (x)ρ(dx) = 0. Let ⎧ ⎨ h(x, y) if x ∈ Q hX (x) .h(y; x) := (4.18) ⎩ any density if x ∈ Q , and .n(x, dy) = h(y; x) dγ (y). Let us prove that n is a conditional law of Y given .X = x. Indeed, for any pair .f, g of real bounded measurable functions on .(E, E) and .(G, G) respectively, .E f (X)g(Y ) =
f (x) dρ(x)
E
=
g(y)h(x, y) dγ (y) G
f (x)hX (x) dρ(x) E
g(y)h(y; x) dγ (y) G
192
4 Conditioning
which means precisely that the conditional law of Y given .X = x is .n(x, dy) = h(y; x) dγ (y). In particular, for every bounded measurable function g, E(g(Y )|X = x) =
g(y)h(y; x) dγ (y) .
.
G
Conversely, note that, if the conditional density .h(·; x) of Y with respect to X and the density of X are known, then the joint law of .(X, Y ) has density .(x, y) → hX (x)h(y; x) with respect to .ρ ⊗ γ and the density, .hY , of Y with respect to .γ is
hY (y) =
h(x, y) dρ(x) =
.
h(y; x)hX (x) dρ(x) .
E
(4.19)
E
Example 4.17 Let us take advantage of the second freezing lemma, Lemma 4.15, in order to compute the density of a Student law .t (n). Recall √ that this is the law of an r.v. of the form .T := √X n, where X and Y are Y
independent and .N(0, 1)- and .χ 2 (n)-distributed respectively. Thanks to the√ second freezing lemma, the conditional law of T given .Y = y is the law of . √Xy n, i.e. an .N(0, yn ), so that √ y − 1 yt 2 .h(t; y) = √ e 2n . 2π n By (4.19), the density of T is hT (t) =
h(t; y)hY (y) dy
.
= =
1 2n/2 Γ ( n2 ) 1
√
2π n
√ 2n/2 Γ ( n2 ) 2π n
+∞
√
1
y y n/2−1 e−y/2 e− 2n yt dy 2
0
+∞
1
y
1 2
y 2 (n+1)−1 e− 2 (1+ n t ) dy .
0
We recognize in the last integral, but for the constant, a Gamma.(α, λ) density with .α = 12 (n + 1) and .λ = 12 (1 + n1 t 2 ), so that hT (t) =
.
Γ ( n2 + 12 ) Γ ( n2 + 12 ) 1 = · √ √ n+1 Γ ( n2 ) π n (1 + t 2 ) n+1 2n/2 Γ ( n2 ) 2π n ( 1 + 1 t 2 ) 2 2 2 2n n 1
4.3 Conditional Laws
193
The .t (n) densities have a shape similar to the Gaussian (see Figs. 4.1 and 4.2 below) but they go to 0 at infinity only polynomially fast. Also .t (1) is the Cauchy law.
Let us now tackle the question of the existence of a conditional expectation. So far we know the existence in the following situations. • When X and Y are independent: just choose .n(x, dy) = ν(dy) for every x, .ν denoting the law of Y . • When .Y = Ψ (X, Z) with Z independent of X as in the second freezing lemma, Lemma 4.15, • When the joint law of X and Y has a density with respect to a .σ -finite product measure, as in Remark 4.16.
............... .... ... ... ... ... ... . . .. .. . . . . . . . . . . . . . . . . .. . . . . . . . . ....... .. .. ..... . . . . . . . . . ...... .. .... .. . . . . . .. . . ...... ... . .. .. . . . . . . ..... .. . .. . . . . . . . . ..... .. .. ... . . . . .. . . . . ..... ... . .. .. . . . . . . . . ...... .. .... .. . . . . . .. . . ...... ... . .. .. . . . . . . . ....... . .... . .. . . . . . . . . ....... .... . .... . . . . . . . . . ........ ... . . .......... . ......... .. . . . ............. .................. . . .............. ................. ............ ...................... .... ......................... . . . . . . . . . . . . . . . . . . . . . . ......... ... ..... ................................... .......... ........ ............................ . . . . ............. . . . . . . . .....
−3
−2
−1
0
1
2
3
Fig. 4.1 Comparison between an .N (0, 1) density (dots) and a .t (1) (i.e. Cauchy) density .......... ........................................... .......... ............. ........ ............... . ........ . . . . ........ .. ..... ........ ............ . ....... ......... ...... . ....... .......... . ....... ...... . ....... ......... ....... . ....... .......... ....... . ..... ....... . . . . ....... ...... . . ........ . . . ..... ...... . . . . . .... ....... . . . . . ...... ... . . . ....... . . ... . . ....... . . . . ......... ..... . . . . . . . . .................. ...... . . . . . . . . . . . ................... ........... . . . . . . . . . .......... . . . ................... . . ................. . . . . . . . . . ....... . . ....... . . ............ . . . . . . . . ............. . . . . . ..... .......................
−3
−2
−1
0
1
2
3
Fig. 4.2 Comparison between an .N (0, 1) density (dots) and a .t (9) density. Recall that (Example 3.30), as .n → ∞, .t (n) converges weakly to .N (0, 1)
194
4 Conditioning
In general it can be proved that if the spaces E and G of Definition 4.13 are Polish, i.e. metric, complete and separable, conditional laws do exist and a uniqueness result holds. A proof can be found in most of the references. In particular, see Theorem 1.1.6, p. 13, in [23]. Conditional laws appear in a natural way also in the modeling of random phenomena, as often the data of the problem provide the law of some r.v. X and the conditional law of some other r.v. Y given .X = x. Example 4.18 A coin is chosen at random from a heap of possible coins and tossed n times. Let us denote by Y the number of tails obtained. Assume that it is not known whether the chosen coin is a fair one. Let us actually make the assumption that the coin gives tail with a probability p that is itself random and Beta.(α, β)-distributed. What is the value of .P(Y = k)? What is the law of Y ? How many tails appear in n throws on average? If we denote by X the Beta.(α, β)-distributed r.v. that models the choice of the coin, the data of the problem indicate that the conditional law of Y given .X = x, .ν x say, is binomial .B(n, x) (the total number of throws n is fixed). That is n .ν x (k) = x k (1 − x)n−k , k = 0, . . . , n . k Denoting by .μ the Beta distribution of X, (4.12) here becomes, again for .k = 0, 1, . . . , n, and .B = [0, 1], P(Y = k) = ν x (k) μ(dx) Γ (α + β) n 1 α−1 x (1 − x)β−1 x k (1 − x)n−k dx = Γ (α)Γ (β) k 0 . (4.20) Γ (α + β) n 1 α+k−1 = x (1 − x)β+n−k−1 dx Γ (α)Γ (β) k 0 n Γ (α + β)Γ (α + k)Γ (n + β − k) · = Γ (α)Γ (β)Γ (α + β + n) k This discrete probability law is known as Skellam’s binomial. We shall see an application of it in the forthcoming Example 4.19. In order nto compute the mean of this law, instead of using the definition .E(Y ) = k=0 kP(Y = k) that leads to unpredictable computations, recall that the mean is also the expectation of the conditional expectation (Proposition 4.5 (b)), i.e. E(Y ) = E E(Y |X) .
.
4.3 Conditional Laws
195
Now .E(Y |X = x) = nx, as the conditional law of Y given .X = x is .B(n, x) α and, recalling that the mean of a Beta.(α, β)-distributed r.v. is . α+β , E(Y ) = E(nX) =
.
nα · α+β
Example 4.19 Let us go back to Geissler’s data. In Example 3.44 we have seen that a binomial model is not able to explain them. Might Skellam’s model above be a more fitting alternative? This would mean, intuitively, that every family has its own “propensity” to a male offspring which follows a Beta distribution. Let us try to fit the data with a Skellam binomial. Now we play with two parameters, i.e. .α and .β. For instance, with the choice of .α = 34.13 and .β = 31.61 let us compare the observed values .q¯k with those, .rk , of the Skellam binomial with the parameters above (.qk are the values produced by the “old” binomial model): k 1 2 3 4 5 . 6 7 8 9 10 11
qk 0.003174 0.016113 0.053711 0.120850 0.193359 0.225586 0.193359 0.120850 0.053711 0.016113 0.003174
rk 0.004074 0.017137 0.050832 0.107230 0.169463 0.205732 0.193329 0.139584 0.075529 0.029081 0.008008
q¯k 0.004415 0.017007 0.046770 0.109567 0.168929 0.219624 0.181848 0.135568 0.078168 0.029599 0.008504
The value of Pearson’s T statistics now is .T = 13.9 so that the Skellam model gives a much better approximation. However Pearson’s Theorem cannot be applied here, at least in the form of Theorem 3.42, as the parameters .α and .β above were estimated from the data. How the values .α and .β above were estimated from the data and how the statement of Pearson’s theorem should be modified in this situation is left to a more advanced course in statistics.
196
4 Conditioning
4.4 The Conditional Laws of Gaussian Vectors In this section we investigate conditional laws (and therefore also conditional expectations) when the r.v. Y (whose conditional law we want to compute) and X (the conditioning r.v.) are jointly Gaussian. It is possible to take advantage of the method of Example 4.16, taking the quotient between the joint density and the other marginal, but now we shall see a much quicker and efficient method. Moreover, let us not forget that for a Gaussian vector existence of the joint density is not guaranteed. Let .Y, X be Gaussian vectors .Rm - and .Rd -valued respectively. Assume that their joint law on the product space .(Rm+d , B(Rm+d )) is Gaussian of mean and covariance matrix respectively
bY bX
.
CY CY X CXY CX
where .CY and .CX are the covariance matrices of Y and X respectively and .CY X = ∗ is the .m × d matrix of the covariances of the E[(Y − E(Y )) (X − E(X))∗ ] = CXY components of Y and those of X; let us assume moreover that .CX is strictly positive definite (and therefore invertible). Let us first look for a .m × d matrix A such that the r.v.’s .Y − AX and X are independent. Let .Z = Y −AX. The pair .(Y, X) is Gaussian as well as .(Z, X), which is a linear function of the former. Hence, as seen in Sect. 2.8, p. 90, independence of Z and X follows as soon as .Cov(Zi , Xj ) = 0 for every .i = 1, . . . , m, j = 1, . . . , d. First, to simplify the notation, let us assume that the means .bY and .bX vanish. The condition of absence of correlation between the components of Z and those of X can then be written
.
0 = E(ZX∗ ) = E[(Y − AX)X∗ ] = E(Y X∗ ) − AE(XX∗ ) = CY X − ACX .
.
−1 Hence .A = CY X CX . Without the assumptions that the means vanish, just make the same computation with Y and X replaced by .Y − bY and .X − bX . We can write now
Y = AX + (Y − AX),
.
where the r.v.’s .Y − AX and X are independent. Hence by Lemma 4.15 (the second freezing lemma) the conditional law of Y given .X = x is the law of .Ax + Y − AX. As .Y − AX is Gaussian, the law of .Ax + Y − AX is determined by its mean −1 Ax + bY − AbX = bY − CY X CX (bX − x)
.
(4.21)
4.4 The Conditional Laws of Gaussian Vectors
197
and its covariance matrix CY −AX = E (Y − bY − A(X − bX ))(Y − bY − A(X − bX ))∗ E (Y − bY )(Y − bY )∗ − E (Y − bY )(X − bX )∗ A∗ −E A(X − bX )(Y − bY )∗ + E A(X − bX )(X − bX )∗ A∗ .
= CY − CY X A∗ − ACXY + ACX A∗ −1 ∗ −1 ∗ −1 −1 ∗ = CY − CY X CX CY X − CY X CX C Y X + CY X CX CX CX CY X −1 ∗ = CY − CY X CX CY X ,
(4.22) where we have taken advantage of the fact that .CX is symmetric and of the relation CXY = CY∗ X . In particular, from (4.21) we obtain the conditional expectation
.
−1 E(Y |X = x) = bY − CY X CX (bX − x) .
.
(4.23)
When both Y and X are real r.v.’s, (4.23) and (4.22) give for the values of the mean and the variance of the conditional distribution, respectively bY −
.
Cov(Y, X) (bX − x) , Var(X)
(4.24)
which is equal to .E(Y |X = x) and Var(Y ) −
.
Cov(Y, X)2 · Var(X)
(4.25)
Note that the variance of the conditional law is always smaller than the variance of Y , which is a general fact already noted in Remark 4.9. Let us point out some important features.
Remark 4.20 (a) The conditional laws of a Gaussian vector are also Gaussian. (b) If Y and X are jointly Gaussian, the conditional expectation of Y given X is an affine-linear function of X and (therefore) coincides with the regression line. Recall (Remark 2.24) that the conditional expectation is the best approximation in .L2 of Y by a function of X whereas the regression line provides the best approximation of Y by an affine-linear function of X. (c) Only the mean of the conditional law depends on the value of the conditioning variable X. The covariance matrix of the conditional law does not depend on the value of X.
198
4 Conditioning
Exercises 4.1 (p. 342) Let X, Y be i.i.d. r.v.’s with a .B(1, p) law, i.e. Bernoulli with parameter p and let .Z = 1{X+Y =0} , . G = σ (Z). (a) What are the events of the .σ -algebra . G? b) Compute .E(X| G) and .E(Y | G) and determine their law. Are these r.v.’s also independent? 4.2 (p. 342) Let .(Ω, F, P) be a probability space and . G ⊂ F a sub-.σ-algebra. (a) Let .A ∈ F and .B = {E(1A | G) = 0}. Show that .B ⊂ Ac a.s. (b) Let X be a positive r.v. Prove that .
E(X| G) = 0 ⊂ {X = 0}
a.s.
i.e. the zeros of a positive r.v. shrink under conditioning. 4.3 (p. 343) Let X be a real integrable r.v. on a probability space .(Ω, F, P) and G ⊂ F a sub-.σ -algebra. Let . D ⊂ F be another .σ -algebra independent of X and independent of . G.
.
(a) Is it true that E(X| G ∨ D) = E(X| G) ?
.
(4.26)
(b) Prove that if . D is independent of .σ (X) ∨ G, then (4.26) holds. • Recall Remarks 2.12 and 4.3. 4.4 (p. 343) Let .(Ω, F, P) be a probability space and . G ⊂ F a sub-.σ -algebra. A non-empty event .A ∈ G is an atom of . G if there is no event of . G which is strictly contained in A save .∅. Let E be a Hausdorff topological space and X an E-valued r.v. (a) Prove that .{X = x} is an atom of .σ (X). (b) Prove that if X is . G-measurable then it is constant on the atoms of . G. (c) Let Z be a real r.v. Prove that if .P(X = x) > 0 then .E(Z |X) is constant on .{X = x} and on this event takes the value 1 . P(X = x)
{X=x}
Z dP .
(4.27)
• Recall that in a Hausdorff topological space the sets formed by a single point are closed, hence Borel sets.
Exercises
199
4.5 (p. 344) (a) Let .X, Y be r.v.’s with values in a measurable space .(E, E) and Z another r.v. taking values in some other measurable space. Assume that the pairs .(X, Z) and .(Y, Z) have the same law (in particular X and Y have the same law). Prove that, if .h : E → R is a measurable function such that .h(X) is integrable, then E[h(X)|Z] = E[h(Y )|Z]
a.s.
.
(b) Let .T1 , . . . , Tn be real i.i.d. integrable r.v.’s and .T = T1 + · · · + Tn . (b1) Prove that the pairs .(T1 , T ), .(T2 , T ), . . . , .(Tn , T ) have the same law. (b2) Prove that E(T1 |T ) =
.
T · n
4.6 (p. 344) Let .X, Y be independent r.v.’s both with a Laplace distribution of parameter 1. (a) (b1) (b2) (b3)
Prove that X and XY have the same joint distribution as .−X and XY . Compute .E(X|XY = z). What if X and Y were both .N(0, 1)-distributed instead? And with a Cauchy distribution?
4.7 (p. 345) Let X be an m-dimensional r.v. having density f with respect to the Lebesgue measure of .Rm of the form .f (x) = g(|x|), where .g : R+ → R+ . (a) Prove that the real r.v. .|X| has a density with respect to the Lebesgue measure and compute it. (b) Let .ψ : Rm → R be a bounded measurable function. Compute .E ψ(X) |X| . 4.8 (p. 346) (Conditional expectations under a change of probability) Let Z be a positive r.v. defined on the probability space .(Ω, F, P) and . G ⊂ F a sub-.σ -algebra. Recall (Exercise 4.2) that .{Z = 0} ⊃ {E(Z | G) = 0} a.s. (a) Note that .Z1{E(Z | G)>0} = Z a.s. and deduce that, for every r.v. Y such that Y Z is integrable, we have E(ZY | G) = E(ZY | G)1{E(Z | G)>0}
.
a.s.
(4.28)
(b1) Assume moreover that .E(Z) = 1. Let .Q be the probability on .(Ω, F) having density Z with respect to .P and let us denote by .EQ the mathematical expectation with respect to .Q. Prove that .E(Z | G) > 0 .Q-a.s. (.E still denotes the expectation with respect to .P). (b2) Prove that if Y is integrable with respect to .Q, then EQ (Y | G) =
.
E(Y Z | G) E(Z | G)
Q-a.s.
(4.29)
200
4 Conditioning
• Note that if the density Z is itself . G-measurable, then .EQ (Y | G) = E(Y | G) .Q-a.s. 4.9 (p. 347) Let T be an r.v. having density, with respect to the Lebesgue measure, given by f (t) = 2t,
0≤t ≤1
.
and .f (t) = 0 for .t ∈ [0, 1]. Let Z be an .N(0, 1)-distributed r.v. independent of T . (a) Compute the Laplace transform and characteristic function of .X = ZT . What are the convergence abscissas? (b) Compute the mean and variance of X. (c) Prove that for every .R > 0 there exists a constant .cR such that P(|X| ≥ x) ≤ cR e−Rx .
.
4.10 (p. 348) (A useful independence criterion) Let X be an m-dimensional r.v. on the probability space .(Ω, F, P) and . G ⊂ F a sub-.σ -algebra. (a) Prove that if X is independent of . G, then E(eiθ,X | G) = E(eiθ,X )
for every θ ∈ Rm .
.
(4.30)
(b) Assume that (4.30) holds. (b1) Let Y be a real . G-measurable r.v. and compute the characteristic function of .(X, Y ). (b2) Prove that if (4.30) holds, then X is independent of . G. 4.11 (p. 348) Let X, Y be independent r.v.’s Gamma.(1, λ)- and .N(0, 1)-distributed respectively. √ (a) Compute the characteristic function of .Z = X Y . (b) Compute the characteristic function of an r.v. W having a Laplace law of parameter .α, i.e. having density with respect to the Lebesgue measure f (x) =
.
α −α|x| e . 2
(c) Prove that Z has a density with respect to the Lebesgue measure and compute it. 4.12 (p. 349) Let X, Y be independent .N(0, 1)-distributed r.v.’s and let, for .λ ∈ R, 1
Z = e− 2 λ
.
(a) Prove that .E(Z) = 1.
2 Y 2 +λXY
.
Exercises
201
(b) Let .Q be the probability on .(Ω, F) having density Z with respect to .P. What is the law of X with respect to .Q? 4.13 (p. 349) Let X and Y be independent .N(0, 1)-distributed r.v.’s. (a) Compute, for .t ∈ R, the Laplace transform L(t) := E(etXY ) .
.
√ (b) Let .|t| < 1 and let .Q be the new probability .dQ = 1 − t 2 etXY dP. Determine the joint law of X and Y under .Q. Compute .VarQ (X) and .CovQ (X, Y ). 4.14 (p. 350) Let .(Xn )n be a sequence of independent .Rd -valued r.v.’s, defined on the same probability space. Let .S0 = 0, .Sn = X1 +· · ·+Xn and . Fn = σ (Sk , k ≤ n). Show that, for every bounded Borel function .f : Rd → R, E f (Sn+1 )| Fn = E f (Sn+1 )|Sn
.
(4.31)
and express this quantity in terms of the law .μn of .Xn . • This exercise proves rigorously a rather intuitive feature: as .Sn+1 = Xn+1 + Sn and .Xn+1 is independent of .X1 , . . . , Xn hence also of .S1 , . . . , Sn , in order to foresee the value of .Sn+1 , once the value of .Sn is known, the additional knowledge of .S1 , . . . , Sn does not provide any additional information. In the world of stochastic processes (4.31) means that .(Sn )n enjoys the Markov property. 4.15 (p. 350) Compute the mean and variance of a Student .t (n) law. 4.16 (p. 351) Let .X, Y, Z be independent r.v.’s with .X, Y ∼ N(0, 1), .Z ∼ Beta.(α, β). √ (a) Let .W = ZX + 1 − Z 2 Y . What is the conditional law of W given .Z = z? What is the law of W ? (b) Are W and Z independent? 4.17 (p. 351) (Multivariate Student t’s) A multivariate (centered) .t (n, d, C) distribution is the law of the r.v. .
X √ n, √ Y
where X and Y are independent, .Y ∼ χ 2 (n) and X is d-dimensional .N(0, C)distributed with a covariance matrix C that is assumed to be invertible. Prove that a .t (n, d, C) law has a density with respect to the Lebesgue measure and compute it. Try to reproduce the argument of Example 4.17.
202
4 Conditioning
4.18 (p. 352) Let X, Y be .N(0, 1)-distributed r.v.’s. and W another real r.v. Let us assume that .X, Y, Z are independent and let X + YW Z=√ · 1 + W2
.
(a) What is the conditional law of Z given .W = w? (b) What is the law of Z? 4.19 (p. 352) A family .{X1 , . . . , Xn } of r.v.’s, defined on the same probability space .(Ω, F, P) and taking values in the measurable space .(E, E), is said to be exchangeable if and only if the law of .X = (X1 , . . . , Xn ) is the same as the law of .Xσ = (Xσ1 , . . . , Xσn ), where .σ = (σ1 , . . . , σn ) is any permutation of .(1, . . . , n). (a) Prove that if .{X1 , . . . , Xn } is exchangeable then the r.v.’s .X1 , . . . , Xn have the same law; and also that the law of .(Xi , Xj ) does not depend on .i, j , .i = j . (b) Prove that if .X1 , . . . , Xn are i.i.d. then they are exchangeable. (c) Assume that .X1 , . . . , Xn are real-valued and that their joint distribution has a density with respect to the Lebesgue measure of .Rn of the form f (x) = g(|x|)
.
(4.32)
for some measurable function .g : R+ → R+ . Then .{X1 , . . . , Xn } is exchangeable. (d1) Assume that there exists an r.v. Y defined on .(Ω, F, P) and taking values in some measurable space .(G, G) such that the r.v.’s .X1 , . . . , Xn are conditionally independent and identically distributed given .Y = y, i.e. such that the conditional law of .(X1 , . . . , Xn ) given .Y = y is a product .μy ⊗ · · · ⊗ μy . Prove that the family .{X1 , . . . , Xn } is exchangeable. (d2) Let .X = (X1 , . . . , Xd ) be a multidimensional Student .t (n, d, I )-distributed r.v. (see Exercise 4.17), with .I = the identity matrix. Prove that .{X1 , . . . , Xd } is exchangeable. 4.20 (p. 353) Let T , W be exponential r.v.’s of parameters respectively .λ and .μ. Let S = T + W.
.
(a) What is the law of S? What is the joint law of T and S? (b) Compute .E(T |S). Recalling the meaning of the conditional expectation as the best approximation in .L2 of T given S, compare with the result of Exercise 2.30 where we computed the regression line, i.e. the best approximation of T by an affine-linear function of S.
Exercises
203
4.21 (p. 354) Let .X, Y be r.v.’s having joint density with respect to the Lebesgue measure f (x, y) = λ2 xe−λx(y+1)
.
x > 0, y > 0
and .f (x, y) = 0 otherwise. (a) Compute the densities of X and of Y . (b) Are the r.v.’s .U = X and .V = XY independent? What is the density of XY ? (c) Compute the of X given .Y = y and the squared .L2 conditional expectation 2 distance .E (X − E(X|Y )) . Recall (4.6). 4.22 (p. 356) Let X, Y be independent r.v.’s Gamma.(α, λ)- and Gamma.(β, λ)-distributed respectively. (a) (b) (c) (d)
What is the density of .X + Y ? What is the joint density of X and .X + Y ? What is the conditional density, .g(·; z), of X given .X + Y = z? Compute .E(X|X + Y = z) and the regression line of X with respect to .X + Y .
4.23 (p. 357) Let X, Y be real r.v.’s with joint density f (x, y) =
.
1 1 2 2 (x exp − − 2rxy + y ) √ 2(1 − r 2 ) 2π 1 − r 2
where .−1 < r < 1. (a) Determine the marginal densities of X and Y . (b) Compute .E(X|Y ) and .E(X|X + Y ). 4.24 (p. 358) Let X be an .N(0, 1)-distributed r.v. and Y another real r.v. In which of the following situations is the pair .(X, Y ) Gaussian? (a) The conditional law of Y given .X = x is an .N( 12 x, 1) distribution. (b) The conditional law of Y given .X = x is an .N( 12 x 2 , 1) distribution. (c) The conditional law of Y given .X = x is an .N(0, 14 x 2 ) distribution.
Chapter 5
Martingales
5.1 Stochastic Processes A stochastic process is a mathematical object that is intended to model a quantity that performs a random motion. It will therefore be something like .(Xn )n , where the r.v.’s .Xn are defined on some probability space .(Ω, F, P) and take their values on the same measurable space .(E, E). Here n is to be seen as a time. It is also possible to consider families .(Xt )t with .t ∈ R+ , i.e. in continuous time, but we shall only deal with discrete time models.
A filtration is an increasing family .( Fn )n of sub-.σ -algebras of . F. A process (Xn )n is said to be adapted to the filtration .( Fn )n if, for every n, .Xn is . Fn measurable.
.
Given a process .(Xn )n , we can always consider the filtration .( Gn )n defined as . Gn = σ (X1 , . . . , Xn ). This is the natural filtration of the process and, of course, is the smallest filtration with respect to which the process is adapted. Just a moment for intuition: .X1 , . . . , Xn are the positions of the process before (.≤) time n and therefore are quantities that are known to an observer at time n. The .σ -algebra . Fn represents the family of events for which, at time n, it is known whether they have taken place or not.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Baldi, Probability, Universitext, https://doi.org/10.1007/978-3-031-38492-9_5
205
206
5 Martingales
5.2 Martingales: Definitions and General Facts Let .(Ω, F, P) be a probability space and .( Fn )n a filtration on it.
Definition 5.1 A martingale (resp. a supermartingale, a submartingale) of the filtration .( Fn )n is a process .(Mn )n adapted to .( Fn )n , such that .Mn is integrable for every n and, for every .n ≥ m, E(Mn | Fm ) = Mm
.
(resp. ≤ Mm , ≥ Mm ) .
(5.1)
Martingales are an important tool in probability, appearing in many contexts of the theory. For more information on this subject, in addition to almost all references mentioned at the beginning of Chap. 1, see also [1], [21]. Of course (5.1) is equivalent to requiring that, for every n, E(Mn | Fn−1 ) = Mn−1 ,
.
(5.2)
as this relation entails, for .m < n, E(Mn | Fm ) = E E(Mn | Fn−1 )| Fm = E(Mn−1 | Fm ) = · · · = E(Mm+1 | Fm )=Mm .
.
It is sometimes important to specify with respect to which filtration .(Mn )n is a martingale. Note for now that if .(Mn )n is a martingale with respect to .( Fn )n , then it is a martingale with respect to every smaller filtration (provided it contains the natural filtration). Indeed if .( Fn )n is another filtration to which the martingale is adapted and with . Fn ⊂ Fn , then E(Mn | Fm ) = E E(Mn | Fm )| Fm = E(Mm | Fm ) = Mm .
.
Of course a similar property holds for super- and submartingales. When the filtration is not specified we shall understand it to be the natural filtration. The following example presents three typical situations giving rise to martingales.
Example 5.2 (a) Let .(Zk )k be a sequence of real centered independent r.v.’s and let .Xn = Z1 + · · · + Zn . Then .(Xn )n is a martingale.
5.2 Martingales: Definitions and General Facts
207
Indeed we have .Xn = Xn−1 + Zn and, as .Zn is independent of X1 , . . . , Xn−1 and therefore of . Fn−1 = σ (X1 , . . . , Xn−1 ),
.
E(Xn | Fn−1 ) = E(Xn−1 | Fn−1 ) + E(Zn | Fn−1 ) = Xn−1 + E(Zn ) = Xn−1 .
.
(b) Let .(Uk )k be a sequence of real independent r.v.’s such that .E(Uk ) = 1 for every k and let .Yn = U1 · · · Un . Then .(Yn )n is a martingale: with an idea similar to (a) E(Yn | Fn−1 ) = E(Un Yn−1 | Fn−1 ) = Yn−1 E(Un ) = Yn−1 .
.
A particularly important instance of these martingales appears when the .Un are positive. (c) Let X be an integrable r.v. and .( Fn )n a filtration, then .Xn = E(X| Fn ) is a martingale. Indeed, if .n > m, (Proposition 4.5 (c)) E(Xn | Fm ) = E E(X| Fn )| Fm = E(X| Fm ) = Xm .
.
We shall see that these martingales may have very different behaviors.
It is clear that linear combinations of martingales are also martingales and linear combinations with positive coefficients of supermartingales (resp. submartingales) are again supermartingales (resp. submartingales). If .(Mn )n is a supermartingale, .(−Mn )n is a submartingale and conversely. Moreover, if M is a martingale (resp. a submartingale) and .Φ : R → R is a l.s.c. convex function (resp. convex and increasing) such that .Φ(Mn ) is also integrable, then .(Φ(Mn ))n is a submartingale with respect to the same filtration. This is a consequence of Jensen’s inequality, Proposition 4.6 (d): if M is a martingale we have for .n ≥ m E Φ(Mn )| Fm ≥ Φ E[Mn | Fm ] = Φ(Mm ) .
.
In particular, if .(Mn )n is a martingale then .(|Mn |)n is a submartingale. We say that .(Mn )n is a martingale (resp. supermartingale, submartingale) of .Lp , p .p ≥ 1, if .Mn ∈ L for every n and we shall speak of square integrable martingales (resp. supermartingales, submartingales) for .p = 2. If .(Mn )n is a martingale of .Lp , p .p ≥ 1, then .(|Mn | )n is a submartingale. Beware of a possible mistake: it is not granted that if M is a submartingale the same is true of .(|Mn |)n or .(Mn2 )n (even if M is square integrable): the functions 2 .x → |x| and .x → x are indeed convex but not increasing. This statement becomes true if we add the assumption that M is positive: the functions .x → |x| and .x → x 2 are increasing when restricted to .R+ .
208
5 Martingales
5.3 Doob’s Decomposition
A process (An )n is said to be a predictable increasing process for the filtration ( Fn )n if • A0 = 0, • for every n, An ≤ An+1 a.s., • for every n, An+1 is Fn -measurable.
Let .(Xn )n be an .( Fn )n -submartingale and recursively define A0 = 0,
.
An+1 = An + E(Xn+1 | Fn ) − Xn .
(5.3)
By construction .(An )n is a predictable increasing process. Actually by induction An+1 is . Fn -measurable and, as X is a submartingale, .An+1 − An = E(Xn+1 | Fn ) − Xn ≥ 0. If .Mn = Xn − An then
.
E(Mn+1 | Fn ) = E(Xn+1 | Fn ) − An+1 = Xn − An = Mn
.
(we use the fact that .An+1 is . Fn -measurable). Hence .(Mn )n is a martingale. Such a decomposition is unique: if .Xn = Mn + An is another decomposition of .(Xn )n into the sum of a martingale .M and of a predictable increasing process .A , then .A0 = A0 = 0 and An+1 − An = Xn+1 − Xn − (Mn+1 − Mn ).
.
Conditioning with respect to . Fn , we obtain .An+1 − An = E(Xn+1 | Fn ) − Xn = An+1 − An ; hence .An = An and .Mn = Mn . We have thus obtained that
every submartingale .(Xn )n can be decomposed uniquely into the sum of a martingale .(Mn )n and of a predictable increasing process .(An )n .
This is Doob’s decomposition. The process A is the compensator of .(Xn )n . If .(Mn )n is a square integrable martingale, then .(Mn2 )n is a submartingale. Its compensator is the associated increasing process to the martingale .(Mn )n .
5.4 Stopping Times
209
5.4 Stopping Times When dealing with stochastic processes an important technique consists in the investigation of its value when stopped at some random time. This section introduces the right notion in this direction. Let .(Ω, F, P) be a probability space and .( Fn )n a filtration on it. Let . F∞ = σ ( n≥0 Fn ).
Definition 5.3 (a) A stopping time of the filtration .( Fn )n is a map .τ : Ω → N ∪ {+∞} (the value .+∞ is allowed) such that, for every .n ≥ 0, {τ ≤ n} ∈ Fn .
.
(b) Let .
.
Fτ = { A ∈ F∞ ; for every n ≥ 0, A ∩ {τ ≤ n} ∈ Fn } .
Fτ is the .σ -algebra of events prior to time .τ .
Remark 5.4 In (a) and (b), the conditions .{τ ≤ n} ∈ Fn and .A ∩ {τ ≤ n} ∈ Fn are equivalent to requiring that .{τ = n} ∈ Fn and .A ∩ {τ = n} ∈ Fn , respectively, as {τ ≤ n} =
n
.
{τ = k},
{τ = n} = {τ ≤ n} \ {τ ≤ n − 1}
k=0
so that if, for instance, .{τ = n} ∈ Fn for every n then also .{τ ≤ n} ∈ Fn and conversely.
Remark 5.5 Note that a deterministic time .τ ≡ m is a stopping time. Indeed {τ ≤ n} =
.
∅
if n < m
Ω
if n ≥ m ,
210
5 Martingales
and in any case .{τ ≤ n} ∈ Fn . Not unexpectedly in this case . Fτ = Fm : if A ∈ F∞ then
.
A ∩ {τ ≤ n} =
.
∅
if n < m
A
if n ≥ m .
Therefore .A ∩ {τ ≤ n} ∈ Fn for every n if and only if .A ∈ Fm .
A stopping time is a random time at which a process adapted to the filtration ( Fn )n is observed or modified. Recalling that intuitively . Fn is the .σ -algebra of events that are known at time n, the condition .{τ ≤ n} ∈ Fn imposes the condition that at time n it is known whether .τ has already happened or not. A typical example is the first time at which the process takes some values, as in the following example.
.
Example 5.6 (Passage Times) Let X be a stochastic process with values in (E, E) adapted to the filtration .( Fn )n . Let, for .A ∈ E,
.
τA (ω) = inf{n; Xn (ω) ∈ A} ,
(5.4)
.
with the understanding .inf ∅ = +∞. Then .τA is a stopping time as {τA = n} = {X0 ∈ / A, X1 ∈ / A, . . . , Xn−1 ∈ / A, Xn ∈ A} ∈ Fn .
.
τA is the passage time at A, i.e. the first time at which X visits the set A. Conversely, let
.
ρA (ω) = sup{n; Xn (ω) ∈ A}
.
i.e. the last time at which the process visits the set A. This is not in general a stopping time: in order to know whether .ρA ≤ n you need to know the positions of the process at times after time n.
The following proposition states some important properties of stopping times. They are immediate consequences of the definitions: we advise the reader to try to work out the proofs (without looking at them beforehand. . . ) as an exercise.
Proposition 5.7 Let .( Fn )n be a filtration and .τ1 , τ2 stopping times of this filtration. Then the following properties hold. (continued)
5.4 Stopping Times
211
Proposition 5.7 (continued) (a) .τ1 + τ2 , .τ1 ∨ τ2 , .τ1 ∧ τ2 are stopping times with respect to the same filtration. (b) If .τ1 ≤ τ2 , then . Fτ1 ⊂ Fτ2 . (c) . Fτ1 ∧τ2 = Fτ1 ∩ Fτ2 . (d) Both events .{τ1 < τ2 } and .{τ1 = τ2 } belong to . Fτ1 ∩ Fτ2 .
Proof (a) The statement follows from the relations {τ1 + τ2 ≤ n} =
n
.
{τ1 = k, τ2 ≤ n − k} ∈ Fn ,
k=0
{τ1 ∧ τ2 ≤ n} = {τ1 ≤ n} ∪ {τ2 ≤ n} ∈ Fn , {τ1 ∨ τ2 ≤ n} = {τ1 ≤ n} ∩ {τ2 ≤ n} ∈ Fn . (b) Let .A ∈ Fτ1 , i.e. such that .A ∩ {τ1 ≤ n} ∈ Fn for every n; we must prove that also .A ∩ {τ2 ≤ n} ∈ Fn for every n. As .τ1 ≤ τ2 , we have .{τ2 ≤ n} ⊂ {τ1 ≤ n} and therefore A ∩ {τ2 ≤ n} = A ∩ {τ1 ≤ n} ∩{τ2 ≤ n} ∈ Fn .
.
∈ Fn
(c) Thanks to (b) . Fτ1 ∧τ2 ⊂ Fτ1 and . Fτ1 ∧τ2 ⊂ Fτ2 , hence . Fτ1 ∧τ2 ⊂ Fτ1 ∩ Fτ2 . Conversely, let .A ∈ Fτ1 ∩ Fτ2 . Then, for every n, we have .A ∩ {τ1 ≤ n} ∈ Fn and .A ∩ {τ2 ≤ n} ∈ Fn . Taking the union we find that A∩{τ1 ≤ n} ∪ A∩{τ2 ≤ n} = A∩ {τ1 ≤ n}∪{τ2 ≤ n} = A∩{τ1 ∧τ2 ≤ n}
.
so that .A∩{τ1 ∧τ2 ≤ n} ∈ Fn , hence the opposite inclusion . Fτ1 ∧τ2 ⊃ Fτ1 ∩ Fτ2 . (d) Let us prove that .{τ1 < τ2 } ∈ Fτ1 : we must show that .{τ1 < τ2 }∩{τ1 ≤ n} ∈ Fn . We have {τ1 < τ2 } ∩ {τ1 ≤ n} =
n
.
k=0
{τ1 = k} ∩ {τ2 > k} .
212
5 Martingales
This event belongs to . Fn , as .{τ2 > k} = {τ2 ≤ k}c ∈ Fk ⊂ Fn . Therefore .{τ1 < τ2 } ∈ Fτ1 . Similarly {τ1 < τ2 } ∩ {τ2 ≤ n} =
.
n {τ2 = k} ∩ {τ1 < k} k=0
and again we find that .{τ1 < τ2 } ∩ {τ2 ≤ n} ∈ Fn . Therefore .{τ1 < τ2 } ∈ Fτ1 ∩ Fτ2 . Finally note that {τ1 = τ2 } = {τ1 < τ2 }c ∩ {τ2 < τ1 }c ∈ Fτ1 ∩ Fτ2 .
.
For a given filtration .( Fn )n let X be an adapted process and .τ a finite stopping time. Then we can define its position at time .τ : Xτ = Xn
.
on {τ = n}, n ∈ N .
Note that .Xτ is . Fτ -measurable as {Xτ ∈ A} ∩ {τ = n} = {Xn ∈ A} ∩ {τ = n} ∈ Fn .
.
A typical operation that is applied to a process is stopping: if .(Xn )n is adapted to the filtration .( Fn )n and .τ is a stopping time for this filtration, the process stopped at time .τ is .(Xτ ∧n )n , i.e. a process that moves as .(Xn )n up to time .τ and then stays fixed at the position .Xτ (at least if .τ < +∞). Also the stopped process is adapted to the filtration .( Fn )n , as .τ ∧ n is a stopping time which is .≤ n so that .Xτ ∧n is . Fτ ∧n -measurable and by Proposition 5.7 (b) . Fτ ∧n ⊂ Fn . The following remark states that a stopped martingale is also a martingale. Remark 5.8 If .(Xn )n is an .( Fn )n -martingale (resp. supermartingale, submartingale), the same is true for the stopped process .Xnτ = Xn∧τ , where .τ is a stopping time of the filtration .( Fn )n . Actually .X(n+1)∧τ = Xn∧τ on .{τ ≤ n} and therefore τ E(Xn+1 − Xnτ | Fn ) = E[(Xn+1 − Xn )1{τ ≥n+1} | Fn ] .
.
By the definition of stopping time .{τ ≥ n + 1} = {τ ≤ n}c ∈ Fn and therefore τ E(Xn+1 − Xnτ | Fn ) = 1{τ ≥n+1} E(Xn+1 − Xn | Fn ) = 0 .
.
(5.5)
Note that the stopped process is a martingale with respect to the same filtration, ( Fn )n , which may be larger than the natural filtration of the stopped process.
.
5.5 The Stopping Theorem
213
Remark 5.9 Let .M = (Mn )n be a square integrable martingale with respect to the filtration .( Fn )n , .τ a stopping time for this filtration and .M τ the stopped martingale, which is of course also square integrable as .|Mn∧τ | ≤ nk=0 |Mk |. Let A be the associated increasing process of M and .A the associated increasing process of .M τ . Is it true that .An = An∧τ ? Note first that .(An∧τ )n is an increasing predictable process. Actually it is obviously increasing and, as A(n+1)∧τ =
n
.
Ak 1{τ =k} + An+1 1{τ >n}
k=1
A(n+1)∧τ is the sum of . Fn -measurable r.v.’s hence . Fn -measurable itself. 2 Finally, by definition .(Mn2 − An )n is a martingale, hence so is .(Mn∧τ − An∧τ )n (stopping a martingale always gives rise to a martingale). Therefore .An = An∧τ , as the associated increasing process is unique. .
5.5 The Stopping Theorem The following result is the key tool in the proof of many properties of martingales appearing in the sequel.
Theorem 5.10 (The Stopping Theorem) Let .X = (Ω, F, ( Fn )n , (Xn )n , P) be a supermartingale and .τ1 , .τ2 stopping times of the filtration .( Fn )n , a.s. bounded and such that .τ1 ≤ τ2 a.s. Then the r.v.’s .Xτ1 and .Xτ2 are integrable and E(Xτ2 | Fτ1 ) ≤ Xτ1 .
.
(5.6)
In particular, .E(Xτ2 ) ≤ E(Xτ1 ).
Proof The integrability of .Xτ1 and .Xτ2 is immediate, as, for .i = 1, 2 and denoting by k a number larger than .τ2 , .|Xτi | ≤ kj =1 |Xj |. In order to prove (5.6) let us first assume .τ2 ≡ k ∈ N and let .A ∈ Fτ1 . As .A ∩ {τ1 = j } ∈ Fj , we have, for .j ≤ k, E Xτ1 1A∩{τ1 =j } = E Xj 1A∩{τ1 =j } ≥ E Xk 1A∩{τ1 =j } ,
.
214
5 Martingales
where the inequality holds because .(Xn )n is a supermartingale and .A ∩ {τ1 = j } ∈ Fj . Taking the sum with respect to j , .0 ≤ j ≤ k, k k E Xτ1 1A = E Xj 1A∩{τ1 =j } ≥ E Xk 1A∩{τ1 =j } = E Xτ2 1A ,
.
j =0
j =0
which proves the theorem if .τ2 is a constant stopping time. Let us now assume more generally .τ2 ≤ k. If we apply the result of the first part of the proof to the stopped martingale .(Xnτ2 )n (recall that .Xnτ2 = Xn∧τ2 ) and to the stopping times .τ1 and k, we have E Xτ1 1A = E Xττ12 1A ≥ E Xkτ2 1A = E Xτ2 1A ,
.
which concludes the proof. Theorem 5.10 applied to X and .−X gives
Corollary 5.11 Under the assumptions of Theorem 5.10, if moreover X is a martingale, E(Xτ2 | Fτ1 ) = Xτ1 .
.
In some sense the stopping theorem states that the martingale (resp. supermartingale, submartingale) relation (5.1) still holds if the times m, n are replaced by bounded stopping times. If X is a martingale, applying Corollary 5.11 to the stopping times .τ1 = 0 and .τ2 = τ we find that the mean .E(Xτ ) is constant as .τ ranges among bounded stopping times. Beware: these stopping times must be bounded, i.e. a number k must exist such that .τ2 (ω) ≤ k for every .ω a.s. A finite stopping time is not necessarily bounded. Very often however we shall need to apply the relation (5.6) to unbounded stopping times: as we shall see, this can often be done in a simple way by approximating the unbounded stopping times with bounded ones. The following is a first application of the stopping theorem.
5.5 The Stopping Theorem
215
Theorem 5.12 (Maximal Inequalities) Let X be a supermartingale and λ > 0. Then
.λP sup Xn ≥ λ ≤ E(X0 ) + E(Xk− ) , . (5.7)
.
0≤n≤k
λP inf Xn ≤ −λ ≤ E Xk 1{inf0≤n≤k Xn ≤−λ} . 0≤n≤k
(5.8)
Proof Let τ (ω) =
.
inf{n; n ≤ k, Xn (ω) ≥ λ} k if { } = ∅ .
τ is a bounded stopping time and, by (5.6) applied to the stopping times .τ2 = τ , τ1 = 0,
. .
E(X0 ) ≥ E(Xτ ) = E Xτ 1{sup0≤n≤k Xn ≥λ} + E Xk 1{sup0≤n≤k Xn −λ} − E(Xk )
.
0≤n≤k
= −E Xk 1{inf0≤n≤k Xn ≤−λ} ,
i.e. (5.8). Note that (5.7) implies that if a supermartingale is such +∞ (in particular if it is a positive supermartingale) then the a.s. Indeed, by (5.7),
that .supk≥0 E(Xk− ) < r.v. .supn≥0 Xn is finite
λP sup Xn ≥ λ = lim λP sup Xn ≥ λ ≤ E(X0 ) + sup E(Xk− ) < +∞ ,
.
k→∞
n≥0
0≤n≤k
k≥0
from which .
lim P sup Xn ≥ λ = 0 ,
λ→+∞
n≥0
i.e. the r.v. .supn>0 Xn is a.s. finite. This will become more clear in the next section.
5.6 Almost Sure Convergence One of the reasons for the importance of martingales is the result of this section: it guarantees that, under assumptions that are quite weak and easy to check, a martingale converges a.s. Let .[a, b] ⊂ R, .a < b, be an interval and k γa,b (ω) = how many times the path (Xn (ω))n≤k crosses ascending [a, b] .
.
We say that .(Xn (ω))n crosses ascending the interval .[a, b] once in the time interval [i, j ] if
.
Xi (ω) < a ,
.
Xm (ω) ≤ b for m = i + 1, . . . , j − 1 , Xj (ω) > b . When this happens we say that the process .(Xn )n has performed one upcrossing on k (ω) is therefore the number of upcrossings on the interval .[a, b] (see Fig. 5.1). .γa,b the interval .[a, b] of the path .(Xn (ω))n up to time k. The proof of the convergence theorem has some technical points, but the baseline is quite simple: in order to prove that a sequence converges we must prove first of
5.6 Almost Sure Convergence
b
a
217
... ..• .. ... •....... ... .. .... ... ... ....... ..• .. .... .• ... . . . . . . . . ... ......... . ... . .. . ... . . . . . . . . ..... ... . .. .. .... ... ... ... ... ... •... ... ..... .. .. ... ... ... ... ... ... ..... .. ... ... ... .. ... ... ... . ... . ... . . . . ... . . . . ... ... . . . . ... . . . . ... . . . . ... . . . . . ... . . . . . ... . ... . . . ... . . . . ... . . . . ... . ... . . . ... . . . . ... . . . . ... . ... ... ... .• . ... . .... . . ... . ... ... ... . . ... . . . . . ... . ... .. ... . . . . ... . . . . . ... ... .. . ... . . . . . ... . . . . ... ... . ... . . . . ... . . . . • ... . ... . . ... . . .... . ... . ... . . ... . . . . .• . . ... ... . . . ... . . . . . ... • . ... ... . . ... .. . . . •..... ... . . ... ... . . . . . . . . . . . . . . ... ... ... ... . . .. .. ... ... ... .... ... ..... ... ... ... ... ... ... ... ... ... ... ... ...... ... .....• ... .. . .. ... . • . . . . • . . . . . ....... ... .. ... .. ....... ........... ... ... ....... ...•... . •..
•
0
1
2
3
4
5
6
7
8
=k
k =3 Fig. 5.1 Here .γa,b
all that it does not oscillate too much. Hence the following proposition, which states that a supermartingale cannot make too many upcrossings, is the key tool.
Proposition 5.13 If X is a supermartingale, then k (b − a) E(γa,b ) ≤ E[(Xk − a)− ] .
.
Proof Let us consider the following sequence of stopping times τ1 (ω) =
.
τ2 (ω) =
inf{n; n ≤ k, Xn (ω) < a} k if { } = ∅ , inf{n; τ1 (ω) < n ≤ k, Xn (ω) > b} k if { } = ∅ ,
... τ2m−1 (ω) = τ2m (ω) =
inf{n; τ2m−2 (ω) < n ≤ k, Xn (ω) < a} k if { } = ∅ , inf{n; τ2m−1 (ω) < n ≤ k, Xn (ω) > b} k if { } = ∅ ,
(5.9)
218
5 Martingales
i.e. at time .τ2i , if .Xτ2i > b, the i-th upcrossing is completed and at time .τ2i−1 , if Xτ2i−1 < a, the i-th upcrossing is initialized. Let
.
k A2m = {τ2m ≤ k, Xτ2m > b} = {γa,b ≥ m} ,
.
k ≥ m − 1, Xτ2m−1 < a} . A2m−1 = {γa,b k ≥ m) = P(A ). It is The idea of the proof is to find an upper bound for .P(γa,b 2m immediate that .Ai ∈ Fτi , as .τi and .Xτi are . Fτi -measurable r.v.’s. By the stopping theorem, Theorem 5.10, with the stopping times .τ2m−1 and .τ2m we have
E[(Xτ2m − a)1A2m−1 | Fτ2m−1 ] = 1A2m−1 E(Xτ2m − a | Fτ2m−1 ) ≤ 1A2m−1 (Xτ2m−1 − a) .
.
As .Xτ2m−1 < a on .A2m−1 , taking the expectation we have 0 ≥ E (Xτ2m−1 − a)1A2m−1 ≥ E (Xτ2m − a)1A2m−1 .
(5.10)
.
Obviously .A2m−1 = A2m ∪ (A2m−1 \ A2m ) and .
Xτ2m ≥ b
on A2m
Xτ2m = Xk
on A2m−1 \ A2m
so that (5.10) gives 0 ≥ E (Xτ2m − a)1A2m + E (Xτ2m − a)1A2m−1 \A2m (Xk − a) dP , ≥ (b − a)P(A2m ) +
.
A2m−1 \A2m
from which we deduce k (b − a)P(γa,b ≥ m) ≤ −
.
A2m−1 \A2m
(Xk − a) dP ≤
A2m−1 \A2m
(Xk − a)− dP . (5.11)
The events .A2m−1 \ A2m are pairwise disjoint as m ranges over .N so that, taking the sum in m in (5.11), (b − a)
∞
.
k P(γa,b ≥ m) ≤ E[(Xk − a)− ]
m=1
k k and the result follows recalling that . ∞ m=1 P(γa,b ≥ m) = E(γa,b ) (Remark 2.1).
5.6 Almost Sure Convergence
219
Theorem 5.14 Let X be a supermartingale such that .
sup E(Xn− ) < +∞ .
(5.12)
n≥0
Then it converges a.s. to a finite limit.
Proof For fixed .a < b let .γa,b (ω) denote the number of upcrossings on the interval [a, b] of the whole path .(Xn (ω))n . As .(Xn − a)− ≤ a + + Xn− , by Proposition 5.13,
.
1 k sup E[(Xn − a)− ] E(γa,b ) = lim E(γa,b )≤ k→∞ b − a n≥0 . 1 + ≤ a + sup E(Xn− ) < +∞ . b−a n≥0
(5.13)
In particular .γa,b < +∞ a.s., i.e. there exists a negligible event .Na,b such that γa,b (ω) < +∞ for .ω ∈ Na,b ; taking the union, N, of the sets .Na,b as .a, b range in .Q with .a < b, we can assume that, outside the negligible event N, we have .γa,b < +∞ for every .a, b ∈ R. / N the sequence .(Xn (ω))n converges: otherwise, if Let us show that for .ω ∈ .a = limn→∞ Xn (ω) < limn→∞ Xn (ω) = b, the sequence .(Xn (ω))n would take values close to a infinitely many times and also values close to b infinitely many times. Hence, for every .α, β ∈ R with .a < α < β < b, the path .(Xn (ω))n would cross the interval .[α, β] infinitely many times and we would have .γα,β (ω) = +∞. The limit is moreover finite: thanks to (5.13) .
.
lim E(γa,b ) = 0
b→+∞
but .γa,b (ω) is decreasing in b and therefore .
lim γa,b (ω) = 0
b→+∞
a.s.
As .γa,b can only take integer values, .γa,b (ω) = 0 for b large enough and .(Xn (ω))n is therefore bounded from above a.s. In the same way we see that it is bounded from below. The assumptions of Theorem 5.14 are in particular satisfied by all positive supermartingales.
220
5 Martingales
Remark 5.15 Note that, if X is a martingale, condition (5.12) of Theorem 5.14 is equivalent to boundedness in .L1 . Indeed, of course if .(Xn )n is bounded in 1 .L then (5.12) is satisfied. Conversely, taking into account the decomposition + − + − .Xn = Xn −Xn , as .n → E(Xn ) := c is constant, we have .E(Xn ) = c+E(Xn ), hence E(|Xn |) = E(Xn+ ) + E(Xn− ) = c + 2E(Xn− ) .
.
Example 5.16 As a first application of the a.s. convergence Theorem 5.14, let us consider the process .Sn = X1 + · · · + Xn , where the r.v.’s .Xi are such that 1 .P(Xi = ±1) = 2 . .(Sn )n is a martingale (it is an instance of Example 5.2 (a)). .(Sn )n is a model of a random motion that starts at 0 and, at each iteration, makes a step to the left or to the right with probability . 12 . Let .k ∈ Z. What is the probability of visiting k or, to be precise, if .τ = inf{n; Sn = k} is the passage time at k, what is the value of .P(τ < +∞)? Assume, for simplicity, that .k < 0 and consider the stopped martingale − .(Sn∧τ )n . This martingale is bounded from below as .Sn∧τ ≥ k, hence .Sn∧τ ≤ −k and condition (5.12) is verified. Hence .(Sn∧τ )n converges a.s. But on .{τ = +∞} convergence cannot take place as .|Sn+1 −Sn | = 1, so that .(Sn∧τ )n cannot be a Cauchy sequence on .{τ = +∞}. Hence .P(τ = +∞) = 0 and .(Sn )n visits every integer .k ∈ Z with probability 1. A process .(Sn )n of the form .Sn = X1 + · · · + Xn where the .Xn are i.i.d. integer-valued is a random walk on .Z. The instance of this exercise is a simple (because .Xn takes the values .±1 only) random walk. It is a model of random motion where at every step a displacement of one unit is made to the right or to the left. Martingales are an important tool in the investigation of random walks, as will be revealed in many of the examples and exercises below. Actually martingales are a critical tool in the investigation of any kind of stochastic processes.
Example 5.17 Let .(Xn )n and .(Sn )n be a random walk as in the previous example. Let .a, b be positive integers and let .τ = inf{n; Xn ≥ b or Xn ≤ −a} be the exit time of S from the interval .] − a, b[. We know, thanks to Example 5.16, that .τ < +∞ with probability 1. Therefore we can define the r.v. .Sτ , which is the position of .(Sn )n at the exit from the interval .] − a, b[. Of course, .Sτ can only take the values .−a or b. What is the value of .P(Sτ = b)?
5.6 Almost Sure Convergence
221
Let us assume for a moment that we can apply Theorem 5.10, the stopping theorem, to the stopping times .τ2 = τ and .τ1 = 0 (we are not allowed to do so because .τ is finite but we do not know whether it is bounded, and actually it is not), then we would have 0 = E(S0 ) = E(Sτ ) .
.
(5.14)
From this relation, as .P(Sτ = −a) = 1 − P(Sτ = b), we deduce that 0 = E(Sτ ) = b P(Sτ = b) − a P(Sτ = −a) = b P(Sτ =b) − a (1 − P(Sτ = b)) ,
.
i.e. P(Sτ = b) =
.
a · a+b
The problem is therefore solved if (5.14) holds. Actually this is easy to prove: for every n the stopping time .τ ∧ n is bounded and the stopping theorem gives 0 = E(Sτ ∧n ) .
.
Now observe that .limn→∞ Sτ ∧n = Sτ and that, as .−a ≤ Sτ ∧n ≤ b, the sequence .(Sτ ∧n )n is bounded, so that we can apply Lebesgue’s Theorem and obtain (5.14). This example shows a typical application of the stopping theorem in order to obtain the distribution of a process stopped at some stopping time. In this case the process under consideration is itself a martingale. For a more general process .(Sn )n one can look for a function f such that .Mn = f (Sn ) is a martingale to which the stopping theorem can be applied. This is the case in Exercise 5.12, for example. Other properties of exit times can also be investigated via the stopping of suitable martingales, as will become clear in the exercises. This kind of problem (investigation of properties of exit times), is very often reduced to the question of “finding the right martingale”. This example also shows how to apply the stopping theorem to stopping times that are not bounded: just apply the stopping theorem to the stopping times .τ ∧ n, which are bounded, and then hope to be able to pass to the limit using Lebesgue’s Theorem as above or some other statement, such as Beppo Levi’s Theorem.
222
5 Martingales
5.7 Doob’s Inequality and Lp Convergence, p > 1 A martingale M is said to be bounded in .Lp if .
sup E(|Mn |p ) < +∞ . n≥1
Note that, for .p ≥ 1, .(|Mn |p )n is a submartingale so that .n → E(|Mn |p ) is increasing.
Theorem 5.18 (Doob’s Maximal Inequality) Let M be a martingale bounded in .Lp for .p > 1. Then if .M ∗ = supn |Mn | (the maximal r.v.), .M ∗ belongs to .Lp and M ∗ p ≤ q sup Mn p ,
.
(5.15)
n≥1
where .q =
p p−1
is the exponent conjugated to p.
Theorem 5.18 is a consequence of the following.
Lemma 5.19 If X is a positive submartingale, then for every .p > 1 and n ∈ N,
.
p p
p p E max Xk ≤ E(Xn ) . 0≤k≤n p−1
.
Proof Note that if .Xn ∈ Lp the term on the right-hand side is equal to .+∞ and there p is nothing to prove. If instead .Xn ∈ Lp , then .Xk ∈ Lp also for .k ≤ n as .(Xk )k≤n p is itself a submartingale (see the remarks at the end of Sect. 5.2) and .k → E(Xk ) is p increasing. Hence also .Y := max1≤k≤n Xk belongs to .L . Let, for .λ > 0, τλ (ω) =
.
inf{k; 0 ≤ k ≤ n, Xk (ω) > λ} n + 1 if { } = ∅ .
We have . nk=1 1{τλ =k} = 1{Y >λ} , so that, as .Xk ≥ λ on .{τλ = k}, λ1{Y >λ} ≤
n
.
k=1
Xk 1{τλ =k}
5.7 Doob’s Inequality and Lp Convergence, p > 1
223
and, for every .p > 1,
Y
Y =p p
λ 0
.
p−1
+∞
≤p
+∞
dλ = p
λ
p−2
0
0
n
λp−1 1{Y >λ} dλ (5.16)
Xk 1{τλ =k} dλ .
k=1
As .1{τλ =k} is . Fk -measurable, .E(Xk 1{τλ =k} ) ≤ E(Xn 1{τλ =k} ) and taking the expectation in (5.16) we have
.
1 E(Y p ) = p
+∞
0
n
λp−2 E Xk 1{τλ =k} dλ ≤ k=1
1 E Xn × (p − 1) p−1
=
+∞ 0
λp−2
n
+∞ 0
n
E λp−2 Xn 1{τλ =k} dλ k=1
1{τλ =k} dλ =
k=1
1 E(Y p−1 Xn ) . p−1
=Y p−1
Hölder’s inequality now gives E(Y p ) ≤
.
p p−1 p−1 p p p 1 p 1 E (Y p−1 ) p−1 ] p E(Xn ) p = E(Y p ) p E(Xn ) p . p−1 p−1
As we know already that .E(Y p ) < +∞, we can divide both sides of the equation by .E(Y p )
p−1 p
, which gives
p 1/p E max Xk = E(Y p )1/p ≤
.
0≤k≤n
p p E(Xn )1/p . p−1
Proof of Theorem 5.18. Lemma 5.19 applied to the positive submartingale .(|Mk |)k gives, for every n, p p
E max |Mk |p ≤ E(|Mn |p ) , 0≤k≤n p−1
.
and now we can just note that, as .n → ∞, .
max |Mk |p ↑ (M ∗ )p ,
0≤k≤n
E(|Mn |p ) ↑ sup E(|Mn |p ) . n
224
5 Martingales
Doob’s inequality (5.15) provides simple conditions for the .Lp convergence of a martingale if .p > 1. Assume that M is bounded in .Lp with .p > 1. Then .supn≥0 Mn− ≤ M ∗ . As by Doob’s inequality .M ∗ is integrable, condition (5.12) of Theorem 5.14 is satisfied and M converges a.s. to an r.v. .M∞ and, of course, .|M∞ | ≤ M ∗ . As .|Mn − M∞ |p ≤ 2p−1 (|Mn |p + |M∞ |p ) ≤ 2p M ∗ p , Lebesgue’s Theorem gives .
lim E |Mn − M∞ |p = 0 .
n→∞
Conversely, if .(Mn )n converges in .Lp , then it is also bounded in .Lp and by the same argument as above it also converges a.s. Therefore for .p > 1 the behavior of a martingale bounded in .Lp is very simple:
Theorem 5.20 If .p > 1 a martingale is bounded in .Lp if and only if it converges a.s. and in .Lp .
In the next section we shall see what happens concerning .L1 convergence of a martingale. Things are not so simple (and somehow more interesting).
5.8 L1 Convergence, Regularity The key tool for the investigation of the .L1 convergence of martingales is uniform integrability, which was introduced in Sect. 3.6.
Proposition 5.21 Let .Y ∈ L1 . Then the family . H := {E(Y | G)} G, as . G ranges among all sub-.σ -algebras of . F, is uniformly integrable.
Proof We shall prove that the family . H satisfies the criterion of Proposition 3.33. First note that . H is bounded in .L1 as E E(Y | G) ≤ E E(|Y | G) = E(|Y |)
.
and therefore, by Markov’s inequality, 1 P |E(Y | G)| ≥ R ≤ E(|Y |) . R
.
(5.17)
5.8 L1 Convergence, Regularity
225
Let us fix .ε > 0 and let .δ > 0 be such that . |Y | dP < ε A
for every .A ∈ F such that .P(A) ≤ δ, as guaranteed by Proposition 3.33, as .{Y } is a uniformly integrable family. Let now .R > 0 be such that 1 P |E(Y | G)| > R ≤ E(|Y |) < δ . R
.
We have then .
{|E(Y | G)|>R}
E(Y | G) dP ≤ =
{|E(Y | G)|>R}
{|E(Y | G)|>R}
E |Y | G dP
|Y | dP < ε ,
where the last equality holds because the event .{|E(Y | G)| > R} is . G-measurable. In particular, recalling Example 5.2 (c), if .( Fn )n is a filtration on .(Ω, F, P) and Y ∈ L1 , then .(E(Y | Fn ))n is a uniformly integrable martingale. A martingale of this form is called a regular martingale. Conversely, every uniformly integrable martingale .(Mn )n is regular: indeed, as 1 .(Mn )n is bounded in .L , condition (5.12) holds and .(Mn )n converges a.s. to some r.v. Y . By Theorem 3.34, .Y ∈ L1 and the convergence takes place in .L1 . Hence .
Mm = E(Mn | Fm )
.
L1
→
n→∞
E(Y | Fm )
(recall that the conditional expectation is a continuous operator in .L1 , Remark 4.10). We have therefore proved the following characterization of regular martingales.
Theorem 5.22 A martingale .(Mn )n is uniformly integrable if and only if it is regular, i.e. of the form .Mn = E(Y | Fn ) for some .Y ∈ L1 , and if and only if it converges a.s. and in .L1 .
The following statement specifies the limit of a regular martingale.
226
5 Martingales
1 Proposition 5.23 Let .Y ∈ L (Ω, F, P), .( Fn )n a filtration on .(Ω, F) and ∞ . F∞ = σ n=1 Fn , the .σ -algebra generated by the . Fn ’s. Then .
lim E(Y | Fn ) = E(Y | F∞ )
n→∞
a.s. and in L1 .
Proof If .Z = limn→∞ E(Y | Fn ) a.s. then Z is . F∞ -measurable, being the limit of F∞ -measurable r.v.’s (recall Remark 1.15 if you are worried about the a.s.). In order to prove that .Z = E(Y | F∞ ) a.s. we must check that
.
E(Z1A ) = E(Y 1A )
.
for every A ∈ F∞ .
(5.18)
The class . C = n Fn is stable with respect to finite intersections, generates . F∞ and contains .Ω. If .A ∈ Fm for some m then as soon as .n ≥ m we have .E E(Y | Fn )1A = E E(1A Y | Fn ) = E(Y 1A ), as also .A ∈ Fn . Therefore E(Z1A ) = lim E E(Y | Fn )1A = E(Y 1A ) .
.
n→∞
Hence (5.18) holds for every .A ∈ C, and, by Remark 4.3, also for every .A ∈ F∞ . Remark 5.24 (Regularity of Positive Martingales) In the case of a positive martingale .(Mn )n the following ideas may be useful in order to check regularity (or non-regularity). Sometimes it is important to establish this feature (see Exercise 5.24, for example). (a) Regularity is easily established when the a.s. limit .M∞ = limn→∞ Mn is known. We have .E(M∞ ) ≤ limn→∞ E(Mn ) by Fatou’s Lemma. If this inequality is strict, then the martingale cannot be regular, as .L1 convergence entails convergence of the expectations. Conversely, if .E(M∞ ) = limn→∞ E(Mn ) then the martingale is regular, as for positive r.v.’s a.s. convergence in addition to convergence of the expectations entails .L1 convergence (Scheffé’s theorem, Theorem 3.25). (b) (Kakutani’s trick) If the limit .M∞ is not known, a possible approach in order to investigate the regularity of .(Mn )n is to compute .
lim E( Mn ) .
n→∞
(5.19)
5.8 L1 Convergence, Regularity
227
If this limit is equal to 0 then necessarily .M∞ = 0. Actually by Fatou’s Lemma .E( M∞ ) ≤ lim E( Mn ) = 0 , n→∞
√ so that the positive r.v. . M∞ , having expectation equal to 0, is .= 0 a.s. In this case regularity is not possible (barring trivial situations). (c) A particular case is martingales of the form M n = U1 · · · U n
.
(5.20)
where the r.v.’s .Uk are independent, positive and such that .E(Uk ) = 1, see Example 5.2 (b). In this case we have .
∞ lim E Mn = E( Uk )
n→∞
(5.21)
k=1
so that if the infinite product above is equal to 0, then .(Mn )n is not regular. Note that in order to determine the behavior of the infinite product Proposition 3.4 may be useful. By Jensen’s inequality .E Uk ≤ E(Uk ) ≤ 1 and the inequality is strict unless .Uk ≡ 1, as the square root is strictly concave √ so that .E( U ) < 1.√In particular, if the .Uk are also identically distributed, we √ k have .E( Mn ) = E( U1 )n →n→∞ 0 and .(Mn )n is not a regular. Hence a martingale of the form (5.20), if in addition the .Un are i.i.d., cannot be regular (besides the trivial case .Mn ≡ 1). The next result states what happens when the infinite product in (5.21) does not vanish.
Proposition 5.25 Let .(Un )n be a sequence of independent positive r.v.’s with E(Un ) = 1 for every n and let .Mn = U1 · · · Un . Then if
.
.
∞ E Un > 0 n=1
(Mn )n is regular.
.
228
5 Martingales
√ Proof Let us prove first that .( Mn )n is a Cauchy sequence in .L2 . We have, for .n ≥ m, √ √ √ E ( M n − M m )2 = E M n + M m − 2 M n M m . √ = 2 1 − E Mn Mm .
(5.22)
Now n n E Mn Mm = E(U1 · · · Um ) E( Uk ) = E( Uk ) .
.
k=m+1
(5.23)
k=m+1
√ As .E( Uk ) ≤ 1, it follows that √ ∞ k=1 E( Uk ) . E( Uk ) ≥ E( Uk ) = m √ k=1 E( Uk ) k=m+1 k=m+1 n
∞
√ and, as by hypothesis . ∞ k=1 E( Uk ) > 0, we obtain √ ∞ k=1 E( Uk ) =1. . lim E( Uk ) = lim m √ m→∞ m→∞ k=1 E( Uk ) k=m+1 ∞
Therefore going back to (5.23), for every .ε > 0, for .n0 large enough and .n, m ≥ n0 , E Mn Mm ≥ 1 − ε
.
√ and by (5.22) .( Mn )n is a Cauchy sequence in .L2 and converges in .L2 . This implies that .(Mn )n converges in .L1 (see Exercise 3.1 (b)) and is regular. Remark 5.26 (Backward Martingales) Let .(Bn )n be a decreasing sequence of .σ -algebras. A family .(Zn )n of integrable r.v.’s is a backward (or reverse) martingale if E(Zn | Bn+1 ) = Zn+1 .
.
Backward supermartingales and submartingales are defined similarly. The behavior of a backward martingale is easily traced back to the behavior of martingales by setting, for every N and .n ≤ N, Yn = ZN −n ,
.
Fn = BN −n .
5.8 L1 Convergence, Regularity
229
As E(Yn+1 | Fn ) = E(ZN −n−1 | BN −n ) = ZN −n = Yn ,
.
(Yn )n≤N is a martingale with respect to the filtration .( Fn )n≤N . Let us note first that a backward martingale is automatically uniformly integrable thanks to the criterion of Proposition 5.21:
.
Zn = E(Zn−1 | Bn ) = E E(Zn−2 | Bn−1 )| Bn
.
= E(Zn−2 | Bn ) = · · · = E(Z1 | Bn ) . In particular, .(Zn )n is bounded in .L1 . Also, by Proposition 5.13 (the upcrossings) applied to the reversed backward martingale .(Yn )n≤N , a bound similar to (5.9) is proved to hold for .(Zn )n and this allows us to reproduce the argument of Theorem 5.14, proving a.s. convergence. In conclusion, the behavior of a backward martingale is very simple: it converges a.s. and in .L1 . For more details and complements, see [21] p. 115, [12] p. 264 or [6] p. 203.
Example 5.27 Let .(Ω, F, P) be a probability space and let us assume that . F is countably generated. This is an assumption that is very often satisfied (recall Exercise 1.1). In this example we give a proof of the Radon-Nikodym theorem (Theorem 1.29 p. 26) using martingales. The appearance of martingales in this context should not come as a surprise: martingales appear in a natural way in connection with changes of probability (see Exercises 5.23–5.26). Let .Q be a probability on .(Ω, F) such that .Q P. Let .(Fn )n ⊂ F be a sequence of events such that . F = σ (Fn , n = 1, 2, . . . ) and let .
Fn = σ (F1 , . . . , Fn ) .
For every n let us consider all possible intersections of the .Fk , .k = 1, . . . , n. Let Gn,k , k = 1, . . . , Nn be the atoms, i.e. the elements among these intersections that do not contain other intersections. Then every event in . Fn is the finite disjoint union of the .Gn,k ’s.
.
230
5 Martingales
Let, for every n, Xn =
.
Nn Q(Gn,k ) k=1
P(Gn,k )
1Gn,k .
(5.24)
As .Q P if .P(Gn,k ) = 0 then also .Q(Gn,k ) = 0 and in (5.24) we shall consider the sum as extended only to the indices k such that .P(Gn,k ) > 0. Let us check that .(Xn )n is an .( Fn )n -martingale. If .A ∈ Fn , then A is the finite (disjoint) union of the .Gn,k for k ranging in some set of indices . I. We have therefore
Q(Gn,k ) .E(Xn 1A ) = E Xn 1Gn,k = E(Xn 1Gn,k ) = P(Gn,k ) P(Gn,k ) k∈ I k∈ I k∈ I = Q(Gn,k ) = Q(A) . k∈ I
If .A ∈ Fn , then obviously also .A ∈ Fn+1 so that E(Xn+1 1A ) = Q(A) = E(Xn 1A ) ,
.
(5.25)
hence .E(Xn+1 | Fn ) = Xn . Moreover, the previous relations for .A = Ω give E(Xn ) = 1. Being a positive martingale, .(Xn )n converges a.s. to some positive r.v. X. Let us prove that .(Xn )n is also uniformly integrable. Thanks to (5.25) with .A = {Xn ≥ R}, we have, for every n, .
E(Xn 1{Xn ≥R} ) = Q(Xn ≥ R)
.
(5.26)
and also, by Markov’s inequality, .P(Xn ≥ R) ≤ R −1 E(Xn ) = R −1 . By Exercise 3.28, for every .ε > 0 there exists a .δ > 0 such that if .P(A) ≤ δ then .Q(A) ≤ ε. If R is such that .R −1 ≤ δ then .P(Xn ≥ R) ≤ δ and (5.26) gives E(Xn 1{Xn ≥R} ) = Q(Xn ≥ R) ≤ ε
.
for every n
and the sequence .(Xn )n is uniformly integrable and converges to X also in .L1 . It is now immediate that X is a density of .Q with respect to .P: this is actually Exercise 5.24 below. Of course this proof can immediately be adapted to the case of finite measures instead of probabilities. Note however that the Radon-Nikodym Theorem holds even without assuming that . F is countably generated.
Exercises
231
Exercises 5.1 (p. 358) Let .(Xn )n be a supermartingale such that, moreover, .E(Xn ) = const. Then .(Xn )n is a martingale. 5.2 (p. 359) Let M be a positive martingale. Prove that, for .m < n, .{Mm = 0} ⊂ {Mn = 0} a.s. (i.e. the set of zeros of a positive martingale increases). 5.3 (p. 359) (Product of independent martingales) Let .(Mn )n , .(Nn )n be martingales on the same probability space .(Ω, F, P), with respect to the filtrations .( Fn )n and .( Gn )n , respectively. Assume moreover that .( Fn )n and .( Gn )n are independent (in particular the martingales are themselves independent). Then the product .(Mn Nn )n is a martingale for the filtration .( Hn )n with . Hn = Fn ∨ Gn . 5.4 (p. 359) Let .(Xn )n be a sequence of independent r.v.’s with mean 0 and variance σ 2 and let . Fn = σ (Xk , k ≤ n). Let .Mn = X1 + · · · + Xn and let .(Zn )n be a square integrable process predictable with respect to .( Fn )n .
.
(a) Prove that Yn =
n
.
Zk Xk
k=1
is a square integrable martingale. (b) Prove that .E(Yn ) = 0 and that E(Yn2 ) = σ 2
n
.
E(Zk2 ) .
k=1
(c) What is the associated increasing process of .(Mn )n ? And of .(Yn )n ? 5.5 (p. 360) (Martingales with independent increments) Let .M = (Ω, F, ( Fn )n , (Mn )n , P) be a square integrable martingale.
.
2 ). (a) Prove that .E[(Mn − Mm )2 ] = E(Mn2 − Mm (b) M is said to be with independent increments if, for every .n ≥ m, .Mn − Mm is independent of . Fm . Prove that, in this case, the associated increasing process is 2 2 2 .An = E(Mn ) − E(M ) = E[(Mn − M0 ) ] and is therefore deterministic. 0 (c) Let .(Mn )n be a Gaussian martingale (i.e. such that, for every n, the vector .(M0 , . . . , Mn ) is Gaussian). Show that .(Mn )n has independent increments with respect to its natural filtration .( Gn )n .
232
5 Martingales
5.6 (p. 361) Let .(Yn )n≥0 be a sequence of i.i.d. r.v.’s such that .P(Yk = ±1) = 12 . Let . F0 = {∅, Ω}, . Fn = σ (Yk , k ≤ n) and .S0 = 0, .Sn = Y1 + · · · + Yn , .n ≥ 1. Let .M0 = 0 and Mn =
n
.
sign(Sk−1 )Yk , n = 1, 2, . . .
k=1
where ⎧ ⎪ ⎪ ⎨1 . sign(x) = 0 ⎪ ⎪ ⎩−1
if x > 0 if x = 0 if x < 0 .
(a) What is the associated increasing process of the martingale .(Sn )n ? (b) Show that .(Mn )n≥0 is a square integrable martingale with respect to .( Fn )n and compute its associated increasing process. (c1) Prove that E[(|Sn+1 | − |Sn |)1{Sn >0} | Fn ] = 0
.
E[(|Sn+1 | − |Sn |)1{Sn p. Let .Sn = Y1 + · · · + Yn .
.
(a) Prove that .limn→∞ Sn = −∞ a.s. (b) Prove that Zn =
.
q Sn p
is a martingale. (c) Let .a, b be strictly positive integers and let .τ = τ−a,b = inf{n, Sn = b or Sn = −a} be the exit time from .] − a, b[. What is the value of .E(Zn∧τ )? Of .E(Zτ )? (d1) Compute .P(Sτ = b) (i.e. the probability for the random walk .(Sn )n to exit from the interval .] − a, b[ at b). How does this quantity behave as .a → +∞? (d2) Let, for .b > 0, .τb = inf{n; Sn = b} be the passage time of .(Sn )n at b. Note that .{τb < n} ⊂ {Sτ−n,b = b} and deduce that .P(τb < +∞) < 1, i.e. with strictly positive probability the process .(Sn )n never visits b. This was to be expected, as .q > p and the process has a preference to make displacements to the left.
234
5 Martingales
(d3) Compute .P(τ−a < +∞). 5.13 (p. 368) (Wald’s identity) Let .(Xn )n be a sequence of i.i.d. integrable real r.v.’s with .E(X1 ) = x. Let . F0 = {Ω, ∅}, . Fn = σ (Xk , k ≤ n), .S0 = 0 and, for .n ≥ 1, .Sn = X1 + · · · + Xn . Let .τ be an integrable stopping time of .( Fn )n . (a) Let .Zn = Sn − nx. Show that .(Zn )n is an .( Fn )n -martingale. (b1) Show that, for every n, .E(Sn∧τ ) = x E(n ∧ τ ). (b2) Show that .Sτ is integrable and that .E(Sτ ) = x E(τ ), first assuming .X1 ≥ 0 a.s. and then in the general case. (c) Assume that, for every n, .P(Xn = ±1) = 12 and .τ = τb = inf{n; Sn ≥ b}, where .b > 0 is an integer. Show that .τb is not integrable. (Recall that we already know, Example 5.16, that .τb < +∞ a.s.) • Note that no requirement concerning independence of .τ and the .Xn is made. 5.14 (p. 369) Let .(Xn )n be a sequence of i.i.d. r.v.’s such that .P(Xn = ±1) = 12 and let . F0 = {∅, Ω}, . Fn = σ (Xk , k ≤ n), and .S0 = 0, Sn = X1 + · · · + Xn , .n ≥ 1. (a) Show that .Wn = Sn2 − n is an .( Fn )n -martingale. (b) Let .a, b be strictly positive integers and let .τa,b be the exit time of .(Sn )n from .] − a, b[. (b1) Compute .E(τa,b ). (b2) Let .τb = inf{n; Xn ≥ b} be the exit time of .(Xn )n from the half-line .]−∞, b[. We already know (Example 5.16) that .τb < +∞ a.s. Prove that .E(τb ) = +∞ (already proved in a different way in Exercise 5.13 (c)). Recall that we already know (Example 5.17 (a)) that .P(Sτa,b = −a) = a .P(Sτa,b = b) = a+b .
b a+b ,
5.15 (p. 369) Let .(Xn )n be a sequence of i.i.d. r.v.’s with .P(X = ±1) = 12 and let 3 .Sn = X1 + · · · + Xn and .Zn = Sn − 3nSn . Let .τ be the exit time of .(Sn )n from the interval .] − a, b[, a, b > 0. Recall that we already know that .τ is integrable and that b a .P(Sτ = −a) = a+b , .P(Sτ = b) = a+b . (a) Prove that Z is a martingale. (b1) Compute .Cov(Sτ , τ ) and deduce that if .a = b then .τ and .Sτ are not independent. (b2) Assume that .b = a. Prove that the r.v.’s .(Sn , τ ) and .(−Sn , τ ) have the same joint distributions and deduce that .Sτ and .τ are independent. 5.16 (p. 371) Let .(Xn )n be a sequence of i.i.d. r.v.’s such that .P(Xi = ±1) = 12 and let .S0 = 0, .Sn = X1 + · · · + Xn , . F0 = {Ω, ∅} and . Fn = σ (Xk , k ≤ n). Let a be a strictly positive integer and .τ = inf{n ≥ 0; Sn = a} be the first passage time of .(Sn )n at a. In this exercise and in the next one we continue to gather information about the passage times of the simple symmetric random walk.
Exercises
235
(a) Show that, for every .θ ∈ R, Znθ =
.
eθSn (cosh θ )n
θ ) is bounded. is an .( Fn )n -martingale and that if .θ ≥ 0 then .(Zn∧τ n θ (b1) Show that, for every .θ ≥ 0, .(Zn∧τ )n converges a.s. and in .L2 to the r.v.
Wθ =
.
eθa 1{τ 1 be a positive integer and let .τ = inf{n ≥ 0; |Sn | = a} be the exit time of .(Sn )n from .] − a, a[. In this exercise we investigate the Laplace transform and the existence of moments of .τ . π π Let .λ ∈ R be such that .0 < λ < 2a . Note that, as .a > 1, .0 < cos 2a < cos λ < 1 (see Fig. 5.2) .
.
(a) Show that .Zn = (cos λ)−n cos(λSn ) is an .( Fn )n -martingale. (b) Show that 1 = E(Zn∧τ ) ≥ cos(λa)E[(cos λ)−n∧τ ] .
.
−2
− a
0
Fig. 5.2 The graph of the cosine function between ≥ cos(λa)
.cos(λSn∧τ )
a π .− 2
and
π . . 2
2
As .−λa ≤ λSn∧τ ≤ λa,
236
(c) (d1) (d2) (e)
5 Martingales
Deduce that .E[(cos λ)−τ ] ≤ (cos(λa))−1 and then that .τ is a.s. finite. Prove that .E(Zn∧τ ) →n→∞ E(Zτ ). Deduce that the martingale .(Zn∧τ )n is regular. Compute .E[(cos λ)−τ ]. What are the convergence abscissas of the Laplace transform of .τ ? For which values of p does .τ ∈ Lp ?
5.18 (p. 374) Let .(Un )n be a positive supermartingale such that .limn→∞ E(Un ) = 0. Prove that .limn→∞ Un = 0 a.s. 5.19 (p. 374) Let .(Yn )n≥1 be a sequence of .Z-valued integrable r.v.’s, i.i.d. and with common law .μ. Assume that • .E(Yi ) = b < 0, • .P(Yi = 1) > 0 but .P(Yi ≥ 2) = 0. Let .S0 = 0, .Sn = Y1 + · · · + Yn and W = sup Sn .
.
n≥0
The goal of this problem is to determine the law of W . Intuitively, by the Law of Large Numbers, .Sn →n→∞ −∞ a.s., being sums of independent r.v.’s with a strictly negative expectation. But, before sinking down, .(Sn )n may take an excursion on the positive side. How large? (a) Prove that .W < +∞ a.s. (b) Recall (Exercise 2.42) that for a real r.v. X, both its Laplace transform and its logarithm are convex functions. Let .L(λ) = E(eλY1 ) and .ψ(λ) = log L(λ). Prove that .ψ(λ) < +∞ for every .λ ≥ 0. What is the value of .ψ (0+)? Prove that .ψ(λ) → +∞ as .λ → +∞ and that there exists a unique .λ0 > 0 such that .ψ(λ0 ) = 0. (c) Let .λ0 be as in b). Prove that .Zn = eλ0 Sn is a martingale and that .limn→∞ Zn = 0 a.s. (d) Let .K ∈ N, .K ≥ 1 and let .τK = inf{n; Sn ≥ K} be the passage time of .(Sn )n at K. Prove that lim Zn∧τK = eλ0 K 1{τK 0 for every .x ∈ E. Let .(Xn )n≥1 be a sequence of i.i.d. E-valued r.v.’s having law q. Show that
.
Yn =
.
n p(Xk ) q(Xk )
k=1
is a positive martingale converging to 0 a.s. Is it regular? 5.22 (p. 376) Let .(Un )n be a sequence of real i.i.d. r.v.’s with common density with respect to the Lebesgue measure f (t) = 2(1 − t)
.
for 0 ≤ t ≤ 1
and .f (t) = 0 otherwise (it is a Beta.(1, 2) law). Let . F0 = {∅, Ω} and, for .n ≥ 1, Fn = σ (Uk , k ≤ n). For .q ∈]0, 1[ let
.
.
(a) (b) (c) (d)
X0 = q , Xn+1 = 12 Xn2 +
1 2
1[0,Xn ] (Un+1 )
(5.31)
n≥0.
Prove that .Xn ∈ [0, 1] for every .n ≥ 0. Prove that .(Xn )n is an .( Fn )n -martingale. Prove that .(Xn )n converges a.s. and in .L2 to an r.v. .X∞ and compute .E(X∞ ). Note that .Xn+1 − 12 Xn2 can only take the values 0 or . 12 and deduce that .X∞ can only take the values 0 or 1 a.s. and has a Bernoulli distribution of parameter q.
5.23 (p. 377) Let .P, .Q be probabilities on the measurable space .(Ω, F) and let ( Fn )n ⊂ F be a filtration. Assume that, for every .n > 0, the restriction .Q| F of n .Q to . Fn is absolutely continuous with respect to the restriction, .P| , of .P to . Fn . Let F .
n
Zn =
.
dQ| F
n
dP| F
·
n
(a) Prove that .(Zn )n is a martingale. (b) Prove that .Zn > 0 .Q-a.s. and that .(Zn−1 )n is a .Q-supermartingale. (c) Prove that if also .Q| F P| F , then .(Zn−1 )n is a .Q-martingale. n
n
238
5 Martingales
5.24 (p. 378) Let .( Fn )n ⊂ F be a filtration on the probability space .(Ω, F, P). Let (Mn )n be a positive .( Fn )n -martingale such that .E(Mn ) = 1. Let, for every n,
.
dQn = Mn dP
.
be the probability on .(Ω, F) having density .Mn with respect to .P. (a) Assume that .(Mn )n is regular. Prove that there exists a probability .Q on .(Ω, F) such that .Q P and such that .Q|Fn = Qn . (b) Conversely, assume that such a probability .Q exists. Prove that .(Mn )n is regular. 5.25 (p. 378) Let .(Xn )n be a sequence of .N(0, 1)-distributed i.i.d. r.v.’s. Let .Sn = X1 + · · · + Xn and . Fn = σ (X1 , . . . , Xn ). Let, for .θ ∈ R, 1
Mn = eθSn − 2 nθ . 2
.
(a) Prove that .(Mn )n is an .( Fn )n -martingale and that .E(Mn ) = 1. (b) For .m > 0 let .Qm be the probability on .(Ω, F) having density .Mm with respect to .P. (b1) What is the law of .Xn with respect to .Qm for .n > m? (b2) What is the law of .Xn with respect to .Qm for .n ≤ m? 5.26 (p. 378) Let .(Xn )n be a sequence of independent r.v.’s on .(Ω, F, P) with .Xn ∼ N (0, an ). Let . Fn = σ (Xk , k ≤ n), .Sn = X1 + · · · + Xn , .An = a1 + · · · + an and 1
Zn = eSn − 2 An .
.
(a) Prove that .(Zn )n is an .( Fn )n -martingale. (b) Assume that .limn→∞ An = +∞. Compute .limn→∞ Zn a.s. Is .(Zn )n regular? (c1) Assume .limn→∞ An = A∞ < +∞. Prove that .(Zn )n is a regular martingale and determine the law of its limit. (c2) Let .Z∞ := limn→∞ Zn and let .Q be the probability on .(Ω, F, P) having density .Z∞ with respect to .P. What is the law of .Xk under .Q? Are the r.v.’s .Xn independent also with respect to .Q? 5.27 (p. 380) Let .(Xn )n be a sequence of i.i.d. .N(0, 1)-distributed r.v.’s and . Fn = σ (Xk , k ≤ n). (a) Determine for which values of .λ ∈ R the r.v. .eλXn+1 Xn is integrable and compute its expectation. (b) Let, for .|λ| < 1, Zn = λ
n
.
k=1
Xk−1 Xk .
Exercises
239
Compute .
log E(eZn+1 | Fn )
and deduce an increasing predictable process .(An )n such that Mn = eZn −An
.
is a martingale. (c) Determine .limn→∞ Mn . Is .(Mn )n regular? 5.28 (p. 381) In this exercise we give a proof of the first part of Kolmogorov’s strong law, Theorem 3.12 using backward martingales. Let .(Xn )n be a sequence of i.i.d. integrable r.v.’s with .E(Xk ) = b. Let .Sn = X1 + · · · + Xn , .X n = n1 Sn and .
Bn = σ (Sn , Sn+1 , . . . ) = σ (Sn , Xn+1 , . . . ) .
(a1) Prove that for .1 ≤ k ≤ n we have E(Xk | Bn ) = E(Xk |Sn ) .
.
(a2) Deduce that, for .k ≤ n, E(Xk | Bn ) =
.
1 Sn . n
(b1) Prove that .(Xn )n is a .(Bn )n -backward martingale. (b2) Deduce that Xn
.
Recall Exercise 4.3.
a.s.
→ b.
n→∞
Chapter 6
Complements
In this chapter we introduce some important notions that might not find their place in a course for lack of time. Section 6.1 will introduce the problem of simulation and the related applications of the Law of Large Numbers. Sections 6.2 and 6.3 will give some hints about deeper properties of the weak convergence of probabilities.
6.1 Random Number Generation, Simulation In some situations the computation of a probability or of an expectation is not possible analytically and the Law of Large Numbers provides numerical methods of approximation.
Example 6.1 It is sometimes natural to model the subsequent times between events (e.g. failure times) with i.i.d. r.v.’s, .Zi say, having a Weibull law. Recall (see also Exercise 2.9) that the Weibull law of parameters .α and .λ has density with respect to the Lebesgue measure given by λαt α−1 e−λt ,
.
t >0.
This means that the first failure occurs at time .Z1 , the second one at time .Z1 + Z2 and so on. What is the probability of monitoring more than N failures in the time interval .[0, T ]? This requires the computation of the probability P(Z1 + · · · + ZN ≤ T ) .
.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Baldi, Probability, Universitext, https://doi.org/10.1007/978-3-031-38492-9_6
(6.1)
241
242
6 Complements
As no simple formulas concerning the d.f. of the sum of i.i.d. Weibull r.v.’s is available, a numerical approach is the following: we ask a computer to simulate n times batches of N i.i.d. Weibull r.v.’s and to keep account of how many times the event .Z1 + · · · + ZN ≤ T occurs. If we define Xi =
1
if Z1 + · · · + ZN ≤ T for the i-th simulation
0
otherwise
.
then the .Xi are Bernoulli r.v.’s of parameter .p = P(Z1 + · · · + ZN ≤ T ). Hence by the Law of Large Numbers, a.s., .
lim
n→∞
1 (X1 + · · · + Xn ) = E(X) = P(Z1 + · · · + ZN ≤ T ) . n
In other words, an estimate of the probability (6.1) is provided by the proportion of simulations that have given the result .Z1 +· · ·+ZN ≤ T (for n large enough, of course).
In order to effectively take advantage of this technique we must be able to instruct a computer to simulate sequences of r.v.’s with a prescribed distribution. High level software is available (scilab, matlab, mathematica, python,. . . ), which provides routines that give sequences of “independent” random numbers with the most common distributions, e.g. Weibull. These software packages are usually interpreted, which is not a suitable feature when dealing with a large number of iterations, for which a compiled program (such as FORTRAN or C, for example) is necessary, being much faster. These compilers usually only provide routines which produce sequences of numbers that can be considered independent and uniformly distributed on .[0, 1]. In order to produce sequences of random numbers with an exponential law, for instance, it will be necessary to devise an appropriate procedure starting from uniformly distributed ones. Entire books have been committed to this kind of problem (see e.g. [10, 14, 15]). Useful information has also been gathered in [22]. In this section we shall review some ideas in this direction, mostly in the form of examples. The first method to produce random numbers with a given distribution is to construct a map .Φ such that, if X is uniform on .[0, 1], then .Φ(X) has the target distribution. To be precise the problem is: given an r.v. X uniform on .[0, 1] and a discrete or continuous probability .μ, find a map .Φ such that .Φ(X) has law .μ. If .μ is a probability on .R and has a d.f. F which is continuous and strictly increasing, and therefore invertible, then the choice .Φ = F −1 solves the problem.
6.1 Random Number Generation, Simulation
243
Actually as the d.f. of X is
x →
.
⎧ ⎪ ⎪ ⎨0 x
⎪ ⎪ ⎩1
if x < 0 if 0 ≤ x ≤ 1 if x > 1
and as .0 ≤ F (t) ≤ 1, then, for .0 < t < 1, P(F −1 (X) ≤ t) = P(X ≤ F (t)) = F (t)
.
so that the d.f. of the r.v. .F −1 (X) is indeed F . Example 6.2 Uniform law on an interval .[a, b]: its d.f. is ⎧ ⎪ 0 ⎪ ⎪ ⎪ ⎨x − a .F (x) = ⎪ b−a ⎪ ⎪ ⎪ ⎩1
if x < a if a ≤ x ≤ b if x > b
and therefore .F −1 (y) = a + (b − a)y, for .0 ≤ y ≤ 1. Hence if X is uniform on .[0, 1], .a + (b − a)X is uniform on .[a, b].
Example 6.3 Exponential law of parameter .λ. Its d.f. is F (t) =
.
0 1 − e−λt
if t < 0 if t ≥ 0 .
F is therefore invertible .R+ → [0, 1[ and .F −1 (x) = − λ1 log(1 − x). Hence if X is uniform on .[0, 1] then .− λ1 log(1 − X) is exponential of parameter .λ. The method of inverting the d.f. is however useless when the inverse .F −1 does not have an explicit expression, as is the case with Gaussian laws, or for probabilities on .Rd . The following examples provide other approaches to the problem.
244
6 Complements
Example 6.4 (Gaussian Laws) As always let us begin with an .N(0, 1); its d.f. F is only numerically computable and for .F −1 there is no explicit expression. A simple algorithm to produce an .N(0, 1) law starting from a uniform .[0, 1] is provided in Example 2.19: if W and T are independent r.v.’s respectively exponential of parameter . 12 and uniform on .[0, 2π ], then the r.v. .X = √ W cos T is .N(0, 1)-distributed. As W and T can be simulated as explained in the previous examples, the .N(0, 1) distribution is easily simulated. This is the Box-Müller algorithm. Other methods for producing Gaussian r.v.’s can be found in the book of Knuth [18]. For simple tasks the fast algorithm of Exercise 3.27 can also be considered. Starting from the simulation of an .N(0, 1) law, every Gaussian law, real or d-dimensional, can easily be obtained using affine-linear transformations.
Example 6.5 How can we simulate an r.v. taking the values .x1 , . . . , xm with probabilities .p1 , . . . , pm respectively? Let .q0 = 0, .q1 = p1 , .q2 = p1 + p2 ,. . . , .qm−1 = p1 + · · · + pm−1 and .qm = 1. The numbers .q1 , . . . , qm split the interval .[0, 1] into sub-intervals having amplitude .p1 , . . . , pm respectively. If X is uniform on .[0, 1], let Y = xi
.
if
qi−1 ≤ X < qi .
(6.2)
Obviously Y takes the values .x1 , . . . , xm and P(Y = xi ) = P(qi−1 ≤ X < qi ) = qi − qi−1 = pi ,
.
i = 1, . . . , m ,
so that Y has the required law.
This method, theoretically, can be used in order to simulate any discrete distribution. Theoretically. . . see, however, Example 6.8 below. The following examples suggest simple algorithms for the simulation of some discrete distributions.
6.1 Random Number Generation, Simulation
245
Example 6.6 How can we simulate a uniform distribution on the finite set {1, 2, . . . , m}? The idea of the previous Example 6.5 is easily put to work noting that, if X is uniform on .[0, 1], then mX is uniform on .[0, m], so that
.
Y = mX + 1
.
(6.3)
is uniform on .{1, . . . , m}.
Example 6.7 (Binomial Laws) Let .X1 , . . . , Xn be independent numbers uniform on .[0, 1] and let, for .0 < p < 1, .Zi = 1{Xi ≤p} . Obviously .Z1 , . . . , Zn are independent and .P(Zi = 1) = P(Xi ≤ p) = p, .P(Zi = 0) = P(p < Xi ≤ 1) = 1 − p. Therefore .Zi ∼ B(1, p) and .Y = Z1 + · · · + Zn ∼ B(n, p).
Example 6.8 (Simulation of a Permutation) How can we simulate a random deck of 52 cards? To be precise, we want to simulate a random element in the set E of all permutations of .{1, . . . , 52} in such a way that all permutations are equally likely. This is a discrete r.v., but, given the huge cardinality of E (.52! 8·1067 ), the method of Example 6.5 is not feasible. What to do? In general the following algorithm is suitable in order to simulate a permutation on n elements. (1) Let us denote by .x0 the vector .(1, 2, . . . , n). Let us choose at random a number between 1 and n, with the methods of Example 6.5 or, better, of Example 6.6. If .r0 is this number let us switch in the vector .x0 the coordinates with index .r0 and n. Let us denote by .x1 the resulting vector: .x1 has the number .r0 as its n-th coordinate and n as its .r0 -th coordinate. (2) Let us choose at random a number between 1 and .n − 1, .r1 say, and let us switch, in the vector .x1 , the coordinates with indices .r1 and .n − 1. Let us denote this new vector by .x2 . (3) Iterating this procedure, starting from a vector .xk , let us choose at random a number .rk in .{1, . . . , n−k} and let us switch the coordinates .rk and .n−k. Let .xk+1 denote the new vector.
246
6 Complements
(4) Let us stop when .k = n−1. The coordinates of the vector .xn−1 are now the numbers .{1, . . . , n} in a different order, i.e. a permutation of .(1, . . . , n). It is rather immediate that the permutation .xn−1 can be any permutation of .(1, . . . , n) with a uniform probability.
Example 6.9 (Poisson Laws) The method of Example 6.5 cannot be applied to Poisson r.v.’s, which can take infinitely many possible values. A possible way of simulating these laws is the following. Let .(Zn )n be a sequence of i.i.d. exponential r.v.’s of parameter .λ and let .X = k if k is the largest positive integer such that .Z1 + · · · + Zk ≤ 1, i.e. Z1 + · · · + Zk ≤ 1 < Z1 + · · · + Zk+1 .
.
The d.f. of the r.v. X obtained in this way is P(X ≤ k − 1) = P(Z1 + · · · + Zk > 1) = 1 − Fk (1) .
.
The d.f., .Fk , of .Z1 + · · · + Zk , which is Gamma.(k, λ)-distributed, is Fk (x) = 1 − e−λ
k−1 (λx)i
.
i=1
i!
,
hence P(X ≤ k − 1) = e−λ
k−1 i λ
.
i=1
i!
,
which is the d.f. of a Poisson law of parameter .λ. This algorithm works for Poisson law, as we know how to sample exponential laws. However, this method has the drawback that one cannot foresee in advance how many exponential r.v.’s will be needed.
We still do not know how to simulate a Weibull law, which is necessary in order to tackle Example 6.1. This question is addressed in Exercise 6.1 a).
6.1 Random Number Generation, Simulation
247
The following proposition introduces a new idea for producing r.v.’s with a uniform distribution on a subset of .Rd .
Proposition 6.10 (The Rejection Method) Let .R ∈ B(Rd ) and .(Zn )n a sequence of i.i.d. r.v.’s with values in R and .D ⊂ R a Borel set such that .P(Zi ∈ D) = p > 0. Let .τ be the first index i such that .Zi ∈ D, i.e. .τ = inf{i; Zi ∈ D}, and let X=
.
Zk
if τ = k
any x0 ∈ D
if τ = +∞ .
Then, if .A ⊂ D, P(X ∈ A) =
.
P(Z1 ∈ A) · P(Z1 ∈ D)
In particular, if .Zi is uniform on R then X is uniform on D.
Proof First note that .τ has a geometric law of parameter p so that .τ < +∞ a.s. If A ⊂ D then, noting that .X = Zk if .τ = k,
.
P(X ∈ A) =
∞
.
P(X ∈ A, τ = k)
k=1
=
∞
P(Zk ∈ A, τ = k) =
k=1
=
∞
∞
P(Zk ∈ A, Z1 ∈ D, . . . , Zk−1 ∈ D)
k=1
P(Zk ∈ A)P(Z1 ∈ D) . . . P(Zk−1 ∈ D) =
k=1
= P(Z1 ∈ A)
∞
P(Zk ∈ A)(1 − p)k−1
k=1 ∞
(1 − p)k−1 = P(Z1 ∈ A) ×
k=1
P(Z1 ∈ A) 1 = · p P(Z1 ∈ D)
248
6 Complements
Example 6.11 How can we simulate an r.v. that is uniform on a domain .D ⊂ R2 ? This construction is easily adapted to uniform r.v.’s on domains of .Rd . Note beforehand that this is easy if D is a rectangle .[a, b]×[c, d]. Indeed we know how to simulate independent r.v.’s X and Y that are uniform respectively on .[a, b] and .[c, d] (as explained in Example 6.2). It is clear therefore that the pair .(X, Y ) is uniform on .[a, b] × [c, d]: indeed the densities of X and Y with respect to the Lebesgue measure are respectively
fX (x) =
.
⎧ ⎪ ⎨
1 b−a
⎪ ⎩0
if a ≤ x ≤ b otherwise
fY (y) =
⎧ ⎪ ⎨
1 d −c
⎪ ⎩0
if c ≤ y ≤ d otherwise
so that the density of .(X, Y ) is
f (x, y) =
.
⎧ ⎪ ⎨
1 (b − a)(d − c)
⎪ ⎩0
if (x, y) ∈ [a, b] × [c, d] otherwise,
which is the density of an r.v. which is uniform on the rectangle .[a, b] × [c, d]. If, in general, .D ⊂ R2 is a bounded domain, Proposition 6.10 allows us to solve the problem with the following algorithm: if R is a rectangle containing D, (1) simulate first an r.v. .(X, Y ) uniform on R as above; (2) let us check whether .(X, Y ) ∈ D. If .(X, Y ) ∈ D go back to (1); if .(X, Y ) ∈ D then the r.v. .(X, Y ) is uniform on D. For instance, in order to simulate a uniform distribution on the unit ball of 2 .R , the steps to perform are the following: (1) first simulate r.v.’s .X1 , X2 uniform on .[0, 1] and independent; then let .Y1 := 2X1 −1, .Y2 := 2X2 −1, so that .Y1 and .Y2 are uniform on .[−1, 1] and independent; .(Y1 , Y2 ) is therefore uniform on the square .[−1, 1] × [−1, 1]; (2) check whether .(Y1 , Y2 ) belongs to the unit ball .{x 2 + y 2 ≤ 1}. In order to do this just compute .W = Y12 + Y22 ; if .W > 1 we go back to (1) for two new values .X1 , X2 ; if .W ≤ 1 instead, then .(Y1 , Y2 ) is uniform on the unit ball.
6.1 Random Number Generation, Simulation
249
Example 6.12 (Monte Carlo Methods) Let .f : [0, 1] → R be a bounded Borel function and .(Xn )n a sequence of i.i.d. r.v.’s uniform on .[0, 1]. Then .(f (Xn ))n is also a sequence of independent r.v.’s, each of them having mean .E[f (X1 )]; by the Law of Large Numbers therefore 1 f (Xk ) n n
.
k=1
a.s.
→
n→∞
E[f (X1 )] .
(6.4)
But we have also
1
E[f (X1 )] =
f (x) dx .
.
0
These remarks suggest a method of numerical computation of the integral of f : just simulate n random numbers .X1 , X2 , . . . uniformly distributed on .[0, 1] and then compute 1 f (Xk ) . n n
.
k=1
This quantity for n large is an approximation of 1 . f (x) dx . 0
More generally, if f is a bounded Borel function on a bounded Borel set D ⊂ Rd , then its integral can be approximated numerically in a similar way: if .X1 , X2 , . . . are i.i.d. r.v.’s uniform on D, then .
1 . f (Xk ) n n
k=1
a.s.
→
n→∞
1 |D|
f (x) dx . D
This algorithm of computation of integrals is a typical example of a Monte Carlo method. These methods are in general much slower than the classical algorithms of numerical integration, but they are much simpler to implement and are particularly useful in dimension larger than 1, where numerical methods become very complicated or downright unfeasible. Let us be more precise: let 1 f (Xk ) , n k=1 1 f (x) dx , I := |D| D n
In : =
.
250
6 Complements
σ 2 : = Var(f (Xn )) < +∞ , then the Central Limit Theorem states that, weakly, .
1 √ (I n − I ) σ n
→
n→∞
N(0, 1) .
If we denote by .φβ the quantile of order .β of a .N(0, 1) distribution, from this
relation it is easy to derive that, for large n, . I n − √σn φ1−α/2 , I n + √σn φ1−α/2 is a confidence interval for I of level .α. This gives the appreciation that the error of .I n as an estimate of the integral I is of order . √1n . This is rather slow, but independent of the dimension d.
Example 6.13 (On the Rejection Method) Assume that we are interested in the simulation of a law on .R having density f with respect to the Lebesgue measure. Let us now present a method that does not require a tractable d.f. We shall restrict ourselves to the case of a bounded function f (.f ≤ M say) having its support contained in a bounded interval .[a, b]. The region below the graph of f is contained in the rectangle .[a, b]×[0, M]. By the method of Example 6.11 let us produce a 2-dimensional r.v. .W = (X, Y ) uniform in the subgraph .A = {(x, y); a ≤ x ≤ b, 0 ≤ y ≤ f (x)}: then X has density f . Indeed
t
P(X ≤ t) = P((X, Y ) ∈ At ) = λ(At ) =
f (s) ds ,
.
a
where .At is the intersection (shaded in Fig. 6.1) of the subgraph A of f and of the half plane .{x ≤ t}.
So far we have been mostly concerned with real-valued r.v.’s. The next example considers a more complicated target space. See also Exercise 6.2.
6.1 Random Number Generation, Simulation
a
251
t
b
Fig. 6.1 The area of the shaded region is equal to the d.f. of X computed at t
Example 6.14 (Sampling of an Orthogonal Matrix) Sometimes applications require elements in a compact group to be chosen randomly “uniformly”. How can we rigorously define this notion? Given a locally compact topological group G there always exists on .(G, B(G)) a Borel measure .μ that is invariant under translations, i.e. such that, for every .A ∈ B(G) and .g ∈ G, μ(gA) = μ(A) ,
.
where .gA = { ∈ G; = gh for some h ∈ A} is “the set A translated by the action of g”. This is a Haar measure of G (see e.g. [16]). If G is compact it is possible to choose .μ so that it is a probability and, with this constraint, such a .μ is unique. To sample an element of G with this distribution is a way of choosing an element with a “uniform distribution” on G. Let us investigate closely how to simulate the random choice of an element of the group of rotations in d dimensions, .O(d), with the Haar distribution. The starting point of the forthcoming algorithm is the QR decomposition: every .d × d matrix M can be decomposed in the form .M = QR, where Q is orthogonal and R is an upper triangular matrix. This decomposition is unique under the constraint that the entries on the diagonal of R are positive. The algorithm is very simple: generate a .d × d matrix M with i.i.d. .N(0, 1)distributed entries and let .M = QR be its QR decomposition. Then Q has the Haar distribution of .O(d). This follows from the fact that if .g ∈ O(d), then the two matrix-valued r.v.’s M and gM have the same distribution, owing to the invariance of the Gaussian laws under orthogonal transformations: if QR is the QR decomposition of M, then .gQ R is the QR decomposition of gM. Therefore .gQR ∼ QR and .gQ ∼ Q by the uniqueness of the QR decomposition. This provides an easily
252
6 Complements
implementable algorithm: the QR decomposition is already present in the high level computation packages mentioned above and in the available C libraries.
Note that the algorithms described so far are not the only ones available for the respective tasks. In order to sample a random rotation there are other possibilities, for instance simulating separately the Euler angles that characterize each rotation. But this requires some additional knowledge on the structure of rotations.
6.2 Tightness and the Topology of Weak Convergence In the investigation that follows we consider probabilities on a Polish space, i.e. a metric space that is complete and separable, or, to be precise, a metric separable space whose topology is defined by a complete metric. This means that Polish-ness is a topological property.
Definition 6.15 A family . T of probabilities on .(E, B(E)) is tight if for every ε > 0 there exists a compact set K such that .μ(K) ≥ 1 − ε for every .μ ∈ T.
.
A family of probabilities .K on .(E, B(E)) is said to be relatively compact if for every sequence .(μn )n ⊂ K there exists a subsequence converging weakly to some probability .μ on .(E, B(E)).
Theorem 6.16 Suppose that E is separable and complete (i.e. Polish). If a family .K of probabilities on .(E, B(E)) is relatively compact, then it is tight.
Proof Recall that in a complete metric space relative compactness is equivalent to total boundedness: a set is totally bounded if and only if, for every .ε > 0, it can be covered by a finite number of open sets having diameter smaller that .ε. Let .(Gn )n be a sequence of open sets increasing to E and let us prove first that for every .ε > 0 there exists an n such that .μ(Gn ) > 1 − ε for all .μ ∈ K. Otherwise, for each n we would have .μn (Gn ) ≤ 1 − ε for some .μn ∈ K. By the assumed relative compactness of .K, there would be a subsequence .(μnk )k ⊂ K such that .μnk →k→∞ μ, for some probability .μ on .(E, B(E)). But this is not possible: by
6.2 Tightness and the Topology of Weak Convergence
253
the portmanteau theorem, see (3.19), we would have, for every n, μ(Gn ) ≤ lim μnk (Gn ) ≤ lim μnk (Gnk ) ≤ 1 − ε
.
k→∞
k→∞
from which μ(E) = lim μ(Gn ) ≤ 1 − ε ,
.
n→∞
so that .μ cannot be a probability. As E is separable there is, for each k, a sequence .Uk,1 , .Uk,2 , . . . of open kballs of radius . k1 covering E. Let .nk be large enough so that the open set .Gnk = nj =1 Uk,j is such that μ(Gnk ) ≥ 1 − ε2−k
.
for every μ ∈ K .
The set A=
∞
.
Gnk =
nk ∞
Un,j
k=1 j =1
k=1
is totally bounded hence relatively compact. As, for every .μ ∈ K, ∞ ∞ ∞ μ(Ac ) = μ Gcnk ≤ μ(Gcnk ) ≤ ε 2−k = ε
.
k=1
k=1
k=1
we have .μ(A) ≥ 1 − ε. The closure of A is a compact set K satisfying the requirement. Note that a Polish space need not be locally compact. The following (almost) converse to Theorem 6.16 is especially important.
Theorem 6.17 (Prohorov’s Theorem) If E is a metric separable space and T is a tight family of probabilities on .(E, B(E)) then it is also relatively compact.
.
We shall skip the proof of Prohorov’s theorem (see [2], Theorem 5.1, p. 59). Note that it holds under weaker assumptions than those made in Theorem 6.16 (no completeness assumptions).
254
6 Complements
Let us denote by .P the family of probabilities on the Polish space .(E, B(E)). Let us define, for .μ, ν ∈ P, .ρ(μ, ν) as ρ(μ, ν) = inf{ε; μ(Aε ) ≤ ν(A) + ε and μ(Aε ) ≤ ν(A) + ε for every A ∈ B(E)} ,
.
where .Aε = {x ∈ E; d(x, A) ≤ ε} is the neighborhood of radius .ε of A.
Theorem 6.18 Let .(E, B(E)) be a Polish space, then • .ρ is a distance on .P, the Prohorov distance, • .P endowed with this distance is also a Polish space, • .μn →n→∞ μ weakly if and only if .ρ(μn , μ) →n→∞ 0.
See again [2] Theorem 6.8, p. 83 for a proof. Therefore we can speak of “the topology of weak convergence”, which makes . P a metric space and Prohorov’s Theorem 6.17 gives a characterization of the relatively compact sets for this topology.
Example 6.19 (Convergence of the Empirical Distributions) Let .(Xn )n be an i.i.d. sequence of r.v.’s with values in the Polish space .(E, B(E)) having common law .μ. Then the maps .ω → δXn (ω) are r.v.’s .Ω → P, being the composition of the measurable maps .Xn : Ω → E and .x → δx , which is continuous .E → P, (see Example 3.24 a)). Let 1 .μn = δXk , n n
k=1
which is a sequence of r.v.’s with values in .P. For every bounded measurable function .f : E → R we have
1 f (Xk ) n n
.
E
f dμn =
k=1
and by the Law of Large Numbers 1 f (Xk ) = E[f (X1 )] = n→∞ n n
.
lim
k=1
f dμ E
a.s.
6.3 Applications
255
Assuming, in addition, f continuous, this gives that the sequence of random probabilities .(μn )n converges a.s., as .n → ∞ to the constant r.v..μ in the topology of weak convergence.
Example 6.20 Let .μ be a probability on the Polish space .(E, B(E)) and let C = {ν ∈ P; H (ν; μ) ≤ M}, H denoting the relative entropy (or KullbackLeibler divergence) defined in Exercise 2.24, p. 105. In this example we see that C is a tight family. Recall that .H (ν; μ) = +∞ if .ν μ and, noting .Φ(t) = t log t for .t ≥ 0, .
H (ν; μ) =
Φ
.
E
dν dμ
dμ
if .ν μ. As .limt→+∞ 1t Φ(t) = +∞, the family of densities . H = dν ; ν ∈ C} is uniformly integrable in .(E, B(E), μ) by Proposition 3.35. { dμ every .ε > 0 there exists a .δ > 0 such that if Hence (Proposition 3.33) for dν .μ(A) ≤ δ then .ν(A) = A dμ dμ ≤ ε. As the family .{μ} is tight, for every .ε > 0 there exists a compact set .K ⊂ E such that .μ(K c ) ≤ δ. Then we have for every probability .ν ∈ C ν(K ) =
.
c
Kc
dν dμ ≤ ε dμ
therefore proving that the level sets, C, of the relative entropy are tight.
6.3 Applications In this section we see two typical applications of Prohorov’s theorem. The first one is the following enhanced version of P. Lévy’s Theorem as announced in Chap. 3, p. 132.
Theorem 6.21 (P. Lévy’s Revisited) Let .(μn )n be a sequence of probabilities on .Rd . If .( μn )n converges pointwise to a function .κ and if .κ is continuous at 0, then .κ is the characteristic function of a probability .μ and .(μn )n converges weakly to .μ.
256
6 Complements
Proof The idea of the proof is simple: in Proposition 6.23 below we prove that the condition “.( μn )n converges pointwise to a function .κ that is continuous at 0” implies that the sequence .(μn )n is tight. By Prohorov’s Theorem every subsequence of .(μn )n has a subsequence that converges weakly to a probability .μ. Necessarily . μ = κ, which proves simultaneously that .κ is a characteristic function and that .μn →n→∞ μ. First, we shall need the following lemma, which states that the regularity of the characteristic function at the origin gives information concerning the behavior of the probability at infinity.
Lemma 6.22 Let .μ be a probability on .R. Then, for every .t > 0, C μ |x| > 1t ≤ t
.
t
1 − ℜ μ(θ ) dθ
0
for some constant .C > 0 independent of .μ.
Proof We have +∞ 1 t 1 t 1 − ℜ μ(θ ) dθ = . dθ (1 − cos θ x) dμ(x) t 0 t 0 −∞ t +∞ sin tx 1 +∞ dμ(x) . 1− dμ(x) (1 − cos θ x) dθ = = t −∞ tx 0 −∞ Note that the use of Fubini’s Theorem is justified, all integrands being positive. As 1 − siny y ≥ 0, we have
.
.
··· ≥
sin tx sin y dμ(x) ≥ μ |x| > 1t × inf 1 − 1− |y|≥1 tx y {|x|≥ 1t }
and the proof is completed with .C = inf|y|≥1 (1 −
sin y −1 . y )
Proposition 6.23 Let .(μn )n be a sequence of probabilities on .Rd . If .( μn )n converges pointwise to a function .κ and if .κ is continuous at 0 then the family .(μn )n is tight.
6.3 Applications
257
Proof Let us assume first .d = 1. Lemma 6.22 gives .
C lim μn |x| > 1t ≤ lim n→∞ t n→∞
t
1 − ℜ μn (θ ) dθ
0
and by Lebesgue’s Theorem .
C lim μn |x| > 1t ≤ n→∞ t
t
1 − ℜκ(θ ) dθ .
0
Let us fix .ε > 0 and let .t0 > 0 be such that .1 − ℜκ(θ ) ≤ Cε for .0 ≤ θ ≤ t0 , which is possible as .κ is assumed to be continuous at 0. Setting .R0 = t10 we obtain .
lim μn |x| > R0 ≤ ε .
n→∞
i.e. .μn (|x| ≥ R0 ) ≤ 2ε for every n larger than some .n0 . As the family formed by a single probability .μk is tight, for every .k = 1, . . . , n0 there are positive numbers .R1 , . . . , Rn0 such that μk (|x| ≥ Rk ) ≤ 2ε
.
and taking .R = max(R0 , . . . , Rn0 ) we have .μn (|x| ≥ R) ≤ 2ε for every n. Let .d > 1: we have proved that for every .j, 1 ≤ j ≤ d, there exists a compact set c .Kj such that .μn,j (K ) ≤ ε for every n, where we denote by .μn,j the j -th marginal j of .μn . Now just note that .K := K1 × · · · × Kd is a compact set and μn (K c ) ≤ μn,1 (K1c ) + · · · + μn,d (Kdc ) ≤ dε .
.
Example 6.24 Let E, G be Polish spaces. Let .(μn )n be a sequence of probabilities on .(E, B(E)) converging weakly to some probability .μ. Let .(νn )n be a sequence of probabilities on .(G, B(G)) converging weakly to some probability .ν. Is it true that μn ⊗ νn
.
→
n→∞
μ⊗ν ?
We have already met this question when E and G are Euclidean spaces (Exercise 3.14), where characteristic functions allowed us to conclude the result easily. In this setting we can argue using Prohorov’s Theorem (both implications). As the sequence .(μn )n converges weakly, it is tight and, for every .ε > 0 there
258
6 Complements
exists a compact set .K1 ⊂ E such that .μn (K1 ) ≥ 1 − ε. Similarly there exists a compact set .K2 ⊂ G such that .νn (K2 ) ≥ 1 − ε. Therefore μn ⊗ νn (K1 × K2 ) = μn (K1 )νn (K2 ) ≥ (1 − ε)2 ≥ 1 − 2ε .
.
As .K1 × K2 ⊂ E × G is a compact set, the sequence .(μn ⊗ νn )n is tight and for every subsequence there exists a further subsequence .(μnk ⊗ νnk )k converging to some probability .γ on .(E × G, B(E × G)). Let us prove that necessarily .γ = μ ⊗ ν. For every pair of bounded continuous functions .f1 : E → R, .f2 : G → R we have
f1 (x)f2 (y) dγ (x, y) = lim
.
E×G
= lim
k→∞ E
k→∞ E×G
f1 (x) dμnk (x)
G
=
f1 (x)f2 (y) dμnk (x) dνnk (y)
f2 (y) dνnk (y) =
f1 (x)dμ(x) E
f2 (y)dν(y) G
f1 (x)f2 (y) dμ ⊗ ν(x, y) . E×G
By Proposition 1.33 necessarily .γ = μ ⊗ ν and the result follows thanks to the sub-sub-sequences Criterion 3.8 applied to the sequence .(μn ⊗ νn )n in the Polish space .P(E × G) endowed with the Prohorov metric.
The previous example and the enhanced P. Lévy’s theorem are typical applications of tightness and of Prohorov’s Theorem: in order to prove weak convergence of a sequence of probabilities, first prove tightness and then devise some argument in order to identify the limit. This is especially useful for convergence of stochastic processes that the reader may encounter in more advanced courses.
Exercises 6.1 (p. 382) Devise a procedure for the simulation of the following probability distributions on .R. (a) (b) (c) (d) (e) (f)
A Weibull distribution with parameters .α, λ. A Gamma.(α, λ) distribution with .α semi-integer, i.e. .α = A Beta.(α, β) distribution with .α, β both half-integers. A Student .t (n). A Laplace distribution of parameter .λ. A geometric law with parameter p.
k 2
for some .k ∈ N.
Exercises
259
(a) see Exercise 2.9, (c) see Exercise 2.20(b), (e) see Exercise 2.43, (f) see Exercise 2.12(a). 6.2 (p. 382) (A uniform r.v. on the sphere) Recall (or take it as granted) that the normalized Lebesgue measure of the sphere .Sd−1 of .Rd is characterized as being the unique probability on .Sd−1 that is invariant with respect to rotations. Let X be an .N(0, I )-distributed d-dimensional r.v. Prove that the law of the r.v. Z=
.
X |X|
is the normalized Lebesgue measure of the sphere. 6.3 (p. 382) For every .α > 0 let us consider the probability density with respect to the Lebesgue measure f (t) =
.
α (1 + t)α+1
t >0.
(6.5)
(a) Determine a function .Φ :]0, 1[→ R such that if X is an r.v. uniform on .]0, 1[ then .Φ(X) has density f . (b) Let Y be a Gamma.(α, 1)-distributed r.v. and X an r.v. having a conditional law given .Y = y that is exponential with parameter y. Determine the law of X and devise another method in order to simulate an r.v. having a law with density (6.5) with respect to the Lebesgue measure.
Chapter 7
Solutions
1.1 Let .D ⊂ E be a dense countable subset and . D the family of open balls with center in D and rational radius. . D is a countable family of open sets. Let .A ⊂ E be an open set. For every .x ∈ A ∩ D, let .Bx be an open ball centered at x and with a rational radius small enough so that .Bx ⊂ A. A is then the union (countable, obviously) of these open balls. Hence the .σ -algebra generated by . D contains all open sets and therefore also the Borel .σ -algebra which is the smallest one enjoying the property of containing the open sets. 1.2 (a) Every open set of .R is a countable union of open intervals (this is also a particular case of Exercise 1.1). Thus the .σ -algebra generated by the open intervals, 1 say, contains all open sets of .R hence also the Borel .σ -algebra .B(R). This .B concludes the proof, as the opposite inclusion is obvious. (b) We have, for every .a < b, ]a, b[=
.
∞
]a, b − n1 ] .
n=1
2 say, contains all open Thus the .σ -algebra generated by the half-open intervals, .B intervals, hence also .B(R) thanks to (a). Conversely, ]a, b] =
.
∞
]a, b + n1 [ .
n=1
2. Hence .B(R) contains all half-open intervals and also .B 3 say, contains, by (c) The .σ -algebra generated by the open half-lines .]a, ∞[, .B complementation, the half lines of the form .] − ∞, b] and, by intersection, the half 3 ⊃ B(R). The opposite inclusion is obvious. open intervals .]a, b]. Thanks to (b), .B (d) Just a repetition of the arguments above.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Baldi, Probability, Universitext, https://doi.org/10.1007/978-3-031-38492-9_7
261
262
7 Solutions
1.3 (a) We know (see p. 4) that every real continuous map is measurable with respect to the Borel .σ -algebra .B(E). Therefore .B0 (E), which is the smallest .σ algebra enjoying this property, is contained in .B(E). (b) In a metric space the function “distance from a point” is continuous. Hence, for every .x ∈ E and .r > 0 the open ball with radius r and centered at x belongs to . B0 (E), being the pullback of the interval .] − ∞, r[ by the map .y → d(x, y). As every open set of E is a countable union of these balls (see Exercise 1.1), .B0 (E) contains also all open sets and therefore also the Borel .σ -algebra .B(E). 1.4 Let us check the three properties of .σ -algebras. (i) .S ∈ ES as .S = E ∩ S. (ii) If .B ∈ ES then B is of the form .B = A ∩ S for some .A ∈ E and therefore its complement in S is S \ B = Ac ∩ S .
.
As .Ac ∈ E, the complement set .S \ B belongs to . ES . (iii) Finally, if .(Bn )n ⊂ ES , then each .Bn is of the form .Bn = An ∩ S for some .An ∈ E. Hence ∞ .
Bn =
n=1
and, as .
n An
∞ ∞ An ∩ S = An ∩ S n=1
∈ E, also .
n Bn
n=1
∈ ES .
1.5 (a) We have seen already (p. 4) that the functions .
lim fn
n→∞
and
lim fn
n→∞
are measurable. L is the set where these two functions coincide and is therefore measurable. (b) If the sequence .(fn )n takes values in a metric space G, the set of the points x for which the Cauchy condition is satisfied can be written ∞ ∞
1 . x ∈ E; d fm (x), fk (x) ≤ .H := =0 n=0 m,k≥n
The distance function .d : G × G → R is continuous, so that all sets appearing in the definition of H are measurable. If G is also complete, then .H = L = {x; limn→∞ fn (x) exists} is measurable. 1.6 (a) Immediate as .Φ ◦ f = limn→∞ Φ ◦ fn and the functions .Φ ◦ fn are realvalued.
Exercise 1.7
263
(b) Let .D ⊂ G be a countable dense subset and let us denote by .Bz (r) the open ball centered at .z ∈ D and with radius r. Then if .Φ(x) = d(x, z) we have −1 (B (r)) = (Φ ◦ f )−1 ([0, r[). Hence .f −1 (B (r)) ∈ E. Every open set of .f z z .(G, d) is the (countable) union of balls .Bz (r) with .z ∈ D and radius .r ∈ Q. Hence .f −1 (A) ∈ E for every open set .A ⊂ G and the proof is complete thanks to Remark 1.5. 1.7 (a) This is a rather intuitive inequality as, if the events were disjoint, we would have an equality. A first way of proving this rigorously is to trace back to a sequence of disjoint sets to which .σ -additivity can be applied, following the same idea as in Remark 1.10(b). To be precise, recursively define B1 = A1 ,
.
B2 = A2 \ A1
, ... ,
Bn = An \
n−1
Ak , . . .
k=1
The .Bn are pairwise disjoint and .B1 ∪ · · · ∪ Bn = A1 ∪ · · · ∪ An , therefore ∞
An =
.
n=1
∞
Bn .
n=1
Moreover .Bn ⊂ An , so that ∞ ∞ ∞ ∞ μ An = μ Bn = μ(Bn ) ≤ μ(An ) .
.
n=1
n=1
n=1
n=1
There is a second method, which is simpler, but uses the integral and Beppo Levi’s Theorem. If .A = ∞ n=1 An , then clearly 1A ≤
∞
.
1Ak
k=1
as the sum on the right-hand side certainly takes a value which is .≥ 1 on A. Now we have, thanks to Corollary 1.22(a),
μ(A) =
1A dμ ≤
.
∞
E
E k=1
1Ak dμ =
∞ k=1 E
1Ak dμ =
(b) Immediate as, thanks to (a), μ(A) ≤
∞
.
n=1
μ(An ) = 0 .
∞ k=1
μ(Ak ) .
264
7 Solutions
c (c) If .A ∈ A then obviously ⊂ A and .μ(An ) = 0 for every .A ∈ A. If .(An )n n then, thanks to (b), also .μ( n An ) = 0, hence . n An ∈ A. Otherwise, if there exists an .n0 such that .μ(Acn0 ) = 0, then ∞ c ≤ μ(Acn0 ) = 0 μ An
.
n=1
and again . n An ∈ A. 1.8 (a) Let .(xn )n ⊂ F be a sequence converging to some .x ∈ E and let us prove that .x ∈ F . If .r > 0 then the ball .Bx (r) contains at least one of the .xn (actually infinitely many of them). Hence it also contains a ball .Bxn (r ), for some .r > 0. Hence .μ(Bx (r)) > μ(Bxn (r )) > 0, as .xn ∈ F . Hence also .x ∈ F . (b1) Let .D ⊂ E be a dense subset. For every .x ∈ D ∩ F c there exists a neighborhood .Vx of x such that .μ(Vx ) = 0 and that we can assume to be disjoint from F , which is closed. .F c is then the (countable) union of such .Vx ’s for .x ∈ D and is a negligible set, being the countable union of negligible sets (Exercise 1.7(b)). (b2) If .F1 is a closed set strictly contained in F such that .μ(F1c ) = 0, then there exist .x ∈ F \ F1 and .r > 0 such that .Bx (r) ⊂ F1c . But then we would have .μ(Bx (r)) = 0, in contradiction with the fact that .x ∈ F . 1.9 (a) We have, for every .n ∈ N, |f | ≥ n1{|f |=+∞}
.
and therefore
|f | dμ ≥ nμ(|f | = +∞) .
.
E
As this relation holds for every n, if .μ(f = +∞) > 0 we would have . |f | dμ = +∞, in contradiction with the integrability of .|f |. (b) Let, for every positive integer n, .An = {f ≥ n1 }. Obviously .f ≥ n1 1An and therefore
1 1 1An dμ = μ(An ) . . f dμ ≥ n n E E Hence .μ(An ) = 0 for every n. Now {f > 0} =
∞
.
n=1
{f ≥ n1 } =
∞
An ,
n=1
hence .{f > 0} is negligible, being the countable union of negligible sets (Exercise 1.7(b)).
Exercise 1.11
265
(c) Let .An = {f ≤ − n1 }. Then
1 f dμ ≤ − μ(An ) . n An
.
Therefore as we assume that . A f dμ ≥ 0 for every .A ∈ E, necessarily .μ(An ) = 0 for every n. But {f < 0} =
∞
.
An
n=1
hence again .{f < 0} is negligible, being the countable union of negligible sets. 1.10 By Beppo Levi’s Theorem we have
|f | dμ = lim ↑
.
n→∞
E
|f | ∧ n dμ . E
But, for every n, .|f | ∧ n ≤ n 1N , so that
|f | ∧ n dμ ≤ n μ(N) = 0 .
.
E
Taking .n → ∞, Beppo Levi’s Theorem gives . E |f | dμ = 0, hence also . E f dμ = 0. • In particular the integral of a function taking the value .+∞ on a set of measure 0 and vanishing elsewhere is equal to 0. 1.11 (a) Let .μ be the measure on .N defined as μ(n) = wn .
.
With this definition we can write
φ(t) =
.
N
e−tx dμ(x) .
Let us check the conditions of Theorem 1.21 (derivation under the integral sign) for the function .f (t, x) = e−tx . Let .a > 0 be such that .I =]a, +∞[ is a half-line containing t. Then ∂f (t, x) = |x|e−tx ≤ |x|e−ax := g(x) . ∂t
.
(7.1)
266
7 Solutions
g is integrable with respect to .μ as
.
N
g(x) dμ(x) =
∞
nwn e−an
n=1
and the series on the right-hand side is summable. Thanks to Theorem 1.21, for every .a > 0, .φ is differentiable in .]a, +∞[ and φ (t) =
∞
∂f (t, x) dμ(x) = − nwn e−tn . ∂t
.
N
n=1
(b) If .wn+ = wn ∨ 0, .wn− = −wn ∧ 0, then the two sequences .(wn+ )n , .(wn− )n are positive and φ(t) =
∞
.
wn+ e−tn −
n=1
∞
wn− e−tn := φ + (t) − φ − (t)
n=1
and now both .φ + and .φ − are differentiable thanks to (a) above and (1.34) follows. (c1) Just consider the measure on .N μ(n) =
√
.
n.
In order to repeat the argument of (a) we just have to check that the function g of (7.1) is integrable with respect to the new measure .μ, i.e. that ∞ .
n3/2 e−an < +∞ ,
n=1
which is immediate. (c2) Again the answer is positive provided that ∞ .
√
n −an
e
< +∞ .
n e− 12 an
· e− 2 an . As
ne
(7.2)
n=1 √
Now just write .n e
n e−an
√
= ne
√
.
lim n e
n→∞
1
n − 12 an
e
=0 1
the general term of the series in (7.2) is bounded above, for n large, by .e− 2 an , which is the general term of a convergent series.
Exercise 1.15
267
1.12 There are many possible solutions of this exercise, of course. (a) Let us choose .E = R, . E = B(R) and .μ =Lebesgue’s measure. If .An = [n, +∞[ then .A = n An = ∅, so that .μ(A) = 0 whereas .μ(An ) = +∞ for every n. (b) Let .(E, E, μ) be as in (a). Let .fn = −1[n,+∞] . We have .fn ↑ 0 as .n → ∞, but the integral of the .fn is equal to .−∞ for every n. 1.13 Let .A ∈ G be such that . μ(A) = 0. Then .μ(Φ −1 (A)) = μ(A) = 0 hence also −1 −1 (A)) = 0. .ν(Φ (A)) = 0, so that .ν(A) = ν(Φ 1.14 (a) If .(An )n ⊂ B([0, 1]) is a sequence of disjoint sets, then • if .λ(An ) = 0 for every n then also .λ( n An ) = 0, therefore ∞ .μ An = 0 and n=1
∞
μ(An ) = 0 .
n=1
• If, instead, .λ(An ) > 0 for some n, then also .λ( ∞ μ An = +∞
.
and
n=1
∞
n An )
> 0 and
μ(An ) = +∞ ,
n=1
so that in any case the .σ -additivity of .μ is satisfied. (b) Of course if .μ(A) = 0 then also .λ(A) = 0 so that .λ μ. If a density f of .λ with respect to .μ existed we would have, for every .A ∈ B([0, 1]),
λ(A) =
f dμ .
.
A
But this is not possible because the integral on the right-hand side can only take the values 0 (if .1A f = 0 .μ-a.e.) or .+∞ (otherwise). The hypotheses of the Radon-Nikodym theorem are not satisfied here (.μ is not .σ -finite). 1.15 (a1) Assume, to begin with, .p < +∞. Denoting by M an upper bound of the Lp norms of the .fn (the sequence is bounded in .Lp ), Fatou’s Lemma gives
.
|f |p dμ ≤ lim
.
E
n→∞ E
|fn |p dμ ≤ M p
hence .f ∈ Lp . The case .p = +∞ is rather obvious but, to be precise, let M be again an upper bound of the norms .fn ∞ . This means that if .An = {|fn | > M} then .μ(An ) = 0 for every n. We obtain immediately that outside .A = n An , which is also negligible, .|f | ≤ M .μ-a.e.
268
7 Solutions
(a2) Counterexample: .μ = the Lebesgue measure of .R, .fn = 1[n,n+1] . Every fn has, for every p, .Lp norm equal to 1 and .(fn )n converges to 0 a.e. but certainly not in .Lp , as .fn p ≡ 1 and .Lp convergence entails convergence of the .Lp -norms (Remark 1.30). (b) We have .gn → g a.e. as .n → ∞. As .|gn | ≤ |g| and by the obvious bound .|g − gn | ≤ |g| + |gn | ≤ 2|g|, we have by Lebesgue’s Theorem .
|gn − g|p dμ
.
E
→
n→∞
0.
1.16 (a1) Let .p < q. If .|x| ≤ 1, then .|x|p ≤ 1; if conversely .|x| ≥ 1, then p ≤ |x|q . Hence, in any case, .|x|p ≤ 1 + |x|q . If .p ≤ q and .f ∈ Lq , then .|x| p q .|f | ≤ 1 + |f | and we have
p q .f p = |f |p dμ ≤ (1 + |f |q ) dμ ≤ μ(E) + f q , E
E
hence .f ∈ Lp . (a2) If .p → q−, then .|f |p → |f |q a.e. Moreover, thanks to a1), .|f |p ≤ 1 + |f |q . As .|f |q and the constant function 1 are integrable (.μ is finite), by Lebesgue’s Theorem
. lim |f |p dμ = |f |q dμ . p→q− E
E
(a3) Again we have .|f |p → |f |q a.e. as .p → q−, and by Fatou’s Lemma
.
|f | dμ ≥
|f |q dμ = +∞ .
p
lim p→q− E
E
(a4) (1.37) follows by Fatou’s Lemma again. Moreover, if .f ∈ Lq0 for some q0 > q, then for .q ≤ p ≤ q0 we have .|f |p ≤ 1 + |f |q0 and (1.38) follows by Lebesgue’s Theorem. (a5) Let .μ be the Lebesgue measure. The function
.
f (x) =
.
1 x log2 x
1[0, 1 ] (x) 2
is integrable (a primitive of .x → (x log2 x)−1 is .x → (− log x)−1 ). But .|f |p = (x p log2p x)−1 is not integrable at 0 for any .p > 1. Therefore, .f 1 < +∞, whereas .limp→1+ f p = +∞. (b1) As .|f | ≤ f ∞ a.e.
p p .f p = |f |p dμ ≤ f ∞ μ(E) E
Exercise 1.18
269
which gives lim f p ≤ f ∞ lim μ(E)1/p = f ∞ .
.
p→+∞
p→+∞
(b2) We have .|f |p ≥ |f |p 1{|f |≥M} ≥ M p 1{|f |≥M} . Hence
|f |p dμ ≥
.
E
E
M p 1{|f |≥M} dμ = M p μ(|f | ≥ M) .
(7.3)
If .M < f ∞ , then .μ(|f | ≥ M) > 0 and by (7.3) .
lim f p ≥ lim M μ(|f | ≥ M)1/p = M .
p→+∞
p→+∞
By the arbitrariness of .M < f ∞ and (b1) .
lim f p = f ∞ .
p→+∞
1.17 An element of .p is a sequence .(an )n such that ∞ .
|an |p < +∞ .
(7.4)
n=1
If .(an )n ∈ p then necessarily .|an | →n→∞ 0, hence .|an | ≤ 1 for n larger than some n0 . If .q ≥ p then .|an |q ≤ |an |p for .n ≥ n0 and the series with general term .|an |q is bounded above eventually by the series with general term .|an |p .
.
1.18 We have
.
0
+∞
+∞
1 1 −tx e sin x dx = e−tx dx cos(xy) dy x 0 0
1 +∞ = dy cos(xy) e−tx dx . 0
0
Integrating by parts we find
+∞
.
0
x=+∞ y +∞ 1 cos(xy) e−tx dx = − e−tx cos(xy) − sin(xy) e−tx dx x=0 t t 0 x=+∞ y 2 +∞ y −tx 1 = + 2e sin(xy) − 2 cos(xy) e−tx dx , x=0 t t t 0
270
7 Solutions
from which
y 2 +∞ 1 1+ 2 cos(xy) e−tx dx = t t 0
.
and
+∞
.
cos(xy) e−tx dx =
0
t · + y2
t2
Therefore, with the change of variable .z = yt ,
+∞
.
0
1 sin x e−tx dx = x
1
t dy = 2 t + y2
0
1/t 0
1 1 dz = arctan · 2 t 1+z
Of course we can apply Fubini’s Theorem as .(x, y) → cos(xy) e−tx is integrable on .R+ × [0, 1]. As .t → 0+ the integral converges to . π2 . 1.19 We must prove that the integral . Rd |f (y)| |g(x − y)| dy is finite for almost every x. Note first that this integral is well defined, the integrand being positive. By Fubini’s Theorem 1.34
. dx |f (y)| |g(x − y)| dy = |f (y)| dy |g(x − y)| dx Rd
Rd
=
Rd
Rd
|f (y)| dy
Rd
Rd
|g(x)| dx = f 1 g1 .
Hence .(x, y) → f (y)g(x − y) is integrable and, again by Fubini’s Theorem (this is (1.30), to be precise)
x →
.
Rd
f (y)g(x − y) dy
is an a.e. finite measurable function of .L1 . Moreover
.f ∗ g1 = |f ∗ g(x)| dx = dx
≤
Rd
Rd
dx
Rd
Rd
Rd
f (y)g(x − y) dy
f (y)g(x − y) dy = f 1 g1 .
2.1 We have ∞ ∞ ∞ c =1−P P An = 1 − P An Acn = 1
.
n=1
n=1
n=1
Exercise 2.3
271
as the events .Acn are negligible and a countable union of negligible events is also negligible (Exercise 1.7). 2.2 Let us denote by D a dense subset of E. (a) Let us consider the countable set of the balls .Bx ( n1 ) centered at .x ∈ D and with radius . n1 . As the events .{X ∈ Bx ( n1 )} belong to . G, their probability can be equal to 0 or to 1 only. As their union is equal to E, for every n there exists at least an .xn ∈ D such that .P(X ∈ Bxn ( n1 )) = 1. (b) Let .An = Bx1 (1) ∩ · · · ∩ Bxn ( n1 ). .(An )n is clearly a decreasing sequence of measurable subsets of E, .An has diameter .≤ n2 , as .An ⊂ Bxn ( n1 ), and the event 1 .{X ∈ An } has probability 1, being the intersection of the events .{X ∈ Bxk ( )}, k .k = 1, . . . , n, all of them having probability 1. (c) The set A=
∞
.
An
n=1
has diameter 0 and therefore is formed by a single .x0 ∈ E or is .= ∅. But, as the sequence .(An )n is decreasing, P(X ∈ A) = lim P(X ∈ An ) = 1 .
.
n→∞
Hence A is non-void and is formed by a single .x0 . We conclude that .X = x0 with probability 1. 2.3 (a) We have, for every .k > 0,
{Z = +∞} = sup Xn = +∞ = sup Xn = +∞ ,
.
n≥k
n≥1
hence the event .{Z = +∞} belongs to the tail .σ -algebra of the sequence .(Xn )n and by Kolmogorov’s 0-1 law, Theorem 2.15, can only have probability 0 or 1. If .P(Z ≤ a) > 0, necessarily .P(Z = +∞) < 1 hence .P(Z = +∞) = 0. (b1) Let .a > 0. As the events .{supk≤n Xk ≤ a} decrease to .{Z ≤ a} as .n → ∞, we have ∞ n .P(Z ≤ a) = lim P sup Xk ≤ a = lim P(Xk ≤ a) = (1 − e−λk a ) . n→∞
k≤n
n→∞
k=1
k=1
The product converges to a strictly positive number if and only if the series ∞infinite −λk a is convergent (see Proposition 3.4 p. 119, in case this fact was not e k=1 already known). In this case
.
272
7 Solutions ∞ .
e−λk a =
k=1
∞ 1 · ka k=1
If .a > 1 the series is convergent, hence .P(Z ≤ a) > 0 and, thanks to (a), .Z < +∞ a.s. (b2) Let .K > 0. As .{supk≤n Xk ≥ K} ⊂ {Z ≥ K}, we have, for every .n ≥ 1, P(Z > K) ≥ P sup Xk > K = 1 − P sup Xk ≤ K
.
k≤n
k≤n
= 1 − P X1 ≤ K, . . . , Xn ≤ K = 1 − P(X1 ≤ K)n = 1 − (1 − e−cK )n . As this holds for every n, .P(Z > K) = 1 for every .K > 0 hence .Z = +∞ a.s. 2.4 By assumption
E[|X + Y |] =
∞ ∞
.
∞
∞
|x + y| dμX (x) dμY (y) < +∞ .
By Fubini’s Theorem for .μY -almost every y we have
∞
.
∞
|x + y| dμX (x) < +∞ ,
hence .E(|y + X|) < +∞ for at least one .y ∈ R and X is integrable, being the sum of the integrable r.v.’s .y + X and .−y. By symmetry Y is also integrable. 2.5 (a) For every bounded measurable function .φ : Rd → R, we have
E[φ(X + Y )] =
.
=
=
Rd
dν(y)
Rd
Rd
Rd
dν(y)
Rd
Rd
φ(x + y) dμ(x) dν(y)
φ(x + y)f (x) dx
φ(z)f (z − y) dz =
Rd
φ(z) dz
Rd
f (z − y) dν(y) , :=g(z)
which means that .X + Y has density g with respect to the Lebesgue measure dz. (b) Let us try to apply the derivation theorem of an integral depending on a parameter, Proposition 1.21. By assumption ∂f (z − y) < M ∂zi
.
Exercise 2.7
273
for some constant M, as we assume boundedness of the partial derivatives of f . The constants being integrable with respect to .ν, the condition of Proposition 1.21 is satisfied and we deduce that g is also differentiable and .
∂g (z) = ∂zi
∂f (z − y) dν(y) . ∂zi
Rd
(7.5)
This proves (b) for .k = 1. Derivation under the integral sign applied to (7.5) proves (b) for .k = 2 and iterating this argument the result follows by induction. • Recalling that the law of .X + Y is the convolution .μ ∗ ν, this exercise shows that “convolution regularizes”. 2.6 (a) If .An := {|x| > n} then . ∞ n=1 An = ∅, so that .limn→∞ μ(An ) = 0 and .μ(An ) < ε for n large. (b) Let .ε > 0. We must prove that there exists an .M > 0 such that .|g(x)| < ε for .|x| > M. Let us choose .M = M1 + M2 , with .M1 and .M2 as in the statement of the exercise. We have then .|g(x)| = f (x − y) μ(dy) Rd
≤
{|y|≤M1 }
|f (x − y)| μ(dy) +
{|y|>M1 }
|f (x − y)| μ(dy) := I1 + I2 .
We have .I2 ≤ f ∞ μ({|y| > M1 }) ≤ εf ∞ . Moreover, if .|x| ≥ M = M1 + M2 and .|y| ≤ M1 then .|x − y| ≥ M2 so that .|f (x − y)| ≤ ε. Putting things together we have, for .|x| > M, |g(x)| ≤ ε(1 + f ∞ ) ,
.
from which the result follows thanks to the arbitrariness of .ε. 2.7 If .X ∼ N(0, 1), 1 2 E(etX ) = √ 2π
+∞
.
−∞
etx e−x 2
2 /2
1 dx = √ 2π
The integral clearly diverges if .t ≥ 12 . If .t <
+∞
.
−∞
e−x
2 ( 1 −t) 2
dx =
+∞ −∞
1 2
+∞
−∞
e−x
2 ( 1 −t) 2
dx .
instead just write
exp −
x2 dx . 2(1 − 2t)−1
We recognize in the integrand, but for the constant, the density of a Gaussian law with mean 0 and variance .(1 − 2t)−1 . Hence for .t < 12 the integral is equal to √ −1/2 and .E(etX2 ) = (1 − 2t)−1/2 . . 2π (1 − 2t)
274
7 Solutions 2
Recalling that if .X ∼ N(0, 1) then .Z = σ X ∼ N(0, σ 2 ), we have .E(etZ ) = 2 2 E(etσ X ) and in conclusion 2
E(etZ ) =
.
⎧ ⎨+∞ ⎩√
if t ≥ 1
if t
0. We have, thanks to the integration rule with respect to an image measure, Proposition 1.27, b+σ X 1 .E (xe − K)+ = √ 2π
+∞ −∞
xeb+σ z − K
+
1 2
e− 2 z dz .
The integrand vanishes if .xeb+σ z − K < 0, i.e. if z ≤ ζ :=
.
K 1 log − b , σ x
hence, with a few standard changes of variable,
+∞ b+σ z 1 2 1 xe E (xeb+σ X − K)+ = √ − K e− 2 z dz 2π ζ
+∞
+∞ 1 2 1 2 x K =√ eb+σ z− 2 z dz − √ e− 2 z dz 2π ζ 2π ζ 1 2 +∞ 1 xeb+ 2 σ 2 = √ e− 2 (z−σ ) dz − K 1 − Φ(ζ ) 2π ζ 1 2 +∞ 1 2 xeb+ 2 σ = √ e− 2 z dz − KΦ(−ζ ) 2π ζ −σ
.
1
2
= xeb+ 2 σ Φ(−ζ + σ ) − KΦ(−ζ ) . Finally note that as .σ X ∼ −σ X, .E[(xeb+σ X − K)+ ] = E[(xeb+|σ |X − K)+ ]. 2.9 (a) Let us first compute the d.f. With the change of variable .s α = u, .αs α−1 ds = du we find for the d.f. F of f , for .t > 0,
F (t) =
.
0
t
λαs α−1 e−λs ds = α
tα 0
λe−λu du = 1 − e−λt . α
(7.6)
Exercise 2.10
275
As
+∞
.
−∞
f (s) ds = lim
t
t→+∞ −∞
f (s) ds = lim F (t) = 1 , t→+∞
f is a probability density with respect to the Lebesgue measure. (b1) If X is exponential with parameter .λ we have, recalling the values of the constants for the Gamma laws,
+∞
E(X ) = λ
.
β
t β e−λt dt =
0
λΓ (β + 1) Γ (β + 1) = · λβ λβ+1
(7.7)
The d.f., G say, of .Xβ is, for .t > 0, G(t) = P(Xβ ≤ t) = P(X ≤ t 1/β ) = 1 − e−λt
1/β
.
,
so that, comparing with (7.6), .Xβ is Weibull with parameters .λ and .α = β1 . (b2) Thanks to (b1) a Weibull r.v. Y with parameters .α, .λ is of the form .X1/α , where X is exponential with parameter .λ; thanks to (7.7), for .β = α1 and .β = α2 , we have E(Y ) = E(X1/α ) =
.
E(Y 2 ) = E(X2/α ) =
Γ (1 + α1 ) , λ1/α Γ (1 + α2 ) λ2/α
and for the variance Var(Y ) = E(Y 2 ) − E(Y )2 =
.
Γ (1 + α2 ) − Γ (1 + α1 )2 · λ2/α
(c) Just note that .Γ (1 + 2t) − Γ (1 + t)2 is the variance of a Weibull r.v. with parameters .λ = 1 and .α = 1t . Hence it is a positive quantity. 2.10 The density of X is obtained from the joint density as explained in Example 2.16:
fX (x) =
+∞
.
−∞
f (x, y) dy = (θ + 1) eθx
+∞
1 θ (1 + θ1 ) (eθx
= eθx
1 1
(eθx )1+ θ
1
(eθx + eθy − 1)2+ θ y=+∞ 1 1+ θ1 y=0 θy + e − 1) 0
= −(θ + 1) eθx
eθy
= e−x .
dy
276
7 Solutions
Hence X is exponential of parameter 1. By symmetry Y has the same density. Note that the marginals do not depend on .θ . 2.11 (a1) We have, for .t ≥ 0, P(− log X ≤ t) = P(X ≥ e−t ) = 1 − e−t ,
.
hence .− log X is an exponential Gamma.(1, 1)-distributed r.v. (a2) .W = − log X − log Y is therefore Gamma.(2, 1)-distributed and its d.f. is, again for .t ≥ 0, FW (t) = 1 − e−t − te−t .
.
Hence the d.f. of .XY = e−W is, for .0 < s ≤ 1, F (s) = P(e−W ≤ s) = P(W ≥ − log s) = 1 − FW (− log s) = s − s log s
.
and, taking the derivative, the density of XY is f (s) = − log s
.
for 0 < s ≤ 1 .
(b) The r.v.’s XY and Z are independent and their joint law has a density with respect to the Lebesgue measure that is the tensor product of their densities. We have, for .z ∈ [0, 1], P(Z 2 ≤ z) = P(Z ≤
.
√ √ z) = z
and, taking the derivative, the density of .Z 2 is 1 fZ 2 (z) = √ 2 z
.
0 0, 1 λk+1 k −λt t e . n k! k=0 n−1
g(t) =
.
∼ Gamma(k+1,λ)
(d) We have
+∞
.
0
1 tg(t) dt = b
+∞ 0
1 t F (t) dt = b
+∞
t P(X > t) dt 0
Exercise 2.15
279
and, recalling again Remark 2.1, 1 σ 2 + b2 σ2 b E(X2 ) = = + · 2b 2b 2b 2 2.14 We must compute the image, .ν say, of the probability .
··· =
dμ(θ, φ) =
.
1 sin θ dθ dφ, 4π
(θ, φ) ∈ [0, π ] × [0, 2π ]
under the map .(θ, φ) → cos θ . Let us use the method of the dumb function: let ψ : [−1, 1] → R be a bounded measurable function, by the integration formula with respect to an image measure, Proposition 1.27, we have
.
ψ(t) dν(t) =
.
=
ψ(cos θ ) dμ(θ, φ) =
1 2
π
1 4π
ψ(cos θ ) sin θ dθ =
0
2π
π
ψ(cos θ ) sin θ dθ
dφ 0
1 2
0 1
−1
ψ(u) du ,
i.e. .ν is the uniform distribution on .[−1, 1]. In some sense all points of the interval [−1, 1] are “equally likely”.
.
• One might wonder what the answer to this question would be for the spheres of d .R for other values of d. Exercise 2.15 gives an answer for .d = 2 (i.e. the circle). 2.15 First approach: let us compute the d.f. of W : for .−1 ≤ t ≤ 1 FW (t) = P(W ≤ t) = P(cos Z ≤ t) = P(Z ≥ arccos t) =
.
1 (π − arccos t) π
(recall that .arccos is decreasing). Hence fW (t) =
.
1 , √ π 1 − t2
−1 ≤ t ≤ 1 .
(7.8)
Second approach: the method of the dumb function: let .φ : R → R be a bounded Borel function, then
1 π .E[φ(cos Z)] = φ(cos θ ) dθ . π 0 Let .t = cos θ , so that .θ = arccos t and .dθ = −(1 − t 2 )−1/2 dt. Recall that .arccos is the inverse of the .cos function restricted to the interval .[0, π ] and therefore taking values in the interval .[−1, 1]. This gives
E[φ(cos Z)] =
.
1
1 φ(t) √ dt π 1 − t2 −1
280
7 Solutions
i.e. (7.8). 2.16 (a) The integral of f on .R2 must be equal to 1. In polar coordinates and with the change of variable .r 2 = u, we have
1=
+∞ +∞
.
−∞
−∞
f (x, y) dx dy = 2π
+∞
+∞
g(r )r dr = π 2
0
g(u) du . 0
(b1) We know (Example 2.16) that X has density, with respect to the Lebesgue measure,
fX (x) =
+∞
.
−∞
g(x 2 + y 2 ) dy
(7.9)
and obviously this quantity is equal to the corresponding one for .fY . (b2) Thanks to (7.9) the density .fX is an even function, therefore X is symmetric and .E(X) = 0. Obviously also .E(Y ) = 0. (b3) We just need to compute .E(XY ), as we already know that X and Y are centered. We have, again in polar coordinates and recalling that .x = r cos θ , .y = r sin θ ,
E(XY ) =
+∞ +∞
.
−∞
=
0
−∞
xy g(x 2 + y 2 ) dx dy
2π
sin θ cos θ dθ
+∞
g(r 2 )r 3 dr .
0
=0
+∞ 1 Note that the integral . 0 g(r 2 )r 3 dr is finite, as it is equal to . 2π E(X2 + Y 2 ). 1
1
1 −2 r 1 − 2 (x +y ) If .g(r) = 2π e , then .f (x, y) = 2π e can be split into the tensor product of a function of x times a function of y, hence X and Y are independent (and are each .N(0, 1)-distributed). If .f = π1 1C , where C is the ball of radius 1, X and Y are not independent: as can be seen by looking at Fig. 7.1, the marginal densities are both strictly positive on the interval .[−1, 1] so that their product gives strictly positive probability to the areas near the corners, which are of probability 0 for the joint distribution. 2
2
• It is a classical result of Bernstein that a probability on .Rd which is invariant under rotations and whose components are independent is necessarily Gaussian (see e.g. [7], p. 82). (c1) For every bounded Borel function .φ : R → R we have
+∞ +∞ X .E[φ( )] = dy φ( xy )g(x 2 + y 2 ) dx . Y −∞
−∞
Exercise 2.17
281 1
−1
1
−1 Fig. 7.1 The rounded triangles near the corners have probability 0 for the joint density but strictly positive probability for the product of the marginals
With the change of variable .z = xy , .|y| dz = dx in the inner integral we have
.
··· =
−∞
=
=
+∞
dy
+∞ −∞
+∞
−∞
φ(z) dz
+∞ −∞
φ(z)g y 2 (1 + z2 ) |y| dz
+∞ −∞ +∞
φ(z) dz
g y 2 (1 + z2 ) |y| dy
2g y 2 (1 + z2 ) y dy .
0
√
Replacing .y 1 + z2 = u, .dy = (1 + z2 )−1/2 du, we have
.
··· =
−∞
=
=
+∞
−∞
φ(z)
+∞
0
+∞ −∞
+∞
φ(z) dz φ(z)
1 dz 1 + z2
1 1 + z2 +∞
0
du u 2g(u2 ) √ √ 1 + z2 1 + z2
+∞ dz 2g(u2 )u du 0
g(u) du =
+∞ −∞
φ(z)
1 dz π(1 + z2 )
and the result follows. (c2) Just note that the pair .(X, Y ) has a density of the type (2.85), so that this is a situation as in (c1) and . X Y has a Cauchy law. Y (c3) Just note that in (c2) both . X Y and . X have a Cauchy distribution. 2.17 (a) .Q is a measure (Theorem 1.28) as X is a density, being positive and integrable. Moreover, .Q(Ω) = E(X) = 1 so that .Q is a probability. (b1) As obviously .X1{X=0} = 0, we have .Q(X = 0) = E(X1{X=0} ) = 0.
282
7 Solutions
(b2) As the event .{X > 0} has probability 1 under .Q, we have, for every .A ∈ F, 1 Q 1 .P(A) = E 1A = EQ 1A∩{X>0} = E[1A∩{X>0} ] = P(A ∩ {X > 0}) X X and therefore .P is a probability if and only if .P(X > 0) = 1. In this case .P = P and dP 1 = and .P Q. Conversely, if .P Q, then, as .Q(X = 0), then also .P(X = 0). dQ X (c) For every bounded Borel function .φ : R → R we have
+∞ Q .E [φ(X)] = E[Xφ(X)] = φ(x)x dμ(x) .
.
−∞
Hence, under .Q, X has law .dν(x) = x dμ(x). Note that such a .ν is also a probability because
+∞
+∞ . dν(x) = x dμ(x) = E(X) = 1 . −∞
−∞
If .X ∼ Gamma.(λ, λ) then its density f with respect to the Lebesgue measure is f (x) =
.
λλ λ−1 −λx x e Γ (λ)
and its density with respect to .Q is x →
.
λλ λ −λx λλ+1 x e x λ e−λx , = Γ (λ) Γ (λ + 1)
which is a Gamma.(λ + 1, λ). (d1) Thanks to Theorem 1.28, .EQ (Z) = E(XZ) = E(X)E(Z) = E(Z). (d2) As X and Z are independent under .P, for every bounded Borel function .ψ we have EQ [ψ(Z)] = E[Xψ(Z)] = E(X)E[ψ(Z)] = E[ψ(Z)] ,
.
(7.10)
hence the laws of Z with respect to .P and to .Q coincide. (d3) For every choice of bounded Borel functions .φ, ψ : R → R we have, thanks to (7.10), EQ [φ(X)ψ(Z)] = E[Xφ(X)ψ(Z)] = E[Xφ(X)]E[ψ(Z)]
.
= EQ [φ(X)]EQ [ψ(Z)] , i.e. X and Z are independent also with respect to .Q.
Exercise 2.18
283
2.18 (a) We must only check that . λ2 (X + Z) is a density, i.e. that it is a positive r.v. whose integral is equal to 1, which is immediate. (b) As X and Z are independent under .P and recalling the expressions of the moments of the exponential laws, .E(X) = λ1 , .E(X2 ) = λ22 , we have λ λ EQ (XZ) = E XZ(X + Z) = E(X2 Z) + E(XZ 2 ) 2 2 . λ 2 2 λ = E(X2 )E(Z) + E(X)E(Z 2 ) = × 2 3 = 2 · 2 2 λ λ
(7.11)
(c1) The method of the dumb function: if .φ : R2 → R is a bounded Borel function, λ EQ φ(X, Z) = E (X + Z)φ(X, Z) 2
λ +∞ +∞ φ(x, z)(x + z)λ2 e−λ(x+z) dx dz . = 2 −∞ −∞
.
Hence, under .Q, X and Z have a joint law with density, with respect to the Lebesgue measure, g(x, z) =
.
λ3 (x + z) e−λ(x+z) 2
x, z > 0 .
As g does not split into the tensor product of functions of x and z, X and Z are not independent under .Q. They are even correlated: we have EQ (X) =
.
λ 2 3 λ 1 λ = + E[X(X + Z)] = E(X2 ) + E(XZ) = 2 2 2 λ2 2λ λ2
and, recalling (7.11) , 9 1 CovQ (X, Z) = EQ (XZ) − EQ (X)EQ (Z) = 2 − 0 ,
Exercise 2.19
285
so that (7.12) gives
+∞ λα+β (yz)α−1 y β−1 e−λ(zy+y) y dy Γ (α)Γ (β) 0
+∞ λα+β zα−1 = y α+β−1 e−λ(1+z)y dy Γ (α)Γ (β) 0
g(z) =
.
= =
λα+β zα−1 Γ (α+β) Γ (α)Γ (β) (λ(1+z))α+β
(7.13)
zα−1 Γ (α + β) · Γ (α)Γ (β) (1 + z)α+β
(b2) If .U ∼ Gamma.(α, 1), then . Uλ ∼ Gamma.(α, λ) (exercise). Let now U , V be two independent r.v.’s Gamma.(α, 1)- and Gamma.(β, 1)-distributed respectively, then the r.v.’s . Uλ , . Vλ have the same joint law as X, Y , therefore their quotient has the X U U same law as . X Y . Hence . Y = V and the law of . V does not depend on .λ. (b3) The moment of order p of W is
+∞
E(W p ) =
.
zp g(z) dz =
0
Γ (α + β) Γ (α)Γ (β)
+∞ 0
zα+p−1 dz . (z + 1)α+β
(7.14)
The integrand tends to 0 at infinity as .zp−β−1 , hence the integral converges if and only if .p < β. If this condition is satisfied, the integral is easily computed recalling that (7.13) is a density: just write
.
0
+∞
zα+p−1 dz = (z + 1)α+β
+∞ 0
zα+p−1 dz (z + 1)α+p+β−p
and therefore, thanks to (7.13) with .α replaced by .α + p and .β by .β − p, E(W p ) =
.
Γ (α + p)Γ (β − p) Γ (α + p)Γ (β − p) Γ (α + β) × = · Γ (α)Γ (β) Γ (α + β) Γ (α)Γ (β)
(7.15)
(c1) The r.v.’s .X2 and .Y 2 + Z 2 are Gamma.( 12 , 12 )- and Gamma.(1, 12 )-distributed respectively and independent. Therefore (7.13) with .α = 12 and .β = 1 gives for the density of .W1 1
1
z− 2 1 z− 2 = · .f1 (z) = 2 (z + 1)3/2 Γ ( 12 )Γ (1) (z + 1)3/2 Γ ( 32 )
As .W2 =
√
W1 , P(W2 ≤ t) = P(W1 ≤ t 2 ) = FW1 (t 2 )
.
286
7 Solutions
and, taking the derivative, the requested density of .W2 is f2 (t) = 2tf1 (t 2 ) =
.
(t 2
1 + 1)3/2
t >0.
(c2) The joint law of X and Y has density, with respect to the Lebesgue measure, f (x, y) =
.
1 − 1 (x 2 +y 2 ) . e 2 2π
It is straightforward to deduce from (7.12) that . X Y has a Cauchy density g(z) =
.
1 π(1 + z2 )
but we have already proved this in Exercise 2.16, as a general fact concerning all joint densities that are rotation invariant. 2.20 (a) We can write .(U, V ) = Ψ (X, Y ), with .Ψ (x, y) = (x + y, x+y x ). Let us make the change of variable .(u, v) = Ψ (x, y). Let us first compute .Ψ −1 : we must solve ⎧ ⎨u = x + y . x+y ⎩v = · x We find .x =
u v
and then .y = u − uv , i.e. .Ψ −1 (u, v) = (uv, u − uv ). Its differential is DΨ
.
so that .| det D Ψ −1 (u, v)| = f (x, y) =
.
−1
u . v2
(u, v) =
1 v
1−
1 v
− vu2
u v2
Denoting by f the joint density of .(X, Y ), i.e.
1 x α−1 y β−1 e−(x+y) , Γ (α)Γ (β)
the joint density of .(U, V ) is g(u, v) = f ( uv , u − uv )
.
u · v2
x, y > 0 ,
Exercise 2.21
287
The density f vanishes unless both its arguments are positive, hence .g > 0 for u > 0, v > 1. If .u > 0, .v > 1 we have
.
u α−1 u β−1 − u − (u − u ) u 1 v u− e v Γ (α)Γ (β) v v v2 β−1 1 (v − 1) = · uα+β−1 e−u × Γ (α)Γ (β) v α+β
g(u, v) = .
(7.16)
As the joint density of .(U, V ) can be split into the product of a function of u and of a function of v, U and V are independent. (b) We must compute
gV (v) :=
+∞
.
−∞
g(u, v) du .
By (7.16) we have .gV (v) = 0 for .v ≤ 1 and gV (v) =
.
Γ (α + β) (v − 1)β−1 Γ (α)Γ (β) v α+β
for .v > 1, as we recognized the integral of a Gamma.(α + β, 1) density. Y Y Note that .V = 1 + X and that the density of the quotient . X has already been computed in Exercise 2.19(b), from which the density .gV could also be derived. As for the law of . V1 , note first that this r.v. takes its values in the interval .[0, 1]. For .0 ≤ t ≤ 1 we have P V1 ≤ t = P V ≥ 1t = 1 − GV ( 1t ) ,
.
with .GV denoting the d.f. of V . Taking the derivative, . V1 has density, with respect to the Lebesgue measure, t→
.
β−1 1 Γ (α + β) 1 1 Γ (α+β) α−1 1 − 1 t g ( ) = t α+β = (1 − t)β−1 , V t Γ (α)Γ (β) t 2 t Γ (α)Γ (β) t2
i.e. a Beta.(α, β) density. 2.21 (a) For every bounded Borel function .φ : R2 → R
+∞
E[φ(Z, W )] =
1
f (t) dt
.
0
φ xt, (1 − x)t dx .
0
With the change of variable .z = xt, .dz = t dx, in the inner integral we obtain, after Fubinization,
.
+∞
··· =
f (t) dt 0
0
t
1 φ(z, t − z) dz = t
+∞
+∞
dz 0
z
φ(z, t − z )
1 f (t) dt . t
288
7 Solutions
With the further change of variable .w = t − z and noting that .w = 0 when .t = z, we land on
+∞ +∞ 1 f (z + w) dz , .··· = dz φ(z, w) z+w 0 0 so that the requested joint density is 1 f (z + w), z+w
g(z, w) :=
.
z > 0, w > 0 .
Note that, g being symmetric, Z and W have the same distribution, a fact which was to be expected. (b) If f (t) = λ2 t e−λt ,
t >0
.
then g(z, w) = λ2 e−λ(z+w) = λe−λz × λe−λw .
.
Z and W are i.i.d. with an exponential distribution of parameter .λ. 2.22 We have
G(x, y) = P(x ≤ X ≤ Y ≤ y) =
f (u, v) du dv ,
.
Qx,y
where .Qx,y is the square .[x, y]×[x, y]. Keeping in mind that .X ≤ Y a.s., .f (u, v) = 0 for .u > v so that
y
y .G(x, y) = du f (u, v) dv . x
u
Taking the derivative first with respect to x and then with respect to y we find f (x, y) = −
.
∂ 2G (x, y) . ∂x∂y
(b1) Denoting by H the common d.f. of Z and W , we have G(x, y) = P(x ≤ X ≤ Y ≤ y) = P(x ≤ Z ≤ y, x ≤ W ≤ y) .
= (H (y) − H (x))2 ,
(7.17)
Exercise 2.24
289
hence the joint density of .X, Y is, for .x ≤ y, f (x, y) = −
.
∂ 2G (x, y) = 2h(x)h(y) ∂x∂y
and .f (x, y) = 0 for .x > y. (b2) If Z and W are uniform on .[0, 1] then .h = 1[0,1] and .f (x, y) = 2 1{0≤x≤y≤1} . Therefore E[|Z − W |] = E max(Z, W ) − min(Z, W ) = E(Y − X)
1
1 y 1 dy (y − x) dx = y 2 dy = · =2 3 0 0 0
.
2.23 (a) Let .f = 1A with .A ∈ E and .μ(A) < +∞ and .φ(x) = x 2 . Then .φ(1A ) = 1A and (2.86) becomes μ(A) ≥ μ(A)2
.
hence .μ(A) ≤ 1. Let now .(An )n ⊂ E be an increasing sequence of sets of finite μ-measure and such that .E = n An . As .μ(An ) ≤ 1 and .μ passes to the limit on increasing sequences, we have also .μ(E) ≤ 1. (b) Note that (2.86) implies a similar, reverse, inequality for integrable concave functions hence equality for affine-linear ones. Now for .φ ≡ 1, recalling that necessarily .μ is finite thanks to (a),
.
μ(E) =
.
φ(1E ) dμ = φ
1E dμ = 1 .
2.24 (a1) Let .φ(x) = x log x if .x > 0, .φ(0) = 0, .φ(x) = +∞ if .x < 0. For x > 0 we have .φ (x) = 1 + log x, .φ
(x) = x1 , therefore .φ is convex and, as .limx→0 φ(x) = 0, also lower semi-continuous. It vanishes at 1 and at 0. By Jensen’s inequality .
H (ν; μ) =
φ
.
dν
E
dμ
dμ ≥ φ
dν dμ = φ ν(E) = 0 . E dμ
(7.18)
The convexity relation H λν1 + (1 − λ)ν2 ; μ ≤ λH (ν1 ; μ) + (1 − λ)H (ν2 ; μ)
.
(7.19)
is immediate if both .ν1 and .ν2 are . μ thanks to the convexity of .φ. If one at least among .ν1 , ν2 is not absolutely continuous with respect to .μ, then also .λν1 + .(1 − λ)ν2 μ and in (7.19) both members are .= +∞. Moreover .φ is strictly convex as .φ
> 0 for .x > 0. Therefore the inequaldν dν ity (7.18) is strict, unless . dμ is constant. As . dμ is a density, this constant can only be equal to 1 so that the inequality is strict unless .ν = μ.
290
7 Solutions
(a2) As .log 1A = 0 on A whereas .1A = 0 on .Ac , .1A log 1A ≡ 0 and H (ν; μ) =
.
1 μ(A)
E
1A 1A log μ(A) dμ =
1 μ(A)
− log μ(A) dμ A
= − log μ(A) . As .ν(Ac ) = 0 whereas .μ(Ac ) = 1 − μ(A) > 0, .μ is not absolutely continuous with respect to .ν and .H (μ; ν) = +∞. (b1) We have, for .k = 0, 1, . . . , n, .
q k (1 − q)n−k dν (k) = k , dμ p (1 − p)n−k
i.e. .
log
q 1−q dν (k) = k log + (n − k) log , dμ p 1−p
so that H (ν; μ) =
n
.
ν(k) log
k=0
=
dν (k) dμ
n 1−q n k q q (1 − q)n−k k log + (n − k) log p 1−p k k=0
1−q q . = n q log + (1 − q) log p 1−p
(b2) We have, for .t > 0, .
log
ρ dν (t) = e−(ρ−λ)t , dμ λ λ dν (t) = − log − (ρ − λ) t dμ ρ
and
+∞
H (ν; μ) =
log
.
0
λ dν (t) dν(t) = − log − (ρ − λ)ρ dμ ρ
= − log
λ λ λ ρ−λ − = − 1 − log , ρ ρ ρ ρ
which, of course, is a positive function (Fig. 7.2).
0
+∞
te−ρt dt
Exercise 2.24
291
0.6
3
Fig. 7.2 The graph of .ρ →
λ ρ
− 1 − log ρλ , for .λ = 1.2
(c) If for one index i, at least, .νi μi , then there exists a set .Ai ∈ Ei such that νi (Ai ) > 0 and .μi (Ai ) = 0. Then,
.
ν(E1 × · · · × Ai × · · · × En ) = νi (Ai ) > 0 ,
.
μ(E1 × · · · × Ai × · · · × En ) = μi (Ai ) = 0 , so that also .ν μ and in (2.88) both members are .= +∞. dνi , then If, instead, .νi μi for every i and .fi := dμ i .
dν (x1 , . . . , xn ) = f1 (x1 ) . . . f (xn ) dμ
and, as . Ei dνi (xi ) = 1 for every .i = 1, . . . , n,
H (ν; μ) =
.
=
= =
n i=1
E1 ×···×En
E1 ×···×En
E1 ×···×En
E1 ×···×En
log
dν dν dμ
log f1 (x1 ) . . . fn (xn ) dν1 (x1 ) . . . dνn (xn )
log f1 (x1 ) + · · · + log fn (xn ) dν1 (x1 ) . . . dνn (xn )
log fi (xi ) dν1 (x1 ) . . . dνn (xn ) =
n i=1
=
n i=1
H (νi ; μi ) .
log fi (xi ) dνi (xi ) Ei
292
7 Solutions
• The courageous reader can compute the relative entropy of .ν = N(b, σ 2 ) with respect to .μ = N(b0 , σ02 ) and find that H (ν; μ) =
.
σ2 1 1 σ2 − log 2 − 1 + (b − b0 )2 . 2 2 σ0 σ0 2σ02
2.25 (a) We know that if .X ∼ N(b, σ 2 ) then .Z = X − b ∼ N(0, σ 2 ), and also that the odd order moments of centered Gaussian laws vanish. Therefore 3 = E(Z 3 ) = 0 , .E (X − b) hence .γ = 0. Actually in this computation we have used only the fact that the Gaussian r.v.’s have a law that is symmetric with respect to their mean, i.e. such that .X −b and .−(X −b) have the same law. For all r.v.’s with a finite third order moment and having this property we have E[(X − b)3 ] = E[(−(X − b))3 ] = −E[(X − b)3 ] ,
.
so that .E[(X − b)3 ] = 0 and .γ = 0. (b) Recall that if .X ∼ Gamma.(α, λ) its k-th order moment is E(Xk ) =
.
Γ (α + k) (α + k − 1)(α + k − 2) · · · α , = k λ Γ (α) λk
hence for the first three moments: E(X) =
.
α , λ
E(X2 ) =
α(α + 1) , λ2
E(X3 ) =
α(α + 1)(α + 2) · λ3
With the binomial expansion of the third degree (here .b = αλ ) E[(X − b)3 ] = E(X3 ) − 3E(X2 )b + 3E(X)b2 − b3
.
1 α(α + 1)(α + 2) − 3α 2 (α + 1) + 3α 3 − α 3 3 λ 2α α = 3 α 2 + 3α + 2 − 3α 2 − 3α + 2α 2 = 3 · λ λ
=
On the other hand the variance is equal to .σ 2 = γ =
.
2α λ3 α 3/2 λ3
α , λ2
so that
= 2α −1/2 .
In particular, the skewness does not depend on .λ and for an exponential law is always equal to 2. This fact is not surprising keeping in mind that, as already noted somewhere above, if .X ∼ Gamma.(α, 1) then . λ1 X ∼ Gamma.(α, λ). Hence
Exercise 2.27
293
the moments of order k of a Gamma.(α, λ)-distributed r.v. are equal to the same moments of a Gamma.(α, 1)-distributed r.v. multiplied by .λ−k and the .λ’s in the numerator and in the denominator in (2.89) simplify. Note also that the skewness of a Gamma law is always positive, which is in agreement with intuition (the graph of the density is always as in Fig. 2.4, at least for .α > 1). 2.26 By hypothesis, for every .n ≥ 1,
+∞
.
−∞
x dμ(x) = n
+∞
−∞
x n dν(x)
and therefore, by linearity, also
+∞
.
−∞
P (x) dμ(x) =
+∞
−∞
P (x) dν(x)
(7.20)
for every polynomial P . By Proposition 1.25, the statement follows if we are able to prove that (7.20) holds for every continuous bounded function f (and not just for every polynomial). But if f is a real continuous function then (Weierstrass’s Theorem) f is the uniform limit of polynomials on .[−M, M]. Hence, if P is a polynomial such that .sup−M≤x≤M |f (x) − P (x)| ≤ ε, then
+∞
.
=
≤
−∞
M
−M
M −M
f (x) dμ(x) −
−∞
f (x) − P (x) dμ(x) −
f (x) − P (x) dμ(x) +
+∞
M
−M
M
−M
f (x) dν(x) f (x) − P (x) dν(x)
f (x) − P (x) dν(x) ≤ 2ε
and by the arbitrariness of .ε
+∞
.
−∞
f (x) dμ(x) =
+∞
−∞
f (x) dν(x)
for every bounded continuous function f . 2.27 The covariance matrix C is positive definite and therefore is invertible if and only if it is strictly positive definite. Recall (2.33), i.e. for every .ξ ∈ Rm Cξ, ξ = E X − E(X), ξ 2 .
.
(7.21)
294
7 Solutions
Let us assume that X takes its values in a proper hyperplane of .Rm . Such a hyperplane is of the form .{x; ξ, x = t} for some .ξ ∈ Rm , ξ = 0 and .t ∈ R. Hence ξ, X = t
a.s.
.
Taking the expectation we have .ξ, E(X) = t, so that .ξ, X − E(X) = 0 a.s. and by (7.21) .Cξ, ξ = 0, so that C cannot be invertible. Conversely, if C is not invertible there exists a vector .ξ ∈ Rm , .ξ = 0, such that 2 .Cξ, ξ = 0 and by (7.21) .X − E(X), ξ = 0 a.s. (the mathematical expectation of a positive r.v. vanishes if and only if the r.v. is a.s. equal to 0, Exercise 1.9). Hence .X ∈ H a.s. where .H = {x; ξ, x = ξ, E(X)}. Let .μ denote the law of X. As H has Lebesgue measure equal to 0 whereas .μ(H ) = 1, .μ is not absolutely continuous with respect to the Lebesgue measure. 2.28 Recall the expression of the coefficients a and b, i.e. a=
.
Cov(X, Y ) , Var(X)
b = E(Y ) − aE(X) .
(a) As .aX + b = a(X − E(X)) + E(Y ), we have Y − (aX + b) = Y − E(Y ) − a(X − E(X)) ,
.
which gives .E(Y − (aX + b)) = 0. Moreover, E (Y − (aX + b))(aX + b)
.
= E (Y − E(Y ) − a(X − E(X)))(a(X − E(X)) + E(Y )) = aE (Y − E(Y ))(X − E(X)) − a 2 E (X − E(X))2 = aCov(Y, X) − a 2 Var(X) =
Cov(X, Y )2 Cov(X, Y )2 − =0. Var(X) Var(X)
(b) As .Y − (aX + b) and .aX + b are orthogonal in .L2 , we have (Pythagoras’s theorem) E(Y 2 ) = E (Y − (aX + b))2 + E (aX + b)2 .
.
2.29 (a) We have .Cov(X, Y ) = Cov(Y + W, Y ) = Var(Y ) = 1, whereas .Var(X) = Var(Y ) + Var(W ) = 1 + σ 2 . As the means vanish, the regression line .x → ax + b of Y with respect to X is given by a=
.
1 , 1 + σ2
b=0.
Exercise 2.29
295
1 The best approximation of Y by a linear function of X is therefore . 1+σ 2 X (intuitively one takes the observation X and moves it a bit toward 0, which is the mean of Y ). The quadratic error is
E Y−
.
2 1 1 2 = Var(Y ) + X Var(X) − Cov(X, Y ) 1 + σ2 (1 + σ 2 )2 1 + σ2 =1+
1 2 σ2 − = · 2 2 1+σ 1+σ 1 + σ2
(b) If .X = (X1 , X2 ), the covariance matrix of X is 1 1 + σ2 .CX = 1 1 + σ2
!
whereas the vector of the covariances of Y and the .Xi ’s is CX,Y
.
1 = 1
!
and the means vanish. We have, with a bit of patience, −1 CX =
.
1 1 + σ 2 −1 2 2 −1 1 + σ 2 (1 + σ ) − 1
! ,
hence the regression “line” is $ 1 + σ 2 −1 ! 1! X ! % 1 1 , X2 −1 1 + σ 2 1 2σ 2 + σ 4 $ σ 2! X ! % X + X 1 1 2 1 , = = · X2 2σ 2 + σ 4 σ 2 2 + σ2
" −1 # CX CX,Y , X =
.
The quadratic error can be computed as in (a) or in a simpler way, using Exercise 2.28(b), E Y−
.
1 2 1 = Var(Y ) − Var (X1 + X2 ) (X1 + X2 ) . 2 2 2+σ 2+σ
Now .Var(X1 + X2 ) = Var(2Y + W1 + W2 ) = 4 + 2σ 2 , therefore .
··· = 1 −
4 + 2σ 2 2 σ2 =1− = · 2 2 2 (2 + σ ) 2+σ 2 + σ2
296
7 Solutions
The availability of two independent observations has allowed some reduction of the quadratic error. 2.30 As .Cov(X, Y ) = Cov(Y, Y ) + Cov(Y, W ) = Cov(Y, Y ) = Var(Y ), the regression line of Y with respect to X is .y = ax + b, with the values a=
.
Var(Y ) Cov(X, Y ) = = Var(X) Var(Y ) + Var(W )
b = E(Y ) − aE(X) =
1 λ2
+
1 λ2
1 ρ2
=
ρ2 , λ2 + ρ 2
ρ2 1 1 λ−ρ 1 − 2 + = 2 · λ λ + ρ2 ρ λ λ + ρ2
2.31 Let X be an r.v. having characteristic function .φ. Then φ(t) = E(eitX ) = E( eitX ) = E(e−itX ) = φ−X (t) ,
.
hence .φ is the characteristic function of .−X. Let .Y, Z be independent r.v.’s with characteristic function .φ. Then the characteristic function of .Y + Z is .φ 2 . Similarly the characteristic function of .Y − Z is .φ · φ = |φ|2 . 2.32 (a) The characteristic function of .X1 is
φX1 (θ ) =
1/2
.
−1/2
eiθx dx =
1 iθx x=1/2 2 e = sin θ2 x=−1/2 iθ θ
and therefore the characteristic function of .X1 + X2 is φX1 +X2 (θ ) =
.
4 sin2 θ2
θ 2
·
(b) As f is an even function whereas .x → sin(θ x) is odd,
φ(θ ) =
1
.
1
=2 0
−1
(1 − |x|) e
iθx
dx =
1
−1
(1 − |x|) cos(θ x) dx
x=1 2 1 2 (1 − x) cos(θ x) dx = (1 − x) sin(θ x) + sin(θ x) dx x=0 θ θ 0 =
2 4 (1 − cos θ ) = 2 sin2 2 θ θ
θ 2
.
Exercise 2.33
297
−16
0
16
Fig. 7.3 The graph of the density (7.22). Note a typical feature: densities decreasing fast at infinity have very regular characteristic functions and conversely regular densities have characteristic functions decreasing fast at infinity. In this case the density is compactly supported and the characteristic function is very regular. The characteristic function tends to 0 a bit slowly at infinity and the density is not regular
As the probability .f (x) dx and the law of .X1 + X2 have the same characteristic function, they coincide. (c) As .φ is integrable, by the inversion Theorem 2.33, 1 .f (x) = 2π
∞ −∞
4 sin2 θ2 e−iθx dθ . θ2
Exchanging the roles of x and .θ we can write κ(θ ) = f (θ ) =
.
1 2π
∞ −∞
4 sin2 x2 e−iθx dx . x2
As .κ(0) = 1, the positive function g(x) :=
.
2 sin2 π x2
x 2
is a density, having characteristic function .κ. See its graph in Fig. 7.3. 2.33 Let .μ be a probability on .Rd . We have, for .θ ∈ Rd , & μ(−θ ) = & μ(θ ) ,
.
(7.22)
298
7 Solutions
so that the matrix .(& μ(θh − θk ))h,k is Hermitian. Moreover, n .
n
& μ(θh − θk )ξh ξk =
h,k=1
ξh ξk
h,k=1
=
Rd
ξh eiθh ,x ξk eiθk ,x dμ(x)
n h,k=1
=
eiθh ,x e−iθk ,x dμ(x)
Rd
n 2 ξh eiθh ,x dμ(x) ≥ 0
Rd
h=1
(the integrand is positive) and therefore .& μ is positive definite. Bochner’s Theorem states that the converse is also true: a positive definite function .Rd → C taking the value 1 at 0 is the characteristic function of a probability (see e.g. [3], p. 262). 2.34 (a) As .x → sin(θ x) is an odd function whereas .x → cos(θ x) is even, 1 & .ν(θ ) = 2
=
+∞ −∞
+∞
1 dx = 2
−|x| iθx
e
e
+∞
−∞
e−|x| cos(θ x) dx
e−x cos(θ x) dx
0
and, twice by parts,
+∞
.
0
x=+∞ e−x cos(θ x) dx = −e−x cos(θ x) −θ x=0
x=+∞ = 1 + θ e−x sin(θ x) − θ2 x=0
=1−θ
+∞
2
+∞
e−x sin(θ x) dx
0
+∞
e−x cos(θ x) dx
0
e−x cos(θ x) dx ,
0
from which
+∞
(1 + θ )
.
2
e−x cos(θ x) dx = 1 ,
0
i.e. (2.90). (b1) .θ →
1 1+θ 2
is integrable and by the inversion theorem, Theorem 2.33, h(x) =
.
1 −|x| 1 e = 2 2π
+∞ −∞
e−ixθ dθ . 1 + θ2
Exercise 2.37
299
Exchanging the roles of x and .θ we find 1 . π
e−ixθ dx = e−|θ| , 1 + x2
hence .& μ(θ ) = e−|θ| . (b2) The characteristic function of .Z = 12 (X + Y ) is 1
1
φZ (θ ) = φX ( θ2 ) φY ( θ2 ) = e− 2 |θ| e− 2 |θ| = e−|θ| .
.
Therefore . 12 (X + Y ) is also Cauchy-distributed. • Note that .& μ is not differentiable at 0; this is hardly surprising as a Cauchy r.v. does not have finite moment of order 1. 2
2.35 (a) Yes, .μn = N( nb , σn ). (b) Yes, .μn =Poiss.( λn ). (c) Yes, .μn = Gamma.( n1 , λ). (d) We have seen in Exercise 2.34 that a Cauchy law .μ has characteristic function & μ(θ ) = e−|θ| .
.
Hence if .X1 , . . . , Xn are independent Cauchy r.v.’s, then the characteristic function of . Xn1 + · · · + Xnn is equal to |θ| n & μ( nθ )n = e− n = e−|θ| = & μ(θ ) ,
.
hence we can choose .μn as the law of . Xn1 , which, by the way, has density .x → n 2 2 −1 with respect to the Lebesgue measure. π (1 + n x ) 2.36 (a) By (2.91), for every .a ∈ R, μθ (] − ∞, a]) = μ(Hθ,a ) = ν(Hθ,a ) = νθ (] − ∞, a]) .
.
Hence .μθ and .νθ have the same d.f. and coincide. (b) We have
.& μ(θ ) = eiθ,x dμ(x) = & μθ (1) = & νθ (1) = Rd
Rd
eiθ,x dν(x) = & ν(θ ) ,
so that .μ and .ν have the same characteristic function and coincide. 2.37 (a1) Recall that, X being integrable, .φ (θ ) = i E[XeiθX ]. Hence EQ (eiθX ) = E(XeiθX ) = −iφ (θ )
.
(7.23)
300
7 Solutions
and .−iφ is therefore the characteristic function of X under .Q. (a2) Going back to (7.23) we have .
− iφ (θ ) = E(XeiθX ) =
R
xeiθx dμ(x) ,
i.e. .−iφ is the characteristic function of the law .dν(x) = x dμ(x), which is a probability because . x dμ(x) = E(X) = 1. (a3) If .X ∼ Gamma.(λ, λ), then .−iφ is the characteristic function of the probability having density with respect to the Lebesgue measure given by .
λλ λ −λx λλ+1 x e x λ e−λx , = Γ (λ) Γ (λ + 1)
x >0,
which is a Gamma.(λ + 1, λ). If X is geometric of parameter .p = 1, then .−iφ is the probability having density with respect to the counting measure of .N given by qk = kp(1 − p)k ,
k = 0, 1, . . .
.
i.e. a negative binomial distribution. (b) Just note that every characteristic function takes the value 1 at 0 and
.−iφ (0) = E(X). 2.38 The problem of establishing whether a given function is a characteristic function is not always a simple one. In this case an r.v. X with characteristic function .φ would have finite moments of all orders, .φ being infinitely many times differentiable. Moreover, we have φ (θ ) = −4θ 3 e−θ , 4
.
φ
(θ ) = (16θ 6 − 12θ 2 ) e−θ
4
and therefore it would be E(X) = iφ (0) = 0 ,
.
Var(X) = E(X2 ) − E(X)2 = −φ
(0) = 0 . An r.v. having variance equal to 0 is necessarily a.s. equal to its mean. Therefore such a hypothetical X would be equal to 0 a.s. But then it would have characteristic 4 function equal to the characteristic function of this law, i.e. .φ ≡ 1. .θ → e−θ cannot be a characteristic function. As further (not needed) evidence, Fig. 7.4 shows the graph, numerically computed using the inversion Theorem 2.33, of what would be the density of an r.v. having this “characteristic function”. It is apparent that it is not positive.
Exercise 2.39
−6
301
−4
2
0
2
4
6 1
Fig. 7.4 The graph of what the density corresponding to the “characteristic function” .θ → e− 2 θ would look like. If it was really a characteristic function, this function would have been .≥ 0
4
2.39 (a) We have, integrating by parts,
+∞ 1 2 E Zf (Z) = √ xf (x) e−x /2 dx 2π −∞
+∞ +∞ 1 1 2 2 = − √ f (x) e−x /2 +√ f (x) e−x /2 dx −∞ 2π 2π −∞ = E f (Z) . .
(b1) Let us choose ⎧ ⎪ ⎪ ⎨−1 .f (x) = 1 ⎪ ⎪ ⎩connected as in Fig. 7.5
for x ≤ −1 for x ≥ 1 for − 1 ≤ x ≤ 1 .
This function belongs to .Cb1 . Moreover, .zf (z) ≥ 0 and .zf (z) = |z| if .|z| ≥ 1, so that .|Z|1{|Z|≥1} ≤ Zf (Z). Hence, as .f is bounded, E(|Z|) ≤ 1 + E(|Z|1{|Z|≥1} ) ≤ 1 + E[Zf (Z)] = 1 + E[f (Z)] < +∞ ,
.
so that .|Z| is integrable. (b2) For .f (x) = eiθx (2.92) gives E(ZeiθZ ) = iθ E(eiθZ ) .
.
(7.24)
As we know that Z is integrable, its characteristic function .φ is differentiable and φ (θ ) = i E(ZeiθZ ) = −θ E(eiθZ ) = −θ φ(θ )
.
302
7 Solutions
−1
1
Fig. 7.5 Between .−1 and 1 the function is .x → are possible in order to obtain a function in .Cb1
1 2
(3x−x 3 ). Of course other choices of connection
and solving this differential equation we obtain .φ(θ ) = e−θ
2 /2
, hence .Z ∼ N(0, 1).
2.40 (a) We have ∞
φ(θ ) =
P(X = k) eiθk .
.
(7.25)
k=−∞
As the series converges absolutely, we can (Corollary 1.22) integrate by series and obtain
2π
2π ∞ . φ(θ ) dθ = P(X = k) eiθk dθ . 0
0
k=−∞
All the integrals on the right-hand side above vanish for .k = 0, whereas the one for k = 0 is equal to .2π : (2.93) follows. (b) We have
.
2π
−iθm
e
.
0
∞
φ(θ ) dθ =
2π
P(X = k)
e−iθm eiθk dθ ,
0
k=−∞
and now all the integrals with .k = m vanish, whereas for .k = m the integral is again equal to .2π , i.e. P(X = m) =
.
1 2π
2π
e−iθm φ(θ ) dθ .
0
(c) Of course .φ cannot be integrable on .R, as it is periodic: .φ(θ + 2π ) = φ(θ ). From another point of view, if it was integrable then X would have a density with respect to the Lebesgue measure, thanks to the inversion Theorem 2.33.
Exercise 2.43
303
2.41 (a) The sets .Bnc are decreasing and their intersection is empty. As probabilities pass to the limit on decreasing sequences of sets, .
lim μ(Bnc ) = 0
n→∞
and therefore .μ(Bnc ) ≤ η for n large enough. (b) We have
1
|eiθ1 ,x − eiθ2 ,x | ≤
.
0
1
=
d iθ1 +t (θ2 −θ1 ),x e dt dt
|θ2 − θ1 , x| dt ≤ |x| |θ2 − θ1 | .
0
(c) We have
|& μ(θ1 ) − & μ(θ2 )| ≤
.
=
|e
iθ1 ,x
BRc η
−e
iθ2 ,x
Rd
|eiθ1 ,x − eiθ2 ,x | dμ(x)
| dμ(x) +
≤ 2μ(BRc η ) + |θ1 − θ2 |
|eiθ1 ,x − eiθ2 ,x | dμ(x) BRη
BRη
|x| dμ(x) ≤ 2μ(BRc η ) + Rη |θ1 − θ2 | .
Let .ε > 0. Choose first .η > 0 so that .2μ(BRc η ) ≤ 2ε and then .δ such that .δRη < 2ε . Then if .|θ1 − θ2 | ≤ δ we have .|& μ(θ1 ) − & μ(θ2 )| ≤ ε. 2.42 (a) If .0 < λ < 1, by Hölder’s inequality with .p = λ1 , .q = the integrands being positive,
1 1−λ ,
we have, all
L λs + (1 − λ)t = E (es,X )λ (et,X )1−λ ≤ E(es,X )λ E(et,X )1−λ .
= L(s)λ L(t)1−λ .
(7.26)
(b) Taking logarithms in (7.26) we obtain the convexity of .log L. The convexity of L now follows as the exponential function is convex and increasing. 2.43 (a) For the Laplace transform we have L(z) =
.
λ 2
+∞ −∞
ezt e−λ|t| dt .
304
7 Solutions
The integral does not converge if .ℜz ≥ λ or .ℜz ≤ −λ: in the first case the integrand does not vanish at .+∞, in the second case it does not vanish at .−∞. For real values .−λ < t < λ we have, L(t) = E(etX ) =
.
=
λ 2
+∞
λ 2
+∞
0
e−(λ−t)x dx +
0
λ 2
=
etx e−λx dx + 0
−∞
λ 2
e(λ+t)x dx =
0
−∞
etx eλx dx
1 λ 1 + 2 λ−t λ+t
λ2 · λ2 − t 2
As L is holomorphic for .−λ < ℜz < λ, by the argument of analytic continuation in the strip we have L(z) =
.
λ2 , − z2
λ2
−λ < ℜz < λ .
The characteristic function is of course (compare with Exercise 2.34, where .λ = 1) φ(θ ) = L(iθ ) =
.
λ2 · λ2 + θ 2
(b) The Laplace transform, .L2 say, of Y and W is computed in Example 2.37(c). Its domain is . D = {z < λ} and, for .z ∈ D, L2 (z) =
.
λ · λ−z
Then their characteristic function is φ2 (t) = L2 (it) =
.
λ λ − it
and the characteristic function of .Y − W is φ3 (t) = φ2 (t)φ2 (t) =
.
λ λ2 λ = 2 , λ − it λ + it λ + t2
i.e. the same as the characteristic function of a Laplace law of parameter .λ. Hence Y − W has a Laplace law of parameter .λ.
.
Exercise 2.45
305
(c1) If .X1 , . . . , Xn and .Y1 , . . . , Yn are independent and Gamma.( n1 , λ)distributed, then (X1 − Y1 ) + · · · + (Xn − Yn ) = (X1 + · · · + Xn ) − (Y1 + · · · + Yn ) .
.
∼ Gamma(1,λ)
∼ Gamma(1,λ)
We have found n i.i.d. r.v.’s whose sum has a Laplace distribution, which is therefore infinitely divisible. (c2) Recalling the characteristic function of the Gamma.( n1 , λ) that is computed in Example 2.37(c), if .λ = 1 the r.v.’s .Xk − Yk of (c1) have characteristic function θ →
.
1 1/n 1 1/n 1 = , 1 − iθ 1 + iθ (1 + θ 2 )1/n
so that .φ of (2.95) is a characteristic function. Note that the r.v.’s .Xk − Yk have density with respect to the Lebesgue measure as this is true for both .Xk and .Yk , but, for .n ≥ 2, their characteristic function is not integrable so that in this case the inversion theorem does not apply. 2.44 (a) For every .0 < λ < x2 , Markov’s inequality gives P(X ≥ t) = P(eλX ≥ eλt ) ≤ E(eλX ) · e−λt .
.
(b) Let us prove that .E(eλ X ) < +∞ for every .λ < λ: Remark 2.1 gives
+∞
E(eλ X ) =
.
0
P eλ X ≥ s ds=
≤ t0 +
t0
P eλ X ≥ s ds+
0 +∞
P X≥
t0
= t0 +
log s ds ≤ t0 +
1 λ
+∞
P eλ X ≥ s ds
t0 +∞
λ
e− λ log s ds
t0 +∞
t0
1
ds < +∞ . s λ/λ
Therefore .x2 ≥ λ. 2.45 As we assume that 0 belongs to the convergence strip, the two Laplace transforms, .Lμ and .Lν , are holomorphic at 0 (Theorem 2.36), i.e., for z in a neighborhood of 0, Lμ (z) =
.
∞ 1 (k) L (0)zk , k! μ k=1
Lν (z) =
∞ 1 (k) L (0)zk . k! ν k=1
306
7 Solutions
By (2.63) we find
L(k) μ (0) =
.
x k dμ(x) =
x k dν(x) = L(k) ν (0) ,
so that the two Laplace transforms coincide in a neighborhood of the origin and, by the uniqueness of the analytic continuation, in the whole convergence strip, hence on the imaginary axis, so that .μ and .ν have the same characteristic function. 2.46 (a) Let us compute the derivatives of .ψ: ψ (t) =
.
L (t) , L(t)
ψ
(t) =
L(t)L
(t) − L (t)2 · L(t)2
Recalling that .L(0) = 1, denoting by X any r.v. having Laplace transform L, (2.63) gives ψ (0) = L (0) = E(X),
.
ψ
(0) = L
(0) − L (0)2 = E(X2 ) − E(X)2 = Var(X) . (b1) The integral of .x → eγ x L(γ )−1 with respect to .μ is equal to 1 so that γ x L(γ )−1 is a density with respect to .μ and .μ is a probability. As for its .x → e γ Laplace transform:
Lγ (t) =
+∞
etx dμγ (x) dx
+∞ L(γ + t) 1 · e(γ +t)x dμ(x) = = L(γ ) −∞ L(γ )
.
−∞
(7.27)
(b2) Let us compute the mean and variance of .μγ via the derivatives of .log Lγ as seen in (a). As .log Lγ (t) = log L(γ + t) − log L(γ ) we have .
d L (γ + t) log Lγ (t) = , dt L(γ + t)
d2 L(γ + t)L
(γ + t) − L (γ + t)2 log L (t) = γ dt 2 L(γ + t)2
Exercise 2.46
307
and denoting by Y an r.v. having law .μγ , for .t = 0, E(Y ) = .
L (γ ) = ψ (γ ) , L(γ )
(7.28)
L(γ )L
(γ ) − L (γ )2 Var(Y ) = = ψ
(γ ) . L(γ )2
(b3) One of the criteria to establish the convexity of a function is to check that its second order derivative is positive. From the second Eq. (7.28) we have .ψ
(γ ) = Var(Y ) ≥ 0. Hence .ψ is convex. We find again, in a different way, the result of Exercise 2.42. Actually we obtain something more: if X is not a.s. constant then
.ψ (γ ) = Var(Y ) > 0, so that .ψ is strictly convex. The mean of .μγ is equal to .ψ (γ ). As .ψ is increasing (.ψ
is positive), the mean of .μγ is an increasing function of .γ . (c1) If .μ ∼ N (0, σ 2 ) then 1
L(t) = e 2 σ
.
2t 2
ψ(t) =
,
1
1 2
σ 2t 2 .
2 2
Hence .μγ has density .x → eγ x− 2 σ γ with respect to the .N(0, σ 2 ) law and therefore its density with respect to the Lebesgue measure is 1
x → eγ x− 2 σ
.
1 2 1 1 1 − x − (x−σ 2 γ )2 e 2σ 2 = √ e 2σ 2 √ 2π σ 2π σ
2γ 2
and we recognize an .N(σ 2 γ , σ 2 ) law. Alternatively, and possibly in a simpler way, we can compute the Laplace transform of .μγ using (7.27): 1
Lγ (t) =
e2σ
.
e
2 (t+γ )2
1 2
σ 2γ 2
1
= e2 σ
2 (t 2 +2γ
t)
1
= e2 σ
2 t 2 +σ 2 γ
t
,
which is the Laplace transform of an .N(σ 2 γ , σ 2 ) law. (c2) Also in this case it is not difficult to compute the density, but the Laplace transform provides the simplest argument: the Laplace transform of a .Γ (α, λ) law is for .t < λ (Example 2.37(c)) L(t) =
.
λ α . λ−t
As L is defined only on .] − ∞, λ[, we can consider only .γ < λ. The Laplace transform of .μγ is now Lγ (t) =
.
α λ − γ α λ − γ α λ = , λ − (t + γ ) λ λ−γ −t
308
7 Solutions
−3
−2
−1
0
1
2
3
Fig. 7.6 Comparison, for .λ = 3 and .γ = 1.5, of the graphs of the Laplace density f of parameter (dots) and of the twisted density .fγ
.λ
which is the Laplace transform of a .Γ (α, λ − γ ) law. (c3) The Laplace transform of a Laplace law of parameter .λ is, for .−λ < t < λ, (Exercise 2.43) L(t) =
.
λ2 · λ2 − t 2
Hence .μγ has density x →
.
(λ2 − γ 2 ) eγ x λ2
with respect to .μ and density fγ (x) :=
.
λ2 − γ 2 −λ|x|+γ x e λ2
with respect to the Lebesgue measure (see the graph in Fig. 7.6). Its Laplace transform is Lγ (t) =
.
λ2 − γ 2 λ2 λ2 − γ 2 = · λ2 − (t + γ )2 λ2 λ2 − (t + γ )2
(c4) The Laplace transform of a Binomial .B(n, p) law is L(t) = (1 − p + p et )n .
.
Hence .μγ has density, with respect to the counting measure of .N, fγ (k) =
.
n eγ k pk (1 − p)n−k , (1 − p + p eγ )n k
k = 0, . . . , n ,
Exercise 2.48
309
which, with some imagination, can be written fγ (k) =
.
n k
k n−k p eγ p eγ 1− , γ γ 1−p+pe 1−p+pe
k = 0, . . . , n ,
i.e. a binomial law .B(pγ , n) with .pγ = p eγ (1 − p + p eγ )−1 . (c5) The Laplace transform of a geometric law of parameter p is L(t) = p
∞
.
(1 − p)k etk ,
k=0
which is finite for .(1 − p) et < 1, i.e. for .t < − log(1 − p) and for these values L(t) =
.
p · 1 − (1 − p) et
Hence .μγ has density, with respect to the counting measure of .N, fγ (k) = eγ k (1 − (1 − p) eγ )(1 − p)k = (1 − (1 − p) eγ )((1 − p) eγ )k ,
.
i.e. a geometric law of parameter .pγ = 1 − (1 − p) eγ . 2.47 (a) As the Laplace transforms are holomorphic (Theorem 2.36), by the uniqueness of the analytic continuation they coincide on the whole strip .a < ℜz < b of the complex plane, which, under the assumption of (a), contains the imaginary axis. Hence .Lμ and .Lν coincide on the imaginary axis, i.e. .μ and .ν have the same characteristic function and coincide. (b1) It is immediate that .μγ , .νγ are probabilities (see also Exercise 2.46) and that Lμγ (z) =
.
Lμ (z + γ ) , Lμ (γ )
Lνγ (z) =
Lν (z + γ ) Lν (γ )
and now .Lμγ and .Lνγ coincide on the interval .]a − γ , b − γ [ which, as .b − γ > 0 and .a − γ < 0, contains the origin. Thanks to (a), .μγ = νγ . (b2) Obviously dμ(x) = Lμ (γ )e−γ x dμγ (x) = Lν (γ )e−γ x dνγ (x) = dν(x) .
.
2.48 (a) The method of the distribution function gives, for .x ≥ 0, Fn (x) = P(Zn ≤ x) = P(X1 ≤ x, . . . , Xn ≤ x) = P(X1 ≤ x)n = (1 − e−λx )n .
.
Taking the derivative we find the density fn (x) = nλe−λx (1 − e−λx )n−1
.
310
7 Solutions
+∞ for .x ≥ 0 and .fn (x) = 0 for .x < 0. Noting that . 0 xe−λx dx = λ−2 , we have
+∞
E(Z2 ) = 2λ
.
xe−λx (1 − e−λx ) dx = 2λ
0
+∞
(xe−λx − xe−2λx ) dx
0
1 1 3 1 = 2λ 2 − 2 = · 2 λ λ 4λ
And also
+∞
E(Z3 ) = 3λ
.
xe−λx (1 − e−λx )2 dx
0
+∞
= 3λ
(xe−λx − 2xe−2λx + xe−3λx ) dx = 3λ
0
=
1 2 1 − + λ2 4λ2 9λ2
11 1 · 6 λ
(b) We have, for .t ∈ R,
+∞
Ln (t) = nλ
.
etx e−λx (1 − e−λx )n−1 dx .
0
This integral clearly diverges for .t ≥ λ, hence the domain of the Laplace transform is .ℜz < λ for every n. If .t < λ let .e−λx = u, .x = − λ1 log u, i.e. .−λe−λx dx = du, tx = u−t/λ . We obtain .e
1
Ln (t) = n
.
u−t/λ (1 − u)n−1 dt
0
and, recalling from the expression of the Beta laws the relation
1
.
uα−1 (1 − u)β−1 du =
0
Γ (α)Γ (β) , Γ (α + β)
we have for .α = 1 − λt , .β = n, Ln (t) = n
.
Γ (1 − λt )Γ (n) Γ (n + 1 − λt )
·
(c) From the basic relation of the .Γ function, Γ (α + 1) = αΓ (α) ,
.
(7.29)
Exercise 2.49
311
and taking the derivative we find .Γ (α + 1) = Γ (α) + αΓ (α) and, dividing both sides by .Γ (α + 1), (2.98) follows. We can now compute the mean of .Zn by taking the derivative of its Laplace transform at the origin. We have L n (t) = n Γ (n)
.
=
− λ1 Γ (n + 1 − λt )Γ (1 − λt ) + λ1 Γ (n + 1 − λt )Γ (1 − λt ) Γ (n + 1 − λt )2
Γ (n + 1 − λt )Γ (1 − λt ) nΓ (n)
t − Γ (1 − ) + λ λΓ (n + 1 − λt ) Γ (n + 1 − λt )
and for .t = 0, as .Γ (n + 1) = nΓ (n), L n (0) =
.
1 Γ (n + 1) − Γ (1) + . λ Γ (n + 1)
(7.30)
Thanks to (2.98), .
1 Γ (n) 1 1 Γ (n − 1) Γ (n + 1) = + = + + Γ (n + 1) n Γ (n) n n−1 Γ (n − 1) = ··· =
1 1 + + · · · + 1 + Γ (1) n n−1
and replacing in (7.30), E(Zn ) =
.
1 1 1 1 + + ··· + . λ 2 n
In particular, .E(Zn ) ∼ const · log n. 2.49 (a) Immediate as .ξ, X is a linear function of X. (b1) Taking as .ξ the vector having all its components equal to 0 except for the ith, which is equal to 1, we have that each of the components .X1 , . . . , Xd is Gaussian, hence square integrable. (b2) We have 1
E(eiθξ,X ) = eiθbξ e− 2 σξ θ , 2 2
.
(7.31)
where by .bξ , .σξ2 we denote respectively mean and variance of .ξ, X. Let b denote the mean of X and C its covariance matrix (we know already that X is square integrable). We have (recalling (2.33)) bξ = E(ξ, X) = ξ, b ,
.
σξ2 = E(ξ, X − b2 ) =
d i=1
cij ξi ξj = Cξ, ξ .
312
7 Solutions
Now, by (7.31) with .θ = 1, the characteristic function of X is 1
ξ → E(eiξ,X ) = eiξ,b e− 2 Cξ,ξ ,
.
which is the characteristic function of a Gaussian law. 2.50 (a) Let us compute the joint density of U and V : we expect to find a function of .(u, v) that is the product of a function of u and of a function of v. We have .(U, V ) = Ψ (X, Y ), where x Ψ (x, y) = ( , x2 + y2 . x2 + y2
.
Let us note beforehand that U will be taking values in the interval .[−1, 1] whereas V will be positive. In order to determine the inverse .Ψ −1 , let us solve the system ⎧ x ⎨u = ( 2 x + y2 . ⎩ 2 v = x + y2 . √ √ √ Replacing v in the first equation we find .x = u v and then .y = v 1 − u2 , so that √ √ ( −1 .Ψ (u, v) = u v, v 1 − u2 . Hence
⎛
√ v
u √ 2 v
⎞
⎟ ⎜ D Ψ −1 (u, v) = ⎝ u√v √1−u2 ⎠ − √ 2 2√v
.
1−u
and ( 2 1 det D Ψ −1 (u, v) = 1 1 − u2 + √ u = √ · 2 2 1−u 2 1 − u2
.
Therefore the joint density of .(U, V ) is, for .u ∈ [−1, 1], .v ≥ 0, 1 − 1 (u2 v+v(1−u2 )) 1 e 2 f (Ψ −1 (u, v)) det D Ψ −1 (u, v) = × √ 2π 2 1 − u2 1 1 1 1 = e− 2 v × √ · 2 π 1 − u2
.
Exercise 2.51
313
Hence U and V are independent. We even recognize the product of an exponential Gamma.(1, 12 ) (the law of V , but we knew that beforehand) and of a distribution of density fU (u) =
.
1 1 √ π 1 − u2
−1 0, .ω ∈ An only for the values of n such that 1 .x ≤ n , i.e. only for a finite number of them. Hence .limn→∞ An = {X = 0} and
Exercise 3.4
319
P(limn→∞ An ) = 0. Clearly the second half of the Borel-Cantelli Lemma does not apply here as the events .(An )n are not independent. (b1) The events .(Bn )n are now independent and
.
∞ .
P(Bn ) =
n=1
∞ 1 = +∞ n n=1
and by the Borel-Cantelli Lemma, second half, .P(limn→∞ Bn ) = 1. (b2) Now instead ∞ .
P(Bn ) =
n=1
∞ 1 < +∞ n2 n=1
and the Borel-Cantelli Lemma gives .P(limn→∞ Bn ) = 0. 3.4 (a) We have ∞ .
P(Xn ≥ 1) =
n=1
∞
−(log(n+1))α
e
=
n=1
∞
1
n=1
α−1 (n + 1)log(n+1)
·
The series converges if .α > 1 and diverges if .α ≤ 1, hence by the Borel-Cantelli lemma 1 0 if α > 1 .P lim {Xn ≥ 1} = n→∞ 1 if α ≤ 1 . (b1) Let .c > 0. By a repetition of the computation of (a) we have ∞ .
n=1
P(Xn ≥ c) =
∞
e−c(log(n+1)) = α
n=1
∞
1
n=1
α−1 (n + 1)c log(n+1)
·
(7.34)
• Assume .α > 1. The series on the right-hand side on (7.34) is convergent for every .c > 0 so that .P(limn→∞ {Xn ≥ c}) = 0 and .Xn (ω) ≥ c for finitely many indices n only a.s. Therefore there exists a.s. an .n0 such that .Xn < c for every .n ≥ n0 , which implies that .limn→∞ Xn < c and, thanks to the arbitrariness of c, .
lim Xn = 0
n→∞
a.s.
320
7 Solutions
• Assume .α < 1 instead. Now, for every .c > 0, the series on the right-hand side in (7.34) diverges, so that .P(limn→∞ {Xn ≥ c}) = 1 and .Xn ≥ c for infinitely many indices n a.s. Hence .limn→∞ Xn ≥ c and, thanks to the arbitrariness of c, .
lim Xn = +∞
a.s.
n→∞
• We are left with the case .α = 1. We have ∞ .
P(Xn ≥ c) =
n=1
∞ n=1
1 · (n + 1)c
The series on the right-hand side now converges for .c > 1 and diverges for .c ≤ 1. Hence if .c ≤ 1, .Xn ≥ c for infinitely many indices n whereas if .c > 1 there exists an .n0 such that .Xn ≤ c for every .n ≥ n0 a.s. Hence if .α = 1 .
lim Xn = 1
n→∞
a.s.
(b2) For the inferior limit we have, whatever the value of .α > 0, ∞ .
P(Xn ≤ c) =
n=1
∞ α 1 − e−(c log(n+1)) .
(7.35)
n=1
The series on the right-hand side diverges (its general term tends to 1 as .n → ∞), therefore, for every .c > 0, .Xn ≤ c for infinitely many indices n and .limn→∞ Xn = 0 a.s. (c) As seen above, .limn→∞ Xn = 0 whatever the value of .α. Taking into account the possible values of .limn→∞ Xn computed in (b) above, the sequence converges only for .α > 1 and in this case .Xn →n→∞ 0 a.s. 3.5 (a) By Remark 2.1
E(Z1 ) =
.
+∞
P(Z1 ≥ s) ds =
0
∞
n+1
P(Z1 ≥ s) ds
n=0 n
and now just note that
P(Z1 ≥ n + 1) ≤
.
n+1
P(Z1 ≥ s) ds ≤ P(Z1 ≥ n) .
n
(b1) Thanks to (a) the series . ∞ n=1 P(Zn ≥ n) is convergent and by the BorelCantelli Lemma the event .limn→∞ {Zn ≥ n} has probability 0 (even if the .Zn were not independent).
Exercise 3.6
321
(b2) Now the series . ∞ n=1 P(Zn ≥ n) diverges, hence .limn→∞ {Zn ≥ n} has probability 1. n (c1) Assume that .0 < x2 < +∞ and let .0 < θ < x2 . Then .E(eθX ) < +∞ and thanks to (b1) applied to the r.v.’s .Zn = eθXn , .P limn→∞ eθXn ≥ n = 0 hence θXn < n eventually, i.e. .X < 1 log n for n larger than some .n , so that .e n 0 θ .
1 Xn ≤ log n θ
a.s.
(7.36)
1 Xn ≤ log n x2
a.s.
(7.37)
lim
n→∞
and, by the arbitrariness of .θ < x2 , .
lim
n→∞
Conversely, if .θ > x2 then .eθXn is not integrable and by (b2) .P limn→∞ eθXn ≥ n = 1, hence .Xn > θ1 log n infinitely many times and .
lim
n→∞
1 Xn ≥ log n θ
a.s.
(7.38)
a.s.
(7.39)
and, again by the arbitrariness of .θ > x2 , .
lim
n→∞
1 Xn ≥ log n x2
which together with (7.37) completes the proof. If .x2 = 0 then (7.38) gives .
lim
n→∞
Xn = +∞ . log n
(c2) We have |Xn | = lim . lim √ n→∞ log n n→∞
2
Xn2 log n
and, as .Xn2 ∼ χ 2 (1) and for such a distribution .x2 = √ in (7.40) is equal to . 2 a.s.
1 2
(7.40) (Example 2.37(c)), the .lim
• Note that Example 3.5 is a particular case of (c1). 3.6 (a) The r.v. .limn→∞ |Xn (ω)|1/n , hence also R, is measurable with respect to the tail .σ -algebra .B∞ of the sequence .(Xn )n . R is therefore a.s. constant by Kolmogorov’s 0-1 law, Theorem 2.15.
322
7 Solutions
(b) As .E(|X1 |) > 0, there exists an .a > 0 such that .P(|X1 | > a) > 0. Then ∞ the series . ∞ n=1 P(|Xn | > a) = n=1 P(|X1 | > a) is divergent and by the BorelCantelli Lemma P lim {|Xn | > a} = 1 .
.
n→∞
Therefore .|Xn |1/n > a 1/n infinitely many times and .
lim |Xn |1/n ≥ lim a 1/n = 1
n→∞
n→∞
a.s.
i.e. .R ≤ 1 a.s. (c) By Markov’s inequality, for every .b > 1, P(|Xn | ≥ bn ) ≤
.
E(|Xn |) E(|X1 |) = n b bn
n hence the series . ∞ n=1 P(|Xn | ≥ b ) is bounded above by aconvergent geometric series. By the Borel-Cantelli Lemma .P limn→∞ {|Xn | ≥ bn } = 0, i.e. P |Xn |1/n < b eventually = 1
.
and, as this is true for every .b > 1, .
lim |Xn |1/n ≤ 1
a.s.
n→∞
i.e. .R ≥ 1 a.s. Hence .R = 1 a.s. 3.7 Assume that .Xn →Pn→∞ X. Then for every subsequence of .(Xn )n there exists a further subsequence .(Xnk )k such that .Xnk →k→∞ X a.s., hence also .
d(Xnk , X) 1 + d(Xnk , X)
a.s.
→
n→∞
0.
As the r.v.’s appearing on the left-hand side above are bounded, by Lebesgue’s Theorem .
d(X , X) nk =0. lim E k→∞ 1 + d(Xnk , X)
We have proved that from every subsequence of the quantity on the left-hand side of (3.44) we can extract a further subsequence converging to 0, therefore (3.44) follows by Criterion 3.8.
Exercise 3.9
323
Conversely, let us assume that (3.44) holds. We have d(X , X) n P ≥ δ = P d(Xn , X) ≥ δ(1 + d(Xn , X)) 1 + d(Xn , X) δ . = P d(Xn , X) ≥ 1−δ
.
Let us fix .ε > 0 and let .δ =
ε 1+ε
so that .ε =
δ 1−δ .
By Markov’s inequality
d(X , X) 1 d(X , X) n n , P d(Xn , X) ≥ ε = P ≥δ ≤ E 1 + d(Xn , X) δ 1 + d(Xn , X)
.
so that .limn→∞ P d(Xn , X) ≥ ε = 0. 3.8 (a) Let Sn =
n
.
Xk .
k=1
Then we have, for .m < n, n n E(|Sn − Sm |) = E Xk ≤ E(|Xk |) ,
.
k=m+1
k=m+1
from which it follows easily that .(Sn )n is a Cauchy sequence in .L1 , which implies 1 .L convergence. (1) (b1) As .E(Xk+ ) ≤ E(|Xk |), the argument of (a) gives that, if .Sn = nk=1 Xk+ , (1) (1) the sequence .(Sn )n converges in .L1 to some integrable r.v. .Z1 . As .(Sn )n is increasing, it also converges a.s. to the same r.v. .Z1 , as the a.s. and the .L1 limits necessarily coincide. (b2) By the same argument as in (b1), the sequence .Sn(2) = nk=1 Xk− converges a.s. to some integrable r.v. .Z2 . We have then .
lim Sn = lim Sn(1) − lim Sn(2) = Z1 − Z2
n→∞
n→∞
n→∞
a.s.
and there is no danger of encountering a .+∞ − ∞ form as both .Z1 and .Z2 are finite a.s. 3.9 For every subsequence of .(Xn )n there exists a further subsequence .(Xnk )k such that .Xnk →k→∞ X a.s. By Lebesgue’s Theorem .
lim E(Xnk ) = E(X) ,
k→∞
(7.41)
324
7 Solutions
hence for every subsequence of .(E[Xn ])n there exists a further subsequence that converges to .E(X), and, by the sub-sub-sequences criterion, .limn→∞ E(Xn ) = E(X). 3.10 (a1) We have, for .t > 0, P(Un > t) = P(X1 > t, . . . , Xn > t) = P(X1 > t) . . . P(Xn > t) = e−nt .
.
Hence the d.f. of .Un is .Fn (t) = 1 − e−nt , .t > 0, and .Un is exponential of parameter n. (a2) We have 1 .
lim Fn (t) =
n→∞
0 if x ≤ 0 1 if x > 0 .
The limit coincides with the d.f. F of an r.v. that takes only the value 0 with probability 1 except for its value at 0, which however is not a continuity point of F . Hence (Proposition 3.23) .(Un )n converges in law (and in probability) to the Dirac mass .δ0 . (b) For every .δ > 0 we have ∞ .
P(Un > ε) =
n=1
∞
e−nε < +∞ ,
n=1
hence by Remark 3.7, as .Un > 0 for every n, .Un →n→∞ 0 a.s. In a much simpler way, just note that .limn→∞ Un exists certainly, the sequence .Un (ω) being decreasing for every .ω. Therefore .(Un )n converges a.s. and, by (a), it converges in probability to 0. The result then follows, as the a.s. limit and the limit in probability coincide. No need for Borel-Cantelli. . . (c) We have 1 1 1−β/α P Vn > β = P Un > β/α = e−n . n n
.
As .1 −
β α
> 0,
.
∞ 1 P Vn > β < +∞ n n=1
and by the Borel-Cantelli Lemma .Vn > n1β for a finite number of indices n only a.s. Hence for n large .Vn ≤ n1β , which is the general term of a convergent series. 3.11 (a1) As .X1 and .X2 are independent and integrable, their product .X1 X2 is also integrable and .E(X1 X2 ) = E(X1 )E(X2 ) = 0 (Corollary 2.10).
Exercise 3.13
325
Similarly, .X12 and .X22 are integrable (.X1 and .X2 have finite variance) independent r.v.’s, hence .X12 X22 is integrable, and .E(X12 X22 ) = E(X12 )E(X22 ) = Var(X1 )Var(X2 ) = σ 4 . As .X1 X2 is centered, .Var(X1 X2 ) = E(X12 X22 ) = σ 4 . (a2) We have .Yk Ym = Xk Xk+1 Xm Xm+1 . Let us assume, to fix the ideas, .m > k: then the r.v.’s .Xk , .Xk+1 Xm , .Xm+1 are independent and integrable. Hence .Yk Ym is also integrable and E(Yk Ym ) = E(Xk )E(Xk+1 Xm )E(Xm+1 ) = 0
.
(note that, possibly, .k + 1 = m). (b) The r.v.’s .Yn are uncorrelated and have common variance .σ 4 . By Rajchman’s strong law .
1 1 X1 X2 + X2 X3 + · · · + Xn Xn+1 = Y1 + · · · + Yn n n
a.s.
→
n→∞
E(Y1 ) = 0 .
3.12 .(Xn4 )n is a sequence of i.i.d. r.v.’s having a common finite variance, as the Laplace laws have finite moments of all orders. Hence by Rajchman’s strong law .
1 4 X1 + X24 + · · · + Xn4 n
a.s.
→
E(X14 ) .
n→∞
Let us compute .E(X14 ): tracing back to the integrals of the Gamma laws, E(X14 ) =
.
λ 2
+∞
x 4 e−λ|x| dx = λ
−∞
+∞
x 4 e−λx dx =
0
Γ (5) 24 = 4 · 4 λ λ
Again thanks to Rajchman’s strong law 1 2 Xk n n
.
k=1
a.s.
→
n→∞
E(X12 ) =
2 , λ2
hence .
lim
X12 + X22 + · · · + Xn2
n→∞
X14 + X24 + · · · + Xn4
=
1 lim n1 n→∞ n
n
2 k=1 Xk 4 k=1 Xk
n
=
E(X12 ) E(X14 )
=
λ2 12
a.s.
3.13 (a) We have 1 2 2 = (Xk − 2Xk Xn + X n ) n k=1 . n n n 1 2 1 1 2 2 2 = Xk − 2X n Xk + X n = Xk − X n . n n n n
Sn2
k=1
k=1
k=1
(7.42)
326
7 Solutions
By Kolmogorov’s strong law, Theorem 3.12, applied to the sequence .(Xn2 )n , 1 2 Xk n n
.
k=1
a.s.
→
n→∞
E(X12 )
and again by Kolmogorov’s (or Rajchman’s) strong law for the sequence .(Xn )n a.s.
2
Xn
.
→
n→∞
E(X1 )2 .
In conclusion Sn2
.
a.s.
→
E(X12 ) − E(X1 )2 = σ 2 .
n→∞
(b) Thinking of (7.42) we have n 1 E Xk2 = E(X12 ) n
.
k=1
whereas 2
E(X n ) = Var(X n ) + E(X n )2 =
.
1 2 σ + E(X1 )2 n
and putting things together 1 n−1 2 σ . E(Sn2 ) = E(X12 ) − E(X1 )2 ) − σ 2 = n n
.
=σ 2
Therefore .Sn2 →n→∞ σ 2 but, in the average, .Sn2 is always a bit smaller than .σ 2 . 3.14 (a) For every .θ1 ∈ Rd , .θ2 ∈ Rm the weak convergence of the two sequences implies, for their characteristic functions, that & μn (θ1 )
.
→
n→∞
& μ(θ1 ),
& νn (θ2 )
→
n→∞
& ν(θ2 ) .
Hence, denoting by .φn , .φ the characteristic functions of .μn ⊗ νn and of .μ ⊗ ν respectively, by Proposition 2.35 we have .
lim φn (θ1 , θ2 ) = lim & μn (θ1 )& νn (θ2 ) = & μ(θ1 )& ν(θ2 ) = φ(θ1 , θ2 )
n→∞
n→∞
and P. Lévy’s theorem, Theorem 3.20, completes the proof. (b1) Immediate: .μ∗ν and .μn ∗νn are the images of .μ⊗ν and .μn ⊗νn respectively under the map .(x, y) → x + y, which is continuous .Rd × Rd → Rd (Remark 3.16).
Exercise 3.15
327
(b2) We know (Example 3.9) that .νn →n→∞ δ0 (the Dirac mass at 0). Hence thanks to (b1) μ ∗ νn
.
→
n→∞
μ ∗ δ0 = μ .
3.15 (a) As we assume that the partial derivatives of f are bounded, we can take the derivative under the integral sign (Proposition 1.21) and obtain, for .i = 1, . . . , d, .
∂ ∂ μ ∗ f (x) = ∂xi ∂xi
Rd
f (x − y) dμ(y) =
Rd
∂ f (x − y) dμ(y) . ∂xi
(b1) The .N(0, n1 I ) density is gn (x) =
.
nd/2 − 1 n|x|2 e 2 . (2π )d/2
The proof that the k-th partial derivatives of .gn are of the form 1
Pα (x)e− 2 n|x| , 2
α = (k1 , . . . , kd )
.
(7.43)
for some polynomial .Pα is easily done by induction. Indeed the first derivatives are obviously of this form. Assuming that (7.43) holds for all derivatives up to the order .|α| = k1 + · · · + kd , just note that for every i, .i = 1, . . . , d, we have .
∂ 1 1 ∂ 2 2 Pα (x)e− 2 n|x| = Pα (x) − nxi Pα (x) e− 2 n|x| , ∂xi ∂xi
which is again of the form (7.43). In particular, all derivatives of .gn are bounded. (b2) If .νn = N(0, n1 I ), then (Exercise 2.5) the probability .μn = νn ∗ μ has density
fn (x) =
.
Rd
gn (x − y) dμ(y)
with respect to the Lebesgue measure. The sequence .(μn )n converges in law to .μ as 1
& μn (θ ) = & νn (θ )& μ(θ ) = e− 2n |θ| & μ(θ )
.
2
→
n→∞
& μ(θ ) .
Let us prove that the densities .fn are .C ∞ ; let us assume .d = 1 (the argument also holds in general, but it is a bit more complicated to write down). By induction: .fn is certainly differentiable, thanks to (a), as .gn has bounded derivatives. Let us assume next that Theorem 1.21 (derivation under the integral sign) can be applied m times and therefore that the relation
dm dm . f (x) = g (x − y) dμ(y) n m n dx m R dx
328
7 Solutions
holds. As the integrand again has bounded derivatives, we can again take the derivative under the integral sign, which gives that .fn is .m + 1 times differentiable. therefore, by recurrence, .fn is infinitely many times differentiable. • This exercise, as well as Exercise 2.5, highlights the regularization properties of convolution. 3.16 (a1) We must prove that .f ≥ 0 .ρ-a.e. and that . E f (x) dρ(x) = 1. As 1 .fn →n→∞ f in .L (ρ), for every bounded measurable function .φ : E → R we have
. lim φ(x) dμn (x) − φ(x) dμ(x) n→∞
E
E
= lim φ(x) fn (x) − f (x) dρ(x) n→∞
E
≤ lim
n→∞ E
≤ φ∞ lim
|φ(x)||fn (x) − f (x)| dρ(x)
n→∞ E
|fn (x) − f (x)| dρ(x) = 0 ,
i.e.
.
lim
n→∞ E
φ(x) dμn (x) =
(7.44)
φ(x) dμ(x) . E
By choosing first .φ = 1 we find . E f (x) dρ(x) = 1. Next for .φ = 1{f t} is an open set for every t, so that .limn→∞ μn (f > t) ≥ μ(f > t). By Fatou’s Lemma
.
f (x) dμn (x) = lim
lim n→∞ E
≥
+∞
n→∞ 0
+∞
μn (f > t) dt
μ(f > t) dt =
0
f (x) dμ(x) . E
2
1
0
1 2
1
Fig. 7.7 The graph of .fn of Exercise 3.16 for .n = 13. The rate of oscillation of .fn increases with n. It is difficult to imagine that it might converge in .L1
330
7 Solutions
Fig. 7.8 The graph of the distribution function .Fn of Exercise 3.16 for .n = 13. The effect of the oscillations on the d.f. becomes smaller as n increases
1
0
1
As this relation holds for every l.s.c. function f bounded from below, by Theorem 3.21(a) (portmanteau), .μn →n→∞ μ weakly. 3.18 Recall that a .χ 2 (n)-distributed r.v. has mean n and variance 2n. Hence 1 E Xn = 1, n
Var
.
1
2 Xn = · n n
By Chebyshev’s inequality, therefore, X 2 n − 1 ≥ δ ≤ 2 P n δ n
.
→
n→∞
0.
Hence the sequence .( n1 Xn )n converges to 1 in probability and therefore also in law. Second method (possibly less elegant). Recalling the expression of the characteristic function of the Gamma laws (.Xn is .∼ Gamma.( n2 , 12 )) the characteristic function of . n1 Xn is (Example 2.37(c)) φn (θ ) =
.
n/2
1 2 1 2
−i
=
θ n
n/2
1 1 − i 2θ n
−n/2 1 = 1 − 2θ i n
→
n→∞
eiθ
and we recognize the characteristic function of a Dirac mass at 1. Hence .( n1 Xn )n converges to 1 in law and therefore also in probability, as the limit takes only one value with probability 1 (Proposition 3.29(b)). Third method: let .(Zn )n be a sequence of i.i.d. .χ 2 (1)-distributed r.v.’s and let .Sn = Z1 + · · · + Zn . Therefore for every n the two r.v.’s .
1 Xn n
and
1 Sn n
have the same distribution. By Rajchman’s strong law . n1 Sn →n→∞ 1 a.s., hence also in probability, so that . n1 Xn →Pn→∞ 1.
Exercise 3.19
331
• Actually, with a more cogent inequality than Chebyshev’s, it is possible to prove that, for every .δ > 0,
.
∞ Xn P − 1 ≥ δ < +∞ , n k=1
so that convergence also takes place a.s. 3.19 First method: distribution functions. Let .Fn denote the d.f. of .Yn = have .Fn (t) = 0 for .t < 0, whereas for .t ≥ 0 Fn (t) = P(Xn ≤ nt) = P(Xn ≤ nt) =
nt λ
.
k=0
=
n
1−
1 n
Xn : we
λ k n
λ nt+1 λ 1 − (1 − λn )nt+1 = 1 − 1 − . n n 1 − (1 − λn )
Hence for every .t ≥ 0 .
lim Fn (t) = 1 − e−λt .
n→∞
We recognize on the right-hand side the d.f. of an exponential law of parameter .λ. Hence .( n1 Xn )n converges in law to this distribution. Second method: characteristic functions. Recalling the expression of the characteristic function of a geometric law, Example 2.25(b), we have φXn (θ ) =
.
λ n
1 − (1 −
λ iθ n )e
=
λ , n(1 − eiθ ) + λeiθ
hence φYn (θ ) = φXn
.
θ n
=
λ n(1 − eiθ/n ) + λeiθ/n
·
Noting that .
lim n(1 − eiθ/n ) = θ lim
n→∞
n→∞
1 − eiθ/n θ n
= −θ
we have .
lim φYn (θ ) =
n→∞
λ , λ − iθ
d iθ e |θ=0 = −iθ , dθ
332
7 Solutions
which is the characteristic function of an exponential law of parameter .λ. 3.20 (a) The d.f. of .Xn is, for .y ≥ 0,
y
Fn (y) =
.
0
n 1 · dx = 1 − 1 + ny (1 + nx)2
As, of course, .Fn (y) = 0 for .y ≤ 0, 1 .
lim Fn (y) =
n→∞
1 y>0 0 y ≤0.
The limit is the d.f. of an r.v. X with .P(X = 0) = 1. .(Xn )n converges in law to X and, as the limit is an r.v. that takes only one value, the convergence takes place also in probability (Proposition 3.29(b)). (b) The a.s. limit, if it existed, would also be 0, but for every .δ > 0 we have P(Xn > δ) = 1 − P(Xn ≤ δ) =
.
1 · 1 + nδ
(7.45)
The series . ∞ n=1 P(|Xn | > δ) diverges and by the Borel-Cantelli Lemma (second half) .P(limn→∞ {Xn > δ}) = 1 and the sequence does not converge to zero a.s. We have even that .Xn > δ infinitely many times and, as .δ is arbitrary, .limn→∞ Xn = +∞. For the inferior limit note that for every .ε > 0 we have ∞ .
P(Xn < ε) =
n=1
∞ 1− n=1
1 = +∞ , 1 + nε
hence .P(limn→∞ {Xn < ε}) = 1. Therefore .Xn < ε infinitely many times with probability 1 and .limn→∞ Xn = 0. 3.21 Given the form of the r.v.’s .Zn of this exercise, it appears that their d.f.’s should be easier to deal with than their characteristic functions. (a) We have, for .0 ≤ t ≤ 1, P(Zn > t) = P(X1 > t, . . . , Xn > t) = (1 − t)n ,
.
hence the d.f. .Fn of .Zn is ⎧ ⎪ ⎪ ⎨0 .Fn (t) = 1 − (1 − t)n ⎪ ⎪ ⎩1
for t < 0 for 0 ≤ t ≤ 1 for t > 1 .
Exercise 3.23
333
Hence .
lim Fn (t) =
1 0
for t ≤ 0
1
for t > 0
n→∞
and we recognize the d.f. of a Dirac mass at 0, except for the value at 0, which however is not a continuity point of the d.f. of this distribution. We conclude that .Zn converges in law to an r.v. having this distribution and, as the limit is a constant, the convergence takes place also in probability. As the sequence .(Zn )n is decreasing it converges a.s. (b) The d.f., .Gn , of .n Zn is, for .0 ≤ t ≤ n, n Gn (t) = P(nZn ≤ t) = P Zn ≤ nt = Fn nt = 1 − 1 − nt .
.
As 1 .
lim Gn (t) = G(t) :=
n→∞
0
for t ≤ 0
1 − e−t
for t > 0
the sequence .(n Zn )n converges in law to an exponential distribution with parameter λ = 1. Therefore, for n large,
.
P min(X1 , . . . , Xn ) ≤
.
2 n
≈ 1 − e−2 = 0.86 .
3.22 Let us compute the d.f. of .Mn : for .k = 0, 1, . . . we have (n) P(Mn ≤ k) = 1 − P(Mn ≥ k + 1) = 1 − P U1 ≥ k + 1, . . . , Un(n) ≥ k + 1 n − k n (n) n = 1 − P U1 ≥ k + 1 = 1 − . n+1
.
Now .
lim
n→∞
n − k n n+1
k + 1 n 1− = e−(k+1) . n→∞ n+1
= lim
Hence .
lim P(Mn ≤ k) = 1 − e−(k+1) ,
n→∞
which is the d.f. of a geometric law of parameter .e−1 . 3.23 (a) The characteristic function of .μn is & μn (θ ) = (1 − an ) eiθ·0 + an eiθn = 1 − an + an eiθn
.
334
7 Solutions
and if .an →n→∞ 0 & μn (θ )
.
→
n→∞
1
for every θ ,
which is the characteristic function of a Dirac mass .δ0 . It is possible to come to the same result also by computing the d.f.’s (b) Let .Xn , X be r.v.’s with .Xn ∼ μn and .X ∼ δ0 . Then E(Xn ) = (1 − an ) · 0 + an · n = nan ,
.
E(Xn2 ) = (1 − an ) · 02 + an · n2 = n2 an , Var(Xn ) = E(Xn2 ) − E(Xn )2 = n2 an (1 − an ) . 1 If, for instance, .an = √1n then .E(Xn ) →n→∞ +∞, whereas .E(X) = 0. If .an = n3/2 then the expectations converge to the expectation of the limit but .Var(Xn ) →n→∞ +∞, whereas .Var(X) = 0. (c) By Theorem 3.21 (portmanteau), as .x → x 2 is continuous and bounded below, we have, with .Xn ∼ μn , .X ∼ μ,
2 . lim E(Xn ) = lim x 2 dμn ≥ x 2 dμ = E(X2 ) . n→∞
n→∞
The same argument applies for .limn→∞ E(|Xn |). 3.24 (a) The d.f. of .Xn is, for .t ≥ 0, .Fn (t) = 1 − e−λn t . As .Fn (t) →n→∞ 0 for every .t ∈ R, the d.f.’s of the .Xn do not converge to any distribution function. (b) Note in the first place that the r.v.’s .Yn take their values in the interval .[0, 1]. We have, for every .t < 1, {Yn ≤ t} =
∞
.
{k ≤ Xn ≤ k + t}
k=0
so that the d.f. of .Yn is, for .0 ≤ t < 1, Gn (t) := P(Yn ≤ t) =
∞
.
∞ P(k ≤ Xn < k + t) = (e−λn k − e−λn (k+t) )
k=0
= (1 − e−λn t )
∞
k=0
e−λn k =
k=0
1 − e−λn t 1 − e−λn
Therefore .
lim Gn (t) = t
n→∞
=
λn t + o(λn t) · λn + o(λn )
Exercise 3.27
335
and .(Yn )n converges in law to a uniform distribution on .[0, 1]. 3.25 The only if part is immediate, as .x → θ, x is a continuous map .Rd → R. L If .θ, Xn →n→∞ θ, X for every .θ ∈ Rd , as both the real and the imaginary parts ix of .x → e are bounded and continuous, we have .
lim E(eiθ,Xn ) = E(eiθ,X )
n→∞
and the result follows thanks to P. Lévy’s Theorem 3.20. 3.26 By the Central Limit Theorem .
X1 + · · · + Xn √ n
L
→
n→∞
N(0, σ 2 )
and the sequence .(Zn )n converges in law to the square of a .N(0, σ 2 )-distributed r.v. (Remark 3.16), i.e. to a Gamma.( 12 , 2σ1 2 )-distributed r.v. 3.27 (a) By the Central Limit Theorem the sequence Sn∗ =
.
X1 + · · · + Xn − nb √ nσ
converges in law to an .N(0, 1)-distributed r.v., where b and .σ 2 are respectively the mean and the variance of .X1 . Here .b = E(Xi ) = 12 , whereas
E(X12 ) =
.
1
x 2 dx =
0
1 3
1 ∗ . and therefore .σ 2 = 13 − 14 = 12 . The r.v. W in (3.49) is nothing else than .S12 ∗ It is still to be seen whether .n = 12 is large enough for .Sn to be approximatively .N (0, 1). Figure 7.9 and (b) below give some elements of appreciation. (b) We have, integrating by parts,
+∞ 1 2 E(X4 ) = √ x 4 e−x /2 dx 2π −∞
+∞ 1 2 +∞ 2 =√ − x 3 e−x /2 +3 x 2 e−x /2 dx −∞ 2π −∞
+∞ 1 2 x 2 e−x /2 dx = 3 . =3√ 2π −∞ .
=Var(X)=1
336
7 Solutions
The computation of the moment of order 4 of W is a bit more involved. If .Zi = Xi − 12 , then the .Zi ’s are independent and uniform on .[− 12 , 12 ] and E(W 4 ) = E[(Z1 + · · · + Z12 )4 ] .
(7.46)
.
Let us expand the fourth power .(Z1 + · · · + Z12 )4 into a sum of monomials. As 3 .E(Zi ) = E(Z ) = 0 (the .Zi ’s are symmetric), the expectation of many terms i appearing in this expansion will vanish. For instance, as the .Zi are independent, E(Z13 Z2 ) = E(Z13 )E(Z2 ) = 0 .
.
A moment of reflection shows that a non-zero contribution is given only by the terms, in the development of (7.46), of the form .E(Zi2 Zj2 ) = E(Zi2 )E(Zj2 ) with 4 4 .i = j and those of the form .E(Z ). The term .Z clearly has a coefficient .= 1 in i i the expansion of the right-hand term in (7.46). In order to determine which is the coefficient of .Zi2 Zj2 , i = j , we remark that in the power series expansion around 0 of φ(x1 , . . . , x12 ) = (x1 + · · · + x12 )4
.
the monomial .xi2 xj2 , for .i = j , has coefficient .
1 ∂ 4φ 1 (0) = × 24 = 6 . 2 2 2!2! ∂xi ∂xj 4
We have
E(Zi2 ) =
1/2
.
−1/2
x 2 dx =
1 , 12
E(Zi4 ) =
1/2
−1/2
x 4 dx =
1 · 80
As all the terms of the form .E(Zi2 Zj2 ), i = j , are equal and there are .11 + 10 + · · · + 1 = 12 × 12 × 11 = 66 of them, their contribution is 6 × 66 ×
.
11 1 = · 144 4
The contribution of the terms of the form .E(Zi4 ) (there are 12 of them), is . 12 80 . In conclusion E(W 4 ) =
.
11 12 + = 2.9 . 4 80
Exercise 3.28
−3
337
−2
−1
0
1
2
3
Fig. 7.9 Comparison between the densities of W (solid) and of a true .N (0, 1) density (dots). The two graphs are almost indistinguishable
The r.v. W turns out to have a density which is quite close to an .N(0, 1) (see Fig. 7.9). This was to be expected, the uniform distribution on .[0, 1] being symmetric around its mean, even if the value .n = 12 seems a bit small. However as an approximation of an .N(0, 1) r.v. W has some drawbacks: for instance it cannot take values outside the interval .[−6, 6] whereas for an .N(0, 1) r.v. this is possible, even if with a very small probability. In practice, in order to simulate an .N (0, 1) r.v., W can be used as a fast substitute of the Box-Müller algorithm (Example 2.19) for tasks that require a moderate number of random numbers, but one must be very careful in simulations requiring a large number of them because, then, the occurrence of a very large value is not so unlikely any more. 3.28 (a) Let .A := limn→∞ An . Recalling that .1A = limn→∞ 1An , by Fatou’s Lemma .P lim An = E lim 1An ≥ lim E(1An ) = lim P(An ) ≥ α . n→∞
n→∞
n→∞
n→∞
(b) Let us assume ad absurdum that for some .ε > 0 it is possible to find events An such that .P(An ) ≤ 2−n and .Q(An ) ≥ ε. If again .A = limn→∞ An we have, by the Borel-Cantelli Lemma,
.
P(A) = 0
.
whereas, thanks to (a) with .P replaced by .Q, Q(A) ≥ ε ,
.
contradicting the assumption that .Q P. • Note that (b) of this exercise is immediate if we admit the Radon-Nikodym Theorem: we would have .dQ = X dP for some density X and the result follows immediately thanks to Proposition 3.33, as .{X} is a uniformly integrable family.
338
7 Solutions
3.29 (a) By Fatou’s Lemma M r ≥ lim E(|Xn |r ) ≥ E(|X|r )
.
n→∞
(this is as in Exercise 1.15(a1)). (b) We have .|Xn − X|p →n→∞ 0 a.s. Moreover, E (|Xn − X|p )r/p = E[|Xn − X|r ] ≤ 2r−1 E(|Xn |r ) + E(|X|r ) ≤ 2r M r .
.
Therefore the sequence .(|Xn − X|p )n tends to 0 as .n → ∞ and is bounded in .Lr/p . As . pr > 1, the sequence .(|Xn − X|p )n is uniformly integrable by Proposition 3.35. The result follows thanks to Theorem 3.34. If .Xn →n→∞ X in probability only, just note that from every subsequence of .(Xn )n we can extract a further subsequence .(Xnk )nk converging to X a.s. For this subsequence, by the result just proved we have .
lim E(|Xnk − X|p ) = 0
k→∞
and the result follows thanks to the sub-sub-sequence Criterion 3.8. 3.30 (a) Just note that, for every R, .ψR is a bounded continuous function. (b) Recall that a uniformly integrable sequence is bounded in .L1 ; let .M > 0 be such that .E(|Xn |) ≤ M for every n. By the portmanteau Theorem 3.21(a) (.x → |x| is continuous and positive) we have M ≥ lim E(|Xn |) ≥ E(|X|) ,
.
n→∞
so that the limit X is integrable. Moreover, as .ψR (x) = x for .|x| ≤ R, we have |Xn − ψR (Xn )| ≤ |Xn |1{|Xn |>R} and .|X − ψR (X)| ≤ |X|1{|X|>R} . Let .ε > 0 and R be such that
.
E(|X|1{|X|>R} ) ≤ ε
.
and
E(|Xn |1{|Xn |>R} ) ≤ ε
for every n. Then E(Xn ) − E(X)
.
≤ E[Xn − ψR (Xn )] + E[ψR (Xn )] − E[ψR (X)] + E[ψR (X) − X] ≤ E(|Xn |1{|Xn |>R} ) + E[ψR (Xn )] − E[ψR (X)] + E(|X|1{|X|>R} ) ≤ 2ε + E[ψR (Xn )] − E[ψR (X)] .
Exercise 3.31
339
Item (a) above then gives .
lim E(Xn ) − E(X) ≤ 2ε
n→∞
and the result follows thanks to the arbitrariness of .ε. • Note that in the argument of this exercise we took special care not to write quantities like .E(Xn − X), which might not make sense, as the r.v.’s .Xn , X might not be defined on the same probability space. 3.31 (a) If .(Zn )n is a sequence of independent .χ 2 (1)-distributed r.v.’s and .Sn = Z1 + · · · + Zn , then, for every n, .
Xn − n Sn − n ∼ √ √ 2n 2n
and, recalling that .E(Zi ) = 1, .Var(Zi ) = 2, the term on the right-hand side converges in law to an .N(0, 1) law by the Central Limit Theorem. Therefore this is true also for the left-hand side. (b1) We have √ 2n 1 . lim √ = lim 3 √ 3 n→∞ 2Xn + 2n − 1 n→∞ Xn 2n−1 + n 2n and as, by the strong Law of Large Numbers, .
lim
n→∞
Xn =1 n
a.s.
we obtain √ 2n 1 . lim √ = √ n→∞ 2Xn + 2n − 1 2
a.s.
(b2) We have ( .
2Xn −
√ 2Xn − 2n + 1 2n − 1 = √ √ 2Xn + 2n − 1 =√
1 2Xn − 2n · +√ √ √ 2Xn + 2n − 1 2Xn + 2n − 1
(7.47)
340
7 Solutions
The last term on the right-hand side is bounded above by .(2n−1)−1/2 and converges to 0 a.s., whereas √ 2Xn − 2n 2n Xn − n .√ =2 √ · ×√ √ √ 2Xn + 2n − 1 2Xn + 2n − 1 2n We have seen in (a) that .
Xn − n √ 2n
L
→
n→∞
N(0, 1)
and recalling (7.47), (3.50) follows by (repeated applications of) Slutsky’s Lemma 3.45. (c) From (a) we derive, denoting by .Φ the d.f. of an .N(0, 1) law, the approximation x − n X − n x − n n ∼Φ √ Fn (x) = P(Xn ≤ x) = P √ ≤ √ 2n 2n 2n
.
(7.48)
whereas from (b) √ ( Fn (x) = P(Xn ≤ x) = P 2Xn − 2n − 1 . √ √ √ √ ≤ 2x − 2n − 1 ∼ Φ 2x − 2n − 1 .
(7.49)
In order to deduce from (7.48) an approximation of the quantile .χα2 (n), we must solve the equation, with respect to the unknown x, x − n . α=Φ √ 2n
.
Denoting by .φα the quantile of order .α of an .N(0, 1) law, x must satisfy the relation .
x−n = φα , √ 2n
i.e. x=
.
√ 2n φα + n .
Similarly, (7.48) gives the approximation x=
.
√ 2 1 φα + 2n − 1 . 2
Exercise 3.32
341
.95
.91 1 20
125
130
Fig. 7.10 The true d.f. of a .χ 2 (100) law in the interval .[120, 130], together with the CLT approximation (7.48) (dashes) and Fisher’s approximation (7.49) (dots)
For .α = 0.95, i.e. .φα = 1.65, and .n = 100 we obtain respectively x = 1.65 ·
.
√ 200 + 100 = 123.334
and √ 1 (1.65 + 199 )2 = 124.137 , 2
x=
.
which is a much better approximation of the true value .124.34. Fisher’s approximation, proved in (b), remains very good also for larger values of n. Here are the values of the quantiles of order .α = 0.95 for some values of n and their approximations. n .
1 2
χα2 (n) √
2 (φ√ α + 2n − 1 ) 2n φα + n
200 233.99 233.71 232.90
300 341.40 341.11 340.29
400 447.63 447.35 446.52
500 553.13 552.84 552.01
see also Fig. 7.10. 3.32 (a) Recalling the value of the mean and variance of the Gamma distributions, E( n1 Xn ) = 1 and .Var( n1 Xn ) = n1 . Hence by Chebyshev’s inequality
.
X 1 n − 1 ≥ δ ≤ 2 , P n δ n
.
so that . Xnn →n→∞ 1 in probability and in law.
342
7 Solutions
(b) Let .(Zn )n be a sequence of i.i.d. Gamma.(1, 1)-distributed r.v.’s and let .Sn = Z1 + · · · + Zn . Then the r.v.’s .
1 √ (Xn − n) n
1 √ (Sn − n) n
and
have the same distribution for every n. Now just note that, by the Central Limit Theorem, the latter converges in law to an .N(0, 1) distribution. (c) We can write √ 1 n 1 .√ (Xn − n) = √ √ (Xn − n) . Xn Xn n Thanks to (a) and (b), √ n √ Xn
n→∞
1 √ (Xn − n) n
n→∞
1 (Xn − n) √ Xn
n→∞
.
L
→
1,
L
→
N(0, 1)
and by Slutsky’s Lemma .
L
→
N(0, 1) .
3.33 As the r.v.’s .Xk are centered and have variance equal to 1, by the Central Limit Theorem √ .
n Xn =
X1 + · · · + Xn √ n
L
→
n→∞
N(0, 1) .
(a) As the derivative of the sine function at 0 is equal to 1, the Delta method gives √ .
n sin X n
L
→
n→∞
N(0, 1) .
(b) As the derivative of the cosine function at 0 is equal to 0, again the Delta method gives √ .
n (1 − cos X n )
L
→
n→∞
N(0, 0) ,
i.e. the sequence converges in law to the Dirac mass at 0. (c) We can write 2 √ 3 .n(1 − cos X n ) = n 1 − cos X n .
Exercise 4.2
343
Let us apply the Delta method to the function .f (x) =
√ 1 − cos x. We have
√ 1 − cos x 1 .f (0) = lim =√ · x→0 x 2
The Delta method gives 3 √ . n 1 − cos X n
L
→
n→∞
Z ∼ N(0, 12 ) ,
so that n(1 − cos X n )
.
L
→
n→∞
Z 2 ∼ Γ ( 12 , 1) .
4.1 (a) The .σ -algebra . G is generated by the two-elements partition .A0 = {X +Y = 0} and .A1 = Ac0 = {X + Y ≥ 1}, i.e. . G = {Ω, A0 , A1 , ∅}. (b) We are as in Example 4.8: .E(X| G) takes on .Ai , i = 0, 1, the value αi =
.
E(X1Ai ) · P(Ai )
As .X = 0 on .A0 , .E(X1A0 ) = 0 and .α0 = 0. On the other hand .X1A1 = 1{X=1} 1{X+Y ≥1} = 1{X=1} and therefore .E(X1A1 ) = P(X = 1) = p and α1 =
.
p p = · P(A1 ) 1 − (1 − p)2
Hence E(X| G) =
.
p 1{X+Y ≥1} . 1 − (1 − p)2
(7.50)
The r.v. .E(X| G) takes the values .p(1 − (1 − p)2 )−1 with probability .1 − (1 − p)2 and 0 with probability .P(A0 ) = (1 − p)2 . Note that .E[E(X| G)] = p = E(X). By symmetry (the right-hand side of (7.50) being symmetric in X and Y ) .E(X| G) = E(Y | G). As a non-constant r.v. cannot be independent of itself, .E(X| G) and .E(Y | G) are not independent. 4.2 (a) The r.v. .E(1A | G) is . G-measurable so that .B = {E(1A | G) = 0} ∈ G and, by the definition of conditional expectation, E(1A 1B ) = E E(1A | G)1B .
.
(7.51)
344
7 Solutions
As .E(1A | G) = 0 on B, (7.51) implies .E(1A 1B ) = 0. As .0 = E(1A 1B ) = P(A ∩ B) we have .B ⊂ Ac a.s. (b) If .B = {E(X| G) = 0} we have E(X1B ) = E E(X| G)1B = 0 .
.
The r.v. .X1B is positive and its expectation is equal to 0, hence .X1B = 0 a.s., which is equivalent to saying that X vanishes a.s. on B. 4.3 Statement (a) looks intuitive: adding the information . D, which is independent of X and of . G, should not provide any additional information useful to the prediction of X. But given how the exercise is formulated, the reader should have become suspicious that things are not quite as they seem. Let us therefore prove (b) as a start; we shall then look for a counterexample in order to give a negative answer to (a). (b) The events of the form .G ∩ D, .G ∈ G, D ∈ D, form a class that is stable with respect to finite intersections, generating . G ∨ D and containing .Ω. Thanks to Remark 4.3 we need only prove that E E(X| G)1G∩D = E(X1G∩D )
.
for every .G ∈ G, .D ∈ D. As . D is independent of .σ (X) ∨ G (and therefore also of G),
.
E E(X| G)1G∩D = E E(X1G | G)1D
.
↓
= E(X1G )E(1D ) = E(X1G 1D ) = E(X1G∩D ) , where .↓ denotes the equality where we use the independence of . D and .σ (X) ∨ G. (a) The counterexample is based on the fact that it is possible to construct r.v.’s .X, Y, Z that are pairwise independent but not independent globally and even such that X is .σ (Y ) ∨ σ (Z)-measurable. This was seen in Remark 2.12. Hence if . G = σ (Y ), . D = σ (Z), then E(X| G] = E(X)
.
whereas E(X| G ∨ D) = X .
.
4.4 (a) Every event .A ∈ σ (X) is of the form .A = {X ∈ A } with .A ∈ B(E). Note that .{X = x} ∈ σ (X), as .{x} is a Borel set. In order for A to be strictly contained in
.{X = x}, .A must be strictly contained in .{x}, which is not possible, unless .A = ∅. (b) If A is an atom of . G and X was not constant on A, then X would take on A at least two distinct values .y, z. But then the two events .{X = y} ∩ A and .{X = z} ∩ A
Exercise 4.6
345
would be . G-measurable, nonempty and strictly contained in A, thus contradicting the assumption that A is an atom. (c) .W = E(Z |X) is .σ (X)-measurable and therefore constant on .{X = x}, as a consequence of (a) and (b) above. The value c of this constant is determined by the relation
.cP(X = x) = E(W 1{X=x} ) = E(Z1{X=x} ) = Z dP , {X=x}
i.e. (4.27). 4.5 (a) We have .E[h(X)|Z] = g(Z), where g is such that, for every bounded measurable function .ψ, E h(X)ψ(Z) = E g(Z)ψ(Z) .
.
But .E[h(X)ψ(Z)] = E[h(Y )ψ(Z)], as .(X, Z) ∼ (Y, Z), and therefore also E[h(Y )|Z] = g(Z) a.s. (b1) The r.v.’s .(T1 , T ) and .(T2 , T ) have the same joint law. Actually .(T1 , T ) can be obtained from the r.v. .(T1 , T2 + · · · + Tn ) through the map .(s, t) → (s, s + t). .(T2 , T ) is obtained through the same map from the r.v. .(T2 , T1 + T3 · · · + Tn ). As the two r.v.’s .(T1 , T2 + · · · + Tn ) and .(T2 , T1 + T3 · · · + Tn ) have the same law (they have the same marginals and independent components), .(T1 , T ) and .(T2 , T ) have the same law. The same argument gives that .(T1 , T ), . . . , (Tn , T ) have the same law. (b2) Thanks to (a) and (b1) .E(T1 |T ) = E(T2 |T ) = · · · = E(Tn |T ) a.s., hence a.s. .
.
nE(T1 |T ) = E(T1 |T ) + · · · + E(Tn |T ) = E(T1 + · · · + Tn |T ) = E(T |T ) = T .
4.6 (a) .(X, XY ) is the image of .(X, Y ) under the map .ψ(x, y) := (x, xy). (−X, XY ) is the image of .(−X, −Y ) under the same map .ψ. As the Laplace distribution is symmetric, .(X, Y ) and .(−X, −Y ) have the same distribution (independent components and same marginals), also their images under the same function have the same distribution. (b1) We must determine a measurable function g such that, for every bounded Borel function .φ
.
E[X φ(XY )] = E[g(XY ) φ(XY )] .
.
Thanks to (a) .E(X φ(XY )) = −E(X φ(XY )) hence .E(X φ(XY )) = 0. Therefore g ≡ 0 is good and .E(X|XY = z) = 0. (b2) Of course the argument leading to .E(X|XY = z) = 0 holds for every pair of independent integrable symmetric r.v.’s, hence also for .N(0, 1)-distributed ones.
.
346
7 Solutions
(b3) A Cauchy r.v. is symmetric but not integrable, nor l.s.i. as
−
E(X ) =
.
0
−∞
−x dx = +∞ . π(1 + x 2 )
Conditional expectation for such an r.v. is not defined. 4.7 (a) Let .φ : R+ → R be a bounded Borel function. We have, in polar coordinates, .E φ(|X|) =
Rm
φ(|x|)g(|x|) dx =
+∞
= ωm−1
Sm−1
+∞
dθ
φ(r)g(r)r m−1 dr
0
φ(r)g(r)r m−1 dr ,
0
where .Sm−1 is the unit sphere of .Rm and .ωm−1 denotes the .(m − 1)-dimensional measure of .Sm−1 . We deduce that .|X| has density g1 (t) = ωm−1 g(t)t m−1 .
.
(b) Recall that every .σ (|X|)-measurable r.v. W is of the form .W = h(|X|) (this is Doob’s criterion, Proposition 1.7). Hence, for every bounded Borel function .ψ we : R+ → R such that, for every bounded Borel function must determine a function .ψ (|X|)h(|X|)]. We have, again in polar coordinates, h, .E[ψ(X)h(|X|)] = E[ψ E ψ(X)h(|X|) =
.
= =
Sm−1
1
+∞
dθ
0
Rm
ψ(x)h(|x|)g(|x|) dx
ψ(r, θ )h(r)g(r)r m−1 dr
+∞
dθ
ψ(t, θ )h(t)g1 (t) dt ωm−1 Sm−1 0
+∞ 1 (|X|)h(|X|) h(t)g1 (t) ψ(t, θ ) dθ dt = E ψ = ωn−1 Sm−1 0 with (t) := .ψ
1 ωm−1
Sm−1
ψ(t, θ ) dθ .
Hence (|X|) E ψ(X) |X| = ψ
.
a.s.
Exercise 4.8
347
(t) is the average of .ψ on the sphere of radius t. Note that .ψ 4.8 (a) As .{Z > 0} ⊂ {E(Z | G) > 0} we have Z ≥ Z1{E(Z | G)>0} ≥ Z1{Z>0} = Z
.
and obviously E(ZY | G) = E(Z1{E(Z | G)>0} Y | G) = E(ZY | G)1{E(Z | G)>0}
.
a.s.
(b1) As the events of probability 0 for .P are also negligible for .Q, .{Z = 0} ⊃ {E(Z | G) = 0} also .Q-a.s. Recalling that .Q(Z = 0) = E(Z1{Z=0} ) = 0 we obtain .Q(E(Z | G) = 0) ≤ Q(Z = 0) = 0. (b2) First note that the r.v. .
E(Y Z | G) E(Z | G)
of (4.29) is . G-measurable and well defined, as .E(Z | G) > 0 .Q-a.s. Next, for every bounded . G-measurable r.v. W we have EQ
.
E(Y Z | G) W =E Z W . E(Z | G) E(Z | G)
E(Y Z | G)
As in the mathematical expectation on the right-hand side Z is the only r.v. that is not . G-measurable, .
E(Y Z | G) E(Y Z | G) W G = E E(Z | G) W ··· = E E Z E(Z | G) E(Z | G) = E E(Y Z | G)W = E(Y ZW ) = EQ (Y W )
and the result follows. • In solving Exercise 4.8 we have been a little on the sly on a delicate point that deserves more attention. Always recall that a conditional expectation (with respect to a probability .P) is not an r.v., but a family of r.v.’s that differ among them only on .P-negligible events. Therefore the quantity .E(Z | G) must be considered with caution when we argue with respect to a probability .Q different from .P, as a .P-negligible event might not also be .Q-negligible. In this case there are no such difficulties as .P Q.
348
7 Solutions
4.9 (a) By the freezing lemma, Lemma 4.11, the Laplace transform of X is L(z) = E(e
zZT
1 2 2 ) = E E(ezZT |T ) = E(e 2 z T ) =
1
1 2 2 t
2t e 2 z
dt
0
∞ 1 2 n 1 2 1 2 2 1 2 2 t=1 = 2 (e 2 z − 1) = z ) . = 2 e2 z t t=0 (n + 1)! 2 z z
.
(7.52)
n=0
L is defined on the whole of the complex plane so that the convergence abscissas are .x1 = −∞, .x2 = +∞. The characteristic function is of course φ(θ ) = L(iθ ) =
.
1 2 2 (1 − e− 2 θ ) . 2 θ
See in Fig. 7.11 the appearance of the density having such a characteristic function. (b) As its Laplace transform is finite in a neighborhood of the origin, X has finite moments of all orders. Of course .E(X) = 0 as .φ is real-valued, hence X is symmetric. Moreover the power series expansion of (7.52) gives E(X2 ) = L
(0) =
.
1 · 2
Alternatively, directly,
1
Var(ZT ) = E(Z 2 T 2 ) = E(Z 2 )E(T 2 ) =
.
t 2 · 2t dt =
0
−3
−2
−1
0
1
t 4 1 1 = · 2 0 2
2
Fig. 7.11 The density of the r.v. X of Exercise 4.9, computed numerically with the formula (2.54) of the inversion Theorem 2.33. It looks like the graph of the Laplace density, but it tends to 0 faster at infinity
Exercise 4.11
349
(c) Immediate with the same argument as in Exercise 2.44: by Markov’s inequality, for every .R > 0, .x > 0, P(X ≥ x) = P(eRX ≥ eRx ) ≤ e−Rx E(eRX ) = L(R) e−Rx
.
and in the same way .P(X ≤ −x) ≤ L(−R) e−Rx ). Therefore P(|X| ≥ x) ≤ L(R) + L(−R) e−Rx .
.
Of course property (c) holds for every r.v. X having both convergence abscissas infinite. 4.10 (a) Immediate, as X is assumed to be independent of . G (Proposition 4.5(c)). (b1) We have, for .θ ∈ Rm , .t ∈ R, φ(X,Y ) (θ, t) = E(eiθ,X eitY ) = E E(eiθ,X eitY | G) . = E eitY E(eiθ,X | G) = E eitY E(eiθ,X ) =
E(eiθ,X )E(eitY )
(7.53)
= φX (θ )φY (t) .
(b2) According to the definition, X and . G are independent if and only if the events of .σ (X) are independent of those belonging to . G, i.e. if and only if, for every m .A ∈ B(R ) and .G ∈ G, the events .{X ∈ A} and G are independent. But this is immediate, thanks to (7.53): choosing .Y = 1G , the r.v.’s X and .1G are independent thanks to the criterion of Proposition 2.35. 4.11 (a) We have (freezing lemma again), E(eiθ
.
√
XY
√ 1 2 ) = E E(eiθ X Y )|X = E(e− 2 θ X )
and we land on the Laplace transform of the Gamma distributions. By Example 2.37 (c) or directly − 12 θ 2 X
E(e
.
)=λ
+∞
−λt − 12 θ 2 t
e
e
+∞
dt = λ
0
1
e− 2 (θ
0
2 +2λ)t
dt =
2λ · 2λ + θ 2
(b) The characteristic function of a Laplace distribution is computed in Exercise 2.43(a): E(eiθW ) =
.
α2
α2 · + θ2
(c) Comparing the results of (a) and (b) we see that Z has a Laplace law of √ parameter . 2λ.
350
7 Solutions
4.12 (a) We have 1 2 2 E(Z) = E E(Z |Y ) = E E(e− 2 λ Y +λY X |Y ) .
.
1
By the freezing lemma .E(e− 2 λ
2 Y 2 +λY X
1
Φ(y) = E(e− 2 λ
.
|Y ) = E[Φ(Y )], where
2 y 2 +λyX
1
2y2+ 1 2
) = e− 2 λ
λ2 y 2
=1.
Hence .E(Z) = 1. (b) Let us compute the Laplace transform of X under .Q: for .t ∈ R 1
1
EQ (etX ) = E(e− 2 λ Y +λY X etX ) = E(e− 2 λ Y +(λY +t)X ) 1 2 2 = E E(e− 2 λ Y +(λY +t)X |Y ) = E[Φ(Y )] , 2 2
2 2
.
where now 1
Φ(y) = E(e− 2 λ
.
2 y 2 +(λy+t)X
1
2y2
1 2
1
) = e− 2 λ
1
1 2 +λty
2
e 2 (λy+t) = e 2 t
,
so that 1 2
2t 2
EQ (etX ) = e 2 t E(eλtY ) = e 2 t e 2 λ
.
1
2 )t 2
= e 2 (1+λ
.
Therefore .X ∼ N (0, 1 + λ2 ) under .Q. Note that this law depends on .|λ| only and that the variance of X becomes larger under .Q for every value of .λ. 4.13 (a) The freezing lemma, Lemma 4.11, gives 1 2 2 E(etXY ) = E E(etXY |Y ) = E[e 2 t Y ] .
.
Hence, as .Y 2 ∼ Γ ( 12 , 12 ) (Remark 2.37 or Exercise 2.7), .E(etXY ) = +∞ if .|t| ≥ 1 and 1 E(etXY ) = √ 1 − t2
.
if |t| < 1 .
(b) Thanks to (a) .Q is a probability. Let .φ : R2 → R be a bounded Borel function. We have ( Q φ(X, Y ) = 1 − t 2 E φ(X, Y )etXY .E √
1 1 − t 2 +∞ +∞ 2 2 φ(x, y) etxy e− 2 (x +y ) dx dy , = 2π −∞ −∞
Exercise 4.15
351
from which we derive that, under .Q, the joint density with respect to the Lebesgue measure of .(X, Y ) is √ 1 − t 2 − 1 (x 2 +y 2 −2txy) e 2 . . 2π We recognize a Gaussian law, centered and with covariance matrix C such that C
.
−1
1 −t = −t 1
! ,
i.e. 1 .C = 1 − t2
1t t 1
! ,
from which 1 t , CovQ (X, Y ) = · 2 1−t 1 − t2 4.14 Note that .Sn+1 = Xn+1 + Sn and that .Sn is . Fn -measurable whereas .Xn+1 is independent of . Fn . We are therefore in the situation of the freezing lemma, Lemma 4.11, which gives that VarQ (X) = VarQ (Y ) =
.
E f (Xn+1 + Sn )| Fn = Φ(Sn ) ,
.
(7.54)
where, (recall that .Xn ∼ μn ) Φ(x) = E f (Xn+1 + x) =
.
f (y + x) dμn+1 (y) .
(7.55)
The right-hand side in (7.54) is .σ (Sn )-measurable (being a function of .Sn ) and this implies (4.31): indeed, as .σ (Sn ) ⊂ Fn , E f (Sn+1 )|Sn = E E(f (Sn+1 )| Fn )|Sn = E(Φ(Sn )|Sn ) = Φ(Sn ) = E f (Sn+1 )| Fn .
.
Moreover, by (7.55), E f (Sn+1 )| Fn = Φ(Sn ) =
.
f (y + Sn ) dμn+1 (y) .
4.15 Recall that .t (1) is the Cauchy law, which does not have a finite mean. For n ≥ 2 a look at the density that is computed in Example 4.17 shows that the mean exists, is finite, and is equal to 0 of course, as Student laws are symmetric.
.
352
7 Solutions
As for the second order moment, let us use the freezing lemma, which is a better strategy than direct √ computation with the density that was computed in Example 4.17. Let .T = √X n be a .t (n)-distributed r.v., i.e. with .X, Y independent Y
and .X ∼ N (0, 1), .Y ∼ χ 2 (n). We have X2 X2 n =E E n Y = E[Φ(Y )] , E(T 2 ) = E Y Y
.
where X2 n Φ(y) = E n = , y y
.
so that
+∞ n n 1 n/2−1 −y/2 = n/2 n y .E(T ) = E e dy Y y 2 Γ (2) 0
+∞ n y n/2−2 e−y/2 dy . = n/2 n 2 Γ (2) 0 2
The integral diverges at 0 if .n ≤ 2. For .n ≥ 3 we can trace back the integral to a Gamma density and we have Var(T ) = E(T 2 ) =
.
n2n/2−1 Γ ( n2 − 1) n n = n = · n n/2 2( 2 − 1) n−2 2 Γ (2)
4.16 Thanks to the second freezing lemma, Lemma 4.15, the conditional law of √ W given .Z = z is the law of .zX + 1 − z2 Y , which is Gaussian .N(0, 1) and does not depend on z. This implies (Remark 4.14) that .W ∼ N(0, 1) and that W is independent of Z. 4.17 By the second freezing √ lemma, Lemma 4.15, the conditional law of X given Y = y is the law of . √Xy n, i.e. .∼ N(0, yn C), hence with density with respect to the Lebesgue measure
.
h(x; y) =
.
y y d/2 −1 e− 2n C x,x . √ d/2 (2π n) det C
Thanks to (4.19) the density of X is
hX (x) =
h(x; y)hY (y) dy
.
=
1
√ 2n/2 Γ ( n2 )(2π n)d/2 det C
+∞ 0
y
y d/2 y n/2−1 e− 2n C
−1 x,x
e−y/2 dy
Exercise 4.19
=
353
1
√ 2n/2 Γ ( n2 )(2π n)d/2 det C
+∞
y
1
1
y 2 (d+n)−1 e− 2 (1+ n C
−1 x,x)
dy .
0
We recognize in the last integrand a Gamma.( 12 (d +n), 12 (1+ n1 C −1 x, x)) density, except for the constant, so that hX (x) =
.
=
Γ ( n2 + d2 )2
1
√ 2n/2 Γ ( n2 )(2π n)d/2 det C (1 +
Γ ( n2 + d2 ) √ Γ ( n2 )(π n)d/2 det C (1 +
1 n
C −1 x, x)
1 1 n
n+d 2
C −1 x, x)
n+d 2
n+d 2
·
4.18 (a) Thanks to the second freezing lemma, Lemma 4.15, the conditional law of Z given .W = w is the law of the r.v. .
X + Yw , √ 1 + w2
which is .N (0, 1) whatever the value of w, as .X + Y w ∼ N(0, 1 + w2 ). (b) .Z ∼ N(0, 1) thanks to Remark 4.14, which entails also that Z and W are independent. 4.19 (a) Let i be an index, .1 ≤ i ≤ n. Let .σ be a permutation such that .σ1 = i. The identity in law .X ∼ Xσ of the vectors implies the identity in law of the marginals, hence .X1 ∼ Xσ1 = Xi . Hence, .Xi ∼ X1 for every .1 ≤ i ≤ n. If .1 ≤ i, j ≤ n, .i = j , then, just repeat the previous argument by choosing a permutation .σ such that .σ1 = i, .σ2 = j and obtain that .(Xi , Xj ) ∼ (X1 , X2 ) for every .1 ≤ i, j ≤ n, .i = j . (b) Immediate, as X and .Xσ have independent components and the same marginals. (c) The random vector .Xσ := (Xσ1 , . . . , Xσn ) is the image of .X = (X1 , . . . , Xn ) under the linear map .A : (x1 , . . . , xn ) → (xσ1 , . . . , xσn ). Hence (see (2.20)) .Xσ has density fσ (x) =
.
1 f (A−1 x) . | det A|
Now just note that .f (A−1 x) = g(|A−1 x|) = g(|x|) = f (x) and also that .| det A| = 1, as the matrix A is all zeros except for exactly one 1 in every row and every column. (d1) For every bounded measurable function .φ : (E ×· · ·×E, E⊗· · ·⊗ E) → R we have E[φ(X1 , . . . , Xn )] = E E[φ(X1 , . . . , Xn )|Y ] .
.
(7.56)
354
7 Solutions
As the conditional law of .(X1 , . . . , Xn ) given .Y = y is the product μy ⊗ · · · ⊗ μy , hence exchangeable, we have .E[φ(X1 , . . . , Xn )|Y = y] = E[φ(Xσ1 , . . . , Xσn )|Y = y] a.s. for every permutation .σ . Hence
.
E[φ(X1 , . . . , Xn )] = E E[φ(X1 , . . . , Xn )|Y ] = E E[φ(Xσ1 , . . . , Xσn )|Y ]
.
= E[φ(Xσ1 , . . . , Xσn )] . (d2) If .X ∼ t (n, d, I ), then .X ∼
√ n √ (Z1 , . . . , Zd ), where .Z1 , . . . , Zd Y ∼ χ 2 (n). Therefore, given .Y = y,
are
independent .N(0, 1)-distributed and .Y the components of X are independent and .N(0, yn ) distributed, hence exchangeable thanks to (d1). One can also argue that a .t (n, d, I ) distribution is exchangeable because its density is of the form (4.32), as seen in Exercise 4.17. 4.20 (a) The law of .S = T + W is the law of the sum of two independent exponential r.v.’s of parameters .λ and .μ respectively. This can be done in many ways: by computing the convolution of their densities as in Proposition 2.18, or also by obtaining the density .fS of S as a marginal of the joint density of .(T , S), which we are asked to compute anyway. Let us follow the last path, taking advantage of the second freezing lemma, Lemma 4.15: we have .S = Φ(T , W ), where .Φ(t, w) = t +w, hence the conditional law of S given .T = t is the law of .t + W , which has a density with respect to the Lebesgue measure given by .f¯(s; t) = fW (s − t). Hence the joint density of T and S is f (t, s) = fT (t)f¯(s; t) = λμe−λt e−μ(s−t) ,
.
t > 0, s > t
and the density of S is, for .s > 0,
fS (s) =
.
f (t, s) dt = λμe−μs
s
e−(λ−μ)t dt
0
=
λμ λμ −μs e (e−μs − e−λs ) . 1 − e−(λ−μ)s = λ−μ λ−μ
(b) The conditional density of T given .S = s is f (t, s) f¯(t; s) = fS (s)
.
and, replacing the expressions for f and .fS as computed in (a), ⎧ −μs ⎨ (λ − μ) e e−(λ−μ)t −μs −λs .f¯(t; s) = e −e ⎩ 0
if 0 ≤ t ≤ s otherwise .
Exercise 4.21
355
1 .8 .6 .4 .2
0
2
4
Fig. 7.12 The graph of the conditional expectation (solid) of Exercise 4.20 with the regression line (dots). Note that the regression line here is not satisfactory as, for values of s near 0, it lies above the diagonal, i.e. it gives an expected value of T that is larger than s, whereas we know that .T ≤ S
The conditional expectation of T given .S = s is the mean of this conditional density, i.e.
(λ − μ) e−μs s −(λ−μ)t .E(T |S = s) = te dt . e−μs − e−λs 0 Integrating by parts and with some simplifications E(T |S = s) =
.
s (λ − μ) e−μs 1 −(λ−μ)s −(λ−μ)s e − + (1 − e ) e−μs − e−λs λ−μ (λ − μ)2 =
1 s · + λ−μ 1 − e−(μ−λ)s
In Exercise 2.30 we computed the regression line of T with respect to s, which was s →
.
μ2 λ−μ s+ 2 · λ2 + μ2 λ + μ2
Figure 7.12 compares the graphs of these two estimates. 4.21 (a) For .x > 0 we have
fX (x) =
+∞
.
−∞
f (x, y) dy = 0
+∞
λ2 xe−λx(y+1) dy
356
7 Solutions
−λx
+∞
= λe
y=+∞ λxe−λxy dy = −λe−λx e−λxy = λe−λx , y=0
0
hence X is exponential of parameter .λ. As for Y , instead, recalling the integral of the Gamma densities, we have for .y > 0
fY (y) =
.
f (x, y) dx = λ
2
+∞
xe−λx(y+1) dx =
0
λ2 Γ (2) 1 = · (λ(y + 1))2 (y + 1)2
Note that the density of Y does not depend on .λ. (b) Let .φ : R2 → R be a bounded Borel function. We have
+∞
E[φ(U, V )] = E[φ(X, XY )] =
dx
.
0
+∞
φ(x, xy)λ2 x e−λx(y+1) dy .
0
With the change of variable .z = xy in the inner integral, i.e. .x dy = dz, we have
.
··· =
+∞
dx 0
+∞
φ(x, z)λ2 e−λ(z+x) dz .
0
Hence the joint density of .(U, V ) is, for .u > 0, v > 0, g(u, v) = λ2 e−λ(u+v) = λe−λu · λe−λv ,
.
so that U and V are independent and both exponential with parameter .λ. (c) The conditional density of X given .Y = y is, for .x > 0, f (x, y) = λ2 x(y + 1)2 e−λx(y+1) , f¯(x; y) = fY (y)
.
which is a Gamma.(2, λ(y + 1)) density (as a function of x, of course). The conditional expectation .E(X|Y = y) is therefore the mean of this density, i.e. E(X|Y = y) =
.
Hence .E(X|Y ) =
2 λ(Y +1)
2 · λ(y + 1)
and the requested squared .L2 distance is 2 E X − E(X|Y ) .
.
By (4.6) this is equal to .E(X2 ) − E[E(X|Y )2 ]. Now, recalling the expression of the moments of the exponential distributions, we have E(X2 ) = E(X)2 + Var(X) =
.
2 λ2
Exercise 4.22
357
and .E[E(X|Y ) ] = E 2
4 4 = 2 2 2 λ (Y + 1) λ
0
+∞
1 4 dy = 2 , 4 (y + 1) 3λ
from which the requested squared .L2 distance is equal to . 3λ2 2 . • Note that (d) above states that, in the sense of .L2 , the best approximation of X by a function of Y is . λ(Y2+1) . We might think of comparing this approximation with the regression line of X with respect to Y , which is the best approximation by an affine-linear function of Y . However we have
+∞
E(Y 2 ) =
.
0
y2 dy = +∞ . (1 + y)2
Hence Y is not square integrable (not even integrable), so that the best approximation in .L2 of X by an affine-linear function of Y can only be a constant and this constant must be .E(X) = λ1 , see the remark following Example 2.24 p.68. 4.22 (a) We know that .Z = X + Y ∼ Gamma.(α + β, λ). (b) As X and Y are independent, their joint density is f (x, y) =
.
λα+β x α−1 y β−1 e−λ(x+y) Γ (α)Γ (β)
if .x, y > 0 and .f (x, y) = 0 otherwise. For every bounded Borel function .φ : R2 → R we have E φ(X, X + Y )
.
=
λα+β Γ (α)Γ (β)
+∞
+∞
dx 0
φ(x, x + y) x α−1 y β−1 e−λ(x+y) dy .
0
With the change of variable .z = x + y, .dz = dy in the inner integral we obtain λα+β .··· = Γ (α)Γ (β)
+∞
dx 0
+∞
φ(x, z) x α−1 (z − x)β−1 e−λz dz ,
x
so that the density of .(X, X + Y ) is ⎧ ⎪ ⎨
λα+β x α−1 (z − x)β−1 e−λz .g(x, z) = Γ (α)Γ (β) ⎪ ⎩0
if 0 < x < z otherwise .
358
7 Solutions
(c) Denoting by .gX+Y the density of .X + Y , which we know to be Gamma.(α + β, λ), the requested conditional density is g(x; z) =
.
g(x, z) · gX+Y (z)
It vanishes unless .0 ≤ x ≤ z. For x in this range we have g(x; z) =
.
λα+β α−1 (z − x)β−1 e−λz Γ (α)Γ (β) x λα+β α+β−1 e−λz Γ (α+β) z
=
Γ (α + β) 1 x α−1 ( ) (1 − xz )β−1 . Γ (α)Γ (β) z z
(d) The conditional expectation .E(X|X + Y = z) is the mean of this density. With the change of variable .t = xz , .dx = z dt,
.
Γ (α + β) z x α ( ) (1 − xz )β−1 dx Γ (α)Γ (β) 0 z
1 Γ (α + β) = t α (1 − t)β−1 dt . z Γ (α)Γ (β) 0
x g(x; z) dx =
(β) Recalling the expression of the Beta laws, the last integral is equal to . ΓΓ(α+1)Γ (α+β+1) , hence, with the simplification formula of the Gamma function, the requested conditional expectation is
.
α Γ (α + β) Γ (α + 1)Γ (β) z= z. Γ (α)Γ (β) Γ (α + β + 1) α+β
We know that the conditional expectation given .X+Y = z is the best approximation (in the sense of the .L2 distance) of X as a function of .X + Y . The regression line is instead the best approximation of X as an affine-linear function of .X + Y = z. As the conditional expectation in this case is itself an affine-linear function of z, the two functions necessarily coincide. • Note that the results of (c) and (d) do not depend on the value of .λ. 4.23 (a) We recognize that the argument of the exponential is, but for the factor . 12 , the quadratic form associated to the matrix ! 1 1 −r .M = . 1 − r 2 −r 1
Exercise 4.24
359
M is strictly positive definite (both its trace and determinant are .> 0, hence both eigenvalues are positive), hence f is a Gaussian density, centered and with covariance matrix ! 1r −1 .C = M = . r1 Therefore X and Y are both .N(0, 1)-distributed and .Cov(X, Y ) = r. (b) As X and Y are centered and .Cov(X, Y ) = r, by (4.24), E(X|Y = y) =
.
Cov(X, Y ) y = ry . Var(Y )
Also the pair .X, X + Y is jointly Gaussian and again formula (4.24) gives E(X|X + Y = z) =
.
1+r 1 Cov(X, X + Y ) z= z= z. Var(X + Y ) 2(1 + r) 2
Note that .E(X|X + Y = z) does not depend on r. 4.24 (a) The pair .(X, Y ) has a density with respect to the Lebesgue measure given by 1 2 1 1 1 1 2 f (x, y) = fX (x)f (y; x) = √ e− 2 x √ e− 2 (y− 2 x) 2π 2π 1 − 1 (x 2 +y 2 −xy+ 1 x 2 ) 1 − 1 ( 5 x 2 +y 2 −xy) 4 e 2 e 2 4 = = . 2π 2π
.
At the exponential we note the quadratic form associated to the matrix C
.
−1
− 12 = − 12 1 5 4
! .
We deduce that the pair .(X, Y ) has an .N(0, C) distribution with C=
.
1 2 1 5 2 4
1
! .
(b) The answer is no and there is no need for computations: if the pair .(X, Y ) was Gaussian the mean of the conditional law would be as in (4.24) and necessarily an affine-linear function of the conditioning r.v. (c) Again the answer is no: as noted in Remark 4.20(c), the variance of the conditional distributions of jointly Gaussian r.v.’s cannot depend on the value of the conditioning r.v.
360
7 Solutions
5.1 As .(Xn )n is a supermartingale, .U := Xm − E(Xn | Fm ) ≥ 0 a.s., for .n > m. But .E(U ) = E(Xm ) − E[E(Xn | Fm )] = E(Xm ) − E(Xn ) = 0. The positive r.v. U having expectation equal to 0 is .= 0 a.s., so that .Xm = E(Xn | Fm ) a.s. and .(Xn )n is a martingale. 5.2 If .m < n, as .{Mm = 0} is . Fm -measurable and .Mm 1{Mm =0} = 0, we have E(Mn 1{Mm =0} ) = E(Mm 1{Mm =0} ) = 0 .
.
As .Mn ≥ 0, necessarily .Mn = 0 a.s. on .{Mm = 0}, i.e. .{Mm = 0} ⊂ {Mn = 0} a.s. • Note that this is just Exercise 4.2 from another point of view. 5.3 We must prove that, if .m ≤ n, E(Mn Nn 1A ) = E(Mm Nm 1A )
(7.57)
.
for every .A ∈ Hm or at least for every A in a subclass . Cm ⊂ Hm that generates Hm , contains .Ω and is stable with respect to finite intersections (Remark 4.3). Let . Cm be the class of the events of the form .A1 ∩ A2 with .A1 ∈ Fm , .A2 ∈ Gm . . Cm is stable with respect to finite intersections and contains both . Fm (choosing .A2 = Ω) and . Gm (with .A1 = Ω). As the r.v.’s .Mn 1A1 and .Nn 1A2 are independent (the first one is . Fn -measurable whereas the second one is . Gn -measurable) we have .
E(Mn Nn 1A1 ∩A2 ) = E(Mn 1A1 Nn 1A2 ) = E(Mn 1A1 )E(Nn 1A2 )
.
= E(Mm 1A1 )E(Nm 1A2 ) = E(Mm 1A1 Nm 1A2 ) = E(Mm Nm 1A1 ∩A2 ) , hence (7.57) is satisfied for every .A ∈ Cm and therefore for every .A ∈ Hm . 5.4 (a) .Zn is . Fn−1 -measurable whereas .Xn is independent of . Fn−1 , hence .Xn and .Zn are independent and .Zn2 Xn2 is integrable, being the product of integrable independent r.v.’s. Hence .Yn is square integrable for every n. Moreover, E(Yn+1 | Fn ) = E(Yn + Zn+1 Xn+1 | Fn ) = Yn + Zn+1 E(Xn+1 | Fn )
.
= Yn + Zn+1 E(Xn+1 ) = Yn , where we have taken advantage of the fact that .Yn and .Zn+1 are . Fn -measurable, whereas .Xn+1 is independent of . Fn . (b) As .Zk and .Xk are independent, .E(Zk Xk ) = E(Zk )E(Xk ) = 0 hence .E(Yn ) = 0. Moreover, n n n 2 =E E(Yn2 ) = E Zk Xk Zk Xk Zh Xh = E(Zk Xk Zh Xh ) .
.
k=1
k,h=1
k,h=1
Exercise 5.5
361
In the previous sum all terms with .h = k vanish: actually, let us assume .k > h, then the r.v. .Zk Xh Zh is . Fk−1 -measurable, whereas .Xk is independent of . Fk−1 . Hence E(Zk Xk Zh Xh ) = E(Xk )E(Zk Zh Xh ) = 0 .
.
Therefore, as .E(Zk2 Xk2 ) = E(Zk2 )E(Xk2 ) = σ 2 E(Zk2 ), E(Yn2 ) =
n
.
E(Zk2 Xk2 ) = σ 2
k=1
n
E(Zk2 ) .
(7.58)
k=1
(c) The compensator .(An )n of .(Mn2 )n is given by the condition .A0 = 0 and the 2 − Mn2 | Fn ). Now relations .An+1 = An + E(Mn+1 2 E(Mn+1 − Mn2 | Fn ) = E[(Mn + Xn+1 )2 − Mn2 | Fn ]
.
2 2 E(Mn2 + 2Mn Xn+1 + Xn+1 − Mn2 | Fn ) = E(2Mn Xn+1 + Xn+1 | Fn ) 2 = 2Mn E(Xn+1 | Fn ) + E(Xn+1 | Fn ) .
As .Xn+1 is independent of . Fn , .
E(Xn+1 | Fn ) = E(Xn+1 ) = 0
a.s.
2 2 E(Xn+1 | Fn ) = E(Xn+1 ) = σ2
a.s. ,
hence .An+1 = An +σ 2 and, with the condition .A0 = 0, we have .An = nσ 2 . In order n )n say, just repeat the same argument: to compute the compensator of .(Yn2 )n , .(A 2 E(Yn+1 − Yn2 | Fn ) = E[(Yn + Zn+1 Xn+1 )2 − Yn2 | Fn ] .
2 2 = E(2Yn Zn+1 Xn+1 + Zn+1 Xn+1 | Fn ) 2 2 2 = 2Yn Zn+1 E(Xn+1 | Fn ) + Zn+1 E(Xn+1 | Fn ) = σ 2 Zn+1 .
Therefore n = σ 2 A
n
.
Zk2 .
k=1
5.5 (a) Let .m ≤ n: we have .E(Mn Mm ) = E[E(Mn Mm | Fm )] = E[Mm E(Mn | Fm )] = 2 ) so that E(Mm 2 2 E[(Mn − Mm )2 ] = E(Mn2 ) + E(Mm ) − 2E(Mn Mm ) = E(Mn2 ) − E(Mm ).
.
(b) Let us assume .M0 = 0 for simplicity: actually the martingales .(Mn )n and (Mn − M0 )n have the same associated increasing process. Note that the suggested
.
362
7 Solutions
associated increasing process vanishes at 0 and is obviously predictable, so that it is sufficient to prove that .Zn = Mn2 − E(Mn2 ) is a martingale (by the uniqueness of the associated increasing process). We have, for .m ≤ n,
.
E(Mn2 | Fm ) = E[(Mn − Mm + Mm )2 | Fm ]
(7.59)
2 |F ] . = E[(Mn − Mm )2 + 2(Mn − Mm )Mm + Mm m
We have .E[(Mn − Mm )Mm | Fm ] = Mm E(Mn − Mm | Fm ) = 0 and, as M has independent increments, 2 E[(Mn − Mm )2 | Fm ] = E[(Mn − Mm )2 ] = E(Mn2 − Mm ).
.
2 + E(M 2 − M 2 ) and Therefore, going back to (7.59), .E(Mn2 | Fm ) = Mm n m 2 2 2 2 E(Zn | Fm ) = Mm + E(Mn2 − Mm ) − E(Mn2 ) = Mm − E(Mm ) = Zm .
.
(c) Let .m ≤ n. As M is a Gaussian family, .Mn − Mm is independent of (M0 , . . . , Mm ) if and only if .Mn − Mm is uncorrelated with respect to .Mk for every .k = 0, 1, . . . , m. But, by the martingale property, .
E[(Mn − Mm )Mk ] = E E[(Mn − Mm )Mk | Gm ] = E Mk E(Mn − Mm | Gm ) ,
.
=0
so that .Cov(Mn − Mm , Mk ) = E[(Mn − Mm )Mk ] = 0. 5.6 (a) Let us denote by .(Vn )n≥0 the associated increasing process of .(Sn )n . As Yn+1 is independent of . Fn , .E(Yn+1 | Fn ) = E(Yn+1 ) = 0 and, recalling the definition of compensator in (5.3),
.
2 Vn+1 − Vn = E(Sn+1 − Sn2 | Fn ) 2 = E (Sn + Yn+1 )2 − Sn2 | Fn = E(Yn+1 + 2Yn+1 Sn | Fn ) .
2 2 = E(Yn+1 | Fn ) + 2Sn E(Yn+1 | Fn ) = E(Yn+1 )=1.
Therefore .V0 = 0 and .Vn = n. Note that this is a particular case of Exercise 5.5(b), as .(Sn )n is a martingale with independent increments. (b) We have E(Mn+1 − Mn | Fn ) = E[sign(Sn )Yn+1 | Fn ]=sign(Sn )E(Yn+1 | Fn )=0
.
a.s.
Exercise 5.6
363
therefore .(Mn )n is a martingale. It is obviously square integrable and its associated increasing process, .(An )n say, is obtained as above: 2 An+1 − An = E(Mn+1 − Mn2 | Fn )
.
2 + 2M sign(S )Y = E sign(Sn )2 Yn+1 n n n+1 | Fn 2 = sign(Sn )2 E(Yn+1 | Fn ) +2Mn sign(Sn ) E(Yn+1 | Fn ) = sign(Sn )2 = 1{Sn =0} =0
=1
from which An =
n−1
.
1{Sk =0} .
k=1
Note that this is a particular case of Exercise 5.4(b). (c1) On .{Sn > 0} we have .Sn+1 ≥ 0, as .Sn+1 ≥ Sn − 1 ≥ 0; therefore .|Sn+1 | − |Sn | = Sn+1 − Sn = Yn+1 and E[(|Sn+1 | − |Sn |)1{Sn >0} | Fn ] = 1{Sn >0} E(Yn+1 | Fn ) = 0 .
.
The other relation is proved in the same way. Therefore .E |Sn+1 | − |Sn | Fn = E[(|Sn+1 | − |Sn |)1{Sn =0} | Fn ] = 1{Sn =0} E |Yn+1 | Fn = 1{Sn =0} and n + 1{Sn =0} . n + E |Sn+1 | − |Sn | Fn = A n+1 = A A
.
Hence n = A
n−1
.
1{Sk =0} .
k=0
(c2) We have n+1 − A n ) = |Sn+1 | − |Sn | − 1{Sn =0} . Nn+1 − Nn = |Sn+1 | − |Sn | − (A
.
As (|Sn+1 | − |Sn |)1{Sn >0} = Yn+1 1{Sn >0} ,
.
(|Sn+1 | − |Sn |)1{Sn Zn } | Fn = Φ(Zn ) ,
.
where Φ(z) = E z1{ξn+1 ≤z} + ξn+1 1{ξn+1 >z} ,
.
i.e.
−λz
Φ(z) = z(1 − e
.
)+λ
+∞
ye−λy dy
z
−λz
= z(1 − e
+∞ +
−λy
) + − ye
z
+∞
e−λy dy
z
1 = z(1 − e−λz ) + ze−λz + e−λz λ 1 −λz =z+ e , λ hence E(Zn+1 | Fn ) − Zn =
.
1 −λZn e λ
Exercise 5.8
365
and (7.60) becomes An+1 = An +
.
1 −λZn e , λ
so that 1 −λZk .An = e . λ n−1 k=0
• Note that this gives the relation .E(Zn ) = E(An ) = λ1 of .E(e−λZk ) was computed in (2.97), where we found E(e−λZk ) = Lk (−λ) = k!
.
n−1 k=0
E(e−λZk ). The value
1 Γ (2) = , Γ (k + 2) k+1
so that we find again the value of the expectation .E(Zn ) as in Exercise 2.48. 5.8 (a) The exponential function being convex we have, by Jensen’s inequality, E(eMn | Fn−1 ) ≥ eE(Mn | Fn−1 ) = eMn−1 ,
.
which implies (5.27). (b) Recalling how Doob’s decomposition was derived in Sect. 5.3, let us recursively define .A0 = 0 and An = An−1 + log E(eMn | Fn−1 ) − Mn−1 .
.
(7.61)
This defines an increasing predictable process and taking the exponentials we find eAn = eAn−1 E(eMn | Fn−1 )e−Mn−1 ,
.
i.e., .An being . Fn−1 -measurable, E(eMn −An | Fn−1 ) = eMn−1 −An−1 ,
.
thus proving (b). (c1) We have
.
log E(eMn | Fn−1 ) = log E(eW1 +···+Wn | Fn−1 ) = log eW1 +···+Wn−1 E(eWn | Fn−1 ) = W1 + . . . + Wn−1 + log E(eWn | Fn−1 ) = Mn−1 + log L(1) ,
366
7 Solutions
where we denote by L the Laplace transform of the .Wk ’s (which is finite at 1 by hypothesis) and thanks to (7.61), An = n log L(1) .
.
(c2) Now we have n log E(eMn | Fn−1 ) = log E exp Zk Wk Fn−1
.
k=1
=
n−1
Zk Wk + log E(eZn Wn | Fn−1 ) = Mn−1 + log E(eZn Wn | Fn−1 ) .
k=1
As .Wn is independent of . Fn−1 and .Zn is . Fn−1 -measurable, by the freezing lemma, E(eZn Wn | Fn−1 ) = Φ(Zn ) ,
.
where .Φ(z) = E(ezWn ) = L(z) and (7.61) gives .An = An−1 + log L(Zn ), i.e. An =
n
.
log L(Zk ) .
k=1
In particular .n → exp
n k=1 Zk Wk
−
n
k=1 log L(Zk )
is an .( Fn )n -martingale.
5.9 We already know that .Xτ is . Fτ -measurable (see the end of Sect. 5.4) hence we must just prove that, for every .A ∈ Fτ , .E(X1A ) = E(Xτ 1A ). As .A ∩ {τ = n} ∈ Fn we have .E(X1A∩{τ =n} ) = E(Xn 1A∩{τ =n} ) and, as .τ is finite, E(X1A ) =
∞
.
E(X1A∩{τ =n} ) =
n=0
=
∞
∞
E(Xn 1A∩{τ =n} )
n=0
E(Xτ 1A∩{τ =n} ) = E(Xτ 1A ) .
n=0
5.10 If X is a martingale the claimed property is a consequence of the stopping theorem (Corollary 5.11) applied to the stopping times .τ1 = 0 and .τ2 = τ . Conversely, in order to prove the martingale property we must prove that, if .n > m, E(Xn 1A ) = E(Xm 1A )
.
for every A ∈ Fm .
(7.62)
Exercise 5.12
367
The idea is to find two bounded stopping times .τ1 , τ2 such that the relation .E(Xτ1 ) = E(Xτ2 ) implies (7.62). Let us choose, for .A ∈ Fm , 1 τ1 (ω) =
.
m if ω ∈ A n
if ω ∈ Ac
and .τ2 ≡ n; .τ1 is a stopping time: indeed
{τ1 ≤ k} =
.
⎧ ⎪ ⎪ ⎨∅ A ⎪ ⎪ ⎩Ω
if k < m if m ≤ k < n if k ≥ n ,
so that, in any case, .{τ1 ≤ k} ∈ Fk . Now .Xτ1 = Xm 1A + Xn 1Ac and the relation E(Xτ1 ) = E(Xn ) gives
.
E(Xm 1A ) + E(Xn 1Ac ) = E(Xτ1 ) = E(Xn ) = E(Xn 1A ) + E(Xn 1Ac ) ,
.
and by subtraction we obtain (7.62). 5.11 (a) We must prove that, for .m ≤ n, E(Mn 1A˜ ) = E(Mm 1A˜ ) ,
.
(7.63)
for every .A˜ ∈ Fm or, at least for every .A˜ in a class . C ⊂ Fm of events that is stable with respect to finite intersections and generating . Fm . Very much like Exercise 5.3, a suitable class . C is that of the events of the form A˜ = A ∩ B,
.
A ∈ Fm , B ∈ G .
We have, . Fn and . G being independent, E(Mn 1A∩B ) = E(Mn 1A 1B ) = E(Mn 1A )E(1B ) .
= E(Mm 1A )E(1B ) = E(Mm 1A∩B ) .
which proves (7.63) for every .A˜ ∈ C. (b) Let . Fn = σ ( Fn , σ (τ )). Thanks to (a) .(Mn )n is also a martingale with respect Fn )n . Moreover we have .{τ ≤ n} ∈ σ (τ ) ⊂ Fn for every n, so that .τ is a to .( .( Fn )n -stopping time. Hence the stopped process .(Mn∧τ )n is an .( Fn )n -martingale. 5.12 (a) By the Law of Large Numbers we have a.s. .
1 1 Sn = (Y1 + · · · + Yn ) n n
→
n→∞
E(Y1 ) = p − q < 0 .
368
7 Solutions
Hence, for every .δ such that .p − q < δ < 0, there exists, a.s., an .n0 such that 1 n Sn < δ for .n ≥ n0 . It follows that .Sn →n→∞ −∞ a.s. (b) Note that .Zn = ( pq )Y1 . . . ( pq )Yn , that the r.v.’s .( pq )Yk are independent and that
.
q −1 q E ( pq )Yk = P(Yk = 1) + P(Yk = −1) = q + p = 1 , p p
.
so that .(Zn )n are the cumulative products of independent r.v.’s having expectation = 1 and the martingale property follows from Example 5.2(b). (c) As .n ∧ τ is a bounded stopping time, .E(Zn∧τ ) = E(Z1 ) = 1 by the stopping theorem, Theorem 5.10. Thanks to (a) .τ < +∞ a.s., hence .limn→∞ Zn∧τ = Zτ a.s. As .−a ≤ Zn∧τ ≤ b, we can apply Lebesgue’s Theorem, which gives .E(Zτ ) = limn→∞ E(Zn∧τ ) = 1. (d1) As .Zτ can take only the values .−a or b, we have
.
1 = E(Zτ ) = E[( pq )Sτ ] =
.
q b p
P(Sτ = b) +
q −a p
P(Sτ = −a) .
As .P(Sτ = −a) = 1 − P(Sτ = b), the previous relation gives 1−
.
q −a p
= P(Sτ = b)
q b p
−
q −a p
,
i.e. P(Sτ = b) =
.
1 − ( pq )−a ( pq )b − ( pq )−a
,
and, as . pq > 1 lim P(Sτ−a,b = b) = lim
.
1 − ( pq )−a
a→+∞ ( q )b p
a→+∞
− ( pq )−a
=
p b q
(7.64)
.
(d2) If .τb (ω) < n, as the numerical sequence .(Sn (ω))n cannot reach .−n in less than n steps, necessarily .Sτ−n,b = b, hence .{τb < n} ⊂ {Sτ−n,b = b}. Therefore by (7.64) P(τb < +∞) = lim P(τb < n) ≤ lim P(Sτ−n,b = b) =
.
n→∞
n→∞
p b q
.
On the other hand, thanks to the obvious inclusion .{τb < +∞} ⊃ {Sτ−a,b = b} for every a, from (7.64) we have that the .= sign holds.
Exercise 5.13
369
(d3) Obviously we have, for every n, P(τ−a < +∞) ≥ P(Sτ−a,n = a)
.
and therefore P(τ−a < +∞) ≥ lim P(Sτ−a,n = −a) = lim 1 − P(Sτ−a,n = n)
.
n→∞
= lim 1 − n→∞
1 − ( pq )−a ( pq )n − ( pq )−a
n→∞
= lim
n→∞
( pq )n − 1 ( pq )n − ( pq )−a
=1.
• This exercise gives some information concerning the random walk .(Sn )n : it visits a.s. every negative integer but visits the strictly positive integers with a probability that is strictly smaller than 1. This is of course hardly surprising, given its asymmetry. In particular, for .b = 1 (7.64) gives .P(τb < +∞) = pq , i.e. with probability .1 − pq the random walk .(Sn )n never visits the strictly positive integers. 5.13 (a) As .Xn+1 is independent of . Fn , .E(Xn+1 | Fn ) = E(Xn+1 ) = x a.s. We have .Zn = (X1 − x) + · · · + (Xn − x), so that .(Zn )n are the cumulative sums of independent centered r.v.’s, hence a martingale (Example 5.2(a)). (b1) Also the stopped process .(Zn∧τ )n is a martingale, therefore .E(Zn∧τ ) = E(Z0 ) = 0, i.e. E(Sn∧τ ) = x E(n ∧ τ ) .
(7.65)
.
(b2) By Beppo Levi’s Theorem .E(n ∧ τ ) ↑ E(τ ) as .n → ∞. If we assume Xk ≥ 0 a.s., the sequence .(Sn∧τ )n≥0 is also increasing, hence also .E(Sn∧τ ) ↑ E(Sτ ) as .n → ∞ and from (7.65) we obtain
.
E(Sτ ) = xE(τ ) < +∞ .
(7.66)
.
As for the general case, if .x1 = E(Xn+ ), .x2 = E(Xn− ) (so that .x = x1 − x2 ), let (1) (2) .Sn = X1+ + · · · + Xn+ , .Sn = X1− + · · · + Xn− and Zn(1) = Sn(1) − nx1 ,
.
Zn(2) = Sn(2) − nx2 . + − As .Xn+1 (resp. .Xn+1 ) is independent of . Fn , .(Zn )n (resp. .(Zn )n ) is a martingale with respect to .( Fn )n . By (7.66) we have (1)
E(Sτ(1) ) = x1 E(τ ),
.
(2)
E(Sτ(2) ) = x2 E(τ ) ,
370
7 Solutions
and by subtraction, all quantities appearing in the expression being finite (recall that τ is assumed to be integrable),
.
E(Sτ ) = E(Sτ(1) ) − E(Sτ(2) ) = (x1 − x2 )E(τ ) = xE(τ ) .
.
(c) The process .(Sn )n can make, on .Z, only one step to the right or to the left. Therefore, recalling that we know that .τb < +∞ a.s., .Sτb = b a.s., hence .E(Sτb ) = b. If .τb were integrable, (c) would give instead E(Sτb ) = E(X1 )E(τb ) = 0 .
.
5.14 (a) With the usual trick of splitting into the value at time n and the increment we have 2 .E(Wn+1 | Fn ) = E (Sn + Xn+1 ) − (n + 1)| Fn 2 = E(Sn2 + 2Sn Xn+1 + Xn+1 | Fn ) − n − 1 .
Now .Sn2 is already . Fn -measurable, whereas E(Sn Xn+1 | Fn ) = Sn E(Xn+1 ) = 0 ,
.
2 2 E(Xn+1 | Fn ) = E(Xn+1 )=1,
hence E(Wn+1 | Fn ) = Sn2 + 1 − n − 1 = Sn2 − n = Wn .
.
(b1) The stopping time .τa,b is not bounded but, by the stopping theorem applied to .τa,b ∧ n, 0 = E(W0 ) = E(Wτa,b ∧n ) = E(Sτ2a,b ∧n ) − E(τa,b ∧ n) ,
.
hence E(Sτ2a,b ∧n ) = E(τa,b ∧ n) .
.
Now .Sτ2a,b ∧n →n→∞ Sτa,b a.s. and .E(Sτ2a,b ∧n ) →n→∞ E(Sτ2a,b ) by Lebesgue’s Theorem as the r.v.’s .Sτa,b ∧n are bounded (.−a ≤ Sτa,b ∧n ≤ b) whereas .E(τa,b ∧ n) ↑n→∞ E(τa,b ) by Beppo Levi’s Theorem. Hence .τa,b is integrable and E(τa,b ) = E(Sτ2a,b ) = a 2 P(Sτa,b = −a) + b2 P(Sτa,b = b)
.
= a2
b a a 2 b + b2 a + b2 = a+b a+b a+b = ab .
Exercise 5.15
371
(b2) We have, for every .a > 0, .τa,b < τb . Therefore .E(τb ) > E(τa,b ) = ab for every .a > 0 so that .E(τb ) must be .= +∞. 3 ) = 0, we have 5.15 (a) As .E(Xn+1 ) = E(Xn+1
E(Zn+1 | Fn ) = E (Sn + Xn+1 )3 − 3(n + 1)(Sn + Xn+1 )| Fn 2 3 = E Sn3 + 3Sn2 Xn+1 + 3Sn Xn+1 + Xn+1 | Fn − 3(n + 1)Sn
.
= Sn3 + 3Sn − 3(n + 1)Sn = Sn3 − 3nSn = Zn . (b1) By the stopping theorem, for every .n ≥ 0 we have .0 = E(Zn∧τ ), hence 3 E(Sn∧τ ) = 3E[(n ∧ τ )Sn∧τ ] .
.
(7.67)
Note that .−a ≤ Sn∧τ ≤ b so that .Sn∧τ is bounded and that .τ is integrable (Exercise 5.14). Then by Lebesgue’s Theorem we can take the limit as .n → ∞ in (7.67) and obtain E(τ Sτ ) =
.
=
b a 1 1 E(Sτ3 ) = − a3 + b3 3 3 a+b a+b 1 1 −a 3 b + b3 a = ab(b − a) . 3 a+b 3
As we know already that .E(Sτ ) = 0, we obtain Cov(Sτ , τ ) = E(τ Sτ ) =
.
1 ab(b − a) . 3
If .b = a then .Sτ and .τ are correlated and cannot be independent, which is somehow intuitive: if b is smaller than a, i.e. the rightmost end of the interval is closer to the origin, then the fact that .Sτ = b suggests that .τ should be smallish. (b2) Let us note first that, as .Xi ∼ −Xi , the joint distributions of .(Sn )n and of .(−Sn )n coincide. Moreover, we have P(Sτ = a, τ = n) = P(|S0 | < a, . . . , |Sn−1 | < a, Sn = a)
.
and as the joint distributions of .(Sn )n and of .(−Sn )n coincide P(Sτ = a, τ = n) = P(|S0 | < a, . . . , |Sn−1 | < a, Sn = −a) .
= P(Sτ = −a, τ = n) .
(7.68)
372
7 Solutions
As .P(Sτ = a, τ = n) + P(Sτ = −a, τ = n) = P(τ = n) and .P(Sτ = a) = from (7.68) we deduce P(Sτ = a, τ = n) =
.
1 2,
1 P(τ = n) = P(Sτ = a)P(τ = n) , 2
which proves that .Sτ and .τ are independent. 5.16 (a) Note that E(eθXk ) =
.
1 θ 1 −θ e + e = cosh θ 2 2
and that we can write Znθ =
.
n eθXk cosh θ
k=1
so that the .Znθ are the cumulative products of independent positive r.v.’s having expectation equal to 1, hence a martingale as seen in Example 5.2(b). θ ) is a Thanks to Remark 5.8 (a stopped martingale is again a martingale) .(Zn∧τ n martingale. If .θ > 0, it is also bounded: as .Sn cannot cross level a without taking the value a, .Sn∧τ ≤ a (this being true even on .{τ = +∞}). Therefore, .cosh θ being always .≥ 1, θ 0 ≤ Zn∧τ ≤ eθa .
.
θ ) is a bounded martingale, hence bounded in .L2 , and (b1) Let .θ > 0. .(Zn∧τ n 2 it converges in .L (and thus in .L1 ) and a.s. to an r.v. .W θ . On .{τ < ∞} we have θ = lim θ θ θa −τ θ .W n→∞ Zn∧τ = Zτ = e (cosh θ ) ; on the other hand .W = 0 on .{τ = ∞}, since in this case .Sn ≤ a for every n whereas the denominator tends to .+∞. Therefore (5.28) is proved. (b2) We have .W θ →θ→0+ 1{τ 0 (the square root is not analytic at 0), i.e. the right convergence abscissa of the Laplace transform of .τ is .x2 = 0. This is however immediate even without the computation above: .τ being a positive r.v., its Laplace transform is finite on .ℜz ≤ 0. If the right convergence abscissa were .> 0, .τ would have finite moments of all orders, whereas we know (Exercises 5.13(d) and 5.14) that .τ is not integrable. 5.17 (a) We have .E(eiλXk ) = 12 (eiλ + e−iλ ) = cos λ and, as .Xn+1 is independent of . Fn , E cos(λSn+1 )| Fn = E ℜeiλ(Sn +Xn+1 ) | Fn = ℜE eiλ(Sn +Xn+1 ) | Fn = ℜ eiλSn E[eiλXn+1 ] = ℜ eiλSn cos λ = cos(λSn ) cos λ ,
.
so that E(Zn+1 | Fn ) = (cos λ)−(n+1) E cos(λSn+1 )| Fn = (cos λ)−n cos(λSn )=Zn .
.
The conditional expectation .E[cos(λ(Sn + Xn+1 ))| Fn ] can also be computed using the addition formula for the cosine (.cos(α + β) = cos α cos β − sin α sin β)), which leads to just a bit more complicated manipulations.
374
7 Solutions
(b) As .n ∧ τ is a bounded stopping time, .E(Zn∧τ ) = E(Z0 ) = 1. Moreover, as .
−
π π < −λa ≤ λSn∧τ ≤ λa < , 2 2
we have .cos(λSn∧τ ) ≥ cos(λa) and 1 = E(Zn∧τ ) = E (cos λ)−n∧τ cos(λSn∧τ ) ≥ E[(cos λ)−n∧τ ] cos(λa) .
.
(c) The previous relation gives 1 · cos(λa)
E[(cos λ)−n∧τ ] ≤
.
(7.72)
As .0 < cos λ < 1, we have .(cos λ)−n∧τ ↑ (cos λ)−τ as .n → ∞, and taking the limit in (7.72), by Beppo Levi’s Theorem we obtain E[(cos λ)−τ ] ≤
.
1 · cos(λa)
(7.73)
Again as .0 < cos λ < 1, .(cos λ)−τ = +∞ on .{τ = +∞}, and (7.73) entails P(τ = +∞) = 0. Therefore .τ is a.s. finite. (d1) We have .|Sn∧τ | →n→∞ |Sτ | = a a.s. and therefore
.
Zn∧τ = (cos λ)−n∧τ cos(λSn∧τ )
.
a.s.
→ (cos λ)−τ cos(λa) = Zτ .
n→∞
(7.74)
Moreover, |Zn∧τ | = |(cos λ)−n∧τ cos(λSn∧τ )| ≤ (cos λ)−τ
.
and .(cos λ)−τ is integrable thanks to (7.73). Therefore by Lebesgue’s Theorem .E(Zn∧τ ) →n→∞ E(Zτ ). (d2) By Scheffé’s Theorem .Zn∧τ →n→∞ Zτ in .L1 and the martingale is regular. (e) Thanks to (c) .1 = E(Zτ ) = cos(λa)E[(cos λ)−τ ], so that E[(cos λ)−τ ] =
.
1 , cos λa
which can be written E[eτ (− log cos λ) ] =
.
1 · cos λa
(7.75)
Exercise 5.19
375
π Hence the Laplace transform .L(θ ) = E(eθτ ) is finite for .θ < − log cos 2a (which is a strictly positive number). (7.75) gives
.
lim θ→− log cos
π 2a −
L(θ ) = limπ
λ→ 2a
1 = +∞ cos(λa)
π and we conclude that .x2 := − log cos 2a is the right convergence abscissa, the left one being .x1 = −∞ of course. As the convergence strip of the Laplace transform contains the origin, .τ has finite moments of every order (see (2.63) and the argument p.86 at the end of Sect. 2.7). • In Exercises 5.16 and 5.17 it has been proved that, for the simple symmetric random walk, for .a > 0 the two stopping times .
τ1 = inf{n ≥ 0, Sn = a} , τ2 = inf{n ≥ 0, |Sn | = a}
are both a.s. finite. But the first one is not integrable (Exercise 5.13(d)) whereas the second one has a Laplace transform which is finite for some strictly positive values and has finite moments of all orders. The intuition behind this fact is that before reaching the level a the random walk .(Sn )n can make very long excursions on the negative side, therefore taking a lot of time before reaching a. 5.18 We know that the limit .limn→∞ Un = U∞ ≥ 0 exists a.s., .(Un )n being a positive supermartingale. By Fatou’s Lemma E(U∞ ) ≤ lim E(Un ) = 0 .
.
n→∞
The positive r.v. .U∞ has mean 0 and is therefore .= 0 a.s. 5.19 (a) By the strong Law of Large Numbers . n1 Sn →n→∞ b < 0, so that .Sn →n→∞ −∞ a.s. Thus .(Sn )n is bounded from above a.s. (b) As .Y1 ≤ 1 a.s. we have .eλY1 ≤ eλ for .λ ≥ 0 and .L(λ) < +∞ on .R+ . Moreover .L(λ) ≥ eλ P(Y1 = 1), which gives .limλ→+∞ ψ(λ) = +∞. As .L (0+) = E(Yi ) = b, ψ (0+) =
.
L (0+) =b 0 such that .L(λ0 ) = 1 reduces to the equation of the second degree pe2λ − eλ + q = 0 .
.
Its roots are .eλ = 1 (obviously, as .L(0) = 1) and .eλ = pq . Thus .λ0 = log pq and in this case W has a geometric law with parameter .1 − e−λ0 = 1 − pq . 5.20 (a) The .Sn are the cumulative sums of independent centered r.v.’s, hence they form a martingale (Example 5.2(a)). (b) The r.v.’s .Xk are bounded, therefore .Sn ∈ L2 . The associated increasing process, i.e. the compensator of the submartingale .(Sn2 )n , is defined by .A0 = 0 and
Exercise 5.22
377
2 2 An+1 = An + E(Sn+1 | Fn ) − Sn2 = An + E(2Sn Xn+1 + Xn+1 | Fn )
.
2 = An + E(Xn+1 ) = An + 2−n
hence, by induction, An =
n−1
.
2−k = 2(1 − 2−n ) .
k=0
(Note that the increasing process .(An )n is deterministic, as always with a martingale with independent increments, Exercise 5.5(b).) (c) As the associated increasing process .(An )n is bounded and An = E(Sn2 ) ,
.
we deduce that .(Sn )n is bounded in .L2 , so that it converges a.s. and in .L2 and is regular. 5.21 We have p(X ) p(x) k = E p(x) = 1 . q(x) = q(Xk ) q(x)
.
x∈E
(7.76)
x∈E
Yn is therefore the product of positive independent r.v.’s having expectation equal to 1 and is therefore a martingale (Example 5.2(b)). Being a positive martingale it converges a.s. Recalling Remark 5.24(c), the limit is 0 a.s. and .(Yn )n cannot be regular.
.
5.22 (a) Let us argue by induction. Of course .X0 = q ∈ [0, 1]. Assume that Xn2 ∈ [0, 1]. Then obviously .Xn+1 ≥ 0 and also
.
Xn+1 =
.
1 2 1 1 1 X + 1[0,Xn ] (Un+1 ) ≤ + = 1 . 2 n 2 2 2
(b) The fact that .(Xn )n is adapted to .( Fn )n is also immediate by induction. Let us check the martingale property. We have 1 1 E(Xn+1 | Fn ) = E Xn2 + 1[0,Xn ] (Un+1 ) Fn 2 2 1 1 = Xn2 + E 1[0,Xn ] (Un+1 ) Fn . 2 2 By the freezing lemma .E 1[0,Xn ] (Un+1 ) Fn = Φ(Xn ) where, for .0 ≤ x ≤ 1, .
Φ(x) = E[1[0,x] (Un+1 )] = P(Un+1 ≤ x) .
.
378
7 Solutions
An elementary computation gives, for the d.f. of .Un , .P(Un ≤ x) = 2x − x 2 so that E(Xn+1 | Fn ) =
.
1 2 1 X + Xn − Xn2 = Xn . 2 n 2
(c) .(Xn )n is a bounded martingale, hence is regular and converges a.s. and in .Lp for every .p ≥ 1 to some r.v. .X∞ and .E(X∞ ) = limn→∞ E(Xn ) = E(X0 ) = q. (d) (5.31) gives 1 1 2 Xn+1 − Xn = 1[0,Xn ] (Un+1 ) , 2 2
.
2 hence .Xn+1 − 12 Xn can only take the values 0 or . 12 and, taking the limit, also .X∞ − 1 2 1 2 X∞ can only take the values 0 or . 2 a.s. Now the equations .x − 12 x 2 = 0 and .x − 12 x 2 = 12 together have the roots .0, 1, 2. As .0 ≤ X∞ ≤ 1, .X∞ can only take the values 0 or 1, hence it has a Bernoulli distribution. As .E(X∞ ) = q, .X∞ ∼ B(1, q).
5.23 Let us denote by .E, .EQ the expectations with respect to .P and .Q, respectively. (a) Recall that, by definition, for .A ∈ Fm , .Q(A) = E(Zm 1A ). Let .m ≤ n. We must prove that, for every .A ∈ Fm , .E(Zn 1A ) = E(Zm 1A ). But as .A ∈ Fm ⊂ Fn , both these quantities are equal to .Q(A). (b) We have .Q(Zn = 0) = E(Zn 1{Zn =0} ) = 0 and therefore .Zn > 0 .Q-a.s. Moreover, as .{Zn > 0} ⊂ {Zm > 0} a.s. if .m ≤ n (Exercise 5.2: the zeros of a positive martingale increase), for every .A ∈ Fm , EQ (1A Zn−1 ) = EQ (1A∩{Zn >0} Zn−1 ) = P(A ∩ {Zn > 0}) .
−1 ≤ P(A ∩ {Zm > 0}) = EQ (1A Zm )
(7.77)
and therefore .(Zn−1 )n is a .Q-supermartingale. (c) Let us assume .P Q: this means that .P(A) = 0 whenever .Q(A) = 0. Therefore also .P(Zn = 0) = 0 and EQ (Zn−1 ) = E(Zn Zn−1 ) = 1 .
.
The .Q-supermartingale .(Zn−1 )n therefore has constant expectation and is a .Qmartingale by the criterion of Exercise 5.1. Alternatively, just repeat the argument of (7.77) obtaining an equality.
Exercise 5.26
379
5.24 (a) If .(Mn )n is regular, then .Mn →n→∞ M∞ a.s. and in .L1 and .Mn = E(M∞ | Fn ). Such an r.v. .M∞ is positive and .E(M∞ ) = 1. Let .Q be the probability on . F having density .M∞ with respect to .P. Then, if .A ∈ Fn , we have Q(A) = E(1A M∞ ) = E[1A E(M∞ | Fn )] = E(1A Mn ) = Qn (A) ,
.
so that .Q and .Qn coincide on . Fn . (b) Conversely, let Z be the density of .Q with respect to .P. Then, for every n, we have for .A ∈ Fn E(Z1A ) = Q(A) = Qn (A) = E(Mn 1A ) ,
.
which implies that E(Z | Fn ) = Mn ,
.
so that .(Mn )n is regular. 1
5.25 (a) Immediate as .Mn is the product of the r.v.’s .eθXk − 2 θ , which are independent and have expectation equal to 1 (Example 5.2(b)). (b1) Let .n > m. As .Xn is independent of .Sm , hence of .Mm , for .A ∈ B(R) we have 2
Qm (Xn ∈ A) = E(1{Xn ∈A} Mm ) = E(1{Xn ∈A} )E(Mm ) = P(Xn ∈ A) .
.
Xn has the same law under .Qm as under .P. (b2) If .n ≤ m instead, .Xn is . Fm -measurable so that
.
Qm (Xn ∈ A) = E(1{Xn ∈A} Mm ) = E E(1{Xn ∈A} Mm | Fn )
.
1 2 = E 1{Xn ∈A} E(Mm | Fn ) = E(1{Xn ∈A} Mn ) = E 1{Xn ∈A} eθXn − 2 θ Mn−1 .
As .Xn is independent of . Fn−1 whereas .Mn−1 is . Fn−1 -measurable, .
1 2 1 2 1 2 · · · = E 1{Xn ∈A} eθXn − 2 θ E(Mn−1 ) = √ eθx− 2 θ e−x /2 dx 2π A
1 2 1 e− 2 (x−θ) dx . =√ 2π A
If .n ≤ m then .Xn ∼ N(θ, 1) under .Qm . 5.26 (a) Follows from Remark 5.2(b), as the .Zn are the cumulative products of the 1 r.v.’s .eXk − 2 ak , which are independent and have expectation equal to 1 (recall the Laplace transform of Gaussian r.v.’s).
380
7 Solutions
(b) The limit .limn→∞ Zn exists a.s., .(Zn )n being a positive martingale. In order to compute this limit, let us try Kakutani’s trick (Remark 5.24(b)): we have .
( 1 1 1 lim E( Zn ) = lim E(e 2 Sn ) e− 4 An = lim e− 8 An = 0 .
n→∞
n→∞
n→∞
(7.78)
Therefore .limn→∞ Zn = 0 and .(Zn )n is not regular. (c1) By (7.78) now .
( 1 lim E( Zn ) = e− 8 A∞ > 0 .
n→∞
Hence (Proposition 5.25) the martingale is regular. Another argument leading directly to the regularity of .(Zn )n can also be obtained by noting that .(Sn )n is itself a martingale (sum of independent centered r.v.’s) which is bounded in .L2 , as .E(Sn2 ) = An . Hence .(Sn )n converges a.s. and in .L2 to some limit .S∞ , which is also Gaussian and centered (Proposition 3.36) as .L2 convergence 1 entails convergence in law. Now if .Z∞ := eS∞ − 2 A∞ we have ∞ ∞ 1 1 E(Z∞ | Fn ) = E exp Sn − An + Xk − ak Fn 2 2
.
k=n+1
1 = eSn − 2 An E exp
∞
Xk −
k=n+1
1 2
∞
k=n+1
ak
1
= eSn − 2 An = Zn ,
k=n+1
again giving the regularity of .(Zn )n . As a consequence of this argument the limit 1 S −1 A .Z∞ = e ∞ 2 ∞ has a lognormal law with parameters .− A∞ and .A∞ (it is the 2 1 exponential of an .N(− 2 A∞ , A∞ )-distributed r.v.). (c2) Let .f : Rn → R be a bounded Borel function. Note that the joint density of .X1 , . . . , Xn (with respect to .P) is 1 .
(2π )n/2
√
− 2a1 x12
Rn
e
1
1
· · · e− 2an xn
2
where .Rn = a1 a2 . . . an . Then we have EQ [f (X1 , . . . , Xn )] = E[f (X1 , . . . , Xn )Z∞ ] = E E[f (X1 , . . . , Xn )Z∞ | Fn ] = E[f (X1 , . . . , Xn )Zn ] .
1
=
1 (2π )n/2
√
= E[f (X1 , . . . , Xn ) eSn − 2 An ]
1 − 1 x2 f (x1 , . . . , xn ) ex1 +···+xn − 2 An e 2a1 1 · · ·
Rn
Rn
1
· · · e− 2an xn dx1 . . . dxn 2
Exercise 5.27
381
1
=
√ (2π )n/2 Rn
− 2a1 (x12 −2a1 x1 )
1
Rn
f (x1 , . . . , xn ) e− 2 (a1 +···+an ) e 1
=
√ (2π )n/2 Rn
Rn
=
(2π )n/2
− 2a1 (x12 −2a1 x1 +a12 )
Rn
1
···
· · · e− 2an (xn −2an xn +an ) dx1 . . . dxn
√
2
f (x1 , . . . , xn ) e 1
1
···
· · · e− 2an (xn −2an xn ) dx1 . . . dxn
1
1
2
2
− 2a1 (x1 −a1 )2
Rn
f (x1 , . . . , xn ) e
1
1
· · · e− 2an (xn −an ) dx1 . . . dxn , 2
so that under .Q the joint density of .X1 , . . . , Xn with respect to the Lebesgue measure is 1 2 1 1 − 1 (x −a )2 g(x1 , . . . , xn ) = √ e 2a1 1 1 · · · √ e− 2an (xn −an ) , 2π an 2π a1
.
which proves simultaneously that .Xk ∼ N(ak , ak ) and that the r.v.’s .Xn are independent. The same result can be obtained by computing the Laplace transform or the characteristic function of .(X1 , . . . , Xn ) under .Q. 5.27 (a) By the freezing lemma, Lemma 4.11, 1 2 2 E eλXn Xn+1 = E E(eλXn Xn+1 | Fn ) = E(e 2 λ Xn )
.
(7.79)
and, recalling Exercise 2.7 (or the Laplace transform of the Gamma distributions), ⎧ ⎨√ 1 λXn Xn+1 .E(e )= 1 − λ2 ⎩ +∞
if |λ| < 1 if |λ| ≥ 1 .
(b) We have E(eZn+1 | Fn ) = eZn E(eλXn+1 Xn | Fn )
.
and by the freezing lemma again .
log E(eZn+1 | Fn ) = Zn +
1 2 2 λ Xn . 2
(7.80)
Let .A0 = 0 and An+1 = An + log E(eZn+1 | Fn ) − Zn = An +
.
1 2 2 λ Xn , 2
(7.81)
382
7 Solutions
i.e. 1 2 2 λ Xk . 2 n
An+1 =
.
k=1
(An )n is obviously predictable and increasing. Moreover, (7.81) gives
.
.
log E(eZn+1 | Fn ) − An+1 = Zn − An
and, taking the exponential and recalling that .An+1 is . Fn -measurable, we obtain E(eZn+1 −An+1 | Fn ) = eZn −An ,
.
so that .Mn = eZn −An is the required martingale. (c) Of course .(Mn )n converges a.s., being a positive martingale. In order to investigate regularity, let us try Kakutani’s trick: we have ( .
Mn = exp
n λ
2
1 2 2 λ Xk . 4 n−1
Xk−1 Xk −
k=1
k=1
One possibility in order to investigate the limit of this quantity is to write ( .
Mn = exp
n λ
2
k=1
Xk−1 Xk −
1 1 2 2 λ Xk exp − λ2 Xk2 := Nn · Wn . 8 8 n−1
n−1
k=1
k=1
Now .(Nn )n is a positive martingale (same as .(Mn )n with . λ2 instead of .λ) and converges a.s. to a finite limit, whereas .Wn →n→∞ 0 a.s., as .E(Xk2 ) = 1 and, √ 2 by the law of large numbers, . n−1 k=1 Xk →n→∞ +∞ a.s. Hence . Mn →n→∞ 0 a.s. and the martingale is not regular. The courageous reader can also attempt to use Hölder’s inequality in order to √ prove that .E( Mn ) →n→∞ 0. 5.28 (a1) Just note that .Bn = σ (Sn )∨σ (Xj , j ≥ n+1) and that .σ (Xj , j ≥ n+1) is independent of .σ (Sn ) ∨ σ (Xk ). The result follows thanks to Exercise 4.3(b). (a2) Follows from the fact that the joint distributions of .Xk , Sn and .Xj , Sn are the same (see also Exercise 4.5). (b1) Thanks to (a), as .X n = n1 (Sn+1 − Xn+1 ) , E(X n | Bn+1 ) = E(X n |Sn+1 ) =
.
=
1 1 Sn+1 − Sn+1 n n(n + 1)
1 1 Sn+1 − E(Xn+1 |Sn+1 ) n n 1 = Sn+1 = X n+1 . n+1
Exercise 6.2
383
(b2) By Remark 5.26, the backward martingale .(Xn )n converges a.s. to an r.v., Z say. As Z is measurable with respect to the tail .σ algebra of the sequence .(Xn )n , as noted in the remarks following Kolmogorov’s 0–1 law, p. 52, Z must be constant a.s. As the convergence also takes place in .L1 , this constant must be .b = E(X1 ). 6.1 (a) Thanks to Exercise 2.9 (b) a Weibull r.v. with parameters .α, λ is of the form X1/α , where X is exponential with parameter .λ. Therefore, recalling Example 6.3, if X is a uniform r.v. on .[0, 1], then .(− λ1 log(1 − X))1/α is a Weibull r.v. with parameters .α, λ. (b) Recall that if .X ∼ N(0, 1) then .X2 ∼ Gamma.( 12 , 12 ). Therefore if .X1 , . . . , Xk are i.i.d. .N(0, 1) distributed r.v.’s (obtained as in Example 6.4) then k 1 1 k 2 2 2 2 .X + · · · + X ∼ Gamma.( , ) and . k 1 2 2 2λ (X1 + · · · + Xk ) ∼ Gamma.( 2 , λ). (c) Thanks to Exercise 2.20 (b) and (b) above if the r.v’s .X1 , . . . , Xk , Y1 , . . . , Ym are i.i.d. and .N (0, 1)-distributed then .
X12 + · · · + Xk2
Z=
.
X12 + · · · + Xk2 + Y12 + · · · + Ym2
has a Beta.( k2 , m2 ) distribution. (d) If the r.v’s .X, Y1 , . . . , Yn are i.i.d. and .N(0, 1)-distributed then .
√ X n ∼ t (n) . 2 + · · · + Yn
Y12
(e) In Exercise 2.43 it is proved that the difference of independent exponential r.v.’s of parameter .λ has a Laplace law of parameter .λ. Hence if .X1, X2 are independent and uniform on .[0, 1], then .− λ1 log(1 − X1 ) − log(1 − X2 ) has the requested distribution. (f) Thanks to Exercise 2.12(a), if X is exponential with parameter .− log(1 − p), then .X is geometric with parameter p. • Note that, for every choice of .α, β ≥ 1, a Beta.(α, β) r.v. can be obtained with the rejection method, Example 6.13. 6.2 For every orthogonal matrix .O ∈ O(d) we have OZ =
.
OX OX = · |X| |OX|
As .OX ∼ X we have .OZ ∼ Z so that the law of Z is the normalized Lebesgue measure of the sphere. • Note that also in this case there are many possible ways of simulating the random choice of a point of the sphere with the normalized Lebesgue measure. Sarting from Exercise 2.14, for example, in the case of the sphere .S2 of .R3 .
384
7 Solutions
6.3 (a) We must compute the d.f., F say, associated to f and its inverse. We have, for .t ≥ 0,
F (t) =
t
.
0
t α 1 1 ds = − · =1− α α+1 (1 + s) 0 (1 + t)α (1 + s)
The equation 1−
.
1 =x (1 + t)α
is easily solved, giving, for .0 < x < 1, Φ(x) =
.
1 −1. (1 − x)1/α
(b) The joint law of X and Y is, for .x, y > 0, h(x, y) = fY (y)f (x; y) =
.
1 1 y α−1 e−y × y e−yx = y α e−y(x+1) Γ (α) Γ (α)
and the law of X has density with respect to the Lebesgue measure given by
fX (x) =
+∞
.
=
−∞
1 h(x, y) dy = Γ (α)
+∞
y α e−y(x+1) dy
0
Γ (α + 1) = f (x) . Γ (α)(1 + x)α+1
Therefore the following procedure produces a random number according with to law defined by f : • first sample a number y with a Gamma.(α, 1) distribution, • then sample a number x with an exponential distribution with parameter y. This provides another algorithm for generating a random number with density f , at least for the values of .α for which we know how to simulate a Gamma.(α, 1) r.v., see Exercise 6.1(b).
References
1. P. Baldi, L. Mazliak, P. Priouret, Solved exercises and elements of theory. Martingales and Markov Chains (Chapman & Hall/CRC, Boca Raton, 2002). 2. P. Billingsley, Probability and Measure. Wiley Series in Probability and Mathematical Statistics, 3rd edn. (John Wiley & Sons, New York, 1995) 3. M. Brancovan, T. Jeulin, Probabilités Niveau M1 (Ellipses, Paris, 2006) 4. L. Breiman, Probability (Addison-Wesley, Reading, 1992) 5. P. Brémaud, Probability Theory and Stochastic Processes Universitext (Springer, Cham, 2020) 6. E. Çınlar, Probability and Stochastics. Graduate Texts in Mathematics, vol. 261 (Springer, New York, 2011) 7. L. Chaumont, M. Yor, A guided tour from measure theory to random processes, via conditioning. Exercises in Probability. Cambridge Series in Statistical and Probabilistic Mathematics, vol. 35, 2nd edn. (Cambridge University Press, Cambridge, 2012). 8. D. Dacunha-Castelle, M. Duflo, Probability and Statistics, vol. I (Springer-Verlag, New York, 1986) 9. C. Dellacherie, P.-A. Meyer, Probabilités et Potentiel, chap. I à IV (Hermann, Paris, 1975) 10. L. Devroye, Nonuniform Random Variate Generation (Springer-Verlag, New York, 1986) 11. R.M. Dudley, Real Analysis and Probability. Cambridge Studies in Advanced Mathematics, vol. 74 (Cambridge University Press, Cambridge, 2002). Revised reprint of the 1989 original 12. R. Durrett, Probability–Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics, vol. 49, 5th edn. (Cambridge University Press, Cambridge, 2019) 13. W. Feller, An Introduction to Probability Theory and Its Applications, vol. II (John Wiley and Sons, New York, 1966) 14. G.S. Fishman, Concepts, algorithms, and applications. Monte Carlo. Springer Series in Operations Research (Springer-Verlag, New York, 1996) 15. J.E. Gentle, Random Number Generation and Monte Carlo Methods. Statistics and Computing, 2nd edn. (Springer, New York, 2003) 16. P.R. Halmos, Measure Theory (D. Van Nostrand Co., New York, 1950) 17. O. Kallenberg, Foundations of Modern Probability. Probability Theory and Stochastic Modelling, vol. 99, 3rd edn. (Springer, Cham, 2021) 18. D.E. Knuth, Seminumerical algorithms. The Art of Computer Programming, vol. 2, 3rd edn. (Addison-Wesley, Reading, 1998) 19. J.-F. Le Gall, Measure Theory, Probability, and Stochastic Processes. Graduate Texts in Mathematics, vol. 295 (Springer, Cham, 2022) 20. J. Neveu, Mathematical Foundations of the Calculus of Probability (Holden-Day, San Francisco/California/London/Amsterdam, 1965)
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Baldi, Probability, Universitext, https://doi.org/10.1007/978-3-031-38492-9
385
386
References
21. J. Neveu, Discrete-Parameter Martingales. North-Holland Mathematical Library, vol. 10, revised edn. (North-Holland Publishing/American Elsevier Publishing, Amsterdam/Oxford/New York, 1975) 22. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, The art of scientific computing. Numerical Recipes in C, 2nd edn. (Cambridge University Press, Cambridge, 1992). 23. D.W. Stroock, S.R.S. Varadhan, Multidimensional Diffusion Processes. Grundlehren der Mathematisches Wissenschaften, vol. 233 (Springer, Berlin/Heidelberg/New York, 1979) 24. D. Williams, Foundations. Diffusions, Markov Processes, and Martingales, vol. 1. Probability and Mathematical Statistics (John Wiley & Sons, Chichester, 1979) 25. D. Williams, Probability with Martingales. Cambridge Mathematical Textbooks (Cambridge University Press, Cambridge, 1991)
Index
Symbols Lp spaces, 27, 38 p spaces, 39 σ -additivity, 8 σ -algebras, 1 Baire, 35 Borel, 2 generated, 2, 6 independent, 44 P-trivial, 51 product, 28 tail, 51
A Absolute continuity, 25 Adapted, process, 205 Algebras, 1 Associated increasing process, 208 Atoms, 198
B Beppo Levi, theorem, 15, 183 Bernstein polynomials, 143 Borel-Cantelli, lemma, 118 Box-Müller, algorithm, 55, 244
C Carathéodory criterion, 9 extension theorem, 10 Cauchy-Schwarz, inequality, 27, 62 Cauchy’s law, 85, 102, 107, 108, 193 Change of probability, 102, 103, 108, 199–201, 237, 238
Characteristic functions, 69 Chebyshev, inequality, 65 Cochran, theorem, 93 Compensator of a submartingale, 208 Conditional expectation, 178, 179 law, 177, 189 Confidence intervals, 97 Convergence almost sure, 115 in law, 141 in Lp , 116 in probability, 115 weak of finite measures, 129 Convolution, 39, 55 Correlation coefficient, 69 Covariance, 65 matrix, 66
D Delta method, 165 Density, 24 Dirac masses, 23 Distribution functions (d.f.), 11, 41 Doob decomposition, 208 maximal inequality, 222 measurability criterion, 6
E Elementary functions, 5 Empirical means, 52 Events, 41 Exponential families, 110
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Baldi, Probability, Universitext, https://doi.org/10.1007/978-3-031-38492-9
387
388 F Fatou, lemma, 16, 183 Filtrations, 205 natural, 205 Fisher, approximation, 175 Fubini-Tonelli, theorem, 33 Functions integrable, 14 semi-integrable, 14
H Haar measure, 251 Histograms, 127 Hölder, inequality, 27, 62
I Independence of events, 46 of r.v.’s, 46 of σ -algebras, 44 Inequalities Cauchy-Schwarz, 27, 62 Chebyshev, 65 Doob, 222 Hölder, 27, 62 Jensen, 61, 183 Markov, 64 Minkowski, 27, 62 Infinitely divisible laws, 108
J Jensen, inequality, 61, 183
K Kolmogorov 0-1 law, 51 Law of Large Numbers, 126 Kullback-Leibler, divergence, 105, 163, 255
L Laplace transform, 82 convergence abscissas, 83 domain, 82 Laws Cauchy, 107 Gaussian multivariate, 87, 148, 196 non-central chi-square, 113 Skellam binomial, 194 Student, 94, 142, 192, 201
Index Student multivariate, 201 Weibull, 100, 258 Laws of Large Numbers histograms, 128 Kolmogorov’s, 126 Monte Carlo methods, 249 Rajchman’s, 125 Lebesgue measure, 12, 34 theorem, 16, 183 Lemma Borel-Cantelli, 118 Fatou, 16, 183 Slutsky, 162
M Markov inequality, 64 property, 201 Martingales, 206 backward, 228 Doob’s maximal inequality, 222 with independent increments, 231 maximal inequalities, 215 regular, 225 upcrossings, 216 Mathematical expectation, 42 Maximal inequalities, 215 Measurable functions, 3 space, 2 Measures, 7 on an algebra, 8 Borel, 10 counting, 24 defined by a density, 24 Dirac, 23 finite, σ -finite, probability, 8 image, 24 Lebesgue, 12, 34 product, 32 Measure spaces, 7 Minkowski, inequality, 27, 62 Moments, 63, 106 Monotone classes, 2 theorem, 2 Monte Carlo methods, 249
N Negligible set, 12
Index O Orthogonal projector, 92
P Passage times, 210 Pearson (chi-square), theorem, 155 Pearson’s statistics, 155 P. Lévy, theorem, 132, 255 Positive definite, function, 107, 113 Predictable increasing processes, 208 Prohorov distance, 254 theorem, 253
Q Quantiles, 95
R Radon-Nikodym, theorem, 26, 229 Random variables, 41 centered, 42 correlated,uncorrelated, 66 independent, 46 laws, 41 Random walk, 220, 233–235 simple, 220 Regression line, 67 Rejection method, 250 Relative entropy, 105, 163, 255
S Scheffé, theorem, 139 Skewness, of an r.v., 105 Slutsky, lemma, 162
389 Stein’s characterization of the Gaussian, 108 Stopping times, 209 Student laws, 94, 142, 192, 201 Supermartingales, submartingales, 206 Support of a measure, 36
T Tensor product, 51 Theorems Beppo Levi, 15, 183 Carathéodory extension, 10 Central limit, 152 Cochran, 93 derivation under the integral sign, 17 Fubini-Tonelli, 33 inversion of characteristic functions, 79 Lebesgue, 16, 183 Pearson (chi-square), 155 P. Lévy, 132, 255 Prohorov, 253 Radon-Nikodym, 26, 229 Scheffé, 139 Tight, probabilities, 252
U Uniform integrability, 144 Upcrossing, 216
V Variance, 63
W Wald identities, 234 Weibull, laws, 100, 258