Essentials of Probability Theory for Statisticians (Solutions, Instructor Solution Manual) [1 ed.] 9781498704236, 9781498704199, 1498704190


119 77 5MB

English Pages 175 Year 2016

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Essentials of Probability Theory for Statisticians   (Solutions, Instructor Solution Manual) [1 ed.]
 9781498704236, 9781498704199, 1498704190

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

solutions MANUAL FOR Essentials of Probability Theory for Statisticians

by

Michael A. Proschan and Pamela A. Shaw

K24704_SM_Cover.indd 1

01/06/16 10:38 am

K24704_SM_Cover.indd 2

01/06/16 10:38 am

solutionS MANUAL FOR Essentials of Probability Theory for Statisticians

by

Michael A. Proschan and Pamela A. Shaw

Boca Raton London New York

CRC Press is an imprint of the Taylor & Francis Group, an informa business

K24704_SM_Cover.indd 3

01/06/16 10:38 am

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2016 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20160531 International Standard Book Number-13: 978-1-4987-0423-6 (Ancillary) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

K24704_SM_Cover.indd 4

01/06/16 10:38 am

1 This manual repeats the original problem in boldface, and then gives a solution in normal font. Our solutions are not the only ones, or even necessarily the shortest ones. As we worked out the solutions, we noticed typos, unclear wordings, or omissions in the following problems. 1. Section 4.2.1, problem 4b. 2. Section 5.2.1, problem 2. 3. Section 7.1, problem 4. 4. Section 8.2, problem 8. 5. Section 9.5, problem 7. 6. Section 10.6.4, problem 4. 7. Section 11.11.3, problem 2. We have corrected the wording using red font.

K24704_SM_Cover.indd 5

01/06/16 10:38 am

2 Section 2.1 1. Which has greater cardinality, the set of integers or the set of even integers? Justify your answer. They both have the same cardinality, namely countable infinity, because we can put each of them in 1-1 correspondence with the positive integers 0  1

1  2

−1  3

2  4

−2 . . .  5 ...

0  1

2  2

−2 4   3 4

−4 . . .  5 ...

Formally, the correspondence from integers to positive integers is f (k) =



2k if k is a positive integer −2k + 1 if k is a negative integer or 0

whose inverse function from positive integers to integers is g(k) =



m if k = 2m, m = 1, 2, . . . −m if k = 2m + 1, m = 0, 2, . . . or k = 0.

Similarly, the 1 − 1 correspondence between even integers {2k, k an integer} and positive integers is represented by f (2k) =



with inverse function

2k if k is positive −2k + 1 if k is 0 or negative,

g(k) =



−(k − 1) if k is odd k if k is even.

Another way to show that the integers and even integers have the same cardinality is to put them in 1 − 1 correspondence with each other as follows. Even Integers Integers

0  0

2  1

−2  −1

4  2

−4  −2

... ...

2. Prove that the set of irrational numbers in [0, 1] is uncountable. We use a contrapositive argument. We know by Proposition 2.5 that the set of rationals is countable, so if the set of irrationals were also countable, then [0, 1] = {rationals} ∪ {irrationals} would be the union of two countable sets, and therefore countable by Proposition 2.6. But [0, 1] is uncountable (Proposition 2.2), so the irrationals must be uncountable.

K24704_SM_Cover.indd 6

01/06/16 10:38 am

3 3. What is the cardinality of the following sets? (a) The set of functions f : {0, 1} −→ {integers}. Countable by Proposition 2.3 because the set of such functions corresponds to the direct product {integers} × {integers} of two countable sets. (b) The set of functions f {0, 1, . . . , n} −→ {integers}. Countable by repeated application of Proposition 2.3 because it is the direct product {integers} × {integers} × . . . × {integers} of n + 1 countable sets. (c) The set of functions f : {integers} −→ {0, 1}. This has the same cardinality as [0, 1] because each function corresponds to a countably infinite string of zeros and ones, and the set of all such strings has the cardinality of [0, 1] by part 2 of Proposition 2.10. 4. Which of the following arguments is (are) correct? Justify your answer. (a) Any collection C of nonempty, disjoint intervals (a, b) is countable because for each C ∈ C, we can pick a rational number x ∈ C. Therefore, we can associate the Cs with a subset of rational numbers. Correct argument. (b) C must be uncountable because for each C ∈ C, we can pick an irrational number x ∈ C. Therefore, we can associate the Cs with irrational numbers. Incorrect because this associates the Cs with a subset of irrational numbers, not all rational numbers. A subset of irrational numbers can be either countable or uncountable. (c) Any collection C of nonempty intervals (not just disjoint) must be countable because for each C ∈ C, we can pick a rational number. Therefore, we can associate C with a subset of rational numbers. Incorrect because there is no guarantee that for different Cs we will pick different rational numbers. We could pick the same rational number an uncountably infinite number of times, which does not establish a 1-1 mapping. A counterexample is the set of intervals of the form [x, 1], x ∈ [0, 1], which is in 1 − 1 correspondence with its set of left endpoints, x. Therefore the cardinality of the set

K24704_SM_Cover.indd 7

01/06/16 10:38 am

4 of intervals of the form [x, 1], x ∈ [0, 1] is that of [0, 1], which is uncountable. 5. Prove that the set of numbers in [0, 1] with a repeating decimal representation of the form x = .a1 a2 . . . ana1 a2 . . . an . . . (e.g., .111 . . ., .976976976 . . ., etc.) is countable. Hint: if x = .a1 a2 . . . ana1 a2 . . . an . . ., what is 10nx − x, and therefore what is x? If x = .a1 a2 . . . an a1 a2 . . . an . . ., then 10n x − x = a1 a2 . . . an (that is, an × 100 + an−1 × 101 + . . . + a1 × 10n−1 ), so x = a1 a2 . . . an /(10n − 1) is rational. Thus, the set of repeating decimals corresponds to a subset of rational numbers, and is therefore countable by Propositions 2.4 and 2.5. 6. (a) Prove that any closed interval [a, b], −∞ < a < b < ∞, has the same cardinality as [0, 1]. Make the following 1 − 1 correspondence: f (ω) = (b − a)ω + a between [0, 1] and [a, b]. (b) Prove that any open interval (a, b), −∞ < a < b < ∞, has the same cardinality as [0, 1]. The set [a, b] = (a, b) ∪ {a, b}, so [a, b] has the same cardinality as (a, b) because augmenting any infinite set with a countable set does not change its cardinality (part 2 of Proposition 2.9). By part a, [a, b] has the same cardinality as [0, 1], so (a, b) also has the same cardinality as [0, 1]. (c) Prove that (−∞, ∞) has the same cardinality as [0, 1]. The function f (ω) = ln{ω/(1 − ω)} exhibits a 1 − 1 correspondence between (0, 1) and (−∞, ∞), proving that (0, 1) and (−∞, ∞) have the same cardinality. Moreover, [0, 1] = (0, 1) ∪ {0, 1} has the same cardinality as (0, 1) by part 2 of Proposition 2.9, so [0, 1] has the same cardinality as (−∞, ∞). 7. Prove Proposition 2.4. Clearly, if A has only finitely many elements, then A is countable, so assume that A is infinite. Each element of A corresponds to a unique element bπi of B, so we have the following correspondence between A and the positive integers.

K24704_SM_Cover.indd 8

01/06/16 10:38 am

5 A Positive integers

bπ1  1

b π2  2

b π3  3

... ...

Therefore, A is countably infinite. 8. Prove that if A is any set, the set of all subsets of A has larger cardinality than A. Hint: suppose there were a 1 − 1 correspondence a ↔ Ba between the elements of A and the subsets Ba of A. Construct a subset C of A as follows: if a ∈ Ba, then exclude a from C, whereas if a ∈ / Ba, include a in C.

Define the subset C as in the hint: if a ∈ Ba , then exclude a from C, whereas if a ∈ / Ba , include a in C. Then C is not any of the Ba because C either excludes an element of Ba or includes an element outside of Ba . Therefore, the collection Ba cannot include all subsets of A. We have shown that the collection of subsets of A is larger than the set of elements of A.

9. The Cantor set is defined as the set of numbers in [0, 1] whose base 3 representation .a1 a2 . . . = a1 /31 +a2 /32 +. . . + has ai = 1 for each i. Show that the Cantor set has the same cardinality as [0, 1]. The Cantor set corresponds to the set of infinite strings (a1 , a2 , . . .), where each ai is 0 or 2, which clearly is in 1 − 1 correspondence with the set of infinite strings (b1 , b2 , . . .) of 0s and 1s. This latter set was shown to have the same cardinality as [0, 1] (part 2 of Proposition 2.10). Therefore, the Cantor set has the same cardinality as [0, 1]. 10. Show that the set A constructed in Example 2.14 is empty. Consider step t. Let r be a rational number such that 0 < r < t and let s = t − r. Then t is in As because t − s is the rational number r. Because t is in As already and s < t, the procedure does not add t to A. But t was an arbitrary nonzero number. Therefore, no elements ever get added to A, so A is empty. 11. Imagine a 1 × 1 square containing lights at each pair (r, s), where r and s are both rational numbers in the interval [0, 1]. (a) Prove that the set of lights is countable. Let Z be the set of rational numbers. The set of lights is in 1 − 1 correspondence with the direct product Z × Z. Because Z is

K24704_SM_Cover.indd 9

01/06/16 10:38 am

6 countable by Proposition 2.5, Z × Z is countable by Proposition 2.3. (b) We can turn each light either on or off, and thereby create light artwork. Prove that the set of all possible pieces of light artwork is uncountable, and has the same cardinality as [0, 1]. The set of all possible pieces of light artwork is the set of infinite strings of 0s and 1s. Therefore, it has the same cardinality as [0, 1] by part 2 of Proposition 2.10. (c) We create a line segment joining the point (r1 , s1 ) to the point (r2 , s2 ) by turning on the lights at all positions (R, S) = (1 − λ)(r1 , s1 ) + λ(r2 , s2 ) as λ ranges over the rational numbers in [0, 1]. Show that the set of all such line segments is countable. Hint: think about the set of endpoints of the segments. The set of possible line segments is in 1 − 1 correspondence with its set of possible endpoints. This latter set is in 1 − 1 correspondence with a subset of A × B, where A = B = {(r, s) : r and s are rational}. Each of A and B is countable by Proposition 2.3 because it is the direct product Z ×Z, where Z is the set of rational numbers (which is countable by Proposition 2.5). Because A and B are both countable, A × B is countable by Proposition 2.3. The result now follows from Proposition 2.4.

K24704_SM_Cover.indd 10

01/06/16 10:38 am

7 Section 3.2.3 1. Which of the following are sigma fields of subsets of Ω = {1, 2, 3, 4, 5, 6}? (a) {{1, 2}, {1, 2, 3}, {4, 5, 6}, ∅, Ω}. No, because {1, 2}C = {3, 4, 5, 6} is not in the collection.

(b) {{1, 2}, {3, 4, 5, 6}, ∅, Ω}. Yes.

(c) {{1}, {3, 4, 5, 6}, {1, 3, 4, 5, 6}, Ω}. No, because ∅ is not in the collection. 2. Let Ω = {1, 2, 3, . . .}. (a) Is {1, 3, 5, . . .}, {2, 4, 6, . . .}, ∅, Ω} a sigma-field of subsets of Ω? Yes. (b) Let Ri be the set of elements of Ω with remainder i when divided by 3. Is {R0 , R1 , R2 , R0 ∪R1 , R0 ∪R2 , R1 ∪R2 , ∅, Ω} a sigma-field of subsets of Ω? Yes. (c) Let Si be the set of elements of Ω with remainder i when divided by 4. What is σ(S0 , S1 , S2 , S3 ) (i.e., the smallest sigma-field containing each of these sets)? {S0 , S1 , S2 , S3 , S0 ∪ S1 , S0 ∪ S2 , S0 ∪ S3 , S1 ∪ S2 , S1 ∪ S3 , S2 ∪ S3 , S0 ∪ S1 ∪ S2 , S0 ∪ S1 ∪ S3 , S0 ∪ S2 ∪ S3 , S1 ∪ S2 ∪ S3 , ∅, Ω}. 3. Let Ω = [0, 1], A1 = [0, 1/2), and A2 = [1/4, 3/4). Enumerate the sets in σ(A1 , A2 ), the smallest sigma-field containing A1 and A2 . The sigma-field consists of all possible unions of the sets: [0, 1/4), [1/4, 1/2), [1/2, 3/4), and [3/4, 1]. Therefore, σ(A1 , A2 ) is {[0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1], [0, 1/2), [0, 1/4)∪[1/2, 3/4), [0, 1/4)∪[3/4, 1], [1/4, 3/4), [1/4, 1/2) ∪ [3/4, 1], [1/2, 1], [0, 3/4), [0, 1/4) ∪ [1/2, 1], [0, 1/2) ∪ [3/4, 1], [1/4, 1], [0, 1], ∅}. 4. Let Ω = {1, 2, . . .} and A = {2, 4, 6, . . .} Enumerate the sets in σ(A), the smallest sigma-field containing A. What is σ(A1 , A2 , . . .), where Ai = {2i}, i = 1, 2, . . .? σ(A) = {{1, 3, 5, . . .}, {2, 4, 6, . . .}, ∅, Ω}.

K24704_SM_Cover.indd 11

01/06/16 10:38 am

8 Next consider σ(A1 , A2 , . . .}. Let Bt , as t ranges over some index set T , be the collection of all subsets of even integers. Let O denote the set of odd integers {1, 3, 5, . . .}. Then σ(A1 , A2 , . . .} = {Bt , O, Bt ∪ O, t ∈ T }. 5. Let B(0, r) be the two dimensional open ball centered at 0 with radius r; i.e., B(0, r) = {(ω1 , ω2 ) : ω12 + ω22 < r 2 } . If Ω = B(0, 1), what is σ(B(0, 1/2), B(0, 3/4))? Let D(r1 , r2 ) be the set of points (x, y) such that r12 ≤ x2 + y 2 < r22 . Then σ(B(0, 1/2), B(0, 3/4)) consists of all possible unions of the sets B(0, 1/2), D(1/2, 3/4), D(3/4, 1), namely {B(0, 1/2), D(1/2, 3/4), D(3/4, 1), B(0, 3/4), B(0, 1/2) ∪ D(3/4, 1), D(1/2, 1), B(0, 1), ∅}. 6. Show that if Ω = R2 , the set of all open and closed sets of Ω is not a sigma-field. Let A be the open set {(x, y) : x2 + y 2 < 1, x < 0} and B be the closed set {(x, y) : x2 + y 2 ≤ 1, x ≥ 0}. Then A ∪ B is neither open nor closed. It is not open because, for example, there is no open ball around (1, 0) that is contained within A ∪ B. It is not closed because it does not contain the boundary point (−1, 0), for example (and closed sets contain their boundary points). 7. Give an example to show that the union of two sigma-fields need not be a sigma-field. Let Ω = {1, 2, 3}, F = {{1}, {2, 3}, ∅, Ω} and G = {{3}, {1, 2}, ∅, Ω}. Then F and G are both sigma-fields and F ∪G = {{1}, {2, 3}, {3}, {1, 2}, ∅, Ω} is not a sigma-field because the union of the two sets {1} and {3} in F ∪ G is not in F ∪ G. 8. Let Ω be a countably infinite set like {0, 1, 2, . . .}, and let F be the set of finite and co-finite subsets (A is co-finite if AC is finite). Show that F is a field, but not a sigma-field. Let A be the set of finite or co-finite sets, and let A1 ∈ F and A2 ∈ F. C Clearly, AC i ∈ F because if Ai is finite, then Ai is co-finite, while if Ai is co-finite, then AC i is finite. Therefore, A is closed under complements. Now consider A1 ∪A2 . If A1 and A2 are both finite, then A1 ∪A2 is finite, hence A1 ∪ A2 ∈ F. If either A1 or A2 is co-finite, then (A1 ∪ A2 )C = C AC 1 ∩ A2 is finite. This means that A1 ∪ A2 is co-finite, hence in A. Repeating this argument shows that A is closed under finite unions. We have shown that A is closed under complements and finite unions, and is therefore a field.

K24704_SM_Cover.indd 12

01/06/16 10:38 am

9 A is not a sigma-field because if Ω = {a1 , a2 , . . .}, then each {an } ∈ A because it is a finite set, yet ∪n odd {ai } = {a1 , a3 , a5 . . .} is not in A because both it and its complement contain infinitely many elements. That is, A is not closed under countable unions, hence not a sigma-field. 9. * Prove that if Ω is uncountable, then the set of countable and co-countable sets (recall that A is co-countable if AC is countable) is a sigma-field. Note that A is nonempty because Ω is co-countable (its complement contains no elements). Let Ai ∈ A be countable or co-countable. Clearly, C AC i ∈ A because if Ai is countable, then Ai is co-countable, while if Ai C is co-countable, Ai is countable. Therefore, we have shown that A is closed under complements. Now consider ∪∞ i=1 Ai . If any of the Ai are C C co-countable, then (∪Ai ) = ∩Ai is countable, so ∪Ai is co-countable. On the other hand, if all of the Ai are countable, then ∪Ai is countable because the countable union of countable sets is countable. Therefore, ∪∞ i=1 Ai ∈ A. We have shown that A is a nonempty collection that is closed under complements and countable unions. Therefore, A is a sigma-field. 10. * Show that the one-dimensional Borel sigma-field B is generated by sets of the form (−∞, x], x ∈ R. That is, the smallest sigma-field containing the sets (−∞, x], x ∈ R, is the Borel sigma-field. The same is true if (−∞, x] is replaced by (−∞, x). Because the Borel sigma-algebra is generated by intervals, it suffices to show that all intervals are included in A = σ{(−∞, x], x ∈ R}. Consider each interval of the form (a, b], −∞ < a < b < ∞. Then (a, b] = (−∞, b] ∩ (−∞, a]C . Also, Ab = (−∞, b] ∈ A by definition, as C is Aa = (−∞, a]. Because A is a sigma-field, AC a ∈ A, so Ab ∩ Aa ∈ A. Therefore, all intervals of the form (a, b], −∞ < a < b < ∞, are in A. Intervals of the form (a, b) can be written as the countable union ∩n (a, b − 1/n] of sets in A, and are therefore in A. Intervals of the form [a, b) can be written as ∩n (a − 1/n, b) ∈ A. Intervals of the form [a, b] can be written as ∩[a, b + 1/n) ∈ A. Finally, intervals of the form (a, ∞) can be written as (−∞, a]C ∈ A. We have shown that A contains all intervals, completing the proof. 11. * Prove that Example 3.8 is a field, but not a sigma-field. To show that A is a field, we must show that A is closed under complements and finite unions. By DeMorgan’s laws, it suffices to show that A is closed under complements and finite intersections. Let A and B be

K24704_SM_Cover.indd 13

01/06/16 10:38 am

10 n two sets in A. Then A = ∪m i=1 Ai and B = ∪j=1 Bj , where the Ai are disjoint intervals of the form (−∞, a], (a, b], or (b, ∞), as are the Bj . We first show that A ∩ B is a finite union of disjoint intervals of the three types. Note that

n A ∩ B = (∪m i=1 Ai ) ∩ (∪j=1 Bj ) = ∪i,j (Ai ∩ Bj ).

Note that Ai1 ∩ Bj1 and Ai2 ∩ Bj2 are disjoint unless i1 = i2 and j1 = j2 because the Ai are disjoint and the Bj are disjoint. Therefore, if we demonstrate that each Ai ∩ Bj is the union of a finite number of disjoint intervals of the three types, that will show that A ∩ B is also the union of a finite number of disjoint intervals of the three types. We consider each of the cases. (a) Suppose that Ai = (−∞, ai ] and Bj = (−∞, bj ]. Then Ai ∩ Bj = (−∞, min(ai , bj )], which is an interval of the form (−∞, s]. (b) Suppose that Ai = (−∞, ai ] and Bj = (bj , cj ]. Then Ai ∩ Bj = (bj , min(ai , cj )] is either the empty set or an interval of the form (s, t]. (c) Suppose that Ai = (−∞, ai ] and Bj = (bj , ∞). Then Ai ∩ Bj = (bj , ai ] is either the empty set or an interval of the form (s, t]. (d) Suppose that Ai = (ai , bi ] and Bj = (cj , dj ]. Then Ai ∩ Bj = (max(ai , cj ), min(bi , dj )], which is either the empty set or an interval of the type (s, t]. (e) Suppose that Ai = (ai , bi ] and Bj = (cj , ∞). Then Ai ∩ Bj = (max(ai , cj ), bi ], which is either the empty set or an interval of the type (s, t]. (f) Finally, suppose that Ai = (ai , ∞) and Bj = (bj , ∞). Then Ai ∩ Bj = (max(ai , bj ), ∞) is either the emptyset or an interval of the form (s, ∞). This shows that Ai ∩ Bj is the union of a finite number of disjoint sets of the three types, and hence the same is true of A ∩ B. In other words, A is closed under intersections of 2 members. Repeated application of this result shows that it is closed under finite intersections. We need only show that A is also closed under complements. If A = C m C C ∪m i=1 Ai , then A = ∩i=1 Ai . If Ai is of the form (−∞, ai ], then Ai = (ai , ∞); if Ai = (ai , ∞), then AC i = (−∞, ai ]; if Ai = (ai , bi ], then

K24704_SM_Cover.indd 14

01/06/16 10:38 am

11 C AC i = (−∞, ai ] ∪ (bi , ∞). Thus, Ai is also the union of a finite number of disjoint intervals. It follows that AC is as well. This shows that A is closed under complements.

We have shown that A is a nonempty collection that is closed under finite intersections and complements, hence a field. To see that it is not a sigma-field, note that it does not contain any singletons. For example, the set {1} cannot be written as the union of a finite number of disjoint intervals of the 3 types. If A were a sigma-field, then it would include all singletons {a} because {a} = ∩n (a − 1/n, a]. 12. Let B denote the Borel sets in R, and let B[0,1] be the Borel subsets of [0, 1], defined as {B ∈ B, B ⊂ [0, 1]}. Prove that B[0,1] = {B ∩ [0, 1], B ∈ B}. Let E be of the form E ∈ B, E ⊂ [0, 1]. Then E = E ∩ [0, 1], so E is of the form B ∩ [0, 1] for some Borel set B on the line, namely B = E. Now suppose that E is of the form B ∩ [0, 1] for some Borel set B of R. Then E is a subset of [0, 1] and a Borel set because both B and [0, 1] are Borel sets, and Borel sets are closed under finite intersections. Therefore, E is of the form B ⊂ [0, 1], B ∈ B. We have shown that each set of the form {B ∈ B, B ⊂ [0, 1]} is of the form {B ∩ [0, 1], B ∈ B}, and vice-versa. Therefore, these formulations are equivalent. 13. * The countable collection {Ai}∞ i=1 is said to be a partition of Ω if ∞ the Ai are disjoint and ∪i=1 Ai = Ω. Prove that the sigma-field generated by {Ai}∞ i=1 consists of all unions of the Ai. What is its cardinality? Hint: think about Proposition 2.10. Let A be the collection of all unions of the Ai . We claim that σ(A1 , A2 , . . .) must contain each set E in A. This follows from the fact that each set E ∈ A is a countable union of the Ai (because the entire collection {Ai } is countable). Therefore, if A is a sigma-field, it must be the smallest sigma-field containing A1 , A2 , . . . We will therefore show that A is a sigma-field. If E = ∪i∈I Ai , where I is some subset of N = {1, 2, . . .}, then E C = ∪i∈N \I Ai is a countable union of Ai sets, so is in A. That is, A is closed under complements. Moreover, if Ei are in A for i = 1, 2, . . ., then each Ei is a countable union, so ∪i Ei is also a countable union. Therefore, ∪i Ei ∈ A. That is, A is closed under countable unions. We have shown that the nonempty collection A is closed under complements

K24704_SM_Cover.indd 15

01/06/16 10:38 am

12 and countable unions, and is therefore a sigma-field. This completes the proof that A = σ(A1 , A2 , . . .).

Note that A has the same cardinality as [0, 1] by Proposition 2.10 because each member of A, namely a union of a specific collection of partition sets Ai , may be denoted by an infinite string of 0s and 1s signifying which Ai are included in the union.

∞ 14. ↑ Suppose that {Ai}∞ i=1 and {Bi}i=1 are partitions of Ω and {Bi} is finer than {Ai}, meaning that each Ai is a union of ∞ members of {Bi}. Then σ({Ai}∞ i=1 ) ⊂ σ({Bi}i=1 ).

By the preceding problem, each set in σ(A1 , A2 , . . .) is a countable union of the Ai , and each Ai is a countable union of the Bi . Therefore, each set in σ(A1 , A2 , . . .) is a countable union of the Bi , and is therefore in ∞ σ(B1 , B2 , . . .). This shows that σ({Ai }∞ i=1 ) ⊂ σ({Bi }i=1 ). 15. Complete the following steps to show that a sigma-field F cannot be countably infinite. Suppose that F contains the nonempty sets A1 , A2 , . . . (a) Show that each set of the form ∩∞ i=1 Bi, where each Bi is C either Ai or Ai , is in F . Each AC i is in F because F is closed under complements. Because each Bi is Ai or AC i , each Bi is in F. The countable intersection of such sets is in F because F is a sigma-field.

(b) What is wrong with the following argument: the set C = C {C = ∩∞ i=1 Bi, each Bi is either Ai or Ai } is in 1:1 correspondence with the set of all infinitely long strings of 0s and 1s because we can imagine Ai as a 1 and AC i as a 0. Therefore, by Proposition 2.10, F must have the same cardinality as [0, 1]. Hint: to see that this argument is wrong, suppose that the Ai are disjoint. There is no guarantee that sets of the form ∩Bi are different. For example, if the Ai are disjoint, then the only nonempty intersections of Bi are when at most one of the Bi is Ai . The number of intersections of the form ∩i Bi when at most one of the Bi is AC i is countable. (c) Show that any two non-identical sets in C are disjoint. What is the cardinality of the set of all countable unions of C sets? What can you conclude about the cardinality of F ?

K24704_SM_Cover.indd 16

01/06/16 10:38 am

13 ∞ Suppose that C1 = ∩∞ i=1 B1i and C2 = ∩i=1 B2i , where each B1i or B2i is C either Ai or AC i . If one of B1i and B2i is Ai and the other is Ai , then B1i and B2i are disjoint. To not be disjoint, B1i and B2i must both be Ai or both be AC i , and this must hold for each i. Therefore, if B1i and B2i are not disjoint, they are identical.

We claim that an infinite number of sets in C must be nonempty. If not, then the set of all unions of sets in C would be a finite collection. But each Bi can be written as a union of sets in C, and there are infinitely many distinct Bi because there are at least as many Bi as Ai . This shows that an infinite number of sets in C are nonempty.

Let C1 , C2 , . . . be a countably infinite collection of nonempty sets in C, and let D be the collection of sets Dt , where Dt is a union of a subcollection of C1 , C2 , . . .. Each Dt ∈ F because each Dt is a countable union of C sets, and each C ∈ F. Moreover, different Dt are distinct. Record a 1 if Ci is included in the union and 0 if it is not. Then D can be associated with the set of infinite strings of 0s and 1s. The cardinality of D is the same as [0, 1] by Proposition 2.10.

We have shown that if F contains infinitely many sets, then it must contain uncountably many sets as well. We conclude that a sigma-field must be either finite or uncountably infinite.

16. Prove Proposition 3.14. Define the sigma-fields F1 , F2 , and F3 as follows. 1. F1 is the extended Borel sigma-field (smallest sigma-field of subsets of ¯ that contains all extended intervals). R 2. F2 is the smallest sigma-field that contains the two sets {−∞} and {+∞} and all Borel subsets of R. 3. F3 is the set of all Borel subsets B of R, together with all augmentations ¯ of B by −∞, +∞, or both. B We will prove that F1 ⊃ F2 ⊃ F3 ⊃ F1 .

¯\ Note that F1 must contain (−∞, +∞], and therefore must contain R ¯ \ [−∞, +∞) = {+∞}. (−∞, +∞] = {−∞}. Likewise, it must contain R To see that it must also contain all Borel sets of R, note that it contains ¯ \ {−∞, +∞} = R and all intervals of R, and it is a sigma-field. Therefore, R it must contain the smallest sigma-field containing all intervals of R, namely

K24704_SM_Cover.indd 17

01/06/16 10:38 am

14 the Borel sets of R. We have shown that F1 contains {{−∞}, {+∞}, B ∈ B}. Because F1 is a sigma-field, it must contain the smallest sigma-field containing {{−∞}, {+∞}, B ∈ B}, namely F2 . This shows that F1 ⊃ F2 . The next step is to show that F2 ⊃ F3 . Each set in F3 is either a Borel set of R or a Borel subset of R augmented by −∞ and/or +∞. We have seen that F2 contains each Borel set B of R and {−∞} and {+∞}, so it contains each set of the form B ∪ {−∞}, B ∪ {+∞}, and B ∪ {−∞, +∞}. Therefore, F2 ⊃ F3 . The final step is to show that F3 ⊃ F1 . Each extended interval (including ¯ is in F3 because any such interval is either a Borel set of R or a Borel set of R) R augmented by {−∞}, {+∞}, or both. Also, it is not difficult to show that ¯ so it must contain the smallest sigma-field F3 is a sigma-field of subsets of R, ¯ sets containing all extended intervals, namely F1 . That is, F3 ⊃ F1 . of R We have demonstrated that F1 ⊃ F2 ⊃ F3 ⊃ F1 . This implies that Fi ⊂ Fj and Fj ⊂ Fi for each i, j = 1, 2, 3, so Fi = Fj for each i, j = 1, 2, 3.

K24704_SM_Cover.indd 18

01/06/16 10:38 am

15 Section 3.4 1. * Prove that if P (B) = 1, then P (A ∩ B) = P (A).

P (A ∩ B) = P (A) − P (A ∩ B C ), and P (A ∩ B C ) ≤ P (B C ) = 1 − P (B) = 1 − 1 = 0. Therefore, P (A ∩ B) = P (A) − 0 = P (A).

∞ 2. Prove that if {An}∞ n=1 and {Bn}n=1 are sequences of events such that P (Bn) → 1 as n → ∞, then limn→∞ P (An ∩ Bn) exists if and only if limn→∞ P (An) exists, in which case these two limits are equal.

P (An ∩ Bn ) = P (An ) − P (An ∩ BnC ), and P (An ∩ BnC ) ≤ P (BnC ) = 1 − P (Bn ) → 0. It follows that a1 = lim n→∞ P (An ∩Bn ) = lim n → ∞P (An ) and a2 = lim n→∞ P (An ∩ Bn ) = lim n → ∞P (An ). Therefore, limn→∞ P (An ∩ Bn ) exists if and only if a1 = a2 = a, in which case the limit is a. The same is true for limn→∞ P (An ). Thus, limn→∞ P (An ∩ Bn ) exists if and only if limn→∞ P (An ) exists, in which case both limits are equal. 3. Prove properties 2-4 of Proposition 3.26. For property 2, write E2 as a union of disjoint sets: E2 = E1 ∪ (E2 \ E1 ) and note that E2 \ E1 ∈ F because sigma fields are closed under complements and countable intersections. By countable additivity, P (E2 ) = P (E1 ) + P (E2 \ E1 ) ≥ P (E1 ) because P (E2 \ E1 ) ≥ 0 by nonnegativity of probability measures. For property 3, note that P (E) ≥ 0 because a probability measure is nonnegative by definition. Also, because E ⊂ Ω, P (E) ≤ P (Ω) by property 2, and P (Ω) = 1 by definition of probability measure. Therefore, P (E) ≤ 1.

For property 4, note that 1 = P (Ω) = P (E ∪ E C ) = P (E) + P (E C ) by countable additivity. Therefore, P (E C ) = 1 − P (E).

4. Let (Ω, F , P ) be a probability space and A ∈ F be a set with P (A) > 0. Let FA denote F ∩ A = {F ∩ A, F ∈ F } and define PA(F ) = P (F ∩ A)/P (A) for each F ∈ FA; PA is called the conditional probability measure given event A. Prove that (A, FA, PA) is indeed a probability space, i.e., that FA is a sigma-field of subsets of A and PA is a probability measure on FA. It is clear that FA is nonempty because it contains A: A = Ω ∩ A, and Ω ∈ F. If B ∈ FA , then B = F ∩ A for some F ∈ F. It follows that B C = A \ (F ∩ A) = A ∩ (F ∩ A)C

K24704_SM_Cover.indd 19

01/06/16 10:38 am

16 = A ∩ (F C ∪ AC ) = (A ∩ F C ) ∪ (A ∩ AC ) = F C ∩ A. Also, F C ∈ F because F is closed under complements. Thus, B C ∈ FA . This shows that FA is closed under complements. Also, If B1 , B2 , . . . are in FA , then Bi = Fi ∩ A, where Fi ∈ F. Then ∩Bi = ∩i (Fi ∩ A) = (∩Fi ) ∩ A, and ∩i Fi ∈ F because F is closed under countable intersections. Therefore, ∩i Bi ∈ FA . We have shown that FA is closed under complements and countable intersections, and therefore under countable unions by DeMorgan’s laws. Therefore, FA is a sigma-field of subsets of A. We show next that PA is a probability measure on FA . Clearly PA is nonnegative. Also, if B1 , B2 , . . . are disjoint sets in FA , then Bi = Fi ∩A, where the Fi are disjoint sets in F. It follows that PA (∪Bi ) = P {∪i (Fi ∩ A)}/P (A) = =



PA (Bi ).

 i

P (Fi ∩ A)/P (A)

i

Also, PA (A) = P (A ∩ A)/P (A) = P (A)/P (A) = 1. Therefore, PA is a probability measure. 5. Flip a fair coin countably infinitely many times. The outcome is an infinite string such as 0, 1, 1, 0, . . ., where 0 denotes tails and 1 denotes heads on a given flip. Let Γ denote the set of all possible infinite strings. It can be shown that each γ ∈ Γ has probability (1/2)(1/2) . . . (1/2) . . . = 0 because the outcomes for different flips are independent (the reader is assumed to have some familiarity with independence from elementary probability) and each has probability 1/2. What is wrong with the following argument? Because Γ is the collection of all possible outcomes, 1 = P (Γ) = P (∪Γ γ) =



P (γ) =

γ∈Γ



0 = 0.

γ∈Γ

The problem is that ∪Γ γ is an uncountable union by Proposition 2.10.  Therefore, it is not necessarily the case that P (∪Γ γ) = Γ P (γ).

6. ↑ In the setup of problem 5, let A be a collection of infinite strings of 0s and 1s. Is P (A) = P (∪A γ) =



γ∈A

K24704_SM_Cover.indd 20

P (γ) =



0=0

γ∈A

01/06/16 10:38 am

17 if A is (a) The set of infinite strings with exactly one 1? Yes, because the set of infinite strings with exactly one 1 has the same cardinality as the set of positive integers because there is a 1 − 1 correspondence between this set and the set of indices of the lone 1. (b) The set of infinite strings with finitely many 1s? Yes, because the set of infinite strings with finitely many 1s is in 1 − 1 correspondence with a subset of rational numbers, and is therefore countable. (c) The set of infinite strings with infinitely many 1s? No, because the set of infinite strings with infinitely many 1s is uncountable. 7. If (Ω, F , P ) is a probability space, prove that the collection of sets A such that P (A) = 0 or 1 is a sigma-field. Let A be the collection of sets such that P (A = 0) or P (A) = 1. If A ∈ A, then P (AC ) = 1 − P (A), so if P (A) is 0 or 1, then so is P (AC ). That is, A is closed under complements. Now suppose that A1 , A2 , . . . have probability 0 or 1. If any of the Ai have probability 1, then ∪i Ai has probability 1 because ∪i Ai contains each Ai . On the other hand, if all Ai have probability 0, then by countable additivity, P (∪i Ai ) ≤   ∞ i P (Ai ) = i 0 = 0. We have shown that P (∪i=1 Ai ) is 0 or 1. That is, A is closed under countable unions. Having shown that A is closed under complements and countable unions, we conclude that A is a sigma-field. 8. Let A1 , A2 , . . . be a countable sequence of events such that  P (Ai ∩ Aj ) = 0 for each i = j. Show that P ∪∞ = i=1 Ai ∞ i=1 P (Ai). Consider the finite union ∪ni=1 Ai :

P {A1 ∪ (∪ni=2 Ai )} = P (A1 ) + P (∪ni=2 Ai ) − P {A1 ∩ (∪ni=2 Ai )} = P (A1 ) + P (∪ni=2 Ai ) − P {∪ni=2 (A1 ∩ Ai )} = P (A1 ) + P (∪ni=2 Ai ) 



because P {∪ni=2 (A1 ∩ Ai )} ≤ ni=2 P (A1 ∩ Ai ) = ni=2 0 = 0. Continuing  in this fashion, we obtain P (∪ni=1 Ai ) = ni=1 P (Ai ). Now consider the countable union ∪∞ i=1 Ai . P (∪∞ i=1 Ai ) =

K24704_SM_Cover.indd 21

lim P (∪ni=1 Ai )

n→∞

01/06/16 10:38 am

18 = =

lim

n→∞ ∞ 

n 

P (Ai )

i=1

P (Ai ).

i=1

9. Consider the following stochastic version of the balls in the box paradox in Example 1.5. Again there are infinitely many balls numbered 1, 2, . . . Step 1 at 1 minute to midnight is to put balls numbered 1, 2 into the box, and randomly remove one of the 2 balls. Step 2 at 1/2 minute to midnight is to add balls numbered 3, 4, and randomly remove one of the 3 balls from the box. Step 3 at 1/4 minute to midnight is to add balls numbered 5, 6 and randomly remove one of the 4 balls from the box. Step 4 at 1/8 minute to midnight is to add balls numbered 7, 8 and randomly remove one of the 5 balls from the box, etc. (a) Show that the probability of the event An that ball number 1 is not removed at any of steps 1, 2, . . . , n is 1/(n + 1). The probability in question is P (An ) = P (not removed at step 1)P (not removed at step 2 | not removed at step 1) . . . P (not removed at step n | not removed by step n − 1) 1 2 3 n − 1  ... = 2 3 4 n  



n  n+1



=

1 n+1

(b) What is the probability of the event A∞ that ball number 1 is never removed from the box? Justify your answer. Note that A∞ = ∩∞ n=1 An , so P (A∞ ) = limn→∞ P (An ) = limn→∞ (n+ −1 1) = 0. (c) Show that with probability 1 the box is empty at midnight. Let Bi be the event that ball i is never removed. A similar argument ∞  to the above shows that P (∪∞ i 0 = 0. This i=1 Bi ) ≤ i=1 P (Bi ) = shows that the probability that at least one ball is never removed is 0. In other words, with probability 1, all balls are removed by midnight. 10. Use induction to prove the inclusion-exclusion formula: for an = P (∪n i=1 Ei) an

=

n 

i 1 =1

K24704_SM_Cover.indd 22

P (Ei 1 ) −



1≤i 1 xsup . Also, there is a sequence xn in A such that xn ↓ xinf . By the right continuity of distribution functions, F (xinf = limn→∞ F (xn ) = c. Therefore, xinf ∈ A. The right endpoint, xsup , may or may not belong to A because F could have a jump at x = xsup . Therefore, A could be either [xinf , xsup ] or [xinf , xsup ). This proves the result when xinf and xsup are finite. Moreover, they must be finite. For example, if xinf = −∞, then limx→−∞ F (x) = c instead of 0, contradicting one of the conditions of a d.f. Similarly, if xsup = +∞, then limx→∞ F (x) = c instead of 1, contradicting one of the conditions of a d.f.

6. In the proof of Proposition 4.23, fill in some missing details: (a) How did we know that F is uniformly continuous on [−A, A]? Note that [−A, A] is a compact set because it is closed and bounded. Therefore, by proposition A.62, F is uniformly continuous on [−A, A]. (b) Show that |F (y) − F (x)| <  for x and y both exceeding A. |F (x) − F (y)| = |1 − F (y) − {1 − F (x)}| ≤ |1 − F (x)| + |1 − F (y)| < /2 + /2 = . (c) Show that |F (y) − F (x)| <  for −A ≤ x ≤ A and y > A. |F (y)−F (x)| ≤ |F (y)−F (A)+F (A)−F (x)|, and |F (A)−F (x)| < δ and A − x < δ. Therefore, |F (y) − F (x)| < |1 − F (A)| + F (A) − F (x) < /2 + /2 = . 7. Suppose that θ ∈ Θ is a parameter and X is a random variable whose distribution function Fθ (x) is the same for all θ ∈ Θ.

K24704_SM_Cover.indd 38

01/06/16 10:38 am

35 Use the monotone class theorem (Theorem 3.32) to prove that, for each Borel set B, Pθ (X ∈ B) is the same for all θ ∈ Θ.

Let A be the collection of Borel sets such that Pθ (X ∈ A) is the same for all θ ∈ Θ. Then A contains all intervals of the form (∞, x]. A also contains intervals of the form (x, ∞) because Pθ (X > x) = 1 − Fθ (x), and Fθ (x) is the same for all θ ∈ Θ. Also, P (X ∈ (a, b]) = Fθ (b) − Fθ (a) is the same for all θ ∈ Θ. By countable additivity, A contains the field F0 of finite unions of disjoint sets of the form (−∞, a], (a, b], or (b, ∞).

We next show that A is a monotone class. If An ∈ A and An ↑ A, then by the continuity property of probability, Pθ (A) = limn→∞ Pθ (An ). Because Pθ (An ) is the same for each θ, Pθ (A) is the same for each θ. Therefore, A ∈ A. The same argument shows that if An ↓ A and An ∈ A, then A ∈ A. Therefore, A is a monotone class. By the monotone class theorem, A contains the smallest sigma-field containing F0 , namely the Borel sets. 8. Let r1 , r2 , . . . be an enumeration of the rational numbers. Let  i F (x) = ∞ i=1 (1/2 )I(x ≥ ri). Prove that F (x) is a distribution function that has a jump discontinuity at every rational number. Can you concoct a distribution function F that has a jump discontinuity at every irrational number? It is clear that F (x) an increasing function of x. Moreover, F (x) is right-continuous because F (x + ∆) =

∞  i=1

(1/2i )I(x + ∆ ≥ ri )

lim F (x + ∆) = lim ∆↓0

=

∞ 

∆↓0 i=1 ∞  i=1

(1/2i )I(x ≥ ri − ∆)

(1/2i )I(x ≥ ri ) + lim ∆↓0

∞  i=1

(1/2i )I(ri − ∆ ≤ x < ri ).

The limit on the right is 0 by the following argument. For  > 0, choose  i N large enough that ∞ i=N +1 (1/2) < . This will be possible because ∞ i i=1 (1/2) = 1 < ∞. Now choose ∆ small enough so that none of  i r1 , . . . , rN are in the interval (x, x + ∆]. Then ∞ i=1 (1/2 )I(ri − ∆ ≤ ∞ x < ri ) ≤ i=N +1 (1/2i ) < . This shows that lim∆↓0 F (x + ∆) = ∞ i i=1 (1/2 )I(x ≥ ri ) = F (x), demonstrating that F is right continuous. A similar argument shows that lim F (x) =

x→−∞

K24704_SM_Cover.indd 39

lim

x→−∞

∞  i=1

(1/2i )I(x ≥ ri ) = 0.

01/06/16 10:38 am

36 lim F (x) =

x→∞

lim

x→∞

∞  i=1

(1/2i )I(x ≥ ri ) = 1.

Therefore, F is a distribution function. It is impossible to concoct a d.f. that has a jump discontinuity at every irrational number because the set of jumps of a d.f. is countable.

K24704_SM_Cover.indd 40

01/06/16 10:38 am

37 Section 4.4 1. Let (Ω, F , P ) be ([0, 1], B[0,1] , µL), X1 (ω) = I(ω ≥ 1/2), and X2 (ω) = ω. What is the joint distribution function for (X1 , X2 )?   0

if min(x1 , x2 ) < 0, P (X1 ≤ x1 , X2 ≤ x2 ) = P (ω < 1/2, ω ≤ x2 ) if 0 ≤ x1 < 1,   P (ω ≤ x2 ) if x1 ≥ 1.   0

if min(x1 , x2 ) < 0, =  min(x2 , 1/2) if 0 ≤ x1 < 1, x2 ≥ 0  min(x2 , 1) if x1 ≥ 1, x2 ≥ 0.

2. Let (Ω, F , P ) be ([0, 1], B[0,1] , µL), X1 (ω) = |ω − 1/4|, and X2 (ω) = ω − 1/4. What is the joint distribution function for (X1 , X2 )? P (X1 ≤ x1 , X2 ≤ x2 )  if x1 < 0 or x2 < −1/4, 0 = P (|ω − 1/4| ≤ x1 , ω − 1/4 ≤ x2 ) if 0 ≤ x1 ≤ 3/4 & −1/4 ≤ x2 ≤ 3/4,  P (ω − 1/4 ≤ x2 ) if x1 > 3/4 & −1/4 ≤ x2 ≤ 3/4  if x1 < 0 or x2 < −1/4, 0 = min(x1 + 1/4, x2 + 1/4) if 1/4 ≤ x1 ≤ 3/4 & −1/4 ≤ x2 ≤ 3/4,  min(x2 , 3/4) if x1 > 3/4.

3. If F (x, y) is the distribution function for (X, Y ), find expressions for: (a) P (X ≤ x, Y > y). G(x) − F (x, y), where G(x) = limy→∞ F (x, y) is the marginal distribution function for X. (b) P (X ≤ x, Y ≥ y). G(x) − F (x, y − ), where G(x) is the marginal distribution function of X and F (x, y − ) = limy∗ ↑y F (x, y ∗ ). (c) P (X ≤ x, Y = y). F (x, y) − F (x, y − ) = F (x, y) − limy∗ ↑y F (x, y ∗ ).

(d) P (X > x, Y > y). P (Y > y) − P (X ≤ x, Y > y) = 1 − H(y) − {G(x) − F (x, y)} = 1 − H(y) − G(x) + F (x, y), where G(x) and H(y) are the marginal distribution functions of X and Y , respectively.

K24704_SM_Cover.indd 41

01/06/16 10:38 am

38 4. Use the monotone class theorem (Theorem 3.32) to prove Proposition 4.26. Let P  and Q be two probability measures on Rk such that P  {(−∞, x1 ], . . . , (−∞, xk ]} = Q {(−∞, x1 ], . . . , (−∞, xk ]}. We will show that P  and Q agree on all k-dimensional Borel sets. Let A be the collection of sets A such that P  {A × (−∞, x2 ] × . . . × (−∞, xk ]} = Q {A × (−∞, x2 ] × . . . × (−∞, xk ]}. Then A contains sets of the form (−∞, x1 ] by assumption. It also clearly contains sets of the form (a, b] and (b, ∞) by subtraction. Moreover, because P and Q are probability measures, they are countably additive. It follows that A contains the field of Example 3.8. We next show that A is a monotone class. Suppose that An ∈ A and An ↑ A Then An × (−∞, x2 ] × . . . × (−∞, xk ] increases to A × (−∞, x2 ] × . . . × (−∞, xk ], so by the continuity property of probability (Proposition 4.30), P  {An ×(−∞, x2 ]×. . .×(−∞, xk ]} → P  {A×(−∞, x2 ]×. . .×(−∞, xk ]}, and similarly for Q . Because P  {An × (−∞, x2 ] × . . . × (−∞, xk ]} = Q {An ×(−∞, x2 ]×. . .×(−∞, xk ]}, the same is true when An is replaced by A. Thus, A is closed under increasing unions. A similar argument shows that A is closed under decreasing intersections. Therefore, A is a nonempty collection that is closed under increasing unions and decreasing intersections. That is, A is a monotone class. By the monotone class theorem (Theorem 3.32), A contains the smallest sigma-field containing the field of Example 3.8. That is, A contains the one-dimensional Borel sets. For a given Borel set B1 , let A be the collection of sets A such that P  {B1 × A × (−∞, x3 ] × . . . × (−∞, xk ]} = Q {B1 × A × (−∞, x3 ] × . . . × (−∞, xk ]}. By the same argument as above, A contains the one dimensional Borel sets. Continue in the above fashion to show that P  (B1 × B2 × . . . × Bk ) = Q (B1 × B2 × . . . × Bk ) for all product sets B1 × B2 × . . . × Bk of onedimensional Borel sets. The same is true for finite unions of disjoint product sets of this form by finite additivity of probability measures. Therefore, P  and Q agree on the field of Proposition 3.20 generating the k-dimensional Borel sets. By the Caratheodory extension theorem (Theorem 3,34), P  and Q must agree for all k-dimensional Borel sets. 5. Let F (x, y) be a bivariate distribution function. Prove that

K24704_SM_Cover.indd 42

01/06/16 10:38 am

39 limA↑ y).

∞ F (x, A)

= P (X ≤ x) and limA↑

∞ F (A, y)

= P (Y ≤

The sets {ω : X(ω) ≤ x, Y (ω) ≤ A} increase to the set {ω : X(ω) ≤ x, Y (ω) < ∞}, so by the continuity property of probability, P (X ≤ x, Y ≤ A) → P (X ≤ x, Y < ∞) = P (X ≤ x), and similarly for P (X ≤ A, Y ≤ y). 6. ↑ In the previous problem, can we take a sequence that tends to ∞, but is not monotone (such as An = n1/2 if n is even and An = ln(n) if n is odd)? In other words, is limn→∞ F (x, An) = P (X ≤ x) for An not monotone, but still tending to ∞? If so, prove it. If not, find a counterexample. It is still true. If it were not true, then there would exist a number  > 0 and a subsequence n1 , n2 , n3 , . . . such that |F (x, Ank ) − G(x)| ≥  for all k = 1, 2, 3, . . . .

(3)

Let N = {n1 , n2 , . . .}. Let m1 be n1 , m2 be the smallest number among N \ {m1 } such that Am2 > Am1 , m3 be the smallest index among N \ {m1 , m2 } such that Am3 > Am2 , etc. Then Am1 , Am2 , . . . are increasing to ∞, so limi→∞ P (X ≤ x, Y ≤ Ami ) → P (X ≤ x) = G(x) by the continuity property of probability. This contradicts (3), so this contrapositive argument proves that F (x, An ) → G(x) for An → ∞ even if the An are not an increasing sequence. 7. To obtain the marginal distribution function for X1 from the joint distribution function for (X1 , . . . , Xk), can we use different Ais increasing to ∞? In other words, is limA2 ↑ ∞,...,Ak ↑ ∞ F (x1 , A2 , . . . , Ak) = P (X1 ≤ x1 )? Justify your answer. Yes. If not, then there would exist a number  > 0 and a sequence (n) (n) (n) A2 , . . . , Ak such that Ai → ∞ as n → ∞, i = 2, . . . , k and (n)

(n)

|F (x1 , A2 , . . . , Ak ) − P (X1 ≤ x1 )| ≥  for all n.

(4)

Just as in the preceding problem, we can find a further subsequence (n ) n1 , n2 , . . . such that Ai j is increasing in j for each fixed i = 2, . . . , k. Now use the continuity property of probability to conclude that F (x1 , (n ) (n ) A2 j , . . . , Ak j ) → P (X1 ≤ x1 ) as j → ∞. But this contradicts (4). This contrapositive argument proves that there is no  such that (4) holds. Therefore, limA2 ↑∞,...,Ak ↑∞ F (x1 , A2 , . . . , Ak ) = P (X1 ≤ x1 ).

K24704_SM_Cover.indd 43

01/06/16 10:38 am

40 Section 4.5.2 1. Suppose that X and Y are independent random variables. Can X and Y 2 be dependent? Explain. No, because if X and Y are independent, then so are f (X) = X and g(Y ) = Y 2 by Proposition 4.39. 2. Let X have a standard normal distribution and Y =



−1 +1

if |X| ≤ 1 if |X| > 1.

Show that X and Y 2 are independent, but X and Y are not independent. What is wrong with the following argument: since √ X and Y 2 are independent, X and Y = Y 2 are independent by Proposition 4.39. Y 2 = 1 with probability 1, so Y 2 is independent of any random variable because it is a constant. However, X and Y are not independent because, for example, P (Y = +1, |X| > 1) = P (|X| > 1) = P (Y = +1)P (|X| > 1) = P √ (|X| > 1)2 . The reason this does not contradict Proposition 4.39 is that Y 2 is not Y ; it is |Y |. Therefore, the conclusion from Proposition 4.39 is that X and |Y | are independent. This is correct because |Y |, like Y 2 , is 1 with probability 1, and is therefore independent of any random variable. 3. Prove that if A1 , . . . , An are independent events and each Bi is either Ai or AC i , then B1 , . . . , Bn are also independent. By definition of independence of events, we must prove that the probability of the intersection of any subset of the Bi is the product of individual probabilities. Without loss of generality, we can take the subset to be B1 , . . . , Bm for some m ≤ n. We use induction on the number mC of the Bi that are AC i . The result holds by definition of independence of events if mC is 0. Now suppose the result holds for mC = 0, 1, . . . , r. Let mC = r + 1, and let Bj be one of the sets that is AC j . Then (∩i=j≤m Bi ) − P {Aj  ∩ (∩i=j≤m Bi )} P (B1 ∩ . . . ∩ Bm ) = P = P (Bi ) − P (Aj ) P (Bi ) i=j≤m

= {1 − P (Aj )}

=

m 



i=j≤m

i=j≤m

P (Bi ) = P (Bj )



P (Bi )

i=j≤m

P (Bi ).

i=1

K24704_SM_Cover.indd 44

01/06/16 10:38 am

41 The second line follows by the induction assumption because the number of Bi that are AC i among either the collection {Bi , i = j} or the collection {Aj , Bi , i = j} is r. We have shown that the result holds when mC = r + 1. By induction, the result holds for all mC = 0, 1, 2, . . . , m. This completes the proof. 4. Prove that if X1 and X2 are random variables each taking values 0 or 1, then X1 and X2 are independent if and only if P (X1 = 1, X2 = 1) = P (X1 = 1)P (X2 = 1). That is, two binary random variables are independent if and only if they are uncorrelated. If X1 and X2 are independent, then P (X = 1, Y = 1) = P (X = 1)P (Y = 1) = P (X ∈ {1}, Y ∈ {1}) by definition because {1} is a Borel set. Now suppose that P (X = 1, Y = 1) = P (X = 1)P (Y = 1). Then P (X = 1, Y = 0) = = = =

P (X P (X P (X P (X

= 1) − P (X = 1, Y = 1) = 1) − P (X = 1)P (Y = 1) = 1){1 − P (Y = 1)} = 1)P (Y = 0).

Similarly, P (X = 0, Y = 1) = P (X = 0)P (Y = 1). Finally,

P (X = 0, Y = 0) = = = =

P (X P (X P (X P (X

= 0) − P (X = 0, Y = 1) = 0) − P (X = 0)P (Y = 1) = 0){1 − P (Y = 1)} = 0)P (Y = 0).

5. If A1 , A2 , . . . is a countably infinite sequence of independent ∞ events, then P (∪∞ i=1 Ai) = 1 − i=1 {1 − Pr(Ai)}. ∞ C ∞ C P (∪∞ i=1 Ai ) = 1 − P {(∪i=1 Ai ) } = 1 − P {∩i=1 Ai }

= 1− = 1−

∞ 

i=1 ∞ 

i=1

P (AC i ) (Proposition 4.35 and Problem 3)

{1 − P (Ai )}.

6. Let Ω = {ω1 , ω2 , ω3 , ω4 }, where ω1 = (−1, −1), ω2 = (−1, +1), ω3 = (+1, −1), ω4 = (+1, +1). Let X(ω) be the indicator that the first component of ω is +1, and Y (ω) be the indicator that

K24704_SM_Cover.indd 45

01/06/16 10:38 am

42 the second component of ω is +1. find a set of probabilities p1 , p2 , p3 , p4 for ω1 , ω2 , ω3 , ω4 such that X and Y are independent. Find another set of probabilities such that X and Y are not independent. X1 and X2 are binary random variables, so by problem 4, X1 and X2 are independent if and only if P (X = 1, Y = 1) = P (X = 1)P (Y = 1). That is, they are independent if and only if p4 = (p3 + p4 )(p2 + p4 ) p4 = p2 p3 + p2 p4 + p3 p4 + p24 p24 + (p2 + p3 − 1)p4 + p2 p3 = 0 p4 =

−(p2 + p3 − 1) ±



(p2 + p3 − 1)2 − 4p2 p3

. 2 One set of ps that satisfies this equation is p1 = p2 = p3 = p4 = 1/4. A set of ps that does not satisfy the equation is p1 = 1/6, p2 = 1/2, p3 = 1/6, p6 = 1/6. 7. Let X be a Bernoulli random variable with parameter p, 0 < p < 1. What are necessary and sufficient conditions for a Borel function f (X) to be independent of X? Let f (0) = a0 and f (1) = a1 , with a0 = a1 . Then P (X = 1, f (X) = a1 ) = P (X = 1) = p. If X and f (X) were independent, this probability would be P (X = 1)P {f (X) = a1 } = p2 . Therefore, independence requires p = p2 , which implies that p = 0 or 1. But the problem states that 0 < p < 1, so X and f (X) cannot be independent if a0 = a1 . If a0 = a1 (i.e., f (0) = f (1)), then f (X) is a constant, and therefore independent of X. Therefore, if 0 < p < 1, then X and f (X) are independent if and only if f (0) = f (1). 8. Suppose that X is a random variable taking only 10 possible values, all distinct. The F sets on which X takes those values are F1 , . . . , F10 . You must determine whether X is independent of another random variable, Y . Does the determination of whether they are independent depend on the set of possible values {x1 , . . . , x10 } of X? Explain. No, because X and Y are independent if and only if σ(X) and σ(Y ) are independent, and σ(X) does not change depending on the values {x1 , . . . , x10 }.

Another way to look at it is that changing the values x1 , . . . , x10 to 10 other distinct values corresponds to using a 1-1 function X ∗ = f (X),

K24704_SM_Cover.indd 46

01/06/16 10:38 am

43 where f : {x1 , x2 , . . . , x10 } −→ {a1 , . . . , a10 }. If X and Y are independent, so are X ∗ = f (X) and Y by Proposition 4.39. Likewise, X = g(X ∗ ) for a 1 − 1 function g : {a1 , . . . , a10 } −→ {x1 , . . . , x10 }, so if X ∗ and Y are independent, then X = g(X ∗ ) and Y are independent. 9. Flip a fair coin 3 times, and let Xi be the indicator that flip i is heads, i = 1, 2, 3, and X4 be the indicator that the number of heads is even. Prove that each pair of random variables is independent, as is each trio, but X1 , X2 , X3 , X4 are not independent. It is immediate from the fact that the coin is fair that X1 and X2 are independent, X1 and X3 are independent, and X2 and X3 are independent. Also, P (X1 = 1, X4 = 1) = P (1st is heads, the number of heads is even) = P (1st is heads, exactly one of the next two is heads) = (1/2){P (2nd is heads, third is tails)+P (2nd is tails, 3rd is heads)} = (1/2)(1/4 + 1/4) = 1/4, while P (X1 = 1)P (X4 = 1) = (1/2)P (0 or 2 heads) = (1/2){1/8 + 3(1/8)} = 1/4. Therefore, X1 and X4 are independent. The same argument shows that X2 and X4 are independent and X3 and X4 are independent. It is also clear that X1 , X2 , and X3 are independent, so consider X1 ,X2 , and X4 . Because these are binary variables, it suffices to prove that P (X1 = 1, X2 = 1, X4 = 1) = P (X1 = 1)P (X2 = 1)P (X4 = 1). Note that P (X1 = 1, X2 = 1, X4 = 1) = P (1st is heads, 2nd is heads, 3rd is tails) = (1/2)(1/2)(1/2) = 1/8. Also, P (X1 = 0)P (X2 = 0)P (X4 = 0) = (1/2)(1/2)P (0 or 2 heads) = (1/2)(1/2)(1/2) = 1/8. Therefore, (X1 , X2 , X4 ) are independent. The same argument shows that (X1 , X3 , X4 ) are independent and (X2 , X3 , X4 ) are independent. Therefore, all trios are independent. To see that (X1 , X2 , X3 , X4 ) are not independent, note that P (X1 = 1, X2 = 1, X3 = 1, X4 = 1) = 0 = (1/2)(1/2)(1/2){1/8+3(1/8)}. 10. Prove that a random variable X is independent of itself if and only if P (X = c) = 1 for some constant c. If X is independent of itself, then P (X ≤ x, X ≤ x) = {P (X ≤ x)}2 . But P (X ≤ x, X ≤ x) = P (X ≤ x) = F (x), so independence requires that F (x) = {F (x)}2 for each x. That is, F (x) has to be either 0 or 1 for each x. If c = inf{x : F (x) = 1}, then F (x) = 0 for x < c and 1 for x ≥ c. That is, P (X = c) = 1. For the other direction, suppose that P (X = c) = 1. Then P (X ≤ x1 , X ≤ x2 ) is 1 if x1 ≥ c and x2 ≥ c, and 0 if x1 < c or x2 < c. Likewise,

K24704_SM_Cover.indd 47

01/06/16 10:38 am

44 P (X ≤ x1 )P (X ≤ x2 ) is 1 if x1 ≥ c and x2 ≥ c, and 0 if x1 < c or x2 < c. Therefore, P (X ≤ x1 , X ≤ x2 ) = P (X ≤ x1 )P (X ≤ x2 ). By Proposition 4.37, X is independent of itself. 11. Prove that a sigma-field is independent of itself if and only if each of its sets has probability 0 or 1. If the sigma-field F is independent of itself, then by definition, E1 ∈ F and E2 ∈ F ⇒ P (E1 ∩ E2 ) = P (E1 )P (E2 ). Take E2 = E1 to conclude that P (E1 ∩ E1 ) = {P (E1 )}2 . But P (E1 ∩ E1 ) = P (E1 ), so P (E1 ) = {P (E1 )}2 . That is, P (E1 ) = 0 or 1.

To prove the other direction, suppose that P (E) = 0 or 1 for each E ∈ F. Then E1 , E2 ∈ F ⇒ P (E1 ∩ E2 ) = 0 if either P (E1 ) = 0 or P (E2 ) = 0, and 1 if P (E1 ) and P (E2 ) are both 1. Likewise, P (E1 )P (E2 ) = 0 if either P (E1 ) = 0 or P (E2 ) = 0, and 1 if P (E1 ) and P (E2 ) are both 1. Therefore, P (E1 ∩ E2 ) = P (E1 )P (E2 ). That is, E1 and E2 are independent. This holds for each choice of E1 ∈ F, E2 ∈ F (including E1 = E2 ), so F is independent of itself. 12. Prove that if X is a random variable and f is a Borel function such that X and f (X) are independent, then there is some constant c such that P (g(X) = c) = 1. For t ∈ R, let At = g −1 (−∞, t], which is a Borel set because g is a Borel function. Then P (g(X) ≤ t) = P (X ∈ At , g(X) ≤ t) = P (X ∈ At )P {g(X) ≤ t} = {P (g(X) ≤ t)}2 . Therefore, P (g(X) ≤ t){1 − P (g(X) ≤ t)} = 0, so P (g(X) ≤ t) = 0 or 1 for each t. If c = inf{t : P (g(X) ≤ t) = 1}, then the distribution function for g(X) is 0 for g(X) < c and 1 for g(X) ≥ c. That is, g(X) = c with probability 1. 13. Let (Ω, F , P ) = ([0, 1], B[0,1] , µL), and for t ∈ [0, 1], let Xt = I(ω = t). Are {Xt, t ∈ [0, 1]} independent?

Yes. Any finite set Xt1 , . . . , Xtk are each 0 with probability 1. This means that P (Xt1 ≤ x1 , . . . , Xtk ≤ xk ) is 1 if all xi ≥ 0 and 0 if any xi < 0, as is P (Xt1 ≤ x1 )P (Xt2 ≤ x2 ) . . . P (Xtk ≤ xk ). Therefore, P (Xt1 ≤ x1 , . . . , Xtk ≤ xk ) = P (Xt1 ≤ x1 )P (Xt2 ≤ x2 ) . . . P (Xtk ≤ xk ). By Proposition 4.37, (Xt1 , Xt2 , . . . , Xtk ) are independent.

K24704_SM_Cover.indd 48

01/06/16 10:38 am

45 14. Prove that the collection A1 in step 1 of the proof of Proposition 4.37 contains the field F0 of Example 3.8.

A1 contains all sets of the form A1 = (−∞, x1 ] by definition. Also, it holds for A1 = (−∞, ∞) because P {X1 ∈ (−∞, ∞) ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn } = = =

lim P (X1 ≤ x1 ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn )

x1 →∞

lim P (X1 ≤ x1 )P (X2 ≤ x2 ) . . . P (Xn ≤ xn )

x1 →∞

P {X1 ∈ (−∞, ∞)}P (X2 ≤ x2 ) . . . P (Xn ≤ xn ).

Now suppose that A = (x1 , ∞). Then P (X1 ∈ A1 ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn ) =

P (X1 < ∞ ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn ) − P (X1 ≤ x1 ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn )

=

P (X1 < ∞)P (X2 ≤ x2 ) . . . P (Xn ≤ xn ) − P (X1 ≤ x1 )P (X2 ≤ x2 ) . . . P (Xn ≤ xn )

=

{1 − P (X1 ≤ x1 )}P (X2 ≤ x2 ) . . . P (Xn ≤ xn )

=

P {X1 ∈ (x1 , ∞)}P (X2 ≤ x2 ) . . . P (Xn ≤ xn ),

so the result holds for A1 of the form (x1 , ∞). It also holds for A1 of the form (a, b] because P (X1 ∈ A1 ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn ) =

P (X1 ≤ b ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn ) − P (X1 ≤ a ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn )

=

P (X1 ≤ b)P (X2 ≤ x2 ) . . . P (Xn ≤ xn ) − P (X1 ≤ a)P (X2 ≤ x2 ) . . . P (Xn ≤ xn )

=

{P (X1 ≤ b) − P (X1 ≤ a)}P (X2 ≤ x2 ) . . . P (Xn ≤ xn )

=

P {X1 ∈ (a, b]}P (X2 ≤ x2 ) . . . P (Xn ≤ xn ).

Now suppose that A = ∪ni=1 Ai , where the Ai are disjoint sets of the form (−∞, x], (x, ∞), or (a, b]. Then P (X1 ∈ A ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn ) = =

n  i=1 n  i=1

P (X1 ∈ Ai ∩ X2 ≤ x2 ∩ . . . ∩ Xn ≤ xn ) P (X1 ∈ Ai )P (X2 ≤ x2 ) . . . P (Xn ≤ xn )

= P (X1 ∈ A)P (X2 ≤ x2 ) . . . P (Xn ≤ xn ). Therefore, A includes the field F0 of Example 3.8.

K24704_SM_Cover.indd 49

01/06/16 10:38 am

46 15. It can be shown that if (Y1 , . . . , Yn) have a multivariate normal distribution with E(Yi2 ) < ∞, i = 1, . . . , n, then any two subcollections of the random variables are independent if and only if each correlation of a member of the first subcollection and a member of the second subcollection is 0. Use this fact to prove that if Y1 , . . . , Yn are iid normals, then (Y1 − Y¯ , . . . , Yn − Y¯ ) is independent of Y¯ . What can you conclude from this about the sample mean and sample variance of iid normals? By the stated result, it suffices to prove that the covariance between Y¯ and Yi − Y¯ is 0 for each i. By elementary properties of covariances, Y¯ , Yi ) − cov(Y¯ ,  Y¯ ) cov(Y¯ , Yi − Y¯ ) = cov( 

= cov (1/n)   

n 

j=1



Yj , Yi  − var(Y¯ )

= (1/n) cov(Yi , Yi ) +  2

2

 j=i

 

cov(Yj , Yi ) − σ 2 /n

= (1/n)(σ + 0) − σ /n = 0.

By the stated result, this shows that Y¯ is independent of the vector (Y1 −  Y¯ , . . . , Yn − Y¯ ). Because the sample variance s2 = (n − 1)−1 ni=1 (Yi − Y¯ )2 is a Borel function of the latter vector, Y¯ is independent of s2 by Proposition 4.43. 16. Show that if Yi are iid from any non-degenerate distribution F (i.e., Yi is not a constant), then the residuals R1 = Y1 − Y¯ , . . . , Rn = Yn − Y¯ cannot be independent.

If R1 , R2 , . . . , Rn were independent, then Rn would be independent of n−1  − n−1 i=1 Ri . But Rn = − j=1 Rj with probability 1 because residuals sum to 0. We have shown that if the Ri were independent, then Rn would have to be independent of itself, which, by problem 10, implies that Rn is constant with probability 1. But Rn = Yn − Y¯ is clearly not a constant if the Yi are non-degenerate. Therefore, the residuals R1 , . . . , Rn cannot be independent.

17. Prove that the random variables X1 , X2 , . . . are independent by Definition 4.36 if and only if the sigma fields F1 = σ(X1 ), F2 = σ(X2 ), . . . are independent by Definition 4.48. Suppose that X1 , X2 , . . . , Xn are independent by Definition 4.36, and let Fi = {Xi−1 (B), B ∈ B} be the sigma-field generated by Xi . If Fi ∈ Fi ,

K24704_SM_Cover.indd 50

01/06/16 10:38 am

47 i = 1, 2, . . . , n, then Fi = Xi−1 (Bi ) for some Borel set Bi . Therefore, P (∩ni=1 Fi ) = P {∩ni=1 (Xi ∈ Bi )} =

n 

i=1

P (Xi ∈ Bi ) =

n 

Fi .

i=1

Likewise, if the sigma-fields Fi = {Xi−1 (B), B ∈ B} are independent by Definition 4.48, then they are independent by Definition 4.36 because P (Xi ∈ Bi ) is P {Xi−1 (Bi )}, and Xi−1 (Bi ) ∈ Fi .

This shows that the two definitions are equivalent for a finite number of random variables X1 , . . . , Xn . But independence of an infinite collection is defined in terms of independence of each finite subcollection. Therefore, the two definitions are equivalent for an infinite collection X1 , X2 , . . . as well.

K24704_SM_Cover.indd 51

01/06/16 10:38 am

48 Section 4.6 1. Use the inverse probability transformation to construct a random variable that has a uniform distribution on [0, a]: F (x) = x/a for 0 ≤ x ≤ a. Set y = x/a and solve for x: x = ay. Therefore, F −1 (x) = ax. This means that if ω has a uniform [0, 1] distribution, aω has a uniform distribution on [0, a].

2. Use the inverse probability transformation to construct a random variable that has an exponential distribution with parameter λ : F (x) = 1 − exp(−λx).

Set y = 1 − exp(−λx) and solve for x: x = (−1/λ) ln(1 − y). This means that if ω has a uniform [0, 1] distribution, −(1/λ) ln(1 − ω) has an exponential distribution with parameter λ.

3. Use the inverse probability transformation to construct a random variable that has a Weibull distribution: F (x) = 1 − exp{−(x/η)β }, x ≥ 0. Set y = 1 − exp{−(x/η)β } and solve for x: x = η{− ln(1 − y)}(1/β) . This means that if ω has a uniform [0, 1] distribution, η{− ln(1 − ω)}(1/β) has a Weibull distribution with parameters η and β.

4. Use the inverse probability transformation to construct a random variable X on ((0, 1), B(0,1) , µL) with the following distribution function: F (x) =

 0    

x

 .75   

1

if if if if

x 0} > 0. This implies that µ{ω : f (ω) > 1/2m } > 0 for some m because if Am is the set {ω : f (ω) > 1/2m } and A is the set {ω : f (ω) > 0}, then A = ∪∞ m=1 Am , so if each Am had  µ-measure 0, then by countable subadditivity, µ(A) ≤ ∞ m=1 µ(Am ) = ∞ 0 = 0. Therefore, we have established that µ(A ) > 0 for some m m=1 m. This means that µ(Am ) ≥  for some m and  > 0. Then Sm ≥ (1/2m ). Because Sm is an increasing function of m, Sn ≥ (1/2m ) for each n ≥ m. It follows that limn→∞ Sn ≥ (1/2m ) > 0. We have shown  that f (ω)dµ(ω) cannot be 0 if µ{ω : f (ω) > 0} > 0.

Putting these two facts together, we see that only if f = 0 a.e.

K24704_SM_Cover.indd 56



f (ω)dµ(ω) = 0 if and

01/06/16 10:38 am

53 Section 5.3 1. Use elementary properties of integration to prove that E{(X − µX )(Y − µY )} = E(XY ) − µX µY , assuming the expectations are finite. (X − µX )(Y − µY ) = XY − µY X − µX Y + µX µY , so by the linearity property of integration, E{(X − µX )(Y − µY )} = E(XY ) − µY E(X) − µX E(Y ) + µX µY = E(XY ) − µY µX − µX µY + µX µY = E(XY ) − µX µY . 

2. Explain why the notation ab f (ω)dµ(ω) is ambiguous unless µ(a) = µ(b) = 0. How should we write the integral if we mean to include the left, but not the right, endpoint of the interval? It is ambiguous because it is not clear whether the region of integration is [a, b] or (a, b], or [a, b) or (a, b). It makes a difference if one or both endpoints a and b have positive measure. 

3. Let f (ω) be an F -measurable function such that |f (ω)|dµ(ω) < ∞. Prove that if A1 , A2 , . . . are disjoint sets in F , then  ∞  ∪i Ai f (ω)dµ(ω) = i=1 Ai f (ω)dµ(ω). By definition,



∪i Ai

f (ω)dµ(ω) = = = =



f (ω)I(∪i Ai )dµ(ω)

Ω

f (ω)

i



i

Ai



 



I(Ai )dµ(ω)

i

f (ω)I(Ai )dµ(ω) f (ω)dµ(ω).

The third line follows from Elementary Property 5b because  i



|f (ω)I(Ai )|dµ(ω) = = = ≤

K24704_SM_Cover.indd 57





|f (ω)|I(Ai )dµ(ω)

i

|f (ω)|I(Ai )dµ(ω) (Property 5a)

i  Ω



∪i Ai Ω

|f (ω)|dµ(ω)

|f (ω)|dµ(ω) < ∞.

01/06/16 10:38 am

54 4. Show that the dominated convergence theorem (DCT) implies the bounded convergence theorem (Theorem 5.13). If |Xn | ≤ c, then Xn is dominated by the random variable X(ω) ≡ c and  X is integrable because cdP (ω) = c.

5. Use the monotone convergence theorem (MCT) to prove part of Elementary Property 5, namely that if Xn are nonnegative   random variables, then E( n Xn) = E(Xn). n

Let Yn = ∞  i=1

i=1

Xi . Then Yn ↑ Y =

E(Xi ) = n→∞ lim

n  i=1

∞

i=1

Xi and

E(Xi ) = n→∞ lim E(Yn ) = E(Y ) = E

∞ 



Xi .

i=1

The second to last equality is by the MCT. 6. Prove that if E(X) = µ, where µ is finite, then E{XI(|X| ≤ n)} → µ as n → ∞. This follows from the dominated convergence theorem because Y = a.s. XI(|X| ≤ n) → X and |Y | ≤ |X|, with E(|X|) < ∞. 7. Prove that limn→∞    

[0,1]



[0,1] {cos(nx)/n}dx

 

{cos(nx)/n}dx ≤ ≤



[0,1]

[0,1]

= 0.

| cos(nx)/n|dx (1/n)dx = 1/n → 0.

The first step follows from the modulus inequality (Elementary Property 3) 8. Find limn→∞

∞

k=1

1/{k(1 + k/n)} and justify your answer. 

We can write the sum as Ω fn (ω)dµ(ω), where fn (ω) = 1/{ω(1 + ω/n)} and µ is counting measure on Ω = {1, 2, . . .}. Note that fn increases with n, so the MCT implies that lim

n→∞

∞ 

1/{k(1 + k/n)} =

k=1

=

K24704_SM_Cover.indd 58

lim



n→∞ Ω





fn (ω)dµ(ω) =

(1/ω)dµ(ω) =

∞ 

k=1



lim fn (ω)

Ω n→∞

1/k = ∞.

01/06/16 10:38 am

55 9. Find limn→∞ answer.

∞

k=1 (1

− 1/n)kn/{k! ln(1 + k)} and justify your 

Write the sum as an integral with respect to counting measure: ∞ k=1 (1−  1/n)kn /{k! ln(1+k)} = fn (ω)dµ(ω), where fn (ω) = (1−1/n)ωn /{ω! ln(1+ ω)} and µ is counting measure on the positive integers. Note that fn ↑ exp(−ω)/{ω! ln(1 + ω)}, so by the MCT, limn→∞

∞

− 1/n)kn /{k! ln(1 + k)}

k=1 (1

=

= =

lim

n→∞ 



fn (ω)dµ(ω)

lim fn (ω)dµ(ω) =

n→∞ ∞ 



exp(−ω)/{ω! ln(1 + ω)}dµ(ω)

exp(−k)/{k! ln(1 + k)}

k=1

10. The dominated convergence theorem (DCT) has the condition that “P {ω : Xn(ω) → X(ω) and |Xn(ω)| ≤ Y (ω) for all n} = 1.” Show that this is equivalent to “P {ω : Xn(ω) → X(ω)} = 1 and P {ω : |Xn(ω)| ≤ Y (ω)} = 1 for each n. Let A = {ω : Xn (ω) → X(ω)} and Bn = {ω : |Xn (ω)| ≤ Y (ω)}. It is clear that A ∩ (∩n Bn ) ⊂ A and A ∩ (∩n Bn ) ⊂ Bn for each n. Therefore, if P {A ∩ (∩n Bn )} = 1 then P (A) = 1 and P (Bn ) = 1 for each n. For the reverse direction,

P [{A ∩ (∩n Bn )}C ] = P {AC ∪ (∪n BnC )} ≤ P (AC ) +

∞ 

P (BnC )

(6)

n=1

by countable subadditivity. Therefore, if P (A) = 1 and P (Bn ) = 1 for each n, then each term in Expression(6) is 0, so P [{A ∩ (∩n Bn )}C ] = 0. Accordingly, P {A ∩ (∩n Bn )} = 1. 11. * The DCT and MCT apply to limits involving t → t0 as well (see Definition A.54 of the Appendix). For example, show the following. Let ft, f , and g be measurable functions and A = {ω : |ft(ω)| ≤ g(ω) for all t and ft(ω) → f (ω) as  t → t0 }, where g(ω)dµ < ∞. If A ∈ F and µ(AC ) = 0,   then limt→t0 ft(ω)dµ(ω) = f (ω)dµ(ω). 



By definition, limt→t0 ft (ω)dµ(ω) exists if and only if limn→∞ ftn (ω)dµ(ω) exists and has the same value for each sequence tn → t0 . Moreover, |f (ω)| ≤ g(ω) for all n, so by the usual DCT, limn→∞ ftn (ω)dµ(ω) =  tn ft (ω)dµ(ω). Because this holds for every sequence tn converging to t,   limt→t0 ft (ω)dµ(ω) = f (ω)dµ(ω).

K24704_SM_Cover.indd 59

01/06/16 10:38 am

56 12. ↑ Is the result of the preceding problem correct if it is stated as follows? Let At = {ω : |ft(ω)| ≤ g(ω)} and B = {ω : ft(ω) →  f (ω) as t → t0 }, where g(ω)dµ < ∞. If µ(AC t ) = 0 and   µ(B C ) = 0, then limt→t0 ft(ω)dµ(ω) = f (ω)dµ(ω).

No, because µ{(∩t At )C } = µ(∪t AC t ), but this is not necessarily less than  C or equal to t µ(At ) because measures are not necessarily uncountably subadditive.

13. Suppose that X is a random variable with density function ∞ f (x), and consider E(X) = −∞ xf (x)dx. Assume further that f is symmetric about 0 (i.e., f (−x) = f (x) for all x ∈ R). Is the following argument correct?  ∞

−∞

xf (x)dx = =

lim

 A

A→∞ −A

xf (x)dx

lim 0 = 0

A→∞

because g(x) = xf (x) satisfies g(−x) = −g(x). Hint: consider f (x) = {π(1 + x2 )}−1 ; are E(X −) and E(X + ) finite?

The argument is incorrect. Let f (x) = {π(1 + x2 )}−1 . Then E(X−)  and E(X + ) both equal 0∞ x/{π(1 + x2 )}dx = (2π)−1 ln(1 + x2 )|∞ 0 = ∞. Therefore, E(X) does not exist because it is of the form +∞ − ∞.

14. Show that if fn are nonnegative measurable functions such that fn(ω) ↓ f (ω), then it is not necessarily the case that   limn→∞ fn(ω)dµ(ω) = limn→∞ fn(ω)dµ(ω) for an arbitrary measure µ. Hint: let µ be counting measure and fn(ω) be the indicator that ω ≥ n.

Let µ be counting measure on Ω = {0, 1, 2, . . .} and fn be the indicator  that ω ≥ n. Then fn (ω)dµ(ω) = µ{n, n + 1, . . .} = ∞. On the other   hand, limn→∞ fn (ω) = 0 for each ω, so limn→∞ fn (ω) = 0 dµ(ω) = 0.

15. Use the preservation of ordering property to give another proof of the fact that if f is a nonnegative, measurable function, then  f (ω)dµ(ω) = 0 if and only if µ(A) = 0, where A = {ω : f (ω) > 0}. Hint: if µ(A) > 0, then there must be a positive integer n such that µ{ω : f (ω) > 1/n} > 0. Let An = {ω : f (ω) > 1/n}. If µ(A) > 0, then µ(An ) > 0 for some n ∞ because µ(A) = µ(∪∞ n=1 An ) ≤ n=1 µ(An ); if all µ(An ) = 0, then µ(A) would be 0. Thus, there is an  > 0 and an index N such that µ(AN ) ≥ . By the preservation of ordering property of integration,

K24704_SM_Cover.indd 60

01/06/16 10:38 am

57



A

f (ω)dµ(ω) = ≥



Ω



f (ω)I(A)dµ(ω) ≥





f (ω)I(AN )dµ(ω)

(1/N )I(AN )dµ(ω) = (1/N )µ(AN ) ≥ (1/N ).

16. Use elementary integration properties and Fatou’s lemma to prove the monotone convergence theorem. Suppose that the nonnegative, measurable functions fn increase to f .   By Fatou’s lemma, lim n→∞ fn (ω)dµ(ω) ≥ f (ω)dµ(ω). Also, because  fn (ω) ↑ f (ω), fn (ω) ≤ f (ω) for each n. Therefore, fn (ω)dµ(ω) ≤    f (ω)dµ(ω) for each n, from whence lim n→∞ fn (ω)dµ(ω) ≤ f (ω)dµ(ω). Because lim n→∞ lim n→∞





fn (ω)dµ(ω) ≥ fn (ω)dµ(ω) ≤ 

 

f (ω)dµ(ω) and f (ω)dµ(ω),

we conclude that limn→∞ fn (ω)dµ(ω) = f (ω)dµ(ω).

K24704_SM_Cover.indd 61

01/06/16 10:38 am

58 Section 5.4 1. State and prove a result analogous to Jensen’s inequality, but for concave functions. Suppose that f (x) is a concave function and that X is a random variable with finite mean µ. If E(|f (X)|) < ∞, then E{f (X)} ≤ f {E(X)}.

Apply Jensen’s inequality to the convex function −f (x) to see that E{−f (X)} ≥ −f {E(X)}, so E{f (X)} ≤ f {E(X)}.

2. Prove that if X has mean 0, variance σ 2 , and finite fourth moment µ4 = E(X 4 ), then µ4 ≥ σ 4 .

Let Y = X 2 . By Jensen’s inequality, µ4 = E(Y 2 ) ≥ {E(Y )}2 = (σ 2 )2 = σ4.

3. Prove the Schwarz inequality. This follows immediately by applying the Holder’s inequality with p = q = 2. 4. Prove Corollary 5.20. E(|X|p ) ≥ {E(|X|)}p by Jensen’s inequality because f (x) = xp is a convex function for p ≥ 1. Now take the pth root of both sides to see that E(|X|) ≤ {E(|X|p )}1/p . 5. Prove that Markov’s inequality is strict if P (|X| > c) > 0 or E{|X|I(|X| < c} > 0. Does this imply that Markov’s inequality is strict unless |X| = c with probability 1? (Hint: consider X taking values c and 0 with probabilities p and 1 − p). In Markov’s inequality, E(|X|) = E{|X|I(|X| < c)} + E{|X|I(|X| = c)} + E{|X|I(|X| > c)} = E{|X|I(|X| < c)} + cP (|X| = c) + E{|X|I(|X| > c)}. If P (|X| > c) is nonzero, then the last term is strictly greater than cP (|X| > c) from which we deduce that E(|X|) > 0 + cP (|X| = c) + cP (|X| > c) = cP (|X| ≥ c). In other words, Markov’s inequality is strict if P (|X| > c) is nonzero. Similarly, if E{|X|I(|X| < c)} > 0, then E(|X|) > 0 + cP (|X| = c) + E{|X|I(|X| > c)} ≥ cP (|X| = c) + cP (|X| > c) = cP (|X| ≥ c),

K24704_SM_Cover.indd 62

01/06/16 10:38 am

59 and again the inequality is strict. It is not necessarily the case that |X| = c with probability 1. For example if |X| takes values c and 0 with probabilities p and 1 − p, then P (|X| ≥ c) = p = {E(|X|)}/c, so the inequality is an equality. 6. Prove that the inequality in Jensen’s inequality is strict unless f (X) = f (µ) + b(X − µ) with probability 1 for some constant b. Because f (x) is convex, there exists a line y = f (µ) + b(x − µ) passing through (µ, f (µ)) such that f (x) ≥ f (µ) + b(x − µ) for all x. It follows that f (X) ≥ f (µ) + b(X − µ) with probability 1. That is, the random variable U = f (X) − {f (µ) + b(X − µ)} is nonnegative. Proposition 5.3 implies that E(U ) > 0 unless U = 0 with probability 1. In other words, Jensen’s inequality is strict unless f (X) = f (µ) + b(X − µ) with probability 1. 7. Prove that if 0 < σX < ∞ and 0 < σY < ∞, the correlation coefficient ρ = E(XY )/σX σY between X and Y is between −1 and +1. Note that |cov(X, Y )| = |E{(X − µX )(Y − µY )}| ≤ E(|X − µX | |Y − µY |) (modulus inequality) ≤





E{(X − µX )2 } E{(Y − µY )2 } = σX σY (Schwarz inequality)

It follows that |ρ| = |cov(X, Y )/(σX σY )| ≤ 1. 8. Suppose that xi are positive numbers. The sample geomet 1/n ric mean is defined by G = ( n . Note that ln(G) = i=1 xi) n (1/n) i=1 ln(xi). Using this representation, prove that the arithmetic mean is always at least as large as the geometric mean. ln(G) = E{ln(X)}, where X takes the value xi with probability 1/n, i = 1, . . . , n. By Jensen’s inequality, E{− ln(X)} ≥ − ln{E(X)} = − ln(¯ x). It follows that ln(G) ≤ ln(¯ x). Exponentiating both sides of this inequality, we get G ≤ exp{ln(¯ x)} = x¯. 9. The sample harmonic mean of numbers x1 , . . . , xn is defined by  −1 . Show that the following ordering holds {(1/n) n i=1 (1/xi)}

K24704_SM_Cover.indd 63

01/06/16 10:38 am

60 for positive numbers: harmonic mean ≤ geometric mean ≤ arithmetic mean. Does this inequality hold without the restriction that xi > 0, i = 1, . . . , n? The preceding exercise showed that the geometric mean is less than or equal to the arithmetic mean, so it suffices to prove that the harmonic mean is less than or equal to the geometric mean. Again let X be a random variable taking the value xi with probability 1/n. Then (1/n)

n  i=1

{− ln(xi )} = (1/n)

n 

ln(1/xi )

i=1

= E{ln(1/X)} ≤ ln{E(1/X)}   = ln (1/n)

n 

(1/xi )

i=1

because f (x) = ln(x) is a concave function. Now exponentiate both sides to get {

n 

1/n

(1/xi )}

i=1

G=

 n 

xi

i=1

1/n

≤ (1/n) ≥

n 

(1/xi )

i=1

1

(1/n)

n

i=1 (1/xi )

= H.

We have shown that the harmonic mean (H), geometric mean (G), and arithmetic mean A satisfy H ≤ G ≤ A. The inequality does not necessarily hold if some of the numbers are negative. For example, the geometric mean of −2, +2, namely 2i, is imaginary. 

10. Let f0 (x) and f1 (x) be density functions with | ln{f1 (x)/f0 (x)}|  × f0 (x)dx < ∞. Then ln{f1 (x)/f0 (x)}f0 (x)dx ≤ 0. We can write



ln{f1 (x)/f0 (x)}f0 (x)dx as

E[ln{f1 (X)/f0 (X)}] ≤ ln[E{f1 (X)/f0 (X)}], where X has density f0 (x). Also, E{f1 (X)/f0 (X)} =



{f1 (x)/f0 (x)}f0 (x)dx =



f1 (x)dx = 1



because f1 (x) is a density function. Therefore, ln{f1 (x)/f0 (x)}f0 (x)dx ≤ ln(1) = 0.

K24704_SM_Cover.indd 64

01/06/16 10:38 am

61 11. Suppose that X and Y are independent nonnegative, nonconstant random variables with mean 1 and both U = X/Y and V = Y /X have finite mean. Prove that U and V cannot both have mean 1. E(U ) = E(X/Y ) = E(X)E(1/Y ) > µX (1/µY ) = µX /µY . If E(U ) = 1, then 1 > µX /µY . Likewise, E(V ) = E(Y /X) = E(Y )E(1/X) > µY (1/µX ) = µY /µX . If E(V ) = 1, then 1 > µY /µX . Clearly, it is not possible for both 1 > µX /µY and 1 > µY /µX .

K24704_SM_Cover.indd 65

01/06/16 10:38 am

62 Section 5.5 1. Suppose that cov{Y, f (X)} = 0 for every Borel function f such that cov{Y, f (X)} exists. Show that this does not necessarily imply that X and Y are independent. Hint: let Z be N(0, 1), and set Y = Z and X = Z 2 . Let Z be N(0, 1), and set Y = Z and X = Z 2 . Then cov{Y, f (X)} = ∞ 2 −∞ zf (z )φ(z)dz = 0, where φ(z) is the standard normal density function. If this integral exists, it must be 0 because the function h(z) = zf (z 2 )φ(z) satisfies h(−z) = −h(z). Therefore, Z is uncorrelated with every function of Z 2 such that the covariance exists, yet Z and Z 2 are clearly not independent. 2. Let X be a nonnegative random variable with distribution func  tion F (x). Prove that 0∞{1 − F (x)}dx = 0∞ xdF (x) = E(X).  ∞ 0

{1 − F (x)}dx = = =

 ∞  0



[0,∞)

[0,∞)

[0,∞)  ∞ 0



I(t > x)dF (t) dx 

I(t > x)dx dF (t)

tdF (t) = E(X).

The reversal of order of the integration is by Tonelli’s theorem because the integrand is nonnegative. 

∞ 3. Prove that −∞ {F (x + a) − F (x)}dx = a, for any distribution function F and constant a ≥ 0.

 ∞

−∞

{F (x + a) − F (x)}dx = = =

 ∞  ∞

−∞ −∞ ∞  ∞

−∞ ∞ −∞

−∞



I(x < t ≤ x + a)dF (t) dx 

I(t − a ≤ x < t)dx dF (t)

{t − (t − a)}dF (t) =

 ∞

−∞

(a)dF (t) = a.

The reversal of order of the integration is by Tonelli’s theorem because the inegrand is nonnegative.

K24704_SM_Cover.indd 66

01/06/16 10:38 am

63 Section 6.1.1 For Problems 1–4, let (Ω, F , P ) be ([0, 1], B[0,1] , µL), where µL is Lebesgue measure. 1. Let Xn(ω) ≡ 1 and Yn(ω) = I(ω > 1/n), where I denotes an indicator function. Does Xn/Yn converge for every ω ∈ [0, 1]? Does it converge almost surely to a random variable? If so, what random variable? Xn /Yn does not converge for ω = 0 because the denominator, Yn , is 0 for all n. Nonetheless, it converges almost surely Z(ω) ≡ 1. It also converges almost surely to Z  , where Z  is any random variable such that Z  (ω) = 1 with probability 1. 2. Let



(−1)n if ω is rational ω if ω is irrational. Does Yn converge almost surely to a random variable? If so, specify a random variable on (Ω, F , P ) that Yn converges almost surely to. Yn =

Yn converges almost surely to T (ω) = ω. The set of ω on which it fails to converge, namely the rational numbers in [0, 1], has Lebesgue measure 0. It also converges almost surely to T  , where T  is any random variable that equals T with probability 1. For example, we could arbitrarily change the value of T to 4 for rational ω. 3. In the preceding problem, reverse the words “rational” and “irrational.” Does Yn converge almost surely to a random variable? If so, specify a random variable on (Ω, F , P ) that Yn converges almost surely to. In this case Yn does not converge almost surely to a random variable: the set of ω on which it converges has Lebesgue measure 0. 4. For each n, divide [0, 1] into [0, 1/n), [1/n, 2/n), . . . , [(n−1)/n, 1], and let Xn be the left endpoint of the interval containing ω. Prove that Xn converges almost surely to a random variable, and identify the random variable. Xn converges almost surely to X(ω) = ω. To see this, note that for each ω ∈ [0, 1], |Xn (ω) − ω| ≤ 1/n because the length of the interval containing ω is 1/n. Therefore, |Xn (ω) − X(ω)| ≤ 1/n → 0 for each ω, a.s. so Xn (ω) → X(ω).

K24704_SM_Cover.indd 67

01/06/16 10:38 am

64 a.s.

a.s.

5. Prove that if Xn → 0, then Yn = |Xn|/(1 + |Xn|) → 0. Does ln(Yn) converge almost surely to a finite random variable? a.s.

a.s.

|Yn| ≤ |Xn| → 0. Therefore, Yn → 0. ln(Yn) does not converge a.s. to a finite random variable. 6. Suppose that Xn converges almost surely to X. Suppose further that the distribution and density functions of X are F (x) and f (x), respectively. Let G(x, y) =



F (y)−F (x) y−x

f (x)

if x = y if x = y.

Prove that G(X, Xn) converges almost surely to f (X).

G(X, Xn ) =



F {Xn (ω)}−F {X(ω)} Xn (ω)−X(ω)

f {X(ω)}

if X(ω) = Xn (ω) if X(ω) = Xn (ω).

Each ω for which Xn (ω) → X(ω) is an ω for which G(X, Xn ) converges a.s. a.s. to f {X(ω)} as n → ∞. Because Xn → X, G(X, Xn ) → f (X). 7. Prove parts 1 and 2 of Proposition 6.5. Part 1 follows from the uniqueness of a limit of a sequence of numbers. For fixed ω outside a null set N , xn = Xn (ω) → x = X(ω), and for ω outside a null set N2 , xn = Xn (ω) → x , so outside the null set N ∪ N  , x must equal x by the uniqueness of the limit of a sequence of numbers. That is, X = X  with probability 1. Consider part 2. Let A be the set of ω for which Xn (ω) → X(ω), and C = DC be the set of ω for which x = X(ω) is a continuity point of f . Each ω in A ∩ C is an ω for which f {Xn (ω)} → f {X(ω)}, Furthermore, P {A ∩ C)C } = P (AC ∪ D) ≤ P (AC ) + P (D) = 0 + 0 = 0. Therefore, the probability that f {Xn (ω)} → f {X(ω)} is 1. 8. Prove parts 4 and 5 of Proposition 6.5. Parts 4 and 5 follow immediately from the corresponding properties for sequences of numbers. For example, if xn = Xn (ω) and yn = Yn (ω) are sequences of numbers converging to x = X(ω) and y = Y (ω), then xn yn → xy. Therefore, except on the null set of ω for which either Xn (ω) fails to converge to X(ω) or Yn (ω) fails to converge to Y (ω), a.s. Xn (ω)Yn (ω) → X(ω)Y (ω). Therefore, Xn (ω)Yn (ω) → X(ω)Y (ω). The a.s. same argument shows that Xn (ω)/Yn (ω) → X(ω)/Y (ω).

K24704_SM_Cover.indd 68

01/06/16 10:38 am

65 9. Extend Example 6.7 to show that if X1 , X2 , . . . are iid random variables with any non-degenerate distribution, then they cannot converge almost surely. Because the Xi are iid with a nondegenerate distribution, there must exist numbers x1 < x2 such that P (X1 ≤ x1 ) = p1 > 0 and P (X1 ≥  ∞ ∞ x2 ) = p2 > 0. Then ∞ n=1 P (Xn ≤ x1 ) = n=1 p1 = ∞ = n=1 p2 = ∞ n=1 P (Xn ≥ x2 ). By part 2 of the Borel-Cantelli lemma, P (A) = 1 = P (B), where A is the event that Xn ≤ x1 for infinitely many n, and B is the event that Xn ≥ x2 for infinitely many n. Therefore, P (A ∩ B) = 1 because P {(A ∩ B)C } ≤ P (AC ) + P (B C ) = 0 + 0 = 0. But each ω for which Xn (ω) ≤ x1 for infinitely many n and Xn (ω) ≥ x2 for infinitely many n is an ω for which Xn (ω) cannot converge. Therefore, Xn converges with probability 0. 10. Using the same reasoning as in Example 6.7, one can show the following. For each fixed subsequence n1 , n2 , . . . , P (Xn1 = 1, Xn2 = 1, . . . , Xnk = 1, . . .) = 0. Does this imply that, with probability 1, there is no subsequence m1 , m2 , . . . such that Xm1 = 1, Xm2 = 1, . . . , Xmk = 1, . . .? Explain. No. If there were only countably many subsequences s1 , s2 , . . ., the statement would be true by countable subadditivity because P (∪s {Xn1 =   1, . . . , Xnk = 1, . . .}) ≤ s P {Xn1 = 1, . . . , Xnk = 1, . . .} = s 0 = 0. However, the number of sequences is uncountable, so the step P (∪s {Xn1 =  1, . . . , Xnk = 1, . . .}) ≤ s P {Xn1 = 1, . . . , Xnk = 1, . . .} is not necessarily true. In fact, with probability 1, there will be a subsequence of all ones. Otherwise, X1 , X2 , . . . would contain only finitely many 1s, which has probability 0 by the second part of the Borel-Cantelli lemma because ∞ n=1 P (Xn = 1) = ∞.

K24704_SM_Cover.indd 69

01/06/16 10:38 am

66 Section 6.1.2 1. Let (Ω, F , P ) be ([0, 1], B[0,1] , µL), where µL is Lebesgue measure. Let Xn(ω) = ω I(ω ≤ 1 − 1/n). Prove that Xn converges in probability to a random variable, and identify that random variable. p

Xn → X(ω) = ω. |Xn − ω| = ω{I(ω ≤ 1 − 1/n) − 1} is nonzero only when ω > 1 − 1/n. Therefore, P (|Xn − ω| ≥ ) ≤ 1/n → 0. 2. Sometimes we transform an estimator using a continuous function f (θˆn). What can we conclude about the transformed estip mator if θˆn → θ? p p If f is a continuous function and θˆn → θ, then f (θˆn ) → f (θ) by part 2 of Proposition 6.12.

3. Explain the relevance of part 3 of Proposition 6.12 in terms of one- and two-sample estimators. If a one sample estimator of a parameter such as a mean, proportion, etc. converges in probability to that parameter, then a two-sample estimate formed as the difference of two one-sample estimators converges in probability to the difference in means, proportions, etc. 4. Give an example to show that convergence of Xn to X in probability is not sufficient to conclude that f (Xn) converges in probability to f (X) for an arbitrary function f . Let f (x) = I(x = 0), and let Xn =



−1/n w.p. 1/2 +1/n w.p. 1/2.

p

Then Xn → 0 and f (0) = 0, but f (Xn ) = 1 with probability 1. 5. Prove that the following are equivalent. p

(a) Xn → X.

(b) For each  > 0, P (|Xn − X| > ) → 0 (that is, the ≥ symbol in Definition 6.8 can be replaced by >). (c) For each  > 0, P (Xn − X > ) → 0 and P (Xn − X < −) → 0 as n → ∞.

K24704_SM_Cover.indd 70

01/06/16 10:38 am

67 p

Suppose Xn → X. For given  > 0, P (|Xn − X| > ) ≤ P (|Xn − X| ≥ ) → 0. Thus, item 1 implies item 2.

Now suppose P (|Xn − X| > ) → 0 for each . Then certainly P (Xn − X > ) → 0 and P (Xn − X < −) → 0 because Xn − X >  ⇒ |Xn − X| > , and Xn − X < − ⇒ |Xn − X| > . Therefore, item 2 implies item 3. Now suppose that P (Xn − X > ) → 0 and P (Xn − X < −) → 0 for each  > 0. Then |Xn − X| ≥  implies that either Xn − X > /2 or Xn − X < −/2. Therefore, P (|Xn − X| ≥ ) ≤ P (Xn − X > /2) + P (Xn − X < −/2) → 0 + 0 = 0. Thus, item 3 implies item 1. p

6. Prove that if Xn → X, then there exists an N such that P (|Xn − X| ≥ ) ≤  for n ≥ N .

Because P (|Xn − X| ≥ ) → 0, for each τ we can determine an N such that P (|Xn − X| ≥ ) < τ for n ≥ N . Take τ =  to conclude that there is an N such that P (|Xn − X| ≥ ) < , which clearly implies that P (|Xn − X| ≥ ) ≤  for n ≥ N .

7. Prove parts 1 and 2 of Proposition 6.12. p

p

Consider part 1. Suppose that Xn → X and Xn → X  . Then

P (|X − X  | ≥ )

= P (|X − Xn + Xn − X  | ≥ ) ≤ P (|X − Xn | + |Xn − X  | ≥ ) ≤ P (|X − Xn | > /2 ∪ |Xn − X  | > /2) ≤ P (|X − Xn | ≥ /2) + P (|Xn − X  | ≥ /2). Now take the limit of both sides as n → ∞: lim P (|X − X  | ≥ ) =

n→∞

lim P (|X − Xn | ≥ /2) + n→∞ lim P (|Xn − X  | ≥ /2)

n→∞

= 0 + 0 = 0. But the left side is just P (|X − X  | ≥ ), so P (|X − X  | ≥ ) = 0. But  is arbitrary, so P (|X − X  | > 0) = 0, which means X = X  a.s. Now consider part 2. Assume that f is continuous on R. For given  and A, we must show that there is an N such that P (|f (Xn )−f (X)| ≥ A) <  for n ≥ N .

Choose B such that P (|X| > B) < /2. This is possible because the random variable X must be finite with probability 1. Because FB =

K24704_SM_Cover.indd 71

01/06/16 10:38 am

68 [−2B, 2B] is a compact set and f is continuous on FB , f is uniformly continuous on FB by Proposition A.62. That means there exists a τ such that if x ∈ FB and y ∈ FB and |x − y| ≤ τ , then |f (x) − f (y)| < A. Notice that |X| ≤ B and |Xn − X| ≤ B imply that Xn and X are both in FB . Therefore, for |X| ≤ B and |Xn − X| ≤ λ = min(τ, B), Xn and X are both in FB and |f (Xn ) − f (X)| < A. Therefore, the event that |f (Xn ) − f (X)| ≥ A implies that either |X| > B or |Xn − X| > λ. Choose N such that P (|Xn − X| > λ) < /2. We can do this because p Xn → X. Then for n ≥ N , P {|f (Xn )−f (X)| ≥ A} ≤ P (|X| > B)+P (|Xn −X| > λ) < /2+/2 = . This completes the proof. 8. Prove parts 4 and 5 of Proposition 6.12. For part 4, note that |Xn Yn | ≥  ⇒ |Xn | ≥ 1/2 or |Yn | ≥ 1/2 , so P (|Xn Yn | ≥ ) ≤ P (|Xn | ≥ 1/2 ) + P (|Yn | ≥ 1/2 ) → 0 + 0 = 0 as n → ∞. For part 5, one proof is as follows. The function f (y) = 1/y is continuous p except at y = 0, and P (Y = 0) = 0. Therefore, by part 2, 1/Yn → 1/Y . p p Because Xn → X and 1/Yn → 1/Y , then part 4 implies that Xn (1/Yn ) converges in probability to X/Y . The above proof of part 5 is somewhat unsatisfying because we used part 2 in its full generality, but we only proved part 2 when f is continuous on the entire line. We therefore supply an alternate proof of part 5. Consider first the case when Xn converges in probability to 0. We will show that for given A and , we can find an N such that P (|Xn /Yn | ≥ A) < . Because P (Y = 0) = 0, there exists a τ > 0 such that P (|Y | ≤ τ ) < /3. p Because Yn → Y , there is an N1 such that P (|Yn − Y | > τ /2) < /3 for n ≥ N1 . For |Y | > τ and |Yn − Y | ≤ τ /2, |Yn | ≥ τ /2. Find an N2 such that P {|Xn |/(τ /2) > A} < /3 for n ≥ N2 . The event that |Xn /Yn | ≥ A implies that either |Y | > τ or |Yn − Y | > τ /2 or |Xn |/(τ /2) > A. Therefore, for n ≥ max(N1 , N2 ), P (|Xn /Yn | ≥ A)

≤ P (|Y | > τ ) + P (|Yn − Y | > τ /2) + P {|Xn |/(τ /2) > A} < /3 + /3 + /3 = .

K24704_SM_Cover.indd 72

01/06/16 10:38 am

69 p

This proves part 5 when Xn → 0.

p

Now consider the general case in which Xn → X. Note that, when Yn and Y are nonzero, Xn X − Yn Y

Xn Y − XYn Yn Y Xn Y − XY + XY − XYn = Yn Y (Xn − X)Y + X(Y − Yn ) . = Yn Y

=

(7)

Part 4 implies that the numerator tends to 0 in probability and the denominator tends to Y 2 in probability. We can invoke the result just proven to conclude that Expression (7) converges in probability to 0. p This completes the proof in the general case that Xn → X. 9. Prove that P (θˆn ≥ θ + ) → 0 in Example 6.10. θˆn ≥ θ +  ⇒ pˆ ≥ 1/2, where pˆ is the proportion of observations at least as large as θ + . Let p = P (X ≥ θ + ), which is less than 1/2 because the distribution function F is strictly increasing. Then pn ≥ 1/2) P (θˆn ≥ θ + ) ≤ P (ˆ = P (ˆ pn − p ≥ 1/2 − p) ≤ P (|ˆ pn − p| ≥ 1/2 − p) var(ˆ pn ) (Chebychev’s inequality) (1/2 − p)2 p(1 − p) → 0. = n(1/2 − p)2



10. If X1 , X2 , . . . are iid random variables with a non-degenerate distribution function, can Xn converge in probability to a constant c? If so, give an example. If not, prove that it cannot happen. It cannot happen because P (|Xn − c| ≥ ) = P (|X1 − c| ≥ ).

(8)

Suppose that P (|Xn − c| ≥ ) → 0. Taking the limit of Equation (8) as n → ∞, we get 0 = n→∞ lim P (|X1 − c| ≥ ) = P (|X1 − c| ≥ ).

K24704_SM_Cover.indd 73

01/06/16 10:38 am

70 This shows that P (|X1 − c| ≥ ) = 0 for each  > 0. This clearly implies that X1 = c with probability 1, contradicting the fact that the Xi have a non-degenerate distribution. Therefore, Xn cannot converge in probability to a constant. 11. If X1 , . . . , Xn are identically distributed (not necessarily indep pendent) with E(|Xi|) < ∞, then Yn = max(|X1 |, . . . , |Xn|)/n → 0. Hint: P (Yn ≥ ) = P (∪n i=1 Ai), where Ai = {|Xi(ω)| ≥ n}. P (Yn ≥ ) = =

P (∪ni=1 Ai ) n  i=1



n 

P (Ai )

i=1

P (|Xi | ≥ n) =

n  i=1

P (|X1 | ≥ n)

= nP (|X1 | ≥ n) ≤ E{|X1 |I(|X1 | ≥ n)} → 0 by the DCT because E(|X1 |) < ∞.

K24704_SM_Cover.indd 74

01/06/16 10:38 am

71 Section 6.1.3 1. Let Xn =

√

0

n

with probability 1/n with probability 1 − 1/n.

Show that Xn → 0 in L1 , but not in L2 . Are there any examples in which Xn → 0 in L2 , but not in L1 ? √ √ √ E(|Xn − 0|) = nP (Xn = n) + 0P (Xn = 0) = n(1/n) = 1/ n → 0 E(|Xn − 0|2 ) = nP (Xn = n) + 02 P (Xn = 0) = n(1/n) = 1. Therefore, E(|Xn − 0|2 ) does not converge to 0. That is, Xn does not converge to 0 in L2 . By Proposition 6.16, there are no examples in which Xn → 0 in L2 but not in L1 . 2. What does Proposition 6.19 imply about the MSE of twosample estimators such as the difference in means or proportions when their one-sample MSEs converge to 0? By part 2 of Proposition 6.19, if the one-sample estimators θˆ1 and θˆ2 in the two groups converge in L2 to θ1 and θ2 (i.e., their MSEs tend to 0), then the two-sample estimator θˆ1 − θˆ2 converges in L2 to θ1 − θ2 (i.e., the MSE of θˆ1 − θˆ2 tends to 0). 3. Prove part 1 of Proposition 6.19. E(|X − X  |p ) = ≤ ≤ =

E(|X − Xn + Xn − X  |p ) E{(|X − Xn | + |Xn − X  |)p } E {2p (|X − Xn |p + |Xn − X  |p )} (Proposition 5.23) 2p {E(|X − Xn |p ) + E(|Xn − X  |p )} .

Now take the limit as n → ∞ to conclude that E(|X −X  |p ) = 0. Clearly, if |X − X  | had nonzero probability of exceeding , then it could not be the case that E(|X − X  |p ) = 0. Therefore, P (|X − X  | > ) = 0 for every  > 0. This implies that X = X  a.s. 4. Show by a counterexample that Xn converging to X in Lp and Yn converging to Y in Lp does not necessarily imply that XnYn converges to XY in Lp. Let Xn be the random variable in Exercise 1, and let Yn = Xn . Then Xn and Yn both converge in L1 to 0, but Xn Yn = Xn2 does not converge to 0 in L1 because E(|Xn2 − 0|) = n(1/n) + 0 = 1, which does not tend to 0.

K24704_SM_Cover.indd 75

01/06/16 10:38 am

72 5. Show by counterexample that Xn converging to X in Lp does not necessarily imply that f (Xn) converges to f (X) in Lp for a continuous function f . Let Xn be as defined in Exercise 1, and take f (x) = x2 . Then Xn → 0 in L1 but f (Xn ) = Xn2 does not converge to f (0) = 0 in L1 because E(|Xn2 − 0|) = n(1/n) + 0 = 1, which does not tend to 0. p

6. Prove that if Xn → 0, then E{|Xn|/(1 + |Xn|)} → 0.  

 

p

Note that Yn = |Xn |/(1 + |Xn |) ≤ 1 for all n and Yn → 0 by part 5 of Proposition 6.12. For any  > 0, E(Yn ) = E{Yn I(Yn ≤ )} + E{Yn I(Yn > )} ≤  + E{1I(Yn > )} =  + P (Yn > )

(9)

Take the limsup of both sides to conclude that lim n→∞ E(Yn ) ≤ +0 = . Because  is arbitrary, lim n→∞ E(Yn ) = 0. This clearly implies that E(Yn ) → 0 as n → ∞.

K24704_SM_Cover.indd 76

01/06/16 10:38 am

73 Section 6.1.4 D

1. Suppose that Xn ∼ N (µn, 1). Prove that if Xn → N (µ, 1), then µn → µ as n → ∞. D

P (Xn ≤ x) = Φ(x − µn ) → Φ(x − µ) because Xn → N(µ, 1). But Φ−1 is a continuous function, so x − µn → x − µ as n → ∞. This implies that µn → µ. D

2 2. Suppose that Xn ∼ N (0, σn ). Prove that if Xn → N (0, σ 2 ), then σn → σ as n → ∞. D

Suppose first that σ = 0, so that Xn → N(0, 0). If any subsequence σnk converged to a number τ = 0, then by Example 6.21, Xn would converge to a N(0, τ 2 ), which is not N(0, 0). Therefore, every convergent subsequence σnk must converge to 0. Also, no subsequence σnk can converge to ∞ because then Xnk would not converge in distribution to a proper random variable. We conclude that σn converges to 0. Now suppose that σ = 0. Again if there is a subsequence σnk converging to a nonzero number different from σ, Example 6.21 leads to a contradiction. Also, if any subsequence σnk converges to 0, then clearly P (Xnk ≤ x) = Φ(x/σnk ) cannot converge to a number in (0, 1) (because |x/σnk | → ∞ as k → ∞), which contradicts the fact that P (Xn ≤ x) → Φ(x/σ). Therefore, again every convergent subsequence σnk must converge to σ. Also, as argued in the preceding paragraph, no subsequence σnk can converge to ∞. We conclude that σn → σ. 3. Prove that if Xn has a discrete uniform distribution on {1/n, D 2/n, . . . , n/n = 1}, Xn → X, where X is uniform (0, 1). Let x be an arbitrary point in (0, 1). There is an interval [(i − 1)/n, i/n] containing x. Also, (i − 1)/n ≤ Fn (x) ≤ i/n. It follows that both x and Fn (x) are in the interval [(i − 1)/n, i/n], so |Fn (x) − x| cannot be greater than the width of the interval, namely 1/n. It follows that |Fn (x) − x| ≤ 1/n → 0 as n → ∞. This proves that Xn converges in distribution to the uniform distribution on (0, 1). 4. Prove Proposition 6.23 using the same technique as in Example 6.22. D

Suppose that Xn → X. Then Fn (x) → F (x) for all continuity points x of F . The set of discontinuity points of F is countable, so the set of continuity points includes all but a countable set. Therefore, the set of

K24704_SM_Cover.indd 77

01/06/16 10:38 am

74 continuity points is a dense set of R. Therefore, Fn (x) → F (x) on a dense set of reals. Now suppose that Fn (d) → F (d) for all d ∈ D, a dense set of reals. Let x be a continuity point of F . Let d1 ∈ D and d2 ∈ D be such that d1 < x < d2 . Then Fn (d1 ) ≤ Fn (x) ≤ Fn (d2 ). It follows that F (d1 ) = lim Fn (d1 ) ≤ lim Fn (x). Similarly, lim Fn (x) ≤ lim Fn (d2 ) = F (d2 ). Because d1 < x and d2 > x are arbitrary members of D, and D is dense, we can find d1 and d2 arbitrarily close to x (either less than or greater than). Because F (d1 ) → F (x) as d1 ↑ x, d1 ∈ D, and F (d2 ) → F (x) as d2 ↓ x, d2 ∈ D (because x is a continuity point of F ), lim n→∞ Fn (x) ≥ F (x) and lim n→∞ Fn (x) ≤ F (x), proving that lim Fn (x) exists and equals F (x) at each continuity point x of F . 5. Prove Proposition 6.30. D

Suppose that Xn → X, and let f be a function whose set of discontinuities D is a Borel set and P (X ∈ D) = 0. By the Skorokhod representation theorem, there exist, on some probability space (Ω , F  , P  ), random variables Xn with the same distribution as Xn , X  with the a.s. same distribution as X  , such that Xn → X  . Then f (Xn ) converges almost surely to f (X  ) because each ω  for which X  (ω  ) ∈ DC and Xn (ω  ) → X  (ω  ) is an ω  for which f {X  (ω  )} → f {X  (ω  )}. Now let g be a bounded continuous function. Then g{f (Xn )} is bounded and converges almost surely to g{f (X  )}. By the bounded convergence theorem, E[g{f (Xn )}] → E[g{f (X  )}]. Because this holds for every bounded continuous function g, f (Xn ) converges in distribution to f (X  ) by Proposition 6.29. But f (Xn ) has the same distribution as f (Xn ), and f (X) has the same distribution as f (X  ). Therefore, f (Xn ) converges in distribution to f (X). 6. Let U1 , U2 , . . . be iid uniform [0, 1] random variables, and λ be a fixed number in (0, 1). Let Xn be the indicator that Ui ∈ [λ/n, 1], and Yn be the indicator that X1 = 1, X2 = 1, . . . , Xn = 1. Does Yn converge in distribution? If so, what is its limiting distribution?

K24704_SM_Cover.indd 78

01/06/16 10:38 am

75 

P (Yn = 1) = ni=1 P (Ui ≥ λ/n) = (1 − λ/n)n → exp(−λ). Therefore, Yn converges in distribution to a Bernoulli random variable with parameter p = exp(−λ). 7. Let X be a random variable with distribution function F (x) and strictly positive density f (x) that is continuous at x = 0. Let Gn be the conditional distribution of nX given that nX ∈ [a, b]. Show that Gn converges in distribution to a uniform on [a, b]. The conditional distribution function is P (a/n ≤ X ≤ x/n) P (a/n ≤ X ≤ b/n) F (x/n) − F (a/n) = F (b/n) − F (a/n)

P {nX ≤ x |, nX ∈ [a, b]} =

=

(a/n) (x − a) F (x/n)−F (x−a)/n

(a/n) (b − a) F (b/n)−F (b−a)/n (x − a)f (ηn ) = (b − a)f (τn )

by the mean value theorem, where a/n ≤ ηn ≤ x/n and a/n ≤ τn ≤ b/n. But ηn and τn tend to 0 as n → ∞, and f is continuous and positive at 0. Therfore, the conditional distribution tends to (x − a)f (0)/{(b − a)f (0)} = (x − a)/(b − a). That is, it tends to a uniform distribution on [a, b].

K24704_SM_Cover.indd 79

01/06/16 10:38 am

76 Section 6.2.3 1. Let (Ω, F , P ) = ([0, 1], B[0,1] , µL) and define X1 = I(ω ∈ [0, 1/2]), X2 = I(ω ∈ (1/2, 3/4]), X3 = I(ω ∈ (3/4, 7/8]), etc. (the ina.s. tervals have widths 1/2, 1/4, 1/8, 1/16, . . .) Prove that Xn → 0. 



∞ n Note that ∞ n=1 P (Xn = 1) = n=1 (1/2) < ∞, so by the Borel-Cantelli lemma, P (Xn = 1 for infinitely many n) = 0. But each ω for which Xn = 1 for only finitely many n is an ω for which Xn (ω) → 0. Therefore, a.s. Xn → 0.

2. Let X1 , X2 , . . . be iid with a continuous distribution function, and let Y1 be the indicator that X1 is the larger of (X1 , X2 ), Y2 be the indicator that X3 is the largest of (X3 , X4 , X5 ), Y3 be the indicator that X6 is the largest of (X6 , X7 , X8 , X9 ), etc. p Prove that Yn → 0 but not almost surely. The Yn are independent Bernoulli random variables with respective probability parameters pn = 1/(n + 1). Therefore, P (|Yn | > ) = P (Yn = 1) = 1/(n + 1) → 0, so Yn converges to 0 in probability. Because ∞ n=1 1/(n + 1) = ∞, The Borel-Cantelli lemma implies that, with probability 1, Yn = 1 for infinitely many n. Clearly, P (Yn = 0 i.o.) is also 1. But each  for which Yn = 0 for infinitely many n and Yn = 0 for infinitely many n is an  for which Yn fails to converge. Therefore, with probability 1, Yn fails to converge. 3. In the proof of Proposition 6.35, we asserted that it is possible to find a sequence m → 0 such that x − m and x + m are both continuity points of the distribution of F . Prove this fact. Suppose the statement is not true. Then we claim there exists a number t such that for each  ≤ t, either x −  or x +  is not a continuity point of F . Otherwise, we could find an 0 < 1 such that x − 0 and x + 0 are both continuity points of F , then find an 1 < 1/2 such that x − 1 and x + 1 are continuity points of F , then find an 2 < 1/22 such that x − 2 and x + 2 are continuity points of F , etc. This would create a sequence m → 0 with x − m and x + m both being continuity points of F . This completes the argument that there exists a number t such that for each  ≤ t, either x −  or x +  is not a continuity point of F . Let y be x− if x− is a discontinuity point of F . If x− is a continuity point, then x +  must be a discontinuity point, so let y = x +  in that case. We have established a 1 − 1 correspondence between a subset of the discontinuity points of F and the uncountable set {y ,  ≤ t}. That

K24704_SM_Cover.indd 80

01/06/16 10:38 am

77 means that the set of discontinuities must be uncountable. But we know that the set of discontinuities of F is countable, so this proves the result by contradiction. 4. It can be shown that the bounded convergence theorem applies to convergence in probability as well as almost sure convergence. Use this fact and Proposition 6.29 to supply a much simpler proof of Proposition 6.35. p

Suppose that Xn → X, and let f be a bounded continuous function. p Then f (Xn ) → f (X) by part 2 of Proposition 6.12. By the bounded convergence theorem, E{f (Xn )} → E{f (X)}. We have shown that for every bounded continuous function f , E{f (Xn )} → E{f (X)}. By TheD orem 6.29, Xn → X. 5. Prove the converse of Exercise 6 in Section 6.1.3, namely that p if E{|Xn|/(1 + |Xn|)} → 0, then Xn → 0. Because convergence in L1 implies convergence in probability (Proposip tion 6.32), Yn = |Xn |/(1 + |Xn |) → 0. But then |Xn | = Yn /(1 − Yn ) converges in probability to 0/(1 − 0) = 0 (part 5 of Proposition 6.12). 6. Using the same technique as in the proof of the first part of the Borel-Cantelli lemma (namely, using a sum of indicator random variables) compute the following expected numbers: (a) One method of testing whether basketball players have “hot” and “cold” streaks is as follows. Let Xi be 1 if shot i is made and −1 if it is missed. The number of sign changes in consecutive shots measures how “streaky” the player is: a very small number of sign changes means the player had long streaks of made or missed shots. For example, −1, −1, −1, +1, +1, +1, +1, +1, +1, +1 contains only one sign change and has a streak of 3 missed shots followed by 7 made shots. Under the null hypothesis that the Xi are iid Bernoulli p (i.e., streaks occur randomly), what is the expected number of sign changes in n shots? Hint: let Yi = I(Xi = Xi−1 ). 

Let Yi = I(Xi = Xi−1 ). The number of sign changes is ni=2 Yi , so   the expected number of sign changes is ni=2 E(Yi ) = ni=2 {P (Xi−1 = n −1, Xi = +1) + P (Xi−1 = +1, Xi = −1)} = i=2 {(1 − p)p + p(1 − p) = 2(n − 1)p(1 − p).

K24704_SM_Cover.indd 81

01/06/16 10:38 am

78 (b) What is the expected number of different values appearing in a bootstrap sample (a sample drawn with replacement) of size n from {x1 , . . . , xn}? Show that the expected proportion of values not appearing in the bootstrap sample is approximately exp(−1) if n is large. Let Yi be the indicator that xi appears in the bootstrap sample. Let Ai,j be the event that the jth number drawn with replacement from  n {x1 , . . . , xn } is xi . Then P (Yi = 0) = nj=1 P (AC i,j ) = (1 − 1/n) and P (Y = 1) = 1 − (1 − 1/n)n .  The number of values appearing in the bootstrap sample is ni=1 Yi ,   and its expectation is ni=1 E(Yi ) = ni=1 P (Yi = 1) = n{1 − (1 − 1/n)n }. The expected proportion of values appearing is (1/n) times the expected number of values appearing, namely 1 − (1 − 1/n)n → 1 − exp(−1). Therefore, the proportion not appearing tends to 1 − {1 − exp(−1)} = exp(−1). 7. Regression analysis assumes that errors from different observations are independent. One way to test this assumption is to count the numbers n+ and n− of positive and negative residuals, and the number nR of runs of the same sign (see pages 198-200 of Chatterjee and Hadi, 2006). For example, if the sequence of signs of the residuals is + + − − − + − − −−, then n+ = 3, n− = 7, and nR = 4. Assume that the residuals are exchangeable (each permutation has the same joint distribution). Using indicator functions, prove that the expected number of runs, given n+ and n−, is 2n+ n−/(n+ + n−) + 1. Let n = n+ + n− . Let Ii (+−) be the indicator that residual i − 1 is positive and residual i is negative. Similarly, let Ii (−+) be the indicator that residual i − 1 is negative and residual i is positive. The number of  runs is 1 + ni=2 {Ii (+−) + Ii (−+)}, and its expectation is n 











n+ n− n− 1+ + n n−1 n i=2 + − + − 2n n (n − 1)2n n =1+ . = 1+ n(n − 1) n

n+ n−1



8. Suppose that Y1 , Y2 , . . . are independent. Show that P [supn{Yn}  < ∞] = 1 if and only if ∞ n=1 P (Yn > B) < ∞ for some constant B. 

Suppose that ∞ n=1 P (Yn > B) < ∞ for some constant B. Then P (Yn > B i.o.) = 0. But each ω for which Yn (ω) > B only finitely often is an ω

K24704_SM_Cover.indd 82

01/06/16 10:38 am

79 for which supn Yn is finite. Therefore, P (supn Yn < ∞) = 1. 

Now suppose that n P (Yn > B) = ∞ for each B. Then P (Yn > B i.o.) = 1 for each B. Let Ak be the event that Yn > k i.o. in n. Then P (supn Yn = ∞) = P (∩k Ak ), and P {(∩k Ak )C } = P (∪k AC k) ≤    C n P (Yn > B) = ∞ for each B, k P (Ak ) = k 0 = 0. Therefore, if then supn Yn = ∞ with probability 1. This is a proof by contrapositive  that P (supn Yn < ∞) = 1 ⇒ ∞ n=1 P (Yn > B) < ∞.

9. Let A1 , A2 , . . . be a countable sequence of independent events   with P (Ai) < 1 for each i. Then P ∪∞ A = 1 if and only if i=1 i P (Ai i.o.) = 1. Suppose that P (B) = 1, where B = ∪∞ i=1 Ai . Then 

C 0 = P (B C ) = P ∩∞ i=1 Ai











C = P ∩ni=1 AC ∩ ∩∞ i i=n+1 Ai = P (Cn ∩ Dn ) = P (Cn )P (Dn ),

(10)

∞ C where Cn = ∩ni=1 AC i and Dn = ∩i=n+1 Ai are independent. We have  shown that P (Cn )P (Dn ) = 0, and P (Cn ) = ni=1 P (AC i ) is not  0 because  each P (Ai ) < 1. Therefore, P (Dn ) = 0. This means that P ∪∞ i=n+1 Ai = P (DnC ) = 1. Therefore,







∞ ∞ 1 = lim P ∪∞ i=n+1 Ai = P ∩n=1 ∪i=n+1 Ai n→∞ = P (Ai i.o.).





∞ ∞ C Now i i.o.) = 1. P (∪i=1 Ai ) = 1 − i=1 P (Ai ). Also, suppose that  P (A  ∞ C ln = ∞ Because ln(1 − x) ≤ −x, i=1 P (Ai ) i=1 ln{1 − P (Ai )}. ∞ ∞ i=1 ln{1−P (Ai )} ≤ − i=1 P (Ai ) = −∞ by the Borel-Cantelli lemma. ∞ ∞ C This implies that i=1 P (AC i ) = exp(−∞) = 0 and 1 − i=1 P (Ai ) = 1. ∞ That is, P (∪i=1 Ai ) = 1.

10. Prove that if A1 , . . . , An are events such that n − 1, then P ∩n i=1 Ai > 0.

n

i=1

P (Ai) >

Let Yi = I(Ai ). Then Z=

n  i=1

P (Ai ) = E

 n  i=1



I(Ai ) = expected number of Ai .

We now use a contrapositive argument. If P (∩ni=1 Ai ) = 0, then P (Z = n) = 0, in which case Z ≤ n − 1 with probability 1. Then E(Z) ≤ n − 1.  This completes the proof by contrapositive that ni=1 P (Ai ) > n − 1 ⇒ P (∩ni=1 Ai ) > 0.

K24704_SM_Cover.indd 83

01/06/16 10:38 am

80 11. Give an example to show that independence is required in the second part of the Borel-Cantelli lemma. Flip a fair coin and let A1 be the event that it is heads. Set A2 =  ∞ A1 , A3 = A1 , etc. Then ∞ n=1 P (An ) = n=1 (1/2) = ∞, but P (An i.o.) = P (A1 ) = 1/2. 12. * Show that almost sure convergence and convergence in Lp do not imply each other without further conditions. Specifically, use the Borel-Cantelli lemma to construct an example in which Xn takes only two possible values, one of which is 0, and: (a) Xn converges to 0 in Lp for every p > 0, but not almost surely. Let Xn be independent random variables with Xn =



0 w.p. 1 − 1/n 1 w.p. 1/n.

Then E(|Xn − 0|p ) = P (Xn = 1) = 1/n → 0. However, P (Xn = 1  i.o.) = 1 by the Borel-Cantelli Lemma because n P (Xn = 1) = ∞ n=1 1/n = ∞. Also, it is clear that P (Xn = 0 i.o.) = 1 as well. Therefore, with probability 1, Xn = 0 i.o. and Xn = 1 i.o. Each ω for which this happens is an ω for which Xn (ω) fails to converge. Therefore Xn does not converge a.s. to 0. (b) Xn converges almost surely to 0, but not in Lp for any p > 0 (hint: modify Example 6.34). Let Xn be independent random variables with Xn =



0 2n

w.p. 1 − 1/n2 w.p. 1/n2 .

The Borel-Cantelli lemma implies that P (Xn = 2n i.o.) = 0 because 2 n n=1 1/n < ∞. Each ω for which Xn = 2 only finitely often is a.s. an ω for which Xn (ω) → 0, therefore, Xn → 0. However, E{|Xn − 0|p ) = 2np /n2 → ∞. ∞

p

13. Construct an example such that Xn are independent and Xn → 0, yet Sn/n does not converge almost surely to 0. Hint, let Xn take values 0 or 2n with certain probabilities. Let Xn be independent with Xn =

K24704_SM_Cover.indd 84



0 2n

w.p. 1 − 1/n w.p. 1/n.

01/06/16 10:38 am

81 Then Sn /n ≥ Xn /n, and with probability 1, Xn /n = 2n /n for infinitely  n many n by the Borel-Cantelli lemma because ∞ n=1 P (Xn /n = 2 /n) = ∞ n n=1 1/n = ∞. Each ω for which Xn /n = 2 /n i.o. is an ω for which Xn (ω)/n does not converge to 0. Therefore, P (Sn /n → 0) = 0. 14. Let θˆn be an estimator with finite mean. p (a) If θˆn is consistent for θ (i.e., θˆn → θ), is θˆn asymptotically unbiased (i.e., E(θˆn) → θ)? No. For example, suppose that θ = 0 and

θˆn =



0 w.p. 1 − 1/n n w.p. 1/n.

p Then θˆn → 0 but E(θˆn) = 1. (b) If θˆn is asymptotically unbiased, is θˆn consistent? No. Let Xi be iid N(θ, 1) and θˆn = X1 . Then θˆn is unbiased but clearly θˆn does not converge in probability to θ. (c) If θˆn is asymptotically unbiased and var(θˆn) exists and tends to 0, is θˆn consistent? Yes. In that case

P (|θˆn − θ| ≥ ) = P {(θˆn − θ)2 ≥ 2 } E(θˆn − θ)2 ≤ 2 ˆ var(θn ) + (bias)2 = 2 → 0 + 0 = 0. 15. Let Xi be iid random variables, and let Yn = Xn/n. Prove that a.s. Yn → 0 if and only if E(|X|) < ∞. Hint: use Proposition 5.9 in conjunction with the Borel-Cantelli lemma. a.s.

Suppose that Yn → 0. Then P (|Xn /n| ≥  i.o.) = 0. By the Borel ∞ Cantelli lemma, ∞ n=0 P (|Xn /n| ≥ ) < ∞. Equivalently, n=1 P (|Xn |/ ∞ ≥ n) < ∞. But the Xi are iid, so n=1 P (|X|/ ≥ n) < ∞, where X is a generic Xi . By Proposition 5.9, E(|X|/) < ∞. This implies that E(|X|) < ∞.

Now suppose that E(|X|) < ∞. Then E(|X|/) < ∞. By Proposition  ∞ 5.9, ∞ n=1 P (|X|/ ≥ n) < ∞, so n=1 P (|Xn |/ ≥ n) < ∞. By the Borel-Cantelli lemma, P (|Xn |/ ≥ n i.o.) = 0. That is, P (|Xn /n| ≥  i.o.) = 0. Now let Ak be the event that |Xn /n| > 1/k for infinitely many

K24704_SM_Cover.indd 85

01/06/16 10:38 am

82 n. Then P (Ak ) = 0 for each k, so P (∪k Ak ) = 0. But clearly, any ω for which |Xn /n| > 1/k for only finitely many n for every k is an ω for a.s. which |Xn (ω)/n| → 0. Therefore, Yn → 0.

K24704_SM_Cover.indd 86

01/06/16 10:38 am

83 Section 6.2.4 1. Prove the reverse direction of part 1 of Proposition 6.42. Suppose that every subsequence (M ) ⊂ (N ) contains a further suba.s. sequence (K) ⊂ (M ) such that Xk → X. If Xn does not converge in probability to X, then there must be an  > 0 and τ > 0 and a subsequence (M ) ⊂ (N ) such that P (|Xm − X| > ) ≥ τ for all m ∈ M . By hypothesis, there is a further subsequence (K) ⊂ (M ) a.s. such that Xk → X along k ∈ K. Because almost sure convergence p implies convergence in probability, Xk → X along k ∈ K. Also, because P (|Xm − X| > ) ≥ τ for all m ∈ M , and (K) ⊂ (M ), P (|Xk − X| > ) ≥ τ for all k ∈ K, contradicting the fact that p Xk → X along k ∈ K. Therefore, the original assumption that Xn does not converge in probability to X is wrong. 2. State whether the following is true: Xn converges in probability if and only if every subsequence contains a further subsequence that converges almost surely. If true, prove it. If not, give a counterexample. False. Let Xn = 1 if n is odd and 0 if n is even. Then Xn is a sequence of numbers such that every subsequence (M ) contains either infinitely many odd m or infinitely many even m. If there are infinitely many odd m, take (K) to be the odd indices of (M ). If there are only finitely many odd m, take (K) to be the even indices of (M ). Then clearly the sequence of numbers Xk are all the same, so this sequence converges. Therefore, Xk converges a.s. along k ∈ K. However, the original sequence clearly does not converge in probability to anything. 3. Determine and demonstrate whether the distribution functions associated with the following sequences of random variables are tight. (a) Xn =



0 n1/2

with probability 1/2 with probability 1/2.

Not tight because P (|Xn | > M ) = 1/2 for all n > M 2 .

(b) Xn = ln(Un), where Un is uniform [1/n, 1]. Tight because, for n ≥ 2, P {| ln(Un )| ≥ M } = P {ln(Un ) ≤ −M } = P {Un ≤ exp(−M )}, and P {Un ≤ exp(−M )} =

K24704_SM_Cover.indd 87



0

exp(−M )−1/n 1−1/n

if n ≤ exp(M ) if n > exp(M ).

01/06/16 10:38 am

84 Also, exp(−M ) exp(−M ) − 1/n ≤ = 2 exp(−M ) 1 − 1/n 1 − 1/2

for n ≥ 2. To make this probability less than or equal to , take 2 exp(−M ) ≤  (i.e., M ≥ − ln(/2)).

(c) Xn = cos(nU ), where U is uniform [0, 1]. Tight because cos(nU ) is bounded between −1 and 1, so Fn(−2) = 0 and 1 − Fn(2) = 0 for all n.

(d) Xn ∼ N(n, 1). Not tight because 1 − Fn (M ) ≥ 1/2 for all n ≥ M .

(e) Xn ∼ N(0, 1 + 1/n). Tight. For given , choose M large enough that P {N(0, 2) > M } ≤ . Then P {N(0, 2) ≤ −M } ≤  as well. Furthermore, P {N(0, 1 + 1/n) > M } is decreasing in n for n ≥ 1 and M > 0, so P {N(0, 1 + 1/n) > M } ≤  and P {N(0, 1+1/n) ≤ −M } ≤  for all n = 1, 2, . . .

4. Use subsequence arguments to prove items 1 and 2 of Proposition 6.12. p

p

For part 1, suppose Xn → X and Xn → X  . Then there is a subsequence p a.s. (M ) such that Xm → X along m ∈ M . Also, Xm → X  along m ∈ M , so a.s. there exists a further subsequence (K) ⊂ (M ) such that Xk → X  . Also, a.s. a.s. because Xm → X along m ∈ M , and (K) ⊂ (M ), Xk → X along k ∈ K. We have shown that Xk converges almost surely to both X and X  . But for given ω, the sequence of numbers xk = Xk (ω) cannot converge to 2 different limits, so X(ω) = X  (ω) for all ω such that Xk (ω) → X(ω) and Xk (ω) → X  (ω). Therefore, X = X  with probability 1. p

For part 2, Suppose that Xn → X and f is continuous except on a set D such that X −1 (D) ∈ F and P (X ∈ D) = 0. Let (M ) ⊂ (N ). There is a.s. a further subsequence (K) ⊂ (M ) such that Xk → X along k ∈ K. But each ω such that Xk (ω) → X(ω) along (K) and X(ω) ∈ DC is an ω for a.s. which f (Xk ) → f (X). Therefore, we have shown that every subsequence a.s. (M ) contains a further subsequence (K) such that f (Xk ) → f (X). It follows that f (Xn ) converges in probability to f (X).

5. Use a subsequence argument to prove items 4 and 5 of Proposition 6.12. Because Xn converges in probability to X, every subsequence (M ) cona.s. tains a further subsequence (K) such that Xk → X. Along (K), Yk converges in probability to Y . Therefore, there is a further subsequence

K24704_SM_Cover.indd 88

01/06/16 10:38 am

85 a.s.

a.s.

(J) ⊂ (K) such that Yj → Y . Therefore, along (J), Xj → X and a.s. a.s. Yj → Y . Therefore, Xj Yj → XY along (J). Therefore, every subsea.s. quence (M ) contains a further subsequence (J) such that Xj Yj → XY . p It follows that Xn Yn → XY . The same argument works for part 5.

6. Prove Proposition 6.44 using Proposition 6.29 and a subsequence argument. One direction is clear, so suppose that every subsequence contains a further subsequence that converges to X in distribution. Then every subsequence (M ) contains a further subsequence (K) such that E{f (Xk )} converges to E{f (X)} for each bounded continuous function f . But then E{f (Xn )} converges to E{f (X)} because of the subsequence argument for a sequence of numbers (Proposition 6.41). Therefore, by Proposition D 6.29, Xn → X. 7. Provide an alternative proof of Proposition 6.44 by contradiction using the definition of convergence in distribution and a subsequence argument. Again one direction is clear, so suppose that every subsequence has a further subsequence that converges in distribution to X. If Xn does not converge to X in distribution, then there exists a continuity point x of F and an  > 0 and a subsequence (M ) ⊂ (N ) such that |Fm (x)−F (x)| ≥  for all m ∈ M . By assumption, there is a further subsequence (K) ⊂ (M ) such that |Fk (x) − F (x)| → 0 along k ∈ K. But this contradicts the fact that |Fm (x) − F (x)| ≥  for all m ∈ M and the fact that (K) ⊂ (M ). 8. Prove that Xn → X in Lp if and only if each subsequence (M ) contains a further subsequence (K) ⊂ (M ) such that Xk → X in Lp along (K). This follows immediately from the fact that an = E(|Xn − X|p ) is just a sequence of numbers, so an → 0 if and only if each subsequence (M ) contains a further subsequence (K) ⊂ (M ) such that ak → 0 by Proposition 6.41.

K24704_SM_Cover.indd 89

01/06/16 10:38 am

86 Section 6.2.5 1. Give an example to show that the following proposition is false. p D D If Xn → X and Yn → Y , then Xn + Yn → X + Y . If Xn and Yn are on different probability spaces, then Xn + Yn is not even defined.

2. In the first step of the proof of Slutsky’s theorem, we asserted D D that Xn → X implies that AXn + B → AX + B for constants A and B. Why does this follow? Because f (x) = Ax + B is a continuous function of x (Proposition 6.30). 3. Show that convergence in distribution to a constant is equivalent to convergence in probability to that constant. D

If Xn → c, then by definition of convergence in distribution, Fn (x) → x for each continuity point x of F . The only discontinuity point of F is c. Therefore, P (|Xn − c| ≥ ) = P (Xn ≤ c − ) + P (Xn ≥ c + ) = Fn (c − ) + P (Xn ≥ c + ) ≤ Fn (c − ) + 1 − Fn (c + /2) → 0 + 0 = 0. p

p

Thus, Xn → c. Now suppose that Xn → c. Then P (Xn ≤ c − ) → 0. Thus, Fn (c−) → F (c−) = 0. Also P (Xn ≥ c+) → 0, so 1−Fn (c+) ≤ P (Xn ≥ c+) → 0. Of course this implies that Fn (c+) → 1. Therefore, Fn (x) → 0 for x < c and Fn (x) → 1 for x > c. That is, Fn (x) converges at all continuity points to the distribution function for a point mass at c. 4. Let Xn and Yn be random variables on the same probability p p D space with Xn → 0 and Yn → Y . Prove that XnYn → 0. Let  > 0 and B > 0 be given. We will prove that there is an N such that P (|Xn Yn | ≥ B) <  for all n ≥ N .

Because Yn converges in distribution, the distribution function Fn of Yn is tight. Therefore, for given , there exists an A such that P (|Yn | ≥ p A) < /2 for all n. Because Xn → 0, we can find an N such that P (|Xn | ≥ B/A) < /2 for n ≥ N . The event |Xn Yn | ≥ B requires either |Yn | ≥ A or |Xn | ≥ B/A. Therefore, for n ≥ N , P (|Xn Yn | ≥ B) ≤ P (|Yn | ≥ A) + P (|Xn | ≥ B/A) < /2 + /2 = . This completes the proof that P (|Xn Yn | > B) → 0 as n → ∞.

K24704_SM_Cover.indd 90

01/06/16 10:38 am

87 5. ↑ In the preceding problem, why is there a problem if you try to use the Skorokhod representation theorem to prove the result? The Skorokhod representation theorem says that there exist random variables Yn and Y  on some probability space such that Yn has the same distribution as Yn , Y  has the same distribution as Y , and Yn converges almost surely to Y  . The problem is that Xn was on the original probability space. Therefore, it does not make sense to talk about Xn + Yn . 6. What is the problem with the following attempted proof of Slutsky’s theorem? As noted in the first paragraph of the proof, p D D it suffices to prove that if Yn → Y and Cn → 0, then Yn +Cn → Y. (a) By the Skorokhod representation theorem, there exist random variables Yn and Y  such that Yn has the same distribution as Yn, Y  has the same distribution as Y , and a.s. Yn → Y . Correct. (b) Because almost sure convergence implies convergence in p probability (Proposition 6.37), Yn → Y . Correct. p

p

(c) Since Cn → 0 by assumption, Yn + Cn → Y  + 0 = Y  by part 3 of Proposition 6.12. False. The problem is that Cn and Yn are not necessarily on the same probability space. The Skorokhod representation theorem says that there exists a probability space on which we can define a Yn with the stated properties, but Cn is on the original probability space, not the new space. (d) Also, convergence in probability implies convergence in D distribution (Proposition 6.35). Therefore, Yn + Cn → Y . Correct. (e) But Yn has the same distribution as Yn and Y  has the same D distribution as Y , so Yn + Cn → Y as well, completing the proof. Correct. 7. In Chapter 8, we will prove the following. If Xi are iid Bernoulli random variables with parameter p ∈ (0, 1), and pˆn is the proportion of Xi = 1 among X1 , X2 , . . . , Xn, then Zn = n1/2 (pˆn −

K24704_SM_Cover.indd 91

01/06/16 10:38 am

88 D

p)/{p(1 − p)}1/2 → N(0, 1). Taking this as a fact, prove that D ˜n = n1/2 (pˆn − p)/{pˆn(1 − pˆn)}1/2 → N(0, 1). Z n1/2 (ˆ pn − p) Z˜n =  pˆn (1 − pˆn ) =

  n1/2 (ˆ pn  

 

− p)    p(1 − p) . p(1 − p)  pˆn (1 − pˆn )

The first term converges in distribution to N(0, 1), whereas [{p(1 − p)}/{ˆ pn (1 − pˆn )}] converges in probability to 1. By Slutsky’s theorem, D ˜ Zn → N(0, 1). 8. If Xn is asymptotically normal with mean nµ and variance nσ 2 , does Slutsky’s theorem imply that {n/(n + 1)}1/2 Xn is also asymptotically normal with mean nµ and variance nσ 2 ? Explain. No. Slutsky’s theorem would apply if Xn converged in distribution, which it does not. {n/(n + 1)}1/2 Xn is asymptotically normal with mean {n/(n + 1)}1/2 nµ and variance {n2 /(n + 1)}σ 2 .

K24704_SM_Cover.indd 92

01/06/16 10:38 am

89 Section 6.3 1. Show that Proposition 6.55 is not necessarily true for convergence in distribution. For example, let (Xn, Yn) be bivariate normal with means (0, 0), variances (1, 1) and correlation ρn =



−0.5 +0.5

if n is odd if n is even.

Then the marginal distributions of Xn and Yn converge weakly to standard normals, but the joint distribution does not converge weakly. If (Xn , Yn ) are bivariate normal with means (0, 0), variances (1, 1) and correlation ρ = −.5 if n is odd and +.5 if n is even, then (Xn , Yn ) converges in distribution to a bivariate normal with correlation 0.5 along the subsequence of even integers, and to a bivariate normal with correlation −0.5 along the sequence of odd integers. Because these are two different distributions, (Xn , Yn ) does not converge in distribution. 2. Let Xn1 , . . . , Xnk be independent with respective distribution functions Fni converging weakly to Fi, i = 1, . . . , k. Prove that D

(a) (Xn1 , . . . , Xnk) → (X1 , . . . , Xk) as n → ∞, where the Xi are independent and Xi ∼ Fi. The joint distribution function Fn (x1 , . . . , xk ) of (Xn1 , . . . , Xnk ) is k Fni (xi ). If xi is a continuity point of Fi for each i, then i=1 k k i=1 Fi (xi ), which is the joint distribution funci=1 Fni (xi ) → tion of independent random variables X1 , . . . , Xk with marginals  F1 , . . . , Fk . Also, every continuity point (x1 , . . . , xk ) of ki=1 Fi (xi ) is D such that xi is a continuity point of Fi . Therefore, (Xn1 , . . . , Xnk ) → (X1 , . . . , Xk ). 

D



(b) For any constants a1 , . . . , ak, ki=1 aiXni → ki=1 aiXi as n → ∞.  This follows from Theorem 6.59 and the fact that f (x) = ki=1 ai xi is a continuous function of x. 3. Let (Xn, Yn) be bivariate normal with means (0, 0), variances (1/n, 1), and correlation ρn = 1/2. What is the limiting distribution of Xn + Yn? What would the limiting distribution be if the variances of (Xn, Yn) were (1 + 1/n, 1)? Justify your answers.

K24704_SM_Cover.indd 93

01/06/16 10:38 am

90 If the variances are (1/n, 1), then Xn converges in probability to 0, while Yn converges in distribution to N(0, 1). By Slutsky’s theorem, Xn + Yn converges in distribution to N(0, 1). If the variances were (1 + 1/n, 1), then the distribution of (Xn , Yn ) converges to a bivariate normal with mean vector (0, 0), variances (1, 1), and correlation 0.5 because the bivariate normal distribution function 2 is a continuous function of (µX , µY , σX , σY2 , ρ). Therefore, Xn + Yn con2 + σY2 + 2ρσX σY because verges to a normal with mean 0 and variance σX of the well known result that a linear combination of bivariate normals is normal. D

D

4. Suppose that (Xn, Yn) → (X, Y ). Prove that Xn → X.

The function f (x, y) = x is a continuous function of (x, y). Therefore, by the Mann-Wald theorem (Theorem 6.59), f (Xn , Yn ) converges in distribution to f (X, Y ) = X.

5. Prove Proposition 6.55. The almost sure convergence result follows from the corresponding property for sequences of vectors of numbers xn1 , . . . , xnk because for fixed ω, xn1 = X1 (ω), . . . , xnk = Xnk (ω) is just a sequence of vectors of numbers. p

For the convergence in probability result, suppose that Xni → Xi for  i = 1, . . . , k. Then the event { ki=1 (Xni − Xi )2 }1/2 ≥  implies that at least one of |Xni − Xi |, i = 1, . . . , k, is at least as large as /k 1/2 . Therefore, P

 k   (Xni i=1

− Xi )2

1/2



≥  ≤

k  i=1

P (|Xni − Xi | ≥ /k 1/2 ) → 0.

This proves that if each Xni converges in probability to Xi , then Xn converges in probability to X. The reverse direction is immediate: if Xn converges in probability to X,   then |Xn1 − X1 | ≤ { ki= (Xni − Xi )2 }1/2 , so if P [{ ki= (Xni − Xi )2 }1/2 ≥ ] → 0, then P (|Xni − Xi | ≥ ) → 0 as well.

K24704_SM_Cover.indd 94

01/06/16 10:38 am

91 Section 7.1 1. Prove that if X1 , . . . , Xn are iid Bernoulli (1/2) random variables, then a.s.

(a) Sn/nr → 0 if r > 1. a.s. a.s. By the SLLN, Sn /n → 1/2, so Sn /nr = (Sn /n)n1−r → (1/2)·0 = 0.

(b) With probability 1, infinitely many Xi must be 0. Each ω for which only finitely many Xi are 0 is an ω for which Sn (ω)/n → 1. This set of ω must have probability 0 because with probability 1, Sn (ω)/n → 1/2 by the SLLN.

(c) P (A) = 0, where A = {fewer than 40 percent of X1 , . . . , Xn are 1 for infinitely many n}. With probability 1, Sn /n → 1/2. Therefore, for all ω outside a null set, there is an N such that Sn /n ≥ .45 for all n ≥ N . This clearly precludes event A.

2. The following statements about iid random variables X1 , X2 . . . with mean µ are all consequences of the WLLN or SLLN. For each, say whether the result follows from the WLLN, or it requires the SLLN. ¯ n > µ − .001) ≥ .999 for all (a) There is an N such that P (X n ≥ N. WLLN. (b) With probability at least .999, there is an N = N (ω) such ¯ n > µ − .001 for all n ≥ N . that X SLLN. ¯ n) > µ − .001} = 1. (c) P {lim (X SLLN.

¯ n > µ − .001)} = 1. (d) lim {P (X WLLN. 3. Let X1 , X2 , . . . be iid with distribution function F having a unique median θ. Prove that the sample median θˆ converges almost surely to θ. For each positive integer k, the event Ak that θˆn < θ − 1/k infinitely often in n has probability 0 (note that for clarity, we denote θˆ by θˆn ). To see this, note first that F (θ − 1/k) < 1/2 (otherwise F would not have

K24704_SM_Cover.indd 95

01/06/16 10:38 am

92 

a unique median). By the SLLN, (1/n) ni=1 I(Xi < θ − 1/k) converges almost surely to a number less than 1/2, yet each ω ∈ Ak is an ω for  which (1/n) ni=1 I(Xi < θ − 1/k) is at least 1/2 for infinitely many n. Accordingly, P (Ak ) = 0 for each k. It follows that P (θˆn < θ − 1/k  infinitely often in n for at least one positive integer k) ≤ ∞ k=1 P (Ak ) = 0. ˆ Similarly, P (θn > θ + 1/k infinitely often in n for at least one positive integer k) = 0. But each ω such that, for all k, θ − 1/k ≤ θˆn ≤ θ + 1/k a.s. for all but finitely many n is an ω for which θˆn → θ. Therefore, θˆn → θ. 4. In analysis of variance with k groups and ni observations in   i ¯ 2 k (ni − group i, the pooled variance is ki=1 n i=1 j=1 (Yij − Yi) / 1). Assume that the Yij are independent and identically distributed, for j = 1, . . . , ni, with mean µi and variance σ 2 . Prove that σ ˆ 2 converges almost surely to the common withingroup variance σ 2 as k remains fixed and each ni → ∞.

 i By Example 7.6, the sample variance s2i = (ni − 1)−1 nj=1 (Yij − Y¯i )2 within group i converges almost surely to σ 2 as ni → ∞, i = 1, . . . , k. Let  > 0 be given. We will show that we can find an N such that ˆ 2 ≤ σ 2 +  for all n ≥ N . σ2 −  ≤ σ

For each ω, there is an Ni such that σ 2 −  ≤ s2i (ω) ≤ σ 2 +  for all n ≥ Ni . Take N = maxi (Ni ). Then σ 2 −  ≤ s2i (ω) ≤ σ 2 +  for all ˆ i,n = (ni − 1)/ k (nj − 1), so that n ≥ N and each i = 1, . . . , k. Let λ j=1 i k ˆ k ˆ 2 i=1 λi,ni si , satisfies i=1 λi,ni = 1. For n ≥ N , the pooled variance, σ2 −  =

k  i=1

ˆ i,n (σ 2 − ) ≤ λ i

k  i=1

ˆ i,n s2 ≤ λ i i

k 

ˆ i,n (σ 2 + ) = σ 2 +  λ i

i=1

 ˆ i,n s2 is σ ˆ 2 . We have shown that, except on a for all n ≥ N . But ki=1 λ i i null set, for each  > 0, there is an N such that σ 2 −  ≤ σ ˆ 2 ≤ σ 2 +  for a.s. all ≥ N . It follows that σ ˆ 2 → σ2.

5. Pick countably infinitely many letters randomly and independently from the 26-letter alphabet. Let X1 be the indicator that letters 1-10 spell “ridiculous,” X2 be the indicator that letters 2-11 spell “ridiculous,” X3 be the indicator that letters 3-12 spell “ridiculous,” etc. (a) Can you apply the SLLN to X1 , X2 , . . .? No, because X1 , X2 , . . . contain overlapping letters, and are therefore not independent.

K24704_SM_Cover.indd 96

01/06/16 10:38 am

93 (b) What if X1 had been the indicator that letters 1-10 spell “ridiculous,” X2 had been the indicator that letters 11-20 spell “ridiculous,” X3 had been the indicator that letters 21-30 spell “ridiculous,” etc. Yes. Now the Xi are iid. (c) Prove that the probability that some set of 10 consecutive letters spells “ridiculous” is 1.  Define Xi as in part b, and Sn = ni=1 Xi . Then the Xi are iid a.s. Bernoulli with probability p = (1/26)10 . By the SLLN, Sn /n → p. But each ω for which Sn (ω)/n → p is an ω for which at least one of X1 , X2 , . . . is 1. Therefore, the probability that at least one set of 10 consecutive letters among letters 1, 2 . . . spells “ridiculous” is at least at large as P (at least one of X1 , X2 , . . . is 1), which is 1. 6. Let (X1 , Y1 ), (X2 , Y2 ), . . . be independent, though (Xi, Yi) may be correlated. Assume that the Xi are identically distributed with finite mean E(X1 ) = µX and the Yi are identically distributed with finite mean E(Y ) = µY . Which of the following are necessarily true? 

a.s.

(a) (1/n) (Xi + Yi) → µX + µY .   a.s. a.s. True, because (1/n) ni=1 Xi → µX and (1/n) ni=1 Yi → µY . Theren n a.s. fore, (1/n) i=1 Xi + (1/n) i=1 Yi → µX + µY . 

(b) You flip a single coin. If it is heads, set Zn = (1/n) n i=1 Xi, n n = 1, 2, . . . If it is tails, set Zn = (1/n) i=1 Yi, n = a.s. 1, 2, . . .. Then Zn → (µX + µY )/2. Only if µX = µY . If µX = µY , then Zn converges almost surely to a random variable Z taking values µX with probability 1/2 and µY with probability 1/2.

(c) You flip a coin for each i. Set Ui = Xi if the ith flip is  heads, and Ui = Yi if the ith flip is tails. Then (1/n) n i=1 Ui a.s. → (µX + µY )/2. True. The Ui are iid with mean (1/2)E(Xi )+(1/2)E(Yi ) = (1/2)(µX + µY ). Therefore, the result holds by the SLLN. (d) You generate an infinite sequence U1 , U2 , . . . as described  in part c, and let Z1 (ω) = limn→∞(1/n) n i=1 Ui(ω) if this limit exists and is finite, and Z1 (ω) = 0 if the limit either does not exist or is infinite. Repeat the entire experiment countably infinitely many times, and let Z1 , Z2 , . . . be the results. Then P {ω : Zi(ω) = (µX + µY )/2 for all i} = 1.

K24704_SM_Cover.indd 97

01/06/16 10:38 am

94 True. Let Ci be the event that Zi = (µX + µY )/2. Then P (Ci ) = ∞ 0 for each i, so P (∪∞ i=1 Ci ) ≤ i=1 P (Ci ) = 0. Therefore, with probability 1, Z1 , Z2 , . . . all equal (µX + µY )/2. (e) Form all possible combinations obtained by selecting either Xi or Yi, i = 1, 2, . . . Let s index the different combinations; for the sth combination, let Zs denote the corresponding limit in part d. Then P {ω : Zs(ω) → (µX + µY )/2 for all s} = 1. Not necessarily. It is true that for each s, P {Zs = (µX +µY )/2} = 1. However, there are uncountably many s, so we cannot conclude that  P [∪s {Zs = (µX + µY )/2}] ≤ s P {Zs = (µX + µY )} = 0.

7. A gambler begins with an amount A0 of money. Each time he bets, he bets half of his money, and each time he has a 60% chance of winning.

(a) Show that the amount of money An that the gambler has  after n bets may be written as A0 n i=1 Xi for iid random variables Xi. Show that E(An) → ∞ as n → ∞. After the first bet, he has A1 = A0 X1 , where X1 =



1/2 with probability .4 3/2 with probability .6.

After the second bet, he has A0 X1 X2 , where X2 is independent of, and has the same distribution as, X2 . Similarly, after n bets, the  gambler has A0 ni=1 Xi , where the Xi are iid taking values (1/2) or (3/2) with probabilities 0.4 and 0.6, respectively. (b) Show that An → 0 a.s. 



Take the log of A0 ni=1 Xi = ln(A0 ) + ni=1 ln(Xi ). By the SLLN,  a.s. (1/n) ni=1 ln(Xi ) → E{ln(Xi )} < 0. Also, every ω for which n  (1/n) i=1 ln(Xi ) → E{ln(Xi )} is an ω for which ni=1 ln(Xi ) →  E{ln(Xi )} → −∞. Therefore, with probability 1, A0 ni=1 Xi → exp(−∞) = 0. Therefore, the gambler’s money will eventually be depleted.

8. Prove that if (X1 , Y1 ), . . . , (Xn, Yn) are iid pairs with 0 < var(X) < ∞ and 0 < var(Y ) < ∞, then the sample correlation coefficient Rn converges almost surely to ρ = E{(X − µX )(Y − µY )}/(σX σY ). Prove that Rn converges to ρ in L1 as well.

K24704_SM_Cover.indd 98

01/06/16 10:38 am

95 Begin with Zn = 

(n − 1)−1

n −1

(n − 1)

n

i=1 (Xi

− µX )(Yi − µY )



¯ 2 (n − 1)−1 i=1 (Xi − X)

n

¯ 2 i=1 (Yi − Y )

,

(11)

¯ and Y¯ are rewhich is similar to ρ, except that in the numerator, X placed by µX and µY . To see that Zn converges almost surely to ρ, note 2 2 1/2 σY ) = σX σY by Examfirst that the denominator converges to (σX ple 7.6. For the numerator, note first that E{|(X − µX )(Y − µY )|} ≤ {E(X − µX )2 E(Y − µY )2 }1/2 < ∞ by Holder’s inequality. By the SLLN,  (1/n) ni=1 (Xi −µX )(Yi −µY ) converges almost surely to E{(X −µX )(Y −  µY )}. The numerator, {n/(n − 1)}(1/n) ni=1 (Xi − µX )(Yi − µY ), converges almost surely to (1)E{(X − µX )(Y − µY )}. Therefore, (11) converges almost surely to (σX σY )−1 E{(X − µX )(Y − µY )} = ρ. Now consider ρn − Zn .

n

¯ ¯ i=1 (µX − X)(µY − Y )  ρn − Zn =    ¯ 2 (n − 1)−1 n (Yi − Y¯ )2 (n − 1)−1 ni=1 (Xi − X) i=1   n ¯ ¯ (µX − Xi )(µY − Y ) n−1  =  .  ¯ 2 (n − 1)−1 n (Yi − Y¯ )2 (n − 1)−1 ni=1 (Xi − X) i=1 (n − 1)−1

As noted before, the denominator converges almost surely to σX σY , while a.s. ¯ a.s. the numerator converges almost surely to 0 because X → µX and Y¯ → µY (SLLN). a.s.

We have established that ρn = Zn + ρn − Zn → ρ + 0 = ρ, completing the proof. 9. Consider the analysis of covariance (ANCOVA) model Yi = β0 + β1 Xi + β2 zi + i, where Xi and Yi are the baseline and end of study values of a continuous variable like cholesterol, zi is a treatment indicator (1 for treatment, 0 for control), and the i are iid from a distribution with mean 0 and variance σ2 . Though we sometimes consider the xi as fixed constants, regard the Xi now as iid random variables with finite, nonzero 2 , and assume that the vector of Xs is independent variance σX of the vector of s. The estimated treatment effect in this model is ¯T − X ¯ C ), βˆ2 = Y¯T − Y¯C − βˆ1 (X where

βˆ1 =

K24704_SM_Cover.indd 99



T (Xi

¯ T )(Yi − Y¯T ) + C (Xi − X ¯ C )(Yi − Y¯C ) −X .   ¯ 2 ¯ 2 T (Xi − XT ) + C (Xi − XC )

01/06/16 10:38 am

96 Let n = nT + nC , and suppose that nT /nC → λ as n → ∞. Prove that βˆ1 converges almost surely to β1 and βˆ2 converges almost surely to β2 .

n  i=1

¯ i − Y¯ ) = (Xi − X)(Y = =

and similarly,

n

βˆ1 =

=

i=1 (Xi



n  i=1 n 

Xi Yi − Y¯

n  i=1

¯ Xi − X

n 

¯ Y¯ Yi + n X

i=1

i=1

¯ Y¯ − nX ¯ Y¯ + nX ¯ Y¯ Xi Yi − nX

i=1

¯ Y¯ Xi Yi − nX

n 

¯ 2= − X)

n

i=1

¯ 2 . Therefore, Xi2 − nX

¯ T Y¯T + i∈C Xi Yi − nC X ¯ C Y¯C Xi Yi − nT X   2 2 ¯2 ¯2 i∈T Xi − nT XT + i∈C Xi − nC XC

i∈T

nT nC



i∈T

Xi Yi

nT

nT nC



i∈T

nT



Xi2

nT ¯ ¯ X Y nC T T

+

nT ¯ 2 X nC T

+





Xi Yi ¯ C Y¯C −X nC  . Xi2 2 i∈C ¯ − XC nC i∈C

(12)

Within each treatment group, the Xi are iid with finite mean, as are the Xi2 , the Yi , the Yi2 , and the Xi Yi (Holder’s inequality). By the ¯ C converge almost surely to µX , (1/nT ) i∈T X 2 and ¯ T and X SLLN, X i



(1/nC ) i∈C Xi2 converge almost surely to E(X 2 ), and similarly for the   Y s. Furthermore, (1/nT ) i∈T Xi Yi and (1/nC ) i∈C Xi Yi converge almost surely to E(XY )T and E(XY )C , respectively. Also, nT /nC → λ by assumption. Therefore, by Equation (12), a.s. λ{E(XY )T ) − µXT µY T } + E(XY )C − µXC µY C βˆ1 → λ{E(XT2 ) − µ2X } + E(XC2 ) − µ2X cov(X, Y ) cov(X, Y )(λ + 1) = = 2 2 σX (λ + 1) σX

(13)

because, under the model Y = β0 + β1 Xi + β2 zi + i , the covariance between X and Y is the same in the two treatment groups. It is cov(Xi , Yi ) = cov(Xi , β0 + β1 Xi + β2 zi + i ) 2 = β1 cov(Xi , Xi ) = β1 σX . a.s. 2 2 /σX = β1 . Therefore, by Equation (13) βˆ1 → β1 σX

K24704_SM_Cover.indd 100

01/06/16 10:38 am

97 10. Use the weak law of large numbers to prove that if Yn is Poisson with parameter n, n = 1, 2, . . ., then P (|Yn/n − 1| > ) → 0 as n → ∞ for every  > 0 (hint: what is the distribution of the sum of independent Poisson(1)s?). Can the strong law of large a.s. numbers be used to conclude that Yn/n → 1? 

Yn has the same distribution as Sn = ni=1 Xi , where the Xi are iid Poisson with parameter 1. By the WLLN, Sn /n converges in probability to E(X) = 1. Therefore, P (|Yn /n − 1| > ) = P (|Sn /n − 1| > ) → 0. a.s.

The SLLN cannot be used to conclude that Yn /n → 1; Yn has the same  marginal distribution as Sn = ni=1 Xi , but that does not mean that joint probabilities involving Y1 /1, Y2 /2, Y3 /3, . . . are the same as those involving S1 /1, S2 /2, S3 /3, . . . From the information given, we do not even know the joint distribution of (Yn /n, Yn+1 /(n + 1)). Therefore, we a.s. cannot use the SLLN to conclude that Yn /n → 1. a.s.

Even though we cannot use the SLLN to conclude that Yn /n → 1, a.s. the following argument does show that Yn /n → 1. If Y is Poisson λ, E(Y − λ)4 = λ + 3λ2 . Therefore, P (|Yn /n − 1| > ) = P {(Yn /n − 1)4 > 4 } E(Yn /n − 1)4 Markov’s inequality 4 E(Yn − n)4 n + 3n2 = = , 4 n 4 4 n 4





2 which is of order 1/n2 . Because ∞ n=1 (1/n ) < ∞, the Borel-Cantelli lemma implies that P (A ) = 0, where A = {ω : |Yn (ω)/n − 1| >  i.o. in n). This holds for every  > 0, so P (∪∞ k=1 A1/k ) = 0. This says that for ω outside a set of probability 0, |Yn /n−1| > 1/k for only finitely a.s. many n for each k = 1, 2, . . . This clearly implies that Yn /n → 1.

11. ↑ Prove that if f is a bounded continuous function on x ≥ 0,  k then f (x) = limn→∞ ∞ k=0 f (k/n) exp(−nx)(nx) /k! (see the proof of Theorem 7.8). 

Let Xi be iid Poisson with parameter x, and let Sn = ni=1 Xi . By the p a.s. SLLN, Sn /n → E(X) = x. Because f is continuous, f (Sn /n) → f (x). By the bounded convergence theorem, E{f (Sn /n)} → f (x). But Sn is  Poisson with parameter nx, so E{f (Sn /n) is ∞ k=0 f (k/n)P (Sn = k) =  k f (k/n) exp(−nx)(nx) /k!. This completes the proof. k=0

K24704_SM_Cover.indd 101

01/06/16 10:38 am

98 Section 7.2 1. Let an be a sequence of numbers such that an → a as n → ∞.  Prove that (1/n) n i=1 ai → a. 

Let  > 0 be given. We will show that lim n→∞ |(1/n) ni=1 ai − a| < . There exists an M such that |an − a| <  for n ≥ M . Also,  n   (1/n) (ai  i=1

  − a)

≤ (1/n) ≤ (1/n)

M 

i=1 M  i=1

|ai − a| + (1/n)

n 

i=M +1

|ai − a|

|ai − a| + {(n − M )/n}. 

Take limsup as n → ∞ of both sides to see that lim n→∞ |(1/n) ni=1 ai −  a| ≤ . Because  is arbitrary, lim n→∞ |(1/n) ni=1 ai − a| = 0. This proves the result. 2. Prove the SLLN for the special case that Xi are iid Bernoulli with parameter p. Hint: use the Borel-Cantelli lemma (Lemma 6.38) together with the fact that the binomial random variable  Sn = n X satisfies E(Sn − np)4 = 3n2 p2 (1 − p)2 + np(1 − i i=1 p){1 − 6p(1 − p)}. P (|Sn /n − p| > 1/k) = P {(Sn /n − p)4 > 1/k 4 } ≤ k 4 E{(Sn /n − p)4 } Markov s inequality = (k/n)4 [3n2 p2 (1 − p)2 + np(1 − p){1 − 6p(1 − p)}], 

2 which is of order 1/n2 . Also, ∞ n=1 1/n < ∞. Therefore, the BorelCantelli lemma implies that P (Ak ) = 0, where Ak is the event that ∞ |Sn /n−p| > 1/k for infinitely many n. Also, P (∪∞ k=1 Ak ) ≤ k=1 P (Ak ) = C 0. But each ω ∈ (∪k Ak ) is an ω for which Sn (ω)/n → p. Therefore, a.s. Sn /n → p.

3. What is wrong with the following reasoning? If Xn are iid with ¯ n a.s. mean 0, then X → 0 by the SLLN. That means there exists ¯ an N such that |Xn − 0| ≤ 1 for all n ≥ N . The DCT then ¯ n|) → 0. implies that E(|X ¯ n − 0| ≤ The problem is that the stated N depends on ω. Therefore, |X 1 for all n ≥ N (ω), but there is no guarantee that E{N (ω)} < ∞. Therefore, the DCT does not apply.

K24704_SM_Cover.indd 102

01/06/16 10:38 am

99 Section 7.3 1. Let X1 , X2 , . . . be iid random variables independent of the positive integer-valued random variable N with E(N ) < ∞. If E(|X1 |) < ∞, then E(SN ) = E(X1 )E(N ).

E(SN ) = E = E = = =

∞ 

k=1 ∞ 

∞ 

k=1 ∞ 

k=1 ∞ 

k=1

SN I(N = k) Sk I(N = k)





E{Sk I(N = k)} E(Sk )E{I(N = k)} kE(X1 )P (N = k) = E(X1 )E(N ).

k=1





∞ The third step is because ∞ k=1 E(|Sk |)E{I(N = k=1 E(|Sk I(N = k)|) = ∞ k)} ≤ k=1 kE(|X1 |)P (N = k) = E(|X1 |)E(N ) < ∞.

2. Let Sn be a non-symmetric random walk (i.e., p = 1/2) and k be any integer. Prove that with probability 1, Sn reaches level k only finitely many times. Prove also that P (Sn reaches level k infinitely many times for some integer k) is 0.

By the SLLN, Sn /n → p with probability 1. But each ω for which Sn (ω)/n → p is an ω for which Sn (ω) → ∞. If Sn (ω) → ∞ almost surely, then P (Ak ) = 0, where Ak is the event that |Sn | ≤ |k| for infinitely many n. Therefore, P (Sn reaches level k infinitely many times) = 0. Also, ∞ ∞ P (∪∞ k=1 Ak ) ≤ k=1 P (Ak ) = k=1 0 = 0. Therefore, P (Sn reaches level k infinitely many times for some integer k) is 0.

K24704_SM_Cover.indd 103

01/06/16 10:38 am

100 Section 8.1 1. We have two different asymptotic approximations to a binomial random variable Xn with parameters n and pn: the law of small numbers when npn → λ and the CLT for fixed p. In each of the following scenarios, compute the exact binomial probability and the two approximations. Which approximation is closer? (a) n = 100 and p = 1/100; compute P (Xn ≤ 1). The exact binomial probability is (1 − 1/100)100 + 100(1/100)(1 − 1/100)99 = .7358. For the normal approximation, E(Xn ) = np = 100(1/100) = 1 and var(Xn ) = np(1 − p) = 100(1/100)(99/100) = .99. Therefore, P (Xn ≤ 1) = Φ{(1 − 1)/(.99)1/2 } = 1/2. This is not even close to the exact answer. For the Poisson approximation, λ = 100(1/100) = 1. P (Xn ≤ 1) ≈ exp(−1)(1)0 /0! + exp(−1)(1)1 /1! = 2 exp(−1) = .7358. This matches the exact answer to 4 decimal places. The Poisson approximation is much better in this case. (b) n = 100 and p = 40/100; compute P (Xn ≤ 38).    100 The exact probability is 38 (40/100)k (1 − 40/100)100−k = k=0 k .3822. The normal approximation is Φ{(38 − 40)/(24)1/2 } = .3415, which is fairly close to the exact answer.  k The Poisson approximation is 38 k=0 exp(−40)(40) /k! = .4160. The Poisson approximation is slightly better in this case. 2. Prove that if Xn has a chi-squared distribution with n degrees of freedom, then (Xn − n)/(2n)1/2 converges in distribution to a standard normal deviate. Hint: what is the distribution of the sum of n iid chi-squared random variables with parameter 1? 

Note that Xn has the same distribution as Sn = ni=1 Yi , where Yi are iid chi-squared random variables with 1 degree of freedom. A chi-squared random variable with 1 degree of freedom has mean 1 and variance 2. By D the CLT, (Sn − n)/(2n)1/2 → N(0, 1). Because (Xn − n)/(2n)1/2 has the same distribution as (Sn − n)/(2n)1/2 , (Xn − n)/(2n)1/2 also converges in distribution to N(0, 1). 3. Prove that if Xn has a Poisson distribution with parameter n, D then (Xn − n)/n1/2 → X, where X ∼ N(0, 1). Hint: what is

K24704_SM_Cover.indd 104

01/06/16 10:38 am

101 the distribution of the sum of n iid Poisson random variables with parameter 1? 

Note that Xn has the same distribution as ni=1 Yi , where Yi are iid Poisson with parameter 1. The Poisson distribution with parameter 1 D has mean 1 and variance 1. Therefore, by the CLT, (Sn − n)/n1/2 → N(0, 1). Because (Xn −n)/n1/2 has the same distribution as (Sn −n)/n1/2 , (Xn − n)/n1/2 also converges in distribution to N(0, 1). 4. ↑ In the preceding problem, let us approximate P (Xn = n) by P (n − 1/2 ≤ Xn ≤ n + 1/2) using the CLT. (a) Show that the CLT approximation is asymptotic to φ(0)/n1/2 as n → 0, where φ(x) is the standard normal density function. The CLT approximation is P (Xn = n) = Φ{(n + 1/2 − n)/n1/2 } − Φ{(n − 1/2 − n)/n1/2 } = Φ(.5/n1/2 ) − Φ(−.5/n1/2 ) = φ(ηn )(1/n1/2 ), where ηn is between −.5/n1/2 and .5/n1/2 (mean value theorem). As n → ∞, φ(ηn ) → φ(0). Therefore, P (Xn = n) is asymptotic to φ(0)/n1/2 . (b) Equate P (Xn = n) to φ(0)/n1/2 and solve for n! to obtain Stirling’s formula. φ(0) exp(−n)nn = P (Xn = n) ≈ √ . n! n Therefore, √ n exp(−n)nn √ = 2πn exp(−n)nn . n! ≈ φ(0) This is Stirling’s formula. (c) Is the above a formal proof of Stirling’s formula? Why or why not? No. Replacing P (Xn = n) by Φ{(n + 1/2 − n)/n1/2 } − Φ{(n − 1/2 − n)/n1/2 } is problematic because there is some error n associated with this approximation. That is, P (Xn = n) = Φ{(n + 1/2 − n)/n1/2 } − Φ{(n − 1/2 − n)/n1/2 } + n . Therefore, all we can really conclude is that P (Xn = n) = φ(ηn )/n1/2 + n , where n → 0. That is, 1 exp(−n)nn + n . =√ n! 2πn

K24704_SM_Cover.indd 105

01/06/16 10:38 am

102 Solving for n! yields exp(−n)nn n! = 1 = √ + n . 2πn



2πn exp(−n)nn √ . 1 + 2πn n

We would need to show that n1/2 n → 0 as n → ∞ to show that this approach yields Stirling’s formula. 5. Suppose that Xn ∼ Fn and X ∼ F , where Fn and F are continuous and F is strictly increasing. Let xn,α be a (1 − α)th quanD tile of Fn, and suppose that Xn → X. Prove that xn,α → xα, the (1 − α)th quantile of X. We prove that lim n→∞ xn,α ≥ xα and lim n→∞ xn,α ≤ xα .

Let x∗ = lim n→∞ xn,α . The first step is to prove that x∗ ≥ xα . Suppose, on the contrary, that x∗ < xα . Then x∗ < xα −  for some  > 0. There is a subsequence n1 , n2 , . . . such that xnk ,α → x∗ . Therefore, for all but finitely many k, Fnk (xα − ) ≥ Fnk (xnk ,α ) = 1 − α

(14)

(the equality is by continuity of Fnk ). It follows that F (xα − ) = lim Fnk (xα − ) ≥ 1 − α k→∞

by Equation (14). But F (xα ) = 1 − α and F is strictly increasing, so this is a contradiction. This proves that lim n→∞ xn,α ≥ xα .

Similarly, let x∗ = lim n→∞ xn,α . The next step is to prove that x∗ ≤ xα . Suppose, on the contrary, that x∗ > xα . Then x∗ > xα +  for some  > 0. There is a subsequence n1 , n2 , . . . such that xnk → x∗ . Then for all but finitely many k, P (Xnk ≥ xα + ) ≥ P (Xnk ≥ xn,α ) = α

(15)

(the equality is by continuity of Fnk ). Therefore, P (X ≥ xα + ) = lim P (Xnk ≥ xα + ) ≥ α k→∞

by Equation (15). But F is strictly increasing and P (X ≥ xα ) = α, so this is a contradiction. Therefore, lim n→∞ xn,α ≤ xα . We have shown that lim n→∞ xn,α ≥ xα and lim n→∞ xn,α ≤ xα . It follows that lim n→∞ xn,α = lim n→∞ xn,α = limn→∞ xn,α = xα .

K24704_SM_Cover.indd 106

01/06/16 10:38 am

103 6. In Example 8.6, use Slutsky’s theorem to justify the replace n ¯ ¯ ment of n i=1 (Xi − X)(Yi − Y ) by i=1 XiYi when showing that n

i=1 (Xi

n

i=1 (Xi

= =

¯ ¯ − X)(Y i − Y) − n cov(X, Y) 

n var(X1 Y1 )

¯ i − Y¯ ) − n cov(X, Y ) − X)(Y 

n var(X1 Y1 ) n n ¯ ¯ i=1 (Xi − X)(Yi − Y ) − i=1 Xi Yi −Y¯



n var(X1 Y1 ) ¯ n Yi + nX ¯ Y¯ i=1 Xi − X i=1

n



n var(X1 Y1 ) ¯ Y¯ ¯ Y¯ −nX −n1/2 X =  = . n var(X1 Y1 ) var(X1 Y1 )

=



n

D

→ N(0, 1).

i=1

Xi Yi − n cov(X, Y ) 

n var(X1 Y1 )

¯ Y¯ − nX ¯ Y¯ + nX ¯ Y¯ −nX 

n var(X1 Y1 )

(16)

¯ converges in distribution to N (0, var(X)). Also, Y¯ By the CLT, n1/2 X converges almost surely (and therefore in probability) to 0 by the strong law of large numbers. Therefore, by Exercise 4 of Section 6.2.5, Equation (16) converges in probability to 0. We have shown that the difference in line 1 tends to 0 in probability. By Slutsky’s theorem, the asymptotic distributions of the two expressions in line 1 are the same. 7. In Example 8.7, use Slutsky’s theorem to justify rigorously the replacement of βˆ1 by β1 when showing that ¯T − X ¯ C ) − β2 Y¯T − Y¯C − βˆ1 (X 

σ2 (1/nT

+ 1/nC )

¯T − X ¯ C ) − β2 Y¯T − Y¯C − βˆ1 (X 

σ2 (1/nT + 1/nC ) ¯T − X ¯C ) (β1 − βˆ1 )(X =  . σ2 (1/nT + 1/nC )



D

→ N(0, 1).

¯T − X ¯ C ) − β2 Y¯T − Y¯C − β1 (X 

σ2 (1/nT + 1/nC )

We know that βˆ1 is consistent, so β1 − βˆ1 tends to 0 in probability. It ¯T − X ¯ C )/(1/nT +1/nC ) converges in distribution. suffices to prove that (X

K24704_SM_Cover.indd 107

01/06/16 10:38 am

104 ¯ T and X ¯ C are the same (because X is a baseline variable), The means of X so without loss of generality, we may take the common mean to be 0.



¯C ¯T − X X

σ2 (1/nT + 1/nC )

= =



¯ T − √n T X ¯C nT X





σ2 (1 + nT /nC )  ¯ T − nT √nC X ¯C nT X nC 

σ2 (1 + nT /nC )  √ nT √ ¯ ¯C nC X nT XT nC =  − . σ2 (1 + nT /nC ) σ2 (1 + nT /nC )

1/2

¯ T converges in distribution to U ∼ N (0, σT2 ) and By the CLT, nT X 1/2 ¯ 2 nC X C converges in distribution to V ∼ N (0, σC ), where U and V are independent. Furthermore, nT /nC converges to a constant λ by assump2 ¯ ¯T − X tion. Therefore, (X C )} converges in distribution √ C )/{σ (1/nT +1/n 1/2 1/2 to U/{σ (1 + λ) } + λ V /{σ (1 + λ) }.

We have shown that the difference in the two original expressions, one with β1 and the other with βˆ1 , is the product of an expression converging to 0 in probability and an expression converging in distribution. We conclude that the difference converges to 0 in probability. By Slutsky’s theorem, the asymptotic distributions of the expressions involving β1 and βˆ1 are the same.

K24704_SM_Cover.indd 108

01/06/16 10:38 am

105 Section 8.2 1. Imagine infinitely many quincunxes, one with a single row, another with two rows, another with three rows, etc. Roll one ball on each quincunx. What is the probability that the ball is in the rightmost bin of infinitely many of the quincunxes? The probability of being in the rightmost bin of a quincunx with n rows  n is 1/2n . Because ∞ n=1 1/2 < ∞, The probability that the ball is in the rightmost bin for infinitely many quincunxes is 0 by the Borel-Cantelli lemma. 2. Let X1 have distribution F with mean 0 and variance 1, and X2 , X3 , . . . be iid with point mass distributions at 0. That is, Xi ≡ 0 with probability 1 for i ≥ 2. What is the asymptotic  distribution of Sn = n i=1 Xi? Does the CLT hold? The asymptotic distribution of Sn is F . The CLT does not hold.

3. In Example 8.16, prove directly that the Lindeberg condition does not hold. Un1 , . . . , Unn are iid Bernoulli pn such that npn → λ. Note that 



n     1 2 2 E Uni I Uni ≥  var(Sn ) . var(Sn ) i=1 n    1  2 2 ≥ E Uni I Uni ≥  npn (1 − pn ) . npn i=1

(17)

Because npn → λ, pn → 0 as n → ∞. Therefore, there is an N such that (17) is at least as large as n    1  2 2 E Uni I Uni ≥  npn (1 − 1/2) npn i=1 n    1  2 2 E Uni I Uni ≥ (/2)(λ + 1) . ≥ npn i=1

(18)

For  sufficiently small, (/2)(λ + 1) < 1, in which case n  i=1







2 2 E Uni I Uni ≥ (/2)(λ + 1)

= =

n 

E{I(Uni = 1)}

i=1 npn .

Therefore, (18) is at least npn /(npn ) = 1 for n ≥ N . Therefore, it cannot converge to 0 as n → ∞. The Lindeberg condition is not satisfied.

K24704_SM_Cover.indd 109

01/06/16 10:38 am

106 4. Let τni be independent Bernoullis with probability pn, 1 ≤ i ≤ n, and let Xni = τni − pn. Prove that if  < pn < 1 −  for all n, where  > 0, then Xni satisfies Lyapounov’s condition with r = 3. We must show that

n

i=1



E(|Xni |3 )

var(Sn )

3

→ 0.





The numerator is no greater than ni=1 E(|τni − pn |3 ) ≤ ni=1 E(13 ) = n. The denominator is {npn (1 − pn )}3/2 ≥ {n(1 − )}3/2 because the function x(1 − x) increases as |x − 1/2| decreases for x ∈ [0, 1]. It follows that the ratio of numerator to denominator is no greater than n−1/2 /{(1 − )}3/2 , which tends to 0 as n → ∞. Therefore, Lyapounov’s condition is satisfied. 5. Let Xni, 1 ≤ i ≤ n, be independent and uniformly distributed on (−an, an), an > 0. Prove that Lyapounov’s condition holds for r = 3. 

The numerator for the Lyapounov condition is 2n 0an x3 dx/(2an ) = a3n /4. For the denominator, note that var(Xni ) = a2n /3. The denominator is therefore {na2n /3}3/2 = n3/2 a3n /33/2 . The ratio of numerator to denominator is 33/2 /(4n1/2 ) → 0 as n → ∞. Therefore, the Lyapounov condition is satisfied. 6. Let Yi be iid random variables taking values ±1 with probability 1/2 each. Prove that the random variables Xni = (i/n)Yi satisfy Lyapounov’s condition with r = 3. For the numerator, note that E(|Xni |3 ) = (i3 /n3 )E(13 ) = i3 /n3 . There  fore, the numerator is (1/n3 ) ni=1 i3 , which is no greater than (1/n3 ) 1n+1 x3 dx ≤ (1/n3 )n4 /4 = n/4. For the denominator note that var(Xni ) = (i/n)2 E(Yi2 ) = i2 /n2 . There  fore, var(Sn ) = (1/n2 ) ni=1 i2 ≥ (1/n2 ) 0n x2 dx = n/3. Therefore, the denominator is at least as large as (n/3)3/2 . Accordingly, the ratio of numerator to denominator is no greater than (n/4)/(n/3)3/2 = 33/2 /(4n1/2 ) → 0 as n → ∞. Therefore, the Lyapounov condition is satisfied. 7. Prove that the Lindeberg CLT (Theorem 8.14) implies the standard CLT (Theorem 8.1).

K24704_SM_Cover.indd 110

01/06/16 10:38 am

107 Without loss of generality, assume that the Xni are iid with mean 0. The expression in the Lindeberg condition simplifies as follows: 

=



n     1 2 2 E Xni I Xni ≥  var(Sn ) var(Sn ) i=1

nE {X12 I (X12 ≥  nσ 2 )} E {X12 I (X12 ≥  nσ 2 )} = . nσ 2 σ2

Also, E{X12 I(X12 ≥  nσ 2 )} → 0 by the DCT because X12 I(X12 ≥  nσ 2 ) clearly tends to 0 almost surely and is dominated by the integrable random variable X12 . Therefore, the Lindeberg condition is satisfied for iid random variables with finite variance. 8. What is the asymptotic distribution of Sn in Example 8.11?  i−1 Hint: recall that ∞ is the base 2 representation of a i=2 Ui/2 number picked randomly from [0, 1]. It is uniform (0, 1). 9. Let Dn be iid from a distribution with mean 0 and finite variance σ 2 , and let Tn be the usual one-sample t-statistic Tn = 

n n−1

n

i=1

 

n i=1

Di2

Di n

−(

i=1

Di

.

)2 /n

p

With Zn defined by (8.16), prove that, Tn − Zn → 0 under the null hypothesis that E(Di) = 0. What does this say about how the t-test and permutation test compare under the null hypothesis if n is large?

T n − Zn =

=



  Di    n  i=1 n

 n  n

(



D) √ i   n 

i=1

2 i=1 Di

n−1



1 −

(

n

i=1

Di )2

n





1 1 n−1



n

2 i=1 Di



(

n

i=1

n

Di )2





     n 2 i=1 Di



1



  .   n 2  D i=1 i

1

n



Example 7.6 showed that the sample variance s2n = (n − 1)−1 { ni=1 Di2 −   ( ni=1 Di )2 /n} converges almost surely to σ 2 . By the SLLN, (1/n) ni=1 Di2

K24704_SM_Cover.indd 111

01/06/16 10:38 am

108 a.s.

→ E(Di2 ) = σ 2 . Therefore, Tn − Zn = n−1/2

 n 



Di Rn ,

i=1



where Rn tends almost surely to 1/σ − 1/σ = 0. Also, n−1/2 ni−1 Di converges in distribution to N(0, σ 2 ) by the CLT. Therefore, by Exercise 4 of Section 6.2.5, Tn − Zn converges in probability to 0.

To see the implications of this result on how the permutation and t-tests compare, note the following. We showed previously that the permutation distribution of Zn tends to N (0, 1). Therefore, the permutation test is nearly equivalent to rejecting the null hypothesis if Zn > zα . We have also shown previously that the t-distribution is asymptotically N(0, 1), so the t-test is nearly equivalent to rejecting the null hypothesis if Tn > zα . p We just showed that Tn − Zn → 0, so Tn and Zn have high probability of differing by at most a small amount if n is large. We conclude that the permutation and t-tests are both asymptotically equivalent to rejecting the null hypothesis if Zn > Zα . That is, they are asymptotically equivalent tests. 10. ↑ Consider the context of the preceding problem. To simulate what might happen to the t-test if there is an outlier, replace the nth observation by n1/2 . √ D (a) Prove that Tn → N(1/ 2, 1/2). Multiply the numerator and denominator by (n − 1)−1/2 to obtain Tn =

n−1

Di √i=1 n−1

    n  n−1 D2 i i=1  n−1

n−1

+



n n−1

+

n n−1



n−1

√ 2 Di + n) i=1 n(n−1)

(

,



2 Consider the denominator. The term n−1 i=1 Di /(n−1) tends almost surely to E(Di2 ) = σ 2 by the SLLN. Also,



n−1 i=1

Di +

√ 2 n

n(n − 1)





n−1 i=1

Di +

√ 2 n

(n − 1)2

√ 2 n = Di /(n − 1) + n−1 i=1 a.s. 2 → {E(Di ) + 0} = (0 + 0)2 = 0. n−1 

Also, the numerator tends in distribution to Y + 1, where Y is normal with mean 0 and variance σ 2 . By Slutsky’s theorem, Tn

K24704_SM_Cover.indd 112

01/06/16 10:38 am

109 converges in distribution to (Y + 1)/21/2 , whose distribution is N(1/21/2 , 1/2). (b) If n is very large, what is the approximate type I error rate for a one-tailed test rejecting for Tn exceeding the (1−α)th quantile of a t-distribution with n − 1 degrees of freedom? As n → ∞, the (1 − α)th quantile of a t-distribution with n − 1 degrees of freedom tends to zα , the (1 − α)th quantile of a standard normal distribution. Therefore, the approximate type 1 error rate is 





P N 1/21/2 , 1/2 > zα



= 1−Φ = 1−Φ





z − 1/21/2   α √



1/2



2zα − 1 .

For α = 0.025, the actual one-tailed type 1 error rate is close to 1 − Φ(21/2 · 1.96 − 1) = 1 − Φ(1.772) = 0.038 instead of 0.025.

K24704_SM_Cover.indd 113

01/06/16 10:38 am

110 Section 8.4 1. Use ch.f.s to prove that if X1 and X2 are independent Poissons with respective parameters λ1 and λ2 , then X1 + X2 is Poisson with parameter λ1 + λ2 . The ch.f. of Xi is exp[λi {exp(it) − 1}], so the ch.f. of X1 + X2 is the product of the ch.f.s, namely exp[(λ1 + λ2 ){exp(it) − 1}]. This is the ch.f. of a Poisson random variable with parameter λ1 + λ2 . Therefore, X1 + X2 is Poisson with parameter λ1 + λ2 . 2. Use ch.f.s to prove that if X1 and X2 are independent exponentials with parameter θ, then X1 +X2 is gamma with parameters 2 and θ. The ch.f. of Xi is (1 − it/θ)−1 , so the ch.f. of X1 + X2 is the product of the two ch.f.s, namely (1 − it/θ)2 . This is the ch.f. of a gamma with parameters 2 and θ. 3. Use ch.f.s to prove that if X1 and X2 are iid random variables, then the distribution function for X1 − X2 is symmetric about 0. The ch.f. of X1 − X2 is E[exp{it(X1 − X2 )}] = E{exp(itX1 )}E{exp(−itX2 )} = ψ(t)ψ(−t) = ||ψ(t)||2 . This shows that the ch.f. of X1 − X2 is real. By part 5 of Proposition 8.29, X1 − X2 is symmetric about 0. 4. Use the representation cos(t) = {exp(it) + exp(−it)}/2 to read out the probability mass function that corresponds to the characteristic function cos(t). Describe the distribution corresponding to the ch.f. {cos(t)}n.

{exp(it)+exp(−it)}/2 is the ch.f. corresponding to a symmetric Bernoulli random variable X taking values +1 or −1 with probability 1/2 each. Therefore, cos(t) is the ch.f. of this symmetric Bernoulli random variable, and {cos(t)}n is the ch.f. for the sum of n independent symmetric Bernoulli random variables.

5. ↑ Use the CLT in conjunction with the preceding problem to deduce that {cos(t/n1/2 )}n converges to exp(−t2 /2) as n → ∞. Then verify this fact directly. Hint: write the log of

K24704_SM_Cover.indd 114

01/06/16 10:38 am

111 {cos(t/n1/2 )}n as ln{cos(t/n1/2 )}/(1/n) (this is not problematic because cos(t/n1/2 ) is nonnegative for n sufficiently large) and use L’Hospital’s rule as many times as needed. Let Xi be iid symmetric Bernoulli random variables. The ch.f. of the ¯ = Sn /n1/2 is {ψ(t/n1/2 )}n = {cos(t/n1/2 )}n . By the sample mean X CLT, Sn /n1/2 converges in distribution to N(0, 1). Therefore, the ch.f. of Sn /n1/2 converges to the ch.f. of the standard normal distribution, namely exp(−t2 /2). This proves the result. To verify this fact directly, let ψn (t) = {cos(t/n1/2 )}n . For any given t, cos(t/n1/2 ) > 0 for n large enough. Therefore, we can take the logarithm of ψn (t): ln{ψn (t)} = n ln{cos(t/n1/2 )} =

ln{cos(t/n1/2 )} . 1/n

Let x = 1/n1/2 and evaluate the limit as x → 0 of ln{cos(tx)}/x2 . The numerator and denominator both tend to 0, so we apply L’Hospital’s rule. The derivatives of the numerator and denominator with respect to x are −t sin(tx)/ cos(tx) and 2x, respectively. Their ratio is (−t2 /2) sin(tx) −t sin(tx) = → −t2 /2 as x → 0 2x cos(tx) cos(tx) tx because sin(u)/u → 1 as u → 0 (L’Hospital’s rule). We have shown that the logarithm of ψn (t) tends to −t2 /2, so ψn (t) tends to exp(−t2 /2). 6. Let Y be a mixture of two normal random variables: Y = X1 or Y = X2 with probability λ and 1 − λ, respectively, where Xi ∼ N(µi, σi2 ). Show that Y has ch.f. ψY (t) = λ exp(iµ1 t − σ12 t2 /2) + (1 − λ) exp(iµ2 t − σ22 t2 /2).

The density function for Y is λf1 + (1 − λ)f2 , where fi is the density corresponding to N(µi , σi2 ), i = 1, 2. The ch.f. of Y is ψY (t) =

 ∞

exp(ity){λf1 (y) + (1 − λ)f2 (y)}dy

−∞  ∞

= λ

−∞

exp(ity)f1 (y)dy + (1 − λ)

 ∞

−∞

exp(ity)f2 (y)dy

= λ exp(itµ1 − σ12 t2 /2) + (1 − λ) exp(itµ2 − σ22 t2 /2). 7. Use ch.f.s to prove that the distribution of the sample mean of n iid observations from the Cauchy distribution with parameters θ and λ is Cauchy with parameters θ and λ.

K24704_SM_Cover.indd 115

01/06/16 10:38 am

112 ¯ is The ch.f. for a single Cauchy is exp(itθ − |t|λ), so the ch.f. of X E exp{i(t/n)(X1 + . . . + Xn )} =

n 

j=1

exp{i(t/n)θ − |t|λ}

= exp[{i(t/n)θ − |t/n|λ}n] = exp{itθ − |t|λ}. This is the ch.f. of a Cauchy with parameters θ and λ. 8. Let Y1 , Y2 be iid Cauchy with parameters θ and λ, and let τ ∈ (0, 1). Use ch.f.s to deduce the distribution of τ Y1 + (1 − τ )Y2 . The ch.f. of τ Y1 + (1 − τ )Y2 is

ψ(τ t)ψ{(1 − τ )t} = exp{iτ tθ − |τ t|λ + i(1 − τ )tθ − |(1 − τ )t|λ} = exp{i(τ + 1 − τ )tθ − |t|λ(τ + 1 − τ )} = exp{itθ − |t|λ}. This is the ch.f. of a Cauchy(θ, λ). Therefore, τ Y1 + (1 − τ )Y2 has a Cauchy distribution with parameters θ and λ. 9. The geometric distribution is the distribution of the number of failures before the first success in iid Bernoulli trials with success probability p. Given its ch.f. in Table 8.2, determine the ch.f. of the number of failures before the sth success. The number of failures before the sth success is the sum of iid observations from a geometric distribution because one must first observe the number of failures to the first success, then the additional number of failure to the next success, etc. Therefore, the ch.f. for the number of failures until the sth success is the product of ch.f.s: s 

p = j=1 1 − (1 − p) exp(it)



p 1 − (1 − p) exp(it)

s

.

10. Suppose that Y1 and Y2 are independent, Y1 has a chi-squared distribution with k degrees of freedom, and Y1 + Y2 has a chisquared distribution with n degrees of freedom. Prove that Y2 has a chi-squared distribution with n − k degrees of freedom.

This follows from Proposition 8.32 because the ch.f. of a chi-squared random variable with 1 degree of freedom is, from Table 8.2, 1/(1 − 2it)1/2 , which has no real (or complex) roots.

11. Suppose that Z1 and Z2 are independent, Z1 ∼ N(0, 1), and Z1 + Z2 ∼ N (0, 2). Prove that Z2 ∼ N(0, 1).

K24704_SM_Cover.indd 116

01/06/16 10:38 am

113 This follows from Proposition 8.32 because the ch.f. of a N(0, 1) random variable is, from Table 8.2, exp(−t2 /2), which has no real (or complex) roots. 12. Show that the following are NOT ch.f.s. (a) ψ(t) = cos(t) + sin(t). ψ(π/4) = 21/2 > 1, and the magnitude of a characteristic function cannot exceed 1. (b) ψ(t) = (1/2) cos(t). ψ(0) = 1/2, and characteristic functions must equal 1 at t = 0. (c) ψ(t)  = sin(1/t) 1

for t = 0 for t = 0.

This function is not continuous at t = 0, and characteristic functions are continuous (uniformly continuous, in fact) at every t. 13. Use induction to finish the proof of Proposition 8.33. 

∞ Assume that the formula ψ (k) (t) = −∞ (ix)k exp(itx)dF (x) holds. We will show that this formula holds for k + 1 as well. ψ(k+1) (t)

= = = = =

lim

∆→0

lim

∆→0

lim

∆→0

lim

∆→0

lim

∆→0





1 ∆

 



−∞ ∞ −∞ ∞ −∞ ∞ −∞



−∞

(ix)k exp{i(t + ∆)x}dF (x) −

k

(ix) exp(itx) (ix)k exp(itx) (ix)k exp(itx)

 



exp(i∆x) − 1 ∆







(ix)k exp(itx)dF (x)

−∞



dF (x)

cos(∆x) − 1 i sin(∆x) + ∆ ∆



dF (x)

− sin(ηx )∆x i cos(λx )∆x + ∆ ∆



dF (x)

(ix)k exp(itx) {−x sin(ηx ) + ix cos(λx )} dF (x),

where ηx and ∆x are both between 0 and ∆x. As ∆ → 0, ηx → 0 and −x sin(ηx) → 0 and x cos(λx) → x. Moreover, | − x sin(ηx)| ≤ |x| and |ix cos(λx)| ≤ |x|, so the magnitude of the integrand is no greater than |x|k(1)(2|x|) = 2|x|k+1 . By assumption, E(|X|k+1 ) < ∞. Therefore, both the real and imaginary parts of the integrand are dominated by an integrable function. By the DCT, we can take the limit inside the integrand: ψ (k+1) (t) =

K24704_SM_Cover.indd 117

 ∞

(ix)k exp(itx)(ix)dF (x)

−∞

01/06/16 10:38 am

114 =

 ∞

(ix)k+1 exp(itx)dF (x).

−∞

This completes the proof by induction.

K24704_SM_Cover.indd 118

01/06/16 10:38 am

115 Section 8.5.2 D

1. Use ch.f.s to give another proof of the fact that if Xn → X, D D Yn → Y , and Xn and Yn are independent, then Xn + Yn → X + Y , where X and Y are independent (see also Problem 2 of Section 6.3). The ch.f. for Xn + Yn is ψXn +Yn (t) = ψXn (t)ψYn (t), which converges to ψX (t)ψY (t) because ψXn (t) → ψX (t) and ψYn (t) → ψY (t). But ψX (t)ψY (t) is the ch.f. of X + Y , the sum of independent random variables. By the continuity property of characteristic functions, Xn + Yn converges in distribution to X + Y . 2. Use ch.f.s to give another proof of the fact that, in Example 8.13, the asymptotic distribution of Sn/{var(Sn)}1/2 is a mixture of two normals (see Problem 6 of Section 8.4), where  Sn = n j=1 Xnj . var(Sn ) = n + n − 1 = 2n − 1. The ch.f. of

n

j=1

Xnj is

ψXn1 (t)ψXn2 +...+Xnn (t) = cos(n1/2 t){cos(t)}n−1 . √  Therefore, the ch.f. of nj=1 Xnj / 2n − 1 is 



n1/2 t √ = cos 2n − 1

 



t cos √ 2n − 1

n−1

,

which is the product of 2 terms. The first term converges to cos(t/21/2 ). Write the second term as 

cos



t 1/2

un

un (n−1)/un

,

un → where un = 2n − 1. As n → ∞, un → ∞ and {cos(t/u1/2 n )} 2 exp(−t /2) (see Exercise 5 of Section 8.4). Also, (n − 1)/un → 1/2 as n → ∞, so the second term converges to exp(−t2 /4). Therefore, the √ n ch.f. of j=1 Xnj / 2n − 1 converges to

cos(t/21/2 ) exp(−t2 /4).

This is the ch.f. of the sum of two independent random variables U + V , where U takes values ±1/21/2 with probability 1/2 each, while V is N (0, 1/2). The resulting distribution is P (U + V ≤ z) = P (U = 1/21/2 , V ≤ z − 1/21/2 ) + P (U = −1/21/2 , V ≤ z + 1/21/2 )

K24704_SM_Cover.indd 119

01/06/16 10:38 am

116 





z − 1/21/2 z + 1/21/2 + (1/2)Φ = (1/2)Φ 1/21/2  1/21/2    = (1/2)Φ 21/2 z − 1 + (1/2)Φ 21/2 z + 1 .



3. Modify Example 8.13 so that the first random variable is −n1/2 , 0, or +n1/2 with probability 1/3. Show that the asymptotic distribution of Sn/{var(Sn)}1/2 is a mixture of three normals. Let Xn1 take the values −n1/2 , 0, or n1/2 with probability 1/3 each. Let Xn2 , . . . , Xnn be iid symmetric Bernoulli random variables (±1 with  probability 1/2 each). The variance of Sn = ni=1 Xni is var(Xn1 ) + (n − 2 1) = E(Xn1 ) + n − 1 = 2n/3 + n − 1 = (5n − 3)/3. Then 

Xn2 + . . . + Xnn Xn1  + = 

Sn

5n−3 3

var(Sn )

5n−3 3

Xn2 + . . . + Xnn √ + n−1

Xn1

= 

5n−3 3



3(n − 1) . 5n − 3

The first term converges in distribution to a random variable X taking the values −(3/5)1/2 , 0, or (3/5)1/2 with probability 1/3 each. For the second term, note that (Xn2 +. . .+Xnn )/(n−1)1/2 converges to N(0, 1) by the CLT, while {3(n − 1)/(5n − 3)}1.2 → (3/5)1/2 . By Slutsky’s theorem, the product converges in distribution to Y ∼ N (0, 3/5). Also, X and Y are independent, so Sn /{var(Sn )}1/2 converges in distribution to X + Y . Therefore, at each continuity point u of the distribution of X + Y P



√ Sn

var(Sn )

→ =

=

≤u





P (X + Y ≤ u) (1/3)P

(1/3)Φ

u+

Y





3/5

5 3



u+1







3/5

3/5



+ (1/3)Φ

+ (1/3)P (Y ≤ u) + (1/3)P



5 3

u







5 3

u−1





.

Y



3/5



u−





3/5

3/5



4. Use characteristic functions to prove the law of small numbers (Proposition 6.24). The ch.f. of a binomial random variable with parameters n and pn is, from Table 8.2, 

npn {exp(it) − 1} {1 − pn + pn exp(it)} = 1 + n n

K24704_SM_Cover.indd 120

n

.

01/06/16 10:39 am

117 By Lemma 8.38, and the fact that npn → λ by assumption, the ch.f. of a binomial (n, pn ) converges to exp[λ{exp(it) − 1}], which is the ch.f. of a Poisson random variable with parameter λ (Table 8.2). By the continuity property of ch.f.s, the binomial with parameters (n, pn ), npn → λ, converges in distribution to a Poisson (λ).

K24704_SM_Cover.indd 121

01/06/16 10:39 am

118 Section 8.6 1. Use ch.f.s to prove Corollary 8.43, the Cramer-Wold device. D

Suppose that t Xn → t X for all t ∈ Rk . The ch.f. of Xn , namely E{exp(it Xn ), converges to E{exp(it X) because exp(it x) = cos(t x) + i sin(t x), and cos(t x) and sin(t x) are bounded continuous functions of D x. By the continuity property of multivariate ch.f.s, Xn → X. D

D

The other direction is immediate because if Xn → X, then t Xn → t X by the Mann-Wald theorem because t x is a continuous function of x.

2. Prove that two k-dimensional random vectors X1 and X2 with respective ch.f.s ψ1 (t) and ψ2 (t) are independent if and only if the joint ch.f. E{i(t1 X1 + t2 X2 )} of (X1 , X2 ) is ψ1 (t1 )ψ2 (t2 ) for all t1 ∈ Rk, t2 ∈ Rk.

The multivariate ch.f. of independent random vectors Y1 and Y2 with the same distributions as X1 and X2 is E{exp(it1 X1 +t2 X2 )} = ψ1 (t1 )× ψ2 (t2 ). If the ch.f. ψX1 ,X2 (t) of (X1 , X2 ) equals that of (Y1 , Y2 ), then because a multivariate ch.f. uniquely defines a distribution function, the joint distribution of (X1 , X2 ) must be the same as that of (Y1 , Y2 ). This implies that X1 and X2 are independent.

3. Let Y have a trivariate normal distribution with zero means, unit variances, and pairwise correlations ρ12 , ρ13 , and ρ23 . Show that Y has the same distribution as AZ, where Z are iid standard normals and 

A=

1

    ρ12    

ρ13

0 

0

1 − ρ212

ρ23 −ρ12 ρ13



1−ρ2 12



    0 .    2 )−(ρ −ρ ρ )2  (1−ρ2 )(1−ρ 23 12 13 12 13 1−ρ2 12

Let Y = AZ. We will show that the covariance matrix Σ of Y has 1s on off the main diagonal. Σ11 = (1, 0, 0)(1, 0, 0) = the main diagonal and ρij   1 + 0 + 0 = 1; Σ22 = (ρ12 , 1 − ρ212 , 0)(ρ12 , 1 − ρ212 , 0) = ρ212 + 1 − ρ212 + 0 = 1;  ρ

 Σ33 = ρ213 +  1 − ρ212

K24704_SM_Cover.indd 122

   (1 − ρ212 )(1 − ρ213 ) − (ρ23  +   1 − ρ212 2

 23 − ρ12 ρ13

2

− ρ12 ρ13 )2  

01/06/16 10:39 am

119 = = = =

(ρ23 − ρ12 ρ13 )2 (1 − ρ212 )(1 − ρ213 ) − (ρ23 − ρ12 ρ13 )2 + + 1 − ρ212 1 − ρ212 2 2 2 2 ρ13 (1 − ρ12 ) + (1 − ρ12 )(1 − ρ13 ) 1 − ρ212 2 2 (1 − ρ12 ) {ρ13 + 1 − ρ213 } 1 − ρ212 1 − ρ212 = 1. 1 − ρ212

ρ213

It is also clear that Σ12 = ρ12 and Σ13 = ρ13 . Lastly,

Σ23 = ρ12 ρ13 +



1 − ρ212





ρ − ρ12 ρ13   23

= ρ12 ρ13 + ρ23 − ρ12 ρ13



1 − ρ212 = ρ23 ,

completing the proof. 4. Prove the multivariate version of Slutsky’s theorem (Theorem 8.50). p

D

Suppose that Xn → X and Yn → 0. If t ∈ Rk , then t (Xn + Yn ) = D t Xn + t Yn . By the Mann-Wald theorem, t Xn → t X because f (x) = p t x is a continuous function. Also, t Yn → 0. Therefore, by Slutsky’s D theorem in one dimension, t Xn + t Yn → t X. By the Cramer-Wold D device, Xn + Yn → X. 5. Use bivariate ch.f.s to prove that if (X, Y ) are nonsingular bivariate normal with correlation ρ, then X and Y are independent if and only if ρ = 0. Without loss of generality, we can assume that (X, Y ) have mean (0, 0) because subtracting means clearly does not change the dependence status of random variables. The bivariate ch.f. of (X, Y ) is 2 + 2t1 t2 ρσX σY + t22 σY2 )} exp{−(1/2)(t21 σX = ψX (t1 )ψY (t2 ) exp(−t1 t2 ρσX σY ).

Because (X, Y ) has a nonsingular distribution, var(X) > 0 and var(Y ) > 0. Therefore, the bivariate ch.f. factors into ψX (t1 )ψY (t2 ) if and only if ρ = 0. 6. ↑ Let X have a standard normal distribution. Flip a fair coin and define Y by: Y =

K24704_SM_Cover.indd 123



−X +X

if tails if heads.

01/06/16 10:39 am

120 Show that X and Y each have a standard normal distribution and cov(X, Y ) = 0 but X and Y are not independent. Why does this not contradict the preceding problem? The distribution function for Y is the mixture (1/2)Φ(y) + (1/2)Φ(y) = Φ(y) because both X and −X have a standard normal distribution function. Therefore, Y has a standard normal distribution function. Also, Y = ZX, where Z is 1 for heads and −1 for tails. It follows that cov(X, Y ) = cov(X, ZX) = E{X(ZX)} = E(Z)E(X 2 ) = 0 because E(Z) = 0. Therefore, X and Y are uncorrelated. They are not independent because |Y | = |X|; if X and Y were independent, then any Borel function of X (including f (X) = |X|) would be independent of any Borel function (including f (Y ) = |Y |) of Y .

This does not contradict the preceding problem because that problem requires (X, Y ) to have a bivariate normal distribution, which they do not have.

7. Suppose that X1 and X2 are iid N(µ, σ 2 ) random variables. Use bivariate ch.f.s to prove that X1 − X2 and X1 + X2 are independent. The bivariate ch.f. of (Y1 , Y2 ) = (X1 − X2 , X1 + X2 ) is

ψ(t1 , t2 ) = = = = = =

E[exp{it1 (X1 − X2 ) + it2 (X1 + X2 )] E[exp{i(t1 + t2 )X1 }]E[exp{i(t2 − t1 )X2 }] ψX1 (t1 + t2 )ψX2 (t2 − t1 ) exp{iµ(t1 + t2 ) − σ 2 (t1 + t2 )2 /2} exp{iµ(t2 − t1 ) − σ 2 (t2 − t1 )2 /2} exp{2iµt2 − σ 2 (2t21 + 2t22 )/2} exp{i0t1 − (21/2 σ)2 t21 /2} exp{i(2µ)t2 − (21/2 σ)2 t22 /2}.

This is the product of the ch.f. of a N (0, 2σ 2 ) random variable and the ch.f. of a N (2µ, 2σ 2 ) random variable. It follows that the joint distribution of (Y1 , Y2 ) is that of independent N (0, 2σ 2 ) and N (2µ, 2σ 2 ) random variables. 8. Let (X, Y, Z) be independent with respective (finite) means 2 2 , σY2 , σZ . Let µX , µY , µZ and respective (finite) variances σX (Xi, Yi, Zi), i = 1, . . . , n, be independent replications of (X, Y, Z),   Show that the asymptotic distribution of n i=1 (Xi + Yi, Yi + Zi) as n → ∞ is bivariate normal, and find its asymptotic mean and covariance vector.

K24704_SM_Cover.indd 124

01/06/16 10:39 am

121 Let Ui = Xi + Yi and Vi = Yi + Zi . The pairs (U1 , V1 ), (U2 , V2 ), . . . are iid with means µU = E(Xi +Yi ) = µX +µY and µV = µY +µZ , variances σU2 = 2 + σY2 + 2ρX,Y σX σY and σV2 = σY2 + σZ2 + 2ρY,Z σY σZ , and covariance σX cov(X+Y, Y +Z) = cov(X, Y )+cov(X, Z)+σY2 +cov(Y, Z). Let Σ denote  the covariance matrix of (U, V ). By the multivariate CLT, ni=1 (Ui , Vi ) is asymptotically normal with mean n(µU , µV ) and covariance nΣ. 9. Let Y be multivariate normal with mean vector 0 and strictly positive definite covariance matrix Σ. Let Σ1/2 be a symmetric   square root of Σ; i.e., Σ1/2 = Σ1/2 and Σ1/2 Σ1/2 = Σ. Define 

Z = Σ1/2

−1

Y. What is the distribution of Z?

The mean vector of Z is (Σ1/2 )−1 0 = 0. The covariance vector of Z is cov(Z) = = = =

(Σ1/2 )−1 Σ{(Σ1/2 )−1 } (Σ1/2 )−1 Σ(Σ1/2 )−1 (see explanation below) (Σ1/2 )−1 Σ1/2 Σ1/2 (Σ1/2 )−1 II = I.

Note that we made use of the fact that A = Σ1/2 is symmetric, so (A−1 ) = A−1 . We can verify this statement by showing that (A−1 ) A = I. But this follows from the fact that (A−1 ) A = (A A−1 ) = (AA−1 ) = I  = I. 10. Let Γ be an orthogonal matrix (i.e., Γ is k × k and ΓΓ = ΓΓ = Ik, where Ik is the k-dimensional identity matrix). (a) Prove that ||Γy|| = ||y|| for all k-dimensional vectors y. That is, orthogonal transformations preserve length. ||ΓY||2 = (ΓY) ΓY = Y Γ ΓY = Y IY = ||Y||2 . (b) Prove that if Y1 , . . . , Yk are iid normals, then the components of Z = ΓY are also independent. That is, orthogonal transformations of iid normal random variables preserve independence. The covariance matrix for Z is cov(Z) = Γσ 2 IΓ = σ 2 ΓΓ = σ 2 I.

K24704_SM_Cover.indd 125

01/06/16 10:39 am

122 11. Helmert transformation The Helmert transformation for iid N(µ, σ 2 ) random variables Y1 , . . . , Yn is Z = HY, where 

H =

            

−1 √ 2 1 √ 6

1 √ 2 1 √ 6 1 i(i+1)

.. . √1 n



.. .

.. .



0



1 i(i+1)



2 3

.. . ...

.. .

.. .

√1 n

√1 n



0

0

...

0

0 .. .

0 .. .

...

0 .. . 0

1



i(i+1)

.. . √1 n

i i+1

.. .

√1 n

... ... ... ...

.. . √1 n

             

In row i, the number of {i(i + 1)}−1/2 terms is i. (a) Show that H is orthogonal. The rows of H have length 1 and are clearly orthogonal. Therefore, H is an orthogonal matrix (b) Using the facts that Zn = n1/2 Y¯ and orthogonal transfor  2 Zi2 = n mations preserve length, prove that n−1 i=1 i=1 Yi − n 2 2 nY¯ = i=1 (Yi − Y¯ ) . The squared length ||Z||2 must equal ||Y||2 because orthogonal  transformations preserve length. Also, Zn2 = ( ni=1 Yi )2 /n. Therefore, n−1 

Zi2

=

i=1

 n 

Zi2

i=1

= =

n  i=1 n  i=1

Yi2





− Zn2

 n  i=1

Yi

2

/n

(Yi − Y¯ )2 .

(c) Using the representation in part b, prove that the sample mean and variance of a sample of size n from a N(µ, σ 2 ) distribution are independent. Because Z is multivariate normal with covariance σ 2 I, its compo 2 nents are iid. Also, the sample variance, (n − 1)−1 n−1 i=1 Zi , is a Borel function of only the first n − 1 components, while the sample mean, n−1/2 Zn , involves only the last component. Therefore, the sample mean and variance are independent. (d) Where does this argument break down if the Yi are iid from a non-normal distribution?

K24704_SM_Cover.indd 126

01/06/16 10:39 am

123 If the Yi are iid from a non-normal distribution, then the fact that the covariance matrix for Z is σ 2 I implies only that the components of Z are uncorrelated, not that they are iid.

K24704_SM_Cover.indd 127

01/06/16 10:39 am

124 Section 9.1 1. Let Xn have density function fn(x) = 1 + cos(2πx)/n for X ∈ [0, 1]. Prove that P (Xn ∈ B) converges to the Lebesgue measure of B for every Borel set B. fn (x) → f (x) = 1 for x ∈ [0, 1]. By Polya’s theorem, supB∈B P (Xn ∈ B) → µL (B). 2. Consider the t-density fν (x) =



Γ

ν+1 2



(1 + x2 /ν)−(ν+1)/2 . √ Γ(ν/2) πν

Stirling’s formula says that Γ(x)/{exp(−x)xx−1/2 (2π)1/2 } → 1 as x → ∞. Use this fact to prove that the t-density converges to the standard normal density as ν → ∞. If Z and Tν denote standard normal and T random variables with ν degrees of freedom, for what sets A does P (Tν ∈ A) converge to P (Z ∈ A) as ν → ∞? Is the convergence uniform? fν (x)



exp

=



− ν+1 2

  ν+1  ν+1 −1/2 2 2

=



(1 +



(2π)

 ν+1 ν/2 2



exp(−1/2)

 (x2 /2) ν/2

1+

n

 1+(1/2) ν/2 (ν/2)

 (x2 /2) ν/2 ν/2

 ν+1 

Γ

2



2

(1 + x  ν+1  ν+1  ν+1 −1/2 2 exp − (2π)1/2 2 2   Γ(ν/2) exp(−ν/2)(ν/2)ν/2−1/2 (2π)1/2

/ν)−(ν+1)/2

√ πν

rn

x2 /ν)(ν+1)/2 (ν/2)ν/2−1/2 s

1+



{exp(−ν/2)(ν/2)ν/2−1/2 (2π)1/2 }

exp(−1/2) =

1/2

sn

√ πν

rn



2/ν

√ πν

exp(−x2 /2) exp(−1/2) exp(1/2)(1) = . √ √ (1) exp(x2 /2)(1) 2π 2π

This is the standard normal density. Therefore, the t density fν (x) with ν degrees of freedom converges to the standard normal density φ(x) as ν → ∞ for every x. By Scheffe’s theorem, P (Tν ∈ A) converges to P (Z ∈ A) uniformly in Borel sets A as ν → ∞. 3. Prove part of Theorem 9.5 in R2 , namely that if (Xn, Yn) ∼ Fn(x, y) converges in distribution to (X, Y ) ∼ F (x, y) and F is continuous, then Fn converges uniformly to F . Do this in 3 steps: (1) for given > 0, prove there is a bound B such that P ({|X| > B} ∪ {|Y | > B}) < /2 and P ({|Xn| >

K24704_SM_Cover.indd 128

01/06/16 10:39 am

125 B} ∪ {|Yn| > B}) < /2 for all n; (2) use the fact that F is continuous on the compact set C = [−B, B] × [−B, B] to divide C into squares such that |Fn(x, y) − F (x, y)| is arbitrarily small for (x, y) corresponding to the corners of the squares; (3) Use the fact that, within each square, |Fn(x2 , y2 ) − Fn(x1 , y1 )| and |F (x2 , y2 ) − F (x1 , y1 )| are maximized when (x1 , y1 ) and (x2 , y2 ) are at the “southwest” and “northeast” corners. For given  > 0, there is a bound B such that P ({|X| > B} ∪ {|Y | > B}) < /2 and P ({|Xn | > B} ∪ {|Yn | > B}) < /2 for all n. This follows from the fact that the distribution functions Fn (x, y) are tight (Corollary 6.50). Because F is continuous, it is uniformly continuous on the square C = {(x, y) : |x| ≤ B, |y| ≤ B} because C is a compact set. It follows that, for given  > 0, we can find δ such that if |x| ≤ B, |y| ≤ B, |x | ≤ B, |y  | ≤ B, and ||(x, y) − (x , y  )|| < δ, then |F (x, y) − F (x , y  )| < /3. Partition the square S into mini squares such that the Euclidean distance between any two pairs of points (x, y) and (x , y  ) within each mini square is less than δ. Then |F (x, y) − F (x , y  )| < /3 for each pair of points within each mini square. Choose N such that |Fn (x, y) − F (x, y)| < /3 for all corners (x, y) of all mini squares and n ≥ N . This will be possible because there are only finitely many corners and Fn (x, y) → F (x, y) for each corner. If (x, y) is any point in C, denote by sw and ne the southwest (lower left) and northeast (upper right) corners of the mini square containing (x, y). Then |Fn (x, y) − F (x, y)| = |Fn (x, y) − Fn (xsw , ysw ) + Fn (xsw , ysw ) − F (xsw , ysw )+ F (xsw , ysw ) − F (x, y)| ≤ {Fn (xne , yne ) − Fn (xsw , ysw )} + |Fn (xsw , ysw ) − F (xsw , ysw )|+ +{F (xne , yne ) − F (xsw , ysw )} ≤ max1 + max2 + max3 ,

where max1 , max2 , and max3 are the maxima over all corners of all mini squares of the terms Fn (xne , yne ) − Fn (xsw , ysw ), |Fn (xsw , ysw ) − F (xsw , ysw )|, and F (xne , yne ) − F (xsw , ysw ), respectively. Each of the three maxima is less than /3 for n ≥ N , so |Fn (x, y) − F (x, y)| <  for all (x, y) ∈ C and n ≥ N .

K24704_SM_Cover.indd 129

01/06/16 10:39 am

126 For each (x, y) outside of C, we have ||Fn (x, y) − F (x, y)|| = ||1 − F (x, y) − {1 − Fn (x, y)|| ≤ ||1 − F (x, y)|| + ||1 − Fn (x, y)|| < /2 + /2 = . We have shown that, for each  > 0, there is an N such that n ≥ N ⇒ |Fn (x, y) − F (x, y)| <  for all (x, y). This proves that Fn converges uniformly to F . 4. Recall that in Exercise 11 of Section 3.4, there are n people, each with a different hat. The hats are shuffled and passed back in random order. Let Yn be the number of people who get their own hat back. You used the inclusion-exclusion formula to see that P (Yn ≥ 1) → 1 − exp(−1). Extend this result by proving that P (Yn = k) → P (Y = k) as n → ∞, where Y has a Poisson distribution with parameter 1. Conclude that supA |P (Yn ∈ A) − P (Y ∈ A)| → 0, where the supremum   is over all subsets of Ω = {0, 1, 2, . . .}. Hint: P (Yn = k) = n P (the first k people k get their own hat back and none of the remaining n − k people get their own hat back).  

P (Yn = k) = nk P (the first k people get their own hat back and none of the remaining n−k people get their own hat back). The probability that the first k people get their own hats back is n−1 (n−1)−1 . . . (n−k +1)−1 .   n Note that k n−1 (n − 1)−1 . . . (n − k + 1)−1 = 1/k!. Given that the first k people get their own hats back, it is as if we have a fresh batch of n − k people with different hats, and we want to know the probability that none of them get their own hat back. From Exercise 11 of Section  3.4, this probability converges to 1/e. Therefore, P (Yn = k) = nk P (the first k people get their own hat back and none of the remaining n − k people get their own hat back) converges to (1/k!)/e = exp(−1)(1)k /k!, the Poisson probability mass function with λ = 1. By Scheff´e’s theorem, sup |(P (Yn ∈ A) − P (Y ∈ A)| → 0, where the sup is over all subsets of Ω.

K24704_SM_Cover.indd 130

01/06/16 10:39 am

127 Section 9.5 1. Suppose that we reject a null hypothesis for small values of a test statistic Yn that is uniformly distributed on An ∪ Bn under the null hypothesis, where An = [−1/n, 1/n] and Bn = [1 − 1/n, 1 + 1/n]. Show that the “approximate” p-value using the asymptotic null distribution of Yn is not necessarily close to the exact p-value. If Yn = y < 0, the asymptotic p-value is P (Y < y) = 0, even though P (Yn < y) could be nearly 1/4 if y is very close to 0. 2. Prove Corollary 9.4 for intervals of the form (a) (−∞, x). P (−∞ < Xn < x) = Fn (x) − P (Xn = x), and we showed in the partial proof of Corollary 9.4 that Fn (x) → F (x) uniformly in x and P (Xn = x) → P (X = x) uniformly in x. Therefore, P (−∞ < Xn < x) → P (−∞ < X < x) uniformly in x

(b) (x, ∞).

P (x < Xn < ∞) = 1 − Fn (x), and Fn (x) → F (x) uniformly in x. Therefore, P (x < Xn < ∞) → P (x < X < ∞) uniformly in x.

(c) [x, ∞).

P (x ≤ Xn < ∞) = 1 − Fn (x) + P (Xn = x), and Fn (x) → F (x) uniformly in x and P (Xn = x) → P (X = x) uniformly in x. Therefore, P (x ≤ Xn < ∞) → P (x ≤ X < ∞) uniformly in x.

(d) [a, b). P (a ≤ Xn < b) = Fn (b) − Fn (a) + P (Xn = a) − P (Xn = b), and Fn (x) → F (x) uniformly in x and P (Xn = x) → P (X = x) uniformly in x. Therefore, supa,b |P (a ≤ Xn < b) − P (a ≤ Xn < b)| → 0. 3. Let Xn and X have probability mass functions fn and f on the integers k = 0, −1, +1, −2, +2, . . ., and suppose that fn(k) → f (k) for each k. WithD out using Scheff´ e’s theorem, prove that Xn → X. Then prove a stronger result using Scheff´ e’s theorem. First we do not use Scheff´e’s theorem. It is clear that P (Xn ∈ B) → P (X ∈ B) for every B with a finite number of integers. Therefore,

K24704_SM_Cover.indd 131

01/06/16 10:39 am

128 P (|Xn | ≤ x) → P (|X| ≤ x). By Corollary 6.50, the sequence of distribution functions for |Xn | is tight. This clearly implies that the sequence Fn of distribution functions for Xn is tight. Now let  > 0 be given. We must show that there is an N such that |Fn (x) − F (x)| <  for n ≥ N . First find s < 0 such that |Fn (s) − F (s)| < /2 for all n. This is possible because we can find s1 such that Fn (s1 ) ≤ /4 for all n (tightness), and s2 such that F (s2 ) ≤ /4, so take s = min(s1 , s2 ). Now find N such that |P (s < Xn ≤ x) − P (s < X ≤ x)| < /2 for all n ≥ N . This is possible because there are only finitely many integers between s and x. This shows that for n ≥ N , |Fn (x) − F (x)| = |Fn (x) − Fn (s) + Fn (s) − {F (x) − F (s) + F (s)}| ≤ |Fn (x) − Fn (s) − {F (x) − F (s)}| + |Fn (s) − F (s)| < /2 + /2 = , completing the proof. Now let us use Scheff´e’s theorem. By Scheff´e’s theorem, P (Xn ∈ B) → P (X ∈ B) uniformly across all Borel sets as n → ∞. 4. Find a sequence of density functions converging to a non-density function. Let fn (x) be the normal density function with mean 0 and variance 1/n. Then fn (x) converges to ∞ if x = 0 and 0 if x = 0, which is not a density function. 5. If X1 , . . . , Xn are iid with mean 0 and variance σ 2 < ∞, prove ¯ 2 /σ 2 converges in distribution to a central chi-squared that nX random variable with 1 degree of freedom, which is not normal. Why does this not contradict the delta method? D ¯ → By the CLT, n1/2 X/σ Z ∼ N(0, 1). Because f (x) = x2 is a continuous 2 D ¯ → Z 2 , which function, the Mann-Wald theorem implies that {n1/2 X/σ} is a chi-square with 1 degree of freedom. This does not contradict the delta method because f  (0) = 0.

6. Use the asymptotic distribution of θˆ = ln(pˆ1 /pˆ2 ) derived in Example 9.12 in conjunction with the one-dimensional delta method to prove that the asymptotic distribution of the relative risk is given by Equation (9.8). We found in Example 9.12 that θˆ = ln(ˆ p/ pˆ2 ) was asymptotically normal with mean and variance (1 − p1 )/(np1 ) + (1 − p2 )/(np2 ). Also, f (x) = ˆ is exp(x) has f  (θ) = exp(θ). Therefore, by the delta method, f (θ)

K24704_SM_Cover.indd 132

01/06/16 10:39 am

129 asymptotically normal with mean exp(θ) = ln(p1 /p2 ) and variance 



1 − p1 1 − p2 ˆ var{f (θ)} = {exp(θ)}2 + np1 np2  1 − p 1 1 − p2 = (p1 /p2 )2 + np1 np2   p1 (1 − p1 ) p21 (1 − p2 ) = + . np22 np32 This agrees with the variance in Equation 9.8.

7. Give an example to show that Theorem 9.25 is false if we remove the condition that F and G be non-degenerate. Let Xn be normal with mean 0 and variance 1. With an = 0 and D bn = 1, (Xn − an )/bn → N(0, 1). But now let αn = 0 and βn = 1/n. Then (Xn − αn )/βn converges in distribution to a point mass at 0, but bn /βn = 1/(1/n) = n does not converge to a constant as n → ∞. 8. Let Sn be the sum of iid Cauchy random variables with parameters θ and λ (see Table 8.2 and Exercise 7 of Section 8.4). Do there exist normalizing constants an and bn such that D (Sn − an)/bn → N(0, 1)? If so, find them. If not, explain why not. D

No, because (Sn − 0)/1 → Cauchy(θ, λ). Therefore, by Theorem 9.25, if (Sn − an )/bn converges in distribution, it must be to F {(x − a)/b} for some constants a and b, where F is the Cauchy distribution with parameters θ and λ. 9. Suppose you have two “trick” coins having probabilities 0.20 and 0.80 of heads. Randomly choose a coin, and then flip it ad infinitum. Let Xi be the indicator of heads for flip i, and  ˆn converge to a constant (either pˆn = (1/n) n i=1 Xi. Does p almost surely or in probability)? If so, what is the constant? Does pˆn converge in distribution? If so, to what distribution? Is (pˆn − an)/bn asymptotically normal for some an and bn?

pˆn converges in distribution to a random variable taking the value 0.20 with probability 1/2 and 0.80 with probability 1/2. Therefore, pˆn cannot converge to a constant almost surely or in probability because if it did, pn − then pˆn would converge in distribution to that constant. Also, (ˆ an )/bn cannot converge in distribution to a normal. That would violate Theorem 9.25 because we have already shown that (ˆ pn − 0)/1 converges in distribution to a two point distribution.

K24704_SM_Cover.indd 133

01/06/16 10:39 am

130 10. Prove Proposition 9.20. Suppose that E(|Xn |1+δ ) ≤ B for all n, where δ > 0. Let 1/(1 + δ) + 1/p = 1. By Holder’s inequality,

E{|Xn |I(|Xn | ≥ A)} ≤ {E(|Xn |1+δ )}1/(1+δ) [E{I(|Xn | ≥ A)}p ]1/p ≤ B 1/(1+δ) {P (|Xn | ≥ A)}1/p = B 1/(1+δ) {P (|Xn |1+δ ≥ A1+δ )}1/p ≤ B 1/(1+δ) {E(|Xn |1+δ )/A1+δ }1/p (Markov’s ineq.) ≤ B 1/(1+δ) {B/A1+δ }1/p . Because this expression is free of n and tends to 0 as A → ∞, Xn is UI. 11. Let Xi be iid with E(|X1 |) < ∞. Prove that Sn/n is UI. E{|Sn /n|I(|Sn /n| ≥ A)} = E{|Sn /n|I(|Sn | ≥ nA)} 

≤ E (1/n)

n  i=1

|Xi |I(|Sn | ≥ nA)



= (1/n)nE{|X1 |I(|Sn | ≥ nA)} = E{|X1 |I(|Sn | ≥ nA)} = E{|X1 |I(|X1 | ≤ B)I(|Sn | ≥ nA)} + E{|X1 |I(|X1 | > B)I(|Sn | ≥ nA)} ≤ BP (|Sn | ≥ nA) + E(|X1 |I(|X1 | > B)} BnE(|X1 |) BE(|Sn |) + E(|X1 |I(|X1 | > B)} ≤ + E(|X1 |I(|X1 | > B)} nA nA BE(|X1 |) + E(|X1 |I(|X1 | > B)} = A for any B > 0. For given  > 0, choose B large enough to make the righmost term less than /2. Then choose A large enough that the first term is less than /2. Then E{|Sn /n|I(|Sn /n| ≥ A)} <  for n. This completes the proof that |Sn /n| is uniformly integrable. ≤

D

12. If Xn → X and E(|Xnr |) → E(|X r |) < ∞, r > 0, then Xnr is UI. D

D

Because Xn → X, the Mann-Wald theorem implies that Xnr → X r because f (x) = xr is a continuous function. By Theorem 9.19, Xnr is UI.

K24704_SM_Cover.indd 134

01/06/16 10:39 am

131 13. Let Xn be binomial (n, pn). (a) Prove that if pn = p for all n, then 1/Xn is asymptotically normal, and determine its asymptotic mean and variance (i.e., the mean and variance of the asymptotic distribution of 1/Xn). How do these compare with the exact mean and variance of 1/Xn? Note that 1/Xn is infinite if Xn = 0. D

By the CLT, (Xn /n − p)/{p(1 − p)/n}1/2 → Z, a standard normal deviate. Let f (x) = 1/x. By the delta method, f (Xn /n) is asymptotically normal with mean 1/p and variance {f  (p)}2 p(1 − p)/n = (1/p4 )p(1 − p)/n = (1 − p)/(np3 ). That is, (1/Xn ) is asymptotically normal with mean (np)−1 and variance (1 − p)/(np)3 . On the other hand, the exact mean of 1/Xn is ∞ because Xn = 0 with nonzero probability. The variance of Xn is undefined because Xn has infinite mean. (b) If pn = λ/n for some constant λ, prove that 1/Xn does not converge in distribution to a finite-valued random variable. By the law of small numbers, Xn converges in distribution to a Poisson (λ). Therefore, for y > 0, P (1/Xn ≤ y) = P (Xn ≥ 1/y) =



P (Xn = j).

j≥1/y



As n → ∞, this probability converges to j≥1/y exp(−λ)λj /j! ≤ 1 − exp(−λ)λ0 /0! = 1 − exp(−λ) < 1. We have shown that the distribution function Fn (y) for Yn = 1/Xn converges to a function F (y) that is not a distribution function because limy→∞ F (y) ≤ 1 − exp(−λ) < 1. In other words, there is nonzero probability that Y = ∞. 14. Let µn be the binomial probability measure with parameters n and pn, where npn → λ. If ν is the Poisson probability measure with parameter λ, prove the following improvement of the law of small numbers (Proposition 6.24): supB∈B |µn(B) − ν(B)| → 0. We showed in the proof of the Law of Small Numbers that the binomial probability mass function converges to the Poisson probability mass functions. Probability mass functions are densities with respect to counting measure, so Scheffe’s theorem implies that supB∈B |µn (B) − ν(B)| → 0. 15. Consider a permutation test in a paired data setting, as in Examples 8.10 and 8.20. Let pn = pn(Zn) be the exact, one-tailed

K24704_SM_Cover.indd 135

01/06/16 10:39 am

132 



Di2 , permutation p-value corresponding to Zn = n i=1 Di/ and let pn be the approximate p-value 1 − Φ(Zn). Using what a.s. was shown in Example 8.20, prove that pn(Zn) − pn(Zn) → 0.

The exact one-tailed p-value is 1 − Fn (Zn ), where Fn is the permutation distribution obtained by fixing d1 , d2 , . . . and treating ±di as being equally likely. In Example 8.20, we showed that, with probability 1, the permutation distribution Fn (z) converges in distribution to Φ(z), a continuous distribution. By Polya’s theorem, the convergence is uniform. It follows that, with probability 1, supz |1 − Fn (z) − {1 − Φ(z)}| → 0 as n → ∞. Therefore, with probability 1, |1 − Fn (Zn ) − {1 − Φ(Zn )}| → 0. That is, the difference between the exact and approximate p-values tends to 0 with probability 1. 16. Prove Proposition 9.23. Suppose that |Xn | ≤ Y a.s., where E(Y ) < ∞. Then E{|Xn |I(|Xn | ≥ A)} ≤ E{Y I(Y ≥ A)} → 0 as A → ∞ by the DCT because Y I(Y ≥ a.s. A) → 0 and is dominated by the integrable random variable Y . Therefore, if we choose A large enough that E{Y I(Y ≥ A)} ≤ , then E{|Xn |I(|Xn | ≥ A)} ≤  for all n. This shows that Xn is UI.

K24704_SM_Cover.indd 136

01/06/16 10:39 am

133 Section 10.1 1. Let X and Y be independent Bernoulli (p) random variables, and let S = X + Y . What is the conditional probability mass function of Y given S = s for each of s = 0, 1, 2? What is the conditional expected value of Y given the random variable S? For s = 0, P (Y = 0 | S = 0) = 1.

For s = 1,

P (Y = 0 | S = 1) = P (Y = 0, X = 1 | S = 1) (1 − p)p = P (S = 1) p(1 − p) = P (X = 0, Y = 1) + P (X = 1, Y = 0) p(1 − p) = 1/2 = 2p(1 − p) = P (Y = 1 | S = 1). For s = 2, P (Y = 1 | S = 2) = 1. Therefore,  0

if S = 0 Z = E(Y |S) =  1/2 if S = 1 1 if S = 2.

2. Verify directly that in the previous problem, Z = E(Y | S) satisfies Equation (10.8) with X in this expression replaced by S. Let Z = E(Y | S) as defined above. Suppose that B contains none of 0, 1, or 2. Then E{ZI(S ∈ B)} = 0 = E{Y I(S ∈ B)}.

Suppose the only point among 0, 1, or 2 that B contains is 0. Then E{ZI(S ∈ B)} = E{ZI(S = 0)} = 0 = E{Y I(S ∈ B)}.

If the only point among 0, 1, or 2 that B contains is 1, then

E{ZI(S ∈ B)} = E{ZI(S = 1)} = E{(1/2)I(S = 1)} = (1/2)P (S = 1) = (1/2)2p/(1 − p) = p(1 − p). Also, E{Y I(S ∈ B)} = E[Y {I(X = 0, Y = 1) + I(X = 1, Y = 0)}] = E{Y I(X = 0, Y = 1)} + E{Y I(X = 1, Y = 0)}

K24704_SM_Cover.indd 137

01/06/16 10:39 am

134 = E{1I(X = 0, Y = 1)} + 0 = P (X = 0, Y = 1) = p(1 − p). If the only point among 0, 1, or 2 that B contains is 2, then E{ZI(S ∈ B)} = E{1 · I(S = 2)} = P (S = 2) = p2 . Likewise, E{Y I(S = 2)} = E{1 · I(S = 2)} = P (S = 2) = p2 .

If B contains more than one point among 0, 1, or 2, then E{ZI(S ∈   B)} = j∈B E{ZI(S = j)} = j∈B E{Y I(S = j)} = E{Y I(S ∈ B)}.

3. If X and Y are independent with respective densities f (x) and g(y) and E(|Y |) < ∞, what is E(Y | X = x)? What about Z = E(Y | X)? Verify directly that Z satisfies Equation (10.8). 

E(Y | X = x) = µY = yg(y)dy. To see that µY satisfies Equation (10.8), note that E{µY I(X ∈ B)} = µY P (X ∈ B) and E{Y I(X ∈ B)} = E(Y )E{I(X ∈ B) by Proposition 5.29 = µY P (X ∈ B) = E{µY I(X ∈ B)}.

This verifies that Equation (10.8) holds. 4. Let U1 and U2 be independent observations from a uniform distribution on [0, 1], and let X = min(U1 , U2 ) and Y = max(U1 , U2 ). What is the joint density function for (X, Y )? Using this density, find Z = E(Y | X). Verify directly that Z satisfies Equation (10.8). The joint density for (X, Y ) is f (x, y) = 2I(x < y). E(Y | X = x) = 1 1 2 x y(2dy)/ x (2dy) = {(1 − x )/2}/(1 − x) = (1 + x)/2. Therefore, Z = E(Y | X) = (1 + X)/2.

To see that this expression satisfies Equation (10.8), note that E{ZI(Z ∈ B)} = E







1+X I(X ∈ B) , 2

whereas E{Y I(X ∈ B)} = = =

K24704_SM_Cover.indd 138

  1 B

x

  1 B



B

x



yf (x, y)dy dx 

2ydy dx =



B

(1 − x2 )

(1 + x) 2(1 − x)dx 2

01/06/16 10:39 am

135

= E





1+X I(X ∈ B) 2



because 2(1−x) is the marginal density of x. This proves that E{ZI(X ∈ B)} = E{Y I(X ∈ B)} 5. Let Y have a discrete uniform distribution on {1, −1, 2, −2, . . . , n, −n}. I.e., P (Y = y) = 1/(2n) for y = ±i, i = 1, . . . , n. Define X = |Y |. What is E(Y | X = x)? What about Z = E(Y | X)? Verify directly that Z satisfies Equation (10.8).

E(Y | X = x) = 0 and Z = E(Y | X) = 0 a.s. To see that Z satisfies Equation (10.8), note that E{ZI(X ∈ B)} = 0 and E{Y I(X ∈ B)} =   j∈B E{Y I(X = j)} = j∈B 0 = 0. Therefore, Z satisfies Equation (10.8).

6. Notice that Expression (10.2) assumes that (X, Y ) has a density with respect to two-dimensional Lebesgue measure or counting measure. Generalize Expression (10.2) to allow (X, Y ) to have a density with respect to an arbitrary product measure µX ×µY .

Let f (x, y) be a density with respect a product measure µ×ν. If E(|Y |) = ydν(y) < ∞, then the conditional expected value of Y given X = x is





E(Y | X = x) = 

yf (x, y)dν(y) . f (x, y)dν(y)

7. ↑ A mixed Bernoulli distribution results from first observing the value p from a random variable P with density f (p), and then observing a random variable Y from a Bernoulli (p) distribution. (a) Determine the density function g(p, y) of the pair (P, Y ) with respect to the product measure µL ×µC , where µL and µC are Lebesgue measure on [0, 1] and counting measure on {0, 1}, respectively. g(p, y) = pI(y = 1) + (1 − p)I(y = 0).

(b) Use your result from the preceding problem to prove that E(Y | P ) = P a.s. The conditional expectation of Y given P = p is, from the preceding problem, E(Y | P = p) =

K24704_SM_Cover.indd 139



y{pI(y = 1) + (1 − p)I(y = 0)}dµC (y) {pI(y = 1) + (1 − p)I(y = 0)}dµC (y)



01/06/16 10:39 am

136

K24704_SM_Cover.indd 140

=

1

=

p = p. 1−p+p

y{pI(y = 1) + (1 − p)I(y = 0)} y=0 {pI(y = 1) + (1 − p)I(y = 0)}

y=0

1

01/06/16 10:39 am

137 Section 10.2 1. Let Y have a discrete uniform distribution on {±1, ±2, . . . , ±n}, and let X = Y 2 . Find E(Y | X = x). Does E(Y | X = x) match what you got for E(Y | X = x) for X = |Y | in Problem 5 in the preceding section? Now compute Z = E(Y | X) and compare it with your answer for E(Y | X) in Problem 5 in the preceding section. E(Y | X = x) = 0. This is the same answer as in Problem 5 of the previous section because |Y | and Y 2 are 1-1 functions of each other. Likewise, E(Y | X) = 0 a.s., just as in Problem 5 of the preceding section. 2. Let Y be as defined in the preceding problem, but let X = Y 3 instead of Y 2 . Find E(Y | X = x) and Z = E(Y | X). Does Z match your answer in the preceding problem? Now E(Y |X = x) = E(Y | Y 3 = x) = E(Y | Y = x1/3 ) = x1/3 . Also, E(Y | X) = X 1/3 . This does not match the answer of the preceding problem. 3. Tell whether the following is true or false. If it is true, prove it. If it is false, give a counterexample. If E(Y | X1 ) = E(Y | X2 ), then X1 = X2 almost surely. This is false. We saw that in the discrete uniform setting of Problem 1 of this section and Problem 5 of the preceding section, E(Y | Y 2 ) was the same as E(Y | |Y |), yet it is not the case that Y 2 = |Y | a.s. 4. Let Y be a random variable defined on (Ω, F , P ) with E(|Y |) < ∞. Verify the following using Definition 10.7. (a) If A = {Ω, ∅}, then E(Y | A) = E(Y ) a.s. Let µY = E(Y ). To prove that µY is a version of E(Y | A), note that E{µY I(∅)} = 0 = E{Y I(∅)} because P (∅) = 0. Similarly, E{µY I(Ω)} = µY P (Ω) = µY and E{Y I(Ω)} = E(Y ) = µY . Therefore, µY is a version of E(Y A).

(b) If A = σ(Y ), then E(Y | A) = Y a.s.

Because Z = Y , it is trivially true that E{ZI(A)} = E{Y I(A)}. Therefore Y is a version of E(Y | A). ¯ = 5. Let X1 , . . . , Xn be iid with E(|Xi|) < ∞. Prove that E(X1 | X) n ¯ ¯ ¯ X a.s. Hint: it is clear that E{(1/n) i=1 Xi | X} = X a.s.

K24704_SM_Cover.indd 141

01/06/16 10:39 am

138 Using the hint, we have that ¯ = E(X ¯ | X) ¯ = (1/n) X

n  i=1

¯ E(Xi | X)

¯ = E(X1 | X). ¯ = (1/n)nE(X1 | X) 6. If Y is a random variable with E(|Y |) < ∞, and g is a Borel function, then E{Y | X, g(X)} = E(Y | X) a.s. The sigma-field generated by (X, g(X)) is σ(X) because each event of the form {ω : g(X(ω)) ∈ B} is of the form {ω : X(ω) ∈ A for the Borel set A = g −1 (B).

7. Suppose that X is a random variable with mean 0 and variance σ 2 < ∞, and assume that E(Y | X) = X a.s. Find E(XY ). E(XY ) = E{E(XY | X)} = E{XE(Y | X)} = E(XX) = E(X 2 ) = σ 2 . 8. Prove Proposition 10.3 when (X, Y ) has a joint probability mass function. Suppose that (X, Y ) has joint probability mass function f (x, y), and let g(x) be the marginal probability mass function of X. Let Z = E(Y | X) =

  y yf (X,y) 

g(X)

0

if g(X) = 0 if g(X) = 0.

To see that Z satisfies the definition of conditional expectation, note that if P (X ∈ B) = 0, then g(x) = 0 for each x ∈ B, in which case E{ZI(X ∈ B)} = 0 = E{Y I(X ∈ B)}. On the other hand, if P (X ∈ B) = 0, then E{ZI(X ∈ B)} =





x∈B,g(x)>0 y



=



yf (x, y) g(x) g(x) yf (x, y) =

x∈B,g(x)>0 y



yf (x, y)

x∈B y

=    I(x ∈ B) yf (x, y) = yI(x ∈ B)f (x, y) = x

y

x

y

= E{Y I(X ∈ B)}. The interchange of order of summation is by Fubini’s theorem because sums are integrals with respect to counting measure.

K24704_SM_Cover.indd 142

01/06/16 10:39 am

139 9. Prove part 2 of Proposition 10.13. Suppose that P (Y1 ≤ Y2 ) = 1. Let Zi = E(Yi | A), i = 1, 2. Then Z2 −Z1 is a version of E(Y2 − Y1 | A) The set A = {ω : Z2 (ω) − Z1 (ω) < 0} is in A. By definition of conditional expectation, E{(Z2 − Z1 )I(A)} = E{(Y2 − Y1 )I(A)}.

(19)

If the probability of A were strictly positive, then E{(Z2 − Z1 )I(A)} would be strictly negative, which is a contradiction because the right side of (19) is positive because P (Y1 ≤ Y2 ) = 1. Therefore, P (A) must be 0, which means that Z1 ≤ Z2 with probability 1.

K24704_SM_Cover.indd 143

01/06/16 10:39 am

140 Section 10.3 1. Let (X, Y ) take the values (0, 0), (0, 1), (1, 0), and (1, 1) with probabilities p00 , p01 , p10 , and p11 , respectively, where p00 + p01 + p10 + p11 = 1 and p00 + p01 > 0, p10 + p11 > 0. (a) What is the conditional distribution of Y given X = 0? p00 P (X = 0, Y = 0) = P (X = 0) p00 + p01 p01 P (X = 0, Y = 1) = P (Y = 1 | X = 0) = . P (X = 0) p00 + p01 P (Y = 0 | X = 0) =

(b) Show that E(Y | X) is linear in X, and determine the slope and intercept.

p10 P (X = 1, Y = 0) = P (X = 1) p10 + p11 p11 P (X = 1, Y = 1) = P (Y = 1 | X = 1) = . P (X = 1) p10 + p11

P (Y = 0 | X = 1) =

Set E(Y | X) = β0 + β1 X and solve for β0 and β1 : β0 = E(Y | X = 0) = p01 /(p00 + p01 ) and β0 + β1 = E(Y | X = 1) = p11 /(p10 + p11 ). Therefore, β1 = p11 /(p10 + p11 ) − p01 /(p00 + p01 ). 2. Roll a die and let X denote the number of dots showing. Then independently generate Y ∼ U(0, 1), and set Z = X + Y . (a) Find a conditional distribution function of Z given X = x and a conditional distribution function of Z given Y = y. Given X = x, Z is uniformly distributed on [x, x + 1]. Given Y = y, Z has a discrete uniform distribution on {i + y, i = 1, . . . , 6}. (b) Find a conditional distribution function of X given Z = z and a conditional distribution function of Y given Z = z. Given Z = z, X must have been z, the greatest integer less than or equal to x. Therefore, a conditional distribution of X given Z = z is a point mass at z. A conditional distribution of Y given Z = z is a point mass at z − z.

K24704_SM_Cover.indd 144

01/06/16 10:39 am

141 3. Dunnett’s one-tailed test for the comparison of k treatment means µ1 , . . . , µk to a control mean µ0 with common known variance σ 2 and common sample size n rejects the null hypothesis if maxi Zi0 > c, where Y¯i − Y¯0 Zi0 =  2σ 2 /n

and Y¯i is the sample mean in arm i. Under the null hypothesis, µi = µ0 , i = 1, . . . , k, and without loss of generality, assume that µi = 0, i = 0, 1, . . . , k. Therefore, assume that Y¯i ∼ N(0, σ 2 /n). (a) Find the conditional distribution of maxi Zi0 given Y¯0 = y0 . Given Y0 = y0 , the Zi0 are iid normals with mean (0−y0 )/(2σ 2 /n)1/2 = −y0 /(2σ 2 /n)1/2 and variance (σ 2 /n)/(2σ 2 /n) = 1/2. Therefore, given Y = y0 , the probability that maxi Zi0 ≤ z is the conditional probability that all Zi0 ≤ z, namely   



 Φ  

k   2σ /n      1/2

z − √−y20



= Φ





k √ n y0 2z + . σ

(b) Find the conditional distribution of maxi Zi0 given Y¯0 = z0 σ/n1/2 . Simply replace y0 by z0 σ/n1/2 in the answer to part (a) to get

Φ



2 z + z0

k

.

(c) Find the unconditional distribution of maxi Zi0 . By part (b), 

P max Zi0 ≤ i

¯ Yi z 

σ/n



√ = z0  = {Φ( 2 z + z0 )}k .

Furthermore, the distribution of Z0 = Y¯i /(σ/n)1/2 is N(0, 1). Accordingly, the unconditional distribution of maxi Zi0 is P (max Zi0 ≤ z) = EE{I(max Zi0 ≤ z | Z0 )} i i    ∞ 2 √ k exp(−z0 /2) √ = {Φ( 2 z + z0 )} dz0 . −∞ 2π

K24704_SM_Cover.indd 145

01/06/16 10:39 am

142 4. Let X, Y be iid N(0, σ 2 ). Find the conditional distribution of X 2 − Y 2 given that X + Y = s. X 2 − Y 2 = (X − Y )(X + Y ). Given that X + Y = s, X 2 − Y 2 has the distribution of s(X − Y ), namely normal with mean 0 and variance 2σ 2 s2 .

5. Fisher’s least significant difference (LSD) procedure for testing whether means µ1 , . . . , µk are equal declares µ1 < µ2 if both the t-statistic comparing µ1 and µ2 and F-statistic comparing all means are both significant at level α. When the common variance σ 2 is known, this is equivalent to rejecting the null 2 hypothesis if Z12 > c1,α and R2 > ck−1,α, where k  n Y¯1 − Y¯2 Z12 =  (Y¯i − Y¯ )2 , R2 = 2 2 (k − 1)σ i=1 2σ /n

and ci,α is the upper α point of a chi-squared distribution with i degrees of freedom. Use the result of Problem 11 of Section 2 2 . + R2 given Z12 8.6 to find the conditional distribution of Z12 2 2 Use this to find an expression for P (Z12 > c1 ∩ R > ck−1,α).

Without loss of generality, assume that µ = 0 and σ 2 /n = 1. Apply the Helmert transformation to the independent normal random variables Y¯1 , . . . , Y¯k : U = H(Y¯1 , . . . Y¯k ) . Then U1 , . . . , Uk are iid normal random variables with mean 0 and variance 1. Also, Z12 = U1 = (Y¯1 − Y¯2 )/21/2   2 2 and R2 = (k − 1)−1 ki=1 (Y¯i − Y¯ )2 = (k − 1)−1 k−1 i=1 Ui . Let V1 = U1 k−1 2 and V2 = i=2 Ui . Then 2 P (Z12 > c1,α ∩ R2 > ck−1,α )

= P  {V1 > c1,α ∩ V2 > (k − 1)ck−1,α − V1 } =

= =

=



 

P {V1 > c1,α ∩ V2 > (k − 1)ck−1,α − V1 | V1 = v1 } f1 (v1 )dv1 P {V1 > c1,α ∩ V2 > (k − 1)ck−1,α − v1 | V1 = v1 } f1 (v1 )dv1

I(v1 > c1,α )P {V2 > (k − 1)ck−1,α − v1 } f1 (v1 )dv1 I(v1 > c1,α )[1 − Fk−2 {(k − 1)ck−1,α − v1 }]f1 (v1 )dv1 ,

where Fk−2 is the chi-squared distribution function with k − 2 degrees of freedom and f1 is the chi-squared density function with 1 degree of freedom. 6. Let Y be a random variable with finite mean, and suppose that E{exp(Y )} < ∞. What is the probability that E{exp(Y ) | A} < exp{E(Y | A)}?

K24704_SM_Cover.indd 146

01/06/16 10:39 am

143 exp(Y ) is a convex function, so Jensen’s inequality for conditional distributions implies that E{exp(Y ) | A} ≥ exp{E(Y | A)} a.s. Therefore, the conditional probability that E{exp(Y ) | A} < exp{E(Y | A)} is 0. 7. Prove Markov’s inequality for conditional expectation (part 2 of Proposition 10.21). Let Gω (y) be a conditional distribution function of Y given A. For fixed  ω, C = C(ω) is a constant. By Proposition 10.18, R I(|y| ≥ C)dGω (y)  is a version of P (|Y | ≥ C | A). For fixed ω, R I(|y| ≥ C)dGω (y) is an integral involving an ordinary distribution function of y, namely Gω (y). By the ordinary Markov inequality, this integral is no greater   than R |y|dGω (y)/C. By Proposition 10.18, R |y|dGω (y) is a version of E(|Y | | A). Therefore, P (|Y | ≥ C | A) ≤ E(|Y | | A)/C a.s. This proves Markov’s inequality for conditional expectation. 8. Prove Chebychev’s inequality for conditional expectation (part 3 of Proposition 10.21). Simply apply Markov’s inequality for conditional expectation to {Y − E(Y | A)}2 : P {|Y − E(Y | A)| ≥ C | A} = P [{Y − E(Y | A)}2 ≥ C 2 | A] E[{Y − E(Y | A)}2 | A] a.s. (Markov) C2 var(Y | A) a.s. = C2 This proves Chebychev’s inequality for conditional expectation. ≤

9. Prove H¨ older’s inequality for conditional expectation (part 4 of Proposition 10.21). Let Hω (x, y) be a conditional distribution function of (X, Y ) given A. By  Proposition 10.25, |xy|dHω (x, y) is a version of E(|XY | | A). For fixed  ω, |xy|dHω (x, y) is an ordinary expectation, so by H¨older’s inequality for ordinary expectation, 

R2

|xy|dHω (x, y) ≤ 



x∈R

|x| dFω (x) p

1/p 

y∈R

1/q

|y| dGω (y) q



a.s., (20) 

where Fω (x) = y∈R dHω (x, y) and Gω (y) = x∈R dHω (x, y). But x∈R |x|p  dFω (x) and y∈R |y|q dGω (y) are versions of E(|X|p | A) and E(|Y |q | A), respectively. Therefore, the right side of Expression (20) is {E(|X|p | A)}1/p {E(|Y |q | A)}1/q a.s.

K24704_SM_Cover.indd 147

01/06/16 10:39 am

144 Therefore, E(|XY | | A) ≤ {E(|X|p | A)}1/p {E(|Y |q | A)}1/q a.s. This proves H¨older’s inequality for conditional expectation. 10. Prove Schwarz’s inequality for conditional expectation (part 5 of Proposition 10.21). Apply H¨older’s inequality for conditional expectation with p = q = 2 to deduce that  E(|XY |) ≤ E(X 2 | A)E(Y 2 | A) a.s. This proves Schwarz’s inequality for conditional expectation.

11. Prove Minkowski’s inequality for conditional expectation (part 6 of Proposition 10.21). Let Hω (x, y) be a conditional distribution function of (X, Y ) given A,   and let Fω (x) = y∈R dHω (x, y) and Gω (y) = x∈R dHω (x, y). Then 

R2



|x + y|p dHω (x, y)

x∈R y∈R

|x|p dFω (x) |y|p dGω (y)

are versions of E(|X + Y |p | A), E(|X|p | A), and E(|Y |p | A), respec tively. By Minkowski’s inequality for ordinary expectation, { R2 |x +   y|p dHω (x, y)}1/p ≤ { x∈R |x|p dFω (x)}1/p + { y∈R |y|p dGω (y)}1/p , completing the proof that {E(|X + Y |p | A)}1/p ≤ {E(|X|p | A)}1/p + {E(|Y |p | A)}1/p a.s. This proves Minkowski’s inequality for conditional expectation. 12. Prove parts 1 and 2 of Proposition 10.23. For part 1, var(Y | A) = = = = =

E[{Y − E(Y | A)}2 | A] E(Y 2 | A) + E[{E(Y | A)}2 | A] − 2E{Y E(Y | A) | A} a.s. E(Y 2 | A) + {E(Y | A)}2 − 2E(Y | A)E(Y | A) a.s. E(Y 2 | A) + {E(Y | A)}2 − 2{E(Y | A)}2 E(Y 2 | A) − {E(Y | A)}2 .

The third line is by Proposition 10.14 because E(Y | A) is A-measurable, so we can factor it out of conditional expectations given A.

K24704_SM_Cover.indd 148

01/06/16 10:39 am

145 For part 2, cov(Y1 , Y2 | A) = = = = =

E[{Y1 − E(Y1 | A)}{Y2 − E(Y2 | A)} | A] a.s. E(Y1 Y2 | A) − E{Y1 E(Y2 | A) | A} − E{Y2 E(Y1 | A) | A} + E{E(Y1 | A)E(Y2 | A) | A} a.s E(Y1 Y2 | A) − E(Y2 | A)E(Y1 | A) − E(Y1 | A)E(Y2 | A) + E(Y1 | A)E(Y2 | A) a.s. E(Y1 Y2 | A) − 2E(Y1 | A)E(Y2 | A) + E(Y1 | A)E(Y2 | A) E(Y1 Y2 | A) − E(Y1 | A)E(Y2 | A).

13. Prove parts 3 and 4 of Proposition 10.23. Part 3: var(Y + C | A) = = = =

E[{Y + C − E(Y + C | A)}2 | A] a.s. E[{Y + C − E(Y | A) − C}2 | A] E[{Y − E(Y | A)}2 | A] var(Y | A).

Part 4: cov(Y1 + C1 , Y2 + C2 | A) = = = =

E[{Y1 + C1 − E(Y1 + C1 | A)}{Y2 + C2 − E(Y2 + C2 | A)} | A] E[{Y1 + C1 − E(Y1 | A) − C1 }{Y2 + C2 − E(Y2 | A) − C2 } | A] E[{Y1 − E(Y1 | A)}{Y2 − E(Y2 | A)} | A] cov(Y1 , Y2 | A).

14. Complete the proof of Proposition 10.17 by showing that the  set B of Borel sets B such that B dFω (y) is A-measurable is a monotone class containing the field in Proposition 3.8. We know that B contains all sets of the form (−∞, a] by definition of a regular conditional distribution function. We claim that it also con  tains all sets of the form (a, b] because (a,b] dFω (y) = (−∞,b] dFω (y) −  and there(−∞,a] dFω (y) is the difference of two A-measurable functions,  fore A-measurable. Similarly, (b,∞) dFω (y) = 1 − (−∞,b] dFω (y) is A measurable because (−∞,b] dFω (y) is A-measurable. It is also clear that if B = ∪ki=1 Bi is the union of disjoint sets Bi of the form (−∞, ai ] or    (ai , bi ] or (bi , ∞), then ∪Bi = i Bi dFω (y) is A-measurable because  each Bi dFω (y) is A-measurable. Therefore, B  contains the field in Proposition 3.8. 

To see that B  is a monotone class, suppose that Bi ↑ B. Then B dFω (y) =   limi→∞ I(Bi )dFω (y) = limi→∞ Bi dFω (y) is a limit of A-measurable functions, and therefore A-measurable.

K24704_SM_Cover.indd 149

01/06/16 10:39 am

146 15. Consider the Z-statistic comparing two means with known finite variance σ 2 , Z=



¯ Y¯ − X

σ 1/nX + 1/nY

.

Suppose that nX remains fixed, and let F be the distribution ¯ Assume that Y1 , Y2 , . . . are iid with mean µY . function for X. Show that the asymptotic (as nY → ∞ and nX remains fixed) ¯ = x is normal conditional distribution function for Z given X and determine its asymptotic mean and variance. What is the asymptotic unconditional distribution of Z as nY → ∞ and nX remains fixed? ¯ fixed at x, Z becomes (Y¯ − x)/{σ(1/nX + 1/nY )1/2 }. By the With X CLT, Y¯ is asymptotically normal with mean µY and variance σ 2 /nY . ¯ = x, Z is asymptotically normal with mean (µY − Therefore, given X x)/{σ(1/nX + 1/nY )1/2 } and variance

(σ 2 /nY )/{σ 2 (1/nX + 1/nY )} = (1/nY )/(1/nX + 1/nY ). To make this rigorous, note that σ



Y¯ −x 1/nX +1/nY



− √ (µY −x) σ

1/nX +1/nY

1/nY 1/nX +1/nY

=



nY (Y¯ − µY )/σ,

which tends to N(0, 1) as nY → ∞ by the CLT.

For the unconditional distribution of Z, note that as nY tends to ∞ and 1/2 ¯ nX remains fixed, Z tends a.s. to (µY − X)/(σ/n X ). Therefore, 



  ¯ µY − X ¯ − µY ≥ −zσ/√nX P (Z ≤ z) → P ≤z =P X √ σ/ nX √ = 1 − F (µY − zσ/ nX ) .

16. Prove Proposition 10.18, first when g is simple, then when g is nonnegative, then when g is an arbitrary Borel function. Assume first that g(y) is a simple function taking the value bi on y ∈ Bi ,  where Bi is a Borel set, i = 1, . . . , k. Then Z = g(y)dF (y, ω) =   k I(y ∈ Bi )dF (y, ω) = µ(Bi , ω) is a i=1 bi I(y ∈ Bi )dF (y, ω), and version of E{I(Y ∈ Bi ) | A} by Theorem 10.17. By definition, if A ∈ A, then E{µ(Bi , ω)I(ω ∈ A)} = E{I(Y ∈ Bi )I(ω ∈ A)}.

K24704_SM_Cover.indd 150

01/06/16 10:39 am

147 Therefore, E{Z(ω)I(ω ∈ A)} = =

k 

i=1 k  i=1

bi E{µ(Bi , ω)I(ω ∈ A)} bi E{I(Y ∈ Bi )I(ω ∈ A)}

= E[g{Y (ω)}I(ω ∈ A)]. Therefore, Z satisfies the definition of a conditional expectation of g(Y ) given A.

Now suppose that g is an arbitrary nonnegative, Borel function. Then there are nonnegative simple functions gn (y) increasing to g(y). It follows from Proposition 10.13 that 

g(y, ω)dF (y, ω) = =



lim gn (y)dF (y, ω)

n→∞ 

lim

n→∞

gn (y)dF (y, ω)

= lim E{gn (Y ) | A} a.s. n→∞ = E{g(Y ) | A} a.s. by part 3 of Proposition 10.13. This proves the result for nonnegative measurable functions. An arbitrary measurable function g(Y ) with E{|g(Y )|} < ∞ may be written as the difference, g + (Y ) − g − (Y ), of nonnegative Borel functions with E{g + (Y )} < ∞ and E{|g − (Y )|} < ∞. Also, by the result  just proven for nonnegative measurable functions, g + (y, ω)dF (y, ω) =  E{g + (Y ) | A} a.s. and g − (y, ω)dF (y, ω) = E{g − (Y ) | A} a.s. Therefore, 

K24704_SM_Cover.indd 151

g(y)dF (y, ω) =



g + (y)dF (y, ω) −



g − (y)dF (y, ω)

= E{g + (Y ) | A} − {g − (Y ) | A} a.s. = E{g + (Y ) − g − (Y ) | A} a.s. = E{g(Y ) | A} a.s.

01/06/16 10:39 am

148 Section 10.4 1. Show that Proposition 10.10 is a special case of Proposition 10.30. Let A be the trivial sigma-field {∅, Ω}. If C is any sub-sigma-field of F, then A ⊂ C. By Proposition 10.30, E[E{Y | C} | A] = E(Y | A).

(21)

But for any random variable X with finite mean, E(X | A) = E(X). Therefore, the left and right sides of Equation (21) are EE(Y | C) a.s. and E(Y ) a.s. 2. Let Y be a random variable with E(|Y |) < ∞, and suppose that Z is a random variable such that Y − Z is orthogonal to X (i.e., E{(Y − Z)X} = 0) for each A-measurable random variable X . Prove that Z = E(Y | A) a.s. Let X = I(A), where A ∈ A. Then X is an A-measurable random variable, so E{(Y − Z)X} = 0. But this implies that E(Y X) = E(ZX). I.e., E{Y I(A)} = E{ZI(A)} for each A ∈ A. By definition of conditional expectation, Z = E(Y | A) a.s.

3. Suppose that E(Y 2 ) < ∞, and let Z = E(Y | A). Prove the identity E(Y 2 ) = E(Z 2 ) + E(Y − Z)2 . E(Y 2 ) = E(Z + Y − Z)2 = E(Z 2 ) + E(Y − Z)2 + 2E{Z(Y − Z)}. The last term is E[E{Z(Y − Z) | A}] = E{ZE(Y − Z | A)} = E[Z{E(Y | A) − E(Z | A)}] = E{Z(Z − Z)} = 0. Therefore, E(Y 2 ) = E(Z 2 ) + E{(Y − Z)2 }.

K24704_SM_Cover.indd 152

01/06/16 10:39 am

149 Section 10.5 Prove that if there is a regular conditional distribution function F (y | x) of Y given X = x that does not depend on x, then X and Y are independent. Let F (y | x) be a regular conditional distribution function of Y given X = x that does not depend on x. That is, F (y | x) = F (y). Then F (y) is also the unconditional distribution of Y because P (Y ≤ y) = E{P (Y ≤ y | X)} = E{F (y)} = F (y). Therefore, P (X ≤ a, Y ≤ b) = E{I(X ≤ a)I(Y ≤ b)} = E[E{I(X ≤ a)I(Y ≤ b) | X}] = E[I(X ≤ a)E{I(Y ≤ b) | X}]  = E[I(X ≤ a)

= E[I(X ≤ a)



I(y ≤ b)dF (y | X) I(y ≤ b)dF (y)

= E[I(X ≤ a)P (Y ≤ b)] = P (X ≤ a)P (Y ≤ b). This proves that X and Y are independent.

K24704_SM_Cover.indd 153

01/06/16 10:39 am

150 Section 10.6.4 1. A common test statistic for the presence of an outlier among iid data from N (µ, σ 2 ) is the maximum normed residual ¯ |Xi − X| . U = max  n 1≤i≤n 2 ¯ (X − X) i i=1

¯ s2 ) is a Using the fact that the sample mean and variance (X, complete, sufficient statistic, prove that U is independent of s2 . Note that replacing Xi by (Xi − µ)/σ does not change the value of U , and (Xi − µ)/σ has a standard normal distribution. Therefore, the distribution of U for arbitrary µ and σ is the same as the distribution of U when Xi has a standard normal distribution. It follows that U is an ancillary statistic (its distribution does not depend on (µ, σ). By Basu’s ¯ s2 ), theorem, U is independent of the complete sufficient statistic (X, 2 and therefore independent of s . 2. Let Y ∼ N(µ, 1), and suppose that A is a set such that P (Y ∈ A) is the same for all µ. Use Basu’s theorem to prove that P (Y ∈ A) = 0 for all µ or P (Y ∈ A) = 1 for all µ. Consider the random variable I(Y ∈ A). Then I(Y ∈ A) is ancillary because P {I(Y ∈ A) = 1} = P (Y ∈ A) does not depend on (µ, σ 2 ). Also, Y is a complete sufficient statistic. By Basu’s theorem, I(Y ∈ A) is independent of Y . This means that

P {I(Y ∈ A) = 1, Y ∈ A)} = P {I(Y ∈ A) = 1}P (Y ∈ A) = {P (Y ∈ A)}2 . (22) But the left side is just P (Y ∈ A), so we conclude that p = p2 , where p = P (Y ∈ A). Therefore, p(1 − p) = 0, proving that p = 0 or 1. 3. In the ECMO example, consider the set of possible treatment assignment vectors that are consistent with the marginals of Table 10.2. Show that the probability of each of the 10 assignment vectors other than (E,S,E,E,E,E,E,E,E,E,E,E) and (S,E,E,E,E,E, E,E,E,E,E,E) is 1/429. Each vector consistent with the given marginals other than the 2 vectors given must have the lone “S” in the third slot or later. Suppose that the lone S is in position j. The product of probabilities associated with the first 3 assignments given the outcome vector 0, 1, 0, . . . , 0 is (1/2)(2/3)(2/4) = 1/6. All remaining babies survive. If the lone S was

K24704_SM_Cover.indd 154

01/06/16 10:39 am

151 baby number 3, then the product of probabilities for babies 4, 5, . . . , 12 is 10!4! (2/5)(3/6) . . . (10/13) = . 13! If the lone S is baby number 4, the product of probabilities for babies 4, 5, . . . , 12 is 10!4! . (2/5)(3/6)(4/7) . . . (10/13) = 13! If the lone S is baby number 5, the product of probabilities for babies 4, 5, . . . , 12 is 10!4! . (3/5)(2/6)(4/7) . . . (10/13) = 13! Likewise, whenever the lone S is among babies 3, . . . , 12, the numerators of the probabilities for babies 4, . . . , 12 consist of the numbers 2, . . . 10 in some order, while the denominators consist of 5, 6, . . . , 13. Therefore, the product of probabilities is 4·3·2 2 10!4! = = . 13! 13 · 12 · 11 143 Multiplying by the probability of the first 3 assignments, namely 1/6, we get 1/429. 4. Let X be exponential (λ). It is known that X is a complete and sufficient statistic for λ. Use this fact to deduce the following result on the uniqueness of the Laplace transform  ψ(t) = 0∞ f (x) exp(tx)dx of a Borel function f (x) with domain (0, ∞). If f (x) and g(x) are two Borel functions whose Laplace transforms agree for all t < 0, then f (x) = g(x) except on a set of Lebesgue measure 0. If t < 0, the Laplace transforms for the Borel functions f (x) and g(x) are E{f (X)} and E{g(X)}, where X is exponential with parameter λ = −t > 0. Because the distribution of X is complete, f (X) = g(X) with probability 1 by Proposition 10.41. That is, f (x) = g(x) except on a set of Lebesgue measure 0.

K24704_SM_Cover.indd 155

01/06/16 10:39 am

152 Section 10.7.2 1. In Example 10.47, let Y (ω) = ω and consider two different sigma-fields. The first, A1 , is the sigma-field generated by I(ω is rational). The second, A2 , is the sigma-field generated by Y . What are A1 and A2 ? Give regular conditional distribution functions for Y given A1 and Y given A2 . The sigma-field A1 generated by I(ω is rational) is {Q, QC , ∅, Ω}, where Q is the set of rational numbers in [0, 1]. The sigma field A2 generated by Y (ω) = ω is the Borel sigma-field of subsets of [0, 1]. Let F (y) = y be the uniform distribution function on [0, 1]. Then F is a conditional distribution function of Y given A1 . Let G(y) =



0 y B) → 1 as n → ∞ for each positive number B. What are the implications of this result for a test rejecting the null hyˆ θˆn)}1/2 > zα, where θˆn is a consistent pothesis when θˆn/{var( estimator of a parameter θ > 0 and var( ˆ θˆn) is an estimator of p its variance such that var( ˆ θˆn) → 0? If Xn > a/2 and |Yn | < a/(2B), then Xn /|Yn | > (a/2)/{a/(2B)} = B. Let Dn = {Xn > a/2}, En = {|Yn | < a/(2B)}, and Fn = Dn ∩ En . Then P (Fn ) → 1 as n → ∞ because P (FnC ) ≤ P (DnC ) + P (EnC ) → 0 + 0 as n → ∞. This proves that P (Xn /|Yn | > B) → 1 as n → ∞. ˆ θˆn )}1/2 > Suppose that a test rejects the null hypothesis when θˆn /{var( zα , where θˆn is a consistent estimator of a parameter θ > 0 and var( ˆ θˆn ) is p ˆ an estimator of its variance such that var( ˆ θn ) → 0. Then the numerator ˆ θn converges in probability to θ > 0, while the denominator converges to 0 in probability. By the above result, the probability of rejecting the ˆ θˆn )}1/2 > zα ], tends to 1 as n → ∞. In other null hypothesis, P [θˆn /{var( words, the power of the test tends to 1. 2. State and prove a result about the asymptotic distribution of the one-sample t-statistic n1/2 Y¯n/sn under a local alternative. Suppose that Yi,n i = 1, . . . , n are iid from a distribution F {(y − µn )/σ}, where F has finite variance and n1/2 µn /σ → ∆. Then the t-statistic n1/2 Y¯n /sn for testing µn = 0 converges in distribution to N(∆, 1). The proof is as follows. Let Yi be iid from F (y/σ). Then the Yi have the same distribution as Yi + µn , i = 1, . . . , n. Write the t-statistic as   √ √ n(Y¯n − µn ) nµn σ + Tn = σ σ sn √   √ nY¯n nµn σ + = σ σ sn D → {N(0, 1) + ∆}(1) = N(∆, 1) by Slutsky’s theorem.

K24704_SM_Cover.indd 159

01/06/16 10:39 am

156 Section 11.3 1. Prove that the limit of (11.8) as λ → 0 is (t/84)β . G(t) =

F (t) 1 − exp{−(λt)β } = , t ≤ 84. F (84) 1 − exp{−(84λ)β }

Use L’Hospital’s rule. The derivative of the numerator and denominator with respect to λ are − exp{−(λt)β }(−tβ βλβ−1 ) = βtβ λβ−1 exp{−(λt)β } and − exp{−(84λ)β }(−84β βλβ−1 ) = β 84β λβ−1 exp{−(84λ)β }, respectively. Therefore, the limit of (10.8) is β βtβ λβ−1 exp{−(λt)β } β exp{−(λt) } = lim = (t/84)β . (t/84) λ→0 β 84β λβ−1 exp{−(84λ)β } λ→0 exp{−(84λ)β }

lim

2. Simulate 10 discrete uniforms on {1, 2 . . . , 1000} and compute the test statistic (11.9). Repeat this thousands of times and calculate the proportion of times the statistic exceeds the upper 0.05 quantile of a gamma (10, 1) distribution. Is it close to 0.05? It is approximately 0.046. The test is only slightly conservative. 3. Repeat the preceding exercise, but with 500 discrete uniforms on {1, 2 . . . , 100}. What proportion of simulated experiments resulted in test statistic (11.9) exceeding the upper 0.05 quantile of a gamma (500, 1) distribution? It is approximately 0.006. The test is extremely conservative.

K24704_SM_Cover.indd 160

01/06/16 10:39 am

157 Section 11.5.6 1. We asserted that P (X ≤ x | X +Y = s) is a decreasing function of s. Prove this using the representation that a hypergeometric random variable U is the number of red balls in a sample of size s drawn without replacement from a population of m red balls and n blue balls. Let X be the number of red balls in a sample of size s drawn without replacement from m red and n blue balls. Now draw another ball and let X  be the number of red balls among the sample of size s + 1. Then X  ≥ X with probability 1 because X  is either X or X + 1. Because X  ≤ x ⇒ X ≤ x, P (X  ≤ x) ≤ P (X ≤ x). But X has a hypergeometric (m, n, s) distribution and X  has a hypergeometric (m, n, s + 1) distribution. Therefore, P (X ≤ x | X + Y = s) is a decreasing function of s. 2. Modify the argument leading to Expression (11.11) to prove that lim Fn(zn1 | zn2 ) ≤ Φ(z1 ). Let zn2 be a support point of the distribution of Zn2 such that zn2 → z2 as n → ∞. Fn (z1 | zn2 ) ≤ P {Zn1 ≤ z1 | Zn2 ∈ (zn2 − , zn2 ]} =

P {(Zn1 ≤ z1 ) ∩ (Zn2 ∈ (zn2 − , zn2 ])} . P {Zn2 ∈ (zn2 − , zn2 ]}

(23)

Polya’s theorem implies that P (Zn1 ≤ z1 , Zn2 ∈ (zn2 − , zn2 ]) = Φ(z1 ) × {Φ(zn2 ) − Φ(zn2 − )} + an , where an → 0 as n → ∞. Similarly, Polya’s theorem in R1 implies that P (Zn2 ∈ (zn2 − , zn2 ]) = Φ(zn2 ) − Φ(zn2 − ) + bn , where bn → 0 as n → ∞. Therefore, Expression (23) becomes: Φ(z1 ){Φ(zn2 ) − Φ(zn2 − )} + an Φ(zn2 ) − Φ(zn2 − ) + bn Φ(z1 ){Φ(z2 ) − Φ(z2 − )} = Φ(z1 ). lim n→∞ Fn (z1 | zn2 ) ≤ Φ(z2 ) − Φ(z2 − ) Fn (z1 | zn2 ) ≤

3. Review the argument leading to Proposition 11.1 in the special case mN = nN = n. Modify it to allow mN = nN using the fact that, under the null hypothesis, XN /mN − YN /nN is uncorrelated with, and asymptotically independent of, XN + YN .

K24704_SM_Cover.indd 161

01/06/16 10:39 am

158 Let p ∈ (0, 1) be the common Bernoulli parameter under the null hypothesis. Let UN 1 = (XN − mN p)/{mN p(1 − p)}1/2 and UN 2 = (YnN − nN p)/{nN p(1 − p)}1/2 . Then UN 1 and UN 2 are independent and each converges in distribution to a standard normal by the CLT. Accordingly, D (UN 1 , UN 2 ) → (U1 , U2 ), where (U1 , U2 ) are iid standard normals. Let UN 1 √ √  √ − √UnNN2 nN UN 1 − mN UN 2  mN  √ = = 1 − λN UN 1 − λN UN 2 , ZN 1 = 1 m N + nN + 1 mN

nN

where λN = mN /(mN + nN ). Similarly, let √ √  mN UN 1 + nN UN 2  √ ZN 2 = = λN UN 1 + 1 − λN UN 2 . m N + nN

Assume that λN → λ as N → ∞, where λ ∈ (0, 1) is a constant. If we replace λN by its limit λ in ZN 1 and ZN 2 , then, by the Mann-Wald D theorem, (ZN 1 , ZN 2 ) → (Z1 , Z2 ), where Z1 = (1 − λ)1/2 U1 − λ1/2 U2 and Z2 = λ1/2 U1 + (1 − λ)1/2 U2 . Therefore, (Z1 , Z2 ) are iid standard normals because they are bivariate standard normal with correlation 0. All of this was predicated on replacing λN by λ, but the same is true without this step by the multivariate Slutsky theorem (Theorem 8.50). To see this, note that   √  √ 1 − λN UN 1 − λN UN 2 − 1 − λUN 1 − λUN 2  √    √ 1 − λN − 1 − λ UN 1 + λ − λN UN 2 = converges to 0 in probability because UN 1 and UN 2 converge in distribu1/2 tion and (1 − λN )1/2 − (1 − λ)1/2 → 0 and λ1/2 − λN → 0. A similar 1/2 argument shows that the difference between λN UN 1 + (1 − λN )1/2 UN 2 and λ1/2 UN 1 + (1 − λ)1/2 UN 2 converges to 0 in probability.

We have shown that (ZN 1 , ZN 2 ) converges in distribution to two iid standard normals, just as we showed that this held in the special case when mN = nN . The rest of the proof is identical to the proof we gave of Proposition 11.1 in the special case mN = nN .

4. Suppose that XN is hypergeometric (mN , nN , s), where s is a fixed integer. Assume that, as N → ∞, mN and nN tend to ∞ in such a way that mN /(mN + nN ) → λ. Is XN still asymptotically normal? No. XN can take only values 0, 1, . . . , s, where s is fixed. Therefore, XN cannot be asymptotically normal. In fact, it can be seen that XN is asymptotically binomial (s, λ).

K24704_SM_Cover.indd 162

01/06/16 10:39 am

159 Section 11.6 1. Many experiments involving non-human primates are very small because the animals are quite expensive. Consider an experiment with 3 animals per arm, and suppose that one of two binary outcomes is being considered. You look at outcome data blinded to treatment assignment, and you find that 5 out of 6 animals experienced outcome 1, whereas 3 animals experienced outcome 2. You argue as follows. With outcome 1, it is impossible to obtain a statistically significant result at 1tailed α = 0.05 using Fisher’s exact test. With outcome 2, if all 3 events are in the control arm, the 1-tailed p-value using Fisher’s exact test will be 0.05. Therefore, you select outcome 2 and use Fisher’s exact test (which is a permutation test on binary data). (a) Is this adaptive test valid? Explain. It is a valid test of the strong null hypothesis that treatment has no effect on the joint distribution of the two outcomes. The conditional distribution of the number of outcome 2 events in the treatment group, given the totals (over treatment and control) for outcome 1 and outcome 2 is hypergeometric—the same distribution used to compute Fisher’s exact p-value. (b) Suppose your permutation test did not condition on the numbers of animals per arm. For example, your permutation test treats as equally likely all 26 = 64 treatment assignments corresponding to flipping a fair coin for each person. Your test statistic is the difference in proportions, which is not defined if all animals are assigned to the same treatment. Therefore, your permutation distribution excludes these two extreme assignments. Is the resulting test a valid permutation test? No, because in the experiment, we forced the number of animals per arm to be 3. The distribution of the test statistic changes for different numbers per arm.

K24704_SM_Cover.indd 163

01/06/16 10:39 am

160 Section 11.7.3 

1. Make rigorous the above informal argument that if ∞ i=1 yi = 2n k, then i=1 Ziyi is asymptotically binomial with parameters (k, 1/2). Let i1 , . . . , ik be the indices of the yi that equal 1. For 2n ≥ max(i1 , . . . , ik ), all of the yi = 1 are among the 2n observations. For any sequence zi1 , . . . , zik of zeros and ones, P (Zi1 = zi1 , . . . , Zik = zik ) → (1/2)k as n → ∞. To see this, note that P (Zi1 = 1) = n/(2n) = 1/2, P (Zi2 = 1 | Zi1 = 1) = (n − 1)/(2n − 1) → 1/2, P (Zi1 = 1 | Zi1 = 0) = n/(2n − 1) → 1/2, etc. Similarly, the conditional distribution of Zij given the outcomes zi1 , . . . , zij−1 converges to Bernoulli (1/2) as n → ∞. It follows that the joint distribution of Zi1 , . . . , Zik converges to iid Bernoulli (1/2) random variables. By the Mann-Wald theorem, k j=1 Zij converges in distribution to the sum of k iid Bernoulli (1/2) random variables, namely binomial (k, 1/2). 2. Suppose the clustering from the sexual sharing results in n pairs of patients. Within each pair, the two Y s are correlated, but if we select a single member from each pair, they are iid. Prove that the condition in Proposition 11.2 holds. Extend the result to triplets, quadruplets, etc. Arbitrarily denote one member of the ith pair by Yi and the other by   a.s. a.s. Yi . By the SLLN, (1/n) ni=1 Yi → p and (1/n) ni=1 Yi → p. It follows   that (2n)−1 2n i=1 (Yi +Yi ) converges almost surely to (1/2)p+(1/2)p = p. Assuming that 0 < p < 1, the condition in Proposition 11.2 holds. With triplets, denote one member of the ith triplet by Yi , another by Yi , and another by Yi . The sample mean of the Yi converges a.s. to p, as does the sample mean of the Yi , as does the sample mean of the Yi . It follows that the sample mean of all 2n observations converges almost surely to (1/3)p + (1/3)p + (1/3)p = p. The same technique clearly applies to quadruplets, quintuplets, etc. 3. Now suppose the 2n patients represent one giant cluster and were generated as follows. We flip a coin once. We generate Y1 , Y2 , . . . as iid Bernoulli(p), where p = 0.2 if the coin flip is heads and 0.8 if the coin flip is tails. Does the condition in Proposition 11.2 hold? Yes, because the mean of the 2n observations converges almost surely to a random variable taking the value 0.2 on the set {ω : coin flip is heads} and 0.8 on the set {ω : coin flip is tails}.

K24704_SM_Cover.indd 164

01/06/16 10:39 am

161 Section 11.8.3 a.s. 1. From Expression (11.17) prove that βˆ → (2/π)1/2 . Write βˆ as βˆn

=

(2/n)

n  i=1

=

(2/n)

n  i=1

Yi I(Yi ≥ 0) + (2/n) Yi I(Yi ≥ 0) + (2/n)

n  i=1

n  i=1

ˆ − (2/n) Yi I(Yi ≥ θ)

n  i=1

Yi I(Yi ≥ 0) − Y¯

ˆ − I(Yi ≥ 0)} − Y¯ . Yi {I(Yi ≥ θ)

By the SLLN, the first term converges almost surely to 2E{Y1 I(Y1 ≥ 0) =  2(2π)−1/2 0∞ y exp(−y 2 /2)dy = (2/π)1/2 , while the last term converges almost surely to 0. Therefore, we need only show that the middle term tends almost surely to 0. The absolute magnitude of the middle term is less than or equal to Tn = (2/n)

n  i=1

ˆ − I(Yi ≥ 0)|} = (2/n) |Yi |{|I(Yi ≥ θ)

n  i=1

ˆ |Yi |{I(θˆ ≤ Yi < 0) + I(0 ≤ Yi < θ)},

a.s.

so it suffices to show that Tn → 0.

We have shown previously that the sample median converges almost surely to the true median, in this case 0. Thus, for given  > 0 and ˆ almost all ω, there exists an N such that − ≤ θ(ω) ≤  for n ≥ N . For each such ω, for n ≥ N , Tn (ω) ≤ (2/n)

n  i=1

|Yi (ω)|{I(− ≤ Yi (ω) < 0) + I(0 ≤ Yi (ω) < )}.

a.s.

By the SLLN, Tn → A = 2E[|Y1 |{I(− ≤ Y1 < 0) + I(0 ≤ Y1 < )}]. As long as ω belongs to the set of probability 1 for which Tn → A , lim (Tn ) ≤ A . But  is arbitrary and A → 0 as  → 0 by the DCT because E(|Y1 |) < ∞. Therefore, lim {Tn (ω)} = 0. This shows that a.s. a.s. Tn → 0, concluding the proof that βˆn → (2/π)1/2 . 2. Prove that if two columns of A are as shown in (11.18), then the only possible treatment assignments are (C,T,T,C) or (T,C,C,T). Why does this prove that Z cannot be independent of A? From the first column, we know that if the covariate is balanced, the first two assignments must be the opposite of each other, as must be the last two. Therefore, the four possibilities are (C,T,C,T),(C,T,T,C),(T,C,C,T), or (T,C,T,C). But if the second covariate is also balanced, then the first

K24704_SM_Cover.indd 165

01/06/16 10:39 am

162 and third assignments must be different, as must be the second and fourth assignments. That eliminates (C,T,C,T) and (T,C,T,C). Therefore, the only possibilities are (C,T,T,C) and (T,C,C,T). If Z were independent of A, then the conditional distribution of Z given A would have to be the unconditional distribution of Z, namely uniform on the 6 possible vectors with two of the zi equal to +1 and two of the zi equal to −1. But we saw that for the observed A, four of the possible randomization vectors have probability 0, so Z cannot be independent of A.

K24704_SM_Cover.indd 166

01/06/16 10:39 am

163 Section 11.9 1. Explain why Figure 11.7 is a graphical counterexample to the following proposition: (X  Y | U ) and (X  Y | W ) ⇒ {X  Y | (U, W )}. Then show that the following is a specific counterexample. Let U, W, X be iid Bernoulli (1/2). If (U, W ) = (0, 0) or (1, 1), take Y = X. If (U, W ) = (0, 1) or (1, 0), take Y = 1 − X.

Figure 11.7 shows that conditioning on U opens the path from X to U to V . However, we cannot get from V to Y because the path remains blocked at W . Likewise, conditioning on W opens the path from V to Y , but there is no way to get from X to V because the path remains blocked at U . Thus, X and Y are conditionally independent given U and conditionally independent given W . On the other hand, if we condition on both U and W , then the blockages at both U and W are cleared, so there is now a path from X to U to V to W to Y . That is, X and Y are not necessarily conditionally independent given (U, W ). Now consider the specific example in which U, W, X are iid Bernoulli (1/2). If (U, W ) = (0, 0) or (1, 1), take Y = X. If (U, W ) = (0, 1) or (1, 0), take Y = 1 − X.

P (X = 1, Y = 1 | U = 0)

P (X = 1, Y = 1, U = 0) P (U = 0) P (X = 1, U = 0, W = 0, Y = 1) + P (X = 1, U = 0, W = 1, Y = 1) = 1/2 1/8 (1/2)(1/2)(1/2)(1) + (1/2)(1/2)(1/2)(0) = = 1/4. = 1/2 1/2

=

Of course, P (X = 1 | U = 0) = 1/2 because X and U are iid Bernoulli (1/2). Moreover, P (Y = 1 | U = 0) P (U = 0, W = 0, Y = 1) + P (U = 0, W = 1, Y = 1) P (U = 0) (1/2)(1/2)(1/2) + (1/2)(1/2)(1/2) = 1/2

=

= 1/2.

K24704_SM_Cover.indd 167

01/06/16 10:39 am

164 Therefore, P (X = 1, Y = 1 | U = 0) = 1/4 = P (X = 1 | U = 0)P (Y = 1 | U = 0).

Similarly, P (X = 1, Y = 1 | U = 1)

P (X = 1, Y = 1, U = 1) P (U = 1) P (X = 1, U = 1, W = 0, Y = 1) + P (X = 1, U = 1, W = 1, Y = 1) = 1/2 (1/2)(1/2)(1/2)(0) + (1/2)(1/2)(1/2)(1) 1/8 = = 1/4, = 1/2 1/2

=

while P (X = 1 | U = 1) = 1/2 and P (Y = 1 | U = 1)

P (U = 1, W = 0, Y = 1) + P (U = 1, W = 1, Y = 1) P (U = 1) (1/2)(1/2)(1/2) + (1/2)(1/2)(1/2) = 1/2

=

= 1/2. Thus, P (X = 1, Y = 1 | U = 1) = 1/4 = P (X = 1 | U = 1)P (Y = 1 | U = 1). Hence, X and Y are conditionally independent given U . A similar argument shows that X and Y are conditionally independent given W . On the other hand, if we condition on both U and W , then X and Y are not independent because, for example, Y = X given that U = 0, W = 0. 2. Decide which of the following are true and prove the true ones. (a) {X  (Y, W ) | Z} ⇒ (X  Y | Z). True. If {X  (Y, W ) | Z}, then P (X ≤ x, Y ≤ y | Z) = P (X ≤ x, Y ≤ y, W < ∞ | Z) a.s. = P (X ≤ x | Z)P (Y ≤ y | Z)P (W < ∞ | Z) a.s = P (X ≤ x | Z)P (Y ≤ y | Z) a.s. This proves that X and Y are conditionally independent given Z. (b) {X  (Y, W ) | Z} ⇒ {X  Y | (Z, W )}. True. By Proposition 10.33, there is a conditional distribution F (x | y, w, z) of X given (Y, W, Z) that does not depend on (y, w). But of course that implies that F (x | y, w, z) does not depend on y. By Proposition 10.33, {X  Y | (Z, W )}.

K24704_SM_Cover.indd 168

01/06/16 10:39 am

165 (c) (X  Y | Z), and Y  W ⇒ (X  W | Z). False. For example, Let (X, Y, W ) be iid standard normals and set Z = X + W . Then clearly Y is independent of W . Also, X and Y are conditionally independent given Z because Y is independent of (X, W ). On the other hand, X and W are not independent given Z because X + W = Z. (d) (X  Y | Z) and {X  W | (Z, Y )} ⇒ {X  (Y, W ) | Z}. P (X ≤ x, Y ≤ y, W ≤ w | Z) = = = = = = = =

E[E{I(X ≤ x, W ≤ w)I(Y ≤ y) | Z, Y } | Z] E{I(Y ≤ y)P (X ≤ x, W ≤ w | Z, Y ) | Z} E{I(Y ≤ y)P (X ≤ x | Z, Y )P (W ≤ w | Z, Y ) | Z} E{I(Y ≤ y)P (X ≤ x | Z)P (W ≤ w | Z, Y ) | Z} P (X ≤ x | Z)E{I(Y ≤ y)P (W ≤ w | Z, Y ) | Z} a.s. P (X ≤ x | Z)E[E{I(Y ≤ y)I(W ≤ w) | Z, Y } | Z] a.s. P (X ≤ x | Z)E{I(Y ≤ y)I(W ≤ w) | Z} a.s. (Proposition 10.30) P (X ≤ x | Z)P (Y ≤ y, W ≤ w | Z) a.s.,

proving that {X  (Y, W ) | Z}. The fourth line follows from the conditional independence of X and W given (Z, Y ), and the fifth line is from the fact that X and Y are conditionally independent given Z. 3. In the path diagram of Figure 11.8, determine which of the following are true and factor the joint density function for (X, U, V, W, Y, Z). The joint density function factors as follows. f1 (x)f2 (y)f3 (u | x, y)f4 (v | x)f5 (w | v, y)f6 (z | u, v, y). (a) U and V are independent. False. We can traverse the path from U against the arrow to X and then to V . (b) X and Y are independent. True. There is no way to reach Y from X because we cannot pass through any inverted forks without conditioning on the random variable at the vertex at the inverted fork. (c) X and Y are conditionally independent given Z. False. Once we condition on Z, the following path from X to Y is available: X to V to Z to Y because we can now pass through the inverted fork at Z.

K24704_SM_Cover.indd 169

01/06/16 10:39 am

166 (d) X and Y are conditionally independent given (U, V ). False, conditioning on U now opens the path from X to U to Y .

K24704_SM_Cover.indd 170

01/06/16 10:39 am

167 Section 11.10.4 1. Use the delta method to verify Expression (11.23). g(Y¯ , V ) = Y¯ /V 1/2 , where V = s2 . ∂g/∂(Y¯ )|(µ,σ2 ) = 1/V 1/2 |(µ,σ2 ) = 1/σ, ∂g/∂V |(µ,σ2 ) = Y¯ (−V −3/2 /2)|(µ,σ2 ) = −µ/(2σ 3 ). Also, Y¯ and s2 are

independent, so they have covariance 0. By the bivariate delta method, g(Y¯ , V ) is asymptotically normal with mean µ/σ and variance var{g(Y¯ , V )} = (1/σ)2 var(Y¯ ) + {−µ/(2σ 3 )}2 var(s2 ) + 2(1/σ){−µ/(2σ 3 )}cov(Y¯ , s2 ) 



σ2 = (1/σ )(σ /n) + {µ /(4σ )}var {(n − 1)s2 /σ 2 } + 0 n−1   σ4 2 6 = 1/n + {µ /(4σ )} var(χ2n−1 ) 2 (n − 1)   2 E2 E = 1/n + . {2(n − 1)} = 1/n + 4(n − 1)2 2(n − 1) 2

2

2

That is, 

6

Y¯ /s − µ/σ

1/n + E 2 /{2(n − 1)}

(24)

D

→ N(0, 1).

Also, because the ratio of 1/n+E 2 /{2(n−1)} to 1/n+E 2 /{2n} tends to 1 as n → ∞, Slutsky’s theorem ensures that Y¯ /s is asymptotically normal with mean µ/σ and variance (1 + E 2 /2)/n. This matches Expression (11.23). 2. In the effect size setting, the variance of the asymptotic distribution depended on the parameter we were trying to estimate, so we made a transformation. Find the appropriate transformations in the following settings to eliminate the dependence of the asymptotic distribution on the parameter we are trying to estimate. (a) Estimation of the probability of an adverse event. We count only the first adverse event, so the observations are iid Bernoulli random variables with parameter p, and we use the sample proportion with events, pˆ = X/n, where X is binomial (n, p). The delta method variance of g(ˆ p) is {g  (p)}2 var(ˆ p) = {g  (p)}2 p(1− p)/n. To make this expression not depend on p, we need {g  (p)}2 p(1−

K24704_SM_Cover.indd 171

01/06/16 10:39 am

168 p) = c. That is, we need g  (p) = c{p(1 − p)}−1/2 , so g(p) =  c{p(1 − p)}−1/2 dp = 2c{arcsin(p1/2 )}. We can clearly ignore constants and use g(p) = arcsin(p1/2 ). (b) Estimation of the mean number of adverse events per person. We count multiple events per person. Assume that the total number of events across all people follows a Poisson distribution with parameter λ, where λ is very large. ˆ = (Y1 + . . . + Yn )/n, is (1/n) The mean number of events, λ ˆ is asymptotically normal with times a Poisson (nλ). Accordingly, λ ˆ is mean λ and variance (1/n)2 (nλ) = λ/n. The variance of g(λ)  2  2  {g (λ)} (λ/n). We need {g (λ)} (λ/n) = c. That is, g (λ) is proportional to 1/λ1/2 . Therefore, we can take g(λ) = λ1/2 . 3. Suppose that the asymptotic distribution of an estimator δˆ is N(δ, f (δ)/n) for some function f . How can we transform the estimator such that the variance of the asymptotic distribution does not depend on the parameter we are trying to estimate? ˆ has variance {g  (δ)}2 {f (δ)/n}, so we set The transformed estimator g(δ) {g  (δ)}2 f (δ)/n = c. That is, we take {g  (δ)}2 proportional to 1/f (δ).  This results in g(δ) = [1/{f (δ)}1/2 ]dδ.

K24704_SM_Cover.indd 172

01/06/16 10:39 am

169 Section 11.11.3 1. Let x1 , . . . , xn be a sample of data. Fix n and x2 , . . . , xn, and ¯/s send x1 to ∞. Show that the one-sample t-statistic n1/2 x converges to 1 as x1 → ∞. What does this tell you about the performance of a one-sample t-test in the presence of an outlier? n

Note that x¯/x1 = 1/n + ( (n − 1)−1

n

i=1 (xi x21

− x¯)2

i=2

xi )/(nx1 ) → 1/n as x1 → ∞. Likewise, n

− x¯/x1 )2 n−1  (1 − x¯/x1 )2 + ni=2 (xi /x1 − x¯/x1 )2 = n−1 (1 − 1/n)2 + (n − 1)(0 − 1/n)2 → n−1 i=1 (xi /x1

=

= n/n2 = 1/n.

Therefore, √ √ √ n¯ x n¯ x/x1 n(1/n)  = → =1  n (n−1)−1 (x −¯ x)2 s 1/n i=1 i 2 x1

as x1 → ∞. That is, the t-statistic tends to 1. This means that the t-statistic is virtually guaranteed not to be statistically significant if the sample size is large and a conventional alpha level (e.g., 0.01, 0.05, 0.10) is used. 2. Let Y1 , . . . , Yn be iid N(µ, σ 2 ), and Ymax and Y¯ be their sample maximum and mean, respectively. It can be shown that there are sequences of numbers an and bn converging to ∞ such that D an/n1/2 → 0 and an{(Ymax − µ)/σ − bn} → U for some nondegenerate random variable U . Using this fact, prove that D an{(Ymax − Y¯ )/σ − bn} → U as well. Note that

Dn

K24704_SM_Cover.indd 173





   Ymax − Y¯ Ymax − µ − b n − an − bn = an σ σ √ ¯ ¯ an n(Y − µ) an (µ − Y ) = −√ . = σ n σ

01/06/16 10:39 am

170 By the CLT, n1/2 (Y¯ −µ)/σ converges in distribution to a standard normal p deviate, while an /n1/2 → 0. By Exercise 4 of Section 6.2.5, Dn → 0. The result now follows from Slutsky’s theorem. 3. Show that if Y1 , . . . , Yn are iid standard normals and Ymin is the smallest order statistic, then nΦ(Ymin ) converges in distribution to an exponential with parameter 1. Umin = Φ(Ymin ) has the distribution of the minimum of n iid uniforms U1 , . . . , Un on [0, 1] because Ui = Φ(Yi ), i = 1, . . . , n are iid uniform [0, 1]. Therefore, P {nΦ(Ymin )} > u} = P (all Ui > u/n) = (1 − u/n)n → exp(−u) as n → ∞. That is, the distribution function of nΦ(Ymin ) converges to 1 − exp(−u), an exponential distribution function with parameter 1. 4. Prove that if Y1 , . . . , Yn are iid standard normals and Ymin and Ymax are the smallest and largest order statistics, 

P nΦ(Ymin ) > α1





n{1 − Φ(Ymax )} > α2 = (1−α1 /n−α2 /n)n

for (α1 + α2 )/n < 1. What does this tell you about the asymptotic joint distribution of [nΦ(Ymin ), n{1 − Φ(Ymax )}]?

Let U(1) = Φ(Ymin ) and U(n) = Φ(Ymax ). Then U(1) and U(n) are the smallest and largest order statistics from n iid uniform deviates on [0, 1]. Therefore, 

P nU(1) > α1 , n{1 − U(n) } > α2



= P { all Ui are between α1 /n and 1 − α2 /n} = (1 − α2 /n − α1 /n)n → exp{−(α1 + α2 )},

which is the joint survival function for two iid exponentials with parameter 1. This says that, asymptotically, the minimum and maximum order statistics are independent, which makes sense because the smallest observation tells virtually nothing about the largest observation if n is very large. 5. ↑ Let Y1 , . . . , Yn be iid N(µ, σ 2 ), with µ and σ 2 known. Declare the smallest order statistic to be an outlier if nΦ{(Y(1) − µ)/σ} ≤ a, and the largest order statistic to be an outlier if n[1 − Φ{(Y(n) − µ)/σ}] ≤ a. Determine a such that the probability of erroneously declaring an outlier is approximately 0.05 when n is large.

K24704_SM_Cover.indd 174

01/06/16 10:39 am

171 Note that U(1) = Φ{(Y(1) − µ)/σ} and U(n) = Φ{(Y(n) − µ)/σ} are the minimum and maximum order statistics from n iid uniform deviates U1 = Φ{(Y1 − µ)/σ}, . . . , Un = Φ{(Yn − µ)/σ}. The null probability of declaring no outlier is 



P nU(1) > a ∩ n{1 − U(n) } > a . By the preceding problem, the null probability of declaring no outlier converges to exp{−(a + a)} = exp(−2a). Equate exp(−2a) to 0.95 to get a = − ln(0.95)/2 = 0.0256.

K24704_SM_Cover.indd 175

01/06/16 10:39 am