753 109 4MB
English Pages [476]
Lebesgue’s remarkable theory of measure and integration with probability
Paul Loya
(Please do not distribute this book to anyone else. Please email [email protected] to report errors or give criticisms)
Contents Prologue: The paper that changed integration theory . . . forever Sur une g´en´eralisation de l’int´egrale d´efinie Some remarks on Lebesgue’s paper
1 1 3
Part 1.
7
Finite additivity
Chapter 1. Measure & probability: finite additivity 1.1. Introduction: Measure and integration 1.2. Probability, events, and sample spaces 1.3. Semirings, rings and σ-algebras 1.4. The Borel sets and the Principle of Appropriate Sets 1.5. Additive set functions in classical probability 1.6. More examples of additive set functions Notes and references on Chapter 1
9 9 16 26 36 44 59 67
Chapter 2. Finitely additive integration 2.1. Integration on semirings 2.2. Random variables and (mathematical) expectations 2.3. Properties of additive set functions on semirings 2.4. Bernoulli’s Theorem (The WLLNs) and expectations 2.5. De Moivre, Laplace and Stirling star in The Normal Curve Notes and references on Chapter 2
69 69 75 86 97 108 127
Part 2.
131
Countable additivity
Chapter 3. Measure and probability: countable additivity 3.1. Introduction: What is a measurable set? 3.2. Countably additive set functions on semirings 3.3. Infinite product spaces and properties of countably additivity 3.4. Outer measures, measures, and Carath´eodory’s idea 3.5. The Extension Theorem and regularity properties of measures Notes and references on Chapter 3
133 133 137 144 158 168 178
Chapter 4. Reactions to the Extension & Regularity Theorems 4.1. Probability, Bernoulli sequences and Borel-Cantelli 4.2. Borel’s Strong Law of Large Numbers 4.3. Measurability and Littlewood’s First Principle(s) 4.4. Geometry, Vitali’s nonmeasurable set, and paradoxes 4.5. The Cantor set Notes and references on Chapter 4
181 181 190 198 210 227 239
iii
iv
Part 3.
CONTENTS
Integration
245
Chapter 5. Basics of Integration Theory 5.1. Introduction: Interchanging limits and integrals 5.2. Measurable functions and Littlewood’s second principle 5.3. Sequences of functions and Littlewood’s third principle 5.4. Lebesgue’s definition of the integral and the MCT 5.5. Features of the integral and the Principle of Appropriate Functions 5.6. The DCT and Osgood’s Principle Notes and references on Chapter 5
247 247 252 264 275 285 297 312
Chapter 6. Some applications of integration 6.1. Practice with the DCT and its corollaries 6.2. Lebesgue, Riemann and Stieltjes integration 6.3. Probability distributions, mass functions and pdfs 6.4. Independence and identically distributed random variables 6.5. Approximations and the Stone-Weierstrass theorem 6.6. The Law of Large Numbers and Normal Numbers Notes and references on Chapter 6
315 315 332 344 354 365 373 386
Part 4.
389
Further results in measure and integration
Chapter 7. Fubini’s theorem and Change of Variables 7.1. Introduction: Iterated integration 7.2. Product measures, volumes by slices, and volumes of balls 7.3. The Fubini-Tonelli Theorems on iterated integrals 7.4. Change of variables in multiple integrals 7.5. Some applications of change of variables 7.6. Polar coordinates and integration over spheres Notes and references on Chapter 7
391 391 396 409 422 437 448 456
Bibliography
461
Prologue: The paper that changed integration theory . . . forever Sur une g´ en´ eralisation de l’int´ egrale d´ efinie
On a generalization of the definite integral1 Note by Mr. H. Lebesgue. Presented by M. Picard. In the case of continuous functions, the notions of the integral and antiderivatives are identical. Riemann defined the integral of certain discontinuous functions, but all derivatives are not integrable in the sense of Riemann. Research into the problem of antiderivatives is thus not solved by integration, and one can desire a definition of the integral including as a particular case that of Riemann and allowing one to solve the problem of antiderivatives.(1) To define the integral of an increasing continuous function y(x) (a ≤ x ≤ b) we divide the interval (a, b) into subintervals and sums the quantities obtained by multiplying the length of each subinterval by one of the values of y when x is in the subinterval. If x is in the interval (ai , ai+1 ), y varies between certain limits mi , mi+1 , and conversely if y is between mi and mi+1 , x is between ai and ai+1 . So that instead of giving the division of the variation of x, that is to say, to give the numbers ai , we could have given to ourselves the division of the variation of y, that is to say, the numbers mi . From here there are two manners of generalizing the concept of the integral. We know that the first (to be given the numbers ai ) leads to the definition given by Riemann and the definitions of the integral by upper and lower sums given by Mr. Darboux. Let us see the second. Let the function y range between m and M . Consider the situation m = m0 < m1 < m2 < · · · < mp−1 < M = mp y = m when x belongs to the set E0 ; mi−1 < y ≤ mi when x belongs to the set Ei .2 We will define the measures λ0 , λi of these sets. Let us consider one or the other of the two sums X X m0 λ0 + mi λi ; m0 λ0 + mi−1 λi ; 1This is a translation of Lebesgue’s paper where he first reveals his integration theory. This paper appeared in Comptes Rendus de l’Academie des Sciences (1901), pp. 1025–1028, and is translated by Paul Loya and Emanuele Delucchi. 2Translator’s footnote: That is, Lebesgue defines E = y −1 (m) = {x ∈ [a, b] ; y(x) = m} and 0 Ei = y −1 (mi−1 , mi ] = {x ∈ [a, b] ; mi−1 < y(x) ≤ mi }. 1
2
PROLOGUE: THE PAPER THAT CHANGED INTEGRATION THEORY . . . FOREVER
if, when the maximum difference between two consecutive mi tends to zero, these sums tend to the same limit independent of the chosen mi , this limit will be, by definition, the integral of y, which will be called integrable. Let us consider a set of points of (a, b); one can enclose in an infinite number of ways these points in an enumerably infinite number of intervals; the infimum of the sum of the lengths of the intervals is the measure of the set.3 A set E is said to be measurable if4 its measure together with that of the set of points not forming E gives the measure of (a, b).(2) Here are two properties of these sets: Given an infinite number of measurable sets Ei , the set of points which belong to at least one of them is measurable; if the Ei are such that no two have a common point, the measure of the set thus obtained is the sum of measures of the Ei . The set of points in common with all the Ei is measurable.5 It is natural to consider first of all functions whose sets which appear in the definition of the integral are measurable. One finds that: if a function bounded in absolute value is such that for any A and B, the values of x for which A < y ≤ B is measurable, then it is integrable by the process indicated. Such a function will be called summable. The integral of a summable function lies between the lower integral and the upper integral.6 It follows that if an integrable function is summable in the sense of Riemann, the integral is the same with the two definitions. Now, any integrable function in the sense of Riemann is summable, because the set of all its points of discontinuity has measure zero, and one can show that if, by omitting the set of values of x of measure zero, what remains is a set at each point of which the function is continuous, then this function is summable. This property makes it immediately possible to form nonintegrable functions in the sense of Riemann that are nevertheless summable. Let f (x) and ϕ(x) be two continuous functions, ϕ(x) not always zero; a function which does not differ from f (x) at the points of a set of measure zero that is everywhere dense and which at these points is equal to f (x) + ϕ(x) is summable without being integrable in the sense of Riemann. Example: The function equal to 0 if x is irrational, equal to 1 if x is rational. The above process of construction shows that the set of all summable functions has cardinality greater than the continuum. Here are two properties of functions in this set. (1) If f and ϕ are summable, f + ϕ is and the integral of f + ϕ is the sum of the integrals of f and of ϕ. (2) If a sequence of summable functions has a limit, it is a summable function. 3Translator’s footnote: Denoting by m∗ (E) the measure of a set E ⊆ (a, b), Lebesgue is
P S defining m∗ (E) to be the infimum of the set of all sums of the form i ℓ(Ii ) such that E ⊆ i Ii where Ii = (ai , bi ] and ℓ(Ii ) = bi −ai . It’s true that Lebesgue doesn’t specify the types of intervals, but it doesn’t matter what types of intervals you choose to cover E with (I chose left-half open ones because of my upbringing). 4Translator’s footnote: Lebesgue is defining E to be measurable if m∗ (E) + m∗ ((a, b) ∩ E c ) = b − a. 5Translator’s footnote: Lebesgue is saying that if the E are measurable, then S E is iT i S Pi measurable, if the Ei are pairwise disjoint, then m∗ ( i Ei ) = i m∗ (Ei ), and finally, that i Ei is measurable. The complement of a measurable set is, almost by definition, measurable; moreover, it’s not difficult to see that the empty set is measurable. Thus, the collection of measurable sets contains the empty set and is closed under complements and countable unions; later when we define σ-algebras, think about Lebesgue. 6Translator’s footnote: Lower and upper integrals in the sense of Darboux.
SOME REMARKS ON LEBESGUE’S PAPER
3
The collection of summable functions obviously contains y = k and y = x; therefore, according to (1), it contains all the polynomials and, according to (2), it contains all its limits, therefore it contains all the continuous functions, that is to say, the functions of first class (see Baire, Annali di Matematica, 1899), it contains all those of second class, etc. In particular, any derivative bounded in absolute value, being of first class, is summable, and one can show that its integral, considered as function of its upper limit, is an antiderivative. Here is a geometrical application: if |f ′ |, |ϕ′ |, |ψ ′ | are bounded, the curve x = f (t), y = ϕ(t), z = ψ(t), p has a length given by the integral of (f ′2 + ϕ′2 + ψ ′2 ). If ϕ = ψ = 0, one obtains the total variation of the function f of bounded variation. If f ′ , ϕ′ , ψ ′ do not exist, one can obtain an almost identical theorem by replacing the derivatives by the Dini derivatives.
Footnotes: (1) These two conditions imposed a priori on any generalization of the integral are obviously compatible, because any integrable derivative, in the sense of Riemann, has as an integral one of its antiderivatives. (2) If one adds to this collection suitably selected sets of measure zero, one obtains the measurable sets in the sense of Mr. Borel (Le¸cons sur la th´ eorie des fonctions).
Some remarks on Lebesgue’s paper In Section 1.1 of Chapter 1 we shall take a closer look at Lebesgue’s theory of integration as he explained in his paper. Right now we shall discuss some aspects he brings up in his paper involving certain defects in the Riemann theory of the integral and how his theory fixes these defects. The antiderivative problem. One of the fundamental theorems of calculus (FTC) learned in elementary calculus says that for a bounded7 function f : [a, b] → R, we have Z b (0.1) f (x) dx = F (b) − F (a), a
where F is an antiderivative of f , which means F ′ (x) = f (x) for all x ∈ [a, b]. It may be hard to accept at first, because it’s not stated in a first course in calculus, but the FTC may fail if the integral in (0.1) is the Riemann integral! In fact, there are bounded functions f that are not Riemann integrable, but have antiderivatives, thus for such functions the left-hand side of (0.1) does not make sense. In Section ? we shall define such a function due to Vito Volterra (1860–1940) that he published in 1881. With this background, we can understand Lebesgue’s inaugural words of his paper: In the case of continuous functions, the notions of the integral and antiderivatives are identical. Riemann defined the integral of certain discontinuous functions, but all derivatives are not integrable in the sense of Riemann. Research into the problem 7 The Riemann integral is only defined for bounded functions, which is why we make this assumption. We would deal with unbounded functions, but then we’ll have to discuss improper integrals, which we don’t want to get into.
4
PROLOGUE: THE PAPER THAT CHANGED INTEGRATION THEORY . . . FOREVER
of antiderivatives is thus not solved by integration, and one can desire a definition of the integral including as a particular case that of Riemann and allowing one to solve the problem of antiderivatives. In Lebesgue’s theory of integral, we shall see that the Fundamental Theorem of Calculus always holds for any bounded function with an antiderivative. In this sense, Lebesgue’s theory of integral solves the “problem of antiderivatives”. The limit problem. Suppose that for each n = 1, 2, 3, . . . we are given a function fn : [a, b] → R, all bounded by some fixed constant.8 Also suppose that for each x ∈ [a, b], limn→∞ fn (x) exists; since this limit depends on x, the value of the limit defines a function f : [a, b] → R such that for each x ∈ [a, b], f (x) = lim fn (x). n→∞
The function f is bounded since we assumed all the fn ’s were bounded by some fixed constant. A question that you’ve probably seen before in Elementary Real Analysis is the following: Given that the fn ’s are Riemann integrable, is it true that Z b Z b (0.2) f (x) dx = lim fn (x) dx? a
n→∞
a
We shall call this question the “limit problem”, which by using the definition of f (x), we can rephrase as follows: Is it true that Z b Z b lim fn (x) dx = lim fn (x) dx, a n→∞
n→∞
a
which is to say, can we switch limits with integrals? In the Riemann integration world, the answer to this question is “No” for the following reason: Even though each fn is Riemann integrable, it’s not necessarily the case that the limit function Rb f is Riemann integrable. Thus, even though the numbers a fn (x) dx on the rightRb hand side of (0.2) may be perfectly well-defined, the symbol a f (x) dx on the left-hand side of (0.2) may not be defined! For an example of such a case, we go back to the example Lebesgue brought up in the second-to-last paragraph of his paper where he wrote Example: The function equal to 0 if x is irrational, equal to 1 if x is rational. Denoting this function by f : R → R, we have ( 1 if x is rational, f (x) = 0 if x is irrational. This function is called Dirichlet’s function after Johann Peter Gustav Lejeune Dirichlet (1805–1859) who introduced it in 1829; here’s a rough picture of Dirichlet’s function: 1 0 8That is, there is a constant C such that |f (x)| ≤ C for all x ∈ [a, b] and for all n. n
SOME REMARKS ON LEBESGUE’S PAPER
5
It’s easy to show that f : R → R is not Riemann integrable on any interval [a, b] with a < b (See Exercise 1). Now, in 1898, Ren´e-Louis Baire (1874–1932) introduced the following sequence of functions fn : R → R, n = 1, 2, 3, . . ., defined by ( 1 if x = p/q is rational in lowest terms with q ≤ n, fn (x) = 0 otherwise. Here is a picture of f3 focusing on x ∈ [0, 1]:
1
f3 1 3
0
1 2
2 3
1
Notice that f3 (x) = 1 when x = 0, 1/3, 1/2, 2/3/1, the rationals with denominators not greater than 3 when written in lowest terms, otherwise f3 (x) = 0. More generally, fn is equal to the zero function except at finitely many points, namely at 0/1, 1/1, 1/2, 1/3, 2/3, . . ., (n − 1)/n and 1/1. In particular, fn is Riemann integrable and for any a < b, Z b fn (x) dx = 0; a
here we recall that the Riemann integral is immune to changes in functions at finitely many points, so as the fn ’s differ from the zero function at only finitely Rb Rb many points, a fn (x) dx = a 0 dx = 0. Also notice that lim fn = the Dirichlet function,
n→∞
which as we mentioned earlier is not Riemann integrable. Hence, for this simple example, the limit equality (0.2) is nonsense because the left-hand side of the equality is not defined. In Lebesgue’s theory of integration, we shall see that the limit function f will always be Lebesgue integrable (which Lebesgue mentions in point (2) at the end of the second-to-last paragraph of his paper) and moreover, the equality (0.2) always holds when the sequence fn is bounded. In this sense, Lebesgue’s theory of integral gives a positive answer to the “limit problem”. Finally, let’s discuss The arc length problem. In the last paragraph of Lebesgue’s paper he mentions the following geometric application: Here is a geometrical application: if |f ′ |, |ϕ′ |, |ψ ′ | are bounded, the curve xp = f (t), y = ϕ(t), z = ψ(t), has a length given by the integral of (f ′2 + ϕ′2 + ψ ′2 ).
To elaborate more on this, suppose we are given a curve C in 3-space defined by parametric equations C : x = f (t) , y = ϕ(t) , z = ψ(t) , such as shown on the left-hand picture here:
a ≤ t ≤ b,
6
PROLOGUE: THE PAPER THAT CHANGED INTEGRATION THEORY . . . FOREVER
To define L, the length of C, we approximate the curve by a piecewise linear curve, an example of which is shown on the right, and find the length of the approximating curve. Taking closer and closer approximations to the curve by piecewise linear curves, we define the length of the curve L by (0.3)
L := the limit of the lengths of the piecewise linear approximations,
provided that the lengths of the piecewise linear approximation approach a specific value. In elementary calculus we learned another formula for the length of the curve: Z bp (0.4) L= (f ′ (t))2 + (ϕ′ (t))2 + (ψ ′ (t))2 dt, a
assuming that the derivatives are bounded. A natural question is: Are the two notions of length, defined by (0.3) and (0.4), equivalent? The answer is “No” if the Riemann integral is used in (0.4)! More precisely, there are curves which p have length in the sense of (0.3) but such that (f ′ (t))2 + (ϕ′ (t))2 + (ψ ′ (t))2 is not Riemann integrable; thus, (0.4) is nonsense if the integral is understood in the Riemann sense. In Lebesgue’s theory of integral, we shall see that the two notions of arc length are equivalent. Thus, Lebesgue’s theory of integral solves the “arc length problem”. There are many other defects in Riemann’s integral that Lebesgue’s integral fixes, and we’ll review and discuss new defects as we progress through the book (for example, see the discussion on multi-dimensional integrals in Chapter ?). Summary. If we insist on using the Riemann integral, we have to worry about important formulas that are true some of the time; however, using the Lebesgue integral, these “defective formulas” become, for all intents and purposes, correct all of the time. Thus, we can say that Lebesgue’s integral simplifies life! ◮ Exercises 0.1. 1. Using your favorite definition of the Riemann integral you learned in an elementary course on Real Analysis (for instance, via Riemann sums or Darboux sums), prove that Dirichlet’s function is not Riemann integrable on any interval [a, b] where a < b.
Part 1
Finite additivity
CHAPTER 1
Measure & probability: finite additivity In this work I try to give definitions that are as general and precise as possible to some numbers that we consider in Analysis: the definite integral, the length of a curve, the area of a surface. Opening words to Henri L´ eon Lebesgue’s (1875–1941) thesis [171].
1.1. Introduction: Measure and integration This section is meant to be a motivational speech where we outline the basic ideas behind the theory of measure,1 or assigning “size” to sets, and how to use this notion of measure to define integrals. 1.1.1. Lebesgue sums. Recall that the Riemann integral of a function with domain an interval [a, b] is defined via Riemann sums, which approximate the (signed) area of a function by partitioning the domain. The basic idea of Henri L´eon Lebesgue (1875–1941) is that he forms “Lebesgue sums” by partitioning the range of the function. Let us recall Lebesgue’s words in his inaugural 1901 paper Sur une g´en´eralisation de l’int´egrale d´efinie (On a generalization of the definite integral), Henri Lebesgue To define the integral of an increasing continuous function (1875–1941). y(x) (a ≤ x ≤ b) one divides the interval (a, b) into subintervals and forms the sum of the quantities obtained by multiplying the length of each subinterval by one of the values of y when x is in the subinterval. If x is in the interval (ai , ai+1 ), y varies between certain limits mi , mi+1 , and conversely if y is between mi and mi+1 , x is between ai and ai+1 . Of course, instead of giving the division of the variation of x, that is to say, to give the numbers ai , one could have given the division of the variation of y, that is to say, the numbers mi . From here there are two manners of generalizing the concept of the integral. One sees that the first (to be given the numbers ai ) leads to the definition given by Riemann and the definitions of the integral by upper and lower sums given by Mr. Darboux. Let us see the second. Let the function y range between m and M . Given m = m0 < m1 < m2 < · · · < mp−1 < M = mp y = m when x belongs to the set E0 ; mi−1 < y ≤ mi when x belongs to the set Ei .2 We will define the measures λ0 , λi of these sets. Let
1Numero pondere et mensura Deus omnia condidit (God created everything by number, weight and measure). Sir Isaac Newton (1643–1727). 2 Translator’s footnote: That is, Lebesgue defines E0 = y −1 (m) = {x ∈ [a, b] ; y(x) = m} and Ei = y −1 (mi−1 , mi ] = {x ∈ [a, b] ; mi−1 < y(x) ≤ mi }. 9
10
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
us consider either of the two sums X X m 0 λ0 + m i λi ; m 0 λ0 + mi−1 λi ;
if, when the maximum difference between two consecutive mi tends to zero, these sums tend to the same limit independent of the mi chosen, this limit will be, by definition, the integral of y, which will be known as integrable.
To visualize what Lebesgue is saying concerning the second method of generalizing the concept of the integral, consider a bounded nonnegative function f , such as shown here: f (x)
mi mi−1
a
(
]❳ ❳ ② ✿[ ✘ ❳❳ ✘✘ ❳❳✘✘✘
)
Ei
b
Figure 1.1. The set Ei = {x ; mi−1 < f (x) ≤ mi } for this example is a union of two intervals; the left interval in Ei is a “left-hand open” (or “right-half closed”) interval and the right interval in Ei is a “right-half open” (or “left-half closed”) interval. Let’s say the range of f lies between m and M . Our goal is to determine the area below the graph. Just as Lebesgue says, let us take a partition along the y-axis: m = m0 < m1 < m2 < · · · < mp−1 < M = mp .
Let E0 = {x ; f (x) = m0 } and for each i = 1, 2, . . . , p, let o n (1.1) Ei := f −1 (mi−1 , mi ] = x ; mi−1 < f (x) ≤ mi .
Figure 1.1 shows an Ei , which in this case is just a union of two intervals and in the following figure we take p = 6: f (x)
m6 m5 m4 m3 m2 m1 m0 E1 E2 E3 E4 E5
E6
E5 E4 E3 E2E1
Figure 1.2. The sets E1 , . . . , E6 . In Figure 1.1 we wrote Ei in terms of right and left-hand open intervals. In this figure, for simplicity we omit the details of the endpoints. Omitted from the picture is E0 = {x ; f (x) = m0 }, which is the set {a, b} for this example.
Following Lebesgue, put λi = m(Ei ), where m(Ei ) = the “measure” or “length” of the set Ei .
1.1. INTRODUCTION: MEASURE AND INTEGRATION
11
For the function in Figure 1.1, it is clear that Ei has a length because E0 = {a, b} (just two points, so m(E0 ) should be zero) and for i > 0, Ei is just a union of intervals (and lengths for intervals have obvious meanings), but for general functions, the sets Ei can be very complicated so it is not clear that a “length” can always be assigned to Ei . Any case, a set that has a well-defined notion of “measure” or “length” is called measurable. (By the way, there are some sets that don’t have a well-defined notion of length — see Section 1.1.3 for a discussion of this fact.) Overlooking this potential difficultly, a careful study of Figure 1.3 shows that X m0 λ0 + mi−1 λi = m0 m(E0 ) + m0 m(E1 ) + m1 m(E2 ) + m2 m(E3 ) + · · · is the “lower” area of the rectangles shown in the left-hand picture in Figure 1.3, while X m0 λ0 + mi λi = m0 m(E0 ) + m1 m(E1 ) + m2 m(E2 ) + m3 m(E3 ) + · · · is the “upper” area of the rectangles shown in the right-hand picture in Figure 1.3.
f (x)
m6 m5 m4 m3 m2 m1 m0 E1E2E3 E4 E5
E6
f (x)
m6 m5 m4 m3 m2 m1 m0
E5 E4E3E2E1
E1E2E3 E4 E5
E6
E5 E4E3E2E1
Figure 1.3. Approximating the area under the graph of f from the inside and from the outside. We now go back to Lebesgue’s 1901 paper where he says: if, when the maximum difference between two consecutive mi tends to zero, these sums tend to the same limit independent of the mi chosen, this limit will be, by definition, the integral of y, which will be known as integrable.
We can make this precise as follows. Let P denote the partition {m0 , m1 , . . . , mp }, let kPk be the maximum difference between two consecutive mi in the partition, and let X X LP = m0 m(E0 ) + mi−1 m(Ei ) and UP = m0 m(E0 ) + mi m(Ei ),
which we call the lower and upper sums defined by the partition P. For a real number I we write limkPk→0 LP = I if given any ε > 0, there is a δ > 0 such that for any partition P with kPk < δ, we have I − LP < ε. There is a similar definition of what limkPk→0 UP = I means. Then Lebesgue is saying that if there exists a real number I such that (1.2)
lim LP = I =
kPk→0
lim UP ,
kPk→0
12
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
then we say that f is integrable, and the common limit I is by definition the integral of f , which we shall denote by Z (1.3) f := I = lim LP = lim UP . kPk→0
kPk→0
Assuming that each set Ei is measurable, this definition actually works!3 A function for which each set Ei is measurable is called a measurable function. In particular, in Section 6.2 we’ll see that a Riemann integrable function is measurable, and the limit (1.3) is equal to the Riemann integral of the function. However, many more functions have integrals that can be defined via (1.3). Such a function is called Lebesgue integrable and the corresponding integral is called the Lebesgue integral of the function. As we shall see in the sequel, this integral has some powerful features as hinted in the prologue. Moreover, just as the notions of open sets in Euclidean space generalize to abstract topological spaces, which is indispensable for modern mathematics, the Lebesgue integral for Euclidean space has a valuable generalization to what are called abstract measure spaces. For this reason, we shall develop Lebesgue’s theory through abstract measure theory, a time consuming but worthy task. We shall see the usefulness of abstract measure theory in action as we study probability in this book. Before we go on, here is a nice description of the difference between Lebesgue and Riemann integration from Lebesgue himself [173, pp. 181–82]: One could say that, according to Riemann’s procedure, one tried to add the indivisibles by taking them in the order in which they were furnished by the variation in x, like an unsystematic merchant who counts coins and bills at random in the order in which they come to hand, while we operate like a methodical merchant who says: I have m(E1 ) pennies which are worth 1 · m(E1 ),
I have m(E2 ) nickels worth 5 · m(E2 ),
I have m(E3 ) dimes worth 10 · m(E3 ), etc.
Altogether I have
S = 1 · m(E1 ) + 5 · m(E2 ) + 10 · m(E3 ) + · · · .
The two procedures will certainly lead the merchant to the same result because no matter how much money he has there is only a finite number of coins or bills to count. But for us who must add an infinite number of indivisibles the difference between the two methods is of capital importance.
1.1.2. Measurable sets, σ-algebras and integrals. Now what properties should measurable sets have? Certainly the empty set ∅, since it has nothing in it, should be measurable with measure (or “size”) zero. Also, any bounded interval (a, b), (a, b], [a, b), [a, b] should be measurable (with measure b − a). You might recall that any open subset of R can be written as a countable union of open intervals.4 Since open sets are so fundamental to mathematics, we would surely want 3 If you are interested, see Chapter 5 for the details, the integral of a function taking on both positive and negative values is defined by breaking up the function into a difference of two nonnegative functions. The integral of the function is the difference of the integrals of the two nonnegative functions, provided these integrals exist. 4If you don’t remember this, see Section 1.4.
1.1. INTRODUCTION: MEASURE AND INTEGRATION
13
open sets to be measurable. Thus, we also would like measurable sets to be closed under countable unions (that is, a countable union of measurable sets should be measurable). Since closed sets are also fundamental, and closed sets are just complements of open sets, we would like measurable sets to also be closed under taking complements. To summarize: The empty set should be measurable, measurable sets should be closed under countable unions, and measurable sets should be closed under complements. Finally, intervals should be measurable. Omitting the last property, which deals specifically with the real line, generalizing these considerations we are lead to the following definition. A collection of subsets S of a set X is called a σ-algebra of subsets of X if5 (1) ∅ ∈ S ; S∞ (2) An ∈ S , n = 1, 2, . . . implies n=1 An ∈ S (that is, S is closed under countable unions); (3) A ∈ S implies Ac = X \ A ∈ S (that is, S is closed under complements).
For example, the Borel sets B, to be discussed more thoroughly in Section 1.4, is roughly speaking the σ-algebra of subsets of R obtained by taking countable unions and complements of intervals, and doing these operations either finitely or infinitely many of times.6 We say that B is the σ-algebra generated by the intervals or that B is the smallest σ-algebra containing the intervals. Thus, every Borel set should have a measure. Another important σ-algebra is the Lebesgue measurable sets M . In a sense that can be made precise (see Theorem 3.16), the collection M makes up the largest σ-algebra containing the intervals such that the notion of measure has “nice” properties, where we now describe “nice”. Now what “nice” properties should our measure m have? As stated already, the measure of any interval should be the length of the interval, e.g. for a left-half open interval (a, b], we have m(a, b] = b − a. Also, the empty set should have zero measure: m(∅) = 0. Now observe that (0, 1] =
∞ [
n=1
1 1 i 1 i 1 1i 1 1i , = ,1 ∪ , ∪ , ∪ ··· n+1 n 2 3 2 4 3
is a countable union of pairwise disjoint intervals: ... ( (](](]( ]( ]( ]( 1 0. . . 1761 15 14 31 2
] 1
Moreover, observe that (1.4)
m(0, 1] =
∞ X m
n=1
1 1i , , n+1 n
since m(0, 1] = 1 and right-hand side is also 1 because it’s a telescoping sum ∞ X 1
n=1
n
−
1 1 1 1 1 1 = 1− + − + − + · · · = 1. n+1 2 2 3 3 4
5Since the complement of a union of sets is the intersection of the complements of the sets, we can replace (2) with the condition that S be closed under countable intersections. 6Actually, to make this precise, we need the notion of transfinite induction. You can read about transfinite induction and the Borel sets on page 101 of [253].
14
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
The property (1.4) is called countable additivity. Of course, this example was concocted, but no matter what geometric example you can think of, this “countable additivity” of measure (length, area, volume, . . .) always holds; for example, consider the following square of side length 1 with smaller squares drawn within it:
Figure 1.4. Each shaded square is measurable, so (the σ-algebra property) the union of the squares is also measurable. Moreover, we leave it as a fun exercise to show that the area of the union is 1/3, which also equals the sum of the areas of each shaded square. Thus, countable additivity is an inherent property of measure. We generalize these considerations as follows. A measure on a σ-algebra S (of subsets of some set X) is a map7 µ: S → [0, ∞]
such that µ(∅) = 0 and such that µ is countably additive in the sense that for any set A ∈ S written as a union of pairwise disjoint sets, A = A1 ∪ A2 ∪ A3 ∪ · · · with An ∈ S for all n, we have µ(A) = µ(A1 ) + µ(A2 ) + µ(A3 ) + · · · ;
this countable additivity is a generalization of (1.4). The space X (really, the triple (X, S , µ)) is called a measure space. Maurice Ren´e Fr´echet (1878–1973), in his 1915 paper [99], building on the work of Johann Radon’s (1887–1956) 1913 paper [232] concerning Rn , seems to be the first to define measures on abstract spaces — spaces that are not Euclidean. Figure 1.5 shows some examples of measures. ✛
1 2
90 (degrees)
2 4
5 4π/3
H
T
1/2
1/2
Figure 1.5. Angle, area, volume, cardinality and probability (a coin, heads or tails up) are measures. In each case a number is assigned that measures “how much” of each quantity is there.
With the definition of a measure space, we can present the definition of the integral in complete generality! Let S be a σ-algebra of subsets of a set X and let µ be a measure on S . The sets in S are called measurable sets. We call a bounded nonnegative function f : X → [0, ∞) measurable if each set Ei of the form (1.1) is measurable, that is, Ei ∈ S ; in particular, the sum (1.2) is defined for each n. We then define the integral of the function f by the formula (1.3). Note 7Here, we fix an object that’s not a real number and denote it by ∞; we adjoin this object to the set of nonnegative real numbers and call the set [0, ∞]. We’ll talk more about infinities in Section 1.5. We allow the symbol ∞ in the codomain because we should allow sets to have infinite measure, such as the real line R which has infinite length.
1.1. INTRODUCTION: MEASURE AND INTEGRATION
15
how general this definition is! (The domain of the function f is a set X, which need not be Euclidean space.) As mentioned earlier, the ability to define the integral of functions on abstract spaces is one of the most far reaching properties of Lebesgue’s theory. We remark that nowadays we usually don’t define Lebesgue integrability via (1.3); we will in fact define it in an equivalent but slightly different and easier to use way that you’ll learn in Chapter 5. 1.1.3. Assigning measures. The problem is how to assign measure, or length, to an arbitrary subset of R? We can do this rather quickly (and many books do this — including how I learned this material!), but we’re going to take the baby step approach and go slowly as follows. Now we certainly know how to measure lengths of intervals, e.g. m(a, b] = b − a. Because of the various assortments of intervals available, for concreteness we shall choose one kind to work with. We define I 1 as the collection of all bounded lefthalf open intervals (a, b]. Thus, the infant step is the observation that we have a completely natural measure m : I 1 → [0, ∞),
defined by m(a, b] = b − a where a ≤ b. The question is how to assign lengths to more general subsets of R. Second, the baby step is to define m on more complicated sets such as finite unions of intervals. We denote by E 1 , the so-called “elementary figures” of R, as the collection of all finite unions of elements of I 1 . Thus, we shall try to extend the function m : I 1 → [0, ∞) to a function m : E 1 → [0, ∞).
This step is relatively easy. Third, the step to adulthood, is that we would like to extend m so that it’s defined on a σ-algebra containing E 1 such as the Borel sets. The trick to do this is to define m on all subsets of R; that is, we assign a length to all subsets of R. Unfortunately, this notion of “length” is not additive! For example, there exist disjoint subsets A, B ⊆ R such that the length of A ∪ B is strictly less than the sum of the lengths of A and B (see Section 4.4.2). Strange indeed! However, we shall prove that there is a σ-algebra M of subsets of R, called the Lebesgue measurable sets, which we mentioned earlier and will turn out to contain the Borel sets, such that m : M → [0, ∞]
is a measure. Here is a summary of our measure theory program. Our Measure Theory Program: (1) The collection I 1 has the structure of what’s called a “semiring.” Hence, the first thing we need to do is understand the properties of semirings. This is done in Sections 1.3, where we study other structures such as rings and σ-algebras. (The collection E 1 turns out to be a “ring”.) (2) In Section 1.4 we study the Borel sets B. (3) In Section 1.6 we study Lebesgue measure, and a slight generalization called the Lebesgue-Stieltjes measure, on I 1 . (4) We then extend m to a function on E 1 . This is done in Section 2.3. (5) Next, we define the length of any subset of R. The idea on how to do this goes back more than 2000 years to Archimedes of Syracuse (287 BC–212
16
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Archimedes (287 BC–212 BC)
Figure 1.6. Circumscribing and inscribing a circle with regular polygons. BC). In Proposition 1 of Archimedes’ book On the measurement of the circle [122], he found8 the area of a circle of radius r to be π r2 . He did this by approximating the area of a circle by the areas of circumscribed and inscribed regular polygons, whose areas are easy to find; see Figure 1.6. The area of a region found by circumscribing it by simple geometric shapes is called the “outer measure” of the region, and at least for measure theoretic purposes, finding areas of regions by circumscribing the region has been favored over inscribing the region. We can also find the outer measure of a subset in Rn , for any n, by similar means. Outer measures will be studied in Section 3.4. (6) Finally, in Section 3.5 we show that this whole process of extending m from I 1 to E 1 , then finally to M and B works. Lastly, we shall meet many interesting friends and sights along our extension journey such as Pascal, Fermat, and some probability. ◮ Exercises 1.1. Rb 1. Let f (x) = x on an interval [0, b]. Compute the Lebesgue integral 0 f using the definition in (1.3). Of course, you should get b2 /2.
1.2. Probability, events, and sample spaces After reading the last section, you might be wondering why we emphasized the abstract notion of a σ-algebra and integration on abstract spaces? Why can’t we just focus on concrete sets like Euclidean space? In this section we answer this question and a lot more through probability. In fact, we show that abstract spaces and notions such as countable unions and intersections are not just “abstract nonsense” but are required in order to answer very concrete questions arising from simple probability examples.9 1.2.1. The beginnings of probability theory. Probability started to flourish in the year 1654, although there have been isolated writings on probability before that year. For example, Girolamo Cardano (1501–1576) wrote Liber de Ludo Aleae on games of chance around 1565, which can be considered the first textbook on the 8Proposition 1 actually says that the area of a circle is equal to that of a right-angled triangle where the sides including the right angle are respectively equal to the radius and circumference of the circle. 9If abstract spaces were of no use except for the sake of being abstract, then we should only talk about Lebesgue measure and integration on Euclidean space! This reminds of a quote by George P´ olya (1887–1985): “A mathematician who can only generalise is like a monkey who can only climb up a tree, and a mathematician who can only specialise is like a monkey who can only climb down a tree. In fact neither the up monkey nor the down monkey is a viable creature. A real monkey must find food and escape his enemies and so must be able to incessantly climb up and down. A real mathematician must be able to generalise and specialise.” Quoted in [191].
1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES
17
probability calculus10 (posthumously published in 1663), and Galileo Galilei (1564– 1642) wrote Sopra le Scoperte dei Dadi on dice games around 1620. In the year 1654, the French writer Antoine Gombaud Chevalier de M´er´e (1607–1684) asked Blaise Pascal (1623–1662) a couple of questions related to gambling. One problem can be called the dice problem and the other can be called the problem of points. The dice problem has to do with throwing two dice until you get a double six; more specifically, How many times must you throw the dice in order to have a better than 50–50 “chance” of getting two sixes?
The problem of points (also called the division problem) deals with how to divide the stakes for a unfinished game; more specifically, How should one divide the prize money for a “fair” game (that is, each player has an equal “probability” of winning a match) that is started but was ended before a player won the money?
Blaise Pascal (1623–1662).
Pascal solved these problems in correspondence with Pierre de Fermat (1601–1665); you can read their letters in [262]. These problems have been around for many years before Pascal’s time, for example, Girolamo Cardano (1501–1576) discussed the dice problem for the case of one die in his 1565 Pierre de Fermat book on games of chance Liber de Ludo Aleae. To my knowledge, the first (1601–1665). published version of the problem of points was by the founder of accounting, Fra Luca Pacioli (1445-1517), in Summa de Arithmetica [218] in 1494. However, it is reasonable to say that probability theory as a mathematic discipline was developed through the discussions between Pascal and Fermat. We study the dice problem and the problem of points in Section 1.5. The words “chance,” “fair,” and “probability” in the above descriptions of the dice problem and the problem of points are in quotes because they need to be defined; in other words, do we really know exactly what these questions are asking? In some sense these words should represent numbers from which you can make certain conclusions (e.g. that a game is “fair”). Naturally, whenever we speak of numbers we think of functions. More precisely, these functions are called probability measures, which we’ll talk about in due time, and which assign numbers to the outcomes (or events) of a random phenomenon such as a game, for example. In this section we study events and in Sections 1.5 and 4.1 we study probability measures in great depth. 1.2.2. Sample spaces and events. Probability theory is the study of the mathematical models of certain random phenomena such as, for instance, what numbers land up when you throw two dice or what side of a coin is right-side up when you flip a coin. Whenever you conduct an experiment involving random phenomenon, the most fundamental fact you need to know is all the possible outcomes of the random phenomenon. A set containing all the possible outcomes of the experiment is called a sample space for the given experiment. 10“Calculus” conjures up images of limits, differentiation, and integration of functions. However, “calculus” has a much broader meaning (from Webster’s 1913 dictionary): A method of computation; any process of reasoning by the use of symbols; any branch of mathematics that may involve calculation. In this broader sense, there are many calculi in mathematics such as the probability calculus, variational calculus (calculus of variations); some of my research involves what are called pseudodifferential calculi.
18
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Example 1.1. If you toss a coin once,11
a sample space is X = {H, T },
where H represents that the coin lands with heads up and T with tails up. We could also use X = {0, 1} where (say) 1 represents heads and 0 tails. Example 1.2. A sample space when you throw two dice S = (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6)
is
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6)
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6) (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6) (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6)
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6) ,
which we usually write as S = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)}, the set of all pairs (m, n), where m, n ∈ {1, 2, . . . , 6}. The numbers m and n represents the numbers die 1 and die 2, respectively, show. Example 1.3. What if you want to study the phenomenon of throwing the dice twice in row; that is, throwing them, picking them up and then throwing them again? If S = {(1, 1), . . . , (6, 6)} is all the possible outcomes for a single throw of the dice, a sample space for two throws is X = {(x1 , x2 ) ; x1 , x2 ∈ S} = S × S; the first entry x1 ∈ S represents the roll of the dice on the first throw and the second entry x2 ∈ S what happens on the second throw. Similarly, a sample space for throwing the dice n-times in a row is S n = S × S × · · · × S (n factors of S). (Note that S × S can also represent the sample space for throwing four dice at once.)
Example 1.4. Now if we want to answer Antoine Gombaud Chevalier de M´er´e’s dice problem, we are not told how many times one should throw the dice. In this situation one can use an idealized phenomenon of throwing the dice infinitely many times. In this case, a sample space is the set of all sequences of elements of S = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)}, which we can denote in various ways such as X = {(x1 , x2 , x3 , . . .) ; xi ∈ S for all i} = S × S × S × S × ··· =
∞ Y
S = S∞,
i=1
the infinite product of S with itself. Here, x1 represents the outcome on the first roll, x2 the outcome on the second roll, and so on.
An event is simply a set of possible outcomes of an experiment; in other words, an event E represents the outcome that one of the elements of E occurs as a result of the experiment. 11
Courtesy of my then 2 year old daughter.
1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES
Symbol Set theory jargon X (universal) set X (universal) set ∅ empty set A subset of X Ac = X \ A complement A∪B union A∩B intersection A\B difference
Probability theory jargon sample space certain event impossible event event that an outcome in A event that no outcome in A event that an outcome in A event that an outcome in A event that an outcome in A
19
occurs occurs or B occurs and B occurs and not in B occurs
Table 1. A set theory/probability theory dictionary.
Example 1.5. Let X = S×S, where S = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)}, which represents the sample space for the phenomenon of throwing two dice twice. Here are some examples of events. (1) The trivial events are the two extremes: anything or nothing occurring. The event that anything can happen on the first and second throws is X = S × S, which is called the certain event, and the event that nothing happens on both throws is ∅, which is called the impossible event. (2) The event that we throw a double six on at least one of the two throws is given by the subset A = {(x1 , x2 ) ∈ S 2 ; xi = (6, 6) for some i = 1, 2}.
(3) The event that we throw a pair of odd numbers on at least one of the throws is B = {(x1 , x2 ) ∈ S 2 ; x1 ∈ O or x2 ∈ O},
where O = {(m, n) ; m, n ∈ {1, 3, 5}}. (4) We now consider events formed by performing set operations with A and/or B. First, the event that we do not throw a double six on either throw is Ac = S 2 \ A = {(x1 , x2 ) ∈ S 2 ; xi 6= (6, 6) for i = 1, 2}. (5) The event that we throw either a double six or a pair of odd numbers on at least one throw is C = A ∪ B.
(6) The event that we throw a double six on one of the throws and a pair of odd numbers on the other throw is D = A ∩ B. (7) Finally, the event that we throw at least one double six and we don’t throw any odd pair is E = A \ B.
See Table 1 for a dictionary of set theory/probability theory jargon. This simple example shows that the usual set operations (unions, differences, etc.) are an essential part of probability. As a side remark, in the next section we’ll study the notion of a ring of sets, which is a collections of sets that is closed under unions, intersections, and differences. The above dictionary shows that the concept of a ring is a very natural object of study in probability. Example 1.6. Now let X = S × S × · · · where S = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)}, which represents the phenomenon of throwing two dice infinitely many times. Given a
20
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
natural number n, what is the event that we throw a double six on the n-th throw? It is An = {(x1 , x2 , . . .) ; xn = (6, 6)}
= S × S × · · · × S × {(6, 6)} × S × · · · ,
where the set {(6, 6)} occurs in the n-th position. An obvious question is: What is the event that we throw a double six at some point? It is A = {(x1 , x2 , . . .) ; xi = (6, 6) for some i ∈ N}.
Notice that A=
∞ [
An
(Event that we throw a double six at some point).
n=1
Hence, A is a countable union of sets.
Thus, the notion of countable union is an essential part of probability in the sense that countable unions result from very simple probabilistic questions. In (4) of Example 1.5 we saw that forming complements is also an essential part of probability. (E.g. the event that we never throw a double six is the complement of the event that we do throw a double six at some point.) Conclusion: We see that the idea of studying σ-algebras is not a figment of the imagination but is required to study probability! This conclusion will be even more evident after we look at the concept of . . . 1.2.3. Infinitely often. Consider again the experiment of throwing two dice infinitely many times. What is the event that we throw a double six not just once, twice or even a finite number of times during any given infinite sequence of throws, but an infinite number of times? To answer this question, let’s consider the following abstract problem: Let (1.5)
A1 , A2 , A3 , A4 , . . . , An , . . . , Ak . . .
be events in a sample space X; we are interested in those x ∈ X belonging to infinitely many An ’s, which we denote with the special notation {An ; i.o.} := {x ∈ X ; x belongs to infinitely many An ’s},
the set of x ∈ X that belong to the An ’s “infinitely often.” Contemplating the sequence (1.5) we see that if x ∈ X happens to belong to infinitely many of the sets A1 , A2 , . . . then given any n ∈ N, however large, we can find a k ≥ n such that x ∈ Ak ; otherwise x would be confined to A1 , A2 , . . . , An−1 contrary to the fact that x belonged to infinitely many of A1 , A2 , . . .. To reiterate, we showed that For all n = 1, 2, . . ., there exists a k ≥ n such that x ∈ Ak .
Transforming this into set theory language, the “for T∞all”Sis really intersection and the “there exists” is really union; that is, x ∈ n=1 k≥n Ak . Reversing this T∞ S argument shows that if x ∈ n=1 k≥n Ak , then x belongs to infinitely many of A1 , A2 , . . .. Thus, we have shown Part (1) of the following proposition. We shall leave Part (2) for you to prove in Problem 1.
Proposition 1.1. Let A1 , A2 , . . . be subsets of a set X. (1) We have {An ; i.o.} = lim sup An , where lim sup An :=
T∞ S∞ n=1
k=n
Ak .
1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES
21
(2) Now define {An ; a.a.} = {x ∈ X ; x belongs to An for all but finitely many n’s},
the set of x ∈ X that “almost always” belong to an An . Then ∞ \ ∞ [ {An ; a.a.} = lim inf An , where lim inf An := Ak . n=1 k=n
Example 1.7. Consider again the sample space X = S × S × · · · representing the phenomenon of throwing two dice infinitely many times. Given a natural number n, let An be the event that we throw a double six on the n-th throw: An = S × S × · · · × S × {(6, 6)} × S × · · · ,
where the set {(6, 6)} occurs in the n-th position. Then the event that we throw infinitely many double sixes is just {An ; i.o.}. Hence, according to our proposition, ∞ [ ∞ \ The event that we throw infinitely many double sixes = Ak . n=1 k=n
On the other hand,
The event that all but a finite number of throws were double sixes =
∞ \ ∞ [
Ak ,
n=1 k=n
which is the event that we throw only a finite number of non-double sixes.
Notice that the right-hand sides of the above events are a countable intersections and countable unions — again, the notion of σ-algebra pops out at us! To summarize: After looking at all the above examples, if you didn’t know about σ-algebras already, you would be forced to invent them! 1.2.4. Bernoulli sequences, Monkeys, and Shakespeare. A Bernoulli trial is a random phenomenon that has exactly two outcomes, “success” and “failure” (or “yes” and “no,” “pass” and “fail,” etc.). For example, declaring heads to be a “success” and tails to be “failure,” flipping a coin is a Bernoulli trial. Say that we have two dice and we’re interested in obtaining a double six. Then throwing two dice becomes a Bernoulli trial if we regard a double six as “success” and not obtaining a double six as “failure.” A sample space of a Bernoulli trial can be S = {0, 1} ,
where 1 = “success” and 0 = “failure.”
We are mostly interested in an infinite sequence of trials such as, for example, flipping a coin infinitely many times.12 Such a sequence of trials can be realized as an infinite sequence of 0’s and 1’s: H
T
H
T
T
T
H
· · · · · · · · · ⇄ (1, 0, 1, 0, 0, 0, 1, . . .)
Sequences of Bernoulli trials are called Bernoulli (or Bernoullian) sequences after Jacob (Jacques) Bernoulli (1654–1705) who studied them in his famous treatise on probability Ars conjectandi, published posthumously in 1713 by Jacob’s nephew Nicolaus Bernoulli (1687–1759). Here is an interesting example of a Bernoulli sequence. We are given a typewriter that has separate keys for lower case and upper case letters and all the other 12Technically speaking, in an infinite (or even finite) sequence of trials we require the probability of “success” to remain the same at each trial. However, we haven’t technically defined probability yet, so we won’t worry about this technicality.
22
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
different symbols (punctuation marks, numbers, etc. . . including a “space”). Let’s take a sonnet of Shakespeare, say
“Shall I compare thee to a summer’s day?” Here it is: Shall I compare thee to a summer’s day? Thou art more lovely and more temperate: Rough winds do shake the darling buds of May, And summer’s lease hath all too short a date: Sometime too hot the eye of heaven shines, And often is his gold complexion dimm’d; And every fair from fair sometime declines, By chance, or nature’s changing course, untrimm’d; But thy eternal summer shall not fade, Nor lose possession of that fair thou owest; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou growest; So long as men can breathe, or eyes can see, So long lives this, and this gives life to thee.
My word processor tells me there are a total of 632 symbols here (including spaces). Now let’s do an experiment. We put a monkey in front of the typewriter,
have him hit the keyboard 632 times, remove the paper, put in a new paper, have him hit the keyboard 632 more times, remover the paper, etc. . . repeating this process infinitely many times. We are interested in whether or not the Monkey will ever type Shakespear’s sonnet 18. Here, a “success” is that he types it and a “failure” is that he doesn’t type it. Thus, the sample space for this experiment is a Bernoulli sequence X = S ∞ = {(x1 , x2 , x3 , . . .) ; si ∈ S := {0, 1}}, where on the i-th page, xi = 1 if the monkey successfully types sonnet 18 and xi = 0 if the monkey fails; e.g.
DDDD U D
fails
fails
fails
fails
success
········· ⇄
fails
0 , 0 , 0 , 0 ,1 ,0 ,...
One might ask if given n ∈ N, what’s the event that the monkey types Shakespeare’s sonnet 18 on the n-th try? The answer is An = S × S × · · · × S × {1} × S × S × · · · where {1} occurs in the n-th factor. Here are some more questions one might ask, whose answers can be written in terms of An : (1) Question: Given n ∈ N, what’s the event that the monkey types Shakespeare’s sonnet 18 at least once in the first n pages? Answer: n [
k=1
Ak .
1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES
23
(2) Question: Given n ∈ N, what’s the event that the monkey does not type Shakespeare’s sonnet 18 in any of the first n pages? Answer: !c n [ Ak = {0} × {0} × · · · × {0} × S × S × S × · · · , k=1
where the {0}’s occur in the first n factors. (3) Question: What’s the event that the monkey eventually types Shakespeare’s sonnet 18? Answer: It is ∞ [ An , n=1
(4) Question: Finally, what is the event that the monkey types Shakespear’s sonnet 18 infinitely many times? Answer: It is {An ; i.o.} =
∞ ∞ [ \
An .
k=1 n=k
Once we learn more about probability measures, we will be able to compute the probability of each of these events occurring. In particular, it is interesting to note that the probability Shakespear’s sonnet 18 is typed infinitely many times is one (see Section 4.1)! On the other hand, it is quite impossible that the monkey will type Shakespear’s sonnet 18 in any finite amount of time (see Section 2.4)! 1.2.5. The measure problem for probability. Back in Section 1.1.3 we studied the measure problem for length. Before jumping into the abstract material in the next section we briefly study the corresponding problem for probability. In Section 1.5 we will give a thorough discussion on probability, so here we shall proceed intuitively and not worry about being precise. Below, when we use the term probability, it means what you think it means: It’s a number between 0 and 1 measuring the likelihood that an event will occur. Consider the sample space for an experiment consisting of an infinite sequence of coin tosses: S∞ = S × S × S × S × · · · , where S = {0, 1} and 1 represents heads and 0 tails. The event that we throw a head on the first toss (not caring what happens on the other tosses) is E = {1} × S × S × S × · · · .
What is the probability of the event E occurring? That is, what is the probability that we toss a head on the first toss? Of course, the answer is 1/2. The event that we throw a head on the first toss and a tail on the third toss is F = {1} × S × {0} × S × · · · .
What is the probability of the event F occurring? After some thought, the answer should be 1/2 · 1/2 = 1/4. Now let k ∈ N and let S1 , S2 , . . . , Sk be nonempty subsets of S and consider the event (1.6)
S1 × S2 × S3 × · · · × Sk × S × S × S × S × · · · ,
the event that an outcome in S1 occurs on the first toss, an outcome in S2 occurs on the second toss, . . ., and an outcome in Sk occurs on the k-th toss. What is the
24
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
probability of this event occurring? After some thought, the answer should be ℓ 1 (1.7) , 2 where ℓ is the number of sets amongst S1 , . . . , Sk that equal {0} or {1}. A set of the form (1.6) is called a cylinder set. Why “cylinder set”? If we look at R3 = R × R × R and consider sets A1 , A2 ⊆ R, then A1 × A1 × R is a cylinder in R3 extending above and below the set A1 × A2 in the plane as seen in the picture. If we put S = R in (1.6) and let the Si ’s be subsets of R, then (1.6) would be an “infinite-dimensional” cylinder. If we denote the collection of cylinder sets by C , then ✻ ✻ ✻ ✻ we have a map A1 × A2 × R
µ : C → [0, 1]
defined by assigning a nonempty cylinder set the number (1.7). Note that the empty set is also a cylinder set (just take S1 = ∅); we then set µ(∅) = 0. By the way, the collection C has the properties of a “semiring,” to be ❄ ❄ discussed in the next section. ❄ ❄ Now, we know how to define the probability of any event represented by a cylinder set. The question is: Can we define the probability of an event that is not a cylinder set? For instance, what is the probability of tossing infinitely many heads? Given n ∈ N, the event of throwing a head on the n-th toss is An = S × S × · · · × S × {1} × S × S × · · · A2
A1
where {1} occurs in the n-th factor. Thus, the event of throwing infinitely many heads is ∞ [ ∞ \ {An ; i.o.} = Ak . n=1 k=n
This event is really quite complicated so it’s not entirely obvious what the probability is! (You’ll be able to prove that the probability is one after reading Section 4.1.) We are now in a position similar to what we talked about in Section 1.1.3 for length: We have to extend the function µ from C , where it’s perfectly defined, to a σ-algebra containing C . The first step is to define µ on more complicated sets such as finite unions of cylinder sets. The collection of finite unions of cylinder sets has the structure of a “ring.” The next step is to extend µ further so that it’s defined on a σ-algebra containing C . This extension process is very similar to the extension process for length! The beauty of abstract measure theory is that it unites seemingly unrelated topics such as length on the one hand and probability on the other. Our next order of business is to understand semirings, rings, and σ-algebras, which we have already mentioned several times. ◮ Exercises 1.2. 1. Given subsets A1 , A2 , . . . of a set X prove that {An ; a.a.} = {x ∈ X ; x belongs to An for all but finitely many n’s} equals lim inf An :=
∞ ∞ \ [
n=1 k=n
Ak .
1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES
25
2. (Gambler’s ruin) Let S = {0, 1} and let X = S × S × · · · be the sample space of tossing a coin infinitely many times where 1 represents heads and 0 tails. Suppose a gambler starts with an initial capital of $i where i ∈ N. (He is playing against a person with an infinite amount of money.) On each flip, the gambler bets $1 that a head is thrown; if it’s a head he wins $1 and if it’s a tails he loses $1. He plays until he goes broke. For each k = 1, 2, . . ., define Rk : X → R as follows: If x = (x1 , x2 , . . .) ∈ X, ( 1 if xk = 1 (1.8) Rk (x) := −1 if xk = 0, which represents what is gained or lost on the k-th toss. (i) For each n ∈ N, define Sn : X → R as follows: Given a sequence x = (x1 , x2 , . . .) of coin tosses, Sn (x) := R1 (x) + R2 (x) + · · · + Rn (x) , where Rk is defined in (1.8). Let An = {i + Sn = 0} ∩
n−1 \ k=1
{i + Sk > 0},
where {i + Sn = 0} = {x ∈ X ; i + Sn (x) = 0} and {i + Sk > 0} = {x ∈ X ; i + Sk (x) > 0}. Show that An is the event that the gambler goes broke on exactly the n-th play. (ii) Show that ∞ [ An n=1
is the event that the gambler eventually goes broke. (iii) The gambler aspires to gain $a where a > i is an integer. As soon as he reaches $a he quits (if he hasn’t gone broke before reaching $a). What is the event that he actually does reach $a? (If you’re interested in the foolishness of gambling, please see Section 4.1.1 for an analysis of the folly of gambling.) 3. (Winning streaks) Let X = S × S × · · · , where S = {0, 1}, be the sample space for the experiment of tossing a coin infinitely many times. Let n ∈ N. In an infinite sequence of coin tosses, what is the event that at some point a head is tossed n times in a row? What is the event that no run of heads more than n occurs in a sequence of coin tosses? Considering the gambler from the previous problem who bets on coin flips, what is the event that the gambler has a streak of n wins in a row? 4. (Random walks) Suppose that you put a particle at the origin. Each second thereafter, the particle moves one unit (either to the right or to the left). The path the particle follows is called a random walk. (i) Let 0 denote a move one unit to the left and 1 a move one unit to the right. Let S = {0, 1} and explain why X = S × S × S × · · · represents a sample space for the random phenomenon of a particle undergoing a random walk. (ii) A more traditional sample space is (see e.g. Feller’s classic [94, Ch. 3]) the set X2 ⊆ {0, 1, 2, . . .} × Z, where X2 = {(0, x0 ), (1, x1 ), (2, x2 ), . . . ; x0 = 0, xi+1 − xi ∈ {−1, 1} for i = 0, 1, 2, . . .}. Explain why X2 can also be considered a sample space for random walks. Find a bijection between X and X2 . (iii) For n = 0, 1, 2, . . ., define Sn : X → R by S0 (x) := 0 and Sn (x) := R1 (x) + R2 (x) + · · · + Rn (x) for n ≥ 1 where Rk is defined in (1.8) in the Gambler’s ruin problem. Let a ∈ Z and n = 0, 1, 2, . . .. In terms of Sn , what is the event that the particle is at the point a at time n? What is the event that the particle visits
26
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
the point a for the first time at time n, where if a = 0, we mean that the particle visits the origin for the first time since starting the particle’s journey? (iv) Given a ∈ Z, what is the event that the particle visits the point a infinitely many times?
1.3. Semirings, rings and σ-algebras 1
The set I of left-half open intervals in R has a simple structure of a “semiring”, while the Borel sets has a more robust structure of a σ-algebra. A fundamental problem in measure theory is that of extending an additive set function on a basic class of subsets (e.g. m on I 1 ) to a σ-algebra containing the basic class (e.g. B, the Borel sets). For this reason, the purpose of this section is to understand these two classes, and classes that lie in between. 1.3.1. Semirings. Let us start by noting some basic properties of I 1 = {(a, b] ; a, b ∈ R},
where (a, b] = ∅ if b ≤ a. First, ∅ ∈ I 1 as just noted. Also, if I, J ∈ I 1 , then I ∩ J ∈ I 1 , for (a1 , b1 ] ∩ (a2 , b2 ] = (a, b],
where a = max{a1 , a2 } and b = min{b1 , b2 },
as can be verified. Finally, observe that if I, J ∈ I 1 , then I \ J is a union of pairwise disjoint sets in I 1 , for if I = (a1 , b1 ] and J = (a2 , b2 ], then I \ J = (a1 , b] ∪ (a, b1 ],
where b = min{b1 , a2 } and a = max{a1 , b2 },
which can be verified by drawing pictures like this one (of course, one can prove this rigorously too): (
(
a1
a2
]
]
b1
b2
Figure 1.7. In this situation, we see that (a1 , b1 ] ∩ (a2 , b2 ] = (a2 , b1 ] and (a1 , b1 ] \ (a2 , b2 ] = (a1 , a2 ].
To summarize: I 1 contains ∅, is closed under intersections, and the difference of two sets in I 1 can be written as a union of pairwise disjoint sets in I 1 . Generalizing these properties, we arrive at the following definition, due to John von Neumann [302, p. 85]: A collection of subsets I of a set X is called a semiring if (1) ∅ ∈ I ; (2) If A, B ∈ I , then A ∩ B ∈ I ; (3) If A, B ∈ I , then there are finitely many pairwise disjoint sets A1 , . . . , AN in S I (for some N ∈ N) such that A \ B = N n=1 An . We can replace the last statement with the following equivalent one: If A, B ∈ I and B ⊆ A, then B is part of a partition of A in the sense that there are pairwise SN disjoint sets A1 , . . . , AN in I such that A1 = B and A = n=1 An ; see Problem 1. Finally, we can generalize Property (3) above to the following (also see Problem 1): If A, I1 , . . . , In ∈ I , then there are finitely many pairwise disjoint sets J1 , . . . , JN in I such that n N [ [ (1.9) A\ Ik = Jk . k=1
k=1
1.3. SEMIRINGS, RINGS AND σ-ALGEBRAS
27
Note that Property (3) is exactly this statement with n = 1. One can prove (1.9) using an induction argument on n. We shall use (1.9) in Lemma 1.3. Example 1.8. Here are two simple examples of semirings. Let X = {a, b, c} be a set consisting of three elements and let I = {∅, {a}, {b, c}, X}. It is easy to check that I is a semiring. Another example of a semiring is I = {∅, {a}, {b}, {c}, X}. Example 1.9. Here’s a nonexample: The set of all open intervals is not a semiring. Here, Condition (3) of a semiring is not satisfied because e.g. (0, 2) \ (0, 1) = [1, 2) cannot be written as a finite union of pairwise disjoint open intervals. Similarly, the set of all closed intervals is not a semiring for the same reason. Example 1.10. However, the set of all bounded intervals (open, closed, and half-open ones) is a semiring. (Exercise!) This semiring is a little too large for our tastes, so we like to focus on just the left-half open intervals I 1 .
Define I n as the collection of all left-hand open boxes13 in Rn : (a1 , b1 ] × · · · × (an , bn ],
b2
]
where ai , bi ∈ R. Here’s a picture of such a box when n = 2 (in which case the 2-dimensional box is, of course, called a rectangle):
(a1 , b1 ] × (a2 , b2 ]
a2⌣ ( a1
] b1
Figure 1.8. Here we drew dotted lines to emphasize lines not part of the rectangle; in the future we shall be careless and usually draw the boxes with solid lines. (E.g. see Figure 1.9 in a few pages.)
We’ve showed that I 1 is a semiring; is it true that I n is a semiring for any n? The answer if “yes,” which follows from the following. Products of semirings Proposition 1.2. If I1 , . . . , IN are semirings, then so is the product I1 × · · · × IN := A1 × · · · × AN ; A1 ∈ I1 , . . . , AN ∈ IN .
In particular, I n = I 1 × · · · × I 1 (n factors of I 1 ) is a semiring.
Proof : For notational simplicity, we prove this result for only two semirings. Thus, we show that if I and J are semirings, then I × J is also a semiring. Since ∅ = ∅ × ∅, ∅ ∈ I × J . So, we are left to verify Conditions (2) and (3) of a semiring. 13Geometrically, an element of I n is not a “left-hand open box” as seen in the picture; it’s not just open on the left, it’s open at the left and at the bottom! Elements of I n are really just products of left-hand open intervals, but the name “left-hand open boxes” seems to have stuck.
28
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Let A, B ∈ I × J . Then, A = C × D,
for some C ∈ I , D ∈ J ,
B = E × F,
for some E ∈ I , F ∈ J .
By definition of set intersection, it’s straightforward to show that A ∩ B = (C ∩ E) × (D ∩ F ). Since I and J are semirings, C ∩ E ∈ I and D ∩ F ∈ J . Thus, A ∩ B ∈ I × J . Suppose now that B ⊆ A; we need to show that B is part of a partition of A. Since B ⊆ A, we have E ⊆ C and F ⊆ D. As I and J are semirings, we can write [ [ C= Cn , D = Dm , n
m
where the unions are finite, Cn ∈ I are disjoint for different n’s, Dm ∈ J are pairwise disjoint for different m’s, and where C1 = E, D1 = F . By properties of the Cartesian product, we have [ Cn × Dm . A = C ×D = n,m
Since the Cn ’s are pairwise disjoint and the Dm ’s are pairwise disjoint, it follows that the sets Cn × Dm are disjoint for different (n, m). Moreover, C1 × D1 = E × F = B. Hence, A is a union of pairwise disjoint sets in I × J which contains B as one of the sets. Our proof is complete.
Let A1 = (a1 , b1 ] and A2 = (a2 , b2 ] be in the semiring I 1 as in the picture (
a1
(
a2
]
]
b1
b2
Observe that although A1 and A2 are not disjoint in this picture, we can write the union A1 ∪ A2 as a union of disjoint sets in I 1 : A1 ∪ A2 = B1 ∪ B2 ∪ B3 , where B1 = (a1 , a2 ], B2 = (a2 , b1 ], B3 = (b1 , b2 ]. In other words, the union A1 ∪ A2 of elements of I 1 can be replaced by pairwise disjoint elements that have the same union. The following lemma says that this property holds (even for countable unions) for any semiring and is one of the fundamental properties of semirings. Fundamental Lemma of Semirings Lemma 1.3. If {An } are countably many sets in a semiring I , then there are pairwise disjoint sets {Bnm } in I such that for each n, Bn1 , Bn2 , Bn3 , . . . are finite in number and are subsets of An , and [ [ An = Bnm . n
n,m
Proof : Given such a countable collection {An }, the first step is to replace this collection with a collection of pairwise disjoint sets having the same union. Here’s the (standard) way to do so: Define a sequence of sets {Bn } by B1 = A1 , B2 = A2 \ A1 , B3 = A3 \ (A1 ∪ A2 ) ,
1.3. SEMIRINGS, RINGS AND σ-ALGEBRAS
29
and in general, Bn = An \
n−1 [
Ak .
k=1
Here’s a picture of the first few steps (where the Ai ’s are rectangles): A1
A2 A3
B1
B2 = A2 \ A1 B3 = A3 \ (A1 ∪ A2 )
Note that Bn need not be in the semiring since simirings need not be closed under unions and set differences. We claim that (1) Bn ⊆ An for each n. (2) The B Sn ’s are pairwise disjoint. S (3) A = n Bn , where A = n An . Indeed, Bn is a subset of An by definition of Bn . To prove (2) note that if m < n, then Bn by definition does not contain any points of Am while we already know from (1) that Bm ⊆ AmS ; hence Bn ∩ Bm = ∅. To proveS(3), note that sinceSBn ⊆ An for all n, we have n Bn ⊆ A. To prove that A ⊆ n Bn , let x ∈ A = n An . Then x ∈ Ak for some k. By the well-ordering principle of N, we may assume that k is the smallest natural number such that xS∈ Ak . Then x∈ / A1 , . . . , Ak−1 and hence x ∈ Bk by definition of Bk . Thus, x ∈ n Bn , which proves (3). S Now, by (1.9) we can write Bn as a union Bn = m Bnm where the Bnm ∈ I are pairwise disjoint and where this union is finite. Now unioning over n we get S S B = n Bn = n,m Bnm exactly as we wanted.
1.3.2. Rings and σ-algebras. Next in the hierarchy of classes of subsets are rings. A nonempty collection of subsets R of a set X is called a ring of subsets of X if (1) A ∪ B ∈ R (that is, R is closed under unions); (2) A \ B ∈ R (that is, R is closed under differences). Since ∅ = A \ A, it follows that ∅ ∈ R. Moreover, we claim that a ring is closed under intersections. Indeed, given two sets A, B, we have the formula A ∩ B = A \ (A \ B),
as can be verified. Therefore if R is assumed to be closed under set differences, then automatically it is closed under intersections. Hence, a ring is a nonempty collection of subsets that is closed under unions, intersections, and set differences. It follows that a ring is a semiring, but not vice versa in general. We remark that by induction, one can show that rings are closed under fiTN nite intersections and unions, that is, if A1 , . . . AN ∈ R, then n=1 An ∈ R and SN n=1 An ∈ R. Finally, if might be wondering how our “rings” relate to the “rings” you’ve seen in abstract algebra, see Problem 11.
Example 1.11. Simple examples of rings are {∅, X} and P(X), where P(X) is the power set of X. One of the most important examples is the ring E n of elementary figures (or sets) in Rn . Such a set is by definition a finite union of left-half open boxes; see Figure 1.9. Theorem 1.5 below shows that E n is a ring. In fact, when you think of a ring, you should think of finite unions of sets.
We now prove that any set is contained in a “smallest” ring.
30
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Figure 1.9. Elementary figures are just unions of left-half open boxes.
Theorem 1.4. Let A be any collection of subsets of a given set. Then there exists a unique “smallest” ring containing A , where “smallest” means, by definition, that the ring contains A and the ring is contained in any ring that contains A . This smallest ring is called the ring generated by A and is denoted by R(A ). Proof : The uniqueness is immediate (why?) so we just have to prove existence. Given a collection A of subsets of a set X, we need to find a collection R(A ) of subsets of X such that (1) R(A ) is a subset of any ring that contains A , and (2) R(A ) is a ring. Observe that if R(A ) is supposed to be contained any ring that contains A , then it must be contained in the intersection of all rings that contain A . Because of this observation, let’s make the following definition: \ R(A ) := R A ⊆R
where the intersection is over all rings R such that A ⊆ R. Since the power set P(X) is a ring that contains A , the right-hand intersection is well-defined. By construction, R(A ) is contained in every ring containing A . It remains to prove that R(A ) is a ring. To this end, let A, B ∈ R(A ), which means that A, B ∈ R for all rings R that contain A . Since a ring is closed under unions and differences, it follows that A ∪ B ∈ R and A \ B ∈ R for all rings R that contain A . This implies that A ∪ B ∈ R(A ) and A \ B ∈ R(A ), which shows that R(A ) is a ring.
Recall that E n denotes the collection of elementary figures in Rn , which is the collection of all finite unions of left-half open boxes. The following result says that the elementary figures is the ring generated by I n . Theorem 1.5. If I is a semiring, then A ∈ R(I ), the ring generated by I , if and only if A is a finite union of sets in I , if and only if A is a finite union of pairwise disjoint sets in I . Proof : By the Fundamental Lemma of Semirings (Lemma 1.3), the last two statements are equivalent so we just have to prove the first if and only if. Let U be the collection of all finite unions of sets in I ; we need to prove that R(I ) = U . Since a ring is, by definition, closed under taking finite unions and I ⊆ R(I ), it follows that R(I ) contains all finite unions of sets in I ; that is, U ⊆ R(I ). Since I ⊆ U , if we show that U is a ring, then we must have U = R(I ) since R(I ) is the smallest ring containing I . The set U is closed under unions by definition. To see S S that U is closed under differences, let A, B ∈ U . Then A = An and B = Bm are finite unions of sets in I , and so by properties of
1.3. SEMIRINGS, RINGS AND σ-ALGEBRAS
sets, A\B =
[ n
An \
[ m
Bm =
[ n
An \
[ m
31
Bm .
Since I is a semiring and An , Bn ∈ I , by (1.9) we can write [ [ An \ Bm = Cnk , m
k
a finite (pairwise disjoint) union of sets Cnk ∈ I . Hence, [[ [ [ [ A\B = An \ Bm = Cnk = Cnk n
m
n
k
n,k
is a finite union of sets in I , and so, U is closed under differences.
We now discuss the all important σ-algebras: A σ-algebra is a ring that is closed under countable unions and which contains the whole space X. Thus, a σ-algebra S satisfies14 S∞ (I) If An ∈ S , n = 1, 2, . . ., then n=1 An ∈ S ; (II) If A, B ∈ S , then A \ B ∈ S ; (III) X ∈ S .
Observe that if A ∈ S , then Ac = X \ A ∈ S , since S (being a ring) is closed under set differences. Thus, a σ-algebra S satisfies, in particular, the following properties: (1) ∅ ∈ S ; S∞ (2) If An ∈ S , n = 1, 2, . . ., then n=1 An ∈ S ; (3) If A ∈ S , then Ac = X \ A ∈ S .
In (2) and (3) we say that S is closed under countable unions and closed under complements. Moreover, we claim that any collection S of subsets of a set X satisfying these three properties is a σ-algebra. Indeed, let S denote such a collection. Then by (1) and (3), we see that X = ∅c ∈ S . Also, S is by definition closed under countable unions, so we just have to prove that S is closed under set differences. However, if A, B ∈ S , then by De Morgan laws, we have A \ B = A ∩ B c = (Ac ∪ B)c .
The right-hand side is in S since S is closed under unions and complements, therefore A \ B ∈ S . We shall use the conditions (1),(2),(3) from now on to describe σ-algebras. Recall that a ring is closed under finite intersections; for a σ-algebra, countable intersections are allowed. Lemma 1.6. A σ-algebra is closed under countable intersections. T Proof : GivenSAn , n = 1, 2, . . . in a σ-algebra S , we need to show that ∞ n=1 An ∈ S . ∞ c Put A = n=1 An . Then A ∈ S since S is closed under complements (so each
14Note that the countably infinite union S∞ A in (1) also contains finite unions because n=1 n
a finite union A1 ∪ · · · ∪ Ak can be written as a countably infinite union An = ∅ for n = k + 1, k + 2, . . ..
S∞
n=1
An where we put
32
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Acn belongs to S ) and countable unions. Moreover, by the De Morgan laws we have ∞ ∞ c [ \ Ac = Acn = An . n=1
n=1
Since A ∈ S and S is closed under complements we get
T∞
n=1
An ∈ S .
Almost the exact same proof used in Theorem 1.4 establishes the following more general result. The last statement of the theorem shall be left as an exercise. Theorem 1.7. Given any collection of subsets A of a set, there exists a unique smallest σ-algebra containing A called the σ-algebra generated by A and is denoted by S (A ). Moreover, we have S (A ) = S (R(A )); that is, the σ-algebra generated by A equals the σ-algebra generated by the ring generated by A . Example 1.12. Trivial examples of σ-algebras are {∅, X} and P(X) for any set X. For our purposes, the most important examples of σ-algebras are the Borel sets in Rn , which is the σ-algebra S (I n ), the σ-algebra generated by I n which we discuss in Section 1.4, and the σ-algebra generated by the cylinder sets of a sequence of random experiments, which is the topic for the next section.
Semirings are (usually) very simple objects, elements of the ring generated by the semiring are slightly more complicated being unions of elements of the semiring, while elements of σ-algebras can take on imaginative shapes since they may involve countable unions, intersections and complements of elements of the semiring. For example, let X be a nonempty set and let I denote the collection of all singletons of X together with the empty set; so, A ∈ I means A = ∅ or A = {x} where x ∈ X. It’s easy to check that I is a semiring (see Problem 7). Now consider the following picture:
In I
In R(I )
In S (I )
The left represents a singleton set {x} ∈ I , consisting of just a single point of X. In the middle is an element of the ring R(I ), a finite union of singletons, which is just a finite subset of X, and finally, on the right is a countable number of points in X (densely packed together to make them look continuous), which is an element of the σ-algebra S (I ). In Figure 1.10 we give a summary of the relationships between the various collections of sets that we have so far introduced. Since Theorem 1.5 gives a precise description of elements of R(I ), the ring generated by a semiring I , you may be wondering if there is a similar descriptive theorem for S (I ). Certainly if a set can be obtained from I by taking at most countably many combinations of unions, intersections, and/or complements, then the set is in S (I ) because S (I ) is closed under such operations. The converse statement (any set in S (I ) can be obtained from I by taking at most countably many combinations of unions, intersections, and/or complements) is “false”, but
1.3. SEMIRINGS, RINGS AND σ-ALGEBRAS
33
semiring ring σ-algebra
Figure 1.10. Every σ-algebra is a ring, and every ring is a semiring. can be made “true” using transfinite induction, a subject not a prerequisite for reading this book! However, it’s useful to think of S (I ) as exactly those sets obtained in this way just to have a mental image of S (I ). 1.3.3. Sequence space. When we discussed the measure problem for probability in Section 1.2.5 we stated that the cylinder sets form a semiring. We now prove this in a slightly more general situation than just for Bernoulli sequences. Let X1 , X2 , . . . be sample spaces and let X be the set of all infinite sequences: X = {(x1 ,x2 , x3 , . . .) ; xi ∈ Xi for all i}, which can be denoted by X1 × X2 × X3 × X4 × · · ·
or
∞ Y
Xi .
i=1
X is called a sequence space, and they model infinitely many experiments performed in sequence (where X1 is the sample space of the first experiment, X2 the second etc.). If X1 = X2 = X3 = · · · equal the same sample space, say S, then X = S ∞ , the infinite product of S with itself, and S ∞ represents a model for an experiment repeated an infinite number of times as we discussed in Section 1.2. For example, if S = {(j, k) ; j, k = 1, . . . , 6}, then S ∞ is a sample space for throwing two dice an infinite number of times, if S = {0, 1}, then S ∞ is the space of Bernoulli sequences, and if S = (0, 1], then S ∞ is a sample space for picking an infinite sequence of points in (0, 1] “at random”. For each i ∈ N, let Ii be a semiring of subsets of Xi and suppose that Xi ∈ Ii for each i. A cylinder set generated by I1 , I2 , . . . is a subset C ⊆ X of the form C = A1 × A2 × · · · × An × Xn+1 × Xn+2 × Xn+3 × · · ·
for some n ∈ N and events A1 ∈ I1 , A2 ∈ I2 , . . . , An ∈ In . Thus, C represents the event that A1 occurs on the first trial, A2 occurs on the second trial, . . ., and An occurs on the n-th trial, and anything can happen on all the trials after the n-th one. We can also represent a cylinder set is by C = A × Xn+1 × Xn+2 × Xn+3 × · · · ,
for some n,
where A ∈ I1 × · · · × In , with I1 × · · · × In defined in Proposition 1.2. We denote the collection of cylinder sets generated by I1 , I2 , . . . by C . Cylinder sets Proposition 1.8.Q C forms a semiring and the ring generated by C consists of all subsets of ∞ i=1 Xi of the form A × Xn+1 × Xn+2 × · · ·
for some n ∈ N where A ∈ R(I1 × · · · × In ).
34
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Proof : Because this proof would be rather long, we shall only prove the semiring statement and leave the ring statement as Problem 2. It’s easy to see that C contains the empty set. To prove the other two conditions of a semiring are satisfied, the idea is to apply Proposition 1.2 for the product of finitely many semirings. (The intersection condition is easy, and you should be able to prove it without using Proposition 1.2.) We reduce to the finite case as follows. Let A, B ∈ C and write for some n, m, A = A′ × Xn+1 × Xn+2 × Xn+3 × · · ·
and
B = B ′ × Xm+1 × Xm+2 × Xm+3 × · · · where A ∈ I1 × · · · × In and B ′ ∈ I1 × · · · × Im . Note that we can assume that m = n; because e.g. if n > m, then we can define ′
then we have
B ′′ = B ′ × Xm+1 × Xm+2 × · · · × Xn ,
B = B ′ × Xm+1 × Xm+2 × · · · × Xn × Xn+1 × Xn+2 × · · · = B ′′ × Xn+1 × Xn+2 × · · · .
Thus, we may assume that
A = A′ × Xn+1 × Xn+2 × · · ·
and
B = B ′ × Xn+1 × Xn+2 × · · · ,
where A′ , B ′ ∈ I1 × · · · × In . Now, since I1 , . . . , In are semirings, so is the product I1 × · · · × In by Proposition 1.2, a fact we’ll use below. Observe that A ∩ B = (A′ ∩ B ′ ) × Xn+1 × Xn+2 × · · · .
By Property (2) of a semiring, we know that A′ ∩B ′ ∈ I1 ×· · ·×In , so A∩B ∈ C . Also observe that A \ B = (A′ \ B ′ ) × Xn+1 × Xn+2 × · · · .
By Property (3) of a semiring we know that A′ \ B ′ =
N [
A′k ,
k=1
for some pairwise disjoint sets A′1 , . . . , A′N ∈ I1 × · · · × In . It follows that A\B =
N [
Ak ,
k=1
where Ak = A′k × Xn+1 × Xn+2 × · · · , and where A1 , . . . , AN ∈ C are pairwise disjoint. This completes our proof.
We remark that the abstract notion of a semiring captures, at the same time, the essential properties of intervals or boxes (dealing with Euclidean space) on the one hand, and cylinder sets (dealing with probability) on the other, two seemingly distinct concepts! This shows that the power to abstract is, well, powerful. 1.3.4. Inverse images. Our final result of this section shows that under preimages, functions preserve all the structures studied in this section. Proposition 1.9. Let A be either a ring or σ-algebra of subsets of a set X. Then given any function f : X → Y , where Y is any set, the collection Af = {A ⊆ Y ; f −1 (A) ∈ A }
1.3. SEMIRINGS, RINGS AND σ-ALGEBRAS
35
is a class of subsets of Y of the same type as A . Proof : For concreteness, assume that A is a σ-algebra; the ring case is similar. We shall prove that Af is a σ-algebra. Since f −1 (∅) = ∅ ∈ A , we have ∅ ∈ Af . Let {An } be any sequence of sets in Af , so that f −1 (An ) ∈ A for each n. Since by set theory, f −1
∞ [
n=1
∞ [ An = f −1 (An ), n=1
S and A is closed under countable unions, it follows that f −1 ( ∞ n=1 An ) ∈ A , S∞ which implies n=1 An ∈ Af by definition of Af . If A ∈ Af , then f −1 (A) ∈ A by definition of Af , and by basic set theory, we have f −1 (Y \ A) = X \ f −1 (A). Since A is closed under complements, X \ f −1 (A) ∈ A , so f −1 (Y \ A) ∈ A , which implies Y \ A ∈ Af . Thus, Af is a σ-algebra. ◮ Exercises 1.3. 1. (a) Prove that Condition (3) of a semiring is equivalent to the following statement: If A, B ∈ I and B ⊆ A, S then there are pairwise disjoint sets A1 , . . . , AN in I such that A1 = B and A = N n=1 An . (b) Generalize (a) as follows. Let A1 , . . . , An , A be members of a semiring I where A1 , . . . , An are pairwise disjoint and are all subsets of A. Prove that there are sets An+1 , . . . , AN ∈ I such that A1 , . . . , An , An+1 , . . . , AN are pairwise disjoint, and S A= N k=1 Ak . Suggestion: If all else fails, remember your old friend induction. (c) If A, I1 , . . . , In ∈ I , prove that there are finitely many pairwise disjoint sets SN S J1 , . . . , JN in I such that A \ n k=1 Ik = k=1 Jk . 2. Prove Theorem 1.7 and complete the proof of Proposition 1.8. 3. Let A be either a semiring, ring or σ-algebra of subsets of a set X and let Y ⊆ X. Prove that the “restriction” of A to Y , {A ∩ Y ; A ∈ A }, is a class of the same type as A . (Just choose one class to work with — the proofs have the same flavor.) 4. Let A ⊆ R be a nonempty set and let I be the collection of all left-half open intervals with both end points in A. Prove that I is a semiring. 5. Let A be a nonempty collection of subsets of some set and let I denote the collection of finite intersections A1 ∩ A2 ∩ · · · where for each n, either An ∈ A or Acn ∈ A such that An ∈ A for at least one n. Prove that I is a semiring. 6. Give an example of two semirings whose intersection is not a semiring. Suggestion: If you’ve already spent hours trying to find an example, try a simple one such as found in Example 1.8. We remark that the fact that the intersection of rings remains a ring and the intersection of σ-algebras remains a σ-algebra were the main ingredients in the proofs of Theorems 1.4 and 1.7, respectively. Thus, in general, given a set A there may not be a “smallest” semiring containing A . 7. Here are more examples of various classes. Let X be a nonempty set. (a) Given A ⊆ X, let A = {A} be a single element collection of subsets of X. What are R(A ) and S (A )? Answer the same question when A = {A, B} where A, B ⊆ X. (b) Assume X is infinite and let R be the collection of subsets A ⊆ X such that either A is finite or Ac is finite. Show that R is a ring but not a σ-algebra. (c) Assume X is uncountable and let S be the collection of subsets A ⊆ X such that either A is countable or Ac is countable. Show that S is a σ-algebra. Is it
36
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
ever true that S = P(X)? (To answer this question, use the fact that X has an uncountable subset whose complement is uncountable.15) (d) Let I be the collection consisting of the empty set and all singleton sets of X (sets of the form {x} with x ∈ X). Prove that I is a semiring but not a ring if X has more than one element and prove that R(I ) consists of all finite sets. If X is countable, show that S (I ) = P(X) and if X is uncountable, show that S (I ) is the σ-algebra described in Problem 7c immediately above. 8. Given a collection of subsets A of a set, prove that the σ-algebra generated by A equals the union of the σ-algebras generated by countable subsets of A ; that is, if D is the collection of countable subsets of A , prove that [ S (B). S (A ) = S
B∈D
Suggestion: Prove that B∈D S (B) is a σ-algebra containing A . 9. Here’s an “abstract” example of a semiring that we’ll meet again in Section ??. Let F be any nonempty collection of functions on a nonempty set such that if f, g ∈ F , then max{f, g} and min{f, g} are also in F . Show that the system of “left-half open function intervals” IF , consisting of sets of the form (f, g] := {(x, t) ∈ X × R ; f (x) < t ≤ g(x)},
where f, g ∈ F with f ≤ g, is a semiring. 10. In this problem we show that preimages behave very nicely (cf. Proposition 1.9). Let A be either a ring or σ-algebra of subsets of a set Y . (a) Given a function f : X → Y , prove that f −1 (A ) = {f −1 (A) ; A ∈ A } is a class of subsets of X of the same type as A . (b) It can turn out that f −1 (A ) can have a more robust structure than A . For instance, find a function f and a ring R such that f −1 (R) is a σ-algebra. (c) Find a function f and a semiring A such that f −1 (A ) is not a semiring. (d) Unfortunately, images don’t behave nicely. Find a function f and sets A and B such that f (A ∩ B) 6= f (A) ∩ f (B). Find a function f and a ring R such that f (R) is not a ring. 11. (Cf. Problem 5 in Exercises 2.1.) Let R be a ring of sets. For sets A, B ∈ R, define “addition” and “multiplication” of the two sets by, respectively, A + B := (A \ B) ∪ (B \ A),
A · B := A ∩ B.
(The right-hand side of A + B is called the symmetric difference of A and B and is usually denoted by A∆B.) With these operations, prove that R is a commutative ring in the algebraic sense of the word (thus, you need to state the additive and multiplicative identities). Also prove that R has the following properties: A · A = A,
A+A =0
for all A ∈ R.
Any ring with these properties is called a Boolean ring.
1.4. The Borel sets and the Principle of Appropriate Sets In this section we study one of the most important σ-algebras, the Borel sets ´ in Rn , named after F´elix Edouard Justin Emile Borel (1871–1956). We also introduce the “Principle of Appropriate Sets,” a principle that is used everywhere in measure/integration theory. At the end of this section we show how to define Borel sets in an arbitrary topological space. 15 Assuming certain facts about cardinality (see e.g. [9, p. 78]) one proof goes as follows: There is a bijection f : X × {0, 1} → X (in fact, between X × Y and X for any set Y whose cardinality is not greater than X). The set f (X × {0}) ⊆ X is uncountable and so is its complement.
1.4. THE BOREL SETS AND THE PRINCIPLE OF APPROPRIATE SETS
37
1.4.1. The Borel subsets of Rn . In Borel’s 1898 book [37] he discussed properties of “measure” and “measurable sets” in the interval [0, 1] (taken from [121, p. 103]): When a set is formed of all the points comprised in a denumerable infinity of intervals which do not overlap and have total length s, we shall say that the set has measure s. When two sets do not have common points and their measures are s and s′ , the set obtained by uniting them, that is to say their sum, has measure s + s′ . More generally, if one has a denumerable infinity of sets which pairwise have no common point and having measures s1 , s2 , . . ., sn , . . ., their sum . . . has measure
´ Emile Borel (1871–1956).
s1 + s2 + · · · + sn + · · · . All that is a consequence of the definition of measure. Now here are some new definitions: If a set E has measure s and contains all the points of a set E ′ of which the measure is s′ , the set E − E ′ , formed of all the points of E which do not belong to E ′ , will be said to have measure s − s′ . . . The sets for which measure can be defined by virtue of the preceding definitions will be termed measurable sets by us ...
At the beginning of the quote, Borel is discussing sets that are countable unions of intervals, then in the “new definition” later on he talks about differences of sets. Thus, the sets he is working with are in the σ-algebra generated by the intervals, which in honor of Borel is nowadays called the “Borel sets”: The Borel subsets B n of Rn is, by definition, the σ-algebra generated by I n , where recall that I n is the left-half open boxes. For n = 1, we denote B 1 by B. Since E n is the ring generated by I n , by Theorem 1.7 of the last section we can also define B n as the σ-algebra generated by E n . Here’s a picture to consider:
In I n
In R(I n )
In S (I n )
On the left is an element of I n , a single left-half open box. In the middle is an element of the ring R(I n ), a finite union of left-half open boxes and finally, on the right is a “blob”, to be concrete say a (strange looking) open subset of Rn . Since we’ll prove in Section 1.4.3 that open subsets are Borel sets, the “blob” is in B n . See Section 1.4.3 for some neat pictures of Borel sets. We used I n to define the Borel sets, but there is nothing special about lefthalf open boxes. For example, (dealing with n = 1 for simplicity) we claim that B is also the σ-algebra generated by the bounded open intervals. To prove this, let IO be the collection of all bounded open intervals in R; we must show that S (IO ) = B, which means S (IO ) ⊆ B and B ⊆ S (IO ). Take the first set inequality: S (IO ) ⊆ B.
38
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
To prove this we use the so-called Principle of Appropriate Sets, explained as follows. Let C be a collection of sets (e.g. C = IO ) and let A be a σ-algebra (e.g. A = B); when can we say that S (C ) ⊆ A ? (In our problem, we need to know that S (IO ) ⊆ B.) The easiest way is via the16 Principle of Appropriate Sets: If C is a collection of sets and all these “appropriate sets” are contained in a σ-algebra A , then S (C ) is also contained in A . The proof of this principle is trivial: If A is a σ-algebra and C ⊆ A , then A is a σ-algebra containing C and hence S (C ) ⊆ A since S (C ) is the smallest σ-algebra containing C . By the Principle of Appropriate Sets, to prove that S (IO ) ⊆ B, we just have to show that IO ⊆ B. Thus, given (a, b) ∈ IO , we need to prove that (a, b) ∈ B. To see this, observe that ∞ [ 1 (a, b) = a, b − . k k=1
Since σ-algebras are closed under countable unions, it follows that (a, b) ∈ B. Thus, IO ⊆ B, so by the Principle of Appropriate Sets we know that S (IO ) ⊆ B. The proof that B := S (I 1 ) ⊆ S (IO ) has an analogous flavor: We just have to prove I 1 ⊆ S (IO ). To do so, let (a, b] ∈ I 1 and observe that (a, b] =
∞ \ 1 a, b + . k
k=1
Since σ-algebras are closed under countable intersections, it follows that (a, b] ∈ S (IO ). Thus, I 1 ⊆ S (IO ), so by the Principle of Appropriate Sets we know that B = S (IO ) ⊆ S (IO ). Thus, we have shown that S (IO ) = B. We generalize this result in Proposition 1.10 below. First some notation. We shall denote a box (a1 , b1 ] × · · · × (an , bn ] in Rn by the notation (a, b] where a and b are the n-tuples of numbers a = (a1 , . . . , an ), b = (b1 , . . . , bn ). Other types of boxes are denoted similarly; e.g. [a, b] = [a1 , b1 ] × · · · × [an , bn ]. An infinite box is denoted analogously; e.g. (a, ∞) = (a1 , ∞) × (a2 , ∞) × · · · × (an , ∞), with similar notations for other infinite boxes. The proof of the following result uses the Principle of Appropriate Sets as we did in the proof that B 1 equals the σ-algebra generated by the bounded open intervals and we leave its proof to you as practice on using this very useful principle.
16This principle seems to be popular in Russian books such as [256, 241]. I thank Anton Schick for showing me [256].
1.4. THE BOREL SETS AND THE PRINCIPLE OF APPROPRIATE SETS
39
Proposition 1.10. The Borel sets B n is the σ-algebra of subsets of Rn generated by any of one of the following collections of subsets of the form (here, a and b represent n-tuples of real numbers): (1) (a, b]; (5) (−∞, a];
(2) (a, b);
(3) [a, b);
(6) (−∞, a);
(4) [a, b];
(7) [a, ∞);
(8) (a, ∞).
We have defined the Borel sets in terms of intervals, but we can instead define them in terms of the topology of Rn ; that is, the Borel sets is the σ-algebra generated by the open sets. Although a bit overkill, we prove this result via the so-called Dyadic cube theorem, which is interesting in its own right. 1.4.2. Dyadic Cube Theorem. A dyadic interval is an interval of real numbers of the form 1 k k+1 (k, k + 1] = , 2j 2j 2j where k ∈ Z and j ∈ N. The length of such an interval is 1/2j . The dyadic intervals of a fixed length form a partition R as seen here: ✛ ( −3 2j
(] −2 2j
(] −1 2j
(] 0
]( 1 2j
]( 2 2j
] ✲ 3 2j
Figure 1.11. The dyadic intervals (k, k + 1]/2j where k ∈ Z. A dyadic cube in Rn of length 1/2j is a product of n dyadic intervals of length 1/2 . Thus, a dyadic cube is a cube of the form j
1 (k, k + 1], 2j where k = (k1 , . . . , kn ) is an n-tuple of integers and k + 1 = (k1 + 1, . . . , kn + 1). We shall call the number 1/2j the length of the above dyadic cube, although we should probably say “length of a side of the cube” to be more precise. The set of all dyadic cubes form a countable set since the set of all dyadic cubes is in one-to-one correspondence with Zn × N, where the correspondence is 1 (k1 , . . . , kn , j) ←→ (k, k + 1]. 2j In the following lemma, we collect some properties of dyadic cubes. Lemma 1.11. The following properties hold: (1) A point in Rn is contained in a unique dyadic cube of a given length. In particular, dyadic cubes of a fixed length are pairwise disjoint. (2) If C and C ′ are dyadic cubes of different lengths, then C and C ′ intersect if and only if C ⊆ C ′ or C ′ ⊆ C. Proof : Since a dyadic cube in Rn is just a product of n dyadic intervals, these properties hold if they hold for intervals. Thus, we shall prove the theorem for intervals. For dyadic intervals, Property (1) is clear from the picture in Figure
40
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
1.11. Before proving (2) we note that if I = (a, b] and J = (j, j + 1] where a, b, j ∈ Z, then I ∩ J 6= ∅
(1.10)
=⇒
J ⊆ I.
We shall let the reader verify this fact. We now prove (2). Let C and C ′ be dyadic intervals that intersect and have different lengths. Consider the case that the length of C is greater than the length of C ′ ; we shall prove that C ′ ⊆ C. We can write C = I/2k where I is of the form I = (i, i + 1] with i ∈ Z and k ∈ N, and C ′ = J/2k+ℓ where J is of the form J = (j, j + 1] with j ∈ Z and ℓ ∈ N. Since C and C ′ intersect, multiplying by 2k+ℓ , we obtain C ∩ C ′ 6= ∅
=⇒
I J ∩ 6= ∅ 2k 2k+ℓ
=⇒
(2ℓ I) ∩ J 6= ∅.
According to (1.10), we have J ⊆ 2ℓ I, which gives C ′ ⊆ C after division by 2k+ℓ .
Dyadic Cube Theorem Theorem 1.12. Each open set in Rn is a pairwise disjoint union of (countably many) dyadic cubes. Proof : Note that we don’t need to add “countably many” since we already noted above that the set of all dyadic cubes is countable. Before attacking the proof, consider the interval (−1, 1) in R. If you were to decompose (−1, 1) as a pairwise disjoint union of dyadic intervals you would probably do it as follows: ✛( ] ]( ]( ]( ]( ](✲ 0 3 7 1 −1 1 2 4 8 (The end points of the dyadic intervals are dyadic numbers of the form (2n −1)/2n where n = −1, 1, 2, . . ..) Each dyadic interval I in this picture is a subset of (−1, 1) and it has the following property: If I ′ is another dyadic interval and I ⊆ I ′ , then I ′ is not contained in (−1, 1). It’s this property that we shall exploit in our proof for any open set. For example, consider the interval (1/2, 3/4]. The only dyadic intervals containing (1/2, 3/4] are (0, 1] and (1/2, 1], both of which are not contained in (−1, 1). We carry this property to the general case. Now let U = 6 ∅ be an open set in Rn . Let V be the union of all dyadic cubes C ⊆ U having the property that if C ⊆ C ′ where C ′ is dyadic cube of length strictly greater than the length of C, then C ′ 6⊆ U. By Property (2) in Lemma 1.11, one can check that V is a union of pairwise disjoint dyadic cubes. We show that V = U. By construction, V ⊆ U. To see that U ⊆ V, let x ∈ U. Since U is open, there are dyadic cubes with sufficiently small lengths contained in U that contain x. Among all such cubes, let C be the one with the largest length (which can be at most 1/2). By definition, C is one of the cubes that make up V, so x ∈ C ⊆ V. Thus, U = V and our proof is complete.
1.4.3. Borel sets and topology. We can now characterize the Borel sets in terms of the topology of Rn . Theorem 1.13. The Borel subsets of Rn is the σ-algebra generated by the open sets of Rn .
1.4. THE BOREL SETS AND THE PRINCIPLE OF APPROPRIATE SETS
41
Proof : By the Principle of Appropriate Sets! Let S be the σ-algebra generated by the open sets; we need to prove that S = B n . First of all, given a nonempty open subset U ⊆ Rn , by the Dyadic cube theorem we can write U as a countable union of left-hand open boxes and hence any open set is in B n , since B n is the σalgebra generated by the open boxes (by Proposition 1.10). Therefore, S ⊆ B n since S is the smallest σ-algebra containing the open sets. On the other hand, every open box is an open set, so any open box is in S . It follows that B n ⊆ S since B n is the smallest σ-algebra containing the open boxes.
Note that to prove this result we didn’t need the full power of the Dyadic cube theorem. We just needed the fact that any open set is the countable union (of not necessarily pairwise disjoint) boxes, a statement much easier to prove than the Dyadic Cube Theorem; see Problems 2 and 3. As a consequence of Theorem 1.13, if a subset of Rn can be obtained from open or closed sets by repeated taking countable unions, intersections and/or complements, then the set is a Borel set; such sets include any set you can physically picture! Here are some famous fractal Borel subsets of R and R2 :17
Figure 1.12. On the left is a construction of the Cantor set, which is what’s left over after repeatedly erasing the open middle thirds starting from the unit interval [0, 1]. The Cantor set is a closed subset of R, hence is a Borel subset of R. In the middle is a Julia set and on the right is the Mandelbrot set, both of which are closed and hence are Borel subsets of R2 .
Figure 1.13 shows some Borel subsets of R3 .18 We’ll take a close look at the Cantor
Figure 1.13. The nautilus shell (a cut away of it on left), the human head, and the Alexander horned sphere, each being a closed subset of R3 , are Borel sets. 17Pictures taken from the wikipedia commons. For information on fractals see e.g. [193]. 18The pictures of the nautilus shell and Mona Lisa are taken from the wikipedia commons.
The picture of the Alexander horned sphere is from [133, p. 176]. The Alexander horned sphere, named after James Waddell Alexander II (1888–1971), is homeomorphic to a 3-dimensional ball.
42
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
set in Section 4.5. Motivated by Theorem 1.13, for an arbitrary topological space X, we define the Borel sets B(X) as follows: B(X) is the σ-algebra of subsets of X generated by the open sets. Proposition 1.14. If f : X → Y is a continuous map between topological spaces, then the inverse image of any Borel set in Y under f is a Borel set in X. Proof : By the Principle of Appropriate Sets! By Proposition 1.9, S = {A ⊆ Y ; f −1 (A) ∈ B(X)}
is a σ-algebra of subsets of Y . Let O be the collection of open subsets of Y . Since f is continuous, it follows that O ⊆ S . Thus, B(Y ) ⊆ S since B(Y ) is the smallest σ-algebra containing O. Now the statement B(Y ) ⊆ S means that for every A ∈ B(Y ), we have f −1 (A) ∈ B(X), which is what we wanted to show.
Since Borel set are defined via topology, the following proposition shouldn’t be a surprise. Proposition 1.15. Borel sets of topological spaces are preserved under homeomorphisms. Proof : Given a homeomorphism f : X → Y of topological spaces, we must show that f takes Borel sets of X to Borel sets of Y . To prove this, let g = f −1 . Then g : Y → X is continuous, so by Proposition 1.14, we know that for any Borel set A ⊆ X, the set g −1 (A) = f (A) is a Borel set in Y . This is exactly what we wanted to prove.
This proposition is not true for the so-called “Lebesgue measurable sets” that we’ll discuss later, see Problem 7 in Exercises 4.4. ◮ Exercises 1.4. 1. Let Ak be the σ-algebra generated by sets of the form (k) given in Proposition 1.10, where k = 1, 2, . . . , 8; thus, A1 = B n by definition of B n , A2 is the σ-algebra generated by the bounded open boxes, and so forth. Prove the following sequence of inclusions: A1 ⊇ A2 , A2 ⊇ A3 , · · · , A7 ⊇ A8 , A8 ⊇ A1 .
Using this fact, conclude that A1 = A2 = · · · = A8 , which proves Proposition 1.10. You may assume that n = 1 throughout your proof, which will make a couple of the inclusions notationally simpler to prove. 2. In this problem we prove that any open set can be written as the countable union of (not necessarily pairwise disjoint) left-hand open boxes. Here are some steps. (i) First prove that the set of all nonempty left-hand open boxes with rational vertices is countable. That is, prove that the set of all boxes of the form (a1 , b1 ] × · · · × (an , bn ]
where a1 , . . . , an , b1 , . . . , bn ∈ Q with ai < bi for each i, is countable. (ii) Let U ⊆ Rn be open and nonempty and denote by A the set of all nonempty left-half open boxes I with rational vertices such that I ⊆ U. Show that [ U= I. I∈A
1.4. THE BOREL SETS AND THE PRINCIPLE OF APPROPRIATE SETS
43
3. Using a proof similar to the one in the previous problem, prove that any open set is a countable union of (not necessarily pairwise disjoint) open boxes. Prove the same for closed boxes. 4. Prove that the Borel subsets of Rn is the σ-algebra generated by the collection of all (i) dyadic cubes; (ii) left-half open boxes with rational end points; (iii) left-half open boxes with dyadic end points; (iv) closed sets; (v) compact sets. 5. Prove that any open set in R is a countable union of pairwise disjoint open intervals. Is this last statement true in Rn for n > 1 if we replace “open intervals” by “open boxes?” Suggestion: For each x in an open set U ⊆ R, prove that there is a largest open interval containing x, say Ix . Show that these intervals are pairwise disjoint, countable, and the U is the union of all such intervals. 6. The technique used to prove the Dyadic Cube Theorem is useful in establishing analogous statements concerning open sets. For instance, imitating the proof of the Dyadic Cube Theorem, show that each open set in Rn is a countable union of closed cubes with pairwise disjoint interiors. 7. Another common way to prove the Dyadic Cube Theorem is as follows. Let U1 ⊆ U be the union of all dyadic cubes of length 1/21 that are contained in U. Let U2 ⊆ U be the union of all dyadic cubes of length 1/22 contained in the set U \ U1 . Proceeding by induction, assuming that Uj has been defined, let Uj+1 ⊆ U be the union of all dyadic cubes of length 1/2j+1 contained in the set U \ (U1 ∪ U2 ∪ · · · ∪ Uj ). Let V be the union of all the Uj ’s. Prove that U = V. 8. Given x ∈ Rn , r ∈ R with r 6= 0, and A ⊆ Rn , we denote the translation of A by x by A + x or x + A, and the multiple of A by r by rA; that is, we define x + A = A + x := {a + x ; a ∈ A},
rA := {ra ; a ∈ A}.
Prove that B n is translation and scalar invariant. That is, prove that B n + x := {A + x ; A ∈ B n }
and
rB n := {rA ; A ∈ B n }
both equal B n . Q 9. (Cylinder sets in R∞ ) In this problem we study σ-algebras in R∞ . Let R∞ = ∞ i=1 R, which consists of all infinite sequences (x1 , x2 , . . .) with xi ∈ R for all i. (i) Let C1 denote the collection of cylinder sets of R∞ consisting of subsets of R∞ of the form A1 × A2 × · · · × An × R∞ for some n ∈ N and Ai ∈ I 1 for each i. Let C2 denote the cylinder sets formed by requiring Ai ∈ B for each i. Prove that S (C1 ) = S (C2 ). We denote this common σ-algebra by B ∞ . Suggestion: To prove that S (C2 ) ⊆ S (C1 ), you need to prove that any element of C2 belongs to S (C1 ). To prove this, first prove that for any i ∈ N and A ∈ B, we have Ri−1 × A × R∞ ∈ S (C1 ). To prove this, let A = {A ⊆ R ; Ri−1 × A × R∞ ∈ S (C1 )}. Prove that B = S (I 1 ) ⊆ A . (ii) Prove that each of the following sets belongs to B ∞ : (a) {x ∈ R∞ ; sup{x1 , x2 , . . .} ≤ 1}. P ∞ (b) {x ∈ R∞ ; xn ≥ 0 for each n and n=1 xn ≤ 1}. ∞ (c) {x ∈ R ; limn→∞ xn exists}. Suggestion: For (c) use the Cauchy criterion. 10. (Cf. [24]) For Rn , the σ-algebra generated by the open sets and by the compact sets are the same, namely the Borel sets (see Problem 4). For a general topological space, they need not be the same. A topological space is said to be σ-compact (also called σ-bounded) if it is equal to a countable union of compact subsets. (a) Prove that for any σ-compact Hausdorff space, the Borel sets coincides with the σ-algebra generated by the compact sets. Suggestion: Recall that for a Hausdorff space, compact sets are closed. (b) Suppose that the topology of X consists of just the sets ∅ and X. What are the Borel sets? What is the σ-algebra generated by the compact sets?
44
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
1.5. Additive set functions in classical probability Statements such as “the probability that you get a head when you flip a coin is 1/2” or “the probability that a die shows 1 when you throw it is 1/6” are “facts” that we are all familiar with. This section is devoted to understanding the basics of probability theory and additive set functions. However, we begin by discussing 1.5.1. Arithmetic on R. In measure theory, ±∞ show up one way or another, since for instance, Rn and other unbounded intervals (should) have infinite length. Let ∞ (called infinity and also denoted by +∞) and −∞ (called minus infinity) be distinct objects that are not real numbers.19 We define the extended real numbers R as the set R ∪ {±∞}. We order R by taking the usual ordering on R and defining −∞ < ∞ and −∞ < a < ∞ for any real a. We use the standard notations for intervals; for example, we shall see that measures take values in the interval [0, ∞] = {x ∈ R ; 0 ≤ x ≤ ∞}, which equals [0, ∞) ∪ {∞}, the set of nonnegative real number and ∞. We define arithmetic operations on R as follows. For real numbers a, b ∈ R considered as extended real numbers, a ± b, a · b, and a/b (provided b 6= 0) have their usual meanings. When infinities are involved, addition and subtraction are defined by: ∞ ± a = ±a + ∞ := ∞,
−∞ ± a = ±a − ∞ := −∞, Multiplication is defined by (±∞) · a = a · (±∞) := ±∞,
and
(±∞) · a = a · (±∞) := ∓∞,
for − ∞ < a ≤ ∞,
for − ∞ ≤ a < ∞. for 0 < a ≤ ∞,
for − ∞ ≤ a < 0,
(±∞) · 0 = 0 · (±∞) := 0. Finally, division is defined by a = 0, for all a ∈ R. ±∞ Note that ∞ − ∞ , −∞ + ∞ , and division of infinities are not defined.
The definition 0 · ±∞ = 0 and ±∞ · 0 = 0 might seem a bit strange; for instance in elementary calculus we were taught that these were “indeterminant forms.” However, these definitions are incredibly useful in measure theory. 1.5.2. Probabilities and additive set functions. When we are dealing with a random phenomenon we can never say exactly what will occur, we can only say the “probability” (or chance or likelihood ) that a certain outcome will occur. Mathematically speaking, probability is the assignment of real numbers in the interval [0, 1] to represent these likelihoods; numbers close to zero representing low likelihoods and numbers closer to one higher likelihoods. 19Here’s one natural way to define infinity. Let us call a sequence {a } of real number right n
unbounded if given any M > 0 we have an > M for all n sufficiently large. Define ∞ as the collection of all right unbounded sequences. With this definition, given any sequence {an } we write an → ∞ if {an } ∈ ∞. We can similarly define −∞ as the collection of all “left unbounded” sequences.
1.5. ADDITIVE SET FUNCTIONS IN CLASSICAL PROBABILITY
45
For example, consider the sample space S = {(1, 1), . . . , (6, 6)} of throwing two dice, which consists of a set of 36 possible outcomes. Assume that the dice are “fair,” which intuitively means when we throw them, each pair (m, n) of numbers between 1 and 6 has an equal chance of landing up. Then we would all agree that the probability that you throw a double six is 1/36. We would all agree that the probability that you throw a (1, 1) or a (6, 6) is 2/36. More generally, given an event A ⊆ S consisting of n possible outcomes, we would all agree that the probability of at least one outcome in A occurring should be n/36. With this example in mind, given any finite sample space X representing the outcomes of some type of random phenomenon and given any event A ⊆ X, the classic definition of the probability of an outcome in A occurring is (1.11)
probability of A :=
#A number of elements of A = . #X total number of possible outcomes
If probabilities are assigned in this way, the apparatus (coin, dice, . . .) used in the experiment, is said to be fair and each outcome of the experiment is said to occur equally likely or with equal probability. One way to think of the formula (1.11) is as “probability” = “volume” where the “volume” of an event A is the proportion of all possible outcomes that lie in the event A. Since time immemorial, (1.11) has been used as the definition of mathematical probability. For example, Girolamo Cardano (1501–1576) uses the formula (1.11) to compute various probabilities in Liber de Ludo Aleae, the first text on the calculus of probabilities; for instance, in Chapter 14 of his book he makes a (correct) table of #A for various events where the sample space is S = {(1, 1), . . . , (6, 6)}, the sample space for throwing two dice. In Abraham de Moivre’s (1667–1754) classic The Doctrine of Chances first published in 1718, he says [71, p. 1]: 1. The Probability of an Event is greater or less, according to the number of Chances by which it may happen, compared with the whole number of Chances by which it may either happen or fail. 2. Wherefore, if we constitute a Fraction whereof the Numerator be the number of Chances whereby an Event may happen, and the Denominator the number of all the Chances whereby it may either happen or fail, that Fraction will be a proper designation of the Probability of happening. Thus if an Event has 3 Chances to happen, and 2 to fail, the Fraction 35 will fitly represent the Probability of its happening, and may be taken to be the measure of it.
Any case, for our dice example, we denote the probability of an outcome in A occurring by µ(A) and we define it by µ(A) :=
number of elements of A #A = , total number of possibilities 36
where #A = number of elements of A. Thus, µ assigns to every subset A ⊆ S a number µ(A) ∈ [0, 1], in other words, µ is a function µ : P(S) → [0, 1],
46
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
where P(S) is the power set of S. This function has several easily proved properties such as µ(∅) = 0 (which is obvious since #∅ = 0) and #S 36 = = 1, 36 36 and also if A, B ⊆ S are disjoint, then #(A ∪ B) = #A + #B, so µ(S) =
#(A ∪ B) #A + #B = 36 36 #A #B = + = µ(A) + µ(B). 36 36 An induction argument shows that given any finite number of sets A1 , A2 , . . . , AN ∈ SN PN P(S) that are pairwise disjoint, µ( n=1 An ) = n=1 µ(An ). This discussion shows that the following definition is worthy of study: Given a semiring I , a function µ : I → [0, ∞] µ(A ∪ B) =
is called a set function and µ is said to be additive, or finitely additive, if it has the following two properties: (1) µ(∅) = 0. SN (2) If A ∈ I and A = n=1 An where A1 , . . . , AN ∈ I are pairwise disjoint, then (1.12)
µ(A) =
N X
µ(An ).
n=1
Of course, since a ring or σ-algebra is also a semiring this definition works in the case that I is a ring or σ-algebra. The set function µ is called a (finitely additive) probability set function20 if µ has range in [0, 1] (so that µ : I → [0, 1]) and in addition to properties (1) and (2), if X denotes the universal set, we have (3) X ∈ I and µ(X) = 1. This “abstract” description of a probability set function goes back a long time: 1.5.3. Hilbert’s sixth problem and Kolmogorov’s axioms. In the 1900 International Congress of Mathematicians held in Paris, David Hilbert (1862–1943) gave his famous list of 23 open problems in mathematics, now called “Hilbert problems”, which have greatly influenced the direction of mathematical research since that time. Here’s Hilbert’s sixth problem [128]: David Hilbert (1862–1943).
6. Mathematical treatment of the axioms of physics The investigations on the foundations of geometry suggest the problem: To treat in the same manner, by means of axioms, those physical sciences in which mathematics plays an important part; in the first rank are the theory of probabilities and mechanics.
In 1933, Andrey Nikolaevich Kolmogorov (1903–1987) published his classic book Grundbegriffe der Wahrscheinlichkeitsrechnung [151], an EngAndrey Kolmogorovlish translation of which is at the website http://www.kolmogorov.com/. (1903–1987). 20 Usually called a finitely additive probability measure but we shall use the word measure strictly for countably additive set functions.
1.5. ADDITIVE SET FUNCTIONS IN CLASSICAL PROBABILITY
47
In this book, he completely solves the axiomatization of probability component of Hilbert’s sixth problem. Let X be a set and let R be a finite collection of subsets of X referred to as a collection of observable or plausible events; we shall consider the case of infinitely many observable events in Section ?. Here are Kolmogorov’s axioms [151, p. 2] in the finite case: I. R is a ring21 of subsets of a set X. II. R contains X. III. To each set A in R is assigned a nonnegative real number µ(A). This number µ(A) is called the probability of the event A. IV. µ(X) = 1. V. If A and B have no element in common, then
µ(A ∪ B) = µ(A) + µ(B). The triple (X, R, µ) is called a field of probability. Of course, these axioms just mean that µ : R → [0, 1] is a “probability set function,” which we defined a few paragraphs above. We remark that if R is a ring of subsets of a set X and X ∈ R then the ring R is automatically closed under complements.22 Indeed, if we happen to know that A ∈ R, then since X ∈ R and R is closed under differences, we conclude that Ac = X \ A ∈ R. Moreover, if µ : R → [0, 1] is a probability set function, then X = A ∪ Ac implies that 1 = µ(A) + µ(Ac ) since µ(X) = 1. Thus, dealing with a probability set function, we have the useful result:
A ∈ R =⇒ Ac ∈ R
and
µ(Ac ) = 1 − µ(A).
Most of the probability set functions you have seen in lower-level probability courses are of the following type: Proposition 1.16. Given any finite sample space X describing a fair experiment and using the classic definition (1.11) to define a set function µ : P(X) → [0, 1],
we obtain a finitely additive probability set function. The proof of this result proceeds exactly as with the dice example above. Now consider the following proposition describing a not-necessarily fair experiment whose sample space is countable.
21In Kolmogorov’s book, he uses the term field for what we call a ring. 22A ring closed under complements is called an algebra or sometimes a field.
48
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Proposition 1.17. Let X be a nonempty countable (finite or infinite) set. Then a function µ : P(X) → [0, ∞] is a finitely additive set function if and only if there is a function m : X → [0, ∞] such that for all A ⊆ X, we have X µ(A) = m(x), x∈A
P where the sum is only over those x’s such that x ∈ A. If the sum x∈A m(x) diverges, it equals, by definition ∞. P The set function µ is a finitely additive probability set function if and only if x∈X m(x) = 1, in which case the function m is called a probability mass function.
We remark that Proposition 1.17 holds verbatim if we replace “finitely additive” with “countably additive” everywhere in the proposition (see Problem 2 in Exercises 3.2); we shall learn more about countably additive measures in the next chapter. Now to prove the “only if”, let µ : P(X) → [0, ∞] be a finitely additive set function and then define m(x) := µ {x} for all x ∈ X. We often drop the parentheses, so we can write this as m(x) = µ{x}. Now given S A ⊆ X we can write A = x∈A {x}, so by additivity, X X µ(A) = µ{x} = m(x). x∈A
x∈A
This proves the “only if;” we shall leave the “if” statement to you (see Problem 2). If X is finite and we assign the same “masses” to each point of X, that is, m(x) =
1 #X
for all x ∈ X,
then the µ defined as in Proposition 1.17 is exactly the classical probability set function (1.11) (why?). However, there are situations where m is not constant, one case of which we’ll see as we now return to the problem of points. 1.5.4. The problem of points. Here’s an example of the problem of points related to America’s pastime:
Commissioner’s Trophy
Problem of points: Baseball teams A and B (of equal ability) are playing in the world series. The first one to win four games wins the series and wins D dollars. Let’s say that after four rounds, team A has won three games and team B has won one game, but because of salary disputes (what else?) the players went on strike and for the first time in history the world series was canceled. It was decided to sell the Commissioner’s Trophy23 (given to the winner of the series) and the money given to the teams. How should the money be fairly divided?
23The picture of the Commissioner’s Trophy was taken from the Wikipedia commons.
1.5. ADDITIVE SET FUNCTIONS IN CLASSICAL PROBABILITY
Event Round 5 1 A 2 A 3 A 4 A 5 B 6 B 7 B 8 B
49
Round 6 Round 7 A A A B B A B B A A A B B A B B
Table 2. All possible outcomes of three hypothetical future matches between teams A and B. Here, “A” represents that team A wins and “B” that team B wins.
Pascal and Fermat’s ingenious solution to this problem is called the method of combinations. Their observation was that we can consider the teams playing a “best of seven” game where the teams play seven total games and at the end of the seven rounds count who has won the most rounds. In other words, the “first one to win four” game is completely equivalent to winning the “best of seven” game. Since these two situations are completely equivalent, we can divide up the prize money assuming the teams were playing the “best of seven” game instead of the “first one to win four” game; in the words of Pascal [262, p. 557], Therefore, since these two conditions are equal and indifferent, the division should be alike for each.
Now, thinking in terms of the best of seven, we have already played four rounds, so we just have three more to go. Table 2 shows the sample space of all the possible outcomes in the hypothetical rounds 5–7 together with the probability of each scenario. Thus, we define X = {(A, A, A), (A, A, B), . . . , (B, B, B)}, a set consisting of eight elements, and we define µ : P(X) → [0, 1] using the classical definition (1.11); recall that we assumed that the teams were of equal ability which is why we assume that each hypothetical outcome is equally likely. Let A be the event that Team A wins the series and let B be the event that Team B wins. Then recalling that team A already has three wins, they would win everything if any one of the top seven outcomes in Table 2 occurred in the hypothetical 5–7 rounds; the only way that team B would win is if the last outcome in Table 2 occurred. Thus, A = {(A, A, A), (A, A, B), . . . , (B, B, A)}
and B = {(B, B, B)}.
Hence, µ(A) = the probability team A wins is =
#A 7 = , #X 8
µ(B) = the probability team B wins is =
#B 1 = . #X 8
and
50
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
For this reason, it would seem like the fairest division of the prize money would be 7D D dollars and Team B gets dollars. 8 8 Now that we have seen the method of combinations, you should be able to read Pascal’s letter of Monday, August 24, 1654 to Fermat, where he illustrates the above logic [262, p. 555]: Team A gets
This is the method of procedure when there are two players: If two players, playing in several throws, find themselves in such a state that the first lacks two points and the second three of gaining the stake, you say it is necessary to see in how many points the game will be absolutely decided. It is convenient to suppose that this will be in four points, from which you conclude that it is necessary to see how many ways the four points may be distributed between the two players and to see how many combinations there are to make the first win and how many to make the second win, and to divide the stake according to that proportion. I could scarcely understand this reasoning if I had not known it myself before; but you also have written it in your discussion. Then to see how many ways four points may be distributed between two players, it is necessary to imagine that they play with dice with two faces (since there are but two players), as heads and tails, and that they throw four of these dice (because they play in four throws). Now it is necessary to see how many ways these dice may fall. That is easy to calculate. There can be sixteen, which is the second power of four; that is to say, the square. Now imagine that one of the faces is marked a, favorable to the first player. And suppose the other is marked b, favorable to the second. Then these four dice can fall according to one of these sixteen arrangements: a a a a a a a a b b b b b b b b a a a a b b b b a a a a b b b b a a b b a a b b a a b b a a b b a b a b a b a b a b a b a b a b 1 1 1 1 1 1 1 2 1 1 1 2 1 2 2 2 and, because the first player lacks two points, all the arrangements that have two a’s make him win. There are therefore 11 of these for him. And because the second lacks three points, all the arrangements that have three b’s make him win. There are 5 of these. Therefore it is necessary that they divide the wager as 11 is to 5.
Please notice Pascal’s words: “I could scarcely understand this reasoning if I had not known it myself before.” This shows that mathematics is not easy, even to one of the greatest mathematicians who ever lived! The method of combinations has also confused many great mathematicians such as Gilles Personne de Roberval (1602–1675), one of the leading mathematicians in Paris at the time of Pascal and Fermat. He objected to the method of combinations because the true game (described in Pascal’s letter above) would have ended as soon as the winner was declared and not all the extra four games needed to be played. Roberval’s comments to Pascal were [262, p. 556]: That it is wrong to base the method of division on the supposition that they are playing in four throws seeing that when one lacks two points and the other three, there is no necessity that they play four
1.5. ADDITIVE SET FUNCTIONS IN CLASSICAL PROBABILITY
51
throws since it may happen that they play but two or three, or in truth perhaps four.
Roberval might have been happy if Pascal approached the problem as follows. Recall that the first player in Pascal’s letter lacks two points and the second three points, so the real ending scenarios could only have been the following ten: a a
a a b b a b a 1 1
1
a b b a b a b 2 1
b b b a a b b b a a b a 1 2 1
b b b b a b b 2 2
where the bottom row shows who wins. Let us take our sample space to be these ten scenarios: X = {aa, aba, abba, abbb, . . . , bbb}. For this sample space it is intuitively clear that the individual outcomes are not equally likely; e.g. aa and aba are not equally likely to occur. Indeed, since the players are equally likely to win a round in two rounds the probability that aa occurs should be 1/4. In three rounds the probability that aba occurs should be 1/8. Similarly, in four rounds the probability that abba occurs should be 1/16. Continuing in this manner, we see that we should assign (that is, define) probabilities as follows: µ{aa} =
1 1 , µ{aba} = µ{baa} = µ{bbb} = 4 8
µ{abba} = µ{abbb} = µ{baba} = µ{babb} = µ{bbaa} = µ{bbab} =
1 . 16
Since 1 1 1 1 1 1 1 1 1 1 + + + + + + + + + = 1, 4 8 8 8 16 16 16 16 16 16 it follows from Proposition 1.17 that µ induces a finitely additive probability set function µ : P(X) → [0, 1]. Now the event that a wins is {aa , aba , abba , baa , baba , bbaa} and the event that b wins is {abbb , babb , bbab , bbb}; hence the probability that a wins is 1 1 1 1 1 1 11 µ {aa , aba , abba , baa , baba , bbaa} = + + + + + = 4 8 16 8 16 16 16
and the probability that b wins is
1 1 1 1 5 µ {abbb , babb , bbab , bbb} = + + + = . 16 16 16 8 16
These are exactly the same division rules (11/16 to player a and 5/16 to player b) as given by the method of combinations! We can do a similar argument with the world series example. For the world series example, the real ending scenarios are only the following four:
52
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Event Round 5 1 A 2 B 3 B 4 B
Round 6
Round 7 Probability
A B B
A B
1 2 1 4 1 8 1 8
In terms of our mathematical framework, we put X = {A, BA, BBA, BBB} and define 1 1 1 1 µ{A} = , µ{BA} = , µ{BBA} = , µ{BBB} = ; 2 4 8 8 then via Proposition 1.17, we get a probability set function µ : P(X) → [0, 1]. Now, A wins the series if any of the first three outcomes in the table occur and B wins if the last one occurs, so the probability team A wins is = µ{A, BA, BBA} =
1 1 1 7 + + = , 2 4 8 8
and
1 . 8 These are exactly the same numbers we had gotten before using the method of combinations! Now, why did we get the same answers using the method of combinations and the method that might have made Roberval happy? Is it a coincidence? No, of course, and we leave you to think about why we got the same answer! By the way, if you’re interested, there is yet another way to solve the problem of points through the concept of “expected winnings,” a notion due once again to our friend Pascal, who explained it in his July 29, 1654 letter to Fermat [262, p. 547]. You can read about “expected winnings” in one of the first (now free) published books (1657) on probability, Libellus de Ratiociniis in Ludo Aleae [137] by Christiaan Huygens (1629–1695), who learned “expectations” from Pascal. the probability team B wins is = µ{BBB} =
1.5.5. The dice problem. Recall that the dice problem is the following: How many times must you throw two dice in order to have a better than 50–50 chance of getting two sixes? Let n ∈ N, let S = {(i, j) ; i, j = 1, 2, . . . , 6} (the sample space of throwing two dice once), and let X = S × S × S × · · · × S, = Sn; | {z } (n times)
then X is a sample space for throwing two dice n times. We know (via the counting rules we review at the end of this section) that #X = 36n , since S has 36 elements. We define the probability set function µ : P(X) → [0, 1] by assuming equally likely outcomes. In order to answer the dice problem, let A ⊆ X be the event that we get two sixes in at least one of the n throws; let us find µ(A). Since X = A ∪ Ac and µ is finitely additive, we have 1 = µ(X) = µ(A) + µ(Ac )
=⇒
µ(A) = 1 − µ(Ac ),
so we just have to find µ(Ac ), which turns out to be quite easy. Notice that Ac is the event that we never throw two sixes. Hence, Ac = T × T × T × · · · × T , {z } | (n times)
1.5. ADDITIVE SET FUNCTIONS IN CLASSICAL PROBABILITY
53
where T = S \ {(6, 6)}. Since #T = 35, it follows that #Ac = 35n and hence n #Ac 35 c µ(A ) = = . #X 36 Thus, n 35 Throwing the dice n times, the probability . (1.13) =1− of throwing two sixes at least once 36 We can now solve the dice problem. We want to know what n must be in order that if we throw the dice n times we have a better than a 1/2 probability of throwing two sixes; that is, we want n 35 1 log 2 ≥ , which holds ⇐⇒ n ≥ = 24.605 . . . . 1− 36 2 log(36/35) This was exactly Pascal’s solution to M´er´e’s question: We need at least 25 throws to have a better than 50–50 chance of getting two sixes. Just for fun, let us give another method of calculating (1.13). Let Ak ⊆ X be the event that we throw two sixes on the k-th throw but not on any throw before the k-th. Thus, Ak = T × T × · × T ×{(6, 6)} × S × S × · · · × S | {z } | {z } (k − 1 times)
Then
(n − k times)
k−1 #Ak 35k−1 · 1 · 36n−k 1 35 = = 36n 36n 36 36 One can check that A = A1 ∪ · · · ∪ An and that A1 , . . . , An are pairwise disjoint, so by additivity we have k−1 n n X 1 X 35 µ(A) = µ(Ak ) = 36 36 k=1 k=1 n 35 n 1 1 − 36 35 = = 1 − , 35 36 1 − 36 36
which agrees with (1.13). We remark that M´er´e believed the answer was 24 and not 25 throws to have a better than 50–50 chance of getting two sixes. The reason was the following gambler’s rule common in those days. Let T be the total number of outcomes of an experiment (each outcome equally likely) and let t be the number of trials of the experiment needed to have a better than 50–50 chance of getting a specific outcome. For example, in the experiment of throwing a single die, T = 6 (six sides on a die) and for the specific outcome of throwing a six, it was well-known even from the times of Cardano in the 1500’s, that t = 4. (Indeed, throwing the die n times, the probability of throwing a six at least once is 1 − (5/6)n ; when n = 3, this number is < 1/2 while for n = 4, the number is > 1/2.) The gambler’s rule was that the ratio t/T is the same for all experiments, as long as T is sufficiently large. Thus, as we saw in the case of a single die, we have T = 6 and t = 4, so t/T = 4/6 = 2/3. Hence, we should have t/T = 2/3 = .66666 . . . for any experiment as long as T is large enough. In the case of two dice, T = 36 (six sides on each dice) so we conclude that t/36 = 2/3, or t = 36 · 2/3 = 24, which was M´er´e’s (wrong) solution. In fact, it was later shown by Abraham de Moivre (1667–1754) in his book The Doctrine of
54
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Chances [71, p. 36] that t/T ≈ log 2 = 0.6931 . . . for T large. If M´er´e would have known this, he would have gotten the correct answer: t ≈ 36 · 0.6931 = 24.95 . . ., that is, t = 25. See Problem 3 for a proof of De Moivre’s theorem. Appendix/Review on basics on counting. In order to use the classical definition of probability one needs to be able to count the number of elements of a set. Thus, before going any further, it might be helpful to quickly review some elementary ideas on counting. Let X1 , . . . , Xm be finite sets and consider the product X = X1 × X2 × · · · × Xm = {(x1 , x2 , . . . , xm ) ; xi ∈ Xi for each i}. If #Xi = ni , then we have #X = n1 · n2 · · · nm .
To see this, observe that for an m-tuple (x1 , x2 , . . . , xm ) ∈ X, there are n1 choices for the first component x1 . For each of the n1 choices for x1 , there are n2 choices for x2 . Thus, there are n1 ·n2 ways of filling the first two components x1 , x2 . Continuing in this manner with x3 , . . . , xn , we get our formula for #X. In particular, let S be a set having n elements and put X = S × S × · · · × S = S m,
where there are m factors of S; then #X = nm . In probability jargon, the set S m is called the set of all m-samples taken from S with replacement. This is because if S represents the set of objects in an urn,24 then you can think of an m-tuple (x1 , x2 , . . . , xm ) ∈ S m as a sequence of m objects that you take from the urn, one at a time, making sure to replace each object to the urn before taking the next one. Now consider the following subset of S m : P = {(x1 , . . . , xm ) ; xi ∈ S , xi 6= xj for i 6= j}. How many elements does P have? Observe that there are n choices for the first component x1 of an element (x1 , x2 , . . . , xm ) ∈ P . For each of the n choices for x1 , there are n − 1 choices for x2 since x2 is not allowed to equal x1 . Thus, there are n(n − 1) ways of filling the first two components x1 , x2 . Continuing in this manner with x3 , . . . , xn , the number of elements of P equals #P = n(n − 1)(n − 2) · · · (n − m + 1) =
n! . (n − m)!
In probability jargon, the set P is called the set of all m-permutations of S or the set of all m-samples taken from S without replacement. This is because if S represents the set of objects in an urn, then you can think of an m-tuple (x1 , x2 , . . . , xm ) ∈ P as the sequence of m objects that you take from the urn, one at a time, without replacing the object you took before taking the next one. Finally, let 1 ≤ m ≤ n and consider the set C = {A ; A ⊆ S , #A = m};
241828 Webster dictionary: An urn is a vessel of various forms, usually a vase furnished with a foot or pedestal, employed for different purposes, as for holding liquids, for ornamental uses, for preserving the ashes of the dead after cremation, and anciently for holding lots to be drawn. Image taken from Kantner’s 1887 “Book of Objects,” page 12.
1.5. ADDITIVE SET FUNCTIONS IN CLASSICAL PROBABILITY
55
thus, C consists of all subsets of S having m elements. The number of elements of C is called the number of combinations of n things taken m at a time. An element of C is a subset A = {x1 , x2 , . . . , xm } ⊆ S consisting of m distinct elements of S (so that xi 6= xj for i 6= j). Notice that there are m! different ways to form an m-tuple from the elements of the set {x1 , x2 , . . . , xm } since there are m choices for the first component of the m-tuple, m − 1 choices for the second component of the m-tuple, etc. Each such m-tuple will give an element of the set P defined above, and moreover, each element of the set P can be obtained in this way. Thus, to each element of C corresponds m! elements of P and hence, after some thought, we conclude that #C · m! = #P. In view of the formula for #P that we found above, we see that n! n #C = =: , m! (n − m)! m which is just the familiar binomial coefficient. If S represents the set of objects in an urn, then you can think of a element {x1 , x2 , . . . , xm } ∈ C as m objects that you scoop from the urn all at once. You can also think of {x1 , x2 , . . . , xm } ∈ C as the set of m objects that you take from the urn, one at a time, without replacing the object you took before taking the next one. Since the elements of a set are not ordered, you don’t care about the order in which the objects were taken from the urn, you only care about the objects you got. ◮ Exercises 1.5. 1. Let I be a semiring of subsets of a set X and let µ : I → [0, ∞] be a finitely additive set function. Prove that if A, B ∈ I with A ⊆ B, then µ(A) ≤ µ(B). 2. Prove the “only if” part of Proposition 1.17 using characteristic functions as follows. (i) Recall that the characteristic function of a set A is defined by χA (x) = 1 if x ∈ A and χA (x) = 0 if x 6∈ A. With X be a countable set and µ defined as in the statement of Proposition 1.17 for some mass function m, show that given any set A ⊆ X, we have X µ(A) = m(x) χA (x). x∈X
(ii) Show that if A =
SN
n=1
An where the An ’s are pairwise disjoint, then χA =
N X
χAn .
n=1
(iii) Using (i) and (ii) prove Proposition 1.17. 3. (De Moivre’s Theorem) Let T be the total number of outcomes in an experiment (each outcome equally likely) and let t be the number of trials of the experiment needed to have a better than 50–50 chance of getting a specific outcome. Show that t = ⌈− log 2/ log(1 − 1/T )⌉. where for any r ∈ R, ⌈r⌉ is the smallest integer ≥ r. Prove that limT →∞ t/T = ln 2. 4. (The birthday problem) Here is a question similar to the dice problem: How many students must be in your class in order to have a better than 50–50 chance of finding two people with the same birthday? Assume that every year has 365 days. Also assume that the people are “randomly chosen” which means that all days of the year are equally likely to be the birthday of a person.
56
5.
6.
7.
8.
9.
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
(i) Let n ∈ N and give an explicit description of a sample space X representing the birthdays of n randomly chosen people, and define the probability set function µ : P(X) → [0, 1]. (ii) Explicitly define the subset A ⊆ P(X) representing the event that n randomly chosen people all have different birthdays. (iii) Determine µ(A), the probability that n randomly chosen people have different birthdays. From this, find a formula for the probability Pn that at least two people in n randomly chosen people have the same birthday. (iv) Show that Pn < 1/2 for n < 23 and Pn > 1/2 for n ≥ 23 and conclude that we need a classroom with at least 23 students. (Another birthday problem) We now do the last problem with “classroom” replaced by “any group of people in history”! We shall assume the Gregorian calendar (the calendar established October 15, 1582 and currently in use in most countries) has always been used. In this calendar, a leap birthday (on Feb. 29) occurs 97 times every 400 years. Thus, in 400 years, there are a total of 400·365+97 = 146097 days. A regular (non-leap) birthday occurs 400 times in these 146097 days while a leap birthday occurs only 97 times. Thus, we define the chance of a randomly chosen person’s birthday to be a = 400/146097 or b = 97/146097 depending on the day being a regular day or the leap day. Let n ∈ N, let S = {1, 2, . . . , 366} (366 being the leap day), and let X = S n , which represents the sample space of the possible birthdays for n people. (i) Assign a probability to a singleton {(x1 , . . . , xn )} consisting of n birthdays and define a probability set function µ : P(X) → [0, 1]. (ii) Explicitly define the subset A ⊆ P(X) representing the event that n randomly chosen people all have different birthdays and find µ(A). Suggestion: Write A as A = A0 ∪ A1 ∪ · · · ∪ An where A0 ⊆ A is the subset of A where none of the people have birthday 366 while for Ai ⊆ A, i > 0, the i-th person has birthday 366. (iii) Determine µ(A), the probability that n randomly chosen people have different birthdays. Then find a formula for the probability Pn that at least two people in n randomly chosen people have the same birthday. Using a computer, check that we need a group of at least 23 people to have a better than 50–50 chance of finding two people with the same birthday. (Yet Another birthday problem) In a group of n people, let k of them be women. What is the probability that in a group of n people, at least one pair have the same birthday with at least one member of such a pair a woman? Assume 365 day years and that all days of the year are equally likely to be the birthday of a person. (i) Give an explicit description of a sample space X and the subset A ⊆ P(X) representing the event that in a group of n people, at least one pair have the same birthday with at least one member of such a pair a woman. (ii) Determine the probability in question. The first of two players (of equal ability) to win three matches wins $D dollars. Player A wins the first match, but was injured so cannot continue. How should we fairly divide the prize money? The problem of points has been around for a long time. Perhaps the first published version is by Fra Luca Pacioli in Summa de Arithmetica [218] in 1494 (translation found in [211]): “A team plays ball such that a total of 60 points is required to win the game, and each inning counts 10 points. The stakes are 10 ducats. By some incident they cannot finish the game and one side has 50 points and the other 20. One wants to know what share of the prize money belongs to each side. In this case I have found that opinions differ from one to another, but all seem to me insufficient in their arguments, but I shall state the truth and give the correct way.” Solve this problem assuming each team is equally likely to win an inning. This problem is quoted from [285, p. 10]: Suppose each player to have staked a sum of money denoted by A; let the number of points in the game be n + 1, and suppose the
1.5. ADDITIVE SET FUNCTIONS IN CLASSICAL PROBABILITY
Bet Extrait simple Ambe simple Terne Quaterne Quine
Probability 89 4 90 5 88 3 90 5 87 2 90 5 86 1 90 5 85 0 90 5
Odds of winning
Payoff
1 in 18
15
1 in 400.5
270
1 in 11,748
5,500
1 in 511,038
75,000
1 in 43,949,268
1,000,000
57
Table 3. Odds of winning “Casanova’s Lottery” (La Loterie de France).
first player to have gained n points and the second player none. If the players agree to separate without playing any more, prove that the first player is entitled to 2A − 2An , assuming each player is equally likely to gain a point. 10. (Casanova’s Lottery) In 1756 a meeting was held to discuss strategies on how to raise ´ funds to help finance the French military school Ecole Militaire. Giacomo Casanova (1725–1798) was invited to the meeting and proposed a lottery following a similar lottery introduced in the 1600’s in the city of Genoa located in northern Italy (you can read about Casanova’s lottery in [274]). Here’s how the lottery is played. Tickets with the numbers 1, 2, . . . , 90 were sold to the people. The person can choose a single number, or two numbers, . . ., up to five numbers. At a public drawing, five numbers were drawn from the ninety. If the person choose a single number (called an “Extrait simple”) and it matched any one of the five numbers, he won 15 times the cost of the ticket; if he choose two numbers “Ambe simple” and they matched any two of the five numbers, he won 270 times the cost of the ticket; similarly, if he won on choosing three “Terne,” four “Quaterne”, or five “Quine” numbers he won respectively 5, 500, 75, 000, and 1, 000, 000 times the cost of the ticket (see Table 3). In this problem we study Genoese type lotteries. Let n ∈ N and we label n tokens with the numbers 1, . . . , n. Let m (with m ≤ n) be the number of tokens drawn, one after the other, from a rotating cage. (E.g. n = 90 and m = 5 in Casanova’s lottery.) Assume that each token is equally likely to be drawn. (i) Explain how X1 = {(x1 , . . . , xm ) ; xi ∈ {1, . . . , n} , xi 6= xj for i 6= j} and X2 = {x ; x ⊆ {1, . . . , n} , #x = m} represent two different sample spaces for the phenomenon of randomly choosing m tokens from n tokens. (ii) Let ℓ ≤ m and let a1 , a2 , . . . , aℓ ∈ {1, 2, . . . , n} be distinct numbers on a ticket a sucker has chosen. What is the event that these ℓ numbers match any ℓ of the numbers on m randomly drawn tokens from the cage? Describe the event as subsets A1 ⊆ X1 and A2 ⊆ X2 . (iii) Find #X1 , #X2 , #A1 , and #A2 and show that n−ℓ #A2 #A1 m−ℓ = , both of which equal . n #X1 #X2 m
58
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
Conclude that the probability that the ℓ numbers a1 , a2 , . . . , aℓ match any ℓ of ( n−ℓ ) the numbers on m randomly drawn tokens from the cage equals m−ℓ . Verify n (m ) that Table 3 gives the odds for Casanova’s lottery. The odds shown in Table 3 are certainly against the people (e.g. in Quine you can only win one million times what you paid, although the odds are that you have to spend 43 million times the cost of a ticket to win Quine), so we can see why in 1819 Pierre Simon Laplace (1749–1827) said in an address to a governmental council [274, p. 4] The poor, excited by the desire for a better life and seduced by hopes whose unlikelihood it is beyond their capacity to appreciate, take to this game as if it were a necessity. They are attracted to the combinations that permit the greatest benefit, the same that we see are the least favorable. The lottery was discontinued in 1836. (iv) In Casanova’s lottery, there are two other types of bets. One is called “Extrait d´etermine,” where a person specifies a single number and the place where the number occurs (e.g. “12” as the second token drawn amongst the five). The payoff is 70 times the wager. The other is the “Ambe d´etermine,” where a person specifics two numbers and the places where the numbers occurs (e.g. “12” as the second token and “33” as the fourth token drawn amongst the five). The payoff is 5, 100 times the wager. What sample space, X1 or X2 would you use to compute probabilities for winning Extrait d´etermine and Ambe d´etermine? Compute the corresponding probabilities for general n and m ≤ n. (For Casanova’s lottery, n = 90 and m = 5, answers are, respectively, 1 in 90 and 1 in 8, 010). 11. (The Canadian 6/49 Lottery) On a Canadian 6/49 lottery ticket,25 you choose six numbers from 1, 2, . . . , 49. For the drawing, 49 balls are labeled 1, 2, . . . , 49 and then six balls are drawn at random from the 49 balls. After this, a seventh ball, called the “bonus ball,” is drawn at random from the remaining 43 balls. You win under the following conditions: (1) Three, four, or six of your numbers match respectively three, four, or six of the numbers of the first six balls drawn. (The bonus ball is irrelevant here.) (2) Two or five of your numbers match respectively two or five of the numbers of the first six balls drawn and one of your other numbers matches the bonus ball. (3) Five of your numbers match five of the numbers of the first six balls drawn and your sixth number does not match the bonus ball. See Table 4 for the odds of winning Lotto 6/49.26 In this problem we study 6/49 type lotteries. Let n ∈ N and we label n tokens with the numbers 1, . . . , n. Let m (with m ≤ n) be the number of tokens drawn. (E.g. n = 49 and m = 6 in Lotto 6/49.) Assume that each token is equally likely to be drawn. (We’ll consider a “bonus ball” in a moment.) Let a1 , a2 , . . . , am ∈ {1, 2, . . . , n} be distinct numbers on a ticket a sucker has chosen, and let ℓ ∈ N with ℓ ≤ m. (i) Let X1 = {x ; x ⊆ {1, . . . , n} , #x = m}. What is the event that exactly ℓ numbers amongst a1 , . . . , am match ℓ numbers on m randomly drawn tokens? Describe the event as a subset A1 ⊆ X1 . Prove 25
The New York Lotto, as well as other lotteries, are very similar to the Canadian 6/49. To put these odds in perspective, the chance of a person in the USA being killed (in his lifetime) by lightning is about 1 out of every 35, 000 (see http://www.lightningsafety.noaa.gov/resources/LtgSafety-Facts.pdf). If you buy one lotto ticket an hour, how many years will it take for you to have a better than 50–50 chance of winning the jackpot (six matches)? Answer: You will be dead for many many years before it happens! 26
1.6. MORE EXAMPLES OF ADDITIVE SET FUNCTIONS
Matches 2 + bonus
4 5 no bonus 5 + bonus
3 49 6 6 43 4 2 49 6 6 42 5 1 49 6 6 43 1 5 1 · 49 43 6 6 43 6 0 49 6
Approximate odds of winning 1 in 81
3
3
6
Probability 6 43 4 2 4 · 49 43 6 6 43
59
1 in 57 1 in 1032 1 in 55,491 1 in 2,330,636 1 in 13,983,816
Table 4. Odds of winning the Canadian Lotto 6/49.
that Probability that A1 occurs =
m ℓ
n−m m−ℓ n m
.
Using the formula, verify the second, third, and sixth rows in Table 4. Suggestion: Group the numbers {1, 2, . . . , n} into two groups, {a1 , . . . , am } and {1, 2, . . . , n} \ {a1 , . . . , am }. To match ℓ numbers, you need exactly ℓ numbers from the first group and exactly m − ℓ numbers from the second group. (ii) Henceforth assume that after the m balls are drawn, another ball, the “bonus ball,” is drawn at random amongst the remaining n − m balls. Write down a sample space, call it X2 , to represent this situation. What is the event that exactly ℓ numbers amongst a1 , . . . , am match ℓ numbers on the first m randomly drawn tokens and one of your remaining m − ℓ numbers matches the bonus ball? Describe the event as a subset A2 ⊆ X2 . Prove that m n−m m−ℓ Probability that A2 occurs = ℓ nm−ℓ · . n −m m
Using the formula, verify the first and fifth rows in Table 4. (iii) What is the event that exactly ℓ numbers amongst a1 , . . . , am match ℓ numbers on the first m randomly drawn tokens and none of your remaining m − ℓ numbers matches the bonus ball? Describe the event as a subset A3 ⊆ X2 . Prove that m n−m−1 Probability that A3 occurs =
ℓ
m−ℓ n m
.
Using the formula, verify the fourth row in Table 4.
1.6. More examples of additive set functions In the last section we introduced additive sets functions and gave examples of them occurring in probability. In this section we study Lebesgue and LebesgueStieltjes additive set functions, which are the additive set functions occurring in geometry.
60
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
1.6.1. Lebesgue measure on I n . Recall that m : I 1 → [0, ∞)
is defined by m(∅) := 0 and for each nonempty (a, b] ∈ I 1 , we define m(a, b] := b − a.
The function m is called Lebesgue measure on R1 . We can also define Lebesgue measure on Rn . Recall that we denote a left-half open box (a1 , b1 ] × · · · × (an , bn ] in Rn by the notation (a, b] where a and b are the n-tuples of numbers a = (a1 , . . . , an ), b = (b1 , . . . , bn ) where ak ≤ bk for each k. Also recall that I n is the set of such boxes. Given a box (a, b], we define its Lebesgue measure in the obvious way: m(a, b] := (b1 − a1 ) · (b2 − a2 ) · · · (bn − an ),
which is the product of the lengths of the sides of the box. From the picture B6
B2
B1
B4
B5
B3
Figure 1.14. A rectangle decomposed as a union of non-overlapping rectangles. If B denotes the big rectangle, it’s “obvious” that m(B) = PN k=1 m(Bk ).
it is “obvious” that if a box is partitioned into smaller boxes, then the measure of the box is the sum of the measures of the smaller boxes. This is “obviously” true, but its proof is not at all trivial; we shall prove it in Proposition 1.18 below. In other words, it’s obvious that m is finitely additive, where recalling the definition from the last section, a function µ : I → [0, ∞] on a semiring I is called a set function and µ is said to be additive or finitely additive, if (1) µ(∅) = 0. SN (2) If A ∈ I and A = n=1 An where A1 , . . . , AN ∈ I are pairwise disjoint, then µ(A) =
N X
µ(An ).
n=1
Proposition 1.18. For any n, Lebesgue measure m : I n → [0, ∞) is additive. Proof : For notational simplicity, we prove this result for n = 2; the general case is only notationally more cumbersome. Let I × J ∈ I 2 = I 1 × I 1 , and suppose that N [ I×J = Ik × Jk k=1
is a union of pairwise disjoint left-half open rectangles where Ik , Jk ∈ I 1 . We need to prove that m(I × J) =
N X
k=1
m(Ik × Jk ) ; that is, m(I) m(J) =
N X
k=1
m(Ik ) m(Jk ).
1.6. MORE EXAMPLES OF ADDITIVE SET FUNCTIONS
The idea, which we’ll see again and again (e.g. in the law of large numbers), is to turn this statement on measures into something involving functions and then use function techniques to prove the result. For an arbitrary set X and subset A ⊆ X, recall that the characteristic function of the set A is defined by ( 1 if x ∈ A, χA (x) := 0 if x 6∈ A. Observe that for subsets C, D ⊆ R2 , we have
(1.14)
χC×D (x, y) = χC (x) χD (y),
and if E and F are disjoint subsets of R2 , then (1.15)
χE∪F = χE + χF .
For example, to prove the first equality, note that χC×D (x, y) = 1
⇐⇒
(x, y) ∈ C × D ⇐⇒
⇐⇒
x∈D, y∈C
χC (x) = 1 , χD (y) = 1
⇐⇒
χC (x) χD (y) = 1.
The equality (1.15) is proved similarly and holds, by induction, for any finite S union of pairwise disjoint sets. Any case, since I × J = N k=1 Ik × Jk is a union of pairwise disjoint sets, (1.15) shows that χI×J (x, y) =
N X
k=1
χIk ×Jk (x, y).
Since χI×J (x, y) = χI (x) χJ (y) and χIk ×Jk (x, y) = χIk (x) χJk (y) by (1.14), we conclude that N X (1.16) χI (x) χJ (y) = χIk (x) χJk (y). k=1
Now for any finite interval (c, d], the function χ(c,d] (y) is Riemann integrable and (see Figure 1.15) 1
(c, d]
Figure 1.15.
R
Z
(1.17)
χ(c,d] (y) dy =
Rd c
dy = d − c = m(c, d].
χ(c,d] (y) dy = m(c, d].
Let us fix x ∈ R in the equality (1.16), and regard both sides of the equality as functions only of the variable y. Then integrating both sides of (1.16) with respect to y and using (1.17), we obtain Z Z N N X X χI (x) χJ (y) dy = χIk (x) χJk (y) dy =⇒ m(J) χI (x) = m(Jk ) χIk (x). k=1
We now integrate both sides of m(J) χI (x) = x, again using (1.17), obtaining m(I) m(J) =
N X
k=1
This proves our result.
PN
k=1
k=1
m(Jk ) χIk (x) with respect to
m(Ik ) m(Jk ).
61
62
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
1.6.2. Lebesgue-Stieltjes additive set functions. We end this section with another example of a set function on I 1 that is of importance in many fields such as functional analysis and probability theory. Before introducing this set function, we review some definitions. A function f : R → R is said to be nondecreasing if for any x ≤ y, we have f (x) ≤ f (y). Although we are mostly interested in nondecreasing functions, we remark that the function f is called nonincreasing if for any x ≤ y, we have f (x) ≥ f (y). A monotone function is a function that is either nondecreasing or nonincreasing. The following lemma contains some of the main properties of nondecreasing functions. Lemma 1.19. Let f be a nondecreasing function on R. Then the left and right-hand limits, f (x−) and f (x+), exist at every point. Moreover, the following relations hold: f (x−) ≤ f (x) ≤ f (x+), and if x < y, then f (x+) ≤ f (y−). ✻ f (x+) f (x−)
q ❛
❛ q❛ x
❛ q
✲
Figure 1.16. This picture of a nondecreasing function f suggests that f (x−) = sup{f (y) ; y < x} and f (x+) = inf{f (y) ; x < y}. Proof : See Figure 1.16 for an illustration of this lemma. Fix x ∈ R. Since f is nondecreasing, for all y < x we have f (y) ≤ f (x), so the set A := {f (y) ; y < x} is bounded above by f (x). Hence, the supremum of A exists; call it a. We shall prove that a = limy→x− f (y). To this end, let ε > 0 be given. Then the number a − ε cannot be an upper bound for A and hence there is a z < x such that a − ε < f (z), or rearranging the inequality we have a − f (z) < ε. Now given any y with z < y < x, by monotonicity, we have f (z) ≤ f (y), therefore
z < y < x =⇒ −f (y) ≤ −f (z) =⇒ a − f (y) ≤ a − f (z) =⇒ a − f (y) < ε. On the other hand, since a is an upper bound of A, it follows that for any y < x, we have f (y) ≤ a, which implies that for y < x, |a − f (y)| = a − f (y). To summarize: For all y ∈ R with z < y < x, we have |a − f (y)| < ε.
This means, by definition, that
lim f (y) = a = sup{f (y) ; y < x}.
y→x−
Thus, f (x−) = a. Moreover, since f (x) is an upper bound for A, it follows that a ≤ f (x). Hence, f (x−) ≤ f (x). By considering the set {f (y) ; x < y} one can similarly prove that f (x+) = inf{f (y) ; x < y} ≥ f (x).
1.6. MORE EXAMPLES OF ADDITIVE SET FUNCTIONS
63
Let x < y. Then we can choose w with x < w < y, so by definition of infimum and supremum, f (x+) = inf{f (y) ; x < y} ≤ f (w) ≤ sup{f (z) ; z < y} = f (y−). This completes our proof.
There is an analogous statement for nonincreasing functions although we won’t need this statement. Given a nondecreasing function f , we define the set function µf : I 1 → [0, ∞) by
µf (a, b] := f (b) − f (a).
This set function is called the Lebesgue-Stieltjes set function of f ; Thomas Stieltjes here, the “Stieltjes” name refers to Thomas Jan Stieltjes (1856–1894) who (1856–1894). shortly before his death introduced what are now called Riemann-Stieltjes integrals, where Lebesgue-Stieltjes set functions came from, in a famous 1894 paper on continued fractions [270]. One way to interpret µf , as Stieltjes originally did (see Section 6.2), is to consider a rod lying along the interval [0, ∞) and let f (x) = mass of the rod on the interval [0, x]: |
x }
{z f (x)
Then µf (a, b] = f (b) − f (a) is exactly the mass of the rod between a and b, so µf measures not necessarily uniform mass distributions. Another interpretation of µf is that it measures how much f distorts lengths. Indeed, µf (a, b] is just the length of the interval (f (a), f (b)], the image of the interval (a, b] under f . Here are some examples: f (x) = x f (b)
f (x) = H(x)
f (x) = ex f (b)
f (a) = f (b) = 1
f (a)
f (a) ( a
] b
( a
] b
( a
] b
In particular, for f (x) = x, we get the usual Lebesgue measure. The middle picture shows the Heaviside function H(x), which equals 0 for x ≤ 0 and 1 for x > 0, and for the particular interval in the picture we have µH (a, b] = 0. In the third picture, f distorts lengths exponentially. Problem 3 looks at various examples of Lebesgue-Stieltjes set functions including the Heaviside function. Our next task is to prove that general Lebesgue-Stieltjes set functions are additive, and to do so, we need the following lemma. SN Lemma 1.20. Let (a, b] = n=1 (an , bn ] be a union of pairwise disjoint nonempty left-half intervals. Then we can relabel the sets (a1 , b1 ], (a2 , b2 ], . . ., (aN , bN ] so that b1 = b ,
aN = a , and
an = bn+1 , n = 1, 2, . . . , N − 1.
64
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
(
a2
a1
b1
b2
a3
](
](
(
]
a3
b3
a2
b2
b3
a1
](
](
] b1
Figure 1.17. The left figure shows a left-half open interval written as a pairwise disjoint union (a2 , b2 ] ∪ (a1 , b1 ] ∪ (a3 , b3 ]. By relabelling the subscripts, we can write this same union as (a3 , b3 ] ∪ (a2 , b2 ] ∪ (a1 , b1 ] where b1 is the right end point, a3 is the left end point, and an = bn+1 for n = 1, 2.
Figure 1.17 shows an illustration of this lemma when N = 3. Since the statement of this lemma seems so intuitively “obvious” we shall leave the details to you. (Warning: Although obvious, to give an honest completely rigorous proof is tedious and written out in detail, should take you about a page!) We can now prove that Lebesgue-Stieltjes set functions are additive. Proposition 1.21. For any nondecreasing function f : R → R, the LebesgueStieltjes set function µf : I 1 → [0, ∞) is additive. S Proof : Given a pairwise disjoint union (a, b] = N n=1 In , we need to show that PN µf (a, b] = n=1 µf (In ). Since µf (∅) = 0 we may assume that the In ’s are nonempty. Then according to the previous lemma we can relabel the In ’s so that (a, b] =
N [
(an , bn ],
n=1
where b1 = b, aN = a, and an = bn+1 for n = 1, 2, . . . , N − 1. Now observe that the following sum telescopes: N X
µf (an , bn ] =
n=1
N X
(f (bn ) − f (an ))
n=1
= (f (b1 ) − f (a1 )) + (f (b2 ) − f (a2 )) + · · · + (f (bN ) − f (aN ))
= f (b1 ) − f (aN ),
which is f (b) − f (a). This is just µf (a, b], exactly as we set out to prove.
We make a couple remarks. First, we remark that any finite-valued (not allowed to take on the value ∞) additive set function on I 1 must be a LebesgueStieltjes set function; that is, if µ : I 1 → [0, ∞) is additive, then µ = µf for some nondecreasing function f : R → R. Thus, Lebesgue-Stieltjes set functions characterize all finite-valued additive set functions on I 1 . See Problem 4 for the proof. Second, we remark that Lebesgue-Stieltjes set functions can be defined for nondecreasing functions defined any interval by just extending the function to be constant outside of the interval so that it remains a nondecreasing map on R. For instance, if f : [a, b] → R is a nondecreasing function on a closed interval, we define f (x) = f (a) for x < a and f (x) = f (b) for x > b. Then the extended map f : R → R is nondecreasing so it defines a Lebesgue-Stieltjes set function. ◮ Exercises 1.6.
P 1. Let µ : I → [0, ∞] be map on a semiring I satisfying µ(A) = N n=1 µ(An ) for any SN set A ∈ I written as A = n=1 An where A1 , . . . , AN ∈ I are pairwise disjoint. If
1.6. MORE EXAMPLES OF ADDITIVE SET FUNCTIONS
65
µ(A) < ∞ for some A, show that µ(∅) = 0. Thus, the requirement that µ(∅) = 0 in the definition of an additive set function is redundant if µ is not identically ∞. 2. The notion of σ-finite will occur quite often in future chapters. An additive S set function µ on a semiring I of subsets of a set X is said to be σ-finite if X = ∞ n=1 Xn where {Xn } is a sequence of pairwise disjoint sets in I with µ(Xn ) < ∞ for each n. Most measures of practical interest are σ-finite. (a) Show that Lebesgue measure on I n and any Lebesgue-Stieltjes set function on I 1 are σ-finite. S (b) Prove that µ is σ-finite if X = ∞ n=1 Xn where {Xn } is a sequence of not necessarily pairwise disjoint sets in I with µ(Xn ) < ∞ for each n. Suggestion: Use the Fundamental Lemma of Semirings (Lemma 1.3). 3. In this problem we look at examples of Lebesgue-Stieltjes set functions. (a) Let I be the semiring of left-half open intervals in (0, 1] and define µ : I → [0, ∞] by µ(a, b] = b − a if a 6= 0 and µ(a, b] = ∞ if a = 0. Show that µ is finitely additive. (b) Given a function g : R → [0, ∞) that is Riemann integrable on any finite interval, we define mg : I 1 → [0, ∞) by taking the Riemann integral g: Z b mg (a, b] := g(x) dx. a
In particular, if g = 1, this is just the usual Lebesgue measure m. Let f : R → R be an nondecreasing continuously differentiable function. Show that µf = mf ′ , where µf is the Lebesgue-Stieltjes measure corresponding to f . Remark: In the subject of Distribution Theory it’s common to identify the measure mg with the function g; that is, consider the measure mg and the function g defining the measure as the “same”. Thus, it is OK to write mg = g, properly understood. Hence, the equality µf = mf ′ can be written µf = f ′ if you wish. (c) Given α ∈ R, define Hα : R → R by ( 0 if x < α, Hα (x) := 1 if x ≥ α.
This function is called the Heaviside function, in honor of Oliver Heaviside (1850–1925) who applied it to simulate current in an electric circuit. Give a formula for µHα (I) for any I ∈ I 1 . Remark: Let’s define a Dirac delta “function” δα , named after the great mathematical physicist Paul Adrien Maurice Dirac (1902–1984), by the formal properties ( Z 1 if α ∈ I δα (x) dx := 0 if α ∈ / I, I where I ∈ I 1 . Of course, there is no function with these properties, hence the reason for the quotes on “function”; however, see Problem 2c in Exercises 3.2. In view of Part (a), do you see why we formally write Hα′ = δα ? 4. In this exercise we prove that the finite-valued additive set functions on I 1 are exactly the Lebesgue-Stieltjes set functions.27 (a) Let µ : I 1 → [0, ∞) be an additive set function and define f : R → R by ( −µ(x, 0] if x < 0, f (x) := µ(0, x] if x ≥ 0. Show that f is nondecreasing and µ = µf . (b) Given a nondecreasing function g : R → R, show that if µf = µg , then f and g differ by a constant. So, the function corresponding to a Lebesgue-Stieltjes set function is unique up to a constant. 27
If you’re interested in the corresponding statement for I n , see [30, p. 176].
66
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
5. In this exercise, we study the translation invariance of measures on I 1 . Related properties for Rn are studied in Section 4.4. Given x ∈ R and A ⊆ R, the translation of A by x is denoted by A + x or x + A: x + A = A + x = {a + x ; a ∈ A} = {y ∈ R ; y − x ∈ A}.
(a) Prove that I 1 is translation invariant in the sense that if I ∈ I 1 , then x + I ∈ I 1 for any x ∈ R. (b) A set function µ : I 1 → [0, ∞] is translation invariant if µ(x + I) = µ(I) for all x ∈ R and I ∈ I 1 . A function f : R → R is affine if f (x) = ax + b for some a, b ∈ R. Prove that if µ is the Lebesgue-Stieltjes set function defined by an affine function, then µ is translation invariant. In Problem 8 we’ll prove the converse. 6. (Cauchy’s functional equation I) (Cf. [310]) In this and the next problem we study Cauchy’s functional equation, studied by Augustin Louis Cauchy (1789–1857) in 1821. (This problem doesn’t involves measures, but it’s useful for Problem 8.) A function f : R → R is said to be additive if it satisfies Cauchy’s functional equation: for every x, y ∈ R , we have f (x + y) = f (x) + f (y). Suppose that f : R → R is additive. (i) Prove that f (0) = 0 and for all x ∈ R, f (−x) = −f (x). (ii) Prove that f (rx) = r f (x) for all r ∈ Q and x ∈ R. In particular, setting x = 1, we see that f (r) = f (1) r for all r ∈ Q. (We can do even better and say that f (x) = f (1) x for all x ∈ R if we add one assumption explained next.) (iii) Suppose that in addition to being additive, f is bounded on some interval (−a, a) with a > 0; thus, there is a constant C > 0 such that for all |x| < a, we have |f (x)| ≤ C. Prove that f (x) = f (1) x for all x ∈ R. Suggestion: Show that for all n ∈ N, we have |f (x)| ≤ C/n if |x| < a/n. Next, fix x ∈ R, let n ∈ N and choose r ∈ Q such that |x − r| < a/n. Verify the identity f (x) − f (1) x = f (x − r) − f (1) (x − r) and try to estimate the absolute value of the right-hand side. 7. (Cauchy’s functional equation II: Hamel’s theorem) (Cf. [127]) Let f : R → R be additive but not linear; that is, f not of the form f (x) = f (1) x for all x ∈ R. How bad can f be? After all, we know from Part (ii) of the previous problem that f (r) = f (1) r for all r ∈ Q. In fact, f can be extremely bad! Hamel’s theorem, named after Georg Karl Wilhelm Hamel (1877–1954), states that the graph of f , Gf := {(x, f (x)) ; x ∈ R}, is dense in R2 . In other words, for each p ∈ R2 and ε > 0 there is a point z ∈ Gf such that |p − z| < ε. To prove this, you may proceed as follows. (i) Prove that if the graph of f (x)/f (1) is dense in R2 , then so is the graph of f . Conclude that we may henceforth assume that f (r) = r for all r ∈ Q. (ii) Since f is not linear there is a point x0 ∈ R such that f (x0 ) 6= x0 ; thus, f (x0 ) = x0 + δ for some δ 6= 0. Let p ∈ R2 and ε > 0. Choose rational numbers r, s such that |p − (r, s)| < ε/2. Next, choose a rational number a 6= 0 such that ε ε s − r =⇒ δa + r − s < . < a − δ 8|δ| 8 ε Finally, choose a rational number b such that x0 − b < . If x = r + a(x0 − b), 8|a| show that f (x) − s = r + aδ − s + a(x0 − b).
(iii) Show that z = (x, f (x)), where x = r + a(x0 − b), satisfies |p − z| < ε. (iv) Use Hamel’s theorem to prove Part (iii) of the previous problem; that is, prove that if f : R → R is additive and bounded on some interval (−a, a) with a > 0, then f must be linear.
NOTES AND REFERENCES ON CHAPTER ??
67
8. Let f : R → R be a nondecreasing function. In this problem we prove that the Lebesgue-Stieltjes set function defined by f is translation invariant if and only if f is affine. Since every set function on I 1 is a Lebesgue-Stieltjes set function (by Problem 4), after completing this problem you have proven the following theorem. Theorem. Lebesgue-Stieltjes set functions defined by affine transformations are the only translation invariant set functions on I 1 . By Part (b) of Problem 5 we just have to prove necessity. (i) Assume that µf is translation invariant. Let g(x) = f (x) − f (0). Show that g is additive, that is, g(x + y) = g(x) + g(y) for all x, y ∈ R. (ii) Using Part (iii) of Problem 6, prove that g(x) = g(1) x for all x ∈ R. From this, deduce that f is affine.
Notes and references on Chapter 1 §1.1 : There are many expositions of Lebesgue’s theory that you can find; see for examples Ulam’s nice article [291]. §1.2 : See [212] for a translation of Girolamo Cardano’s (1501–1576) book Liber de Ludo Aleae and see [211] for some history on the Pascal-Fermat-M´er´e triangle. For more on the dice problem, see [227] and for the problem of points, see [80]. For the general history of probability, see classic (free) book [285] and dealing with relations to measure theory, see e.g. the articles [40], [72], [111], and [31]. §1.5 : Pierre-Simon Laplace (1749–1827) greatly influenced the mathematical theory of probability through his groundbreaking treatise Th´eorie analytique des probabilit´es [157] published in 1812. In Essai philosophique dur les probabilit´es, the introduction to the second edition of Theorie analytique, he calls the principle #A number of elements of A probability of A := = #X total number of possible outcomes the First Principle of the Calculus of Probabilities (see [158, p. 5,6]): The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought. The ratio of this number to that of all the cases possible is the measure of this probability, which is thus simply a fraction whose numerator is the number of favorable cases and whose denominator is the number of all the cases possible. In Section 1.5 we discussed (and defined) the notion of “fairness” and we used this notion several examples. However, we remark that fairness in theory may not actually be fairness in practice. Perhaps the most famous illustration of this is “Weldon’s dice data.” In 1894, Walter Frank Raphael Weldon (1860–1906) wrote to Francis Galton (1822–1911) concerning a dice experiment consisting of 23,806 tosses of twelve dice — you can read Weldon’s letter in [220]. If the dice were fair, the probability of throwing a 5 or 6 for any given toss of one die is 1/3. However, the probability obtained experimentally from Weldon’s 23,806 tosses turns out to be approximately 0.3377, a little larger than 1/3; see [94, p. 138]. One explanation (see [76, p. 273]) for the discrepancy between theory and practice could be due to the fact that the hollowed-out pips on each face used to denote the numbers on a die make the die slightly unbalanced; e.g. the 6 face should be lighter than the 1 face. Since the 5 and 6 faces are the lightest faces one might conjecture that they will land upwards most often. This is indeed the case at least from Weldon’s data. If you are interested in birthday-type problems, check out [210, 204, 27, 198].
CHAPTER 2
Finitely additive integration The basic theme of this chapter (and which will be a recurring theme throughout this book) is that we can use integration of functions to help us understand better the measure of sets. 2.1. Integration on semirings In our proof that Lebesgue measure is additive on I n we saw how useful integration theory can be to derive properties of set functions. In this section, we develop a simple integration theory on semirings, and in Section 2.3, we use this integration theory to develop measure theory on semirings. Moreover, many of the properties of integrals we develop here will come in handy in Chapter ? when we study integration on σ-algebras. 2.1.1. Integrals on semirings. Let µ : I → [0, ∞] be an additive set function on a semiring I of subsets of a set X. Our goal is to understand the properties of integrals defined via µ. However, our integration theory shall be very primitive in that we only integrate “simple functions,” which are described as follows. Recall that for any subset A ⊆ X, χA : X → R
is called the characteristic function of A and is defined by ( 1 if x ∈ A, χA (x) := 0 if x 6∈ A. A function f : X → R is called an I -simple function (also called an I -step function or a simple random variable in probability) if f is of the form (2.1)
f=
N X
a n χ An ,
n=1
where A1 , . . . , AN ∈ I are pairwise disjoint and a1 , . . . , aN ∈ R. Here’s a picture of such a simple function: PN Given an R I -simple function f = n=1 an χAn we define the integral of f , denoted by f , as the extended real number defined by (2.2)
Z
f :=
N X
an µ(An ),
n=1
provided that the right-hand side is a well-defined extended real number (thus, +∞ and −∞ are allowable integrals). By “well-defined,” we mean that the right-hand side cannot contain a term equal to +∞ and another term equal to −∞ (because 69
70
2. FINITELY ADDITIVE INTEGRATION
a5 a4 a3 a2 a1 A1A2
A4
A3
A5
R = f n=1 an χAn , so f represents the area under f .
Figure 2.1. In this picture,R f P5
n=1 an µ(An ). Geometrically,
P5
:=
∞ − ∞ is not defined). Recall that the convention is if an = 0 and µ(An ) = ∞, then an µ(An ) := 0. Note that if all an are nonnegative, then the right-hand side of (2.2) is always well-defined (it may equal +∞, but this is OK) and it geometrically represents the “area under the graph of f ” as seen in Figure 2.1. Going back to the definition of simple functions, we remark that the presentation of f as the sum (2.1) is not unique; for example, if I = I 1 , then the simple function f (x) = χ(0,3] (x) can be written in many different ways: χ(0,3] = χ(0,1] + χ(1,3] = χ(0,1] + χ(1,2] + χ(2,3] = · · · .
The basic reason for non-uniqueness is the fact that we can write unions of elements of I 1 in many different ways as seen here: (
( B1 ]( B2 ] (
A1
B3
] (
A2
]
( C1 ]( C2 ] ( C3 ]( C4 ]( C5 ]
]
Figure 2.2. A1 ∪ A2 can be written in many different ways; e.g. A = A1 ∪ A2 = B1 ∪ B2 ∪ B3 = C1 ∪ · · · ∪ C5 .
Since a simple function can be written in many different ways, it’s not obvious that the formula (2.2) gives the same value for all presentations of f . To prove this is PM indeed the case, suppose that f = k=1 bk χBk is another presentation of f , where B1 , . . . , BM ∈ I are pairwise disjoint and b1 , . . . , bM ∈ R. Then (2.3)
N X
an χAn (x) =
n=1
M X
n=1
bn χBn (x)
for all x ∈ X.
We assume that all the an ’s and bn ’s are nonzero otherwise we can just drop them from the sums. First note that (2.4)
an = b m
if An ∩ Bm 6= ∅,
because at a point x ∈ An ∩ Bm , the left-hand side of (2.3) equals an and the right-hand side equals bm . Next, we claim that [ [ An = (An ∩ Bm ) and Bm = (An ∩ Bm ). m
n
For example, to prove the left-hand equality, let x ∈ An . Then the left-hand side of (2.3) equals an and hence is not zero, therefore the right-hand side is also not
2.1. INTEGRATION ON SEMIRINGS
71
S zero; in particular,S x ∈ Bm for some m. This shows that An ⊆S m (An ∩ Bm ); the opposite inclusion m (An ∩ Bm ) ⊆ An is automatic, so An = m (An ∩ Bm ). The right-hand equality is proved similarly. Finally, observe that X XX an µ(An ) = an µ(An ∩ Bm ) (by the additivity of µ) n
= =
n
m
m
n
XX X
bm µ(An ∩ Bm )
bm µ(Bm )
(by (2.4))
(by the additivity of µ).
m
Thus, the integral of f is well-defined, independent of the presentation of f . R We remark that since the notation f doesn’t explicitly state what µ, in some cases it may not be clear what measure we are integrating with respect to, and in such cases to emphasize the measure we use the notation Z Z f dµ for f. For the semiring I n with Lebesgue measure m, we can denote the integral (2.2) by several notations: Z Z Z Z f dm or f dx or f (x) dx for f, or d of any other letter by which we are denoting the coordinate functions on Rn . We also remark that if µ : I → [0, 1] is a finitely additive probability set function,R then an I -simple function is called a simple random variable. The integral f dµ is called the expected value, or mean value, of f , and is usually denoted by E(f ). This number represents the value of f that we expect to observe, at least on average, if we repeat the experiment a very large number of times. See Section 2.2 for more details on expectations. 2.1.2. Properties of the integral. We now prove that the integral has all the properties that we expect an integral to have. But first, a definition: A collection A of functions is called an algebra of functions if it is closed under taking linear combinations and products; that is, (1) f, g ∈ A (2) f, g ∈ A
=⇒ for any a, b ∈ R, af + bg ∈ A . =⇒ f g ∈ A .
Lemma 2.1. The set of I -simple functions forms an algebra. Proof : We need to show that any linear combination and product of simple functions is again a simple function. That the product of simple functions is simple will be left for you (just use that χA · χB = χA∩B for any sets A and B) and we shall only prove the linear combination statement. Let f and g be I -simple functions and let a, b ∈ R; we need to show that af + b g is an I -simple function. Actually, since it’s easy to see that af and b g are I -simple functions, we just have to show that f + g is an I -simple function. Furthermore, we may assume that g has just one term (exercise: deduce the general case by induction on the number of terms
72
2. FINITELY ADDITIVE INTEGRATION
in a presentation of g). Thus, let f=
N X
an χAn ,
g = c χB
n=1
be I -simple functions. Before diving into the (complicated) proof that f + g is an I -simple function, consider the case when n = 1, that is, f = aχA consist of one term where A ∈ I . Now decompose A ∪ B as C1 ∪ C2 ∪ C3 as shown here: A
(
]
( .( C1 ]( C2 ](
B
]
C3
]
C1 = A \ B, C2 = A ∩ B, C3 = B \ A
Then χA = χC1 + χC2 and χB = χC2 + χC3 , so f + g = aχA + bχB = a(χC1 + χC2 ) + b(χC2 + χC3 ) = aχC1 + (a + b)χC2 + bχC3 , which shows that f + g is an I -simple function. The proof in the general case is not much different: We just take the differences and intersections of the sets making up f and g and write them as disjoint unions of elements of I , then add! To implement this idea, recall from Property (1.9) of a semiring, there are finitely many pairwise disjoint sets {Bk } in I such that B\
N [
An =
n=1
[
Bk .
k
Therefore, using the easy-to-prove formula S = (S \ T ) ∪ (T ∩ S) for any sets S, T ⊆ X, we have [ [ (2.5) B= Bk ∪ (An ∩ B) . n
k
Again using the difference property of semirings, for each n there are finitely many pairwise disjoint sets {Anm } in I such that [ An \ B = Anm . m
Therefore, by the easy-to-prove formula, [ (2.6) An = Anm ∪ (An ∩ B). m
Now observe that if F and F are any disjoint sets, then χE∪F = χE + χF . Indeed, if x ∈ E ∪ F , then both sides equal 1 and if x ∈ / E ∪ F , then both sides equal 0. Using induction, this formula holds for any finite union of pairwise disjoint sets. Hence, by (2.5) and (2.6), we have X X X χAnm + χAn ∩B . χB = χBk + χAn ∩B and χAn = k
m
n
PN
Therefore, the formulas f = n=1 an χAn and g = c χB take the form X X X X f= an χAnm + an χAn ∩B , g = c χBk + c χAn ∩B , n,m
n
k
n
2.1. INTEGRATION ON SEMIRINGS
and so, (2.7)
f +g =
X
an χAnm +
n,m
73
X X (an + c) χAn ∩B + c χBk . n
k
By construction, the sets {Anm , Aj ∩ B, Bk } are all pairwise disjoint, so after all this work we see that f + g is by definition an I -simple function.
Theorem 2.2. The integral has the following properties: (1) The integral is nonnegative: Z f ≥ 0 for any nonnegative I -simple function f . R R (2) If f and g are I -simple functions and a, b ∈ R, and if f Rexists, g R R exists, and the sum a f + b g makes sense, then the integral (af + b g) exists and is linear: Z Z Z (af + b g) = a f + b g. (3) The integral For any I -simple functions f and g with f ≤ g R is monotone: R such that f and g exist, we have Z Z f ≤ g.
Proof : The proof of (2) is the longest, so we shall prove (1) and (3), then prove (2) at the end. Step 1: To prove (1) is quick: If f is a nonnegative I -simple function, then in the presentation (2.1), each an must be nonnegative, which implies that R f ≥ 0. Step 2: We shall prove (3) assuming we’ve already proved (2), which we’ll R R prove in later. Let f ≤ g be I -simple functions; weRshall prove that f ≤ g R R provided these R integrals exist. If f = −∞, then f ≤ g is automatic, so assume that f 6= −∞. Observe that g = f + (g − f ), and by Lemma 2.1, g − f is an I -simple function. Moreover, since f ≤ g, the R function g − f is nonnegative, so R by nonnegativity Rof theR integral, we have (g − f ) ≥ 0. Therefore, since f 6= −∞, the sum f + (g − f ) is well-defined. Hence by (2), Z Z Z Z Z g = f + (g − f ) =⇒ g = f + (g − f ) =⇒ f ≤ g,
R where in the second implication, we used that (g − f ) ≥ 0. Step 3: To prove (2),R given I -simple f and g and real numbers R functions R a, b, we need to show that (af + b g) = a f + b g. Actually, since Rit’s easy to R R R R show that af = a f and b g = b g, we just have to show that (f + g) = R R f + g. Moreover, as in the proof of Lemma 2.1, we may assume that g has just one term. Thus, let f=
N X
an χAn ,
g = c χB
n=1
be I -simple functions. By definition of the integral, we have Z Z X f+ g= an µ(An ) + c µ(B), n
74
2. FINITELY ADDITIVE INTEGRATION
where we assume that the sum on the right is well defined. R The object of the game now is to express the right-hand side so it becomes (f + g). To this end, note that by (2.5) in the previous lemma and additivity of µ, we have X X µ(B) = µ(An ∩ B) + µ(Bk ), n
k
and by (2.6) in the previous lemma, for each n we have X µ(An ) = µ(Anm ) + µ(An ∩ B). m
See the proof of the previous lemma for the various notations used here. Therefore, Z Z X f+ g= an µ(An ) + c µ(B) n
=
X n
=
X
an
X m
X X µ(Anm ) + µ(An ∩ B) + c µ(An ∩ B) + µ(Bk )
an µ(Anm ) +
n,m
=
X
n,m
X n
an µ(Anm ) +
X n
an µ(An ∩ B) +
X n
n
c µ(An ∩ B) +
(an + c) µ(An ∩ B) +
X
X
k
c µ(Bk )
k
b µ(Bk ).
k
This expression, in R view of the formula (2.7) in the previous lemma, is by definition the integral (f + g), just as we wanted to show.
Corollary 2.3. An I -simple function is any function of the form f=
N X
a n χ An ,
n=1
an ∈ R, An ∈ I ,
where A1 , . . . , AN ∈ I are not necessarily disjoint, in which case Z N X f= an µ(An ), n=1
provided that the right-hand side is defined.
Proof : Since each χAn is an I -simple function and the I -simple functions form a linear space, it follows that f is an I -simple function. The formula for the integral of f follows from linearity of the integral. ◮ Exercises 2.1. 1. Prove the following useful identities for characteristic functions: χA∩B = χA · χB , χAc = 1 − χA
χA∪B = χA + χB − χA · χB . 2. In this problem, we give an example connecting integration with sums. Let P(N) be the power set of the natural numbers. Consider the counting function # : P(N) → [0, ∞] defined by #(A) := number of elements in A if A is a finite set, or #(A) := ∞ if A has infinitely many elements. (a) Show that # is finitely additive on P(N).
2.2. RANDOM VARIABLES AND (MATHEMATICAL) EXPECTATIONS
75
(b) Given any nonnegative simple function f : N → R, show that Z ∞ X f d# = f (n). n=1
3. This exercise deals with Lebesgue-Stieltjes additive set functions. (a) Let g be a nondecreasing function on R and let µg : I 1 → [0, ∞) be its corresponding Lebesgue-Stieltjes set function defined by µg (a, b] = g(b) − g(a). Given P any I 1 -simple function f = N k=1 ak χAk where Ak = (xk−1 , xk ], show that Z
f dµg =
N X
k=1
ak {g(xk ) − g(xk−1 )}.
Readers familiar with the Riemann-Stieltjes integral will recognize the right-hand side as a “Riemann-Stieltjes sum”; we’ll review Riemann-Stieltjes sums in Section 6.2 of Chapter 5. (b) Let g be a continuously differentiable nondecreasing function on R. Prove that for any I 1 -simple function f , we have Z Z f dµg = f g ′ dx,
where the right-hand side denotes the Riemann integral of f g ′ . 4. Let µ : I → [0, ∞] be an additive set function on a semiring I and let g be a nonnegative I -simple function. Define mg : I → [0, ∞] by Z mg (A) := χA g dµ, for all A ∈ I .
Note that χA g is a nonnegative I -simple function (this follows from the fact that simple functions form an algebra), so this integral is defined. (i) Prove that mg : I → [0, ∞] is additive. (ii) Prove that for any I -simple function f , we have (provided the integrals exist) Z Z f dmg = f g dµ.
5. Following [309], we give Bourbaki’s1 proof of Problem 11 in Exercises 1.3. Let R be a ring of subsets of a set X, and let ZX 2 be the ring of Z2 -valued functions on X. (Recall that Z2 = {0, 1} with addition and multiplication modulo 2.) (i) Show that as elements of ZX 2 , we have χA∆B = χA + χB ,
where A∆B = (A \ B) ∪ (B \ A) is the symmetric difference of A and B. (ii) Show that R, with its operations of multiplication and additive given by intersection and symmetric differences, respectively, is isomorphic to a subring of ZX 2 . (iii) Show that R is isomorphic to ZX 2 if and only if R is the power set of X.
2.2. Random variables and (mathematical) expectations The theory of expectations can be traced back to a letter from Pascal to Fermat on Wednesday, July 29, 1654 on the problem of points. In this section we study expectations (really integrals) from the probabilistic viewpoint. 1 By the way, “Bourbaki” was the brainchild of a group of French mathematicians started by Henri Paul Cartan (1904–) and Andr´ e Weil (1906–1998), and is not a real person. Bourbaki was just a pen name used by the group as the “author” of their math books.
76
2. FINITELY ADDITIVE INTEGRATION
2.2.1. Expectation as an expected average. In any experiment we perform, we always try to (1) observe the outcomes of the experiment and (2) assign numerical values (that is, record data) to the outcomes observed. For example, if we roll a die, we can observe and then record the number of dots on the top face of the die. In usual mathematical jargon, we would call (1) and (2) a numerical function on the sample space (because to each element of the sample space, we assign a number). However, in probability jargon, we use the term random variable; thus, a random variable assigns numerical values to the outcomes of an experiment. In this section we consider simple random variables. Let (X, R, µ) be a field of probability, meaning that X is a sample space, R is a ring of observable events containing X, and µ : R → [0, 1] is an additive set function with µ(X) = 1. Let f : X → R be a simple random variable, which is probability jargon for an R-simple function. Thus, f is of the form f=
N X
a k χ Ak ,
k=1
{ ......
n1 students n2 students n3 students
}|
}|
h3
z
}|
h2
z
z
}|
h1
hN
z
{
{
{
where A1 , . . . , An ∈ R are pairwise disjoint. Note that to any outcome x ∈ X, f (x) is one of the values a1 , . . . , aN . These values can vary widely and depending on how big N is, can be quite extensive, so we shall look for a single value that represents the “average value of f ”. Here are a couple methods to find this average. Method 1: Consider briefly the following. Suppose we have a total of n students in a classroom and suppose there are n1 students with height h1 , there are n2 students with height h2 , . . ., and there are nN students with height hN , as seen here,
nN students
What is the average height? This is simple, we have Ave. height =
sum heights for every student h1 · n1 + h2 · n2 + · · · + hN · nN = number of students n n2 nN n1 = h1 · + h2 · + · · · + hN · n n n N X nk = hk n k=1 N X # of students with height hk = hk · . total # of students k=1
Note that “(# of students with height hk )/(total # of students)” = “the probability that a randomly selected student has height hk ”. Thus, Ave. height =
N X
k=1
hk · (probability a student has height hk ).
2.2. RANDOM VARIABLES AND (MATHEMATICAL) EXPECTATIONS
77
(Of course, this example involved heights, but it can work for computing averages of most anything you can think of!) Now back to our function f . Note that since f (x) = ak if and only if x ∈ Ak , we can say that f takes the values ak with probability pk = µ(Ak ) where k = 1, 2, . . . , N . Thus, by analogy with our heights example, it seems like the Z N N X X Ave. value of f = ak · p k = ak µ(Ak ) = f, k=1
k=1
R where we used the definition of the integral. Thus, f seems like a good candidate for the “average value of f ”. Method 2: Here’s another, in fact better, method to describe the average value of f . Suppose that we repeat the experiment a large number of times, say n times where n is large, and we note the values of f on each experiment. Recall that on each experiment f takes the values ak with probability pk = µ(Ak ) where k = 1, 2, . . . , N . Thus, intuitively speaking, after doing the experiment n times one would expect that f would take the value ak approximately n · pk number of times. Hence, one would expect that the average value of f over n experiments add up every value of f obtained over n experiments number of experiments n a1 · (np1 ) + a2 · (np2 ) + · · · + aN · (npN ) ≈ . n =
Thus, the average value of f over n experiments ≈ a1 p 1 + a2 p 2 + · · · + aN p N =
N X
ak µ(Ak ).
k=1
In other words, by definition of the integral, for n large, The expected average value of f over n experiments ≈
Z
f.
Of course, as n gets larger and larger, the more precise this should be! R Both Method 1 and Method 2 show us that the number f summarizes the “average value of f ”. Thus, these thought experiments compel us to define, for any simple random variable f : X → R, the expectation of f by Z E(f ) := f. In view of Method 2, we shall interpret E(f ) as the expected average value of f over a large number of experiments. When we study the “Law of Averages” (the Weak Law of Large Numbers) in Section 2.4, see especially Subsection 2.4.3, we show that this interpretation is correct.
78
2. FINITELY ADDITIVE INTEGRATION
2.2.2. Expectation as an expected gain. As we described above, expected value represents an expected average value over many experiments. However, the idea of expected value was originally used by Pascal in a different sense, namely in the sense of an appropriate amount a gambler should be entitled to if he is not able to continue the game he is playing. In the days of Pascal, the currency in France was the louis d’or, seen on the side.2 This gold coin was struck in 1640 by Louis XIII and it was also called the pistole after a Spanish gold coin used in France since the 1500’s. Here is Pascal’s letter to Fermat on Wednesday, July 29, 1654 [262, p. 547]: This is the way I go about it to know the value of each of the shares when two gamblers play, for example, in three throws, and when each has put 32 pistoles at stake: Let us suppose that the first of them has two (points) and the other one. They now play one throw of which the chances are such that if the first wins, he will win the entire wager that is at stake, that is to say 64 pistoles. If the other wins, they will be two to two and in consequence, if they wish to separate, it follows that each will take back his wager that is to say 32 pistoles. Consider then, Monsieur, that if the first wins, 64 will belong to him. If he loses, 32 will belong to him. Then if they do not wish to play this point, and separate without doing it, the first should say “I am sure of 32 pistoles, for even a loss gives them to me. As for the 32 others, perhaps I will have them and perhaps you will have them, the risk is equal. Therefore let us divide the 32 pistoles in half, and give me the 32 of which I am certain besides.” He will then have 48 pistoles and the other will have 16.
Let’s see mathematically what Pascal is saying. In the second sentence, Pascal argues that we are in really in the situation of one throw. Thus, consider the sample space X = {0, 1}, where 0 represents the situation where the first gambler loses the throw and 1 the situation where he wins the throw, and let µ be the probability measure such that µ{0} = µ{1} = 1/2; here we assume the gamblers are of equal ability. Let f : X → R be the random variable representing the first gambler’s gain. Now according to Pascals words, Consider then, Monsieur, that if the first wins, 64 will belong to him. If he loses, 32 will belong to him,
we have f (0) = 32 and f (1) = 64; that is, f = 32 χ{0} + 64 χ{1} in our usual notation. Then the first gambler says “I am sure of 32 pistoles, for even a loss gives them to me. As for the 32 others, perhaps I will have them and perhaps you will have them, the risk is equal. Therefore let us divide the 32 pistoles in half, and give me the 32 of which I am certain besides.” 2Picture from the wikipedia commons.
2.2. RANDOM VARIABLES AND (MATHEMATICAL) EXPECTATIONS
79
In other words, the first gambler’s claims that his rightful gain is 32 32 + = 48 pistoles. 2 Notice that we can write this as Z 64 32 1 1 + = f (1) · + f (0) · = f (1) · µ{1} + f (0) · µ{0} = f. 2 2 2 2 Thus, the gambler’s expected gain is exactly the expected value as we defined it! We can generalize Pascal’s pistol example as follows. Suppose that the first gambler gains a pistoles if he loses and b pistoles if he wins; in this case f (0) = a
and f (1) = b ,
or f = a χ{0} + b χ{1} .
Then according to Pascal, the first gambler is sure of getting a pistoles, and of what’s left over, namely b − a pistoles, the risk is equal that he will win them or lose them, so Pascal would argue that the gambler’s rightful gain is a+
a+b b−a = . 2 2
We can write this as
Z 1 1 a b + = f (0) · + f (1) · = f, 2 2 2 2 again the expected value as an integral. Actually, this generalized pistol example is basically Proposition I of Christiaan Huygens’ (1629–1695) book Libellus de Ratiociniis in Ludo Aleae [137], which is the first book to systematically study expectations. Do you remember Gilles Personne de Roberval (1602–1675) who objected to Pascal’s method of combinations we studied back in Section 1.5.4? (Speaking of Section 1.5, we invite you to solve the problem of points we studied back in that section using expectations — that is, using integrals.) He might object to Pascal’s pistol argument because in reality the gamblers can play more than just one round. In fact, the true sample space is X = {1, 01, 00},
representing that the first gambler wins the first toss (1), loses the first toss but wins the second (01), or loses both tosses (00). In this case, the probabilities are 1 1 1 , µ{01} = µ{00} = , 2 4 4 and the random variable f , representing the first gambler’s pistol winnings, is µ{1} =
f (1) = 64 ,
f (01) = 64 ,
f (00) = 0.
Hence, Roberval would probably accept that the expected gain is Z 1 1 1 E(f ) = f = · 64 + · 64 + · 0 = 48, 2 4 4 which exactly as before! Consider now the general case of a probability field (X, R, µ) and a simple random variable N X f= a k χ Ak , k=1
80
2. FINITELY ADDITIVE INTEGRATION
where A1 , . . . , An ∈ R are pairwise disjoint. Suppose that f represents the gain of a gambler; that is, a1 is the gain if the event A1 occurs, a2 is the gain if the event A2 occurs, and so forth. If we put pk = µ(Ak ) , then E(f ) = (2.8)
Z
k = 1, 2, . . . , N,
f = a1 p 1 + a2 p 2 + · · · + aN p N =
N X
k=1
(gain when Ak occurs) × (probability Ak occurs).
There are different ways to understand how E(f ) is the expected gain of the gambler in the sense that E(f ) represents the appropriate amount the gambler should be entitled to if somehow he wouldn’t be able to continue the game. One way is to fall back on our old use of the word expectation, namely that E(f ) represents the average gain of the gambler if he actually does play the game a large number of times. From this viewpoint, it’s reasonable to say that the expected gain of the gambler should be E(f ). Another way to see that E(f ) represents the expected gain is from the viewpoint of weighted averages. Recall that given N numbers x1 , . . . , xN and N “weights” w1 , . . . , wN , nonnegative numbers that sum to one, we define the weighted average (or weighted mean) as the number x1 w1 + · · · + xN wN .
The xi ’s with larger weights contribute more to the sum than the xi ’s with smaller weights. Such weighted averages appear on class syllabi, at least from those classes that give grades. A typical class might assign grades as follows: Homework is 20%, Midterm is 30%, and the Final is 50% of your grade. Thus, if you scored an 80 on Homework, an 89 on the Midterm, and a 95 on the Final, your semester score is 80 · .2 + 89 · .3 + 95 · .5 = 90.2.
Your final exam score helped to boost your semester score above a ninety even though your other scores were below ninety. Since the final was weighed more heavily than the other grades, this professor believed that the final exam best measured the understanding of the course. Weighted averages applied to grades combine different grades throughout the semester to give a “rightful” grade. More generally, weighted averages give a “rightful” common value to the xk ’s taking into account that some xk ’s are judged more important than others. Any case, back to expected gain, observe from (2.8) that E(f ) is a weighted average, where the gain ak is weighted according to its probability pk of occurring. Thus, knowing that weighted averages correspond to a “rightful” value, we see that E(f ) can be interpreted as the gambler’s “rightful” gain. 2.2.3. Examples. We now compute some expectations. Example 2.1. In our first example, we shall see the difference between the everyday use of expectation and the mathematical use of expectation. Let’s try to win the jackpot of the Canadian Lotto 6/49. We either win or lose, so X = {0, 1} where 0 = lose and 1 = win. We win the jackpot with probability p = 1/13, 983, 816 (see Problem 11 in
2.2. RANDOM VARIABLES AND (MATHEMATICAL) EXPECTATIONS
81
Exercises 1.5). Let’s say the jackpot is $10, 000, 000 and let f equal 10, 000, 000 if we win the jackpot and 0 otherwise. Then the mathematical expectation of the amount we win is Z 1 1 E(f ) = f = 10000000 · +0· 1− = 0.715 . . . . 13, 983, 816 13, 983, 816 Thus, our mathematical expectation is about 72 cents. However, the real amount we expect to win is zero!
Example 2.2. Suppose that we flip a fair coin n times; what is the expected number of heads that we’ll throw? Let X = S n where S = {0, 1}. Observe that if Ai = S × · · · × S × {1} × S × · · · × S (there are n − 1 factors of S here) where the {1} is in the i-th factor, then fi = χAi equals 1 if we toss a head on the i-th toss and 0 if we toss a tail on the i-th toss. Note that µ(Ai ) = 1/2 for each i. The function Sn := f1 + f2 + · · · + fn , is the random variable giving the number of heads in n tosses. Therefore, the expected number of heads in n tosses of a coin is Z n Z n n X X X n 1 = , Sn = fi = µ(Ai ) = 2 2 i=1 i=1 i=1
which is exactly as intuition tells us! Observe that Sn f1 + f2 + · · · + fn = n n is the random variable giving the average number of head in n tosses; its expectation R R is (Sn /n) = (1/n) Sn = (1/n)(n/2) = 1/2, again, just as intuition tell us!
We shall return to the following example in Section 6.6 when we study the Law of Large Numbers. Example 2.3. (Genoese type lotteries; cf. the Casanova’s lottery problem in Problem 10 in Exercises 1.5) The ideas behind modern-day lotteries come from a lottery held in Genoa, a historic city in northern Italy, which dates from the early 1600’s (see [22, 274, 273] for more on the Genoese lottery). The basics of the Genoese lottery were as follows. 90 tokens labeled with the numbers 1, 2, . . . , 90 were drawn sequentially from a rotating cage, the “wheel of fortune,” in a public place by a blindfolded boy in a blue suit (a common uniform in orphanages).
Figure 2.3. There are two “wheel of fortunes” in this lottery in Guildhall, London, 1751. Photo taken from [10, p. 68]. Beforehand, players would chose one, two, three, four, or five particular numbers and they would win if the numbers they chose matched any of the five numbers drawn.
82
2. FINITELY ADDITIVE INTEGRATION
Let n ∈ N and we label n tokens with the numbers 1, . . . , n. Let m (with m ≤ n) be the number of tokens drawn, one after the other, from a rotating cage (we assume each token is drawn with equal probability); e.g. n = 90 and m = 5 in the Genoese lottery. Observe that X = {(x1 , . . . , xm ) ; xi ∈ {1, . . . , n} , xi 6= xj for i 6= j} represents a sample space for the drawing of m tokens from a lot of n tokens, where xi represents the i-th token drawn. X has n(n − 1)(n − 2) · · · (n − m + 1) number of elements (since for a typical element (x1 , . . . , xm ) ∈ X there are n choices for x1 , n − 1 choices for x2 , etc.). Thus, the probability measure is µ : P(X) → [0, 1] ,
µ(A) =
#A . n(n − 1) · · · (n − m + 1)
If you’re interested, see Problem 10 in Exercises 1.5 for the probabilities of the various ways to win the Genoese lottery. For each i = 1, 2, . . . , m, consider the random variable given by the value of the i-th token drawn: fi : X → R
defined by fi (x) := xi ,
where x = (x1 , . . . , xm ). Then f :X→R ,
where f := f1 + f2 + f3 + f4 + f5 ,
represents the sum of the values of the randomly drawn tokens. What is the expectation of f ? Since the expectation (integral) is linear we just have to compute the expectation of each fi . Observe that fi = χAi1 + 2χAi2 + 3χAi3 + · · · + nχAin , where Aik = {(x1 , . . . , xm ) ∈ X ; xi = k}, the event that the number k appears on the i-th draw. Thus, Z E(fi ) = fi = µ(Ai1 ) + 2µ(Ai2 ) + 3µ(Ai3 ) + · · · + nµ(Ain ). For each k, the set Aik has (n − 1)(n − 2) · · · (n − m + 1) elements (do you see why?), therefore (n − 1)(n − 2) · · · (n − m + 1) 1 µ(Aik ) = = . n(n − 1)(n − 2) · · · (n − m + 1) n
Another way to see this is to observe that intuitively the probability that the number k appears on the i-th draw should be 1/n since there are n total numbers, each one equally likely, that could appear on the i-th draw. Thus, E(fi ) =
1 + 2 +3 + ··· + n n+1 = , n 2
noting that 1 + · · · + n = n(n + 1)/2. Since E(f ) = E(f1 ) + · · · + E(fm ) we conclude that m(n + 1) . E(f ) = 2 For instance, in the Genoese lottery, we have n = 90 and m = 5, so The expected sum of the numbers drawn in the Genoese lottery = 227.5. ◮ Exercises 2.2. 1. (Huygen’s propositions) The following propositions are found in Christiaan Huygens’ (1629–1695) book Libellus de Ratiociniis in Ludo Aleae [137]:
2.2. RANDOM VARIABLES AND (MATHEMATICAL) EXPECTATIONS
83
Proposition I: If I expect a or b, and have an equal chance of gaining either of them, my Expectation is worth a+b . 2 Proposition II: If I expect a, b, or c, and each of them be equally likely to . fall to my Share, my Expectation is worth a+b+c 3 Proposition III: If the number of Chances I have to gain a, be p, and the number of Chances I have to gain b, be q. Supposing the Chances equal; my Expectation will then be worth ap+bq . p+q Prove each of these proposition using the mathematical definition of expectations. 2. (Cardano’s game) In Girolamo Cardano’s (1501–1576) book Liber de Ludo Aleae, he writes [212, p. 240]: Thus, in the case of six dice, one of which has only an ace on one face, and another a deuce, and so on up to six, the total number is 21, which divided by 6, the number of faces, gives 3 21 for one throw. In other words, consider six dice, the first one having a single dot on one side and blanks on the other five sides, the second one having two dots on one side and blanks on the other five sides, and so forth. He says that if you roll all six dice, the expected number of dots rolled is 3 21 . Can you prove this? 3. (Roulette) An American roulette wheel has the numbers 00, 0, 1, 2, 3, . . . , 36 on its perimeter. The 00 and 0 are in green, and the other numbers have the colors red and black; here’s a picture where the reds appear whitish:3
A ball is spun on the wheel and it lands on a number. (i) (Singles) Suppose that you bet on a single number 00, 0, 1, . . . , 36. If the ball lands on your number, you are paid 35 to 1, namely you win 35 times you amount you bet, otherwise you lose the amount you bet. Suppose you bet $1 on a number; if the ball lands on your number you get $35, otherwise you lose your $1. What is the expected amount you will win? (ii) (Doubles) Suppose that you bet on two numbers. If the ball lands on either number, you are paid 17 to 1. Suppose you bet $1 on doubles. What is the expected amount you will win? (iii) (Triples) Suppose that you bet on three numbers. If the ball lands on one of your numbers, you are paid 11 to 1. Suppose you bet $1 on triples. What is the expected amount you will win? (iv) (Reds) Suppose that you bet on reds (or on blacks, or on evens, or odds, or on high numbers (19 − 36) or low numbers 1 − 18). If the ball lands on reds (or on blacks, or on evens, or odds, or on high numbers or on low numbers), you are paid 1 to 1. Note that 00 and 0 are considered odd if you bet on evens, and even if you bet on odds! Suppose you bet $1 on red. What is the expected amount you will win? (You get the same expected winnings if you bet on blacks or evens or odds or on highs or lows.) 3
Picture from the Wikipedia commons. Author is Ron Shelley.
84
2. FINITELY ADDITIVE INTEGRATION
4. (Pascal’s Wager) In this problem we look at “Pascal’s wager,” the primordial example of the modern subject of decision theory. Blaise Pascal (1623–1662) argued that as long as there is a positive probability that God exists, a person should believe in Him. Here are Pascal’s thoughts as quoted in article 233 of Pascal’s Pens´ees:4 Let us then examine this point, and say, “God is, or He is not.” But to which side shall we incline? Reason can decide nothing here. There is an infinite chaos which separated us. A game is being played at the extremity of this infinite distance where heads or tails will turn up. What will you wager? According to reason, you can do neither the one thing nor the other; according to reason, you can defend neither of the propositions. Do not, then, reprove for error those who have made a choice; for you know nothing about it. “No, but I blame them for having made, not this choice, but a choice; for again both he who chooses heads and he who chooses tails are equally at fault, they are both in the wrong. The true course is not to wager at all.” Yes; but you must wager. It is not optional. You are embarked. Which will you choose then? Let us see. Since you must choose, let us see which interests you least. You have two things to lose, the true and the good; and two things to stake, your reason and your will, your knowledge and your happiness; and your nature has two things to shun, error and misery. Your reason is no more shocked in choosing one rather than the other, since you must of necessity choose. This is one point settled. But your happiness? Let us weigh the gain and the loss in wagering that God is. Let us estimate these two chances. If you gain, you gain all; if you lose, you lose nothing. Wager, then, without hesitation that He is. “That is very fine. Yes, I must wager; but I may perhaps wager too much.” Let us see. Since there is an equal risk of gain and of loss, if you had only to gain two lives, instead of one, you might still wager. But if there were three lives to gain, you would have to play (since you are under the necessity of playing), and you would be imprudent, when you are forced to play, not to chance your life to gain three at a game where there is an equal risk of loss and gain. But there is an eternity of life and happiness. And this being so, if there were an infinity of chances, of which one only would be for you, you would still be right in wagering one to win two, and you would act stupidly, being obliged to play, by refusing to stake one life against three at a game in which out of an infinity of chances there is one for you, if there were an infinity of an infinitely happy life to gain. But there is here an infinity of an infinitely happy life to gain, a chance of gain against a finite number of chances of loss, and what you stake is finite. It is all divided; where-ever the infinite is and there is not an infinity of chances of loss against that of gain, there is no time to hesitate, you must give all. And thus, when one is forced to play, he must renounce reason to preserve his life, rather than risk it for infinite gain, as likely to happen as the loss of nothingness. For it is no use to say it is uncertain if we will gain, and it is certain that we risk, and that the infinite distance between the certainly of what is staked and the uncertainty of what will be gained, equals the finite good which is certainly staked against the uncertain infinite. It is not so, as every player stakes a certainty to gain an uncertainty, and yet he stakes a finite certainty to gain a finite uncertainty, without transgressing against reason. There is not an infinite distance between the certainty staked and the uncertainty of the gain; that is untrue. In truth, there is an infinity between the certainty of 4
See eg. http://www.gutenberg.org/ebooks/18269 for the entire text of Pens´ ees.
2.2. RANDOM VARIABLES AND (MATHEMATICAL) EXPECTATIONS
85
gain and the certainty of loss. But the uncertainty of the gain is proportioned to the certainty of the stake according to the proportion of the chances of gain and loss. Hence it comes that, if there are as many risks on one side as on the other, the course is to play even; and then the certainty of the stake is equal to the uncertainty of the gain, so far is it from fact that there is an infinite distance between them. And so our proposition is of infinite force, when there is the finite to stake in a game where there are equal risks of gain and of loss, and the infinite to gain. This is demonstrable; and if men are capable of any truths, this is one. Here’s a simplified version of Pascal’s argument; see [57] for a more thorough analysis. We work under the following assumptions: (a) God exists with probability p and doesn’t exist with probability 1 − p. (b) (If He exists,) God rewards those who believe in Him with joy in an eternal afterlife measured by a number J. God “rewards” those who don’t believe in Him with “joy” in an eternal afterlife measured by −A, where A is a positive number representing eternal anguish. (c) Let B be a number representing the amount of joy experienced in life, living as if you believed God exists. (d) Let D be a number representing the amount of joy experienced in life, living as if you didn’t believe God exists. Let Y denote the random variable representing the total amount of joy you will experience, both in this life and the afterlife, if yes, you believe God exists, and let N denote the random variable representing the total amount of joy you will experience, both in this life and the afterlife, if no, you do not believe God exists. Find E(Y ) and E(N ). Pascal argues that it’s reasonable to base our belief in God on which number E(Y ) or E(N ) is larger. Show that E(N ) > E(Y )
⇐⇒
D > p(J + A) + B.
Thus, if p = 0, then your total joy is based strictly on earthly joys and one might as well forget belief in God. However, if p > 0 and J and A are sufficiently large (in fact, Pascal considers J to be infinite), then believing in God is the reasonable option. 5. (cf. [11]) (The birthday problem) What is the expected number of people in a room of n people who share the same birthday with at least one other person in the room? We assume that a year has exactly 365 days (forget leap years). (i) Write down a sample space X and the probability set function µ. (ii) Let f : X → R be the random variable representing the number of people who sharePthe same birthday with at least one other people in the room. Show that f = n k=1 fk where fk = 1 if the k-th person shares the same birthday with at least one other person and fk = 0 otherwise. fk is the characteristic function of a set; write down the set explicitly. (iii) Find E(f ). (iv) What is the smallest number of people in a room needed so that at least two people (are expected to) share the same birthday? 6. (The hat check problem) n people enter a restaurant and their hats are checked in. After dinner the hats are randomly re-distributed back to their owners. What is the expected number of customers that receive their own hats? (i) Write down a sample space X and the probability set function µ. (ii) Let f : X → R be the random variable P representing the number of people who receive their own hats. Show that f = n k=1 fk where fk = 1 if the k-th customer receives his own hat and fk = 0 otherwise. fk is the characteristic function of a set; write down the set explicitly. (iii) Find E(f ).
86
2. FINITELY ADDITIVE INTEGRATION
1
(a, b]
Figure 2.4. If A = (a, b], then
R
χA (x) dx =
Rb a
dx = b − a = m(A).
2.3. Properties of additive set functions on semirings In this section we use the properties of the integrals of functions to derive properties of additive set functions. In particular, we show that Lebesgue measure extends from I n to define an additive set function on E n , the ring of elementary figures on Rn , and we study probability models on sequence space. At the end of this section, we give a thorough analysis of the Monkey-Shakespeare experiment. 2.3.1. Application of integration I: Properties. For our first application of integration, we derive some properties of finitely additive set functions on semirings. Given an additive set function µ : I → [0, ∞] on a semiring I and a set A ∈ I , by definition of the integral of the I -simple function f = χA , we have (see Figure 2.4) Z µ(A) = χA . We can conclude properties of the set function µ by exploiting this formula and the properties of the integral in Theorem 2.2. We remark that since rings and σ-algebras are also semirings, all the properties and definitions that we state for semirings also hold for set functions on rings and σ-algebras. Properties of additive set functions Theorem 2.4. Let µ : I → [0, ∞] be additive on a semiring I . Then, (1) µ is monotone in the sense that if A, B ∈ I and A ⊆ B, then µ(A) ≤ µ(B). S∞ (2) µ is countably superadditive in the sense that if A ∈ I and n=1 An ⊆ A where A1 , A2 , . . . ∈ I are pairwise disjoint, then ∞ X µ(An ) ≤ µ(A). n=1
(3) µ is finitely subadditive in the sense that if A ∈ I and A ⊆ where A1 , . . . , AN ∈ I , then µ(A) ≤
N X
SN
n=1
An
µ(An ).
n=1
(4) µ is subtractive in the sense that if A, B ∈ I with A ⊆ B, B \ A ∈ I , and µ(A) < ∞, then µ(B \ A) = µ(B) − µ(A). Proof : Here’s a picture of this theorem: If A ⊆ B are sets in I , then χA ≤ χB . Therefore, by monotonicity of the
2.3. PROPERTIES OF ADDITIVE SET FUNCTIONS ON SEMIRINGS
B\A A
A4 A5 A3 A A1 A2
B P
µ(A) ≤ µ(B) µ(B \ A) = µ(B) − µ(A)
n
µ(An ) ≤ µ(A)
A4
87
A3
A A1 A2 P µ(A) ≤ n µ(An )
Figure 2.5. Left: Properties (1) and (4), Middle: (2), Right: (3). integral, µ(A) =
Z
χA ≤
S∞
Z
χB = µ(B).
Let A ∈ I and assume that n=1 An ⊆ A where A1 , A2 , . . . ∈ I are pairwise S disjoint. Observe that for any N ∈ N we have N n=1 An ⊆ A, so N X
n=1
χAn ≤ χA ,
as you can verify. Therefore, by linearity and monotonicity of the integral, we see that Z Z X N Z N X χAn = χAn ≤ χA ; n=1
n=1
that is,
N X
n=1
µ(An ) ≤ µ(A).
Letting N → ∞ proves the superadditivity S property. Let A ∈ I and assume that A ⊆ N n=1 An where A1 , . . . , AN ∈ I . Then observe that N X χA ≤ χAn . n=1
Hence by monotonicity and linearity of the integral, we have X Z Z X N N Z N X χAn = χAn =⇒ µ(A) ≤ µ(An ). χA ≤ n=1
n=1
n=1
Finally, to prove (4) note that χB\A = χB − χA . Integrating both sides we get µ(B \ A) = µ(B) − µ(A), where we used that µ(A) 6= ∞ so that µ(B) − µ(A) is well-defined.
We remark that in general we cannot replace N by ∞ in Property (3) as certain examples show; if you can’t wait, see Section 3.2. We also remark that one could prove Theorem 2.4 using only the properties of additive set functions and semirings, and without using any integration theory, however, the proof isn’t so elegant. 2.3.2. Application of integration II: Products. Our second application of integration deals with products of additive set functions. Let µ1 , . . . , µN be additive set functions on semirings I1 , . . . , IN . From Proposition 1.2 we know that the product I1 × · · · × IN is a semiring. We define ω : I1 × · · · × IN → [0, ∞]
88
2. FINITELY ADDITIVE INTEGRATION
by ω(A1 × · · · × AN ) := µ1 (A1 ) · · · µN (AN ),
for all “boxes” A1 × · · · × AN ∈ I1 × · · · × IN . Here’s a picture when N = 2: X2
A1 × A2
A2 A1
ω(A1 × A2 ) = µ1 (A1 ) · µ2 (A2 )
X1
In the product µ1 (A1 ) · · · µN (AN ), we use the conventions that 0 · ∞ := 0 and ∞ · 0 := 0 in case there is a 0 and ∞ in the product. The set function ω is the product of µ1 , . . . , µN . The main example to keep in mind is Lebesgue measure m on I n = I 1 × · · · × I 1 , which is just the n-fold product of Lebesgue measure on I 1 . We use our integration theory to give a simple proof that ω is additive. Theorem 2.5. The set function ω : I1 × · · · × IN → [0, ∞] is additive; in words, the product of additive set functions is additive. Proof : This proof is almost word-for-word the same as the proof of Proposition 1.18! For notational simplicity, we prove this result for only two additive set functions, say µ : I → [0, ∞] and ν : J → [0, ∞], where I and J are semirings on sets X and Y , respectively. We need to prove that ω is additive. Let A × B ∈ I × J , and suppose that N [ A×B = Ak × Bk k=1
is a disjoint union where Ak × Bk ∈ I × J . Observe that for subsets C ⊆ X and D ⊆ Y , we have
(2.9)
χC×D (x, y) = χC (x) χD (y),
and if E and F are disjoint subsets of X × Y , (2.10)
χE∪F = χE + χF .
Then by (2.9) we have χA×B (x, y) = χA (x) χB (y) and, since A×B = that
SN
k=1
Ak ×Bk is a union of pairwise disjoint sets, (2.10) shows
χA×B (x, y) =
N X
χAk ×Bk (x, y).
N X
χAk (x) χBk (y).
k=1
By (2.9), χAk ×Bk (x, y) = χAk (x) χBk (y), so we conclude that χA (x) χB (y) =
k=1
Let us fix x ∈ X, and put a = χA (x) and ak = χAk (x) (thus, each a and ak is either 0 or 1). Then the above equality is just a χB (y) =
N X
k=1
ak χBk (y).
2.3. PROPERTIES OF ADDITIVE SET FUNCTIONS ON SEMIRINGS
89
Both sides are J -simple functions, so integrating both sides of this equality, we obtain N X a ν(B) = ak ν(Bk ), k=1
or after substituting a = χA (x) and ak = χAk (x), we get ν(B) χA (x) =
N X
ν(Bk ) χAk (x).
k=1
If ν(B) and each ν(Bk ) is finite, then both sides of this equality are I -simple functions so we can integrate both sides of the equality, obtaining µ(A) ν(B) =
N X
µ(An ) ν(Bn ).
n=1
On the other hand, if A or any An has infinite measure, then P it is straightforward to check that this equality still holds. Thus, ω(A×B) = N k=1 ω(Ak ×Bk ), which proves our result.
2.3.3. Probability set functions on sequence space. We begin by reviewing sequence space from Section 1.3.3. Given a countable number of sample spaces X1 , X2 , . . ., sequence space is the set of all infinite sequences: X = {(x1 , x2 , x3 , . . .) ; xi ∈ Xi for all i}, which can be denoted by = X1 × X2 × X3 × X4 × · · · =
∞ Y
Xi .
i=1
X represents the sample space for a countable number of experiments performed in sequence where X1 is the sample space of the first experiment, X2 the sample space for the second experiment, and so on. Example 2.4. If then X = S
∞
,
X1 = X2 = · · · = S = {(j, k) ; j, k = 1, . . . , 6}
is a sample space for an infinite sequence of rolling two dice; e.g.
,
,
········· ⇄
(6, 3) , (4, 5) , (1, 5) , . . .
If Xi = (0, 1] for each i, then X = S ∞ is a sample space for picking an infinite sequence of points from the interval (0, 1].
Assume that we are given probability set functions: µ1 : I1 → [0, 1] , µ2 : I2 → [0, 1] , µ3 : I3 → [0, 1] , . . . ,
where Ii is a semiring on Xi (thus, Xi ∈ Ii and µi (Xi ) = 1). For instance, for the dice case in Example 2.4, we can put µi = µ0 : P(S) → [0, 1] for all i, where
#A for all A ⊆ S. 36 Thus, we are working with fair dice. In the case Xi = [0, 1] in Example 2.4 we can put µi = Lebesgue measure on left-half open intervals in (0, 1]. A natural question is: Is there an “obvious” probability set function on X defined using µ1 , µ2 , µ3 , . . .? Well, it is not so obvious how to assign a probability to an arbitrary subset of X, but for some subsets it’s clear; these subsets are the µ0 (A) =
90
2. FINITELY ADDITIVE INTEGRATION
cylinder sets, which we introduced back in Section 1.3.3.5 Recall that a cylinder set is a subset A ⊆ X that, for some n ∈ N, can be written as (2.11)
A = A1 × A2 × · · · × An × Xn+1 × Xn+2 × Xn+3 × · · · ⊆ X1 × X2 × · · · × Xn × Xn+1 × Xn+2 × Xn+3 × · · ·
for some events Ai ∈ Ii . The event A represents the event that A1 occurs on the first trial, A2 occurs on the second trial, . . ., An occurs on the n-th trial, not caring what happens after the n-th one. What is the probability of the event A occurring? To answer this, consider the case when n = 2; then we are asking: What is the probability of the event A1 occurring on the first trial and A2 on the second trial (not caring about what happens afterwards)? Just thinking intuitively without thinking too much into the details, if we think of µ1 (A1 ) as the fraction of times the event A1 occurs and µ2 (A2 ) as the fraction of times the event A2 occurs, then it would make sense that the product6 µ1 (A1 ) · µ2 (A2 ) is the fraction of times the event A1 followed by A2 will occur. More generally, with A as above, it would make sense that the Probability of A = µ1 (A1 ) · µ2 (A2 ) · · · µn (An ). Since 0 ≤ µi (Ai ) ≤ 1 for each i, the productQon the right is also in [0, 1]. This ∞ discussion motivates the following. Let C ⊆ i=1 Xi denote the collection of all cylinder sets and define µ : C → [0, 1] by
(2.12)
µ(A) := µ1 (A1 ) · µ2 (A2 ) · · · µn (An )
for A as in (2.11).
The set function µ is called the infinite product of µ1 , µ2 , . . .. This definition is very similar to the definition just dealt with in Theorem 2.5 where we considered the product of finitely many additive set functions. Note that putting A1 = X1 , A2 = X2 , . . . , An = Xn in (2.11), we see that X ∈ C, and by definition of µ, we have (2.13)
µ(X) = µ1 (X1 ) · µ2 (X2 ) · · · µn (Xn ) = 1 · 1 · · · 1 = 1.
The infinite product measure Proposition 2.6. The infinite product set function µ : C → [0, 1] is a finitely additive probability set function.
5It turns out that the “obvious” probability set function defined as in (2.12) below cannot be extended to be a (finitely additive) probability set function on P(X) ... this has to do with the existence of nonmeasureable sets as we shortly mentioned in Section 1.1.3. 6Here, we are actually imposing the condition that the trials A , A be “independent,” which 1 2 roughly speaking means that the knowledge of the occurrence of the first event A1 does not effect the probability of the second event A2 occurring. We shall return to independence in Section ?.
2.3. PROPERTIES OF ADDITIVE SET FUNCTIONS ON SEMIRINGS
I R(I )
91
Given: µ : I → [0, ∞], additive set function. Question: Does µ have an extension ν : R(I ) → [0, ∞]?
Figure 2.6. Since I ⊆ R(I ), it’s a natural question to ask if we can extend µ from the semiring I to the generally much larger collection R(I ). Theorem 2.7 says “yes.”
Proof : Recall that in Proposition 1.8 we proved C forms a semiring by reducing the proof to the product of finitely many semirings (Proposition 1.2). In a very similar way, we can prove that µ : C → [0, 1] is finitely additive by reducing it to Theorem 2.5 where we proved that the product of finitely many additive set functions is additive. We shall leave the details to you if you’re interested.
2.3.4. Application of integration III: Extensions. In Theorem 2.7 below we apply integration theory to show that any additive set function on a semiring can be extended to the generated ring; see Figure 2.6. The primary examples are extending Lebesgue measure from I n , the left-half open boxes, to the elementary figures E n in Rn , and the infinite product measure from the cylinder sets to the ring generated by the cylinder sets. If µ : I → [0, ∞] is an additive set function on a semiring I , and if R is a ring containing I , then an additive set function ν : R → [0, ∞] is called an extension of µ if ν(A) = µ(A) for all A ∈ I . The Semiring Extension Theorem Theorem 2.7. If µ : I → [0, ∞] is an additive set function on a semiring I , then there is a unique additive set function on the ring R(I ) generated by I that extends µ. We denote this (unique) set function by µ again and call it the extension of µ to R(I ). Proof : We already know that (2.14)
µ(A) =
Z
χA
for all A ∈ I .
The idea is to simply define the extension by this same formula for any A ∈ R(I )! Of course, we have to prove that (2.14) (i) is defined for each A ∈ R(I ), (ii) is additive, and (iii) is the unique additive set function on R(I ) that extends µ. Step 1: Given A ∈ R(I R ), let’s show that the function χA is an I -simple function; this implies that χA is defined. In fact, we know by Theorem 1.5 that S A = n An , a finite union where A1 , A2 , . . . ∈ I are pairwise disjoint. Thus, by the equality (2.10) we used in Theorem 2.5, we see that X χA = χAn . n
Therefore, χA is an I -simple function. In particular, we can define Z µ(A) := χA for all A ∈ R(I ).
92
2. FINITELY ADDITIVE INTEGRATION
Note that this formula is consistent with the formula (2.14) when A ∈ I . S Step 2: Let A ∈ R(I ) and assume that A = n An , a finite union of pairwise disjoint sets A1 , A2 , . . . P ∈ R(I ). Then by the equality (2.10) we used in Theorem 2.5, we see that χA = n χAn . Therefore, by linearity of the integral, Z Z X XZ X (2.15) µ(A) = χA = χAn = χAn = µ(An ). n
n
n
Thus, µ is indeed finitely additive on R(I ). Step 3: Let ν : R(I ) → [0, ∞] be finitely additive and assume that ν = µ on IS ; we shall prove that ν = µ on R(I ). Indeed, by Theorem 1.5 we can write A = n An , a finite union where A1 , A2 , . . . ∈ I are pairwise disjoint. Then by finite additivity, we have X X ν(A) = ν(An ) = µ(An ), n
n
where we used that ν = µ on I . On the other hand, the sum exactly µ(A) by (2.15). Thus, ν = µ and our proof is complete.
P
n
µ(An ) is
As an easy corollary, we obtain Extensions of familiar set functions Corollary 2.8. (1) For each n, Lebesgue measure m on I n extends uniquely to an additive set function on the ring of elementary figures E n . (2) The Lebesgue-Stieltjes measure µf on I 1 of any right-continuous nondecreasing function f : R → R extends uniquely to an additive set function on the ring of elementary figures E 1 . (3) The infinite product µ : C → [0, 1] on cylinder sets of probability set functions on the factors of a sequence space extends uniquely to be a finitely additive probability set function µ : R(C ) → [0, 1]. Technically speaking, the Semiring Extension Theorem only guarantees that the infinite product measure µ : C → [0, 1] has a unique extension µ : R(C ) → [0, ∞].
However, since µ(X) = 1 (we showed this in (2.13)), by monotonicity we have µ(A) ≤ µ(X) = 1 for all A ∈ R(C ).
Thus, µ : R(C ) → [0, 1].
2.3.5. On Monkeys and Shakespeare. Back in Section 1.2.4 we described the Monkey-Shakespeare experiment, which we briefly review.7 Choose your favorite Shakespeare passage (or any other passage for that matter) and let N be the number of symbols the passage consists of. For example, N = 632 for the sonnet “Shall I compare thee to a summer’s day?” Put a monkey in front of a typewriter and let him hit the keyboard N times, remove the paper, put in a new paper, have him hit the 7Cartoon from http://www.sangrea.net/free-cartoons/
2.3. PROPERTIES OF ADDITIVE SET FUNCTIONS ON SEMIRINGS
93
keyboard N more times, remove the paper, etc. . . repeating this process infinitely many times. If we consider a “success” (= 1) when the monkey types the passage and a “failure” (= 0) when he doesn’t type it, then the sample space for this experiment is the space of Bernoulli sequences S ∞ where S = {0, 1}; e.g.
DDDD U D
fails
fails
fails
fails
success
fails
········· ⇄
0 , 0 , 0 , 0 ,1 ,0 ,...
Assume that on any given trial, the probability of a success is a constant p ∈ (0, 1) so that the probability of a failure is 1 − p. Thus, µ0 : P(S) → [0, 1] is defined by µ0 {1} = p , µ0 {0} = 1 − p.
(Of course, µ0 (∅) = 0 and µ0 (S) = µ0 {0, 1} = 1.) For example, assuming that the keyboard can make 100 different symbols (for a nice round number) and there are a total of N symbols in your favorite Shakespeare passage, assuming that each symbol is equally likely to be typed, we have N 1 1 p= = 2N . 100 10
Thus, p = 1/100632 = 1/101264 for Shakespeare’s sonnet 18. Let µ : C → [0, 1] be the infinite product of µ0 with itself, where C is the semiring of cylinder subsets of S ∞ . Then by the Semiring Extension Theorem, we know that µ extends to a probability set function µ : R(C ) → [0, 1]. Here is a Question: For each n ∈ N, what is the probability that the monkey will type your passage within the first n pages? Let An ⊆ S ∞ be the event that the monkey types your passage within the first n pages. Then Acn is the event that the monkey does not type your passage in the first n pages, and hence Acn = {0} × {0} × · · · × {0} ×S × S × S × · · · . {z } | (n times)
Acn
Thus, ∈ C ⊆ R(C ) and so, as R(C ) is a ring and X ∈ C , we have An = X \ Acn ∈ R(C ). In particular, the probability µ(An ) is defined. Now by definition of µ, we have n µ(Acn ) = (µ0 {0})n = (1 − p) . Therefore, by subtractivity, µ(An ) = µ(X \ Acn ) = µ(X) − µ(Acn ) = 1 − (1 − p)n ,
or in words, the
Probability the passage will be typed within the first n pages = 1 − (1 − p)n .
For example, consider the situation p = 1/101264 as in Shakespeare’s sonnet 18. The probability that the monkey will type Sonnet 18 within the first 1 googol pages (where a googol is by definition 10100 ) is 10100 1 1 − 1 − 1264 . 10
94
2. FINITELY ADDITIVE INTEGRATION
Some estimates show that8 this number is approximately 10−1164 , or 0.000000000000 · · ·000000000000 | {z }1.... 1164 zeros
To summarize: It is essentially impossible that the monkey will type Sonnet 18 within the first 1 googol pages. Let’s ask another Question: How many pages must the monkey type in order to have at least a 1% chance of typing Shakespeare’s sonnet 18? We want to find n so that 1 − (1 − p)n ≥
1 . 100
Rearranging and taking logarithms we obtain 99 n ≥ (1 − p) 100
=⇒
(1 − p)
−n
100 ≥ 99
=⇒ =⇒
−1
n log (1 − p) n≥
Thus, the answer is that we need at least log 100 99 pages. log(1 − p)−1
100
≥ log
log 99 . log(1 − p)−1
100 99
We can get an accurate estimate of the right-hand side as follows. We first use calculus to show that if 0 ≤ x ≤ 1/2, then9 (1 − x)−1 ≤ e2x .
Taking logarithms, we obtain log(1 − x)−1 ≤ 2x
=⇒
1 1 ≥ . log(1 − x)−1 2x
Hence, noting that log(100/99)/2 = 0.005025 . . . ≥ 0.005, it follows that we need more than log 100 5 × 10−3 99 ≥ pages. 2p p
For example, if p = 1/101264 , just to have a 1% chance of typing the entire sonnet 18 of Shakespeare, the monkey must type more than 5 × 10−3 = 5 × 101261 pages, 10−1264 which is quite a lot of pages. For perspective, the number of atoms in the observable universe is10 approximately 1080 , so the number of pages is around the order of magnitude of 101180 times the number of atoms in the known universe. Here’s another Question: How many years must the monkey type in order to have at least a 1% chance of typing Shakespeare’s sonnet 18? 8Exercise: Try to find out how I got this! 9Exercise: Try to prove this! 10See http://en.wikipedia.org/wiki/Observable universe
2.3. PROPERTIES OF ADDITIVE SET FUNCTIONS ON SEMIRINGS
95
To answer this question we have to make some assumptions about how fast the monkey can type. Let’s be overly generous and assume that he can type 10 pages per minute (this is quite fast: 6320 symbols per minute if he types Shakespeare’s sonnet 18 consisting of 632 symbols!). Then in one year he can type (let’s forget leap years for simplicity) 10 pages 60 minutes 24 hour 365 days pages × × × = 5.256 × 106 . minute 1hour day year year Thus, to have just a 1% chance of producing the desired text (which he has a probability p of producing on a single page) it will take the monkey approximately 5 × 10−3 1 10−9 · ≈ years (The Monkey Equation), p 5.256 × 106 p
where, again, we assumed that he can type 10 pages of that text in a minute. So, for example, if p = 1/101264 , it will take approximately 101255 years to have a measly 1% chance of producing sonnet 18. Some say the age of the universe is estimated to be 13.7 × 109 years11, so it will take the monkey on the order of magnitude of 101245 universe ages to have just a 1% chance of typing Shakespeare’s sonnet 18! So, basically we can say that within the current estimates of the age of the universe, the monkey doesn’t have a chance of typing sonnet 18! Even if every particle in the universe was a monkey, say we had 1080 monkeys to help type, it would still take 101255 /1080 = 101175 years! We will return to the Monkey-Shakespeare problem in Section ? ◮ Exercises 2.3. 1. Let µ : R → [0, ∞] be an additive function on a ring R. Given sets A and B in R, prove that µ(A ∪ B) + µ(A ∩ B) = µ(A) + µ(B). If you draw a Venn diagram of A ∪ B, do you see why this formula is “obvious”? Warning: Keep in mind that µ may take the value ∞, and in this case you should never subtract two quantities because you could have a nonsense statement like ∞−∞. Suggestion: A slick way to prove the formula is to first prove that χA∪B + χA∩B = χA + χB , then integrate this formula. 2. Let µ : R → [0, ∞] be an additive function on a ring R. Given sets A, B, C in R, prove that µ(A ∪ B ∪ C) + µ(A ∩ B)+µ(B ∩ C) + µ(A ∩ C) =
µ(A) + µ(B) + µ(C) + µ(A ∩ B ∩ C).
Suggestion: A slick way to prove this formula is to write D = A ∪ B ∪ C and observe that 1 − χD = χD c = χAc ∩Bc ∩C c = χAc χBc χC c
= (1 − χA )(1 − χB )(1 − χC ).
Multiply out this formula, then use integration. 11See http://en.wikipedia.org/wiki/Age of the universe
96
2. FINITELY ADDITIVE INTEGRATION
3. (Inclusion-Exclusion formula) Generalize the previous exercises as follows. Let µ : R → [0, ∞) be an additive function on a ring R. Given sets A1 , . . . , AN in R, prove that N N [ X X µ An = µ(An ) − µ(Ai ∩ Aj ) n=1
n=1
+
1≤i 0, x1 + x2 + · · · + xn lim µ x ∈ X ; − p < ε = 1. n→∞ n
This result is also called the Weak Law of Large Numbers. The “weak” part of this title distinguishes this result from the “Strong Law of Large Numbers” that we’ll prove in Section 6.6 and is “stronger” than the weak law because the strong law implies the weak law but not vice versa. Note that since x1 + x2 + · · · + xn x∈X; − p < ε n x1 + x2 + · · · + xn =X\ x∈X; − p ≥ ε , n
Bernoulli’s Theorem is equivalent to the statement that x1 + x2 + · · · + xn lim µ x ∈ X ; − p ≥ ε = 0. n→∞ n
2.4.2. Proof of the weak law of large numbers. My favorite way to prove this theorem is to transform it into a problem involving integrals of functions on X instead of measures of points of X and then follow Pafnuty Lvovich Chebyshev’s (1821–1894) 1867 proof of the law of large numbers
Pafnuty Chebyshev (1821–1894).
100
2. FINITELY ADDITIVE INTEGRATION
[56] (see [262] for a translation). For each i, consider the random variable that observes a head on the i-th toss: ( 1 if xi = 1 fi : X → R defined by fi (x) := xi = 0 if xi = 0 where x = (x1 , x2 , x3 , . . .). The function fi is really just a C -simple function, for, let Ai ⊆ X be the event that on the i-th toss we flip a head: (2.16)
Ai = S × S × S × · · · × S × {1} × S × · · · ∈ C ,
where the {1} occurs in the i-th slot. Then f i = χ Ai , the characteristic function of Ai . We let Sn = f 1 + f 2 + · · · + f n , which is the simple random variable that observes the total number of heads in n tosses. Note that x1 + x2 + · · · + xn Sn (x) x∈X; − p ≥ ε = x ∈ X ; − p ≥ ε . n n Following most probabilists, we always simplify set notation as in
Probabilist Set Notation: {x ∈ X ; Property(x)} = {Property} (i.e., drop “x”). Then with this notation in mind, we write x1 + x2 + · · · + xn Sn x∈X; − p ≥ ε = − p ≥ ε . n n Thus, Bernoulli’s theorem can be written as . . . Bernoulli’s theorem: Function Version Theorem 2.10. For each ε > 0, Sn lim µ − p ≥ ε = 0. n→∞ n
This is Bernoulli’s theorem transformed into a statement involving functions on X. Observe that Sn 2 2 2 n − p ≥ ε = |Sn − np| ≥ nε = (Sn − np) ≥ n ε . Hence, we are left to prove that lim µ (Sn − np)2 ≥ n2 ε2 = 0. n→∞
This limit turns out to be an easy consequence of Chebyshev’s inequality, who stated and then used (a similar) inequality in his proof of the law of large numbers in 1867. We remark that an earlier version (1853) of the inequality is due to Ir´en´ee-Jules Bienaym´e (1796–1878), so the inequality is sometimes called the Bienaym´e–Chebyshev inequality. Yet another name for the inequality is Markov’s
2.4. BERNOULLI’S THEOREM (THE WLLNS) AND EXPECTATIONS
101
inequality, named after Andrei Andreyevich Markov (1856–1922), who was a student in Chebyshev’s classes. We shall see several reincarnations of Chebyshev’s inequality in the sequel. Chebyshev’s inequality, Version I Lemma 2.11. If ν : R → [0, ∞] is a finitely additive set function on a ring of subsets of a set X with X ∈ R, and f is a nonnegative R-simple function, then for any constant α > 0, Z 1 ν {f ≥ α} ≤ f dν. α Proof : Here’s a picture showing why Chebyshev’s inequality is “obvious”:
f z }| {
It’s “obvious” that Area of Rect. ≤ Area under f ; R that is, α · µ{f ≥ α} ≤ f .
α
| {z } {f ≥ α}
First of all, from Exercise 1 we know that A := {x ∈ X ; f (x) ≥ α} ∈ R so that ν(A) is defined. Now, by definition of A, α ≤ f on the set A, and since f is nonnegative we have χA f ≤ f . Thus, αχA ≤ χA f ≤ f, so by monotonicity of the integral, Z Z Z αχA dν ≤ χA f dν ≤ f dν. By definition of the integral,
and hence α ν(A) ≤
R
Z
αχA dν = α ν(A),
f dν, which is equivalent to Chebyshev’s inequality.
Since Sn − np = f1 + · · · + fn − npχX is a sum of simple functions, and simple functions form an algebra (Lemma 2.1), it follows that (Sn − np)2 is a simple function. Thus, by Chebyshev’s inequality, we have Z 1 2 2 2 µ (Sn − np) ≥ n ε ≤ 2 2 (Sn − np)2 . n ε Although we don’t have to, we shall evaluate the right-hand integral using the functions (2.17)
Ri := fi − p ,
i = 1, 2, 3, . . . ,
which are related to the Rademacher functions introduced in 1922 by Hans Rademacher (1892–1969);12 see the exercises for various properties of these functions. Observe that Sn − np = f1 + f2 + · · · + fn − np = (f1 − p) + (f2 − p) + · · · + (fn − p) = R1 + R2 + · · · + Rn , 12If p = 1/2, then 2R , i = 1, 2, . . ., are the original Rademacher functions. i
102
2. FINITELY ADDITIVE INTEGRATION
so
Z 1 (R1 + R2 + · · · + Rn )2 . n2 ε 2 To evaluate the right-hand side, observe that µ (Sn − np)2 ≥ n2 ε2 ≤
(R1 + R2 + · · · + Rn )2 = (R1 + · · · + Rn )(R1 + · · · + Rn ) = Hence, 1 2 n ε2
Z
(R1 + R2 + · · · + Rn )2 =
n X
Ri Rj .
i,j=1
n Z 1 X Ri Rj . n2 ε2 i,j=1
R R SinceR Ri =Rfi − p and fi = χAi = µ(Ai ) = p (see the definition of Ai in (2.16)) and 1 = χX = µ(X) = 1, we see that Z Z Z Ri Rj = (fi − p)(fj − p) = (fi fj − pfi − pfj + p2 ) Z Z Z Z 2 = fi fj − p fi − p fj + p 1 Z = f i f j − p2 − p2 + p2 Z = f i f j − p2 . By definition of fi , we have fi fj = χAi · χAj = χAi ∩Aj so
Z
fi fj = µ(Ai ∩ Aj ) =
(
p p2
if i = j if i = 6 j,
as you can check using the expression for Ai in (2.16). Thus, ( Z p − p2 if i = j Ri Rj = 0 if i 6= j. Finally, we conclude that (2.18)
µ (Sn − np)2 ≥ n2 ε2 ≤ ≤ =
This shows that
n Z 1 X Ri Rj n2 ε2 i,j=1 n 1 X (p − p2 ) n2 ε2 i=1
1 p − p2 2 · n · (p − p ) = . n2 ε 2 nε2
lim µ (Sn − np)2 ≥ n2 ε2 = 0,
n→∞
which completes the proof of Bernoulli’s Theorem.
2.4. BERNOULLI’S THEOREM (THE WLLNS) AND EXPECTATIONS
103
2.4.3. Expectations revisited. Let µ0 : I → [0, 1] be a probability set function on a semiring I of subsets of a sample space S. Given R a simple random variable f : S → R, recall that the expectation of f , E(f ) := f , was interpreted as the “expected average value of f over a large number of experiments”. We can now make this precise! To do so, let X := S ∞ , the sample space for repeating the experiment modeled by S an infinite number of times and let µ : R(C ) → [0, 1]
be the infinite product of µ0 with itself, where C is the cylinder subsets of X generated by the semiring I on each factor S of X. Given an infinite sequence of outcomes x = (x1 , x2 , x3 , . . .) ∈ X, note that f (xk ) is the value of f on the k-th outcome of the infinite sequence of experiments, so f (x1 ) + f (x2 ) + f (x3 ) + · · · + f (xn ) n is exactly the average value of f on the first n experiments on a given x ∈ X. Our intuitive notion of expectation suggests that f (x1 ) + f (x2 ) + f (x3 ) + · · · + f (xn ) ≈ E(f ), n where the larger n is the better this approximation should be. This is exactly correct as Theorem 2.12 below shows! To set-up this theorem, for each i define fi : X → R
by fi (x1 , x2 , . . .) := f (xi ),
so fi represents the observation of the random variable f on the i-th experiment. The Expectation Theorem Theorem 2.12. For each ε > 0, f1 + f2 + f3 + · · · + fn lim µ − E(f ) < ε = 1. n→∞ n
The proof of this theorem is very similar to the proof of Bernoulli’s theorem, so similar in fact, that we leave it as an excellent exercise to test if you understood the proof of Bernoulli’s theorem; see Problem 3. 2.4.4. Experimental verification of Bernoulli’s Theorem. On pages 109–113 of Uspensky’s classic probability book [293], he lists eight examples showing the experimental verification of Bernoulli’s Theorem. My favorite example is Buffon’s needle problem,13 studied by Georges Louis Leclerc, Comte de Buffon (1707–1788) who considered the following needle experiment in 1777, quoted from page 112 of [293]: One of the most striking experimental tests of Bernoulli’s theorem was made in connection with a problem considered for the first time by Buffon. A board is ruled with a series of equidistant parallel lines, and a very fine needle, which is shorter than the distance between lines, is thrown at random on the board. Denoting by ℓ the length of the needle and by h the distance between lines, the probability that the needle will intersect one of the lines (the other 13Note: “Buffon” is not the same as “Buffoon,” which is a clownish-type person.
Georges Buffon (1707–1788).
104
2. FINITELY ADDITIVE INTEGRATION
possibility is that the needle will be completely contained within the strip between two lines) is found to be p=
2ℓ . πh
The remarkable thing about this expression is that it contains the number π = 3.14159 . . . expressing the ratio of the circumference of a circle to its diameter.
See Problem 5 for a proof that the probability the needle will cross a line is indeed 2ℓ/πh as stated. Here’s a picture of the situation:
Therefore, by Bernoulli’s Theorem if we throw a needle a large number of times, then the ratio between the number of times the needle crosses a line and the total number of throws should be close to 2ℓ/hπ. One such experiment was conducted by Johann Rudolf Wolf (1816–1893) between 1849 and 1853. In his experiment, ℓ = 36 and h = 45 (in millimeters), so the theoretical probability that a needle crosses a line is 2ℓ 72 = = 0.5093 . . . . hπ 45π He threw the needle 5000 times and it crossed a line 2532 times giving a ratio 2532 = 0.5064, 5000 not far off from the true probability! By the way, this approximation gives an probabilistic method to determine π! Indeed, P =
2ℓ hπ
=⇒
π=
2ℓ . PL
Hence, Wolf’s experiment show that π≈
72 2ℓ = = 3.15955766 . . . , 0.5064h 0.5064 · 45
not a bad approximation. In fact, Wolf’s original motivation to do his experiment was to find π via Bernoulli’s theorem; here is what he said (quoted from [235]): In the well-known work “One Million Facts” (Lalanne 1843) I found the following note that attracted my highest attention: “On a plane surface draw a sequence of parallel, equally spaced straight lines; take an absolutely cylindrical needle of length a, less than the constant interval d that separates the parallels, and drop it randomly a great number of times on the surface covered by the lines. If one counts the total number q of times the needle has been dropped and notes the number p of times the needle crosses with any one of the parallels, the quantity 2aq : pd will express the ratio π of circumference and diameter all the more precisely the more trials that have been made.”
By the way, the Buffoon needle experiment is perhaps the first Monte Carlo Method, which is a general term describing any method that uses random experiments to approximate solutions to mathematical problems.
2.4. BERNOULLI’S THEOREM (THE WLLNS) AND EXPECTATIONS
105
Another experimental verification deals with the Genoese lottery presented back in Example 2.3. In that experiment, we perform a lottery by randomly drawing five tokens from ninety, and then we sum the five numbers drawn. Thus, S = {(s1 , . . . , s5 ) ; si ∈ {1, . . . , 90} , si 6= sj for i 6= j} and we define f :S→R ,
where f (s1 , s2 , s3 , s4 , s5 ) := s1 + s2 + s3 + s4 + s5 ,
which represents the sum of the values of the randomly drawn tokens. We computed E(f ) = 227.5. Now perform an infinite sequence of lotteries where in each lottery we randomly draw five tokens from ninety, and in each lottery we sum the five numbers drawn. From the Expectation Theorem, if we observe a large number n of lotteries, for a given sequence of outcomes (x1 , x2 , . . .), we would expect that the arithmetic mean f (x1 ) + f (x2 ) + · · · + f (xn ) n should not differ much from the expected value 227.5. In the book Wahrscheinlichkeitstheorie, Fehlerausgleichung, Kollektivmaßlehre [62], Emanuel Czuber (1851– 1925) carefully gathered data of 2, 854 Genoese-type lotteries that operated in Prague between 1754 and 1886. If you look in [293, p. 187] you’ll find a very large table listing his results. From this table you can compute that the arithmetic mean of the sum of the five tokens draw in the 2, 854 lotteries is 227.67. Not far from the theoretical value of 227.5! ◮ Exercises 2.4. 1. Let R be a ring of subsets of a set X such that X ∈ R and let f : X → R be an R-simple function. Prove that for any α ∈ R, {f ≥ α} ∈ R
and
{f < α} ∈ R.
2. (Poisson’s Theorem) Let p1 , p2 , p3 , . . . be real numbers with 0 < pn < 1 for all n. Consider an infinite sequence of, say, coin tosses, and assume that the probability of obtaining a head on the n-th toss is pn . In Bernoulli’s Theorem, pn = p for all n, but now we are allowing the probability to change depending on the toss. By imitating the proof of Bernoulli’s theorem, prove Sim´eon-Denis Poisson’s (1781–1840) Theorem: For each ε > 0, n o p1 + p2 + · · · + pn x1 + x2 + · · · + xn lim µ x ∈ X ; − < ε = 1, n→∞ n n where µ is the infinite product of µ1 , µ2 , µ3 , . . . with µi assigning the probability pi to heads on the i-th toss. 3. (The Expectation theorem) In this problem we prove Theorem 2.12. We shall use the notation as explained in Subsection 2.4.3. (i) We first rewrite the statement of the expectation theorem. For each i define fi : X → R by fi (x1 , x2 , x3 , . . .) := f (xi )
for all (x1 , x2 , x3 , . . .) ∈ X.
Show that fi : X → R is a C -simple random variable. Suggestion: Write f = P a χAk where the Ak ’s (belonging to I ) are pairwise disjoint and let k k Aik = S × S × S × · · · × S × Ak × S × · · · ∈ C , P where the Ak occurs in the i-th slot. Show that fi = k ak χAik .
106
2. FINITELY ADDITIVE INTEGRATION
(ii) Define Sn := f1 + · · · + fn . Show that the expectation theorem is equivalent to the following statement: For each ε > 0, Sn lim µ − p ≥ ε = 0, n→∞ n
where p := E(f ). (iii) Now follow the proof of Bernoulli’s theorem to prove the Expectation Theorem. (iv) Let S = (0, 1] and let µ0 : I → [0, 1] be Lebesgue measure on I = left-hand open intervals in (0, 1]. Define f : S → R as follows: If x ∈ (0, 1], write x = 0.x1 x2 x3 . . . in decimal notation; if x has two decimal expansions, take the one that does not terminate. Define f (x) = x1 = tenth place digit of x; thus, f (0.123 . . .) = 1, f (0.987 . . .) = 9, f (0.9) = f (0.8999 . . .) = 8, etc. Show that f : S → R is an I -simple random variable. Now if we sample numbers in (0, 1] “at random” and average their tenth digits, what should these averages approach as the number of samples increases? Use the Expectation Theorem to make your answer rigorous. 4. Here is a different proof of Bernoulli’s Theorem. Below we use the notation p, µ, etc. as in the proof of Bernoulli’s Theorem in the main text and we put Sn = f1 + f2 + · · · + fn where the fi ’s are as before. ! n k n−k (i) Prove that if 0 ≤ k ≤ n, then µ {Sn = k} = p q , where q = 1 − p. k (ii) Given ε > 0, prove that ! n X n k n−k Sn ≥ p+ε = p q , µ n k k=m+1
where m = ⌊n(p + ε)⌋, the largest integer n ≤ n(p + ε). (iii) Given λ > 0, expanding peλq + qe−λp via the binomial theorem, prove that n Sn µ ≥ p + ε ≤ e−λnε peλq + qe−λp . n 2
(iv) Prove that for any x ∈ R, ex ≤ x + ex . (v) Prove that n 2 2 2 2 n 2 2 peλq + qe−λp ≤ peλ q + qeλ p ≤ eλ n . 2 Sn Taking λ = ε/2, show that µ ≥ p + ε ≤ e−ε /4 . n (vi) Conclude that µ Snn ≥ p + ε → 0 as n → ∞. Assuming that µ Snn ≤ p − ε → 0 as n → ∞, whose proof is analogous to the preceding one, prove Bernoulli’s Theorem. 5. (Buffon’s needle problem) A floor is ruled with horizontal parallel lines at distances h apart from each other. A needle of length ℓ < h, so thin that it can be considered to be a line segment, is thrown on the floor so that it is equally likely to land on any part of the floor. In this problem we consider the question: What is the probability that the needle will intersect one of the lines? To answer this question proceed as follows.
✻ h
❄
✻ y θ ❄
p
ℓ
✻
ℓ
h
❄ Figure 2.7. Buffon’s needle problem.
✻ y ✙θ ❄ p
2.4. BERNOULLI’S THEOREM (THE WLLNS) AND EXPECTATIONS
107
(i) For a needle thrown on the floor, let p denote the lowest lying end point of the needle; see Figure 2.7 for a couple examples. If the needle lands parallel to the horizonal lines, let p denote the left end point of the needle. Let y, where 0 ≤ y < h, be the distance between p and the horizontal line immediately above it. Let θ be the angle of the needle from the horizonal passing through p; see Figure 2.7 for a couple examples. as in the left-hand picture in Figure 2.7. Show that the needle crosses a parallel line if and only if y < ℓ sin θ. Because of this relation, we can think of the rectangle X = {(θ, y) ; 0 ≤ θ < π , 0 ≤ y < h},
as the sample space for the needle experiment and we can think of the event that the needle crosses a line as the subset A ⊆ X given by A = {(θ, y) ∈ X ; y < ℓ sin θ}.
(ii) If X were a finite set, then we would interpret the statement that when the needle is thrown on the floor “it is equally likely to land on any part of the floor” to mean that the #A Probability the needle crosses a line = . #X Unfortunately, A and X are infinite sets, so the right-hand side is not defined. However, if we interpret # as “area” we can get a perfectly well-defined right-hand side. Thus, we shall define the Area A Probability the needle crosses a line := . Area X Determine the areas of X and A (use the area interpretation of the Riemann integral to find the area of A; see Figure 2.8). Finally, prove that the probability the needle crosses a line is 2ℓ/πh. y
h
✻ y = ℓ sin θ
π
✲ θ
Figure 2.8. The region A is the area under the curve of y = ℓ sin θ. The region X is the rectangle containing A. 6. (Another Buffon needle problem) A floor is ruled with horizontal parallel lines at distances h apart from each other. A needle of length ℓ > h, so thin that it can be considered to be a line segment, is thrown on the floor so that it is equally likely to land on any part of the floor. Show that the probability the needle will intersect a line is 2ℓ π − 2α (1 − cos θ0 ) + , hπ π where θ0 = arcsin(h/ℓ). 7. (Equivalence of S ∞ and [0, 1]) In this problem we relate the infinite product measure for Bernoulli sequences to Lebesgue measure on the real line; this is due to Hugo Steinhaus (1887–1972) [269]. Fix a natural number b ≥ 2. Let S = {0, 1, . . . , b − 1} and let µ0 : P(S) → [0, 1] assign “fair” probabilities µ0 (A) = #A/b. Let S ∞ denote the space of Bernoulli sequences and let µ : R(C ) → [0, 1] denote the infinite product of µ0 with itself where C ⊆ S ∞ is the collection of cylinder sets.
108
2. FINITELY ADDITIVE INTEGRATION
(i) x = (x1 , x2 , . . .) ∈ S ∞ is terminating if there is an N ∈ N such that xk = 0 for all k ≥ N . Prove that the set of all terminating elements of S ∞ is countable. (ii) Define f : S ∞ → [0, 1] by associating to each element (x1 , x2 , x3 , . . .) ∈ S ∞ the real number whose b-adic expansion is 0.x1 x2 x3 . . .; that is, x1 x2 x3 (2.19) f (x1 , x2 , x3 , . . .) := + 2 + 3 + ··· . b b b Show that f is onto but is not bijective because each rational number in (0, 1] can be expressed as the image of a terminating and non-terminating element of S ∞ . (Use well-known facts concerning b-adic representations of real numbers.) Show if T ⊆ S ∞ denotes the subset of all terminating elements of S ∞ , then restricting f off the countable set T gives a bijection f : S ∞ \ T → (0, 1].
This shows in particular that S ∞ is uncountable, although a direct proof of this fact isn’t hard. (iii) Let x1 , . . . , xN ∈ {0, 1} and let
(2.20)
Prove that
A = {x1 } × · · · × {xN } × S × S × · · · ∈ C . f (A) =
1 (k , k + 1] = bN
k k+1 , , bN bN
where k = bN x1 + bN−1 x2 + · · · + xN , and that µ(A) = m(f (A)),
where m denotes Lebesgue measure. Thus, f maps events in S ∞ of the form (2.20) into b-adic intervals and the measure of the event equals the Lebesgue measure of the corresponding b-adic interval. (iv) Prove that A ∈ S (C ) if and only if f (A) ∈ B, in which case µ(A) = m(f (A)). (v) It might be interesting to note that f can be expressed in terms of Rademacher functions. Let R1 , R2 , . . . be the Rademacher functions defined in (2.17) with p = 1/2. Show that f (x) −
∞
X Ri (x) 1 = 2 2i i=1
for all x ∈ S ∞ .
8. Let S ∞ where S = {0, 1} be the sample space for an infinite sequence of coin tosses. Let R1 , R2 , . . . be the Rademacher functions defined in (2.17) with p = 1/2 and let Prove that e
tsn
:S
∞
sn (x) := R1 (x) + · · · + Rn (x).
→ R is a simple function and that t/2 n n Z e + e−t/2 t etsn = = cosh . 2 2
2.5. De Moivre, Laplace and Stirling star in The Normal Curve The normal curve, the protagonist of this section, is ubiquitous in mathematics and nature. Its use in probability was first discovered by Abraham de Moivre (1667–1754) and has since then been one of the cornerstones of probability theory; in fact, without it one may argue that the field of statistics would not exist. It’s said to show up in areas as diverse as daily maximum and minimum air temperature, real estate prices, IQ scores, body temperature, heights of adult people, adult body weight, shoe size, stock market analysis, heart rates, kinetic theory, population
2.5. DE MOIVRE, LAPLACE AND STIRLING STAR IN THE NORMAL CURVE
109
Figure 2.9. In 1914, Prof. Albert Blakeslee (1874–1954) arranged 175 military cadets in a histogram according to their heights for this photo. This photo is an example of a “living histogram” and it has been used in many genetics textbooks.
dynamics and on and on, and in this section we shall discuss one of the most famous mathematical reasons for its ubiquity, the De Moivre-Laplace Theorem. 2.5.1. What me, normal? Consider the following experiment: Ask a nonmathematically inclined friend to think about the heights of all the students in the university. Without a doubt, he would imagine the heights of most students clustered around some average value and as the heights move further from this average there would be less and less students at those heights. In essence your friend is assuming a “bell curve” for heights. See Figure 2.9 for a real-life bell curve. More generally, the bell curve is likely to show up in situations where most data points tend to cluster around some average value and there are fewer data points at the extremes. The technical name for the bell curve is the normal density function, which is the function (x−µ)2 1 φ(x) = √ e− 2σ2 , 2πσ 2
where µ is referred to as the mean and σ the standard variation; here’s a picture when µ = 5 and for various σ:
Thus, φ is the famous “bell curve” and µ is its center and σ measures how much φ spreads from the mean; the smaller σ is the more concentrated φ is near µ while the larger σ is the more more spread φ is from µ. See Section 6.4 for more on standard variations. The normal density was first discovered by Abraham de Moivre (1667– 1754) as early as 1721 [304], although its discovery is sometimes attributed to Carl F. Gauss (1777–1855) (and hence the normal density is sometimes called the Gaussian density), who wrote a paper on error analysis involving the normal density in 1809, almost 90 years after De Moivre; see [221, 194, 195].
110
Abraham de Moivre (1667–1754).
2. FINITELY ADDITIVE INTEGRATION
One of the many interesting aspects surrounding the normal density is its explicit dependence on π, the ratio of the circumference of a circle to its diameter. Now what does π (dealing with circles) have to do with probability? Beats me, but it does! This mysterious relationship was noted by the 1963 Nobel laureate Eugene Paul Wigner (1902–1995) in a famous paper “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” [308], who wrote the following concerning the appearance of π in the normal density: There is a story about two friends, who were classmates in high school, talking about their jobs. One of them became a statistician and was working on population trends. He showed a reprint to his former classmate. The reprint started, as usual, with the Gaussian distribution and the statistician explained to his former classmate the meaning of the symbols for the actual population, for the average population, and so on. His classmate was a bit incredulous and was not quite sure whether the statistician was pulling his leg. “How can you know that?” was his query. “And what is this symbol here?” “Oh,” said the statistician, “this is pi.” “What is that?” “The ratio of the circumference of the circle to its diameter.” “Well, now you are pushing your joke too far,” said the classmate, “surely the population has nothing to do with the circumference of the circle.” Naturally, we are inclined to smile about the simplicity of the classmate’s approach. Nevertheless, when I heard this story, I had to admit to an eerie feeling because, surely, the reaction of the classmate betrayed only plain common sense.
Of course, there are many functions that can give a “bell-like curve,” so why the normal density in particular? In fact, one can give mathematical arguments demonstrating why! One argument is the De Moivre-Laplace theorem to be discussed later and another is more of a heuristic argument involving error analysis that we’ll present now and is essentially contained in Robert Adrain’s (1775–1843) 1808 paper [2]. It seems like Adrain, one of the few great American mathematicians in the early 19-th century, was the Robert Adrain first to publish an error analysis “derivation” of the normal density in his work on the method of least squares. Carl F. Gauss (1777–1855) produced (1775–1843). a similar result one year later in 1809 [194, 195]. We remark that you will often find Adrain’s derivation called the Herschel-Maxwell derivation, after Sir John Herschel (1792–1871) and James Clerk Maxwell (1831–1879), who gave similar derivations in 1850 and 1860, respectively [140, Ch. 7]. Here’s (our interpretation of) Adrain’s argument. Say that you want to measure the position of an object, like a star, and place the star at the origin in the plane. We can also think of this as a dart game: The bull’s-eye is the star and a measurement of the star’s location is like throwing a dart at the dart board, where the dart hits is where we measure the star. We shall speak in this dart language henceforth. We shall also make assumptions as we proceed, but the basic idea is that the darts should cluster near the bull’s-eye with fewer and fewer hits far away from the bull’s-eye; this, of course, is exactly the situation we want to produce a bell curve. Consider the probability that the dart hits a small region of the dart board dA = dx · dy shown here:
2.5. DE MOIVRE, LAPLACE AND STIRLING STAR IN THE NORMAL CURVE
dy
−
111
dA
− |
dx
|
Assume that for some function φ : R → [0, ∞), given a point (x, y) ∈ R2 , the probability that the x-coordinate of the dart lies in an (infinitesimally small) interval of length dx around x is φ(x) dx and the probability that the y-coordinate of the dart lies in an (infinitesimally small) interval of length dy around y is φ(y) dy; this gives the probability that the dart lies in dA as14 φ(x) dx · φ(y) dy = φ(x) φ(y) · dxdy. In other words, we can consider φ(x) φ(y) as the probability per unit area that the dart hits the point (x, y); thus, φ(x) is the probability per unit length that the x-coordinate of the thrown dart is x (with a similar interpretation for φ(y)). Thus, for example considering φ(x), given any interval I ⊆ R the probability the x-coordinate of the dart lands in I is Z (2.21) φ(x) dx. I
It’s reasonable to assume that the probability depends only on the distance from the origin and not on how the axes are oriented; so for example, the probability we hit an area immediately around the point (1, 0) is the same as the probability that we hit areas immediately around (0, 1), (−1, 0) and (0, −1). Thus, if we introduce polar coordinates (r, θ), where x = r cos θ and y = r sin θ, then φ(x) φ(y) is a function depending only on r. Taking the partial derivative of φ(x) φ(y) with respect to θ, the conclude that15 φ′ (x)
∂x ∂y φ(y) + φ(x) φ′ (y) = 0. ∂θ ∂θ
Recalling that x = r cos θ and y = r sin θ we see that
∂x ∂θ
= −y and
∂y ∂θ
= x. Hence,
−φ′ (x) y φ(y) + φ(x) φ′ (y) x = 0. This implies that (2.22)
φ′ (x) φ′ (y) = . xφ(x) yφ(y)
The left-hand side of (2.22) is a function of x only while the right-hand side is a function of y only; thus16 φ′ (x) φ′ (y) = = C, xφ(x) yφ(y) 14Here we are assuming what is called “independence” of the x and y-coordinates. 15You can also take the partial derivative with respect to x, then with respect to y, noting
that the partial of r with respect to x is x/r and with respect to y is y/r . . . try it! 16Can you prove that if f (x) = g(y) for all x, y ∈ R, then f (x) = C and g(y) = C for some constant C?
112
2. FINITELY ADDITIVE INTEGRATION
for some constant C ∈ R. Hence, φ(x) satisfies the ordinary differential equation φ′ (x) = C xφ(x), 2
C
whose solution is φ(x) = Ae 2 x for another constant A. Assuming that the probability the dart hits the board a very far distance from the origin should be close to zero, we conclude that C < 0 (for if C = 0, then φ(x) = A, a constant and if C 2 C > 0, then φ(x) = Ae 2 x would grow exponentially as x gets larger). Hence, we can write C = −1/σ 2 for some σ > 0. Thus, x2
φ(x) = Ae− 2σ2 . In Problems 4 and 6 you will prove that (2.23)
Z
∞ 0
e
−x2
√ π dx = 2
or
Z
2
e−x dx =
√ π,
R
which is called probability integral, (also the Gaussian integral or Laplace’s integral) and which is where π enters the picture. The proofs of (2.23) in the exercises use nothing of measure theory, just basic Riemann integration, and in Section 6.1 we’ll return to the proofs using Lebesgue integration. Now, the probability that the x-coordinate of the thrown dart is some real number is 1. Thus, recalling the formula (2.21), it follows that Z Z ∞ x2 1= φ(x) dx = A e− 2σ2 dx. −∞
R
√
Replacing x with xσ 2, it follows that √ Z ∞ −x2 √ e dx = A σ 2π. 1 = Aσ 2 −∞
√ Thus, A = 1/(σ 2π) and ta-da, we get the normal density with mean µ = 0: φ(x) =
x2 1 √ e− 2σ2 . σ 2π
We can interpret this discovery as “Random errors distribute themselves normally;” this is one way to state the so-called Normal law of errors. We remark that Pierre-Simon Laplace (1749–1827) was probably the first to try and rigorously evaluate the probability integral; this was done in the historic 1774 paper M´emoire sur la probabilit´e des causes par les ´ev`enmens [159]. (Memoir on the probability of the causes of events.)17 17Although Laplace might have been the first to evaluate the probability integral, as we
said earlier, Abraham de Moivre was the first to discover the normal density in probability; see the articles [221, 4, 69] for the exciting detective-type story! Also, through my own personal researches I think that the first to indirectly state the value of the probability integral was James Stirling (1692–1770), who in his famous 1730 book Methodus Differentialis [289, p. 127] explicitly √ said that Γ(1/2) = π, where Γ is the Gamma function. See Theorem 6.3 for the relation of Γ(1/2) to the probability integral.
2.5. DE MOIVRE, LAPLACE AND STIRLING STAR IN THE NORMAL CURVE
113
2.5.2. De Moivre, Stirling, Wallis, and the Binomial distribution. For the rest of this section we work in the set-up we had for the weak law of large numbers. Thus, let S = {0, 1} be the sample space of an experiment with 1 a success, occurring with probability p, and 0 a failure, occurring with probability q := 1 − p, and let X := S ∞ be the sample space of an infinite sequence of trials of the experiment. Let µ : R(C ) → [0, 1] be the infinite product measure on the ring R(C ) generated by the cylinder subsets C of X. The De Moivre-Laplace theorem begins with the seemingly innocent The Binomial distribution Theorem 2.13. In a sequence of n Bernoulli trials, with a probability p of success on each trial and probability q = 1 − p of failure, the probability of obtaining exactly k successes is given by n k n−k b(k; n, p) := p q , 0 ≤ k ≤ n, k and b(k; n, p) = 0 otherwise.
Proof : That b(k; n, p) = 0 “otherwise” is clear, so assume that 0 ≤ k ≤ n. Let I ⊆ {1, 2, . . . , n} consist of exactly k elements and let AI ⊆ X be the set AI = {a1 } × {a2 } × · · · × {an } × S × S × S × · · · , where ai = 1 if i ∈ I and ai = 0 if i ∈ / I. Then as there are exact k of the ai ’s equal to 1 (and n − k of them equal to 0) we have µ(AI ) = pk (1 − p)n−k . If A is the event that we get k successes S in the first n trials (without regard to which trials they occur), then A = I AI where the union is over all subsets I ⊆ {1, 2, . . . , n} consisting of k elements. Since the AI ’s are pairwise disjoint, µ(AI ) = pk (1 − p)n−k for each I, and the number of subsets of {1, 2, . . . , n} consisting of exactly k elements is nk , it follows that X µ(A) = µ(AI ) = pk (1 − p)n−k + pk (1 − p)n−k + · · · + pk (1 − p)n−k {z } | I (nk) terms ! n k = p (1 − p)n−k . k
The function b(n, p) : Z → R, defined by k 7→ b(k; n, p), is called the binomial mass function. Example 2.5. Suppose that we flip a fair coin n times. What is the probability that we obtain exactly k heads where 0 ≤ k ≤ n? By our theorem, the answer is ! ! n n−k n 1 1 n 1 b(k; n, 0.5) = = . k 2 2 k 2n
114
2. FINITELY ADDITIVE INTEGRATION
In particular, if n = 2k, the probability of half the throws resulting in heads is ! (2k)! 1 2k 1 = . b(k; 2k, 0.5) = (k!)2 22k k 22k 100! 1 . If anyone can (50!)2 2100 guess what this number equals to (say) three decimal places, you’re the next Thomas Fuller18 (1710–1790). (The answer is 0.079589, accurate up to 6 decimal places.) For instance, if n = 100 and k = 50, the answer is
As this example shows, although we have a nice formula regarding probabilities for successes, computationally it’s basically useless! Thus, for any non-trivial application it’s important to be able to approximate binomial coefficients. De Moivre was the first to give a useful approximation for b(k; 2k, 0.5) using Stirling’s formula; we’ll get to this later. For the moment, we want to mention a simpler formula that does the job called Wallis’ formula, named after John Wallis (1616–1703) who proved it in 1656, and is given by ∞ Y π 2n 2n 2 2 4 4 6 6 8 8 10 10 = · = · · · · · · · · · ··· . 2 2n − 1 2n + 1 1 3 3 5 5 7 7 9 9 11 n=1
John Wallis (1616–1703).
Here, the infinite product on the right-hand side of π/2 is, by definition, ∞ n Y Y 2n 2n 2k 2k · := lim · n→∞ 2n − 1 2n + 1 2k − 1 2k + 1 n=1 k=1 2 2 4 4 6 6 2n 2n = lim · · · · · ··· · . n→∞ 1 3 3 5 5 7 2n − 1 2n + 1 The proof of Wallis’ formula is very elementary, just using basic Riemann integration techniques, so we will leave its proof as a must-do exercise (see Problem 3); if you are willing to wait, we’ll also prove Wallis’ formula in Section 6.1 using Lebesgue’s theory. We can write Wallis’ formula as
(2.24)
√
n 1 2 · 4 · · · (2n) 1 Y 2k = lim √ . π = lim √ n→∞ n→∞ 2k − 1 n n 1 · 3 · · · (2n − 1) k=1
Indeed, observe that Wallis’ first formula can be written as 2n 2 π 2 2 4 2 1 = lim · ··· · , n→∞ 2 1 3 2n − 1 2n + 1 so that r n n Y Y √ 2 2k 1 1 2k π = lim = lim √ p . n→∞ 2n + 1 2k − 1 n→∞ n 1 + 1/2n 2k − 1 k=1
k=1
18 Thomas Fuller, the “Virginia calculator”, was African and in 1724 at the age of 14, he was shipped to America and sold as a slave. He never learned to read or write but he was a mathematician of the finest caliber. Here’s a part of Fuller’s obituary, from the Columbian Centinel (Boston), Vol. 14, Dec. 29, 1790: “He could multiply seven into itself, that product by seven, and the products, so produced, by seven, for seven times. He could give the number of months, days, weeks, hours, minutes, and seconds in any period of time that any person chose to mention, allowing in his calculation for all leap years that happened in the time; he would give the number of poles, yards, feet, inches, and barley-corns in any distance, say the diameter of the earth’s orbit; and in every calculation he would produce the true answer in less time than ninety-nine men out of a hundred would produce with their pens.”
2.5. DE MOIVRE, LAPLACE AND STIRLING STAR IN THE NORMAL CURVE
115
p Now using that 1 + 1/2n → 1 as n → ∞ implies Wallis’ second formula (2.24). Wallis’ formula (2.24) answers the question: What is the ratio of the even numbers 2, √ 4, .√. . , 2n and the odd numbers 1, 3, . . . , 2n − 1? The answer is: Approximately π · n. Now what does π (dealing with circles) have to do with ratios of even and odd numbers? Beats me, but it does! Now back to our problem, recall that we’d like to estimate 2k 1 (2k)! 1 bk := b(k; 2k, 0.5) = = . 2k k 2 (k!)2 22k Using the definition of the factorial, we can write bk =
1 · 2 · · · (2k − 1)(2k) 1 · , (1 · 2 · · · k) · (1 · 2 · · · k) (2 · 2 · · · 2) · (2 · 2 · · · 2)
where there are k factors of 2 in (2 · 2 · · · 2). Multiplying the denominators, we obtain 1 · 3 · 5 · · · (2k − 1) 1 · 2 · · · (2k − 1)(2k) = . bk = (2 · 4 · · · (2k)) · (2 · 4 · · · (2k)) 2 · 4 · · · (2k) where we canceled the even factors from the numerator with one of the products of the even numbers in the denominator. Finally, recalling Wallis’ formula (2.24), we see that √ 1 b √k π = lim √ , which implies that lim = 1. k→∞ k→∞ k bk (1/ πk) Because of this limit, we write 1 bk ∼ √ . kπ Here, given any two sequences {ck }, {dk }, if we have ck lim = 1, k→∞ dk we write ck ∼ dk and we say that the sequences are asymptotically equal or asymptotically equivalent; basically this means that ck and√dk are roughly the same size as k → ∞. Thus, bk is asymptotically equal to 1/ kπ. In particular, taking k = 50 we obtain 1 b50 ≈ √ = 0.079788 . . . . 50π The true answer b50 = 0.079589 . . ., not bad! 2.5.3. The De Moivre and Stirling formulas. In the last subsection we found a nice formula for b(k; 2k, 0.5). Now what if we need to approximate b(k; n, 0.5) when n 6= 2k or what if we have a unfair coin so we need to approximate b(k; n, p) where p 6= 1/2? In such cases, Wallis’ formula doesn’t help. However, De Moivre found an amazing relationship between b(k; n, p) and the normal density that gives us exactly the approximations we are looking for. Figure 2.10 shows a graph of b(k; n, 0.5) for various n while Figure 2.11 shows a graph of b(k; n, 0.75) for various n. Closely scrutinizing the graphs in both cases, it’s clear that the graph of b(k; n, p) (where p = 0.5 or 0.75) looks just like the normal density with mean
116
2. FINITELY ADDITIVE INTEGRATION
Figure 2.10. p = 0.5, n = 40 (left), n = 100 (middle), n = 200 (right).
Figure 2.11. p = 0.75, n = 40 (left), n = 100 (middle), n = 200 (right). at np. Therefore, we conjecture that for general p and large n, k, we have (k−np)2 n k n−k 1 (2.25) p q ∼ √ e− 2σ2 , k σ 2π for some constant σ to be determined.The key to verifying this formula is obviously n! estimating the binomial coefficient nk = k!(n−k)! , which means that we need to estimate the factorial. Such a formula was first discovered by Abraham de Moivre (1667–1754) around 1721, although the formula itself is called Stirling’s formula, after James Stirling19 (1692–1770) who published it in his most famous work Methodus Differentialis [289] in 1730. This formula is the following asymptotic formula for the factorial function: n n √ √ (2.26) n! ∼ 2πn = 2πn nn e−n . e √ Recall that “∼” means that the ratio of n! and 2πn nn e−n approaches unity as n → ∞: √ n! n! ∼ 2πn nn e−n means that lim √ = 1. n→∞ 2πn nn e−n Now what do π and e have to do with multiplying all the integers from 1 to n? Beats me, but they do! You will provide an elementary proof of Stirling’s formula in Problem 5. As already mentioned, it was Abraham de Moivre who first derived Stirling’s formula. He did this in a supplement to his 1730 paper Miscellanea Analytica called Approximatio ad Summan Terminorum Binomii (a+b)n in Seriem expansi, dated Nov. 12, 1733 (reproduced in English in The Doctrine √ of Chances [71]). In this paper he derived “Stirling’s formula” with the constant 2π replaced by a non-explicit constant: √ n n n! ∼ B n , e where 1 1 1 1 B ≈ e1− 12 + 360 − 1260 + 1680 = 2.507399 . . . ; 19No known portraits of Stirling seem to exist.
2.5. DE MOIVRE, LAPLACE AND STIRLING STAR IN THE NORMAL CURVE
117
√ for comparison, 2π = 2.506628 . . .. Unfortunately, de Moivre wasn’t able to determine B explicitly, which is where Stirling enters the picture [71, p. 243-44]20: It is now a dozen years or more since I found what follow . . . When I first began that inquiry, I contented myself to determine at large the Value of B, which was done by the addition of some Terms of the above-written Series; but as I perceived that it converged but slowly, and seeing at the same time that what I had done answered my purpose tolerably well, I desisted from proceeding farther till my worthy and learned Friend Mr. James Stirling, who had applied himself after me to that inquiry, found that the Quantity B did denote the Squareroot of the Circumference of a Circle whose Radius is Unity, so that if that Circumference be called c, the Ratio of the middle Term to the Sum of all the Terms will be expressed by √2nc . But altho’ it be not necessary to know what relation the number B may have to the Circumference of the Circle, provided its value be attained, either by pursuing the Logarithmic Series before mentioned, or any other way; yet I own with pleasure that this discovery, besides that it has saved trouble, has spread a singular Elegancy on the Solution.
Actually, in the preface to Stirling’s book [289, p. 18], he admits that De Moivre found the formula first: The problem of finding the middle coefficient in a very large power of the binomial had been solved by De Moivre some years before I considered it.
Regardless of who found the approximation we shall use it to prove the conjecture (2.25), which we shall call De Moivre’s theorem, whose proof is basically an exercise in dexterity and patience using Stirling’s formula! 2.5.4. Proof of De Moivre’s Theorem. We do it in three steps. Step 1: We apply Stirling’s formula. The starting point is n k n−k n! (2.27) p q = pk q n−k . k k!(n − k)!
By Stirling’s formula we have √ n! ∼ 2πn nn e−n ,
√ k! ∼ 2πk k k e−k , p (n − k)! ∼ 2π(n − k) (n − k)n−k e−(n−k) ,
valid for n → ∞, k → ∞ and n − k → ∞, respectively. Let us assume these three conditions; thus, our first assumption is n → ∞ , k → ∞ , n − k → ∞. Plugging the asymptotic equations for n!, k! and (n−k)! into (2.27) and simplifying we obtain r n k n−k n nn p q ∼ pk q n−k k k 2πk(n − k) k (n − k)n−k r np k nq n−k n ∼ . 2πk(n − k) k n−k 20Page 143 of the ebook found at http://www.ibiblio.org/chance/.
118
2. FINITELY ADDITIVE INTEGRATION
We want the right-hand side to have an exponential, so the logical thing to do is write np k nq n−k = e−A k n−k where by elementary properties of logarithms, ! np k nq n−k A = − log k n−k k n−k + (n − k) log . = k log np nq
Note that A depends on both k and n but to simplify notation we omit its explicit dependence (so, actually we should write Ank , but we won’t). Now, considering on our conjecture (2.25), we should have (k − np)2 , (a conjecture at this point), 2σ 2 for some σ (that we’ll worry about later). With (2.28) as our guide, let us put a = k − np (we omit the explicit dependence of a on k and n). Then after some simplification, using that a = k − np (or k = a + np), we obtain n−k k A = k log + (n − k) log np nq a a = k log 1 + + (n − k) log 1 − . np nq (2.28)
A≈
End of Step 1: We have shown that as n, k, n − k → ∞, we have r n k n−k n (2.29) p q ∼ e−A , k 2πk(n − k)
where
a a A = k log 1 + + (n − k) log 1 − . np nq Step 2: Simplify A. We now try to simplify A so it (hopefully) looks like (2.28). To do so, we use the second-order Taylor expansion with remainder for log(1 + x), which from elementary calculus is (2.30)
log(1 + x) = x −
x2 + R(x), 2 2
where R(x) is a function (which, of course, equals log(1 + x) − x + x2 ) that satisfies the estimate, for some constant C, (2.31)
|R(x)| ≤ C|x|3 ,
for all x with |x| ≤ 1/2.
Putting x = a/np and then x = −a/nq in (2.30) we get a a a2 a log 1 + = − +R np np 2n2 p2 np and
a a a2 a log 1 − =− − +R − . nq nq 2n2 q 2 nq
2.5. DE MOIVRE, LAPLACE AND STIRLING STAR IN THE NORMAL CURVE
119
Multiplying the first formula by k, the second by n − k, then adding and spending a couple minutes double checking the algebra, we see that a a A = k log 1 + + (n − k) log 1 − np nq 2 3 a p−q a a a − + kR + (n − k)R − . = 2npq 6p2 q 2 n2 np nq √ a2 The first term 2npq is exactly our conjecture (2.28) (with σ = npq), so let’s force the other terms to vanish. The best way to do so is to make the second term a3 /n2 vanish; thus, we make the following a3 (k − np)3 = → 0 as n, k → ∞. n2 n2 This assumption will be henceforth in effect; in particular, we may henceforth assume that a2 a a A= + kR + (n − k)R − . 2npq np nq Now observe that taking cube roots, the assumption (2.32) is equivalent to a → 0 as n, k → ∞. n2/3 Here are some immediate consequences of this assumption (for the second bullet, recall that a = k − np, or k = a + np): (2.32)
Assumption:
• (2.33)
a 1 a = 2/3 · 1/3 → 0 · 0 = 0 as n, k → ∞ n n n
k a + np a = = p + → p + 0 = p as n, k → ∞ n n n k • n−k =n 1− → ∞ · (1 − p) = ∞ as n, k → ∞. n
•
In particular, the last bullet says that under the assumption (2.32) we get n−k → ∞ for free. Thus, the asymptotic equation (2.29) holds under the assumption (2.32). End of Step 2: In conclusion we have so far shown that as n, k → ∞ under the assumption (2.32), we have r n n k n−k p q ∼ e−A , 2πk(n − k) k where
a2 a a A= + kR + (n − k)R − . 2npq np nq Step 3: Finally finish argument. From the inequality (2.31), the second bullet in (2.33), and our assumption (2.32), we see that 3 3 kR a ≤ C k a = C · k · a → C · p · 0 = 0 as n, k → ∞. 3 3 2 np n p n n p3
A similar argument shows that
a (n − k)R − → 0 as n, k → ∞. nq
120
2. FINITELY ADDITIVE INTEGRATION
Thus, under the assumption (2.32), we have A−
a2 → 0 as n, k → ∞. 2npq
Since k/n → p (the second bullet in (2.33)) we see that n 1 1 1 1 1 = ∼ = . k(n − k) n (k/n)(1 − k/n) n p(1 − p) npq Thus, r
n ∼ 2πk(n − k)
r
1 . 2πnpq
In conclusion, as n, k → ∞ under the assumption (2.32), we have r a2 n n k n−k 1 e− 2npq . p q ∼ e−A ∼ √ k 2πk(n − k) 2πnpq In conclusion, we have shown that as n, k → ∞ in such a way that (k−np)/n2/3 → 0, we have (k−np)2 n k n−k 1 p q ∼√ e− 2npq (De Moivre’s theorem). k 2πnpq It will be helpful later on, when we state the De Moivre-Laplace theorem, to rewrite this expression in terms of random variables. Given i ∈ N, we let fi : X → R be the simple random variable observing whether or not we are successful on the i-th toss; thus, ( 1 if xi = 1 fi (x) := xi = 0 if xi = 0 for all x = (x1 , x2 , x3 , . . .) ∈ X. Then given n ∈ N, Sn := f1 + f2 + · · · + fn is the simple random variable giving the number of successes on the first n trials. In particular, µ Sn = k is exactly the probability of k successes in the first n trials, which implies that n k n−k µ Sn = k = p q . k Thus, we can state De Moivre’s theorem as follows: De Moivre’s Theorem Theorem 2.14. If n, k → ∞ in such a way that (k − np)/n2/3 → 0, then (k−np)2 1 µ Sn = k ∼ √ e− 2npq . 2πnpq
2.5. DE MOIVRE, LAPLACE AND STIRLING STAR IN THE NORMAL CURVE
121
2.5.5. De Moivre-Laplace Theorem. The De Moivre-Laplace theorem is just an integral version of De Moivre’s theorem. To simplify the exponent in De Moivre’s theorem, let k − np xk = √ . npq Then the right-hand side of De Moivre’s theorem is x2 (k−np)2 1 1 k √ e− 2npq = √ e− 2 . 2πnpq 2πnpq
We can make this a tad bit simpler by defining ∆xk := xk+1 − xk , which equals ∆xk =
k + 1 − np k − np 1 − √ = √ . √ npq npq npq
Thus, the right-hand side in De Moivre’s theorem is x2 1 k √ e− 2 ∆xk , 2π which looks awfully like something belonging to a Riemann integral! We’ll get back √ √ to this later. Solving xk = k−np npq for k we get k = np + xk npq, so the left-hand side in De Moivre’s theorem is n o √ Sn − np µ Sn = np + xk npq = µ = x √ k . npq
Thus, as n, k → ∞ such that (k − np)/n2/3 → 0, we have x2 Sn − np 1 k (2.34) µ ∼ √ e− 2 ∆xk . = x √ k npq 2π
Since we simply can’t help ourselves from trying to make a Riemann integral out of this (asymptotic) equation, let us fix two real numbers a, b ∈ R with a < b and consider all xk such that a ≤ xk ≤ b. Here’s a picture of this situation:
| a
1 1 1 1 √ √ √ √ npq npq npq npq z }| {z }| {z }| { z }| { ········· | | | | | | | xk0 xk0 +1 xk0 +2 xk1 −1 xk1 b
√
Figure 2.12. The lengths between adjacent xk ’s is ∆xk = 1/ npq. The integer k0 is the first k such that a ≤ xk ≤ b and k1 is the last integer such that a ≤ xk ≤ b. Thus, the xk ’s partition [a, b] into subintervals √ of length 1/ npq, except for the first and last subintervals that have √ length ≤ 1/ npq.
Taking a closer look at the assumption a ≤ xk ≤ b, observe that this is the same √ as a ≤ k−np npq ≤ b, or more explicitly, √ √ np + a npq ≤ k ≤ np + b npq. In particular, as n → ∞ we must have k → ∞. The assumption implies that √ a pq k − np k − np √ √ a pq ≤ ≤ b pq =⇒ ≤ ≤ n1/2 n1/6 n2/3
a≤
k−np √ npq
√ b pq . n1/6
≤ b also
122
2. FINITELY ADDITIVE INTEGRATION
In particular, as n → ∞, (k − np)/n2/3 → 0. To reiterate, we’ve shown that the assumption a ≤ xk ≤ b implies that as n → ∞, we automatically have both k → ∞ and (k − np)/n2/3 → 0. Thus, the asymptotic formula (2.34) is valid as n → ∞ where a ≤ xk ≤ b. We now sum both sides of (2.34) over all k ∈ N such that a ≤ xk ≤ b. The left-hand side of (2.34) is X Sn − np Sn − np µ = xk = µ a ≤ √ ≤b √ npq npq k n o −np because the set a ≤ S√nnpq ≤ b is exactly the (pairwise disjoint union) of all the n o −np 21 √ = xk where xk = k−np sets S√nnpq npq and a ≤ xk ≤ b. Thus, Sn − np 1 X − x2k (2.35) µ a≤ √ ≤b ≈ √ e 2 ∆xk , npq 2π k
where the right-hand side is only summed for those k with a ≤ xk ≤ b and where the approximation becomes exact when n → ∞. Of course, the right-hand side of 2 (2.35) is exactly a Riemann sum for the function e−x /2 over the interval [a, b], as seen here:
| a
xk
xk+1
| b
√
Figure 2.13. The lengths between adjacent xk ’s is ∆xk = 1/ npq. Hence,
Z b 2 1 1 X − x2k 2 e e−x /2 dx. lim √ ∆xk = √ n→∞ 2π k 2π a
We have thus obtained the following theorem for finite a, b; in Problem 10 you will establish it even when a or b is infinite. De Moivre-Laplace Theorem Theorem 2.15. For any a, b ∈ [−∞, ∞] with a < b, we have Z b n o x2 Sn − np 1 lim µ a ≤ √ ≤b = √ e− 2 dx. n→∞ npq 2π a
The same result holds when any of the “≤’s” on the left is replaced with “ 1/2, we have Z ∞ Z 1 Z π Z π 2 2 3 cos2α−2 θ dθ = sin2α−2 θ dθ. (1 + x2 )−α dx = (1 − x2 )α− 2 dx = 0
0
0
0
(b) In the text we’ve been focusing on the first integral, so for fun let’s use the sine R π/2 integral. Henceforth put Sα = 0 sinα x dx. Prove that α π Sα−1 , S0 = , S1 = 1. (2.36) Sα+1 = α+1 2 (c) Prove that for any n ∈ N, S2n =
1 · 3 · 5 · · · (2n − 1) π · , 2 · 4 · 6 · · · (2n) 2
S2n+1 =
2 · 4 · · · (2n) . 1 · 3 · 5 · · · (2n + 1)
22 Technically speaking we should be employing the continuity correction (if you know what this is), but we won’t worry about this.
2.5. DE MOIVRE, LAPLACE AND STIRLING STAR IN THE NORMAL CURVE
(d) Show that
Z
∞
(1 + x2 )−n dx =
0
125
n−1 π Y 2k − 1 . 2 2k k=1
3. (Elementary proof of Wallis’ formula) In this problem we give Euler’s proof (made a little more rigorous) of Wallis’ formula, found in Chapter 9, De evolutione integralium per producta infinita, of Euler’s calculus textbook Institutionum calculi integralis volumen primum (Foundations of Integral Calculus, volume 1) [89] (cf. [247, p. 153]). We also show how Wallis’ formula implies the Rprobability integral. α 1 (i) In Euler’s book, he uses the integrals 0 √t 2 dx to derive Wallis’ formula. 1−t R π/2 Taking t = sin x transforms Euler’s integrals into Sα := 0 sinα x dx. (ii) Show that S2n+2 ≤ S2n+1 ≤ S2n . Now replace the Sα ’s here with their expressions in Part (c) of Problem 2, and show that 2n + 1 2Wn ≤ ≤ 1, 2n + 2 π where 2 · 2 · 4 · 4 · 6 · 6 · · · (2n)(2n) Wn = , 1 · 3 · 3 · 5 · 5 · · · (2n − 1)(2n + 1) Conclude that lim Wn = π/2, which is exactly Wallis’ product. 4. (Elementary proof of Wallis =⇒ probability integral) In this problem we show how Wallis’ formula implies the probability integral. (This is basically an exercise in Bourbaki’s book [41, p. 127].) (i) Show that for all x ∈ R, 2 1 1 − x2 ≤ e−x ≤ . 1 + x2 (ii) Conclude that Z 1 Z ∞ Z ∞ 2 (1 − x2 )n dx ≤ e−nx dx ≤ (1 + x2 )−n dx, 0
0
0
and from these inequalities, Part (a) of Problem 2, and Wallis’ formula, derive the probability integral. 5. (Elementary proof of Wallis =⇒ Stirling’s formula) Many “standard” proofs of 1 Stirling’s formula first prove de Moivre’s result that n! ∼ B nn+ 2 e−n , then find B by Wallis’ formula (or the probability integral). Here’s one such proof. (i) For any x ∈ [0, 1), prove that x3 x2 0 ≤ log(1 + x) − x − ≤ . 2 3 Suggestion: Remember any facts about alternating series? (ii) Define an = log(n!) + n − n + 21 log n. Prove that 1 1 an − an+1 = n + log 1 + − 1, 2 n then using (i), prove that for n ≥ 2, |an − an+1 | ≤ constant . n2 P (iii) Show that limn→∞ n−1 k=1 (an − an+1 ) exists and use this to prove that limn→∞ an 1 exists. Use this fact to prove that n! ∼ B nn+ 2 e−n for some constant B. (iv) Show that Wallis’ formula can be written as r 22n (n!)2 π lim √ = , n→∞ 2 2n(2n)! √ then use this to show that B = 2π.
126
2. FINITELY ADDITIVE INTEGRATION
6. (Stieljes method cf. [311, p. 272]) In this problem we give Thomas Jan Stieltjes’ (1856–1894) computation of the probability integral in the two-page (but ingenious) R∞ R∞ 2 2 1890 paper Note Sur l’int´egrale 0 e−u du [271]. Define Tn = 0 xn e−x dx; what we want is T0 . Tn−2 for n ≥ 2 and then prove that for n ≥ 0, (i) Prove that Tn = n−1 2 T2n = n!
1 · 3 · 5 · · · (2n − 1) 2 · 4 · · · (2n)
and
T2n+1 =
n! . 2
(ii) Stieltjes’ brilliant idea is the following identity: For all n ∈ N, we have Tn2 < Tn−1 Tn+1 .
Prove this using Stieltjes’ (ingenious) trick: Consider the polynomial p(t) = at2 + bt + c,
where a = Tn−1 , b = 2Tn , c = Tn+1 ,
and show that p(t) > 0 for all t ∈ R (In particular, p(t) does not have any real roots.) Subhint: Show that Z ∞ 2 p(t) = xn−1 (x + t)2 e−x dx. 0
(iii) Using (i) and (ii) show that
2n + 1 2 2n + 1 T2n < T2n−1 T2n+1 . 2 2 (iv) Using (i) and (iii), show that 2n − 1 2 3 2 1 2 2n + 1 1 < 2(2n + 1) ··· T02 < . 2n 4 2 2n (v) Finally, use Wallis’ formula on (iv) to determine the probability integral. 7. (De Moivre-Laplace =⇒ Bernoulli) In this problem we show that the De MoivreLaplace theorem implies Bernoulli’s theorem. To do so, let ε > 0 and show that for any fixed, but arbitrary, r > 0, Sn − np √ < r ⊆ Sn − p < ε n npq 2 T2n+1
0, prove that lim µ |Dn | ≤ r = 0. n→∞
Interpret this probabilistically. (iii) How many tosses are needed to ensure that |Dn | ≥ 2 holds with probability 99%?
NOTES AND REFERENCES ON CHAPTER ??
127
9. Equation (2.34) means that x2 k Sn − np 1 µ = xk = √ e− 2 ∆xk 1 + snk , √ npq 2π
where snk → 0 as n, k → ∞ such that (k − np)/n2/3 → 0; more precisely, given ε > 0 there are a N and δ such that if n, k > N and (k − np)/n2/3 < δ, we have |snk | < ε. Thus, (2.35) should be written 2 Sn − np 1 X − x2k µ a≤ √ ≤b = √ e ∆xk 1 + snk . npq 2π k Prove that
lim
n→∞
X
e−
x2 k 2
∆xk snk = 0,
k
where for each n ∈ N, the sum is taken over all k such that a ≤ xk ≤ b. 10. Given any r ∈ (0, ∞), using Chebyshev’s inequality (or simply look at the inequalities in (2.18) in Subsection 2.4.2) prove that √ 1 µ |Sn − np| ≥ npq r = µ (Sn − np)2 ≥ npq r 2 ≤ 2 . r Conclude that for any r ∈ (0, ∞), Sn − np Sn − np 1 µ ≤ −r , µ ≥ r ≤ 2. √ √ npq npq r
In particular, these probabilities can be made arbitrarily small by taking r > 0 sufficiently large. Using this fact, prove the De Moivre-Laplace theorem in the case when a = −∞ or b = +∞.
Notes and references on Chapter 2 §2.4 : For a history of the law of large numbers, see [255]. §2.5 : The book The life and times of the central limit theorem [1] is a great book for history on the De Moivre-Laplace theorem and its generalization, the central limit theorem. If you’re interested in the normal law of errors, see e.g. [68, 268]. Many books go over the Herschel-Maxwell (really Adrain) derivation of the normal law; a few of the books that do so are [140, Ch. 7], [114, p. 209] and [261, p. 66]. We mentioned that it was Laplace who gave the first “proof” of the probability integral. Quoting from Stigler’s translation [272, p. 367], here’s the passage from the 1774 paper M´emoire sur la probabilit´e des causes par les ´ev`enmens [159] where Laplace derives the probability integral: Let −[(p + q)3 /2pq]zz = ln µ, and we will have23 √ Z Z (p + q)3 2qp dµ √ 2dz exp − zz = − . 2pq (p + q)2 − ln µ The number µ can here have any value between 0 and 1, and, supposing the integral begins at µ = 1, we need its value at µ = 0. This may be determined using the following theorem (see M. Euler’s Calcul int´egral). Supposing the integral goes from µ = 0 to µ = 1, we have Z Z µn dµ µn+i dµ 1 π p p · = , 2i i(n + 1) 2 (1 − µ ) (1 − µ2i ) whatever be n and i. Supposing n = 0 and i is infinitely small, we will have (1 − µ2i )/(2i) = − ln µ, because the numerator and the denominator of this quantity become zero when i = 0, and if we
23
The limits on the left-hand integral are from 0 to ∞.
128
2. FINITELY ADDITIVE INTEGRATION
Figure 2.14. Three Galton boards. differentiate them both, regarding i alone as variable, we will have (1− µ2i )/(2i) = ln µ, therefore1 − µ2i = −2i ln µ. Under these conditions we will thus have Z Z Z Z µn+i dµ dµ dµ 1π µn dµ p p √ √ √ √ = = ; 2i 2i i2 2i − ln µ 2i − ln µ (1 − µ ) (1 − µ ) Therefore
Z
√ dµ = π √ 2i − ln µ supposing the integral is from µ = 0 to µ = 1. In our case, however, the integral is from µ = 1 to µ = 0, and we will have Z √ dµ √ √ = π. − 2i − ln µ Thus, √ √ Z pq 2π (p + q)3 zz = . 2 dz exp 2pq (p + q)3/2 √
(p+q)3 pq
= 2, this last equation can be written, in modern notation, Z ∞ √ ∞ 2 2 1 √ π 2 e−z dz = √ · 2π =⇒ e−z dz = . 2 2 0 0 Of course, Laplace’s argument is not quite rigorous. Later on, in his 1781 paper M´emoirs sur les probabilit´es published in M´emoirs de l’Academie royale des Sciences de Paris, Laplace gives a completely rigorous derivation of the probability integral using double integrals; see Section 7.3 if you’re interested. The normal density can also be seen as emerging from the binomial coefficients via the Galton board, named after Francis Galton (1822–1911). Imagine a vertically placed board with many regularly spaced pin nailed into the board. We then drop a large number of tiny balls from the top which hit the pins and bounce left and right eventually landing in bins at the bottom of the board; see Figure 2.14.24 Here’s Galton himself describing how the outline of the balls in the bins at the bottom form the normal density (see also [220]): [102, p. 64] The shot passes through the funnel and issuing from its narrow end, scampers deviously down through the pins in a curious and interesting way ; each of them darting a step to the right or left, as the case may be, every time it strikes a pin. The pins are disposed in a quincunx fashion, so that every descending shot strikes against a pin in each successive row. The cascade issuing from the
Note that if we set
Z
24 Figures 7, 8 and 9 from [102, p. 63]. A demonstration of the Galton board can be found at http://www.jcu.edu/math/isep/Quincunx/Quincunx.html
NOTES AND REFERENCES ON CHAPTER ??
129
funnel broadens as it descends, and, at length, every shot finds itself caught in a compartment immediately after freeing itself from the last row of pins. The outline of the columns of shot that accumulate in the successive compartments approximates to the Curve of Frequency (Fig. 3, p. 38), and is closely of the same shape however often the experiment is repeated. The outline of the columns would become more nearly identical with the Normal Curve of Frequency, if the rows of pins were much more numerous, the shot smaller, and the compartments narrower; also if a larger quantity of shot was used. To see mathematically why the normal density should be obtained, imagine each row as an experiment, whether a ball bounces to the left (say a 0) or the right (say a 1), each with probability 1/2. Thus, if there are n rows, the path of the ball can be described by an n-tuple of 0’s and 1’s, which is just a sequence of Bernoulli trials!
Part 2
Countable additivity
CHAPTER 3
Measure and probability: countable additivity We intend to attach to each bounded set a positive number or zero called its measure and satisfying the following conditions: 1) There are sets whose measure is not zero. 2) Two congruent sets have the same measure. 3) The measure of the union of a finite number or an infinite countable number of sets that are pairwise disjoint is the sum of the measures of the sets. We solve this problem of measure for those sets that we call measurable. In the introduction to Chapter I (The Measure of a Set) of Henri L´ eon Lebesgue’s (1875–1941) 1902 thesis [171].
3.1. Introduction: What is a measurable set? It turns out that there are many answers to this very simple question! 3.1.1. Answer #1 by Lebesgue. Building on the researches of, for instance, ´ ´ F´elix Edouard Justin Emile Borel (1871–1956) and Marie Ennemond Camille Jordan (1838–1922), Lebesgue’s answer to the question “What is a measurable set?” is given in his first paper on integration theory [165]: Let us consider a set of points of (a, b); one can enclose in an infinite number of ways these points in an infinite number of intervals; the infimum of the sum of the lengths of the intervals is the measure of the set. A set E is said to be measurable if its measure together with that of the set of points not forming E gives the measure of (a, b).
This answer is somewhat terse, so to grasp what Lebesgue is saying, let’s give the exact same definition of measurability, but explained in more detail from page 182 of his book [173]. Let A be a subset of an interval (a, b). Here’s how he defines the measure m(A) of A: Enclose A in a finite or denumerably infinite number of intervals, and let l1 , l2 , . . . be the length of these intervals. We obviously wish to have m(A) ≤ l1 + l2 + · · · .
If we look for the greatest lower bound of the second member for all possible systems of intervals that cover A, this bound will be an upper bound of m(A). For this reason we represent it by m∗ (A), and we have m(A) ≤ m∗ (A).
(3)
If (a, b) \ A is the set of points of the interval (a, b) that do not belong to A, we have similarly m((a, b) \ A) ≤ m∗ ((a, b) \ A). 133
134
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
Now we certainly wish to have m(A) + m((a, b) \ A) = m(a, b) = b − a; and hence we must have m(A) ≥ b − a − m∗ ((a, b) \ A).
(4)
The inequalities (3) and (4) give us upper and lower bounds for m(A). One can easily see that these two inequalities are never contradictory. When the lower and upper bounds for A are equal, m(A) is defined, and we say that A is measurable.
Let’s take a closer look at what Lebesgue is saying. Let A ⊆ R, not worrying for the moment on whether A ⊆ (a, b); for example, A could be unbounded. Consider his definition of m∗ (A). Let’s cover A by countably many intervals {In }, so that A⊆
∞ [
In .
n=1
Because of the various assortments of intervals available, for concreteness we assume all the intervals In are in I 1 S . Given any such IP ∈ I 1 , recall that m(I) is the usual ∞ ∞ length of I. Now, since A ⊆ n=1 In , the sum n=1 m(In ) should S∞ be greater than the true “measure” of P A. Intuitively speaking, the “worse” n=1 In approximates S∞ ∞ A, the bigger theP sum n=1 m(In ), and the “better” n=1 In approximates A, the ∞ smaller the sum n=1 m(In ). Here’s a one-dimensional illustration: (] (
] ( ]( ]
(
]( ]
Figure 3.1. The top shows a set A ⊆ R (which could be quite com-
plicated). The bottom shows left-hand open intervals covering A. The sum of the lengths of the intervals is larger than the true measure of A. It therefore makes sense to define the (outer) measure of A to be the “smallest,” more precisely, the infimum, of the set of all such sums of lengths of intervals that cover A.
and here’s a two-dimensional illustration:
Figure 3.2. The left-hand picture shows a cover of A = a disk in R2 by five rectangles that gives a “bad” (too large of an) approximation to A. The right-hand picture shows a cover of A by eight rectangles that gives a “better” (closer) approximation to A; in this case, the sum of the areas of the rectangles is closer to the true measure of A. The infimum of all such sums of areas of rectangular approximations should be the true measure of A.
3.1. INTRODUCTION: WHAT IS A MEASURABLE SET?
135
P∞ We define m∗ (A) as the “smallest” possible sum n=1 m(In ). More precisely, we define “smallest” in terms of infimums: (∞ ) ∞ X [ ∗ 1 (3.1) m (A) := inf m(In ) ; A ⊆ In , In ∈ I . n=1
n=1
On the right-hand side, the infimum is taken over all countable covers {In } of A, where In ∈ I 1 for all n. This procedure defines the “measure” m∗ (A) ∈ [0, ∞] for any set A ⊆ R. Thus, we have a map m∗ : P(R) → [0, ∞],
where P(R) is the power set of R, the set of all subsets of R. The function m∗ is called Lebesgue outer measure and it assigns a “length” to every subset of R. It might seem like m∗ is exactly what we need to solve Lebesgue’s measure problem: 1) There are sets whose measure is not zero. 2) Two congruent sets have the same measure. 3) The measure of the union of a finite number or an infinite countable number of sets that are pairwise disjoint is the sum of the measures of the sets.
Certainly m∗ satisfies 1) (e.g. it’s easy to check that m∗ (R) = ∞) and in Section 4.4 we’ll prove it satisfies 2), where congruent means there is a translation taking one set to the other. However, it fails to have Property 3 in Lebesgue’s quote; in fact, (see Section 4.4) one can always break up an interval (a, b), where a < b, as a union (a, b) = A ∪ (a, b) \ A for some subset A ⊆ (a, b) such that b − a < m∗ (A) + m∗ ((a, b) \ A).
Thus, the sum of the measures of the parts (A and (a, b) \ A) is greater than the measure of the whole (the interval (a, b))! This fact follows from the work of Giuseppe Vitali (1875–1932), who in his famous 1905 paper [298] proved the existence of a non-measurable set, and said This suffices us to conclude: The problem of measure of the set of points of a straight line is impossible.
Intuitively speaking, non-measurable sets have “blurry” or “cloudly” boundaries, so the definition (3.1) assigns a larger measure to them than they should have. Now although m∗ is not a measure on P(R), there is a proper subset of P(R) on which m∗ does satisfy 3); these sets are what Lebesgue calls measurable. Now assume that A is a subset of some bounded interval (a, b). Then when Lebesgue says “When the lower and upper bounds for A are equal, m(A) is defined, and we say that A is measurable”, what he says that the subset A is measurable if m∗ (A) = b − a − m∗ ((a, b) \ A); see Equations (3) and (4) in Lebesgue’s quote. In other words, he defines A ⊆ (a, b) to be measurable if (3.2)
b − a = m∗ (A) + m∗ ((a, b) \ A),
and then m(A) = measure of A := m∗ (A). Intuitively speaking, measurable sets have a “distinct” boundaries, so the definition (3.1) assigns exactly the measure to them than they should have.
136
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
3.1.2. Answer #2 by Carath´ eodory. Unfortunately the definition (3.2) for measurability only works for sets that belong to some open interval. Of course, it would also be nice to define measurability for unbounded sets too. Notice that (3.2) can be written as (3.3)
m∗ (a, b) = m∗ ((a, b) ∩ A) + m∗ ((a, b) \ A),
since (a, b) ∩ A = A as A ⊆ (a, b), and we are using the fact that m∗ (a, b) = b − a, which you will prove in Problem 1. For an arbitrary subset A of R, there might not exist an interval (a, b) that contains A. However, notice that even though there might not exist such an interval containing A, both sides of the equality (3.3) are still defined for any interval (a, b) (because m∗ is defined for any subset of R). Thus, why not simply declare a subset A ⊆ R to be measurable if (3.3) holds for any interval (a, b). This is exactly Constantin Carath´eodory’s (1873–1950) brilliant idea which he published in 1914. In fact, it turns that that for theoretical purposes it’s convenient to replace (a, b) by any subset of R. Constantin Thus, here’s Carath´eodory’s definition of measurability: A subset A ⊆ R is Carath´eodory measurable if for any subset E ⊆ R, we have (1873–1950). m∗ (E) = m∗ (E ∩ A) + m∗ (E \ A).
If A does lie in an interval, this definition is equivalent to Lebesgue’s definition, but the advantage of Carath´eody’s definition is that it works for unbounded sets too. 3.1.3. Answer #3 by Littlewood. Here are yet some more ways to define measurability, which are intuitively nice. On page 26 of John Littlewood’s (1885–1977) book “Lectures on the theory of functions” [183] he tries to emphasize to his readers that the theory of Lebesgue integration is not so difficult as it may seem. In particular, concerning measurable sets he says John Littlewood (1885–1977).
Every [finite Lebesgue] (measurable) set is nearly a finite sum of intervals.
We can in fact take Littlewood’s statement here and make it into a definition. Let A ⊆ R and assume that m∗ (A) < ∞. Then we can define A to be measurable if A is “nearly” a finite union of intervals, that is, an elementary figure: A is measurable if A is “nearly” an elementary figure. We make this precise as follows: Given any ε > 0 there is an element I ∈ E 1 (= the elementary figures in R1 = finite unions of elements of I 1 ) such that m∗ (A \ I) < ε
and m∗ (I \ A) < ε.
Thus, A is “nearly” equal to I in the sense that the points in A but not in I and the points in I but not in A have small outer measure. Now what if A does not necessarily have finite outer measure? Then we can modify Littlewood’s definition of measurability as follows: Given any subset A ⊆ R, A is measurable if A is “nearly” an open set. We make this precise as follows. To say that A is “nearly” an open set we mean given ε > 0 there is an open set U ⊆ R such that A⊆U
and m∗ (U \ A) < ε.
3.2. COUNTABLY ADDITIVE SET FUNCTIONS ON SEMIRINGS
137
Thus, the points in U but not in A has small outer measure. Littlewood’s approach to measurability is intuitively appealing because we all should have an intuitive feel for unions of intervals and open sets; measurable sets are not much different. We end this present section with Littlewood’s words of wisdom [183, p. 26–27]: Most of the results of this present section are fairly intuitive applications of these ideas, and the student armed with them should be equal to most occasions when real variable theory is called for. If one of the principles would be the obvious means to settle a problem if it were “quite” true, it is natural to ask if the “nearly” is near enough, and for a problem that is actually soluble it generally is. ◮ Exercises 3.1. 1. In this problem we prove that for any interval (a, b) (with a < b), we have m∗ (a, b) = b − a. The arguments you use here will be repeated quite often in the sequel! We shall prove that m∗ (a, b) ≤ b − a and b − a ≤ m∗ (a, b). (i) Using that (a, b) ⊆ (a, b] = (a, b] ∪ ∅ ∪ ∅ ∪ · · · , from the definition (3.1) prove that m∗ (a, b) ≤ b − a. S ∗ (ii) Assume that P (a, b) ⊆ ∞ n=1 (an , bn ]. Show that b − a ≤ m (a, b) if we can show that b − a ≤ ∞ m(a , b ]. n n n=1 S (iii) Let ε > 0 with ε < (b − a)/2 and prove that [a + ε, b − ε] ⊆ ∞ (an , bn + 2εn ). Sn=1 N Using compactness, show there is an N such that [a+ε, b−ε] ⊆ n=1 (an , bn + 2εn ), S ε and hence (a + ε, b − ε] ⊆ N n=1 (an , bn + 2n ]. P ε (iv) Using finite subadditivity, conclude that b − a − 2ε ≤ N ≤ n=1 m(an , bn ] + 2n P∞ ε + n=1 m(an , bn ]. Taking ε → 0, get the desired result.
3.2. Countably additive set functions on semirings In this section we study measures on semirings. In particular, we show that Lebesgue measure lives up to its name: it’s indeed a measure. 3.2.1. Countable additivity and Lebesgue(-Stieljes) measures. An additive set function µ : I → [0, ∞] on a semiring I (in particular, a ring or σ-algebra) is said to be countably additive if for S any countable collection of pairwise disjoint sets A1 , A2 , A3 , . . . in I such that ∞ n=1 An ∈ I , we have µ
∞ [
n=1
An
!
=
∞ X
µ(An ).
n=1
A countably additive set function is called a measure. If X ∈ I and µ(X) = 1, then µ is called a probability measure. Here are a couple remarks about this definition. First, note that if I is a finite collection of sets, µ is automatically countably additive (can you prove this?). In particular, all probability set functions where the sample space is finite are measures. So, countable additivity is a new idea only when S∞ I is not finite. Second, note that if I is a σ-algebra, then the assumption n=1 An ∈ I is automatically satisfied, otherwise we have to make this assumption. So, although we work with semirings and rings in this section, the natural domain for a measure is really a σ-algebra. Later we shall prove that the additive set functions of geometry (Lebesgue measure) and those we have looked at in our probability examples (eg. infinite product
138
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
measures) are all countably additive. However, lest you think that all finitely additive set functions are countably additive, consider the following example. (There are other examples in the exercises.) Example 3.1. Let f : R → R be the nondecreasing function defined by ( 0 if x ≤ 0, f (x) = 1 if x > 0. Recall that the corresponding Lebesgue-Stieltjes set function is defined by µf (a, b] = f (b)−f (a). We know (by Proposition 1.21) that µf : I 1 → [0, ∞) is finitely additive. Here’s a picture of f : 1 = f (a) = f (b) = f (1)
0 = f (0) . . . ( (](]( ]( ]( 0 a b
] 1
Observation: Notice that µf (a, b] = 0 for any a, b > 0, while µf (0, 1] = 1. Using this fact it’s easy to show that µf is not countably additive. Indeed, consider the decomposition ∞ ∞ [ [ 1i 1 , , or A = An , (0, 1] = n+1 n n=1 n=1 i 1 where A = (0, 1] and An = n+1 , n1 ; then A1 , A2 , . . . are pairwise disjoint. By P∞ our observation above, for any n ∈ N, we have µf (An ) = 0, so n=1 µf (An ) = 0. However, µf (A) = f (1) − f (0) = 1, so, as 1 6= 0, µf (A) 6=
∞ X
µf (An ).
n=1
Thus µf : I 1 → [0, ∞) is not countably additive.
Now how do we prove that an additive set function µ : I → [0, ∞] is countably S∞ additive? That is, given A ∈ I with A = n=1 An with An ∈ I for each n and pairwise disjoint, we need to show that µ(A) =
∞ X
µ(An ).
n=1
We can break this up into two separate inequalities: (3.4)
∞ X
n=1
µ(An ) ≤ µ(A)
and µ(A) ≤
∞ X
µ(An ).
n=1
The first inequality holds because any finitely additive set function is countably superadditivity (see Theorem 2.4 back in Section 2.3). The second inequality holds if ∞ is replaced by a finite N ; this is just finite subadditivity which holds for any finitely additive set function (again from Theorem 2.4 back in Section 2.3). However, for infinite sums it my fail. For instance,i in the above example we have S 1 1 A= ∞ A , where A = (0, 1] and A = n n=1 n n+1 , n , but as 1 6≤ 0, we have µf (A) 6≤
∞ X
n=1
µf (An ).
3.2. COUNTABLY ADDITIVE SET FUNCTIONS ON SEMIRINGS
139
Conclusion: the second inequality in (3.4) is the deciding factor in determining countable additivity. This explains why the following definition is important. An additive set functionSµ : I → [0, ∞] on a semiring I is said to be countably ∞ subadditive if A ⊆ n=1 An where A, A1 , A2 , . . . ∈ I implies1 ∞ X
µ(A) ≤
µ(An ).
n=1
In probability theory, this inequality is called Boole’s inequality after George Boole (1815–1864). We can reword our conclusion as: If a finitely additive set function is countably subadditive, then it’s countably additive. We shall use this fact to prove that Lebesgue measure is indeed a measure. Before presenting the proof, let’s review some handy notation. Recall back in Section 1.4, right before Proposition 1.10, that we denote a box (p1 , q1 ] × · · · × (pn , qn ] in Rn by the notation (p, q] where p = (p1 , . . . , pn ) and q = (q1 , . . . , qn ) are elements of Rn . Of course, this box could be the empty set if any pk ≥ qk . Given r, s ∈ R, we let (p + r, q + s] be the box determined by the n-tuples (p1 + r, . . . , pn + r) and (q1 + s, . . . , qn + s). There are analogous notations for other types of boxes such as closed, open, etc. Lebesgue measure is a measure Theorem 3.1. For any n, Lebesgue measure m : I n → [0, ∞) is a measure, that is, m is countably additive. Proof : To prove that m is a measure on I n , we shall prove that m is countably subadditive. The idea is to use a compactness argument and an “ 2εk trick” to reduce countable subadditivity to finite subadditivity (and we know that m S is finitely subadditive since it is finitely additive). Let (a, b] ⊆ ∞ k=1 (ak , bk ] where (a, b] ∈ I n and each (ak , bk ] ∈ I n ; we need to prove that m(a, b] ≤ P ∞ k=1 m(ak , bk ]. To do so, let ε > 0 and for each k ∈ N, take a positive number εk > 0 (later on in the proof we’ll choose these εk ’s in a nice way). Observe that [a + ε, b] ⊆ (a, b] ⊆
∞ [
(ak , bk ] ⊆
k=1
∞ [
(ak , bk + εk ).
k=1
We now use compactness: Since the closed box [a + ε, b] is compact, it is covered by a finite union of the open sets on the far right, say the first N of them, [a + ε, b] ⊆
N [
(ak , bk + εk ).
k=1
Since (a + ε, b] ⊆ [a + ε, b] and (ak , bk + εk ) ⊆ (ak , bk + εk ], we conclude that (a + ε, b] ⊆
N [
(ak , bk + εk ].
k=1
Now, m is finitely subadditive, so m(a + ε, b] ≤
N X
m(ak , bk + εk ].
k=1
1Note that there is no assumption on the pairwise disjointness of the A ’s. n
140
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
The right-hand sum is ≤ the same sum with N replaced by ∞ and hence, m(a + ε, b] ≤
∞ X
m(ak , bk + εk ].
k=1
We now use the “ 2εk trick”: Using the fact that for any p, q ∈ Rn , m(p, q + r] − m(p, q] is a continuous function of r ∈ [0, ∞) and it vanishes at r = 0, we can take εk > 0 small enough so that ε m(ak , bk + εk ] − m(ak , bk ] ≤ k . 2 Hence, m(ak , bk + εk ] ≤ ∞ X
k=1
Since
P∞
1 k=1 2k
ε 2k
+ m(ak , bk ], so
m(ak , bk + εk ] ≤
∞ ∞ X X ε + m(ak , bk ]. 2k k=1
k=1
= 1, we see that m(a + ε, b] ≤ ε +
∞ X
m(ak , bk ].
k=1
Finally, letting ε → 0 andP using the fact that m(a + ε, b] is a continuous function of ε, we obtain m(a, b] ≤ ∞ k=1 m(ak , bk ] as desired.
We now consider Lebesgue-Stieltjes set functions. In Proposition 1.21 we proved that any such set function is finitely additive. However, we saw in Example 3.1 that µf may not be countably additive, which was the case for the function ( 0 if x ≤ 0, f (x) = 1 if x > 0. Note that in this example, f is not right continuous at 0. The following theorem says that right-continuity (at every point) is both a necessary and sufficient condition for µf to be a measure. Lebesgue-Stieltjes measures Theorem 3.2. The Lebesgue-Stieltjes set function µf of a nondecreasing function f is a measure on I 1 if and only if f is right-continuous (at every point). In particular, the Lebesgue-Stieltjes set function of a nondecreasing continuous function is a measure on I 1 . Proof : If f : R → R is nondecreasing and right-continuous, to prove that µf is countably additive, we just have to prove that µf is countably subadditive. One proof of this fact is similar to the proof of Theorem 3.1 — basically everywhere you see an “m” in the proof of Theorem 3.1, replace it with “µf ” — you can go through the details if you wish. We now prove necessity, so suppose that µf is countably additive and fix x ∈ R. To prove that f is right-continuous at x we just have to prove that f (x) = limn→∞ f (xn ) for any strictly decreasing sequence x1 > x2 > x3 > · · · with limn→∞ xn = x (why?). Consider the union (x, x1 ] =
∞ [
(xn+1 , xn ]
n=1
3.2. COUNTABLY ADDITIVE SET FUNCTIONS ON SEMIRINGS
141
and note that the sets (xn+1 , xn ] are pairwise disjoint for different n. Since µf is assumed to be a measure, f (x1 ) − f (x) = µf (x, x1 ] =
∞ X
µf (xn+1 , xn ]
n=1
= lim
N→∞
N−1 X n=1
f (xn ) − f (xn+1 ) .
The last sum telescopes to n=1 f (xn ) − f (xn+1 ) = f (x1 ) − f (xN ) and we find that f (x1 ) − f (x) = f (x1 ) − lim f (xN ). PN−1
N→∞
Cancelling off f (x1 ) shows that f (x) = lim f (xN ). N→∞
3.2.2. Equivalence of countable additivity and subadditivity. Before we prove Theorem 3.1 we saw that countable subadditivity implied countable additivity. We now prove the converse. Thus, countable subadditivity and countable additivity are equivalent. Before proving this, we need the following result on double summations that we’ll use time and time again, often without mentioning it. Double summation lemma Lemma 3.3. For each pair (m, n) ∈ N × N, let amn ∈ [0, ∞] and let f : N → N × N be a bijective function; therefore f (1), f (2), f (3), . . . is a list of all elements of N × N. Then ∞ X ∞ ∞ X ∞ ∞ X X X amn = amn = af (n) . m=1 n=1
n=1 m=1
Either of these sums is denoted by
P
m,n amn
n=1
(since all sums are the same).
P∞ P∞ Proof : We remark that the double P∞ sum m=1 n=1 amn means for each m ∈ N, to sum the inner summation n=1 amn , which gives a nonnegative extended real number for each m ∈ N. We then sum all these numbers from m = 1 to ∞. (The P P∞ double sum ∞ An alternative way to look n=1 m=1 amn has a similar Pmeaning.) Pn at double sums are as follows. If smn := m a i=1 j=1 ij , then ∞ X ∞ X
amn = lim
m=1 n=1
lim smn
m→∞ n→∞
and
∞ X ∞ X
amn = lim
n=1 m=1
lim smn ,
n→∞ m→∞
where the iterated limits, eg. limm→∞ limn→∞ smn , means for each m ∈ N, take the inner first, limn→∞ smn , then to take the outer limit m → ∞ next. Let P limit P∞ P∞ P∞ P∞ L1 = ∞ a , L = a , L = mn 3 mn 2 m=1 n=1 n=1 m=1 n=1 af (n) , and put L := sup {smn ; m, n ∈ N} ;
we shall prove that L1 = L2 = L3 = L. We first show that L1 , L2 , L3 ≤ L, then we prove the opposite inequality. For all m, n, we have, by definition of L, smn ≤ L, so using the fact that limits preserve inequalities, we have lim smn ≤ L. n→∞
142
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
Now taking m → ∞, we get lim smn ≤ L
lim
=⇒
m→∞ n→∞
L1 ≤ L.
Similarly, L2 ≤ L. Given n ∈ N, we can choose m so that f (1), . . . , f (n) ∈ N × N are amongst the aij ’s for 1 ≤ i, j ≤ m. Then, n X i=1
m X m X
af (i) ≤
i=1 j=1
aij ≤ L,
by definition of L. Taking n → ∞ we get L3 ≤ L. Thus, L1 , L2 , L3 ≤ L. On the other hand, by definition of L1 and L2 , for any m, n we have smn ≤ L1
and
smn ≤ L2 .
Thus, taking the supremum over all m, n, we see that L ≤ L1 and L ≤ L2 . Also, given m, n ∈ N, using that f (1), f (2), f (3), . . . is a list of all elements of N × N, we can choose N ∈ N so that f (1), f (2), . . . , f (N ) contain all the pairs (i, j) with 1 ≤ i ≤ m and 1 ≤ j ≤ n. Hence, smn ≤
N X i=1
af (i) ≤ L3 .
Taking the supremum over all m, n, we get L ≤ L3 .
We shall use this double summation theorem S as follows. Let µ : I → [0, ∞] be a measure and let A ∈ I and assume that2 A = m,n Amn where Amn ∈ I for each m, n ∈ N × N and which are pairwise disjoint, meaning that when (m, n) 6= (k, ℓ), Amn ∩ Akℓ = ∅. We claim that XX XX (3.5) µ(A) = µ(Amn ) = µ(Amn ). m
n
m
n
Indeed, pick any bijection f : N → N × N. Then we have ∞ [
Af (n) .
∞ X
µ(Af (n) ).
A=
n=1
Thus, by countable additivity, µ(A) =
n=1
On the other hand, by our lemma with amn = µ(Amn ) we have ∞ X
n=1
µ(Af (n) ) =
∞ X ∞ X
m=1 n=1
µ(Amn ) =
∞ X ∞ X
µ(Amn ).
n=1 m=1
P This proves (3.5); we sometimes denote either sum in (3.5) by m,n µ(Amn ). Now with this intermission complete, we prove that countable additivity and countable subadditivity are equivalent. 2This includes finite unions because we could take A mn = ∅ except for finitely many (m, n).
3.2. COUNTABLY ADDITIVE SET FUNCTIONS ON SEMIRINGS
143
Equivalence of countable additivity and subadditivity Theorem 3.4. If µ : I → [0, ∞] is additive on a semiring, then ⇐⇒
µ is countably additive
µ is countably subadditive;
that is, µ is a measure if and only if µ is countably subadditive. Proof : We already know that countable subadditivity implies countable additivity, so suppose that µ is a countably additive; S we shall prove that µ is countably subadditive. To do so, let A ∈PI with A ⊆ ∞ n=1 An where An ∈ I for each n; ∞ we need to show that µ(A) ≤ µ(A n ). To prove this, we first intersect both n=1 S S∞ sides of A ⊆ ∞ n=1 An with A to obtain A = n=1 (A ∩ An ). By the Fundamental Lemma of Semirings, there exists pairwise disjoint sets {Bnm } ⊆ I such that for each n, Bnm ⊆ (A ∩ An ) are finite in number, and [ [ A = (A ∩ An ) = Bnm . n
n,m
By superadditivity, for each n we have X (3.6) µ(Bnm ) ≤ µ(A ∩ An ) ≤ µ(An ), m
where we used that µ(A∩An ) ≤ µ(An ) by monotonicity. Finally, using countable additivity, [ XX A= Bnm =⇒ µ(A) = µ(Bnm ) (by our discussion on (3.5)) n,m
n
≤
X
m
µ(An )
(by (3.6)).
n
◮ Exercises 3.2. 1. Let X be an infinite set. Define µ : P(X) → [0, ∞] by µ(A) = 0 if A is finite and µ(A) = ∞ if A is infinite. Prove that µ is finitely, but not countably additive. 2. (Various examples of measures) (a) For any nonnegative function g : R → [0, ∞) that is Riemann integrable on any Rb finite interval, prove that mg : I 1 → [0, ∞), defined by mg (a, b] = a g(x) dx, is a measure on I 1 and hence extends uniquely to a measure on E 1 . (b) Prove that Proposition 1.17 holds verbatim where we replace “finitely additive” with “countably additive” everywhere in that proposition. (c) (Dirac and discrete measures) (a) Given a semiring I of subsets of a set X and given α ∈ X, define ( 1 if α ∈ A δα : I → [0, ∞) by δα (A) := 0 if α ∈ / A. Prove that δα is a measure; it is called the Dirac measure supported on α. (b) Let R be a ring of subsets of X. Assume that R contains the total set X and all countable subsets of X. We say that a measure µ : R → [0, ∞] is discrete if there exists a countable set C ⊆ X such that µ(X \ C) = 0. Prove that a measure µ is discrete if and only if there are countably many points α1 , α2 , . . . ∈ X and extended real numbers a1 , a2 , . . . ∈ (0, ∞] such that X µ= an δαn in the sense that µ(A) =
P
n
n an δαn (A) for all A ∈ R.
144
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
3. (“Rationals are only finitely additive”) Suppose that we wanted to study “lengths” of intervals of rational numbers. Then instead of working with the semiring I 1 , we would work with the semiring I := {(a, b] ∩ Q ; a, b ∈ Q , a ≤ b}. Consider the set function µ : I → [0, ∞) defined by µ(I) := b − a for I = (a, b] ∩ Q ∈ I . (i) Prove that µ is finitely additive. (ii) Prove that µ is not countably additive. This shows that the natural measure on I is not a measure! Suggestion: Let {r1 , r2 , r3 , . . .} be a list of all rational numbers in I = (0, 1] ∩ Q. Try to define elements I1 , I2 , S I3 , . . . ∈ I such that for each n ∈ N, rn ∈ In and µ(In ) ≤ 1/2n+1 , then consider ∞ n=1 In .
3.3. Infinite product spaces and properties of countably additivity
The main problem in this section is the following. Assume we are given countably many probability measures: µ1 : I1 → [0, 1] , µ2 : I2 → [0, 1] , µ3 : I3 → [0, 1] , . . . , Q∞ where Ii is a semiring on a sample space Xi . Let C ⊆ i=1 Xi denote the collection of all cylinder sets, sets of the form A = A1 × A2 × · · · × An × Xn+1 × Xn+2 × Xn+3 × · · · ,
where Ai ∈ Ii for each i, and define the infinite product of µ1 , µ2 , . . ., µ : C → [0, 1],
on a cylinder set A as written above, by µ(A) := µ1 (A1 ) · µ2 (A2 ) · · · µn (An ). By Proposition 2.6 we know that µ is finitely additive. Since the µi ’s were, by assumption, measures, it’s natural to conjecture that µ is in fact countably additive. Proving this is the main goal of this section. On the way to this goal, we give alternative characterizations of countable additivity in terms of the notion of continuity. 3.3.1. Infinite product probabilities I: Finite sample spaces. For pedagogical reasons, we first treat the case when the Xi ’s are finite; the proof in this case is simpler than the more general case, which we’ll handle in Subsection 3.3.3. For example, in the case Xi = {0, 1} for each i with {1} occurring with probability p and {0} with probability 1 − p, then we shall prove in particular that the infinite product measure on S ∞ is really a measure. More generally, assume the Xi ’s are finite nonempty sample spaces. Then given probability set functions µi : P(Xi ) → [0, 1] ,
i = 1, 2, . . . ,
we shall prove that the corresponding infinite product measure µ : C → [0, 1] is really a measure, that is, countably additive. In fact, we shall prove even more: In Theorem 3.5 below we shall prove that an arbitrary additive set function on C is automatically countably additive! The proof of this fact uses the following interesting property of cylinder sets (assuming the Xi ’s are finite!): Claim: If A1 , A2 , A3 , . . . ∈ C are pairwise disjoint cylinder sets whose union is a cylinder set, then there is an N such that Ak = ∅ for all k > N . Assuming this claim for the moment, let’s prove the following theorem: Measures on the cylinder sets Theorem 3.5. If the Xi ’s are finite nonempty sets, then any additive set function on C is countably additive; explicitly, if µ : C → [0, ∞] is finitely
3.3. INFINITE PRODUCT SPACES AND PROPERTIES OF COUNTABLY ADDITIVITY 145
additive, then in fact it’s countably additive. In particular, the infinite product of countably many probability set functions is a probability measure on C . S Proof : Let A ∈ C and assume that A = ∞ n=1 An where the An ’s are pairwise disjoint elements of C . SBy our claim there is an N such that Ak = ∅ for all k > N , and hence A = N n=1 An . By finite additivity, we have µ(A) =
N X
µ(An ) =
n=1
∞ X
µ(An ).
n=1
where we used the fact that µ(An ) = µ(∅) = 0 for all n > N . This completes our proof. How easy was that!
The proof of our claim is based on the following lemma. Lemma 3.6. If A1 ⊇ A2 ⊇ A3 ⊇ · · · is a nonincreasing sequence of nonempty T sets in R(C ), the ring generated by the cylinder sets, then the intersection ∞ n=1 An is not empty.
Proof : Note that if F is a finite set and {An } is a nonincreasing sequence of subsets of F , then it’s easy to prove that the intersection of all the An ’s is nonempty as seen here:
Figure 3.3. A1 is the large ellipse containing all the (finitely many) points of F , A2 is the second largest ellipse containing all points of F but one, and so on. Since F is finite, eventually all the An ’s are the same (the small circle containing a single point in this example). Thus, the intersection of all the An ’s is nonempty. For essentially the same reasons as here, our lemma is true, although, of course, the proof is a little harder! We prove this lemma in two steps. T Step 1: We first produce a point that should be in the intersection ∞ n=1 An . To this end, for each n, since each An is nonempty we can choose a point in each set; let us denote such points by (a11 , a12 , a13 , a14 , . . .) ∈ A1
(a21 , a22 , a23 , a24 , . . .) ∈ A2 (a31 , a32 , a33 , a34 , . . .) ∈ A3 (a41 , a42 , a43 , a44 , . . .) ∈ A4 .. .
.. .
.. .
.. .
.. .
Now look at the first column, which represents a sequence of points in the finite set X1 . Since X1 is a finite set, at least one point in X1 , call such a point a1 , must be repeated infinitely many times in the first column. Thus, there is an infinite set B1 ⊆ N such that an 1 = a1 for all n ∈ B1 . Now consider the elements an 2 in the second column where n ∈ B1 . Since B1 is infinite, there are infinitely many an 2 ’s in the second column where n ∈ B1 . Since X2 is a finite set, at least one point in X2 , call such a point a2 , is repeated
146
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
infinitely many times amongst the an 2 ’s where n ∈ B1 . Thus, there is an infinite set B2 ⊆ B1 such that an 2 = a2 for all n ∈ B2 . Note that since B2 ⊆ B1 , for all n ∈ B2 we still have an 1 = a1 . In conclusion, we have an infinite set B2 ⊆ B1 such that n an 1 = a1 , a2 = a2 for all n ∈ B2 .
Continuing by Q induction, we find infinite subsets B1 , B2 , B3 , . . . ⊆ N and points a1 , a2 , . . . ∈ Xi such that for each m ∈ N, n n an 1 = a1 , a2 = a2 , . . . , am = am for all n ∈ Bm ;
(3.7)
We now put a := (a1 , a2 , a3 , . . .) ∈
Y
Xi .
Step 2: We now prove that a ∈ A1 ∩ A2 ∩ · · · . To do so, fix k ∈ N; we have to prove that a ∈ Ak . By Proposition 2.6 we know that Ak has the form Ak = A × Xm+1 × Xm+2 × Xm+3 × · · · for some m and for some set A ⊆ X1 × · · · × Xm , so we just have to prove that (a1 , . . . , am ) ∈ A. To prove this, observe that by (3.7), for any n ∈ Bm we have n n n n n n (a1 , a2 , . . . , am , an m+1 , am+2 , . . .) = (a1 , a2 , . . . , am , am+1 , am+2 , . . .),
which, as we recall from Step 1, is a point belonging to the set An . Now, since Bm is an infinite set, we can take n ≥ k. In this case, we know that An ⊆ Ak because A1 ⊇ A2 ⊇ A3 ⊇ · · · , so n (a1 , a2 , . . . , am , an m+1 , am+2 , . . .) ∈ Ak .
This implies that (a1 , a2 , . . . , am ) ∈ A, and we’re done. Proof of Claim : Let S A1 , A2 , A3 , . . . ∈ C be pairwise disjoint cylinder sets and suppose that A := ∞ n=1 An ∈ C ; we must show there is an N such that Ak = ∅ for all k > N . To see this, for each n define Bn = A1 ∪ A2 ∪ A3 ∪ · · · ∪ An ∈ R(C ); we must show that A = Bn for some n. To see this, observe that B1 ⊆ B2 ⊆ B3 ⊆ B4 ⊆ · · · , so A \ B1 ⊇ A \ B2 ⊇ A \ B3 ⊇ A \ B4 ⊇ · · · . S S∞ Moreover, since A = ∞ n=1 An = n=1 Bn , we have (by De Morgan’s laws) ∅ =A\A =
∞ \
(A \ Bn ).
n=1
By our lemma, if all T the sets A \ Bn (which belong to R(C )) were nonempty, then the intersection ∞ n=1 (A \ Bn ) would also be nonempty; of course, we know the intersection is empty, so we must have A \ Bn = ∅ for some n. This implies that A = Bn and our proof is complete.
Before tackling the general case, we need to study an alternative characterizations of measures in terms of the notion of continuity instead of countable additivity.
3.3. INFINITE PRODUCT SPACES AND PROPERTIES OF COUNTABLY ADDITIVITY 147
3.3.2. Continuity of measures. A sequence of sets {An }, n = 1, 2, . . ., is nondecreasing if An ⊆ An+1 for each n, in which case, the limit set is by definition ∞ [ lim An := An . n=1
The sequence is nonincreasing if An ⊇ An+1 for each n, and in this case, lim An :=
∞ \
An .
n=1
An additive set function µ : R → [0, ∞] on a ring R is said to be
(1) continuous from below if for any nondecreasing sequence of sets {An } in R with limit set A = lim An ∈ R, we have µ (lim An ) = lim µ(An ). (2) µ is continuous from above if if for any nonincreasing sequence of sets {An } in R with limit set A = lim An ∈ R and µ(A1 ) 6= ∞, we have µ (lim An ) = lim µ(An ). We call a set function µ continuous if it is continuous from below and continuous from above. Recall that a function f : R → R is continuous at a point a if and only if given any sequence {an } with a = lim an , we have f (a) = lim f (an ) ,
that is, f (lim an ) = lim f (an ).
This is why we use the term “continuous” in the above definitions for a set function. Equivalence of measure and continuity Theorem 3.7. If µ : R → [0, ∞] is additive on a ring R, then (1) µ is a measure if and only if µ is continuous from below. (2) If µ is a measure, then µ is continuous from above. (3) If µ is finite-valued; that is, µ(A) < ∞ for each A ∈ R, then µ is a measure if and only if µ is continuous from above. (4) If µ is finite-valued, then µ is a measure if and only if µ is continuous from above at ∅; that is, given any nonincreasing sequence of sets {An } in S , ∞ \ if An = ∅ , then lim µ(An ) = 0. n=1
Proof : To avoid giving a very long proof, we only prove (2) (which implies the “only if” parts of (3) and (4)) and the “only if” part of (1), leaving the rest for your enjoyment. To prove the “only if” part of (1), assume that µ is a measure. To prove continuity from below, let A1 ⊆ A2 ⊆ A3 ⊆ · · · be a nondecreasing sequence of sets in S with limit set A. Observe that (see the left-hand picture in Figure 3.4) A = A1 ∪ (A2 \ A1 ) ∪ (A3 \ A2 ) ∪ (A4 \ A3 ) ∪ · · · .
148
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY A A3 \ A2 A2 \ A1
A1 \ A2 A2 \ A3 A3 \ A4
A1
A
Figure 3.4. On the left, A1 ⊆ A2 ⊆ A3 ⊆ · · · are nondecreasing concentric disks whose union is the large disk A. On the right, A1 ⊇ A2 ⊇ A3 ⊇ · · · are nonincreasing concentric disks whose intersection is the small disk A. The sets on the right are pairwise disjoint, so by countable additivity and the definition of infinite series as the limit of partial sums, ∞ X µ(A) = µ(A1 ) + µ(Ak+1 \ Ak ) k=1
= lim
n→∞
Note that An = A1 ∪ by (finite) additivity,
Sn−1
µ(A1 ) +
n−1 X k=1
µ(Ak+1 \ Ak ) .
k=1 (Ak+1 \ Ak ) is a union of pairwise disjoint sets. Thus,
µ(An ) = µ(A1 ) +
n−1 X k=1
which implies that
µ(Ak+1 \ Ak ),
µ(A) = lim µ(An ). n→∞
To prove (2), assume that µ is a measure and let A1 ⊇ A2 ⊇ A3 ⊇ · · · be a nonincreasing sequence of sets in S with limit set A such that µ(A1 ) 6= ∞. In particular, since An ⊆ A1 for each n, by monotonicity we have µ(An ) < ∞ for all n. Now observe that (see the right-hand picture in Figure 3.4) A1 = A ∪ (A1 \ A2 ) ∪ (A2 \ A3 ) ∪ (A3 \ A4 ) ∪ · · · .
The sets on the right are pairwise disjoint, so by countable additivity and subtractivity of additive set functions, we have ∞ ∞ X X µ(A1 ) = µ(A) + µ(Ak \ Ak+1 ) = µ(A) + (µ(Ak ) − µ(Ak+1 )) k=1
k=1
= µ(A) + lim
n→∞
The right-hand sum telescopes: n−1 X k=1
n−1 X k=1
(µ(Ak ) − µ(Ak+1 )).
(µ(Ak ) − µ(Ak+1 )) = (µ(A1 ) − µ(A2 )) + (µ(A2 ) − µ(A3 ))
Thus,
+ (µ(A3 ) − µ(A4 )) + · · · + (µ(An−1 ) − µ(An ))
= µ(A1 ) − µ(An ). µ(A1 ) = µ(A) + lim
n→∞
µ(A1 ) − µ(An ) .
3.3. INFINITE PRODUCT SPACES AND PROPERTIES OF COUNTABLY ADDITIVITY 149
Subtracting µ(A1 ) (which is a finite number by assumption) from both sides gives µ(A) = lim µ(An ), which proves our result. Example 3.2. You may be wondering about the hypothesis “µ(A1 ) 6= ∞” in the definition of “continuous from above.” This hypothesis is needed otherwise the result is false. Here’s a trivial counterexample. Consider the set function µ : P(R) → [0, ∞] defined by µ(∅) := 0 and µ(A) := ∞ for A 6= ∅. One can check that µ is a measure. Observe that ∞ 1 \ A= An , where A = ∅ and An = 0, , n n=1 and A1 ⊇ A2 ⊇ A3 ⊇ · · · . Thus, µ(A) = 0 and, since µ(An ) = ∞ for every n, lim µ(An ) = ∞. Hence, µ(A) 6= lim µ(An ).
3.3.3. Infinite product probabilities II: General sample spaces. We now drop the assumption that the sample spaces Xi are finite. For example, Q in the case Xi = (0, 1] for each i with Lebesgue measure, the infinite product ∞ i=1 Xi = (0, 1]∞ represents the sample space of picking an infinite sequence of real numbers “at random” from the interval (0, 1]. Theorem 3.8 says that Lebesgue measure on (0, 1]∞ is a measure. More generally, assume we are given probability measures µ1 : I1 → [0, 1] , µ2 : I2 → [0, 1] , µ3 : I3 → [0, 1] , . . . , where IiQis a semiring on a sample space Xi (no longer assumed to be finite). ∞ Let C ⊆ i=1 Xi denote the collection of cylinder sets generated by the semirings I1 , I2 , . . . and consider the infinite product of µ1 , µ2 , . . .: µ : C → [0, 1].
The following theorem says that µ is a measure. Probability measures on the cylinder sets Theorem 3.8. The infinite product of countably many probability measures is a probability measure on C . Proof : Our proof begins with . . . StepQ1: Introduction to sections. By a section we mean the following. Let A ⊆ ∞ i=1 Xi . Given (x1 , x2 , . . . , xn ) ∈ X1 × · · · × Xn , we define ( ) ∞ Y A(x1 , x2 , . . . , xn ) := y ∈ Xi ; (x1 , x2 , . . . , xn , y) ∈ A . i=n+1
This set is called the section of A at (x1 , x2 , . . . , xn ). Figure 3.5 shows a couple examples of sections in the finite product case X1 × X2 and X1 × X2 × X3 . One reason sections are important Q is the following Claim: Let (a1 , a2 , . . .) ∈ ∞ i=1 Xi and let A ∈ R(C ). Then (a1 , a2 , . . .) ∈ A if and only if A(a1 , a2 , . . . , ak ) 6= ∅ for all k.
In other words, (a1 , a2 , . . .) ∈ A if and only if for every k ∈ N, the (a1 , . . . , ak )section of A is not empty. The proof of this claim is not difficult (and it’s somewhat “obvious” after some thought) so we shall leave it to the interested reader (see Problem 2). We now relate . . .
150
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
X3
X2
✮ A(x1 , x2 )
✾ A(x1 ) x1
X1
X2 (x1 , x2 )
X1
Figure 3.5. On the left, A is a (filled in) oval and on the right, A is a solid ball. In both cases, the sections are line segments. In the first case, A(x1 ) = {x2 ∈ X2 ; (x1 , x2 ) ∈ A} and in the second case, A(x1 , x2 ) = {x3 ∈ X3 ; (x1 , x2 , x3 ) ∈ A}. Step 2: Integration and measures of sections. We first introduce some notation. For each k ∈ N we let C (k) = cylinder subsets of Xk × Xk+1 × Xk+2 × · · · .
and we let µ(k) = infinite product measure of µk , µk+1 , µk+2 , . . . . The set function µ(k) : C (k) → [0, 1] is finitely additive and it extends uniquely to a finitely additive set function µ(k) : R(C (k) ) → [0, 1]. Observe that C (1) = C and µ(1) = µ while for k > 1, we think of µ(k) : Q R(C (k) ) → Q [0, 1] as, roughly ∞ speaking, the restriction of µ : R(C ) → [0, 1] from ∞ i=1 Xi to i=k Xi . Let A ∈ R(C ). Then we claim that (1) For any x1 ∈ X1 , we have A(x1 ) ∈ R(C (2) ). (2) The function f : X1 → R defined by f (x1 ) := µ(2) (A(x1 )) for all x1 ∈ X1 is an I1 -simple function. Z (3) We have µ(A) =
f dµ1 .
Here’s a picture showing why (3) is “obvious” in the simple case of X1 × X2 : X2
✾ µ(2) (A(x1 )) = length of the line segment A(x1 ). x1
X1
Figure 3.6. Integrating (“summing”) the lengths of the lineR segments over all the points x1 ∈ X1 gives the area of A; that is, area of A = µ(A).
f dµ1 =
Since an element of R(C ) is a union of pairwise disjoint elements of C , we just have to check (1)–(3) for an element of C . Thus, assume that A ∈ C . Then we can write ∞ Y A = A1 × A2 × · · · × AN × Xi i=N+1
for some Ai ∈ Ii . It follows that given x1 ∈ X1 , ( ∅ A(x1 ) = Q A2 × · · · × AN × ∞ i=N+1 Xi
if x1 ∈ / A1 , if x1 ∈ A1 .
Hence, A(x1 ) ∈ C (2) for all x1 ∈ X1 . Moreover, we have ( 0 if x1 ∈ / A1 , (2) µ (A(x1 )) = µ2 (A2 ) · · · µN (AN ) if x1 ∈ A1 ;
3.3. INFINITE PRODUCT SPACES AND PROPERTIES OF COUNTABLY ADDITIVITY 151
it follows that f (x1 ) = a χA1 (x1 ), where a = µ2 (A2 ) · · · µN (AN ). Thus, f = a χA1 is an I1 -simple function and by definition of the integral, Z f dµ1 = a µ1 (A1 ) = µ1 (A1 )µ2 (A2 ) · · · µN (AN ) = µ(A),
as required. This completes the proof of (1)–(3). Identical arguments prove the following results, which we’ll need below: If A ∈ R(C (k) ), then (1) For any xk ∈ Xk , we have A(xk ) ∈ R(C (k+1) ). (2) The function f : Xk → R defined by f (xk ) := µ(k+1) (A(xk )) for all xk ∈ Xk is an Ik -simple function. (3) We have Z (3.8) µ(k) (A) = f dµk .
With all this preliminary material, we are now ready for . . . Step 3: Idea of proof. Now how do sections help to prove that the infinite product measure µ is a measure? By the Semiring Extension Theorem we know that µ extends uniquely to be additive on R(C ). We shall prove that this extension is countably additive, then by restricting back to C we see that µ is countably additive on C as well. We shall use (the contrapositive in the displayed statement of) Part (4) of Theorem 3.7 to prove that µ : R(C ) → [0, 1] is a measure: Let A1 , A2 , . . . ∈ R(C T ) with A1 ⊇ A2 ⊇ A3 ⊇ · · · and with ∞ lim µ(An ) 6= 0; we need Q∞ to show that n=1 An 6= ∅ . To do so, we shall find a point (a1 , a2 , . . .) ∈ i=1 Xi such that the section
(3.9)
Goal: An (a1 , a2 , . . . , ak ) 6= ∅
for all n, k ∈ N.
Since this holds for all k ∈ N it follows T that (a1 , a2 , . . .) ∈ An for all n and since T∞ this holds for all n we get (a1 , a2 , . . .) ∈ ∞ n=1 An 6= ∅ n=1 An . This proves that and shows that µ is a measure. Now that we know what we are after, our next step is . . . Step 4: Proof of Theorem. Instead of proving the statement (3.9) directly, we turn this statement about sets into a statement about measures of sets. To this end, note that µ(A1 ) ≥ µ(A2 ) ≥ · · · so, as lim µ(An ) 6= 0, there is an ε > 0 such that (3.10)
µ(An ) ≥ ε
for all n ∈ N.
We can consider this inequality Q as the k = 0 statement of the following claim: There is a point (a1 , a2 , . . .) ∈ ∞ i=1 Xi such that ε (k+1) Claim: µ (An (a1 , a2 , . . . , ak )) ≥ k for all n, k ∈ N. 2 Our claim certainly implies our goal: Since in particular An (a1 , a2 , . . . , ak ) has positive measure, it must be nonempty. We shall prove our claim by induction on k. Regarding (3.10) as the k = 0 case, assuming the k − 1 case: ε (3.11) µ(k) (An (a1 , a2 , . . . , ak−1 )) ≥ k−1 for all n ∈ N, 2 we shall prove there is a point ak ∈ Xk giving our claim. We now turn our claim, which is a statement about the measure of a set, to a statement involving a function. Indeed, given n ∈ N, define by
fn : Xk → [0, 1]
fn (xk ) := µ(k+1) (An (a1 , a2 , . . . , ak−1 , xk ))
for all xk ∈ Xk .
152
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
Then to prove our claim we need to show there is a point ak ∈ Xk such that fn (ak ) ≥ ε/2k for all n ∈ N. In other words, if we define for each n ∈ N, n ε o (3.12) Bn = fn ≥ k ⊆ Xk , 2 T then we need to show that the intersection ∞ n=1 Bn is not empty. Observe that since fn is a simple function on Xk we know that Bn ∈ R(Ik ) (Problem 1 in Exercises 2.4). Also, since A1 ⊇ A2 ⊇ A3 ⊇ · · · , all the sections of the Ai ’s are also nonincreasing, so f1 ≥ f2 ≥ f3 ≥ · · · . It follows that
B1 ⊇ B2 ⊇ B3 ⊇ · · · .
Moreover, observe that combining (3.8) and (3.11), we have Z ε fn dµk ≥ k−1 for all n ∈ N. 2 Thus, Z Z Z ε ≤ f dµ = χ f dµ + χBnc fn dµk n B n k k n 2k−1 Z Z ε ≤ χBn · 1 dµk + 1 · k−1 dµk 2 ε = µk (Bn ) + k , 2 where for the second line we used that fn ≤ 1 and fn < ε/2k−1 on the set Bnc . Thus, µk (Bn ) ≥ ε/2k for all n ∈ N; in particular, lim µk (Bn ) 6= 0. Now we n→∞
are given that µk : Ik → [0, 1] is a measure, so by Problem 3 it follows that µk : R(Ik )T→ [0, 1] is also a measure. Hence by Part (4) of Theorem 3.7 we know that ∞ n=1 Bn 6= ∅. This completes our proof.
3.3.4. Kolmogorov’s countable additivity model. Recall Andrey Nikolaevich Kolmogorov’s (1903–1987) axioms for probability in the case where there are only finitely many events [151, p. 2]: I. R is a ring of subsets of a set X. II. R contains X. III. To each set A in R is assigned a nonnegative real number µ(A). This number µ(A) is called the probability of the event A. IV. µ(X) = 1. V. If A and B have no element in common, then µ(A ∪ B) = µ(A) + µ(B).
When the number of events can be infinite, on page 14 of [151], Kolgomorov adds another axiom and says In all future investigations, we shall assume that besides Axioms I–V, still another holds true: VI. For a decreasing sequence of events A1 ⊇ A2 ⊇ A3 · · · of R, for T which ∞ A = ∅, the following equation holds n n=1 lim µ(An ) = 0.
By Part (4) of Theorem 3.7, we know that Axiom VI is equivalent to saying that µ : R → [0, 1] is a measure. Thus, Kolgomorov is essentially axiomatizing probability so that probability becomes a part of measure theory! On the next
3.3. INFINITE PRODUCT SPACES AND PROPERTIES OF COUNTABLY ADDITIVITY 153
page, Kolmogorov makes the following interesting statement [151, p. 15] (I bolded a sentence near the bottom). Since the new axiom is essential for infinite fields of probability only, it is almost impossible to elucidate its empirical meaning, as has been done, for example, in the case of Axioms I – V . . .. For, in describing any observable random process we can obtain only finite fields of probability. Infinite fields of probability occur only as idealized models of real random processes. We limit ourselves, arbitrarily, to only those models which satisfy axiom VI. This limitation has been found expedient in researches of the most diverse sort.
So, it seems like Kolgomorov “arbitrarily” studies countable additive probability models because it has been found “expedient” in researches. That this axiom is expedient in researches is very true: Countable additivity gives an incredibly useful theory of the integral (which, for instance, fixes all the deficiencies of the Riemann integral discussed in the prelude such as the interchange of limits and integrals). Nonetheless, axiomatizing probability so that it becomes a part of measure theory has not come without controversy. This is Bruno de Finetti because probability is suppose to model the likelihood of “real-life” phenom- (1906–1985). ena and although everyone can accept finite additivity as part of “real-life” probability, how can we say definitively that all “real-life” probabilistic phenomena behave countably additively? One of the great opponents of the countable additivity axiom of probability was the famous probabilist Bruno de Finetti (1906–1985) who said [70, p. 229] From the viewpoint of the pure mathematician — who is not concerned with the question of how a given definition relates to the exigencies of the application, or to anything outside the mathematics — the choice is merely one of mathematical convenience and elegance. Now there is no doubt at all that the availability of limiting operations under the minimum number of restrictions is the mathematician’s ideal. Amongst their other exploits, the great mathematicians of the nineteenth century made wise use of such operations in finding exact results involving sums of divergent series: first-year students often inadvertently assume the legitimacy of such operations and fail the examination when they imitate these exploits. At the beginning of this century it was discovered that there was a large area in which the legitimacy of these limiting operations could be assumed without fear of contradictions, or of failing examinations: it is not surprising therefore that the tide of euphoria is now at its height.
That the “tide of euphoria” is now at its height is true; take for instance one of the world’s experts in probability theory, Richard Mansfield Dudley (1938–), who said “The definition of probability as a (countably additive, nonnegative) measure of mass 1 on a σ-algebra in a general space is adopted by the overwhelming majority of researches in probability” [76, p. 273]. We shall follow the crowd and mostly limit ourselves to measures instead of finitely additive set functions because it allows us to develop a powerful theory of the integral where limits and integrals can be interchanged “without fear of contradictions, or of failing examinations” and because the “availability of limiting operations under the minimum number of restrictions is the mathematician’s ideal.” As Kolgomorov mentioned, such a theory of the integral “has been found expedient in researches of the most diverse sort.” However, with this said, I’d like to say that studying finite additivity is important
154
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
both philosophically and mathematically; philosophically because in “real-life” we really only encounter finitely additive phenomena (countably additivity is really just an idealization) and mathematically because there are completely natural set functions which are finite, but not countably additive such as asymptotic density that you’ll study in Problems 7 and 8. ◮ Exercises 3.3. 1. In Problem 7 in Exercises 2.4 we related the sequence space {0, 1}∞ modeling an infinite sequence of fair coin tosses and the interval [0, 1] with Lebesgue measure. Looking at this problem, one might think that Theorem 3.5 holds for the interval [0, 1]; that is, if I consists of all left-hand open intervals in [0, 1] and µ : I → [0, ∞) is finitely additive, then it’s automatically countably additive. This is false: Find a finitely, but not countably, additive set function on I . Therefore, although {0, 1}∞ and [0, 1] are in some respects similar, they are measure theoretic very different. 2. In this problem we study sections of sets. (i) Let I1 , I2 , . . Q . be semirings on nonempty sets X1 , X2 , . . ., let A ∈ R(C ), and let (a1 , a2 , . . .) ∈ ∞ i=1 Xi . Prove that (a1 , a2 , . . .) ∈ A if and only if A(a1 , a2 , . . . , ak ) 6= ∅ for all k. (ii) Find a counterexample to the “if” part of (i) ifQ we drop the assumption A ∈ R(C ); ∞ that is, find sets X 1 , X2 , . . ., a subset A ⊆ i=1 Xi and a point (a1 , a2 , . . .) ∈ Q∞ X such that A(a , a , . . . , a ) = 6 ∅ for all k, yet (a1 , a2 , . . .) ∈ / A. i 1 2 k i=1 3. (The Semiring Extension Theorem for measures) Let µ : I → [0, ∞] be a (finitely) additive set function on a semiring I ; then by the Semiring Extension Theorem 2.7 we know that its ring extension µ : R(I ) → [0, ∞] is (finitely) additive. Prove the following result: µ is countably additive on I if and only if µ is countably additive on R(I ). The “if” portion is automatic (why?), it’s the “only if” that requires proof. 4. (General product measures) In this problem we extend Theorem 3.8 to arbitrary products. Let I be an index set (not necessarily countable) and for each i ∈ I, let µi : Ii → [0, 1] beQa probability measure, where Ii is a S semiring on a sample space Xi . We denote by i∈I Xi the set of all functions x : I → i∈I Xi such that x(i) ∈ Xi for each i ∈ I. For example, in the case I = N we identity the function x with the Q infinite tuple (x1 , x2 , . . .) where xi := x(i). Let C ⊆ i∈I Xi denote the collection of all cylinder sets, where A ∈ C means that for some finite set F ⊆ I, we can write Y Y A= Xi × Xi ; i∈F
i∈F /
by which we mean x ∈ A if and only if x(i) ∈ Xi for i ∈ F and there are no conditions on x(i) for i ∈ / F . We define the infinite product of the µi ’s, µ : C → [0, 1],
Q on a cylinder set A as written above, by µ(A) := i∈F µi (Ai ). (i) Prove that µ is finitely additive. In particular, µ extends uniquely to a finitely additive probability set function on R(C ). (ii) Prove that µ is a measure by reducing to an application of Theorem 3.8 in the case I is countable. Suggestion: Let A1 , A2 , . . . ∈ C be a pairwise disjoint cylinder sets. Show that there is a countable set C ⊆ I such that for each k, Y Ak = Bk × Xi i∈C /
Q where Bk = i∈C Bki for some sets Bki ∈ IQ i . Writing C = {c1 , c2 , c3 , . . .} as a list, put Yj := Xcj for each j ∈ N and Y := ∞ j=1 Yj . Prove that B1 , B2 , . . . are pairwise disjoint elements of the cylinder subsets of Y .
3.3. INFINITE PRODUCT SPACES AND PROPERTIES OF COUNTABLY ADDITIVITY 155
5. (Coherent probability and finitely additive probability) In this problem we discuss Bruno de Finetti’s (1906–1985) notion of coherent probability. First, the notion of bettor’s gain. Let 0 ≤ p ≤ 1, let a ∈ R, and let A ⊆ X where X is the sample space of some experiment. You walk into a casino and you pay $ap betting that the event A occurs in the experiment (if a < 0, then the casino actually pays you $|a|p). If the event A occurs, you get $a and if A doesn’t occur, you get nothing. We can summarize your net gain with the function g = a χA − p ; note that if x ∈ A, then g(x) = a − pa and if x ∈ / A, then g(x) = −pa, which are exactly your net gain when the event A does or doesn’t occur. If we are given N weights p1 , . . . , pN ∈ [0, 1], N amounts a1 , . . . , aN ∈ R, and N events A1 , . . . , AN ⊆ X, then the function, called the bettor’s gain, G=
N X
n=1
an χAn − pn ,
represents your net gain if you bet $an pn on the event An , winning $an if the event An occurs. Note that if G(x) > 0 for all x ∈ X, then you win regardless what the outcome of the experiment is; if this is the case, the game is called unfair,3 otherwise the game is fair. We’re now ready to explain de Finetti’s theory of probability, which basically says that probabilities should only give rise to fair games. Let A be a collection of subsets of X and let µ : A → [0, 1]. The function µ is called a coherent probability if for any finite number of events A1 , . . . , AN ∈ A and real numbers a1 , . . . , an ∈ R, the bettor’s gain G=
N X
n=1
an χAn − µ(An )
is always fair; that is, G(x) 6> 0 for all x ∈ X. Note that A is not assumed to be a ring and µ is not assumed to be finitely additive. This is quite different from Kolgomorov’s axioms for a probability! However, if A is a ring containing the whole space X, then de Finetti’s theory and Kolgomorov’s (finitely additive) theory are the same, which shows that de Finetti’s theory is more general than Kolgomorov’s. (de Finetti’s theorem) Let µ : A → [0, 1] be a set function where A is a ring containing X. Then µ is a coherent probability if and only if µ is a finitely additive probability set function. Prove this as follows (taken from Beam’s expository paper [18], which is very good suggested reading). (i) Assume that µ is a coherent probability. By considering the bettor’s gain function χX − µ(X), prove that µ(X) = 1. Given disjoint sets A, B ∈ A , by considering the bettor’s gain function (χA − µ(A)) + (χB − µ(B)) − (χA∪B − µ(A ∪ B)), prove that µ(A ∪ B) ≤ µ(A) + µ(B). Similarly, prove the opposite inequality. Conclude that µ is a finitely additive probability set function. (ii) Assume now that µ is a finitely additive probability set function. By way of contradiction, assume that µ is not a coherent probability, meaning there are sets A1 , . . . , AN ∈ A and real numbers a1 , . . . , an ∈ R such that N X
n=1 3
an χAn − µ(An ) > 0
Of course, unfair to the casino but you might consider it fair to you!
156
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
at all points of X. Show that there are pairwise disjoint nonempty sets B1 , . . ., BM ∈ A with X = B1 ∪ · · · ∪ BM and constants b1 , . . . , bM ∈ R such that M X
n=1
bn χBn − µ(Bn ) > 0
P at all points of X. Prove that for each m, we have bm > M n=1 bn µ(Bn ). 6. (A countably additive paradox) In this problem we show how countable additivity could lead to paradoxical results in probability. Suppose that µ : R → [0, 1] is a countably additive probability set function on a ring R. One might think that µ only gives rise to fair games where we are allowed to bet on countably many events. By fair we mean there does not exist countably many events A1 , A2 , . . . ∈ R and amounts a1 , a2 , . . . ∈ R such that the bettor’s gain ∞ X an χAn − µ(An ) n=1
converges to a positive real number at all points of the sample space. However, it turns out that we can have unfair games! Let X = (0, 1], R the ring generated by those elements of I 1 that are subsets of (0, 1], and consider Lebesgue measure m : R → [0, 1], which is a countably additive probability set function by Problem 3. Here’s Beam’s construction [19]: P (−1)n Step 1: Since the alternating harmonic series ∞ converges conditionally, n=1 n from the Riemann rearrangement theorem of elementary real analysis, we can rearrange the terms of the alternating harmonic series so that the series converges to any given value (or diverges). Let’s fix a positive real number c > 0 and rearrange the natural ∞ X (−1)in numbers i1 , i2 , i3 , . . . so that = c. in n=1 Step 2: Let An = 0, i1n ∈ R and an = (−1)in +1 . Prove that ∞ X
n=1
an χAn − m(An )
converges to a positive real number ≥ c at all points x ∈ (0, 1]. This shows that it’s possible to construct a game where the bettor can win an arbitrary predetermined amount of money no matter what the outcome of the game is! 7. (Asymptotic density) What does is the event that “a randomly chosen natural number is even”? One way to interpret this is that the sample space consists of subsets of N and the event is the subset 2N (consisting of even natural numbers), where for any a ∈ N and A ⊆ N, we define aA := {a x ; x ∈ A}. How would we assign a probability to the event 2N? In general, given a subset A ⊆ N, what is the probability that a randomly chosen natural number lies in A? Here is the common way to assign such a probability. Given a subset A ⊆ N, define #N (A) (3.13) D(A) := lim , N→∞ N where #N (A) := {x ∈ A ; x ≤ N }, provided this limit exists. This limit is called the (asymptotic) density of A. Let D denote the collection of all A ⊆ N such that the limit (3.13) exists. Then we have a map D : D → [0, 1]
One may think that D is a probability measure, but it is not countably additive! Here are some properties of D. (a) Prove that D(N) = 1 and D(A) = 0 for any finite set A. For any a ∈ N, prove that 1 D(aN) = . a
3.3. INFINITE PRODUCT SPACES AND PROPERTIES OF COUNTABLY ADDITIVITY 157
One can interpret this as saying that one out of every a-th natural number is divisible by a. (b) For any a ∈ N and r = 0, 1, 2, . . . , a − 1, put A(a, r) := aN + r = {an + r ; n ∈ N}. Prove that 1 D(A(a, r)) = . a S 2k 2k+1 (c) Let A = ∞ − 1}. Show that the limit (3.13) does not exist. In k=0 {10 , . . . , 10 particular, D 6= P(N). (d) Show that D is closed under finite pairwise disjoint unions, under complements, but is not closed under countable unions. Prove that D : D → [0, 1] is not countably additive, where countable additivity means that S P∞ if A1 , A2 , . . . ∈ D are pairwise disjoint and A = ∞ A ∈ D, then D(A) = k k=1 k=1 D(Ak ). (e) Since D is closed under finite pairwise disjoint unions, you might think D is a ring. However, D is not even a semiring: Find sets A, B ∈ D (A, B 6= N) such that A∩B ∈ / D. Suggestion: If you’re having trouble, consider the following interesting example [46, p. 571]: Fix any A0 ∈ / D. Put A = 2N and B = (2A0 ) ∪ (2Ac0 + 1) = {2n ; n ∈ A0 } ∪ {2n + 1 ; n ∈ / A0 }.
Show that A, B ∈ D (both have density 1/2) but A ∩ B = 2A0 and show that 2A0 ∈ / D. 8. (More on asymptotic density) As we saw in the previous problem, the set D on which asymptotic density is defined is not so well behaved. In this problem we study a subset of D that is a semiring. (i) Let a, b ∈ N. Prove that given any c ∈ Z, the equation ax − by = c has a solution (x, y) ∈ Z × Z if and only if d divides c where d is the greatest common divisor of a and b. Moreover, in the case that ax − by = c has a solution (x, y) ∈ Z × Z, it has infinitely many solutions and all solutions are given as follows: If (x0 , y0 ) is any one solution of the equation with c = 1, then for general c ∈ Z, all solutions are of the form c b c a x = x0 + t , y = y0 + t , for all t ∈ Z. d d d d (ii) Let I be the collection of all subsets of N of the form A(a, r) where a ∈ N, r = 0, 1, . . . , a − 1, and A(a, r) = aN + r = {an + r ; n ∈ N}. In the previous problem you showed that D(A(a, r)) = a1 . Using (i) prove that I is closed under intersections. (iii) Prove that [ A(a, r)c = A(a, k), 0≤k≤a−1 , k6=r
where the union is over all integers k between 0 and a − 1 except k = r. (iv) Prove that the difference of any two elements of I is a pairwise disjoint union of elements of I . Conclude that I is a semiring. In particular, asymptotic density is a finitely additive probability set function on the semiring I . (This probability set function is not countably additive; for a proof using the Extension Theorem, see Problem 13 in Exercises 3.5.) 9. (A finitely additive paradox) (cf. [76, p. 94]) In this problem we show how finite additivity could lead to paradoxical results in probability. Given any set A ⊆ N for which the asymptotic density exists, suppose that the probability that a natural number chosen at random lies in A is D(A). Two people, Jack and Jill, pick natural numbers at random and the one who picks the larger number wins. You call out either Jack or Jill’s name at random, and the person who you call on tells you his number; at this point you don’t know what number the other person chose. However, show that the person you didn’t call on wins the game with probability one.
158
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
3.4. Outer measures, measures, and Carath´ eodory’s idea Given an additive set function µ : I → [0, ∞] where I is a semiring of subsets of some set X, it’s natural to ask the following Question: Can we extend µ to be a measure on S (I )? That is, can we extend µ to a measure µ : S (I ) → [0, ∞]?
The answer is “yes” if and only if the original set function µ : I → [0, ∞] is a measure.4 The idea to construct the extension is to first define what is called an “outer measure” from µ, µ∗ : P(X) → [0, ∞], which is defined on all subsets of X. This map is generally not a measure. In this section we study outer measures and define a σ-algebra Mµ∗ ⊆ P(X) such that µ∗ : Mµ∗ → [0, ∞] is a measure. In Section 3.5 we show that S (I ) ⊆ Mµ∗ , so by restriction, µ∗ : S (I ) → [0, ∞] is a measure and we show that if µ : I → [0, ∞] is a measure, then µ∗ extends µ (that is, µ∗ (I) = µ(I) for all I ∈ I ). 3.4.1. Outer measures. We begin this section by studying outer measures from the abstract point of view. In Carath´eodory’s 1918 book [52], [79, p. 48] he calls a map5 Φ : P(X) → [0, ∞],
Constantin Carath´eodory (1873–1950).
where P(X) is the power set of a set X, an outer measure on X if • Φ(∅) = 0; • Φ is countably subadditive in the sense that ∞ ∞ [ X A⊆ An =⇒ Φ(A) ≤ Φ(An ). n=1
n=1
(There is no condition on the disjointness of {An }.) Note that Φ is monotone in the sense that if A ⊆ B, then Φ(A) ≤ Φ(B). Indeed, A ⊆ B∪∅∪∅ · · · and so by (1) and (2), we have Φ(A) ≤ Φ(B)+Φ(∅)+Φ(∅)+· · · = Φ(B). Thus, Φ(A) ≤ Φ(B). Here’s an example showing that an outer measure may not be a measure. By the way, the purpose of some examples is not they are “useful” in any practical sense but that they help us to understand statements: what they imply and what they don’t imply.6 Example 3.3. Consider, for example, the set X = {a, b, c} consisting of three distinct elements and let Φ : P(X) → [0, ∞] be defined by Φ(∅) = 0, Φ(X) = 2, and Φ(A) = 1 otherwise. Note that if A = {a}, B = {b}, then A and B are disjoint and by definition of Φ, we have Φ(A ∪ B) = 1, Φ(A) = 1 and Φ(B) = 1. Hence, Φ(A ∪ B) < Φ(A) + Φ(B).
4The “only if” statement is obvious: If µ : S (I ) → [0, ∞] is a measure extending the given µ on I , then by restricting to I ⊆ S (I ), it follows that µ : I → [0, ∞] is also a measure. 5Actually, Carath´ eodory worked with X = Rn and not a general set X. 6If you have to prove a theorem, do not rush. First of all, understand fully what the theorem says, try to see clearly what it means. Then check the theorem; it could be false. Examine the consequences, verify as many particular instances as are needed to convince yourself of the truth. When you have satisfied yourself that the theorem is true, you can start proving it. George P´ olya (1887–1985) [226].
´ 3.4. OUTER MEASURES, MEASURES, AND CARATHEODORY’S IDEA
159
Therefore, Φ is not additive. However, Φ is countably subadditive. To see this, let S A ⊆ n An where A, A1 , A2 , . . . ⊆ X. We must verify that
(3.14)
Φ(A) ≤
∞ X
Φ(An ).
n=1
We consider three cases. Case 1: A = ∅. Then Φ(A) = 0, so (3.14) is trivially true. Case 2: A = X. Then Φ(A) = 2. If An = X for some n, then Φ(An ) = 2, so (3.14) holds. If An 6= X for all n, then as X ⊆ A1 ∪ A2 ∪ · · · , there must be at least two different sets Ai and Aj amongst A1 , A2 , . . . that are not empty. Hence, Φ(Ai ) + Φ(Aj ) = 2, and (3.14) holds in this case too. S Case 3: A 6= ∅, X. Then Φ(A) = 1. Since A ⊆ ∞ n=1 An and A 6= ∅, at least one set An cannot be empty; for this set, we have Φ(An ) = 1 or 2, so (3.14) holds.
The most common way to construct an outer measure is the Lebesgue outer measure construction we briefly discussed at the beginning of this chapter. Recall that the basic idea is to assign a “measure” to an arbitrary subset of Euclidean space by circumscribing the set with, in general infinitely many, boxes. Thus, givenSa set A ⊆ Rn , the idea is to cover A by countably many left-hand P open boxes: ∞ ∞ A ⊆ k=1 Ik , where Ik ∈ I n for each k. We then observe that the sum k=1 m(Ik ) is intuitively bigger than the size of A. For example, here’s a picture with A = disk, where we show three covers of A with rectangles:
It’s obvious that the sum of the areas of the rectangles on the far right (assuming they don’t overlap too much) give the best measurement of the true size of A. This shows why we should use covers of A by countably many boxes; the more boxes, the better the approximation. Now intuitively speaking, the “smallest” possible sum of areas of (countably many) rectangles covering A should equal to exact measure of A. Thus, it makes sense to define7 X ∞ ∞ [ (3.15) m∗ (A) := inf m(Ik ) ; A ⊆ Ik , Ik ∈ I n . k=1
k=1
This defines an “outer measure” (because it measures A from the outside) m∗ (A) ∈ [0, ∞] for each subset A ⊆ Rn . Hence, we have defined a set function m∗ : P(Rn ) → [0, ∞],
which is called Lebesgue outer measure. The fact that Lebesgue outer measure is an outer measure is a consequence of Theorem 3.9 below, where we show how to generate outer measures from arbitrary set functions. We remark that it is important in the definition (3.15) for m∗ (A) to take countable covers of A rather than finite covers. For finite covers, the definition (3.15) is called outer content (see Problem 11), used by Camille Jordan (1838–1922) as the foundation of Riemann integration theory. Taking countable covers as a foundation 7The sums P∞ m(I ) include finite sums PN m(I ) by taking I = ∅ for k > N . k k k k=1 k=1
160
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
for integration theory gives rise to Lebesgue integration theory. Before presenting the next theorem, it might be helpful to remind you what (3.15) means. Let (∞ ) ∞ X [ m(Ik ) ; A ⊆ Ik , Ik ∈ I n . S := k=1
k=1
Then by definition of infimum, (3.15) means (Inf 1) m∗ (A) is a lower bound for S; that is,
Camille Jordan (1838–1922).
m∗ (A) ≤
∞ X
m(Ik )
k=1
for any cover {Ik } of A by elements of I n . (Inf 2) m∗ (A) is the greatest lower bound for S; that is, if m∗ (A) < α, then α cannot be a lower bound for S, which means there must be an element of S less than α. Explicitly, there are sets {Ik } in I n such that ∞ ∞ [ X A⊆ Ik with m(Ik ) < α. k=1
k=1
With this review of infimums fresh in memory, we can prove the following theorem.8 Construction of outer measures Theorem 3.9. Let A be any collection of subsets of a set X such that ∅ ∈ A and let µ : A → [0, ∞] be any map such that µ(∅) = 0. The collection A and the map µ are not assumed to have any other properties. For any A ⊆ X, copying the definition (3.15) we define X ∞ ∞ [ ∗ (3.16) µ (A) := inf µ(In ) ; A ⊆ In , In ∈ A . n=1
n=1
This defines a map which is an outer measure.
µ∗ : P(X) → [0, ∞],
Proof : Since ∅ ∈ A and ∅ ⊆ ∅, by definition of infimum for µ∗ (see (Inf 1) above) it follows that 0 ≤ µ∗ (∅) ≤ µ(∅). Since µ(∅) = 0, we obtain µ∗ (∅) = 0. We now show that µS∗ is countably subadditive, which as we’ll P see involves ∗ the “ 2εn -trick”. Let A ⊆ ∞ need to show that µ∗ (A) ≤ ∞ n=1 An ; we P n=1 µ (An ). ∞ ∗ ∗ ∗ If µ (A ) = +∞ for some m, then µ (A ) = +∞ and thus, µ (A) ≤ m n n=1 P∞ ∗ µ (A ) is trivially true. Hence, we may assume that for each n ∈ N, n n=1 X ∞ ∞ [ µ∗ (An ) := inf µ(Im ) ; An ⊆ Im , Im ∈ A < ∞. m=1
m=1
Let ε > 0. Then since for each n ∈ N, we have µ∗ (An ) < µ∗ (An ) + sets {Inm } in A such that (see (Inf 2) above)
(3.17)
An ⊆
∞ [
m=1
Inm
with
∞ X
m=1
µ(Inm ) < µ∗ (An ) +
ε , 2n
there are
ε . 2n
8We define inf ∅ = +∞ so that µ∗ (A) = +∞ in (3.16) if there is no cover of A by elements of C , and we define inf{+∞} = +∞ so that µ∗ (A) = +∞ if all the sums in (3.16) equal +∞.
´ 3.4. OUTER MEASURES, MEASURES, AND CARATHEODORY’S IDEA
161
Notice that A⊆
∞ [
n=1
An ⊆
∞ [
Inm .
n,m=1
Now order the countably many sets {Inm }n,m∈N in any way you wish, in other words, pick a bijection f : N → N × N and consider {If (n) }n∈N . Then A ⊆ S ∞ n=1 If (n) and so, by definition of infimum in (3.16), we have µ∗ (A) ≤
∞ X
µ(If (n) ) =
n=1
∞ X ∞ X
where we used Lemma 3.3. By (3.17), we have
since
P∞
1 n=1 2n
µ(Inm ),
n=1 m=1
P∞
m=1
µ(Inm ) < µ∗ (An ) +
ε , 2n
so
∞ ε X ∗ µ (An ) + ε, µ∗ (A) ≤ µ∗ (An ) + n = 2 n=1 n=1 P ∗ = 1. Taking ε ↓ 0, we get µ∗ (A) ≤ ∞ n=1 µ (An ). ∞ X
To get practice using the definitions (3.15) or (3.16), let us show that Lebesgue outer measure gives volumes consistent with our usual notion of volume. Example 3.4. Let a = (a1 , . . . , an ), b = (b1 , . . . , bn ) ∈ Rn with ak < bk for each k and consider the open box (a, b). Is it true that m∗ (a, b) = (b1 − a1 ) · · · (bn − an )? Since (a, b) ⊆ (a, b], by definition of m∗ (a, b), we have m∗ (a, b) ≤ m(a, b] = (b1 − a1 ) · · · (bn − an ). S n To prove the opposite inequality, let (a, b)S⊆ ∞ for each k. For k=1 Ik where Ik ∈ I ∞ every ε > 0 note that (a, b − ε] ⊆ (a, b) ⊆ k=1 Ik , where we take ε > 0 small enough so that (a, b − ε] is not empty. We know that m is a measure on I n , so it is countably subadditive, and hence (b1 − ε − a1 ) · · · (bn − ε − an ) = m(a, b − ε] ≤
∞ X
m(Ik ).
k=1
Since ε > 0 can be arbitrarily small, this inequality implies that (b1 − a1 ) · · · (bn − an ) ≤
∞ X
m(Ik ).
k=1
Thus, by definition of infimum, (b1 − a1 ) · · · (bn − an ) ≤ m∗ (a, b), which implies that m∗ (a, b) = (b1 − a1 ) · · · (bn − an ). Similar proofs show that the outer measure of any interval is its usual measure; for instance, m∗ [0, 1] = 1 or the outer measure of a single point is zero (here, if a ∈ Rn , the singleton {a} can be written as [a, a]). S Example 3.5. If A ⊆ Rn is countable, A = {a1 , a2 , . . .} = k {ak }, then as m∗ {ak } = 0 for each k, by countable subadditivity it follows that A has outer measure zero: m∗ (A) ≤
∞ X
k=1
m∗ {ak } =
∞ X
0 = 0.
k=1
This example proves the following: Theorem 3.10. Any countable subset of Rn has measure zero. In particular, any subset of the rational numbers in R has measure zero and any subset Rn with positive outer measure must be uncountable.
162
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
Thus, as m∗ [0, 1] = 1, the interval [0, 1] is uncountable. In particular, since the rational numbers are countable, we have a measure-theoretic proof that irrational (= nonrational) numbers in [0, 1] exist and they form an uncountable subset of [0, 1]. Of course, this is one of the most difficult proofs that irrational numbers exist! It might surprise you that there are uncountable sets with measure zero. One example was defined by Cantor and will be studied in Section 4.5 to come. 3.4.2. Measures and Measurable sets. Recall that given a σ-algebra S of subsets of a set X, a set function µ : S → [0, ∞] is a measure if • µ(∅) = 0; • µ is countably additive in the sense that A=
∞ [
An
=⇒
n=1
µ(A) =
∞ X
µ(An )
n=1
for any sequence of pairwise disjoint sets {An } ⊆ S .
From Theorem 3.4 we know that measures are outer measures, but outer measures may not be measures (as we’ve seen by examples). The triple (X, S , µ) (or (X, µ) if S is understood, or (S , µ) is X is understood, or just X if both S and µ are understood) is called a measure space and sets in S are said to be measurable (or µ-measurable to be more precise). µ is called a probability measure if µ(X) = 1, in which case (X, S , µ) is called a probability space. The celebrated Carath´eodory’s Theorem, see Theorem 3.11 below, shows how to construct “measurable sets” from outer measures. The basic idea to do this for subsets of R was given in the introduction to this chapter, Section 3.1, where we showed that it is natural to consider a subset A ⊆ R as being measurable if for any subset E ⊆ R, we have m∗ (E ∩ A) + m∗ (E \ A) = m∗ (E).
Of course, this idea can be applied to any outer measure! Thus, we shall declare a subset A ⊆ X to be measurable, or Φ-measurable to emphasize Φ, if for any subset E ⊆ X, we have Φ(E) = Φ(E ∩ A) + Φ(E \ A).
(3.18)
We can think of this as saying that A “cleanly” cuts any set in the sense that if any set E is sliced into parts in A and outside of A, then Φ is additive on this decomposition as in this picture: A
E E∩A
E\A
Φ(E) = Φ(E ∩ A) + Φ(E \ A)
Sets A that don’t always cut “cleanly” are not measurable. The set of all measurable sets is denoted by MΦ : MΦ = A ∈ P(X) ; for all E ⊆ X, Φ(E) = Φ(E ∩ A) + Φ(E \ A) . One can easily check that ∅ and X are measurable; for instance, X is measurable because for any E ⊆ X, we have E ∩ X = E and E \ X = ∅, so (3.18) is just the tautology Φ(E) = Φ(E). Thus, MΦ is not empty. We can rewrite (3.18) as
´ 3.4. OUTER MEASURES, MEASURES, AND CARATHEODORY’S IDEA
163
follows. Since E \ A = E ∩ Ac where Ac = X \ A is the complement of A, (3.18) becomes Φ(E) = Φ(E ∩ A) + Φ(E ∩ Ac ). Since outer measures are subadditive, given a set A ⊆ X, we always have Thus,
Φ(E) ≤ Φ(E ∩ A) + Φ(E ∩ Ac ),
A ∈ MΦ
⇐⇒
for all E ⊆ X.
for all E ⊆ X, Φ(E ∩ A) + Φ(E ∩ Ac ) ≤ Φ(E).
For those of you who might not be convinced by the definition (3.18) of measurability, here is a quote by the famous mathematicians John Frank Charles Kingman (1939– ) and Samuel James Taylor (1929– ) that might help [148, p. 75]: The reader may find the above explanation of condition (3.18) still inadequate to provide the definition of measurability with much intuitive content. This is a case where the definition is justified by the result — it turns out that, for suitable outer measures, a wide class of sets is measurable and the class of measurable sets has got the right kind of structural properties. The definition is therefore justified ultimately by the elegance and usefulness of the theory which results from it. Example 3.6. Let X = {a, b, c} consist of three distinct elements and let Φ be defined by Φ(∅) = 0, Φ(X) = 2, and Φ(A) = 1 otherwise. In Example 3.3 we proved that Φ is an outer measure. Let us determine MΦ . We already know that ∅ and X are measurable, so let A ⊆ X with A 6= ∅, X. Since A 6= X, Ac 6= ∅, therefore both A and Ac are nonempty. Let d, e ∈ {a, b, c} with d ∈ A and e ∈ Ac . If E = {d, e}, then E ∩ A = {d} and E ∩ Ac = {e}. Thus, by definition of Φ, we have Φ(E) = 1, Φ(E ∩ A) = 1 and Φ(E ∩ Ac ) = 1. Therefore the left-hand side of (3.18) is unity and the right-hand side is two, which implies that A is not measurable. It follows that MΦ = {∅, X}.
For this example, note that MΦ is trivially a σ-algebra and Φ defines a measure on MΦ . The Carath´eodory Theorem 3.11 below states that for a general outer measure Φ, MΦ forms a σ-algebra on which Φ defines a complete measure. Here, a measure µ : S → [0, ∞] on a σ-algebra S is said to be complete if A⊆B
and B ∈ S with µ(B) = 0
=⇒
A ∈ S.
If this holds, then µ(A) = 0 also, since µ is monotone. In words, µ is complete means that any subset of a measurable set of measure zero is measurable (that is, must belong to the σ-algebra). Example 3.7. It’s easy to find trivial examples of complete measures. For instance, given a σ-algebra S we can define µ : S → [0, ∞] by µ(∅) = 0 and µ(A) = ∞ otherwise. It’s easy to check that µ is a complete measure. Nontrivial examples of complete measures are Lebesgue measure and more generally, any outer measure restricted to its measurable sets, as we’ll see in Carath´eodory’s Theorem below.
3.4.3. Carath´ eodory’s theorem: Outer measures to measures. Here is the celebrated theorem due to Carath´eodory. Carath´ eodory’s Theorem Theorem 3.11. Let Φ : P(X) → [0, ∞] be an outer measure and let MΦ be the collection of Φ measurable sets. Then MΦ is a σ-algebra and the
164
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
restriction of Φ to MΦ , defines a measure. Moreover,
Φ : MΦ → [0, ∞],
A ⊆ X and Φ(A) = 0
=⇒
A ∈ MΦ ;
in particular, Φ : MΦ → [0, ∞] is a complete measure.
Proof : We break up the proof into two steps. Step 1: We show MΦ is closed under unions and complements (such a system of sets is called an algebra of sets). To begin, we show that if A ∈ MΦ , then Ac = X \ A ∈ MΦ . Indeed, if E ⊆ X, then since A ∈ MΦ , Φ(E) = Φ(E ∩ A) + Φ(E ∩ Ac )
= Φ(E ∩ Ac ) + Φ(E ∩ (Ac )c ),
since A = (Ac )c . Thus, Ac ∈ MΦ . We next show that MΦ is closed under unions. Let A, B ∈ MΦ . Then for any set E ⊆ X, we need to show that Φ(E) = Φ(E ∩ (A ∪ B)) + Φ(E ∩ (A ∪ B)c ).
To see this, we apply the definition of the measurability of B to obtain (3.19)
Φ(E ∩ (A ∪ B)) = Φ(E ∩ (A ∪ B) ∩ B) + Φ(E ∩ (A ∪ B) ∩ B c ) = Φ(E ∩ B) + Φ(E ∩ A ∩ B c ),
where we used that (A ∪ B) ∩ B = B and B ∩ B c = ∅. Now using the fact that E ∩ (A ∪ B)c = E ∩ Ac ∩ B c , we obtain Φ(E ∩ (A ∪ B)) + Φ(E ∩ (A ∪ B)c ) = Φ(E ∩ B) + Φ(E ∩ A ∩ B c ) + Φ(E ∩ Ac ∩ B c ). Since A is measurable, the sum of the last two terms is
so
Φ(E ∩ B c ∩ A) + Φ(E ∩ B c ∩ Ac ) = Φ(E ∩ B c ), Φ(E ∩ (A ∪ B)) + Φ(E ∩ (A ∪ B)c ) = Φ(E ∩ B) + Φ(E ∩ B c ) = Φ(E),
since B is measurable. This shows that A ∪ B is measurable. Thus, MΦ is an algebra of sets. In particular, since A ∩ B = (Ac ∪ B c )c , MΦ is closed under intersections, and since A \ B = A ∩ B c , it is also closed under differences. Step 2: Next, we show that MΦ is a σ-algebra and Φ is a measure on MΦ . We already know that ∅ ∈ MΦ and MΦ is closed under complements, so we just have S to prove that MΦ is closed under countable unions. To this end, let A = ∞ n=1 An where the An ’s are measurable sets. We need to show that A is measurable. Replacing An with An \ (A1 ∪ · · · ∪ An−1 ), which is also measurable since MΦ is closed under unions and differences, we may assume that A1 , A2 , A3 , . . . are pairwise disjoint. Now given E ⊆ X, we need to show that Φ(E) ≥ Φ(E ∩ A) + Φ(E ∩ Ac ).
Our technique to prove this is to use the measurability of A1 , then A2 , then A3 , S etc. to try and express Φ(E) in terms A = ∞ A . To start, observe that since n n=1 A1 ∈ MΦ , we have Φ(E) = Φ(E ∩ A1 ) + Φ(E ∩ Ac1 ). Since A2 ∈ MΦ we can write the second term as Φ(E ∩ Ac1 ) = Φ(E ∩ Ac1 ∩ A2 ) + Φ(E ∩ Ac1 ∩ Ac2 ) = Φ(E ∩ A2 ) + Φ(E ∩ (A1 ∪ A2 )c ),
´ 3.4. OUTER MEASURES, MEASURES, AND CARATHEODORY’S IDEA
165
where we used that Ac1 ∩ A2 = A2 \ A1 = A2 (recalling A1 and A2 are disjoint) and De Morgan’s law Ac1 ∩ Ac2 = (A1 ∪ A2 )c . Thus, Φ(E) = Φ(E ∩ A1 ) + Φ(E ∩ A2 ) + Φ(E ∩ (A1 ∪ A2 )c ).
We now see the pattern: By using induction (left to you!) for any N ∈ N we have Φ(E) =
N X
n=1
Φ(E ∩ An ) + Φ(E ∩ (A1 ∪ · · · ∪ AN )c ).
Since A1 ∪ · · · ∪ AN ⊆ A, we have Ac ⊆ (A1 ∪ · · · ∪ AN )c . Thus, since Φ is monotone, we have Φ(E ∩ Ac ) ≤ Φ(E ∩ (A1 ∪ · · · ∪ AN )c ), and so Φ(E) ≥
N X
n=1
Φ(E ∩ An ) + Φ(E ∩ Ac ).
This formula holds for any N , so taking N → ∞ and using that limits preserve inequalities, we have ∞ X (3.20) Φ(E) ≥ Φ(E ∩ An ) + Φ(E ∩ Ac ). Now E ∩ A =
(3.21)
n=1
S∞
n=1 E ∩ An , so recalling that Φ is countably subadditive, we have
Φ(E ∩ A) ≤
∞ X
n=1
Φ(E ∩ An ).
Combining this with (3.20), we see that
Φ(E) ≥ Φ(E ∩ A) + Φ(E ∩ Ac ),
which shows that A ∈ MΦ . Moreover, putting E = A in (3.20) (noting that A ∩ An = An for each n and A ∩ Ac = ∅) and in (3.21) we see that ∞ ∞ X X Φ(An ). Φ(An ) and Φ(A) ≤ Φ(A) ≥ n=1 P∞ n=1
This implies that Φ(A) = This completes our proof.
n=1
Φ(An ), which shows that Φ is a measure on MΦ .
In particular, given any set function µ : A → [0, ∞] on a collection of sets A of subsets of a set X with ∅ ∈ A and µ(∅) = 0, since µ∗ : P(X) → [0, ∞]
is an outer measure, Carath´eodory’s theorem says that Mµ∗ , the µ∗ -measurable sets, is a σ-algebra and µ∗ : Mµ∗ → [0, ∞] is a measure. Applying this result to Lebesgue measure we see that the collection of sets that are measurable with respect to Lebesgue outer measure m∗ on P(Rn ) is a σ-algebra, which we denote by M n (= Mm∗ ): M n := Sets that are measurable with respect to m∗ . The collection M n is called the Lebesgue measurable sets. The completeness part of Carath´eodory’s Theorem says A ⊆ Rn and m∗ (A) = 0
=⇒
A ∈ M n,
and the measure part of Carath´eodory’s theorem says m : M n → [0, ∞]
166
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
is a measure, where we dropped the “∗” from m∗ . We shall study more properties of M n in Sections 4.3 and 4.4. ◮ Exercises 3.4. 1. In this exercise we compute various outer measures and their corresponding measurable sets. Let X be any nonempty set. For each case, show that Φ is an outer measure and determine the measurable sets. (a) For any set A ⊆ X, define Φ(A) as the number of points of A if A is finite and Φ(A) = ∞ if A is infinite. (b) For any set A ⊆ X, define Φ(A) = 0 if A = ∅ and Φ(A) = 1 otherwise. (c) Define Φ(∅) = 0, Φ(X) = 2, and Φ(A) = 1 otherwise. Consider the cases when X has one, two, and at least three elements. (d) Now assume that X is uncountable and for any set A ⊆ X, define Φ(A) = 0 if A is countable and Φ(A) = 1 if A is uncountable. 2. Let Φ : P(X) → [0, ∞] be an outer measure (a) Prove that a set A ⊆ X is Φ-measurable if and only if Φ(B ∪ C) = Φ(B) + Φ(C) for all sets B ⊆ A and C ⊆ Ac . (b) If Φ is finitely additive, prove that Φ is countably additive, that is, a measure. 3. In this exercise we compute various outer measures generated by set functions on collections of subsets of a nonempty set X. For each case, given µ : A → [0, ∞], determine µ∗ (A) for all A ⊆ X and determine the corresponding µ∗ -measurable sets. (a) Let A consist of ∅, X, and all singletons (sets consisting of one element). Assume that X has at least two elements and define µ : A → [0, ∞] by µ(∅) = 0, µ(X) = ∞, and µ(A) = 1 for all singleton sets A. (b) Assume that X is uncountable and let A be as in (a). Define µ(X) = 1 and µ(A) = 0 for all singleton sets A. (c) Define f : R → R by ( 0 if x ≤ 0, f (x) = 1 if x > 0, and let µ = µf and A = I 1 where µf : I 1 → [0, ∞) is the Lebesgue-Stieltjes additive set function corresponding to f . Suggestion: Show that µ∗ = 0; that is, µ∗ (A) = 0 for all A ⊆ R. Note that µ∗ (I) 6= µ(I) for any I = (a, b] with a ≤ 0 < b. (d) Define f : R → R by ( 0 if x < 0, f (x) = 1 if x ≥ 0, and let µ = µf and A = I 1 where µf : I 1 → [0, ∞) is the Lebesgue-Stieltjes additive set function corresponding to f . Remark: In (d), you found that µ∗ (I) = µ(I) for all I ∈ I 1 , which is very different from what happens in (c). The difference between (c) and (d) is that in Problem (d), µ : I 1 → [0, ∞) is a measure while in (c), µ is only additive but is not a measure (see Theorem 3.2). The Extension Theorem found in the next section is the underlying reason for this difference. 4. Let µ : A → [0, ∞] be as in Theorem 3.9. Prove that µ∗ (A) ≤ µ(A) for all A ∈ A . Find an example of a set function µ and a set A ∈ A such that µ∗ (A) 6= µ(A). 5. In this problem we see that funny things can happen involving infinities. (a) Let µ : S → [0, ∞] be a complete measure on a σ-algebra S and let A, B ∈ S with µ(A) = µ(B) < ∞. Prove that given any set C such that A ⊆ C ⊆ B, we have C ∈ S . If µ(A) = µ(B) = ∞, can we still conclude that C ∈ S ? Prove it or provide a counterexample. Suggestion: If you’re spending hours on finding
´ 3.4. OUTER MEASURES, MEASURES, AND CARATHEODORY’S IDEA
167
a counterexample, consider a very simple example such as found in, for instance, Example 3.7. (b) Let µ : S → [0, ∞] be a measure and let A ∈ S have finite measure. Let {An } be a sequence of pairwise P disjoint subsets An ⊆ A with S∞An ∈ S for each n and assume that µ(A) = ∞ µ(A ). Prove that µ(A \ n n=1 n=1 An ) = 0. What if we drop the assumption that µ(A) < ∞; is the result still true? 6. In this problem we look at properties of Lebesgue outer measure. (a) Let a = (a1 , . . . , an ), b = (b1 , . . . , bn ) ∈ Rn with ak ≤ bk for each k. Using the definition (3.15) of Lebesgue outer measure, prove that m∗ (a, b] = m∗ [a, b] = m∗ [a, b) = (b1 − a1 ) · · · (bn − an ), the usual notion of volume. Of course, this formula holds for all types of bounded boxes given as products of the various sorts of intervals in R. (b) Using the definition of Lebesgue outer measure, prove that any intersection of the coordinate planes in Rn has Lebesgue measure zero, e.g., given an integer 1 ≤ k ≤ n, show that {0}k × Rn−k has measure zero as a subset of Rn . 7. Let A denote the set of open intervals in R and let µ : A → [0, ∞] assign to each such interval its standard length. Show that µ∗ (A) = m∗ (A) for all subsets A ⊆ R, where m∗ is Lebesgue outer measure defined in (3.15), which uses only left-half open intervals. (Remark: Of course this problem applies equally well if you replace A by your favorite type(s) of bounded intervals in R: open, closed, right-half open, etc. . . . (even mixed) and let A be the collection of all your favorite types of intervals. This problem also applies to Rn , where A is a collection of your favorite bounded boxes.) 8. Let Icn denote the set of all left-half open cubes, where a cube is a box whose sides have the same length. Let µc : Icn → [0, ∞] assign to each such box its standard volume. Show that µ∗c (A) = m∗ (A) for all subsets A ⊆ Rn . 9. Let f : [a, b] → R be a continuous, hence uniformly continuous, function. Let A = {(x, f (x)) ; x ∈ [a, b]} ⊆ R2 be the graph of f . In this problem we prove that A has Lebesgue measure zero. (i) Let ε > 0 be arbitrary and choose δ > 0 so that |f (x) − f (y)| < ε for |x − y| < δ. S Using this fact show that we can write [a, b] = N k=1 Ik where the Ik ’s are intervals with pairwise disjoint interiors such that for some points ak ∈ Ik , we have A⊆
N [
k=1
Ik × [f (ak ) − ε, f (ak ) + ε].
(ii) Prove that m∗ (A) ≤ 2ε(b − a). Conclude that m∗ (A) = 0. 10. We generalize the previous problem to graphs in Rn . (a) Let K ⊆ Rn−1 be a compact set and let f : K → R be a continuous, hence uniformly continuous, function. Let A = {(x, f (x)) ; x ∈ K} ⊆ Rn be the graph of f . Prove that m∗ (A) = 0. (b) Show that the sphere Sn−1 := {x ∈ Rn ; x21 + · · · + x2n = 1} has measure zero as a subset of Rn 11. (Outer content) For any subset A ⊆ Rn , define X N N [ c(A) := inf m(Ik ) ; N ∈ N, A ⊆ Ik , Ik ∈ I n . k=1
k=1
We define inf ∅ := ∞. The number c(A) is called the outer (Jordan) content of A and was introduced by Marie Ennemond Camille Jordan (1838–1922) in 1892. (i) Show that c : P(Rn ) → [0, ∞] is finitely subadditive. (ii) Show that c{a} = 0 for any point a ∈ Rn . (iii) Let A be a dense subset of [0, 1]n (e.g. A is the set of all points in [0, 1]n with S rational coordinates). Show that c(A) = 1. Suggestion: If A ⊆ N k=1 (ak , bk ], then
168
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
S taking closures of both sides we obtain9 [0, 1]n ⊆ N k=1 [ak , bk ]. Can you use this fact to show that c(A) = 1? (iv) Show that c : P(Rn ) → [0, ∞] is not countably subadditive. (v) Finally, show that c : P(Rn ) → [0, ∞] is not finitely additive. Suggestion: Let A = Qn ∩ [0, 1]n and B = [0, 1]n \ A. Find c(A ∪ B), c(A) and c(B). (vi) Show that m∗ (A) ≤ c(A) for all A ⊆ Rn . In particular, a set with zero content has zero measure. The converse is false; however, prove the following: (vii) If A ⊆ Rn , then c(A) = 0 if and only if the closure A compact and m∗ (A) = 0. 12. (Nonmeasurable sets; cf. [175].) Let Φ : P(X) → [0, ∞] be an outer measure. Observe that if A is measurable, then there is a measurable set B such that A ⊆ B and Φ(B \ A) = 0; just take B = A. For a nonmeasurable set, this property is false. (a) Let A be a nonmeasurable set; that is, let A ⊆ X with A ∈ / MΦ . Show there is an ε > 0 such that for any B ∈ MΦ with A ⊆ B, we have Φ(B \ A) ≥ ε. Suggestion: If not, then T for each n, there is a measurable set Bn ⊇ A with Φ(Bn \ A) < 1/n. Let B = ∞ n=1 Bn and use B to show that A is measurable. (b) Given a nonmeasurable set A, show that there exists an ε > 0 such that for any measurable sets B, C ∈ MΦ with B ⊇ A and C ⊇ Ac , we have Φ(B ∩ C) ≥ ε.
3.5. The Extension Theorem and regularity properties of measures The big theorem in this section is the Extension Theorem which states in particular that a measure on a semiring can always be extended to be a measure on the σ-algebra generated by the semiring. For instance, we can extend Lebesgue measure from I n to the Borel sets and we can extend any probability set function on the cylinder sets of a sequence space to the σ-algebra generated by the cylinder sets. However, we begin this section by answering some important uniqueness questions involving Carath´eodory’s definition of measurability. 3.5.1. Regular outer measures and the uniqueness of measurable sets. Given an outer measure Φ, you may ask the following: Question: What right does MΦ have to be called the Φ-measurable sets? We shall give two precise reformulations of this question with their answers. Here’s our first formulation: Question 1: Can we find a strictly larger class of sets on which Φ defines a measure? In other words, does there exist a σ-algebra S such that MΦ ⊆ S with Mφ 6= S and Φ : S → [0, ∞] is a measure? The answer is no for regular outer measures: An outer measure Φ is said to be regular if given any A ⊆ X there is a B ∈ MΦ such that A ⊆ B and Φ(A) = Φ(B). Roughly speaking, if we think of an arbitrary set A as being possibly quite “ugly” and we think of the set B ∈ MΦ as being “nice”, then regularity basically says that we can determine the outer measures of “ugly” sets by only considering “nice” elements of MΦ ; here’s a picture of an “ugly” set A on the left and a “nicer” (not so jagged) set B ∈ MΦ on the right containing A: A
B
A ⊆ B and Φ(A) = Φ(B)
9 From topology, if B ⊆ C1 ∪ · · · ∪ CN , a finite union, then B ⊆ C 1 ∪ · · · ∪ C N , where the bar above the set represents closure of the set. (This fact is not true for countable unions.)
3.5. THE EXTENSION THEOREM AND REGULARITY PROPERTIES OF MEASURES 169
Every outer measure we encounter in practice (e.g. Lebesgue measure) is regular (see Theorem 3.13 below). Some properties of regular outer measures are explored in Problem 10. The following theorem answers Question 1. Proposition 3.12. For a regular outer measure Φ, there is no σ-algebra strictly larger than the Φ-measurable sets on which Φ defines a measure. This proposition justifies the term “the Φ-measurable sets.” We shall leave the proof of this proposition to Problem 5. The following example shows that the regularity assumption in the proposition cannot be dropped. Example 3.8. Consider a previous example X = {a, b, c} consisting of three distinct elements and Φ(∅) = 0, Φ(X) = 2, and Φ(E) = 1 otherwise. We showed that MΦ = {∅, X}. If A = {a}, then the only measurable set containing A is X, and Φ(A) = 1 6= 2 = Φ(X), so Φ is not regular. Consider the σ-algebra S = {∅, A, Ac , X}. Note that MΦ ⊆ S and Φ(X) = 2 = 1 + 1 = Φ(A) + Φ(Ac ). Using this fact one can check that Φ is a measure on S . Thus, S is a strictly larger σ-algebra than the Φ-measurable sets on which Φ defines a measure.
We remark that Proposition 3.12 does not say that MΦ is the “largest” σalgebra on which Φ defines a measure. Here, to say that MΦ is the largest means that if S is a σ-algebra and Φ : S → [0, ∞] is a measure, then S ⊆ MΦ . In other words, MΦ is “largest” means that if Φ is a measure on a σ-algebra S , then MΦ must contain S . Thus, we ask: Question 2: Is MΦ the largest σ-algebra on which Φ is a measure? In Problem 12, you will show that the answer is in general no even if Φ is regular! However, if the outer measure is generated from an additive set function on a semiring, it is true that the measurable sets is the largest σ-algebra containing the semiring on which the outer measure is a measure; this is the content of . . . The Regularity Theorem Theorem 3.13. Let µ : I → [0, ∞] be an additive set function on a semiring I of subsets of a set X and let µ∗ : P(X) → [0, ∞]
be the outer measure defined in Equation (3.16) of Theorem 3.9. Then (1) µ∗ is a regular outer measure. In fact, the following stronger regularity property holds: Given any A ⊆ X, there is a B ∈ S (I ) ⊆ Mµ∗ such that A⊆B
and µ∗ (A) = µ∗ (B).
(2) Mµ∗ , the µ∗ -measurable sets, is the largest σ-algebra containing I on which µ∗ defines a measure. Thus, Carath´eodory’s definition of measurable sets produces the largest σalgebra that contains I for which µ∗ defines a measure; in this sense, Carath´eodory’s definition is the “optimal” definition one can possibly hope for. We shall leave the proof of this theorem for Problems 6 and 7.
170
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
3.5.2. The Extension Theorem. The following lemma implies that given an additive set function on a semiring µ : I → [0, ∞] we always have S (I ) ⊆ Mµ∗ and we can extend µ to a measure on S (I ) using µ∗ if and only if µ is a measure. The Extension Theorem, Theorem 3.15 which follows this lemma, discusses the uniqueness of the extension.
Lemma 3.14. Let µ : I → [0, ∞] be a finitely additive set function on a semiring I of subsets of a set X, let µ∗ : P(X) → [0, ∞]
be the generated outer measure defined in Equation (3.16) of Theorem 3.9, and let Mµ∗ denote the µ∗ -measurable sets. Then (1) S (I ) ⊆ Mµ∗ , where S (I ) is the σ-algebra generated by I . In particular, restricting to S (I ), µ∗ : S (I ) → [0, ∞]
is a measure. (2) µ∗ extends µ (i.e. µ∗ (I) = µ(I) for all I ∈ I ) if and only if µ : I → [0, ∞] is a measure. Proof : Throughout this proof we denote S (I ) by S . Proof of (1): Since Mµ∗ is a σ-algebra and S is the smallest σ-algebra containing I , to prove that S ⊆ Mµ∗ we just need to show that I ⊆ Mµ∗ , that is, given A ∈ I , we need to prove that for all subsets E ⊆ X, µ∗ (E ∩ A) + µ∗ (E \ A) ≤ µ∗ (E).
If µ∗ (E) = +∞, then this inequality is satisfied, so let E ⊆ X and assume that µ∗ (E) < +∞. Recall that X ∞ ∞ [ (3.22) µ∗ (E) := inf µ(In ) ; E ⊆ In , In ∈ I . n=1
n=1
Let ε > 0. Then µ∗ (E) < µ∗ (E) + ε, so µ∗ (E) + ε is not a lower bound for the set on the right-hand side of (3.22), therefore there are sets I1 , I2 , . . . ∈ I such that ∞ ∞ [ X (3.23) E⊆ In with µ(In ) < µ∗ (E) + ε. n=1
n=1
S S∞ Note that E ∩ A ⊆ ∞ n=1 (In ∩ A) and In ∩ A ∈ I and that E \ A ⊆ n=1 (In \ A). Since I n , A ∈ I , by the definition of a semiring, for each n we have In \ A = S disjoint elements Inm ∈ I . Thus, E\A ⊆ Sm Inm for some finitely many pairwise ∗ nm Inm . Hence, by definition of µ in (3.22), we have µ∗ (E ∩ A) ≤
Therefore,
∞ X
n=1
µ(In ∩ A)
and
µ∗ (E ∩ A) + µ∗ (E \ A) ≤ =
∞ X
n=1 ∞ X
n=1
µ∗ (E \ A) ≤
µ(In ∩ A) +
X
X
µ(Inm ).
n,m
µ(Inm )
n,m
µ(In ) < µ∗ (E) + ε,
3.5. THE EXTENSION THEOREM AND REGULARITY PROPERTIES OF MEASURES 171
S where we used that In = (In ∩ A) ∪ (In \ A) = (IP n ∩ A) ∪ m Inm is a union of pairwise disjoint elements, so µ(In ) = µ(In ∩ A) + m µ(Inm ) since µ is additive. Hence, µ∗ (E ∩ A) + µ∗ (E \ A) ≤ µ∗ (E) + ε. Taking ε → 0, it follows that A ∈ Mµ∗ . Thus, I ⊆ Mµ∗ . Proof of (2): Assume that µ∗ extends µ; we show that µ : I → [0, ∞] is a measure. Indeed, by (1), µ∗ : S (I ) → [0, ∞] is a measure, so by restricting to I ⊆ S (I ) it follows that µ∗ : I → [0, ∞] is also a measure. Since µ∗ = µ on I , we conclude that µ is a measure. Conversely, given µ : I → [0, ∞] S is a measure and given I ∈ I , we need to show that µ∗ (I) = µ(I). Since I ⊆ ∞ n=1 In where I1 = I and In = ∅ for all n P > 1, by definition of lower bound in the definition ∞ of µ∗ (I), we have µ∗ (I) ≤ n=1 µ(In ) = µ(I). On the other hand, since µ is a S measure it is countably P subadditive, so for any I1 , I2 , . . . ∈ I such that ∞ I⊆ ∞ n=1 In , we have µ(I) ≤ n=1 µ(In ). Hence, µ(I) is a lower bound for the set on the right-hand side of (3.22) for E = I. Since µ∗ (I) is the greatest such lower bound, it follows that µ(I) ≤ µ∗ (I). Therefore, µ∗ (I) = µ(I).
We now come to the main result of this section, but first a definition: An additive set function µ : I → [0, ∞] on a semiring I is said to be σ-finite if we can write X = I1 ∪ I2 ∪ I3 ∪ I4 ∪ · · · for some pairwise disjoint sets I1 , I2 , . . . ∈ I with µ(In ) < ∞ for each n. Example 3.9. Lebesgue measure on I n is σ-finite because Rn is covered by, for instance, unit volume cubes with integer vertices.
Example 3.10.QLet µ be any probability set function on the cylinder sets C of a sequence space X = ∞ i=1 Xi . In this case, X = I1 where I1 = X ∈ C (noting that the whole space X is a cylinder set). Since µ(I1 ) = 1 < ∞ it follows that µ is σ-finite.
We now state the Extension Theorem, a schematic of which is shown here: Mµ∗ ✛ T✛ S (I ) I✛
µ∗ is a measure here T is a σ-algebra containing I µ defined here
Figure 3.7. The disks represent the nondecreasing sequence of sets I ⊆ S (I ) ⊆ T ⊆ Mµ∗ . Here, T ⊆ Mµ∗ is a σ-algebra containing I and therefore S (I ) ⊆ T (since S (I ) is the smallest σ-algebra containing I ). The Extension Theorem says that restricting µ∗ to T is a measure extending µ and this extension is unique if µ is σ-finite.
The Extension Theorem Theorem 3.15. Let µ : I → [0, ∞] be a measure on a semiring I and let T ⊆ Mµ∗ be a σ-algebra containing I (e.g. T = S (I )). Then (1) (Existence:) The restriction of µ∗ to T , µ∗ : T → [0, ∞],
is a measure extending µ; that is, µ∗ (I) = µ(I) for all I ∈ I .
172
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
(2) (Uniqueness:) If µ is σ-finite, then µ∗ : T → [0, ∞]
is the only extension of µ to T . In the σ-finite case we always drop the ∗ superscript and write µ : T → [0, ∞] for the extension µ∗ . Proof : That µ∗ : T → [0, ∞] is a measure follows from the fact that µ∗ is a measure on Mµ∗ and T ⊆ Mµ∗ and this measure extends µ thanks to Part (2) of the previous lemma. We now prove uniqueness assuming σ-finiteness; see Problem 1 for examples showing that uniqueness may fail if the σ-finite assumption is dropped. Assume T = S (I ); we leave the general case T ⊆ Mµ∗ to Problem 8. Let ν : T → [0, ∞] be a measure extending µ where T = S (I ); we need to show that ν = µ∗ . To prove this, we first make the following Claim: If B ∈ I and µ(B) < ∞, then ν(A) = µ∗ (A) for all A ∈ T with A ⊆ B. Assuming this claim, let’s prove S our result. Indeed, by the assumption of σfiniteness we can write X = ∞ n=1 Xn where {Xn } is a sequence of pairwise disjoint setsSin I with µ(Xn ) < ∞ for each n. Then given any A ∈ T , we can write A = ∞ n=1 (A ∩ Xn ). Since A ∩ Xn ⊆ Xn and µ(Xn ) < ∞, by our claim, ν(A ∩ Xn ) = µ∗ (A ∩ Xn ) for each n. Hence, by countable additivity, ν(A) =
∞ X
n=1
∞ X
ν(A ∩ Xn ) =
n=1
µ∗ (A ∩ Xn ) = µ∗ (A).
We now prove our claim. We start by showing that for anSarbitrary set E ∈ T , we have ν(E) ≤ µ∗ (E). To see this, observe that if E ⊆ ∞ n=1 In where In ∈ I for each n, then by countable subadditivity and the fact that ν = µ on I , we have ∞ ∞ X X ν(E) ≤ ν(In ) = µ(In ). n=1
n=1
Therefore, ν(E) is a lower bound for all sums appearing on the right; since µ∗ (E) is the greatest lower bound it follows that ν(E) ≤ µ∗ (E). Now let B ∈ I with µ(B) < ∞; in particular, ν(B), µ∗ (B) < ∞ (since they both equal µ(B)), so we can use the subtractivity property of measures (see Property (4) in Theorem 2.4). Let A ∈ T with A ⊆ B; we must show that ν(A) = µ∗ (A). To prove this, set E = A and E = B \ A in what we proved in the previous paragraph, obtaining (3.24)
ν(A) ≤ µ∗ (A)
and
ν(B \ A) ≤ µ∗ (B \ A).
Since B = A ∪ (B \ A) is a union of disjoint elements of T , it follows that µ∗ (A) = µ∗ (B) − µ∗ (B \ A)
(subtractivity)
∗
= µ(B) − µ (B \ A) = ν(B) − µ∗ (B \ A)
= ν(A) + ν(B \ A) − µ∗ (B \ A)
≤ ν(A)
(ν is additive)
(by the right-hand inequality in (3.24)).
Thus, µ∗ (A) ≤ ν(A) and combining this with the left-hand inequality in (3.24), we get ν(A) = µ∗ (A) and our proof is complete.
3.5. THE EXTENSION THEOREM AND REGULARITY PROPERTIES OF MEASURES 173
Warning: I’ve seen the Extension Theorem called the Carath´eodory extension theorem, the Carath´eodory-Fr´echet extension theorem, the Caratheodory-Hopf extension theorem, the Hopf extension theorem, the Hahn-Kolmogorov extension theorem, and many others that I can’t remember! However, the theorem is originally due to Maurice Ren´e Fr´echet (1878–1973) who proved it in 1924 [100]. Maurice Fr´echet The Extension Theorem is a general theorem dealing with extensions of (1878–1973). measures on abstract semirings. Now, what is an abstract theorem good for? Indeed, The apex and culmination of modern mathematics is a theorem so perfectly general that no particular application of it is feasible. George P´ olya (1887–1985).
In this case, the general Extension Theorem does have applications and is usually applied to the σ-algebra T = S (I ). For instance, we know from Theorems 3.1, 3.2 and 3.5, all in Section 3.2, that Lebesgue measure m on I n , any Lebesgue-Stieljes measure on I 1 corresponding to a right-continuous function, and the infinite product probability measure on the cylinder sets of a sequence space, are all countably additive. Hence, the Extension Theorem gives Lebesgue, Lebesgue-Stieljes, and infinite product measures Theorem 3.16. (1) There exists a unique measure on the Borel sets B n extending Lebesgue measure m on I n . This extension is usually called Borel measure and sometimes Lebesgue measure, and is denoted by m. (2) The Lebesgue-Stieltjes measure µf on I 1 of any right-continuous nondecreasing function f : R → R extends uniquely to a measure on the Borel sets B. (3) Given probability measures on countably many sample spaces Xi , i = 1, 2, . . ., let µ : C → [0, 1] be the induced product measure. Then there exists a unique measure on S (C ) that extends µ. This extension is called the (infinite) product measure on S (C ) and is denoted by µ. A more elaborated statement of Part (2) is: Given probability measures {µi } on countably many semirings on sample spaces {Xi }, there exists a unique measure (the infinite product measure) µ : S (C ) → [0, 1], where C is the set of cylinder Q∞ subsets of i=1 Xi , that gives the “natural” measure on cylinder sets. By natural we mean that on a cylinder set A1 × A2 × · · · × An × Xn+1 × Xn+2 × · · · , we have µ(A1 × A2 × · · · × An × Xn+1 × Xn+2 × · · · ) = µ1 (A1 ) · µ2 (A2 ) · · · µn (An ).
Theorem 3.16 follows from the Extension Theorem and from the fact that m and µ are σ-finite (Examples 3.9 and 3.10).10 Part (3) of Theorem 3.16 is a special case of the Daniell-Kolgomorov Theorem studied in Problem 11. Here are some examples applying the Extension Theorem. Example 3.11. (Using Existence:) By the existence part of Theorem 3.16 we know there’s a measure on say the Borel sets, namely Lebesgue measure m, that agrees with the usual notion of volume on left-hand open boxes. Since any sort of box is a Borel 10Technically speaking, the Extension Theorem deals with measures with ranges in [0, ∞] rather than [0, 1], so you should think about why Part (3) of Theorem 3.16 holds.
174
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
set, we can therefore determine the Lebesgue measure of any sort of box using the properties of measures. Of course, from Example 3.4 in Section 3.4 and Problem 6 back in Exercises 3.4, we already know that the Lebesgue measure of any sort of box is its usual volume. However, we can easily verify this now using properties of measures. For example, given an open box (a, b) with a = (a1 , . . . , an ), b = (b1 , . . . , bn ) ∈ Rn and ai < bi for each i, we can write (a, b) =
∞ [
k=1
a1 , b1 − k1
1i , k
where = ×· · ·× an , bn − k1 and where the union is nondecreasing. Since measures are continuous (see Theorem 3.7) we conclude that 1 1 1i = lim b1 − − a1 · · · bn − − an m(a, b) = lim m a, b − k→∞ k→∞ k k k = (b1 − a1 ) · · · (bn − an ), a, b− k1
a, b −
just what we expected. Example 3.12. (Using Uniqueness:) Let µ : B n → [0, ∞] be a measure on the Borel sets. Suppose that on left-hand open boxes, µ dilates their measures by a fixed constant; that is, there is a constant α > 0 such that µ(I) = α m(I) for all I ∈ I n . Then we claim that µ = α m on all Borel sets! Indeed, α1 µ is a measure on B n and it agrees with m on I n . Therefore by the uniqueness part of Theorem 3.16, we must have α1 µ = m on all of B n . ◮ Exercises 3.5. 1. In the following examples of measures µ : I → [0, ∞], determine the outer measure µ∗ , find Mµ∗ , and show that µ does not have a unique extension to S (I ). (a) Let X be a set consisting of more than one element and let B ⊆ X be a proper nonempty subset of X. Let I = {∅, B} and define µ : I → [0, ∞] by µ(∅) = 0 and µ(B) = ∞. (b) Let I = I 1 and define µ : I → [0, ∞] by µ(I) = ∞ if I 6= ∅ and µ(∅) = 0. 2. Let f : R → R be nondecreasing and right continuous, so that µf : I 1 → [0, ∞) is a measure and hence µf : Mf → [0, ∞] is a measure, where Mf is the set of µf measurable sets. Here, we drop the superscript ∗ from µ∗f . Recalling the properties of measures found in e.g. Theorem 3.7, consider the following. (a) For a ∈ R, show that µf {a} = f (a) − f (a−), where f (a−) is the left-hand limit of f at a. In particular, f is continuous at a if and only if µf {a} = 0. (b) For a ∈ R, show that µf (a, ∞) = f (∞) − f (a), where f (∞) := limx→∞ f (x). (c) Using (a) and (b), for any a, b ∈ R, derive formulas for µf [a, b], µf [a, b), µf (a, b), µf [a, ∞), µf (−∞, a), µf (−∞, a], and µf (R). In particular, show that for f = x, which gives the standard Lebesgue measure, the measure of these sets are what they intuitively should be. (d) Assume now that f is strictly increasing, that is, f (x) < f (y) if x < y. Prove that µ∗f (A) > 0 for any subset A ⊆ R that has an interior point. 3. Let µ : B → [0, ∞] be a Borel measure, which means µ is a measure on the Borel sets B of R. Assume that µ(K) < ∞ for all compact sets K ⊆ R. (i) Using Problem 4 in Exercises 1.6, prove that there exists a nondecreasing rightcontinuous function f : R → R such that µ equals the restriction of the LebesgueStieljes outer measure µ∗f to the Borel sets B. Thus, Borel measures that are finite on compacta are just Lebesgue-Stieltjes measures. (ii) Suppose that µ is a finite Borel measure, which means that µ(R) < ∞. Define f :R→R
by f (x) := µ(−∞, x] for all x ∈ R.
3.5. THE EXTENSION THEOREM AND REGULARITY PROPERTIES OF MEASURES 175
Show that f is nondecreasing and right-continuous and the measure µ equals the restriction of µ∗f to B. 4. (Carath´ eodory’s separation theorem) Given sets A, B ⊆ Rn , the distance between them is defined by d(A, B) := inf{|x − y| ; x ∈ A & y ∈ B}, where |x − y| denotes the standard Euclidean distance between x and y. We shall prove Carath´eodory’s 1914 result: If A, B ⊆ Rn and d(A, B) > 0, then m∗ (A ∪ B) = m∗ (A) + m∗ (B).
Thus, Lebesgue outer measure is additive on sets that are separated by a positive distance. (Thus, if A, B ⊆ Rn are disjoint, in order for m∗ (A ∪ B) 6= m∗ (A) + m∗ (B), the sets A and B have to satisfy d(A, B) = 0.) (i) Let E ⊆ Rn and define f : Rn → R by f (x) := d({x}, E), the distance between the point x and the set E. Prove that f is a continuous function. (ii) Let A, B ⊆ Rn with d(A, B) =: r > 0 and let C = {x ∈ Rn ; d({x}, B) ≥ r/2}. Prove that C is a closed set such that A ⊆ C and C ∩ B = ∅. (iii) Prove that m∗ (A ∪ B) = m∗ (A) + m∗ (B). 5. Prove Proposition 3.12. Precisely, let S be a σ-algebra of subsets of X such that MΦ ⊆ S and Φ defines a measure on S . To prove that S ⊆ MΦ , let A ∈ S . Show that A ∈ MΦ , which means given E ⊆ X, we have Φ(E ∩ A) + Φ(E ∩ Ac ) ≤ Φ(E).
Suggestion: Apply the regularity assumption to E; that is, there is a set B ∈ MΦ such that E ⊆ B and Φ(E) = Φ(B). 6. In this problem we prove the regularity part of Theorem 3.13. Let µ : I → [0, ∞] be an additive set function on a semiring I of subsets of a set X and let A ⊆ X. We shall prove that there is a B ∈ S (I ) such that A ⊆ B and µ∗ (A) = µ∗ (B). This proves that µ∗ is regular since S (I ) ⊆ Mµ∗ by Lemma 3.14. (i) If µ∗ (A) = ∞, prove that such a B exists. (ii) Assume now that µ∗ (A) < ∞. Given ε > 0, show that there is a set B ∈ S (I ) such that A ⊆ B and (3.25)
µ∗ (A) ≤ µ∗ (B) ≤ µ∗ (A) + ε.
Suggestion: Use the argument to get (3.23) (with A instead of E). (iii) In particular, for each k = 1, 2, . . ., putting ε = 1/k in (3.25) we can find a Bk ∈ S (I ) such that A ⊆ Bk , and
1 . k T∞ Let B = k=1 Bk . Show that B ∈ S (I ), A ⊆ B, and µ∗ (A) = µ∗ (B). 7. We now complete the proof of Theorem 3.13. Let µ : I → [0, ∞] be an additive set function on a semiring I of subsets of a set X. Prove that if S is a σ-algebra with I ⊆ S , and if µ∗ : S → [0, ∞] is a measure, then S ⊆ Mµ∗ . Suggestion: Given A ∈ S , you must show that A ∈ Mµ∗ , which means for all E ⊆ X, µ∗ (Bk ) ≤ µ∗ (A) +
µ∗ (E ∩ A) + µ∗ (E ∩ Ac ) ≤ µ∗ (E).
Given E, use regularity: There is a B ∈ S (I ) such that E ⊆ B and µ∗ (E) = µ∗ (B). 8. (Conclusion of the Extension Theorem) Let µ : I → [0, ∞] be a σ-finite measure on a semiring I and let T be a σ-algebra such that I ⊆ T ⊆ Mµ∗ . Prove that µ∗ is the only measure on T that extends µ on I . Suggestion: You just have to prove the claim given in the proof of Theorem 3.15 but when T ⊆ Mµ∗ is a general σ-algebra. To prove the claim, use the fact that µ∗ is the unique measure on S (I ) that extends µ and use Part (1) of Theorem 3.13.
176
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
9. Let ν : B → [0, ∞] be a measure on the Borel sets of R. Suppose that ν(I) < ∞ for all I ∈ I 1 and ν is translation invariant on I 1 , which means ν(x + I) = ν(I) for all x ∈ R and I ∈ I 1 (where recall that x + I := {x + y ; y ∈ I}). Using some facts stated in Problem 8 of Exercises 1.6, prove that ν(A) = α m(A) for all A ∈ B, where α = ν(0, 1] and m denotes Lebesgue measure. (See Problem 7 in Exercises 4.4 for a higher dimensional generalization of this problem.) 10. Let Φ be a regular outer measure on P(X) and assume that Φ(X) < ∞. (a) Prove that a set A ⊆ X is measurable if and only if Φ(X) = Φ(A) + Φ(Ac ).
Suggestion: To prove the “if” part, let B be measurable such that A ⊆ B and Φ(A) = Φ(B). Prove that Φ(B \ A) = 0 and conclude that A is measurable. (b) The inner measure of a set A ⊆ X is by definition Φ∗ (A) = Φ(X) − Φ(Ac ). Show that A is measurable if and only if Φ(A) = Φ∗ (A). (c) Show that the “if” statement is false in Part (a) when Φ is not regular; that is, give an example of a non-regular outer measure Φ and a set A ⊆ X such that Φ(X) = Φ(A) + Φ(Ac ), but A is not measurable (that is, A ∈ / MΦ ). (d) Consider the set function µ : C → [0, ∞], where µ = Φ and C = MΦ . If Φ is regular, prove that the outer measure generated by µ is exactly Φ. (To prove this you do not need the assumption that Φ(X) < ∞.) Roughly speaking, the outer measure generated by a regular outer measure is same outer measure we started out with. If Φ is not regular, this is not true: (e) Give an example of a non-regular outer measure Φ such that µ∗ 6= Φ. 11. (Kolmogorov’s “Fundamental Theorem”) In this problem we prove a baby version of what’s now called the (Daniell-)Kolmogorov’s extension theorem, which Kolmogorov named the “Fundamental Theorem” in his 1933 book [151], another version of which was published by Percy John Daniell (1889–1946) in 1919 [64]. (i) Given probability measures µi : Ii → [0, 1], i ∈ N, where Ii is a semiring on a sample space Xi , let µ : S (C ) → [0, 1] be the infinite product measure on Q X the σ-algebra generated by the cylinder subsets of ∞ i . For n ∈ N and i=1 A ∈ I1 × · · · × In , define νn : I1 × · · · × In → [0, 1] by νn (A) := µ A × Xn+1 × Xn+2 × · · · (DK), Q∞ where A × Xn+1 × Xn+2 × · · · = {(x1 , x2 , . . .) ∈ i=1 Xi ; (x1 , . . . , xn ) ∈ A}. Prove that νn is a measure and for any A ∈ I1 × · · · × In , prove the following consistency condition: νn (A) = νn+1 (A × Xn+1 ).
(ii) We now prove the converse: Suppose that for each n ∈ N we are given a probability measure νn : I1 × · · · × In → [0, 1] satisfying the above consistency condition. Prove there exists a unique probability measure µ : S (C ) → [0, 1] satisfying Equation (DK) above for all n ∈ N and A ∈ I1 × · · · × In . 12. (Nonmeasurable sets) Let Φ : P(X) → [0, ∞] be an outer measure (regular or not) and suppose that MΦ 6= P(X). Then there is nonmeasurable set A ⊆ X; that is, there is a set A ⊆ X such that A ∈ / MΦ . Show that if Φ(X) = ∞, then S := {∅, A, Ac , X} is a σ-algebra on which Φ defines a measure, but S 6⊆ MΦ . 13. (More on asymptotic density) Let I be the collection of all subsets of N of the form A(a, r) = aN + r = {an + r ; n ∈ N} where a ∈ N, r = 0, 1, . . . , a − 1. Define 1 ; a
D : I → [0, 1]
from Problem 8 in Exercises 3.2 we know that D is finitely additive. by D(A(a, r)) = Prove that D is not countably additive. Suggestion: If it was, then it would have an extension to a measure D : S (I ) → [0, 1]. Show that singletons belongs to S (I ) and D{n} = 0 for all n ∈ N.
3.5. THE EXTENSION THEOREM AND REGULARITY PROPERTIES OF MEASURES 177
14. (Liouville numbers) A real number ξ is called a Liouville number, named after Joseph Liouville (1809–1882), if ξ is irrational11 and it has the property that for each k ∈ N there is a rational number p/q 6= ξ such that ξ − p < 1 . q qk
Liouville numbers are important in number theory; for instance, Liouville numbers are transcendental, which means they are not roots of any polynomial with integer coefficients. Prove that the set of all Liouville numbers has measure zero. Suggestion: Let L be the set of Liouville numbers, let r1 , r2 , . . . be a list of all rationals, where we write rn = pn /qn in lowest terms, and notice that L = Qc ∩
∞ [ ∞ \
k=1 n=1
Ikn ,
where Ikn = (rn − 1/qnk , rn + 1/qnk ).
Now let ℓ ∈ N, and show that for any k ∈ N, m(L ∩ [−ℓ, ℓ]) ≤ where the sum
P
∞ X
n=1 p
m(Ikn ∩ [−ℓ, ℓ]) ≤
p:−ℓ≤ q ≤ℓ
∞ X
X
q=1 p:−ℓ≤ p ≤ℓ q
m
p 1 p 1 − k, + k q q q q
means to sum only over those p with −ℓ ≤ p/q ≤ ℓ. Show
that the right-hand series → 0 as k → ∞ and conclude that m(L ∩ [−ℓ, ℓ]) = 0. Finally, show that m(L) = 0. S 15. (Measures via discrete approximations) Let A = ∞ n=1 In where I1 , I2 , . . . are pairwise disjoint intervals in [0, 1] and suppose that [0, 1] \ A is also a countable union of pairwise disjoint intervals. Let Pn = {1/n, 2/n, . . . , (n − 1)/n, 1}. In this problem we prove that #(A ∩ Pn ) m(A) = lim . n→∞ n Note that #(A ∩ Pn )/n = [number of elements of (A ∩ Pn )]/n can be thought of as some type of “density” of points in A of the form 1/n, 2/n, . . .. This limit shows that these densities approach the measure of A. (i) If I ⊆ [0, 1] is an interval, prove that nm(I) − 1 ≤ #(I ∩ Pn ) ≤ nm(I) + 1. S (ii) Let ε > 0 and let [0, 1] S \A = ∞ n=1 S Jn where the Jn ’s are pairwise disjoint ∞ there is an N such that intervals (so that [0, 1] = n=1 In ∪ ∞ n=1 Jn . Prove that S SN m(A \ IN ) ≤ ε and m([0, 1] \ A \ JN ) ≤ ε where IN = N n=1 In and JN = n=1 Jn . (iii) Prove that (1) nm(KN ) − N ≤ #(KN ∩ Pn ) ≤ nm(KN ) + N
(2) #(IN ∩ Pn ) ≤ #(A ∩ Pn ) ≤ n − #(JN ∩ Pn ), where in the first line, K = I or J. (iv) From these inequalities in (iii), prove the desired result. (v) The assumption that [0, 1] \ A can be written as a union of intervals is important. Find pairwise disjoint intervals I1 , I2 , . . . such that m(A) 6= lim #(A ∩ Pn )/n. n→∞
Suggestion: If you’re having trouble finding such intervals, consider intervals such as found in Problem 3 of Exercises 3.2.
11 We don’t have to make the assumption that ξ is irrational since one can prove that ξ must be irrational in order to satisfy the inequality property.
178
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
Notes and references on Chapter 3 §3.1 : A popular way to define a Lebesgue measurable set is using both outer and inner measures. If A ⊆ (a, b), Lebesgue defined the inner measure of A as m∗ (A) = b − a − m∗ ((a, b) \ A).
In other words, we take the complement of A in (a, b) and look at its outer measure, then subtract this from the b − a, the measure of (a, b). One then defines a set A ⊆ (a, b) to be measurable if m∗ (A) = m∗ (A). The downside with this definition is that it requires A to be a subset of set of finite measure (in this case, (a, b)), while Carath´eodory’s definition does not require finiteness. In fact, here’s what Carath´eodory had to say about his definition [79, p. 72]: The 1. 2. 3.
new definition has great advantages: It can be used for the linear measure. It holds for the Lebesgue case, even if m∗ A = ∞. The proofs of the principal theorems of the theory are incomparably simpler and shorter than before. 4. The main advantage, however, is that the new definition is independent of the concept of inner measure.
§3.2, 3.3 : The book [29] is devoted to the theory of finitely additive set functions, which are called charges. In this book, Salomon Bochner (1899–1982) is quoted as having remarked that finitely additive measures are more interesting, more difficult to handle, and perhaps more important than countably additive ones. §3.4 : Hermann Hankel (1839–1873), who we discussed at the end of Section 4.5, was the first person to grasp the idea of outer content, the precursor to outer measure where finite covers (instead of countably infinite covers) are used to measure the size of sets. In 1870 Hankel proved [115] that for a bounded function f on a closed interval, (1) Riemann integrability of f is equivalent to (2) for every σ > 0, the set of points Sσ where f has jumps > σ has zero content. Hankel then went on to equate measure-theoretic smallness (zero content) with topological smallest (nowhere dense) by “proving” that (2) is equivalent to the set Sσ being nowhere dense for every σ > 0. This statement is false; a counterexample is the characteristic function of a nowhere dense set of positive measure such as a fat Cantor set studied in Section 4.5. Although Hankel confused measure-theoretic smallness and topological smallest (many other did so as well), it can be said that Hankel initiated the measure theoretic approach to integration [108, p. 167]. §3.5 : We know from the Regularity Theorem that if µ : I → [0, ∞] is a finitely, but not countably additive, set function on a semiring I , then µ∗ : S (I ) → [0, ∞] does not extend µ. A natural question is: Is there a finitely additive set function ν : S (I ) → [0, ∞] that does extend µ. The answer is “yes” there always exists an extension [29, p. 78]. Unfortunately, the extension is not generally unique and the proofs that such extensions exist are nonconstructive. Here’s a quote from [32]: One serious loss is the constant interplay of finitely additive measures with the axiom of choice; this renders the whole subject unreal. Here is an example. Take the basic space as the positive integers: Ω = {1, 2, 3, 4, . . .}, take B to the set of integers that has first digit 1 : B = {1, 10, 11, . . . , 19, 100, 101, . . .} suppose we begin an approach to “picking an integer at random” by assigning the number theoretic natural density. Thus even numbers are assigned probability 21 , square free numbers are assigned probability π62 , the primes are assigned probability 0, and so on. A standard result says that there are “Banach limits”: finitely additive probabilities P which are invariant, extend density, and are defined for all
NOTES AND REFERENCES ON CHAPTER ??
subsets of integers. The existence of such P is very roughly equivalent to the axiom of choice. The question now is, what is P (B)? There can be no answer; P (B) can be assigned any value in 19 , 59 ! and then extended. Thus, the existence of Banach limits is no real help. It gives the illusion of a concrete useful construction with little content.
179
CHAPTER 4
Reactions to the Extension & Regularity Theorems Have we got a treat for you in this chapter! Now that we know Lebesgue measure and infinite product probability measures extend from their respective semirings to appropriate σ-algebras, we can study a lot of things we couldn’t before. This chapter is devoted to such studies. 4.1. Probability, Bernoulli sequences and Borel-Cantelli This section is devoted to answering probability questions involving Bernoulli sequences. Let S = {0, 1} and let µ0 : P(S) → [0, 1] be a probability measure; let us say for some 0 < p < 1, µ0 {1} = p, µ0 {0} = 1 − p, µ0 {0, 1} = 1, and µ0 (∅) = 0.
By the Extension Theorem (see Theorem 3.16) there is a unique probability measure µ : S (C ) → [0, 1]
such that on a cylinder set A1 × A2 × · · · × An × S × S × · · · of S ∞ , we have µ(A1 × A2 × · · · × An × S × S × · · · ) = µ0 (A1 ) · µ0 (A2 ) · · · µ0 (An ).
In the following subsections we give various applications of this set-up to gambling and Monkeys and Shakespeare. 4.1.1. Gambler’s ruin and the foolishness of gambling.
Consider a gambler1 with an initial capital of $i who walks into a casino. He sits down at a table and is determined that he will play the game over and over again until he either wins everything (all the money of the house) or loses everything. His probability of winning a game is p and of losing is q = 1 − p. If he wins he gets $1 and if he loses he gives the house $1. Let t be the total amount of money involved; the gambler’s initial $i plus the casino’s money. Our question is What is the probability of the gambler’s ruin? We can model this situation using the sample space S ∞ , where S = {0, 1} with 1 representing a win and 0 a loss. For each n = 1, 2, 3, . . ., define Wn : S ∞ → R as the 1The Cardsharps, painting by Gerard van Honthorst (1590-1656), Museum Wiesbaden. From the wikipedia commons. 181
182
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
net amount the gambler has won after n rounds of play. Thus, if x = (x1 , x2 , . . .) ∈ S ∞ , then Wn (x) = #1’s amongst x1 , . . . , xn − #0’s amongst x1 , . . . , xn .
Notice that i + Wn represents the total amount of money the gambler has after n plays. It follows that n−1 \ {0 < i + Wk < t} k=1
is the event that the gambler neither goes broke nor wins everything during the first n − 1 plays, and that {i + Wn = 0} is the event that he has no money on the n-th play. Thus, Ai,n = {i + Wn = 0} ∩
n−1 \ k=1
{0 < i + Wk < t}
is the event that the gambler, with an initial capital of $i, goes broke on exactly the n-th play. We leave it for you to check that each Ai,n ∈ R(C ). Hence, ∞ [
Ai,n
n=1
belongs to the σ-algebra S (C ) and is the event that the gambler goes broke on some play; that is, the event that the gambler is eventually ruined. Since the sets Ai,n are pairwise disjoint in n, it follows that ! ∞ ∞ [ X Pi := µ Ai,n = µ(Ai,n ). n=1
n=1
is the probability that the gambler is ruined, where recall that i represents his initial capital. We shall put P0 = 1 because if he starts with no capital, he’s already ruined so his chance of ruin is 1, while we put Pt = 0 because if he starts with all the money he will not play, so he has no chance of being ruined. We now derive an equation for Pi for any i with 0 ≤ i ≤ t, in terms of Pi+1 and Pi−1 . The intuitive idea is that either the gambler wins the first round (in which case he now has a capital of $(i + 1)) or he loses the first round (in which case he now has a capital of $(i − 1)). Since these are mutually exclusive events, the probability of ruin for the gambler should be the probability that he wins the first round and then is ruined plus the probability that he loses first round and then is ruined. Since he has a probability p of winning a round, the probability that he wins the first round and then is ruined should be p · Pi+1 ,
where Pi+1 denotes the probability of ruin starting with a capital of $(i + 1). Similarly, the probability of him losing the first round and then being ruined should be q · Pi−1 where q = 1 − p. Hence, the following equation should hold: (4.1)
Pi = p Pi+1 + qPi−1 ,
P0 = 1 , Pt = 0.
In Problem 8 you will provide a precise proof of this intuitive statement. Now, Equation (4.1) is an example of a difference equation, and there is an extensive literature on how to solve such equations; see e.g. [106] or [83]. Thus, we just have to solve (4.1) and we’re done! Of course, Pascal didn’t have a developed theory of
4.1. PROBABILITY, BERNOULLI SEQUENCES AND BOREL-CANTELLI
183
difference equations to turn to, so he had to solve (4.1) from scratch. According to Edwards [81], he probably went about this this way: Pi = p Pi+1 + q Pi−1 =⇒ (p + q)Pi = p Pi+1 + q Pi−1
(since p + q = 1)
=⇒ p(Pi+1 − Pi ) = q (Pi − Pi−1 ) =⇒ Pi+1 − Pi = ρ (Pi − Pi−1 ),
where ρ = q/p. Setting i = 1 in the last equation above, we obtain P2 − P1 = ρ(P1 − P0 ) = ρ(P1 − 1). Next, replacing i with 2, we obtain P3 − P2 = ρ(P2 − P1 ) = ρ ρ(P1 − 1) = ρ2 (P1 − 1).
Continuing, we see the pattern
P2 − P1 = ρ(P1 − 1)
P3 − P2 = ρ2 (P1 − 1) P4 − P3 = ρ3 (P1 − 1) .. .
=
.. .
Pi − Pi−1 = ρi−1 (P1 − 1). Summing the left and right-hand columns, noticing that we get a telescopic sum for the left-hand column, we obtain Pi − P1 = ρ + ρ2 + ρ3 + · · · + ρi−1 (P1 − 1). If ρ = 1, we see that
Pi − P1 = (i − 1)(P1 − 1),
and for ρ 6= 1, we have ρ + ρ2 + ρ3 + · · · + ρi−1 = (ρ − ρi )/(1 − ρ) (using the sum of a geometric progression), so in case ρ 6= 1, Pi − P1 =
ρ − ρi (P1 − 1). 1−ρ
Setting i = t and using that Pt = 0, we can find P1 from these equations (which is a good review of basic algebra), and we finally arrive at the answer: Gambler’s ruin theorem Theorem 4.1. The probability of ruin for a gambler starting with an initial capital of $i is i t (q/p) − (q/p) if p 6= q 1 − (q/p)t Pi = 1 − i if p = q, t where t is the total money involved (gambler + his foe). Example 4.1. (Roulette) The American roulette wheel consists of 38 slots, two of which are green and numbered 0, 00, eighteen are red and eighteen are black and are numbered 1 through 36 (in a mixed-up order):
184
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
A ball is placed in the wheel and the wheel is spun; the object is to predict where the ball will land when the wheel stops. There are many bets you can make; e.g. that the ball will end up on a certain number, or on a certain combination of numbers, or whether the number will be red or black, even or odd. Let’s say that you save up your measly graduate student monthly stipend, $1000 (which was about my stipend!), and you go to the local casino to play a game of roulette. Your favorite color is red, so you always bet that the ball will land on a red slot. In particular, your odds of winning are 18/38 = 9/19. Let’s say that this casino is small and only has $9000. What is the probability of your ruin? In this case, p=
9 19
=⇒
q 1−p 10 = = . p p 9
Thus, the probability of your ruin is P1000 =
(10/9)1000 − (10/9)10000 . 1 − (10/9)10000
Using MapleTM (say), we find that P1000 = 0.9999999999999 . . . . . . 999999999984744, where there are a total of 410 digits of 9 after the decimal point! In other words, essentially with 100% certainty, you will lose absolutely EVERYTHING! Example 4.2. With the same situation as above, let’s say that you played a fair game (of course, there is no such thing as a fair game at a casino). Now what is the probability of your ruin? In this case, for any initial capital $i that you have, and capital $j your foe has, your chance of ruin is Pi = 1 −
j i = . i+j i+j
For example, if i = 1000 and j = 9000, we get P1000 =
9000 = .90; 10000
in other words, with 90% certainty, you will lose EVERYTHING!
Hopefully these examples have served the same purpose as stated in the preface of Richard Anthony Proctor’s (1837–1888) classic book Chance and luck, where he says [230]: If a few shall be taught, by what I have explained here, to see that in the long run even fair wagering and gambling must lead to loss, while gambling and wagering scarcely ever are fair, in the sense of being on even terms, this book will have served a useful purpose.
4.1.2. The Borel-Cantelli Lemmas. Named after F´elix Edouard Justin ´ Emile Borel (1871–1956) and Francesco Paolo Cantelli (1875–1966), the BorelCantelli Lemmas gives simple conditions when the probability of an event occurring
4.1. PROBABILITY, BERNOULLI SEQUENCES AND BOREL-CANTELLI
185
“infinitely often” is either 0 or 1. Recall that (see Proposition 1.1 in Section 1.2) that given sets A1 , A2 , . . . of a set X, {An ; i.o.} := The event that an outcome occurs in infinitely many different An ’s = {x ∈ X ; x belongs to infinitely many An ’s} =
∞ [ ∞ \
Ak .
n=1 k=n
Here’s the first Borel-Cantelli Lemma, which was Borel’s “Problem III” in his famous 1909 paper [36]. The First Borel-Cantelli Lemma Theorem 4.2. Let S be a σ-algebra, let µ : S → [0, ∞] be a measure on S , let A1 , A2 , . . . ∈ S , and put A = {An ; i.o.}. Then ∞ X µ(An ) < ∞ =⇒ µ(A) = 0. n=1
Proof : We know that A=
∞ [ ∞ \
Ak =
n=1 k=n
∞ \
Bn ,
n=1
S where Bn = ∞ k=n Ak . In particular, since A = B1 ∩ B2 ∩ B3 ∩ · · · , we have A ⊆ Bn for any n, so (4.2)
0 ≤ µ(A) ≤ µ(Bn )
for any n.
Now by definition of Bn and using countable subadditivity, we know that µ(Bn ) ≤
∞ X
µ(Ak ).
k=n
P∞ By assumption, k=1 µ(Ak ) < ∞, so it follows that µ(Bn ) can be made arbitrarily small by taking n large. In view of (4.2), we must have µ(A) = 0. Example 4.3. (Run lengths) Consider the sample space S ∞ , where S = {0, 1}, for an infinite sequence of fair coin tosses — that is, p = 1/2 — and let µ denote the infinite product measure on S ∞ . For each n ∈ N, define the “run length function” ℓn : S ∞ → [0, ∞] as follows: Given a sequence of coin tosses x = (x1 , x2 , x3 , . . .) ∈ S ∞ , put ℓn (x) := the number of consecutive tosses of heads starting from the n-th toss, which equals the run of heads from the n-th toss. Given a sequence k1 , k2 , k3 , . . . of natural numbers, what’s the probability that you toss a coin in such a way that for infinitely many n’s the run of heads starting at the n-th toss is at least kn ? I don’t know of a general answer to this question, but using the first Borel-Cantelli Lemma we can give an answer when the probability is zero. Define An = {x ∈ S ∞ ; ℓn (x) ≥ kn }, which is the event that the run of heads from the n-th toss is at least kn . By definition of ℓn (x), we see that An = {x ∈ S ∞ ; xn = 1, xn+1 = 1, . . . , xn+kn −1 = 1},
186
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
which is a cylinder set, and moreover, µ(An ) = Cantelli Lemma,
1 kn . 2
Therefore, by the First Borel-
∞ X 1 < ∞ =⇒ µ{An ; i.o.} = 0. kn 2 n=1 P 1 For example, since ∞ n=1 2n < ∞, with zero probability you can toss a coin such that for infinitely many n’s the run of heads starting at the n-th toss is at least n.
The Second Borel–Cantelli Lemma deals with independent events. Intuitively speaking, two events are independent if the occurrence of one event doesn’t influence the occurrence of the other event. For example, the event that a randomly chosen student in your class is male (call this event A) and the event that it rains today (call it B) are independent. If P (A) is the probability that a randomly chosen student in your class is male and P (B) is the probability that it rains today, then what is the probability of A ∩ B? Intuitively speaking, it should be (4.3)
P (A ∩ B) = P (A) · P (B),
because if we think of P (A) as the fraction of times a male student is chosen and P (B) as the fraction of times it rains, then it would make sense that the product P (A) · P (B) is the fraction of times a male student is chosen and it rains. However, the event that a randomly chosen student in your class is male and the event that a randomly chosen student is wearing a dress are not independent since gender influences clothing style. Mathematically speaking, we define independence via (4.3) and we generalize it as follows: Given a probability measure µ : S → [0, 1], events A1 , A2 , A3 , . . . ∈ S (a finite or countably infinite collection of events) are called pairwise independent if for any i 6= j, the events Ai and Aj are independent in the sense defined in (4.3). We say that A1 , A2 , A3 , . . . are independent if the following stronger condition is satisfied: For any finite subcollection Ai , Aj , . . ., Ak of A1 , A2 , A3 , . . ., we have µ(Ai ∩ Aj ∩ · · · ∩ Ak ) = µ(Ai ) µ(Aj ) · · · µ(Ak ). Note that independence implies pairwise independence. In Problem 2 you will show that pairwise independence may not imply independence. We now come to the Second Borel-Cantelli Lemma, which is Cantelli’s contribution proved in 1917 [47]. The proof is not difficult and so nice that we shall leave it as an exercise (see Problem 6)! The usual statement of the Second Borel-Cantelli Lemma requires the events A1 , A2 , . . . ∈ S to Francesco Cantellibe independent. The relaxed condition of just being pairwise independent (1875–1966). was discovered by Paul Erd¨os (1913–1996) and Alfr´ed R´enyi (1921–1970) in 1959 [84]. The Second Borel-Cantelli Lemma Theorem 4.3. Let µ : S → [0, 1] be a probability measure on a σ-algebra S , let A1 , A2 , . . . ∈ S be pairwise independent, and put A = {An ; i.o.}. Then ∞ X µ(An ) = ∞ =⇒ µ(A) = 1. n=1
4.1. PROBABILITY, BERNOULLI SEQUENCES AND BOREL-CANTELLI
187
Notice that combining the first and second Borel–Cantelli Lemmas, we see that if A1 , A2 , . . . ∈ S are independent and A = {An ; i.o.}, then either µ(A) = 0 or P µ(A) = 1. Indeed, either ∞ µ(A ) converges or diverges. If the sum converges, n n=1 then µ(A) = 0 by the First Borel–Cantelli Lemma and if it diverges, then µ(A) = 1 by the Second Borel–Cantelli Lemma. That µ(A) = 0 or 1 is an example of a Zero-One Law, of which there are many in probability theory; see Problem 4 for Borel’s zero-one law. Example 4.4. (Monkeys and Shakespeare) We continue our Monkey–Shakespeare drama. Put a Monkey in front of a typewriter and see if he can type Shakespeare’s sonnet 18 (or any other passage), and give him an infinite number of opportunities to do so. Consider the sample space S ∞ , where S = {0, 1}, and where 1 represents a successful typing of the passage and 0 not typing the passage. Assume that on each try he has the probability p of typing the passage and let µ denote the infinite product measure on S ∞ . What is the probability the Monkey will type the passage an infinite number of times? As we saw in Section 2.3.5, with certain assumptions on the keyboard and the monkey’s typing speed, the probability is essentially zero that the Monkey will type sonnet 18 in any reasonable amount of time. However, it turns out that with probability one the monkey will type sonnet 18 an infinite number of times! To see this, for each n ∈ N, let An be the event that the monkey types sonnet 18 on the n-th page: An = S × S × · · · × S × {1} × S × S × · · · ,
where the 1 is in the n-th slot. Observe that if Ai1 , Ai2 , . . ., Aik where i1 < i2 < · · · < ik , then Ai1 ∩ Ai2 ∩ · · · ∩ Aik = {x ∈ S ∞ ; xi1 = · · · = xik = 1}.
It follows by the definition of the infinite product measure that
µ(Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) = pk = p · p · · · p = µ(Ai1 ) · µ(Ai2 ) · · · µ(Aik ).
Therefore, A1 , A2 , . . . are independent. Moreover, ∞ ∞ X X µ(An ) = p = ∞, n=1
n=1
so by the Second Borel–Cantelli Lemma, it follows that with probability one, the monkey types sonnet 18 an infinite number of times!
This example really shows the immense gulf between the finite and the infinite: In any reasonable finite amount of time (eg. the hypothetical age of the universe) the monkey will almost certainly not type sonnet 18, but given an infinite amount of time, he will type sonnet 18 an infinite number of times! ◮ Exercises 4.1. 1. Consider the sample space S ∞ where S = {0, 1}. Write down the event A “throwing a head at some point” as a subset of S ∞ . Show that A ∈ S (C ) and A ∈ / R(C ). Finally, answer the question: What is the probability that you throw a head at some point, assuming a constant probability p > 0 of throwing a head on a single toss? 2. We throw a six-sided die infinitely many times. Let Aij be the event that on the i-th and j-th throws we obtain the same number. Show that {Aij } are pairwise independent (that is Aij and Akℓ are independent if Aij 6= Akℓ ) but not independent. 3. Prove that if A1 , A2 , A3 , . . . are independent events, then ! ∞ ∞ \ Y µ An = µ(An ). n=1
n=1
188
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
4. (Borel’s Zero-One Law) Let p1 , p2 , p3 , . . . ∈ (0, 1), let S = {0, 1} and for each n, let µn : P(S) → [0, 1] be the probability measure assigning pn to {1} and 1 − pn to {0}. Consider the sample space S ∞ with measure µ, the infinite product of the µn ’s. In Borel’s 1909 paper [36] he showed that if A is the event that an infinite number of successes occurs in an infinite sequence of experiments, then X X (1) pn < ∞ =⇒ µ(A) = 0 and (2) pn = ∞ =⇒ µ(A) = 1. n
n
Prove this. 5. A sequence of events A1 , A2 , . . . of a probability space (X, S , µ) is called a coin tossing sequence if the events are independent and there is a p ∈ (0, 1) such that µ(Ai ) = p for each i. Show that there doesn’t exist a coin tossing sequence if X is countable and {x} ∈ S for all points x ∈ X. Suggestion: Let q = max{p, 1 − p} and let x ∈ X. Show that µ{x} ≤ q n for each n. Prove the n = 1 case by noting that either x ∈ A1 or x ∈ Ac1 . Prove the n = 2 by noting that x ∈ A1 ∩ A2 , or Ac1 ∩ A2 , or A1 ∩ Ac2 , or Ac1 ∩ Ac2 . Continue. 6. (Proof of the Second Borel-Cantelli Lemma) Let µ : S → [0, 1] be a probability measure P∞on a σ-algebra S , let A1 , A2 , . . . ∈ S be pairwise independent, and assume that n=1 µ(An ) = ∞. We’ll show that µ{An ; i.o.} = 1. This proof uses a clever inequality due to Kai Lai Chung (1917– ) and Paul Erd¨ os (1913–1996) [60]. (i) Prove that given any integrable simple function f : X → R, we have Z 2 Z f ≤ f 2. R R Suggestion: Define E = f and consider the integral (f − E)2 , which is ≥ 0. (ii) Here’s the clever inequality: Prove that for any B1 , B2 , . . . , Bn ∈ S , we have n n n X 2 X [ µ(Bk ) ≤ µ µ(Bi ∩ Bj ). Bk k=1
k=1
i,j=1
Sn
Suggestion: Put B = k=1 Bk , β = µ(B), and define µ1 : S → [0, 1] by µ1 (A) = µ(A ∩ B)/β. Show that µ1 P : S → [0, 1] is a probability measure, then apply (i) to the simple function f = n k=1 χBk on the measure space (X, S , µ1 ). (iii) Show that m [ µ{An ; i.o.} = lim lim µ Ak . n→∞ m→∞
k=n
(iv) Using (ii) show that m [ µ Ak ≥ P m k=n
P
k=n
m k=n
µ(Ak )
µ(Ak )
2
+
2
Pm
k=n
= µ(Ak )
1+
P
1 m k=n
µ(Ak )
−1 .
Now prove that µ{An ; i.o.} = 1. 7. (Another proof of the Second Borel-Cantelli Lemma) Let µ : S → [0, 1] be a probability measure P∞on a σ-algebra S , let A1 , A2 , . . . ∈ S be pairwise independent, and assume that P n=1 µ(An ) = ∞. We will show that µ{An ; i.o.} = 1. (i) Put fn = n k=1 χAk . Show that limn→∞ E(fn ) = ∞ and {An ; i.o.} = {lim fn = ∞}. Thus, we p just have to show that µ{lim fn = ∞} = 1. (ii) Define σn := E[(fn − E(fn ))2 ], the square root of the expectation of (fn − 2 E(fn )) . (σn is called the standard deviation of fn and σn2 is called the variance of fn .) Let α > 0 and using Chebyshev’s inequality prove that |fn − E(fn )| 1 µ < α ≥ 1− 2. σn α
4.1. PROBABILITY, BERNOULLI SEQUENCES AND BOREL-CANTELLI
189
(iii) Prove that given any α > 0, 1 µ E(fn ) − σn α < fn ≥ 1 − 2 . α
p (iv) Show that σn2 = E(fn2 ) − E(fn )2 , then show that σn ≤ E(fn ). Suggestion: To prove that σn2 ≤ E(fn ), just find E(fn )2 and E(fn )2 , then subtract them. (v) Recalling that limn→∞ E(fn ) = ∞ from (i), given α > 0, choose N ∈ N such that for all n ≥ N , we have E(fn ) ≥ 4α2 . Now using (iii) and (iv), prove that o n1 1 for all n ≥ N , µ E(fn ) < fn ≥ 1 − 2 . 2 α (vi) Finally, prove that µ{lim fn = ∞} = 1, which proves the Second Borel-Cantelli Lemma. Suggestion: If you’re having trouble showing this, here’s one way, not the most elegant of the many ways, to go about it. First show that \ {lim fn = ∞} = Bα α∈N
S 2 1 where Bα = ∞ n=1 {2α < fn }. Show that µ(Bα ) ≥ 1 − α2 and B1 ⊇ B2 ⊇ B3 ⊇ · · · , then conclude that 1 µ{lim fn = ∞} = lim µ(Bα ) ≥ lim 1 − 2 = 1. α→∞ α→∞ α 8. In this problem we prove the formula Pi = p Pi+1 + qPi−1 found in (4.1). (i) Let Wn : S ∞ → R be the net amount the gambler has won after n rounds of play. Show that Wn = R1 + · · · + Rn , where Ri : S ∞ → R is defined by ( 1 if xi = 1 Ri (x) = −1 if xi = 0. (ii) For any i, n ∈ N, let Ei,n = {y ∈ Rn ; i + y1 + · · · + yn = 0} ∩
n−1 \ k=1
{y ∈ Rn ; 0 < i + y1 + · · · + yn < t}.
Show that µ {(R1 , . . . , Rn ) ∈ Ei,n } = µ {(R2 , . . . , Rn+1 ) ∈ Ei,n } . (In fact, µ {(R1 , . . . , Rn ) ∈ Ei,n } equals µ {(Rj1 , . . . , Rjn ) ∈ Ei,n } for any
choice of natural numbers j1 < j2 < · · · < jn .) (iii) Show that Ai,n = {(R1 , . . . , Rn ) ∈ Ei,n }, where Ai,n is the event that the gambler goes broke on exactly the n-th play, and that Ai,n = {R1 = 1} ∩ {(R2 , . . . , Rn ) ∈ Ei+1,n−1 } ∪ {R1 = −1} ∩ {(R2 , . . . , Rn ) ∈ Ei−1,n−1 } . Using this equality, prove that
µ(Ai,n ) = p µ(Ai+1,n−1 ) + q µ(Ai−1,n−1 ). (iv) Finally, prove that Pi = p Pi+1 + qPi−1 . 9. Assume that the casino has an unlimited amount of money. Prove that no matter how large the gambler’s initial capital is, his probability of ruin is 1. 10. Let S be a finite set and let s1 , s2 , . . . , sk be any pattern where s1 , . . . , sk ∈ S. In this problem we prove that the probability that the pattern s1 , s2 , . . . , sk occurs infinitely often in a randomly chosen sequence in S ∞ is one. Let An ⊆ S ∞ be the event that the pattern s1 , s2 , . . . , sk occurs at position n; that is, An consists of all elements x ∈ S ∞ such that xn = s1 , . . . , xn+k−1 = sk . We shall prove that µ{An i.o.} = 1.
190
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
(i) Prove that the sets A1 , Ak+1 , A2k+1 , A3k+1 , are independent. (ii) Apply the Second Borel-Cantelli Lemma to show that µ{Ank+1 ; i.o.} = 1. (iii) Finally, prove that µ{An ; i.o.} = 1. 11. (Dirichlet’s approximation theorem) This problem doesn’t use any measure theory, but is given here because it helps to better appreciate the next problem! In this problem we prove that for every irrational number ξ ∈ R there are infinitely many rational numbers p/q in lowest terms such that ξ − p < 1 . q q2 This is Dirichlet’s approximation theorem, named after Johann Peter Gustav Lejeune Dirichlet (1805–1859) who made the Pigeonhole principle famous. (Please review the Pigeonhole principle before proceeding.) (i) For any real number x denote by {x} = x − ⌊x⌋ the fractional part of x, where ⌊x⌋ is the greatest integer ≤ x. Note that {x} ∈ [0, 1). Let n ∈ N and partition [0, 1) into n subintervals 1 1 2 2 3 n−1 0, , , , , , ... , ,1 . n n n n n n
Given an irrational number ξ ∈ R, by considering the n + 1 numbers, 0, {ξ}, {2ξ}, . . ., {nξ}, using the Pigeonhole principle prove that there are two different integers a, b ∈ {0, 1, . . . , n} such that |{aξ} − {bξ}| < 1/n. (ii) Prove that there are integers p, q with 1 ≤ q ≤ n such that |qξ − p| < 1/n. Conclude that ξ − p < 1 . q q2
(iii) Now prove Dirichlet’s approximation theorem. 12. Let α ∈ R with α ≥ 2 and let Aα denote the set of all real numbers x ∈ R such that there are infinitely many rational numbers p/q in lowest terms with x − p < 1 . q qα (a) Prove that Aα is Lebesgue measurable. (b) Prove that m(A2 ) = ∞
and
m(Aα ) = 0 for α > 2.
Suggestion for the second equality: Let α > 2 and let r1 , r2 , . . . be a list of all rationals. Define Bn = {x ∈ R ; |x − rn | < 1/qnα } = (rn − 1/qnα , rn + 1/qnα ), where rn = pn /qn in lowest terms. You want to show that m{Bn ; i.o.} = 0. To do so, let ℓ ∈ N, I = [−ℓ, ℓ], and use the First Borel–Cantelli Lemma to show that m{Bn ∩ I ; i.o.} = 0; then show that m{Bn ; i.o.} = 0.
4.2. Borel’s Strong Law of Large Numbers ´ This section is devoted to Emile Borel’s (1871–1956) Strong Law of Large Numbers, the first version of which was published in the 1909 paper Les probabilit´es denombrables et leurs applications arithm´etiques [36].
4.2. BOREL’S STRONG LAW OF LARGE NUMBERS
191
4.2.1. The Strong Law of Large Numbers (Borel’s Theorem). Before we state the Strong Law of Large Numbers, let us recall the weak law. Let S ∞ , where S = {0, 1}, be the sample space for an infinite sequence of (say) coin tosses, and let p be the probability of a head on any given flip. Then given an infinite sequence of coin tosses, x = (x1 , x2 , x3 , . . .) ∈ S ∞ , the ratio x1 + x2 + x3 + · · · + xn n is the proportion of heads in n tosses, which for a typical sequence of coin tosses should intuitively be close to p for n large. Bernoulli’s theorem, or the Weak Law of Large Numbers, is one interpretation of this intuitive idea: For each ε > 0, ∞ x1 + x2 + · · · + xn − p < ε = 1. lim µ x ∈ S ; n→∞ n
Another, slightly different, interpretation is that if you consider the event x1 + x2 + · · · + xn ∞ A := x ∈ S ; lim =p , n→∞ n then this event should occur with probability one; that is, in a sequence of coin tosses, with probability one the proportion of heads in n tosses approaches p as n → ∞. This is the Strong Law. Borel’s Strong Law of Large Numbers Theorem 4.4. The set A belongs to S (C ) and µ(A) = 1; in other words, the event x1 + x2 + · · · + xn lim =p n→∞ n belongs to S (C ) and it occurs with probability one. As we did with the Weak Law, in order to prove the Strong Law we first transform its statement into a statement involving functions. For each i ∈ N, let fi := χAi : S ∞ → R,
where Ai ⊆ S ∞ is the event that on the i-th toss we flip a head, and let Sn = f 1 + · · · + f n ,
which represents the simple random variable giving the total number of heads in n tosses. Then Sn =p . A = lim n→∞ n Thus, the Strong Law of Large Numbers is really a statement about the points where the limit of a certain sequence of functions equals p. Thus, before proceeding, we better learn some general results on limits of sequences of functions. Lemma 4.5. Let g, g1 , g2 , g3 , . . . be real-valued functions on a probability space (X, S , µ). Suppose that for each ε > 0 and n ∈ N, {|gn − g| ≥ ε} ∈ S . Then (1) L := {lim gn = g} ∈ S , or equivalently, Lc = {lim gn 6= g} ∈ S . (2) µ(L) = 1 if and only if for each ε > 0, µ{|gn − g| ≥ ε ; i.o.} = 0.
192
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
(3) If for each ε > 0,
∞ X
n=1
then µ(L) = 1.
µ{|gn − g| ≥ ε} < ∞,
(4) If µ(L) = 1, then for each ε > 0, lim µ{|gn − g| ≥ ε} = 0. n→∞
Proof : By definition of limit, we have x∈L
⇐⇒
∀ε > 0 , ∃N ∈ N , ∀n ≥ N , |gn (x) − g(x)| < ε.
You can check that this condition is equivalent to the following: 1 . m In the language of sets, ∀ is intersection and ∃ is union, so this statement is simply that ∞ [ ∞ \ ∞ \ L= Acm,n , x∈L
⇐⇒
∀m ∈ N , ∃N ∈ N , ∀n ≥ N , |gn (x) − g(x)|
0, µ {|gn − g| ≥ ε ; for infinitely many n’s} = 0; that is, if and only if for each ε > 0, µ {|gn − g| ≥ ε ; i.o.} = 0. Part (3) follows from Part (2) and the First Borel-Cantelli Lemma. Finally, to prove Part (4), assume that µ(L) = 1 and let ε > 0. Note that {|gn − g| ≥ ε ; i.o.} =
∞ [ ∞ \
{|gk − g| ≥ ε} =
n=1 k=n
∞ \
Bn ,
n=1
S where Bn = ∞ k=n {|gk − g| ≥ ε}. Observe that B1 ⊇ B2 ⊇ B3 ⊇ · · · , so by the “continuity from above” property of measures, we have µ {|gn − g| ≥ ε ; i.o.} = lim µ(Bn ). n→∞
By Part (2), the left-hand side of this equality is 0, so limn→∞ µ(Bn ) = 0. On the other hand, noting that {|gn − g| ≥ ε} ⊆ Bn , it follows that lim µ{|gn − g| ≥ ε} ≤ lim µ(Bn ) = 0,
n→∞
n→∞
which implies that limn→∞ µ{|gn − g| ≥ ε} = 0. Proof of the Strong Law : First of all, by Problem 1 in Exercises 2.4, we know that for each ε > 0, Sn ≥ ε ∈ S (C ). − p n
4.2. BOREL’S STRONG LAW OF LARGE NUMBERS
(In fact, this set belongs to R(C ).) Therefore, by Part (1) of the lemma, Sn A := lim = p ∈ S (C ). n→∞ n
To prove that µ(A) = 1, by Part (3) of our lemma, for fixed ε > 0 we just have to show that ∞ X Sn µ − p ≥ ε < ∞. n n=1
We shall prove this using Chebyshev’s inequality together with plain hard work! To begin, observe that Sn ≥ ε = {|Sn − np| ≥ nε} = (Sn − np)4 ≥ n4 ε4 . − p n
As we did in the proof of the Weak Law, it’s a good time to introduce the Rademacher functions Ri := fi − p ,
i = 1, 2, 3, . . . ;
then Sn − np = f1 + f2 + · · · + fn − np = (f1 − p) + (f2 − p) + · · · + (fn − p) = R1 + R2 + · · · + Rn ,
so
) ( X n 4 Sn 4 4 Rk ≥ n ε . n − p ≥ ε = k=1
By Chebyshev’s inequality (Lemma 2.11), we have ( n ) Z X n X 4 4 1 4 4 µ Rk ≥ n ε ≤ 4 4 Rk . n ε k=1 k=1 4 Observe that if we multiply out R1 + · · · + Rn we obtain n X
k=1
Rk
4
=
X
Ri Rj Rk Rℓ ,
i,j,k,ℓ
where the sum contains terms of the following form: 4 (1) Rm (when all of Ri , Rj , Rk , Rℓ are the same). 2 2 (2) Rp Rq , p 6= q, (when two distinct pairs of Ri , Rj , Rk , Rℓ are the same) (3) Ri Rj Rk Rℓ in which at least one factor is not repeated. Note that there are n terms of the form (1) (this should be clear) and there are 3n(n − 1) terms of the form (2) (this will take a little thinking). We now consider integrals of each of these types of functions. Note that |Ri | ≤ 1 for any i, so Z Z Z Z 4 Rm ≤ 1 = µ(S ∞ ) = 1 and Rp2 Rq2 ≤ 1 = µ(S ∞ ) = 1.
We now compute the integrals of the third type of functions. Assume that i is distinct from j, k, ℓ. Observe that when we multiply out Rj Rk Rℓ = χAj − p χAk − p χAℓ − p ,
we get a linear combination of characteristic functions χB where the set B equals an intersection of one, two, or three sets amongst Aj , Ak , Aℓ or B equals S ∞ (in this case, χB ≡ 1, which occurs when we multiply the three −p’s together). With B as just described, it follows that Ri Rj Rk Rℓ is a linear combination of terms of the form (χAi − p)χB = χAi ∩B − pχB ,
193
194
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
so the integral,
R
Ri Rj Rk Rℓ is a linear combination of terms of the form Z χAi ∩B − pχB = µ(Ai ∩ B) − pµ(B).
Recalling the form of the set B, we leave it as a short exercise for you to check R that µ(Ai ∩ B) = µ(Ai ) · µ(B) = p · µ(B). Hence, Ri Rj Rk Rℓ = 0. Now, n n X 4 X X 2 2 4 Rk = Rm + Rp Rq + ∗, m=1
k=1
p6=q
where ∗ consists of the type (3) terms, and therefore by our computations above, we have Z X n n Z 4 X XZ 2 2 Z 4 Rk = Rm + Rp Rq + ∗ m=1
k=1
≤
(4.4)
n X
p6=q
1+3
m=1
X
1+0
p6=q
= n + 3n(n − 1) ≤ 3n2 . We conclude that ( n ) X 4 4 4 µ Rk ≥ n ε ≤ k=1
Hence,
1 n4 ε4
Z X n
k=1
Rk
4
≤
3n2 3 = 2 4. n4 ε4 n ε
X ∞ ∞ X Sn 3 µ − p ≥ ε ≤ < ∞, 2 ε4 n n n=1 n=1
and the proof of the Strong Law of Large Numbers is complete.
We remark that because of Property (4) in Lemma 4.5, the Strong Law of Large Numbers automatically implies the Weak Law, and in this sense the Strong Law is “stronger” than the Weak Law. Indeed, since µ(A) = 1 where A = limn→∞ Snn = p , by Property (4) in Lemma 4.5, Sn for each ε > 0, lim µ − p ≥ ε = 0. n→∞ n This is exactly the statement of the Weak Law. However, the Weak Law doesn’t automatically imply the Strong Law in the sense that the converse of Property (4) in Lemma 4.5 is in general false; see Problem 3. We also remark that there is a corresponding strong version of the Expectation Theorem 2.12. Let µ0 : I → [0, 1] be a probability measure on a semiring of a sample space S, let X := S ∞ , the sample space for repeating the experiment modeled by S an infinite number of times, let µ : S (C ) → [0, 1] be the infinite product of µ0 with itself, and finally, let f :S→R be a simple random variable. For each i, define (4.5)
fi : X → R
by fi (x1 , x2 , . . .) := f (xi ),
which represents the random variable f observed on the i-th iterate of the experiment. The following theorem is proved in exactly the same way as the Strong Law, with only slight modifications, so we leave its proof for your enjoyment.
4.2. BOREL’S STRONG LAW OF LARGE NUMBERS
195
The Strong Expectation Theorem Theorem 4.6. The event f1 + f2 + · · · + fn lim = E(f ) n→∞ n
belongs to S (C ) and it occurs with probability one.
Example 4.5. Let S = (0, 1] and let µ0 : I → [0, 1] be Lebesgue measure on I = lefthand open intervals in (0, 1]. Define f : (0, 1] → R as the “tenth-place digit” function. Thus, if x ∈ (0, 1] and we write x = 0.x1 x2 x3 . . . in base-ten notation (taking the non-terminating expansion if x has two expansions), then f (x) := x1 . Then from Problem 3 in Exercises 2.4 we know that f : S → R is an I -simple random variable. Moreover, it’s easy to check that E(f ) = 1/10. Hence, if fi is defined as in (4.5) and C denotes the cylinder sets of (0, 1]∞ generated by I , by the Strong Expectation Theorem we know that f1 + f2 + · · · + fn 1 lim = n→∞ n 10 belongs to S (C ) and it occurs with probability one. In other words, if we sample numbers in (0, 1] “at random” and average their tenth digits, with probability 1 these averages approach 1/10 as the number of samples increases.
4.2.2. A couple remarks. Our first remark deals with the limitations of Theorem 4.6. Consider Example 4.5 but now let f : (0, 1] → R be the function f (x) = x; in other words, f represents the actual number (not its tenth digit) picked from the interval (0, 1]. The function f is not simple so its expected value is not yet defined! However, if it were defined it should be 1/2 and the Expectation Theorem in this case should therefore read: If we sample numbers in (0, 1] “at random” and average them, with probability 1 these averages approach 1/2 as the number of samples increases. However, to prove this rigorously we need to learn expected values (integrals) of functions more general than simple functions. We shall study integration in the next chapter and prove a very general SLLN in Section 6.6. For our second remark, we note that the SLLN is really a feature of countable additivity in the sense that it fails to hold for finitely additive probabilities. Here’s a simple example. As usual, let X = S ∞ where S = {0, 1} and let µ be the infinite product measure assigning the probability p for obtaining a head on any given flip. Let us suppose that we live in a world where coins eventually flip to an infinite run of tails. The sample space in such a world is Y = {x ∈ X ; there is an N with xi = 0 for all i ≥ N }.
Define I := C ∩ Y = {A ∩ Y ; A ∈ C } and define
ν : I → [0, 1] by ν(A ∩ Y ) := µ(A).
In Problem 1 we ask you to check that I is a semiring and ν is finitely additive, but not countably additive. Nonetheless, being finitely additive, ν extends uniquely to a finitely additive set function ν : R(I ) → [0, 1].
For each i ∈ N, define fi : Y → [0, 1] as before: fi (y) = 1 if yi = 1 and fi (y) = 0 for yi = 0. Is it true that the SLLN holds for the fi ’s? To answer this question, let
196
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
y ∈ Y . Then there is an N such that yi = 0 for all i ≥ N . Therefore fi (y) = 0 for all i ≥ N , so f1 (y) + · · · + fn (y) f1 (y) + · · · + fN (y) = for all n ≥ N. n n
n (y) Taking n → ∞ we see that lim f1 (y)+···+f = 0. Thus, assuming p > 0, n f1 + · · · + fn lim = p = ∅. n
Hence,
f1 + · · · + fn = p = 0 =⇒ The SLLN fails! ν lim n Although the SLLN fails, in Problem 1 you will prove that the WLLN holds! There are at least two ways to react when confronted with such an example. One way is to view finitely additive probabilities as “pathological” and dismiss them. The other way is to dismiss countable additive probabilities because this example is very simple and simple examples should only illustrate the validity of theories, not give counterexamples to them. I’ll let you decide what viewpoint to take! See Problem 2 for a probability paradox involving the above example. ◮ Exercises 4.2.
1. In this problem and the next one we study the set function ν : R(I ) → [0, 1] defined in Section 4.2.2. (i) Prove that Y is countable. (ii) Prove that if A ∩ Y = B ∩ Y where A, B ∈ C , then A = B. This proves that ν is well-defined. (iii) Prove that ν is finitely additive. (iv) Given a point y ∈ Y and ε > 0, prove that there is a set I ∈ I such that y ∈ I and ν(I) < ε. (v) Since Y is countable, we can write Y = {y1 , y2 , . . .}. Prove there are sets I1 , I2 , . . . ∈ I such that yi ∈ Ii and ν(Ii ) < 1/2i+1 . Use this to show that ν is not countably subadditive and hence not countably additive. (vi) Prove that the WLLN holds. 2. (A finitely additive probability paradox; cf. Problem 9 in Exercises 3.2) Jack and Jill, on top of a hill, each flip a coin infinitely many times. Suppose that they live in a world where coins eventually flip to an infinite run of tails. They record the number of flips it takes to throw the last head (until an infinite run of tails occurs) and the one with the smallest number wins. You call out either Jack or Jill’s name at random, and the person who you call on tells you his number; at this point you don’t know what number the other person chose. Then, they reveal their numbers. Who wins? In this problem we describe a model of this situation, then answer the question. (i) Let F ⊆ P(Y ) denote the collection of all finite subsets of Y , and consider the collection A := {A ∪ F ; A ∈ R(I ) , F ∈ F } and the set function P : A → [0, 1]
defined by P (A ∪ F ) := ν(A)
for all A ∈ R(I ) and F ∈ F . Prove that A is a ring and P is a finitely additive probability set function on A . (ii) You call out a person’s name at random, say Jill. Suppose that Jill told you she threw the last head on the flip n. Let A be the event that Jack wins or they tie. Show that A ∈ A (in fact, A ∈ F ) and P (A) = 0. What’s your conclusion? 3. We show that the converse to Property (4) in Lemma 4.5 doesn’t hold. Let X = [0, 1] with Lebesgue measure. Given n ∈ N, we can write n = 2k + i where k ∈ {0, 1, 2, . . .}
4.2. BOREL’S STRONG LAW OF LARGE NUMBERS
197
i i+1 , . 2k 2k (To get an idea of what these functions look like, it may be helpful to draw pictures of f1 , f2 , f3 , . . . , f7 .) Show that for each ε > 0, lim m{|fn | ≥ ε} = 0. Also show that
and 0 ≤ i < 2k and let fn be the characteristic function of the interval
n→∞
m{lim fn = 0} = 0. In fact, show that lim fn (x) does not exist at any x ∈ [0, 1], so n→∞
{lim fn = 0} = ∅. Thus, the converse to Property (4) in Lemma 4.5 can fail badly. 4. (Borel’s Simple Normal Number Theorem for binary) Let b ∈ N with b ≥ 2. Given a number x ∈ [0, 1], write it in base b, that is, its b-adic expansion: x1 x2 x3 (4.6) x= + 2 + 3 + ··· , b b b where the xi ’s are in the set of digits S := {0, 1, . . . , b−1}. If x has two such expansions (occurring if and only if x 6= 1 and x bn ∈ N for some n ∈ N), one that is terminating and the other non-terminating, in order to have unique expansions we agree to write such numbers in their non-terminating b-adic expansions. Henceforth assume b = 2; we shall deal with the general case in the next problem. The number x is said to be simply normal in base 2 if it asymptotically has the same number of 0’s and 1’s in its binary expansion in the sense that 1 x1 + x2 + · · · + xn = . lim n→∞ n 2 Using the Strong Law of Large Numbers and Problem 7 in Exercises 2.4, prove that all numbers in [0, 1], except for a set of measure zero, are simply normal; that is, if x1 + x2 + · · · + xn 1 A := x ∈ [0, 1] ; lim = , n→∞ n 2
then prove A is Lebesgue measurable (in fact, A is a Borel set) and m(A) = 1. We remark that although every number in [0, 1] is simply normal in base 2 except for a set of measure zero, it’s not easy to determine whether any given number is normal. √ For instance, it’s not known whether (the decimal parts of) e, π, log 2, or 2 are simply normal in base 2! (Of course, one can concoct simply normal numbers such as 0.101010101010101010 . . ..) 5. (Borel’s Simply Normal Number Theorem) We generalize the previous problem and prove Borel’s celebrated simply normal number theorem published in 1909 [36]. Let b ∈ N with b ≥ 2. Fix a digit d ∈ S = {0, 1, . . . , b − 1} and define ( 1 if x = d, f : S → R by f (x) = 0 otherwise. For each i, define fi : S ∞ → R by fi (x1 , x2 , . . .) = f (xi ). Thus, fi observes if the i-th digit is d. Intuitively speaking, since there are a total of b digits, for a “randomly picked number” x ∈ [0, 1], it seems reasonable that in its b-adic expansion (4.6), the digit d should appear with frequency 1/b, that is, it should be that f1 (x) + f2 (x) + · · · + fn (x) 1 = , n b where “x” here represents the sequence (x1 , x2 , . . .) of the digits of x obtained from (4.6). To prove this, let µ0 : P(S) → [0, 1] assign “fair” probabilities, µ0 (A) = #A/b, and let µ : S (C ) → [0, 1] denote the infinite product of µ0 with itself. (i) Using the Strong Expectation Theorem, prove that the set of all (x1 , x2 , . . .) ∈ S ∞ such that (4.7) holds is in S (C ) and it occurs with probability 1. (ii) Assuming the results in Problem 7 in Exercises 2.4, prove that the the set of all x ∈ [0, 1] such that (4.7) holds is Borel and has Lebesgue measure 1. (iii) A number x ∈ [0, 1] is said to be simply normal in base b if given any digit d ∈ {0, 1, . . . , b − 1}, the limit (4.7) holds. In (iii) you proved that all x ∈ [0, 1], except for a set of measure zero, are simply normal in any fixed base b. A number
(4.7)
lim
n→∞
198
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
x ∈ [0, 1] is said to be simply normal if it’s simply normal in every base b ≥ 2. Prove that all x ∈ [0, 1], except for a set of measure zero, are simply normal. Note: To describe in a simply way just one simply normal number is not known!
4.3. Measurability and Littlewood’s First Principle(s) Recall that a set A ⊆ Rn is Lebesgue measurable means that m∗ (E) = m∗ (E ∩ A) + m∗ (E ∩ Ac ) for every set E ⊆ Rn .
Although we have built our theory around this definition of measurability, this definition doesn’t give a geometric “feeling” for what measurability reJohn Littlewood ally means. The purpose of this section is to understand geometrically what a measurable set in terms of elementary figures and the topology of Rn (1885–1977). according to Littlewood’s First Principles, named after John Edensor Littlewood (1885–1977). Littlewood’s second and third principles are in Section 5.3. However, we start this section by stating some results in the general case, which are good enough for most applications, then we’ll specialize to Rn . 4.3.1. Measurability and nonmeasurability. If µ : I → [0, ∞] is an additive set function on a semiring I of subsets of a set X, the Regularity Theorem 3.13 says that given any A ⊆ X, there is a B ∈ S (I ) ⊆ Mµ∗ such that A⊆B
and µ∗ (A) = µ∗ (B).
In other words, we can cover an arbitrary set A ⊆ X by an element of S (I ) with the same measure as A. Note that the equality µ∗ (A) = µ∗ (B) does not immediately imply that µ∗ (B \ A) = 0. Indeed, if this were true, then because µ∗ is a complete measure, B \ A would be measurable and hence A = B \ (B \ A),
being the difference of two measurable sets, would also be measurable. In other words, if A ⊆ X and there is a set B ∈ S (I ) such that A⊆B
and µ∗ (A) = µ∗ (B)
and µ∗ (B \ A) = 0,
then A is measurable. If µ is σ-finite, the converse holds by the following theorem. Regularity and Measurable sets Theorem 4.7. Let µ : I → [0, ∞] be an additive set function on a semiring I of subsets of a set X. If there is a B ∈ S (I ) such that A⊆B
with
µ∗ (A) = µ∗ (B)
and
µ∗ (B \ A) = 0,
then A is µ∗ -measurable with the converse holding if µ is σ-finite.
∗ Proof : We just have S to prove the converse, so assume that µ is σ-finite, let A ∈ Mµ , and let X = ∞ n } ⊆ I is a sequence of pairwise disjoint sets n=1 Xn where {XS ∗ with finite µ measure. Then A = ∞ n=1 An where An = A∩Xn ∈ Mµ , which are disjoint for different n’s and µ∗ (An ) < ∞ for all n. By the Regularity Theorem we know that there is a set Bn ∈ S (I ) with An ⊆ Bn and µ∗ (An ) = µ∗ (Bn ). SinceSµ∗ (Bn ) = µ∗ (An ) < ∞, subtractivity implies that µ∗ (Bn \ An ) = 0. If B= ∞ n=1 Bn , then B ∈ S (I ) and
B=
∞ [
n=1
Bn =
∞ [
n=1
An ∪ (Bn \ An ) = A ∪ N,
4.3. MEASURABILITY AND LITTLEWOOD’S FIRST PRINCIPLE(S)
199
S S∞ where we used that A = ∞ n=1 An and we put N = n=1 (Bn \ An ). Since µ∗ (Bn \An ) = 0 for each n, by subadditivity we have µ∗ (N ) = 0. Since B = A∪N we leave you to check that µ∗ (A) = µ∗ (B) and µ∗ (B \ A) = 0, which completes our proof.
Returning to our discussion before this theorem, we see that if A ⊆ X is not measurable, then for any B ∈ S (I ) with A ⊆ B and µ∗ (A) = µ∗ (B), we must have µ∗ (B \ A) > 0. Here’s a picture: B
(1) µ∗ (A) = µ∗ (B) (2) µ∗ (B \ A) > 0
A⊆B
The set B is suppose to be the region on and inside of the oval. Let’s consider the statements (1) µ∗ (A) = µ∗ (B)
and (2) µ∗ (B \ A) > 0.
(1) can be interpreted as saying that there is no volume between A and B while (2) says that there is volume between A and B! In view of this dichotomy, we visualize A as having a “blurry” or “cloudy” boundary because in a sense, the substance in B and not in A is empty (this is (1)) and on the other hand it takes up space (this is (2))! See Problem 4 for an example of a nonmeasurable set and see Section 4.4 where we present the most famous measurable set of them all, Vitali’s set. The following corollary is our first example of a Littlewood principle; it says that µ∗ -measurable set are just elements of S (I ) up to sets of measure zero. Characterization of Mµ∗ for general sets Corollary 4.8. If µ : I → [0, ∞] is a σ-finite additive set function on a semiring I , then A ∈ Mµ∗ ⇐⇒ A = F ∪ N, where F ∈ S (I ) and N has measure zero; in fact, N is a subset of an element of S (I ) of measure zero. Thus, µ∗ -measurable set are, up to sets of measure zero, elements of S (I ). Proof : The direction “⇐=” is automatic (why?) so we just prove “=⇒”. Let A ∈ Mµ∗ . Then Ac ∈ Mµ∗ so by Theorem 4.7, recalling that µ is σ-finite so we can apply the converse, we know there is a B ∈ S (I ) such that Ac ⊆ B and µ∗ (B \ Ac ) = 0. Let F = B c ∈ S (I ). Then taking complements of Ac ⊆ B we see that F ⊆ A and since B \ Ac = B ∩ A = A ∩ F c = A \ F , we have µ∗ (A \ F ) = 0. Thus, A=F ∪N ,
where N = A \ F has µ∗ -measure zero.
By the Regularity Theorem, N is a subset of an element of S (I ) of the same measure as N , namely zero. This proves our result.
Here’s a schematic of the situation: N F
A= F ∪N
200
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
In this schematic, A ∈ Mµ∗ is a blob, F ∈ S (I ) is the interior of the blob, which makes up most of A, and N is represented by the boundary of A and is supposed to be a small measure zero part of A. Now compare the statement (1) A ∈ Mµ∗
⇐⇒
A = F ∪ N , where F ∈ S (I ) and N has measure zero.
with the Carath´eodory definition of Mµ∗ : (2) A ∈ Mµ∗
⇐⇒
µ∗ (E) = µ∗ (E ∩ A) + µ∗ (E ∩ Ac ) for all E ⊆ X.
The formulation (1) for measurability is, in my opinion, a conceptually easier way to understand measurability than (2). Here’s an immediate corollary (of the corollary) for additive set functions on the left-half open boxes I n in Rn . Characterization of Mµ∗ for Rn Corollary 4.9. If µ : I n → [0, ∞) is additive, then A ∈ Mµ∗
∗
⇐⇒
A = F ∪ N,
where F is a Borel set and N has µ -measure zero; in fact, N is a subset of a Borel set of measure zero. Thus, µ∗ -measurable set are, up to sets of measure zero, Borel sets. We stated this theorem for general additive set functions µ : I n → [0, ∞), but the main examples to keep in mind are Lebesgue measure on Rn and LebesgueStieltjes measures on I 1 . In particular, Lebesgue measurable sets are just Borel sets up to sets of Lebesgue measure zero. We will state precisely the kind of Borel set F is in Part (6) of Theorem 4.10. 4.3.2. Littlewood’s First Principle(s) for Rn . Littlewood’s First Principle [183, p. 26] for subsets of R states that Every [finite Lebesgue] (measurable) set is nearly a finite sum of intervals.
In Rn we interpret this principle as follows: A set with finite Lebesgue outer measure is Lebesgue measurable if and only if it is “nearly” an elementary figure. We can make the word “nearly” precise, meaning that for any ε > 0, the set, call it A, differs from an elementary figure by a set of measure less than ε in the sense that there exists an elementary figure I ∈ E n with m∗ (A \ I) < ε
and m∗ (I \ A) < ε.
See the left-hand picture in Figure 4.1. Thus, we interpret the term “nearly” in
Figure 4.1. On the left, A (a disk) is “nearly” equal to an elementary figure I in the sense that the differences A \ I and I \ A have small measure. On the right, A (a blob with a jagged edge) is covered by an open set U (represented by an ellipse) such that the difference U \ A has small measure.
Littlewood’s First Principle as “up to ε-sets” (that is, sets of measure less than ε).
4.3. MEASURABILITY AND LITTLEWOOD’S FIRST PRINCIPLE(S)
201
In Theorem 4.10 below we extend this principle to encompass more general measures on Rn (not just Lebesgue measure) and taking advantage of the topological structure of Rn , we can give an alternative formulation of Littlewood’s Principles in terms of open and closed sets. In the following theorem we use the notion of a Gδ set (pronounced “geedelta”), which is a countable intersection of open sets, and an Fσ set (pronounced “eff-sigma”), which is a countable union of closed sets. Note that Gδ and Fσ sets are Borel sets and a set is a Gδ set if and only if its complement is an Fσ set. These sets show up often; see Problem 1. As with Corollary 4.9, we state the following theorem for general measures µ : I n → [0, ∞), but the main examples to keep in mind are µ = m, Lebesgue measure, and Lebesgue-Stieltjes measures on I 1 . In the general case, Parts (2),(3) of the following theorem say, see the right-hand picture in Figure 4.1, µ∗ -measurable sets are “nearly” open sets (or closed sets). Parts (4),(5) and (6) of the following theorem can by interpreted as saying that Gδ and Fσ Borel sets “essentially” make up all elements of Mµ∗ ; that is, µ∗ -measurable sets are “essentially” Gδ sets (or Fσ sets). Littlewood’s First Principle(s) for Rn Theorem 4.10. Let µ : I n → [0, ∞) be additive and let A ⊆ Rn . (1) If µ∗ (A) < ∞, then A is µ∗ -measurable if and only if given ε > 0 there is an I ∈ E n with µ∗ (A \ I) < ε
and
µ∗ (I \ A) < ε.
Without the assumption µ∗ (A) < ∞, the set A is µ∗ -measurable if and only if any one of the Properties (2)–(5) hold. (2) Given ε > 0 there is an open set U ⊆ Rn such that A⊆U
and
C⊆A
and
µ∗ (U \ A) < ε.
(3) Given ε > 0 there is a closed set C ⊆ Rn such that (4) There is a Gδ set G ⊆ Rn such that A⊆G
with
F ⊆A
with
µ∗ (A \ C) < ε.
µ∗ (G \ A) = 0 and µ∗ (A) = µ∗ (G).
(5) There is an Fσ set F ⊆ Rn such that
µ∗ (A \ F ) = 0 and µ∗ (F ) = µ∗ (A).
(6) If µ : I n → [0, ∞) is additive, then A ∈ Mµ∗
⇐⇒
A = F ∪ N,
where F is an Fσ and N is a subset of a Gδ set of µ∗ -measure zero. Proof : We shall prove (1), (2), and (4), leaving the equivalence of measurability to (3) and (5) for Problem 3. Step 1: We begin by establishing a useful fact. Let A ⊆ Rn be arbitrary and let ε > 0. We shall prove there exists an open set U ⊆ Rn such that A⊆U
and
µ∗ (U) ≤ µ∗ (A) + ε.
202
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
If µ∗ (A) = ∞, then we can take U = Rn and we’re done, so assume that µ∗ (A) < ∞. We now do another “ 2εk proof”! By definition of infimum in the equality X ∞ ∞ [ µ∗ (A) = inf µ(Ik ) ; A ⊆ Ik , Ik ∈ I n , k=1
there are sets Ik ∈ I (4.8)
n
A⊆
k=1
such that ∞ [
Ik
∞ X
with
k=1
µ(Ik ) ≤ µ∗ (A) +
k=1
ε . 2
Now the idea is get the desired open set U is to replace the Ik ’s, which are of the form Ik = (ak , bk ], by slightly larger open boxes. To this end, observe that ∞ \
Ik =
a k , bk +
j=1
1i , j
so by the continuity of measures, 1i lim µ∗ ak , bk + = µ∗ (Ik ). j→∞ j
Hence, for each k we can choose εk > 0 so that
µ∗ (ak , bk + εk ] < µ∗ (Ik ) +
ε . 2k+1
By monotonicity, we have µ∗ (ak , bk + εk ) ≤ µ∗ (ak , bk + εk ] and by definition of µ∗ , we have µ∗ (Ik ) ≤ µ(Ik ). Thus, µ∗ (ak , bk + εk ) < µ(Ik ) + Let Jk = (ak , bk + εk ) and set U = µ∗ (U) ≤
∞ X
k=1
S∞
k=1
µ∗ (Jk ) ≤
=
Jk . Then U is open, A ⊆ U, and
∞ X k=1 ∞ X
ε . 2k+1
µ(Ik ) +
µ(Ik ) +
k=1
≤ µ∗ (A) +
ε
2k+1
ε 2
ε ε + = µ∗ (A) + ε. 2 2
Step 2: Assuming µ is σ-finite, we prove A is µ∗ -measurable
=⇒
(2)
=⇒
(4)
=⇒
A is µ∗ -measurable,
which shows the equivalence of measurability to (2) and (4). To prove A is µ∗ -measurable =⇒ (2), let ε > 0 and assume A is µ∗ S measurable. Writing Rn = ∞ X where {X k k=1 Sk } is a sequence of pairwise disjoint boxes with finite µ measure, we have A = ∞ k=1 Ak where Ak = A ∩ Xk with µ∗ (Ak ) < ∞. It follows by Step 1 that there is an open set Bk with Ak ⊆ Bk and µ∗ (Bk ) − µ∗ (Ak ) < ε/2k .
S∞ k This implies that µ∗ (B k \ Ak ) < ε/2 by subtractivity. If U = k=1 Bk , then U S∞ is open and U \ A ⊆ k=1 (Bk \ Ak ), so by countable subadditivity, µ∗ (U \ A) ≤
∞ X
k=1
µ∗ (Bk \ Ak )
0 there is an open set U such that A ⊆ U and µ∗ (U \ A) < ε. Then, in particular, for each k = 1, 2, . . ., there is an open set Bk such that A ⊆ Bk , and µ∗ (Bk \ A) < Thus, if B = have
T∞
k=1
1 . k
Bk , then B is a Gδ , A ⊆ B, and since B ⊆ Bk for each k, we 1 , k
µ∗ (B \ A) ≤ µ∗ (Bk \ A)
0. Let A ⊆ k=1 Ik as in (4.8), so that ∞ X
k=1
µ(Ik ) ≤ µ∗ (A) +
ε 2
∞ X
=⇒
µ(Ik ) < µ∗ (A) + ε.
k=1
P P Since the sum ∞ µ(Ik ) is finite, there exists an N such that ∞ k=N+1 µ(Ik ) < SN k=1 n ε. Let I = k=1 Ik . Then I ∈ E and we shall prove that this set has the required S ∗ ∗ properties. Let B S = ∞ k=1 Ik . Then as A ⊆ B, we have µ (A \ I) ≤ µ (B \ I) ∗ and since B \ I ⊆ ∞ I , by definition of µ (B \ I) we have k=N+1 k µ∗ (B \ I) ≤
∞ X
µ(Ik ) < ε.
k=N+1
Thus, µ∗ (A \ I) < ε. To prove that µ∗ (I \ A) < ε, observe that since I ⊆ B, we have I \ A ⊆ B \ A, so Also, B =
S∞
k=1
µ∗ (I \ A) ≤ µ∗ (B \ A).
Ik , so by definition of µ∗ (B),
µ∗ (B) ≤
∞ X
µ(Ik ) < µ∗ (A) + ε
=⇒
k=1
µ∗ (B \ A) < ε,
where we used that µ∗ (B \ A) = µ∗ (B) − µ∗ (A) by subtractivity. This implies that µ∗ (I \ A) < ε and completes the “only if” part of (1). Step 4: Lastly, we prove the “if” part of (1). So, assume that A, which has finite µ∗ -outer measure has the properties in (1); we shall prove that A is µ∗ -measurable. Let ε > 0. Then by (2) we just have to show there is an open set U such that A ⊆ U and µ∗ (U \ A) < ε. To this end, let U be given by Step 1 with ε in Step 1 replaced by ε/4. This implies, in particular, that A⊆U
and
µ∗ (U) < µ∗ (A) +
ε . 3
By assumption there is an I ∈ E n such that µ∗ (A \ I)
0,
recalling that α + ε < µ∗ (A). Thus, α < µ∗ (C). For each j ∈ N, let Kj = C ∩S [−j, j]n . Then {Kj } is a nondecreasing sequence of compact sets such that C = ∞ j=1 Kj , so by continuity of measures, we have µ∗ (C) = lim µ∗ (Kj ), j→∞
By definition of limit and the fact that α < µ∗ (C), it follows that there is some K = Kj such that α < µ∗ (K). Since K = Kj ⊆ C ⊆ A, this shows that α is not an upper bound of S.
In general, given a topological space X such that every compact set is a Borel set, a measure µ on the σ-algebra of Borel subsets of X is said to be a regular Borel measure if every compact subset of X has finite µ-measure and for any Borel set A ⊆ X, Properties (1) and (2) of Theorem 4.11 hold with µ∗ replaced by µ. Explicitly, a measure µ : B(X) → [0, ∞] on the Borel sets of X is a regular Borel measure if for all compact sets K ⊆ X, we have K ∈ B(X) and µ(K) < ∞, and also (1) For every Borel set A ⊆ X, we have µ(A) = inf{µ(U) ; A ⊆ U, U open}. (2) For every Borel set A ⊆ X, we have µ(A) = sup{µ(K) ; K ⊆ A, K compact}. Thus, Theorem 4.11 implies that given an additive set function µ : I n → [0, ∞), the restriction of µ∗ , the outer measure induced by µ, to the Borel sets is a regular Borel measure. In particular, when restricted to the Borel sets, Lebesgue measure and Lebesgue-Stieltjes measures are examples of regular Borel measures. 4.3.4. Translations, Dilations, and the Cube Principle. It should be “obvious” that Lebesgue measure is translation invariant in the sense that the measure of a set doesn’t change if the set is moved: ✒ x
Here we translated the rectangle by a vector x. Another “obvious” property is that measure should scale with the dimension. For example, if a line segment is doubled in length, then the measure of the new segment is two times the original length. If the sides of a rectangle are each doubled, then the measure of the new rectangle is 22 = 4 times the original measure as seen here:
206
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
More generally, if the sides of a box in Rn are each doubled, then the measure of the new box is 2n times the original measure. In Proposition 4.12 we prove that this dilation property of outer measure holds for all subsets of Rn . To make these statements concerning translations and dilations rigorous, we make some definitions. Given x ∈ Rn and A ⊆ Rn , the translation of A by x is denoted by A + x or x + A and is defined by A + x = x + A := {a + x ; a ∈ A} = {y ∈ Rn ; y − x ∈ A}.
Given r > 0 , the dilation of A by r is denoted by rA:
rA := {ra ; a ∈ A} = {y ∈ Rn ; r−1 y ∈ A}.
Let x = (x1 , . . . , xn ) and let I = (a1 , b1 ] × · · · × (an , bn ] ∈ I n . Then observe that I + x = (a1 + x1 , b1 + x1 ] × · · · × (an + xn , bn + xn ]
and
rI = (ra1 , rb1 ] × · · · × (ran , rbn ]. Thus, I is invariant under translations and dilations, and m(I + x) = m(I) and m(rI) = rn m(I). In Proposition 4.12 we show that on any subset of Rn , not just boxes, Lebesgue outer measure is invariant under translations and scales correctly under dilations. However, before doing this, we shall discuss a little . . . Philosophy: If a certain property of Lebesgue measure holds for open sets, then it holds for the Lebesgue outer measure of any set. To see why this philosophy should be true, recall from Theorem 4.11 that given any set A ⊆ Rn , n
m∗ (A) = inf{m(U) ; A ⊆ U, U open}.
Since m∗ (A) can be expressed in terms of this infimum involving open sets only, if a certain property holds for the Lebesgue measure of open sets it should “pass through” the infimum to hold for arbitrary sets. In fact, we can do even better. By the DyadicSCube Theorem there are pairwise disjoint cubes I1 , I2 , . . . ∈ I n such that U = ∞ k=1 Ik . Thus, if a certain property holds for the volume of cubes it should hold for open sets and hence for all sets. This leads us to the The Cube Principle: If a certain property of Lebesgue measure holds for cubes (elements of I n whose sides have the same length), then it holds for the Lebesgue outer measure of any set. Of course, there is a corresponding “Box Principle” but cubes are sometimes easier to work with; also, this principle does not hold for any property but it does hold for many cases (each case should be checked). Translation and dilation properties Proposition 4.12. If A ⊆ Rn is arbitrary, then for any x ∈ Rn and r > 0, m∗ (A) = m∗ (A + x)
and
m∗ (rA) = rn m∗ (A).
Moreover, A is Lebesgue measurable if and only if the translation A + x (or the dilation rA) is Lebesgue measurable. Proof : We prove this proposition in two steps. Step 1: It can by easily checked that the identities m∗ (A) = m∗ (A + x) and m∗ (rA) = r n m∗ (A) hold for cubes, so by the “cube principle” they must hold for all A ⊆ Rn . . . done. Well, just to convince ourselves that we’re not cheating
4.3. MEASURABILITY AND LITTLEWOOD’S FIRST PRINCIPLE(S)
207
here we shall work through the proof that m∗ (A) = m∗ (A + x). Consider the case when A = U ⊆ Rn is open. Then by the DyadicSCube Theorem there are pairwise disjoint cubes I1 , I2 , . . . ∈ I n such that U = ∞ k=1 Ik . Therefore, U +x=
∞ [
(Ik + x),
k=1
which is easily checked and is still a disjoint union, so by countable additivity and the fact that m(Ik + x) = m(Ik ) for each k, we have m(U + x) =
∞ X
k=1 n
m(Ik + x) =
∞ X
m(Ik ) = m(U).
k=1
Now given any subset A ⊆ R , by Theorem 4.11 we have m∗ (A + x) = inf{m(U) ; A + x ⊆ U, U open} = inf{m(U) ; A ⊆ U − x, U open}
= inf{m(V + x) ; A ⊆ V, V open} = inf{m(V) ; A ⊆ V, V open}
(put V = U − x)
= m∗ (A).
Step 2: We now consider Lebesgue measurability. Suppose that A ∈ M n . Then from Corollary 4.9 we know that n
A = F ∪ N,
where F ∈ B and N has measure zero, and hence,
A + x = (F + x) ∪ (N + x).
Since translations are homeomorphisms, by Proposition 1.15 they preserve Borel sets and hence, F + x is a Borel set. Also, by Step 2 we have m∗ (N + x) = m∗ (N ) = 0, so N +x has measure zero. Thus, by Corollary 4.9, A+x is Lebesgue measurable. Conversely, if A + x is Lebesgue measurable, then the argument above shows that (A + x) + y in Lebesgue measurable for any y ∈ Rn . Taking y = −x shows that (A + x) + (−x) = A is Lebesgue measurable. The proof that A is Lebesgue measurable if and only if rA is Lebesgue measurable has a similar flavor and is left to the reader.
The Cube Principle is quite handy; for example, we see this principle again in Theorem 4.14 in Section 4.4. 4.3.5. Completions of general measures. We now discuss the important subject of completions. To begin, recall that if µ : I → [0, ∞] is a σ-finite measure on a semiring I , then from Carath´eodory’s Theorem (Theorem 3.11) we know that µ∗ : Mµ∗ → [0, ∞]
is a complete measure and from Corollary 4.8 above we know that A ∈ Mµ∗ if and only if A = F ∪ N where F ∈ S (I ) and N is a subset of an element of S (I ) of measure zero. Moreover, since N has measure zero, it follows that µ∗ (A) = µ∗ (F ). These properties serve as a guide to make an arbitrary measure complete. Indeed, let us consider arbitrary measure µ : S → [0, ∞] on a σ-algebra S . We denote by S , called the completion of S with respect to µ, the collection of all sets of
208
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
the form F ∪ N , where F ∈ S and N is a subset of an element of S of µ-measure zero. In Problem 9 — a must do exercise! — you will prove that S is a σ-algebra. We define the completion of µ, µ : S → [0, ∞] as follows: If A = F ∪ N ∈ S , then µ(A) := µ(F ), In Problem 9 you will show that µ is a complete measure on S , and you’ll prove that the completion of Borel measure (Rn , B n , m) is Lebesgue measure (Rn , M n , m). We summarize these results in the following theorem. Completions of measures Theorem 4.13. If µ : S → [0, ∞] is a measures on a σ-algebra S , then S is a σ-algebra and µ : S → [0, ∞] is a complete measure on S . Moreover, the completion of Borel measure (Rn , B n , m) is Lebesgue measure (Rn , M n , m). ◮ Exercises 4.3. 1. In this problem we look at various examples of Gδ and Fσ sets. (a) Show that a countable union of Fσ sets is an Fσ set. (b) Show that a countable intersection of Gδ sets is an Gδ set. (c) Show that every countable subset of Rn is an Fσ . (d) For a < b, show that the intervals (a, b), [a, b], and (a, b] are both Fσ and Gδ sets. (e) In R1 , show that the rational numbers form an Fσ and the irrational numbers form a Gδ . (f) Show that every open and closed set in Rn is both a Gδ and Fσ . (g) Let f : R → R. (i) Show that the set of points T where f is continuous (call it Cf ) is a Gδ . Suggestion: Show that Cf = ∞ n=1 Gn with n Gn := c ∈ R ; there is a δ > 0 such that o x, x′ ∈ (c − δ, c + δ) =⇒ |f (x) − f (x′ )| < 1/n ,
and show that Gn is open. (ii) Show that Df = {c ∈ X ; f is not continuous at c}, the set of discontinuity points of f , is an Fσ set. 2. (Littlewood’s First Principle(s) for general additive set functions) In this problem we generalize Littlewood’s First Principles for Rn to the general case. Let µ : I → [0, ∞] be an additive set function on a semiring I of subsets of a set X. Prove the following: (1) Let A ⊆ X with µ∗ (A) < ∞. Then A is µ∗ -measurable if and only if given ε > 0 there is an I ∈ R(I ) with µ∗ (A \ I) < ε
and
µ∗ (I \ A) < ε.
(2) Let A ⊆ X with µ∗ (A) < ∞. Then A is µ∗ -measurable if and only if there is a B ∈ S (I ) such that A⊆B
with
µ∗ (B \ A) = 0
and
µ∗ (A) = µ∗ (B).
4.3. MEASURABILITY AND LITTLEWOOD’S FIRST PRINCIPLE(S)
209
If we drop the assumption µ∗ (A) < ∞, give a counterexample to the “only if” statement. (3) Assuming now that µ is σ-finite and let A ⊆ X (without assuming µ∗ (A) < ∞), prove that A is µ∗ -measurable if and only if we have the equalities: inf{µ(B) ; A ⊆ B, B ∈ S (I )} = µ∗ (A) = sup{µ(C) ; C ⊆ A, C ∈ S (I )}.
3. Let µ : I n → [0, ∞) be an additive set function on I n . (a) Prove the equivalence of measurability, Property (3), and Property (5) in Littlewood’s Theorem 4.10. (b) Prove that a set A ⊆ Rn is µ∗ -measurable if and only if we have the equalities: inf{µ∗ (U) ; A ⊆ U, U open} = µ∗ (A) = sup{µ∗ (K) ; K ⊆ A, K compact}.
4. (Nonmeasurable sets) Let X = R2 and let I = {I × R ; I ∈ I 1 }. (i) Prove that I is a semiring of subsets of X and let µ : I → [0, ∞) be defined by µ(I × R) := m(I), where m(I) is the usual Lebesgue measure of I. Prove that µ is a σ-finite measure. (ii) Prove that A ∈ Mµ∗ if and only if A = B × R where B ∈ M 1 , that is, B is a Lebesgue measurable subset of R. Conclude that given any subsets B, C ⊆ R with C 6= R, the set B × C ⊆ X is nonmeasurable. For example, [0, 1] × [0, 1] is nonmeasurable. 5. (Steinhaus’ theorem) In this problem we prove a fascinating result due to Hugo Steinhaus (1887–1972) proved in 1920. His theorem states that if A ⊆ Rn is Lebesgue measurable and m(A) > 0, then the difference set A − A := {x − y ; x, y ∈ A} contains a nonempty open ball containing the origin. To prove this, proceed as follows. (i) Let A ∈ M n with m(A) > 0. Prove that there is a compact set K and an open set U with K ⊆ A ⊆ U such that 0 < m(U) < 2m(K). Suggestion: Use Theorem 4.11 to find a U and K satisfying these properties. (ii) For any r > 0, let Br ⊆ Rn denote the open ball centered at the origin. Prove that there is a δ > 0 such that for all x ∈ Bδ , we have x + K := {x + y ; y ∈ K} ⊆ U. Suggestion: Argue by contradiction. (iii) Prove that for all x ∈ Bδ , we have (x + K) ∩ K 6= ∅. Conclude that for all x ∈ Bδ , we have (x + A) ∩ A 6= ∅. (iv) Finally, prove that Bδ ⊆ A − A. (v) Basically redoing your proof, show that if µ : B n → [0, ∞) is a translation invariant regular Borel measure and A ⊆ Rn is a Borel set with µ(A) > 0, then A − A contains a nonempty open ball containing the origin. 6. (Cauchy’s functional equation III) Please review Problem 6 in Exercises 1.6. Using Steinhaus’ theorem, prove that if f : R → R is additive and bounded on a measurable set of positive measure, then f (x) = f (1) x for all x ∈ R. Suggestion: Show there is an open interval containing the origin on which f is bounded. 7. Prove that any measure on the Borel subsets of Rn that is finite on the compact sets is automatically a regular Borel measure. Explicitly, given a measure µ : B n → [0, ∞] on the Borel sets of Rn such that µ(K) < ∞ for any compact set K ⊆ Rn , prove (1) For every Borel set A ⊆ Rn , we have µ(A) = inf{µ(U) ; A ⊆ U, U open}.
(2) For every Borel set A ⊆ Rn , we have
µ(A) = sup{µ(K) ; K ⊆ A, K compact}.
210
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
Suggestion: Define µ0 on I n by µ0 (I) = µ(I) for each I ∈ I n . Show that Theorem 4.11 applies to µ0 and show that µ∗0 = µ on B n . 8. (Littlewood’s First Principle(s) for regular Borel measures) Prove that properties (2)–(6) of Littlewood’s First Principle(s) for Rn can be generalized, verbatim, to a σ-finite regular Borel measure µ on a topological space X. For example, (2) in this general case is the following: A subset A ⊆ X is µ∗ -measurable if and only if given ε > 0 there is an open set U ⊆ X such that A⊆U
and
µ∗ (U \ A) < ε.
Similarly, prove the analogous statements for properties (3)–(5). 9. (Completion of a measure) Let µ be a measure on a σ-algebra S . We denote by S , called the completion of S with respect to µ, the class of all sets of the form F ∪ N , where F ∈ S and N is a subset of an element of S of measure zero. (i) Prove that S is a σ-algebra. (ii) Define µ : S → [0, ∞] by µ(A) := µ(F ),
(iii) (iv) (v)
(vi)
where A = F ∪ N with F ∈ S and N is a subset of an element of S of measure zero. Show that µ is well-defined; in other words, if A = F ′ ∪ N ′ is another presentation of A, prove that µ(F ) = µ(F ′ ). Show that µ is a complete measure on S . We call µ the completion of µ. Prove that if B ⊆ A ⊆ C with B, C ∈ S and µ(C \ B) = 0, then A ∈ S and µ(B) = µ(A) = µ(C). Let µ∗ denote the outer measure generated by the measure µ : S → [0, ∞]. If µ is σ-finite prove that S = Mµ∗ and µ = µ∗ . Thus, for σ-finite measures the completion of the measure is just the outer measure µ∗ on Mµ∗ . Thus, in particular, the completion of Borel measure (Rn , B n , m) is Lebesgue measure (Rn , M n , m). We show that the σ-finite assumption is needed in (vi). Let X be a uncountable set, S the σ-algebra of all subsets of X that are countable or have countable complements, and let µ be the counting measure on S ; thus, for each A ∈ S , µ(A) is the number of points in A. Show that S = S , but Mµ∗ = P(X).
4.4. Geometry, Vitali’s nonmeasurable set, and paradoxes In this section we show that Lebesgue measure has all the geometric properties that our intuitive notions of length, area, and volume would lead us to expect. We also construct the famous Vitali set, a set that is not Lebesgue measurable, and we shall study A paradox, a paradox, A most ingenious paradox! 2 4.4.1. The geometry of Lebesgue measure. Recall from Proposition 1.15 that Borel sets of topological spaces are preserved under homeomorphisms. This is false for Lebesgue measurability as you’ll prove in Problem 7 in Exercises 4.5. However, Lebesgue measurability is preserved under all affine transformations of Euclidean space, which are linear transformations followed by translations; in other words, an affine transformation is a map on Rn of the form Rn ∋ x
7−→
b + T x ∈ Rn ,
for some fixed b ∈ Rn and linear transformation T : Rn → Rn . (If you need to review linear transformations, see Section ?? of the Appendix.) When T is an orthogonal transformation (composition of rotations and reflections), the affine 2
Taken from The Pirates of Penzance by Gilbert and Sullivan.
4.4. GEOMETRY, VITALI’S NONMEASURABLE SET, AND PARADOXES
211
transformation is called a rigid motion3 of Euclidean space, and in this case not only is measurability preserved, measure is also preserved. It’s, of course, “obvious” that measure does not depend on rigid motions: A box has the same volume when it is sitting flat on a table and when it is tipped on its side as in Figure 4.2. The
Figure 4.2. Measure should not depend on whether we look at an object straight on or with our head turned.
invariance of measure under rigid motions follows from Proposition 4.12 of the last section and Theorem 4.14 below. The following theorem proves the following “fact” we all learned in linear algebra when we were first introduced to the determinant: | det T | = the factor by which a linear transformation T changes volume. In particular, since | det O| = 1 for any orthogonal matrix O, it follows that volume is invariant under orthogonal transformations, as it should be. Linear transformations and Lebesgue measure Theorem 4.14. For any linear transformation T : Rn → Rn and for any set A ⊆ Rn , we have (4.9)
m∗ (T (A)) = | det T | m∗ (A).
Moreover, if A is Lebesgue measurable, then T (A) is Lebesgue measurable; the converse holds if T is invertible. Proof : The proof of this theorem is a little long, so it might be a good idea to skim it at a first reading of this section. The idea to prove (4.9) consists of two parts. In the first part, which is the easy part, we prove that for any invertible linear transformation T : Rn → Rn and subset A ⊆ Rn , we have (4.10)
m∗ (T (A)) = D(T ) m∗ (A),
where D(T ) := m∗ (T ((0, 1]n )). In the second part, which is more difficult, we show that D(T ) = | det T |. We break up our proof into several steps. In Step 1–Step 3 we only work with invertible linear transformations and in Step 4 and Step 5 we consider noninvertible transformations. Step 1: We prove (4.10). The “Cube Principle” applies to this situation (you should check this) so we just have to prove (4.10) for cubes. Let I = (a, b] ∈ I n be a cube, where a = (a1 , . . . , an ) and b = (b1 , . . . , bn ). Since I is a cube, which we may assume is nonempty, we have bk = ak + c for all k with c > 0. Hence, I = (a, b] = a + (0, c]n = a + c(0, 1]n . 3Some authors require that det T > 0, which is to say, T is a rotation.
212
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
Therefore, by linearity, T (I) = T (a) + c T ((0, 1]n ). By the translation and dilation properties of measure, we conclude that m∗ (T (I)) = m∗ (c T (0, 1]n ) = cn D(T ) = D(T ) m(I), since m(I) = cn . This proves (4.10). To finish proving our theorem, we just have to prove that D(T ) = | det T |. To prove this, the idea is to show that D(T ) has similar properties as | det T | with respect to products and inverses. Step 2: We claim that (4.11)
D(I) = 1 ,
D(T S) = D(T ) D(S) ,
D(T −1 ) = D(T )−1
where I is the identity transformation and T and S are invertible linear transformations on Rn . Since I((0, 1]n ) = (0, 1]n , it follows that D(I) = 1 and by (4.10), we have D(T S) = m∗ (T S((0, 1]n )) = D(T ) m∗ (S((0, 1]n )) = D(T ) D(S). In particular, I = T T −1
=⇒
1 = D(I) = D(T T −1 ) = D(T ) D(T −1 ),
which implies that D(T −1 ) = D(T )−1 as required. Step 3: We now prove that D(T ) = | det T |. By Theorem ?? we know that any invertible matrix can be written as a product of elementary matrices, so T can be written in the form T = E1 E2 · · · EN , where E1 , . . . , EN are elementary matrices; for a review of elementary matrices, please see Section ?? of the Appendix. Therefore, by the multiplicative property of D(T ) found in the second equality in (4.11) we have Since
D(T ) = D(E1 )D(E2 ) · · · D(EN ).
det T = (det E1 )(det E2 ) · · · (det EN ), to prove that D(T ) = | det T | all we have to do is prove that D(E) = | det E| for any elementary matrix. Now there are two types of elementary matrices, “type I” matrices of the form Ei (a) and “type II” of the form Eij (a), where for a ∈ R, a 6= 0, Ei (a) is the elementary matrix given by the operation “multiply the i-th row by a” and Ei,j (a), where a ∈ R and i 6= j, is the elementary matrix given by the operation “add a times the j-th row to i-th row”. Consider a type I matrix Ei (a). Then Ei (a)((0, 1]n ) = (0, 1] × · · · × Ii × · · · × (0, 1],
where all the factors equal (0, 1] except the i-th one, which is Ii = (0, a] if a > 0 or Ii = [a, 0) if a < 0. From this formula, we see that (4.12) D(Ei (a)) = m (0, 1] × · · · × Ii × · · · × (0, 1] = |a|. On the other hand, Ei (a) is obtained from the identity matrix by replacing the 1 in the i-th diagonal spot by a. Thus, det Ei (a) = a, so D(E) = | det E| for E of type I. Now consider a type II matrix Ei,j (a) where a ∈ R. Given such a matrix, we leave you to verify the identity Eij (a)−1 = Ei (−1)Eij (a)Ei (−1). (Hint: First prove that Eij (a)−1 = Eij (−a).) Therefore, by the multiplicative property of D, the second formula in (4.11), we see that D(Eij (a)−1 ) = D(Ei (−1)) D(Eij (a)) D(Ei (−1)).
4.4. GEOMETRY, VITALI’S NONMEASURABLE SET, AND PARADOXES
213
By the third formula in (4.11), we have D(Eij (a)−1 ) = D(Eij (a))−1 , and by (4.12) we have D(Ei (−1)) = D(Ej (−1)) = 1. Therefore, D(Eij (a))−1 = D(Eij (a)). Hence, D(Eij (a))2 = 1 and so D(Eij (a)) = 1. Since det Eij (a) = 1 as well, as you can easily verify, this shows that D(E) = | det E| for E of type II. This completes the proof of (4.9) for invertible linear transformations T . Step 4: We now prove (4.9) assuming that T is not invertible. For T noninvertible, | det T | = 0, so the equality (4.9) will hold if we can prove that m∗ (T (A)) = 0 for any A ⊆ Rn . To prove this, note that by Theorem ?? in the Appendix, we can write T in the form T = SR, where S is an invertible matrix (a product of elementary matrices) and R is a matrix with at least one zero row; let’s say the k-th row where 1 ≤ k ≤ n. By (4.9) for invertible transformations we know that for any subset A ⊆ Rn , m∗ (T (A)) = m∗ (S(R(A)) = | det S| m∗ (R(A)).
Now by the fact that the k-th row of R is zero, we have R(A) ⊆ Rk−1 × {0} × Rn−k and since m∗ (Rk−1 × {0} × Rn−k ) = 0 (can you prove this?), it follows that m∗ (R(A)) = 0. Thus, m∗ (T (A)) = 0. Step 5: We now prove the last statement of our theorem: If A is Lebesgue measurable, then T (A) is Lebesgue measurable with the converse holding if T is invertible. Let A ∈ M n and assume first that T is not invertible. Then by Step 4 we know that T (A) has measure zero and hence is measurable. Assume now that T is invertible. Then from Corollary 4.9 we know that A = F ∪ N , where F ∈ B n and N has measure zero. Observe that T (A) = T (F ) ∪ T (N ).
Since T defines a homeomorphism on Rn (being invertible), by Proposition 1.15, T (F ) is a Borel set. Also, by (4.9) we have m∗ (T (N )) = | det T | m∗ (N ) = 0, so T (N ) has measure zero. Thus, by Corollary 4.9, T (A) is Lebesgue measurable. Hence, we have shown that if A is measurable, then T (A) is measurable. Conversely, if T (A) is Lebesgue measurable, then (applying the previous statement to T −1 ) it follows that T −1 (T (A)) = A is Lebesgue measurable.
In undergraduate vector calculus courses, determinants are usually related to volumes of parallelopipeds in R3 . We can also do the same in Rn . Let V = {v1 , . . . , vn } be a basis of Rn , so that the vectors v1 , . . . , vn are linearly independent. We define the parallelopiped P(V ) spanned by V as the set of points P(V ) := x1 v1 + x2 v2 + · · · + xn vn ; 0 ≤ xi ≤ 1 , i = 1, . . . , n .
When n = 2, P(V ) is shown in Figure 4.3 and it’s usually called a parallelogram instead of a parallelopiped. In Problem 2, you will prove that if A = [v1 · · · vn ] is the matrix with columns v1 , . . . , vn , then m(P(V )) = | det A|.
214
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
✒
v2 ✕
v1 + v2
✿v 1
Figure 4.3. The absolute value of the determinant of the matrix with columns v1 , v2 is equal to the area of the parallelogram spanned by v1 , v2 .
4.4.2. Vitali’s remarkable set. The first person to exhibit a nonmeasurable set was Giuseppe Vitali (1875–1932), who did so in his 1905 paper “Sul problema della misura dei gruppi di punti di una retta” [298] (On the problem of measure of the set of points of a line). In 1908, Edward Burr Van Vleck (1863–1943) [296] without knowledge of Vitali’s work also constructed such a set. Vitali’s set was a subset of (0, 1/2), but in fact, given any set A ⊆ Rn of positive outer measure, Vitali’s proof produces a Giuseppe nonmeasurable subset V of A; we present this proof in Theorem 4.15 beVitali (1875– low. (Note that since any set with zero outer measure is measurable, there 1932). never exists a non-measurable subset of a set of measure zero; this is why we assume positive outer measure.) Before going to Vitali’s theorem, we describe intuitively how to visualize Vitali’s set V . Assume that A is measurable with finite measure. Then in Problem 8 you will prove that Vitali’s subset V ⊆ A has the following interesting properties: (4.13)
(1) m∗ (V ) > 0
and (2) m∗ (A) = m∗ (A \ V ).
Of course, (1) is obvious because if m∗ (V ) = 0, then V is measurable, which we are claiming is false. It’s (2) that is interesting, because it can be interpreted as saying that V has no volume (because it says even though you subtract off V from A, the resulting set still has the measure of A). To summarize, (1) says V has volume while (2) says V has no volume! Because of this, one can visualize V as a “foggy” set, which in a sense takes up space but in another sense is void of substance; see Figure 4.4. (Comparison with a cloud is a poor analogy, so don’t make too much of it!). More generally, from Section 4.3.1 we know that any nonmeasurable set is a set with a “blurry” or “cloudy” boundary.
Figure 4.4. Here’s a real cloud, courtesy of Fir0002/Flagstaffotos.
Vitali’s theorem Theorem 4.15. Any subset of Euclidean space with positive Lebesgue outer measure has a subset that is not Lebesgue (and hence, not Borel) measurable.
4.4. GEOMETRY, VITALI’S NONMEASURABLE SET, AND PARADOXES
Proof : We first reduce to the case when A is bounded, then we do Vitali’s proof. Step 1: LetSA be any subset of Rn with nonzero Lebesgue outer measure. n Then A ⊆ Rn = ∞ k=1 [−k, k] , so intersecting with A we get A=
∞ [
(A ∩ [−k, k]n ).
k=1
Since m∗ is countably subadditive, we conclude that 0 < m∗ (A) ≤
∞ X
k=1
m∗ (A ∩ [−k, k]n ).
Thus, m∗ (A ∩ [−k, k]n ) is nonzero for some k. By proving the theorem for the set A ∩ [−k, k]n , we assume from now on that A ⊆ [−a, a]n for some real a > 0. Step 2: We now partition Rn in a special way. Given any two n-tuples x, y ∈ Rn , we write x ∼ y if x − y is rational, that is, an element of Qn .
It is easy to check that this relation is an equivalence relation; that is, for all x, y, z ∈ Rn , x∼x
,
x ∼ y =⇒ y ∼ x
,
x ∼ y & y ∼ z =⇒ x ∼ z.
Then from the elementary theory of equivalence relations, ∼ partitions Rn into equivalence classes, that is, nonempty pairwise disjoint subsets whose union is Rn such that two points x, y ∈ Rn belong to the same set in the partition if and only if x ∼ y. Figure 4.5 shows a picture. For example, in the case n = 1, here Rn
A
Partition of Rn
A is partitioned too
Figure 4.5. On the left is an abstract picture of Rn as a rectangle and in the middle is a schematic picture of Rn partitioned into equivalence classes pictured as horizontal strips. The picture shows only finitely many equivalence classes, although in reality there are uncountably many and the equivalence classes are quite complicated (impossible to draw!) unlike this very simplified drawing! are some examples of equivalence classes: √ √ √ √ 1 √ 1 2, 2 + , 2 − 1, 2 + 1, 2 + , . . . 2 3 1 1 e, e + , e − 1, e + 1, e + , . . . 2 3 1 1 π, π + , π − 1, π + 1, π + , . . . 2 3
Note that each equivalence class is countable. Indeed, if we fix an element v in an equivalence class, then given any other element x of the same equivalence class, we have x − v = r ∈ Qn . Thus, x = v + r, so all other elements of the same class are obtained from the fixed element v by adding a suitable rational n-tuple. Since the set of all rational n-tuples is countable, it follows that each equivalence class is countable. Consequently, there are uncountably many equivalence classes;
215
216
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
indeed, otherwise Rn would be a countable union of countable sets and hence countable, which, of course, we know is false. Step 3: We now construct Vitali’s nonmeasurable set. Indeed, the partition of Rn also partitions A into pairwise disjoint sets as shown on the right-hand picture in Figure 4.5/left-hand picture in Figure 4.6. Now choose a point from each partition set of A and let V be the set of all such points. Here’s a picture of V , where in the right-hand picture we reiterate what we discussed at the end of Step 2:
v
A
A is partitioned
V = set of points
v+r
All points in the equiv. class of v are of the form v + r for some r ∈ Qn
Figure 4.6. On the left, A is partitioned by the equivalence classes of Rn and in the middle we form V by choosing a point from each partition set of A. (Each equivalence class is countable and there are uncountably many equivalence classes.) Given v ∈ V and r ∈ Qn , the point v + r is in the same equiv. class as v and all points in the equiv. class of v are of the form v + r for some r ∈ Qn . Notice that we can write A in terms of the Vitali set V as follows: Given v ∈ V , if we let Av := {r ∈ Qn ; v + r ∈ A}, then we can write A = {v + r ; v ∈ V and r ∈ Av }.
(4.14)
Since V ⊆ A ⊆ [−a, a]n , we leave you to show that given v ∈ V , we have Av ⊆ Qn ∩ [−2a, 2a]n (just show that if x ∈ [−a, a]n and x + y ∈ [−a, a]n , then y ∈ [−2a, 2a]n ). In particular, if we put Q = Qn ∩ [−2a, 2a]n , then since for any v ∈ V we have Av ⊆ Q, it follows that A ⊆ W where W := v + r ; v ∈ V and r ∈ Q [ = V +r . r∈Q
Here’s a picture of what’s going on:
v
Given v ∈ V , the shaded area is the set {v + r ; r ∈ Av }
v
The shaded rectangle is the set {v + r ; r ∈ Qn ∩ [−2a, 2a]n }
v
The union of the shaded rectangles is W .
Figure 4.7. Since Av ⊆ Q = Qn ∩ [−2a, 2a]n , the shaded area on the
left is a subset of the shaded rectangle in the middle. The set W is the union of all shaded rectangles on the right. Since V ⊆ [−a, a]n and Q ⊆ [−2a, 2a]n we have W ⊆ [−3a, 3a]n . Thus,
(4.15)
A ⊆ W ⊆ [−3a, 3a]n .
Step 4: So far we have just been trying to understand Vitali’s interesting set. We now show that it’s not measurable. To do so, let’s assume that V
4.4. GEOMETRY, VITALI’S NONMEASURABLE SET, AND PARADOXES
217
is measurable and derive a contradiction. Since V is assumed measurable, by translation invariance of Lebesgue measure, any translate of V is measurable with measure equal to the measure of V . Hence, W is measurable. Noting that Q is countable (it’s a subset of Qn , which is countable) and the (V + r)’s are disjoint for different r’s, by countable additivity of Lebesgue measure, we have X m(W ) = m(V ). r∈Q
This is an infinite series of the constant number m(V ). Thus, either m(W ) = ∞ (if m(V ) > 0) or m(W ) = 0 (if m(V ) = 0). However, according to (4.15), we have m∗ (A) ≤ m(W ) ≤ m([−3a, 3a]n ) = (6a)n . Recalling that m∗ (A) > 0, it follows that m(W ) is some positive finite number and hence can’t equal 0 or ∞. This contradiction completes our proof.
If A is measurable with positive finite measure, then recall from (4.13) that m∗ (V ) > 0 and m∗ (A) = m∗ (A \ V ). In particular, although A = V ∪ (A \ V ), we have m∗ (A) < m∗ (V ) + m∗ (A \ V ); that is, the sum of the volumes of the parts is greater than the volume of the whole! This seems paradoxical because it violates “conservation of mass”. However, conservation of mass is technically only valid for objects that have well-defined masses, so to solve this paradox we just have to accept that nonmeasurable sets don’t have well-defined masses and hence the conservation of mass does not apply to nonmeasurable sets. Here’s a related result, which you’ll prove in Problem 9: Paradoxical decompositions Corollary 4.16. If A ⊆ Rn is measurable with positive finite measure, then given any nonmeasurable set B ⊆ A, we have m∗ (A) < m∗ (B) + m∗ (A \ B).
From this corollary, one can imagine taking a pea and dissecting it into not just two but many many pieces so that the sum of the volumes of the parts is larger than the sun. This, in fact, can be done and you’ll prove it in Problem 10. This is the secret to the Banach-Tarski paradox we’ll look at in Section 4.4.4. Here’s another interesting corollary. Corollary 4.17. There is no translation invariant measure on P(Rn ) that extends Lebesgue measure m : I n → [0, ∞). In fact, there is no translation invariant measure on P(Rn ) that assigns nonzero finite values to bounded nonempty sets. Proof : Assume there is a translation invariant measure µ : P(Rn ) → [0, ∞] that assigns nonzero measures to bounded nonempty sets. Define the set V as in Step 3 of Theorem 4.15. Then leaving the details to you, if you repeat Step 4 of Theorem 4.15 with µ instead of m, you’ll show that µ(W ) = ∞ or µ(W ) = 0, both of which are impossible.
218
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
4.4.3. Vitali’s secret. In a footnote at the end of the Lebesgue’s 1907 paper [167, p. 212], Lebesgue remarked that Vitali had constructed a nonmeasurable set: I would add that the existence, in the idealistic sense, of nonmeasurable sets has been shown by Mr. Vitali.
Now why did Lebesgue say “in the idealistic sense”? To find out, let’s review how we defined V . We were given a partition of A; recall that there were uncountably many partition sets, where each partition set was countable. We then chose a point from each partition set. Here’s a picture to contemplate:
✛
a, b, c, d, e, f, . . .
Figure 4.8. V was obtained by “choosing” a point from each partition of A. Take a partition of A and look at its points, say we denote them by a, b, c, d, e, . . .. Which of the points in this partition is in V ? Answer is: Who knows! We know (because of how V is defined) that V contains one of these points, but we don’t know which one!
In other words, we didn’t give a “rule” how to choose each point from each partition, we simply said to “choose” one and let V be the set of points chosen. How do we know we can simultaneously choose a point from each partition set (remember there are uncountably many such partition sets in A) and gather them all together in a set V ? Well, the true answer is that we have to take it by faith that we can do so; in mathematical terms we have to take it as an axiom that we can do so! This axiom, as you should already know, is the Axiom of Choice, Ernst Zermelo introduced by Ernst Zermelo (1871–1953) in 1904 [295, p. 139-141]. This (1871–1953). axiom states that given any collection C of nonempty sets, we can form a new set, called a choice set, by choosing4 an element from each set in C . Here’s a picture: α
A
β
B
γ
C
δ
D
ε
E
......
{α, β, γ, δ, ε, . . .}
Figure 4.9. On the left we have a collection of nonempty sets, C = {A, B, C, . . .}, and on the right is a choice set, obtained by taking an element from each of the nonempty sets.
Although the Axiom of Choice probably seems perfectly logical, even self-evident, note that the choice set is inherently nonconstructive: the Axiom of Choice only says that a choice set exists; it doesn’t tell you how its elements are obtained or even what its elements are! Now in the days of Lebesgue, there were two camps, “Empiricists” who only accepted objects that could be explicitly defined by some rule (Lebesgue was an Empiricist) and “Idealists” who also accepted objects obtained by nonconstructive methods even though it’s impossible to explicitly state the rule 4More precisely, there is a function, called a choice function, with domain C such that f (A) ∈ A for all A ∈ C .
4.4. GEOMETRY, VITALI’S NONMEASURABLE SET, AND PARADOXES
219
defining them (Zermelo was an Idealist). This explains why Lebesgue said “in the idealistic sense”. Finally, we remark that sometimes we don’t need the Axiom of Choice such as when C consists of only finitely many sets, or when there is a constructive way to choose the elements. For example, suppose that C is a collection of subsets of N. Then in Figure 4.9 we can define α to be the least element A, β to be the least element of B, etc. In this way, we can explicitly construct a choice set. The Axiom of Choice is only needed when one needs a choice set without knowing how to explicitly choose the elements. This is exactly the situation in Vitali’s proof: There are uncountably many partition sets of A and there is no way to give a “rule” to pick a point in any given partition; thus we are forced to rely on the Axiom of Choice to produce V for us. Since Vitali uses the Axiom of Choice to construct his nonmeasurable set, which he denoted by G0 , he stated [298, p. 235]:5 Something could be objectionable about considering the set G0 . This can be fully justified if it is accepted that the continuum can be well ordered. For those who do not want to accept our result, it follows that: the possibility of the problem of the measure of sets of points of a straight line and the well ordering of the continuum cannot coexist.
Now even though we defined a nonmeasureable set using the Axiom of Choice, is it possible to define one without using the Axiom of Choice? van Vleck thought so in his construction [296, p. 241]: Thus it seems to me possible, and perhaps not difficult, to remove the arbitrary element of choice in my example by confining one’s attention to a proper subset of the continuum, though as yet I have not succeeded in proving that this is possible.
In fact, it is impossible to “explicitly” produce a nonmeasurable set, where we’ll explain what we mean by “explicit” below. Here’s the explanation why, which requires us to review a bit of set theory. First of all, standard set theory is based on the Zermelo-Fraenkel axioms, named after Ernst Zermelo (1871–1953) and Abraham Fraenkel (1891–1965) and the resulting axiomatic system is known as ZF set theory, which is sufficient for most of “elementary” mathematics. Because of the special character of the Axiom of Choice, this axiom is not part of ZF. However, since the Axiom of Choice seems self-evident (at least to me if I don’t analyze it too much!) one might believe that it can be proved from ZF. However, from the work of Kurt G¨odel (1906–1978) and Paul Cohen (1934–2007), it’s known that the Axiom of Choice is logically independent of ZF, which means one cannot prove or disprove the Axiom of Choice using only the axioms of ZF. In other words, in the ZF world, we are free to accept or decline the Axiom of Choice. If we accept it into ZF, we are using ZFC set theory, which is what mainstream mathematicians use. Now the question to whether or not one can “explicitly” produce a nonmeasurable set, which we shall take to mean using only the ZF axioms, Robert Solovay (1938–) answers our question in his 1970 paper [265]; here’s what he says:6 5Note that Vitali mentions the “well ordering of the continuum”. The reason is that the
Axiom of Choice is equivalent to the well-ordering principle (that any set can be well-ordered), a fact proved by Zermelo in 1904, a year before Vitali’s paper appeared. 6 More precisely, Solovay proved that the statement “every subset of R is Lebesgue measurable” is consistent under the axioms of ZF plus axiom “I = there exists an inaccessible cardinal.” See [265] for the precise statement of Solovay’s result and see [303, Ch. 13] for more on the rˆ ole of the Axiom of Choice in producing nonmeasurable sets.
220
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
We show that the existence of a non-Lebesgue measurable set cannot be proved in Zermelo-Frankel set theory (ZF) if use of the axiom of choice is disallowed.
In this sense, one cannot “explicitly” produce a non-measurable set. 4.4.4. A paradox? A paradox, A most ingenious paradox! Now a nonmeasurable set is not entirely paradoxical: Think of a set with a very, very “blurry” boundary and it’s not unthinkable that it can’t be measured. However, the Axiom of Choice can actually produce entirely paradoxical results as we’ll see. The 1828 Webster’s dictionary says that a paradox is a tenet or proposition contrary to received opinion, or seemingly absurd, yet true in fact. Here’s a paradox, really a theorem because it’s Felix Hausdorff proved, due to Felix Hausdorff (1868–1942) who published it in 1914 [118, (1868–1942). 119, 63]: (Hausdorff Paradox) There is a countable set H of the sphere S2 such that S2 \ H can be divided into pairwise disjoint sets A, B and C such that A, B, C and C ∪ D are pairwise congruent.
By “congruent” we mean that any one of the sets A, B, C, C ∪ D be be obtained from any other one by a suitable rotation. Since H is countable let’s consider it as “negligible” and forget it. Then here’s a na¨ıve picture of the situation: A C
A B
C
B
Since A, B and C are pairwise congruent we think of the sphere as divided as in the left-hand picture, so we can think of A as a third of the sphere. On the other hand, since A and B ∪ C are congruent we think of the sphere as decomposed as in the right-hand picture, so we can think A as half the sphere. Here lies the paradox, for according to the Hausdorff Paradox, we would then have 1/3 = 1/2! Of course, A, B and C are much more complicated than these simple pictures reveal; they are formed using the Axiom of Choice in a similar, but more complicated, manner as Vitali’s set was formed. Because of this paradoxical decomposition of the sphere produced by the Axiom of Choice, Borel said [38, p. 210]: Hence we arrive at the conclusion that the use of the axiom of choice and a standard application of the calculus of probabilities to the sets A, B, C, which this axiom allows to be defined, leads to a contradiction: therefore the axiom of choice must be rejected.
Pretty strong words against the Axiom of Choice. Now if you think the Hausdorff Paradox was paradoxical, consider the Banach-Tarski Paradox (really a theorem and uses the Axiom of Choice), which will blow your mind. A paradox? A paradox, A most ingenious paradox! We’ve quips and quibbles heard in flocks, But none to beat this paradox!7
One version of the Banach-Tarski paradox states that it is possible to cut up a solid ball into finitely many pieces,8 then re-assemble the pieces using only rigid motions, 7
Taken from The Pirates of Penzance by Gilbert and Sullivan. Raphael M. Robinson (1911–1995) proved that five pieces (and no less) suffice [237].
8
4.4. GEOMETRY, VITALI’S NONMEASURABLE SET, AND PARADOXES
221
and end up with two solid balls again . . . the punchline is that each of the two solid balls has the same size as the original one: ✲
Figure 4.10. Magic! Producing two balls identical to the original. This theorem was proved by Stefan Banach (1892–1945) and Alfred Tarski (1902–1983) in 1924 [17]. To make this re-assembling language precise, given two subsets A, B ⊆ Rn , we shall call them congruent by dissection if they can be decomposed as finite unions of pairwise disjoint S SN sets, A = N k=1 Ak and B = k=1 Bk , such that for each k, the set Ak is congruent to Bk , which means that there is a rigid motion Tk : Rn → Rn such that T (Ak ) = Bk . For instance, any triangle is congruent by dissection to a rectangle as seen in Figure 4.11. Stefan Banach (1892–1945). ✲
✲
Figure 4.11. Any triangle is congruent by dissection to a rectangle. The magic trick of producing two balls from one is just the statement that any (solid) ball is congruent by dissection with two disjoint balls, each Alfred Tarski of which is identical in size to the original ball. In fact, it’s possible to do (1902–1983). even better: (Banach-Tarski Paradox) Any two bounded objects of Rn with n ≥ 3 having nonempty interiors are congruent by dissection!
For example, one can take a very small solid ball, say the size of a pea, and cut it into finitely many pieces, then re-assemble the pieces using only rigid motions to produce a solid ball the size of the sun:
✲ Solid ball the size of a pea
Solid ball the size of the sun
Note that the pieces produced when cutting up the pea cannot all be Lebesgue measurable because Lebesgue measure would preserve the measure of the pea; as with the Hausdorff Paradox, the Banach-Tarski Paradox uses the Axiom of Choice. You’ll prove some baby versions of the Banach-Tarski paradox in Problem 11. Example 4.6. This example is far from anything compared with the Banach-Tarski paradox (because it uses countable instead of finite dissections), but it gives nonetheless a taste of the Banach-Tarski paradox. We call two sets A and B congruent by infinite dissection if they can S S∞be decomposed as infinite unions of pairwise disjoint sets, A = ∞ A and B = k k=1 k=1 Bk , such that for each k, the set Ak is congruent
222
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
to Bk . Consider the set W from Step 3 in the proof of Vitali’s theorem, which is a union of translates of the set V : ∞ [ W = V + rk , k=1
where {r1 , r2 , r3 , . . .} is an enumeration of the countable set Qn ∩ [−2a, 2a]n . We know that V + rk and V + rℓ are disjoint for k 6= ℓ. We claim that W can be written as a disjoint union of two subsets, each of which is congruent by infinite dissection to W ; that is, we claim that we can write W = A ∪ B,
where A and B are disjoint subsets of W , both of which are congruent by infinite dissection to W . Indeed, define [ [ A= Ak and B = Bk k
k
where Ak = V + r2k and Bk = V + r2k−1 . Then A and B are disjoint, W = A ∪ B, and we claim that both A and B are congruent by infinite dissection to W ! To prove this for A, just observe that translating Ak by rk − r2k = V + rk , S S so it follows that A = k Ak is congruent by infinite dissection to k (V + rk ) = W . The proof for B is similar, just translate by rk − r2k−1 . (See Problems 11, 12, and 13 for related results.)
Now comes the obvious question: If the Axiom of Choice produces such out-ofthis-world paradoxical results, why do we use it? Can’t we just get along with ZF and forget ZFC? Well, it turns out that to have a “useful” theory of mathematics we have to have some axiom which allows us to choose elements from an infinite number of sets. Here are a few results from Real Analysis that we should all be familiar with: (1) A set of real numbers is closed if and only if it contains all its limit points. (2) A function f : R → R is continuous at a point a ∈ R if and only if it’s sequentially continuous at a. (3) R is not the countable union of countable sets. All these results hold in ZFC and we would all agree they are useful for mathematics. For instance, what if (3) were false? Then the Lebesgue measure of R would be zero and there would be no measure theory! In fact, without the Axiom of Choice each of these three results could be false [142, Ch. 10], [125, Ch. 4]! In the end, I think it would be better to live in a mathematical world with the Axiom of Choice than without it,9 even though we have to live with strange paradoxes; indeed, instead of blaming the Axiom of Choice for these paradoxes, we could instead shift the blame on very complicated nonmeasurable sets for which the usual notion of volume does not apply! In the end, I agree with Solovay [265, p. 3]: 9Here are theorems you might be familiar with that are equivalent to the Axiom of Choice: 1) The Cartesian product of nonempty sets is nonempty; 2) Every surjective function has a right inverse; 3) Every vector space has a basis; 4) Tychonoff’s theorem (the product of compact topological spaces is compact). Here are a small sampling of theorems we’ve proved in this book that use the Axiom of Choice: The Fundamental Lemma of Semirings (Lemma 1.3); that an intersection of a nonempty nonincreasing sequence of nonempty cylinder sets is nonempty (Lemma 3.6); the Construction of outer measures from set functions (Theorem 3.9). It might be interesting to look back at other courses to see where the Axiom of Choice is used — e.g. it’s used in the elementary fact that every infinite set has a countably infinite subset!
4.4. GEOMETRY, VITALI’S NONMEASURABLE SET, AND PARADOXES
223
Of course, the axiom of choice is true, and so there are non-measurable sets. ◮ Exercises 4.4. 1. Let T be a noninvertible linear transformation on Rn . In the proof of Theorem 4.14, we showed that m∗ (T (A)) = 0 for any A ⊆ Rn using elementary matrices. Here’s another way to prove this property of T . Note that it suffices to prove that m∗ (T (Rn )) = 0. (a) Since T is singular, show that some unit vector v ∈ Rn is orthogonal to the column space of T . If O is an orthogonal matrix with v as its first row, show that OT is a matrix with first row zero, and thus OT (Rn ) ⊆ {0} × Rn−1 . Using this fact, prove that m∗ (OT (Rn )) = 0, and from this, deduce that m∗ (T (Rn )) = 0. (b) If T is not invertible, is the statement “A ⊆ Rn is Lebesgue measurable if and only if T (A) is Lebesgue measurable” true? 2. Prove that if V = {v1 , . . . , vn } is a basis of Rn and A = [v1 · · · vn ] is the matrix with columns v1 , . . . , vn , then m(P(V )) = | det A| where P(V ) is the parallelopiped spanned by the basis vectors in V . 3. In Step 3 of the proof of Theorem 4.14, we used that every matrix is a product of elementary matrices to derive the formula D(T ) = | det T |. In this problem we give another method (amongst many others) to prove that D(T ) = | det T |, which uses more linear algebra (and hence is a good review of some linear algebra facts). Assume Step 1 and Step 2 of Theorem 4.14. Assume that T is invertible. (i) Prove that T = M N where M is an orthogonal matrix and N is a symmetric matrix with | det T | = det N . Suggestion: Since T t T is positive definite symmetric, where T t is the transpose of T , Rn has an orthonormal basis of eigenvectors {vk } with corresponding positive eigenvalues {λ √k }. Define the positive definite symmetric matrix N by the condition N vk = λk vk for each k and set M := T N −1 . Now prove that M is orthogonal. (ii) Since N is a symmetric matrix, prove (or recall that) N = O B O−1 where O is an orthogonal matrix and B is diagonal. Conclude that D(T ) = D(M ) D(B). (iii) Prove that D(B) = | det B| and D(M ) = 1. Finally, prove that D(T ) = | det T |. Suggestion: Prove that M , since it’s orthogonal, maps the unit ball in Rn onto itself. From this, show that D(M ) = 1. 4. (Luzin’s condition (N)) A function f : Rn → R is said to fulfill Luzin’s condition (N) [244, p. 244] if f maps null sets (sets of measure zero) to null sets; that is, if for every set A ⊆ Rn of measure zero, its image f (A) also has measure zero. This condition is named after Nikolai Luzin (also spelled Lusin) (1883–1950). Prove the following. Theorem. A continuous function f : Rn → R maps Lebesgue measurable sets into Lebesgue measurable sets if and only if it satisfies condition (N). Suggestion: To prove the “if” direction, use Part (6) of Theorem 4.10. Try to prove that any closed set in Rn can be written as a countable union of compact sets. To prove the “only if” direction, suppose that A ⊆ Rn has measure zero but f (A) has positive outer measure and use Vitali’s Theorem 4.15 on the set f (A). 5. This concepts developed in this problem will be helpful for future problems. (a) Given a point x ∈ Rn , we define its “box norm” by kxkb := max{|x1 |, |x2 |, . . . , |xn |}. This norm is equivalent to the standard norm of vectors in Rn ; that is, an open set with respect to this norm is an open set with respect to the standard norm on Rn and vise versa. Also, note that for any ℓ > 0, [−ℓ, ℓ]n = {x ∈ Rn ; kxkb ≤ ℓ}. Thus, the box [−ℓ, ℓ]n = the “ball” of “radius” ℓ in the box norm. This norm is very convenient for measure theory. Given an n × n matrix A, define |A| :=
224
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
P n max{ n j=1 |aij | ; i = 1, . . . , n}. Show that for any x ∈ R , we have n
n
kA xkb ≤ |A| kxkb .
(b) A function f : R → R is said to be locally Lipschitz if given any point a ∈ Rn , there are constants La and ra such that kf (x) − f (y)kb ≤ La kx − ykb ,
for all x, y in the closed box {z ∈ Rn ; kz − akb ≤ ra }, which is the closed box centered at a of radius ra . The constant La is called a Lipschitz constant.10 Show that any locally Lipschitz function is continuous and show that any differentiable function is locally Lipschitz. Here, a function f : Rn → Rn is said to be differentiable at a point p ∈ Rn if there is an n × n matrix-valued function ϕ : Rn → Rn×n that is continuous at p such that f (x) − f (p) = ϕ(x)(x − p)
for all x ∈ Rn .
(The derivative of f at p is then by definition ϕ(p).) f is said to be differentiable if it’s differentiable at each point p ∈ Rn . 6. Prove that a locally Lipschitz function f : Rn → Rn satisfies Luzin’s condition (N); in particular, locally Lipschitz functions take Lebesgue measurable sets to Lebesgue measurable sets. You may proceed as follows (i) Let I be a closed cube in Rn , that is, a closed box in Rn where each side has the same length. Show that I = a + [−ℓ, ℓ]n , for some a = (a1 , . . . , an ) and ℓ > 0 and show that if ℓ is sufficiently small, then f (I) ⊆ f (a) + La [−ℓ, ℓ]n ,
and that m∗ (f (I)) ≤ Ln a m(I). (ii) Prove that if A ⊆ Rn has measure zero, then f (A) also hasSmeasure zero. Suggestion: If Ak = {a ∈ A ; La ≤ k, ra ≥ 1/k}, show that A = ∞ k=1 Ak . To show that each f (Ak ) has measure zero, use Littlewood’s Principle that any measurable set can be approximated by an open set and use the Dyadic Cube Theorem. 7. We prove that Lebesgue measure is the unique translation invariant measure on B n that assigns the “correct” volume to the unit cube (0, 1]n . That is, we shall prove Theorem. If µ : B n → [0, ∞] is a measure such that µ(I) < ∞ for all I ∈ I n and µ(A + x) = µ(A) for all A ∈ B n and x ∈ Rn , then µ = α m, where α = µ((0, 1]n ). In particular, if µ((0, 1]n ) = 1, then µ = m. You may proceed as follows: (i) For each k, m ∈ N, put Ik,m = (k/2m )(0, 1]n = (0, k/2m ] × (0, k/2m ] × · · · × (0, k/2m ].
Show that Ik,m can be written as a union of pairwise disjoint translates: [ ℓ − 1 Ik,m = + I , 1,m 2m ℓ
where the union is over all ℓ = (ℓ1 , . . . , ℓn ) ∈ Nn with 1 ≤ ℓj ≤ k for j = 1, . . . , n, and where (ℓ − 1)/2m := (ℓ1 − 1)/2m , . . . , (ℓn − 1)/2m ). (ii) Prove that for each k, m ∈ N, we have µ(Ik,m ) = α (k/2m )n where α = µ((0, 1]n ).
(iii) Prove that for any real number r > 0, we have µ r(0, 1]n = α r n .
(iv) Prove that for all left-hand open cubes I, we have µ(I) = α m(I) and from this conclude that for all open sets U ⊆ Rn , we have µ(U) = α m(U). (v) Now prove the theorem.
10 ‘Lipschitz’ is named after Rudolf Otto Sigismund Lipschitz (1832–1903), and Lipschitz conditions are ubiquitous in the study of differential equations where such conditions are utilized to prove that certain equations have solutions.
4.4. GEOMETRY, VITALI’S NONMEASURABLE SET, AND PARADOXES
225
(vi) What if we omit the statement “µ(I) < ∞ for all I ∈ I n ;” is our theorem still true? Prove it or give a counterexample. 8. (Amazing properties of Vitali’s set) We shall have fun with Vitali’s set. (a) Here’s another way to prove that the set V constructed in Vitali’s theorem is not measurable. Show that V − V ∩ Qn = {0}. Conclude, by Steinhaus’ theorem in Problem 5 in Exercises 4.3, that V cannot be measurable. (b) Show that any subset of V that has positive outer measure is not measurable. Suggestion: If B ⊆ V has positive outer measure, show that B is not measurable by imitating the proof that V was not measurable. (c) Assume that the set A, which we assume to be bounded, in the proof of Vitali’s theorem is measurable and let V ⊆ A be a Vitali set. Prove that m∗ (A \ V ) = m∗ (A). This in particular implies that m∗ (A) < m∗ (V ) + m∗ (A \ V ).
We generalize this inequality in Problem 9 below. Suggestion: If m∗ (A \ V ) < m∗ (A), prove there is a measurable set E such that A \ V ⊆ E ⊆ A and m∗ (E) = m∗ (A \ V ). Try to use Part (b) above to get a contradiction. (d) Prove or give a counterexample: If A is nonmeasurable, then m∗ (A) < m∗ (V ) + m∗ (A \ V ). Suggestion: Think about what a Vitali set of a Vitali set is. (e) Let V be a Vitali set (of some bounded set of positive measure) and choose (by regularity) a measurable set B ⊆ Rn with V ⊆ B and m∗ (V ) = m∗ (B). Prove that for any measurable set E ⊆ B with positive measure, E ∩ V is nonmeasurable. Also, prove that B = B1 ∪ B2 where m∗ (B) = m∗ (B1 ) = m∗ (B2 ). (f) Assume again that the set A, which we assume to be bounded, in the proof of Vitali’s theorem is measurable and let V ⊆ A be a Vitali set. Let G = (A \ V ) ∪ Ac . Prove that G is nonmeasurable, m∗ (Gc ) > 0, and m∗ (E ∩ G) = m∗ (E) for all measurable sets E ⊆ Rn .
This equality is interpreted as saying that G “fills” the entire space Rn uniformly and has no “gaps”. This is surprising because m∗ (Gc ) > 0, so G certainly does not fill Rn ! This phenomenon cannot happen for measurable G as we prove next. (g) Prove that if G ⊆ Rn is measurable and m∗ (E ∩ G) = m∗ (E) for all measurable sets E, then m∗ (Gc ) = 0. 9. (Non-additivity of Lebesgue outer measure) Let A ⊆ Rn be measurable with positive, finite measure, and let B ⊆ A. Prove that B is measurable if and only if m∗ (A) = m∗ (B) + m∗ (A \ B).
This implies Corollary 4.16 (why?). (Actually, this result follows from Problem 10 in Exercises 3.5 if you did that problem.) Suggestion: Find a measurable set C such that B ⊆ C ⊆ A and m∗ (B) = m∗ (C). Try to prove that m∗ (C \ B) = 0 and use this to show that B is measurable. 10. (Baby Banach-Tarski) Let A ⊆ Rn have nonempty interior. (i) Prove that given any α > 0, there is an N ∈ N and pairwise disjoint sets A1 , A2 , . . . , AN such that A = A1 ∪ A2 ∪ · · · ∪ AN and m∗ (A1 ) + m∗ (A2 ) + · · · + m∗ (AN ) > α.
For example, you can take a pea and dissect it into finitely many pieces such that the sum of the volumes of the pieces is greater than the volume of the sun! Suggestion: By a suitable translation and since A has nonempty interior, we may assume A contains a neighborhood of the origin, and hence a cube [−3a, 3a]n for some a > 0. Does this remind you of something from Vitali’s theorem? (ii) Prove that there are countably many pairwise disjoint sets B1 , B2 , . . . such that A = B1 ∪ B2 ∪ B3 ∪ · · · and m∗ (B1 ) + m∗ (B2 ) + m∗ (B3 ) + · · · = ∞.
226
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
11. (More Baby Banach-Tarski) Here are some Banach-Tarski type results one can get from the set W from Step 4 in the proof of Vitali’s theorem. (i) Given N ∈ N, show that W can be written as a disjoint union of N subsets, each of which is congruent by infinite dissection to W . (ii) Assume that the bounded set A used to construct W has the property that A + Qn = Rn (e.g. boxes or balls with nonempty interiors have this property). Prove that W and Rn are congruent by infinite dissection. S Suggestion. Let s1 , s2 , s3 , . . . be an enumeration of Qn and show that Rn = ∞ k=1 (V + sk ), a countable union of pairwise disjoint sets. (iii) Let A, B ⊆ Rn be bounded sets where A has a nonempty interior. Prove there is a subset A0 of A and a bounded set B0 containing B that are congruent by infinite dissection. For example, let A be a pea and B the sun. Then there is a subset of the pea and a bounded set containing the sun that are congruent by infinite dissection! Suggestion: By translating we may assume that A contains a neighborhood of the origin. Choose a > 0 such that [−3a, 3a]n ⊆ A and let A0 = W be the set constructed by applying Vitali’s proof to [−a, a]n . 12. (Even more Baby Banach-Tarski) In this problems we prove that the unit circle S1 can be written as a disjoint union of two sets, each of which is congruent by infinite dissection with S1 . In fact, the proof is simply copying Vitali’s proof! (i) Given any two elements x, y ∈ S1 , define x ∼ y if x = ye2πiθ for some θ ∈ Q ∩ [0, 1). Check that this relation is an equivalence relation on S1 . (ii) Choose a point from each equivalence class and let V be the set of all such points. Let θ1 , θ2 , . . . be a list of all rational numbers in [0, 1) and let Vn = e2πiθn V = S∞ 1 2πiθn v ; v ∈ V }. Show that S = n=1 Vn , the Vn ’s are pairwise disjoint, then {e complete the proof of the statement. 13. (Sierpi´ nski-Mazurkiewicz Paradox) One of the beginning results that lead up to the Banach–Tarski paradox was the following interesting result published by Stefan Mazurkiewicz (1888–1945) and Waclaw Sierpinski (1882–1969) in 1914: There is a nonempty subset X of R2 such that X = A ∪ B where A and B are disjoint and each is congruent to X. Prove this theorem as follows. (i) Show that there is a real number θ ∈ R such that eiθ is transcendental; that is, eiθ is not the root of any polynomial with integer coefficients. Use the elementary fact that the set of all algebraic numbers is countable. (ii) Identify R2 with C, and define X as the set of all points in R2 of the form a0 + a1 eiθ + a2 e2iθ + · · · + an eniθ , for some n, a0 , a1 , a2 , . . . , an ∈ {0, 1, 2, 3, . . .}. Let A ⊆ X be those points such that a0 = 0 and let B ⊆ X be those points such that a0 6= 0. Prove that A ∩ B = ∅, X = A ∪ B, and both A and B are congruent to X. (iii) Here are related paradoxes for S1 : We’ll show that S1 (the unit circle) and S1 minus a point are congruent by dissection. To prove this, identify R2 with C as before, and identify S1 with {eiθ ; θ ∈ R}. Given p ∈ S1 we shall prove that S1 and S1 \ {p} are congruent by dissection. Indeed, let A = {ein p ; n ∈ N} and let B = (S1 \ {p}) \ A. Then S1 \ {p} = A ∪ B. Write S1 as A′ ∪ B where A′ and B are disjoint and A′ is congruent to A. (iv) Given a countable subset C ⊆ S1 , show that S1 and S1 \ C are congruent by dissection. Suggestion: Show that there is an angle ∈ R such that the sets S θ inθ C, eiθ C, e2iθ C, . . . are pairwise disjoint. Let A = ∞ C and let B = (S1 \ n=1 e C) \ A, then continue as in (iii).
4.5. THE CANTOR SET
227
4.5. The Cantor set In this section we describe a compact (and hence, a Borel) uncountable set of real numbers with measure zero. This set is called the Cantor set after Georg Ferdinand Ludwig Philipp Cantor (1845–1918) who constructed the set in 1883 [49, 48]. Perhaps the first person to construct Cantor-type sets was Henry John Stephen Smith (1826–1883) in 1875 [264]; in fact, his set is a half-scaled version of Cantor’s set — see the Notes and References section to this chapter. Other early constructions of Cantor-type sets are due to Paul David Gustav du Bois-Reymond (1831–1889) in 1882 [74], [75, p. 188], Vito Volterra (1860–1940) in 1881 [301], and others. It’s not only fascinating to know what the Cantor set is, but also why it came about, so we start Henry Smith by briefly reviewing its history (see [67, 95] for more details). (1826–1883). 4.5.1. The Cantor middle-third set. Cantor introduced his set in ¨ the fifth paper of the six paper series Uber unendliche, lineare Punktmannigfaltigkeiten (On infinite, linear point sets) [49, 48, 51], which established the fundamentals of Cantor’s new transfinite set theory. One of his ultimate goals was to prove the continuum hypothesis CH11 and he stated that he needed to give “a definition as precise and as general as possible” Georg Cantor when a set can be called continuous (a continuum) and after saying this he (1845–1918). expressed hope in proving CH [49, p. 574]: Therefore the question about the cardinality of Rn reduces to the analogous question about the open interval (0, 1) and I hope to be able to answer it with a rigorous proof that this cardinality is no other that the one of our second number class. It will then follow that every infinite set of points has either the cardinality of the first number class or the cardinality of the second number class.
Cantor then goes on to describe the “precise and as general as possible” properties of a continuum. The first property he mentions is that of being “perfect”. Here, a set A ⊆ Rn is said to be perfect if A equals its set of limit points. Recall that a point p is said to be a limit point of A if given any open set U containing p, there is a point a ∈ A different from p such that a ∈ U. The set of limit points of A is denoted by A′ , so A is perfect means that A = A′ . Examples of perfect sets are continuums such as R or finite unions closed intervals. However, although not obvious at first glance, Cantor pointed out that there are perfect sets that are not continuums, revealing for the very first time his now famous set in the following footnote at the end of his paper: As an example of a perfect set which is not everywhere dense in any even so small interval, I name the set of all real numbers which are contained in the formula c1 c2 cν + 2 + ··· + ν + ··· z= 3 3 3 11
One form of the continuum hypothesis (CH) states that an infinite subset of R either has the cardinality of N or R. If ℵ0 denotes the cardinality of N, from set theory [136] we know there is a next larger cardinality which we denote by ℵ1 . Cantor called ℵ0 the “first number class” and ℵ1 the “second number class”. Another form of CH is that R has the cardinality of the second number class.
228
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
where the coefficients cν can assume the values 0 or 2 at leisure and the series can consist of a finite or infinite quantity of members.
In other words, Cantor defines his set as the set of real numbers (in [0, 1]) whose tertiary (or base 3) expansions can be written using only the digits 0 and 2. Cantor did not explain where his set came from, he didn’t prove anything about his set and he never mentions his set in the main body of the paper! I am amazed how the Cantor set can generate so much mathematical fruit in the ensuing years from its humble beginnings as a simple footnote! We’ll get back to his set after we finish Cantor’s story; in particular, we’ll explain Cantor’s comment that his set is “not everywhere dense in even so small interval”. Because of Cantor’s example of a perfect set that is not continuous, being perfect is not enough to characterize a continuous set of points. Thus, Cantor adds another condition that he calls “connectedness” (which is different from how we use the term today). Armed with a precise topological (as we would now describe it) characterization of continuums as perfect-connected sets he hoped to prove CH in his subsequent work. Unfortunately, Cantor never realized his hope and so famous was CH that it was the first problem in Hilbert’s list of 23 open problems given at the 1900 International Congress of Mathematicians in Paris. As you probably know, Cantor was doomed to fail because by the later work of Kurt G¨odel and Paul Cohen it was discovered that CH is undecidable — it cannot be proved or disproved (i.e. it’s independent) — in the standard axioms of modern mathematics (ZFC set theory). Now that we know why the Cantor set came about, to precisely characterize continuums, let’s fill in the details Cantor left out! Instead of defining the Cantor set as Cantor originally did, we shall define it geometrically the way Henry Smith defined his sets; later on we shall relate the geometric definition with Cantor’s original definition. The construction of the Cantor set is illustrated in Figure 4.12. We start with the closed interval [0, 1]. From this interval, we remove the open 0
1 C0
C2
0 C00 0
1 3
2 3
1 3
2 3
C02 1 9
C000 C002
2 9
C20
C020 C022
1
C1
1
C2
C22 7 9
C200 C202
8 9
[0, 1]
C220 C222
C3 C4 C5
Figure 4.12. The Cantor set C is the limit set lim Cn :=
T∞
n=1 Cn , which from our eyes, looks very tiny. However, it turns out that C has uncountably many points, as we’ll see later.
middle third (1/3, 2/3) forming the two disjoint sets C0 and C2 , whose union we denote by C1 : h 1i h2 i C1 = C0 ∪ C2 = 0, ∪ ,1 . 3 3 Note that C0 and C2 each have length 1/3. We now remove each of the open middle thirds from C0 and C2 and denote the remaining set by C2 . Thus, from C0
4.5. THE CANTOR SET
229
we remove (1/32 , 2/32 ) forming the two disjoint sets C00 and C02 , and from C2 we remove (7/32 , 8/32) forming the two disjoint sets C20 and C22 : h 1 i h 2 1i h2 7 i h 8 i C2 = C00 ∪ C02 ∪ C20 ∪ C22 = 0, 2 ∪ 2 , ∪ , 2 ∪ 2,1 . 3 3 3 3 3 3 2 Note that C00 , C02 , C20 , C22 each have length 1/3 . We now continue this “removing open middle thirds process” indefinitely and what’s left over after discarding all the open middle thirds we shall call the Cantor set. If you’re want more details on how the Cantor set is defined, here they are. We shall proceed by induction and follow the convention, as we have already been doing, that whenever we divide a set into thirds, we tack on a “0” to denote the first set and a “2” to denote the third set. Suppose by way of induction that C1 ⊇ · · · ⊇ Cn have already been defined, such that the n-th set is a union of 2n sets: [ Cn = Cα1 ...αn , where the Cα1 ...αn ’s are pairwise disjoint closed intervals of length 1/3n and the union is over all n-tuples (α1 , . . . , αn ) of 0’s and 2’s. For each interval Cα1 ...αn , we remove its middle third, forming two disjoint closed intervals Cα1 ...αn 0 and Cα1 ...αn 2 . Since the length of Cα1 ...αn is 1/3n , the lengths of Cα1 ...αn 0 and Cα1 ...αn 2 are 1/3n+1 . We now put [ Cn+1 := Cα1 ...αn αn+1 ⊆ Cn , where the union is over all (n + 1)-tuples (α1 , . . . , αn , αn+1 ) of 0’s and 2’s. This completes our induction step. The Cantor set is the limit set lim Cn , that is, the intersection C :=
∞ \
n=1
Cn .
S∞ If B = k=1 Ik where the sets Ik are all the open middle thirds removed to form C, then we also have C = [0, 1] \ B. Some properties of the Cantor set are immediate from its definition. For example, since each Cn is a finite union of closed intervals, each Cn is closed and since a countable intersection of closed sets is closed, the Cantor set is closed. Since the Cantor set is closed and bounded (it’s contained in [0, 1]) it’s compact. Now what are some points in the Cantor set? Judging from Figure 4.12 it doesn’t look like there’s much in the Cantor set. However, it’s certainly not empty since the Cantor set contains all the end points of each interval Cα1 ...αn , recalling that only their open middle thirds were thrown away. So, C contains the points 2 7 8 1 2 7 8 1 2 1 0, 1, , , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 , .... 3 3 3 3 3 3 3 3 3 3 We’ll describe more points in the Cantor set once we relate our geometric definition with Cantor’s original definition. We now summarize the main properties of C in a theorem. Let A ⊆ R. We say that A is totally disconnected if for every two points x, y ∈ A, x < y, there is a real number z ∈ R with x < z < y such that A ⊆ (−∞, z) ∪ (z, ∞). We say that A is nowhere dense if the closure A contains no open intervals. In other
230
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
words, any open interval in R must contain a point not in A. Thus, nowhere dense is equivalent to the complement of A being dense.12 Nowhere dense is the precise meaning of Cantor’s comment “not everywhere dense in even so small interval”. Properties of Cantor’s set Theorem 4.18. The Cantor set is perfect, uncountable, compact, totally disconnected, nowhere dense, and has measure zero. Proof : To show that C is perfect, let x be any point in the Cantor set. We need to show that x is a limit point of C. To this end, let I ⊆ R be any open interval containing the point x. For each n ∈ N, let In denote the closed interval (one of the Cα1 ···αn ’s) of Cn that contains x. By construction of the Cantor set, we know that the length of In is 1/3n , therefore we can choose n large enough so that In is completely contained in the open interval I. Let y 6= x be one of the end points of In . Then y ∈ C, y ∈ I, and y 6= x. Thus, x is a limit point of C We already know that the Cantor set is compact. To prove that C is uncountable we proceed by contradiction.13 Suppose that C is countable, so we can write the Cantor set as a list C = {c1 , c2 , . . .}. Since C0 and C2 are disjoint, c1 can be contained in only one of them. Let Cα1 be the set not containing c1 . Since Cα1 0 and Cα1 2 are disjoint, c2 can be contained in at most one of them. Choose one of the two that does not contain c2 and call it Cα1 α2 ⊆ Cα1 . Continuing by induction, we construct a sequence of closed intervals In = Cα1 ...αn such that In does not contain cn and In+1 ⊆ In for each n. Since I1 ⊇ I2 ⊇ I3 ⊇ · · · is a nested sequence of closed intervals, by the Nested Intervals Theorem (Theorem T ??), the intersection ∞ let c a point in the intersection. Since n=1 In is not empty; T In = Cα1 ...αn ⊆ Cn , we see that c ∈ ∞ C n=1 n = C. To summarize, we have found a point c ∈ C such that c ∈ In for each n. However, by construction, for any n, cn is not in the closed interval In , so c, being in all the intervals In , cannot be any of the numbers c1 , c2 , . . .. This contradicts the assumption that {c1 , c2 , . . .} was a list of all the elements of C. That the Cantor set is totally disconnected and nowhere dense, we leave for Problem 1. Finally, to prove that C has measure zero, recall that Cn is the union of 2n disjoint intervals of length 1/3n , we have 2 n 1 m(Cn ) = 2n · n = . 3 3 Now for each n, C ⊆ Cn and so, m(C) ≤ m(Cn ) = (2/3)n . Since (2/3)n → 0 as n → ∞, we must have m(C) = 0.
4.5.2. Cantor’s original definition. We now relate Smith’s geometric definition of the Cantor set with Cantor’s original definition. To do so, we first recall geometrically how to define the base 10 expansion of a real number (which you should have seen in an elementary analysis course). Let x ∈ [0, 1]; for example, consider x = π − 3 = 0.14159 . . ., the decimal part of π, which is represented by the small dot on the left side: 0
1 12Recall that D ⊆ Rn is dense means D = Rn ; i.e., any open set in Rn intersects D. 13
Actually, any nonempty perfect subset of Rn is uncountable, a fact you may try to prove.
4.5. THE CANTOR SET
231
We can expand x as a decimal, a1 a2 a3 + + 3 + ··· , 10 102 10 and we shall explain how to find the coefficients a1 , a2 , . . . in the decimal expansion. The first step is to divide the interval [0, 1] into 10 equal intervals of length 1/10: x = 0.a1 a2 a3 · · · =
0 10
1 10
2 10
3 10
4 10
5 10
6 10
7 10
8 10
9 10
10 10
Next, find which fraction is to the immediate left of x, call that fraction a101 where a1 ∈ {0, 1, . . . , 9}; then14 the first digit of x in base 10 is a1 . For example, the first digit of π − 3 is 1. We can also label each of the 10 intervals from 0 to 9 (labeling left to right) and determine which interval contains x. Now divide the interval of length 1/10 containing x into 10 equal parts, each part of which has length 1/102 as seen here for x = π − 3: The divisions are quite small, so we magnify the interval following picture, where 10p2 is the distance from the point 0 102
1 102
2 102
3 102
4 102
5 102
6 102
1 2 10 , 10 as shown in the a1 1 10 (= 10 for x = π − 3):
7 102
8 102
9 102
10 102
a2 Find the fraction that is to the immediate left of x, say the fraction 10 2 (where a2 ∈ {0, 1, . . . , 9}); then the second digit of x is a2 . Alternatively, we can label the intervals from 0 to 9 and find which interval contains x. For example, the dot representing x = π − 3 lies just to the right of 4/102 so we have a2 = 4. If we keep on repeating the “division into 10 procedure” we eventually get all the ai ’s: a1 a2 a3 x = 0.a1 a2 a3 · · · = + 2 + 3 + ··· 10 10 10 Figure 4.13 shows step-by-step how to get the base 10 expansion of π − 3.
a = π − 3 = .14159265 . . .
a = 0.1 . . . a = 0.14 . . . a = 0.141 . . . a = 0.1415 . . . a = 0.14159 . . .
Figure 4.13. To review, in the first picture we divide [0, 1] into 10 intervals, each of length 1/10. We label the intervals 0 to 9 and we see that x = π − 3 is in interval 1, so a1 = 1. In the second picture we magnify the interval [1/10, 2/10] and we divide this interval into 10 equal intervals of length 1/102 . Labeling the intervals from 0 to 9, we see that a lies in interval 4, so a2 = 4. We continue this process and we get the base 10 expansion of π − 3.
Now recall that given any b ∈ N with b > 1, called a base, and given any x ∈ [0, 1] we can write x as a b-adic or b-ary expansion a1 a2 a3 x= + 2 + 3 + ··· b b b 14If x happens to lie on exactly one of the fractions p/10 where p ∈ {1, . . . , 9}, then we could put a1 = p or a1 = p − 1; in this case x can be written in two different ways in base 10.
232
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
where ai ∈ {0, 1, . . . , b − 1} for each i, are called the digits of x. We can find the digits of x using the same successive division trick outlined above for base 10, except for general bases we divide into b subintervals at each stage. For example, if we focus on b = 3, then given any x ∈ [0, 1] we can write x in a tertiary expansion a1 a2 a3 x= + 2 + 3 + ··· 3 3 3 where ai ∈ {0, 1, 2} for each i. Moreover, we can determine these digits by successive divisions by 3. Thus, we divide [0, 1] into thirds, label the intervals 0, 1 and 2, then a1 is the interval in which x belongs. We divide the a1 interval into thirds, label the newly formed intervals 0, 1 and 2, then a2 is the interval in which x belongs and so on. Bing! A light bulb should have went on in your head! In the geometric construction of the Cantor set we are doing exactly this successive division by thirds except we are omitting all numbers in the intervals labeled with “1”. Thus, we see at least intuitively that x ∈ C if and only if α1 α2 α3 α4 x= + 2 + 3 + 4 + · · · , where αj ∈ {0, 2}. 3 3 3 3 You will prove this in Problem 2. In fact, the αj ’s here correspond exactly to the αj ’s in the construction of the sets Cα1 ...αn in the Cantor set. (This is, of course, the reason why we denoted the Cα1 ...αn ’s the way we did!) Using this description of the Cantor set, one can easily find points in the Cantor set such as 0 2 2 2 2 2 1 1 2 1 1 + 2 + 3 + 4 + 5 + ··· = 2 1+ + 2 + ··· = 2 · = , 3 3 3 3 3 3 3 3 3 1 − 1/3 3 which we already knew was in the Cantor set, but we can find other points in C besides end points of deleted intervals, such as 0 2 0 2 0 2 1 1 2 1 1 + 2 + 3 + 4 + 5 + ··· = 2 1 + 2 + 4 + ··· = 2 · = , 2 3 3 3 3 3 3 3 3 3 1 − 1/3 4 as well as 2 2 0 2 0 2 2 1 1 1 3 + 2 + 3 + 4 + 5 + ··· = 1 + 2 + 4 + ··· = · = . 3 3 3 3 3 3 3 3 3 1 − 1/32 4
In fact, there are many other points in the Cantor set besides the endpoints such as 1 3 11 (to name a few) 10 , 13 , 12 , and many more; see Problems 3 and 4. Here’s a quote from Ralph P. Boas, Jr. (1912–1992) [34, p. 97]: When I was a freshman, a graduate student showed me the Cantor set, and remarked that although there were supposed to be points in the set other than the endpoints, he had never been able to find any. I regret to say that it was several years before I found any for myself.
4.5.3. The Cantor function. We now define the Cantor function (also called Cantor’s singular function) ψ : [0, 1] → [0, 1]. This function has the interesting property that it increases from 0 to 1 essentially without changing! We construct ψ exactly how Cantor did in 1883 [50]. Step 1: The first step is to define ψ on C; see Figure 4.14. Given a point x ∈ C
4.5. THE CANTOR SET
1
233
♣
3 4
♣ ♣
1 2
♣
1 4
♣
♣ ♣ ♣ 0
1 2 1 9 9 3
2 7 8 3 9 9
1
Figure 4.14. Constructing the Cantor function; Step 1. we can write it in a tertiary expansion 2a1 2a2 2a3 2a4 x= + 2 + 3 + 4 + ··· , 3 3 3 3 where aj ∈ {0, 1} for each j; we define ψ(x) via the binary expansion a2 α3 α4 a1 ψ(x) := + 2 + 3 + 4 + ··· . 2 2 2 2 In other words, we omit the factor of 2 from the numerators in x and change from base 3 to base 2: ψ 0.3 2a1 2a2 2a3 2a4 · · · = 0.2 a1 a2 a3 a4 · · · .
For example, ψ(0) = 0, and, since 2 2 2 2 1 = + 2 + 3 + 4 + ··· , 3 3 3 3 we have 1 1 1 ψ(1) = + 2 + 3 + · · · = 1. 2 2 2 An interesting property of ψ is that ψ(x) = ψ(y) if x and y are end points of the same deleted interval removed during the construction of the Cantor set; see Problem 5. For instance, 1/3 and 2/3 are endpoints of the deleted open interval (1/3, 2/3) removed in the first stage of the Cantor set construction, and, since 2 2 2 1 = 2 + 3 + 3 + ··· , 3 3 3 3 we have 1 1 1 1 1 ψ = 2 + 3 + 4 + ··· = , 3 2 2 2 2 and also, since 23 = 23 + 302 + 303 + · · · , we have 2 1 0 0 1 ψ = + 2 + 3 + ··· = . 3 2 2 2 2 Thus, ψ(1/3) = ψ(2/3). We claim that ψ : C → [0, 1] is onto. To see this, let y ∈ [0, 1] and write y in binary: a1 a2 a3 a4 y= + 2 + 3 + 4 + · · · , where aj ∈ {0, 1}. 2 2 2 2 Then by definition of the Cantor function, we have 2a1 2a2 2a3 2a4 ψ(x) = y where x = + 2 + 3 + 4 + ··· . 3 3 2 3
234
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
Thus, ψ maps the visually very tiny Cantor set onto the whole interval [0, 1]! By the way, this last statement gives another proof that C is uncountable, for if it were countable, then ψ(C) would be countable which it is not. Step 2: We now extend the domain of ψ from C to all of [0, 1], the basic gist is shown here: 1 7/8 3/4 5/8 1/2 3/8 1/4 1/8
0
1 2 1 9 9 3
2 7 8 3 9 9
1
Figure 4.15. Constructing the Cantor function; Step 2. To do so, we need to define ψ on the open middle thirds removed from [0, 1] to form the Cantor set. Let {Ik } be all the open middle thirds removed to form the Cantor set. Writing Ik = (ak , bk ), the end points ak and bk belong to the Cantor set and we already know that ψ(ak ) = ψ(bk ). We now define ψ(x) to equal this common value for x ∈ Ik ; that is, for x in an open middle third, we define ψ(x) to equal its values at the end points (which from Step 1, are already known and are equal). This defines our function ψ : [0, 1] → [0, 1].
In Problem 5 you will prove that ψ : [0, 1] → [0, 1] is continuous and nondecreasing. We now look at some of ψ’s interesting properties. First, we know that on any discarded open middle third, ψ is constant (even on the closure of each open middle third). In general, a function f on a topological space X is said to be locally constant if for each c ∈ X there is an open set U containing c such that f is constant on U. Thus, ψ is locally constant on [0, 1] \ C, which is just the union of all the deleted open middle thirds. Hence, although ψ is locally constant on [0, 1] \ C, which has length 1, somehow ψ goes from 0 to 1 doing all its increasing (without jumping because ψ is continuous) on the visually very tiny Cantor set, which has length zero! There is one more property that we want to share.SLet µψ ∞ denote the Lebesgue-Stieltjes measure of the Cantor function ψ. Let B = k=1 Ik where the sets Ik are all the open middle thirds removed to form C. Since ψ is constant on each Ik , we have µψ (Ik ) = 0, which implies that µψ (B) =
∞ X
µψ (Ik ) = 0.
k=1
Hence,
1 = ψ(1) − ψ(0) = µψ [0, 1] = µψ (B ∪ C) = µψ (C),
so the Cantor set has µψ -measure one! We summarize our findings in the following theorem.
4.5. THE CANTOR SET
235
Cantor’s function Theorem 4.19. The Cantor function ψ : [0, 1] → [0, 1] has the following properties: (1) ψ is a nondecreasing continuous function mapping [0, 1] onto [0, 1]; (2) ψ is differentiable except on a set of measure zero (namely C) with ψ ′ = 0; (3) ψ(C) = [0, 1]; (4) µψ (C) = 1. One question you might be wondering is why in the world would Cantor think of such a function? The reason was to supply a counterexample to a statement of Carl Gustav Axel Harnack (1851–1888) [79, p. 17]. We all know that if f : [a, b] → R is differentiable at all points in [a, b] and f ′ (x) = 0 at all points, then f is constant. A natural question is if the conclusion “f is constant” is still true if “at all points” is replaced by some weaker condition. In 1882 Harnack [116], [121, p. 60], “proved” that if a continuous function satisfies f ′ (x) = 0 “in general,” then f must be constant. Here, “in general” means that given ε > 0, there is a set A of content zero such that if x is not in A, then f (x + h) − f (x) < ε for all h sufficiently small. h See Problem 11 in Exercises 3.4 for the definition of (outer) content. Thus, Harnack is saying that if the difference quotient of a function can be made arbitrarily small outside a negligible set (a set of zero content), then the function must be constant. Intuitively this seems plausible, but unfortunately it’s false! In particular, it’s easy to check that Cantor’s function is continuous and satisfies ψ ′ (x) = 0 “in general” yet ψ isn’t constant! Cantor’s example is typical with Cantor-type sets: They serve as testing grounds for the validity of theories.
4.5.4. Cantor-like sets with positive measure. The Cantor set, as we’ve already seen, has measure zero. We can also define sets that have the same properties as the Cantor set except they have positive measure. Here is one example . . . since we have treated the Cantor set example so thoroughly we shall treat the following example more cavalierly. Going back to the construction of the Cantor set, we see that the Cantor set was obtained by removing an open interval of length 1/3 at the first step, then two intervals of length 1/32 at the second step, then 22 intervals of length 1/33 at the third step, then 23 intervals of length 1/34 at the fourth step, etc. Instead of removing intervals whose lengths are powers of 1/3 we can use other numbers to. Let k ∈ N with k ≥ 3; we shall remove intervals whose lengths are powers of 1/k. We start with the closed interval [0, 1]. From this interval, we remove the open middle interval of length 1/k, which forms the two disjoint closed intervals. From each of these two intervals, we remove the open middle interval of length 1/k 2 ending up with 22 intervals remaining, from which we remove their open middle intervals of length 1/k 3 ending up with 23 intervals, and so on; see Figure 4.16. At the end of the n-th stage, we have 2n closed intervals and from these intervals we remove each of their open middle intervals of length 1/k n+1 . Continuing this process indefinitely, what’s left over after discarding all the open middle intervals is the “fat” Cantor k set, which we denote by C(k). The main difference between the cases k > 3 and the case k = 3 (the original Cantor set) is that C(k) has positive
236
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
0
1 1 k 1 k2 1 k3
C1 (k)
1 k2 1 k3
1 k3
[0, 1] C2 (k)
1 k3
C3 (k)
Figure 4.16. The first few stages in constructing C(k). measure for k > 3. Indeed, observe that if we look back at the construction of C(k), the total lengths of the open intervals we removed from [0, 1] is ∞ n 1 2 22 23 1X 2 1 2/k 1 + + 3 + 4 + ··· = = · = . k k2 k k 2 n=1 k 2 1 − 2/k k−2
Hence,
1 k−3 = . k−2 k−2 For instance, m(C(4)) = 12 . Moreover, as k → ∞, we see that m(C(k)) → 1. Hence, we can get “Cantor-like” subsets of [0, 1] whose measure is as close to 1 as desired. In Problem 10 you will prove the following theorem and you will construct the corresponding Cantor function. m(C(k)) = 1 −
Thick Cantor sets C(k) for k ≥ 3 Theorem 4.20. The Cantor set C(k) is perfect, uncountable, compact, totally disconnected, nowhere dense, and has positive measure for k > 3. As with the standard Cantor set, fat Cantor sets are also used as testing grounds for theories. For example, in 1870 Hermann Hankel (1839–1873) “proved” [115, pp. 89-92] that a bounded function is Riemann integrable on an interval [a, b] if and only if its points of continuity form a dense set in [a, b]. We remark that a function f : [a, b] → R whose points of continuity form a dense set in [a, b] is said to be pointwise discontinuous, so Hankel claimed that Riemann integrablity and pointwise discontinuity are equivalent. In 1875 Henry Smith proved Hankel false. Indeed, let A ⊆ [0, 1] be any closed nowhere dense set of positive measure (like one of the fat Cantor sets) and let ( 1 if x ∈ A f = χA : [0, 1] → R = 0 if x ∈ / A. Here’s a graph of f :
Figure 4.17. The set A is the “dust” particles on the horizontal axis. It’s easy to see that f is continuous at each point of the open set [0, 1] \ A (where f = 0) and f is discontinuous at each point of A. In particular, since A is nowhere dense, the points of continuity form a dense set in [0, 1]. However, in Problem 11 you
4.5. THE CANTOR SET
237
will prove that f is not Riemann integrable and hence we have a counterexample to Hankel’s theorem. ◮ Exercises 4.5. 1. Prove that the Cantor set is totally disconnected and nowhere dense. 2. In this exercise, we prove that the Cantor set is exactly those numbers in [0, 1] that have a ternary decimal expansion containing only the digits 0 and 2. (a) Prove, for instance by induction on n, that n α1 α2 αn α1 α2 αn 1 o Cα1 ...αn = x ∈ [0, 1] ; + 2 + ··· + n ≤ x ≤ + 2 +··· + n + n . 3 3 3 3 3 3 3 S S At the same time, show that [0, 1] \ Cn = n−1 B where B = B α1 ...αk with the k k k=0 union over all k-tuples (α1 , . . . , αk ) of 0’s and 2’s, where n α1 αk 1 α1 αk 2 o Bα1 ...αk = x ∈ [0, 1] ; + · · · + k + k+1 < x < + · · · + k + k+1 . 3 3 3 3 3 3 When k = 0, we interpret the k-tuple as empty and put B0 asS the interval (1/3, 2/3). By definition of the Cantor set, note that [0, 1] \ C = ∞ k=0 Bk . We remark that one can also use 0’s and 1’sSto index the sets Cα1 ...αn instead of 0’s and 2’s; for instance, we can write Bk = Ba1 ...ak with the union over all k-tuples of 0’s and 1’s, where n 2ak 1 2a1 2ak 2 o 2a1 + · · · + k + k+1 < x < + · · · + k + k+1 . (4.16) Ba1 ...ak = x ∈ [0, 1] ; 3 3 3 3 3 3 When k = 0, we regard Ba1 ...ak as the interval (1/3, 2/3). (b) Given x ∈ C, prove that we can write α1 α2 αn (4.17) x= + 2 + · · · + n + · · · , αj ∈ {0, 2}. 3 3 3 (c) Finally, prove that any number of the form (4.17) is an element of C. 3. (Cf. [200]) In this problem we give some tricks for producing points in the Cantor set. (a) Prove that the Cantor set is invariant under reflection about 1/2; that is, given x ∈ [0, 1], prove that x ∈ C if and only if 1 − x ∈ C. (b) Prove that the Cantor set is invariant under division by 3; that is, given x ∈ [0, 1], prove that x ∈ C if and only if x/3 ∈ C. (c) Starting from the number 1/4 ∈ C, by reflecting about 1/2 and dividing by 3, verify that the following numbers also belong to the Cantor set: 1 3 1 11 1 11 25 35 , , , , , , , . 4 4 12 12 36 36 36 36 4. In this problem we search for more points in the Cantor set. (a) Prove that if k ∈ N, then 3k2−1 ∈ C. (b) Prove that if k ∈ N, then 3k1+1 ∈ C. Suggestion: Expanding in a geometric series, write 1 1 1 1 1 1 1 1 1 = k · = k − 2k + 3k − 4k + 5k − 6k + · · · , 3k + 1 3 3 3 3 3 3 3 1 + 31k and observe that that 3k1+1 ∈ C.
1 3jk
−
1 3(j+1)k
=
2 3jk+1
+ ··· + ℓ
2 3(j+1)k
. Using these facts, prove ℓ
(c) For k ∈ N and 0 ≤ ℓ ≤ k − 1, show that 3k3+1 ∈ C and 2 · 3k3−1 ∈ C. 5. (Properties of the Cantor function) We analyze the Cantor function more closely. (a) Prove that if x, y ∈ C are endpoints of an open deleted middle third interval, then ψ(x) = ψ(y). Suggestion: Use (4.16) in Problem 2. (b) Prove that ψ : C → [0, 1] is nondecreasing. Suggestion: Let x, y ∈ C with x < y P P∞ 2bj 2aj and write x = ∞ j=1 3j and y = j=1 3j where aj , bj ∈ {0, 1}. Let k ∈ N be
238
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
the smallest natural number such that ak 6= βk . Thus, a1 = b1 , . . . , ak−1 = bk−1 but ak 6= bk . Show that ak < bk . Using this fact prove that ψ(x) ≤ ψ(y). (c) Show that ψ : [0, 1] → [0, 1] is nondecreasing. (d) Show that ψ : [0, 1] → [0, 1] is continuous. (e) Let n ∈ N. Show that the range of ψ on [0, 1] \ Cn equals {ℓ/2n ; ℓ = 1, 2, . . . , 2n − S2n −1 1}. As a consequence, show that [0, 1] \ Cn = ℓ=0 Dℓ for some open intervals Dℓ where (say) the right end points of the intervals Dℓ form an increasing sequence such that ψ = ℓ/2n on the ℓ-th interval Dℓ . (We remark that some authors define the Cantor function on [0, 1] \ C via this property. That is, given x ∈ [0, 1] \ C, we have x ∈ [0, 1] \ Cn for some n, then one defines ψ(x) = ℓ/2n if x is in the ℓ-th interval Dℓ . Defining ψ in this way, you have to check that ψ(x) is defined independent of the n chosen. You then have to extend ψ from [0, 1] \ C to the Cantor set C. See Problem 10 for an explanation of this method to define ψ.) 6. (cf. [54]) (Characterization of the Cantor function) (i) Prove that the Cantor function ψ : [0, 1] → [0, 1] satisfies for all x ∈ [0, 1]: (1) ψ is nondecreasing; (2) ψ(x/3) = ψ(x)/2; (3) ψ(1 − x) = 1 − ψ(x). (ii) Prove that if f : [0, 1] → [0, 1] is continuous and satisfies (1), (2) and (3), then f is the Cantor function. Suggestion: Prove by induction on k that 2a1 2ak 2 a1 ak 1 f + · · · + k + k+1 = + · · · + k + k+1 3 3 3 2 2 2 P for all a1 , . . . , ak ∈ {0, 1}. The formula 1 = 31k + ki=1 32i , which holds for all k ∈ N, might come in handy at one point in your proof. 7. (cf. [112, 53]) In this problem we show there exists a Lebesgue measurable set that is not a Borel set, that Lebesgue measure on the Borel sets is not complete, and that Lebesgue measurability is not preserved under homeomorphisms. (i) Let ψ : [0, 1] → [0, 1] be the Cantor function, and let ϕ(x) = (x + ψ(x))/2. Prove that ϕ : [0, 1] → [0, 1] is strictly increasing, that is, if x < y, then ϕ(x) < ϕ(y). In particular, ϕ is a continuous bijection of [0, 1] onto [0, 1]. (ii) Show that ϕ(C) is Lebesgue measurable with measure 1/2. Suggestion: Note that [0, 1] = ϕ(C) ∪ ϕ([0, 1] \ C). Thus, it suffices to show that ϕ([0, 1] \ C) is measurable with measure 1/2. Recall that B = [0, 1] \ C is the union of all the open middle thirds that were removed from C and that on each of these open middle thirds, ψ S is constant. Let this (pairwise disjoint) union be written as B = ∞ I n=1 n . Then, ϕ(B) =
∞ [
n=1
ϕ(In ) =
∞ [ an 1 + In , 2 2 n=1
where an is the value of ψ on the open interval In . (iii) Show that there exists a Lebesgue measurable set M ⊆ C such that ϕ(M ) is not Lebesgue measurable. In particular, Lebesgue measurability is not preserved under homeomorphisms. Suggestion: Use Vitali’s Theorem 4.15 on the set ϕ(C). (iv) Show that M is not a Borel set. Since M ⊆ C and C is Borel measurable (why?) with measure zero, it follows that Borel measure, that is, Lebesgue measure on the Borel sets, is not complete. 8. (Steinhaus’ Cantor set theorem, cf. [192], [234], [294]) Prove Hugo Steinhaus’ (1887–1972) theorem for the Cantor set, which states that C + C := {x + y ; x, y ∈ C} = [0, 2]. (See Problem 5 in Exercises 4.4 for another Steinhaus theorem.) Thus, even though the Cantor set seems to be very tiny, when you add all its points together you fill up the interval [0, 2]. Suggestion: Given a ∈ [0, 2], consider the tertiary expansion of a/2. 9. If you liked the previous problem, here are some related ones.
NOTES AND REFERENCES ON CHAPTER ??
239
(a) Using Steinhaus’ Cantor set theorem, show that C − C = [−1, 1]. (b) If n ∈ N, put Sn = C + C + · · · + C (n copies of C) = {x1 + · · · + xn ; xi ∈ C}. Prove that Sn = [0, n]. (c) Show that for n ∈ N, we have Sn − Sn = [−n, n]. (d) ShowSthat there is a set A of measure zero such that A + A = R. Suggestion: Let A= ∞ n=1 nC, where nC = {nx ; x ∈ C}. 10. (Thick Cantor sets and functions) Fix k ∈ N with k > 2. In this problem we give a detailed construction of the Cantor set C(k) and its Cantor function ψk : [0, 1] → R. Given bounded intervals I and J we write I < J if the right-end point of I is ≤ the left-end point of J. It would be helpful to draw many pictures during this exercise! (i) Stage 1: Remove the open middle interval from [0, 1] of length 1/k. Denote the remaining closed left-hand interval by C11 , the remaining closed right-hand interval by C12 and the open removed interval by B11 . Note that C11 < B11 < C12 . Let C1 = C11 ∪ C12 and B1 = B11 and observe that C1 is a union of 21 closed intervals, B1 is a union of 21 − 1 open intervals, C1 ∪ B1 = [0, 1], and m(C1 ) = 1 − k1 . Define f1 : B1 → R by f1 (x) = 1/2 for all x ∈ B1 . Induction step: Suppose n ∈ N and assume that disjoint sets Cn and Bn S n have been defined, where Cn = 2j=1 Cn,j with the Cn,j ’s pairwise disjoint closed S n intervals of equal length and Bn = 2j=1−1 Bn,j with the Bn,j ’s pairwise disjoint n open intervals (of possibly different lengths). Assume that m(Cn ) = 1− k2 −· · ·− k2n n (so each Cn,j has measure exactly m(Cn )/2 ), Cn ∪ Bn = [0, 1], and Cn,1 < Bn,1 < Cn,2 < Bn,2 < · · · < Cn,2n −1 < Bn,2n −1 < Cn,2n . Finally, assume that fn : Bn → R has been defined where
j for x ∈ Bn,j , j = 1, 2, . . . , 2n − 1. 2n It would be helpful to draw a picture of fn , which looks like a “staircase”. Prove that m(Cn,j ) > 1/kn+1 for all j. Then remove from each Cn,j the open middle interval of length 1/kn+1 (which leaves closed intervals of positive length) and define sets Cn+1 and Bn+1 having the same properties as Cn and Bn with n replaced by n + 1 and define fn+1 : Bn+1 → R in a similar manner as fnT. (ii) Prove thatSCn+1 ⊆ Cn and Bn ⊆ Bn+1 for all n, and defining C(k) = ∞ n=1 Cn B , prove that C(k) and B are disjoint, C(k) ∪ B = [0, 1], and and B = ∞ n n=1 m(C(k)) = (k − 3)/(k − 2). Prove Theorem 4.20. (iii) Prove that for each n ∈ N, fn+1 (x) = fn (x) for S all x ∈ Bn . (Recall that Bn ⊆ Bn+1 .) Define f : B → R as follows: If x ∈ B = n Bn , then x ∈ Bn for some n and we put f (x) := fn (x). Prove that f (x) is well-defined, independent of the n chosen, and prove that f : B → R is locally constant and nondecreasing. (iv) Define ψk : [0, 1] → R as follows: ψk (0) = 0 and given any x ∈ (0, 1], define ψk (x) := sup {f (y) ; y ∈ B and y < x}. Prove that ψ is nondecreasing, continuous, and ψk = f on B. 11. (Counterexample to Hankel) Using Riemann sums (or upper and lower Darboux sums if you wish), prove that if A ⊆ [0, 1] is a closed nowhere dense set with positive measure, then χA : [0, 1] → R is not Riemann integrable. fn (x) =
Notes and references on Chapter 4 §4.1 : The Gambler’s ruin problem was first posed and solved by Blaise Pascal (1623– 1662) (see [81] for some history on Gambler’s ruin, especially Pascal’s rˆ ole in the problem). §4.2 : It’s interesting to note that Borel’s celebrated 1909 paper [36] was not rigorous, at least by today’s standards. Here are Borel’s words leading up to his description of the Law of Large Numbers [36, p. 258]:
240
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
We propose to study the probabilities for a decimal fraction belonging to a given set assuming 1. The decimal digits are independent; 2. Each of them takes each of the values 0, 1, 2, 3, . . . , q − 1 with probability 1q . There is no need to insist on the somewhat arbitrary character of these two hypotheses: the first, in particular, is necessarily inexact, if we consider as one is always forced to do in practice, a decimal number defined by a law, indeed whatever the nature of this law. It may nevertheless be interesting to study the consequences of this assumption, precisely with the goal to realize the extent to which things like this happen as if this hypothesis holds. Borel’s paper was not rigorous, but it was made so by Hugo Steinhaus (1887–1972) in 1922 [269]; in this paper he calls Borel’s Normal Number Theorem “Borel’s Paradox”: Mr. E. Borel has been the first to show the interest of the study of enumerable probability,15 and he gave some applications to arithmetics that he discovered on this path. Among those applications, the following theorem, known as “Borel’s paradox” drew the attention of analysts: The probability that the frequency of the digit 0 in the dyadic expansion of a random number be 1/2, equals one”, where we call frequency of the digit 0 the limit value of the quotient by n of the number of times that this digit appears in the first n digits of the expansion. We can find, in different authors, the following statement of the same theorem: Almost all numbers α have the property that the frequency of the digit 0 in their dyadic expansion equals 1/2, where almost all means that the Lebesgue measure of the set of the α that do not enjoy this property is zero. To prove this statement, it is sufficient to change the wording of Mr. Borel’s original proof, without changing the core idea. The goal of this note is to establish a system of postulates for enumerable probability that will allow once and for all to switch from one interpretation to the other in this kind of research. It’s argued that Van Vleck produced the first Zero-One law [208]. See [233] for another example of finite additivity 6=⇒ SLLN. §4.3 : I learned Littlewood’s First Principle from Royden’s text [242, p. 72]. §4.4 : The story of the Axiom of Choice is fascinating. Let us start in 1900, at the International Congress of Mathematicians, where David Hilbert (1862–1943) gave a list of 23 open problems in mathematics, the first problem of which was Cantor’s problem of the cardinal number of the continuum. As part of this problem, he asked to well order the real numbers [128]: The question now arises whether the totality of all numbers may not be arranged in another manner so that every partial assemblage may have a first element, i.e., whether the continuum cannot be considered as a well ordered assemblage — a question which Cantor thinks must be answered in the affirmative. 15 “Enumerable probability” is countably additive probability, in contrast to finitely additive probabilities.
NOTES AND REFERENCES ON CHAPTER ??
241
In 1904, Zermelo wrote a letter to Hilbert [295, pp. 139–141] proving the Well Ordering Theorem, the fact that an arbitrary set can be well ordered.16 To prove this theorem, Zermelo stated and used the Axiom of Choice and called it a “logical principle” because “it is applied without hesitation everywhere in mathematical deductions” [295, p. 141]. Now although “applied without hesitation everywhere” Zermelo received much criticism for his proof; here, for example, are some harsh words by Borel against the Axiom of Choice [39, pp. 1251–1252]: One cannot, in fact, hold as valid the following reasoning, to which Mr. Zermelo refers: “it is possible, in a particular set M ′ , to choose any specific element m′ ; since this choice can be made for any of the sets M ′ , it can be made for the set of those sets.” Such reasoning seems to me not to be better founded than the following: “To wellorder a set M , it is enough to choose an arbitrary element in it, to which we assign the rank 1, then another, to which we will attribute the rank 2, and so on transfinitely, in other words until we exhaust all elements of M by the sequence of transfinite numbers.” Now, no mathematician will regard such reasoning as valid. It seems to me that the objections that one can raise against it are valid objections against every reasoning involving an arbitrary choice repeated an unenumerable infinity of times; such reasonings are out of the scope of mathematics. It’s remarkable that for many years, both Borel and Lebesgue held similar views concerning the Axiom of Choice, although a great deal of their own work in measure theory relied on it. Before 1908, set theory did not have an axiomatic foundation, rather it was “na¨ıve set theory” or loosely speaking, “logical principles” applied to sets. The controversy surrounding Zermelo’s Axiom was the impetus that led him to axiomatize set theory in 1908 [295, pp. 183–215]; by doing so with his Axiom of Choice as one of the cornerstones, he could secure his Axiom and his proof of the Well Ordering Theorem on a sound foundation.17 Unfortunately, Zermelo’s axiomatization didn’t win people over to his Axiom of Choice; indeed, it opened up new attacks on his axiomatic system! However, the tide in favor of the Axiom began to turn in 1916 when the Polish mathematican Waclaw Sierpi´ nski (1882–1969) started publishing papers on the subject of set theory and analysis and their dependance on Zermelo’s Axiom, which culminated in a 55 page article in 1918 on this subject [258]. In fact, the Warsaw school of mathematics, which Sierpi´ nski was a part of, played a large role in the dissemination of the Axiom’s place in mathematics. As the years passed by, this role expanded to diverse mathematical fields. Moreover, due to the work of G¨ odel and Cohen which shows that the Axiom of Choice is logically independent of ZF, any fears of Zermelo’s Axiom producing a mathematical contradiction were eliminated. Thus, nowadays Zermelo’s Axiom is accepted without qualms. Now we really can’t do justice to the fascinating story of the Axiom and it would fill an entire book to discuss in depth its origins, development and influence; thankfully such a book is available, see [202, 203]. For more information about the Banach-Tarski Paradox and related paradoxes, see e.g. [33, 278, 202, 303, 63, 305]. In particular, the title A Paradox, a Paradox, a Most Ingenious Paradox of Subsection 4.4.4 was inspired by the paper [33]. The measure problem: In Corollary 4.17 we saw that there does not exist a translation invariant measure (= countably additive set function) on P(Rn ) that gives 16
That is, any set X has a relation “ 0 on (α, β), and ϕ = 1 on a subinterval of (α, β). We can simply draw the graph of such a function with a pencil, which will look something like that shown here: ϕα,β (t) 1
α
β−ε
α+ε
β
Figure 5.2. The function ϕα,β (t). Of course, although it is “obvious” that such a function exists, it still requires proof! We shall in fact prove a higher dimensional result that we’ll need later in this book. Lemma 5.2. Given ε > 0, there is a smooth function ϕ : R → R that is nondecreasing, ϕ(t) = 0 for t ≤ 0, ϕ(t) > 0 for t > 0, and ϕ(t) = 1 for t ≥ ε. Proof : A picture of ϕ is given on the left-hand side of Figure 5.3. To explicitly ϕ(β − t)
ϕ(t) 1
0
1
ε
β−ε
0
β
Figure 5.3. ϕ(t) and its reflection ϕ(β − t) about the point t = β. construct this function, let f (t) :=
(
e−1/t 0
if t > 0, if t ≤ 0.
Here’s a graph of f : 1
f
0
In Problem 1 we ask you to check that f (t) is a smooth function on R vanishing for t ≤ 0. Now define ϕ : R → R by the formula ϕ(t) =
f (t) . f (t) + f (ε − t)
250
5. BASICS OF INTEGRATION THEORY
Note that for any t ∈ R, either t or ε − t is positive, so the denominator f (t) + f (ε − t) is positive for all t ∈ R. By the Quotient Rule it follows that ϕ(t) is infinitely differentiable. We shall prove that ϕ satisfies our requirements. First, since f (t) > 0 for t > 0 we see that ϕ(t) > 0 for t > 0. Second, from the fact that f (t) = 0 for t ≤ 0, we see that ϕ(t) = 0 for t ≤ 0 and ϕ(t) = 1 for t ≥ ε (this last condition holds since f (ε − t) = 0 for t ≥ ε so ϕ(t) = f (t)/f (t) = 1 for t ≥ ε). Finally, to see that ϕ is nondecreasing, we compute its derivative explicitly: ϕ′ (t) =
f ′ (t) f (ε − t) + f (t) f ′ (ε − t) , (f (t) + f (ε − t))2
which is nonnegative since both f and f ′ are nonnegative. This proves the properties of ϕ.
Proposition 5.3. Let (a, b) be a nonempty open box in Rn and let ε > 0 so that the box (a + ε, b − ε) is not empty. Then there exists a smooth function ϕ : Rn → R such that 0 ≤ ϕ ≤ 1, ϕ > 0 on the box (a, b), ϕ = 1 on the box [a + ε, b − ε], and ϕ = 0 outside of the box (a, b). (See Figure 5.2.) Proof : Given ε > 0, by our lemma there is a smooth function ϕ : R → R that is nondecreasing, zero for t ≤ 0, positive for t > 0, and one for t ≥ ε. Given real numbers α < β, define ϕα,β (t) = ϕ(t − α) ϕ(β − t). Using the properties of ϕ, one can check that ϕα,β has the properties illustrated in Figure 5.2. If (a, b) = (a1 , b1 ) × · · · × (an , bn ), we define ϕ(x) = ϕa1 ,b1 (x1 ) · · · ϕan ,bn (xn ), where x = (x1 , . . . , xn ), then ϕ satisfies the conditions of the proposition.
S∞ 5.1.3. Proof of Theorem 5.1. Let U = k=1 Ik be a dense open set written as a union of open intervals, whose complement A := [0, 1] \ U has positive measure (e.g. a thick Cantor set or a set constructed as we did in (5.1)). For each interval Ik = (ak , bk ), choose εk > 0 such that (ak + εk , bk − εk ) is not empty. Let ϕk : R → R be any function as described in Proposition 5.3 (and shown in Figure 5.2) having the property that 0 ≤ ϕk ≤ 1, ϕ > 0 on Ik , ϕk = 1 on [ak + εk , bk − εk ], and ϕk = 0 outside of Ik . Define ψk : R → R by ψk = 1 − ϕk ; see Figure 5.4 for a picture of ψk . The only properties we need of ψk are that ψk ψk (x)
1
1
0 ak
ak + εk
b k − εk
Figure 5.4. The function ψk (x).
bk
5.1. INTRODUCTION: INTERCHANGING LIMITS AND INTEGRALS
is smooth, and 0 ≤ ψk (x)
(
0, its n-th derivative f (n) can be written as pn−1 (t) −1/t e if t > 0, (n) f (t) = t2n 0 if t ≤ 0, where pn−1 (t) is the (n − 1)-degree polynomial defined recursively by p0 (t) = 1 and pn (t) = t2 p′n−1 (t) − (2nt − 1)pn−1 (t) , (n)
n = 1, 2, . . . .
To prove that f (0) = 0, L’Hospital’s Rule will be helpful. 2. In this problem and the next we prove continuous versions of Theorem 5.1. (i) Let (a, b) ⊆ R be a nonempty open interval and let ε > 0 so that the closed interval [a + ε, b − ε] is not empty. Explicitly define, for example using a piecewise linear function, a continuous function ϕ : R → R such that 0 ≤ ϕ ≤ 1, ϕ > 0 on (a, b), ϕ = 1 on [a + ε, b − ε], and ϕ = 0 outside of (a, b).
252
5. BASICS OF INTEGRATION THEORY
(ii) Now follow the proof of Theorem 5.1 to find a nonincreasing sequence of continuous functions fn : [0, 1] → R such that the limit function f : [0, 1] → R is not Riemann integrable. In the following problems we give a different method to find such a sequence {fn }. 3. This problem is used in Problem 4. Given any nonempty closed set C ⊆ R define d(x, C) := inf |x − z| ; z ∈ C .
The number d(x, C) is distance from x to C. (a) Show that we can replace “inf” with “min;” in other words, there exists a point z0 ∈ C such that d(x, C) = |x − z0 |. (b) Define f : R → R by f (x) := d(x, C). Prove that f is continuous. In fact, prove that for any points x, y ∈ R, we have |d(x, C) − d(y, C)| ≤ |x − y|.
4. Let A ⊆ [0, 1] be any Cantor-type T set in [0, 1] with positive measure that we constructed in Section 4.5 and write A = ∞ k=1 Ak , where A1 ⊇ A2 ⊇ · · · and Ak is the union of the 2k pairwise disjoint closed intervals left over at the k-th stage of the construction of A. For each n ∈ N, define fn : [0, 1] → R by 1 fn (x) := 1 − n min , d(x, An ) ; n or more explicitly,
fn (x) =
(
1 − n d(x, An ) 0
if d(x, An ) ≤ if d(x, An ) >
1 , n 1 . n
Notice that fn (x) = 1 if x ∈ An and fn (x) = 0 if d(x, An ) > 1/n. (i) Prove that fn is continuous. (ii) Prove that {fn } is a nonincreasing sequence of functions. (iii) Prove that for x ∈ [0, 1], limn→∞ fn (x) = χA (x).
5.2. Measurable functions and Littlewood’s second principle In this section we study the concept of measurability. We shall see that measurable functions are basically very robust (or strong or durable) continuous-like functions. We make “continuous-like” precise in Luzin’s Theorem (Theorem 5.9), which is also known as Littlewood’s second principle. We also study the concept of almost everywhere. 5.2.1. Measurable functions. A measurable space is a pair (X, S ) where X is a set and S is a σ-algebra of subsets of X. The elements of S are called measurable sets. Recall that a measure space is a triple (X, S , µ) where µ is a measure on S ; if we leave out the measure we have a measurable space. Our goal now is to define what a measurable function is, but before doing so we briefly recall Lebesgue’s integral from Section 1.1 when we went over Lebesgue’s seminal paper Sur une g´en´eralisation de l’int´egrale d´efinie [165]. Given a bounded function f : [α, β] → [0, ∞), the idea to find the area under f is to partition the range of the function rather than the domain. Here’s a picture: f (x)
m3 m2
m2
m1 m0
m1 m0
α
E1
E2
E3 β
f (x)
m3
E1
E2
E3
5.2. MEASURABLE FUNCTIONS AND LITTLEWOOD’S SECOND PRINCIPLE
253
In this specific example we partition the range into the point {m0 } and the three subintervals (m0 , m1 ], (m1 , m2 ] and (m2 , m3 ] and the shaded rectangles on the right approximate the area under f “from below” (there is a similar approximation to the area under f “from above”). The area under the shaded rectangles shown on the right is (5.2)
Area under f ≈ m0 m(E0 ) + m0 m(E1 ) + m1 m(E2 ) + m3 m(E3 ),
where E0 = {x ; f (x) = m0 } (which consists of just the single point α for this specific graph) and E1 = {x ; m0 < f (x) ≤ m1 } = f −1 (m0 , m1 ], with similar expressions for E2 , E3 . If we put m−1 = −1 (or any other negative number) we could also write E0 as E0 = {x ; m−1 < f (x) ≤ m0 } = f −1 (m−1 , m0 ] so that all the Ei ’s have the same form. Now the sum (5.2) only makes sense if each set Ei is Lebesgue measurable (so that m(Ei ) is defined without having to worry about pathologies). Thus, it makes sense to consider only those functions for which f −1 (a, b] is Lebesgue measurable for all a, b ∈ R. Now this requirement makes sense for any measurable space where we replace “Lebesgue measurable” with whatever measurable sets we are working with! In general, we deal with extended real-valued functions, in which case we allow b to equal ∞. We shall return to Lebesgue’s definition of the integral in Section 5.4. This discussion suggestions the following definition: Given a measurable space (X, S ) and function f : X → R, if (1) f −1 (a, b] ∈ S and (2) f −1 (a, ∞] ∈ S for all a ∈ R and b ∈ R, we say that f is measurable.2 It turns out that we can omit Condition (1) because it follows from (2). Indeed, since f −1 (a, b] = f −1 (a, ∞] \ f −1 (b, ∞], and S is a σ-algebra, if both right-hand sets are in S , then so is the left-hand set. Hence, measurability just requires f −1 (a, ∞] ∈ S for each a ∈ R. We are thus led to the following definition: A function f : X → R is measurable if f −1 (a, ∞] ∈ S for each a ∈ R. We emphasize that the definition of measurability is not “artificial” but is required by Lebesgue’s definition of the integral! If X is the sample space of some experiment, a measurable function is called a random variable; thus, In probability, random variable = measurable function. We note that intervals of the sort (a, ∞] are not special, and sometimes it is convenient to use other types of intervals. 2As a reminder, for any A ⊆ R, f −1 (A) := {x ∈ X ; f (x) ∈ A}, so for instance f −1 (a, ∞] = {x ∈ X ; f (x) ∈ (a, ∞]} = {x ∈ X ; f (x) > a}, or leaving out the variable x, f −1 (a, ∞] = {f > a}.
254
5. BASICS OF INTEGRATION THEORY
(1) (2) (3) (4)
Proposition 5.4. For a function f : X → R, the following are equivalent: f is measurable. f −1 [−∞, a] ∈ S for each a ∈ R. f −1 [a, ∞] ∈ S for each a ∈ R. f −1 [−∞, a) ∈ S for each a ∈ R.
Proof : Since preimages preserve complements, we have c f −1 (a, ∞] = f −1 (a, ∞]c = f −1 [−∞, a].
Since σ-algebras are closed under complements, we have (1) ⇐⇒ (2). Similarly, the sets in (3) and (4) are complements, so we have (3) ⇐⇒ (4). Thus, we just to prove (1) ⇐⇒ (3). Assuming (1) and writing ∞ ∞ \ \ 1 1 [a, ∞] = a − ,∞ =⇒ f −1 [a, ∞] = f −1 a − , ∞ , n n n=1 n=1
we have f −1 [a, ∞] ∈ S since each f −1 a − n1 , ∞ ∈ S and S is closed under countable intersections. Thus, (1) =⇒ (3). Similarly, ∞ ∞ [ [ 1 1 =⇒ f −1 (a, ∞] = f −1 a + , ∞ , (a, ∞] = a + ,∞ n n n=1 n=1
shows that (3) =⇒ (1).
As a consequence of this proposition, we can prove that measurable functions are closed under scalar multiplication. Indeed, let f : X → R be measurable and let α ∈ R; we’ll show that αf is also measurable. Assume that α 6= 0 (the α = 0 case is easy) and observe that for any a ∈ R, n ao x ; f (x) > if α > 0, αo (αf )−1 (a, ∞] = {x ; αf (x) > a} = n x ; f (x) < a if α < 0 a iα f −1 ,∞ if α > 0, hα = f −1 −∞, < a if α < 0. α By Proposition 5.4 it follows that (αf )−1 (a, ∞] is measurable. We’ll analyze more algebraic properties of measurable functions in Section 5.3. We now give examples of measurable functions. We first show that all “nice” functions are measurable. Example 5.1. Let X = Rn with Lebesgue measure. Then any continuous function f : Rn → R is measurable because for any a ∈ R, by continuity (the inverse image of any open set is open), f −1 (a, ∞] = f −1 (a, ∞)
(where we used that f does not take the value ∞) is an open subset of Rn . Since open sets are measurable, it follows that f is measurable.
Thus, for Lebesgue measure, continuity implies measurability. However, the converse is far from true because there are many more functions that are measurable
5.2. MEASURABLE FUNCTIONS AND LITTLEWOOD’S SECOND PRINCIPLE
255
than continuous. For instance, Dirichlet’s function D : R → R, ( 1 if x ∈ Q, D(x) = 0 if x 6∈ Q, is Lebesgue measurable. Note that D is nowhere continuous. That D is measurable follows from Example 5.2 below and the fact that D = χQ and Q is measurable. Example 5.2. For a general measurable space X and set A ⊆ X, we claim that the characteristic function χA : X → R is measurable if and only if the set A is measurable. Indeed, looking at the picture in Figure 5.5, a 1 a
R
X a
A
Figure 5.5. Graph of χA and three different a’s. we see that
X −1 χA (a, ∞] = {x ∈ X ; χA (x) > a} = A ∅
if a < 0 if 0 ≤ a < 1, if a ≥ 1.
Hence, χ−1 A (a, ∞] ∈ S for all a ∈ R if and only if A ∈ S , which proves the claim. In particular, there exists non-Lebesgue measurable functions. In fact, given any nonmeasurable set A ⊆ Rn , the characteristic function χA : Rn → R is not measurable.
Of course, since A is non-constructive, so is χA . You will probably never find a nonmeasurable function in practice. The following example shows the importance of studying extended real-valued functions, instead of just real-valued functions. Example 5.3. Let X = S ∞ , where S = {0, 1}, be the sample space for a MonkeyShakespeare experiment (or any other experiment involving a sequence of Bernoulli trials). Let f : X → [0, ∞] be the number of times the Monkey types sonnet 18: f (x1 , x2 , x3 , . . .) = the number of 1’s in (x1 , x2 , x3 , . . .) = x1 + x2 + x3 + · · · . Notice that f = ∞ when the Monkey types sonnet 18 an infinite number of times (in fact, as we saw in Example 4.4 of Section 4.1, f = ∞ on a set of measure 1). To show that f is measurable, given a ∈ R by Proposition 5.4 we just have to show that ( ) ∞ X −1 f [−∞, a] = x = (x1 , x2 , . . .) ; xn ≤ a ∈ S . n=1
To prove this, observe that ∞ X x ∈ f −1 [−∞, a] ⇐⇒ xn ≤ a ⇐⇒ x1 + x2 + · · · + xn ≤ a for all n ∈ N. n=1
Thus,
f −1 [−∞, a] =
∞ \
{x ; x1 + x2 + · · · + xn ≤ a}.
n=1
The set {x ; x1 + x2 + · · · + xn ≤ a} only depends on the first n tuples of an infinite sequence (x1 , x2 , x3 , . . .), so this set is of the form An ×S ×S ×S ×· · · , where An ⊆ S n
256
5. BASICS OF INTEGRATION THEORY
is the subset of S n consisting of those points with no more than a total of a entries with 1’s. In particular, {x ; x1 + x2 + · · · + xn ≤ a} ∈ R(C ) and hence, it belongs to S (C ). Therefore, {f ≤ a} also belongs to S (C ), so f is measurable.
Back in Section 2.1 we defined simple functions. For a quick review in the current context of our σ-algebra S , recall that a simple function (or S -simple function to emphasize the σ-algebra S ) is any function of the form s=
N X
a n χ An ,
n=1
where a1 , . . . , aN ∈ R and A1 , . . . , AN ∈ S are pairwise disjoint. By Corollary 2.3 we know that we don’t have to take the An ’s to be pairwise disjoint, but for proofs it’s often advantageous to do so. Theorem 5.5. Any simple function is measurable. PN Proof : Let s = n=1 an χAn be a simple function where a1 , . . . , aN ∈ R and A1 , . . . , AN ∈ S are pairwise disjoint. If we put AN+1 = X \ A1 ∪ · · · ∪ AN and aN+1 = 0, then X = A1 ∪ A2 ∪ · · · ∪ AN ∪ AN+1 , a union of pairwise disjoint sets, and s = an on An for each n = 1, 2, . . . , N + 1. As in the picture, a4 a3 a2 a a1
{x ∈ X ; s(x) > a} = A1
A3
A2
[
An
an >a
A4
it follows that s−1 (a, ∞] = {x ∈ X ; s(x) > a} =
[
An ,
an >a
where the union is over all n such that an > a. Hence, s−1 (a, ∞] is just a union of elements of S so s is measurable.
5.2.2. Measurability, continuity and topology. In Example 5.1 we saw that continuity implies measurability, essentially by definition of continuity via open sets. It turns out we can express measurability directly in terms of open sets. Measurability criteria Theorem 5.6. For a function f : X → R, the following are equivalent: (1) f is measurable. (2) f −1 ({∞}) ∈ S and f −1 (U) ∈ S for all open subsets U ⊆ R. (3) f −1 ({∞}) ∈ S and f −1 (B) ∈ S for all Borel sets B ⊆ R. Proof : To prove that (1) =⇒ (2), observe that {∞} =
∞ \
n=1
(n, ∞]
=⇒
f −1 ({∞}) =
∞ \
n=1
f −1 (n, ∞].
5.2. MEASURABLE FUNCTIONS AND LITTLEWOOD’S SECOND PRINCIPLE
257
Assuming f is measurable, we have f −1 (n, ∞] ∈ S for each n and since S is a σ-algebra, it follows that f −1 ({∞}) then by the Dyadic S ∈ S . If U ⊆ R is open, 1 Cube Theorem we can write U = ∞ n=1 In where In ∈ I for each n. Hence, f −1 (U) =
∞ [
f −1 (In ).
n=1
By measurability, f −1 (In ) ∈ S for each n, so f −1 (U) ∈ S . To prove that (2) =⇒ (3), we just have to prove that f −1 (B) ∈ S for all Borel sets B ⊆ R. To prove this, recall from Proposition 1.9 that Sf = {A ∈ R ; f −1 (A) ∈ S }
is a σ-algebra. Assuming (2) we know that all open sets belong to Sf . Since Sf is a σ-algebra of subsets of R and B is the smallest σ-algebra containing the open sets, it follows that B ⊆ Sf . Finally, we prove that (3) =⇒ (1). Let a ∈ R and note that (a, ∞] = (a, ∞) ∪ {∞}
=⇒
f −1 (a, ∞] = f −1 (a, ∞) ∪ f −1 ({∞}).
Assuming (3), we have f −1 ({∞}) ∈ S and since (a, ∞) ⊆ R is open, and hence is Borel, we also have f −1 (a, ∞) ∈ S . Thus, f −1 (a, ∞] ∈ S , so f is measurable.
We remark that the choice of using +∞ over −∞ in the “f −1 ({∞}) ∈ S ” parts of (2) and (3) were arbitrary and we could have used −∞ instead of ∞. Consider Part (2) of Theorem 5.6, but only for real-valued functions: Measurability: A function f : X → R is measurable if and only if f −1 (U) ∈ S for each open set U ⊆ R.
One cannot avoid noticing the striking resemblance to the definition of continuity. Recall that for a topological space (T, T ), where T is the topology on a set T , Continuity: A function f : T → R is continuous if and only if f −1 (U) ∈ T for each open set U ⊆ R.
Because of this similarity, one can think about measurability as a type of generalization of continuity. However, speaking philosophically, there are two big differences between measurable functions and continuous functions as we can see by considering X = Rn with Lebesgue measure and its usual topology: (i) There are more measurable functions than continuous ones. (ii) Measurable functions are closed under more operations than continuous functions.
The basic reason for these facts is that there are a lot more measurable sets than there are open sets. E.g. not only are open sets measurable but so are points, Gδ (countable intersections of open) sets, Fσ (countable union of closed) sets, etc. To illustrate Point (ii), in Section 5.3 we shall see that measurable functions are closed under all limiting operations. For example, a limit of measurable functions is always measurable, which is false for continuous functions. Indeed, we saw in Section 5.1 that the characteristic function of a Cantor set can be expressed as a limit of continuous (in fact, differentiable) functions. To summarize this discussion, Measurable functions are similar to continuous functions, but there are more of them and they are more robust.
With this said, in Section 5.2.3 we’ll see that measurable functions are “nearly” continuous functions, just like measurable sets are “nearly” open sets. Since we are on the suject of topology, recall that the collection of Borel subsets of a topological space is the σ-algebra generated by the open sets. For a measurable
258
5. BASICS OF INTEGRATION THEORY
space (X, S ) where X is a topological space with S its Borel subsets, a measurable function f : X → R is called Borel measurable to emphasize that the σ-algebra S is the one generated by the topology and not just any σ-algebra on X. Thus, f : X → R is Borel measurable
⇐⇒
f −1 (a, ∞] ∈ B(X) for all a ∈ R.
f : Rn → R is Borel measurable
⇐⇒
f −1 (a, ∞] ∈ B n for all a ∈ R.
In the case X = Rn with its usual topology,
Proposition 5.7. Any continuous real-valued function on a topological space is Borel measurable. The proof of this proposition follows word-for-word the Rn case in Example 5.1, so we omit its proof. A nice thing about Borel measurability is that it behaves well under composition. (The following proposition is false in general if f is assumed to be Lebesgue measurable; see Problem 9.) Proposition 5.8. If f : R → R is Borel measurable and g : X → R is measurable, where X is an arbitrary measurable space, then the composition, is measurable.
f ◦g :X →R
Proof : Given a ∈ R, we need to show that
(f ◦ g)−1 (a, ∞] = g −1 (f −1 (a, ∞]) ∈ S .
The function f : R → R is, by assumption, Borel measurable, so f −1 (a, ∞] ∈ B 1 . The function g : X → R is measurable, so by Part (3) of Theorem 5.6, g −1 (f −1 (a, ∞]) ∈ S . Thus, f ◦ g is measurable. Example 5.4. If g : X → R is measurable, and f : R → R is the characteristic function of the rationals, which is Borel measurable, then Proposition 5.8 shows that the rather complicated function ( 1 if g(x) ∈ Q, (f ◦ g)(x) = 0 if g(x) 6∈ Q, is measurable. Other, more normal looking, functions of g that are measurable include eg(x) , cos g(x), and g(x)2 + g(x) + 1.
5.2.3. Littlewood’s second principle. We now continue our discussion of Littlewood’s Principles [183, p. 26] from Section 4.3 where we stated the first principle; here are all of them:
Nikolai Luzin (1883–1950).
There are three principles, roughly expressible in the following terms: Every [finite Lebesgue] (measurable) set is nearly a finite sum of intervals; every function (of class Lλ ) is nearly continuous; every convergent sequence of [measurable] functions is nearly uniformly convergent.
The first principle was illustrated in Theorem 4.10 and the third principle is contained in Egorov’s theorem (Theorem 5.15), which we’ll get to in the
5.2. MEASURABLE FUNCTIONS AND LITTLEWOOD’S SECOND PRINCIPLE
259
next section. One common, although historically inaccurate,3 interpretation of the second principle comes from Luzin’s Theorem, named after Nikolai Nikolaevich Luzin (1883–1950) who proved it in 1912 [190], and this theorem makes precise Littlewood’s comment that any Lebesgue measurable function is “nearly continuous”. We remark that Vitali in 1905 was the first to state and prove Luzin’s theorem in the paper [299], although Vitali remarked it was known to Borel [35] and Lebesgue [166] in 1903. Luzin’s Theorem Theorem 5.9. Let X ⊆ Rn be Lebesgue measurable and let f : X → R be a Lebesgue measurable function. Then given ε > 0, there is a closed set C ⊆ Rn such that C ⊆ X, m(X \ C) < ε, and f is continuous on C. Proof : Before presenting the proof, let’s make sure we understand what it says. Take for example, Dirichlet’s function D : R → R, which is the characteristic function of the rationals:
D is discontinuous everywhere! Luzin’s theorem says that given ε > 0 there is a closed subset C ⊆ R such that m(R \ C) < ε and f |C is continuous on C. In fact, it’s easy to find such a closed set C consisting only of irrational numbers. Indeed, just let {r1 , r2 , . . .} be a list ofSall rational numbers in R and let In = (rn − ε/2n+2 , rn + ε/2n+2 ). Then U := ∞ n=1 In is open, so C := R \ U is closed, and m(R \ C) = m(U) ≤
∞ X
n=1
m(In ) =
∞ X
n=1
ε ε = . 2n+1 2
Thus, m(R \ C) < ε and since C is a subset of the irrational numbers, f |C = 0: f |C ≡ 0 C ⊆ {irrationals}
The zero function is continuous, so this example verifies Luzin’s theorem for Dirichlet’s function. Now to the proof of Luzin’s theorem. Step 1: We first prove the theorem only requiring that C be measurable (this proof is yet another example of the “ 2εk -trick”). Let {Vk } be a countable basis of open sets in R; this means that every open set in R is a union of countably many Vk ’s. (For example, take the Vk ’s as open intervals with rational end points.) We want to find a measurable set C such that f |C : C → R is continuous and m(X \ C) < ε. The continuity of f |C : C → R is equivalent to (since {Vk } is a basis and by definition of the subspace topology on C) C ∩ f −1 (Vk ) = C ∩ Uk for all k, 3
The Lλ in Littlewood’s quote is what we would denote by Lp nowadays, where Lp consists of those measurable functions f such that |f |p is integrable. We’ll talk about this function space in Section ? In particular, Littlewood’s original illustration of his second principle was not Luzin’s theorem! See Theorem 6 on [183, p. 27] for the precise illustration of Littlewood’s second principle, which has to do with approximations in Lp by continuous functions.
260
5. BASICS OF INTEGRATION THEORY
for some open set Uk ⊆ Rn . If f −1 (Vk ) happens to be open, then obviously we can take Uk = f −1 (Vk ) but f is not assumed to be continuous, so f −1 (Vk ) is not generally open. However, using Littlewood’s First Principle we can approximate it with an open set! Indeed, given k ∈ N, since f −1 (Vk ) is measurable, by Littlewood’s First Principle there is an open set Uk such that ε f −1 (Vk ) ⊆ Uk and m(Uk \ f −1 (Vk )) < k . 2 Thus, we can write Uk as a union of disjoint sets ε Uk = f −1 (Vk ) ∪ Wk where m(Wk ) < k . 2 Now put ∞ [ C := X \ Wk . k=1
Then C is measurable and
m(X \ C) ≤ Moreover, for any k ∈ N,
∞ X
k=1
m(Wk )
0 by Step 1 we can choose a measurable set B ⊆ X such that m(X \ B) < ε/2 and f is continuous on B. By a Littlewood’s First Principle we can choose a closed set C ⊆ Rn such that C ⊆ B and m(B \ C) < ε/2. Since X \ C = (X \ B) ∪ (B \ C), we have m(X \ C) ≤ m(X \ B) + m(B \ C) < ε.
Also, since C ⊆ B and f is continuous on B, the function f is automatically continuous on the smaller set C. This completes the proof of our theorem.
In Problem 7 we shall see that Luzin’s theorem holds not just for Rn but for topological spaces as well. Our last topic in this section is the idea of “almost everywhere,” a subject we’ll see quite often in the sequel. 5.2.4. The concept of almost everywhere. Let (X, S , µ) be a measure space. We say that a property holds almost everywhere (written a.e.) if the set of points where the property fails to hold is a measurable set with measure zero. For example, we say that a sequence of functions {fn } on X converges a.e. to a function f on X, written fn → f a.e., if f (x) = lim fn (x) for each x ∈ X except n→∞ on a measurable set with measure zero. Explicitly, n o fn → f a.e. ⇐⇒ A := x ; f (x) 6= lim fn (x) ∈ S and µ(A) = 0. n→∞
See Figure 5.6. In Figure 5.6, for x ∈ A (which consists of a single point in this picture), the limit lim fn (x) exists but does not equal f (x). In the general case, n→∞
the limit lim fn (x) need not exist at any point in A. n→∞
5.2. MEASURABLE FUNCTIONS AND LITTLEWOOD’S SECOND PRINCIPLE
f1
f2
f3
✲
261
f = a.e.
Figure 5.6. Here, fn → f a.e. where f is the zero function. For another example, given two functions f and g on X, we say that f = g a.e. if the set of points where f 6= g is measurable with measure zero: f = g a.e.
⇐⇒
A := {x ; f (x) 6= g(x)} ∈ S and µ(A) = 0.
Example 5.5. Consider the Baire sequence fn : [0, 1] → R, n = 1, 2, 3, . . ., defined by ( 1 If x = p/q is rational in lowest terms with q ≤ n, fn (x) = 0 otherwise. Let f : [0, 1] → R be the zero function, f (x) = 0 for all x. Then fn → f a.e. because the set of points where this is false is the set of rational numbers in [0, 1], which has measure zero. Observe that if g is the Dirichlet function on [0, 1] (the characteristic function of the rationals in [0, 1]), then f = g a.e. because of the same reason.
If g is measurable and f = g a.e., then one might think that f must also be measurable. However, as you’ll see in the following proof, to always make this conclusion we need to assume the measure is complete. Proposition 5.10. Assume that µ is complete and let f, g : X → R. If g is measurable and f = g a.e., then f is also measurable. Proof : Assume that g is measurable and f = g a.e., so that the set A = {x ; f (x) 6= g(x)} is measurable with measure zero. Observe that for any a ∈ R, f −1 (a, ∞] = {x ∈ X ; f (x) > a}
= {x ∈ A ; f (x) > a} ∪ {x ∈ Ac ; f (x) > a}
(5.3)
= {x ∈ A ; f (x) > a} ∪ {x ∈ Ac ; g(x) > a} = {x ∈ A ; f (x) > a} ∪ Ac ∩ g −1 (a, ∞] .
The first set in (5.3) is a subset of A, which is measurable and has measure zero, hence the first set is measurable. g is measurable, so the second set in (5.3) is measurable too, hence f is measurable.
For instance, this proposition holds for Lebesgue measure since Lebesgue measure is complete. ◮ Exercises 5.2. 1. Here are some equivalent definition of measurability. Let f : X → R be a function on a measurable space X. (a) If {an } is any countable dense subset of R, prove that f is measurable if and only if f −1 ({∞}) and all sets of the form f −1 (am , an ], where m, n ∈ N, are measurable. (For example, f is measurable if and only if f −1 ({∞}) and all sets of the form f −1 (k/2n , (k + 1)/2n ], where k ∈ Z and n ∈ N, are measurable.) (b) If f ≥ 0, prove that f is measurable if and only if for all k ∈ Z and n ∈ N with 0 ≤ k ≤ 22n − 1, the sets f −1 (k/2n , (k + 1)/2n ] and f −1 (2n , ∞], are measurable. 2. Here are some problems dealing with nonmeasurable functions. (a) Find a non-Lebesgue measurable function f : R → R such that |f | is measurable.
262
5. BASICS OF INTEGRATION THEORY
(b) Find a non-Lebesgue measurable function f : R → R such that f 2 is measurable. (c) Find two non-Lebesgue measurable functions f, g : R → R such that both f + g and f · g are measurable. 3. Here are some problems dealing with measurable functions. (a) Prove that any monotone function f : R → R is Lebesgue measurable. (b) A function f : R → R is said to be lower-semicontinuous at a point c ∈ R if for any ε > 0 there is a δ > 0 such that |x − c| < δ
=⇒
f (c) − ε < f (x).
Intuitively, f is lower-semicontinuous at c if for x near c, f (x) is either near f (c) or greater than f (c). The function f is lower-semicontinuous if it’s lowersemicontinuous at all points of R. (To get a feeling for lower-semicontinuity, show that the functions χ(0,∞) , χ(−∞,0) , and χ(−∞,0)∪(0,∞) are lower-semicontinuous at 0.) Prove that any lower-semicontinuous function is Lebesgue measurable. (c) A function f : R → R is said to be upper-semicontinuous at a point c ∈ R if for any ε > 0 there is a δ > 0 such that |x − c| < δ
=⇒
f (x) < f (c) + ε.
Intuitively, f is upper-semicontinuous at c if for x near c, f (x) is either near f (c) or less than f (c). The function f is upper-semicontinuous if it’s uppersemicontinuous at all points of R. Prove that any upper-semicontinuous function is Lebesgue measurable. 4. Since we work with extended real-valued functions, we can use an “extended” Borel σ-algebra to define measurability. In this problem we define and study this σ-algebra. (a) The extended Borel σ-algebra, B, is the σ-algebra of subsets of R generated by I 1 and subsets of {±∞}. Prove that A ∈ B if and only if for some B ∈ B, we have A = B or A = B ∪ {∞} or A = B ∪ {−∞}, or A = B ∪ {∞, −∞}. (b) Prove that B is the σ-algebra generated by the open sets in R, which by definition consist of open sets in R together with sets of the form U ∪ I ∪ J where U is open in R and I and J are intervals of the form [−∞, a) and (b, ∞] where a, b ∈ R. (c) Prove that B is the σ-algebra of subsets of R generated by any of one of the following collections of subsets (here, a and b represent real numbers): (1) (a, b], {∞}; (2) (a, b), {∞}; (3) [a, b), {∞}; (4) [a, b], {∞}; (5) [−∞, a]; (6) [−∞, a); (7) [a, ∞]; (8) (a, ∞]. To avoid boredom, just prove a couple. (d) Prove that f : X → R is measurable if and only if f −1 (A) ∈ S for each A ∈ B, if and only if f −1 (A) ∈ S for each set A of the form given in any one of eight collections of sets stated in Part (iii). 5. In this problem we prove that measurable functions are closed under the usual arithmetic operations. Let f, g : X → R be measurable. (i) Prove that f 2 , and more generally, |f |p where p > 0, and, assuming that f never vanishes, that 1/f are each measurable. (ii) Assume that f (x) + g(x) is defined for all x ∈ X; that is, assume that f (x) and g(x) are not opposite infinities for any x ∈ X. For a ∈ R and x ∈ X, prove that f (x) + g(x) < a if and only if there exists rational numbers r, s such that f (x) < r, g(x) < s, and r + s < a. Use this fact to prove that f + g is measurable. (iii) We now prove that the product f g is measurable. Let A = {x ; (f (x), g(x)) ∈ R × R}. Prove that A is measurable. Show that f g : A → R and f g : Ac → R are measurable. Conclude that f g : X → R is measurable. Suggestion: To prove measurability of f g on A, observe that f g = 14 (f + g)2 − (f − g)2 . 6. We can improve Luzin’s Theorem as follows. First prove the (i) (Tietze Extension Theorem for R); named after Heinrich Tietze (1880–1964) who proved a general result for metric spaces in 1915 [284]. Let A ⊆ R be a nonempty closed set and let f0 : A → R be a continuous function. Prove that
5.2. MEASURABLE FUNCTIONS AND LITTLEWOOD’S SECOND PRINCIPLE
263
there is a continuous function f1 : R → R such that f1 |A = f0 , and if f0 is bounded in absolute value by a constant M , then we may take f1 to the have the same bound. Suggestion: Show that R\A is a countable union of pairwise disjoint open intervals. Extend f0 linearly over each of the open intervals to define f1 . (ii) Using Luzin’s Theorem (Theorem 5.9) for n = 1, given a measurable function f : X → R where X ⊆ R is measurable, prove that there is a closed set C ⊆ R such that C ⊆ X, m(X \ C) < ε, and a continuous function g : R → R such that f = g on C. Moreover, if f is bounded in absolute value by a constant M , then we may take g to have the same bound as f . 7. Here are some generalizations of Luzin’s Theorem. (i) Let µ be a σ-finite regular Borel measure on a topological space X, let f : X → R be measurable, and let ε > 0. Using Problem 8 in Exercises 4.3 on “Littlewood’s First Principle(s) for regular Borel measures,” prove that there exists a closed set C ⊆ X such that m(X \ C) < ε and f is continuous on C. (ii) We now assume you’re familiar with topology at the level of [205]. With the hypotheses as in (i) except now X is a normal topological space, prove that if C denotes the closed set in (i), there is a continuous function g : X → R such that f = g on C and moreover, if f is bounded in absolute value by a constant, then we may take g to bounded by the same constant. You will need the Tietze Extension Theorem for normal spaces, a theorem actually proved by Pavel Samuilovich Urysohn (1898–1924) and published posthumously in 1925 [292]. 8. (Tonelli’s Integral; [73]) Here we present Leonida Tonelli’s (1885–1946) integral published in 1924 [288]. Let f : [a, b] → R be a bounded function, say |f | ≤ M for some constant M . We say that f is quasi-continuous (q.c.) if there is a sequence of closed sets C1 , C2 , C3 , . . . ⊆ [a, b] with lim m(Cn ) = b − a and a sequence of continuous n→∞
functions f1 , f2 , f3 , . . . where each fn : [a, b] → R satisfies f = fn on Cn and |fn | ≤ M . (i) Let f : [a, b] → R be bounded. Prove that f is q.c. if and only if f is measurable. To prove the “if” statement, you may assume Problem 6. (ii) Let f : [a, b] → R be q.c. and let {fn } be a sequence of continuous functions in the definition of q.c. for f . Let R(fn ) denote the Riemann integral of fn (which exists since f is continuous) and prove that the limit lim R(fn ) exists and its n→∞
value is independent of the choice of sequence {fn } in the definition of q.c. for f . Tonelli defines the integral of f as Z
a
b
f := lim R(fn ). n→∞
It turns out that Tonelli’s integral is exactly the same as Lebesgue’s integral. 9. (cf. [112]) We show that the composition of two Lebesgue measurable function may not be Lebesgue measurable. Indeed, let ϕ and M be the homeomorphism and Lebesgue measurable set, respectively, of Problem 7 in Exercises 4.4. Let g = χM . Show that g◦ϕ−1 is not Lebesgue measurable. Note that both ϕ−1 and g are Lebesgue measurable. 10. (Cauchy’s functional equation IV: The Banach-Sierpinski Theorem) Use any information in Problem 6 in Exercises 1.6 and Problem 6 Exercises 4.3. Prove the Banach-Sierpinski Theorem, proved in 1920 by Stefan Banach (1892–1945) [15] and Waclaw Sierpi´ nski (1882–1969) [259], which states that if f : R → R is additive and Lebesgue measurable, then f (x) = f (1) x for all x ∈ R. As a corollary, we get: Every discontinuous additive function is not Lebesgue measurable. Suggestions: One can prove that for some n ∈ N, the set {x ∈ R ; |f (x)| ≤ n} has positive measure, or one can use Luzin’s Theorem to find a set of positive measure on which f is bounded. 11. (Cauchy’s functional equation V: Hamel basis) Using the Axiom of Choice (really its equivalent form, Zorn’s Lemma) there is a set B ⊆ R (which is uncountable) such
264
5. BASICS OF INTEGRATION THEORY
that every nonzero x ∈ R can be written uniquely as
x = r 1 b1 + r 2 b2 + · · · + r n bn ,
(5.4)
for some nonzero r1 , . . . , rn ∈ Q and distinct elements b1 , . . . , bn ∈ B. The set B is called a Hamel basis, after Georg Hamel (1877–1954) who wrote a paper on Cauchy’s equation and first described such a basis in 1905 [113]. Show that a function f : R → R is additive if and only if for each x ∈ R, we have f (x) = r1 f (b1 ) + · · · + rn f (bn )
where x is written as in (5.4). This allows to to easily “find” nonmeasurable functions; indeed, let us define f (b) = 1 for each b ∈ B so that f (x) = r1 + · · · + rn where x is written as in (5.4). Show that f is not linear and conclude that f is not measurable.
5.3. Sequences of functions and Littlewood’s third principle In this section we continue our study of measurability. We show that measurable functions are very robust in the sense that they are closed under any kind of arithmetic or limiting process involving at most countably many operations: addition, multiplication, division etc.; this isn’t surprising since measurable sets are closed under countable operations. We also discuss Littlewood’s third principle on limits of measurable functions. 5.3.1. Limits of sequences. Before discussing limits of sequences of functions we need to start with limits of sequences of extended real numbers. It’s a fact of life that limits of sequences of extended real numbers in general do not exist; for example, a sequence {an } can oscillate like this: an
a1
a3
a2
a5
a4
a7
a6
a9
a8
a11 a13 a15 a17 n a10 a12 a14 a16 a18
Figure 5.7. The sequence a1 , a2 , a3 , a4 , . . . bounces up and down. However, for the sequence shown in Figure 5.7, assuming that the sequence continues the way it looks like it does, it is clear that although lim an does not exist, the sequence does have an “upper” limiting value, given by the limit of the odd-indexed an ’s and a “lower” limiting value, given by the limit of the even-indexed an ’s. Now how do we find the “upper” (also called “supremum”) and “lower” (also called “infimum”) limits of {an }? It turns out there is a very simple way to do so, as we now explain. Given an arbitrary sequence {an } of extended real numbers, put4 s1 = sup ak = sup{a1 , a2 , a3 , . . .}, k≥1
s2 = sup ak = sup{a2 , a3 , a4 , . . .}, k≥2
s3 = sup ak = sup{a3 , a4 , a5 , . . .}, k≥3
4Here, “sup” is in the sense of extended real numbers, so the (extended) real number s n could equal ∞ if the set {an , an+1 , an+2 , . . .} is bounded above only by ∞.
5.3. SEQUENCES OF FUNCTIONS AND LITTLEWOOD’S THIRD PRINCIPLE
265
and in general, sn = sup ak = sup{an , an+1 , an+2 , . . .}. k≥n
Note that s1 ≥ s2 ≥ s3 ≥ · · · ≥ sn ≥ sn+1 ≥ · · · is an nonincreasing sequence since each successive sn is obtained by taking the supremum of a smaller set of elements (and when sets get smaller, their supremums cannot increase). Since {sn } is an nonincreasing sequence of extended real numbers, the limit lim sn exists in R; in fact, lim sn = inf sn = inf{s1 , s2 , s3 , . . .}, n
as can be easily be checked. We define the lim sup of the sequence {an } as lim sup an := inf sn = lim sn n
= lim
n→∞
sup{an , an+1 , an+2 , . . .} .
Note that the terminology “lim sup” of {an } fits well because lim sup an is exactly the limit of a sequence of supremums. Example 5.6. For the sequence an shown in Figure 5.7, you can check that s1 = a1 , s2 = a3 , s3 = a3 , s4 = a5 , s5 = a5 , . . . . Thus, lim sup an is exactly the limit of the odd-indexed an ’s.
We now define the “lower”/“infimum” limit of an arbitrary sequence {an }. Put5 ι1 = inf ak = inf{a1 , a2 , a3 , . . .}, k≥1
ι2 = inf ak = inf{a2 , a3 , a4 , . . .}, k≥2
ι3 = inf ak = inf{a3 , a4 , a5 , . . .}, k≥3
and in general, ιn = inf ak = inf{an , an+1 , an+2 , . . .}. k≥n
Note that ι1 ≤ ι2 ≤ ι3 ≤ · · · ≤ ιn ≤ ιn+1 ≤ · · · is an nondecreasing sequence since each successive ιn is obtained by taking the infimum of a smaller set of elements. Since {ιn } is an nondecreasing sequence, the limit lim ιn exists, and equals supn ιn . We define the lim inf of the sequence {an } as lim inf an := sup ιn = lim ιn n
= lim
n→∞
inf{an , an+1 , an+2 , . . .} .
Note that the terminology “lim inf” of {an } fits well because lim inf an is the limit of a sequence of infimums. Example 5.7. For the sequence an shown in Figure 5.7, you can check that ι1 = a2 , ι2 = a2 , ι3 = a4 , ι4 = a4 , ι5 = a6 , . . . , Thus, lim inf an is exactly the limit of the even-indexed an ’s. 5As with supremum, “inf” is in the sense of extended real numbers, so the (extended) real number ιn could equal −∞ if the set {an , an+1 , an+2 , . . .} is bounded below only by −∞.
266
5. BASICS OF INTEGRATION THEORY
The following lemma contains some useful properties of limsup’s and liminf’s. Since its proof really belongs in a lower-level analysis course, we shall leave its proof as an exercise for the interested. Lemma 5.11. For a nonempty set A ⊆ R and a sequence of extended real numbers {an }, the following properties hold: (1) sup A = − inf(−A) and inf A = − sup(−A), where −A = {−a ; a ∈ A}. (2) lim sup an = − lim inf(−an ) and lim inf an = − lim sup(−an ). (3) lim an exists as an extended real number if and only if lim sup an = lim inf an , in which case, lim an = lim sup an = lim inf an . (4) If {bn } is another sequence of extended real numbers and an ≤ bn for all n sufficiently large, then lim inf an ≤ lim inf bn
and
lim sup an ≤ lim sup bn .
5.3.2. Operations on measurable functions. Let {fn } be a sequence of extended-real valued functions on a measure space (X, S , µ). We define the functions sup fn , inf fn , lim sup fn , and lim inf fn , by applying these operations pointwise to the sequence of extended real numbers {fn (x)} at each point x ∈ X. For example, sup fn : X → R is the function defined by sup fn (x) := sup{f1 (x), f2 (x), f3 (x), . . .}
at each x ∈ X,
and
lim sup fn : X → R is the function defined by lim sup fn (x) := lim sup(fn (x))
at each x ∈ X.
We define the limit function lim fn by lim fn (x) := lim (fn (x)) n→∞
at those points x ∈ X where the right-hand limit exists. We now show that limiting operations don’t change measurability. Limits preserve measurability Theorem 5.12. If {fn } is a sequence of measurable functions, then the functions sup fn , inf fn , lim sup fn , and lim inf fn are all measurable. If the limit lim fn (x) exists at each x ∈ X, then the limit n→∞
function lim fn is measurable. For instance, if the sequence {fn } is monotone, that is, either nondecreasing or nonincreasing, then lim fn is everywhere defined and it is measurable.
5.3. SEQUENCES OF FUNCTIONS AND LITTLEWOOD’S THIRD PRINCIPLE
267
Proof : To prove that sup fn is measurable, by Proposition 5.4 we just have to show that (sup fn )−1 [−∞, a] ∈ S for each a ∈ R. However, this is easy because by definition of supremum, for any a ∈ R, sup{f1 (x), f2 (x), f3 (x), . . .} ≤ a
⇐⇒
fn (x) ≤ a for all n,
therefore (sup fn )−1 [−∞, a] = {x ; sup fn (x) ≤ a} = =
∞ \
n=1 ∞ \
{x ; fn (x) ≤ a} fn−1 [−∞, a].
n=1
Since each fn is measurable, it follows that (sup fn )−1 [−∞, a] ∈ S . Using an analogous argument one can show that inf fn is measurable. To prove that lim sup fn is measurable, note that by definition of lim sup, lim sup fn := inf{s1 , s2 , s3 , . . .}, where sn = supk≥n fk . We already proved that the supremum of a sequence of measurable functions is measurable, so each sn is measurable and since the infimum of a sequence of measurable function is also measurable, it follows that lim sup fn = inf{s1 , s2 , . . .} is measurable. An analogous argument can be used to show that lim inf fn is measurable. If the limit function lim fn is well-defined, then by Part (3) of Lemma 5.11 we know that lim fn = lim sup fn (= lim inf fn ). Thus, lim fn is measurable. Example 5.8. Let X = S ∞ , where S = {0, 1}, a sample space for the Monkey-Shakespeare experiment (or any other sequence of Bernoulli trials), and let f : X → [0, ∞] be the random variable given by the number of times the Monkey types sonnet 18. Then ∞ X f (x) = xn n=1
That is,
f=
∞ X
χAn = lim
n=1
n→∞
n X
χAk ,
k=1
where An = S ×S ×· · ·×S ×{1}×S ×S ×· · · where {1} is in the n-th slot. Therefore, f is a limit of simple functions, so f is measurable.
Given f : X → R, we define its nonnegative part f+ : X → [0, ∞] and its nonpositive part f− : X → [0, ∞] by f+ := max{f, 0} = sup{f, 0} and f− := − min{f, 0} = − inf{f, 0}.
Here is an example of f± for a parabolic graph: R
f
R
X
f+
X
R
f−
X
One can check that f = f+ − f−
and |f | = f+ + f− .
Assuming f is measurable, by the sup part of Theorem 5.12, we know that f+ (which equals sup{f, 0} and also f− (which equals (− inf{f, 0} = sup{−f, 0}) are
268
5. BASICS OF INTEGRATION THEORY
measurable. In particular, the equality f = f+ − f− shows that any measurable function can be expressed as the difference of nonnegative measurable functions. In the following theorem we prove that a function is measurable if and only if it is a limit of simple functions using Lebesgue’s idea of partitioning the y-axis in his paper Sur une g´en´eralisation de l’int´egrale d´efinie [165]. See Problem 2 for the situation actually treated in his paper. Characterization of measurability Theorem 5.13. A function is measurable if and only if it is the limit of simple functions. Moreover, if the function is nonnegative, the simple functions can be taken to be a nondecreasing sequence of nonnegative simple functions. Proof : Consider first the nonnegative case. Let f : X → [0, ∞] be measurable. Following Lebesgue, we shall partition the y-axis with finer and finer partitions and let the partition width go to zero. To give a concrete example of such a partition, for each n ∈ N, consider the simple function 0 if 0 ≤ f (x) ≤ 21n 1 if 21n < f (x) ≤ 22n 2n 2 2n if 22n < f (x) ≤ 23n sn (x) = . .. .. . 2n 2n 2n 2 −1 if 2 2n−1 < f (x) ≤ 22n = 2n n 2 n 2 if f (x) > 2n . See Figure 5.8 for an example of a function f and pictures of the corresponding R 1
R
R
1
f
f
3 4 1 2 1 4
1 2
X
f
X
R
R
1
1
1 2
3 4 1 2 1 4
s1
X R
s2
X
X
s3
X
Figure 5.8. Here, f looks like a “V ” and is bounded above by 1. The top figures show partitions of the range of f into halves, quarters, then eigths and the bottom figures show the corresponding simple functions. It is clear that s1 ≤ s2 ≤ s3 . s1 , s2 , and s3 . Note that sn is a simple function because we can write sn =
22n −1 X k=0
where Ank = f −1
k χAnk + 2n χBn , 2n
k k+1 , 2n 2n
and
Bn = f −1 (2n , ∞].
5.3. SEQUENCES OF FUNCTIONS AND LITTLEWOOD’S THIRD PRINCIPLE
269
At least if we look at Figure 5.8, it is not hard to believe that in general, the sequence {sn } is always nondecreasing: 0 ≤ s1 ≤ s2 ≤ s3 ≤ s4 ≤ · · ·
and lim sn (x) = f (x) at every point x ∈ X. Because this is so believable looking n→∞
at Figure 5.8, we leave you the pleasure of verifying these facts! Now let f : X → R be any measurable function; we need to show that f is the limit of simple functions. To prove this, write f = f+ − f− as the difference of its nonnegative and nonpositive parts. Since f± are nonnegative measurable functions, we know that f+ and f− can be written as limits of simple functions, say {sn } and {tn }, respectively. It follows that f = f+ − f− = lim(sn − tn )
is also a limit of simple functions.
Using Theorem 5.13 on limits of simple functions, it is easy to prove that measurable functions are closed under all the usual arithmetic operations. Of course, the proofs aren’t particularly difficult to prove directly (except perhaps the sum f + g — see Problem 5 in Exercises 5.2). Theorem 5.14. If f and g are measurable, then f + g, f · g, 1/f , and |f |p where p > 0, are also measurable, whenever each expression is defined. Proof : We need to add the last statement for f + g and 1/f . For 1/f we need f to never vanish and for f +g we don’t want f (x)+g(x) to give a nonsense statement such as ∞ − ∞ or −∞ + ∞ at any point x ∈ X. The proofs that f + g, f g, 1/f , and |f |p are measurable are all the same: we just show that each combination can be written as a limit of simple functions. By Theorem 5.13 we can write f = lim sn and g = lim tn for simple functions sn , tn , n = 1, 2, 3, . . .. Therefore, f + g = lim(sn + tn )
and
f g = lim(sn tn ).
Since the sum and product of simple functions are simple, it follows that f + g and f g are limits of simple functions, so are measurable. To see that 1/f and |f |p are measurable, write the simple function sn as a finite sum X sn = ank χAnk , k
where An1 , An2 , . . . ∈ S are finite in number and pairwise disjoint, and an1 , an2 , . . . ∈ R, which we may assume are all nonzero. If we define X −1 X ank χAnk and vn = |ank |p χAnk , un = k
k
which are simple functions, then a short exercise shows that f −1 = lim un
and
|f |p = lim vn ,
where in the first equality we assume that f is nonvanishing. This shows that f −1 and |f |p are measurable.
In particular, since products and reciprocals of measurable functions are measurable, whenever the reciprocal is well-defined, it follows that quotients of measurable functions are measurable, whenever the denominator is nonvanishing.
270
5. BASICS OF INTEGRATION THEORY
5.3.3. Littlewood’s third principle. We now come to Egorov’s theorem, named after named after Dimitri Fedorovich Egorov (1869–1931) who proved it in 1911 [82]. This theorem makes precise the third of Littlewood’s principles, which is Every convergent sequence of [real-valued] measurable functions is nearly uniformly convergent.
Dimitri Egorov More precisely, in the words of Lebesgue who in 1903 stated (without proof!) this principle as [166, p. 1229]6: (1869–1931). Every convergent series of measurable functions is uniformly convergent when certain sets of measure ε are neglected, where ε can be as small as desired.
Lebesgue here is introducing the idea which is nowadays called “convergence almost uniformly.” A sequence {fn } of measurable functions is said to converge almost uniformly (or “a.u.” for short) to a measurable function f , denoted by fn → f a.u.,
if for each η > 0, there exists a measurable set A such that µ(Ac ) < η and fn → f uniformly on A. As a quick review, recall that fn → f uniformly on A means that given any ε > 0, |fn (x) − f (x)| < ε,
for all x ∈ A and n sufficiently large.
Note that fn (x) and f (x) are necessarily real-valued (cannot take on ±∞) on A. Therefore, Lebesgue is saying that Every convergent sequence of real-valued measurable functions is almost uniformly convergent.
This is nowadays called Egorov’s theorem. In the following example we look at the difference between uniform and almost uniform convergence. Example 5.9. Consider the following two sequences {fn } of functions on [0, 1]: f fn f +ε f −ε
f3 f2 ]f1 1
f1 f4 f3 fn
f2 ] 1
Figure 5.9. Left: The fn ’s are lines rotating toward the line f . Right: The fn ’s are humps (of the same height) that move to the left and approach zero otherwise. The left-hand picture illustrates uniform convergence. Given ε > 0 we see that for all n sufficiently large, we have |fn (x) − f (x)| < ε for all x ∈ [0, 1], or equivalently (using the definition of absolute value), f (x) − ε < fn (x) < f (x) + ε
for all x ∈ [0, 1].
Geometrically, uniform convergence simply means that for all n sufficiently large, the graph of fn is “trapped” between the graphs of f − ε and f + ε. Now consider the right-hand picture. With 0 denoting the zero function we see that fn → 0 pointwise, meaning that for each x ∈ [0, 1], fn (x) → 0. However, fn 6→ 0 6
translation taken from [199, p. 112]
5.3. SEQUENCES OF FUNCTIONS AND LITTLEWOOD’S THIRD PRINCIPLE
271
uniformly. For example, for any ε < height of the humps, it’s not the case that the graph of fn is trapped between the horizontal lines −ε and ε, because the hump of fn will always stick out above the horizontal line at height ε. However, fn does converge to 0 a.u. with respect to Lebesgue measure. Indeed, let η > 0 and let A = [η/2, 1]. Then it’s clear that m([0, 1] \ A) < η and fn → 0 uniformly on A as seen here: fn ε c
−ε A
A
] 1
Figure 5.10. Given any ε > 0, for all n sufficiently large, we see that |fn (x)| < ε for all x ∈ A.
Now that we understand a.u. convergence, we state Egorov’s theorem. Egorov’s Theorem Theorem 5.15. On a finite measure space, a.e. convergence implies a.u. convergence for real-valued measurable functions. That is, any sequence of real-valued measurable functions that converges a.e. to a real-valued measurable function converges a.u. to that function. Proof : Let f, f1 , f2 , f3 , . . . be real-valued measurable functions on a measure space X with µ(X) < ∞, and assume that f = lim fn a.e., which means there is a measurable set E ⊆ X with µ(X \ E) = 0 and f (x) = lim fn (x) for all x ∈ E. n→∞
We need to show that fn → f a.u. (this proof is yet another “ 2εk -trick”, or in this case, we’ll use an “ 2ηk -trick”). Given η > 0, we need to find a measurable set A such that µ(Ac ) < η and for any ε > 0, |fn (x) − f (x)| < ε,
for all x ∈ A and n sufficiently large.
The idea to find A is to find, for each k ∈ N, a set Ak where ε is replaced by 1/k, then intersect all the Ak ’s. Step 1: Given η > 0 and k ∈ N we shall prove that there is a measurable set B ⊆ X and an N ∈ N such that 1 (5.5) µ(B c ) < η and for x ∈ B , |f (x) − fn (x)| < for all n > N. k Indeed, motivated by the second condition in (5.5) let’s define for each m ∈ N, 1 Bm := x ∈ X ; |f (x) − fn (x)| < for all n ≥ m k \ 1 = x ∈ X ; |f (x) − fn (x)| < . k n≥m
Notice that each Bm is measurable and B1 ⊆ B2 ⊆ B3 ⊆ · · · . Also, since fn → f on E, it follows that if x ∈ E, then |f (x) − fn (x)| < 1/k for all n sufficiently large. Hence, there is an m such that x ∈ Bm and so, ∞ [ E⊆ Bm . m=1
Thus, as µ(X) = µ(E) (since µ(X \ E) = 0), by continuity of measures, µ(X) ≤ lim µ(Bm ). m→∞
272
5. BASICS OF INTEGRATION THEORY
On the other hand, since µ(Bm ) ≤ µ(X) for all m (because Bm ⊆ X) it follows that µ(X) = lim µ(Bm ). m→∞
Thus, we can choose N such that µ(X) − µ(BN ) < η. Then with B = BN it’s easy to check that (5.5) holds. This concludes Step 1. Step 2: We now finish the proof. Let η > 0. Then by Step 1, for each k ∈ N we can find a measurable set Ak ⊆ X and a corresponding natural number Nk ∈ N such that η 1 µ(Ack ) < k+1 and for x ∈ Ak , |f (x) − fn (x)| < for all n > Nk . 2 k T∞ S c Now put A = k=1 Ak . Then Ac = ∞ k=1 Ak , so µ(Ac ) ≤
∞ X
k=1
µ(Ack ) ≤
∞ X
k=1
η η = < η, 2k+1 2
and we claim that fn → f uniformly on Ac . Indeed, let ε > 0 and choose ℓ ∈ N such that 1/ℓ < ε. Then x∈A
=⇒
x ∈ Aℓ
=⇒ =⇒
1 for all n > Nℓ ℓ |f (x) − fn (x)| < ε for all n > Nℓ . |f (x) − fn (x)|
M .) (c) A gambler with an initial capital of $i walks into a casino, which has an infinite amount of money, and he sits down at a table and starts gambling and he doesn’t stop until he goes broke. Let’s say that 1 represents he wins a game and 0 he loses a game; if he wins he gets $1 and if he loses he gives the house $1. Define B : S ∞ → R by B(x) = the number of games he plays until he goes broke (B(x) = ∞ if he never go broke in the sequence x of games). 2. (Lebesgue’s 1901 idea) Let f : [α, β] → [0, ∞) be a bounded measurable function, with range lying in a bounded interval [m, M ]. Given a partition P = {m0 , m1 , . . . , mp } of [m, M ], put m−1 = −1 and let ℓP and uP be the simple functions ℓ P = m 0 χE0 +
p X i=1
mi−1 χEi , uP = m0 χE0 +
N X
m i χEi ,
Ei = f −1 (mi−1 , mi ].
i=1
(i) If P and Q are partitions of [m, M ] and Q ⊆ P, prove that ℓQ ≤ ℓP ≤ f ≤ uP ≤ uQ . (ii) Let D denote the set of nondecreasing sequences of partitions of [m, M ] whose lengths are approaching zero. Prove that {ℓPn } is a nondecreasing sequence of simple function converging uniformly to f and prove that {uPn } is a nonincreasing sequence of simple function converging uniformly to f .
5.3. SEQUENCES OF FUNCTIONS AND LITTLEWOOD’S THIRD PRINCIPLE
273
3. Let A1 , A2 , . . . be measurable sets and put lim sup An :=
∞ [ ∞ \
Ak
and
n=1 k=n
lim inf An :=
∞ \ ∞ [
Ak .
n=1 k=n
Let f and f be the characteristic functions of lim sup An and lim inf An , respectively, and for each n, let fn be the characteristic function of An . Prove that f = lim sup fn
and
f = lim inf fn .
4. Here are some a.u. convergence problems. (a) Let X = R with Lebesgue measure and let fn be the characteristic function of [n, ∞). Show that fn → 0 everywhere (that is, pointwise), but fn 6→ 0 a.u. (This shows that the finiteness assumption on X in Egorov’s theorem cannot be dropped.) (b) Let X = [0, 1) with Lebesgue measure and let fn be the characteristic function of the interval [1 − 1/n, 1). Show that fn → 0 a.u., but fn 6→ 0 uniformly. 5. In the following series of problems, we study various convergence properties of measurable functions. We shall work with a fixed measure space (X, S , µ). Let fn , n = 1, 2, . . ., and f be real-valued measurable functions. Prove that fn → f a.e. if and only if for each ε > 0, \ ∞ [ {x ; |fm (x) − f (x)| ≥ ε} = 0. µ n=1 m≥n
6. (Cf. [189]) In this problem we give a characterization of almost uniform convergence. Let fn , n ∈ N, and f be real-valued measurable functions. Prove that fn → f a.u. if and only if for each ε > 0, [ lim µ {x ; |fm (x) − f (x)| ≥ ε} = 0. n→∞
m≥n
7. Problems 5 and 6 are quite useful. (a) (a.u. =⇒ a.e.) Using Problems 5 and 6, prove that if a sequence {fn } of real-valued measurable functions converges a.u. to a real-valued measurable function f , then the sequence {fn } also converges to f a.e. Note that Egorov’s Theorem gives a converse to this statement when X has finite measure. (b) Using Problems 5 and 6 give another proof of Egorov’s theorem. 8. In this problem we prove Luzin’s theorem using Egorov’s theorem. Let f be a realvalued Lebesgue measurable function on a measurable set X ⊆ Rn of finite measure. Given any ε > 0, we shall prove that there exists a closed set C ⊆ Rn such that C ⊆ X, m(X \ C) < ε, and f is continuous on C. Proceed as follows. (i) First prove the theorem for simple functions. Suggestion: Let f be a simple SN PN function and write f = k=1 ak χAk where X = k=1 Ak , the ak ’s are real numbers, and the Ak ’s are pairwise disjoint measurable sets. Given S ε > 0, there is a closed set Ck ⊆ Rn with m(Ak \ Ck ) < ε/N (why?). Let C = N k=1 Ck . (ii) We now prove Luzin’s theorem for nonnegative f . For nonnegative f we know that f = lim fk where each fk , k ∈ N, is a simple function. By (i), given ε > 0 k there is a closed T∞ set Ck such that m(X \ Ck ) < ε/2 and fk is continuous on Ck . Let K1 = k=1 Ck . Show that m(X \ K1 ) < ε. Use Egorov’s theorem to show that there exists a set K2 ⊆ K1 with m(K1 \ K2 ) < ε and fk → f uniformly on K2 . Conclude that f is continuous on K2 . (iii) Now find a closed set C ⊆ K2 such that m(K2 \ C) < ε. Show that m(X \ C) < 3ε and the restriction of f to C is a continuous function. (iv) Finally, prove Luzin’s theorem dropping the assumption that f is nonnegative.
274
5. BASICS OF INTEGRATION THEORY
9. A sequence {fn } of real-valued measurable functions is convergent in measure7 if there is an extended-real valued measurable function f such that for each ε > 0, lim µ {x ; |fn (x) − f (x)| ≥ ε} = 0. n→∞
(Does this remind you of the weak law of large numbers?) Prove that if {fn } converges in measure to a measurable function f , then f is a.e. real-valued, which means {x ; f (x) = ±∞} is measurable with measure zero. If {fn } converges to two functions f and g in measure, prove that f = g a.e. Suggestion: To see that f = g a.e., prove and then use the “set-theoretic triangle inequality”: For any real-valued measurable functions f, g, h, we have n εo n εo {x ; |f (x) − g(x)| ≥ ε} ⊆ x ; |f (x) − h(x)| ≥ ∪ x ; |h(x) − g(x)| ≥ . 2 2
10. Here are some relationships between convergence a.e., a.u., and in measure. (a) (a.u. =⇒ in measure) Prove that if fn → f a.u., then fn → f in measure. (b) (a.e. =⇒ in measure) From Egorov’s theorem prove that if X has finite measure, then any sequence {fn } of real-valued measurable functions that converges a.e. to a real-valued measurable function f also converges to f in measure. (c) (In measure 6=⇒ a.u. nor a.e.) Let X = [0, 1] with Lebesgue measure. Given n ∈ N, write n = 2k +i where k = 0, 1, 2, . .. and 0 ≤ i < 2k , and let fn be the characteristic i i+1 function of the interval , . Draw pictures of f1 , f2 , f3 , . . . , f7 . Show that 2k 2k fn → 0 in measure, but lim fn (x) does not exist for any x ∈ [0, 1]. Conclude that n→∞
{fn } does not converge to f a.u. nor a.e. 11. A sequence {fn } of real-valued, measurable functions is said to be Cauchy in measure if for any ε > 0, µ {x ; |fn (x) − fm (x)| ≥ ε} → 0, as n, m → ∞.
Prove that if fn → f in measure, then {fn } is Cauchy in measure. 12. In this problem we prove that if a sequence {fn } of real-valued measurable functions is Cauchy in measure, then there is a subsequence {fnk } and a real-valued measurable function f such that fnk → f a.u. Proceed as follows. (a) Show that there is an increasing sequence n1 < n2 < · · · such that 1 µ {x ; |fn (x) − fm (x)| ≥ ε} < k , for all n, m ≥ nk . 2 (b) Let
Am =
∞ n [ 1o x ; |fnk (x) − fnk+1 (x)| ≥ k . 2 k=m
Show that {fnk } is a Cauchy sequence of bounded functions set Acm . Deduce S on the c that there is a real-valued measurable function f on A := ∞ A m such that {fnk } m=1 converges uniformly to f on each Acm . (c) Define f to be zero on Ac . Show that fn → f a.u. 13. Assuming Problem 12, Part (a) of 10 and the set-theoretic triangle inequality from Problem 9, prove the following theorem. (Completeness for convergence in measure) If {fn } is a sequence of real-valued measurable functions that is Cauchy in measure, then there exists a real-valued measurable function f such that fn → f in measure. 7
If (X, µ) is a probability space, convergent in measure is called convergent in probability.
5.4. LEBESGUE’S DEFINITION OF THE INTEGRAL AND THE MCT
275
5.4. Lebesgue’s definition of the integral and the MCT In this section we (finally) define the integral of a nonnegative measurable function! We also establish the Monotone Convergence Theorem, one of the most useful theorems in all of integration theory because the MCT basically says “without fear of contradictions, or of failing examinations” [70, p. 229] we can always interchange limits and integrals for nondecreasing sequences of nonnegative functions. 5.4.1. Lebesgue’s original definition. It’s helpful to once again review Lebesgue’s original definition of the integral in Sur une g´en´eralisation de l’int´egrale d´efinie [165]. Given a bounded a function f : [α, β] → [0, ∞), we approximate the area under f from below and above as seen here by partitioning the range of the function, which Lebesgue supposes has lower bound m and upper bound M : f (x)
m3 m2 m1 m0 α
E1
E2
f (x)
m3 m2 m1 m0
E3 β
E1
E2
f (x)
m3 m2 m1 m0
E3
E1
E2
E3
Figure 5.11. Approximating the area under f from below and above. Let P be a partition of [m, M ]: m = m0 < m1 < m2 < · · · < mp−1 < M = mp , let E0 = {f = m0 } and Ei = f −1 (mi−1 , mi ], i = 1, 2, . . . , p, and let X X LP = m0 m(E0 ) + mi−1 m(Ei ) and UP = m0 m(E0 ) + mi m(Ei ), i
i
which we shall call, respectively, the lower and upper sums of f defined by the partition P. In Figure 5.11 we put p = 3 and the shaded rectangles approximate the area under f from below and above. Since f is assumed measurable, each of the sets E0 , . . . , Ep is measurable, so their measures m(Ei ) have all the properties length should have. The lower and upper sums of f are analogous to the lower and upper Darboux sums studied in Riemann integration (the difference being that in Riemann integration the domain of f is partitioned rather than the range). Note that if we consider the M 1 -simple functions (where M 1 denotes the collection of Lebesgue measurable sets) X X ℓ P = m0 χE 0 + mi−1 χEi and uP = m0 χE0 + mi χE i , i
i
which we shall call, respectively, the lower and upper simple functions of f defined by the partition P, then by definition of integration of simple functions, Z Z LP = ℓP and UP = uP . Lebesgue says that f is integrable if there exists a real number I such that (5.6)
lim LP = I =
kPk→0
lim UP ,
kPk→0
276
5. BASICS OF INTEGRATION THEORY
and the common limit I is by definition the integral of f , which we shall denote by Z f := I = lim LP = lim UP . kPk→0
kPk→0
Here, kPk is the maximum of the lengths mi − mi−1 where i = 1, 2, . . . , p and we say that limkPk→0 LP = I if given any ε > 0, there is a δ > 0 such that for any partition P with kPk < δ, we have I − L P < ε
There is a similar definition of what limkPk→0 UP = I means. Now it turns out, see Problem 4, that the limits in (5.6) always exist and are equal! Thus, an arbitrary nonnegative bounded measurable function is integrable. Since one can never “explicitly” construct a nonmeasurable function, basically all nonnegative bounded functions are Lebesgue integrable! Here we see the major difference between the Riemann and Lebesgue theories: It’s easy to construct functions (e.g. the Dirichlet function) that are not Riemann integrable. In Problem 4 you will prove that if we define I := sup{LP ; LP is a lower sum} Z = sup ℓP ; P is a partition of [m, M ] ,
(5.7)
then (5.6) holds with this I. Furthermore, it turns out that in (5.7) we can take P to be a partition of [0, ∞]; this frees us from the bounds of the particular function f and allows us to easily generalize Lebesgue’s definition of the integral to extended real-valued functions. 5.4.2. The definition of the integral. Let (X, S , µ) be a measure space and let f : X → [0, ∞] be a measurable function. Given a partition P of [0, ∞], let Ei = f
−1
0 = m0 < m1 < m2 < · · · < mp−1 < ∞ = mp ,
(mi−1 , mi ], i = 1, 2, . . . , p, and let
ℓP =
p X
mi−1 χEi
i=1
= m1 χ{m1 0 and let An = {x | f (x) ≤ fn (x) + ε1 } where ε1 = εµ(X)/(µ(X) + M ). Show that limn→∞ µ(An ) = µ(X) and hence limn→∞ µ(Acn ) = 0. Take N so that n ≥ N implies µ(Acn ) < ε/(µ(X) + M ), where fn ≤ M for all n (so f ≤ M too). Using the definition of the integral,
5.5. FEATURES OF THE INTEGRAL AND THE PRINCIPLE OF APPROPRIATE FUNCTIONS 285
prove that
Z
f≤
Z
χAn f +
Z
χAcn f.
R Show that for n ≥ N , the first integral on the right is ≤ R fn + εR1 µ(X) andR the c second integral on R the right R is ≤ M µ(An ). Conclude that fn ≤ f ≤ ε + fn , and hence that f = lim fn . (iii) Lemma 3: Let ank be a double sequence of nonnegative extended real-valued numbers such that ank is nondecreasing in n (for fixed k) and nondecreasing in k (for fixed n). OPTIONAL: Prove that lim lim ank = lim lim ank .
n→∞ k→∞
k→∞ n→∞
The proof is similar to the proof of Lemma 3.3 in Section 3.2, which is why Lemma 3 is optional; skip the proof of Lemma 3 if you see its relation to Lemma 3.3. (iv) We now complete Levi’s proof. Let {fn } be a nondecreasing sequence of nonnegative functions and let f = lim fnR. For any k ∈ N, let fnk = min{fn , k} and use Part (iii) on the sequence ank = fnk . 8. Here are some problems dealing with Fatou’s Lemma. (a) Find a sequence of nonnegative functions on [0, 1] (with Lebesgue measure) such that Fatou’s Lemma gives a strict inequality. (b) Prove that Fatou’s Lemma implies the Monotone Convergence Theorem. (c) Let {fn } be a sequence of nonnegative measurable functions on R a measure R space converging to a function f with fRn ≤ f for each n. Prove that f = lim fn . (In particular, you need to show lim fn exists.) Show by counterexample that if we replace “fn ≤ f ” by “fn ≥ f ” for each n the conclusion is false.
5.5. Features of the integral and the Principle of Appropriate Functions We’ve defined the integral for nonnegative measurable functions and in this section we define it for extended real-valued functions. We also introduce the function version of the principle of appropriate sets. We begin by discussing some properties of the integral for nonnegative functions. 5.5.1. Properties of the integral. We start with Chebyshev’s Inequality, a very useful inequality that we’ve already seen back in Lemma 2.11 for simple functions. Recall that we always write {f has a property} for
{x ∈ X ; f (x) has a property}.
For example, the notation {f > α} is shorthand for {x ∈ X ; f (x) > α}. Chebyshev’s Inequality, Version II Theorem 5.19. For any nonnegative measurable function f and for any α > 0, we have Z 1 µ {f > α} ≤ f. α (The same inequality holds for µ {f ≥ α}.) Proof : Let A = {f > α}. Since αχA < f χA and f χA ≤ f we have αχA ≤ f . Hence by monotonicity of the integral (Lemma 5.16), Z Z αµ(A) = α χA ≤ f. This concludes our proof.
286
5. BASICS OF INTEGRATION THEORY
R Thinking of f as the area of a region with height profile f , the following proposition says that the integral has all the properties that we believe Z area should have: f (1) Area is additive. (2) If the base of the region has length zero, the region has area zero. Integral = Area. (3) If two regions have the same height profile, they have the same area. (4) If the area of a region is zero, the region must have zero height. (5) If a region has positive height and area zero, then its base has length zero. (6) If a region has finite area, it has finite height profile. f
Some properties of area under curves Proposition 5.20. Let A be a measurable set, f and g be nonnegative measurable functions, and let a and b be nonnegative real numbers. (1) Z Z Z (2) (3) (4) (5) (6)
(af + b g) = a f + b g, R If µ(A) = 0, then RA f = R0. If fR = g a.e., then f = g. If f = 0, then f = R 0 a.e. If fR > 0 on A and A f = 0, then µ(A) = 0. If f < ∞, then f is a.e. real-valued, that is f (x) ∈ R for a.e. x. (That is, the set {f = ∞} has measure zero.)
Proof : To prove (1), let fn and gn be nondecreasing sequences of nonnegative simple functions approaching f and g, respectively; e.g. such simple functions are provided by Theorem 5.13. Applying the Monotone Convergence Theorem to the three nondecreasing sequences: {fn }, {gn }, and {afn + bgn }, converging to f , g, and af + bg, respectively, we obtain Z Z Z Z a f + b g = lim a fn + lim b gn (MCT) Z Z = lim a fn + b gn (algebra of limits) Z = lim (afn + b gn ) (integral is linear on simple functions) Z = (af + b g) (MCT).
We R shallR leave (2) and (3) for your enjoyment. (Hints: To prove (2) prove ℓ = χA ℓ Rfor any lower simple function ℓ of f and prove (3) using (2).) A Suppose that f = 0 and let A = {x ; f (x) > 0}. Then (4) is the statement that µ(A) = 0. To see this, for each n = 1, 2, . . ., let An = {x ; f (x) > 1/n}. Then An is a nondecreasing sequence of measurable sets with limit set A, so µ(A) = lim µ(An ) since measures are continuous. Now by Chebyshev’s Inequality, Z µ(An ) ≤ n f = 0. that
Thus, µ(An ) = 0 for each n and so µ(A) = 0 as well, which proves (4).
5.5. FEATURES OF THE INTEGRAL AND THE PRINCIPLE OF APPROPRIATE FUNCTIONS 287
R Assume now that f R> 0 on Ra measurable set A and A f = 0; we shall prove that µ(A) = 0. Indeed, A f = χA f = 0 so by (4) we have χA f = 0 a.e. Since f > 0 on A it follows that χA = 0 a.e. which implies that µ(A) = 0. R Finally, to prove (6), suppose that f < ∞ and let A = {x ; f (x) = ∞}. Then for any n ∈ N, we have {f = ∞} ⊆ {f > n}, so by monotonicity of measures and Chebyshev’s Inequality, we have Z 1 µ{f = ∞} ≤ µ {f > n} ≤ f. n Taking n → ∞ we get our result.
The following theorem says that we can always interchange integrals and infinite series of nonnegative measurable functions. The series MCT Theorem 5.21. If {fn } is a sequence of nonnegative measurable functions, then Z X ∞ ∞ Z X fn = fn .
n=1 n=1 P∞ R P∞ Moreover, if the sum f is finite, then the series n=1 fn is finite a.e.; n n=1 P∞ that is, the series n=1 fn converges to a real number a.e.
Proof : Let f :=
P∞
n=1
fn . Then by definition, f = lim gk where gk = k→∞
Pk
n=1
fn .
Since the fn ’s are nonnegative, we have g1 ≤ g2 ≤ g3 ≤ · · · , so by the Monotone Convergence Theorem, we have Z Z Z X k k Z X f = lim gk = lim fn = lim fn k→∞
k→∞
k→∞
n=1
=:
n=1
∞ Z X
fn ,
n=1
where in the third equality we used linearity of the integral from the previous proposition. R P Now assume that ∞ fn < ∞, which is equivalent to saying that n=1 Z ∞ X f < ∞ where f = fn . n=1
Then by Property P (5) of the previous proposition, we see that f is finite a.e., which means that ∞ n=1 fn converges to a real number a.e. See Problem 4 for applications of the series MCT to proving the First Borel– Cantelli Lemma and to the SLLN.
Example 5.13. Let X = S ∞ , where S = {0, 1}, the sample space for (say) the Monkey Shakespeare experiment such that the Monkey can type sonnot 18 with probability p > 0 on any given page. Let f : X → [0, ∞] be the random variable given by the number of times the Monkey types sonnet 18. What is the expected value of f ; in plain English, how many times would you expect the monkey to type sonnet 18? To answer this question, observe that ∞ X f (x) = xn n=1
288
5. BASICS OF INTEGRATION THEORY
That is, ∞ X
f=
χAn ,
n=1
where An = S × S × · · · × S × {1} × S × S × · · · where {1} is in the n-th slot. It follows that ∞ Z ∞ ∞ X X X E(f ) = χAn = µ(An ) = p = ∞. n=1
n=1
n=1
Thus, we would expect the monkey to type sonnet 18 an infinite number of times; however, as the analysis in Section 2.3.5 shows, we wouldn’t expect the monkey to type sonnet 18 even once in any reasonable finite amount of time.
The following theorem shows that the integral is countably additively on countable disjoint unions, just like measures are. Countable additivity of the integral Theorem 5.22. If f is a nonnegative measurable function and A = where the sets An are disjoint measurable sets, then Z ∞ Z X f= f. A
n=1
S∞
n=1
An
An
Proof : Applying the Monotone Convergence Theorem to the monotone sequence n X fn := χAk f, k=1
which converges pointwise to χA f , get this result, as you can readily check.
5.5.2. General definition of the integral. We now define the integral for functions that can take negative as well as positive values. Recall that the nonnegative and nonpositive parts f+ and f− of a measurable function f : X → R are defined by f+ (x) = max{f (x), 0}, f− (x) = − min{f (x), 0}; here is an example of a function f with its corresponding f+ and f− : R
R
f X
R
f+ X
f− X
Thus, f+ represents the part of f above the X-axis and −f− represents the part of f below the X-axis, and also observe that f = f+ − f− ,
and |f | = f+ + f− .
WeRdefine the integral of f geometrically as the area of Rf above the X-axis (that is, f+ ) minus the area of f below the X-axis (that is, f− ): Z Z Z (5.18) f := f+ − f− , provided that the right-hand side is not of the form ∞ − ∞. Here’s an illustration of this definition:
5.5. FEATURES OF THE INTEGRAL AND THE PRINCIPLE OF APPROPRIATE FUNCTIONS 289
R +
Z
f :=
R
f Z
− f+ −
X Z
R f+
R R
f+ X
f− f− X
f− = “area of f above X axis” − “area of f below X axis”
Figure 5.16. The integral f represents the net “signed” area between the graph of f and the X-axis.
We say that f is integrable if Z Z f+ < ∞ and f− < ∞,
R R Rin which case f ∈ R. RThe identity |f | = f+ + f− implies that f+ < ∞ and f− < ∞ if and only if |f | < ∞. Thus, Z f is integrable ⇐⇒ |f | < ∞. We sometimes say that f is µ-integrable to emphasize the measure µ; e.g. if X = Rn and µ is Lebesgue measure, we’d say that f is Lebesgue integrable. More generally, given any measurable set A ⊆ X, we define Z Z f := χA f, A
where again, we assume that the right-hand side makes sense. The function f is said to be integrable on A if Z Z Z f+ < ∞ and f− < ∞ that is, if |f | < ∞. A
A
A
Since the integral on general extended real-valued functions is defined as the difference of integral of nonnegative functions, the properties of the integral in Proposition 5.20 for nonnegative functions translate directly to properties of the integral for extended real-valued functions. For example, (2) of Proposition 5.20 implies that if f : X → R, g : X → R are measurable and are either both nonnegative or both integrable, then9 Z Z f = g a.e. =⇒ f = g. We can paraphrase this as saying Integrals only see a.e.; they are blind to sets of measure zero. Property (6) of Proposition 5.20 implies that if f : X → R is integrable, then f is a.e. real-valued, that is, f (x) ∈ R a.e. also
9 if f is only assumed integrable and f = g a.e. then g must also be integrable, and R Actually, R f = g.
290
5. BASICS OF INTEGRATION THEORY
5.5.3. Linearity of the integral. Given integrable functions f, g : X → R, we want to prove that Z Z Z (5.19) (f + g) = f + g.
It turns out that the left-hand side may not be defined as it stands. Indeed, since f and g are integrable, we know that the functions f and g are finite a.e.; in particular, on a set of measure zero they may take the values ±∞. Thus, the sum f (x) + g(x) is generally only defined a.e. (when it’s not of the form ∞ − ∞ or −∞ + ∞). There is a general convention for dealing with situations like this, which we’ll now explain. Let f be a measurable function defined a.e., that is, f is defined on a measurable set A with µ(Ac ) = 0 such that f −1 (a, ∞] is a measurable subset of X for each a ∈ R. Define ( f on A e f := 0 on Ac . Since A is measurable, one can check that fe : X → R is measurable. We say that f is integrable if fe is integrable, in which case we define Z Z (5.20) f := fe.
Of course, since integrals only see a.e. we could have defined fe to equal any meaR c surable function on A without changing the value of fe. We only chose 0 on Ac to make it simple. We shall apply the convention (5.20) many times in the sequel rarely mentioning it. In particular, this convention is how we understand the left-hand side of (5.19). Now to prove linearity we begin with the following. Lemma 5.23. Let f : X → R be measurable and suppose that f = g − h a.e. where g and h are nonnegative integrable functions. Then f is integrable, and Z Z Z f = g − h. Proof : Note that since g and h are each integrable, they are a.e. real-valued, so the difference g − h is defined a.e. Also note that |f | ≤ g + h a.e., which implies (by monotonicity of the integral on nonnegative functions) that the integral of |f | is finite, so f is integrable. Since g, h, f+ , and f− are each integrable, they are a.e. real-valued and so in particular, rearranging the equalities f = f+ − f−
and
f = g − h (a.e.)
it follows that f+ + h = g + f− a.e. By linearity of the integral for nonnegative functions, we have Z Z Z Z f+ + h = g + f− . Each integral is finite so rearranging we get Z Z Z Z f+ − f− = g − h. Thus,
R
f=
R
g−
R
h as desired.
5.5. FEATURES OF THE INTEGRAL AND THE PRINCIPLE OF APPROPRIATE FUNCTIONS 291
Linearity of the integral Theorem 5.24. Given any integrable functions f and g and real numbers a and b, the integral is linear: Z Z Z (af + b g) = a f + b g. Proof : To prove linearity it suffices to show that Z Z af = a f for any a ∈ R and
Z
(f + g) =
Z
f+
Z
g.
Consider the first equality and the case a < 0; the case a ≥ 0 is easy. Write a = −α where α > 0, so that af = −α(f+ − f− ) = α f− − α f+ . By Lemma 5.23 and linearity of the integral on nonnegative functions we see that Z Z Z Z Z af = α f− − α f+ = α f− − α f+ Z Z Z = −α f+ − f− = a f. To prove that
R
(f + g) =
R
f+
R
g, write
f + g = (f+ − f− ) + (g+ − g− ) = (f+ + g+ ) − (f− + g− ). Applying Lemma 5.23 and using linearity of the integral on nonnegative functions, and using the definition of the integral, we obtain Z Z Z (f + g) = (f+ + g+ ) − (f− + g− ) Z Z Z Z Z Z = f+ + g+ − f− − g− = f + g.
Integral inequalities Theorem 5.25. Given any integrable functions f and g, Z Z (1) If f ≤ g a.e., then f ≤ g. (Monotonicity) Z Z (2) f ≤ |f |.
Proof : If f ≤ g a.e. then using that g = f +(g −f ), which is defined a.e., by linearity we have Z Z Z g = f + (g − f ). R Since g − f ≥ 0 a.e. it follows that (g − f ) ≥ 0, which proves monotonicity. To prove (2), recall that |f | = f+ + f− and observe that Z Z Z Z Z Z f = f − f ≤ f + f = |f |. + − + −
292
5. BASICS OF INTEGRATION THEORY
5.5.4. The Principle of Appropriate Functions. Recall that the Principle of Appropriate Sets says that if a collection of sets contains the “appropriate sets”, then it must contain all sets in the σ-algebra generated by the appropriate sets. The Principle of Appropriate Functions says: If an integration property holds for the “appropriate functions”, then the property holds for all integrable functions. To illustrate this new principle, consider an affine transformation (a linear transformation followed by a translation) F : Rn → Rn where F (x) = T x + b for some invertible n × n matrix T and some b ∈ Rn . We shall prove the following sister theorem to Theorem 4.14: Affine transformations and Lebesgue integration Theorem 5.26. For any measurable function f : Rn → R, the composite function f ◦ F is measurable, and Z Z (5.21) (f ◦ F ) | det T | = f, provided f is nonnegative or integrable.
(Using invariance of Lebesgue measure under affine transformations, one can check that f ◦ F is measurable for any measurable function f on Rn .) To prove the equality (5.21), we use the following “principle” that works not only to prove (5.21) but also for just about any integration property: Let C denote the collection of integrable functions having a certain property.10 Principle of Appropriate Functions: Suppose (1) C contains characteristic functions of measurable sets (2) C a linear space (3) C is closed under limits of nondecreasing sequences of nonnegative functions Then C contains all integrable functions. There is a similar principle if we are just interested in nonnegative measurable functions: Just replace (2) by “is closed under linear combinations by nonnegative constants;” the conclusion is that C contains all nonnegative measurable functions. This principle says that if you can prove an integration property for the “appropriate functions” — namely, characteristic functions of measurable sets — then under some additional conditions, the property holds for all integrable functions. Let’s understand why. If (1) is fulfilled, then by (2), C contains all simple functions. By (3), the set C contains all nonnegative integrable functions by the MCT (since by Theorem 5.13 any nonnegative measurable function is a limit of a nondecreasing sequence of nonnegative simple functions). Finally, since any integrable function can be written as a difference of two integrable nonnegative functions, by (2) it follows that C contains any integrable function. Let’s apply this principle to outline the proof of (5.21) for integrable functions. (1) First, we need to check that (5.21) holds for characteristic functions of measurable sets. We’ll assume you’ve done this — see Problem 7. (2) Both sides of the equality (5.21) are clearly linear in f . 10I thank Anton Schick for telling me about the catchy name “The Principle of Appropriate Functions.”
5.5. FEATURES OF THE INTEGRAL AND THE PRINCIPLE OF APPROPRIATE FUNCTIONS 293
(3) If 0 ≤ f1 ≤ f2 ≤ · · · where each fn satisfies (5.21), then f ◦ F = lim fn ◦ F is a limit of a nondecreasing sequence of nonnegative measurable functions, so by the MCT, Z Z Z Z f ◦ F | det T | = lim fn ◦ F | det T | and f = lim fn . Since
R
fn ◦ F | det T | =
R
fn by assumption, we conclude that Z Z f ◦ F | det T | = f.
This proves Theorem 5.26! This principle will be used quite often in the sequel. ◮ Exercises 5.5. 1. Let f be integrable. Prove the following properties of the R integral. R (a) If f ≥ 0 a.e. and A ⊆ B are measurable sets, then A f ≤ B f . (b) If f ≥ a > 0 a.e. on a measurable set A, then µ(A) < ∞. R (c) If a ≤ f ≤ b a.e. on a measurable set A, then a µ(A) R ≤ AR f ≤ b µ(A). (d) If A ⊆ B are measurable and µ(B \ A) = 0, then A f = B f . 2. A measurable function g is essentially bounded if there is an M > 0 such that |g| ≤ M a.e. If f is integrable and g is essentially bounded, prove that f g is integrable. If X has finite measure, prove that any essentially bounded function is integrable. 3. In this problem we compute some integrals. (a) (Run lengths) Let S ∞ , with S = {0, 1}, be the sample space for an infinite sequence of coin tosses where the probability of throwing a head on any given toss is p. For each n ∈ N, define ℓn : S ∞ → [0, ∞] by ℓn (x) := the number of consecutive tosses of heads starting from the n-th toss. Find E(ℓn ), the expected run length of heads. (By the way, if you’re interested,Ryou can show that the R sequence ℓn gives an example of Fatou’s strict inequality: lim inf ℓn < lim inf ℓn .) (b) (Random series) With the same measure space as in (a), let f : S ∞ → [0, ∞] be the “randomized geometric series” defined as follows: Given x ∈ S ∞ , f (x) := 1 ±
∞ X 1 1 an 1 ± 2 ± 3 ± ··· = 1 + , 2 2 2 2n n=1
where an = 1 if xn = 1 and an = −1 if xn = 0. Compute E(f ). (c) (Area under Cantor’s R function) If ψ : [0, 1] → R is Cantor’s function from Section 4.5, show that ψ = 12 . Suggestion: Show that for all points not in the Cantor set, ψ can be written as ∞ X X a1 ak 1 ψ= + · · · + k + k+1 χBa1 ...ak , 2 2 2 k=0 (a1 ,...,ak )
where [0, 1] \ C is the union over all sets Ba1 ...ak found in Problem 2 in Exercises 4.5. The a1 , . . . , ak ’s are 0’s or 1’s and they determine a unique natural number ℓ = ak + 2ak−1 + · · · + 2k−1 a1 with 0 ≤ ℓ ≤ 2k − 1. (d) (An integrable everywhere unbounded function) Fix a list {r1 , r2 , . . .} of Q, P let g(x)R = x−1/2 χ(0,1) , and define f : R → [0, ∞] by f (x) = ∞ 2−n g(x − rn ). R Pn=1 ∞ −n Using g = 1/2, compute f . Conclude that the series g(x − rn ) n=1 2 converges for a.e. x ∈ R. Prove that f is unbounded in every open interval; that is, given any a < b and N ∈ N there is an x with a < x < b such that f (x) > N . 4. In this problem we apply the series MCT to problems in probability. (a) (The First Borel–Cantelli Lemma) Let A1 , A2 , . . . be P measurable and put A = {An ; i.o.}. Applying the series MCT to the series f (x) = ∞ n=1 χAn and observing
294
5. BASICS OF INTEGRATION THEORY
that x ∈ A if and only if f (x) = ∞, prove that ∞ X
n=1
µ(An ) < ∞
=⇒
µ(A) = 0.
(b) (The Strong Law of Large Numbers) Show that the SLLN, to which we refer you to Section 4.2.1 if you need a review, is equivalent to lim Snn = p a.e., then 4
n) show lim Snn = p a.e. is equivalent to the statement that lim (R1 +···+R = 0 a.e. n4 where the Ri ’s are the Rademacher functions. To prove this last statement, apply P (R1 +···+Rn )4 the series MCT to the series ∞ . (Equation (4.4) might help.) n=1 n4 5. (The (Other) Monotone Convergence Theorem) Let f1 ≥ f2 ≥ f3 ≥ · · · ≥ 0 be a nonincreasing sequence of nonnegative measurable functions. R (i) Assuming that f1 < ∞, prove that Z Z lim fn = lim fn .
R R R (ii) If f1 = ∞, show by example that lim fn = lim fn need not hold. 6. (Lebesgue’s Analytic Definition) We shall call a partition of R a set P = {mi }, where i ∈ Z, with . . . < m−2 < m−1 < m0 < m1 < m2 < . . . and with kPk := sup{mi+1 − mi ; i ∈ Z} < ∞. Let f : X → R be measurable with µ(X) < ∞. (a) If for some partition P of R, the series (5.22)
σP :=
∞ X
i=−∞
mi µ(Ai ) ,
where Ai = {x ; mi ≤ f < mi+1 },
is absolutely convergent, prove that f is integrable. Moreover, prove that (5.22) converges absolutely for any partition of R. Finally, given any sequence P1 , P2 , . . . R of partitions with kP k → 0, prove the formula f = lim σPn ; this formula n R for f is called Lebesgue’s Analytic Definition of the integral [171, Sec. 24]. P Suggestion: If gP := i mi χAi , the formulas |f | ≤ |gP |+kP k and |f −gP | ≤ kPk might be useful. (b) Prove that if f is integrable, then the series (5.22) converges for any partition P. (c) The condition µ(X) < ∞ is needed in this problem: Give counterexamples to (c) and the first and second statements in (a), in the case X = R. 7. (Affine transformations and Lebesgue integration) Prove the equality (5.21) for characteristic functions of Lebesgue measurable sets (you’ll need Theorem 4.14). 8. (Counting measures) (a) Let # : P(N) → [0, ∞] be the counting measure. Prove that any extended realvalued P∞ function f on N is measurable, and is integrable if and only if the series n=1 |f (n)| is convergent, in which case Z ∞ X f (n). f d# = n=1
(b) Let µ : P(N × N) → [0, ∞] be the counting measure on P(N × N). Prove that any extended real-valued function f on N × N is measurable. Let n1 , n2 , . . . be any ordering of the countable set N × N. Prove that given any nonnegative extended real-valued function f on N × N, we have Z ∞ X f dµ = f (nk ). k=1
9. Let g : R → R be a continuously differentiable nondecreasing function and let µg : B 1 → [0, ∞) be its corresponding Lebesgue-Stieltjes set function. Given a Borel measurable function f , prove that f is µg -integrable if and only if f g ′ is Borel integrable,
5.5. FEATURES OF THE INTEGRAL AND THE PRINCIPLE OF APPROPRIATE FUNCTIONS 295
in which case
Z
f dµg =
Z
f g ′ dx.
The Principle of Appropriate Functions might help. 10. Let (X, S , µ) be a measure space, let g be a nonnegative measurable function, and define mg : S → [0, ∞] by Z mg (A) = g dµ, for all A ∈ S . A
(a) Prove that mg : S → [0, ∞] is a measure. (b) Given a measurable function f : X → R, prove that f is mg -integrable if and only if f g is µ-integrable, in which case, Z Z f dmg = f g dµ. Suggestion: The Principle of Appropriate Functions. 11. Fix a Lebesgue integrable function g : Rn → [0, ∞] that vanishes only on a set of zero Lebesgue measure (that is, g is positive a.e.). Let I = M n , the σ-algebra of Lebesgue measurable subsets of Rn , and define µ : I → [0, ∞] by Z µ(A) := g dm, for all A ∈ I , A
where m denotes Lebesgue measure. (i) Show that µ : I → [0, ∞] is a measure. (ii) Let A ∈ I . Prove that µ(A) = 0 if and only if m(A) = 0. (iii) Let µ∗ : P(Rn ) → [0, ∞] denote the outer measure generated by µ. Let A ⊆ Rn and suppose that µ∗ (A) = 0. Show that A ∈ I . Suggestion: Use regularity to find an element B ∈ S (I ) = I such that A ⊆ B and µ∗ (A) = µ∗ (B) = µ(B). Why does µ∗ (B) = µ(B)? (iv) Show that Mµ∗ = I , where Mµ∗ denotes the set of µ∗ -measurable sets. (v) Now suppose that g vanishes on a set of positive Lebesgue measure (instead of on a set of zero Lebesgue measure). Show that Mµ∗ 6= I by finding an element of Mµ∗ that is not in I . 12. Let µ1 and µ2 be measures on (X, S ) and let µ = µ1 + µ2 . (i) Show that µ is a measure on S . (ii) Prove that a measurable function f is µi -integrable for i = 1, 2 if and only if f is µ-integrable, in which case Z Z Z f dµ = f dµ1 + f dµ2 . 13. (Catalan’s Constant) Here are a couple integrals involving the famous Catalan’s P (−1)n constant, G = ∞ ene Charles Catalan n=0 (2n+1)2 = 0.915965594 . . ., named after Eug` (1814–1894). By the way, it’s not known whether or not Catalan’s constant is rational! (For many more formulas, check out [43].) In this problem you may evaluate integrals using the Fundamental Theorem of Calculus. Using the MCT, prove that Z ∞ Z 1 tan−1 x x dx G= dx = . x cosh x 0 0 by using the Maclaurin expansion for tan−1 (x) for the first equality and by writing P∞ n −(2n+1)x x x e−x = 1+e . Suggestion: The series you end up trying −2x = n=0 (−1) x e cosh x to integrate are alternating series, but to apply the MCT you need nonnegative terms; try to group adjacent terms together to get a series of nonnegative terms.
296
5. BASICS OF INTEGRATION THEORY
14. (Basel Problem) In this problem we give Leonhard Euler’s (1707–1783) first rigorous P 2 2 proof that ∞ n=1 1/n = π /6, which was originally announced by Euler in 1735 (see Section 6.1 for more on Euler’s sum). The following is Euler’s argument from his 1743 paper Demonstration de la somme de cette suite 1+1/4+1/9+1/16 . . . (Demonstration of the sum of the following series: 1 + 1/4 + 1/9 + 1/16 . . .) [88] (cf. [246, 247]). In this problem, you may evaluate integrals Theorem of Calculus. P∞using the Fundamental (a) Prove that if we can show that 1/(2n − 1)2 = π 2 /8, then it follows that n=1 P∞ 2 2 n=1 1/nP= π /6; thus, we just have to prove the first equality. Suggestion: ∞ Break up n=1 1/n2 into sums over even and odd numbers. (b) Using the binomial expansion of (1 − x2 )−1/2 near x = 0, prove that (5.23)
arcsin x = x +
∞ X 1 · 3 · 5 · · · (2n − 1) x2n+1 · , 2 · 4 · 6 · · · (2n) 2n + 1 n=1
where this series is valid for all x ∈ [0, 1].11√ (c) Dividing both sides of the above series by 1 − x2 gives (5.24)
∞
X 1 · 3 · 5 · · · (2n − 1) arcsin x x x2n+1 √ = √ + . ·√ 2 2 2 · 4 · 6 · · · (2n)(2n + 1) 1−x 1−x 1 − x2 n=1
Integrate this formula over P [0, 1], stating why term-by-term integration is permissable, to prove that π 2 /8 = ∞ 1/(2n + 1)2 . In Problem 2 of Exercises 2.5 there Rn=0 Rπ 1 2n+1 is a formula for the integral 0 x (1 − x2 )−1/2 dx = 02 sin2n+1 t dt, where we substituted x = sin t. (d) Appendix: We can modify Euler’s proof slightly as follows. First, in (5.23) substitute x = sin t to get
(5.25)
t = sin t +
∞ X
1 1 · 3 · 5 · · · (2n − 1) · sin2n+1 t, 2n + 1 2 · 4 · 6 · · · (2n)
n=1
valid for −π/2 < t < π/2. Now integrate this equality from 0 to π/2 and use the well-known integral to get the formula for π 2 /8. This method was published in [58]. Do you see how this new method is essentially the same as multiplying (5.24) by dx and then putting x = sin t? 15. In this problem, we give Pennisi’s formulas [222] for π and π 2 : (1)
∞ X π n! = , 2 1 · 3 · 5 · · · (2n + 1) n=0
(2)
∞
π2 1 X 2 · 4 · · · (2n) = + . 72 8 n=1 [1 · 3 · 5 · · · (2n + 1)](2n + 2)22n+2
In this problem, you may evaluate integrals using the Fundamental Theorem of Calculus. You may proceed as follows: (i) Justify the following equalities: √ √ Z 1/ 2 arcsin x 1 x dt √ = √ arctan √ = 2 . (1 − x2 ) + 2x2 t2 x 1 − x2 x 1 − x2 1 − x2 0 (ii) Expanding (5.26)
1 (1−x2 )+2x2 t2
as a geometric series, prove that
Z 1/ ∞ √ X arcsin x √ = 2 x2n x 1 − x2 0 n=0
√
2
(1 − 2t2 )n dt.
11 Using “Raabe’s test”, one can in fact show that this series converges uniformly for x in [−1, 1], but we won’t need this fact.
5.6. THE DCT AND OSGOOD’S PRINCIPLE
297
R 1/√2 (iii) Evaluating the integral 0 (1 − 2t2 )n dt (see Problem 2 in Exercises 2.5) and taking a particular value of x in (5.26), derive the first Pennisi formula. Then integrating (5.26) from 0 to a and taking a particular value of a, prove the second Pennisi formula.
5.6. The DCT and Osgood’s Principle In this section we prove probably the most important and powerful limit theorem you’ll ever need, the Dominated Convergence Theorem (DCT). With minimal assumptions, it allows you to interchange limits and integrals “without fear of contradictions, or of failing examinations” [70, p. 229]. We also discuss Vitali’s Convergence Theorem and many corollaries of the DCT. 5.6.1. The Dominated Convergence Theorem. From our discussion concerning Fatou’s Lemma, we know that if {fn } is a sequence of nonnegative functions on a measure space, then concerning the interchange of limits and integration, without making further assumptions the best we can say is Z Z lim fn ≤ lim fn , provided, of course, that the limits on both sides exist. Thus, we seek sufficient conditions under which we can say = rather than ≤. The earliest Lebesgue integration convergence theorem, the Bounded Convergence Theorem, was proved in Lebesgue’s 1902 thesis [171]; in modern abstract measure theory it reads: Bounded Convergence Theorem Theorem 5.27. If X is a space of finite measure, {fn } is a sequence of measurable functions such that lim fn exists a.e. and there is a constant n→∞
M > 0 such that for each n ∈ N, |fn | ≤ M a.e., then the limit function lim fn and each fn are integrable, and Z Z lim fn = lim fn . This theorem follows from Lebesgue’s DCT we’ll present below. The Bounded Convergence Theorem (BCT) is a vast generalization of the Arzel`a [8, 6, 7] and Osgood [214] Bounded Convergence Theorem for the Riemann integral. Although state of the art at the time, Lebesgue’s BCT has two big drawbacks: (1) It fails in the case X has infinite measure.12 (2) It does not apply to unbounded functions. That the BCT does not apply to spaces of infinite measure and to unbounded functions excludes a large chuck of spaces and functions! In 1908, while applying his newly discovered theory of integration in the paper Sur la m´ethode de M. Goursat pour la r´esolution de l’´equation de Fredholm [168], Lebesgue realized just how restrictive the boundedness assumption really was. He used the BCT to solve a Fredholm integral equation (equations used in diverse areas of mathematics and physics, named after Ivar Fredholm (1866–1927)). Because he used the BCT in his solution he was forced to put restrictive boundedness hypotheses on the 12Recall the “moving pulse” sequence {f } from Example 5.12 in Section 5.4. This sequence n R R is bounded and converges, yet lim fn < lim fn .
298
5. BASICS OF INTEGRATION THEORY
functions involved, which made his solution impractical. In order to eliminate the restrictive hypotheses and make his solution useful, he then states and proves the DCT; here’s what he said [168, p. 11-12]: In the preceding statement I have pointed out three restrictive hypotheses . . . The theorem on integration of sequences [the BCT], that has been previously used, will be replaced by the following: A convergent sequences of integrable functions fi is integrable term by term if there exists an integrable function F such that |fi | ≤ |F | for all i and for every value of the variable.
In modern abstract measure theory, Lebesgue’s theorem reads: Dominated Convergence Theorem Theorem 5.28. Let {fn } be a sequence of measurable functions such that (1) lim fn exists a.e.; n→∞
(2) there exists an integrable function g such that for each n ∈ N, |fn | ≤ g a.e. Then the limit function lim fn and each fn are integrable, and Z Z lim fn = lim fn . Proof : Here’s a picture of the situation: f fn f3 f2 f1
g
Figure 5.17. fn → f (everywhere in this example, where f (0) = 0)
and |fn | ≤ g for all n where we assume g is integrable. The DCT says that the net area under the fn ’s converge to the net area under f .
Now to the proof, let f (x) = lim fn (x) when this limit exists (and zero n→∞
when it doesn’t exist — recall the convention around (5.20)!). The a.e. inequality |fn | ≤ g implies that each fn is integrable and taking n → ∞ in |fn | ≤ g a.e. it follows that |f | ≤ g a.e. RThus, f is also integrable. It remains to prove that R R lim fn exists Rand equals f . Because lim fn Rexists if and only if the liminf and limsup of fn are equal, in which case lim fn is the common value (see Lemma 5.11), all we need to prove is Z Z Z lim inf fn = lim sup fn = f. Before proving these equalities we need two facts. First, we note that |fn | ≤ g
=⇒
−g ≤ fn ≤ g
=⇒
fn + g ≥ 0.
where all the inequalities hold a.e. In particular, {fn + g} is an (a.e.) nonnegative sequence. The second fact concerns lim inf: For any constant b ∈ R and extended real-valued sequence {an }, we have
(5.27)
lim inf(an + b) = (lim inf an ) + b
and
lim inf(−an ) = − lim sup an ;
5.6. THE DCT AND OSGOOD’S PRINCIPLE
299
the first equality is an exercise and the second property follows from Property (2) in Lemma 5.11. R Now to our proof, we first work on lim inf fn . The idea is to apply Fatou’s lemma; unfortunately, fn may not be nonnegative, violating the hypotheses of Fatou’s lemma. However, fn + g is nonnegative as we saw above. Hence, Z Z Z (f + g) = lim(fn + g) = lim inf(fn + g) (lim inf = lim when lim exists) Z ≤ lim inf (fn + g) (Fatou’s Lemma) Z Z = lim inf fn + g (by (5.27)). R Subtracting g we get Z Z (5.28) f ≤ lim inf fn . Applying this argument to the sequence {−fn } gives Z Z (−f ) ≤ lim inf (−fn ).
Multiplying by −1 and using the second fact in (5.27) gives lim sup Combining this inequality with (5.28), we see that Z Z Z lim sup fn ≤ f ≤ lim inf fn .
R
fn ≤
R
f.
However, lim inf’s are always less than or equal to lim sup’s of a given sequence, so all inequalities must in fact be equalities. This proves the result.
In Section 6.1 we’ll show you a plethora of tricks this theorem can perform. Observe that the BCT (Theorem 5.27) is a special case R of the DCT with g equal to the constant function M , which is integrable since M = M µ(X) < ∞, assuming that µ(X) < ∞. Here’s an example for which the DCT applies, but not the BCT. Example 5.14. This interesting example is given by William Osgood (1864-1943) [213, 214] in 1896: For each n ∈ N define fn : [0, 1] → R by n2 x . 1 + n3 x2 If f = 0, it’s easy to check that fn → f pointwise. Here are some graphs: fn (x) =
√ Figure 5.18. fn has maximum value n/2 occurring when x = p 1/n3 . Thus, the fn ’s are not uniformly bounded yet fn → 0 pointwise.
For this sequence we can (as Osgood did) show by computation that Z Z lim fn = lim fn .
300
5. BASICS OF INTEGRATION THEORY
However, can we prove this equality using a convergence theorem? Using p calculus √ we find that fn has the maximum value n/2 (obtained when x = 1/n3 ); in particular, the sequence {fn } is not uniformly bounded). Hence, we cannot use the BCT to answer our question. However, we can use the DCT! It’s not obvious at first glance what dominating function will work, but we can find one with the help of calculus. Fix x > 0 and define F : [0, ∞) → [0, ∞) by F (t) =
t2 x . 1 + t3 x 2
Using elementary calculus we find that the maximum value of F is 22/3 x−1/3 /3 (obtained when t = (2/x2 )1/3 ). Thus, F (t) ≤ C x−1/3 for all t ≥ 0 where C = 22/3 /3. In particular, taking t = 1, 2, 3, . . . we see that for all n ∈ N, 0 ≤ fn ≤ g
where g(x) = C x−1/3 .
Thus, we can apply the DCT if we can show that g is integrable. To prove this we shall assume that any Riemann integrable function is Lebesgue integrable and the two integrals are the same, which we’ll prove in Section 6.2. Now let gn = χ[1/n,1] g. Then 0 ≤ g1 ≤ g2 ≤ · · · ≤ gn → g so by the MCT we have Z Z Z 1 g = lim gn = lim C x−1/3 dx n→∞
3C = lim n→∞ 2
1/n
1−
1
n2/3
=
3C , 2
where from the second to third line we use the Fundamental Theorem of Calculus. It follows that g is integrable, so the DCT applies.
Now, you may be asking: Lebesgue’s DCT gives a sufficient condition for the interchange of limits and integrals; is there also a necessary condition? The answer is yes and it was provided in a truly amazing paper [300] by Giuseppe Vitali (1875– 1932) in 1907, a year before the DCT was stated by Lebesgue.13 Vitali’s Convergence Theorem Theorem 5.29. Let {fn } be a sequence of integrable functions such that lim fn exists a.e. Then the limit function lim fn is integrable on X and for n→∞ any measurable set A ⊆ X we have Z Z lim fn = lim fn A
A
if and only if the following conditions hold: (1) For each ε > 0 there R is δ > 0 such that for all measurable sets A with µ(A) < δ, we have | A fn | < ε for all n ∈ N. (2) For each ε > 0 there is a measurable set A R of finite measure such that for all measurable sets B ⊆ X \ A, we have | B fn | < ε for all n ∈ N.
Condition (1) is called uniform absolute continuity and it says that the integrals of the fn ’s can be made uniformly small on sets of small measure and Condition (2), which we shall call uniform Vitali smallness, basically says that the integrals of the fn ’s can be made uniformly small outside sets of finite measure. See Problems 8 and 9 for more on these conditions. Giving both sufficient 13 If you want to read Vitali’s original proof along with some history of the VCT along with further developments due to de la Vall´ ee Poussin, Hahn, and Saks, see Choksi’s paper [59].
5.6. THE DCT AND OSGOOD’S PRINCIPLE
301
and necessary conditions on the interchange of limits and integrals, all the convergence theorems we discussed should be corollaries of the VCT. Indeed, in Section 4 of Lebesgue’s 1909 paper “sur les int´egrales singuli`eres” [169], he states many convergence theorems and in a footnote on page 50 he says All these properties are particular cases of a very general theorem of Mr. Vitali.
Indeed, in Problem 9 you will show that the DCT follows directly from the VCT and in Problem 4 you will find an example of a sequence for which the VCT applies, yet the DCT fails. The proof of the VCT is outlined in Problem 15. 5.6.2. Osgood’s Principles. In William Osgood’s (1864-1943) 1897 paper Non-Uniform Convergence and the Integration of Series Term by Term [214, p. 155], he begins his paper with14 The subject of this paper is the study . . . of the conditions under which Z x Z x (∗) lim sn (x)dx = lim sn (x)dx x0 n→∞
n→∞
x0
Shortly after, he mentions what we shall call Osgood’s principle, which William Osgood basically says that the solution to problem (∗) applies to the interchange of (1864-1943). the integral and other processes involving limits (such as series, differentiation and integration); in his own words, The four problems of 1) integration of a series term by term, 2) differentiation of a series term by term, 3) reversal of the order of integration in a double-integral, 4) differentiation under the sign of integration, are in certain classes of cases but different forms of the same problem, a problem in double limits; so that a theorem applying to one of these problems yields at once a theorem applying to the other three.
By Osgood’s principle, Problems 1)–4) should be answered by the DCT, and indeed they are! In fact, using the DCT, an answer to Problem 1) is in Theorem 5.30 below; Problem 2) is solved in Problem 5 in the Exercises; Problem 3) is illustrated in the Fichtenholz-Lichenstein Theorem in Problem 7 of Exercises 6.2 (cf. Section 7.3 on Fubini’s Theorem), and finally, Problem 4) is answered in Theorem 5.34. Here’s Osgood’s Problem 1), a theorem that complements Theorem 5.21. Integration of a series term by term Theorem 5.30. If {fn } is a sequence of integrable functions such that ∞ Z X |fn | < ∞,
then the series
P∞
n=1
n=1 fn converges a.e. to an integrable function, and Z X ∞ ∞ Z X fn = fn . n=1
n=1
14The . . . in the quote deals with “Condition (A)”, which is the assumption that {s } is n uniformly bounded on [x0 , x]. Also, the (∗) marking the displayed equation was not in his paper.
302
5. BASICS OF INTEGRATION THEORY
P∞ R P∞ Proof : Since |fn | < ∞, by Theorem 5.21 g := n=1 |fn | converges a.e. R R Pn=1 ∞ and g = |fn | < ∞. In particular, g P is integrable and since absolute n=1 convergence implies convergence it follows that ∞ n=1 fn converges a.e. Now if gn :=
n X
fk ,
k=1
P then |gn | ≤ g, and hence by the DCT, lim gn = ∞ k=1 fk is integrable, and Z X Z Z Z X ∞ n fk = lim gn = lim gn = lim fk n→∞
k=1
n→∞
n→∞
= lim
n→∞
k=1
n Z X k=1
fk =:
∞ Z X
fk .
k=1
The following theorem complements Theorem 5.22. S∞ Theorem 5.31. If f is integrable and A = n=1 An where the sets An are disjoint measurable sets, then Z ∞ Z X f= f. A
n=1
An
Proof : This P result follows from the Dominated Convergence Theorem applied to fn = n k=1 χAk f , which converges pointwise to χA f .
5.6.3. Continuity and differentiation. Integrals depending on a continuous parameter occur often in applications. For such applications, the following continuous version of the Dominated Convergence Theorem is helpful. A point a ∈ R is called a limit point of a set I ⊆ R if there exists a sequence {an } of points in I \ {a} such that an → a. The points a = ±∞ are allowable. Continuous Dominated Convergence Theorem Theorem 5.32. Let I ⊆ R be a set and let a ∈ R be a limit point. For each t ∈ I, let ft be a measurable function on X and suppose that (1) for a.e. x ∈ X, the limit lim ft (x) exists; t→a
(2) there exists an integrable function g such that for a.e. x ∈ X, |ft (x)| ≤ g(x) for all t ∈ I. Then lim ft and each ft are integrable, and t→a Z Z lim ft = lim ft . t→a
t→a
Proof : The inequality |ft | ≤ g a.e. implies that each ft is integrable and then taking t → a shows that |f | ≤ g a.e. where f := lim ft , so f is integrable as well. To t→a Z Z prove that f = lim ft , by the sequence formulation of limit we just have to t→a
prove that for any sequence {tn } in I \ {a} of real numbers converging to a, we
5.6. THE DCT AND OSGOOD’S PRINCIPLE
303
have Z
(5.29)
f = lim
n→∞
Z
ftn .
However, given such a sequence, for each n ∈ N we have |ftn | ≤ g a.e. and f = lim ftn a.e., so (5.29) follows from the usual Dominated Convergence Theorem. n→∞
This theorem can be used to establish conditions assuring the continuity or differentiability of integrals depending on a parameter. Continuity of integrals Theorem 5.33. Let f : I × X → R, where I is an interval in R, such that for each t ∈ I, f (t, x) is a measurable function of x. Suppose (1) for a.e. x ∈ X, f (t, x) is a continuous function of t ∈ I; (2) there exists an integrable function g : X → [0, ∞] such that for a.e. x ∈ X, |f (t, x)| ≤ g(x) for all t ∈ I. Then Z F (t) = f (t, x) dµ X
is a continuous function of t ∈ I.
Proof : Define ft (x) := f (t, x). Then for a.e. x ∈ X we have |ft (x)| ≤ g(x) for all t ∈ I where g is integrable, and for a.e. x ∈ X we have f (a, x) = lim ft (x). t→a
Thus, by the Continuous Dominated Convergence Theorem, f (a, x) and each ft are integrable, and Z Z f (a, x) dµ = lim ft dµ, t→a
which is another way of writing F (a) = limt→a F (t). This equality implies that F (t) is continuous at t = a, and hence on I.
Theorem 5.34 below gives very general conditions under which one can differentiate under an integral sign, a trick that in the next section we’ll see is incredibly useful. In fact, the 1965 Nobel laureate Richard Phillips Feynman (1918–1988) was an expert at this trick. Here’s an excerpt of a letter from Feynman to his high school teacher Abram Bader [96, p. 176–177] (cf. [97, Ch. 12]) Another thing I remember as being very important to me was the time when you called me down after class and said “You make too much noise in class.” Then you went on to say that you understood the reason, that it was that the class was entirely too boring. Then you pulled out a book from behind you and said “Here, you read this, take it up to the back of the room, sit all alone, and study this; when you know everything that is in it, you can talk again.” And so, in my physics class I paid no attention to what was going on, but only studied Woods’ “Advanced Calculus” up in the back of the room. It was there that I learned about gamma functions, elliptic functions,
304
5. BASICS OF INTEGRATION THEORY
and differentiating under an integral sign. A trick at which I became an expert . . . Thank you very much.15
Hopefully one day you’ll write to me thanking me for exposing you to this trick! Differentiability under the integral sign Theorem 5.34. Let f (t, x) : I × X → R, where I is an interval in R, such that for each t ∈ I, f (t, x) is integrable in x. Suppose that (1) for a.e. x ∈ X, the partial derivative ∂t f (t, x) exists for all t ∈ I; (2) there exists an integrable function g : X → [0, ∞] such that for a.e. x ∈ X, |∂t f (t, x)| ≤ g(x) for all t ∈ I. Then Z F (t) = f (t, x) dµ X
is differentiable at each t ∈ I, and ′
F (t) =
Z
X
∂ f (t, x) dµ. ∂t
Proof : Fixing a ∈ I, observe that provided the limit exists, we have Z Z F (t) − F (a) 1 F ′ (a) := lim = lim f (t, x) dµ − f (a, x) dµ t→a t − a t→a t−a Z = lim ft (x) dµ, t→a
where for each t ∈ I not equal to a, we have ft (x) :=
f (t, x) − f (a, x) . t−a
Thus, we just have to show that Z Z ∂ lim ft (x) dµ exists and equals f (t, x) dµ. t→a ∂t To see this, note that by Assumption (1), for a.e. x ∈ X, ∂t f (a, x) = lim ft (x) t→a
and from the Mean Value Theorem from elementary calculus, for any t ∈ I, f (t, x) − f (a, x) = |∂t f (at , x)| , |ft (x)| = t−a
for some number at between a and t. By Assumption (2) it follows that for a.e. x, we have |ft (x)| ≤ g(x) for all t ∈ I. Thus, by the Continuous Dominated Convergence Theorem, ∂t f (a, x) and each ft are integrable, and Z Z Z ∂ lim ft (x) dµ = lim ft (x) dµ = f (a, x) dµ. t→a t→a ∂t
See Section 6.1 for applications of this theorem. So far we have focused on real-valued functions, but we everything we have talked about works for . . . 15 The full reference of the book is Advanced Calculus: A Course Arranged with Special Reference to the Needs of Students of Applied Mathematics. Boston, MA, Ginn, 1926 by Frederick Shenstone Woods (1864-1950). Also, the dots . . . is Feynman recalling a lecture at Cornell concerning the trick.
5.6. THE DCT AND OSGOOD’S PRINCIPLE
305
5.6.4. Complex-valued functions. Given a function f : X → C, we can write it in the form √ f = f1 + i f2 , i = −1 , where f1 and f2 are real-valued functions, called the real and imaginary parts, respectively, of f . We say that f is measurable if both f1 and f2 are measurable. For example, the function f : R → C defined by f (x) = eix is Lebesgue measurable since eix = cos x + i sin x and cosine and sine are both measurable. Because measurability of complex-valued functions is defined in terms of the measurability of their real and imaginary parts, it’s no surprise that many the properties of measurable (extended) real-valued functions studied in Sections 5.2 and 5.3 also hold for complex-valued functions. For instance, here’s a partial list of properties of measurable complex-valued function: (1) Complex-valued constant functions are measurable. (2) If f and g are measurable complex-valued functions, then f g, f + g, and |f |p where p > 0, are also measurable. (3) If on the zero set {x ; f (x) = 0}, 1/f is redefined as a measurable complexvalued function, then 1/f is a measurable function on X. (4) If {fn } is a sequence of measurable complex-valued functions, and if the limit lim fn (x) exists for a.e. x, then it defines a measurable function by redefining n→∞
the limit to be zero (or any other measurable complex-valued function) on the points where it does not converge. These results follows directly from Theorems 5.12 and 5.14 by applying these results to the real and imaginary parts of the complex-valued measurable functions. We won’t bore you with the proofs. Suffice to say that everything you know is true for real-valued measurable functions is also true for complex-valued ones . . . as long as we stay away from properties that specifically depend on the ordering of the real numbers because complex numbers do not have an ordering that respects the usual algebraic operations.16 For example, we don’t define the limit infimum of a sequence of complex-valued functions. We now turn to integration. We say that a measurable function f : X → C is integrable if its real and imaginary parts are integrable. If f = f1 + i f2 is broken up into its real and imaginary parts, then we define Z Z Z f := f1 + i f2 . One can check that
q f12 + f22 ≤ |f1 | + |f2 |, R R It follows that f is integrable (meaning both |f1 | and |f2 | are finite real numbers) if and only if |f | is an integrable real-valued function. Thus, Z f : X → C is integrable ⇐⇒ |f | < ∞. |f1 |, |f2 | ≤ |f | =
16For example, supposing there were an ordering “>” on C and supposing i > 0, multiplying
both sides by i we get −1 > 0, an absurdity. Similarly, if i < 0, then multiplying both sides by −i gives 1 < 0, another absurdity.
306
5. BASICS OF INTEGRATION THEORY
Here are some other properties of the integral for complex-valued functions. Theorem 5.35. Given any complex-valued integrable functions f and g and complex numbers a and b, the integral is linear: Z Z Z (af + b g) = a f + b g. Moreover,
Z Z f ≤ |f |.
Proof : Since the integral is linear for real-valued functions, one can easily show that Z Z Z (f + g) = f + g. R R Thus, all we have to show is that af = a f for a ∈ C. Writing a = α + i β where α, β ∈ R and writing f = f1 + i f2 where f1 and f2 are real-valued, we have af = αf1 − βf2 + i (αf2 + βf1 ). Thus, Z
Z (αf1 − βf2 ) + i (αf2 + βf1 ) (def. of integral) Z Z Z Z = α f1 − β f2 + i α f2 + β f1 (linearity) Z Z = α + iβ f1 + i f1 (algebra) Z = a f.
af =
Z
R R R We now prove the absolute value inequality. R R RIf f = 0, then | f | ≤ |f | is satisfied, so assume that f 6= 0. Let a = | f |/ f . Then Z Z Z f = a f = af.
R If we write af = g1 + i g2 where g1 and g2 are real-valued, then as p af is a real R R R number (equalling | f |) , it follows that af = g1 . Now g1 ≤ g12 + g22 = |af | = |f | since |a| = 1. Hence, Z Z Z Z f = af = g1 ≤ |f |.
We remark that all the properties of integrals for real-valued functions hold for complex-valued functions as well, as long as the properties don’t require the complex numbers to be ordered. For example, the MCT doesn’t make sense as stated for complex-valued functions, but the Dominated Convergence Theorem and its continuous version and the continuity and differentiability theorems all hold as stated for complex-valued functions. You can verify these statements if you wish. ◮ Exercises 5.6. 1. (Cantor’s Lebesgue-Stieltjes set function cf. [131]) Let µψ denote the LebesgueStieltjes set function of the Cantor function ψ from Section 4.5.
5.6. THE DCT AND OSGOOD’S PRINCIPLE
307
(i) Let f : R → R be a continuous function. Prove that for µψ -a.e. we have α X αn 1 f = lim fn where fn = f + · · · + n χCα1 ...αn , 3 3 α ...α n
1
where T∞ the sum is over S all n-tuples (α1 , . . . , αn ) of 0’s and 2’s. (Recall that C = C with C = Cα1 ...αn with this union over all n-tuples (α1 , . . . , αn ) of n n n=1 0’s and 2’s — see Problem 2 in Exercises 4.5.) (ii) Prove that Z Z f dµψ = lim fn dµψ .
R (iii) Fix c ∈ C, let f (x) = ec x and put Fn = fn dµψ . Prove that for n ≥ 2, n n Qn c Fn = Fn−1 ec/3 cosh(c/3n ). Conclude that Fn = e1/3+···+1/3 k=1 cosh 3k . (iv) Prove that Z ∞ c Y cosh k . ec x dµψ = ec/2 3 k=1 2
2. Fix a ∈ (1/2, 1) and for each n ∈ N, define fn : [0, 1] → R by fn (x) = na x e−nx . Show that lim fn exists and that the DCT, but not the BCT, implies that we can interchange limits with integrals for this sequence. 3. (Original Proofs) Suppose µ(X) < ∞, let {fn } be a sequence of measurable realvalued functions, and let f := lim fn , assumed to exist at all x ∈ X as a real number. n→∞
(a) Lebesgue’s proof of the BCT from his thesis [171, p. 259]: Assume there is a constant M > 0 such that for each n ∈ N, |fn | ≤ M . Let ε > 0 and let S An = k≥n {|f − fk | ≥ ε}. Prove that lim µ(An ) = 0. Next, prove that n→∞
Z Z Z f − fn ≤
An
|f − fn | +
Z
Ac n
|f − fn | ≤ 2M µ(An ) + ε µ(X).
R R Conclude that f = lim fn . (b) Lebesgue’s proof of the DCT from his 1910 paper [170, p. 375–376]: In this problem, the only convergence theorem for measurable functions you are allowed to use in your proof is the BCT. Assume there is an integrable function g such R that for each n ∈ N, |fn | ≤ g. Let ε > 0 and show there is an M > 0 such that A g < ε where A = {g > M }. Next, prove that Z Z Z Z Z Z f − fn ≤ |f − f | + |f − f | ≤ 2 g + |f − fn |, n n A
Ac
Ac
A
R
R and use the BCT for the integral over Ac to show that f = lim fn . (c) Fatou’s proof of his lemma from his 1906 paper [93, p. 375–76]: In this problem, the only convergence theorem for measurable functions you are allowed to use in your proofR is the BCT.RAssume now thatR the fn ’s are nonnegative; we need to prove that f ≤ limR inf fn .17 (If lim inf fn = ∞, there is nothing to prove, so assume that lim inf fn < ∞.) Now, fixing k ∈ N for the moment, define Ek = {x ∈ X ; f (x) ≤ k}, and define gn : X → [0, ∞) by ( fn (x) if fn (x) ≤ k gn (x) = f (x) if fn (x) > k. 17
R R Fatou’s Lemma in our textbook reads R lim inf fn ≤ R lim inf fn ; however, Fatou assumes that f := lim fn exists, so Fatou proves that f ≤ lim inf fn .
308
5. BASICS OF INTEGRATION THEORY
R R R Using the BCT prove limn→∞ E gn = E f . Second, prove limn→∞ E gn ≤ k k k R lim inf fn . So far, you’ve shown that for any k, Z Z f ≤ lim inf fn . Ek
R R Third, prove that for any nonnegative simple function s, s = limk→∞ E s. Using k R this result and the definition of f as the supremum ofRits lower sums, R prove that given any simple R function sR with 0 ≤ s ≤ f , we have s ≤ lim inf fn . Finally, conclude that f ≤ lim inf fn . 4. Here are some interesting (counter)examples. (a) (DCT is sufficient but not necessary) We show that the dominating condition in the DCT is not necessary for the interchange of limits and integrals. Let X = R with Lebesgue measure and for n = 1, 2, 3, . . ., let fn = nχ(1/(n+1),1/n] . Show that18 (i) lim R fn = 0 at allR points of X; (ii) A lim fn = lim A fn for all measurable sets A ⊆ X; (iii) there is no integrable function g on X such that for each n ∈ N, |fn | ≤ g a.e.; (iv) the conditions of the VCT are satisfied as they should be. (b) (Counterexample to differentiation theorem) R ∞ Let f : [0, ∞) × [0, ∞) → R be the function f (t, x) = t2 e−tx and define F (t) = 0 f (t, x) dx. Prove that F ′ (0) = 1 R∞ and 0 ∂t f (0, x) dx = 0 (since t ∈ [0, ∞), the derivatives at t = 0 are meant in the sense of right-hand derivatives). What hypothesis of Theorem 5.34 is violated? 5. (Differentiation of series) Let {fn } bePa sequence of differentiable functions on an Suppose interval I ⊆ R such that the series f (t) = ∞ n=1 fn (t) converges for each t ∈ I.P that for each n ∈ N there is a constant Mn ≥ 0 such that |fn′ | ≤ Mn a.e. and ∞ n=1 Mn converges. Then f : I → R is differentiable, and f ′ (t) =
∞ X
n=1
fn′ (t) ;
that is,
∞ ∞ X d X d fn (t) = fn (t). dt n=1 dt n=1
6. Prove that a complex-valued function f : X → C is measurable if and only if f −1 (A) is measurable for every open set A ⊆ C. 7. (Averaging Theorem) Let X have finite measure. Let f : X → C be integrable and suppose that there is a closed set C ⊆ C such that for every measurable set A ⊆ X with µ(A) > 0, the averages of f over A are in C, that is, Z 1 (5.30) f ∈ C. µ(A) A
Prove that f (x) ∈ C for a.e. x ∈ X. Suggestion: Since C is closed, C c is open, and hence is a countable union of open balls. Thus, it suffices to prove that µ(f −1 (B)) = 0 where B is an open ball in C c . Given such a ball, suppose that A = f −1 (B) has positive measure. Using (5.30), derive a contradiction. function on a measure space 8. (Absolute continuity) Let f : X → R be an integrable R (X, µ). The set function νf defined by νf (A) := A f for all measurable sets A is said to be absolutely continuous if given any Rε > 0, there is a δ > 0 such that for all measurable sets A with µ(A) < δ, we have | A f | < ε. (i) Prove that νf is absolutely continuous if and only if µ|f | is absolutely continuous. (ii) Prove that ν|f | , and hence νf , is absolutely continuous. Suggestion: First prove that the set function νs for any nonnegative integrable simple function s is absolutely continuous, then approximate |f | by simple functions. 18 In fact, let {an } be a sequence with 0 < · · · < a3 < a2 < a1 = 1 and an → 0. Let In = (an+1 , an ] and bn = [n(an − an+1 )]−1 . Then fn = bn χIn can be substituted for nχ(1/(n+1),1/n] .
5.6. THE DCT AND OSGOOD’S PRINCIPLE
309
9. (Vitali smallness) Let (X, µ) be a measure space. An integrable function f : X → R is said to be Vitali small if for each ε > 0 there is a measurable set A of finite measure R such that for all measurable sets B ⊆ Ac , we have | B f | < ε.19 (i) Prove that an integrable nonnegative function f is Vitali small if and R only if for each ε > 0 there is a measurable set A of finite measure such that Ac f < ε. (ii) Prove that for any integrable function f , if |f | is Vitali small, then f is Vitali small. (iii) Prove that any integrable function f is Vitali small. Suggestion: You just have to prove that |f | is Vitali small. To do so, first try to prove that any nonnegative integrable simple function is Vitali small. (iv) Using this problem and the previous one, prove that the VCT implies the DCT. 10. (Fundamental Theorem of Calculus, to be generalized in Chapter ??.) Given any differentiable function f : R → R with a bounded derivative, we shall prove that the derivative f ′ is Lebesgue integrable on any compact interval [a, b],20 and Z b (5.31) f ′ dm = f (b) − f (a). a
This result improves on Riemann integration, because for the Riemann integral we have to assume not only that f ′ is bounded but also Riemann integrable. Here, boundedness of f ′ automatically ensures it’s Lebesgue integrable. Proceed as follows. (i) Let f : R → R be differentiable almost everywhere, that is, for a.e. x ∈ R, the limit of the difference quotients, f ′ (x) := lim fh (x)
exists, where fh (x) =
h→0
f (x + h) − f (x) . h
Show that f ′ is a Lebesgue measurable function. (ii) Henceforth assume that f ′ exists at all points of R and is bounded. Show that Z b Z b f ′ = lim fh (x) dx. a
h→0
a
(iii) Using the translation invariance of the integral as discussed at the end of Section 5.4, show that Z b Z Z 1 b+h 1 a+h fh (x) dx = f (x) dx − f (x) dx. h b h a a Z 1 c+h f (x) dx = f (c). Using this (iv) Show that for any point c ∈ R, we have lim h→0 h c fact, taking h → 0 in (iii), prove (5.31). (v) The boundedness condition is needed for the proof; consider the following example. Let f (x) = x2 sin(1/x2 ) for x 6= 0 and define f (0) = 0. Prove that f ′ exists for all x ∈ R but f ′ is not Lebesgue integrable on [0, 1], that is, show R1 that 0 |f ′ | = ∞. You may assume that the Lebesgue integral is the same as the Riemann integral on Riemann integrableR functions and use any theorems on 1 Riemann integration to help you show that 0 |f ′ | = ∞. 11. (Another proof of Lebesgue’s Dominated Convergence Theorem) Assuming that µ(X) < ∞, prove Lebesgue’s DCT using Egorov’s theorem. 12. (Yet another proof of Lebesgue’s Dominated Convergence Theorem) Let {fn } be a sequence of measurable functions such that f := lim fn exists a.e. and |fn | ≤ g 19
Thus, the integral of f can be made arbitrarily small outside of sets of finite measure. In order for (5.31) to hold, we just need f to be differentiable on the interval [a, b] with a bounded derivative, and not on all of R as stated, but assuming f is differentiable on R allows us to not have to think about right and left-hand limits to define f ′ (a) and f ′ (b). 20
310
5. BASICS OF INTEGRATION THEORY
a.e. for some integrable function g. We shall prove that Z Z Z Z f = lim fn , that is, lim f − fn = 0.
R R R (i) Show that f − fn ≤ gn , where gn = |f − fn |, which is defined a.e. Thus, R we have to show that lim gn = 0. (ii) Show that {2g − gn } is an a.e. nonnegative sequence of measurable functions. R (iii) Use Fatou’s lemma on the sequence {2g − gn } to prove that lim sup gn = 0. R (iv) Prove that lim gn = 0, which completes the proof. 13. (Young’s Convergence Theorem) Here’s a convergence theorem due to William Henry Young (1863–1942) who discovered an alternative formulation of Lebesgue’s theory of integration, which he published in the paper “On the general theory of integration” [312] in 1905. See [223, Ch. 5] for more on Young’s integral. Prove the following theorem of Young, proved in 1910 [314, 228]: Theorem. Let {fn }, {gn }, {hn } be measurable functions and suppose that (1) the limit functions lim fn , lim gn , lim hn exist a.e.; (2) For each n ∈ N, gn ≤ fn ≤ hn a.e.;R R R (3) lim Rgn and lim hn are integrable and lim gn = lim gn and lim hn = lim hn . Then lim fn and each fn are integrable, and Z Z lim fn = lim fn . 14. Using Young’s Convergence Theorem, prove the following two results. (a) Let {fn } be a sequence of measurable functions such that lim fn exists a.e. and n→∞
for each n ∈ N, |fRn | ≤ gn a.e. forRsome integrable function gn . Suppose that lim gn is integrable and lim gn = lim gn . Then lim fn and each fn are integrable, and Z Z lim fn = lim fn .
(b) (Cf. [149]) LetP {fn }, {gn }, {h n } be measurable functions and suppose that P ∞ g and (1) the series ∞ n n=1 hn converge a.e. to finite values; n=1 (2) For each n ∈ N, g n ≤ fn ≤ hn a.e.; P P∞ hn areRintegrable and we can interchange sums and inteand ∞ (3) n=1 P n=1 g RP RnP P∞ R ∞ ∞ ∞ h . g and h = grals: g = n n n n n=1 n=1 n=1 n=1 P Then ∞ n=1 fn converges a.e. to an integrable function, and Z X ∞ Z ∞ X fn = fn . n=1
n=1
15. (Vitali’s Convergence Theorem) In this problem we prove the celebrated VCT. Let {fn } be a sequence of integrable functions such that f := lim fn exists a.e. n→∞
We need R to show R that f is integrable on X and for any measurable set A ⊆ X we have A f = lim A fn if and only if {fn } is both uniformly absolutely continuous and uniformly Vitali small. Assume the results stated in Problems 8 and 9 concerning absolute continuity and Vitali smallness. (i) Prove the “only if” implication. Suggestion: Observe that for any measurable set M ⊆ X, Z Z Z , + ≤ f (f − f ) f n n M M M R R R and M (fn − f ) → 0 as n → ∞ since M f = lim M fn . The “if” portion is more difficult, so we shall attack it in pieces.
5.6. THE DCT AND OSGOOD’S PRINCIPLE
311
(ii) For this step and the next step, we assume that µ(X) < ∞ and that {fn } is uniformly absolutely continuous. (In the finite measure case we can drop the Vitali smallness condition.) Let ε > 0 and choose δ > 0 such that for all measurable sets R B ⊆ X such that µ(B) < δ, we have B |fn | < ε for all n. Using Fatou’s Lemma, prove that Rf is integrable on any measurable set B ⊆ X such that µ(B) < δ, and moreover, B |f | ≤ ε. (iii) With ε and δ as in the previous step, use Egorov’s Theorem to show there is a measurable set A ⊆ X such thatR µ(Ac ) < Rδ and fn → f uniformly on A. Prove that f is integrable on A and A f = lim A fn . Conclude by (ii) that f is integrable on X. Also, show that Z Z Z Z Z Z f − fn ≤ f − f + f + f n c c n A A A A R R and R use this R to show that | f − fn | < 3ε for n sufficiently large. Conclude that f = lim fn . (iv) We now consider the general case. Assume that {fn } is both uniformly absolutely continuous and uniformly Vitali small (but not necessarily that µ(X) < ∞). Let ε > 0. Using the Vitali small condition and Fatou’s RLemma, prove that there is R a measurable set B with finite measure such that Bc |fn | < ε for all n and |f | ≤ ε. By Steps (ii) and (iii), f is integrable on B and hence f is integrable Bc on X. Now let A ⊆ X be measurable and show that Z Z Z Z Z Z + + ≤ f f f f − f f − n n n A
A∩B
A
R
A∩B
A\B
A\B
R
and use that | A f − A fn | < 3ε for n sufficiently large. Conclude R this to show R that A f = lim A fn . Congratulations! You have just proven one of the best theorems in integration theory! 16. (Jensen’s Inequality) In this problem we prove Johan Jensen’s (1859–1925) inequality for convex functions. (i) Let I = (a, b) where −∞ ≤ a < b ≤ ∞. A function ϕ : I → R is said to be convex if ϕ((1 − t)x + ty) ≤ (1 − t)ϕ(x) + tϕ(y)
for all x, y ∈ I and 0 ≤ t ≤ 1. Show that if ϕ is graphed and z is any point between x and y, then the point (z, ϕ(z)) lies on or below the line joining the points (x, ϕ(x)) and (y, ϕ(y)). (ii) Show that ϕ is convex if and only if for all a < x < y < z < b, we have ϕ(y) − ϕ(x) ϕ(z) − ϕ(y) ≤ . y−x z−y
(5.32)
(iii) Suppose that ϕ is differentiable on I. Using (5.32), prove that ϕ is convex if and only if ϕ′ is a nondecreasing function. In particular, any exponential function ax where a > 1 is a convex function on R. (iv) Suppose that ϕ is convex on I. Using (5.32) prove that ϕ is Lipschitz on any closed interval of I, that is, given any closed interval J ⊆ I there is a constant M such that |ϕ(x) − ϕ(y)| ≤ M |x − y| for all x, y ∈ J. In particular, any convex function is continuous. (v) Let ϕ be a convex function on I. Let A be a measurable set with measure 1 and let f be a measurable function on A such that f (x) ∈ I for all x ∈ A. Then Z Z ϕ f ≤ ϕ ◦ f (Jensen’s Inequality). A
A
312
5. BASICS OF INTEGRATION THEORY
R You may proceed as follows: Let y = A f . Show that y ∈ I. Let α be the supremum over x ∈ (a, y) of the left-hand side of (5.32). Show that ϕ(z) ≥ ϕ(y) + α(z − y)
for all z ∈ I. In particular, this inequality holds for z = f (x) with x ∈ A. Now set z = f (x) and integrate both sides of the inequality over A. Note that ϕ ◦ f is measurable by Proposition 5.8. (vi) Let X = {1, 2, . . . , n} and given α1 , . . . , αn ≥ 0, define µα : P(X) → [0, ∞] by µ({j1 , . . . , jk }) = αjP 1 + · · · + αjk . Prove that µα is a measure. Suppose now that µ(X) = 1, that is, αk = 1. Using Jensen’s Inequality with ϕ(x) = ax where a > 1, prove that for any positive numbers a1 , . . . , an , αn 1 α2 aα 1 a2 · · · an ≤ α1 a1 + α2 a2 + · · · + αn an .
Suggestion: Let f (k) = loga (ak ). In particular, if αk = 1/n for each k, deduce that the geometric mean of n nonnegative real numbers never exceeds their arithmetic mean, that is, for any nonnegative numbers a1 , . . . , an , 1 (a1 a2 · · · an )1/n ≤ (a1 + a2 + · · · + an ). n
Notes and references on Chapter 5 §5.1 : We introduced Ren´e Baire’s sequence {Dn } (where Dn : [0, 1] → [0, 1] is defined by Dn (x) = 1 if x = p/q ∈ Q in lowest terms with 1 ≤ q ≤ n and Dn (x) = 0 otherwise), in order to give an example of a sequence of Riemann integrable functions whose limit, the Dirichlet function, is not Riemann integrable. Historically, this sequence was introduced by Baire in 1898 [13] not to show that Riemann integrability fails under taking limits, but rather as an example of a function of Class 2. Here, a Class 0 function is a continuous function. A Class 1 function is a function that is not of Class 0 (that is, continuous) but is a limit of a sequence Class 0 functions. More generally, a Class n function is a function that is not in any of the preceding classes, but is a limit of a sequence of Class n − 1 functions. To see that D(x) is of Class 2, we first prove that each Dn (x) is of class 1 by proving it’s a limit of “spiky” continuous functions; here’s a proof for D1 (the remaining cases are left to you): fk |{z} 1 k
|{z} 1 1
D1
k → ∞✲ 1
k
Since each Dn is of Class 1 and Dirichlet’s function D is the limit of the Dn ’s, it follows that D is of Class 2, provided that we can show D is not Class 1; that is, provided we can show that D is not a limit of continuous functions. This result follows from Baire’s theorem on functions of first class, one version of which reads [13], [14, Ch. 2], [217, p. 33]: If a function f is of Class 1, then the set of points where f is continuous must be dense in the interval. Since D has no continuity points, it cannot be of Class 1. In particular, there does not exist a sequence of smooth functions converging the D, which is why for Theorem 5.1 we had to introduce thick Cantor sets and the like to produce non-Riemann integrable functions that are limits of smooth functions. §5.2 : Here’s Vitali’s statement from his 1905 paper [299, p. 601] of “Luzin’s theorem”: If f (x) is a finite and measurable function in an interval (a, b) of length l, there exists for each positive number ε, as small as desired, in (a, b) a closed set with measure greater than l − ε such that the values of f (x) at points of it form a continuous function.
NOTES AND REFERENCES ON CHAPTER ??
313
At the bottom of the same page, Vitali mentions This theorem seems to be the same as the one to which it is referred in the notes of Borel and Lebesgue and which is reproduced in the cited book of Lebesgue at the foot of page 125 in the note. The author, who is Lebesgue, has not yet given an explicit proof. Thus, “Luzin’s theorem,” from 1912, seems to be the same as the results of Borel [35] (1904) and Lebesgue [166] (1903), but Borel and Lebesgue’s results were not proven explicitly (cf. Bourbaki’s book [42, p. 223]). §5.3 : The Italian mathematician Carlo Severini (1872–1951) in 1910 proved a different version of Egorov’s theorem involving almost uniform convergence for the partial sums of a series of orthogonal functions [251, 252]. He gives a footnote [251, p. 3] containing (a wrong) statement of Egorov’s theorem. This is why we don’t call Egorov’s theorem the Egorov-Severini (or Severini-Egorov) theorem as it’s sometimes called. §5.4, 5.5 : The papers [129, 130] and the book [223] give very informative discussions on some of the early definitions of the integral. For example, in 1905, William Henry Young (1863–1942) published a Riemann-Darboux formulation of Lebesgue’s theory of integration [312]. See Problem 5 in Exercises 5.4. He partitions the domain of a function just like for Riemann-Darboux integration, except that he (1) considers infinite sums (instead of finite sums) and (2) partitions the domain into measurable sets (instead of intervals). Here’s what Young said [312]: What would be the effect on the Riemann and Darboux definitions, if in those definitions the word “finite” were replaced by “countably infinite,” and the word “interval” by “set of points”? A further question suggests itself: Are we at liberty to replace the segment (a, b) itself by a closed set of points, and so define integration with respect to any closed set of points? Going one step further, recognizing that the theory of the content of open sets quite recently developed by M. Lebesgue has enabled us to deal with all known open sets in much the same way as with closed sets as regards the very properties which here come into consideration, we may attempt to replace both the segment and the intervals of the segment by any kinds of measurable sets. Another formulation of Lebesgue’s theory of integration, due to Young [313, 316] and Frigyes Riesz (1880–1956) [236], uses monotone sequences and doesn’t require measure theory. James P. Pierpont (1866–1938) [224] basically replaces measure by outer measure so one can integrate functions over non-measurable sets. §5.6 : To summarize the main convergence theorems: We have proven, in order: MCT
=⇒
Fatou’s Lemma
=⇒
DCT
=⇒
BCT.
Historically, however, the BCT was proved first, by Lebesgue in his PhD. thesis of 1902 [171], then in 1906 Beppo Levi proved the MCT [177] and independently Fatou proved his Lemma [93], finally in 1908 Lebesgue stated the DCT in [168]. Some of the original proofs of these results were in Problem 3 of Exercises 5.5.
CHAPTER 6
Some applications of integration This chapter is devoted to a few applications Lebesgue’s integral. We start with . . . 6.1. Practice with the DCT and its corollaries In this section we apply Lebesgue’s DCT to a very small sample of problems; all the applications of the DCT could very well fill thousands of pages. In this section, we use the fact that any Riemann integrable function is Lebesgue integrable and the two integrals are equivalent, which we’ll prove in Section 6.2. In particular, we shall freely use standard facts concerning the Riemann integral, for instance, the Fundamental Theorem of Calculus and change of variables. 6.1.1. The probability integral and some of its relatives. As a first application of the DCT machinery (or rather, its corollaries on continuity and differentiation of integrals, Theorems 5.33 and 5.34), we prove the probability integral formula √ Z ∞ Z √ 2 2 π e−x dx = or e−x dx = π, 2 0 R which we already studied back in Section 2.5. Let us put I = R, X = [0, ∞), and Z tx 2 2 f (t, x) = e−x e−s ds in Theorems 5.33 and 5.34. Let M = compute in a moment. Note that |f (t, x)| ≤ M e−x
2
R∞ 0
0
e
−s2
ds, which is some number that we’ll 2
and |∂t f (t, x)| = |x e−(1+t
) x2
2
| ≤ |x e−x |,
where we used the Fundamental Theorem of Calculus to compute ∂t f (t, x), and 2 2 e−x and x e−x are both integrable functions on X. Therefore, by the Continuity and Differentiation theorems, the function Z ∞ F (t) := f (t, x) dx 0
is a continuously differentiable function of t ∈ R. Moreover, we have Z ∞ Z ∞ 2 2 F ′ (t) = ∂t f (t, x) dx = x e−(1+t ) x dx 0
0
=−
2 2 x=∞ 1 e−(1+t )x 1 1 = . 2 (1 + t2 ) x=0 2 (1 + t2 )
315
316
6. SOME APPLICATIONS OF INTEGRATION
Hence, F (t) = 21 tan−1 (t) + c for some constant c. Since f (0, x) = 0 and tan−1 (0) = 0, we must have F (t) = 12 tan−1 (t). Now using the formula for f (t, x), observe that 2 for each x ∈ (0, ∞), and hence a.e. on [0, ∞), we have lim f (t, x) = M e−x . Thus, t→∞ by the Continuous Dominated Convergence Theorem, we have Z ∞ 2 π 1 = tan−1 (∞) = lim F (t) = M e−x dx = M 2 . t→∞ 4 2 0 √ This implies that M = π/2, which is to say, √ Z ∞ π −x2 e dx = 2 0 2
as we set out to prove. What might be amazing is that e−x cannot be integrated in 2 “finite terms,” meaning that e−x does not have an antiderivative expressible as a finite sum of familiar functions (such as rational functions, logarithms, exponential R∞ 2 functions, trig functions, etc.) so there is no way to evaluate 0 e−x dx by finding an antiderivative and using the Fundamental Theorem of Calculus. In you’re 2 interested in proving that e−x cannot be integrated in finite terms, the articles [240, 144, 196] will show you.1 Let’s consider another example (called a cosine transform): Z ∞ 2 G(t) = e−x cos tx dx, t ∈ R. 0
2
2
If we put f (t, x) = e−x cos tx, then ∂t f (t, x) = −x e−x sin tx, so |f (t, x)| ≤ e−x
2
2
and |∂t f (t, x)| ≤ |x| e−x ,
so by the Continuity and Differentiation Theorems it follows that G(t) is a continuously differentiable function of t ∈ R. Moreover, integrating by parts, we have G′ (t) = −
Z
∞
2
x e−x sin tx dx
0
x=∞ 1 Z ∞ 2 2 1 1 = e−x sin tx − e−x cos tx dx = − G(t). 2 2 0 2 x=0
2
The solution to the differential equation G′ (t) = − 21 G(t) is G(t) = Ce−t /4 for some R∞ √ 2 constant C. To evaluate the constant, we set t = 0 and use that 0 e−x dx = π/2 √ √ 2 to find C = π/2. Thus, G(t) = π/2 e−t /4 , or √ Z ∞ 2 π −t2 /4 e−x cos tx dx = e . 2 0
Replacing x with ax and t with t/a where a > 0, we obtain the interesting result: r Z ∞ 1 π −t2 /4a −ax2 (6.1) e cos tx dx = e , a > 0, t ∈ R. 2 a 0
1If you can’t wait, the basic reason follows from a theorem of Joseph Liouville (1809–1882), who in 1835 proved that if f (x) and g(x) are rational functions, where g(x) is not constant, then f (x)eg(x) can be integrated in finite terms if and only if there exists a rational function R(x) such that f (x) = R′ (x) + R(x)g(x). Using Liouville’s theorem, it’s a good exercise to try and show 2 that e−x cannot be integrated in finite terms. If you get stuck, see [196, p. 300].
6.1. PRACTICE WITH THE DCT AND ITS COROLLARIES
317
6.1.2. The probability integral and the Stirling, Wallis formulas. Recall from Section 2.5 that Wallis’ formula, named after John Wallis (1616–1703) who proved it in 1656, is given by ∞ Y π 2n 2n 2 2 4 4 6 6 8 8 10 10 = · = · · · · · · · · · ··· , 2 2n − 1 2n + 1 1 3 3 5 5 7 7 9 9 11 n=1
which can also be written in the equivalent form n √ 1 Y 2k π = lim √ . n→∞ 2k − 1 n
(6.2)
k=1
Stirling’s formula, although it really should be called De Moivre’s formula, named after James Stirling (1692–1770) who published it in 1730, is the asymptotic formula n n √ √ n! ∼ 2πn = 2πn nn e−n , e √ where “∼” means that the ratio of n! and 2πn nn e−n approaches unity as n → ∞. Using the DCT several times we shall prove the following interesting theorem. Equivalence of Probability Integral, Stirling and Wallis Theorem 6.1. The probability integral formula, Stirling’s formula, and Wallis’ formula are equivalent in the sense that each one implies the other. In particular, Stirling’s and Wallis’ formulas hold because we proved the probability integral formula. We shall break up this proof into two lemmas, the first lemma where we show the equivalence of the probability integral and Wallis, then the probability integral and Stirling. Lemma 6.2. The probability integral holds if and only if Wallis’ formula holds. Proof : The idea is to start with the integral Z ∞ n−1 −n π (2n − 3)(2n − 5) · · · 3 · 1 π Y 2k − 1 (6.3) 1 + x2 dx = = . 2 (2n − 2) · (2n − 4) · · · 4 · 2 2 2k 0 k=1
There are many ways to prove this formula; perhaps one of the quickest is to use the Differentiation theorem as you’ll do in Problem 5 (see Problem 2 in Exercises 2.5 for another proof). √ We now take n → ∞ in (6.3). However, before doing this, let’s replace x by x/ n in the integral to get −n Z ∞ Z ∞ −n 1 x2 1 + x2 dx = √ 1+ dx, n n 0 0
therefore,
Z
0
∞
1+
x2 n
−n n−1 π √ Y 2k − 1 dx = . n 2 2k k=1
To finish the proof of this lemma, we shall apply the DCT to −n Z ∞ x2 fn (x) dx , where fn (x) = 1 + . n 0
318
6. SOME APPLICATIONS OF INTEGRATION
−n By the well-known limit limn→∞ 1 + nt = e−t for any real number t, we see that −n 2 x2 lim fn (x) = lim 1 + = e−x . n→∞ n→∞ n 2 n (eg. using the binomial theorem), we Observe that if we multiply out 1 + xn obtain n x2 (6.4) 1+ = 1 + x2 + junk ≥ 1 + x2 , n where “junk” is some unimportant nonnegative expression. Hence, −n x2 1 0 ≤ fn (x) = 1 + ≤ =: g(x). n 1 + x2
Since g(x) is integrable over [0, ∞), the DCT applies, and we see that −n Z ∞ Z ∞ 2 x2 (6.5) lim 1+ dx = e−x dx. n→∞ 0 n 0 Thus,
lim
n→∞
Z ∞ n−1 2 π √ Y 2k − 1 n = e−x dx. 2 2k 0 k=1
From this formula and (6.2) it follows that the probability integral and Wallis’ formula are equivalent.
Here we see the beauty of the DCT: the limit equality (6.5) followed almost without any effort from the DCT. If we didn’t know the DCT, but just had knowledge about the Riemann integral, the proof of (6.5) would take a lot longer, but it can be done. See Problem 1 for a very elementary (but long) proof of (6.5). If you look at that problem you’ll see that Lebesgue’s integral really does “simplify life.” The DCT will also come into play in the Leonhard Eulerproof of Lemma 6.4 below. But first, let’s take a very short detour to study (1707–1783). the Gamma function. This function was introduced in 1729 by Leonhard Euler (1707–1783) and is defined, for each x > 0, by Z ∞ Γ(x) := tx−1 e−t dt. 0
The Gamma function generalizes the factorial function as we now show.
(1) (2) (3) (4)
Theorem 6.3. The Gamma function has the following properties: Γ(1) = 1, Γ(x + 1) = x Γ(x) for any x > 0, Γ(n + 1) = n! for any n = 0, 1, 2, . . .. R∞ √ 2 Γ 12 = 2 0 e−x dx = π.
Proof : We have
Γ(1) =
Z
0
∞
t=∞ e−t dt = −e−t t=0 = 1,
which proves (1). To prove (2), we integrate by parts, Z ∞ Z ∞ Γ(x + 1) = tx e−t dt = −tx e−t t=0 + x 0
∞ 0
tx−1 e−t dt.
6.1. PRACTICE WITH THE DCT AND ITS COROLLARIES
319
Since x > 0, the first term on the right vanishes, and since the integral on the far right is just Γ(x), Property (2) is proved. If n = 0, then Γ(0 + 1) = Γ(1) = 1 by (1) and 0! is (by definition) 1, so (3) holds for n = 0. If n is a positive integer, then using (2) repeatedly, we obtain Γ(n + 1) = n Γ(n) = n (n − 1) Γ(n − 2) = · · · n (n − 1) · · · 2 · 1 · Γ(1) = n!.
Finally, making the change of variables t = x2 , we see that Z ∞ Z ∞ Z ∞ √ 2 2 1 = t−1/2 e−t dt = x−1 e−x 2x dx = 2 e−x dx = π. Γ 2 0 0 0
Property (3) allows us to use the Gamma function to define the factorial for any x > −1 via x! := Γ(x + 1); when x = 0, 1, 2, . . ., this reduces to the usual definition by (3) above. In particular, by Properties (2) and (4), √ 1 1 1 1 π !=Γ +1 = Γ = . 2 2 2 2 2
Using Property √ (2) one can show that the factorial of any half-integer is a rational multiple of π. (This begs the question: What does π (dealing with circles) have to do with factorials of half-integers? Beats me, but it does!) Lemma 6.4. The probability integral holds if and only if Stirling’s formula holds. Proof : Here, we follow Patin’s argument [219]. Making the change of variables t = u2 we have Z ∞ Z ∞ 2 n! = Γ(n + 1) = tn e−t dt = 2 u2n+1 e−u du. 0
0
Thus,
n! en n+ 1 2
n
=2 =2
en n+ 1 2
n Z
0
∞
Z
∞ 0
u √ n
2
u2n+1 e−u du 2n+1
2
en−u du.
√ If we make the substitution u = n + x, we get 2n+1 2n+1 √ √ 2 2 2 x u √ = 1+ √ and en−u = en−( n+x) = e−2 nx−x . n n Hence, 2n+1 Z ∞ Z √ 2 n! en x =2 √ 1+ √ e−2x n−x dx = 2 fn (x) dx, 1 n nn+ 2 − n R where 1 + √x 2n+1 e−2x√n−x2 for − √n ≤ x < ∞ n fn (x) = 0 else. R We shall apply the DCT to 2 R fn (x) dx. We first find lim fn (x). To this end, n→∞ we write 2n+1 2n √ √ 2 2 x x x 1+ √ e−2x n−x = 1 + √ · 1+ √ e−2x n e−x n n n n 2 √ 2 x x = 1+ √ · 1+ √ e−x n e−x n n
320
6. SOME APPLICATIONS OF INTEGRATION
Observe that
and
n √ √ x n log 1+ √x −x n n 1+ √ e−x n = e , n
x lim n log 1 + √ n→∞ n
√ − x n = lim
n→∞
log 1 +
√x n 1 n
−
√x n
,
which is of the form 0/0. Now a simple calculus exercise using L’Hospital’s Rule shows that log 1 + √xn − √xn −x2 lim = . 1 n→∞ 2 n Therefore, lim
n→∞
x 1+ √ n
n
e−x
√ n
= e−x
2
/2
.
From this it follows that n 2 √ 2 2 2 2 x x lim fn (x) = lim 1 + √ · 1+ √ e−x n e−x = 1 · e−x · e−x = e−2x . n→∞ n→∞ n n We now check that |fn (x)| = fn (x) is bounded by an integrable function. To this end, observe that since 1 + t ≤ et for all t ∈ R (can you prove this?), it follows that √x 2n+1 √ √ √ 2 2 nx+ √x −2x n−x2 ne 0 ≤ fn (x) ≤ e n e−2x n−x = e =e
√x −x2 n 2
≤ e|x|−x , 2
since √xn ≤ |x| for all x ∈ R and n ∈ N. We leave you to check that g(x) := e|x|−x is integrable over R. To conclude, we have shown that the hypotheses of the DCT are satisfied and hence Z Z √ Z −x2 n! en −2x2 lim = 2 lim f (x) dx = 2 e dx = 2 e dx, n 1 n→∞ R n→∞ nn+ 2 R R √ where to get the last equality we replaced x with x/ 2. From this formula, it follows that the probability integral holds if and only if Stirling’s formula holds.
6.1.3. The Fundamental Theorem of Algebra. The Fundamental Theorem of Algebra (FTA) answers the following question: Does every nonconstant polynomial have a root? Explicitly, let p(z) be a polynomial of the complex variable z of positive degree n: p(z) = an z n + an−1 z n−1 + · · · + a1 z + a0 , where n ≥ 1, an 6= 0, and the ak ’s are complex coefficients. Then the question is if there exists a z0 ∈ C such that p(z0 ) = 0. The FTA says the answer is yes: The Fundamental Theorem of Algebra Theorem 6.5. Any nonconstant polynomial with complex coefficients has a root in the complex plane.
6.1. PRACTICE WITH THE DCT AND ITS COROLLARIES
321
We shall present several proofs of the FTA using integration [185, 186]. To do so, we need the following differentiation fact. Let f (z) be a rational function of z ∈ C. Writing z in polar coordinate form, z = r eiθ , consider the function f (reiθ ), a function of r and θ. In Problem 13 you will prove that ∂ ∂ f (reiθ ) = ir f (reiθ ). ∂θ ∂r Now let p(z) = z n + an−1 z n−1 + · · · + a0 , a polynomial with complex coefficients and n ≥ 1, and assume, by way of contradiction, that p(z) has no roots. (By dividing p(z) by an if necessary we may assume that an = 1.) We shall derive a contradiction in three ways, each using some variant of the following “Three Step Recipe”: Proving the FTA in three easy steps. Step 1: Define a function F (r) using the polynomial p(z). Step 2: Show that F ′ (r) = 0, so F (r) = C where C is a constant. Step 3: Show that C = 0 and C 6= 0. Contradiction. Thus, our original assumption must have been in error and the polynomial has a root. Proof 1: Consider the rational function zn zn f (z) = = n . n−1 p(z) z + an−1 z + · · · + a0 (6.6)
This function is well-defined for all complex z because p(z) is by assumption never zero. Consider the function Z 2π F (r) = f (reiθ ) dθ. 0
Using the Differentiation Theorem (whose assumptions we leave you to check), it follows that F (r) is a continuously differentiable function of r ∈ R and by (6.6) we see that for r 6= 0, Z 2π Z 1 2π ∂ ∂ F ′ (r) = f (reiθ ) dθ = f (reiθ ) dθ ∂r ir ∂θ 0 0 θ=2π 1 iθ = f (re ) ir θ=0 1 1 i2π = f (re ) − f (rei0 ) = [f (r) − f (r)] = 0, ir ir where we used that e2πi = ei0 = 1. Hence, F (r) = C = a constant. Setting r = 0 and using that f (0) = 0, we obtain Z 2π F (0) = f (0) dθ = 0, 0
so C = 0. On the other hand, by definition of f (z), we have f (z) =
zn z n + an−1 z n−1 + · · · + a0
=
1 , 1 + an−1 /z + · · · + a0 /z n
so f (reiθ ) → 1 as r → ∞. From this and the Continuous Dominated Convergence Theorem (whose assumptions we leave you to check), we see that Z 2π Z 2π iθ F (r) = f (re ) dθ → 1 dθ = 2π as r → ∞. 0
0
322
6. SOME APPLICATIONS OF INTEGRATION
Therefore, C = 2π, implying that 0 = 2π, an absurdity. Thus, our original assumption that p has roots must have been false. The next two proofs2 and are appropriate for those who want to avoid complex numbers as much as possible. To do so, we need the following “real” counterpart to Equation (6.6). Given a rational function f (z), write f (reiθ ) in terms of its real and imaginary parts: f (reiθ ) = u(r, θ) + i v(r, θ) ,
(6.7)
where u and v are real-valued functions. Then, see Problem 13, (6.6) is equivalent to ∂v ∂u =r ∂θ ∂r
(6.8)
and
∂u ∂v = −r . ∂θ ∂r
The details of the following Proofs 2 & 3 are left to Problem 13. Proof 2: As before, assume that p(z) is never zero and consider the rational 1 function f (z) = p(z) . In polar coordinates, as in (6.7) we can write 1 = u(r, θ) + i v(r, θ) , p(reiθ ) where u and v are real-valued functions. Here are the three steps to get a contradiction: Z 2π (i) Define G(r) = u(r, θ) dθ. Using (6.8), show that G′ (r) = 0. This shows 0
that G(r) = C = a constant. (ii) Take r = 0 in G(r) to show that C 6= 0. (iii) Now take the limit of G(r) as r → ∞ to show that C = 0. Contradiction.
For our last proof we assume that p(z) has real coefficients; this loses no generality because we can replace p(z) by another polynomial P (z) with real coefficients that has a root if and only if p(z) has a root (see Part (c) of Problem 13). Proof 3: As usual, assume that p(z) is never zero and consider the rational 1 function f (z) = [p(z)] 2 and write, as in (6.7), 1 = u(r, θ) + i v(r, θ) , [p(reiθ )]2 where u and v are real-valued functions. We shall deviate slightly from our “Three Step Recipe for the FTA” and give another Three Step Recipe! Z ∞ (i) Let H(θ) = u(r, θ) dr. Using (6.8), show that H ′ (0) = 0 and H ′′ (θ) = 0
−H(θ). Solving this ordinary differential equation, show that H(θ) = a cos θ for some constant a. (ii) Take θ = 0 to show that a > 0. (iii) Finally, take θ = π to show that a < 0. Contradiction. We’ve saved for last my favorite application of the DCT and its affiliates: 2These proofs were presented at a Mathematical Association of America meeting [186].
6.1. PRACTICE WITH THE DCT AND ITS COROLLARIES
323
6.1.4. Tannery’s theorem and Euler’s sum. Suppose you were asked to evaluate the interesting limit n n n n n n−1 n−2 1 lim + + + ···+ . n→∞ n n n n To do so, we write this limit in the simpler form n n n X n n n−1 1 + + ···+ = lim ank , lim n→∞ n→∞ n n n k=0
where
n n n−k k = 1− . n n We are now tempted to exchange the limit with the summation: n ∞ X X (6.9) lim ank = lim ank . ank :=
n→∞
k=0
Since for each k ∈ N,
lim ank
n→∞
k=0
n→∞
n k = e−k , = lim 1 − n→∞ n
if (6.9) were in fact valid, we could conclude that n n n X n n n−1 1 lim + + ··· + = lim ank n→∞ n→∞ n n n k=1
=
∞ X
k=1
lim ank =
n→∞
∞ X
k=0
e−k =
1 e = , 1 − 1/e e−1
P∞ 1 where we used the geometric series k=0 rk = 1−r , valid for |r| < 1, with r = e−1 . Jules Tannery’s (1848–1910) Theorem [280, p. 292] will imply that the interchange of limit and summation in (6.9) is indeed valid. Tannery’s series theorem P mn Theorem 6.6. For each natural number n, let k=1 ank be a finite sum of real numbers where mn → ∞ as n → ∞. If for each k, limn→∞ ank exists P and there is a convergent series ∞ M such that |a | ≤ Mk for all k, n, k nk k=1 then mn ∞ X X lim ank = lim ank . n→∞
k=1
k=1
n→∞
Proof : Consider the measure space (N, P(N), #) where # is the counting measure. Since all subsets of N are measurable, it follows that all functions f : N → R are measurable. Moreover, one can check that (see e.g. Problem 8 in Exercises 5.5) Z ∞ X f d# = f (n) n=1
for any integrable function f : N → R. For each n, let fn : N → R be the function ( ank if 1 ≤ k ≤ mn fn (k) = 0 otherwise.
324
6. SOME APPLICATIONS OF INTEGRATION
Also, let g : N → RPbe the function defined by g(k) = Mk for each k ∈ N; then ∞ because the series k=1 Mk converges, it follows that g : N → R is integrable. Now by assumption, the limit limn→∞ fn (k) exists for each k ∈ N, and |fn (k)| ≤ g(k) for all n, k ∈ N. Therefore, by the DCT we have Z Z lim fn d# = lim fn d#. n→∞
The left side is just limn→∞ This completes the proof.
n→∞
Pmn
k=1
ank while the right side is
P∞
k=1
limn→∞ ank .
We remark that Tannery’s theorem can be proved without the DCT; the point here is to show that Tannery’s theorem is just a special case of the DCT when the measure space is the natural numbers. Tannery’s theorem has many applications (see Problem 9 for many examples), such as to our original question involving (6.9): We leave it as an exercise for you to check that n k |ank | = 1 − ≤ e−k . n P∞ Since k=0 e−k converges, the hypotheses of Tannery’s theorem are fulfilled, so (6.9) is indeed valid. Another application deals with the Basel problem. The Basel problem is one of my all-time favorite math problems. We begin our story with the Italian Pietro Mengoli (1625–1686) who in his 1650 book Novae quadraturae arithmeticae, seu de additione fractionum discussed the sum of the reciprocals of the squares, but admitted defeat in finding the sum: ∞ X 1 1 1 1 = 1 + 2 + 2 + 2 + ··· = ? 2 n 2 3 4 n=1
Here’s the original Latin:3
and here’s an English translation: Having concluded with satisfaction my consideration of those arrangements of fractions, I shall move on to those other arrangements that have the unit as numerator, and square numbers as denominators. The work devoted to this consideration has bore some fruit - the question itself still awaiting solution — but it [the work] requires the support of a richer mind, in order to lead to the evaluation of the precise sum of the arrangement [of fractions] that I have set myself as a task. 3It took over four years to find this quite elusive original source, which is why I display it so proudly ,. I thank Emanuele Delucchi for his assistance to tracking down Mengoli’s book in the library at ETH Zurich and for translating the Latin.
6.1. PRACTICE WITH THE DCT AND ITS COROLLARIES
325
Later, we see a plea by the famous Jacob (Jacques) Bernoulli (1654–1705) on page 254 of his 1689 book Tractatus de Seriebus Infinitis.4 In this book Bernoulli evaluates the sums of many series but he was unable to evaluate the sum of the reciprocals of the squares and here’s what he says: And thus with that proposition we can evaluate the sums of the series whose denominators are differences of triangular numbers, or differences of squares. With XV we can do this also when the denominators 1 1 are pure triangular numbers (as in 11 + 13 + 16 + 10 + 15 + · · · ), but it is worth pointing out that if the denominators are pure squares (as 1 1 + 25 + · · · ) this computation is more difficult than in 11 + 14 + 19 + 16 one would expect, even if we can easily see that it converges because it is manifestly smaller than the previous sum. If someone will find out and communicate to us what has escaped our considerations, he will have our deep gratitude.
The problem to find the sum of the reciprocals of the squares became known as the Basel problem after Bernoulli’s town Basel, Switzerland. About 46 years after Bernoulli’s plea, Leonhard Euler (1707-1783) solved the Basel problem. Euler’s first published attempt on the Basel problem is De summatione innumerabilium progressionum (The summation of an innumerable progression) [86], presented to the St. Petersburg Academy on March 5, 1731, where he estimates the sum of the reciprocals of the squares: ∞ X 1 1 1 1 = 1 + 2 + 2 + 2 + · · · ≈ 1.644934, 2 n 2 3 4 n=1
to six decimal places. This equals π 2 /6 = 1.644934066848 . . . to six decimal places, which Euler no doubt realized (Euler was a phenomenal human calculator), so Euler now knew what theP sum should equal.5 He found this approximation by ingeniously ∞ rewriting the sum n=1 n12 as ∞ ∞ X X 1 1 2 = (log 2) + . 2 2 n n 2n−1 n=1 n=1
The advantage is that Euler knew log 2 to many decimal places and the sum on the right converges much faster than the original sum. For example, taking only 17 terms of the new series gives the approximation 1.64493402 . . . for the right-hand side, while (using the integral test remainder estimate) one needs around 2 million terms of the original series to get six places of accuracy to π 2 /6! Now π has to do with circles and circles with trig functions. Four years later, Euler wrote a new paper, De summis serierum reciprocarum (On the sums of series of reciprocals) [87], which was read in the St. Petersburg Academy on December 5, 1735, where he gave three proofs of the solution of the Basel problem using trigonometry. Here’s the beginning of Euler’s paper [21]: So much work has been done on the series of the reciprocals of powers of the natural numbers, that it seems hardly likely to be able to discover anything new about them. For nearly all who have thought about the sums of series have also inquired into the sums of this kind of series, and yet have not been able by any means to express them 4
Bernoulli’s book is available at http://www.kubkou.se/pdf/mh/jacobB.pdf.
5Actually, we found that Stirling computed the value of the series accurate to 17 decimal
places in Example 1 after Proposition 11 in his 1730 book Methodus Differentialis [289].
326
6. SOME APPLICATIONS OF INTEGRATION
in a convenient form. I too, in spite of repeated efforts, in which I attempted various methods for summing these series, could achieve nothing more than approximate values for their sums or reduce them to the quadrature of highly transcendental curves; the former of these is described in the next article, and the latter fact I have set out in preceding ones. I speak here about the series of fractions whose numerators are 1, and indeed whose denominators are the squares, or the cubes, or other ranks, of the natural numbers; of this type are 1 1 1 1 + 25 +etc., likewise 1+ 18 + 27 + 64 +etc. and similarly for 1+ 14 + 19 + 16 higher powers, whose general terms are contained in the form x1n . I have recently found, quite unexpectedly, an elegant expression for the 1 sum of this series 1+ 14 + 19 + 16 +etc., which depends on the quadrature of the circle, so that if the true sum of this series is obtained, from it at once the quadrature of the circle follows. Namely, I have found for six times the sum of this series to be equal to the square of the perimeter of a circle √ whose diameter is 1; or by putting the sum of this series = s, then 6s will hold to 1 the ratio of the perimeter to the diameter.
In other words, Euler proved that ∞ X 1 1 1 1 π2 = 1 + + + + · · · = . n2 22 32 42 6 n=1
To explain Euler’s third proof of this formula, recall as a consequence of the Fundamental Theorem of Algebra (see Part (a) of Problem 13), we can always factor an n-th degree polynomial p(x) into n linear factors: p(x) = a(r1 −x)(r2 −x) · · · (rn −x) where a is a constant and where r1 , . . . , rn are the roots of p. Assuming the roots are not zero and that p(0) = 1, we can rewrite the factorization as x x x p(x) = 1 − 1− ··· 1 − . r1 r2 rn 2
4
Now the function p(x) = sinx x has Taylor series sinx x = 1 − x3! + x5! − · · · ,6 so in some sense p(x) can be thought of as an infinite degree polynomial. It has the roots π, −π, 2π, −2π, 3π, −3π, . . . , and it satisfies p(0) = 1. Hence, by analogy we should be able to write sin x x x x x x x = 1− 1+ 1− 1+ 1− 1+ ··· x π π 2π 2π 3π 3π x2 x2 x2 = 1− 2 1− 2 2 1 − 2 2 ··· , π 2 π 3 π that is, (6.10)
sin x = x
x2 x2 x2 1− 2 1− 2 2 1 − 2 2 ··· . π 2 π 3 π
Now if you multiply all the terms on the right together, you will get x2 1 1 1 1− 2 + 2 + 2 + ··· + ··· π 12 2 3 6Recall that sin x = x − x3 + x5 − · · · . 3! 5!
6.1. PRACTICE WITH THE DCT AND ITS COROLLARIES
327
where “· · · ” involves x4 , x6 , and so forth. Since we already know that sinx x = 4 2 1 − x3! + x5! − · · · also, comparing coefficients of x2 we see that 1 1 1 1 1 − 2 + + + · · · =− . 2 2 2 π 1 2 3 3! P∞ 1 2 Simplification shows that n=1 n2 = π6 , just as Euler said. Now this proof is pretty but not rigorous (even to Euler); the main problem is Euler’s sine expansion (6.10), which was derived by applying a fact for polynomials to the non-polynomial function sin x x . So, in 1743 he gave a perfectly rigorous argument to close the Basel problem [88]. See Problem 14 in Exercises 5.5 for this solution. In Problems 11 and 12 you’ll give a perfectly sound derivation of Euler’s sine product basically taken from Euler’s famous book Introductio in analysin infinitorum, volume 1 (Introduction to analysis of the infinite) [91, p. 124–128]. The proof uses Tannery’s theorem. Here’s an interesting proof of Euler’s formula [197, 184], that uses the continuity and differentiation theorems. We only outline the argument leaving the details to Problem 7. Consider the function Z ∞ tan−1 (tx) (6.11) F (t) = dx. 1 + x2 0 (i) Show that F is a continuously differentiable function of t ∈ R. Moreover, F ′ (t) =
log t t2 − 1
when t 6= 1 and when t = 1, we have F ′ (1) = 12 . 2 (ii) Show that F (0) = 0 and F (1) = π8 ; hence, by the Fundamental Theorem of Calculus, Z 1 log t π2 = F (1) − F (0) = dt. 2 8 0 t −1 P∞ 2k 1 (iii) Finally, expanding into a geometric series, 1−t 2 = k=0 t , prove that Z
0
1
∞
X log t dt = − 2 t −1 k=0
Z
1
t2k log t dt =
0
∞ X
k=1
P∞
1 . (2k + 1)2
1 n=1 n2
(iv) Now breaking up the sum E := into sums when n is even (when n = 2k) and odd (when n = 2k + 1), that is, writing E=
∞ X
k=1
use
∞
∞
k=0
k=1
∞
∞
X X 1 1 1X 1 1 E X 1 + = + = + , 2 2 2 2 (2k) (2k + 1) 4 k (2k + 1) 4 (2k + 1)2 2
π 8
=
P∞
1 k=0 (2k+1)2
k=0
to prove that E =
k=0
2
π 6
.
◮ Exercises 6.1.
1. In this problem we give a very elementary proof of (6.5), where “elementary” means that it uses material from a first-year college calculus course. (i) Prove that for any x ∈ [0, ∞), we have x − x2 ≤ log(1 + x) ≤ x. (ii) Given T > 0, prove that T2 x x− ≤ n log 1 + ≤ x, n n
328
6. SOME APPLICATIONS OF INTEGRATION
where the first ≤ holds for 0 ≤ x ≤ T and the second one holds for all x ≥ 0. In particular, replacing x by x2 , taking exponentials of these inequalities, then rearranging, show that −n 2 2 T2 x2 e−x ≤ 1 + ≤ e n e−x , n √ where the first ≤ holds for all x ≥ 0 and the second ≤ holds for 0 ≤ x ≤ T . (iii) Given T > 0, write −n −n −n Z ∞ Z √T Z ∞ x2 x2 x2 1+ dx = 1+ dx + √ 1+ dx. n n n 0 0 T Using that (1 +
x2 n ) n
≥ 1 + x2 ≥ x2 from (6.4), show that −n Z ∞ x2 1 1+ dx ≤ √ . √ n T T
(iv) Given T > 0, using (ii) and (iii), show that −n Z ∞ Z ∞ Z ∞ 2 2 T2 x2 1 e−x dx ≤ 1+ dx ≤ e n e−x dx + . n T 0 0 0
Setting T = n1/4 and then taking n → ∞, prove (6.5). 2. Many formulas can be written more succinctly using the Gamma function. For instance, R π/2 let Sα = 0 sinα x dx. Prove that π α Sα−1 , S0 = , S1 = 1. Sα+1 = α+1 2 Next, prove that for any n ∈ N, we have √ √ Z π/2 Γ n+1 Γ n+1 π π n 2 2 Sn = ; that is, sin x dx = . Γ n2 + 1 2 Γ n2 + 1 2 0
R π/2 The same formula holds for 0 cosn x dx (just make the change of variables x 7→ π/2 − x). Without the Gamma function, the formula for Sn must be broken up into even and odd n; see Problem 2 of Exercises 2.5. 3. (Gamma function version of Stieljes method) In this problem we give a Gamma function version of Stieltjes’ method explained in Problem 6 of Exercises 2.5. (i) Prove the following identity: For all x > 0, we have 1 2 Γ x+ < Γ(x) Γ(x + 1). 2 Suggestion: Fix x > 0 and consider the polynomial 1 p(t) = at2 + bt + c, where a = Γ(x) , b = 2Γ x + , c = Γ(x + 1), 2 and show that p(t) > 0 for all t ∈ R. Subhint: Notice that Z ∞ p(t) = r x (x + r 1/2 )2 e−r dr. 0
(ii) Using the inequality in (i), show that for all x > 0, we have 1 1 2 1 Γ(x + 1)2 < x + Γ x+ < x+ Γ(x) Γ(x + 1). 2 2 2 (iii) Let x = n ∈ N in the inequalities in (ii) to obtain 3 2 1 2 1 2 2n + 1 2n − 1 2 2n + 1 1< ··· Γ < . 2 2n 4 2 2 2n 2n−3 Don’t forget that Γ n + 21 = 2n−1 · · · 12 Γ 12 . 2 2
6.1. PRACTICE WITH THE DCT AND ITS COROLLARIES
329
(iv) Finally, use Wallis’ formula on (iii) to determine the probability integral. 4. (The probability integral) Here are some more proofs of the probability integral. (a) (Cf. [317]) Show that Z ∞ −t(1+x2 ) e F (t) = dx 1 + x2 0
is continuous for t ≥ 0 with F (0) = π/2 and limt→∞ F (t) = 0, and continuously differentiable for t > 0, and use F (t) to derive the probability integral. (b) (Cf. [311, p. 273]) Here’s a neat one. Show that Z t 2 Z 1 −t(1+x2 ) 2 e dx F (t) = e−x dx + 1 + x2 0 0
is continuously differentiable for t ∈ R with F ′ (t) ≡ 0. Use this to derive the probability integral. 5. In this problem we evaluate some integrals that occur in applications. (a) Prove that for any t > 0, we have Z ∞ π 1 (t2 + x2 )−1 dx = . 2 t 0 Show that differentiating under the integral sign is allowable and prove that for any n ∈ N, Z ∞ π 1 · 3 · · · (2n − 3) 1 (t2 + x2 )−n dx = . 2 2 · 4 · · · (2n − 2) t2n−1 0 Conclude that
Z
∞ 0
(1 + x2 )−n dx =
n−1 π Y 2k − 1 ; 2 k=1 2k
knowing this formula is one of the main ingredient in the proof of Wallis’ formula. R∞ p 2 2 (b) Using the formula 0 e−ax cos tx dx = 12 πa e−t /4a , where a > 0 and t ∈ R (derived in (6.1)), show that r Z ∞ t π −t2 /4a −ax2 xe sin tx dx = e , a > 0, t ∈ R. 4a a 0 R ∞ −tx (c) Show that 0 e dx = 1/t. Using this fact, prove that Z ∞ n! xn e−tx dx = n+1 , n = 0, 1, 2, . . . . t 0 p R∞ 2 (d) Show that 0 e−tx dx = 12 π/t for t > 0. Using this fact, prove that Z ∞ 2 1 · 3 · · · (2n − 1) √ x2n e−x dx = π, n = 0, 1, 2, . . . . 2n+1 0 R∞ 2 (e) Evaluate 0 x e−t x dx. Using this fact, prove that Z ∞ 2 n! x2n+1 e−x dx = , n = 0, 1, 2, . . . . 2 0 R1 (f) Evaluate 0 xt dx where t > −1 Using this fact, prove that Z 1 (−1)n n! xt (log x)n dx = , n = 0, 1, 2, . . . . (t + 1)n+1 0
6. Using a formula in Problem 5, deduce the following interesting formulas: Z 1 Z 1 ∞ ∞ X X 1 (−1)n−1 x x−x dx = and x dx = . n n nn 0 0 n=1 n=1
7. Using the continuity and differentiation theorems, we evaluate some integrals.
330
6. SOME APPLICATIONS OF INTEGRATION
(a) Starting with (6.11), prove Euler’s formula. Suggestion: To show that F ′ (t) = log t/(t2 − 1), use partial fractions. (b) ([150], [25]) In this problem we solve one of the 1985 Putnam problems. Let a > 0 and show that Z ∞ −1 F (t) = x−1/2 e−ax−tx dx 0 p is continuous for t ≥ 0 with F (0) = π/a and continuously differentiable for √ p t > 0. Compute F ′ (t) and use it to prove that F (t) = π/a e−2 at . Suggestion: After differentiating F (t), make the change of variables x = t/(au) in the resulting integral. (c) Show that Z ∞
F (t) =
e−x
2
−t2 /x2
dx
0
is continuous for t ≥ 0 and continuously differentiable for t > 0. Compute F ′ (t) √ π −2t and use it to prove that F (t) = 2 e for t ≥ 0. In fact, can you see a shorter way to do this problem using the Putnam problem above? (d) Let a > 0 and show that Z ∞ sin ax F (t) = e−tx dx x 0 is continuously differentiable for t > 0. Compute F ′ (t) and use it to prove that F (t) = π/2 − tan−1 (t/a). Conclude that Z ∞ −x e sin x π dx = . x 4 0
(e) Show that
Z
∞
2
1 − e−tx dx x2 0 is continuous for t ≥ 0 and continuously differentiable for t > 0. Compute F ′ (t) √ and use it to prove that F (t) = πt. (f) Show that Z ∞ 1 − cos tx −x F (t) = e dx x 0 is a continuously differentiable function of t ∈ R. Compute F ′ (t) and use it to prove that F (t) = 12 log(1 + t2 ). 8. The Laplace transform of a measurable function f on [0, ∞) is Z ∞ L (f )(s) = e−sx f (x) dx, F (t) =
0
defined for those s ∈ R such that e−sx f (x) is integrable in x. For instance, according to Problem 5c, we have L (xn )(s) = n!/sn+1 , which is valid for s > 0. (a) If f is Lebesgue integrable on [0, ∞), prove that L (f )(s) exists for all s ≥ 0 and is a continuous of s ∈ [0, ∞) such that lims→∞ L (f )(s) = 0. P function P∞ n n+1 (b) Let f (x) = ∞ a x be a power series such that converges n n=0 n=0 |an | n!/s for s > c for some c ≥ 0. Prove that the Laplace transform of f exists for all s > c P n+1 and L (f )(s) = ∞ for all s > c. n=0 an n!/s P∞ n+1 (c) Conversely, suppose that converges for s > c for some c ≥ n | n!/s P∞ n=0 |a n 0. Prove that f (x) = a x converges for a.e. x ∈ [0, ∞) and L (f )(s) = n n=0 P∞ n+1 a n!/s for all s > c. n n=0 (d) Using (a), find the Laplace transform of sin x. (Answer: 1/(s2 + 1)). (e) Let F (s) = 1/(s2 − 1) where s > 1. Using (b), find a function f with L (f )(s) = F (s) for s > 1. (Answer: sinh(x).) 9. (Tannery’s theorem) Here are some problems dealing with Tannery’s theorem.
6.1. PRACTICE WITH THE DCT AND ITS COROLLARIES
331
2n (2n)2 (2n)n + + · · · + . n→∞ (3n)2 + 42 (3n)n + 4n 3n + 4 1 1 1 + ··· + . (b) Find lim 2 2 2 n→∞ 2 n sin n1 2 n2 sin n2 2 n2 sin n n2 (a) Find lim
(c) Probably the most popular use of Tannery’s theorem is to derive the Taylor series for ex from the limit ex = limn→∞ (1 + nx )n . Proceed as follows. First, using the binomial theorem, show that ! n n X x n X n x k = = 1 + x + ank , where 1+ n k nk k=0 k=2 ! k n xk 1 2 k−1 x ank = = 1 − 1 − · · · 1 − . k nk n n n k! P xk Now use Tannery’s theorem to show that limn→∞ (1 + nx )n = ∞ k=0 k! . 10. (Tannery’s product theorem) You’ll need this result for the next Q nproblem. Prove the following result [280, p. 296]: For each natural number n, let m k=1 (1 + ank ) be a finite product where m n → ∞ as n → ∞. If for each k, limn→∞ ank exists, and there P is a convergent series ∞ k=1 Mk of nonnegative real numbers such that |ank | ≤ Mk for all k, n, then (6.12)
lim
n→∞
mn Y
(1 + ank ) =
k=1
∞ Y
k=1
lim (1 + ank ).
n→∞
Suggestion: Choose N so that |ank | ≤ 1/2 for all k, n where k ≥ N (why can choose QN Qmn Q n such an N ?) and write m k=1 (1+ank ) = k=1 (1+ank )· k=N (1+ank ). Prove that it’s enough to proveQ the limit (6.12) with k starting from N rather than starting at 1. Then, Pmn n take the log of m k=N (1 + ank ) to get a sum k=N bk (n) where bk (n) = log(1 + ank ). Show that Tannery’s series theorem applies to this sum, then derive Tannery’s product theorem. It might be helpful to prove that | log(1 + x)| ≤ 2|x| if |x| ≤ 1/2. 11. (The Basel Problem) Here’s Euler’s derivation of the sine expansion from Introductio in analysin infinitorum, volume 1. (I first saw this proof reading [110, Sec. 1.5], whose argument we follow.) (i) Finding the n-th roots of unity, prove that for n odd, z n − 1 can be factored as (n−1)/2
(z n − 1) = (z − 1)
Y
k=1
(z − e2πik/n )(z − e−2πik/n ).
(Here we have to assume you have taken a complex variables course.) Using the identity cos θ = (eiθ − e−iθ )/2, rewrite this identity as (n−1)/2
(z n − 1) = (z − 1)
Y
k=1
(z 2 − 2z cos
2πk + 1). n
(ii) Given a ∈ C, replacing z by z/a in the above formula and multiplying the result by an , show that (n−1)/2
(z n − an ) = (z − a)
Y
k=1
(z 2 − 2az cos
2πk + a2 ). n
(iii) Now putting z = (1 + ix/n) and a = (1 − ix/n), and using trig identities, prove that n n (n−1)/2 Y 1 ix ix x2 1+ − 1− =x 1− 2 . 2i n n n tan2 (kπ/n) k=1
332
6. SOME APPLICATIONS OF INTEGRATION 1 (eix 2i
Taking n → ∞ and recalling that sin x = sin x = x lim
n→∞
(n−1)/2
Y
k=1
1−
− e−ix ), we obtain
x2 n2 tan2 (kπ/n)
.
Assume Tannery’s product theorem from the previous problem (or prove it if you Q∞ x2 wish) to show that the right side equals x k=1 1 − π2 k2 .
12. Here’s a proof of Euler’s sum by Hofbauer [134]. 1 1 1 . (i) Show that sin2 πx = 4 sin21 xπ + 2 (1−x)π 2
sin
2
(ii) Using (i) and induction, show that for any n ∈ N, 1=
2 4n
2n−1 X−1 k odd
1 sin2
kπ 2n+1
,
where we only sum over k odd (that is, when k is even, we define the k-th term to be zero). Note that the n = 1 case is just the formula in (i) with x = 1/2. P n −1 (iii) Write the formula in (ii) as 1 = 2k=0 ank for an appropriate ank . Apply TanP∞ 2 1 nery’s series theorem to derive the formula π8 = k=0 (2k+1)2 , and from this deduce Euler’s formula. 13. (The FTA) Here are some problems related to the FTA. (a) If p(z) is a polynomial of positive degree n, prove that p has exactly n complex roots c1 , . . . , cn counting multiplicities, and show that for some constant a ∈ C, p(z) = a (z − c1 )(z − c2 ) · · · (z − cn ). (b) Fill in the details of Proofs 2 & 3, including proving (6.6) and (6.8). Suggestion: First prove (6.6) holds for f (z) = z k for any k ∈ N, then use the Quotient Rule to show it holds for any rational function. (c) Here’s a complex-version of Proof 3. Let p(z) = z n + an−1 z n−1 + · · · + a0 be a polynomial with complex coefficients, n ≥ 1, and suppose, as usual, that p has no roots. (i) Let q(z) = z n + an−1 z n−1 + · · · + a1 z + a0 , the polynomial whose coefficients are the complex conjugates of the coefficients of p. Define P (z) = p(z) q(z) and prove that P (z) has real coefficients and P (z) has a root if and only if p(z) has a root.Z ∞ 1 (ii) Define F (θ) = dr and prove that F (θ) is a continuously differP (reiθ ) 0 entiable function of θ ∈ R such that F ′ (θ) = −iF (θ). This is an ordinary differential equation whose solution is F (θ) = C e−iθ for some constant C. (iii) Taking θ = 0 and θ = π, show that both C > 0 and C < 0, a contradiction. (Note that P (t) = |p(t)|2 > 0 for t ∈ R.)
6.2. Lebesgue, Riemann and Stieltjes integration In this section we compare Riemann integration to Lebesgue integration.7 In particular, we characterize Riemann integrable functions as bounded functions that are continuous almost everywhere. 7 Does anyone believe that the difference between the Lebesgue and Riemann integrals can have physical significance, and that whether say, an airplane would or would not fly could depend on this difference? If such were claimed, I should not care to fly in that plane. Richard Wesley Hamming (1915–1998). Quoted in [239].
6.2. LEBESGUE, RIEMANN AND STIELTJES INTEGRATION
333
6.2.1. The Riemann-Stieltjes integral. In elementary calculus you studied the Riemann integral, which was introduced by Georg Friedrich Bernhard Riemann (1826–1866) in his 1854 habilitationsschrift “Ueber die Darstellbarkeit einer Function durch eine trigonometrische Reihe” (On the representation of a function by trigonometric series). 8 In later analysis or probability courses you might have seen the Riemann-Stieltjes integral, which is a useful generalization of the Riemann integral introduced by Thomas Jan Stieltjes (1856–1894) in 1894 [270]. Before blinding you with many technical definitions, it might be helpful to review Stieltjes’ motivation behind his integrals. Consider point masses m1 , . . . , mN located at points x1 , . . . , xN as seen here: m2 x2
m1 x1
m3 x3
mN xN
Figure 6.1. Masses m1 , . . . , mN centered at x1 , . . . , xN . Given p ∈ N, the p-th moment of the masses is the sum xp1 m1 + xp2 m2 + · · · + xpN mN .
If p = 1, this is just the “center of mass” familiar from physics. Stieltjes was studying moments of continuous mass distributions instead of point masses. Consider the interval [0, 1] as a solid rod and let ϕ : [0, 1] → R be a nondecreasing nonnegative function such that for each x ∈ [0, 1], ϕ(x) is the mass of the rod segment [0, x]. How would one go about defining the p-th moment of the solid rod [0, 1]? Here’s how: Partition the interval [0, 1] into a bunch of subintervals [x0 , x1 ], [x1 , x2 ], . . . , [xN −1 , xN ] where 0 = x0 < x1 < · · · < xN −1 < xN = 1, as seen in Figure 6.2. Then the mass of the k-th segment is ϕ(xk ) − ϕ(xk−1 ), so x0 x1
x2 x3
xk−1 xk
xN
Figure 6.2. ϕ(xk ) is the mass of [0, xk ] and ϕ(xk−1 ) is the mass of [0, xk−1 ], so ϕ(xk ) − ϕ(xk−1 ) is the mass of [xk−1 , xk ]. choosing a point x∗k in the k-th interval [xk−1 , xk ], the sum (x∗1 )p {ϕ(x1 ) − ϕ(x0 )} + (x∗2 )p {ϕ(x2 ) − ϕ(x1 )} + · · · + (x∗N )p {ϕ(xN ) − ϕ(xN −1 )}.
should be a close approximation to what the true p-th moment of the rod should equal. If we put f (x) = xp , then this sum can be written as N X
k=1
f (x∗k ) {ϕ(xk ) − ϕ(xk−1 )},
which is nowadays called a Riemann-Stieltjes sum of f . We now see how to define the p-th moment of the rod: Just take finer and finer partitions of the rod and if these Riemann-Stieltjes sums approach a number, which we call the RiemannStieljes integral of f , then this number would be (by definition) the p-th moment of the rod. With this background, we now blind you with definitions! 8 In Germany, one needs a Habilitation to lecture at a German university, one requirement of which is to write a second “Ph.D. thesis” called the habilitationsschrift. This requirement shows that one can do research after the Ph.D.
334
6. SOME APPLICATIONS OF INTEGRATION
Throughout this section we work on a fixed interval [a, b] and we fix a nondecreasing right-continuous function ϕ : [a, b] → R. (We assume right-continuity so that the corresponding Lebesgue-Stieljes set function is a measure, a fact we’ll need later.) A partition of [a, b] is just a set of numbers P = {x0 , x1 , . . . , xN } where a = x0 < x1 < · · · < xN −1 < xN = b. The length of P, denoted by kPk, is the maximum of the numbers xk − xk−1 where k = 1, . . . , N . Given a bounded function f : [a, b] → R, a RiemannStieltjes sum of f with respect to ϕ is a sum of the form S(P) =
N X
k=1
f (x∗k ) {ϕ(xk ) − ϕ(xk−1 )},
where P is a partition of [a, b] and x∗k ∈ [xk−1 , xk ] for each k. This sum depends on P, f , ϕ, and the choices of the x∗k ’s, but to simplify notation we omit these facts except P. To make precise the idea of taking “finer and finer partitions,” let D = nondecreasing sequences of partitions of [a, b] whose lengths are approaching zero ;
explicitly, an element of D is a sequence {P1 , P2 , P3 , . . .} of partitions of [a, b] such that P1 ⊆ P2 ⊆ P3 ⊆ · · · and kPn k → 0.
Thus, a sequence {Pn } represents the intuitive idea of adding more and more partition points in such a way that the distance between any two adjacent partition points → 0. We say that f is (Riemann-Stieltjes) integrable with respect to ϕ if there is a real number I such that given any sequence {Pn } ∈ D, we have lim S(Pn ) = I;
n→∞
explicitly, given any ε > 0 there is a p such that (6.13)
|I − S(Pn )| < ε
for all n ≥ p,
and all choices of the intermediate points x∗k in the Riemann-Stieltjes sums.9 The number I is called the Riemann-Stieltjes integral of f with respect to ϕ and we shall denote it by Z I= f dϕ. R
If ϕ(x) = x, then we simply say that f is Riemann integrable. There is another approach to the Riemann-Stieltjes integral via lower and upper sums, due to Jean Gaston Darboux (1842–1917), which I think Gaston Darboux allows one to prove theorems easier than using the definition (6.13). He (1842–1917). introduced this new method for understanding the Riemann integral in his 1875 paper M´emoire sur les fonctions discontinues [65]. This paper is really quite remarkable: it was basically a reworking of Riemann’s theory of integration 9Our sequence definition is equivalent to the traditional ε-δ definition, which is: For each ε > 0 there is a δ > 0 such that for any partition P of [a, b] with kPk < δ, we have |I −S(Pn )| < ε for any Riemann-Stieltjes sum corresponding to the partition P. Our sequence definition has the advantage that it gives an efficient proof of Theorem 6.10 via the DCT.
6.2. LEBESGUE, RIEMANN AND STIELTJES INTEGRATION
335
from scratch, packed with all the important theorems on Riemann integration. Because of mathematical convenience, we shall introduce Darboux’s approach, which Lebesgue later used in his paper Sur une g´en´eralisation de l’int´egrale d´efinie. If P = {x0 , x1 , . . . , xN } is a partition of [a, b], then we define simple functions ℓP and uP by ℓP =
N X
mk χ(xk−1 ,xk ] ,
uP =
k=1
where
N X
Mk χ(xk−1 ,xk ] ,
k=1
mk = inf{f (x) ; xk−1 ≤ x ≤ xk },
Mk = sup{f (x) ; xk−1 ≤ x ≤ xk }.
These lower and upper functions of f have the property that ℓP ≤ f ≤ uP
(6.14)
on the interval [a, b]; see Figure 6.3 for pictures of ℓP and uP . The lower and ✻
x0
x1
x2
x3
✲
x4
Figure 6.3. Here f is a linear function. The solid horizontal lines represent uP and the dotted lines ℓP .
upper sums of f with respect to ϕ are the sums L(P) =
N X
k=1
mk {ϕ(xk ) − ϕ(xk−1 )},
Observe that
L(P) =
Z
ℓP dµϕ ,
U (P) =
N X
k=1
U (P) =
Z
Mk {ϕ(xk ) − ϕ(xk−1 )}. uP dµϕ
where these integrals are Lebesgue integrals with respect to the Lebesgue-Stieljes measure µϕ : Mϕ → [0, ∞], defined by µϕ (c, d] = ϕ(d) − ϕ(c) on elements of I 1 , and where Mϕ denotes the µϕ measurable sets.10 This last observation is key to relating Riemann-Stieltjes integrals to Lebesgue integrals. In order to prove Darboux’s theorem relating lower and upper sums to the Riemann-Stieljes integral, we need the following lemma. Lemma 6.7. If f : [a, b] → R is bounded, P and Q are partitions of [a, b], and P ⊆ Q (so every partition point in P is a partition point in Q), then ℓP ≤ ℓQ ≤ f ≤ uQ ≤ uP .
10To be precise we should be writing (µ )∗ because the measure µ : M → [0, ∞] is really ϕ ϕ ϕ the Carath´ eodory extension of the measure µϕ : I 1 → [0, ∞), however, for sake of notational simplicity we drop the ∗.
336
6. SOME APPLICATIONS OF INTEGRATION
Proof : Suppose that the partition Q contains exactly one more point than the partition P. An induction argument proves this lemma when P contains any number of extra points than Q. Let this one extra point be denoted by y and suppose that xk−1 < y < xk . Observe that if mk = inf{f (x) ; xk−1 ≤ x ≤ xk }, then mk ≤ It follows that
inf
xk−1 ≤x≤y
f (x)
and
mk ≤
inf
y≤x≤xk
f (x) .
ℓP ≤ ℓQ . An analogous argument shows that uQ ≤ uP ; then (6.14) implies that ℓP ≤ ℓQ ≤ f ≤ uQ ≤ uP .
Recall that D = nondecreasing sequences {Pn } of partitions of [a, b] with kPn k → 0 .
Given {Pn } ∈ D, let ℓn and un be, respectively, the lower and upper simple functions of f corresponding to the partition Pn . In view of Lemma 6.7, we have so
ℓ1 (x) ≤ ℓ2 (x) ≤ · · · ≤ f (x) ≤ · · · ≤ u2 (x) ≤ u1 (x), L(P1 ) ≤ L(P2 ) ≤ · · · ≤ L(Pn ) ≤ · · · ≤ U (Pn ) ≤ · · · ≤ U (P2 ) ≤ U (P1 ).
Being monotone sequences, the limits lim L(Pn ) and lim U (Pn ) exist. The following theorem, which we shall call Darboux’s theorem, says that f is Riemann integrable if and only if these limits equal the same value for all partitions in D. Darboux’s theorem Theorem 6.8. A bounded function f is integrable with respect to ϕ if and only if there exists a real number I such that for any {Pn } ∈ D, we have in this case, I =
R
lim L(Pn ) = I = lim U (Pn );
R
f dϕ, the Riemann-Stieltjes integral of f .
We leave the proof of Darboux’s theorem for Problem 3. 6.2.2. Lebesgue vs. Riemann integrals. We now characterize RiemannStieltjes integrable functions and show that any Riemann-Stieltjes integrable function is also Lebesgue integrable and the two integrals agree. In order to do so, we need to prove an important lemma. Let f : [a, b] → R be a bounded, let {Pn } ∈ D, and let ℓn and un be, respectively, the lower and upper simple functions of f corresponding to the partition Pn . Then by Lemma 6.7, for each x ∈ [a, b] we have ℓ1 (x) ≤ ℓ2 (x) ≤ · · · ≤ f (x) ≤ · · · ≤ u2 (x) ≤ u1 (x).
In particular, the sequences ℓn and un are monotone sequences, so the limits ℓP (x) := lim ℓn (x) n→∞
exist for each x ∈ [a, b], and
and uP (x) := lim un (x)
ℓP ≤ f ≤ uP .
n→∞
6.2. LEBESGUE, RIEMANN AND STIELTJES INTEGRATION
337
S∞ Here, the subscript P denotes the union P = n=1 Pn of all the partitions. In the following lemma we list the important properties of the functions ℓP and uP . Lemma 6.9. Let f be a bounded function on [a, b] and let {Pn } ∈ D. (a) If f is continuous at x, then ℓP (x) = f (x) = uP (x). (b) If x 6∈ P and ℓP (x) = f (x) = uP (x), then f is continuous at x. Proof : Suppose that f is continuous at x. Let ε > 0. Then there is a δ > 0 such that (6.15)
|y − x| < δ
=⇒
|f (y) − f (x)| < ε.
Since the lengths of the partitions approach zero, we can choose p such that the lengths of all of the partitions Pn are less than δ for all n ≥ p. Given a partition Pn = {x0 , . . . , xN } with n ≥ p there exists a k such that x ∈ (xk−1 , xk ]; then (6.15) implies that |mk − f (x)| ≤ ε
and
|Mk − f (x)| ≤ ε.
|ℓn (x) − f (x)| ≤ ε
and
|un (x) − f (x)| ≤ ε.
Hence, Taking n → ∞ implies that ℓP (x) = f (x) = uP (x) since ε > 0 was arbitrary. Suppose now that x 6∈ P and ℓP (x) = f (x) = uP (x). Let ε > 0. Since ℓP (x) = limn→∞ ℓn (x) and uP (x) = limn→∞ un (x), there exists a p such that |ℓn (x) − f (x)| < ε
|un (x) − f (x)| < ε
and
for all n ≥ p.
Fix n ≥ p and let Pn = {x0 , . . . , xN }. Then xk−1 < x < xk for some k (recall that x ∈ / P ), so ℓn (x) = mk and un (x) = Mk , and hence, (6.16)
|mk − f (x)| < ε
and
|Mk − f (x)| < ε.
Let δ > 0 be chosen such that the interval (x − δ, x + δ) is contained in (xk−1 , xk ). Then (6.16) implies that |y − x| < δ
=⇒
|f (y) − f (x)| < ε.
Thus, f is continuous at x.
We’re now ready to prove the main result of this section, which characterizes Riemann-Stieltjes integrability in terms of Lebesgue-Stieltjes measures. Characterization of Riemann-Stieltjes integrability Theorem 6.10. A bounded function f on a finite interval is RiemannStieltjes integrable with respect to ϕ if and only if it is continuous µϕ -a.e., that is, the set of discontinuity points of f has µϕ -measure zero. When f is Riemann-Stieltjes integrable, then f is (in the Lebesgue sense) µϕ -integrable too, and Z Z f dϕ = f dµϕ , R
where the right-hand side denotes the Lebesgue integral of f with respect to the measure µϕ .
338
6. SOME APPLICATIONS OF INTEGRATION
Proof : Let f be a bounded function on [a, b]. Step 1: We prove that f is Riemann-Stieltjes integrable with respect to ϕ if and only if given any {Pn } ∈ D, we have ℓP = f = uP µϕ -a.e. Indeed, by Darboux’s Theorem we know that f is Riemann-Stieltjes integrable with respect to ϕ if and only if there is a real number I such that given any {Pn } ∈ D, we have (6.17) in which case I =
lim L(Pn ) = I = lim U (Pn ),
R
f dϕ. Recall that Z L(Pn ) = ℓn dµϕ and R
U (Pn ) =
Z
un dµϕ ,
where ℓn and un are, respectively, the lower and upper simple functions of f corresponding the the partition Pn . Since f is bounded, ℓn and un are bounded, so the Dominated Convergence Theorem implies that Z Z lim L(Pn ) = ℓP dµϕ and lim U (Pn ) = uP dµϕ .
Therefore, (6.17) holds if and only if Z Z Z (6.18) ℓP dµϕ = uP dµϕ which then equal
R
or equivalently,
Z
f dϕ ,
(uP − ℓP ) dµϕ = 0.
This holds if and only if uP − ℓP = 0 (that is, uP = ℓP ) µϕ -a.e. since uP − ℓP is nonnegative. As ℓP ≤ f ≤ uP , it follows that uP = ℓP µϕ -a.e. if and only if ℓP = f = uP µϕ -a.e. This proves Step 1. Step 2: We now prove that when f is Riemann-Stieltjes integrable, it is also (Lebesgue) µϕ -integrable and the two notions of integral agree. To see this, let {Pn } ∈ D and recall from Step 1 that ℓP = f = uP µϕ -a.e. Since ℓP and uP are µϕ -measurable functions and µϕ is complete, it follows that f is µϕ -measurable too (see Proposition 5.10). Moreover, since Lebesgue integrals are equal on a.e. equal functions, we have Z Z Z ℓP dµϕ = f dµϕ = uP dµϕ . By (6.18) we see that
Z
ℓP dµϕ =
Z
f dϕ = R
Z
uP dµϕ ,
so the Riemann-Stieltjes and the Lebesgue µϕ -integral of f agree. This completes Step 2. Before completing our proof, we need two facts: (1) Any monotone function has at most countably many points of discontinuity; you will prove this in Problem 9. (2) If ϕ is continuous at a point α, then µϕ ({α}) = 0. This fact isn’t difficult to prove and was shown back in Problem 2 of Exercises 3.5. Step 3: It remains to prove that f is Riemann-Stieltjes integrable with respect to ϕ if and only if f is continuous µϕ -a.e. By Step 1, we just have to show that ℓP = f = uP µϕ -a.e. for any {Pn } ∈ D
⇐⇒
f is continuous µϕ -a.e.
Note that the direction ⇐= follows directly from Part (a) of Lemma 6.9, so we just have to prove the direction =⇒. Assume that ℓP = f = uP µϕ -a.e. for any {Pn } ∈ D. Since D = {discontinuity points of ϕ} is countable, we can choose a sequence {Pn } ∈ D such that P contains no points of D. Then ϕ is
6.2. LEBESGUE, RIEMANN AND STIELTJES INTEGRATION
339
continuous at each point of P , which is a countable set, and hence by Fact (2), the set P has µϕ measure zero. Now, ℓP = f = uP µϕ -a.e. on [a, b] implies, by Part (b) of Lemma 6.9, that f is continuous µϕ -a.e. on [a, b] \ P . Since P has µϕ -measure zero it follows that f is continuous µϕ -a.e. on [a, b].
As a corollary, we get Lebesgue’s characterization of Riemann integrability. Corollary 6.11. A bounded function on a finite interval is Riemann integrable if and only if it is continuous a.e. (with respect to Lebesgue measure), in which case the function is also Lebesgue integrable and the two notions of integral agree. In particular, this theorem shows that the Dirichlet function ( 1 if x ∈ [0, 1] ∩ Q, D(x) = 0 if x 6∈ [0, 1] ∩ Q is not Riemann integrable on [0, 1], a fact we already knew. Since now we have proved that the Lebesgue integral agrees with the Riemann integral on Riemann integrable functions, when integrating Riemann integrable functions we shall henceforth use without proof common results concerning Riemann integration, e.g. the Fundamental Theorem of Calculus, Change of variables in 1-dimension integrals, integration by parts, and so forth. 6.2.3. Neat functions that are Riemann integrable. In Riemann’s habilitationsschrift he gave the following interesting example of a Riemann integrable function. First, Riemann introduced the function ρ : R → R defined as follows:11 Put ρ(x) = x for −1/2 < x < 1/2, ρ(−1/2) = 0, and ρ(1/2) = 0, then extend ρ(x) to the whole real line by reproducing its graph periodically as shown here:
0.4 0.2 -3
-2
0
-1
1
2
3
0 -0.2 -0.4
Figure 6.4. Graph of ρ(x). Note that ρ is continuous except at the half-integers: 1 3 5 7 odd , ± , ± , ± , . . . , in general at all numbers , 2 2 2 2 2 where “odd” is an odd integer. Next, Riemann defined his function: ±
f (x) :=
11Riemann denoted ρ(x) by (x).
∞ X ρ(n x) . n2 n=1
340
6. SOME APPLICATIONS OF INTEGRATION
Since |ρ(x)| ≤ 1/2 it follows that this series converges for all x ∈ R. Moreover, ∞ ∞ X 1/2 1X 1 π2 |f (x)| ≤ = = , n2 2 n=1 n2 12 n=1
so f : R → R is a bounded function. Figure 6.5 shows graphs of f2 , f4 , and f8 , where fN denotes the N -th partial sum of f . Note how erratic these functions look. Since 0.6 0.4
0.2
-0.5
0.4
0.2
0.2
0
0 -1
0.4
0
0.5
-1
1
-0.2
-0.4
0 0
-0.5
1
0.5
-1
-0.5
0
-0.2
-0.2
-0.4
-0.4
0.5
1
-0.6
Figure 6.5. Graphs of f2 (on the left) f4 (in the middle), and f8 (on the right). 1 ρ is discontinuous at the half integers odd 2 , the n-th term n2 ρ(n x) in Riemann’s odd function is discontinuous at the points 2n . In Problem 10 you will prove that f (x) odd is continuous at all real numbers except those rational numbers of the form even . In particular, the set of discontinuity points is a set of measure zero and hence f (x) is Riemann integrable on any finite interval. Note also that the set of discontinuity points is dense in R, so f is indeed a strange function. Because Riemann’s integral can handle such discontinuous functions, mathematicians such as Karl Theodor Wilhelm Weierstrass (1815–1897) said that Riemann’s integral “has been seen as the most general thinkable” and Paul David Gustav du Bois-Reymond (1831–1889) said that “Riemann extended the scope of integrable functions up to its extreme limit” (quotes taken from [139, p. 266]). Another interesting function that is Riemann integrable is T : R → R defined by ( 1/q if x ∈ Q and x = p/q in lowest terms and q > 0, T (x) = 0 if x is irrational.
This function is called Thomae’s function (amongst other names such as the “Ruler function”), named after Johannes Karl Thomae (1840–1921), who discussed this function in §20, page 14, of his 1875 book [282], where he gives several examples of pathological functions (including Riemann’s function) that are Riemann integrable. See Figure 6.6 for a picture of Thomae’s function on [0, 1]. In Problem 11 you will show that T is discontinuous on the rationals and continuous on the irrational numbers. In particular, Thomae’s function is continuous a.e. and hence Riemann integrable. ◮ Exercises 6.2. 1. Evaluate
Z
0
1/2
f (x) dx
and
Z
1/2
T (x) dx 0
where f is Riemann’s function and T is Thomae’s function.
6.2. LEBESGUE, RIEMANN AND STIELTJES INTEGRATION
341
1
0.8
0.6 y 0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
x
Figure 6.6. This is a plot of the points (p/q, T (p/q)) for 0 ≤ p/q ≤ 1 and q at most 13. 2. To avoid certain problems, in this section we have focused on bounded functions. To understand why, let f : [a, b] → R be unbounded and √ consider the function ϕ(x) = x. (E.g. the function defined by f (0) = 0 and f (x) = 1/ x for 0 < x ≤ 1 is unbounded on [0, 1].) Given any m > 0 and given any partition P of [a, b], prove that there exists a Riemann-Stieltjes sum S(P) (since ϕ(x) = x, S(P) is usually called a Riemann sum) such that |S(P)| > m. This shows that the condition (6.13) cannot be satisfied for any I ∈ R. 3. Prove Darboux’s theorem, Theorem 6.8. 4. Let f, ϕ : [0, 1] → R be the functions ( ( 0 if 0 ≤ x ≤ 21 0 if 0 ≤ x < 12 f (x) = , ϕ(x) = 1 1 1 if 2 < x ≤ 2 1 if 12 ≤ x ≤ 12 . (a) Prove, using our original sequence of partitions definition of Riemann-Stieltjes integrability, or use Darboux’s theorem if you wish, that f is not integrable with respect to ϕ on [0, 1]. (b) Let I1 = [0, 1/2] and I2 = [1/2, 1]. Using our sequence of partitions definition of Riemann-Stieltjes integrability, or Darboux’s theorem, prove that f is integrable with respect to ϕ on I1 and prove the same for I2 . Note that although f is integrable with respect to ϕ on I1 and on I2 , f is not integrable with respect to ϕ on [0, 1] = I1 ∪ I2 ! To avoid this pathology, it’s common to use the following definition of the integral. 5. (Another definition of the Riemann-Stieljes integral) We say that f is RS integrable with respect to ϕ if there is a real number I such that for some partition {Pn } ∈ D, we have lim S(Pn ) = I,
n→∞
R where this limit is in the sense of (6.13). We denote the number I by RS f dϕ. Note that all we did was replace “for every partition” in our original definition of the Riemann-Stieltjes integral with “for some partition”. However, this slight change makes a world of difference. (i) Prove that the number I, if it exists, is unique; that is, if {Qn } ∈ D satisfies limn→∞ S(Qn ) = I ′ , then I = I ′ . (ii) Prove that a bounded function f is RS integrable with respect to ϕ if and only if there exists a real number I such that for some partition {Pn } ∈ D, we have lim L(Pn ) = I = lim U (Pn ); in this case, I =
R
RS
f dϕ, the Riemann-Stieltjes integral of f .
342
6. SOME APPLICATIONS OF INTEGRATION
(iii) It’s clear that if f is integrable with respect to ϕ (in our original definition), then f is RS integrable with respect to ϕ. The converse is false; if f and ϕ are the functions in Problem 4, prove that f is RS integrable with respect to ϕ on [0, 1]. (iv) (Additivity on intervals) Let a ≤ c ≤ b and suppose that f is RS integrable with respect to ϕ on both subintervals [a, c] and [c, b]. Prove that f is RS integrable with respect to ϕ on [a, b]. This additivity property is false for our original Riemann-Stieltjes definition of integrability by Problem 4. 6. (Characterization of RS integrability) (i) Given {Pn } ∈ D, let ℓP and uP be as in Lemma 6.9. Prove that for x ∈ [a, b], we have ( f is continuous at x if x ∈ /P ℓP (x) = f (x) = uP (x) ⇐⇒ f is left-continuous at x if x ∈ P . By Lemma 6.9 you just have to prove the statement for x ∈ P . (ii) Let ϕ : [a, b] → R be a nondecreasing right continuous function and let A = the set of continuity points of ϕ. Prove the following theorem: A bounded function f is RS integrable with respect to ϕ if and only if f is continuous µϕ -a.e. on A and if f is not discontinuous from the left at any point of [a, b] \ A = the set of discontinuity points of ϕ. When f is Riemann-Stieltjes integrable, it is also (Lebesgue) µϕ -integrable and the two notions of integral agree. Another way to state the integrability condition is as follows: f is RS integrable with respect to ϕ if and only the set of discontinuity points of f in A has µϕ measure zero and f and ϕ are never simultaneously discontinuous from the left. 7. (Fichtenholz-Lichenstein Theorem) This theorem (cf. [267], [180], [181], [98]), named after Grigori Fichtenholz (1888–1959) and Leon Lichtenstein (1878–1933). Let (X, S , µ) be a measure space and let f : [a, b] × X → R, where [a, b] ⊆ R is a closed interval, be a function such that (a) For a.e. x ∈ X, the function fx : [a, b] → R, defined by fx (t) = f (t, x) for all t ∈ [a, b], is Riemann integrable on [a, b]. (b) For all t ∈ [a, b], the function ft : X → R, defined by ft (x) = f (t, x) for all x ∈ X, is µ-measurable. (c) There is a µ-integrable function g : X → [0, ∞] such that for a.e. x ∈ X, |f (t, x)| ≤ g(x) for all t ∈ [a, b]. We shall prove that Z Z Z Z (6.19) f (t, x) dµ dt = f (t, x) dt dµ. R
X
X
R
R R Here, “ X ” denotes the Lebesgue integral over X while R denotes the Riemann integral over [a, b]. To prove (6.19) you may proceed as follows. (a) Let {Pn } be a nondecreasing sequence of partitions of [a, b] whose lengths are approaching zero and let S(Pn , fx ) be a Riemann sum of fx (t) in the t variable with respect to the partition Pn . Prove that Z S(Pn , F ) = S(Pn , fx ) dµ, X
R
where F (t) = X f (t, x) dµ. (b) Show that the Dominated Convergence Theorem can be applied to obtain (6.19) in the limit as n → ∞. 8. (cf. [162, 179]) This theorem states that a function is continuous almost everywhere if and only if it is has a left-hand limit almost everywhere. To prove this, proceed as follows.
6.2. LEBESGUE, RIEMANN AND STIELTJES INTEGRATION
343
(a) Let f be a bounded real-valued function on an interval I. Given a point a ∈ I, we define the oscillation of f at a by h i osc(f, a) = lim sup{f (x) ; |x − a| < δ} − inf{f (x) ; |x − a| < δ} . δ→0 S Let D denote the set of discontinuities of f . Prove that D = ∞ n=1 Dn where Dn = {x ; osc(f, x) > 1/n}. (b) Let L denote the set of points where f has a left-hand limit. Prove that every point of Dn ∩ L is the right endpoint of an open interval that contains no point of Dn ∩ L. Conclude that Dn ∩ L must be countable, and hence D ∩ L must be countable. (c) Now prove that a bounded function on an interval is continuous a.e. if and only if it has a left-hand limit a.e. 9. Let I ⊆ R be an interval (open, closed, half-open, bounded, unbounded, . . .). In this problem we prove that any monotone function ϕ : I → R has at most a countable number of discontinuities. (i) Show that if this statement holds for any compact interval I, then the statement holds for any interval. Show that if the statement holds for any nondecreasing function, then it holds for any monotone function. (ii) Assume that ϕ : I → R is nondecreasing where I is compact. By e.g. Lemma 1.19 note that ϕ is discontinuous at a point x if and only if d(x) := limy→x+ ϕ(y) − limy→x− ϕ(y) > 0. For each n ∈ N, let Dn = {x ; d(x)S> 1/n}. Prove that Dn is a finite set and that the set of discontinuities of ϕ is ∞ n=1 Dn . 10. (Riemann’s function) In this problem we analyze Riemann’s function. (i) For each n ∈ N, let fn : R → R be a function and suppose that for some a ∈ R and P∞ for each n ∈ N, limx→a fn (x) exists. Suppose there exists a convergent series n of nonnegative real numbers such that |fn | ≤ Mn for all n. Prove that n=1 M P f (x) = ∞ n=1 fn (x) converges for each x ∈ R, and lim f (x) =
x→a
∞ X
n=1
lim fn (x).
x→a
We can replace limx→a with left or right-hand limits . . . the proof is the same. odd (ii) Let D be the set of all rational numbers of the form even . Prove that D is a dense subset of R.P ρ(nx) (iii) Let f (x) = ∞ n=1 n2 , Riemann’s function. Prove that f is continuous on R \ D. For the remainder of this problem we prove that f is discontinuous at each point in D. To do so, we shall prove an interesting result of Riemann: If r ∈ D and we p write r = 2q where p ∈ Z is odd, q ∈ N, and p and q have no common factors, then π2 f (r±) = f (r) ∓ . 16q 2 Here, f (r+) (resp. f (r−)) denotes the right (resp. left)-hand limit of f at r. In particular, the formula for f (r±) shows that f is discontinuous on D, which is a countable (and hence measure zero) subset of R. (iv) As a first step to prove the formula for f (r±), show that for any c ∈ R, we have f (c±) =
∞ X ρ(n c±) . n2 n=1
(v) Prove that ρ(h±) = ρ(h) ∓ 12 for any any half integer h. Now let r ∈ D and write p r = 2q where p ∈ Z is odd, q ∈ N, and p and q have no common factors. Show that ( ρ (nr) if n is not a multiple of q ρ (nr±) = ρ (nr) ∓ 12 if n is a multiple of q.
344
6. SOME APPLICATIONS OF INTEGRATION
(vi) With r as in (v), show that 1 f (r±) = f (r) ∓ 2 2q
1 1 1 + 2 + 2 + ··· 12 3 5
,
2
π and from this prove that f (r±) = f (r) ∓ 16q 2 just as Riemann stated. 11. (Thomae’s function) Prove that Thomae’s function T is discontinuous at each rational number. We now prove that T is continuous at each irrational number as follows: (i) Prove that T (x ± 1) = T (x) for every x ∈ R. This shows that T is a periodic function on R with period 1, so we just have to prove that T is continuous at each irrational number in (0, 1). (ii) Fix an irrational number c ∈ (0, 1) and let ε > 0. Choose any n ∈ N with n ≥ 1/ε and put p A = r = ∈ Q in lowest terms ; 1 ≤ p, q ≤ n , q
which is a finite set of numbers. Prove that for all x ∈ [0, 1] \ A, we have |T (x) − T (c)| = |T (c)| < ε. (iii) Now prove that T is continuous at c.
Prelude to the general SLLN We saw back in Sections 2.4 and 4.2 that the “right way” to think about the Laws of Large Numbers are as statements concerning limits of functions. Let X = S ∞ with S = {0, 1} be an infinite sequence of Bernoulli trials where on each trial 1 occurs with probability p and 0 with probability 1 − p and let µ denote the infinite product measure. For each i ∈ N, define fi := χAi : X → R, where Ai = S × S × S × · · · × S × {1} × S × · · ·
such that the {1} occurs in the i-th slot. Observe that for any i, E(fi ) = µ(Ai ) = p. Let E (which is just p) denote the common expectation of all the fi ’s. Then the SLLN is the statement that the event f1 + f2 + · · · + fn (6.20) lim =E n→∞ n occurs with probability one. There are two properties of the fi ’s that stand out. First, the values (namely, 0 and 1) of the fi ’s are “distributed” exactly the same in the sense that each value occurs with the same probability regardless of i; explicitly, for any i, fi = 1 with probability p and fi = 0 with probability 1 − p. Because the values of the fi ’s are distributed the same, we say that the fi ’s are “identically distributed.” Second, it’s clear that the sets A1 , A2 , A3 , . . . are independent. Because of this we say that f1 , f2 , f3 , . . . are “independent.” The general SLLN is the statement (6.20) but for any independent and identically distributed (or “iid”) random variables. In Section 6.3 we study the notion of distributions, in Section 6.4 we study the notion of “iid” and in Section 6.6 we prove the general SLLN. 6.3. Probability distributions, mass functions and pdfs The goal of this section is to understand . . .
6.3. PROBABILITY DISTRIBUTIONS, MASS FUNCTIONS AND PDFS
345
6.3.1. Probability distributions. The measurable space (R, B), the real line with the Borel σ-algebra, plays a special role in real-life probabilities because numerical data, real numbers, is gathered whenever a random experiment is performed. It’s important to analyze this data probabilistically and for this reason we shall call a law on R any probability measure on (R, B); “law” in the sense that a measure gives a “rule” by which to judge the (probabilistic) behavior of data. Now data (numbers assigned to outcomes of an experiment) is described mathematically by random variables, so let (X, S , µ) be a probability space and recall that a random variable on X is just a measurable function on X. Given a random variable f : X → R, we are interested in questions revolving around the data described by f , such as “What is the likelihood that f takes values in a Borel set A ⊆ R?”
For example, if A = (10, 20), this is the question:
“What is the likelihood that f lies strictly between 100 and 120?” For a general Borel set A ⊆ R, the event that f takes values in the set A is {f ∈ A} = {x ∈ X ; f (x) ∈ A} = f −1 (A).
Note that f −1 (A) ∈ S because f is measurable. The likelihood, or probability, that this event occurs is given by µ{f ∈ A} = µ(f −1 (A)) = the probability that the values of f are in the set A. For different A’s, the numbers µ(f −1 (A)) tell us probabilistically how the values of f are “distributed” amongst different Borel sets. For this reason, it’s natural to define the (probability) distribution of f , or the law of f , as the measure Pf : B → [0, 1] defined by Pf (A) := µ{f ∈ A}
for all Borel sets A ⊆ R.
Note that Pf : B → [0, 1] is a measure because Pf (∅) = µ(f −1 (∅)) = µ(∅) = 0, Pf (R) = µ{f ∈ R} = µ(X) = 1, and if A1 , A2 , . . . are pairwise disjoint Borel sets, then ! ! ∞ ∞ ∞ ∞ [ [ X X −1 Pf An = µ f (An ) = µ(f −1 (An )) = Pf (An ). n=1
n=1
n=1
n=1
The measure Pf : B → [0, 1] contains everything you need to know concerning the probabilistic behavior of the data f represents. Here’s a picture of the situation: f x {f ∈ A}
q
✛ R
f (x) ✲ ) A (R, B, Pf ) (
(X, S , µ)
Figure 6.7. Pf is a probability measure on (R, B) such that for all Borel sets A ⊆ R, Pf (A) = probability that f lies in A. (In the picture, A is an open interval and f −1 (A) = {f ∈ A} is an oval.)
346
6. SOME APPLICATIONS OF INTEGRATION
6.3.2. Discrete probability distributions. Recall from Section 1.5 that if Ω is a countable set, then a probability mass function m : Ω → [0, 1] is a function satisfying X m(ω) = 1. ω∈Ω
Such a mass function determines the discrete probability measure P : P(Ω) → [0, 1] via P (A) :=
X
m(ω) ,
ω∈A
for all A ⊆ Ω,
where the summation is only over those (at most countably many) points ω ∈ A. Conversely, given a probability measure P on P(Ω), the function m(ω) := P {ω} defines a mass function whose corresponding measure is P . Such probability measures are related to discrete random variables, where a random variable f : X → R is said to be discrete if f has a countable range; otherwise f is called a continuous random variable. Suppose that f is discrete and denote the range of f by Ω. Then for any A ∈ B not intersecting Ω, we have f −1 (A) = ∅, so Pf (A) = 0; in particular, it’s common to restrict Pf to subsets of Ω. That is, we consider Pf as a map Pf : P(Ω) → [0, 1]. Then Pf has the mass function12 m : Ω → R defined by
m(ω) := Pf {ω} = µ{f = ω} for all ω ∈ Ω.
The mass function m measures how much f is concentrated at each ω ∈ Ω. Given A ⊆ Ω, we have X Pf (A) = m(ω), ω∈A
where the sum on the right is summed only over those (at most countably many) points ω ∈ A. Example 6.1. (A couple well-known distributions) Let Ω be a finite set. Then the (discrete) uniform distribution on Ω is the probability measure on Ω with mass function m : Ω → R given by 1 m(ω) = for all ω ∈ Ω. #Ω
Any random variable with range Ω with distribution the uniform distribution is said to be uniformly distributed because its values are distributed with equal probabilities. Observe that if A ⊆ Ω, then #A Pf (A) = , #Ω which is just the classical “fair” probability measure. Let p ∈ (0, 1) and suppose that Ω = {0, 1, 2, . . . , n} and m : Ω → R is given by the binomial mass function ! n k m(k) := b(k; n, p) = p (1 − p)n−k , 0 ≤ k ≤ n, k 12m is sometimes called the distribution function of f , although I don’t like using this name because it causes confusion with the (cumulative) distribution function in Theorem 6.12.
6.3. PROBABILITY DISTRIBUTIONS, MASS FUNCTIONS AND PDFS
347
studied back in Section 2.5.2. Recall from Theorem 2.13 that m(k) is the probability that in a sequence of n Bernoulli trials we obtain exactly k successes (each success occurring with probability p). The corresponding measure on Ω is called the binomial distribution and is denoted by B(n, p). Any random variable f with such a distribution is said to be binomially distributed and we write f ∼ B(n, p). See Problem 1 for another common distribution.
Binomially distributed random variables can be obtained experimentally from the Galton board, named after Francis Galton (1822–1911) and looked at in the notes and references on Chapter 1. Consider a Galton board with four rows as seen in Figure 6.8. A ball is dropped at the top and suppose that for some p ∈ (0, 1), ❄ ❘ ✠ ❘ ❘ 0
1
2
3
4
Figure 6.8. A Galton board with n = 4 rows of pegs and bins labeled 0–4 where the balls eventually land. when the ball hits a peg it bumps to the right (a “success”) with probability p and to the left (a “failure”) with probability 1 − p. By studying this figure, one can see that in order for the ball to land in bin k, the ball must have exactly k successful bumps. Thus, if f is the random variable: f = the bin in which the ball lands, then f is binomially distributed with n = 4. Of course, the same can can said for any (finite) number n of rows of pegs. This is an abstract definition of f in the sense that we haven’t said what the sample space is (thus leaving out the domain of the function f ).13 Of course, it’s easy to define a sample space describing this experiment. Indeed, let S = {0, 1} = {left bump, right bump} with probabilities of p for 1 and 1 − p for 0. Then a sample space is S n with the product measure, and f : S n → R is the function f (x1 , x2 , . . . , xn ) = x1 + x2 + · · · + xn .
To encompass all n ∈ N simultaneously, it is convenient to let X = S ∞ with the infinite product measure; then given n ∈ N, f = Sn = the random variable giving the number of successes on the first n trials. From Section 2.5 we know that Sn looks much like the normal density function, a particular case of a probability density function, which we now describe. 6.3.3. Probability density functions. Here, a probability density function, or pdf, is a Lebesgue measurable function φ : R → [0, ∞) such that Z φ(x) dx = 1. R
13In this business it’s common to describe random variables abstractly; that is, describing them in terms of what is observed without specifying the sample space. In fact, after a fascinating lecture of an expert probabilist on random variables dealing with the stock market, I asked him on what sample space his random variables were defined; he had no idea!
348
6. SOME APPLICATIONS OF INTEGRATION
Such a pdf determines a law P : B → [0, 1]
by P (A) :=
Z
for all Borel sets A ⊆ R.
φ(x) dx
A
(Of course, not all laws have pdf’s, such as measures vanishing everywhere except on a Borel set of measure zero.) Observe that P (A) is just the area under φ(x) and above A as shown here: R P (A) =
φ(x)
A
φ(x) dx
= shaded area
A
The most celebrated laws with pdfs are the Example 6.2. (Normal distributions). The most famous pdf is (x−µ)2 1 − e 2σ2 , 2 2πσ which is called a normal density function, where µ ∈ R is called the mean and σ > 0 the standard variation with σ 2 called the variance (we shall discuss these terms in Subsection 6.4.3). The law corresponding to a normal density function is called a normal distribution and is denoted by N (µ, σ); thus,
φ(x) = √
N (µ, σ) : B → [0, 1]
is the measure defined on a Borel set A ⊆ R by
Z (x−µ)2 1 − e 2σ2 dx. 2 2πσ A The standard normal distribution is the measure N (0, 1). Here’s a picture of a normal distribution, where φ(x) has the ubiquitous bell curve shape centered at µ: N (µ, σ)(A) := √
N (µ, σ)(A) =
√ 1 2πσ2
R
A
e−
(x−µ)2 2σ2
dx
= shaded area A
Now let f : X → R be a random variable and let Pf denote its distribution. If a pdf exists for the measure Pf , then f must be continuous, meaning not discrete; however, the converse is false because there are continuous random variables not having pdfs (see Problem 4 for the proof). Amongst those continuous random variables with pdfs, there is a special place for those said to be normally distributed, which means their pdfs are normal densities. We usually write f ∼ N (µ, σ), when the pdf is a normal density function with mean µ and standard deviation σ.
A normally distributed random variable can be obtained by letting (X, S , µ) = (R, B, N (µ, σ))
and putting f : X → R as the identity function
f (x) = x for all x ∈ X = R.
Then given any Borel set A ⊆ R, we have
Pf (A) = µ{f ∈ A} = N (µ, σ){x ; x ∈ A} = N (µ, σ)(A).
6.3. PROBABILITY DISTRIBUTIONS, MASS FUNCTIONS AND PDFS
349
Hence, Pf = N (µ, σ) and f is normally distributed. Of course, replacing N (µ, σ) by an arbitrary probability measure on R, this trick allows one to produce a random variable whose distribution is exactly the given measure. In fact, this example (except for trivial modifications) is the only normally distributed random variable we can give in this book without going into mathematics outside the scope of this book!14 What makes the normal distribution so important is not that (exactly) normally distributed random variables are everywhere, but that, as we discussed in Section 2.5, approximately normal random variables are everywhere, where approximately normal means that Z Pf (A) ≈ φ(x) dx A
for some normal density function φ, where the approximation “≈” depends on the situation. For example, as we were discussing immediately before this example, let Sn be the random variable on S ∞ giving the number of successes on the first n trials of a Bernoulli sequence. Then from the De Moivre-Laplace Theorem (Theorem 2.15) we know that if fn is the random variable Sn − np fn := √ npq where q = 1 − p, then for any interval I ⊆ [−∞, ∞], we have Z n o 1 x2 lim µ fn ∈ I = √ e− 2 dx. n→∞ 2π I That is, for all intervals I ⊆ [−∞, ∞], lim Pfn (I) = N (0, 1)(I).
(6.21)
n→∞
See Problems 5 and 7 for related results. Here’s another common example of a probability distribution with a pdf. Example 6.3. (Continuous uniform distributions) Given an interval [a, b] with −∞ < a < b < ∞, the uniform (or rectangular) distribution on the interval [a, b], denoted by U (a, b), is the law U (a, b) : B → [0, 1] with pdf 1 if x ∈ [a, b], φ(x) := b − a 0 otherwise. Here’s a picture of the uniform distribution:
U (a, b)(A) =
| {z } 1 b−a
a
R
1 A b−a
dx =
m(A) b−a
= shaded area A
b
U (a, b) is “uniform” over [a, b] in the sense that it assigns equal probabilities to subsets of [a, b] with the same length. A random variable f is said to be uniformly distributed if its probability distribution is a uniform distribution on some interval [a, b], in which case we write f ∼ U (a, b). Generally speaking, if the range of a random variable lies in an interval [a, b] and its values are “equally likely” to lie anywhere in [a, b], then the random variable is uniformly distributed. Some examples of uniformly distributed random 14One such example involves Brownian motion; see [109, Ch. 5] for a thorough treatment.
350
6. SOME APPLICATIONS OF INTEGRATION
variables include (under appropriate conditions) spinning a needle on a circular dial and observing where it stops and also picking a number at random from a given interval. See Problem 2 for another common distribution.
6.3.4. (Cumulative) distribution functions. One can also approach probability distributions through Lebesgue-Stieltjes set functions associated to special nondecreasing functions. A (cumulative) distribution function (cdf ) is a function F : R → [0, 1] such that (i) F is nondecreasing ; (ii) F is right continuous; (iii)
lim F (x) = 0 ; (iv) lim F (x) = 1,
x→−∞
x→∞
Here are some pictures of cdfs: 1
1
1
0
0
0
Cdfs are important because they characterize laws. Laws and cdf ’s Theorem 6.12. Laws are in one-to-one correspondence with cdfs in the following sense: Given a cdf F : R → R, its corresponding Lebesgue-Stieltjes measure µF : B → [0, 1] is a law (probability measure). Moreover, given a law µ : B → [0, 1], the function F : R → R defined by F (x) := µ(−∞, x]
for all x ∈ R ,
is the unique cdf whose corresponding Lebesgue-Stieltjes measure is µ. Proof : We shall leave you to prove the first statement and the uniqueness part of the second statement. Let µ be a law and define F : R → R as stated. We need to show that F is a cdf such that µF = µ. Proof that F is a cdf: Since µ is monotone it follows that F is nondecreasing. To prove that F is right continuous, it suffices to show that if a ∈ R and {an } is a nonincreasing sequence of points approaching a, then F (an ) → F (a). To see this, observe that ∞ \ (−∞, a] = (−∞, an ] , n=1
hence by continuity of measures, we have
µ(−∞, a] = lim µ (−∞, an ] . n→∞
That is, F (a) = lim F (an ), which shows that F is right continuous. If we put a = −∞ (in this case, (−∞, −∞] = ∅) and {an } is a nonincreasing sequence approaching −∞ the same argument shows that 0 = µ(∅) = lim µ(−∞, an ]; that is, 0 = lim F (an ), which shows that limx→−∞ F (x) = 0. Finally, if {an } is any nondecreasing sequence with an → ∞, then as ∞ [ R= (−∞, an ] , n=1
6.3. PROBABILITY DISTRIBUTIONS, MASS FUNCTIONS AND PDFS
351
by continuity and using the fact that µ(R) = 1 since µ is a probability measure, we have 1 = µ(R) = lim µ (−∞, an ] . n→∞
This shows that 1 = lim F (an ), which proves that limx→∞ F (x) = 1. Proof that µF = µ: If (a, b] ∈ I 1 with a < b, then by subtractivity of measures we have µF (a, b] = F (b) − F (a) = µ(−∞, b] − µ(−∞, a] = µ (−∞, b] \ (−∞, a] = µ(a, b].
Thus, µF = µ on I 1 ; by the Extension theorem it follows that µF = µ on B = S (I 1 ) as well.
Given a random variable f : X → R on a measure space (X, S , µ), the (cumulative) distribution function (cdf ) of f is the cdf of its law Pf ; thus, the cdf of f is the function F : R → R defined by F (x) := µ{f ≤ x}, where µ{f ≤ x} is shorthand for µ{f ∈ (−∞, x]}, which equals Pf (−∞, x]. By the previous theorem we know that the Lebesgue-Stieltjes measure µF and the distribution Pf are identical laws on B. Thus, the study of probability distributions of random variables is really the study of Lebesgue-Stieltjes measures (of cdfs)! ◮ Exercises 6.3. 1. (Poisson distribution) In this and the next problem we introduce two important probability distributions through describing two different models describing the decay of particles. Suppose that we have a large number of radioactive particles, say n of them and we observe their decay over a time interval [0, t] (say time is in hours). Assume that the average rate of decay during the time interval [0, t] is µ/hour, meaning that on average, µ particles decay per hour during our observation interval [0, t]. (i) Let k ∈ N0 := {0, 1, 2, . . .}. Treat each of the n particles as a Bernoulli trial: Assign it a 1 “success” if it decays in the interval [0, t] and a 0 “failure” if it does not decay. Argue that the probability a given particle decays in the interval [0, t] is tµ/n. Conclude that, in the time interval [0, t], the probability the number of decays is k is exactly the binomial mass function ! n−k k tµ n tµ tµ b k; n, = 1− . n k n n Conclude that the law describing probabilistically the number of decays in the interval [0, t] is given by the binomial distribution B(n, p) where p = tµ/n. (ii) Prove that (tµ)k −tµ tµ lim b k; n, = e . n→∞ n k! (tµ)k −tµ (iii) The function m : N0 → [0, ∞) defined by m(k) = e for all k ∈ N0 is k! called the Poisson mass function. Prove that m is a probability mass function; its corresponding measure on N0 is called the Poisson distribution, which we
352
6. SOME APPLICATIONS OF INTEGRATION
denote by Pois(tµ). This distribution is named after Sim´eon-Denis Poisson (1781– 1840). In conclusion, you have proven that for n large,15 tµ B n, ≈ Pois(tµ); n or in terms of our radioactive decay model, if the average rate of decay is µ/hour in a time interval [0, t] hours, for n large, the probability the number of decays is k ≈
(tµ)k −tµ e . k!
Note that since the probability of a success is p = tµ/n, we can write this approximation as the probability the number of decays is k ≈
(np)k −np e . k!
2. (Exponential distribution) We now make a different model for radioactive decay. Assume as before that we have n particles to begin with. We now do not work on a fixed time interval but we let t vary. Let N (t) = number of particles remaining after t hours and we assume that the rate at which N (t) changes is some fixed proportion r, called the decay constant, of the number of particles present at time t; that is, we assume that for some constant r ∈ (0, 1), N ′ (t) = −r N (t) ,
for all t > 0.
(i) Using that N (0) = n, prove that N (t) = n e−rt for all t ≥ 0. (ii) Show that the proportion of particles that have decayed during the time interval [0, t] is (1 − e−r t ). From this fact, argue that the probability that a given particle decays in the time interval [0, t] is also 1 − e−r t . (iii) If Pt = the probability that a given particle decays in the time interval [0, t], prove that Z t
Pt =
re−rx dx
0
(iv) Prove that the function φ : R → [0, ∞) defined by φ(x) := r e−rx for x ≥ 0 and φ(x) = 0 otherwise, is a probability density function; such a density function is called an exponential density function. Then show that if µ : B → [0, 1] is the law whose pdf is φ, called an exponential distribution, then for any interval I ⊆ R, µ(I) = the probability that a given particle decays in the time interval I. 3. (The Poisson and exponential distributions) The previous two problems presented two models of radioactive decay. In this problem we relate them. We work under the assumptions of Problem 2. (i) Fix t > 0 and recall from (ii) of Problem 2 that the probability a given particle decays in the time interval [0, t] is 1 − e−r t . Now treat each of the n particles as a Bernoulli trial with 1 “success” if it decays in the interval [0, t] and a 0 “failure” if it does not decay. Given k ∈ N0 , prove that, in the time interval [0, t], ! k −r t n−k n (6.22) the probability the number of decays is k = 1 − e−r t e . k
(ii) To relate Formula (6.22) to Problem 1, recall from Problem 2 that the number of particles remaining after t hours equals n e−rt . Prove that the average rate of
15 We use the term “large” meaning “Given any error bound, we can take n as large as necessary to get an approximation within the error bound.”
6.3. PROBABILITY DISTRIBUTIONS, MASS FUNCTIONS AND PDFS
353
decay of the number of particles in the time interval [0, t] is n(1 − e−rt )/t. Thus, using the notation from Problem 1, we have µ=
(6.23)
n(1 − e−rt ) ; t
solving this equation for e−rt and plugging the result into (6.22), obtain for the time interval [0, t] and for k ∈ N0 , ! n−k k n tµ tµ the probability the number of decays is k = 1− . k n n
This is the formula in Part (i) of Problem 1. (iii) We now relate (6.23) to the Poisson mass function. Fix t > 0. Recall in Part (iii) of Problem 1 we got the Poisson mass function assuming (1) the number of particles n is large and (2) the average number of decays/hour, µ, is some −rt ) constant. Unfortunately, in our case, µ = n(1−et , so µ depends on n and is not constant! For this reason, let us assume the decay constant r is inversely proportional to n; specifically, we assume that r = λ/n for some constant λ. Explain why for n large, µ ≈ λ; thus, µ is approximately constant! From this, conclude that from our assumption on r, for the time interval [0, t] and for n large, the probability the number of decays is k ≈
(tµ)k −tµ e , k!
exactly as before. 4. (Problems on probability distributions) (a) Prove that if a pdf exists for the probability distribution of a random variable f , then f must be continuous (= not discrete), which means the range of f is uncountable. (b) Let X = [0, 1] with Lebesgue measure and let ψ : X → R be Cantor’s function. Show that ψ does not have a pdf. (c) Let X = S ∞ with S = {0, 1} and consider the infinite product measure on X assigning probabilities 1/2 to both 0 and 1 on each factor. Prove that the function f : X → R defined by f (x1 , x2 , x3 , . . .) :=
∞ X xn 2n n=1
for all (x1 , x2 , x3 , . . .) ∈ X,
is measurable and Pf = m = Lebesgue measure on Borel subsets of [0, 1]. 5. Using the notation from the De Moivre-Laplace Theorem, we know that (6.24)
lim µn (A) = ν(A),
n→∞
−np where A = (a, b] for any a, b ∈ [−∞, ∞], µn (A) := µ{ S√nnpq ∈ A}, and ν = N (0, 1). Does (6.24) hold when A ⊆ R is a Borel set? The answer is “sometimes”; to understand what this means, proceed as follows. (i) If U ⊆ R is open, prove that m(U) ≤ lim inf µn (U). Suggestion: Write U = S∞ PN 1 k=1 Ik where I1 , I2 , . . . ∈ I are pairwise disjoint. Then µn (U) ≥ k=1 µn (Ik ) for any N . Take lim inf of both sides, then take N → ∞. (ii) If C ⊆ R is closed, prove that lim sup µn (U) ≤ ν(C). (iii) If A ⊆ R is a Borel set whose boundary has Lebesgue measure zero, prove that lim µn (A) = ν(A). (Recall that the boundary of A is A\A0 where A is the closure of A and A0 is the interior of A.) (iv) Find a set A ⊆ R such that lim µn (A) 6= ν(A).
354
6. SOME APPLICATIONS OF INTEGRATION
6. (Convergence of distribution functions) Let F, F1 , F2 , . . . be distribution functions on R where F is continuous and suppose that Fn → F pointwise. In this problem we prove that Fn → F uniformly on R. Fix ε > 0; we must show that there is an N such that |Fn (x) − F (x)| < ε for all x ∈ R and n > N . Proceed as follows. (i) Show that there are finitely many extended real numbers −∞ = a0 < a1 < a2 < · · · < ak = ∞ such that F (ai+1 ) − F (ai ) < ε for i = 0, . . . , k − 1, where we define F (−∞) := 0 and F (∞) := 1. (ii) Find an N such that |Fn (ai ) − F (ai )| < ε for all i = 0, 1, . . . , k − 1 and n > N . (iii) Show that |Fn (x) − F (x)| < ε for all x ∈ R and n > N . 7. (More normal approximations) With the set-up of the De Moivre-Laplace Theorem, we know that for all −∞ ≤ a ≤ b ≤ ∞, Sn − np lim µ a < √ ≤ b = N (0, 1)(a, b]. n→∞ npq It’s often convenient to have a related statement just for Sn rather than for the some−np . In fact, it’s often said that “Sn is approximately normal with what complicated S√nnpq √ mean np and standard deviation npq” by which we mean for all −∞ ≤ a ≤ b ≤ ∞, µ {a < Sn ≤ b} − N (np,
√
npq)(a, b] → 0
as n → ∞.
In this problem we prove this fact. −np and F the cdf of N (0, 1). Show that the above (i) Let Fn denote the cdf of S√nnpq limit is equivalent to the limit µFn (an , bn ] − µF (an , bn ] → 0
as n → ∞
(∗),
b−np √ where an = a−np and bn = √ . npq npq (ii) Prove that Fn → F pointwise. Then assuming the result from the previous problem, that Fn → F uniformly as well, prove (∗). 8. (Scheff´ e’s theorem, after HenryR Scheff´e (1907–1977) [249, 228]) Let {fn } be a sequence of pdfs (thus, fn ≥ 0 and fn = 1 for each n) and suppose that f := lim fn n→∞
exists a.e. and f is also a pdf. In this problem we prove that for all Lebesgue measurable sets A ⊆ R, Z Z lim fn = lim fn . A
A
This result is interesting because it gives the conclusion of the DCT, although there is no mention of a dominating function (indeed, a dominating function may not exist). R (i) Prove that |fn − f | → 0 as n → ∞ and from this result prove Scheff´e’s theorem. Suggestion: Apply Fatou’s lemma to gn := fn + f − |fn − f |. (ii) Here’s a situation for whichPScheff´e’s theorem applies, but the DCT does not. n For each n ∈ N, let an = k=1 1/k and define fn : R → R by fn (x) = 1 for 1/(n + 1) ≤ x ≤ 1 and for Ran ≤ x ≤ aRn+1 and fn (x) = 0 otherwise. Using Scheff´e’s theorem prove that A fn → lim A fn for all Lebesgue measurable sets A ⊆ R; however, prove that there is no integrable function g : R → R such that |fn | ≤ g for all n.
6.4. Independence and identically distributed random variables The goal of this section is to understand what iid random variables are. As a reward of our work we give a probabilistic proof of the Weierstrass Approximation Theorem and we also study the celebrated Stone-Weierstrass Theorem.
6.4. INDEPENDENCE AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES
355
6.4.1. Id random variables. Given random variables f1 , f2 , . . . : X → R, we say that f1 , f2 , . . . are identically distributed (or id) if Pfi = Pfj for all i, j. Thus, “identically distributed” just means having “identical distributions.” Example 6.4. (Repeating an experiment) One standard way to generate identically distributed random variables is through repeating an experiment countably many times. Let S be a sample space with a probability measure µ0 : I → [0, 1] on some semiring I of subsets of S and let f : S → R be a random variable. Let X = S ∞ with probability measure µ : S (C ) → [0, 1], the infinite product of µ0 with itself, where C is the cylinder sets generated I . For each i ∈ N define (6.25)
fi : X → R
by
fi (x1 , x2 , . . .) = f (xi ).
For example, if S = {0, 1} and f (x) = x (thus, f (x) = 0 when x = 0 and f (x) = 1 when x is 1), then fi = χAi where Ai = S × S × S × · · · × S × {1} × S × · · · with {1} occuring in the i-th slot. These fi ’s are exactly the random variables that occurred in Bernoulli’s theorem and Borel’s Strong Law of Large Numbers we covered back in Sections 2.4 and 4.2. In the general case in (6.25) we claim that f1 , f2 , . . . are identically distributed = have the same distribution. To see this, let A ⊆ R be a Borel set and observe that fi−1 (A) := (x1 , x2 , . . .) ; fi (x1 , x2 , . . .) ∈ A = (x1 , x2 , . . .) ; f (xi ) ∈ A = S × S × · · · × S × {f ∈ A} × S × S × · · · ,
where {f ∈ A} occurs in the i-th factor. Thus, Pfi (A) = µ0 {f ∈ A}, which is independent of i. Hence, Pfi (A) = Pfj (A) for any i and j and therefore f1 , f2 , . . . are identically distributed.
Generally speaking, id random variables have identical characteristics insofar as these characteristics can be expressed in terms of probabilities; for example, they have the same expected values. In order to prove this fact, we first prove the following theorem, whose proof is a typical use of the “Principle of Appropriate Functions” as explained in Section 5.5.2. Integration and distributions Theorem 6.13. If f : X → R is a random variable and ϕ : R → R is a Borel measurable function, then Z Z ϕ(f ) dµ = ϕ dPf X
R
in the sense that the left-hand integral is defined if and only if the right-hand integral is, in which case both are equal. In particular, provided f is integrable, we have Z E(f ) = x dPf . R
356
6. SOME APPLICATIONS OF INTEGRATION
Proof : Given a random variable f : X → R, we will prove that Z Z (6.26) ϕ(f ) dµ = ϕ dPf X
R
holds for any Borel measurable function ϕ : R → R (for which the integrals are defined) using the “Principle of Appropriate Functions”: We first prove (6.26) when ϕ is a characteristic function, then for nonnegative simple functions, then for nonnegative measurable functions, and then finally for arbitrary functions. First, let A be a Borel subset of R; we need to show that Z Z χA (f ) dµ = χA dPf . X
R
To see this, observe that for any point x ∈ X, we have χA (f (x)) = 1
⇐⇒
f (x) ∈ A
x ∈ f −1 (A)
⇐⇒
⇐⇒
χf −1 (A) = 1.
Thus, we obtain the very useful formula (6.27)
χA (f ) = χf −1 (A) .
Hence, Z
χA (f ) dµ =
Z
χf −1 (A) dµ = µ(f −1 (A))
(def. of integral)
= Pf (A) Z = χA dPf .
(def. of Pf ) (def. of integral)
R
Thus, (6.26) holds for characteristic functions. By linearity of the integral, (6.26) P holds for simple functions: If s = N n=1 an χAn , where the an ’s are nonnegative and the An ’s are Borel sets, then Z Z X Z N N X s(f ) dµ = an χAn (f ) dµ = an χAn (f ) dµ X
X n=1
X
n=1
=
N X
an
n=1
=
Z X N
Z
χAn dPf R
an χAn dPf =
R n=1
Z
s dPf .
R
If ϕ is a nonnegative measurable function, then writing ϕ = lim sn as nondecreasing limits of nonnegative simple functions, it follows that ϕ(f ) = lim sn (f ) is also a nondecreasing limit of measurable functions, so by the Monotone Convergence Theorem, we have Z Z Z sn dPf sn (f ) dµ = lim ϕ(f ) dµ = lim n→∞ R n→∞ X X Z = ϕ dPf . R
Thus, (6.26) holds for any nonnegative measurable function ϕ. Finally, if ϕ is an arbitrary Borel measurable function, then writing ϕ = ϕ+ − ϕ− as the difference of its nonnegative and nonpositive parts it follows that Z Z Z Z ϕ+ (f ) dµ = ϕ+ dPf and ϕ− (f ) dµ = ϕ− dPf . X
R
X
R
6.4. INDEPENDENCE AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES
357
Hence, ϕ(f ) is µ-integrable if and only if ϕ is Pf -integrable, in which case Z Z Z ϕ(f ) dµ := ϕ+ (f ) dµ − ϕ− (f ) dµ X ZX Z X Z = ϕ+ dPf − ϕ− dPf =: ϕ dPf . R
R
R
Integration and pdfs Corollary 6.14. If f : X → R is a random variable with probability distribution function φ : R → [0, ∞) and ψ : R → R is a Borel measurable function, then Z Z ψ(f ) dµ = ψ(x) φ(x) dx X
R
in the sense that the left-hand integral is defined if and only if the right-hand integral is, in which case both are equal. In particular, provided f is integrable, we have Z x φ(x) dx. E(f ) = R
Proof : Given that Pf (A) =
Z
φ(x) dx A
for all Borel sets A ⊆ R,
R R and that X ψ(f ) dµ = R ψ dPf from the preceding theorem, all we have to do is check that Z Z ψ dPf = ψ(x) φ(x) dx. R
R
This is an application of the Principle of Appropriate Functions, so we shall leave this as an exercise to you. (We have two more PAF proofs to do in this section (as we’ll see) so we won’t bore you with another one and any case, a slightly more general version of this exercise was given in Problem 10 of Exercises 5.5.)
The following theorem contains some important properties of distributions and identically distributed random variables. The proof is another typical use of the “Principle of Appropriate Functions.” Properties of id R.V.’s Theorem 6.15. Let f : X → R and g : X → R be identically distributed random variables. Then for any Borel measurable function ϕ : R → R, (1) ϕ(f ) and ϕ(g) are also identically distributed. (2) We have Z Z ϕ(f ) dµ =
X
ϕ(g) dµ
X
in the sense that the left-hand integral is defined if and only if the right-hand integral is, in which case both are equal.
358
6. SOME APPLICATIONS OF INTEGRATION
Proof : To prove (1), observe that for any Borel set A ⊆ R, Pϕ(f ) (A) = µ(ϕ(f )−1 (A)) = µ(f −1 (ϕ−1 (A))) = Pf (ϕ−1 (A)) = Pg (ϕ−1 (A)) = µ(g
−1
−1
(ϕ
= µ(ϕ(g)
−1
(since (ϕ ◦ f )−1 = f −1 ◦ ϕ−1 ) (since f and g are i.d.)
(A))
(A))
= Pϕ(g) (A). Observe that if f and g are identically distributed, then Pf = Pg , so Z Z ϕ dPf = ϕ dPg , R
R
provided one integral (and hence R R both) is defined. By Theorem 6.13 the left and right sides equal ϕ(f ) dµ and ϕ(g) dµ, respectively. This concludes our proof.
Example 6.5. By (2) of Theorem 6.15, if f and g are identically distributed, then with ϕ(x) = max{x, 0} we see that f+ and g+ are identically distributed, and with ϕ(x) = − min{x, 0} we see that f− and g− are identically distributed. With ϕ(x) = |x|, we also see that |f | and |g| are identically distributed.
6.4.2. Independent random variables. Now that we know what identically distributed random variables are, we define independent random variables, which intuitively speaking are random variables whose values are distributed independently. More precisely, real-valued random variables f1 , f2 , f3 , . . . are independent if {f1 ∈ B1 }, {f2 ∈ B2 }, {f3 ∈ B3 }, . . . are independent for any family B1 , B2 , B3 , . . . ⊆ R of Borel sets. Written another way, we require that the subsets of X, f1−1 (B1 ) , f2−1 (B2 ) , f3−1 (B3 ) , · · · , be independent. Here, we recall that events A1 , A2 , A3 , . . . are independent means that for any finite subcollection Ai , Aj , . . ., Ak of A1 , A2 , A3 , . . ., we have µ(Ai ∩ Aj ∩ · · · ∩ Ak ) = µ(Ai ) µ(Aj ) · · · µ(Ak ). We say that f1 , f2 , f3 , . . . are pairwise independent if for any i 6= j, fi , fj are independent. Finally, we say that random variables f1 , f2 , f3 , . . . are iid if they are independent and identically distributed. Example 6.6. (Repeating an experiment, again) Going back to Example 6.4, we claim that the sequence f1 , f2 , . . . in (6.25) is independent. To see this, let B1 , B2 , . . . ⊆ R be Borel sets and recall from Example 6.4 that fi−1 (Bi ) = S × · · · × S × {f ∈ Bi } × S × S × · · · , where {f ∈ Bi } occurs in the i-th factor. It’s easy to see that such sets are independent, so f1 , f2 , . . . are independent. Thus, f1 , f2 , . . . are iid.
6.4. INDEPENDENCE AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES
359
An important property of independent random variables is that expectations behave multiplicatively on such variables. The proof is yet one more typical use of the “Principle of Appropriate Functions”! Theorem 6.16. If f1 , . . . , fn are independent random variables, then for any Borel measurable functions ϕ1 , . . . , ϕn , (1) ϕ1 (f1 ), . . . , ϕn (fn ) are independent. (2) We have Z Z Z Z ϕ1 (f1 ) ϕ2 (f2 ) · · · ϕn (fn ) = ϕ1 (f1 ) ϕ1 (f2 ) · · · ϕ1 (fn )
provided each of these integrals is defined. That is, the integral of the product is the product of the integrals, or stated in term of expectations, E ϕ1 (f1 ) ϕ2 (f2 ) · · · ϕn (fn ) = E(ϕ1 (f1 )) · E(ϕ1 (f2 )) · · · E(ϕ1 (fn )) .
Proof : Using the definition of independence, we leave it as an exercise to check that ϕ1 (f1 ), . . . , ϕn (fn ) are independent. For saneness of notation, we shall only prove (2) for two independent random variables f and g. Using the “Principle of Appropriate Functions,” we shall prove that Z Z Z (6.28) ϕ(f ) ψ(g) = ϕ(f ) ψ(g)
holds for any Borel measurable functions ϕ, ψ (for which the integrals are defined). First, let A and B be Borel subsets of R; we need to show that Z Z Z χA (f ) χB (g) = χA (f ) χB (g) . By the formula (6.27), we have χA (f ) = χf −1 (A) and χB (g) = χf −1 (B) . Hence, Z Z Z χA (f ) χB (g) = χf −1 (A) χg−1 B = χf −1 (A)∩g−1 (B) = µ f −1 (A) ∩ g −1 B = µ f −1 (A) µ ∩ g −1 (B) Z Z = χf −1 (A) χg−1 (B) Z Z = χA (f ) χB (g) . Since simple functions are just linear combinations of characteristic functions, a short computation shows that Z Z Z s(f ) t(g) = s(f ) t(g) for any nonnegative Borel measurable simple functions s and t. (Just write s and t as linear combinations of characteristic functions and multiply out the left and right-hand sides of the above equality to see that it’s in fact an equality.) Using the Monotone Convergence Theorem, it follows that if ϕ and ψ are nonnegative measurable functions, then writing them as limits of nondecreasing sequences of nonnegative simple functions, we get Z Z Z ϕ(f ) ψ(g) = ϕ(f ) ψ(g) .
360
6. SOME APPLICATIONS OF INTEGRATION
Finally, if ϕ and ψ are arbitrary integrable functions, then we have ϕ = ϕ+ − ϕ− and ψ = ψ+ − ψ− , so Z Z ϕ(f ) ψ(g) = ϕ+ (f ) ψ+ (g) + ϕ− (f ) ψ− (g) − ϕ+ (f ) ψ− (g) − ϕ− (f ) ψ+ (g) Z Z Z Z = ϕ+ (f ) ψ+ (g) + ϕ− (f ) ψ− (g) − ϕ+ (f ) ψ− (g) − ϕ− (f ) ψ+ (g) Z Z Z Z = ϕ+ (f ) ψ+ (g) + ϕ− (f ) ψ− (g) Z Z Z Z − ϕ+ (f ) ψ− (g) − ϕ− (f ) ψ+ (g) Z Z Z Z = ϕ+ (f ) − ϕ− (f ) ψ+ (g) − ψ+ (g) Z Z = ϕ(f ) ψ(g) . Example 6.7. By (1) of Theorem 6.16, if f and g are independent, then with ϕ(x) = ψ(x) = max{x, 0} we see that f+ = ϕ(f ) and g+ = ψ(g) are independent. Similarly, f− and g− are independent, and |f | and |g| are independent.
6.4.3. Variance. Let f : X → R be an integrable random variable. Recall R that the idea behind defining the expectation E = f was that f may be quite complicated (being a function with possibly many values on a possibly very complicated sample space), so we wanted a number that gives useful information about f . The expectation E is one such number since it represents the average, or mean, value of f if the experiment is repeated a large number of times. It would also be useful to find a number that tells us how far the value of f may be from its expected value on any single experiment. To define this number, observe that the random variable |f − E| measures the deviation of f from its mean. We could also use (f − E)2 as a measure of the deviation of f from its mean; because squaring large numbers makes them larger (e.g. 102 = 100) and squaring small numbers makes them smaller (e.g. (1/10)2 = 1/100), the random variable (f − E)2 tends to emphasize the larger deviations of f from E. If we take the average value of (f − E)2 we get what is called the variance of f : Z Var f := E[(f − E)2 ] = (f − E)2 , Since (f − E)2 = f 2 − 2Ef + E 2 , we have
E[(f − E)2 ] = E(f 2 ) − 2E · E(f ) + E 2 = E(f 2 ) − 2E 2 + E 2 = E(f 2 ) − E 2 ,
so an alternative way writing the variance is as Var f = E(f 2 ) − E 2 . The standard deviation of f is the square root of the variance: sZ p σ(f ) := Var f = (f − E)2 . Both Var f and σ(f ) measure how much the values of f are spread from its mean.
6.4. INDEPENDENCE AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES
361
Example 6.8. (Normally distributed random variables) Let f be a random variable and assume that f ∼ N (µ, σ) for some µ ∈ R and σ > 0. We called µ the mean and σ the standard variation . . . we now show these labels are correct! Indeed, by Corollary 6.14 we know that Z Z (x−µ)2 (x−µ)2 1 1 − − E(f ) = √ xe 2σ2 dx and E(f 2 ) = √ x2 e 2σ2 dx. 2 2 2πσ R 2πσ R √ Making the change of variables x = µ + 2σ 2 y, we obtain Z √ 2 1 E(f ) = √ (µ + 2σ 2 y)e−y dy. π R and 1 E(f ) = √ π 2
Z
(µ +
√
2
2σ 2 y)2 e−y dy.
R
Since R −y 2 √ (1) R e dy = π, R 2 2 (2) R y e−y dy = 0 (since y e−y is an odd function on R), R √ 2 (3) and16 R y 2 e−y dy = π/2,
we leave you to show that E(f ) = µ and E(f 2 ) = µ2 + σ 2 . This proves our result.
The following theorem is useful when studying sums of random variables. Theorem 6.17. If f1 , . . . , fn are pairwise independent with finite expectations and variances, then n n X X Var fk = Var fk . k=1
Proof : Let g = and that
Pn
k=1
(g − E(g))2 =
k=1
fk , and observe that E(g) =
n X
(fk − Ek )
k=1
2
=
n X
Pn
k=1
Ek , where Ek = E(fk ),
(fj − Ej )(fk − Ek )
j,k=1
=
n X
(fk − Ek )2 +
k=1
X j6=k
(fj − Ej )(fk − Ek ).
By independence and Theorem 6.16, for j 6= k we have E[(fj − Ej )(fk − Ek )] = E(fj − Ej ) · E(fk − Ek ) = (Ej − Ej )(Ek − Ek ) = 0, so Var g = E[(g − E(g))2 ] =
16
n X
k=1
n X
Var fk .
k=1
√ 2 e−ty dy = π t−1/2 (to verify the equality, R 2 −ty 2 √ in the integral) to get R y e dy = πt−3/2 /2, then take t = 1.
Differentiate both sides of the equality
change variables y 7→ y t−1/2
E(fk − Ek )2 =
R
R
362
6. SOME APPLICATIONS OF INTEGRATION
6.4.4. Probabilistic quantities for iid random variables. The following theorem is no surprise. Theorem 6.18. Identically distributed random variables have the same expectation value, variance, and standard deviation (provided these notions are defined for the random variables). Proof : Let f and g be identically distributed and integrable. Then with ϕ(x) = x, Property (2) of Theorem 6.15 implies that Z Z f = g ; that is, E(f ) = E(g), so identically distributed random variables have the same expectation. With E = E(f ) = E(g) and ϕ(x) = (x − E)2 , Property (2) of Theorem 6.15 implies that Z Z 2 (f − E) = (g − E)2 , so Var f = Var g. In particular, σ(f ) = σ(g).
Let’s recall the set-up in our “Repeating an Experiment” Examples 6.4 and 6.6. Let S be a sample space with a probability measure and let f : S → R be a random variable. Then X = S ∞ is the sample space measure the infinite product of the measure on S. For each i ∈ N, define fi : X → R
by
fi (x1 , x2 , . . .) = f (xi ),
which represents the data the random variable f assigns to the outcome on the i-th trial of the experiment. We summarize the content of Examples 6.4 and 6.6 and Theorem 6.18 in the following: Iid R.V.’s from repeating an experiment Theorem 6.19. The random variables f1 , f2 , f3 , . . . are iid and E(fi ) = E , Var(fi ) = σ 2 , σ(fi ) = σ
for all i,
where E and σ are the common expectations and standard deviations of the fi ’s, which equal the expectation and standard deviation of the original random variable f . In particular, if Sn = f1 + · · · + fn , then √ E(Sn ) = nE , Var(Sn ) = nσ 2 , σ(Sn ) = n σ. The equality for Var(Sn ) (and consequently for σ(Sn )) follows from Theorem 6.17. Note that the same theorem holds for any sequence of iid random variables f1 , f2 , f3 , . . . if we drop the phrase “which equal the expectation and standard deviation of the original random variable f ” from the theorem. Example 6.9. (Binomially distributed random variables) As we’ve seen many times, let S = {0, 1}, assign {1} a probability p ∈ (0, 1), {0} the probability q where q = 1 − p and define f : S → R by f (0) = 0 and f (1) = 1. Then E(f ) = 0 · q + 1 · p = p,
6.4. INDEPENDENCE AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES
363
and Var(f ) = E(f 2 ) − (E(f ))2 = 02 · q + 12 · p − p2 Thus, σ(f ) =
√
= p − p2 = p(1 − p) = pq.
pq. It follows that if Sn = f1 + · · · + fn , then √
E(Sn ) = np , Var(Sn ) = npq , and σ(Sn ) =
npq.
More generally, if f : X → R is any random variable with f ∼ B(n, p), then √ E(f ) = np , Var(f ) = npq , and σ(f ) = npq, as you can readily check. ◮ Exercises 6.4. 1. Here are some useful formulas, which you are free to use in subsequent problems. (a) If f : X → [0, ∞) is integer valued (that is, f (X) ⊆ {0, 1, 2, . . .}, prove that E(f ) =
∞ X µ f >n
n=0
(b) Let f : R → R be Borel measurable and let g : X → R be an integrable function taking countably many values a1 , a2 , . . . , ∈ R. Prove that ∞ X E(f (g)) = f (an ) µ g = an . n=0
(c) If f : X → [0, ∞], prove that
E(f ) =
Z
∞ 0
µ f ≥ x dx.
Suggestion: Principle of Appropriate Functions. 2. (Royal Oak Lottery) Here’s a part of the preface of Abraham de Moivre’s (1667– 1754) famous book The Doctrine of Chances [71], first published in 1718 and whose third and final edition was in 1756: When the Play of the Royal Oak was in use, some Persons who lost considerably by it, had their Losses chiefly occasioned by a Argument of which they could not perceive the Fallacy. The Odds against any particular Point of the Ball were One and Thirty to One, which entitled the Adventurers, in case they were winners, to have thirty two Stakes returned, including their own; instead of which they having but Eight and Twenty, it was very plain that on the single account of the disadvantage of the Play, they lost one eighth part of all the Money they played for. But the Master of the Ball maintained that they had no reason to complain; since he would undertake that any particular point of the Ball should come up in Two and Twenty Throws; of this he would offer to lay a Wager, and actually laid it when required. The seeming contradiction between the Odds of One and Thirty to One, and Twenty-two Throws for any Chance to come up, so perplexed the Adventurers, that they begun to think the Advantage was on their side; for which reason they played on and continued to lose. Here’s what I think de Moivre is saying: For the Royal Oak lottery, a single ball had 32 distinctive points and “Adventurers” would bet on which point would turn up when the “Master of the Ball” throws the ball. (Most lotteries are played with many balls but the Royal Oak was only played with one.) Thus, the probability of winning is 1/32. However, the “Adventurers” would only get paid 28 to 1 if they won and they thought this was unfair since after all, the odds are 1/32, not 1/28. However, the “Master of the Ball” said that any particular point of the ball should appear once in every 22 throws. Let’s verify the Master of the Ball’s statement.
364
6. SOME APPLICATIONS OF INTEGRATION
(i) Let’s fix a particular point on the ball, throw the ball an infinite number of times in sequence, and let f be the number of throws to obtain that particular point. Write down a sample space and show that f is measurable. (ii) Show that for any n = 0, 1, 2, . . ., µ f > n = (31/32)n . Conclude that µ f ≤ 22 > 1/2. Thus, the Master of the Ball will throw that particular point more than half the time; hence, he felt confident to put a wager that he would throw any particular point at least once in 22 throws. (iii) Show that E(f ) = 32. Hence, the expected number of throws you need to obtain a particular point on the ball is 32 (just as you would expect)! Suggestion: To compute E(f ), use a formula in Problem 1. 3. (St Petersburg Paradox) You walk to a booth in St Petersburg and pay a fee to play the following game. A fair coin is flipped until a head appears. If the head appears on the n-th flip, you are paid $2n . What do you expect to gain if you play such a game? (i) Write down a sample space X, the corresponding probability measure µ, and let f : X → R be the function representing the amount you gain. Show that f is measurable. (ii) Show that E(f ) = ∞, which seems to say that if you play this game, you can expect win an infinite amount of money! Hence, it seems like you should be willing to pay any initial fee to play the St Petersburg game. (iii) Given n ∈ N, what is the probability that you win $2n ? In particular, what are your chances of winning $24 = $16? (iv) With (ii) and (iii) in view, do you see why the this scenario is called the St Petersburg paradox ? (v) One reason the St Petersburg game is a paradox is that it assumes the booth has an infinite about money. Suppose that the booth only has $2N so if the first head appears on the n-th flip where n ≥ N , then you only get 2N dollars (instead of $2n ). Now show that the expectation is finite. 4. (The Coupon collector’s problem) A certain bag of chips contains one of n different coupons (say labelled 1, . . . , n), each coupon equally likely to be found. In this problem we prove that thePexpected number of bags youPneed to obtain at least one of 1 1 each type of coupon is n n . In particular, since n k=1 k ∼ log n, meaning that Pn 1 k=117 k −1 limn→∞ (log n) = 1, it follows that k=1 k the expected number of bags to get all n coupons ∼ log n; n what a strange place to see logarithms! Proceed as follows. (i) Write down a sample space X, the corresponding probability measure µ, and let f : X → R be the function representing the number of bags required to complete the set of n coupons. Show that f is measurable. (ii) Using an expectation formula in Problem 1 and looking back at Problem 7 in Exercises 2.3, show that ! n X k−1 1 n E(f ) = n an , where an = (−1) . k k k=1 (iii) Prove that
1 an+1 = an + . n+1 Pn 1 From this, prove that an = k=1 k . (iv) The coupon collector’s problem deals with other situations too. In a pack of 52 cards, you draw a card one at a time, returning the card you drew before you draw the next one. Show that the expected number of draws you need to produce a card of every suit is 8 13 . 17
Can you prove this fact?
6.5. APPROXIMATIONS AND THE STONE-WEIERSTRASS THEOREM
365
5. Let µ1 , µ2 , µ3 , . . . be a sequence of probability measures on B. Prove that there exists a probability space (X, S , µ) and independent random variables f1 , f2 , f3 , . . . such that Pfn = µn for each n. Suggestion: Let X = R∞ with the infinite product measure of the µi ’s. 6. Let f1 , f2 , . . . be iid random variables on a probability space (X, S , µ) and suppose that the range of each fi is {0, 1}, taking the value 1 with probability p and the value 0 with probability 1 − p. With Sn = f1 + · · · + fn , we shall give two proofs that ! n k µ{Sn = k} = p (1 − p)n−k . k (i) (Easy) You can prove this using a similar argument as we did back in Theorem 2.13 in Section 2.5. (ii) (Harder, but neat) We can also give an “analytic” proof as follows. First prove that for all t ∈ R, Z n X etSn = ak etk , where ak = µ{Sn = k}. k=0
R Next, show that for any i, etfi = p et + q where q = 1 − p and use this to prove that Z etSn = (p et + q)n .
Finally, prove the desired formula.
6.5. Approximations and the Stone-Weierstrass theorem As a reward of our work on iid random variables, we give a probabilistic proof of the Weierstrass Approximation Theorem and we also study the celebrated StoneWeierstrass Theorem. 6.5.1. The WAT and Lebesgue’s very first publication. Concerning continuous functions, Karl Weierstrass (1815–1897) proved two strikingly different results. In 1872 [306] he showed there are continuous functions that are nowhere differentiable. In other words, there are continuous functions so jagged they don’t have a tangent line anywhere! We shall study Weierstrass’ nondifferentiable function in Section ??. On other hand, in 1885 [307, 225, 297], when he was 70 years old, he showed that any continuous Karl Weierstrass function can be approximated as close as one wishes by polynomials. Thus, (1815–1897). although continuous functions may be very jagged, they can always be approximated arbitrary close by the smoothest of all functions, polynomials! This last result is the Fundamental Theorem of Approximation Theory, or the Weierstrass Approximation Theorem Theorem 6.20. If f : I → R is a continuous function on a compact interval I and ε > 0 is given, there exists a polynomial p(x) such that |p(x) − f (x)| < ε for all x ∈ I. We remark that Lebesgue’s first publication provided another proof of Weierstrass’ theorem. The paper was Sur l’approximation des fonctions [164] and it was published in 1898. He probably discovered the proof while an undergraduate stu´ dent at Ecole Normale Sup´erieure in Paris, which he entered in 1894 and obtained
366
6. SOME APPLICATIONS OF INTEGRATION
his teaching diploma in mathematics in 1897; he obtained his doctorate in 1902. You will learn Lebesgue’s elegant proof in Problem 6. Now why is Weierstrass’ theorem in a section involving probability? The reason is that one of the most intuitive proofs of the theorem relies on probability! This proof is due to Sergei Natanovich Bernstein (1880–1968) and was published in 1912 [26]. Here’s Bernstein’s idea for a function defined on [0, 1] [154]. Bernstein’s game: Let f : [0, 1] → R be a continuous function and let n ∈ N. Given t ∈ [0, 1], suppose that we toss a coin n times where the probability of flipping a head on any given toss is t. In Bernstein’s game, if we toss k heads, we gain f (k/n) dollars (if f (k/n) < 0 we lose this many dollars). As we know, a sample space for this experiment is X = {0, 1}n with a head “1” assigned the probability t and tail “0” the probability 1 − t on each toss. If Sn : X → R is the number of heads in n tosses, then our gain is modeled by the random variable gn : X → R,
defined by
Sn gn = f , n the composition of f with Sn /n. Indeed, if k heads are tossed have Sn /n = k/n, so gn = f (k/n), which is our required gain. We know that n k the probability of getting k heads in n tosses = t (1 − t)n−k , k so our expected gain for playing Bernstein’s game is n X E(gn ) = (Gain when k heads are tossed) × (probability k heads are tossed) k=0
n k n X = f tk (1 − t)n−k . n k k=0
Now recalling that t is the probability of getting a head on any given toss, it follows that if n is large, we should get approximately nt number of heads in playing Bernstein’s game. Thus, Sn ≈ nt, so we should expect to gain approximately f (nt/n) = f (t) dollars. Since our expected gain is also E(gn ), we conclude that n k n X f (t) ≈ f tk (1 − t)n−k . n k k=0
Hence, for n large, the continuous function f (t) is approximately a polynomial in t, specifically, the polynomial to the right of ≈, which is called the n-th Bernstein polynomial. This heuristic argument for Weierstrass’ theorem will be made rigorous in Problem 1. Now that we know Weierstrass’ theorem, we shall present a very useful generalization due to Marshall Harvey Stone (1903–1989) [276, 277]. Let T be a topological space. A collection A of real-valued continuous functions Marshall Stone on T is called an algebra of functions if for all f, g ∈ A , we have f g ∈ A and af + bg ∈ A for all a, b ∈ R. The collection A is said to separate (1903–1989). points if given any two points x, y ∈ T there exists a function f ∈ A such that f (x) 6= f (y). We denote by C(T, R) the set of all continuous real-valued functions on T .
6.5. APPROXIMATIONS AND THE STONE-WEIERSTRASS THEOREM
367
Stone-Weierstrass theorem Theorem 6.21. Let T be a compact space and let A ⊆ C(T, R) be an algebra of functions that separates points of T and contains the constant functions. Then any function in C(T, R) can be uniformly approximated by functions in A ; that is, given any ε > 0 and f ∈ C(T, R) there is a function g ∈ A such that |f (x) − g(x)| < ε for all x ∈ T. Proof : We shall give a “Lebesgue-like” proof of this result following [103]. Let A ′ ⊆ C(T, R) be the space of functions that can be uniformly approximated by functions in A . We must show that A ′ = C(T, R). Since A is an algebra, one can check that A ′ is an algebra. Now let f ∈ C(T, R); we need to prove that f ∈ A ′ . Since T is compact, |f | is bounded by a constant, say M . Now fe := f + M is nonnegative and if we can prove that fe ∈ A ′ , then as the constant function M belongs to A (and hence to A ′ ) we get f = fe − M ∈ A ′ . Thus, we might as well assume from now on that f is nonnegative. Let ε > 0 and fix n ∈ N such that 1/n < ε/2. Following Lebesgue, let’s partition the range of f using the partition 2 3 4 1 , , , , ..., 0, n n n n and next consider the function (which should be familiar looking by now) ∞ X k g= χ{k/n 0}
Figure 6.10. The horizonal strips have area µ{f > 0}, . . . , µ{f > 3}.
Proof : We remark that if you take a close look at Figure 6.10, it’s easy to see why this lemma holds: As seen in Figure 6.10, we certainly have
E(f ) =
Z
f = area below the graph of f ≤
∞ X
µ{f > n}.
n=0
Also, if you imagine shifting f one unit up, then the graph of f + 1 is above the horizontal rectangles in Figure 6.10. This shows that ∞ X
n=0
µ{f > n} ≤ area below the graph of f + 1 =
Z
(f + 1) = E(f ) + 1.
6.6. THE LAW OF LARGE NUMBERS AND NORMAL NUMBERS
377
Here’s a proof of this geometric argument: ∞ X
µ{f > n} =
n=0
∞ X ∞ X
µ{k < f ≤ k + 1}
n=0 k=n
= = =
∞ X k X
k=0 n=0 ∞ X
µ{k < f ≤ k + 1}
(k + 1) µ{k < f ≤ k + 1}
k=0 ∞ X
(k + 1)
k=0
=
(interchange summation order)
Z X ∞
Z
χ{k n} = ∞, n=0
n=0
where we used that µ{|f1 | > n} = µ{|fn | > n} as the fn ’s (and hence |fn |’s — see Example 6.5) are identically distributed. Since the fn ’s are also pairwise independent, the sets {|fn | > n} are pairwise independent (see Example 6.7), so by the second Borel–Cantelli Lemma, µ{|fn | > n ; i.o.} = 1. This means that, a.e., |fn | > n (or |fnn | > 1) for infinitely many n’s. However, this cannot be possible if lim fnn = 0 a.e. Step 2: We now begin our proof of the SLLN assuming the fn ’s are integrable. To do so, we claim that if the SLLN holds for integrable nonnegative random variables, then it holds as stated. To see this, write each fn in terms of its nonnegative and nonpositive parts, fn+ , fn− , and observe that Pn Pn + f− Sn k=1 fk = − k=1 k . n n n Since {fn } is pairwise independent and identically distributed, by Examples 6.5 and 6.7 we know that {fn+ } and {fn− } have the same properties, so if we can prove the SLLN for nonnegative random variables, it follows that Pn Pn + − + k=1 fn k=1 fn lim = E(f1 ) a.e. and lim = E(f1− ) a.e. n→∞ n→∞ n n Hence, a.e. we have Sn lim = E(f1+ ) − E(f1− ) = E(f1 ), n→∞ n so the SLLN holds as stated. Thus, we shall henceforth assume that the fn ’s are nonnegative. Our goal is to prove that lim Snn = E a.e. The idea to prove this limit is to first truncate the fn ’s so that each fn is bounded. According to Boris Vladimirovich Gnedenko (1912–1995) [105, p. 230], this “method of truncation” was first introduced by Andrei Andreyevich Markov (1856–1922) in 1907. The method of truncation is the following technique. For each n ∈ N, let ϕn : R → R be the truncation function ( x if x ≤ n (6.32) ϕn (x) := 0 otherwise, and consider the truncated random variable ( fn if fn ≤ n gn := ϕn (fn ) = 0 otherwise. Let Tn = g1 + · · · + gn . In Steps 4–6 shall prove the SLLN for the truncated random variables: Tn = E a.e. (6.33) lim n→∞ n That this fact proves the SLLN follows from our next step. Step 3: In this step we prove the following: Tn Sn Claim: If lim = E a.e., then lim = E a.e. n→∞ n n→∞ n To prove this claim, note that Sn Sn − T n Tn (6.34) = + . n n n By assumption, the last term tends to E a.e. as n → ∞, so we just have to prove that the first term on the right in (6.34) vanishes a.e. as n → ∞. To this end,
6.6. THE LAW OF LARGE NUMBERS AND NORMAL NUMBERS
379
observe that if fn (x) = gn (x) for all n from some point on, then limn→∞ (fn (x) − gn (x)) = 0, so by Cauchy’s arithmetic mean theorem, we have lim (Sn (x) − n→∞
Tn (x))/n = 0. Another way of saying that “fn (x) = gn (x) for all n from some point on” is “fn (x) 6= gn (x) for only finitely many n’s,” therefore
Sn − T n = 0 a.e. n Now the left-hand side here means that µ fn 6= gn for infinitely many n = 0, so by definition of “i.o.,” Step 3 is completed once we show that µ fn 6= gn ; i.o. = 0. To prove this we use the First Borel–Cantelli Lemma, which says that µ fn 6= P∞ gn ; i.o. = 0 if n=1 µ fn 6= gn < ∞. To prove this, note that by definition of gn , we have fn 6= gn if and only if fn > n, thus fn 6= gn for at most finitely many n a.e. =⇒ lim
∞ ∞ ∞ X X X µ fn 6= gn = µ fn > n = µ f1 > n < ∞,
n=1
n=1
n=1
where we used that µ{f1 >P n} = µ{f n > n} as the fn ’s are identically distributed. By Lemma 6.29 we have ∞ n=1 µ f1 > n ≤ E(f1 ) + 1, which is finite because the fn ’s are assumed integrable. This completes the Step 3. Step 4: It remains to prove that limn→∞ Tnn = E a.e. To prove this we use the following “subsequence trick.” Fix any α > 1. Let an = ⌊αn ⌋, where for any real number x, ⌊x⌋ denotes the largest integer ≤ x. In this step we shall prove that Ta n lim = E a.e. n→∞ an and in Step 5, we shall use this to deduce that limn→∞ Tnn = E a.e. Now, observe that Pn Pn E(Tn ) k=1 E(ϕn (fn )) k=1 E(ϕn (f1 )) = = , n n n where we used that E(ϕn (fn )) = E(ϕn (f1 )) as the fn ’s are identically distributed. Since 0 ≤ ϕ1 (f1 ) ≤ ϕ2 (f1 ) ≤ ϕ3 (f1 ) ≤ · · · ≤ ϕk (f1 ) → f1 , by the MCT we have lim E(ϕn (f1 )) = E(f1 ) = E. Therefore, by Cauchy’s arithmetic mean theorem,
n→∞
we have lim (6.35)
E(Tn ) n
lim
n→∞
= E, which implies that lim
Ta n = E a.e. an
⇐⇒
lim
n→∞
E(Tan ) an
= E as well. Hence,
Tan − E(Tan ) = 0 a.e. an
We henceforth focus on proving the right-hand side. To this end, note that by T −E(T ) Part (3) of Lemma 4.5, to prove that limn→∞ an an an = 0 a.e., we just have 23 to show that given any ε > 0, we have ∞ X Ta − E(Tan ) ≥ ε < ∞, µ n an n=1 n o P∞ Tan −E(Tan ) So you don’t have to review Lemma 4.5, note that ≥ε < ∞ n=1 µ an n o T −E(T ) implies, by the First Borel-Cantelli Lemma, that µ an a an ≥ ε ; i.o. = 0; in other words, n T −E(T ) the complement of the set of points where an a an < ε for all n sufficiently large has measure n T −E(T ) zero. Since ε > 0 is arbitrary, with a little more work, we get that lim an a an = 0 a.e. 23
n
380
6. SOME APPLICATIONS OF INTEGRATION
which is exactly in the form where we can try to use Chebyshev’s Inequality! (This, of course, was the reason we derived the equivalence (6.35), to set us up for Chebyshev.) By Chebechev’s inequality, Tan − E(Tan ) 1 µ ≥ ε ≤ a2 ε2 Var Tan . an n Since the {fn } are pairwise independent, by Theorem 6.16, {gn } = {φn (fn )} are pairwise independent too, so by the properties of the variance (see Theorem 6.17), for any n we have n n n X X X Var Tn = Var gk = Var gk = E(gk2 ) − E(gk )2 k=1
k=1
≤ = =
k=1 n X k=1 n X k=1 n X
E(gk2 ) E(ϕk (fk )2 ) E(ϕk (f1 )2 )
k=1
≤ n E(ϕn (f1 )2 ),
here we used that E(ϕk (fk )2 ) = E(ϕk (f1 )2 ) as the fn ’s are identically distributed. and at the last step we used that ϕk ≤ ϕn for 1 ≤ k ≤ n, so E(ϕk (f1 )2 ) ≤ E(ϕn (f1 )2 ) for 1 ≤ k ≤ n. Thus, Ta − E(Tan ) ≥ ε ≤ 1 Var Tan ≤ 1 · E(ϕan (f1 )2 ). µ n an a2n ε2 an ε2 Hence,
∞ ∞ X X Tan − E(Tan ) 1 ≥ε ≤ 1 · E(ϕan (f1 )2 ) µ 2 a ε a n n n=1 n=1 X ∞ 1 ϕan (f1 )2 ≤ 2E , ε an n=1
so we are left to show that the right-hand side is finite. To do so, we claim that for some constant C we have ∞ X ϕan (x)2 ≤ C x for all x ≥ 0. Claim: an n=1
Assuming this claim for a moment, we have ∞ X ϕan (f1 )2 ≤ C f1 , an n=1
which shows that ∞ X Ta − E(Tan ) ≥ ε ≤ C E f1 < ∞. µ n 2 a ε n n=1
Thus, we just have to prove our claim. Notice that so far we haven’t used anything about the explicit form an = ⌊αn ⌋, where α > 1; however, we shall do so now. Observe that a1 ≤ a2 ≤ a3 ≤ · · · → ∞, so given x ≥ 0 there is a smallest m ∈ N such that am ≥ x. Then by definition of ϕn in (6.32), ∞ ∞ ∞ X X X ϕan (x)2 x2 1 = = x2 . a a a n n n n=1 n=m n=m
6.6. THE LAW OF LARGE NUMBERS AND NORMAL NUMBERS
381
For any t ≥ 1, observe that
t ≤ ⌊t⌋ + 1 ≤ ⌊t⌋ + ⌊t⌋ = 2⌊t⌋,
which implies 1/⌊t⌋ 1/an ≤ 2/αn . Therefore, recalling the geometric P∞ ≤ n2/t, so m series formula n=m r = r /(1 − r) for any r ∈ R with |r| < 1, we see that ∞ ∞ ∞ X X X ϕan (x)2 1 2 = x2 ≤ x2 n a a α n n n=1 n=m n=m
≤ x2
x2 2(1/α)m = C m, (1 − 1/α) α
where C = 2/(1 − 1/α). By the way we chose m, we have x ≤ am = ⌊αm ⌋ ≤ αm , P ϕan (x)2 so x/αm ≤ 1 and hence ∞ ≤ Cx just as we claimed. n=1 an Step 5: Our last step is to prove that Tn Ta n = E a.e. =⇒ lim = E a.e. lim n→∞ n n→∞ an Recalling that an = ⌊αn ⌋ with α > 1, we have a1 = 1 ≤ a2 ≤ a3 ≤ · · · ≤ an → ∞, so given k ∈ N, there is a largest natural number n ∈ N such that an ≤ k ≤ an+1 . (Although n depends on k, we omit the explicit dependence.) Note that as k → ∞, we also have n → ∞, a fact that will be used later. Any case, observe that the inequality an ≤ k ≤ an+1 implies that 1 1 1 1 Tan ≤ Tk ≤ Tan+1 and ≤ and ≤ , an+1 k k an therefore
Ta Ta n Tk ≤ ≤ n+1 , an+1 k an or written in a slightly different way, Ta Tk an+1 Tan+1 an · n ≤ ≤ · . (6.36) an+1 an k an an+1 an+1 = an where m ∈ N.
Since an = ⌊αn ⌋, an argument (which you can provide) shows that lim
n→∞
α. So far, α > 1 has been arbitrary. We now choose α = 1 + Then there is a measurable set Am of measure zero such that Tan (x) lim = E for x ∈ / Am , n→∞ an T
1 m
which is the precise meaning of “lim aann = E a.e.” Hence, for x ∈ / Am , an Tan (x) 1 an+1 Tan+1 (x) 1 lim · = E and lim · = 1+ E. n→∞ n→∞ an+1 an 1 + 1/m an an+1 m S∞ Let A = m=1 Am , which has measure zero since it’s a countable union of sets of measure zero. Let ε > 0 be given and choose m ∈ N such that 1/m < ε/(2E). Then the inequalities in (6.36), and the fact that as k → ∞, also n → ∞, it follows that given x ∈ X with x ∈ / A (which implies x ∈ / Am ), for k sufficiently large we have 1 ε Tk (x) 1 ε E− ≤ ≤ 1+ E+ . 1 + 1/m 2 k m 2 Observe that 1 ε 1 1 1 ε E− = 1− + 2 − 3 + ··· E − 1 + 1/m 2 m m m 2 1 ε ≥ 1− E − = E − ε, m 2
382
6. SOME APPLICATIONS OF INTEGRATION
where we used the fact that E/m < ε/2. The same fact implies that 1 ε 1+ E + < E + ε. m 2 Thus, for k sufficiently large we have E−ε≤
Tk (x) ≤ E + ε. k
Since ε > 0 was arbitrary, it follows that limk→∞ proves the SLLN.
Tk (x) k
= E for all x ∈ / A. This
We remark that the same argument given at the very end of Section 4.2.1 shows that Etemadi’s SLLN implies the following WLLN: If f1 , f2 , . . . are pairwise independent, identically distributed random variables, with a finite (common) expectation E, then for each ε > 0, we have f1 + f2 + · · · + fn lim µ − E < ε = 1. n→∞ n This version of the WLLN is stronger than the one in Corollary 6.27 because in that corollary we assumed that the fn ’s had finite variances and here we don’t.
6.6.3. Borel’s Normal Number Theorem. If you did Exercises 4 and 5 in Exercises 4.1 this section will be a breeze; we begin by reviewing material from these exercises. The basic gist of a normal number is as follows. Roughly speaking, we say that a number is normal in base 10 if any given finite sequence of digits occurs in its decimal expansion with the expected frequency. Thus, for example, if x ∈ [0, 1] is normal in base 10 and we write x = 0.x1 x2 x3 · · · in its decimal expansion, then the digit “5” would occur with frequency 1/10, the string “34” would occur with frequency 1/102 , and so forth. We now make this precise. Let b ∈ N with b ≥ 2. Given a number x ∈ [0, 1], we can write it in its b-adic expansion, otherwise known as its base b expansion: x1 x2 x3 (6.37) x= + 2 + 3 + ··· , b b b for some xi ’s in the set of digits S := {0, 1, . . . , b − 1}. If x is rational, it has two such expansions, one that is terminating and the other non-terminating; in order to have unique expansions we agree to write rational numbers in their nonterminating b-adic expansions. We shall call a word a finite string of digits. Thus, w = (d1 , d2 , . . . , dk ) for some k ∈ N (called the length of w) and d1 , . . . , dk ∈ S; let’s us fix such a word. For each i ∈ N, define ( 1 if (xi , xi+1 , . . . , xi+k−1 ) = w, fi : [0, 1] → R by fi (x) = 0 otherwise. Thus, fi observes if the word w occurs in the b-adic expansion of x starting from the i-th digit of x. Now consider the average f1 (x) + f2 (x) + · · · + fn (x) , n which is exactly the average number of times the word w occurs in the first n digits of x. Intuitively speaking, since there are a total of b possible digits, the word w (consisting of k specified digits) should occur at any given position in the b-adic
6.6. THE LAW OF LARGE NUMBERS AND NORMAL NUMBERS
383
expansion of x with probability 1/bk . Hence, it seems reasonable that the word w should appear with frequency 1/bk ; that is, it should be that (6.38)
lim
n→∞
f1 (x) + f2 (x) + · · · + fn (x) 1 = k. n b
If this is indeed the case, we say that x is normal in base b with respect to the word w. If (6.38) holds for all words, then we say that x is normal in base b. Finally, we say that x is normal or absolutely normal if it’s normal in all bases ´ b ≥ 2. The following result was proved by Emile Borel in 1909 [36]: Borel’s Normal Number Theorem Theorem 6.31. Almost all numbers in [0, 1] are (absolutely) normal. You shall prove this theorem in Problem 5. We remark that although almost all numbers in [0, 1] are normal, and this has been known for over 100 years now, I don’t know of any simple example! Maybe one of you will produce one! There are complicated examples of normal numbers, which can be computed in theory; the first such number was found by Sierpinski in 1916 [257, 20]. On the other hand, we can give simple examples of numbers normal in specific bases. For example, the first nontrivial numbers which are normal in a given base were constructed by D. G. Champernowne (1912–2000) in 1933 [55]. For example, in base 10, the number 0.12345678910111213141516171819202122232425262728293031 . . ., obtained stringing together all the natural numbers written in base 10, is normal in base 10. In any base b, the number obtained from stringing together all the natural numbers written in base b is normal in base b. Champernowne conjectured that 0.2357111317192329313741434753596167717379838997101103107 . . ., obtained by stringing together all the prime numbers, is also normal in base 10; this was subsequently proved in 1946 by Copeland and Erd¨os [61]. However, it’s not √ know whether “naturally occurring” numbers such as the decimal parts of e, π, 2 or log 2 are normal in any base. ◮ Exercises 6.6. 1. (Cantelli’s SLLN) In this problem we give Cantelli’s SLLN, proved in 1917 [47]. Let f1 , f2 , f3 , . . . be integrable independent random variables on X. For each n, k, put Bn,k =
n X i=1
In this problem we prove that if lim
P∞
n→∞
n=1
E[|fi − E(fi )|k ]. 2 Bn,4 +Bn,2
n2
< ∞, then
Sn − E(Sn ) = 0 a.e., n
where Sn = f1 + · · · + fn . By Part (3) of Lemma 4.5, we just have to show that given any ε > 0, we have ∞ X Sn − E(Sn ) ≥ ε < ∞. µ n n=1 To prove this, proceed as follows:
384
6. SOME APPLICATIONS OF INTEGRATION
(i) Use Chebyshev’s inequality to prove that Z X n 4 Sn − E(Sn ) ≥ε ≤ 1 µ g , i n n4 ε4 i=1
where gi = fi − E(fi ). P 4 n (ii) Multiplying out show that we get a sum of terms of the following i=1 gi
form: (a) gi4 where i = 1, . . . , n, (b) gi2 gj2 , where i 6= j and 1 ≤ i, j ≤ n, and (c) terms of the form gi gj gk gℓ in which at least one gm is not repeated. Show that the integrals of the type (c) terms are zero. (iii) Show that Z X n 2 4 Bn,4 + Bn,2 1 gi ≤ , 4 4 2 4 n ε n ε i=1 and use this to prove Cantelli’s SLLN. 2. (Kac’s proof of the SLLN; cf. [143, 107]) Here’s a proof of a SLLN by Mark Kac (1914–1984). Let f1 , f2 , f3 , . . . be iid random variables on X. Suppose there is a constant C such that |fi | ≤ C for all i. We shall prove that lim Snn = E a.e. where Sn = f1 + · · · + fn and E = E(f1 ) = E(f2 ) = · · · . (i) Show that lim Snn = E a.e. is equivalent to lim Tnn = 0 a.e. where Tn = g1 +· · ·+gn with gi = fi − E. (ii) Show that Z 4 Tn constant ≤ . n n2
Suggestion: Multiply out Tn4 and see Part (ii) of Cantelli’s SLLN in the previous problem to see how to deal with the various terms you get. P∞ R Tn 4 (iii) Conclude that < ∞ and then use Theorem 5.21 to prove that n=1 n 4 Tn lim n = 0 a.e., which implies lim Tnn = 0 a.e.
3. (Khintchine’s WLLN) This theorem was proved by Aleksandr Yakovlevich Khinchin (1894–1959) in 1929 [147] and it strengthens Corollary 6.27 so that we don’t need finite variances. Here’s the statement: If f1 , f2 , f3 , . . . are pairwise independent and identically distributed with finite common expectation E and Sn = f1 + · · · + fn , then for each ε > 0, we have Sn lim µ − E < ε = 1. n→∞ n (i) We first need the following (very) useful inequality: Given f with finite expectation and variance, prove that Z
2 Z |f | ≤ f 2 ; that is, E(|f |)2 ≤ E(f 2 ).
Suggestion: Note thatR Var |f | ≥ 0. R (ii) Case I: Assume that f12 < ∞. Prove that |Sn /n − E|2 = Var(f1 )/n. Use this fact, together with (i) and Chebyshev’sRinequality to prove Khintchine’s theorem. Case II: We henceforth do not require f12 to be finite. We proceed as follows. (iii) Let ϕk be the truncation function in (6.32) and R R let fik = ϕk (fi ) and Snk = f1k + · · · + fnk . Show that |Sn − Snk | ≤ n A |f1 | where Ak = {|f1 | > k}. k R R (Suggestion: Show that |fi − fik | ≤ A |f1 | and use the triangle inequality.) k
6.6. THE LAW OF LARGE NUMBERS AND NORMAL NUMBERS
(iv) Show that Z Z Sn n − E = Z ≤
385
Sn Snk Snk + − E(f1k ) + E(f1k ) − E(f1 ) n − n n Z Z Snk − E(f1k ) + |f1 |. |f1 | + n Ak
Ak
Show that the first and last terms (which are the same) → 0 as k → ∞ and show by Case I that for fixed k, the middle term → 0 as n → ∞. Use these facts, plus an “ε/3-trick” to finish off the proof of Khintchin’s theorem. 4. (Sub-gaussian random variables; cf. [286, 281]) A random variable f : X → R is called sub-gaussian with parameter α > 0 if E(etf ) ≤ et
2
α2 /2
for all t ∈ R.
(Properties of sub-gaussians) Prove the following properties: (i) If f is sub-gaussian with parameter α, then so is −f . 2 2 (ii) If f is sub-gaussian with parameter α, then µ{f > λ} ≤ e−λ /2α . (iii) if f1 , . . . , fn are independent and sub-gaussian with parameters p α1 , . . . , αn , reP spectively, then Sn = n f is sub-gaussian with parameter α21 + · · · + α2n . k k=1 (iv) If f is bounded, say |f | ≤ M√ for some constant M , and E(f ) = 0, then f is sub-gaussian with parameter 2 M . −2
Suggestions: For (ii), observe that µ{f > λ} = µ{λα−2 f > λ2 α−2 } = µ{eλα f > 2 −2 eλ α }. Don’t forget your friend Mr. Chebyshev. To prove (iv), first assume that 2 M = 1 and in this case show that E(etf ) ≤ et . If t ≥ 1, show that E(etf ) ≤ et . For 0 ≤ t ≤ 1, prove that etf = 1 + t f +
t3 f 3 t2 t3 t2 f 2 + + ··· ≤ 1 + tf + + + · · · ≤ 1 + t f + t2 , 2! 3! 2! 3! 2
and use this to show that E(etf ) ≤ et . If M 6= 1, apply the result for M = 1 to the function f /M , which satisfies |f /M | ≤ 1. (v) (A SLLN for sub-gaussians) Prove the following version of the SLLN: Let f1 , f2 , . . . be independent, sub-gaussian random variables with parameters α1 , α and suppose that there are constants C, d > 0 such that P2n, . . ., 2respectively, 2−d α ≤ Cn for all n. Then, k=1 k lim
Sn = 0 a.e., n
where Sn = f1 + · · · + fn . Suggestion: By Part (3) of Lemma 4.5, we just have to show that given any ε > 0, we have X ∞ ∞ X Sn µ {|Sn | ≥ nε} < ∞. µ ≥ ε = n n=1 n=1 Show that
µ {|Sn | ≥ nε} = µ {Sn ≥ nε} + µ {−Sn ≥ nε} , and use this, together with P Properties (ii) and (iii) above for sub-gaussian random variables, to prove that ∞ n=1 µ {|Sn | ≥ nε} < ∞. (vi) Let f1 , f2 , f3 , . . . be iid random variables on X. Suppose there is a constant C such that |fi | ≤ C for all i. Using the SLLN for sug-gaussians, prove that lim Snn = E a.e. where Sn = f1 + · · · + fn and E = E(f1 ) = E(f2 ) = · · · . 5. (Normal Number Theorem) In this problem we prove Borel’s fascinating result. Fix b ∈ N with b ≥ 2 and let S = {0, 1, . . . , b − 1} be the set of digits in base b. Let
386
6. SOME APPLICATIONS OF INTEGRATION
µ0 : P(S) → [0, 1] assign “fair” probabilities, µ0 (A) = #A/b, and let µ : S (C ) → [0, 1] denote the infinite product of µ0 with itself. Fix a word w ∈ S k for some k and define ( 1 if x = w, g : S k → R by g(x) = 0 otherwise. For each i, define gi : S ∞ → R by gi (x1 , x2 , . . .) = g(xi , xi+1 , . . . , xi+k−1 ). Thus, gi observes if the word w occurs in the sequence (x1 , x2 , . . .) starting from the i-th digit. We begin our proof by proving that for a.e. x ∈ S ∞ ,
Sn (x) 1 = k , n b where Sn (x) = g1 (x) + g2 (x) + · · · + gn (x). To prove this, and the rest of Borel’s theorem, proceed as follows. (i) Given ℓ ∈ N, prove that gℓ , gℓ+k , gℓ+2k , . . ., are iid. Conclude that for some Aℓ ∈ S (C ) with µ(Aℓ ) = 1, for all x ∈ Aℓ ,
(6.39)
lim
n→∞
gℓ (x) + gℓ+k (x) + gℓ+2k (x) + · · · + gℓ+(n−1)k (x) 1 = k. n b (ii) Let A = A1 ∩ · · · ∩ Ak . Prove that µ(A) = 1 and for all x ∈ A, lim
n→∞
Smk (x) 1 = k. mk b Suggestion: Break up the sum Smk (x) = g1 (x)+g2(x)+g3(x)+· · ·+gmk (x) into k sums of the form gℓ (x)+gℓ+k(x)+gℓ+2k (x)+· · ·+gℓ+(m−1)k where ℓ = 1, 2, . . . , m and use (i). (iii) Now prove that for all x ∈ A, (6.39) holds. Suggestion: For n > k, let m = ⌈n/k⌉, the smallest integer ≥ n/k. Prove that lim
m→∞
S(m−1)k (x) m − 1 Sn (x) Smk (x) m · ≤ ≤ · , (m − 1)k m n mk m−1
then let n → ∞. Note that m is the unique integer satisfying mk − k < n ≤ mk. (iv) Using Problem 7 in Exercises 2.4, prove that the the set of all x ∈ [0, 1] such that (6.38) holds is Borel and has Lebesgue measure 1. (v) Finally, prove that the set of all words is countable and use the fact that the set of all bases b ≥ 2, b ∈ N is countable, to complete the proof of Borel’s Normal Number Theorem. Congratulations, you have just proven one of the most historic theorems in probability theory!
Notes and references on Chapter 6 §6.1 : The first published proof of the FTA appeared in 1748 and is due to Jean le Rond dAlembert (1717-1783) and a second proof was published in 1749 by Leonhard Euler (1707-1783). D’Alembert’s and Euler’s proofs were flawed and in fact, many others such as Joseph-Louis Lagrange (1736–1813) and Pierre-Simon Laplace (1749 –1827) published flawed proofs. Many credit the first proof of the FTA to Johann Carl Friedrich Gauss (1777–1855) in his 1799 doctoral thesis “A New Proof of the Theorem That Every Algebraic Rational Integral Function In One Variable can be Resolved into Real Factors of the First or the Second Degree.” During a large part of Gauss’ thesis, he pointed out flaws in the previous works starting from D’Alembert and ending with Lagrange, for example [92]: Although the proofs for our theorem which are given in most of the elementary textbooks are so unreliable and so inconsistent with mathematical rigor that they are hardly worth mentioning, I shall nevertheless briefly touch upon them so as to leave nothing out. In order to demonstrate that any equation xm +A xm−1 +B xm−2 +etc.+M = 0, or X = 0, has indeed m roots, they undertake to prove that X can be
NOTES AND REFERENCES ON CHAPTER ??
387
resolved into m simple factors. To this end they assume m simple factors x−α, x−β, x−γ, etc., where α, β, γ, etc. are as yet unknown, and set their product equal to the function X . . . (I bolded the word “assume.”) In other words, they assumed what they were trying to prove! What they did was assume that a polynomial had some mysterious types of roots (which later in Gauss’ paper he called “impossible” roots, and then they tried to show that the roots must be complex numbers (the “possible’ roots). We sometimes forget that even geniuses such as Euler made fundamental errors. So we too shouldn’t get discouraged if we make a mistake now and again. Actually, Gauss’ proof also was flawed, so Gauss really shouldn’t have been so critical of others; what Gauss did was make a claim he didn’t prove [92]: It seems to have been proved with sufficient certainty that an algebraic curve can neither be broken off suddenly anywhere (as happens e.g. with the transcendental curve whose equation is y = ln1x ) nor lose itself, so to say, in some point after infinitely many coils (like the logarithmic spiral). As far as I know, nobody has raised any doubts about this. However, should someone demand it then I will undertake to give a proof that is not subject to any doubt, on some other occasion. The “other occasion,” however, never came until 1920, 65 years after Gauss’ death, when Alexander Ostrowski (1893–1986) [216] filled in the details of Gauss’ missing claim. Thus, although something may seem obvious to us or others with “sufficient certainty,” we should still prove it! Gauss was very fond of the FTA, for Gauss gave altogether four different proofs of it during his lifetime. My favorite, and perhaps the most elementary of all proofs of the FTA is due to Jean-Robert Argand (1768–1822) who published it in 1806 and another improved version in 1814/1815; to read about Argand’s argument see [275, p. 268]. For more on the history of the Basel problem, see [12, 44, 77]. §6.2 : The earliest proof of Theorem 6.10 on the characterization of Riemann-Stieltjes integrability that I could find is due to William Young [315, p. 133]; Corollary 6.11 was proved by Lebesgue in 1902 (sufficiency) [171] and 1904 (necessity) [172]. §6.3–6.6 : Aleksandr Khinchin (1894–1959) in 1928 [146] introduced the term “Strong Law of Large Numbers”. In 1930, Kolmogorov [152] proved Etemadi’s Strong Law of Large Numbers with the additional assumption that the random variables were independent (rather than pairwise independent) and in 1933 in his famous book [151], he states a SLLN for independent and not-necessarily-identically distributed random variables; his theorem reads: If f1P , f2 , . . . are independent random variables with finite expectations 2 and ∞ n=1 Var(fn )/n < ∞, then the SLLN holds for f1 , f2 , . . . in the sense that the event f1 + · · · + fn − E(f1 ) + · · · + E(fn ) lim =0 n→∞ n occurs with probability one.
Part 4
Further results in measure and integration
CHAPTER 7
Fubini’s theorem and Change of Variables I consider here the two-dimensional integral of a function of two variables x, y. And, as it is now necessary in this field of study, I refer to the integral of Lebesgue. The theorem, which we will prove, is the following: If f (x, y) is a function of two variables x, y, bounded or unbounded, integrable in an area Γ of the plane (x, y), then one always has Z Z Z Z Z f (x, y) dσ = dy f (x, y) dx = dx f (x, y) dy, Γ
where with dσ we denote the area element. Opening words to Guido Fubini’s (1879–1943) 1907 paper [101].
7.1. Introduction: Iterated integration In calculus when evaluating double integrals of functions of several variables we often switched the order of integration; we’ll see that funny things can happen in the Riemann world when we aren’t careful with the functions we’re integrating. 7.1.1. Fubini’s theorem. Consider a rectangle R = [a, b] × [c, d] in R2 and a bounded function f : R → R. We are interested in when the following equalities hold for Lebesgue and Riemann integrals:1 Z Z d Z b Z bZ d f (x, y) dxdy = f (x, y) dx dy = f (x, y) dy dx (7.1) . R c a a c (A) (B) (C) The integral (A) is a two-dimensional integral using whatever integration theory you are considering. The integrals (B) and (C) are two (iterated) one-dimensional integrals, which mean to first do the inner integral and second the outer integral, again using whatever integration theory you are considering. In Figure 7.1 we give a heuristic review of these integrals as Riemann integrals (there are related explanations for the integrals as Lebesgue integrals): z
... c ......d y a............. . b......... x
. .. .dy ........ .. . ........... . dx ...........
....
y ...
Figure 7.1. Pictures to understand double integrals. On the left is a graph of z = f (x, y) over the rectangle R. (For this example, the 1In Riemann integration, the double integral in (A) is usually denoted by s f (x, y) dA with R
two integral signs, but we shall write it with one integral sign. 391
392
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
graph of f looks like part of a plane.) The volume of the “tower” in the middle picture is height × area ofR base = f (x, y) × dxdy. Summing up these volumes gives the Riemann integral R f (x, y) dxdy. (Here, “summing” really means to take a limit of Riemann sums; you can review the precise definition if you want to, although it’s not necessary.) Now consider the iterated integral (B) as Riemann integrals, which means that the inside and the outside integrals have to be defined as Riemann integrals. Thus, first of all, for each y ∈ [c, d], the Riemann integral Z b (I) F (y) := f (x, y) dx must exist. a
Notice that at each point y ∈ [c, d] this integral represents the area of the crosssectional slice of the solid at y, the shaded trapezoidal region in right-hand picture in Figure 7.1. The inside integral in (B) is thus a cross-sectional “area function” depending on y ∈ [c, d]. Then second of all, for (B) to make sense, this area function must be Riemann integrable: Z d (II) F (y) dy must exist. c
Given the two conditions (I) and (II), we have the definition Z d Z b Z d (7.2) f (x, y) dx dy := F (y) dy. c
a
c
Another way to describe the function F (y) in (I) uses “slice functions” as follows. Let us hold y ∈ [c, d] fixed. Then define fy : [a, b] → R
by fy (x) := f (x, y) for all x ∈ [a, b];
thus, fy is just the function f (x, y) as a function of x, with y held fixed, called the y-section, or y-slice, of f . The left-hand picture in Figure 7.2 shows a schematic of the y-slice: fy : [a, b] → R
✌ . ....
✌
fx : [c, d] → R
.y. x .........
Figure 7.2. The graph of z = f (x, y) is the top slanted portion of the box. Graphs of the slices fy and fx are obtained by slicing the top at y and Rx, respectively. The R d areas of the shaded regions represent the b integrals a fy (x) dx and c fx (y) dy, respectively.
Provided that the function fy : [a, b] → R is Riemann integrable over [a, b], its Riemann integral is the function F (y) defined in (I) above: Z b Z b F (y) := fy (x) dx = f (x, y) dx. a
a
Then as already mentioned, the iterated integral (B) means that the function F (y) is Riemann integrable on [c, d], and then the iterated integral in (B) is defined in
7.1. INTRODUCTION: ITERATED INTEGRATION
393
(7.2). The iterated integral (C) in (7.1) has a similar meaning using “x-sections” as follows. For each x ∈ [a, b], we define (see the right-hand picture in Figure 7.2) fx : [c, d] → R
by fx (y) := f (x, y) for all y ∈ [c, d].
Then for the iterated integral in (C) to be defined we require that for each x ∈ [a, b], Z d Z d G(x) := fx (y) dy = f (x, y) dy c
c
is defined; the value G(x) is just the area of the shaded region on the right-hand side in Figure 7.2. We further require that G(x) be Riemann integrable and then Z bZ d Z b f (x, y) dy dx := G(x) dx. a
c
a
To define the iterated integrals in (B) and (C) using Lebesgue integrals, just replace the word “Riemann” with “Lebesgue” in the above descriptions. Now there are six possible things that can happen with integrals (A), (B), (C):
(7.3)
(1) (2) (3) (4) (5) (6)
(A) exists but (B), (C) do not exist (B) exists but (A), (C) do not exist (C) exists but (A), (B) do not exist (A), (B) exist but (C) does not exist (A), (C) exist but (B) does not exist (B), (C) exist but (A) does not exist
It may surprise you that All these situations are possible in the Riemann world! That is, if all the integrals in (7.1) are Riemann integrals, then one can find examples of bounded functions such that every one of these six situations occurs! This quagmire shows just how complicated Riemann integration can be when it comes to iterated integration. However, if all the integrals are Lebesgue integrals, for all practical purposes the integrals in (7.1) always exist and are always equal!2 To summarize: For Riemann’s integral we have to worry about contradictions in (7.3), while for Lebesgue’s theory we don’t have such worries (at least for bounded functions — unbounded functions can cause trouble). This gives another instance of our thesis: Lebesgue’s integral simplifies life! We shall give an example illustrating situations (2) and (3) in (7.3) and leave the other situations to the exercises. We shall do (3) since a function satisfying (2) can be found by switching the roles of x and y. 2Here we keep the boundedness assumption and we require f to be Borel measurable. If we replace Borel measurable with Lebesgue measurable, then the equalities in (7.1) still hold in an appropriate sense; see Theorem 7.13. In the case f is not Lebesgue measurable, see the Notes and references section for what can happen in this case.
394
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
7.1.2. Hobson’s book. Ernest William Hobson (1856–1933), wrote the first (now free) book on Lebesgue integration in English, The theory of functions of a real variable and the theory of Fourier’s series [132] whose first edition was published in 1907, five years after Lebesgue’s thesis was published. On pages 428–430 of his book he gives various examples of the strange situations referred to above. In Hobson’s first example, he considers the function f : R → R, where R = [0, 1]2 , defined by ( f (x, y) :=
1 2y
if x ∈ Q, if x ∈ / Q.
Hobson attributes this function to Carl Johannes Thomae (1840–1921) [283]. We shall prove that in (7.1), in the sense of Riemann integrals, (A) and (B) do not exist while (C) does exist. On the other hand, we shall see that in the Lebesgue context, all the integrals in (7.1) exist and are equal. Riemann case: Here we need the R2 version of Lebesgue’s Theorem on Riemann integrability from Section 6.2 (Theorem 6.10; actually, just Corollary 6.11), which states that f : R → R is Riemann integrable if and only if f is continuous a.e. on the rectangle R. For a proof of this fact, one can simply generalize the results in Section 6.2 to higher dimensions. In particular, in Hobson’s first example, it’s not hard to prove that f is discontinuous at all points of R = [0, 1]2 minusR the horizontal line {y = 1/2}. Thus, by Lebesgue’s theorem, the Riemann integral R f (x, y) dxdy does not exist. We now show that the iterated integral in (B) also does not exist. To see this, fix any y ∈ [0, 1]. Then the y-slice function fy : [0, 1] → R is given by the formula ( 1 if x ∈ Q, (7.4) fy (x) := 2y if x ∈ / Q. If y 6= 1/2, we have 2y 6= 1, so in this case it follows that fy (x) is discontinuous at all points x ∈ [0, 1] (the function fy is a Dirichlet-type function). This shows, R1 by Lebesgue’s theorem, that for y 6= 1/2, the Riemann integral 0 fy (x) dx = R1 0 f (x, y) dx does not exist. Since the inside integral in (B) fails to make sense (for all y ∈ [0, 1] except y = 1/2), the iterated integral (B) does not exist. Finally, we show that the iterated integral (C) does exist. Fix x ∈ [0, 1] and consider the x-slice function fx : [0, 1] → R. If x ∈ Q, then so
R1 0
fx (y) = 1 for all y ∈ [0, 1], R1 fx (y) dy exists and equals 0 1 dy = 1. If x ∈ / Q, then
fx (y) = 2y for all y ∈ [0, 1], 1 R1 so 0 fx (y) dy exists and equals 0 2y dy = y 2 0 = 1. To summarize, for any R1 x ∈ [0, 1], the Riemann integral F (y) := 0 fx (y) dy exists and equals 1. The constant function 1 is Riemann integrable so the iterated integral Z 1Z 1 f (x, y) dx dy R1
R1
0
R1
0
exists and equals 0 F (y) dy = 0 1 dy = 1. To summarize: In the Riemann world, for this example the integrals (A) and (B) don’t exist while (C) does exist, and equals 1. In the Lebesgue case, this strange scenario won’t happen.
7.1. INTRODUCTION: ITERATED INTEGRATION
395
R Lebesgue case: Notice that f (x, y) = 1 a.e. therefore R f exists and equals R 1 = 1. Also, from the formula (7.4), we see that Rfor any y ∈ [0, 1], fy (x) = 2y for 1 a.e. x ∈ [0, 1] (precisely, for all non-rational x), so 0 fy (x) dx exists as a Lebesgue R1 R1 integral and equals 0 2y dx = 2y. Therefore, F (y) := 0 fy (x) dx = 2y. The function 2y is Lebesgue integrable on [0, 1] so the iterated integral Z 1Z 1 f (x, y) dx dy R
exists and equals
R1 0
0
F (y) dy = Z
0
R1
0
0 2y dy 1Z 1 0
= 1. A very similar argument shows that f (x, y) dx dy
also exists and equals 1. Therefore, as Lebesgue integrals, we have Z Z 1Z 1 Z 1Z 1 f= f (x, y) dx dy = f (x, y) dy dx, R
0
0
0
0
all of which equal 1.
◮ Exercises 7.1. For each function in the following problems, prove that the integrals (A), (B), and (C) in (7.1) each exist as Lebesgue integrals and the formula (A) = (B) = (C) holds. Considering the integrals as Riemann integrals, please follow the directions concerning (A), (B) and (C) as indicated in each problem. 1. Scenario (1) in (7.3): Define f : [0, 1]2 → R by 1 if x, y ∈ Q f (x, y) := min{q1 , q2 } 0 otherwise,
where in the first line, x, y ∈ Q are written as x = p1 /q1 and y = p2 /q2 in lowest terms. Prove that as Riemann integrals, (A) exists but (B) and (C) do not exist. 2. Scenario (6) in (7.3): Here is an example due to Alfred Israel Pringsheim (1850– 1941) in 1900 [229]. Let P ⊆ [0, 1]2 be the set of all points (x, y) in [0, 1]2 which have finite decimal expansions of the same length; that is, (x, y) ∈ P if for some k ∈ N, we can write x = 0.a1 a2 · · · ak ,
y = 0.b1 b2 · · · bk ,
where ak and bk are not both zero.
2
(i) Prove that P is dense in [0, 1] . (ii) Let f = χP : [0, 1]2 → R be the characteristic function of P . Prove that as Riemann integrals, (B), (C) exist (and equal zero) but (A) does not exist. 3. Another scenario (6) in (7.3): Here is an example related to Pringsheim’s example. Let D ⊆ [0, 1]2 be the set of all points (x, y) in [0, 1]2 of the form (i/2n , j/2n ) where i, j, n ∈ N with 1 ≤ i, j ≤ 2n − 1. (i) Prove that D is dense in [0, 1]2 . (ii) Let f = χD : [0, 1]2 → R be the characteristic function of D. Prove that as Riemann integrals, (B), (C) exist (and equal zero) but (A) does not exist. 4. Scenarios (4) and (5) in (7.3): Define f : [0, 1]2 → R by 1 if x, y ∈ Q, f (x, y) := q 0 otherwise, where x ∈ Q is written as x = p/q in lowest terms. Prove that as Riemann integrals, (A) and (B) exist but (C) does not exist. Give a similar example where (A) and (C) exist but (B) does not exist.
396
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
7.2. Product measures, volumes by slices, and volumes of balls Throughout this and the next sections we work with two σ-finite measure spaces (X, S , µ) and (Y, T , ν). Our goal is to define a measure ω on X × Y (the product of µ and ν) and prove the Fubini–Tonelli theorem: Z Z Z Z Z f dω = f (x, y) dν dµ = f (x, y) dµ dν X×Y
X
Y
Y
X
for any measurable f : X × Y → R that is either nonnegative or integrable. The first step to achieve this goal is to understand the measure ω on X × Y , which is our immediate goal in this section; Section 7.3 deals with the Fubini-Tonelli theorem. We remark that for notational simplicity we focus on the product of only two measures, although it is straightforward to study the product of any finite number of measures using the same techniques. We also remark that we work with σ-finite measures because the Fubini–Tonelli theorem is generally false for non-σ-finite measures as seen in Problem 1. 7.2.1. The Product Measure I: Rectangle Method. We know from Theorem 2.5 that the product of finitely additive set functions on semiring is finitely additive on the product semiring. In particular, the product ω : S × T → [0, ∞], defined by
3
ω(A × B) := µ(A) · ν(B)
for all A ∈ S , B ∈ T ,
is finitely additive. Here’s a picture of the product measure: Y, ν B
A×B A
X, µ
Figure 7.3. The semiring S × T consists of all “rectangles” A × B
where A ∈ S and B ∈ T (see Proposition 1.2). The ω-measure of a rectangle A × B is µ(A) ν(B) =“length” × “height”.
The following lemma says that ω is not just finitely additive, but countably additive. Lemma 7.1. The set function ω : S × T → [0, ∞] is a σ-finite measure on the semiring S × T . Proof : (The proof of this theorem is almost identical to the proof of Theorem 2.5.) Let A × B ∈ S × T , and suppose that A×B =
∞ [
n=1
An × Bn
3We always follow the conventions that 0 · ∞ = 0 = ∞ · 0.
7.2. PRODUCT MEASURES, VOLUMES BY SLICES, AND VOLUMES OF BALLS
397
is a union of pairwise disjoint elements An × Bn ∈ S × T . Since the An × Bn ’s are pairwise disjoint sets, one can check that ∞ X χA×B (x, y) = χAn ×Bn (x, y), n=1
which can further be written as
χA (x) χB (y) =
∞ X
χAn (x) χBn (y).
n=1
Fixing y, and integrating both sides of this equality term-by-term with respect to x (and using the series MCT in Theorem 5.21), we obtain Z µ(A) χB (y) = χA (x) χB (y) dµ X
=
∞ Z X
n=1
χAn (x) χBn (y) dµ = X
∞ X
µ(An ) χBn (y).
n=1
P∞ Integrating the equality µ(A) χB (y) = n=1 µ(An ) χBn (y) term-by-term, we obtain ∞ X µ(A) ν(B) = µ(An ) ν(Bn ). n=1
S S∞ To verify the σ-finiteness claim, write X = ∞ n=1 Xn and Y = n=1 Yn where Xn ∈ S and Y n ∈ T and µ(Xn ), ν(Yn ) < ∞ for all n. Then it follows that S X×Y = ∞ m,n=1 Xm × Yn where Xm × Yn ∈ S × T and ω(Xm × Yn ) < ∞. Hence, ω is σ-finite.
Now that we have a σ-finite measure ω : S × T → [0, ∞]
on a semiring, we know by the Extension Theorem that it extends uniquely to a measure on S (S ×T ), the σ-algebra generated by S ×T . The notation S (S ×T ) is quite ugly and confusion, so we shall denote this σ-algebra by S ⊗ T . This σalgebra is called the product σ-algebra. Thus, The product σ-algebra, S ⊗ T is the σ-algebra generated by S × T . We summarize our findings in the following theorem. The product measure Theorem 7.2. If (X, S , µ) and (Y, T , ν) are σ-finite measures, then there is a unique measure ω on the product σ-algebra S ⊗ T that gives the usual volume on “rectangles”: ω(A × B) = µ(A) ν(B)
This measure is also denoted by µ ⊗ ν.
for all A × B ∈ S × T .
7.2.2. Borel and Lebesgue measurable sets. Consider the Borel measure spaces (Rk , B k , mk ) and (Rℓ , B ℓ , mℓ ), where mi denotes i-dimensional Lebesgue measure. Let n = k + ℓ. Then Theorem 7.2 shows that we have a σ-algebra B k ⊗ B ℓ of subsets of Rn (= Rℓ × Rk ) and a measure mk ⊗ mℓ : B k ⊗ B ℓ → [0, ∞]
398
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
such that (mk ⊗ mℓ )(A × B) = mk (A) mℓ (B)
for all A × B ∈ B ℓ × B k .
Of course, we already have a perfectly natural σ-algebra of subsets of Rn , namely the Borel subsets B n of Rn and perfectly natural measure, m n : B n → Rn ,
n-dimensional Lebesgue measure. This begs two questions: (a)
Is B n = B k ⊗ B ℓ and (b) Is mn = mk ⊗ mℓ ?
By definition of the Borel sets, the first question is whether or not S (I n ) = S (I k ) ⊗ S (I ℓ ). It’s straightforward to check that I n = I k × I ℓ , so the first question can be written as Is S (I k × I ℓ ) = S (I ℓ ) ⊗ S (I k )?
Theorem 7.3 below shows that the answer to this question is “yes.” Now that we know the σ-algebras B n and B k ⊗ B ℓ are identical, we get the answer to our second question is “yes” as well. Indeed, given any I ∈ I n we can split the box as I = I1 × I2 where I1 ∈ I k and I2 ∈ I ℓ . Then one can check that mn (I) = mℓ (I1 )·mk (I2 ). By definition of mℓ ⊗mk , we have mℓ (I1 )·mk (I2 ) = (mℓ ⊗mk )(I1 ×I2 ). Thus, mn (I) = (mk ⊗ mℓ )(I) for all I ∈ I n . Now the uniqueness in the Extension Theorem implies that mn = mk ⊗ mℓ . Here’s the theorem that in particular implies (a). Theorem 7.3. Let A and B be collections of subsets of nonempty sets X and Y , respectively, and suppose that the sets X and Y are countable unions of sets in A and B, respectively. Then S (A × B) = S (A ) ⊗ S (B). Proof : (If you drop the countable union assumption on X and Y , this theorem is false as you’ll find out in Problem 2.) We have to show that S (A × B) equals the σ-algebra generated by S (A ) × S (B), which is to say, we need to show that that is,
S (A × B) = S (S (A ) × S (B));
(1) S (A × B) ⊆ S (S (A ) × S (B));
(2) S (S (A ) × S (B)) ⊆ S (A × B). To prove (1) and (2) we use (as you might have guessed) the Principle of Appropriate Sets. The first inequality is easy: Observe that A ×B ⊆ S (A )×S (B) ⊆ S (S (A )×S (B)). Since S (A ×B) is the smallest σ-algebra containing A ×B, we get (1). The inequality (2) is not so easy and it’s here we need the assumption that X and Y are countable unions of sets in A and B, respectively. By the Principle of Appropriate Sets we just have to show that S (A ) × S (B) is a subset of S (A × B).
The trick to show this is to work on each factor of S (A ) × S (B) separately; for example, let’s show that (7.5)
A ∈ S (A )
=⇒
A × Y ∈ S (A × B).
7.2. PRODUCT MEASURES, VOLUMES BY SLICES, AND VOLUMES OF BALLS
399
An analogous argument shows that B ∈ S (B)
=⇒
X × B ∈ S (A × B).
Suppose for the moment that we have proven these implications. Then given any A × B ∈ S (A ) × S (B), we have A × Y, X × B ∈ S (A × B), so A × B = A × Y ∩ X × B ∈ S (A × B),
since σ-algebras are closed under intersections. This shows that S (A ) × S (B) is a subset of S (A × B). Thus, it remains to prove (7.5). S First of all, if A ∈ A , then (7.5) is easy: By assumption, we can write Y = ∞ i=1 Bi where Bi ∈ B for each i; hence, as A × Bi ∈ A × B for each i, A×Y =
∞ [
i=1
A × Bi ∈ S (A × B).
To prove (7.5) for general A ∈ S (A ), consider the set
A ′ := {A ⊆ X ; A × Y ∈ S (A × B)}.
Using the fact that S (A × B) is a σ-algebra, it’s easy to check that A ′ is a σalgebra, and by what we just proved, we know that A ⊆ A ′ . Thus, S (A ) ⊆ A ′ , which implies that (7.5) holds.
A topological space is said to be second countable if it has a countable basis, that is, a countable collection of sets C such that any open set is a countable union of sets in C . For instance, Rn is second countable since the collection of all open boxes with rational vertices form a countable basis. In Problem 3 you will prove the following corollary. Corollary 7.4. If X and Y are second countable topological spaces and B(X) and B(Y ) are their corresponding Borel σ-algebras, then B(X × Y ) = B(X) ⊗ B(Y ),
where set X × Y is endowed with the product topology. Thus, a set U ⊆ X × Y is open if any point in U is contained in a set of the form V × W ⊆ U where V and W are open sets in X and Y , respectively. This Corollary also implies that B n = B k ⊗ B ℓ (why?). Is the same true for Lebesgue measurable sets? That is, if M i denotes the i-dimensional Lebesgue measurable sets, Is M n = M k ⊗ M ℓ , where n = k + ℓ?
The answer is “no.” In Problem 5 you will show that the σ-algebra M ℓ ⊗ M k is a proper subset of M n . In order to get equality, we have to complete the σ-algebra M k ⊗ M ℓ as described at the end of Section 4.3.5. You will see how to do this in Problem 6 — this problem is a must do exercise! 7.2.3. The π-λ theorem. Let I and L be collections of sets with I ⊆ L. A question that we have considered quite often in the past is Question: When can we conclude that S (I ) ⊆ L?
Of course, we know one answer given by our old friend, the
400
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
Principle of Appropriate Sets: If L is a σ-algebra and it contains the “appropriate sets,” namely all sets in I , then L must contain all of S (I ). Thus, given I ⊆ L, to answer “yes” to our question we can try to prove three things: (a) ∅ ∈ L. (b) If A ∈ L, then Ac ∈ L. S (c) If A1 , A2 , . . . ∈ L, then n An ∈ L. In concrete situations, proving (a) and (b) are usually easy, but (as we’ll see in Example 7.1 below and in Theorem 7.6 later) proving (c) is sometimes difficult! This should not be surprising: (a) is usually given and (b) involves checking complements for individual sets in L, while (c) involving checking closure under unions for an arbitrary family of countably many sets in L. Example 7.1. Let I be a semiring and suppose that we’re given two probability measures µ : S (I ) → [0, 1]
and
ν : S (I ) → [0, 1]
such that µ = ν on I ; must it be true that µ = ν on S (I )? Of course, we know the answer is “yes” by uniqueness in the Extension Theorem, but let’s try to use the Principle of Appropriate Sets to prove it. Thus, let us put L := {A ∈ S (I ) ; µ(A) = ν(A)}; then we want to show that S (I ) ⊆ L. (Again, by the Extension Theorem we know that µ = ν on S (I ) and hence S (I ) ⊆ L — however, the point here is to try and use the PAS to prove it.) We’re given I ⊆ L, so to use the PAS we just have to check the conditions (a), (b) and (c) above for L to be a σ-algebra. (a) is obvious and to check (b), observe that if A ∈ L, then µ(A) = ν(A), so µ(Ac ) = 1 − µ(A) = 1 − ν(A) = ν(Ac ),
which implies that Ac ∈ L. Now we come to (c): Is L closed under arbitrary countable unions? This is not obvious because measures only behave well on countable unions of pairwise disjoint sets. Indeed, if A1 , A2 , .S . . are pairwise disjoint elements of L, then µ(An ) = ν(An ) for each n, so with A = ∞ n=1 An , we have µ(A) =
∞ X
n=1
µ(An ) =
∞ X
ν(An ) = ν(A).
n=1
To summarize, we’ve shown that the collection L (1) contains the empty set, (2) is closed under complements and (3) is closed under countable unions of pairwise disjoint sets. There is a special name given to such collections.
A collection L of sets is called a λ-system (λ for lattice) if L (1) contains the empty set; (2) is closed under complements; (3) is closed under countable unions of pairwise disjoint sets. Back to our original question: Given collections of sets I and L with I ⊆ L, when can we conclude that S (I ) ⊆ L? Theorem 7.5 below, called Eugene Dynkin the π-λ theorem (π for product), says that if we assume that I is closed under (1924–). intersections (or “products”) and L is a λ-system, then we get S (I ) ⊆ L. λ-systems and the π-λ theorem were first introduced in Eugene Borisovich Dynkin’s (1924– ) 1959 book on Markov Processes (according to the supplementary notes in the 1961 English translation [78]).
7.2. PRODUCT MEASURES, VOLUMES BY SLICES, AND VOLUMES OF BALLS
401
Example 7.2. Since the I in Example 7.1 was a semiring (hence is closed under intersections) by the π-λ theorem it follows that S (I ) ⊆ L in that example, a fact we already knew.
π-λ theorem Theorem 7.5. Let I be a collection of sets closed under intersections and let L be a collection of sets such that Then S (I ) ⊆ L.
I ⊆L
L is a λ-system.
and
Proof : Let L0 be the smallest4 λ-system containing I . Then L0 ⊆ L. We shall prove that L0 is a σ-algebra. Then S (I ) ⊆ L0 since S (I ) is the smallest σ-algebra containing I , and hence S (I ) ⊆ L as well. Our first step to show that L0 is a σ-algebra is by proving the following claim: Step 1: If L0 is closed under intersections, then it’s a σ-algebra. To prove this, assume that L S0 is closed under intersections and let A1 , A2 , . . . ∈ L0 ; we have to show that ∞ n=1 An ∈ L0 . Observe that ∞ [
An =
n=1
∞ [
Bn ,
n=1
where Bn = An ∩ Ac1 ∩ Ac2 ∩ · · · ∩ Acn−1 . By definition of a λ-system, L0 is closed under complements and, by assumption, intersections, S so Bn ∈ L0 . Moreover, one can check that the Bn ’s are pairwise disjoint, so ∞ n=1 Bn ∈ L0 . This shows that L0 is a σ-algebra. Step 2: Therefore, according to Step 1, to prove that L0 is a σ-algebra we just have to show that L0 is closed under intersections. For this reason, given any A ∈ L0 , let us put LA = {B ∈ L0 ; A ∩ B ∈ L0 }. Our goal is to prove that L0 ⊆ LA , which shows that for all B ∈ L0 , A ∩ B ∈ L0 ; since A was arbitrary it follows that L0 is closed under intersections. To prove that L0 ⊆ LA , we just have to show that LA is a λ-system and it contains I , this show that L0 ⊆ LA because L0 is the smallest λ-system containing I . Given A ∈ L0 we show that LA is a λ-system; we shall prove that LA contains I in Step 4 below. It’s easy to check that ∅ ∈ LA . Let B ∈ LA ; we need to show that B c ∈ LA , which means that A ∩ B c ∈ L0 . To see this, note that c A ∩ B c = Ac ∪ (A ∩ B) ,
by a little set theory algebra. Observe that A ∩ B ∈ L0 (because B ∈ LA ) and, since L0 is closed under complements, Ac ∈ L0 . Also, Ac and (A∩B) are disjoint, so as L0 is closed under unions of disjoint sets, we have Ac ∪ (A ∩ B) ∈ L0 . By closedness under complements again, it follows that A∩B c ∈ L0 . Thus, B c ∈ LA . Finally, S let B1 , B2 , . . . be pairwise disjoint elements of LA ; we need to show that C := ∞ n=1 Bn ∈ LA . To do so, observe that A∩C =
∞ [
n=1
A ∩ Bn .
4 Copying the proof of Theorem 1.4 you can prove that any collection of subsets of a given set always has a smallest λ-system containing it.
402
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
By definition of LA , we have A ∩ Bn ∈ L0 for each n and since L0 is closed under countable unions of disjoint sets, it follows that A ∩ C ∈ L0 . Thus, C ∈ LA . Step 3: Given I ∈ I , we claim that L0 ⊆ LI . Indeed, since I ∈ L0 also, by Step 2 we know that LI is a λ-system. Moreover, since I is closed under intersections, by definition of LI it follows that I ⊆ LI . Since L0 is the smallest λ-system containing I , we get L0 ⊆ LI . Step 4: Finally, given A ∈ L0 , we show that I ⊆ LA . Let I ∈ I ; we must show that I ∈ LA , which means A ∩ I ∈ L0 . However, observe that A ∈ L0
A ∈ LI
=⇒
(by Step 3)
A ∩ I ∈ L0
=⇒
(definition of LI ).
This completes our proof.
7.2.4. The Product Measure II: Calculus Method. In calculus, how would you determine the area of the region A shown here: Y, ν A
f (x)
Ax
g(x) a
x
b
X, µ
Figure 7.4. A calculus exercise: Find the area of A, the region between the graphs of f and g, by the method of cross sections.
Probably most of you would use the method of cross-sections, also called the method of slices, which finds the area of A ⊆ R2 by integrating all the vertical (or horizontal) slices of A. More precisely, for each x ∈ X, let Ax = {y ∈ R ; (x, y) ∈ A}, which is the vertical cross section (or slice) of A at x as seen in Figure 7.4. In the picture, Ax is just the interval [f (x), g(x)]. Then m(Ax ) is the length of Ax and the calculus answer is, Z Area of A =
m(Ax ) dx,
X
Rb or in more familiar notation a (f (x) − g(x)) dx. The same principle is used to compute volumes in R3 , such as seen here:
Ax Y, ν x
X, µ
Figure 7.5. Another calculus exercise: Denoting the (solid) ball by A, find the volume of A by the method of cross sections.
As before, denote by A the region in question, which in this case is a (solid) ball,
7.2. PRODUCT MEASURES, VOLUMES BY SLICES, AND VOLUMES OF BALLS
403
and for each x on the horizonal axis in Figure 7.5, let Ax = {y ∈ R2 ; (x, y) ∈ A}, the cross section of the sphere at x, which is a disk. Then the Z Volume of A = m(Ax ) dx. X
In Theorem 7.6 below, we show that this method of computing areas and volumes in Euclidean space works for products of abstract measure spaces. Given measure spaces (X, S , µ) and (Y, T , ν) and a subset A ⊆ X × Y , let (see Figure 7.6) Ax = {y ; (x, y) ∈ A},
Ay = {x ; (x, y) ∈ A}.
Ax is called the section of A at x and Ay is the section of A at y. Y, ν y
✤ ✜ Ax Ay ✣ ✢ x
X, µ
Figure 7.6. Sections, or slices, of the set A ⊆ X × Y . Measures by cross-sections Theorem 7.6. Let ω be the product measure ω = µ ⊗ ν. Then for any set A ∈ S ⊗ T , we have (i) Ax ∈ T for each x ∈ X and Ay ∈ S for each y ∈ Y . (ii) µ(Ay ) and ν(Ax ) are measurable functions of y and x, respectively. (iii) We have Z Z (7.6) ω(A) = ν(Ax ) dµ = µ(Ay ) dν. X
Y
In particular, if A ∈ S ⊗ T , then ω(A) = 0 if and only if µ(Ay ) = 0 for a.e. y, if and only if ν(Ax ) = 0 for a.e. x.
Proof : We only prove the statements concerning the x-sections as the statements for the y-sections are proved similarly. Also, we shall prove this theorem assuming that µ(X) < ∞ and ν(Y ) < ∞. The σ-finite case is handled in a very similar manner as it was used in proving uniqueness in Step 2 of the Extension Theorem, so we won’t repeat that argument. As you’ll see, this problem is a very nice application of the π-λ theorem. Step 1: Let’s set up the problem. Let L be the collection of all sets having the “appropriate” properties we want, namely L consists of all sets A ∈ S ⊗ T satisfying (i), (ii), and (iii) for the x-sections. Explicitly, A ∈ L means (i) Ax ∈ T for each x ∈ X. (ii) ν(Ax ) isZa measurable function of x. (iii) ω(A) =
ν(Ax ) dµ.
X
We want to show that S ⊗ T ⊆ L, that is, every element of S ⊗ T has properties (i), (ii), and (iii). Since S ⊗T is by definition the σ-algebra generated by S ×T , to get S ⊗ T ⊆ L we just have to show that S × T ⊆ L and L is a λ-system.
404
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
Step 2: We prove that S × T ⊆ L. Let A = B × C ∈ S × T ; we shall prove that A ∈ L. To see that A satisfies (i), given x ∈ X, observe that ( C if x ∈ B Ax := {y ; (x, y) ∈ B × C} = ∅ if x ∈ / B. Hence, for any x ∈ X, we have Ax ∈ T . In addition, the formula for Ax shows that ( ν(C) if x ∈ B ν(Ax ) = = ν(C) · χB (x), 0 if x ∈ /B
so ν(Ax ) is a measurable function of x. Finally, Z Z ν(Ax ) dµ = ν(C) · χB (x) dµ = ν(C)µ(B) = ω(B × C) = ω(A). X
X
Thus, L contains S × T . Step 3: We now try to show that L is a λ-system. It’s easily checked that ∅ ∈ L. To see that L is closed under complements, let A ∈ L, which recall means that A satisfies (i), (ii), and (iii) above. We leave you to check that for any x ∈ X, we have (Ac )x = (Ax )c , that is, the x-section of a complement is the complement of the x-section. Since A ∈ L, we know that Ax ∈ T , so, as σ-algebras are closed under complements, we have (Ac )x = (Ax )c ∈ T . Thus, Ac satisfies (i). By subtractivity, ν((Ac )x ) = ν((Ax )c ) = ν(Y \ Ax ) = ν(Y ) − ν(Ax ). Since ν(Ax ) is a measurable function of x and ν(Y ) is just a constant, it follows that ν((Ac )x ) = ν(Y ) − ν(Ax ) is a measurable function of x. Thus, Ac satisfies (ii). In particular, we have Z Z Z Z ν(Ax ) dµ ν(Y ) dµ − ν(Y ) − ν(Ax ) dµ = ν((Ac )x ) dµ = X
X
X
X
= µ(X) · ν(Y ) − ω(A)
S∞
= ω(X × Y ) − ω(A) = ω(Ac ),
so Ac satisfies (iii). Finally, letSA = n=1 An be countable union of pairwise (An )x ∈ T and T is a σdisjoint sets in L. Since Ax = ∞ n=1 (An )x and eachP algebra, we have Ax ∈ T as well. Moreover, ν(Ax ) = ∞ n=1 ν((An )x ), so ν(Ax ) is a measurable function of x, being a series of measurable functions. Furthermore, by countable additivity of the integral, the fact that An ∈ L for each n, and by countable additivity of measures, we have Z ∞ ∞ Z X X ω(An ) = ω(A). ν(Ax ) dµ = ν((An )x ) dµ = X
n=1
X
n=1
Thus, we have shown that L is a λ-system. In Step 2 we showed that S ×T ⊆ L. Since S × T is closed under intersections (in fact, it’s a semiring), by Dynkin’s π-λ theorem it follows that S ⊗ T ⊆ L. This completes our proof.
7.2.5. Volumes of balls in Rn . Using volumes by slices, we investigate the volume of balls in Rn (cf. [209, 263]). First consider a cube [−r, r]n of “radius” r in Rn . It’s clear that the volume of the cube behaves as follows: As n → ∞, 0 if r < 1/2; m([−r, r]n ) = (2r)n → 1 if r = 1/2; ∞ if r > 1/2.
7.2. PRODUCT MEASURES, VOLUMES BY SLICES, AND VOLUMES OF BALLS
405
What happens if we consider volumes of balls? Consider the n-dimensional ball of radius r > 0: q Bn (r) = {z ∈ Rn ; kzk ≤ r} , where kzk = z12 + · · · + zn2 . What happens to m(Bn (r)) as n → ∞? Although not quite as shocking as the Banach-Tarski paradox, the answer might still surprise you! First of all, observe that with Bn := Bn (1), we have z ∈ Bn (r)
⇐⇒
z
≤1 r
z ∈ Bn r
⇐⇒
⇐⇒
z ∈ rBn .
Thus, Bn (r) = rBn , so by the dilation properties of Lebesgue measure, we have m(Bn (r)) = rn an ,
(7.7)
where an = m(Bn ).
Next, let’s compute the volume m(Bn (r)) by slices by writing Rn = R × Rn−1 . If x denotes the variable on R and y the variable on Rn−1 , and if A = Bn (r), then 2 Ax = {y ; (x, y) ∈ Bn (r)} = {y ; x2 + y12 + · · · + yn−1 ≤ r2 }
2 = {y ; y12 + · · · + yn−1 ≤ r2 − x2 } p = Bn−1 ( r2 − x2 ).
√ where we regard Bn−1 ( r2 − x2 ) to be empty if |x| > r. Hence, by the method of slices, m(Bn (r)) =
Z
m(Ax ) dx =
R
Z Z
r −r r
p m(Bn−1 ( r2 − x2 )) dx
p n−1 r2 − x2 an−1 dx (by (7.7)) −r Z r p n−1 = 2 an−1 r2 − x2 dx =
0
= 2 an−1 r
n
Z
π 2
sinn θ dθ
0
= 2 rm(Bn−1 (r))
Z
π 2
(put x = r cos θ).
sinn θ dθ
(by (7.7)).
0
To summarize, if we denote by Vn (r) the volume of the n-dimensional ball Bn (r), then we have (7.8)
Vn (r) = 2rVn−1 (r)
Z
π 2
sinn θ dθ.
0
Using this result we shall prove that for any r > 0, however large you wish to take r, we have limn→∞ Vn (r) = 0. For instance, the volumes of the n-dimensional balls of radius one billion tend to zero as n → ∞, quite a different story from cubes!
406
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
Volumes of spheres P Theorem 7.7. For any r > 0, we have ∞ n=1 Vn (r) < ∞. In particular, for any r > 0, we have limn→∞ Vn (r) = 0. Moreover, we have V1 (1) < V2 (1) < V3 (1) < V4 (1) < V5 (1)
V5 (1) > V6 (1) > V7 (1) > · · · ;
and
that is, the volumes of the unit balls increase from dimension one to dimension five and decrease thereafter. Proof : By (7.8), for any r > 0 we have Vn (r) = 2r Vn−1 (r)
Z
π 2
sinn θ dθ.
0
Since 0 ≤ sin θ < 1 for all θ ∈ [0, π/2), we have limn→∞ sinn θ = 0 for a.e. θ ∈ [0, π/2] (all θ except θ = π/2) and sinn θ is bounded by 1. Thus, by the Rπ Dominated (or Bounded) Convergence Theorem we have limn→∞ 02 sinn θ dθ = 0. Hence, Vn (r) lim = 0. n→∞ Vn−1 (r) P∞ Thus, n=1 Vn (r) converges by the Ratio Test. Plugging r = 1 in (7.8) we see that Z π 2 Vn (1) = bn , where bn = 2 sinn θ dθ. Vn−1 (1) 0 Since sinn+1 θ < sinn θ for all 0 < θ < π/2 (because 0 < sin θ < 1 for such θ), it follows that {bn } is a strictly decreasing sequence of real numbers: b1 > b2 > b3 > b4 > b5 > · · ·
One can check numerically that b1 , b2 , b3 , b4 , b5 > 1 and b6 , b7 , b8 , . . . < 1. Since Vn (1) = bn , we conclude that Vn−1 (1) Vn−1 (1) < Vn (1) for n = 2, 3, 4, 5
and
Vn−1 > Vn (1) for n = 6, 7, 8, . . . .
This proves our last statement.
We can use (7.8) to actually compute Vn (r). Theorem 7.8. For any n ∈ N and r > 0, we have Vn (r) =
π n/2 rn , Γ( n2 + 1)
where Γ denotes the Gamma function. Proof : We first need the formula Z π/2 0
Γ sin θ dθ = Γ n
√
n+1 2 n +1 2
π , 2
where Γ denotes the Gamma function, a formula you can find in Problem 2 of Exercises 6.1. Then (7.8) is √ Γ n+1 2 π Vn (r) = rVn−1 (r) n Γ 2 +1 Using this formula, it’s not difficult to prove that Vn (r) = tion on n. We leave the details for your enjoyment.
π n/2 rn Γ( n +1) 2
using induc-
7.2. PRODUCT MEASURES, VOLUMES BY SLICES, AND VOLUMES OF BALLS
407
◮ Exercises 7.2. 1. (The σ-finiteness assumption) Let X = Y = R, S = T = B 1 , µ : S → [0, ∞] be Lebesgue measure, and let ν : T → [0, ∞] be the counting measure (observe that ν is not σ-finite). If A = {(x, x) ; x ∈ [0, 1]} ⊆ X × Y , prove that Z Z ω(A) = ∞ , ν(Ax ) dµ = 1 , µ(Ay ) dν = 0. X
Y
Thus, Theorem 7.6 fails quite badly when we drop the σ-finiteness condition. Conclude that the three integrals Z Z Z Z Z f dω , f (x, y) dν dµ , f (x, y) dµ dν X×Y
X
Y
Y
X
are all different for the nonnegative measurable function f = χA . 2. Let X = Y = {1, 2} and A = B = {{1}}. Note that X and Y are not countable unions of sets in A = B. Show that S (A ) ⊗ S (B) 6= S (A × B). 3. Prove Corollary 7.4. 4. Two problems on the π-λ theorem. (i) If I is a collection of sets closed under intersections (not necessarily a semiring) and µ and ν are two measures on S (I ) with µ(X) = ν(X) < ∞ that agree on I , using the π-λ theorem prove that µ and ν agree on all of S (I ). (ii) Let X = {1, 2, 3} and let I be the collection of subsets {{1, 2}, {2, 3}} of X. Note that I is not closed under intersections. Show that given any a ∈ [1, 2] there exists a measure µa : S (C ) → [0, ∞) such that µa (X) = a and µa {1, 2} = 1 = µa {2, 3}. Why does this give a counterexample to (i) if we drop the “π-condition?” 5. Let k, ℓ ∈ N and put n = k + ℓ; then we know that B n = B k ⊗ B ℓ . In this problem we investigate the validity of this expression for Lebesgue measurable sets. (a) Show that B n ⊆ M k ⊗ M ℓ . Using that there exists a Lebesgue but not Borel measurable subset of Rk (see Problem 7 in Exercises 4.5 for the k = 1 case, from which the general k case can be obtained), show that B n ⊆ M k ⊗ M ℓ is a proper inclusion. Suggestion: For the properness statement, let A ⊆ Rk be a Lebesgue but not Borel measurable set and show that A × {0}ℓ is in M k ⊗ M ℓ but not in B k ⊗ B ℓ . What is a y-section of A × {0}ℓ ? (b) Show that M k × M ℓ ⊆ M n . Suggestion: An element of M i is of the form F ∪ N where F ∈ B i and N is a subset of an element of B i of measure zero. (c) Show that M k ⊗ M ℓ ⊆ M n and this inclusion is proper. Thus, we have B n ⊆ M k ⊗M ℓ ⊆ M n and these inclusions are proper. Suggestion: Let A ⊆ Rk be a non Lebesgue measurable set and show that A × {0}ℓ is in M n but not in M k ⊗ M ℓ . What is a y-section of A × {0}ℓ ? 6. (Complete product measures I) Before doing this problem, please review complete measures in Section 4.3.5 and use any results in Problem 9 of Exercises 4.3 on the completion of measures. Let µ : S → [0, ∞] and ν : T → [0, ∞] be σ-finite complete measures and let µ ⊗ ν : S ⊗ T → [0, ∞] be the induced product measure. We denote the completion of this measure (as defined in Problem 9 of Exercises 4.3) using bars over ⊗: µ⊗ν : S ⊗T → [0, ∞]. Briefly, recall that S ⊗T consists of all sets of the form F ∪ N , where F ∈ S ⊗ T and N is a subset of an element of S ⊗ T of measure zero. Given such an element of S ⊗T we define µ⊗ν(F ∪ N ) := µ ⊗ ν(F ). Show that if µ = mk on M k and ν = mℓ on M ℓ , then with n = k + ℓ, M k ⊗M ℓ = M n
and
mk ⊗mℓ = mn .
408
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
7. (Complete product measures II) For complete measures, Theorem 7.6 takes the following form. Let µ⊗ν : S ⊗T → [0, ∞] be the completed product measure as described in the previous problem for two complete measures µ and ν. Prove that given A ∈ S ⊗T , we have (i) Ax ∈ T for a.e. x ∈ X and Ay ∈ S for a.e. y ∈ Y . (ii) µ(Ay ) and ν(Ax ) are measurable functions of y and x, respectively, where we define µ(Ay ) and ν(Ax ) to equal zero on the sets of measure zero where Ay ∈ /S and Ax ∈ / T. (iii) We have Z Z µ⊗ν(A) =
ν(Ax ) dµ =
X
µ(Ay ) dν.
Y
In particular, if A ∈ S ⊗T , then µ⊗ν(A) = 0 if and only if µ(Ay ) = 0 for a.e. y, if and only if ν(Ax ) = 0 for a.e. x. 8. (Lebesgue’s geometric definition) In this problem, you are free to use any material from the previous two problems. We shall give Lebesgue’s first definition of the integral in his 1902 thesis [171, p. 20], which Lebesgue called the Geometric definition of the integral. Lebesgue called the Analytic definition of the integral [171, p. 28] the definition of the integral we have emphasized in this book obtained by partitioning the range, which Lebesgue introduced in his 1901 paper [165]. (1) Let B ⊆ Rℓ be Lebesgue measurable with positive measure. Prove that if A ⊆ Rk and A×B is Lebesgue measurable in Rk+ℓ , then A is a Lebesgue measurable subset of Rk . Is the conclusion still true if B has zero measure? (2) Let f be a real-valued nonnegative function on an interval [a, b] and let A ⊆ R2 be the region between the x-axis and the curve f . Prove that f : [a, b] → [0, ∞) is Lebesgue measurable if and only if A ⊆ R2 is Lebesgue measurable, and prove that Z b
m2 (A) =
f (x) dx.
a
(In particular, if f : [a, b] → R and A+ is the area under the graph of f+ and A− the area under the graph of f− , then Z b f (x) dx = m2 (A+ ) − m2 (A− ); a
this is what Lebesgue refers to as the geometric definition of the integral.) (3) Let f and g be Lebesgue measurable functions on an interval [a, b] with f ≤ g and let A = {(x, y) ; a ≤ x ≤ b , f (x) ≤ y ≤ g(x)} ⊆ R2 , the region between the curves f and g between a and b. Prove that A is Lebesgue measurable and Z b m2 (A) = [g(x) − f (x)] dx. a
9. In this problem, we define the product of arbitrary measures [23]. Let (S , µ) and (T , ν) be two measure spaces, not necessarily σ-finite. (a) Let Sf and Tf be the subsets of S and T , respectively, consisting of sets with finite measure. Prove that for any set E = E1 × E2 ∈ Sf × Tf , there exists a unique measure ωE on S ⊗ T such that ωE (A × B) = µ(E1 ∩ A) ν(E2 ∩ B)
for all A × B ∈ S × T . (b) Prove that if E, F ∈ Sf × Tf with E ⊆ F , then ωE ≤ ωF in the sense that ωE (A) ≤ ωF (A) for all A ∈ S ⊗ T . (c) Define ω : S ⊗ T → [0, ∞] by ω(A) = sup{ωE (A) ; E ∈ Sf × Tf }.
Prove that ω is a measure on S ⊗ T .
7.3. THE FUBINI-TONELLI THEOREMS ON ITERATED INTEGRALS
409
(d) Prove that ω is the unique measure on S ⊗T satisfying the following two conditions: (a)
ω(A × B) = µ(A) ν(B),
for all A × B ∈ Sf × Tf ,
and for each A ∈ S ⊗ T , (b)
ω(A) = sup{ω(A ∩ E) ; E ∈ Sf × Tf }.
7.3. The Fubini-Tonelli Theorems on iterated integrals With the work we did in the last section, it will be easy to prove the Fubini and Tonelli theorems on iterated integration for Lebesgue integrals, which basically say for nonnegative or integrable functions, in the Lebesgue world you can always evaluate a multiple integral by iterated integration “without fear of contradictions, or of failing examinations” [70, p. 229]. 7.3.1. Tonelli’s theorem. Our first theorem on iterated integrals treats the case of nonnegative measurable functions and is due to Leonida Tonelli (1885–1946) who proved a version of it in his 1909 paper “Sull’integrazione per parti” [287]. Although Fubini’s theorem is from 1907, it’s common to present Tonelli’s theorem first because Fubini’s theorem can be easily deduced from it, as Tonelli remarked in his original paper. Let (X, S , µ) and (Y, T , ν) be two σ-finite measure spaces and let ω be the product measure on the σ-algebra S ⊗ T of X × Y . The Fubini-Tonelli theorem says that for a measurable function f : X × Y → R, we always Leonida have Z Z Z Z Z Tonelli (1885– 1946). f dω = f (x, y) dν dµ = f (x, y) dµ dν, X×Y
X
Y
Y
X
provide that f is either integrable or nonnegative; the “nonnegative” part is Tonelli’s contribution. Let us analyze the iterated integrals more closely; consider the middle iterated integral. Since we only defined the integral for measurable functions, in R order for the inside integral Y f (x, y) dν to even be defined we need that for fixed x ∈ X, the function f (x, y) is a measurable function of y. More precisely, we need that for each x ∈ X, fx : Y → R
defined by
fy : X → R,
defined by
fx (y) := f (x, y) for all y ∈ Y,
is a ν-measurable function. Similarly, we need that for each y ∈ Y , the function fx (y) := f (x, y) for all y ∈ Y,
is an µ-measurable function. The function fx is called the section, or slice, of f at x and the function fy is called the section, or slice, of f at y. Lemma 7.9. If f : X × Y → R is measurable, then for each x ∈ X, the section fx is ν-measurable, and for each y ∈ Y , the section fy is µ-measurable. Proof : Given x ∈ X and a ∈ R, we have
fx−1 (a, ∞] = {y ; fx (y) ∈ (a, ∞]} = {y ; f (x, y) ∈ (a, ∞]} = Ax ,
where A = f −1 (a, ∞] and Ax is the x-section of A. Since f : X × Y → R is measurable, we have A ∈ S ⊗ T . By Theorem 7.6, Ax ∈ T and thus fx is ν-measurable. Similarly, fy is µ-measurable.
410
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
The proof of Tonelli’s theorem is a perfect example of using the “Principle of Appropriate Functions” as explained in Section 5.5.2. Tonelli’s Theorem Theorem 7.10. Let (X, S , µ) and (Y, T , ν) be two σ-finite measures and let ω be the product measure ω = µ ⊗ ν. Then for any measurable function f : X × Y → [0, ∞], Z Z Z Z Z f dω = f (x, y) dν dµ = f (x, y) dµ dν. (7.9) X
Y
Y
X
Moreover, an ω-measurable function f : X × Y → R is ω-integrable if and only if Z Z Z Z (7.10) |f (x, y)| dν dµ < ∞, if and only if, |f (x, y)| dµ dν < ∞. X
Y
Y
X
Proof : We remark that in order for the outside integrals in the middle and right integrals in (7.9) to be defined, implicit in Tonelli’s Theorem is that the inner integrals Z Z f (x, y) dν
and
f (x, y) dµ,
Y
X
are µ-measurable and ν-measurable, respectively; we shall establish these facts in the course of our proof. By the “Principle of Appropriate Functions”, to prove (7.9) for all nonnegative measurable functions, we just have to prove that if C denotes the set of nonnegative measurable functions for which (7.9) does hold, then C contains all characteristic functions of measurable sets, C is closed under linear combinations by nonnegative numbers, and C is closed under limits of nondecreasing sequences of nonnegative functions. To apply this principle, first let f = χA , where A ⊆ X × Y is a measurable set. Then we leave you to check that fx = χAx where Ax is the x-section of A. By definition of the integral, we have Z Z Z (7.11) ν(Ax ) = χAx dν = fx dν = f (x, y) dν. Y Y Y R By Theorem 7.6, we know that Y fx dν, which equals ν(Ax ), is a µ-measurable function of x, and Z ω(A) =
ν(Ax ) dµ.
X
RUsing (7.11) to rewrite the right-hand side and using that ω(A) = f dω to rewrite the left-hand, we obtain Z Z Z f dω = f (x, y) dν dµ; X
R
χA dω =
Y
Thus, the first equality in (7.9) holds for characteristic functions of measurable subsets in X × Y . By using linearity of the integral and measurable functions, it’s easy to check that C is closed under linear combinations by nonnegative numbers. Now let {fn } be a nondecreasing sequence of functions in C converging to a function f . Then for each x ∈ X, {(fn )x } is a nondecreasing sequence of nonnegative T -measurable functions on Y converging to fx . Hence, by the Monotone Convergence Theorem, we have Z Z fx dν = lim (fn )x dν. Y
n→∞
Y
7.3. THE FUBINI-TONELLI THEOREMS ON ITERATED INTEGRALS
411
R Since for each n, Y (fn )x dν is a µ-measurable function of x, and measurable R functions are Rclosed under limits, Y fx dν is also µ-measurable. Finally, since R the sequence Y (fn )x dν is nondecreasing and converges to Y fx dν, Z Z f dω = lim fn dω (by the Monotone Convergence Theorem) n→∞ Z Z = lim (fn )x dν dµ (since each fn satisfies (7.9)) n→∞ X Y Z Z = fx dν dµ (by the Monotone Convergence Theorem) X Y Z Z = f (x, y) dν dµ. X
Y
This proves the first equality in (7.9) for nonnegative measurable functions. To prove the last R statement of this theorem, observe that if f is ω-integrable, then by definition, |f | dω < ∞, and thus (7.10) holds by the first statement of this theorem. On the other hand, if either integral in (7.10) is finite, then again R by the first statement of this theorem, |f | dω < ∞ too. Hence, f is ω-integrable.
Using Tonelli’s theorem we can give Pierre-Simon Laplace’s (1749–1827) 1781 derivation of the probability integral via double integrals. In fact, since it’s quite readable, we shall simply take Laplace’s argument from his 1781 paper M´emoirs sur les probabilit´es published in M´emoirs de l’Academie royale des Sciences de Paris. The following translation is from [160, pp. 60–61]: R 2 There is concern therefore only to have the integral e−t dt from t = 0 to t = ∞. For this, we consider the double integral Z Z 2 e−s(1+u ) ds du,
and we take it from s = 0 to s = ∞ and from u = 0 to u = ∞; by integrating first with respect to s, we will have Z Z Z 2 du e−s(1+u ) ds du = . 1 + u2
Now we have, as we know,5 Z
du π = , 1 + u2 2
π being the ratio of the semi-circumference to the radius; therefore Z Z 2 π e−s(1+u ) ds du = . 2
If we take this double integral first with respect to u, by making R R −t2 R √ 2 ds e dt; let e−t dt = B (the u s = t, it will become e−s √ s integral being taken from t = 0 to t = ∞), we will have Z Z Z 2 ds e−s(1+u ) ds du = B e−s √ . s Now, by making s = s′2 , we have Z Z ′2 ds e−s √ = 2 e−s ds′ = 2B s
5
Because
R∞ 0
du 1+u2
= tan−1 (u)|∞ 0 =
π 2
−0=
π . 2
412
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
(the integral being taken from s′ = 0 to s′ = ∞); therefore Z Z 2 π e−s(1+u ) ds du = 2B 2 = , 2 √ whence we deduce that B = 12 π.
In fact, in Laplace’s paper he uses the double integral trick several times!6 Just to make sure you understand Laplace’s trick to compute the probability integral, we suggest that you work out both sides of Z ∞Z ∞ Z ∞Z ∞ f (x, y) dy dx = f (x, y) dx dy 0
with f (x, y) = x e−x
Guido Fubini (1879–1943).
2
0
0
2
(1+y )
0
; this is a very slight modification of Laplace’s argument.
7.3.2. Fubini’s theorem. Our second theorem treats the case of general integrable functions and is named after Guido Fubini (1879–1943) who proved it in 1907 [101]. However, before presenting Fubini’s theorem, we pause for an observation. Let (X, S , µ) and (Y, T , ν) be two σ-finite measure spaces and let f : X × Y → R be an ω-integrable function where R ω = µ ⊗ ν is the product measure. Being integrable, we have |f | dω < ∞ and by Tonelli’s theorem, Z Z Z |f | dω = |f |x dν dµ. X
Y
Since this integral is finite and integrable functions are finite a.e. (this is Property (6) of Proposition 5.20), for a.e. x ∈ X, we have Z |f |x dν < ∞. Y
This means that for a.e. x ∈ X, fx : Y → R is ν-integrable. In particular, for a.e. x ∈ X, the integral Z Z f (x, y) dν = fx dν Y
Y
R is a perfectly well-defined real number. By convention, let us define Y f (x, y) dν to be zero for those x’s on the set of measure zero where this integral is not a real number. By a similar argument, for a.e. y ∈ Y , the integral Z f (x, y) dµ X
is a well-defined real number; by convention we define this integral to be zero for those y’s on the set of measure zero where this integral is not a real number. The inner integrals in Equation (7.12) below are defined via the preceding conventions. 6 Although, I have to be honest and say that I spent a few hours filling in the details of his other double integral tricks and I got stuck on some of his tricky substitutions on page 66 of [160] . . . if you can figure out what he did, please let me know!
7.3. THE FUBINI-TONELLI THEOREMS ON ITERATED INTEGRALS
413
Fubini’s Theorem Theorem 7.11. If f : X × Y → R is ω-integrable, then Z Z Z Z Z (7.12) f dω = f (x, y) dν dµ = f (x, y) dµ dν. X
Y
Y
X
Proof : We remark that implicit in Fubini’s theorem is that the inner integrals in (7.12) (defined by the convention described above), Z Z f (x, y) dν and f (x, y) dµ, Y
X
are µ-measurable and ν-measurable, respectively, and are integrable as well. Consider the first equality in (7.12); the second equality in (7.12) is proved similarly. Applying Tonelli’s theorem to the nonnegative functions f ± , we have Z Z Z Z f + dω = F + dµ and f − dω = F − dµ, X
where F ± : X → [0, ∞] are the functions Z F + (x) = f + (x, y) dν and
X
F − (x) =
Y
Z
f − (x, y) dν, Y
which are (µ-measurable and µ-) integrable functions; here we used that f is assumed integrable and hence f ± have finite integrals and so F ± must be integrable. Since integrable functions are finite a.e. it follows that F ± are finite a.e. Let A ⊆ X be the measurable set on which both F ± are finite. Then by the convention described before this theorem, we have Z Z Z f + (x, y) dν − f − (x, y) dν if x ∈ A; f (x, y) dν = Y Y Y 0 if x ∈ Ac = χA F + (x) − χA F − (x).
Since F + and F − are µ-measurable and A is a measurable set, it follows that R f (x, y) dν is µ-measurable also. Moreover, we have Y Z Z Z f dω = f + dω − f − dω (def. of integral) Z Z = F + dµ − F − dµ (Tonelli’s theorem) X X Z Z χA F + dµ − χA F − dµ (Ac has measure zero) = X X Z = χA F + − χA F − dµ (linearity of the integral) X Z Z = f (x, y) dν dµ. X
Y
This proves the first equality in (7.12).
In Fubini’s Theorem it is important that f : X × Y → R is assumed integrable with respect to the product measure ω; we cannot just assume that the iterated integrals exist. Indeed, in Problem 3, we shall see that if we drop that assumption that f be integrable, then even if both iterated integrals on the right of (7.12) exist, they may not be equal. Combining the Fubini and the Tonelli theorems, we get the
414
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
Fubini-Tonelli Theorem Theorem 7.12. Let (X, S , µ) and (Y, T , ν) be two σ-finite measures and let ω = µ ⊗ ν. Then for any ω-measurable function f : X × Y → R that is either nonnegative or ω-integrable, we have Z Z Z Z Z f dω = f (x, y) dν dµ = f (x, y) dµ dν. X
Y
Y
X
Moreover, an ω-measurable function f : X × Y → R is ω-integrable if and only if Z Z Z Z |f (x, y)| dν dµ < ∞, if and only if, |f (x, y)| dµ dν < ∞. X
Y
Y
X
If we want to break up Rn as Rn = Rk × Rℓ , where n = k + ℓ, then as (see Section 7.2.2) B n = B k ⊗ B ℓ and mn = mk ⊗ mℓ , the Fubini-Tonelli Theorem implies that Z Z Z Z Z f (x, y) dy dx = f (x, y) dx dy. f dx dy = Rn
Rk
Rℓ
Rℓ
Rk
n
for any Borel measurable function f : R → R that is either nonnegative or integrable, where we use the notation dxdy = dmn , dx = dmk , and dy = dmℓ . We technically cannot use this formula for Lebesgue measurable functions on Rn because the product of the Lebesgue measures on Rk and Rℓ is not the Lebesgue measure on Rn (see the discussion at the end of Section 7.2.2). Of course, the Fubini-Tonelli theorem for Lebesgue measure is still true! This follows from the following theorem, whose proof we shall leave for Problem 5. Fubini-Tonelli Theorem for Complete Measures Theorem 7.13. Let (X, S , µ) and (Y, T , ν) be two σ-finite complete measures and let ω = µ⊗ν denote the completion of the product measure ω = µ ⊗ ν. Then for any ω-measurable function f : X × Y → R that is either nonnegative or ω-integrable, we have Z Z Z Z Z (7.13) f dω = f (x, y) dν dµ = f (x, y) dµ dν. X
Y
Y
X
Moreover, an ω-measurable function f : X × Y → R is ω-integrable if and only if Z Z Z Z |f (x, y)| dν dµ < ∞, if and only if, |f (x, y)| dµ dν < ∞. X
Y
Y
X
I have to admit that in reading many books, I often got confused7 on what “Fubini’s theorem” was and what “Tonelli’s theorem” was, so I decided to look at the original papers. (When all else fails, go to the original sources!) Here’s Fubini stating his theorem [101]: 7Of making many books there is no end, and much study is a weariness of the flesh. Ecclesiastes 12:12. Holy Bible, King James Version.
7.3. THE FUBINI-TONELLI THEOREMS ON ITERATED INTEGRALS
415
I consider here the two-dimensional integral of a function of two variables x, y. And, as it is now necessary in this field of study, I refer to the integral of Lebesgue. The theorem, which we will prove, is the following: If f (x, y) is a function of two variables x, y, bounded or unbounded, integrable in an area Γ of the plane (x, y), then one always has Z Z Z Z Z f (x, y) dσ = dy f (x, y) dx = dx f (x, y) dy, Γ
where with dσ we denote the area element.
On the other hand, here is Tonelli stating his theorem [287]: . . . we show that a function, measurable in a [rectangular] region R, nonnegative, and such that the following integral exists: Z x Z y dx f (x, y) dy, a
(7.14)
c
is integrable in the region R. Moreover, it follows that Z x Z y Z xZ y Z y Z dx f (x, y) dy = f (x, y) dxdy = dy a
c
a
c
c
x
f (x, y) dx.
a
Immediately after finishing his proof of (7.14), Tonelli states the following: From the foregoing it follows that a measurable function of two variables, such that the following integral exists: Z y Z x |f (x, y)| dy, dx c
a
is integrable in the region, and (7.14) holds.
7.3.3. Some applications of Fubini’s Theorem. Example : (Taylor’s Formula) Our first application is a cute method to prove Taylor’s formula with integral remainder [153]. Let f : I → R be a function on an open interval I containing 0 and suppose that for some n ∈ N, f and all its derivatives up to n + 1 are continuous. For any x ∈ I, we shall prove the Taylor’s formula for f (x) about 0: (7.15)
f ′′ (0) 2 f (n) (0) n x + ··· + x + R(x), 2! n!
f (x) = f (0) + f ′ (0)x +
with remainder R(x) given by R(x) =
1 n!
Z
x
0
(x − y)n f (n+1) (y) dy.
We shall prove (7.15) based on the following simple idea, whose origin goes back at least to Charles Hermite (1822–1901) (see his 1878 paper [124]): Try to reconstruct f by integrating f (n+1) , n + 1 times. Thus, consider Z x Z yn Z yn−1 Z y2 Z y1 (7.16) R(x) := ··· f (n+1) (y) dy dy1 · · · dyn−1 dyn . 0
0
0
0
0
Let’s evaluate this integral in two ways. The first way is by direct integration using the Fundamental Theorem of Calculus: Observe that Z x Z y 1 Z x f (n+1) (y) dy dy1 = {f (n) (y1 ) − f (n) (0)} dy1 0
0
0
= f (n−1) (x) − f (n−1) (0) − xf (n) (0).
416
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
Using this formula to evaluate the dy dy1 inner integral in the following, we see that Z xZ y2 Z y1 Z x f (n+1) (y) dy dy1 dy2 = f (n−1) (y2 ) − f (n−1) (0) − y2 f (n) (0) dy2 0
0
0
0
=f
(n−2)
(x) − f (n−2) (0) − xf (n−1) (0) −
x2 (n) f (0). 2
Continuing this argument shows that x2 ′′ x3 (3) xn (n) f (0) − f (0) − · · · − f (0). 2! 3! n! The second way to integrate (7.16) is by using Fubini’s theorem: Using the figure R(x) = f (x) − f (0) − xf ′ (0) −
(7.17)
y1
y1 = x ✠ ✛ y1 = y
Fubini implies
y
we see that Z x Z y1 0
f (n+1) (y) dy dy1 =
0
= =
Z
Z
Z
x 0
0
x
0
0
dy dy1 =
f (n+1) (y) dy1 dy
RxRx 0
y
dy1 dy.
(Fubini’s theorem)
y
xZ
x
dy1 f (n+1) (y) dy
y
x 0
Z
R x R y1
(x − y)f (n+1) (y) dy.
This computation plus another application of Fubini’s theorem give Z x Z y2 Z y1 f (n+1) (y) dy dy1 dy2 0 0 0 Z x Z y2 = (y2 − y)f (n+1) (y) dy dy2 (Previous computation) Z0 x Z x0 = (y2 − y)f (n+1) (y) dy2 dy (Fubini’s theorem) 0 y Z xZ x = (y2 − y) dy2 f (n+1) (y) dy 0 y Z 1 x = (x − y)2 f (n+1) (y) dy. 2 0
Continuing this argument by induction shows that Z x 1 (7.18) R(x) = (x − y)n f (n+1) (y) dy. n! 0
Equating (7.17) and (7.18) yields the Taylor formula (7.15). Example : (Euler’s formula for π 2 /6) Here’s one of my favorite applications of Fubini’s theorem: Yet another proof of Euler’s formula for π 2 /6 — in fact, the proof below is simply a double integral version of the one presented at the end of Section 6.1.4!8 Consider the equal double integrals Z 1Z ∞ Z ∞Z 1 x x dx dy = dy dx; 2 2 2 (1 + x2 )(1 + x2 y 2 ) 0 0 (1 + x )(1 + x y ) 0 0 8 For another version of this argument, you can let the y integral limits go from y = 1 to y = ∞ instead of y = 0 to y = 1 and make the change of variables y = eu ; if you do this, you get the double integral considered in [184].
7.3. THE FUBINI-TONELLI THEOREMS ON ITERATED INTEGRALS
417
we have equality because of Tonelli. The left-hand double integral is a bit tricky to evaluate, but right-hand double integral is easy: Z ∞Z 1 Z ∞ x tan−1 (xy) y=1 dy dx = dx 2 2 2 1 + x2 y=0 0 0 0 (1 + x )(1 + x y ) Z ∞ tan−1 (x) = dx 1 + x2 0 π2 (let u = tan−1 (x) to evaluate the integral). 8 Now to evaluate the left-hand double integral, we first use the method of partial fractions (details left to you!) to find Z ∞ x log y dx = 2 . 2 )(1 + x2 y 2 ) (1 + x y −1 0
(7.19)
Thus,
=
Z
1
Z
∞
Z
1
log y dy. y2 − 1 P 2k Expanding in geometric series, 1/(y 2 − 1) = −1/(1 − y 2 ) = − ∞ k=0 y , we see that Z 1X Z 1 ∞ ∞ Z 1 X log y 2k dy = −y log y dy = −y 2k log y dy, 2 0 y −1 0 k=0 0 k=0 0
0
x dx dy = (1 + x2 )(1 + x2 y 2 )
0
where we used that −y 2k log y are nonnegative on [0, 1] and we can always interchange summations with integrals involving nonnegative functions. Another short elementary calculus exercise, using integration by parts, shows that9 Z 1 1 −y 2k log y dy = . (2k + 1)2 0 Thus,
Z
1 0
Z
0
∞
x (1 +
x2 )(1
+
x2 y 2 )
dx dy =
∞ X
k=0
1 . (2k + 1)2
∞ ∞ X X 1 π2 1 π2 = . It follows that = , Recalling (7.19), we conclude that 2 2 8 (2k + 1) 6 n n=0 k=0 by the now familiar trick of summing over even and odd n; see the end of Section 6.1.
◮ Exercises 7.3. 1. Let (X, S , µ) and (Y, T , ν) be two σ-finite measure spaces and let f (x) and g(y) be integrable functions on X and Y , respectively. Prove that Z Z Z f g d(µ ⊗ ν) = f dµ g dν .
2. In this exercise, we use the Tonelli and Fubini theorems to derive interesting integral formulas. Verify that the hypotheses of the theorem you use. (a) By considering the iterated integral of e−xy over [0, ∞) × [a, b], where 0 < a < b, show that Z ∞ −ax e − e−bx b dx = log . x a 0 (b) By considering the iterated integral of 1/(1 + x2 y 2 ) over [0, ∞) × [a, b], where 0 < a < b, deduce the value of Z ∞ tan−1 (bx) − tan−1 (ax) dx. x 0 9
You can also differentiate with respect to a the formula
R1 0
y a dy = 1/(a+1) then put a = 2k.
418
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
R∞ (c) Show that y/(1 + y 2 ) = 0 e−xy cos x dx, and use this to prove that Z ∞ −bx e − e−ax 1 1 + a2 cos x dx = log , a, b > 0. x 2 1 + b2 0 R∞ 2 (d) From Equation (6.1) in Section 6.1.1 we know that 0 e−x cos 2xy dx = Use this formula to prove that Z ∞ 2 sin 2xy π e−x dx = erf y, x 2 0 R y −t2 where erf y := 0 e dt is the error function. 2
√
π −y 2 e . 2
2
(e) Verify that the iterated integrals of ye−(1+x )y cos(ax) over [0, ∞) × [0, ∞) exist and are equal, and using Equation (6.1) in Section 6.1.1 and Problem 7c in Exercises 6.1, establish the formula Z ∞ π cos ax dx = e−|a| . 2 1 + x 2 0 (f) Here is a slight modification of Laplace’s trick (cf. [104, 245]) to derive the GaussR∞ 2 ian integral. Let I = 0 e−x dx and show that Z ∞Z ∞ 2 2 I2 = e−(x +y ) dy dx. 0
0
√ Changing variables y 7−→ xy in the inner integral, prove that I = π/2. 2 2 2 2 2 2 2 2 3. Let f (x, y) = xy/(x + y ) and g(x, y) = (x − y )/(x + y ) . Show that Z 1 Z 1 Z 1 Z 1 f (x, y) dx dy = f (x, y) dy dx, −1
−1
−1
−1
but f is not integrable on [−1, 1] × [−1, 1], and also show that Z 1Z 1 Z 1Z 1 g(x, y) dx dy 6= g(x, y) dy dx. 0
0
0
0
4. We give a couple useful formulas for integrating over special regions in Rn . (a) Let U ⊆ Rn−1 be Lebesgue measurable and let ϕ1 and ϕ2 be real-valued Lebesgue measurable functions on U such that ϕ1 < ϕ2 . Let A ⊆ Rn be the region between the graphs of ϕ1 and ϕ2 , thus, A = {(z, t) ∈ U × R ; ϕ1 (z) < t < ϕ2 (z)}.
Prove that A ⊆ Rn is Lebesgue measurable and for any Lebesgue integrable function f on A, we have Z Z Z ϕ2 (z) f dx = f (z, t) dt dz. A
U
ϕ1 (z)
Using this formula compute the volume of the ball of radius r in R3 by letting p ϕ1 = 0, ϕ2 (x1 , x2 ) = r 2 − x21 − x22 , and f = 1. (b) Let Rn = Rk × Rℓ and let ϕ1 : U1 → R and ϕ2 : U2 → R be Lebesgue measurable, where U1 ⊆ Rk and U2 ⊆ Rℓ are Lebesgue measurable. Let A = {(y, z) ∈ U1 × U2 ; ϕ1 (y) < ϕ2 (z)}.
Prove that A ⊆ Rn is Lebesgue measurable and for any Lebesgue integrable function f on A, we have Z Z Z f dx = f (y, z) dy dz, A
U2
U1 (z)
where U1 (z) = {y ∈ U1 ; ϕ1 (y) < ϕ2 (z)}. Next, compute the volume of the ball of radius r in R3 by letting ϕ1 (x1 ) = x21 , ϕ2 (x2 , x3 ) = r 2 − x21 − x22 , and f = 1.
7.3. THE FUBINI-TONELLI THEOREMS ON ITERATED INTEGRALS
419
5. (Fubini-Tonelli for complete measures) In this problem we prove Theorem 7.13; feel free to use any results in Problem 7 in Exercises 7.2. Let µ : S → [0, ∞] and ν : T → [0, ∞] be σ-finite complete measures and let ω denote the completion of the product measure ω = µ ⊗ ν. (i) Prove that if f : X × Y → R is ω-measurable, then for a.e. x ∈ X, the section fx is ν-measurable, and for a.e. y ∈ Y , the section fy is µ-measurable. (ii) Let f : X × Y → R be ω-measurable; then by (i) for a.e. x ∈ X, the section fx is ν-measurable, and for a.e. y ∈ Y , the section fy is µ-measurable. On the set of µ-measure zero where fx is not ν-measurable, put fx := 0. On the set of ν-measure zero where fy is not µ-measurable, put fy := 0. Using the Principle of Appropriate Functions, prove that if f is nonnegative, then Z Z x 7→ f (x, y) dν is µ-measurable and y 7→ f (x, y) dµ is ν-measurable, Y
X
and
Z
(7.20)
f dω =
Z Z X
Y
Z Z f (x, y) dν dµ = f (x, y) dµ dν. Y
X
(iii) Prove that an ω-measurable function f : X × Y → R is ω-integrable if and only if Z Z Z Z |f (x, y)| dν dµ < ∞, if and only if, |f (x, y)| dµ dν < ∞. X
Y
Y
X
(iv) Finally, prove that for any ω-integrable function f : X × Y → R, the formula (7.20) holds with the conventions explained just before the statement of Fubini’s theorem (Theorem 7.11). You also have prove that the inside integrals in (7.20) are both measurable and then integrable with respect to the measures given by the outside integrals. 6. (Beta function) A cousin of the Gamma function is the Beta function, defined by B(p, q) =
Z
0
1
xp−1 (1 − x)q−1 dx,
p, q ∈ (0, ∞).
In this problem we derive some of its properties. If you wish to do so, you can make the change of variables x = 1 − t to check that B(p, q) = B(q, p) and making the change of variables x = t/(1 + t) and x = cos2 θ, you can also check that Z π/2 Z ∞ tp−1 dt = 2 cos2p−1 θ sin2q−1 θ dθ. (7.21) B(p, q) = (1 + t)p+q 0 0 (i) We now prove a very interesting formula for the Beta function in terms of the Gamma function. First, prove that for any p, q > 0, Z ∞Z ∞ Γ(p) Γ(q) = xp−1 y q−1 e−x−y dx dy. 0
0
With y > 0 held fixed, in the inner integral make the change of variables x = ty, then use Tonelli’s theorem to change the order of integration to derive the BetaGamma formula Γ(p) Γ(q) B(p, q) = . Γ(p + q)
(ii) As an straightforward application of the Beta-Gamma formula, we give two proofs of the Legendre duplication formula (named after Adrien-Marie Legendre (1752–1833)): For all p > 0, 22p−1 1 Γ(2p) = √ Γ(p) Γ p + . 2 π
420
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
(1) To derive this formula, use the Beta-Gamma formula with q = p to prove that Z 1 Z 1 Γ(p)2 = xp−1 (1 − x)p−1 dx = xp−1 (1 − x)p−1 dx. Γ(2p) 0 1/2 √ Next, make the change of variables x = 12 (1 + t) and the Beta-Gamma formula to get the duplication formula. (2) For another (easier!) derivation, use the identity 2 cos θ sin θ = sin 2θ in the last integral in (7.21) to prove that B(p, p) = 21−2p B(p, 12 ), then use this formula to prove the duplication formula. (iii) By expressing the integrals as Beta functions, prove the formulas Z 1 Z ∞ dx π dx √ √ = = . 4 1 + x4 1 − x4 2 2 0 0
7. (Fourier Convolution) In this problem we define convolution of functions on Rn . Given integrable functions f, g : Rn → R, for each x ∈ Rn , the integral Z f (x − y) g(y) dy (7.22) h(x) = Rn
is called the convolution of f and g and is denoted by f ∗ g. This definition of convolution is important for the Fourier transform as we shall see in Chapter ??. It is not obvious f (x − y) g(y) is integrable as a function of y, so h(x) may not even be defined! However, we shall see that h(x) is defined a.e. and moreover, h is integrable. (a) Show that F : Rn × Rn → R defined by F (x, y) = f (x − y) g(y) is Lebesgue measurable. Suggestion: To show that f1 (x, y) := f (x − y) is measurable, note that f1 = f ◦ G ◦ H, where G : Rn × Rn → Rn is the map G(x, y) = x and H : Rn × Rn → Rn × Rn is the invertible linear transformation H(x, y) = (x − y, y). (b) Show that Z Z Z Z |F (x, y)| dx dy = |f | |g| (c) Show that h(x) is defined a.e., h is integrable (defining h to be whatever you want to, say zero, on the set of measure zero where it’s not defined by the convolution integral), and Z Z Z h= f g
8. (Laplace Convolution) In this problem we study convolution on [0, ∞), which is important for the Laplace transform. Please assume the results of Problem 7. (a) Given integrable functions f, g : [0, ∞) → R, we extend them to all of R by defining them to be zero on the negative real axis. Denoting the extended function by the same letters, for each x ∈ [0, ∞), let h := f ∗ g where ∗ is the convolution defined in (7.22) (now with n = 1) of the previous problem. Prove that Z x f (x − y) g(y) dy. h(x) = 0
(b) Show that if f and g are only assumed to be locally integrable on [0, ∞), which means that f, g are integrable on [0, a] for all a ≥ 0, then h(x) is defined a.e. and h is also locally integrable on [0, ∞). (c) Recall (see Problem 8 in Exercises 6.1) the RLaplace transform of a measurable ∞ function f on [0, ∞) is defined by L (f )(s) = 0 e−sx f (x) dx, provided this integral exists. Let f and g be locally integrable on [0, ∞) and suppose that both L (f )(s) and L (g)(s) exist for some s ∈ R. Prove that L (f ∗ g)(s) is also defined, and L (f ∗ g)(s) = L (f )(s) · L (g)(s).
7.3. THE FUBINI-TONELLI THEOREMS ON ITERATED INTEGRALS
421
(d) (Beta-Gamma formula [266, 45]) As a neat application of the multiplicativity of the Laplace transform, we give another proof of the Beta–Gamma formula from Problem 6. Let f (x) = xp−1 and g(x) = xq−1 . Show that f ∗ g(x) = xp+q−1 B(p, q) where “∗” denotes the Laplace convolution. Next, computing the Laplace transform of f , g, and f ∗ g, prove the Beta-Gamma formula. 9. (Laplace’s double integral trick) We end this problem set with a very interesting problem taken directly from Laplace’s 1781 paper [160]. Take the integrals in Part (iii) of the Beta function Problem 6 as known. (i) We begin with an elementary calculus exercise, computing the arclength of the lemniscate r 2 = cos(2θ) (looks like an infinity sign). Recall calculus that the s from 2 Z dr arclength integral in polar coordinates is given by r2 + dθ. Taking dθ √ √ r = cos 2θ for 0 ≤ θ ≤ π/4, and then making the substitution z = cos 2θ , prove that the arclength of the lemniscate is 2̟, where ̟ (read “pi script”) is the integral Z 1 dz √ ̟=2 . 1 − z4 0 The integral on the right appeared in a paper of Jacob Bernoulli in 1691 and the R∞ 4 notation ̟ for it is due to Gauss. If C = 0 e−t dt, for the rest of this problem we shall derive Laplace’s formula for C in terms of ̟. (ii) Consider the double integral Z ∞Z ∞ Z ∞Z ∞ 4 4 (7.23) e−s(1+u ) ds du = e−s(1+u ) du ds. 0
0
0
0
π Show that the left-hand integral equals 2√ and in the right-hand integral, make 2 √ 4 the change of variables u = t/ s in the inside integral, then after evaluating the inside integral, make the change of variables s = (s′ )4 , and you should get that the right-hand integral equals 4CC ′ where Z ∞ Z ∞ 4 ′4 π √ = 4CC ′ , C = e−t dt , C ′ = s′2 e−s ds. 2 2 0 0
Thus, if you can determine C, you can determine C ′ , and viceversa. Here’s what Laplace says [160, p. 65]: As for the value of C, it has not yet been possible, in spite of many attempts, to restore it to the arcs of a circle or to logarithms; but I have found that it depended on the rectification of the elastic recR tangle curve or, what amounts to the same, on the integral √ dx 4 1−x
taken from 0 to 1; R1 Recall from (i) that the integral 0 √ dx
1−x4
equals ̟/2 where 2̟ is the arclength
2
of the lemniscate r = cos(2θ), which Laplace calls an “elastic rectangle curve”. (iii) To relate C to the arclength of the lemniscate, Laplace first relates the constant R∞ C to the new constant E := 0 √ du 4 . To do so, in the inner integral on the 1+u √ right-hand side of (7.23), make the change of variables u = t/ s, then make the ′ 2 change of variables s = (s ) , and you should get Z ∞Z ∞ 4 e−s(1+u ) ds du = 2C 2 . 0
0
In the the change of variables s = √ inner integral on the left-hand side, make √ s′′ / 1 + u4 ; the end result is the equality 21 E π = 2C 2 . √ (iv) Finally, Laplace proves E = ̟/ 2 by the following ingenious device. First show R1 1 √ that E = 0 (1−sds in the 2 )3/4 by making the change of variables s = 4 4 1+u
422
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
definition of E = 4 Z 1Z √ 1−z 2
0
0
R∞ 0
√ du
1+u4
. Next, show that
dx dz = (1 − z 2 − z 4 )3/4
Z
1 0
Z √1−x4 0
dz dx . (1 − z 2 − z 4 )3/4
√ In the inner integral on the left, make the change of variables x = x′ 4 √ 1 − z 2 and ′ in the inner integral on the right, make the change of variables z = z 1 − x4 .
7.4. Change of variables in multiple integrals Probably the most used integration trick in single variable calculus is the “usubstitution,” or change of variables, method of integration: If x = F (u), then Z Z f (x) dx = f (F (u)) F ′ (u) du. This section is devoted to the n-dimensional version of this formula.
7.4.1. The statement of the change of variables formula. In multi-variable calculus you probably learned the two and three-dimensional change of variables formulas. To review the two-dimensional formula, let x = F1 (u, v) and y = F2 (u, v) be a change of variables from an open subset U of the uv-plane to an open subset V of the xy-plane. Assume that F1 , F2 have continuous first order partial derivatives and that the transformation Carl Jacobi (1804–1851).
F :U →V
defined by F (u, v) = (F1 (u, v), F2 (u, v))
is a homeomorphism and its Jacobian: ∂ F ∂v F1 JF = det u 1 , ∂u F2 ∂v F2
is everywhere nonzero; here, “Jacobian” is named after Carl Gustav Jacob Jacobi (1804–1851) who wrote a treatise on such determinants in 1841 [138]. Then the change of variables formula says that for any Riemann integrable function f on V, we have Z Z (7.24) f (x, y) dx dy = f (F (u, v)) |JF (u, v)| du dv. V
Leonhard Euler (1838–1922).
U
This two-dimensional change of variables formula was first derived by Leonhard Paul Euler (1707–1783) in 1770 in the paper De formulis integralibus duplicatis (On double integral formulas) [90], which is also considered the first paper to develop the notion of the double integral and write double integrals as iterated integrals. More than half a century later, the ndimensional formula was established by Mikhail Ostrogradski (1801–1862) in 1836 [215]; see [145] for more history on the problem and see the Notes section at the end of this chapter for Euler’s ingenious proof of the change of variables formula. Example : (Polar coordinates) Here, U = {(r, θ) ; 0 < r < 2π , 0 < θ < 2π}, V = {(x, y) ; x ∈ / [0, ∞)}, and F : U → V is defined by F (r, θ) = (r cos θ, r sin θ).
Mikhail Ostrogradski (1858–1932).
Then F is a homeomorphism and ∂r F1 ∂θ F1 cos θ JF = det = det ∂r F2 ∂θ F2 sin θ
−r sin θ = r. r cos θ
7.4. CHANGE OF VARIABLES IN MULTIPLE INTEGRALS
Z
V
423
Thus, for any Riemann integrable function f on V, we have Z Z 2πZ ∞ f (x, y) dx dy = f (r cos θ, r sin θ) r dr dθ = f (r cos θ, r sin θ) r dr dθ. U
0
0
In Problems 1 and 4 and in Section 7.6 we’ll study polar coordinates in great detail.
The generalization of (7.24) to higher dimensions is as follows. Consider a homeomorphism F :U →V
where U and V are open sets in Rn . Since F maps into a subset of Rn , we can write F (x) in its components, F (x) = (y1 (x), . . . , yn (x)), where yi : U → R is the i-th component of F . We say that F is a C 1 diffeomorphism if each yi : U → R has continuous first order partial derivatives. In general, we say that F is a C k diffeomorphism if each yi has continuous k-th order partial derivatives and we say that F is a smooth, or C ∞ diffeomorphism, if each yi has continuous partial derivatives of all orders. We remark that from the Inverse Function Theorem of undergraduate analysis it follows that if F is a C k diffeomorphism, then F −1 : V → U is also a C k diffeomorphism — if you’ve forgotten the Inverse Function Theorem, just define F to be a C k diffeomorphism if each component function of F and F −1 has continuous k-th order partial derivatives. The Jacobian matrix of F is the n × n matrix of partial derivatives ∂yi DF = , ∂xj i,j=1,...n
∂yi . If F is a diffeomorphism, DF (x) where the (i, j)-th component of the matrix is ∂x j is always an invertible matrix. The Jacobian or Jacobian determinant of F is
JF := det DF. The following theorem is the main goal in this section. The Change of Variables Theorem Theorem 7.14. If F : U → V is a C 1 diffeomorphism, then for any Lebesgue measurable function f : V → R which is either nonnegative or integrable, we have Z Z (7.25) (f ◦ F ) |JF | = f. U
V
When F is an affine transformation, this theorem was proved back in Section 5.5.2; see Equation (5.21) in that section. There are many proofs of (7.25), some of which you’ll find in the exercises. My favorite proof10 of (7.25) is based on the observation that for a fixed function f on V, the left-hand side of (7.25) does not depend at all on the transformation F : U → R V (for all such transformations, the left-hand side of (7.25) equals the constant V f ). Here’s a rough idea on how to 10Found in Herbert Leinfelder and Christian Simader’s paper [174]; see Problem 7.
424
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
exploit this fact. Suppose for simplicity that U = V = Rn and suppose that for each t ∈ [0, 1] there is a transformation Ft : Rn → Rn ,
such that F1 = F and F0 is a particularly simple transformation for which we know the change of variables formula holds (e.g. F0 could be an affine transformation). Next, consider the function F : [0, 1] → R defined by Z F (t) = (f ◦ Ft ) |JFt |, R which should be a constant function of t ∈ [0, 1] (because it should equal f for any t). In Lemma 7.17, under certain assumptions we shall prove that F ′ (t) = 0, which implies that F (t) is constant; it follows in particular that Z Z F (1) = F (0) , or (f ◦ F ) |JF | = f, and the change of variables formula holds. To prove that F ′ (t) = 0 we need to take derivatives of f so this argument will work in the case f is differentiable. We shall then finish the proof for a general Lebesgue integrable function f using the Principle of Appropriate Functions; this is done in Subsection 7.4.3. Actually, we also have to take derivatives involving F and JF , so, as we don’t want to worry about how many derivatives we are allowed to take, for convenience we shall for the rest of this section assume that the diffeomorphism F in the change of variables formula is smooth, that is, C ∞ . However, the change of variables formula holds only assuming that F is C 1 — see Problem 10 — which is why we stated Theorem 7.14 above for C 1 diffeomorphisms.11 7.4.2. Simader’s Theorem. Assuming that we can differentiation through the integral sign, we want to show that Z ∂ F ′ (t) = (f ◦ Ft ) JFt ∂t is zero. How do we prove an integral vanishes? One of the easiest ways is to show that the integrand is a divergence12 and use the following lemma. Lemma 7.15. Let ν : Rn → Rn be a smooth function and define the divergence of ν as the scalar function div ν : Rn → R, by the formula n X ∂ div ν := νj , ∂xj j=1
whereR we write ν = (ν1 , ν2 , . . . , νn ) in components. If ν has compact support, then Rn div ν = 0. 11If you need a change of variables formula where F is not even assumed C 1 , see Rudin’s book [243, Th. 7.26]. However, we remark that all the change of variables you’ve ever seen (e.g. polar, cylindrical, and spherical coordinates and even more obscure ones like parabolic or elliptical coordinates) are all smooth diffeomorphisms. 12From multi-variable calculus, you should remember that if ν = (ν , ν , ν ) is a vector 1 2 3 field on R3 (say with coordinates x, y, z), then div ν = ∂x ν1 + ∂y ν2 + ∂z ν3 . This is why the n-dimensional version of divergence is defined the way it is in Lemma 7.15.
7.4. CHANGE OF VARIABLES IN MULTIPLE INTEGRALS
425
Proof : By assumption, each νj vanishes outside of some box (−a, a)n ; therefore, writing the integral over Rn as iterated integrals and using Fubini’s theorem, for any j = 1, 2, . . . , n, we have Z Z Z a ∂ ∂ νj = νj dxj dx1 · · · dxj−1 dxj+1 · · · dxn . Rn ∂xj Rn−1 −a ∂xj By the Fundamental Theorem of Calculus, Z a xj =a ∂ νj dxj = νj = 0. xj =−a −a ∂xj R It follows that div ν = 0.
We now state the precise assumptions on the family of transformations Ft . Let I ⊆ R be a closed interval and let U be an open subset of Rn , and suppose that for each t ∈ [0, 1] we are given a smooth function Ft : U → Rn .
We shall call Ft a smooth family of functions if the function F : I × U → Rn
defined by F (t, x) := Ft (x) is smooth (infinitely differentiable); in other words, writing Ft (x) in its components, Ft (x) = (y1 (t, x), . . . , yn (t, x)), we assume that each yi (t, x) is a smooth function13 of (t, x) ∈ I × Rn . Theorem 7.16 below is due to Christian Simader who used it as part of his 1973 Antrittsvorlesung (first public lecture) for his Habilitation (admission as a university professor). It can be used not only to prove the change of variables theorem, but can used to prove many other results as we’ll see in Section 7.5. A function f : X → R on a set X is said to be supported in (or on) a subset A ⊆ X if f (x) = 0 for all x ∈ / A; that is, at all points not in A, the function f vanishes. f has compact support if A is a compact subset of a topological space X. Simader’s Theorem Theorem 7.16. If f : Rn → R is smooth and f ◦ Ft is supported in some fixed compact set for all t ∈ I, then the integral Z (f ◦ Ft ) JFt
is a constant function of t ∈ I.
R Proof : To prove that F (t) := (f ◦ Ft ) JFt is constant, we just have to prove that F ′ (t) = 0. Since f ◦ Ft is supported in some fixed compact set for all t ∈ I, we can apply the Differentiation Theorem for Lebesgue integrals (Theorem 5.34), which implies that Z ∂ F ′ (t) = (f ◦ Ft ) JFt . ∂t ∂ To prove this vanishes we shall write ∂t (f ◦ Ft ) JFt as the divergence of a certain vector field and then the divergence lemma above finishes the proof. This divergence formula is provided in Step 2 below. Step 1: Determinant identities. We denote by Aij the i, j-th cofactor of DFt = [∂yi /∂xj ]. Thus, Aij is (−1)i+j times the determinant of the matrix 13At the endpoints of I we require the one-sided derivatives to exist.
426
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
formed by deleting the i-th row and j-th column of the matrix DFt . In Problem 6 we ask you to prove the following identities: For any i, k = 1, . . . , n, we have ( n n X X JFt , if k = i; ∂ ∂yk (7.26) Aij = 0 and Aij = ∂x ∂x 0, if k 6= i. j j j=1 j=1 The proofs of these identities are elementary in the sense that they involve just elementary calculus and manipulating determinants, which is why we leave the (somewhat boring) calculations for those interested. ∂ Step 2: Divergence formula. We shall prove that ∂t (f ◦Ft ) JFt = div ν, where ν = (ν1 , . . . , νn ) is the vector field with j-th component νj = (f ◦ Ft )
n X ∂yi Aij . ∂t i=1
Once we prove this claim, the divergence lemma completes our proof! Now to verify our claim, observe that by the product rule and the first identity in (7.26), we have n n n i X X X ∂ h ∂yi ∂ ∂yi div ν = (f ◦ Ft ) Aij = (f ◦ Ft ) Aij ∂x ∂t ∂x ∂t j j j=1 i=1 i,j=1 (7.27) n X ∂ 2 yi + (f ◦ Ft ) Aij . ∂x j ∂t i,j=1 We evaluate the first term on the right as follows. Using the chain rule and the second identity in (7.26), we obtain n X
i,j=1
n ∂f ∂y ∂y X ∂yi ∂ i k (f ◦ Ft ) Aij = ◦ Ft Aij ∂xj ∂t ∂y ∂x j ∂t k i,j,k=1
=
n n ∂y X X ∂f ∂yk i ◦ Ft Aij ∂yk ∂t j=1 ∂xj
i,k=1
=
n ∂y X ∂f ∂ k ◦ Ft JFt = (f ◦ Ft ) JFt . ∂y ∂t ∂t k k=1
Hence, the first term on the right in (7.27) is just JFt ∂t (f ◦ Ft ). We evaluate the second term on the right in (7.27) as follows. Let ci denote the i-th column of DFt . Then we can write JFt = det(c1 , . . . , cn ) where the notation “det(c1 , . . . , cn )” means the determinant of the matrix with columns c1 , . . . , cn ; in this case, the matrix is just DF . Now as an exercise in derivatives, we leave you to check that (you can think of this as the “product rule for determinants”) n
X ∂ det(c1 , . . . , cn ) = det(c1 , . . . , cj−1 , ∂t cj , cj+1 , . . . , cn ), ∂t j=1 where ∂t cj is the vector obtained from cj be taking the partial derivative ∂t of each of its entries. Observe that the j-th term is n X ∂ ∂yi det(c1 , . . . , cj−1 , ∂t cj , cj+1 , . . . , cn ) = Aij , ∂t ∂xj i=1
where we expanded the determinant about the j-th column. Hence, the second term on the right in (7.27) is just (f ◦ Ft )∂t JFt . Thus, by the product rule, the expression in (7.27) is ∂t ((f ◦ Ft ) JFt ), which was what we set out to prove.
7.4. CHANGE OF VARIABLES IN MULTIPLE INTEGRALS
427
As our first application of Simader’s theorem, we prove the following special case of the change of variables formula;14 the general change of variables formula is proved afterwards. Lemma 7.17. Let F : Rn → V be a smooth diffeomorphism onto an open subset V of Rn . Then for any smooth function f on Rn supported in an open box in V, the change of variables formula holds: Z Z (f ◦ F ) |JF | = f. Proof : We shall apply Simader’s theorem to an appropriate family of smooth functions Ft . Before doing so, observe that since the change of variables formula holds for affine transformations, by appropriate translations we may assume that V contains the origin, the open box in which f is supported is centered at the origin, and finally, that F (0) = 0. We now prove this lemma in two steps. Step 1: We define a family of smooth functions Ft : Rn → Rn where t ∈ [0, 1], such that F1 = F and F0 = A with A = DF (0), a linear transformation with matrix DF (0). Indeed, simply define ( F (tx)/t, if 0 < t ≤ 1; (7.28) Ft (x) = Ax, if t = 0. Certainly Ft is smooth in t for t 6= 0. To see that Ft is smooth near t = 0, let g(t, x) = F (tx). Then g(t, x) is smooth in t and x, g(0, x) = 0, and it’s an exercise to check that (∂t g)(t, x) = DF (tx) x. Thus, by the Fundamental Theorem of Calculus, Z t Z g(t, x) − g(0, x) = (∂s g)(s, x) ds , or F (tx) = 0
t
DF (sx) ds. 0
Making the change of variables s 7→ st in the integral, we obtain Z 1 F (tx) = t DF (tsx) x ds, 0
thus, dividing by t we obtain (7.29)
Ft (x) =
Z
1
DF (tsx) x ds. 0
Note that putting t = 0 in (7.29) gives F0 (x) = DF (0)x, which is in agreement with the original definition (7.28)). Moreover, the formula (7.29) shows that Ft (x) is smooth near t = 0. Step 2: Let f be a continuously differentiable function supported on some compact set K in an open box B centered at the origin in V. We claim that f ◦ Ft is supported in some fixed compact set for all t ∈ [0, 1]. Indeed, observe that ( F (tx)/t ∈ K, if 0 < t ≤ 1; f ◦ Ft (x) 6= 0 =⇒ Ft (x) ∈ K =⇒ Ax ∈ K, if t = 0 ( −1 x ∈ F (tK)/t, if 0 < t ≤ 1; =⇒ x ∈ A−1 (K), if t = 0. 14
I thank Marcin Mazur for his help in developing the proof of Lemma 7.17.
428
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
It follows that if we define G : [0, 1] × B → Rn by ( F −1 (tx)/t, if 0 < t ≤ 1; G(t, x) := A−1 x, if t = 0, then f ◦ Ft (x) is supported in the set G([0, 1] × K). The same argument we used to show that Ft (x) was smooth shows that G is smooth. In particular, G : [0, 1] × B → Rn is continuous, so as the image of a compact set under a continuous map is again compact, it follows that G([0, 1] × K) is compact. Hence, f ◦ Ft is supported in a fixed compact set for all t ∈ [0, 1]. Therefore, Z (f ◦ Ft ) JFt is constant by Simader’s Theorem.
In particular, taking t = 1 and t = 0, we see that Z Z (7.30) (f ◦ F ) det DF = (f ◦ A) det A.
Now, DF (x) is invertible for all x ∈ Rn , so det DF is never zero (because the determinant of a matrix is zero if and only if the matrix is not invertible) and hence det DF (x) is either always negative or else is always positive. Multiplying both sides of (7.30) by −1 if det DF (x) is always negative or otherwise leaving (7.30) as is, we see that Z Z (f ◦ F ) | det DF | = (f ◦ A) | det A|. R The right-hand side equals f since the change of variables formula holds for affine transformations. This completes our proof.
7.4.3. Proof of the change of variables formula. Let F : U → V be a smooth diffeomorphism between two open sets in Rn . We need to prove that for any Lebesgue measurable function f : V → R that is either nonnegative or integrable, we have Z Z (f ◦ F ) |JF | = f U
V
Using the Principle of Appropriate Functions, all we have to do is prove this formula for characteristic functions of Lebesgue measurable sets (we’ll let you check the other conditions in the Principle of Appropriate R Functions). Thus,R given a Lebesgue measurable set A ⊆ V, we need to show that U (χA ◦ F ) |JF | = V χA ; that is, Z (χA ◦ F ) |JF | = m(A). U
We prove this equality in two steps: First, we prove it when A is a Borel set. This is done in Lemma 7.19, whose proof uses Lemma 7.17. Second, in Lemma 7.20 we prove the change of variables formula for A Lebesgue measurable, using the fact that a Lebesgue measurable set is a union of a Borel set and a set of measure zero. This completes our proof. In order to use the special case of the change of variables formula in Lemma 7.17, we need to extend the domain of F , which is U, to all of Rn . To do so, we shall use the following . . . Lemma 7.18. Let I = (a, b) and I ′ = (a−δ, b+δ) be open boxes in Rn with δ > 0. Then there exists a diffeomorphism G : Rn → I ′ such that G(x) = x for all x ∈ I.
7.4. CHANGE OF VARIABLES IN MULTIPLE INTEGRALS
Proof : Here, recall that the notation (a, b), where a = (a1 , . . . , an ), b = (b1 , . . . , bn ) ∈ Rn , is shorthand for the box (a1 , b1 ) × · · · × (an , bn ); for all δ > 0, we use the notation (a − δ, b + δ) for the box (a1 − δ, b1 + δ) × · · · × (an − δ, bn + δ). Now, observe that if this lemma holds for n = 1, then it holds for general n. Indeed, assuming this lemma holds for n = 1, for each i = 1, 2, . . . , n there is a diffeomorphism Gi : R → (ai − δ, bi + δ) such that Gi (t) = t for all t ∈ (ai , bi ); then it’s easy to check that G : Rn → I ′ defined by G(x) := G1 (x1 ), G2 (x2 ), . . . , Gn (xn ) for all x = (x1 , x2 , . . . , xn ) ∈ Rn , is a diffeomorphism such that G(x) = x for all x ∈ I. So now to the n = 1 case. By translation, we may assume that I is centered at the origin so that I = (−a, a) and I ′ = (−b, b) where b = a + δ. The left-hand picture in Figure 7.7 shows a picture of what G should look like. To construct G,
b a ✻ −a
Here, G(x) = x.
−b
−a
✻a
b Here, H(x) = x.
−b
Figure 7.7. Left: G : R → I ′ such that G(x) = x for all x ∈ I. Right: H : I ′ → R such that H(x) = x for all x ∈ I.
let’s instead construct its inverse. To construct the inverse, recall from Lemma 5.2 that there is a smooth nondecreasing function ϕ : R → R that is zero for x ≤ a and one for x ≥ b. Define H : (−b, b) → R by π H(x) = x + ϕ(|x|) tan x . 2b The right-hand picture in Figure 7.7 shows a picture of H. Using the formula for H, one can check that H → ±∞ as x → ±b and H has a positive derivative (in fact, H ′ (x) ≥ 1 for all x). Thus, H is invertible with a smooth inverse G : R → (−b, b), which has the properties we want.
Lemma 7.19. We have Z (7.31) (χB ◦ F ) |JF | = m(B)
for all Borel sets B ⊆ V.
Proof : It’s easy to check that for any set B ⊆ V, (7.32)
χB ◦ F = χF −1 (B) .
Since Borel sets are preserved under homeomorphisms (this is Proposition 1.15), it follows that for B a Borel set, χB ◦ F is Borel and hence Lebesgue measurable. Thus, the left-hand side of (7.31) is defined. Moreover, the left-hand side of (7.31) is a measure on the Borel sets in V (countable additivity follows from the
429
430
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
series MCT — Theorem 5.21) so by the Extension theorem, to prove the equality (7.31) all we have to do is check it on left-hand open boxes. To do so, we break up the remaining proof into two steps. Step 1: It turns out that checking (7.31) for certain “small” boxes is easy. We shall say that an open box (a, b) is properly inside U if15 for some δ > 0, we have (a − δ, b + δ) ⊆ U. We shall call a box J ⊆ V F -small if there is an open box I that is properly inside of U such that J ⊆ F (I).
Figure 7.8 shows a picture of this situation. We claim that (7.31) holds for F U
I
I′
F ✲
J
F (I ′ ) F (I)
V
Figure 7.8. I ⊆ U is properly in U if you can enlarge I a little,
forming I ′ , and I ′ is still in U. A box J ⊆ V is F -small if J ⊆ F (I) where I is an open box properly inside of U. small left-hand open boxes. To prove this, let J be such a box. Then there is an open box I = (a, b) ⊆ U and a δ > 0 such that I ′ := (a − δ, b + δ) ⊆ U and J ⊆ F (I). By Lemma 7.18 there is a diffeomorphism G : Rn → I ′ such that G(x) = x for all x ∈ I. Now, choosing ε = 1/k, k = 1, 2, . . ., in Proposition 5.3, we can find a sequence {ϕk } of smooth functions supported on the interior of J such that 0 ≤ ϕk ≤ 1 and 1 = lim ϕk (x) at each point x in the interior of J. In particular, χJ = lim ϕk a.e. Applying Lemma 7.17 to the diffeomorphism F ◦G : Rn → F (I ′ ) and the function ϕk , which is supported inside the J, we see that Z Z (7.33) (ϕk ◦ F ◦ G) |J(F ◦ G)| = ϕk . Observe that because ϕk vanishes outside of J, the function ϕk ◦ F vanishes outside of I; thus, the left-hand side of (7.33) is actually just integrated over I. Since G is the identity function on I, it follows that Z Z (ϕk ◦ F ) |JF | = ϕk . Taking k → ∞ and noting that χJ (x) = lim ϕk (x) a.e., the Dominated Convergence Theorem implies that Z Z (χJ ◦ F ) |JF | = χJ = m(J).
This shows that (7.31) holds for F -small boxes. Step 2: To finish our proof, we make the following Claim: We can write any left-hand open box in V as a countable union of pairwise disjoint F -small boxes. Once we prove S this claim we’re done: Given any left-hand open box J ⊆ V, we can write J = ∞ k=1 Jk where the Jk ’s are F -small and pairwise disjoint, so as (7.31) holds for each Jk , by countable additivity it also holds for J. Now to prove our claim, let J ⊆ V be a left-hand open box. Let p ∈ J. Then F −1 (p) ∈ U and since U is open there is an open box I ′ ⊆ U with p ∈ I ′ . By making I ′ 15 Using elementary topology, one can prove that I is properly inside U if and only if the closure of I is a subset of U .
7.4. CHANGE OF VARIABLES IN MULTIPLE INTEGRALS
431
smaller we get an open box I ⊆ I ′ properly inside of U. Since F : U → V is a homeomorphism, F (I) is an open set, so there is a left-hand open box Jp containing p such that Jp ⊆ F (I); moreover, we can take Jp to have rational endpoints. To summarize: For each p ∈ J there is an F -small left-hand open box Jp with rational endpoints containing p with Jp ⊆ J. It follows that [ J= Jp . p
The right-hand side is a countable union (since the rationalsSare countable), so by the Fundamental Lemma of Semirings, we can write J = k Jk , a countable union, where the Jk ’s are F -small pairwise disjoint left-hand open boxes. This completes our proof.
Lemma 7.20. We have Z (χA ◦ F ) |JF | = m(A)
for all Lebesgue measurable sets A ⊆ V.
Proof : Recalling the identity (7.32), we need to show that Z χF −1 (A) |JF | = m(A). U
To prove this, we first of all need to know that F −1 (A) is Lebesgue measurable (so that the left-hand integral is defined). To prove this, recall by Corollary 4.8 that A = B ∪ N where B is a Borel set and N has measure zero. Thus, F −1 (A) = F −1 (B) ∪ F −1 (N ).
Since homeomorphisms preserve Borel sets, we know that F −1 (B) is a Borel, hence Lebesgue, measurable set. To see that F −1 (N ) is also measurable, we use regularity on N to find a Borel set N ′ ⊆ V with N ⊆ N ′ and m(N ′ ) = m(N ) = 0. The previous lemma implies that Z Z 0 = m(N ′ ) = χF −1 (N ′ ) |JF | = |JF |, F −1 (N ′ )
−1
′
which implies that m(F (N )) = 0 by Property 5 of Theorem 5.20, since |JF | > 0. As F −1 (N ) ⊆ F −1 (N ′ ) and Lebesgue measure is complete, it follows that F −1 (N ) is also Lebesgue measurable with measure zero. Thus, F −1 (A) = F −1 (B) ∪ F −1 (N ) is measurable. Since F −1 (N ) has measure zero, we have χF −1 (A) = χF −1 (B) a.e., so Z Z χF −1 (A) |JF | = χF −1 (B) |JF |. By our previous lemma, the right-hand side equals m(B) and since N has measure zero, we have m(B) = m(A). This proves that Z χF −1 (A) |JF | = m(A) and concludes our proof.
Note that in the proof of this lemma, we showed that if A ⊆ V is Lebesgue measurable, then F −1 (A) is also Lebesgue measurable; replacing F by F −1 , we conclude: A smooth diffeomorphism takes Lebesgue measurable sets to Lebesgue measurable sets.
432
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
◮ Exercises 7.4. 1. Let F : R2 → R2 be the “polar coordinates map” F (r, θ) = (r cos θ, r sin θ). (a) Show that F : (0, ∞) × (0, 2π) → R2 is a diffeomorphism onto its image, which consists of R2 minus the nonnegative x-axis. (b) Using the change of variables formula prove that for any integrable function f on R2 , we have Z Z ∞ Z 2π Z 2π Z ∞ f= f (r cos θ, r sin θ) r dθ dr = f (r cos θ, r sin θ) r dr dθ, R2
0
0
0
0
where x = r cos θ and y = r sin θ. We shall generalize this formula to Rn in Problem 4 and in Section 7.6. 2. Let f : [a, b] → (0, ∞) be a Lebesgue measurable function with [a, b] ⊆ [0, 2π). Let F : R2 → R2 denote the polar coordinates map and let A = {F (r, θ) ; a ≤ θ ≤ b, 0 ≤ r ≤ f (θ)} be the set of points in the “polar sector described by f ”. (i) Show that A ⊆ R2 is Lebesgue measurable. (ii) Using the polar coordinates change of variables formula, derive the familiar result from elementary calculus: Z 1 b m(A) = [f (θ)]2 dθ. 2 a
(iii) (Archimedes’ spiral). Archimedes of Syracuse (287 BC–212 BC) in his work On spirals [5, p. 151] considered the spiral f (θ) = a θ where a > 0 and said that the area of the “polar sector described by f ” equals one-third the area of the circle whose radius is f (2π) = 2πa. Prove this. R √ 2 3. In this problem we give the most famous proof that e−x dx = π, a proof probably due to Simon-Denis Poisson (1781–1840) [279, pp. 18–19]. (i) Switching to polar coordinates, prove that Z 2 2 e−(x +y ) dxdy = π. R2
(ii) Use Fubini’s theorem on the left side of this equation to derive the result. 4. (Polar coordinates in Rn ) This problem is highly recommended! For n ≥ 2, we shall define the polar coordinates map16 F : Rn → Rn by F (r, θ1 , θ2 , . . . , θn−1 ) = (x1 , x2 , . . . , xn ), where x1 = r cos θn−1 cos θn−2 · · · cos θ4 cos θ3 cos θ2 cos θ1
x2 = r cos θn−1 cos θn−2 · · · cos θ4 cos θ3 cos θ2 sin θ1
x3 = r cos θn−1 cos θn−2 · · · cos θ4 cos θ3 sin θ2
x4 = r cos θn−1 cos θn−2 · · · cos θ4 sin θ3 .. .
.. .
xn = r sin θn−1 . When n = 2, this is the standard polar coordinate system and when n = 3, this is the geographic spherical coordinate system;17 see Figure 7.9. (i) Here’s one way to understand polar coordinates: Prove that if we denote by Fn polar coordinates in Rn , then polar coordinates in Rn+1 = Rn × R1 is given by Fn+1 (r, θ1 , . . . , θn−1 , θn ) = Fn (r cos θn , θ1 , . . . , θn−1 ) , r sin θn .
16 Also called hyperspherical coordinates. For general n, there is no universal convention for hyperspherical coordinates. These are some coordinates we thought about one evening. 17 “Geographic” because θ2 represents latitude on the earth — the usual convention in elementary calculus is to use the zenith angle = π2 − θ2 instead of the latitude angle.
7.4. CHANGE OF VARIABLES IN MULTIPLE INTEGRALS
x1 = r cos θ1 x2 = r sin θ1
433
(x1 , x2 , x3 )
(x1 , x2 )
x3 = r sin θ2
r θ1 x1 = r cos θ2 cos θ1
r
θ2 θ1 ✛
x2 = r cos θ2 sin θ1 r cos θ2
Figure 7.9. Polar coordinates in R2 and R3 . R (x′ , xn ) r sin θn
r θn r cos θn
Rn
Figure 7.10. Polar coordinates for (x′ , xn ) ∈ Rn+1 = Rn × R is ob-
tained by writing xn = r sin θn and writing x′ in polar coordinates in Rn with radius r cos θn .
See Figure 7.10 for an interpretation of this formula. For the rest of this problem we don’t write the subscript “n” in Fn . (ii) Here’s another way to understand polar coordinates: Show that the distance of F (r, θ1 , . . . , θn−1 ) from the origin is r, just as you would think (that is, show n that p kF (r, θ1 , . . . , θn−1 )k = r where for any x = (x1 , . . . , xn ) ∈ R , kxk = 2 2 x1 + · · · + xn ), then prove that F (r, θ1 , . . . , θn−1 ) = rω
where
ω = F (1, θ1 , . . . , θn−1 ) ∈ Sn−1 ,
so F (r, θ1 , . . . , θn−1 ) is simply r×a point on Sn−1 (namely ω, which has the coordinates (θ1 , . . . , θn−1 )). (iii) Prove that JF = r n−1 cos θ2 cos2 θ3 cos3 θ4 · · · cosn−2 θn−1 . (For example, when n = 2, JF = r and when n = 3, JF = r 2 cos θ2 .) Suggestion: To make the computation simpler, put ti = tan θi , then prove that x 1 −x1 t1 −x2 t2 · · · −x1 tn−1 r x x 2 2 −x2 t2 · · · −x2 tn−1 r t1 x3 x3 0 · · · −x3 tn−1 r t2 JF = det . .. .. .. .. . . . . . . xn xn 0 0 0 r tn 1 −1 −1 · · · −1 1 1 t2 −1 · · · −1 1 1 1 0 · · · −1 . t2 = r −1 x1 x2 · · · xn t1 t2 · · · tn−1 det 2 .. .. .. .. .. . . . . . 1 1 0 0 0 t2 n
434
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
Show that the determinant equals (1 +
1 ) · · · (1 t2 1
+
1 ) t2 n−1
(try to use elementary
row/column operations). (iv) If U = (0, ∞) × (0, 2π) × (− π2 , π2 )n−2 , prove that F : U → F (U) is a smooth diffeomorphism, and m(Rn \ F (U)) = 0. (v) If f is an integrable function on Rn , prove that Z Z ∞ Z f= f (rω) dσ r n−1 dr, Rn
0
where Z Z f (rω) dσ := f (rω) cos θ2 cos2 θ3 · · · cosn−2 θn−1 dθ1 · · · dθn−1 ,
with ω = F (1, θ1 , . . . , θn−1 ) ∈ Sn−1 and the range of integration is 0 < θ1 < 2π and −π/2 < θi < π/2 for i = 2, . . . , n − 1. n (vi) (Volume of balls II) FindR the volume ∈ Rn ; |x| ≤ r} by using R ∞ Rof B (r) = {xn−1 f = χBn (r) in the integral Rn f = 0 f (rω) dσ r dr. You will need the formula √ Z π/2 Γ n+1 π 2 , cosn θ dθ = Γ n2 + 1 2 0
where Γ denotes the Gamma function, which you can deduce from Problem 2 of Exercises 6.1. 5. Is there a change of variables formula for general measure spaces? There is, in a certain sense. Let (X, S , µ) be a measure space and let (Y, T ) be a measurable space, which means that Y is a set and T is a σ-algebra of subsets Y . (i) Let F : X → Y be a measurable map, which means that F −1 (A) ∈ S for all A ∈ T . We define the pushforward F∗ µ of µ under F by (F∗ µ)(A) = µ(F −1 (A)) for all A ∈ T . Show that F∗ µ : T → [0, ∞] is a measure. (ii) Prove that for any nonnegative measurable function f on (Y, T , F∗ µ), the following “change of variables formula” holds: Z Z (7.34) f d(F∗ µ) = (f ◦ F ) dµ. Y
X
(iii) If f is not necessarily nonnegative, prove that f is F∗ µ-integrable if and only if f ◦ F is µ-integrable, in which case the formula (7.34) holds. (iv) As a specific example, let T : Rn → Rn be a linear transformation. Prove that T∗ (m) = | det T |−1 m, where m is Lebesgue or Borel measure. 6. (Determinant identities) In this problem we prove the determinant identities in Simader’s theorem. We denote by Aij the i, j-th cofactor of DFt = [∂yi /∂xj ]. For any i, k = 1, . . . , n, prove the following identities: ( n n X X JFt , if k = i; ∂ ∂yk (7.35) Aij = 0 and Aij = ∂xj ∂xj 0, if k 6= i. j=1 j=1
To get you started, consider the first identity for i = 1. For j = 1, . . . , n, let cj be the j-th column of the matrix obtained by deleting the first row of DFt ; thus, the k-th entry (k = 1, . . . , n − 1) in cj is ∂yk+1 /∂xj . By definition of cofactor, we have A1j = (−1)j+1 det(c1 , . . . , cˆj , . . . , cn ), where the determinant is the determinant of the matrix whose columns are c1 , c2 , . . . , cj−1 , cj+1 , . . . , cn — the hat over cj means to omit this column. Use the “product rule for determinants,” as we did in the proof of ∂ Simader’s theorem, to find ∂x A1j . Using the equality of mixed partial derivatives and j the fact that the determinant changes sign whenever two columns are switched, prove the first identity in (7.35). To prove the second identity in (7.26), recall that given any matrix B = [bij ], and given any i = 1, . . . , n, the determinant det B can be computed by “expanding about
7.4. CHANGE OF VARIABLES IN MULTIPLE INTEGRALS
435
the i-th row” which means if Bij denotes the i, j-th cofactor of B, then det B =
n X
bij Bij .
j=1
7. Here’s (a slight modification of) Herbert Leinfelder and Christian Simader’s proof of the change of variables formula [174]. Let F : U → V be a smooth diffeomorphism between open subsets of Rn . (i) We begin with the following lemma: For each p ∈ V there is an open box Bp centered at p such that for each smooth function f : V → R with compact support in Bp , we have Z Z (f ◦ F ) |JF | = f.
Assume this lemma for the moment and then prove Theorem 7.14 for F assumed smooth. Suggestion: One way to go about it is to proceed as in the proof presented in the text, which reduces the problem to proving the change of variables formula for characteristic functions of left-hand open boxes in V. To prove the change of variables formula for characteristic functions of left-hand open boxes in V, first prove it for “small boxes,” which is the name we’ll call left-hand open boxes I ⊆ V such that I ⊆ Bp for some open box Bp described in the above lemma you’ll prove. Then prove that any left-hand open box in V can be written as a union of pairwise disjoint of “small boxes”. (ii) We now prove the above lemma. First we review some material from P Problem 5 in Exercises 4.4. For an n × n matrix A = [aij ], we put |A| := max{ n j=1 |aij | ; i = 1, . . . , n} and given x ∈ Rn , the “box norm” of x is kxkb := max{|x1 |, . . . , |xn |}; then for any r > 0, (−r, r)n = {x ∈ Rn ; kxkb < r}. Finally, for any x ∈ Rn , we have kAxkb ≤ |A| kxkb . Now back to the proof of our lemma. By translations, we may assume that U and V contain the origin, p = 0, and F (0) = 0. Using the formula (7.29), show that for x near 0, we have F (x) = A(x) x, where A(x) an n×n matrix depending continuously on x such that A(0) = DF (0). Next, let A = A(0) = DF (0) and show that there is an r > 0 such that for all x ∈ (−r, r)n , we have k(A−1 F (x) − x)kb = k(A−1 A(x) − Id)xkb
0 be arbitrary. Prove that C can be divided into closed cubes C1 , . . . , Cm with disjoint interiors with centers x1 , . . . , xm , such that for each k = 1, . . . m, n max |DF (xk )−1 DF (x)| − 1 < ε, max |JF (xk ) − JF (x)| < ε. (7.39) x∈Ck x∈Ck Suggestion: Use the fact that a continuous function on a compact set is uniformly continuous. (d) Apply (7.38) with T = DF (xk ) for k = 1, . . . , m, add them together, and then use (7.39) to prove that Z m(F (C)) ≤ |JF (x)| dx + E, C
7.5. SOME APPLICATIONS OF CHANGE OF VARIABLES
437
where |E| ≤ Cε where C > 0 is a constant that depends only on m(C) and R max{|JF (x)| ; x ∈ C}. Conclude that m(F (C)) ≤ C |JF (x)| dx. (e) Prove that for any nonnegative measurable function f on V, we have Z Z (7.40) f dx ≤ (f ◦ F ) |JF | dx. V
U
(f) Apply (7.40) to F −1 to prove that Z Z Z f dx ≤ (f ◦ F ) |JF | dx ≤ (f ◦ F ) ◦ F −1 |JF | ◦ F −1 |J(F −1 )| dx. V
U
V
R Show that the right-hand side is just V f dx, concluding that Z Z f dx = (f ◦ F ) |JF | dx. V
U
(g) Finally, conclude that the change of variables formula holds for any integrable function on V.
7.5. Some applications of change of variables Using the change of variables formula and Simader’s Theorem we prove some interesting results such as the Fundamental Theorem of Algebra and the Brouwer fixed point theorem. However, before getting to these results, we give . . . 7.5.1. Yet another proof of Euler’s formula for π 2 /6. The earliest published version of the following proof seems to be in 1956, in William LeVeque’s (1923–2007) book [176, p. 122], although LeVeque doesn’t take credit for the proof as stated in [3, p. 35], the book from which we learned this proof. The trick is to evaluate the integral Z 1Z 1 1 I := dx dy, 0 0 1 − xy in two ways. First, we expand the integrand in a geometric series: ∞
∞
k=0
k=1
X X 1 xk y k . = (xy)k = 1 − xy
Applying the series Monotone Convergence Theorem, we see that ∞ Z 1Z 1 X I= xk y k dx dy. 0
k=0
Since
Thus,
R1 0
0
R1
xk dx = 1/(k + 1) = 0 y k dy, we see that Z 1Z 1 1 xk y k dx dy = . (k + 1)2 0 0 I=
∞ X
k=0
∞ X 1 1 = . (k + 1)2 n=1 n2
Second, we evaluate the integral via the change of variables formula. Let U = the rotated open square on the left-hand side in Figure 7.11, let V = the open square (0, 1) × (0, 1), and let F : U → V be the transformation x=u−v
and y = u + v.
438
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
v
y u x
Figure 7.11. Left:
The vertices of the square U are (0, 0), (1/2, −1/2), (1, 0), (1/2, 1/2). Right: V = (0, 1) × (0, 1). v v=u U1
v =1−u U2 ■ u
u = 1/2
Figure 7.12. Breaking up the integral as a sum
R
U1
+
R
U2
.
(That indeed F : U → V is left as an exercise.) One can check that JF = 2, and 1 1 1 ◦ F (u, v) = = . 1 − xy 1 − (u − v)(u + v) 1 − u2 + v 2 Thus, by the change of variables formula, Z 1 I=2 du dv. 2 + v2 1 − u U The integrand is an even function of v, so it follows that Z 1 I=4 du dv. 2 + v2 1 − u upper part of U
Now as shown in Figure 7.12, we can break up the integral over the upper part of U as integrals over U1 and U2 : Z Z 1 1 I =4 du dv + 4 du dv 2 + v2 2 + v2 1 − u 1 − u U1 U2 Z 1/2 Z u Z 1 Z 1−u 1 1 =4 dv du + 4 dv du. 2 2 1 − u2 + v 2 0 0 1−u +v 1/2 0 R 1 −1 Recalling the elementary integral identity a2dv (v/a) + C, we see that +v 2 = a tan Z 1 1 v dv = √ tan−1 √ + C. 2 2 1−u +v 1 − u2 1 − u2 Therefore, Z 1/2 Z 1 1−u u du du −1 √ √ √ I=4 tan +4 tan−1 √ . 2 2 2 1−u 1−u 1−u 1 − u2 0 1/2 u We now let r = tan−1 √1−u in the first integral and s = tan−1 √1−u in the 2 2 1−u second integral; then after some algebra we get du 1 du dr = √ and ds = − √ . 2 2 1 − u2 1−u
7.5. SOME APPLICATIONS OF CHANGE OF VARIABLES
439
Making these substitutions and being careful with the limits of integrations, we get Z π/6 Z 0 π/6 0 I=4 r dr + 4 −2s ds = 2 · r2 − 4 · s2 0
π2 6
0
π/6
=2·
P∞
1 n=1 n2 ,
π/6
π 2 6
+4·
π 2 6
=
π2 . 6
= as we wanted to show. See Problems 1 and 2 for This shows that other ingenious proofs of Euler’s formula using Fubini + Change of Variables. 7.5.2. The Vanishing Theorem and the FTA. Theorem 7.21 below says if you are given a smooth family of functions Ft : B → Rn
where B ⊆ Rn is an open ball of positive (even infinite) radius and t ∈ [0, 1], then Certain positivity conditions on Ft (assumptions (1) and (2)) tells us that if F0 vanishes at the origin, then F1 (x) must vanish at some point x ∈ B. For this reason, we call this theorem the . . . Vanishing theorem Theorem 7.21. Let B ⊆ Rn be an open ball centered at the origin of radius r with 0 < r ≤ ∞ and let Ft : B → Rn , where t ∈ [0, 1], be a smooth family of functions such that for some ε > 0, (1) JF0 (x) > 0 for all x ∈ B with 0 < kxk < ε. (2) kFt (x)k > ε for all t ∈ [0, 1] and x ∈ B with kxk ≥ ρ for some 0 < ρ < r. Then F0 (0) = 0 implies there is an x ∈ B such that F1 (x) = 0. Proof : Here, we recall that (see Section ?? in the Appendix) the notation “k k” represents the (Euclidean) norm, and given x ∈ Rn , the number kxk is the distance of x from the origin: q kxk := x21 + x22 + · · · + x2n .
Figure 7.13 gives a picture of this theorem. Assume (1) and (2) and F0 (0) = 0 ✛
kFt (x)k > ε for all t ∈ [0, 1] and kxk ≥ ρ
ε ρ
❨ ✻
=⇒
F1 (x) = 0 for some x ∈ B
F0 (0) = 0
JF0 (x) > 0 for 0 < kxk < ε
Figure 7.13. The big disk is the open ball B, and the middle and small disks are the balls of radii ρ and ε, respectively. Information about JF0 near the origin and Ft (x) away from the origin tells us that if F0 (0) = 0, then F1 (x) must vanish at some point x ∈ B. and suppose, by way of contradiction, that F1 (x) 6= 0 for all x ∈ B. Then, since a continuous function achieves a minimum on a compact set, kF1 (x)k attains a minimum on the closed ball of radius ρ (where ρ is given in (2)); this minimum is some positive number since by assumption F1 (x) is never zero. Hence, by
440
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
choosing ε > 0 smaller if necessary we may assume that kF1 (x)k > ε for all x with kxk ≤ ρ. By assumption (2), we also have kF1 (x)k > ε for all x ∈ B with kxk ≥ ρ, so we have kF1 (x)k > ε for all x ∈ B. Let f : Rn → [0, ∞) be a smooth function supported inside the open ball of radius ε and strictly positive near the origin; in particular, f (x) = 0 for all x with kxk ≥ ε. Then by assumption (2) we have f ◦ Ft (x) = 0 for all t ∈ [0, 1] and for all xR ∈ B with kxk ≥ ρ. Thus by Simader’s Theorem (Theorem 7.16), the integral (f ◦ Ft ) JFt is constant in t ∈ [0, 1]. In particular, Z Z (7.41) (f ◦ F1 ) JF1 = (f ◦ F0 ) JF0 . Since kF1 (x)k > ε for all x ∈ B and f is supported in the ball of radius ε, f ◦F1 is identically zero, so the left-hand side of (7.41) is zero. On the other hand, since f is nonnegative and strictly positive near the origin, by assumption (1) and the assumption F0 (0) = 0, it follows that (f ◦ F0 ) JF0 is nonnegative and is strictly positive near (but not necessarily at) the origin. In particular, the right-hand side of (7.41) must be positive. This gives a contradiction, and proves the theorem.
We now use the Vanishing Theorem to give a proof of the FTA found in [187]. The proof is different from the one in Section 6.1.3. Let p(z) = z n + an−1 z n−1 + an−2 z n−2 + · · · + a1 z + a0 , a polynomial of degree n ≥ 1 with complex coefficients; we shall prove that p(z) has a zero. (Here we assume, without loss of generality, that the leading coefficient of p is one.) To do so, the idea is to deform p(z) to the monomial z n . To this end, write p(z) = z n + q(z) where q(z) = an−1 z n−1 + · · · + a0 is a polynomial of degree at most n − 1. For t ∈ [0, 1], we define Ft : C → C by Ft (z) = z n + tq(z) for all z ∈ C.
In particular, F1 (z) = p(z) and F0 (z) = z n , thus, Ft deforms p(z) to z n . We now identify the plane R2 with C in the usual fashion: R2 ∋ (x1 , x2 )
←→ x1 + ix2 ∈ C. p Note that the norm of (x1 , x2 ) ∈ R is x21 +p x22 , and the absolute value of the complex number x1 + ix2 is, by definition, also x21 + x22 ; thus, (7.42)
2
k(x1 , x2 )k = |x1 + ix2 |.
It’s useful to go back and forth between C and R2 using the identification (7.42) and in fact, after time goes on, this identification becomes second nature.18 Under the identification of C with R2 , we can consider Ft : R2 → R2 .
Since F0 (z) = z n , we have F0 (0) = 0 and in Lemma 7.22 below, we shall prove that conditions (1) and (2) of the Vanishing Theorem hold. Hence, there is an x ∈ R2 such that F1 (x) = 0; that is, there is a z ∈ C with p(z) = 0, which proves the FTA! 18Actually, many mathematicians define C to be R2 and the absolute value of a complex number to be the norm, so such an identification is actually a tautology!
7.5. SOME APPLICATIONS OF CHANGE OF VARIABLES
441
Lemma 7.22. There are ρ, ε > 0 such that kFt (x)k ≥ ε for all t ∈ [0, 1] and x ∈ R2 with kxk ≥ ρ. Also, JF0 (x) > 0 for all x ∈ R2 \ {0}. Proof : Identifying x = (x1 , x2 ) with z = x1 + ix2 and recalling that the norm of an element of R2 equals the absolute value of the corresponding complex number, we have q(z) kFt (x)k = |z n + tq(z)| = |z|n 1 + t n z a an−2 a0 n−1 n = |z| 1 + t + 2 + ··· + n . z z z For each k, because 1/|z|k → 0 as |z| → ∞, there is a ρk > 0 such that for all z ∈ C with |z| ≥ ρk , 1 an−k . ≤ zk 2n Hence, if ρ is the largest of the ρk ’s, then for |z| ≥ ρ and t ∈ [0, 1], q(z) an−1 an−2 a0 +t 2 +··· + t n t n = z z z z 1 1 1 1 + + ··· + = . 2n 2n 2n 2 Thus, for |z| ≥ ρ and t ∈ [0, 1], by the triangle inequality it follows that19 1 + t q(z) ≥ 1 . zn 2 ≤
Thus, for |z| ≥ ρ and t ∈ [0, 1],
q(z) kFt (x)k = |z| 1 + t n z n
≥ ρn · 1 = ε, 2
where ε = ρn /2. We now prove that JF0 (x) > 0 for x ∈ R2 \ {0}. Before doing so, we first prove the following claim: If P (z) is a polynomial with complex coefficients and we consider P : R2 → R2 using the identification of R2 with C, its Jacobian is given by JP = |∂x1 P |2 . To prove this claim, write P (x) = P1 (x) + iP2 (x) ,
which we identify with
( P1 (x), P2 (x) ) ,
where P1 and P2 are the real and imaginary parts of P (x). Then by definition of the Jacobian, we have ∂x1 P1 ∂x2 P1 JP = det = ∂x1 P1 ∂x2 P2 − ∂x2 P1 ∂x1 P2 . ∂x1 P2 ∂x2 P2 Now by the Cauchy-Riemann equations ∂x2 P1 = −∂x1 P2
and
∂x2 P2 = ∂x1 P1
found in Problem 6, we conclude that JP = ∂x1 P1 ∂x2 P2 − ∂x2 P1 ∂x1 P2 = ∂x1 P1 (∂x1 P1 ) − (−∂x1 P2 )(∂x1 P2 ) = (∂x1 P1 )2 + (∂x1 P2 )2 = |∂x1 P |2 , 19 Indeed, we have 1 = |1| = |1 + tq(z)/z n − tq(z)/z n | ≤ |1 + tq(z)/z n | + |tq(z)/z n | ≤ |1 + tq(z)/z n | + 1/2. Subtracting 1/2 from everything we get 1/2 ≤ |1 + tq(z)/z n |.
442
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
and our claim is proved. We can now finish our proof. Noting that F0 (z) = z n , setting z = x1 + ix2 we have F0 = (x1 + ix2 )n . Therefore, by the chain rule it follows that ∂x1 F0 = n(x1 + ix2 )n−1 . Finally, since |x1 + ix2 |2 = x21 + x22 , our claim gives JF0 = |∂x1 F0 |2 = n|(x1 + ix2 )n−1 |2 = n(x21 + x22 )n−1 .
In particular, JF0 > 0 for all x ∈ R2 \ {0}. This concludes our proof.
7.5.3. The Brouwer fixed point theorem. We now study a celebrated fixed point theorem named after Luitzen Egbertus Jan Brouwer (1881–1966). Before looking at the general Brouwer theorem, consider the simple case of a continuous function f : [a, b] → [a, b] from a closed interval onto itself. We claim that f has a fixed point, which means there is a point c ∈ [a, b] such that f (c) = c; this fact is “obvious” from Figure 7.14 and is L. E. J. Brouwer (1881–1966) .
b
a a
b
Figure 7.14. If f : [a, b] → [a, b] is continuous, there is a point where the graph of y = f (x) crosses the line y = x.
a special case of the Brouwer fixed point theorem we’ll discuss below. The proof of this fact is very simple. Consider the continuous function g(x) = x − f (x).
Because a ≤ f (x) ≤ b for all x ∈ [a, b], we have g(a) = a − f (a) ≤ 0 and g(b) = b − f (b) ≥ 0. Hence, by the Intermediate Value Theorem, there is a point c ∈ [a, b] such that g(c) = 0, which implies that f (c) = c. Here’s one neat application. Take two identical strings and twist one of them up, then lie them side to side as seen in Figure 7.15. Then there is always a point
Figure 7.15. There is a point in the twisted string that lies exactly above the corresponding twin point on the untwisted string. The dot represents such a point.
on the twisted string that lies directly above its corresponding identical point on the untwisted string! To prove this result from the fixed point theorem, let ℓ be the length of the strings, so we can consider the strings to be the interval [0, ℓ]. Define f : [0, ℓ] → [0, ℓ]
as follows: Given x ∈ [0, ℓ], think of x as a point on the twisted string, and define f (x) to be the point on the straight string exactly below x. Then f is continuous
7.5. SOME APPLICATIONS OF CHANGE OF VARIABLES
443
so there is a point c ∈ [0, ℓ] such that f (c) = c; done. We shall prove the following is a generalization of the above one-dimensional fixed point theorem. The Brouwer fixed point theorem Theorem 7.23. If X is a topological space homeomorphic to the closed unit ball in Rn , then any continuous function from X to itself has a fixed point. Here’s an application of Brouwer’s theorem similar to the strings example above. Take two identical pieces of paper and crumple one of them up, then lie the crumpled piece on top and inside of the flat piece as seen in Figure 7.16. Then there
Figure 7.16. There is a point on the crumpled paper that lies exactly above the corresponding twin point on the flat paper.
is always a point on the crumpled paper that lies directly above the corresponding identical point on the flat paper! We leave this as an exercise (here, X is a rectangle representing the paper). Here’s another interesting example. Take a cup of coffee and stir it as much as you like, then take a snapshot of it. Then Brouwer’s theorem says that you can find in the snapshot a point in the exact position it was before you started stirring!
Figure 7.17. In a stirred cup of coffee, there is always a point in the same place it was before stirring. Original photo by Julius Schorzman.
We now prove Brouwer’s theorem using Simader’s Theorem (cf. [174, 123]). We just have to focus on the unit ball Bn = {x ∈ Rn ; |x| ≤ 1}. Let f : Bn → Bn be a continuous map; we need to show that f has a fixed point. For sake of contradiction, assume that f does not have a fixed point. For t ∈ [0, 1], define ft : Bn → Rn by ft (x) = x − tf (x) for all x ∈ Bn . Lemma 7.24. Assuming f does not have a fixed point, there exist ε, ρ ∈ (0, 1) such that kf1 (x)k > 2ε for all x ∈ Bn , and kft (x)k > 2ε for all t ∈ [0, 1] and x ∈ Bn with ρ ≤ kxk ≤ 1. Proof : The function f , by assumption, does not have a fixed point, so kf1 (x)k = kx − f (x)k > 0 for all x ∈ Bn . We list this as Fact 1: kf1 (x)k > 0 for x ∈ Bn .
Now we claim that kft (x)k > 0 for all t ∈ [0, 1] and x ∈ Sn−1 . Indeed, by Fact 1, we know that for t = 1, kf1 (x)k > 0 for x ∈ Bn , so in particular for
444
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
x ∈ Sn−1 . Now consider the case 0 ≤ t < 1. Observe that kft (x)k = 0 if and only if tf (x) = x. Taking norms of both sides, we obtain t kf (x)k = 1 for x ∈ Sn−1 , which cannot hold for 0 ≤ t < 1 since kf (x)k ≤ 1 for all x (recall that the range of f is contained in Bn ). Thus, kft (x)k > 0 for 0 ≤ t < 1 and x ∈ Sn−1 . As already noted, this inequality also holds for t = 1, so kft (x)k > 0 for all t ∈ [0, 1] and x ∈ Sn−1 just as we claimed. We list this as Fact 2: kft (x)k > 0 for t ∈ [0, 1] and x ∈ Sn−1 .
Next, we claim that there is a ρ ∈ (0, 1) such that kft (x)k > 0 for all t ∈ [0, 1] and ρ ≤ kxk ≤ 1. Indeed, put f (t, x) = kft (x)k and proceed by contradiction: If we suppose there does not exists such a ρ, then for each k ∈ N, there is a tk ∈ [0, 1] and an xk with 1 − k1 ≤ kxk k ≤ 1 such that f (tk , xk ) = 0. Since for each k, we have (tk , xk ) ∈ [0, 1] × Bn , a compact set, we know that the sequence (t1 , x1 ), (t2 , x2 ), . . . has a subsequence that converges to some point (t0 , x0 ) where t0 ∈ [0, 1] and x0 ∈ Bn . Since 1 − 1/k ≤ kxk k ≤ 1 for all k we must have kx0 k = 1. Moreover, since f (tk , xk ) = 0 for all k, by continuity we must also have f (t0 , x0 ) = 0. However, this implies that kft0 (x0 )k = 0, which contradictions Fact 2. Hence, we have proved Fact 3: There is a ρ ∈ (0, 1) such that kft (x)k > 0 for t ∈ [0, 1] and ρ ≤ kxk ≤ 1. By Fact 1 and the fact that a continuous function on a compact set attains a minimum, it follows that kf1 (x)k attains a positive minimum for x ∈ Bn . By Fact 3 and the fact that a continuous function on a compact set attains a minimum, kft (x)k attains a positive minimum for 0 ≤ t ≤ 1 and ρ ≤ kxk ≤ 1. Finally, taking ε to be the smallest of these two minimum values proves our result. Proof of Brouwer’s theorem : Assuming that f does not have a fixed point, we shall derive a contradiction via the Vanishing Theorem. To use the Vanishing Theorem we need smooth functions, so we first use Corollary 6.22 of the StoneWeierstrass Theorem to approximate the components of f (x) = (f1 (x), . . . , fn (x)) arbitrarily close by polynomials (in particular, by smooth functions). Thus, with ε > 0 as in Lemma 7.24, there is a smooth function g : Bn → Rn such that kg(x) − f (x)k < ε for all x ∈ Bn . Let B = {x ∈ Rn ; kxk < 1}, the interior of Bn , and for t ∈ [0, 1], define Ft : B → Rn by Ft (x) = x − tg(x). Observe that kft (x)k = kx − tf (x)k = kFt (x) + t(g(x) − f (x))k
≤ kFt (x)k + tkg(x) − f (x)k < kFt (x)k + ε.
It follows, by Lemma 7.24, that (7.43)
kF1 (x)k > ε, for x ∈ B,
and with ρ as in Lemma 7.24, (7.44)
kFt (x)k > ε for t ∈ [0, 1] and x ∈ B with kxk ≥ ρ.
However, noting that F0 (x) = x, we have F0 (0) = 0 and JF0 = 1, which in combination with (7.44) and the Vanishing Theorem, imply that F1 (x) = 0 for some x ∈ U. This, however, is contrary to (7.43). This contradiction implies Brouwer’s theorem. ◮ Exercises 7.5. ´ 1. (Another proof of Euler’s formula) This one is due to Akos L´ aszl´ o [161].
7.5. SOME APPLICATIONS OF CHANGE OF VARIABLES
(i) Show that
445
Z
tanh2k x 2 dx = , 2k 2k +1 cosh x R where tanhx = (ex − e−x )/(ex + e−x ) is the hyperbolic tangent and coshx = (ex + e−x )/2 is the hyperbolic cosine. (ii) Show that ∞ X tanh2k x tanh2k y 1 = . 2 2 cosh(x + y) cosh(x − y) cosh x cosh y k=0
(iii) Using (i) and (ii), show that Z ∞ X 1 1 1 dx dy = . 4 R2 cosh(x + y) cosh(x − y) (2k + 1)2 k=0
2
On the other hand, show that the left-hand integral equals π8 by making the change of variables x = u+v , y = u−v (or, u = x + y and v = x − y). This shows 2 2 P P∞ 2 1 1 π π2 that k=0 (2k+1)2 = 8 , which implies ∞ n=1 n2 = 6 . 2. (Yet another proof of Euler’s formula) Here’s a proof of Euler’s formula due to Frits Beukers, Eugenio Calabi, and Johan A.C. Kolk [28], published in 1993. (Actually, they find not only Euler’s formula for π 2 /2, but many formulas all at once using the Fubini + Change of Variables trick. This paper is also written at a first year grad student level, so check it out!) (i) For the Beukers, Calabi, Kolk proof, consider the integral Z 1 I := dx dy, 1 − x2 y 2 2 [0,1] P 1 over the unit square. In the rest of the proof we’ll show that I = ∞ k=0 (2k+1)2 , 2 which implies Euler’s formula for π /6. sin u sin v (ii) Consider the change of variables x = ,y= . In (iii) and (iv) below, we cos v cos u shall prove that if F denotes this transformation, then F : U → V is a smooth diffeomorphism where U and V are shown in Figure 7.18. Before doing so, show v π 2
y 1
π 2
u
x
1
Figure 7.18. The vertices of the open left-hand triangle U are (0, 0), (π/2, 0), (0, π/2). The right-hand square is V = (0, 1) × (0, 1). that u = sin
−1
x
s
1 − y2 1 − x2 y 2
!
,
v = sin
−1
y
s
1 − x2 1 − x2 y 2
!
.
(iii) Show that if u, v > 0 and u + v < π/2, then 0 < x, y < 1. Suggestion: Apply the sine function to each term in the inequalities 0 < u < π/2 − v. (iv) Conversely, show that if 0 < x, y 0 and u + v < π/2. Suggestion: Use the identities sin−1 z = cos−1 ( 1 − z 2 ) and cos−1 z = π2 − sin−1 z.
446
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
(v) Using the Change of Variables Formula, show that Z Z 1 π2 I= dx dy = 1 du dv = Area of U = . 2 2 8 V 1−x y U
3. (Gamma and Beta functions) In this problem we give two change of variables proofs of the Beta-Gamma formula from Problem 6 in Exercises 7.3; please review that problem before proceeding. Let p, q > 0 and note that Z Γ(p) Γ(q) = xp−1 y q−1 e−x−y dxdy. [0,∞)2
Using the following change of variables, determining the appropriate domain for each transformation, prove the Gamma-Beta formula. (1) x = uv and y = u(1 − v). (2) x = u cos2 v, y = u sin2 v. 4. (Mellin Convolution) In this problem we introduce the “Mellin” convolution of functions on R+ = (0, ∞). (a) Let M+1 denote the LebesgueR measurable subsets of R+ and define the measure µ : M+1 → [0, ∞] by µ(A) = χA x−1 dx for all A ∈ M+1 . It’s easy to prove that (see Problem 10 in Exercises 5.5) a function f : R+ → R is µ-integrable if and only if f (x) x−1 is Lebesgue integrable, in which case Z ∞ Z ∞ dx . f dµ = f (x) x 0 0 Given a function f on R+ , define the function T f on R by (T f )(x) = f (ex ). Show that f is measurable on R+ if and only if T f is measurable on R, and f is µ-integrable if and only if T f is Lebesgue integrable, in which case Z ∞ Z dx f (x) = (T f )(x) dx. x 0 R
(b) Given measurable functions f, g on R+ , for each x ∈ R+ , the integral Z ∞ x dy h(x) = f g(y) y y 0
is called the (Mellin) convolution of f and g and is denoted by f ∗g, provided this integral exists. This definition is different from the ones introduced in Problems 7 and 8 of Exercises 7.3, and is useful in the theory of Mellin transforms studied in Problem 5 below. Prove that if functions f, g on R+ are µ-integrable, then h(x) is defined for a.e. x ∈ R+ and is also µ-integrable with Z ∞ Z ∞ Z ∞ h dµ = f dµ g dµ . 0
0
0
+
(c) Let f and g be µ-integrable functions on R and let h be the Mellin convolution of f and g. Prove that T h = T f ∗ T g, where the convolution on the right is the one defined in Problem 7 of Exercises 7.3. 5. (Mellin Transform)20 The Mellin transform of a measurable function f on R+ is defined by Z ∞ dx M (f )(s) = xs f (x) , x 0 s defined for those s ∈ R such that x f (x) is µ-integrable. (We refer the reader to the previous problem for the definition of µ.) 20 This Mellin tranform problem doesn’t seriously use the change of variables formula, but it shows some applications of the convolution product of the previous problem, which is why this problem is here.
7.5. SOME APPLICATIONS OF CHANGE OF VARIABLES
447
(a) Let f, g be µ-integrable functions on R+ and suppose that both M (f )(s) and M (g)(s) are defined for some s ∈ R. Prove that M (f ∗ g)(s) is also defined, where ‘∗’ denotes the Mellin convolution, and M (f ∗ g)(s) = M (f )(s) · M (g)(s). (b) (Legendre duplication formula) Using the Mellin convolution formula, we can give an interesting proof [155] of the Legendre duplication formula found in Problem 6 of Exercises 7.3. Indeed, just compute the Mellin transforms of f (x) = e−x , g(x) = x1/2 e−x , and f ∗ g where ‘∗’ denotes the Mellin convolution. Suggestion: To get a formula for f ∗ g, use the integral in Problem 7b of Exercises 6.1. 6. (Cauchy-Riemann Equations) For any polynomial p(z), where z = x1 + ix2 , with complex coefficients, prove that ∂x2 p = i∂x1 p. If p = p1 + i p2 where p1 and p2 are respectively the real and imaginary parts of p, prove that ∂x2 p1 = −∂x1 p2 and ∂x2 p2 = ∂x1 p1 . 7. (Surjectivity theorem) We shall prove the following interesting fact: If F : Rn → Rn is a smooth map such that F (x) = x for all x outside of some compact ball, then F must be onto. (a) Given any c ∈ Rn , show that Ft (x) = t(F (x)−c)+(1−t)x satisfies the assumptions of the Vanishing Theorem. Conclude that F (x) = c for some x ∈ Rn . In case you’re interested, here are two other proofs, one via Lax’s change of variables formula and the other using a homotopy argument, cf. [163], [238]. If we suppose that F is not onto, then there is a point q not in the image of F . Prove that there is an open ball B centered at q that does not intersect the image of F . Next, derive contradictions in two ways (by using a specially chosen function f ): (b) assuming and using Lax’s change of variables formula (Problem 8 in Exercises 7.4), (b) using Simader’s Theorem (Theorem 7.16) and the smooth family of maps Ft (x) = tF (x) + (1 − t)x. 8. (No Retraction Theorem) Prove that there does not exist a retraction of Bn onto Sn−1 ; that is, there does not exist a continuous map f : Bn → Sn−1 such that f (x) = x for all x ∈ Sn−1 . Suggestion: Supposing such an f existed, apply the Brouwer Theorem to the function g : Bn → Bn defined by g(x) = −f (x). 9. (Intermediate Value Theorem) We shall prove the following result: (Intermediate Value Theorem) Any continuous function from Bn into Rn that is the identity on the sphere Sn−1 must cover every point of Bn . Explicitly, if f : Bn → Rn is a continuous map such that f (x) = x for all x ∈ Sn−1 , then Bn ⊆ f (Bn ). In particular, the IVT implies the No Retraction Theorem (why?). We shall prove the IVT from the Surjectivity Theorem (Problem 7 above) as follows: (i) For sake of contradiction, suppose there is a continuous map f : Bn → Rn equal to the identity on Sn−1 but does not cover all of Bn . Show that there is a closed ball B in the interior of Bn such that the image of f does not intersect B. (ii) Approximating the component functions of f (x) by polynomials, invoking Corollary 6.22 of the Stone-Weierstrass Theorem, show that there is a smooth function g : Bn → Rn such that the image of g does not intersect B and such that for some ε > 0, B is contained in the open ball of radius 1 − 2ε centered at the origin, and kg(x)k ≥ 1 − ε and kg(x) − xk < ε,
for all x with 1 − ε ≤ kxk ≤ 1.
(iii) From Lemma 5.2 there is a smooth nondecreasing function ϕ : R → R that is zero for t ≤ 0 and one for t ≥ ε. Let ψ(t) = 1 − ϕ(t − 1 + ε). Then ψ(t) = 1 for 0 ≤ t ≤ 1 − ε and ψ(t) = 0 for t ≥ 1. Define F : Rn → Rn by F (x) = ψ(kxk)g(x) + 1 − ψ(kxk) x.
448
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
Prove that F (x) is a smooth function such that F (x) = x for all x with kxk ≥ 1 and kF (x)k ≥ 1 − 2ε for all x with kxk ≥ 1 − ε.
Conclude that the image of F does not intersect B (iv) Finally, derive a contradiction using Problem 7. 10. Brouwer’s fixed point theorem can be derived from the no retraction or intermediate value theorem as follows. Let f : Bn → Bn be a continuous map and suppose that f does not have a fixed point. Then for each point x ∈ Bn , the ray from f (x) through x pierces the sphere in a point, which we denote by g(x). Show that g : Bn → Rn is continuous and is the identity on the sphere, and then derive a contradiction using the intermediate value theorem.
7.6. Polar coordinates and integration over spheres In this section we give a precise notion of measure and integration on spheres in Rn and relate them to n-dimensional polar coordinates. In order to integrate we first need a measure, so we begin by studying . . . 7.6.1. Measures on spheres. Consider the unit sphere Sn−1 = {x ∈ Rn ; kxk = 1}
p where kxk = x21 + · · · + x2n with x = (x1 , . . . , xn ). Since Sn−1 is a topological space (topology inherited from Rn ), its Borel σ-algebra B(Sn−1 ) is defined as the σ-algebra generated by the open subsets of Sn−1 and our goal is to define a measure on B(Sn−1 ). We begin by defining polar coordinates. If you did Problem 4 in Exercises 7.4, you studied polar coordinates on Rn using cosines and sines, generalizing polar coordinate on R2 ; the result was quite complicated if you remember. However, we can define polar coordinates in a much simpler way. If x ∈ Rn \ {0}, we define the polar coordinates of x as the pair x r
1
ω
❘
Figure 7.19. Polar coordinates. r is the distance of x from the origin and ω is the direction of x from the origin.
(r, ω) ∈ (0, ∞) × Sn−1 , where (see Figure 7.19) r := kxk ∈ (0, ∞) and ω :=
x ∈ Sn−1 . kxk
Observe that kωk = 1 so indeed ω ∈ Sn−1 and also observe that we can obtain x from r and ω via x = r ω. Example : If n = 2, then ω ∈ S1 , so there exists an angle θ ∈ R such that ω = (cos θ, sin θ),
7.6. POLAR COORDINATES AND INTEGRATION OVER SPHERES
449
(Any two such angles differ by a multiple of 2π.) Therefore, (x1 , x2 ) = rω = (r cos θ, r sin θ)
=⇒
x1 = r cos θ , x2 = r sin θ,
which are the standard polar coordinates learned in elementary calculus.
In elementary calculus we learned the two-dimensional polar coordinates equation
Z
f dx =
R2
Z
2π
0
Z
∞
f (r cos θ, r sin θ) r dr dθ.
0
Thus, we have written the integral of a function f on R2 in terms of a radial integration and an angular integration, which we can think of as an integral over the circle S1 . Since we have polar coordinates on Rn for any n, and not just n = 2, we should be able to express an integral of a function over Rn in terms of an integral over (0, ∞) × Sn−1 . In fact, if we take a quick glance at Part (v) of Problem 4 in Exercises 7.4, we conjecture that if f is an integrable function on Rn , the integration formula should look something like Z Z ∞ Z (7.45) f= f (rω) dσ rn−1 dr, Rn
0
n−1
where σ is some measure on S . Now we haven’t even defined the measure on Sn−1 , so (7.45) is nonsense at this point! However, the idea is to pretend that we’ve already established Equation (7.45) and then make a judicious choice for the function f to find an expression for σ, then after finding this expression, prove (7.45) as a consequence! To this end, let A ⊆ Sn−1 be a Borel subset of the sphere. To find the measure σ(A), recall that we can find the measure of a set by taking the integral of the characteristic function of the set. For this reason, let us define f : Rn → R as follows: Given x ∈ Rn , ( χA (ω) f (x) := 0
if 0 < kxk ≤ 1 and where ω := x/kxk, else.
Figure 7.20 illustrates how f is defined. Putting this function f into the right-hand
ω= x 1❘
x , f (x) = χA (ω) kxk
f (x) = 0 if x = 0 or if kxk > 1
e A
A
1❘
Figure 7.20. If x is in the closed unit ball minus the origin, then we project x to the sphere and define ω := x/kxk ∈ Sn−1 and f (x) := χA (ω). Otherwise, we define f (x) = 0. If A ⊆ Sn−1 is shown as on the e is the set of points inside the closed unit ball minus the origin right, A e subtended by A. Then f (x) = 1 if and only if x ∈ A.
450
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
side of the equality (7.45), we obtain21 Z ∞ Z Z f (rω) dσ rn−1 dr = 0
1
0
Z
= σ(A)
Z
χA (ω) dσ 1
rn−1 dr
rn−1 dr
0
σ(A) . = n Hence, σ(A) = n
Z
f.
Rn
On the other hand, by reviewing Figure 7.20, we see that where (7.46)
f = χAe, n e = x ∈ Rn ; 0 < kxk ≤ 1 , A
o x ∈A , kxk
which (minus the origin) is the sector, or solid angle, subtended by A. It follows that Z Z e f= χAe = m(A), Rn
Rn
where m here denotes Lebesgue measure on Rn . Thus, we have (7.47)
e σ(A) = n m(A).
For example, when n = 2 and A is an arc on the unit circle as in Figure 7.21, then σ(A) = angle formed by A in radians, and (7.47) is the equation “Area of e = 1 θ,” which is a well-known fact you probably learned in elementary calculus A 2 while studying polar coordinates.22 Now technically speaking, for the right-hand
θ
e= θ Area of A 2 A e A
1❘
Figure 7.21. A is an arc on the unit circle and Ae is the sector (minus
the origin) subtended by A.
e to be Lebesgue measurable. In fact, A e is a side of (7.47) to be defined, we need A Borel set and to see this, define F : Bn \ {0} → Sn−1
21Remember, although we have not yet proved (7.45), we are assuming we have done so, just
to figure out what σ is. 22More generally, the area of the sector of θ radians inside a circle of radius ρ is our case, ρ = 1.
1 2 ρ θ. 2
In
7.6. POLAR COORDINATES AND INTEGRATION OVER SPHERES
451
e = F −1 (A). Since A is a Borel set by F (x) := x/kxk for all x ∈ Bn \ {0}; then A e is a Borel set by Proposition 1.14. and F is continuous, it follows that A Summarizing what we have done: Assuming the polar coordinates integration formula (7.45) we derived the formula (7.47) for the measure σ on Sn−1 in terms of Lebesgue measure on Rn . Now that we have the formula (7.47), we shall take it as the definition of σ. Thus, we define (surface) measure on the sphere Sn−1 as the measure σ : B(Sn−1 ) → [0, ∞) defined by e σ(A) := n m(A) for all A ∈ B(Sn−1 ). Note that σ depends on n, but for notational simplicity we suppress the dimension. Now that we have a measure on Sn−1 , the integral Z f dσ Sn−1
is defined for any integrable Borel measurable function f : Sn−1 → R. For example, in Problem 1 you will show that in the case n = 2, Z Z 2π f dσ = f (cos θ, sin θ) dθ. S1
0
7.6.2. The n-dimensional Polar Coordinates Integration Formula. We now prove the polar coordinates integration formula (7.45). Integration in polar coordinates Theorem 7.25. For any Borel integrable function f on Rn , we have Z Z ∞Z Z Z ∞ (7.48) f= f (r ω) rn−1 dσ dr = f (r ω) rn−1 dr dσ. Rn
0
Sn−1
Sn−1
0
Proof : Consider the “polar coordinates map”: P : (0, ∞) × Sn−1 −→ Rn \ {0},
P (r, ω) = r ω.
We leave you to check that this map is a homeomorphism with inverse x 7−→ (r, ω) where r = kxk and ω = x/kxk. Since homeomorphisms preserve Borel sets (Proposition 1.15) it follows that P induces a bijection P : B (0, ∞) × Sn−1 → B(Rn \ {0}), and hence by Corollary 7.4,
P : B(0, ∞) ⊗ B(Sn−1 ) → B(Rn \ {0})
is a bijection. We now prove our theorem in two steps. Step 1: For all A ∈ B(0, ∞) ⊗ B(Sn−1 ), we shall prove that Z Z ∞ (7.49) m(P (A)) = χA (r, ω) r n−1 dr dσ. Sn−1 0
Let µ be the measure on B(0, ∞) ⊗ B(Sn−1 ) where µ(A) is defined by the righthand side of the equality in (7.49) for all A ∈ B(0, ∞) ⊗ B(Sn−1 ). We must show that m(P (A)) = µ(A) for all A ∈ B(0, ∞) ⊗ B(Sn−1 ). By the Extension Theorem, all we have to do is verify this equality on a semiring that generates B(0, ∞) ⊗ B(Sn−1 ); one can check that such a semiring consists
452
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
of sets of the form (a, b] × B where 0 < a < b and B ∈ B(Sn−1 ). Thus, consider such a set A = (a, b] × B where B ∈ B(Sn−1 ) and note that µ(A) =
Z
Sn−1
= =
Z
Sn−1 Z b a
Z
Z
∞
χ(a,b]×B (r, ω) r n−1 dr dσ
0 ∞
χ(a,b] (r) χB (ω) r n−1 dr dσ Z 1 n r n−1 dr χB (ω) dσ = b − an σ(B). n n−1 S 0
On the other hand, P (A) = {rω ; a < r ≤ b, ω ∈ B}
= {rω ; 0 < r ≤ b, ω ∈ B} \ {rω ; 0 < r ≤ a, ω ∈ B}
= b {rω ; 0 < r ≤ 1, ω ∈ B} \ a {rω ; 0 < r ≤ 1, ω ∈ B} e \ a B, e = bB
e = {x ∈ Bn \ {0} ; x/kxk ∈ B}. By subtractivity, the dilation property where B of Lebesgue measure (Proposition 4.12), and the definition of σ(B), we have e − m(aB) e = bn m(B) e − an m(B) e m(P (A)) = m(bB) 1 n = b − an σ(B). n This show that µ(A) = m(P (A)) and proves (7.49). Step 2: We now finish the proof. Let A ∈ B(Rn \ {0}). Then applying Step 1 to P −1 (A) ∈ B(0, ∞) ⊗ B(Sn−1 ), we have Z
χA = m(A) = m(P (P −1 (A))) = Rn \{0}
=
Z
Sn−1
= =
Z
Z Z
∞ 0
n−1 ZS ∞ Z 0
0
∞
Z
Z
∞
Sn−1 0
χP −1 (A) (r, ω) r n−1 dr dσ
χA ◦ P (r, ω) r n−1 dr dσ (since χP −1 (A) = χA ◦ P ) χA (r ω) r n−1 dr dσ χA (r ω) r n−1 dσ dr (Fubini).
Sn−1
Hence, (7.48) holds on characteristic functions of Borel sets in Rn \{0}. Applying the now familiar Principle of Appropriate Functions, we see that that (7.48) holds for Borel integrable functions on Rn \ {0}. Finally, since {0} is a set of measure zero, Z Z f= f Rn
Rn \{0}
for any Borel integrable function on Rn . This completes our proof. R √ 2 Example 7.3. The standard proof of e−x dx = π is due to Simon-Denis Poisson (1781–1840) [279, pp. 16–17] and it uses Fubini’s Theorem and the integration formula
7.6. POLAR COORDINATES AND INTEGRATION OVER SPHERES
453
R 2 for polar coordinates: If I = e−x dx (where the integral is over R), then Z Z Z Z 2 2 2 2 I2 = e−x dx e−y dy = e−x dx e−y dy Z 2 2 = e−x −y (Fubini) R2 Z 2π Z ∞ 2 = e−r r dr dθ 0 0 Z 2π Z 2π 2 r=∞ 1 1 = − e−r dθ = dθ = π. 2 2 r=0 0 0 √ Hence, I = π.
7.6.3. Spheres of different radii. Now that we have a measure σ on Sn−1 , we define a measure on the sphere of radius ρ > 0, which is the set Sn−1 := {x ∈ Rn ; kxk = ρ}. ρ
To see how to do so, let A ⊆ Sn−1 be a Borel set and write A as ρ A = ρA′
where A′ ⊆ Sn−1 is the Borel set
A′ = {x ∈ Sn−1 ; ρx ∈ A} ∈ B(Sn−1 );
see Figure 7.22 for a picture of the relationship between A and A′ . Now, for nA
ρx A′
x 1 ☛
ρ ❘
Figure 7.22. Given a point x ∈ Sn−1 , we have ρx ∈ Sn−1 . ρ dimensional Lebesgue measure, we know that m(rB) = rn m(B) for any r > 0 and Lebesgue measurable set B ⊆ Rn . Since a sphere is (n − 1)-dimensional and A = ρA′ , by analogy we should have the measure of A =
ρn−1 × the measure of A′ ;
Since A′ ⊆ Sn−1 is a Borel set, the right-hand side equals ρn−1 σ(A′ ) where σ is surface measure on Sn−1 . Since the measure of A should equal ρn−1 σ(A′ ), we simply declare it to be so: We define (Surface) measure on Sn−1 as the map ρ defined by (7.50)
σρ : B(Sn−1 ) → [0, ∞) ρ σρ (A) = ρn−1 σ(A′ ) for all A ∈ B(Sn−1 ). ρ
The following proposition is the integral version of the formula (7.50) and is a good exercise in using the Principle of Appropriate Functions. We shall leave its proof to you (see Problem 2).
454
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
Proposition 7.26. For any ρ > 0 and Borel integrable function f on Sn−1 , we have ρ Z Z n−1 f dσρ = ρ f (ρ ω) dσ . Sn−1 ρ
Sn−1
For example (see Problem 1), for any Borel integrable function f : S1ρ → R, we have Z Z 2π (7.51) f dσρ = ρ f (ρ cos θ, ρ sin θ) dθ. S1ρ
0
◮ Exercises 7.6. 1. Consider the bijection T : [0, 2π) → S1 defined by T (θ) = (cos θ, sin θ) for all θ ∈ S1 . Prove that σ(A) = m(T −1 (A)) for all A ∈ B(S1 ). Use this to show that for any Borel integrable function f : S1 → R, we have Z Z 2π f dσ = f (cos θ, sin θ) dθ. S1
0
Show that for any ρ > 0 and any Borel integrable function f : S1ρ → R, (7.51) holds. 2. Prove Proposition 7.26. 3. Prove that measure and integration over any sphere is invariant under orthogonal transformation in the sense that if O is an orthogonal matrix, then for any ρ > 0 and Borel set A ⊆ Sn−1 , ρ σρ (OA) = σρ (A), and for any integrable function f : Sn−1 → R, ρ Z Z f dσρ = f (Oω) dσρ . Sn−1 ρ
Sn−1 ρ
n
4. Let f : R −→ R be a Borel measurable function that is homogeneous of degree a > 0, that is, f (t x) = ta f (x) for all t > 0 and x ∈ Rn . Assuming that f is integrable over Bn , prove that Z Z 1 f (x) dx = f (ω) dσ. a + n Sn−1 Bn 5. (Volume of balls III) In Section 7.2.5 and also in Problem 4 of Exercises 7.4 we found the volume of the n-ball. In this problem we find it again, together with the area of the (n − 1)-sphere: Area of Sn−1 = ωn :=
2 π n/2 , Γ(n/2) 2
Volume of Bn = 2
2
2
ωn π n/2 = . n Γ( n2 + 1)
Proceed as follows. Integrate e−|x| = e−x1 −x2 −···−xn over Rn in two ways: First using Fubini’s theorem, then switching to polar coordinates. This will give you the formula for the area of Sn−1 . Next, integrate χBn in polar coordinates to get the second formula. 6. Prove that for any ρ > 0, the area of Sn−1 is ρn−1 ωn and the volume of Bn (ρ) is ρ ρn ωn /n, where Bn (ρ) = {x ∈ Rn ; kxk ≤ ρ} and ωn is given in the previous problem. 7. (Volume of balls IV) Here’s yet one more method to find volumes! Breaking up Rn as R2 × Rn−2 and using volume of slices as in Section 7.2 and polar coordinates on the R2 factor, prove that for n ≥ 3, we have Z 2π Z 1 n 2π Vn (1) = Vn−2 (1) (1 − r 2 ) 2 −1 r drdθ = Vn−2 (1) . n 0 0
7.6. POLAR COORDINATES AND INTEGRATION OVER SPHERES
Using this formula, prove by induction that Vn (1) =
455
π n/2 . Γ( n +1) 2
8. (Pappus’s centroid theorem) Let R ⊆ R2 be a measurable set in the positive righthalf of the (x, z)-plane. (Thus, (x, y) ∈ R implies x > 0.) We put Z Z x= x χR (x, z) dx dz = x dx dz, R2
R
which is the “x-coordinate” of the geometric centroid of R. Put p E = {(x, y, z) ∈ R3 ; ( x2 + y 2 , z) ∈ R},
which is the solid of revolution obtained by rotating R about the z-axis; here is a picture where R = a disk in the (x, z)-plane and E is a solid torus: z ✻ z ✻
R
✲x
✲ y
✐ x✰
(x, y, z) ∈ E ⇐⇒ (r, z) ∈ R, p where r = x2 + y 2 = distance to z-axis.
Prove that Vol(E) = 2πArea(R) x. This formula is called Pappus’s centroid theorem, which isR attributed to Pappus of Alexandria (c. 290 – c. 350). Suggestion: We have Vol(E) = R3 χE (x, y, z) dx dy dz. Switch to polar coordinates in (x, y). 9. (Yet another proof of the FTA) In this problem, we give a polar coordinates based proof of the Fundamental Theorem of Algebra [188]. (i) Given ρ > 0, prove that for any smooth function f : B2 (ρ) → R, we have Z 2π Z (∂x2 f + ∂y2 f ) dx dy = ρ ∂ρ f dθ, (7.52) B2 (ρ)
0
where ∂ρ f is the partial derivative of the function f (ρ cos θ, ρ sin θ) with respect to ρ. Suggestion: Look up the polar coordinates formula for ∂x2 + ∂y2 (which is called the Laplacian). Idea of the proof: Now let p(z) be a polynomial of degree n ≥ 1 with complex coefficients, and with leading coefficient equal to 1. We shall prove that p(z) must have a zero. As with other proofs of the FTA, we proceed by contradiction, so assume that p(z) 6= 0 for all z ∈ C. Define f : R2 → R
by f (x1 , x2 ) = log |p(x1 + ix2 )|2 for all (x1 , x2 ) ∈ R2 ; by assumption, p is never zero, so f is a smooth function on R2 . We shall work out both sides of the formula (7.52) for this particular f . We shall prove that the left-hand side is zero, while (for ρ sufficiently large) the right-hand side is not, a contradiction! (ii) Assume the Cauchy-Riemann Equations (proved in Problem 6 of Exercises 7.5): ∂x2 p = i∂x1 p. Taking complex conjugates, we also have ∂x2 p = −i∂x1 p, where the bar over p (in p) denotes complex conjugation. Using these equations, prove that ∂x2 f = −∂y2 f . This shows that the left-hand side of (7.52) is zero. Suggestion: Recall that |p|2 = p p, so f = log(p p). (iii) Since the leading coefficient of p(z) is one, we know that p(z) = z n + p0 (z) where p0 (z) is a polynomial in z of degree at most n − 1. Using this fact, prove that with z = x1 + ix2 where (x1 , x2 ) ∈ R2 , q(x1 , x2 ) := |p(z)|2 − (x21 + x22 )n
456
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
is a polynomial in the variables x1 and x2 of degree at most 2n − 1. Setting x1 = ρ cos θ and x2 = ρ sin θ, conclude that |p(z)|2 = ρ2n + q(ρ, θ),
where we write q(ρ, θ) for q(ρ cos θ, ρ sin θ). Note that q(ρ, θ) is a polynomial in ρ, cos θ, and sin θ of degree at most 2n − 1. (iv) Using that f = log |p|2 = log ρ2n + q(ρ, θ) , prove that g(ρ, θ) := ρ ∂ρ f − 2n
is bounded by a constant times ρ−1 for ρ ≥ 1. (v) Prove that Z 2π lim ρ ∂ρ f dθ = 4πn. ρ→∞
0
This shows that the right-hand side of (7.52) is nonzero for ρ sufficiently large.
Notes and references on Chapter 7 §7.1 : We have in some sense been unfair to the Riemann integral, for there is actually a iterated integral theorem that holds for the Riemann integral, but fails for the Lebesgue integral! This theorem is called the Fichtenholz-Lichenstein Theorem [181], [98] after Grigori Fichtenholz (1888–1959) and Leon Lichtenstein (1878–1933), and states the following: Let f : [a, b] × [c, d] → R be a bounded function on a compact rectangle and suppose that the Riemann integrals Z b Z d f (x, y) dx and f (x, y) dy a
c
exist for each y ∈ [c, d] and each x ∈ [a, b], respectively. Then the iterated Riemann integrals exist and are equal: Z bZ d Z dZ b f (x, y) dydx. f (x, y) dxdy = c
a
a
c
That is, given that the inside Riemann integrals exist, the outside Riemann integrals exist and the iterated integrals are equal. (For the proof of the Fichtenholz-Lichenstein theorem, follow the proof in Problem 7 in Exercises 6.2.) It may come as a surprise, but Waclaw Sierpi´ nski (1882–1969) in 1920 [260] showed that if we assume the Continuum Hypothesis (discussed briefly in Section ?), then the Fichtenholz-Lichenstein Theorem is false for the Lebesgue integral! It turns out that under the Continuum Hypothesis, there is a well-ordering23 on the interval [0, 1] such that every for every a ∈ [0, 1], the set of b a is countable. Let A = {(x, y) ∈ [0, 1]2 ; x y} ⊆ [0, 1]2 and let f = χA : [0, 1]2 → R.
Observe that fixing y ∈ [0, 1], the y-section fy (x) is zero except on a countable set (namely Rb those x’s with x y, which is countable). Thus, fy is Borel integrable and a f (x, y) dx = 0. In particular, we have Z 1Z 1 f (x, y) dxdy = 0. 0
0
23 A well-ordering on a set X is a relation on X such that each nonempty subset A ⊆ X has a minimum element. By minimum element we mean an element b ∈ A such that for all a ∈ A, b a and if a b, then we have a = b. Now you may remember from a footnote in Section ? that ℵ1 denotes the next larger cardinality than ℵ0 , the cardinality of N. Any set X with cardinality ℵ1 has the following interesting property: There is a well-ordering on X such that for every a ∈ X, the set of b a is countable; see [126, p. 30]. Thus, there is a well-ordering of [0, 1] with this same property if we assume the Continuum Hypothesis, that the cardinality of R (and hence of [0, 1]) is ℵ1 .
NOTES AND REFERENCES ON CHAPTER ??
457
On the other hand, fixing x ∈ [0, 1], the x-section fx (y) equals one except on a countable set (namely, fx (y) = 1 except at those y’s such that y x and y 6= x). Thus, fx is Borel Rd measurable and c f (x, y) dy = 1. Hence, Z 1Z 1 f (x, y) dydx = 1. 0
0
Thus, the iterated integrals are not equal! Of course, the failure here of Fubini’s theorem is that the function f : [0, 1]2 → R is not Borel (not even Lebesgue) measurable. §7.2 : In Lebesgue’s 1902 thesis [171], he gave two definitions of the Lebesgue integral, first a geometric definition in terms of the signed area under the graph of a function and the second one via partitioning the range of the function, the one Lebesgue originally gave in his seminal 1901 paper [165] (and the approach we have emphasized in our book). In Problem 8 of Exercises 7.2 we described “Lebesgue’s geometric definition;” here’s what Lebesgue had to say about it [171, p. 18–20] (translation from [120, p. 1059–1061]):24 §15. From a geometric perspective, the problem of integration can be posed as follows: For a curve C given by the equation y = f (x) (f being a continuous positive function, and the axes rectilinear), find the area of the domain bounded by an arc of C, a segment of 0x and the two lines parallel to the y axis for given abscissa values a and b, where a < b. . . . Of whatever sign the function f is, corresponding to it we shall have the set E of points whose coordinates satisfy the following three inequalities: a ≤ x ≤ b,
y f (x) ≥ 0,
0 ≤ y 2 ≤ (f (x))2 .
The set E is the sum of two sets E1 and E2 formed by the points with positive y-axis values for E1 and negative values for E2 (∗). The deficient integral is the interior extension of E1 less the exterior extension of E2 ; the excess integral is the exterior extension of E1 less the interior extension of E2 . If E is (J) measurable (in which case E1 and E2 are as well), the function is integrable, the integral being m(E1 ) − m(E2 ). §17. These results immediately suggest the following generalization: if the set E is measurable (in which case E1 and E2 are as well), we shall call the quantity m(E1 ) − m(E2 ) the definite integral of f between a and b. The corresponding functions f will be said to be summable. (∗) It scarcely matters whether the points on the x-axis be considered as part of E1 or E2 .
§7.3 : Reading Hawkins’ book [121, p. 160], I found it very interesting to read that Fubini’s theorem was actually first stated by Beppo Levi (1875–1961) as a footnote in his 1906 paper “Sul principio di Dirichlet” [178, p. 322]. In Levi’s footnote he reviews some history on iterated integration for Riemann’s integral, then he states “Fubini’s theorem” (not discovered by Fubini until a year later!), but he says that he didn’t want to use 24 Here we corrected a mistake Lebesgue missed (yes, even the masters do make mistakes!) He originally defined the set E as the set of points (x, y) with a ≤ x ≤ b, xf (x) ≥ 0, and 2
0 ≤ y 2 ≤ (f (x))2 ; the xf (x) should have been yf (x). He also wrote f (x) , instead of (f (x))2 , to emphasize that f (x) is being squared, but today a bar over a number usually denotes complex conjugation. Also, the interior/exterior extensions Lebesgue is talking about refers to inner/outer Jordan content and (J) measurability refers, of course, to Jordan measurability.
458
7. FUBINI’S THEOREM AND CHANGE OF VARIABLES
“Fubini’s theorem” because he wanted to use Riemann integration in his work. Here’s Levi’s footnote: A detailed argument concerning surface integration and double integration, in the sense of Riemann, was made by Pringsheim [Zur Theorie des Doppel-Integrals. M¨ unchen Ber., XXVIII (1898), pp. 59-74 — Zur Theorie des Doppel-Integrals, des Green’schen und Cauchy’schen Integralsalzes. M¨ unchen Ber., XXIX (1899), pp. 38–62]. From the formula25 Z x1 Z y 1 Z y1 Z x1 Z y1 Z− x1 f dxdy = dy f dx = dy f dx x0
y0
y0
−
x0
y0
x0
which he establishes in the first note, he remarks (in the second note, §1) that from the existence of the double integral, it follows that the Rx ordinary integrals [ x01 f dx] exist on a dense set of lines y = const., R− x Rx and that the upper and lower integrals [ x01 f dx and x01 f dx] differ −
by more than any given ε only on a subset of measure zero in the sense of Jordan. The formula above of Pringsheim has been extended by Mr. Lebesgue to his integral [Integrale, etc.., loc. cit., ni 37–38, p. 276 and following): The same observations could then be repeated with the simplification that, with the adoption of his definition of measure, one could certainly conclude that the integrals exist over every line y = const., with the exception of the lines belonging to a set of measure zero (see also Vitali, Sulle funzioni ad integrali nullo [this same Rendiconti, t. XX (1905), pp. 136-141]). Disregarding the collection of these lines (which is possible because it has two-dimensional measure zero) the two-dimensional Lebesgue integral can therefore be achieved with two successive integrations. — These observations of a general nature could partially replace those of this paper, but I have preferred to express my results as a special case in order to stay within the theory of the Riemann integral.
Too bad Levi didn’t provide some type of proof of his statements, for then we would be calling “Fubini’s Theorem” the “Levi Theorem”. §7.4 : In most calculus books there are geometric arguments on why the change of variables formula (7.24) should hold by considering how small rectangles in the uv-plane get mapped into curved parallelograms in the xy-plane, and it is good for you to review these common arguments. What is not so common is to find Euler’s proof of the 2dimensional change of variables formula (7.24). Because Euler’s proof is quite clever and not so long, we shall give (in spirit, although not exactly) Euler’s original proof of the change of variables formula (7.24) found on pages 88–90 in [90]. We shall omit the function f in (7.24) in all the calculations below and we shall write all double integrals as iterated integrals. Let x = F1 (u, v) and y = F2 (u, v). Assume that we can solve the equation x = F1 (u, v) for u in terms of x and v, say u = G1 (x, v). Then, (7.53)
y = F2 (u, v) = F2 (G1 (x, v), v) ,
or y = G2 (x, v),
where G2 (x, v) = F2 (G1 (x, v), v). Now if x is held fixed, then y = G2 (x, v) is a function of v only, so the usual one-dimensional change of variables says that Z Z ∂G2 (x, v) dv. dy = ∂v 25 Translator footnote: The bars under the inner integrals represent lower and upper Riemann integrals.
NOTES AND REFERENCES ON CHAPTER ??
459
(Recall that there should be a function f here, but we are omitting it for simplicity.) Thus, ZZ ZZ ZZ ∂G2 (x, v) dv dx dx dy = dy dx = ∂v ZZ ∂G2 = (7.54) (x, v) dx dv. ∂v For the inside integral in (7.54), the variable v held fixed so x = F1 (u, v) can be considered a function only of u, therefore by the usual one-dimensional change of variables formula, the inner integral in (7.54) equals Z Z ∂G2 ∂G2 ∂F1 (x, v) dx = (F1 (u, v), v) (u, v) du. ∂v ∂v ∂u Hence, ZZ ZZ ∂G2 ∂F1 dx dy = (F1 (u, v), v) (u, v) du dv. ∂v ∂u Now we just have to figure out what this complicated expression is! To do so, notice that from (7.53) we have G2 (F1 (u, v), v) = G2 (x, v) = y = F2 (u, v), or (7.55)
G2 (F1 (u, v), v) = F2 (u, v).
Taking the partial derivative of both sides of (7.55) with respect to u, we get (using the chain rule) ∂F1 ∂F2 ∂G2 (F1 (u, v), v) (u, v) = (u, v), (7.56) ∂x ∂u ∂u and then taking the partial derivative of both sides of (7.55) with respect to v, we get (again using the chain rule) ∂G2 ∂F1 ∂G2 ∂F2 (7.57) (F1 (u, v), v) (u, v) + (F1 (u, v), v) = (u, v). ∂x ∂v ∂v ∂v 1 Multiplying both sides (7.57) by ∂F (u, v) and then solving the resulting equation for ∂u ∂G2 ∂F1 (F (u, v), v) (u, v), we see that 1 ∂v ∂u ∂G2 ∂F1 ∂F1 ∂F2 (F1 (u, v), v) (u, v) = (u, v) (u, v) ∂v ∂u ∂u ∂v ∂F1 ∂G2 ∂F1 − (u, v) (F1 (u, v), v) (u, v). ∂u ∂x ∂v ∂F2 ∂F1 By (7.56), the last line equals − ∂u (u, v) ∂v (u, v). Thus, ∂G2 ∂F1 ∂F1 ∂F2 ∂F2 ∂F1 (F1 (u, v), v) (u, v) = (u, v) (u, v) − (u, v) (u, v). ∂v ∂u ∂u ∂v ∂u ∂v The right-hand side is exactly the Jacobian JF . Hence, we obtain ZZ ZZ dx dy = JF du dv, which is the change of variables formula.26
26
RR RR The resulting formula should be dx dy = |JF | du dv, with an absolute value around JF . However, we haveR been very imprecise concerning signs in the above argument. For example, R 2 (x, v) dv near the beginning of Euler’s argument. If ∂G2 (x, v) consider the equality dy = ∂G ∂v ∂v R R 2 is negative, then we really should write this equality as dy = − ∂G (x, v) dv. If we kept track ∂v RR RR of the negative signs, the final formula would be dx dy = |JF | du dv.
Bibliography 1. William J. Adams, The life and times of the central limit theorem, Kaedmon Publishing Co., New York, 1974. 2. Robert Adrian, Research concerning the probabilities of the errors which happen in making observations, etc., The Analyst, or Mathematical Museum I (1808), no. 4, 93–109. 3. Martin Aigner and G¨ unter M. Ziegler, Proofs from The Book, third ed., Springer-Verlag, Berlin, 2004, Including illustrations by Karl H. Hofmann. 4. R.C. Archibald, A rare pamphlet of moivre and some of his discoveries, Isis 8 (1926), no. 4, 671–683. 5. Archimedes, The works of Archimedes, Dover Publications Inc., Mineola, NY, 2002, Reprint of the 1897 edition and the 1912 supplement, Edited by T. L. Heath. 6. C. Arzel` a, Sulla integrabilit` a di una serie di funzioni, Rom. Acc. L. (4) I. (1885), 321–326. , Sulla integrazione per serie, Rom. Acc. L. (4) I. (1885), 532–537, 566–569. 7. 8. , Un teorema intorno alle serie di funzioni, Rom. Acc. L. (4) I. (1885), 262–267. 9. Robert B. Ash, A primer of abstract mathematics, Classroom Resource Materials Series, Mathematical Association of America, Washington, DC, 1998. 10. John Ashton, A history of english lotteries now for the first time written, Leadenhall Press, London, 1893, Available at http://www.archive.org/details/cu31924030209260. 11. Joe Dan Austin, The birthday problem revisited, Math. Mag. 7 (1976), no. 4, 39–42. 12. Raymond Ayoub, Euler and the zeta function, Amer. Math. Monthly 81 (1974), no. 10, 1067–1086. 13. R. Baire, Sur les fonctions discontinues qui rattachent aux fonctions continues, CR Acad. Sci. Paris 126 (1898), 1621–1623. 14. , Sur les fonctions de variables r´ eelles, Ann. Math. 3 (1899), no. 3, 1–122. 15. St. Banach, Sur lequation fonctionnelle f (x + y) = f (x) + f (y), Fund. math. 1 (1920), 123–124. 16. , Sur le probl` eme de la mesure., Fund. math. 4 (1923), 7–33. 17. St. Banach and A. Tarski, Sur la d´ ecomposition des ensembles de points en parties respectivement congruentes., Fund. math. 6 (1924), 244–277. 18. John Beam, Probabilistic expectations on unstructured spaces, preprint, 2007. , Unfair gambles in probability, Statist. Probab. Lett. 77 (2007), no. 7, 681–686. 19. 20. Ver´ onica Becher and Santiago Figueira, An example of a computable absolutely normal number, Theoret. Comput. Sci. 270 (2002), no. 1-2, 947–958. 21. Jordan Bell, On the sums of series of reciprocals, Available at http://arxiv.org/abs/math/0506415. Originally published as De summis serierum reciprocarum, Commentarii academiae scientiarum Petropolitanae 7 (1740) 123134 and reprinted in Leonhard Euler, Opera Omnia, Series 1: Opera mathematica, Volume 14, Birkh¨ auser, 1992. Original text, numbered E41, is available at the Euler Archive, http://www.eulerarchive.org. 22. D. R. Bellhouse, The Genoese lottery, Statist. Sci. 6 (1991), no. 2, 141–148. 23. S. K. Berberian, The product of two measures, Amer. Math. Monthly 69 (1962), 961–968. 24. S. K. Berberian and J. F. Jakobsen, A note on Borel sets, Amer. Math. Monthly 70 (1963), 55. 25. S.J. Bernau, The evaluation of a Putnam integral, Amer. Math. Monthly 95 (1988), no. 10, 935. 26. S. N. Bernstein, D´ emonstration du th´ eorme ` de weierstrass fond´ e sur le calcul des probabilit´ es, Comm. Soc. Math. Kharkov 13 (1912/13), 1–2, Available at http://www.focm.net/at/HAT/papers.html. 461
462
BIBLIOGRAPHY
27. Geoffrey C. Berresford, The uniformity assumption in the birthday problem, Math. Mag. 53 (1980), no. 5, 286–288. 28. Frits Beukers, Johan A. C. Kolk, and Eugenio Calabi, Sums of generalized harmonic series and volumes, Nieuw Arch. Wisk. (4) 11 (1993), no. 3, 217–224. 29. K. P. S. Bhaskara Rao and M. Bhaskara Rao, Theory of charges, Pure and Applied Mathematics, vol. 109, Academic Press Inc. [Harcourt Brace Jovanovich Publishers], New York, 1983, A study of finitely additive measures, With a foreword by D. M. Stone. 30. Patrick Billingsley, Probability and measure, third ed., Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons Inc., New York, 1995, A Wiley-Interscience Publication. 31. N. H. Bingham, Studies in the history of probability and statistics. XLVI. Measure into probability: from Lebesgue to Kolmogorov, Biometrika 87 (2000), no. 1, 145–156. 32. David Blackwell and Persi Diaconis, A non-measurable tail set, Statistics, probability and game theory, IMS Lecture Notes Monogr. Ser., vol. 30, Inst. Math. Statist., Hayward, CA, 1996, pp. 1–5. 33. L. M. Blumenthal, “A paradox, a paradox, a most ingenious paradox.”, Amer. Math. Monthly 47 (1940), 346–353. 34. Ralph P. Boas, Jr., Lion hunting & other mathematical pursuits, The Dolciani Mathematical Expositions, vol. 15, Mathematical Association of America, Washington, DC, 1995, A collection of mathematics, verse and stories, Edited and with an introduction by Gerald L. Alexanderson and Dale H. Mugler. ´ Borel, Un th´ 35. E. eor` eme sur les ensembles mesurables., C. R. 137 (1904), 966–967 (French). 36. E. Borel, Les probabilit´ es denombrables et leurs applications arithm´ etiques, Palermo Rend. 27 (1909), 247–271. ´ 37. Emile Borel, Le¸cons sur la th´ eorie des fonctions, Gauthier-Villars, Paris, 1898. , Les paradoxes de l’infini, third ed., Gallimard, Paris, 1949. 38. ´ 39. , Œvres de Emile borel, Centre National de la Recherche Scientifique, Paris, 1972. 40. Truman Botts, Probability theory and the lebesgue integral, Math. Mag. 42 (1969), no. 3, 105–111. ´ ements de math´ 41. N. Bourbaki, El´ ematique. I: Les structures fondamentales de l’analyse. Livre IV: Fonctions d’une variable r´ eelle (th´ eorie ´ el´ ementaire). Chapitres 1, 2 et 3: D´ eriev´ ees. Primitives et int´ egrales. Fonctions ´ el´ ementaires, Deuxi` eme ´ edition. Actualit´ es Sci. Indust., No. 1074, Hermann, Paris, 1958. 42. Nicolas Bourbaki, Elements of the history of mathematics, Springer-Verlag, Berlin, 1994, Translated from the 1984 French original by John Meldrum. 43. David M. Bradley, Representations of Catalan’s constant, preprint, 2001. 44. Robert E. Bradley, Lawrence A. D’Antonio, and C. Edward Sandifer (eds.), Euler at 300, MAA Spectrum, Mathematical Association of America, Washington, DC, 2007, An appreciation. 45. James W. Brown, The beta-gamma function identity, Amer. Math. Monthly 68 (1961), no. 2, 165. 46. R. Creighton Buck, The measure theoretic approach to density, Amer. J. Math. 68 (1946), 560–580. 47. F.P. Cantelli, Sulla probabilit` a come limite della frequenza, Atti Accad. Naz. Lincei 26 (1917), 39–45. 48. G. Cantor, Fondements d’une th´ eorie g´ en´ erale des ensembles, Acta Mathematica 2 (1883), no. 1, 381–408. ¨ , Uber unendliche, lineare punktmannigfaltigkeiten v, Math Ann. 21 (1883), 545–591. 49. 50. , De la puissance des ensembles parfaits des points. (extrait d’une lettre adress´ ee ` a l’´ editeur), Acta Math. 4 (1884), 381–392. 51. Georg Cantor, Gesammelte Abhandlungen mathematischen und philosophischen Inhalts, Springer-Verlag, Berlin, 1980, Reprint of the 1932 original. 52. C. Carath´ eodory, Vorlesungen u ¨ber reelle funktionen., Leipzig und Berlin: B. G. Teubner, X u. 704 S. 8◦ , 1918 (German). 53. Constantin Carath´ eodory, Vorlesungen u ¨ber reelle Funktionen, Third (corrected) edition, Chelsea Publishing Co., New York, 1968. 54. Donald R. Chalice, A characterization of the Cantor function, Amer. Math. Monthly 98 (1991), no. 3, 255–258.
BIBLIOGRAPHY
463
55. D.G. Champernowne, The construction of decimals normal in the scale of ten,, J. London Math. Soc. 8 (1933), 254–260. 56. P. L. Chebyshev, On the mean values, J. Math. Pure. Appl. 12 (1867), 177–184, In Russian. English translation in D. E. Smith, A source book in mathematics. Vol. 1, 2., Dover, New York, 1959. 57. Frank A. Chimenti, Pascal’s wager: a decision-theoretic approach, Math. Mag. 63 (1990), no. 5, 321–325. P∞ 2 2 58. Boo Rim Choe, An elementary proof of n=1 1/n = π /6, Amer. Math. Monthly 94 (1987), no. 7, 662–663. 59. J. R. Choksi, Vitali’s convergence theorem on term by term integration, Enseign. Math. (2) 47 (2001), no. 3-4, 269–285. 60. K. L. Chung and P. Erd¨ os, On the application of the Borel-Cantelli lemma, Trans. Amer. Math. Soc. 72 (1952), 179–186. 61. Arthur H. Copeland and Paul Erd¨ os, Note on normal numbers, Bull. Amer. Math. Soc. 52 (1946), 857–860. 62. E. Czuber, Wahrscheinlichkeitstheorie, fehlerausgleichung, kollektivmaßlehre, second ed., B. G. Teubner, Leipzig, 1908. 63. Janusz Czy˙z, Paradoxes of measures and dimensions originating in Felix Hausdorff ’s ideas, World Scientific Publishing Co. Inc., River Edge, NJ, 1994. 64. P.J. Daniell, Integrals in an infinite number of dimensions, Ann. of Math. 194 (1919), no. 20, 281–288. ´ 65. G. Darboux, M´ emoire sur les fonctions discontinues, Annales scientifiques de l’E.N.S. 4 (1875), 57–112, available at http://archive.numdam.org/. 66. R.B. Darst, Some Cantor sets and Cantor functions, Amer. Math. Monthly 45 (1972), no. 1, 2–7. 67. Joseph Warren Dauben, Georg Cantor, Princeton University Press, Princeton, NJ, 1990, His mathematics and philosophy of the infinite. 68. R. H. Daw, Why the normal distribution, Journal of the institute of Actuaries Students Society 18 (1972), no. 1, 2–25. 69. R. H. Daw and E. S. Pearson, Studies in the history of probability and statistics. XXX. Abraham de Moivre’s 1733 derivation of the normal curve: a bibliographical note, Biometrika 59 (1972), 677–680. 70. Bruno de Finetti, Theory of probability. Vol. 1, Wiley Classics Library, John Wiley & Sons Ltd., Chichester, 1990, A critical introductory treatment, Translated from the Italian and with a preface by Antonio Mach`ı and Adrian Smith, With a foreword by D. V. Lindley, Reprint of the 1974 translation. 71. Abraham de Moivre, The doctrine of chances: A method of calculating the probabilities of events in play, third ed., Printed for A. Millar, in the Strand, London, 1756, Available at http://www.ibiblio.org/chance/. 72. Joseph L. Doob, The development of rigor in mathematical probability (1900–1950) [in development of mathematics 1900–1950 (luxembourg, 1992), 157–170, Birkh¨ auser, Basel, 1994; MR1298633 (95i:60001)], Amer. Math. Monthly 103 (1996), no. 7, 586–595. 73. R. E. Dressler and K. R. Stromberg, The Tonelli integral, Amer. Math. Monthly 81 (1974), 67–68. R 74. P. du Bois-Reymond, Der beweis des fundamentalsatzes der integralrechnung: ab f ′ (x) dx = f (b) − f (a), Math. Ann. 16 (1880), no. 1, 115–128. 75. , Die allgemeine funktionentheorie, H. Laupp, T¨ ubingen, 1882. 76. R. M. Dudley, Real analysis and probability, Cambridge Studies in Advanced Mathematics, vol. 74, Cambridge University Press, Cambridge, 2002, Revised reprint of the 1989 original. 77. William Dunham, Euler: the master of us all, The Dolciani Mathematical Expositions, vol. 22, Mathematical Association of America, Washington, DC, 1999. 78. E. B. Dynkin, Theory of Markov processes, Dover Publications Inc., Mineola, NY, 2006, Translated from the Russian by D. E. Brown and edited by T. K¨ ov´ ary, Reprint of the 1961 English translation. 79. Gerald A. Edgar (ed.), Classics on fractals, Studies in Nonlinearity, Westview Press. Advanced Book Program, Boulder, CO, 2004. 80. A. W. F. Edwards, Pascal and the problem of points, Internat. Statist. Rev. 50 (1982), no. 3, 259–266.
464
81. 82. 83. 84. 85. 86.
87.
88.
89.
90.
91.
92.
93. 94. 95. 96. 97.
98. 99. 100. 101. 102. 103. 104. 105.
BIBLIOGRAPHY
, Pascal’s problem: the “gambler’s ruin”, Internat. Statist. Rev. 51 (1983), no. 1, 73–79. D. Egorov, Sur les suites de fonctions mesurables., C. R. Acad. Sci. Paris 152 (1911), 244– 246. Saber Elaydi, An introduction to difference equations, third ed., Undergraduate Texts in Mathematics, Springer, New York, 2005. P P. Erd¨ os and A. R´ enyi, On Cantor’s series with convergent 1/qn , Ann. Univ. Sci. Budapest. E¨ otv¨ os. Sect. Math. 2 (1959), 93–109. N. Etemadi, An elementary proof of the strong law of large numbers, Z. Wahrsch. Verw. Gebiete 55 (1981), no. 1, 119–122. Leonhard Euler, De summatione innumerabilium progressionum (the summation of an innumerable progression), Commentarii academiae scientiarum Petropolitanae 5 (1738), 91–105, Presented to the St. Petersburg Academy on March 5, 1731. Published in Opera Omnia: Series 1, Volume 14, pp. 25–41, Enestr¨ om Index is E20, and is available at EulerArchive.org. , De summis serierum reciprocarum (on the sums of series of reciprocals), Commentarii academiae scientiarum Petropolitanae 7 (1740), 115–127, was read in the St. Petersburg Academy on December 5, 1735. Published in Opera Omnia: Series 1, Volume 14, pp. 73–86, Enestr¨ om Index is E41, and is available at EulerArchive.org. , Demonstration de la somme de cette suite 1 + 1/4 + 1/9 + 1/16 . . . (demonstration of the sum of the following series: 1 + 1/4 + 1/9 + 1/16 . . .), Journ. lit. d’Allemange, de Suisse et du Nord 2 (1743), 115–127, Published in Opera Omnia: Series 1, Volume 14, pp. 177–186, Enestr¨ om Index is E63, and is available at EulerArchive.org. , Institutionum calculi integralis volumen primum (foundations of integral calculus, volume 1), impensis academiae imperialis scientiarum, St. Petersburg, 1768, Opera Omnia: Series 1, Volume 11. , De formulis integralibus duplicatis (on double integral formulas), Novi Commentarii academiae scientiarum Petropolitanae 14 (1770), 72–103, Published in Opera Omnia: Series 1, Volume 17, pp. 289–315, Enestr¨ om Index is E391, and is available at EulerArchive.org. , Introductio in analysin infinitorum, volume 1 (introduction to analysis of the infinite. Book II), Springer-Verlag, New York, 1990, Translated from the Latin and with an introduction by John D. Blanton. Ernest Fandreyer, A new proof of the theorem that every algebraic rational integral function in one variable can be resolved into real factors of the first or the second degree, http://www.fsc.edu/library/archives/manuscripts/gauss.cfm. P. Fatou, S´ eries trigonom´ etriques et s´ eries de taylor., Acta Math. 30 (1906), 335–400. William Feller, An introduction to probability theory and its applications. Vol. I, Third edition, John Wiley & Sons Inc., New York, 1968. Jos´ e Ferreir´ os, Labyrinth of thought, second ed., Birkh¨ auser Verlag, Basel, 2007, A history of set theory and its role in modern mathematics. Richard Feynman, Michelle Feynman, and Timothy Ferris, Perfectly reasonable deviations form the beaten track: The letters of richard p. feynman, Basic books, New York, NY, 2005. Richard Feynman, Ralph Leighton, Edward Hutchings, and Albert R. Hibbs, Surely you’re joking, mr. feynman! (adventures of a curious character), W. W. Norton & Company, New York, NY, 1985. Grigori Fichtenholz, Un th´ eor` eme sur l’integration sous le signe int´ egrale, Rend. Cire. Mat. Palermo 36 (1913), 111–114. M. Fr´ echet, Sur l’int´ egrale d’une fonctionnelle ´ etendue ` a un ensemble abstrait, Bull. Soc. Math. France 43 (1915), 249–267. , Des familles et fonctions additivies d’ensembles abstraits, Fund. Math. 5 (1924), 206–251. G. Fubini, Sugli integrali multipli., Rom. Acc. L. Rend. 5 161 (1907), 608–614. Francis Galton, Natural inheritance, Macmillan, London, 1889, Available at http://www.archive.org. I. Gelfand, D. Raikov, and G. Shilov, Commutative normed rings, Translated from the Russian, with a supplementary chapter, Chelsea Publishing Co., New York, 1964. Constantine Georgakis, A note on the Gaussian integral, Math. Mag. 67 (1994), no. 1, 47. B.V. Gnedenko, Theory of probability. transl. from the russian by b. d. seckler, second ed., Chelsea Publishing Company, 1962.
BIBLIOGRAPHY
465
106. Samuel Goldberg, Introduction to difference equations, second ed., Dover Publications Inc., New York, 1986, With illustrative examples from economics, psychology, and sociology. 107. Gerald S. Goodman, Statistical independence and normal numbers: an aftermath to Mark Kac’s Carus Monograph, Amer. Math. Monthly 106 (1999), no. 2, 112–126. 108. I. Grattan-Guinness, From the calculus to set theory, 1630–1910, Princeton Paperbacks, Princeton University Press, Princeton, NJ, 2000, An introductory history, Edited by I. Grattan-Guinness, Reprint of the 1980 original. 109. Ronald B. Guenther and John W. Lee, Partial differential equations of mathematical physics and integral equations, Dover Publications Inc., Mineola, NY, 1996, Corrected reprint of the 1988 original. 110. E. Hairer and G. Wanner, Analysis by its history, Undergraduate Texts in Mathematics, Springer-Verlag, New York, 1996, Readings in Mathematics. 111. P. R. Halmos, The foundations of probability, Amer. Math. Monthly 51 (1944), 493–510. 112. Paul R. Halmos, Measure Theory, D. Van Nostrand Company, Inc., New York, N. Y., 1950. 113. G. Hamel, Eine basis aller zahlen und die unstetigen l¨ osungen der funktionalgleichung f (x + y) = f (x) + f (y), Math. Ann. 60 (1905), 459–462. 114. Richard Hamming, The art of probability, Addison Wesley, Redwood City, CA, 1991. 115. H. Hankel, Untersuchungen u ¨ber die unendlich oft oszillierenden und unstetigen functionen, presented in march, 1870 at the university of t¨ ubingen, Math. Ann. 20 (1882), 63–112. 116. Axel Harnack, Vereinfachung der beweise in der theorie der fourier’schen reihe, Math. Ann. 19 (1882), 235–279. 117. John M. Harris, Jeffry L. Hirst, and Michael J. Mossinghoff, Combinatorics and graph theory, second ed., Undergraduate Texts in Mathematics, Springer-Verlag, New York, 2008. 118. F. Hausdorff, Bemerkung u ¨ber den inhalt von punktmengen, Math. Ann. 75 (1914), 428–434. , Grundz¨ uge der mengenlehre, Veit and Company, Leipzig, 1914. 119. 120. Stephen Hawking, God created the integers: The mathematical breakthroughs that changed history, Running Press, Philadelphia, PA, 2005. 121. Thomas Hawkins, Lebesgue’s theory of integration, second ed., AMS Chelsea Publishing, Providence, RI, 2001, Its origins and development. 122. Thomas Heath, The works of Archimedes, Dover Publications Inc., New York, 2002. 123. Erhard Heinz, An elementary analytic theory of the degree of mapping in n-dimensional space, J. Math. Mech. 8 (1959), 231–247. 124. Ch. Hermite, Sur la formule d’interpolation de lagrange, J. Reine Angew. Math. (1878), 70–79. 125. Horst Herrlich, Axiom of choice, Lecture Notes in Mathematics, vol. 1876, Springer-Verlag, Berlin, 2006. 126. Edwin Hewitt and Karl Stromberg, Real and abstract analysis, Springer-Verlag, New York, 1975, A modern treatment of the theory of functions of a real variable, Third printing, Graduate Texts in Mathematics, No. 25. 127. Edwin Hewitt and Herbert S. Zuckerman, Remarks on the functional equation f (x + y) = f (x) + f (y), Math. Mag. 42 (1969), no. 3, 121–123. 128. David Hilbert, Mathematical problems. reprinted from bull. am. math. soc. 8, 437-479 (1902), Bull. Am. Math. Soc., New Ser. 37 (2000), no. 4, 407–436. 129. T. H. Hildebrandt, On integrals related to and extensions of the Lebesgue integrals, Bull. Amer. Math. Soc. 24 (1917), no. 3, 113–144. , On integrals related to and extensions of the Lebesgue integrals, Bull. Amer. Math. 130. Soc. 24 (1918), no. 4, 177–202. 131. E. Hille and J. D. Tamarkin, Remarks on a Known Example of a Monotone Continuous Function, Amer. Math. Monthly 36 (1929), no. 5, 255–264. 132. Ernest William Hobson, The theory of functions of a real variable and the theory of fourier’s series, second ed., Cambridge University Press, Cambridge, 1921, 1st edition published in 1907. 2nd edition available at http://www.archive.org. 133. John G. Hocking and Gail S. Young, Topology, Addison-Wesley Publishing Co., Inc., Reading, Mass.-London, 1961. 134. Josef Hofbauer, A simple proof of 1 + 1/22 + 1/32 + · · · = π 2 /6 and related identities, Amer. Math. Monthly 109 (2002), no. 2, 196–200. 135. Alfred Horn and Alfred Tarski, Measures in Boolean algebras, Trans. Amer. Math. Soc. 64 (1948), 467–497.
466
BIBLIOGRAPHY
136. Karel Hrbacek and Thomas Jech, Introduction to set theory, third ed., Monographs and Textbooks in Pure and Applied Mathematics, vol. 220, Marcel Dekker Inc., New York, 1999. 137. C. Huygens, Libellus de ratiociniis in ludo aleae, S. Keimer for T. Woodward, London, 1657, Available at http://www.stat.ucla.edu/history/. 138. Carl Jacobi, De determinantibus functionalibus, Crelle Journal fr die reine und angewandte Mathematik 22 (1841), 319–359, Available at http://mathdoc.emath.fr/OEUVRES/. 139. Hans Niels Jahnke (Editor), A history of analysis, American Mathematical Society, Providence, RI, 2003. 140. E. T. Jaynes, Probability theory, Cambridge University Press, Cambridge, 2003, The logic of science, Edited and with a foreword by G. Larry Bretthorst. 141. Thomas Jech, Set theory, Academic Press [Harcourt Brace Jovanovich Publishers], New York, 1978, Pure and Applied Mathematics. 142. Thomas J. Jech, The axiom of choice, North-Holland Publishing Co., Amsterdam, 1973, Studies in Logic and the Foundations of Mathematics, Vol. 75. 143. Mark Kac, Statistical independence in probability, analysis and number theory., The Carus Mathematical Monographs, No. 12, Published by the Mathematical Association of America. Distributed by John Wiley and Sons, Inc., New York, 1959. 144. Toni Kasper, Integration in finite terms: the Liouville theory, Math. Mag. 53 (1980), no. 4, 195–201. 145. Victor J. Katz, Change of variables in multiple integrals: Euler to Cartan, Amer. Math. Monthly 55 (1982), no. 1, 3–11. 146. A.Ya. Khinchin, Sur la loi forte des grands nombres, C.R. Acad. Sci. Paris 186 (1928), 285–287. , Sur la loi des grands nombres, C.R. Acad. Sci. Paris 188 (1929), 477–479. 147. 148. J. F. C. Kingman and S. J. Taylor, Introduction to measure and probability, Cambridge University Press, London, 1966. 149. Takuma Kinoshita, On interchanging sums and integrals of series of functions, Bull. Fac. Educ. Univ. Kagoshima 13 (1961), 2–4. 150. Leonard F. Klosinski, G.L. Alexanderson, and Loren C. Larson, The William Lowell Putnam mathematical competition, Amer. Math. Monthly 93 (1986), no. 8, 620–626. 151. A. N. Kolmogorov, Foundations of the theory of probability, Chelsea Publishing Co., New York, 1956, Translation edited by Nathan Morrison, with an added bibliography by A. T. Bharucha-Reid. Available at http://www.kolmogorov.com/. 152. A.N Kolmogorov, Sur la loi forte des grandes nombres, C.R. Acad. Sci. Paris 191 (1930), 910–911. 153. Dimitri Kountourogiannis and Paul Loya, A derivation of Taylor’s formula with integral remainder, Math. Mag. 74 (2001), no. 2, 109–122. 154. Robert G. Kuller, Coin tossing, probability, and the weierstrass approximation theorem, Math. Mag. 37 (1964), no. 4, 262–265. 155. S.K. Lakshamana, A proof of Legendre’s duplication formula, Amer. Math. Monthly 62 (1955), no. 2, 120–121. 156. E. Landau, u ¨ber die approximation einer stetigen funktion durch eine ganze rationale funktion, Rend. Circ. Mat. Palermo 25 (1908), no. 5, 337–345, Available at http://www.focm.net/at/HAT/papers.html. 157. Pierre Simon Laplace, Th´ eorie analytique des probabilit´ es, third ed., Courcier, Paris, 1820, Available at http://www.openlibrary.org/details/theorieanaldepro00laplrich. , A philosophical essay on probabilities, english ed., Dover Publications Inc., New 158. York, 1995, Translated from the sixth French edition by Frederick William Truscott and Frederick Lincoln Emory, With an introductory note by E. T. Bell. The 1902 edition available at http://www.archive.org/details/philosophicaless00lapliala. 159. P.S. Laplace, M´ emoire sur la probabilit´ e des causes par les ´ ev` enmens, Savants ´ etranges (1774), 621–656, Reprinted in Laplace’s Oeuvres 8, p. 27-65.,. 160. , M´ emoire sur les probabilit´ es, M´ em. Acad. R. Sci. Paris (1778 (1781)), 227–332, Translated by Richard J. Pulskamp and available at http://cerebro.xu.edu/math/Sources/. ´ 161. Akos L´ aszl´ o, The sum of some convergent series, Amer. Math. Monthly 108 (2001), no. 9, 851–855. 162. James A. LaVita, A necessary and sufficient condition for Riemann integration, Amer. Math. Monthly 71 (1964), no. 2, 193–196.
BIBLIOGRAPHY
467
163. Peter D. Lax, Change of variables in multiple integrals, Amer. Math. Monthly 106 (1999), no. 6, 497–501. 164. H. Lebesgue, Sur l’approximation des fonctions, Bull. Sci. Math. 22 (1898), 278–287 (French), Available at http://www.focm.net/at/HAT/papers.html. 165. , Sur une g´ en´ eralisation de l’int´ egrale d´ efinie., C. R. 132 (1901), 1025–1028 (French). 166. , Sur une propri´ et´ e des fonctions., C. R. 137 (1904), 1228–1230 (French). , Contribution ` a l’´ etude des correspondances de M. Zermelo., S. M. F. Bull. 35 (1907), 167. 202–212, available at http://archive.numdam.org/. 168. , Sur la m´ ethode de m. goursat pour la r´ esolution de l’´ equation de fredholm, S. M. F. Bull. 36 (1908), 3–19, available at http://archive.numdam.org/. 169. , Sur les int´ egrales singuli` eres, Ann. Fac. Sci. Toulouse 3 (1909), no. 1, 25–117, available at http://archive.numdam.org/. ´ 170. , Sur l’intgration des fonctions discontinues, Ann. de l’Ecole Normale Sup´ erieure 3 (1910), no. 27, 361–450. 171. Henri Lebesgue, Int´ egrale, longueur, aire., Gauthier-Villars, Paris, 1904. , Le¸cons sur l’int´ egration et la recherche des fonctions primitives, Annali di Mat. (3) 172. 7, 231-359; auch sep. Th` ese. Milan: Rebeschini. 129 S. 4◦ ., 1904. 173. , Measure and the integral, Edited with a biographical essay by Kenneth O. May, Holden-Day Inc., San Francisco, Calif., 1966. 174. Herbert Leinfelder and Christian G. Simader, The Brouwer fixed point theorem and the transformation rule for multiple integrals via homotopy arguments, Exposition. Math. 1 (1983), no. 4, 349–355. 175. John L. Leonard, On nonmeasurable sets, Amer. Math. Monthly 76 (1969), no. 5, 551–552. 176. William Judson LeVeque, Topics in number theory. Vol. I, II, Dover Publications Inc., Mineola, NY, 2002, Reprint of the 1956 original [Addison-Wesley Publishing Co., Inc., Reading, Mass.; MR0080682 (18,283d)], with separate errata list for this edition by the author. 177. Beppo Levi, Sopra l’integrazione delle serie, Rendiconti Istituto Lombardo Scienze 2 (1906), no. 39, 775–780. 178. , Sul principio di Dirichlet, Palermo, Circ. Mat. Rend. (1906), no. 22, 293–359. 179. Leo M. Levine, On a necessary and sufficient condition for Riemann integrability, Amer. Math. Monthly 84 (1977), no. 3, 205. 180. Jonathan W. Lewin, Some applications of the bounded convergence theorem for an introductory course in analysis, Amer. Math. Monthly 94 (1987), no. 10, 988–993. 181. Leon Lichtenstein, Ueber die integration eines bestimmten integrals in bezug auf einen parameter, G¨ ottinger Nachrichten 257 (1910), 468–475. 182. Leonard Lipkin, Tossing a fair coin, The College Math. J. 34 (2003), no. 2, 128–133. 183. J.E. Littlewood, Lectures on the theory of functions, Cambridge University Press, London, 1944. P 1 184. N.J. Lord, Yet another proof that = 16 π 2 , Math. Gazette 86 (2002), no. 507, 477–479. n2 185. P. Loya, Another (new) proof of the fundamental theorem of algebra, preprint, 2003. 186. , The fundamental theorem of algebra (fta), some background and proofs, New College Florida, MAA sectional meeting, Keynote speaker, 2005. 187. P. Loya and M. Mazur, The change of variables formula in multiple integrals, preprint, 2003. 188. Paul Loya, Green’s theorem and the fundamental theorem of algebra, Amer. Math. Monthly 68 (1961), no. 1, 56–57. 189. N. Y. Luther, A characterization of almost uniform convergence, Amer. Math. Monthly 74 (1967), 1230–1231. 190. N. Luzin, Sur les propri´ et´ es des fonctions mesurables., C. R. Acad. Sci. Paris 154 (1912), 1688–1690. 191. Desmond MacHales, Comic sections: The book of mathematical jokes, humour, wit, and wisdom, Boole Press, Dublin, 1993. 192. N.C. Bose Majumder, On the distance set of the Cantor middle third set, III, Amer. Math. Monthly 72 (1965), no. 7, 725–729. 193. Benoit B. Mandelbrot, The fractal geometry of nature, W. H. Freeman and Co., San Francisco, Calif., 1982. 194. Yu. Manin, Note on the history of the method of least squares, The Analyst 4 (1877), no. 5, 140–143.
468
BIBLIOGRAPHY
195. , On the history of the method of least squares, The Analyst 4 (1877), no. 2, 33–36. 196. Elena Anne Marchisotto and Gholam-Ali Zakeri, An invitation to integration in finite terms, The College Math. J. 25 (1994), no. 4, 295–308. 197. Javad Mashreghi, On improper integrals, Crux Mathematicorum 27 (2001), no. 3, 188–190. 198. Frank H. Mathis, A generalized birthday problem, SIAM Rev. 33 (1991), no. 2, 265–270. 199. Fyodor A. Medvedev, Scenes from the history of real functions, Science Networks. Historical Studies, vol. 7, Birkh¨ auser Verlag, Basel, 1991, Translated from the Russian by Roger Cooke. 200. Ioana Mihaila, Cantor, 1/4, and its family and friends, The College Math. J. 33 (2002), no. 1, 21–23. 201. P. R. Montmort, Essay d’ analyse sur les jeux de hazard, second ed., published anonymously, Quillau, Paris, 1713, First Edition, 1708. Available at http://www.york.ac.uk/depts/maths/histstat/lifework.htm. 202. Gregory H. Moore, Zermelo’s axiom of choice, Studies in the History of Mathematics and Physical Sciences, vol. 8, Springer-Verlag, New York, 1982, Its origins, development, and influence. , Lebesgue’s measure problem and Zermelo’s axiom of choice: the mathematical ef203. fects of a philosophical dispute, History and philosophy of science: selected papers, Ann. New York Acad. Sci., vol. 412, New York Acad. Sci., New York, 1983, pp. 129–154. 204. A. G. Munford, A note on the uniformity assumption in the birthday problem, Amer. Statist. 31 (1977), no. 3, 119. 205. James R. Munkres, Topology: a first course, Prentice-Hall Inc., Englewood Cliffs, N.J., 1975. 206. Jan Mycielski, Finitely additive invariant measures. i., Colloq. Math. 42 (1979), 309–318. 207. James R. Newman (ed.), The world of mathematics. Vol. 3, Dover Publications Inc., Mineola, NY, 2000, Reprint of the 1956 original. 208. Albert Novikoff and Jack Barone, The Borel law of normal numbers, the Borel zero-one law, and the work of Van Vleck, Historia Math. 4 (1977), 43–65. 209. Jeffrey Nunemacher, The largest unit ball in any euclidean space, Math. Mag. 59 (1986), no. 3, 170–171. 210. J.E. Nymann, Another generalization of the birthday problem, Math. Mag. 48 (1975), no. 1, 46–47. 211. Oystein Ore, Pascal and the invention of probability theory, Amer. Math. Monthly 67 (1960), 409–419. , Cardano, the gambling scholar, With a translation from the Latin of Cardano’s 212. Book on Games of Chance, by Sydney Henry Gould, Dover Publications Inc., New York, 1965. 213. W. F. Osgood, A geometrical method for the treatment of uniform convergence and certain double limits, Bull. Amer. Math. Soc. 3 (1896), 59–86. 214. , Non-uniform convergence and the integration of series term by term, American J. 19 (1897), 155–190. 215. M. Ostrogradski, M` emoire sur le calcul des variations des int´ egrales multiples, M´ emoires de l’Acad´ emie Imp´ eriale des Sciences de St. Petersbourg 3 (1836), 36–58, submitted to the St. Petersburg Academy of Sciences on January 24, 1834; English translation appeared in 1861 in “A History of the Calculus of Variations during the 19th Century” by I. Todhunter. ¨ 216. Alexander Ostrowski, Uber den ersten und vierten gaussschen beweis des fundamentalsatzes der algebra, Carl Friedrich Gauss Werke 10, Part 2, Springer-Verlag, Berlin, 1920, pp. 1–18. 217. John C. Oxtoby, Measure and category, second ed., Graduate Texts in Mathematics, vol. 2, Springer-Verlag, New York, 1980, A survey of the analogies between topological and measure spaces. 218. L Paciuoli, Summa de Arithmetica, Geometria, Proportioni et Proportionalit` a, Venice, 1494. 219. J. M. Patin, A very short proof of Stirling’s formula, Amer. Math. Monthly 96 (1989), no. 1, 41–42. 220. E. S. Pearson, Studies in the history of probability and statistics. XIV. Some incidents in the early history of biometry and statistics, 1890–94, Biometrika 52 (1965), 3–18. 221. Karl Pearson, Historical note on the origin of the normal curve of errors, Biometrika 16 (1924), no. 3/4, 402–404. 222. L. L. Pennisi, Expansions for π and π 2 , Amer. Math. Monthly 62 (1955), 653–654.
BIBLIOGRAPHY
469
223. Ivan N. Pesin, Classical and modern integration theories, Translated from the Russian and edited by Samuel Kotz. Probability and Mathematical Statistics, No. 8, Academic Press, New York, 1970. 224. James Pierpont, Lectures on the theory of functions of real variables, 2 Vols, Dover Publications Inc., New York, 1959. 225. Allan Pinkus, Weierstrass and approximation theory, J. Approx. Theory 107 (2000), no. 1, 1–66. 226. G. Polya, How to solve it, Princeton Science Library, Princeton University Press, Princeton, NJ, 2004, A new aspect of mathematical method, Expanded version of the 1988 edition, with a new foreword by John H. Conway. 227. Janet Bellcourt Pomeranz, The dice problem—then and now, College Math. J. 15 (1984), no. 3, 229–237. 228. John W. Pratt, On interchanging limits and integrals, Ann. Math. Statist. 31 (1960), 74–77. 229. A. Pringsheim, Zur theorie des doppel-integrals, des green’schen und cauchy’schen integralsatzes, M¨ uchen, Ak. Sber. 29 (1900), 39–62. 230. R. A. Proctor, Chance and luck. dicussion of the laws of luck, coincidences, wagers, lotteries, and the fallacies of gambling., Longmans, Green, and Co., London, 1887, Available at http://www.gutenberg.org/etext/17224. 231. Richard J. Pulskamp, Diverse problems concerning the game of treize, 2007. 232. J. Radon, Theorie und anwendungen der absolut additiven mengenfunktionen, Wien. Ber. 122 (1913), 1295–1438. 233. S. Ramakrishnan and W. D. Sudderth, A sequence of coin toss variables for which the strong law fails, Amer. Math. Monthly 95 (1988), no. 10, 939–941. 234. J.F. Randolph, Distances between points of the cantor set, Amer. Math. Monthly 47 (1940), no. 8, 549–551. 235. Hans Riedwyl, Rudolf Wolf ’s contribution to the Buffon needle problem (an early Monte Carlo experiment) and application of least squares, Amer. Statist. 44 (1990), no. 2, 138–139. 236. F. Riesz, Sur quelques points de la th´ eorie des fonctions sommables, C. R. Acad. Sci. (1912), no. 154, 641–643. 237. Raphael M. Robinson, On the decomposition of spheres, Fund. Math. 34 (1947), 246–260. 238. C. A. Rogers, A less strange version of Milnor’s proof of Brouwer’s fixed-point theorem, Amer. Math. Monthly 87 (1980), no. 7, 525–527. 239. N. Rose, Mathematical maxims and minims, Rome Press Inc., Raleigh, NC, 1988. 240. Maxwell Rosenlicht, Integration in finite terms, Amer. Math. Monthly 79 (1972), 963–972. 241. Vladimir Rotar, Probability theory, World Scientific Publishing Co. Inc., River Edge, NJ, 1997, Translated from the Russian by the author. 242. H. L. Royden, Real analysis, third ed., Macmillan Publishing Company, New York, 1988. 243. Walter Rudin, Real and complex analysis, third ed., McGraw-Hill Book Co., New York, 1987. 244. Stanislaw Saks, Theory of the integral, Second revised edition. English translation by L. C. Young. With two additional notes by Stefan Banach, Dover Publications Inc., New York, 1964. 245. H.F. Sandham, A well-known integral, Amer. Math. Monthly 53 (1946), no. 10, 587. 246. C. Edward Sandifer, The early mathematics of Leonhard Euler, MAA Spectrum, Mathematical Association of America, Washington, DC, 2007. 247. , How Euler did it, MAA Spectrum, Mathematical Association of America, Washington, DC, 2007, Individual articles available on the web. 248. Norbert Schappacher and Ren´ e Schoof, Beppo Levi and the arithmetic of elliptic curves, Math. Intelligencer 18 (1996), no. 1, 57–69. 249. Henry Scheff´ e, A useful convergence theorem for probability distributions, Ann. Math. Statistics 18 (1947), 434–438. 250. J. Schwartz, The formula for change in variables in a multiple integral, Amer. Math. Monthly 61 (1954), 81–85. 251. C. Severini, Sopra gli sviluppi in serie di funzioni ortogonali, Atti Acc. Gioenia di Catania (5). T. 3, Mem. XI., 1910. 252. , Sulle successioni di funzioni ortogonali., Atti Acc. Gioenia di Catania (5). 1910. T. 3, Mem. XIII., 1910.
470
BIBLIOGRAPHY
253. A. Shen and N. K. Vereshchagin, Basic set theory, Student Mathematical Library, vol. 17, American Mathematical Society, Providence, RI, 2002, Translated from the 1999 Russian edition by Shen. 254. A. Shenitzer and J. Stepr¯ ans, The evolution of integration, Amer. Math. Monthly 101 (1994), no. 1, 66–72. 255. O. B. Sheynin, Studies in the history of probability and statistics. XXI. On the early history of the law of large numbers, Biometrika 55 (1968), 459–467. 256. A. N. Shiryaev, Probability, second ed., Graduate Texts in Mathematics, vol. 95, SpringerVerlag, New York, 1996, Translated from the first (1980) Russian edition by R. P. Boas. 257. M.W. Sierpinski, D´ emonstration ´ el´ ementaire du th´ eor` eme de m. borel sur les nombres absolument normaux et d´ etermination effective d′ un tel nombre, Bull. Soc. Math. France 45 (1917), 127–132. 258. W. Sierpinski, L’axiome de m. zermelo et son rˆ ole dans la theorie des ensembles et l’analyse, Bull. Acad. Sci. Cracovie (1918), 97–152. 259. , Sur lequation fonctionnelle f (x + y) = f (x) + f (y), Fund. math. 1 (1920), 116–122. , Sur les 260. R R 1rapports entre l’existence R1 R 1 des integrales: 1 f (x, y)dx, f (x, y)dy, et dx 0 0 0 0 f (x, y)dy., Fund. math. 1 (1920), 142–147. 261. W.M. Smart, Combination of observations, Cambridge University Press, Cambridge, UK, 1958. 262. David Eugene Smith, A source book in mathematics. vol. 1, 2., Dover Publications, Inc, New York, 1959, Unabridged and unaltered republ. of the first ed. 1929. 263. David J. Smith and Mavina K. Vamanamurthy, How small is the unit ball?, Math. Mag. 62 (1989), no. 2, 101–107. 264. Henry Smith, On the integration of discontinuous functions, London Math. Soc. Proc. 6 (1875), 140–153, Available at http://www.archive.org/details/collectedmathema02smituoft. 265. Robert M. Solovay, A model of set-theory in which every set of reals is Lebesgue measurable, Ann. of Math. (2) 92 (1970), 1–56. 266. M.R. Spiegel, The beta function, Amer. Math. Monthly 58 (1951), no. 7, 489–490. 267. D.P. Squier, Repeated Riemann integration, Amer. Math. Monthly 70 (1963), no. 5, 550–552. 268. Saul Stahl, The evolution of the normal distribution, Math. Mag. 79 (2006), no. 2, 96–113. 269. H. Steinhaus, Les probabilit´ es d´ enombrables et leur rapport ` a la th´ eorie de la mesure, Fund. Math. 4 (1922), 286–310. 270. T.-J. Stieltjes, Recherches sur les fractions continues, Ann. Fac. Sci. Toulouse Sci. Math. Sci. Phys. 8 (1894), no. 4, J1–J122. R 2 271. Thomas Jan Stieltjes, Note sur l’int´ egrale 0∞ e−u du (french), Nouv. Ann. (3) IX. (1890), 479–480. 272. Stephen M. Stigler, Laplace’s 1774 memoir on inverse probability, Statist. Sci. 1 (1986), no. 3, 359–378. 273. , Casanova, “bonaparte”, and the loterie de france., Journal de la Socit Franaise de Statistique 144 (2003), 5–34. 274. , Casanova’s lottery, The University of Chicago Record 37 (2003), no. 4, 2–5. 275. John Stillwell, Mathematics and its history, second ed., Undergraduate Texts in Mathematics, Springer-Verlag, New York, 2002. 276. M. H. Stone, Applications of the theory of Boolean rings to general topology, Trans. Amer. Math. Soc. 41 (1937), no. 3, 375–481. 277. , The generalized Weierstrass approximation theorem, Math. Mag. 21 (1948), 167– 184, 237–254. 278. Karl Stromberg, The Banach-Tarski paradox, Amer. Math. Monthly 86 (1979), no. 3, 151– 161. 279. Jacob Strum, Cours d’analyse de l’´ ecole polytechnique, tome second, Gauthier-Villars, Paris, 1868, Available free at Google books. 280. Jules Tannery, Introduction ` a la th´ eorie des fonctions d’une variable (volume 1), A. Hermann, Paris, 1904. 281. Robert Lee Taylor and Tien Chung Hu, Sub-Gaussian techniques in proving strong laws of large numbers, Amer. Math. Monthly 94 (1987), no. 3, 295–299. 282. Johannes Thomae, Einleitung in die theorie der bestimmten integrale, Verlag von Louis Nebert, Halle, 1875, Freely available at http://books.google.com/.
BIBLIOGRAPHY
471
283. K.J. Thomae, Ueber bestimmte integrale, Zeitschr. Math. Phys. 23 (1878), 67–68. 284. H. Tietze, u ¨ber funktionen, die auf einer abgeschlossenen menge stetig sind., J. reine angew. Math. (1915), no. 145, 9–14. 285. Isaac Todhunter, A history of the mathematical theory of probability, Macmillan and Co., Cambridge and London, 1865, Available free at http://www.archive.org. 286. R. James Tomkins, Another proof of borel’s strong law of large numbers, Amer. Statist. 38 (1984), no. 3, 208–209. 287. L. Tonelli, Sull’integrazione per parti, Rendiconti Accad. Nazionale dei. Lincei 18 (1909), 246–253. 288. , Sulla nozione di integrale, Annali di Mat. 1 (1924), no. 1, 105–145. 289. Ian Tweddle, James Stirling’s methodus differentialis, Sources and Studies in the History of Mathematics and Physical Sciences, Springer-Verlag London Ltd., London, 2003, An annotated translation of Stirling’s text. 290. S. M. Ulam, Zur matheorie in der allgemeinen mengenlehre., Fundamenta 16 (1930), 140– 150. , What is measure?, Amer. Math. Monthly 50 (1943), 597–602. 291. 292. P. Urysohn, u ¨ber die m¨ achtigkeit der zusammenh¨ angenden mengen., Math. Ann. 94 (1925), 262–295. 293. James Victor Uspensky, Introduction to mathematical probability, McGraw-Hill Book Co, New York, London, 1937. 294. W.R. Utz, The distance set for the Cantor discontinuum, Amer. Math. Monthly 58 (1951), no. 6, 407–408. 295. Jean van Heijenoort, From Frege to G¨ odel. A source book in mathematical logic, 1879–1931, Harvard University Press, Cambridge, Mass., 1967. 296. E. B. van Vleck, On non-measurable sets of points, with an example., Trans. Amer. Math. Soc. 9 (1908), 237–244. 297. J. Vesel˜ y, Weierstrass’ theorem before weierstrass, http://www.focm.net/at/HAT/fpapers/jiri.pdf. 298. G. Vitali, Sul problema della misura dei gruppi di punti di una retta - 1905 in: G. Vitali, Opere sullanalisi reale e complessa - Carteggio, Unione Matematica Italiana, Bologna, 1984. , Una proprieta della funzioni misurabili, Real Instituto Lombardo 2 (1905), no. 38, 299. 599–603. , Sull’ integrazione per serie., Palermo Rend. 23 (1907), 137–155. 300. 301. V. Volterra, Alcune osservazioni sulle funzioni punteggiate discontinue, Giornale di matematiche di Battaglini 19 (1881), 76–86. 302. John von Neumann, Functional Operators. I. Measures and Integrals, Annals of Mathematics Studies, no. 21, Princeton University Press, Princeton, N. J., 1950. 303. Stan Wagon, The Banach-Tarski paradox, Cambridge University Press, Cambridge, 1993, With a foreword by Jan Mycielski, Corrected reprint of the 1985 original. 304. Helen Walker, Bi-centenary of the normal curve, Journal of the American Statistical Association 29 (1934), no. 185, 72–75. 305. Leonard M. Wapner, The pea & the sun, A K Peters Ltd., Wellesley, MA, 2005, A mathematical paradox. ¨ 306. K. Weierstrass, Uber continuierliche Functionen eines reellen Arguments, die f¨ ur keinen Werth des letzteren einen bestimmten Differentialquotienten besitzen, K¨ onigliche Akademie der Wissenschaften, 18 Juli 1872. [See also “Mathematische Werke,” Vol. 2, pp. 71–74, Mayer 6 M¨ uller, Berlin, 1895.],. ¨ 307. , Uber die analytische Darstellbarkeit sogenannter willk¨ urlicher Funktionen einer reelen Ver¨ anderlichen, Sitzungsberichte der Akademie zu Berlin, 633–639, 789–805. 308. E. Wigner, The unreasonable effectiveness of mathematics in the natural sciences, Comm. Pure Appl. Math. 13 (1960), 1–14. 309. J.B. Wilker, Rings of sets are really rings, Amer. Math. Monthly 89 (1982), no. 3, 211. 310. G.S. Young, The linear functional equation, Amer. Math. Monthly 65 (1958), no. 1, 37–38. 311. Robert M. Young, Excursions in calculus, The Dolciani Mathematical Expositions, vol. 13, Mathematical Association of America, Washington, DC, 1992, An interplay of the continuous and the discrete. 312. W. H. Young, On the general theory of integration., Lond. Phil. Trans. (A) 204 (1905), 221–252.
472
313. 314.
BIBLIOGRAPHY
, On a new method of integration., Proc. London Math. Soc. 9 (1910), no. 2, 15–50. , On semi-integrals and oscillating successions of functions, Proc. London Math. Soc. 9 (1910), no. 2, 286–324. 315. , On integration with respect to a function of bounded variation, Proc. Roy. Soc. London 2 (1912), no. 13, 109–150. 316. , On the new theory of integration., Proc. Roy. Soc. London 88A (1912), 170–178. 317. J. Van Yzeren, Moivre’s and Fresnel’s integrals by simple integration, Amer. Math. Monthly 86 (1979), no. 8, 690–693.