336 106 10MB
English Pages 606 [607] Year 2009
de Gruyter Textbook Wiebe R. Pestman Mathematical Statistics
At a young age, the author’s inability to convert statistical data into round figures caused much distress
Wiebe R. Pestman
Mathematical Statistics Second revised edition
≥
Walter de Gruyter Berlin · New York
Author Wiebe R. Pestman Centre for Biostatistics Utrecht University Padualaan 14 3584 CH Utrecht, The Netherlands Mathematics Subject Classification 2000: 62-01 Keywords: Estimation theory, hypothesis testing, regression analysis, non-parametrics, stochastic analysis, vectorial (multivariate) statistics
앝 Printed on acid-free paper which falls within the guidelines 앪 of the ANSI to ensure permanence and durability.
Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de.
ISBN 978-3-11-020852-8 쑔 Copyright 2009 by Walter de Gruyter GmbH & Co. KG, 10785 Berlin, Germany. All rights reserved, including those of translation into foreign languages. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Printed in Germany. Coverdesign: Martin Zech, Bremen. Typeset using the author’s LATEX files: Kay Dimler, Müncheberg. Printing and binding: Hubert & Co. GmbH & Co. KG, Göttingen.
Preface
This book arose from a series of lectures I gave at the University of Nijmegen (Holland). It presents an introduction to mathematical statistics and it is intended for students already having some elementary mathematical background. In the text theoretical results are presented as theorems, propositions or lemmas, of which as a rule rigorous proofs are given. In the few exceptions to this rule references are given to indicate where the missing details can be found. In order to understand the proofs, an elementary course in calculus and linear algebra is a prerequisite. However, if the reader is not interested in studying proofs, this can simply be skipped. Instead one can study the many examples, in which applications of the theoretical results are illustrated. In this way the book is also useful to students in the applied sciences. To have the starting statistician well prepared in the field of probability theory, Chapter I is wholly devoted to this subject. Nowadays many scientific articles (in statistics and probability theory) are presented in terms of measure theoretical notions. For this reason, Chapter I also contains a short introduction to measure theory (without getting submerged by it). However, the subsequent chapters can very well be read when skipping this section on measure theory and my advice to the student reading it is therefore not to get bogged down by it. The aim of the remainder of the book is to give the reader a broad and solid base in mathematical statistics. Chapter II is devoted to estimation theory. The probability distributions of the usual elementary estimators, when dealing with normally distributed populations, are treated in detail. Furthermore, there is in this chapter a section about Bayesian estimation. In the final sections of Chapter II, estimation theory is put into a general framework. In these sections, for example, the information inequality is proved. Moreover, the concepts of maximum likelihood estimation and that of sufficiency are discussed. In Chapters III, IV and V the classical subjects in mathematical statistics are treated: hypothesis testing, normal regression analysis and normal analysis of variance. In these chapters there is no “cheating” concerning the notion of statistical independence. Chapter VI presents an introduction to non-parametric statistics. In Chapter VII it is illustrated how stochastic analysis can be applied in modern statistics. Here subjects like the Kolmogorov–Smirnov test, smoothing techniques, robustness, density estimation and bootstrap methods are treated. Finally, Chapter VIII is about vectorial
vi
Preface
statistics. I sometimes find it annoying, when teaching this subject, that many undergraduate students do not seem (or indeed not anymore) to be aware of the fundamental theorems in linear algebra. To meet this need I have given a summary of the main results of elementary linear algebra in §VIII.1. The book is written in quite simple English; this might be an advantage to students who do not have English as their native language (like myself). The text contains some 250 exercises, which vary in difficulty. Ivo Alberink, one of my students, has expounded detailed solutions of all these exercises in a book, titled Mathematical Statistics. Problems and Detailed Solutions. Ivo has also read this book (which you now have in front of you) in a critical way. The many discussions which arose from this have been very useful. I wish to thank him for this. I hereby also wish to thank Inge Stappers, Flip Klijn and many other students at the University of Nijmegen. They too read the text critically and gave constructive comments. Furthermore I would like to thank Alessandro di Bucchianico and Mark van de Wiel from the Eindhoven University of Technology for providing most of the statistical tables. Finally, I am very indebted to the secretariat of the department of mathematics in Nijmegen, in particular to Hanny Heitink. She did almost all the typewriting, a colossal amount of work. Nijmegen, November 1998
Wiebe R. Pestman
Preface to the second edition The second edition of this book very much resembles the first. However, there have been some additions, modifications and corrections. As to the latter, I wish to thank readers for drawing my attention to a couple of errors in the first edition. Moreover I would like to mention here the very pleasant collaboration with Walter de Gruyter publishing. In particular I wish to thank Dr. Robert Plato, Kay Dimler and Simon Albroscheit for the formidable job they performed in realizing this second edition. Utrecht, March 2009
Wiebe R. Pestman
Contents
Preface I
II
III
v
Probability theory I.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . I.2 Stochastic variables . . . . . . . . . . . . . . . . . . . . . I.3 Product measures and statistical independence . . . . . . . I.4 Functions of stochastic vectors . . . . . . . . . . . . . . . . I.5 Expectation, variance and covariance of stochastic variables I.6 Statistical independence of normally distributed variables . I.7 Distribution functions and probability distributions . . . . . I.8 Moments, moment generating and characteristic functions . I.9 The central limit theorem . . . . . . . . . . . . . . . . . . I.10 Transformation of probability densities . . . . . . . . . . . I.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
1 1 9 13 18 21 30 38 42 48 53 55
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
65 65 69 79 83 89 94 99 102 111 129 140
Hypothesis tests III.1 The Neyman–Pearson theory . . . . . . . . . . . . . . . . . . III.2 Hypothesis tests concerning normally distributed populations III.3 The 2 -test on goodness of fit . . . . . . . . . . . . . . . . . III.4 The 2 -test on statistical independence . . . . . . . . . . . . III.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
155 155 168 182 188 194
Statistics and their probability distributions, estimation theory II.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . II.2 The gamma distribution and the 2 -distribution . . . . . . II.3 The t -distribution . . . . . . . . . . . . . . . . . . . . . II.4 Statistics to measure differences in mean . . . . . . . . . II.5 The F -distribution . . . . . . . . . . . . . . . . . . . . . II.6 The beta distribution . . . . . . . . . . . . . . . . . . . . II.7 Populations that are not normally distributed . . . . . . . II.8 Bayesian estimation . . . . . . . . . . . . . . . . . . . . II.9 Estimation theory in a framework . . . . . . . . . . . . . II.10 Maximum likelihood estimation, sufficiency . . . . . . . II.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
viii IV
Contents
Simple regression analysis IV.1 The least squares method . . . . . . . . . . . . . . . . . . . IV.2 Construction of an unbiased estimator of 2 . . . . . . . . IV.3 Normal regression analysis . . . . . . . . . . . . . . . . . . IV.4 Pearson’s product-moment correlation coefficient . . . . . . IV.5 The sum of squares of errors as a measure of linear structure IV.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
200 200 210 213 219 222 225
V
Normal analysis of variance 229 V.1 One-way analysis of variance . . . . . . . . . . . . . . . . . . . 229 V.2 Two-way analysis of variance . . . . . . . . . . . . . . . . . . . 236 V.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
VI
Non-parametric methods VI.1 The sign test, Wilcoxon’s signed-rank test VI.2 Wilcoxon’s rank-sum test . . . . . . . . VI.3 The runs test . . . . . . . . . . . . . . . VI.4 Rank correlation tests . . . . . . . . . . VI.5 The Kruskal–Wallis test . . . . . . . . . VI.6 Friedman’s test . . . . . . . . . . . . . . VI.7 Exercises . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
250 250 256 261 265 270 273 278
VII Stochastic analysis and its applications in statistics VII.1 The empirical distribution function associated with a sample VII.2 Convergence of stochastic variables . . . . . . . . . . . . . VII.3 The Glivenko–Cantelli theorem . . . . . . . . . . . . . . . VII.4 The Kolmogorov–Smirnov test statistic . . . . . . . . . . . VII.5 Metrics on the set of distribution functions . . . . . . . . . VII.6 Smoothing techniques . . . . . . . . . . . . . . . . . . . . VII.7 Robustness of statistics . . . . . . . . . . . . . . . . . . . . VII.8 Trimmed means, the median and their robustness . . . . . . VII.9 Statistical functionals . . . . . . . . . . . . . . . . . . . . . VII.10 The von Mises derivative and influence functions . . . . . . VII.11 Bootstrap methods . . . . . . . . . . . . . . . . . . . . . . VII.12 Estimation of densities by means of kernel densities . . . . VII.13 Estimation of densities by means of histograms . . . . . . . VII.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
282 282 284 303 311 316 328 334 341 358 372 384 392 401 405
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
VIII Vectorial statistics 419 VIII.1 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 VIII.2 The expectation and covariance of stochastic vectors . . . . . . . 434 VIII.3 Vectorial samples . . . . . . . . . . . . . . . . . . . . . . . . . . 446
ix
Contents
VIII.4 VIII.5 VIII.6 VIII.7 VIII.8 VIII.9 VIII.10 VIII.11
Vectorial normal distributions . . . . . . . . . . . . . . . . . Conditional probability distributions related to Gaussian ones Vectorial samples from Gaussian distributed populations . . . Vectorial versions of the fundamental limit theorems . . . . . Normal correlation analysis . . . . . . . . . . . . . . . . . . Multiple regression analysis . . . . . . . . . . . . . . . . . . The multiple correlation coefficient . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
450 463 470 486 497 505 517 524
Appendix A B C D E F
Lebesgue’s convergence theorems Product measures Conditional probabilities The characteristic function of the Cauchy distribution Metric spaces, equicontinuity The Fourier transform and the existence of stoutly tailed distributions
529 532 536 540 543 553
List of elementary probability densities
559
Frequently used symbols
561
Statistical tables
568
Bibliography
587
Index
591
Chapter I
Probability theory
I.1
Probability spaces
Probability theory is the branch of applied mathematics concerning probability experiments: experiments in which randomness plays a principal part. In trying to capture the concept of a probability experiment in a definition, one could say: A probability experiment is an experiment which, when repeated under exactly the same conditions, does not necessarily give the same results. In a probability experiment there is, a priori, always a variety of possible outcomes. The set of all possible outcomes of a probability experiment is said to be the sample space or outcome space. This set will systematically be denoted by the symbol . Example 1. When casting a die, the possible outcomes are the casts of 1; 2; 3; 4; 5 or 6 pips. Here one can write WD ¹1; 2; 3; 4; 5; 6º. Example 2. Molecules of a gas collide mutually. The distance covered by some molecule between two successive collisions is a random quantity. Such a distance is always a positive number. One could set WD .0; C1/. Definition I.1.1. An event is understood to be a subset of . Example 3. In Example 1 the event “an even number of pips has been thrown” corresponds to the subset A D ¹2; 4; 6º of . Example 4. In Example 2 the event “the distance between two successive collisions is less than 1” corresponds to the interval .0; 1/. Generally, in a probability model not all subsets of are admissible. Measure theoretical complications are involved here. These complications will not be discussed in this book. In a probability model the collection A of so-called “admissible events” is presented by a collection of subsets of , satisfying more or less specific properties. To this end the following definition: Definition I.1.2. A collection A of subsets of is said to be a -algebra if it satisfies the following three properties:
2
Chapter I Probability theory
(i) 2 A. (ii) If A 2 A then also Ac 2 A. S (iii) If A1 ; A2 ; : : : 2 A then also 1 iD1 Ai 2 A.
In a probability experiment, one generally defines an associated sample space together with a -algebra of so-called admissible events. Now suppose that A is such a -algebra of subsets of . The family of sets A will be regarded as being the collection of all possible events. With every event A 2 A there is an associated chance or probability that A is going to occur. This chance or probability will be denoted by P .A/, a real number in the interval Œ0; 1. Roughly spoken this number can be interpreted in the following way: (i) P .A/ D 1
(ii) P .A/ D 0
(iii) P .A/ D 0:3
! one can be sure that A is going to occur.
! one can be sure that A is not going to occur.
! after endless repetition of the experiment, under exactly the same conditions, event A will occur in 30% of the repetitions.
Actually one is dealing here with a map A 7! P .A/ defined on a -algebra A. Measure theory is the branch of mathematics dealing with this kind of maps. The general definition of the concept of a “measure” is as follows: Definition I.1.3. A map W A ! Œ0; C1 is said to be a measure if the following two conditions are satisfied: (i) .¿/ D 0. (ii) For every mutually disjoint collection A1 ; A2 ; : : : 2 A one has ! 1 1 [ X Ai D .Ai /: iD1
iD1
In condition (ii) one has by convention that .C1/ C .C1/ D C1: Condition (ii) is said to be the -additivity of . The concept of a measure could intuitively be perceived as follows: .A/ is the “size” of the set A, as seen through the glasses of . Although the measure is defined on A, one often talks about a “measure on ”.
3
Section I.1 Probability spaces
Given an arbitrary collection F of subsets of , there always exists the so-called “smallest -algebra enclosing F”. This -algebra is said to be the -algebra generated by F. In this way an important -algebra R can be defined: Example 5. Set WD R (or Rn ). Let F be the collection of all open subsets of . Now F is not a -algebra (see Exercise 4). The -algebra generated by F is called the collection of Borel sets. This -algebra will be denoted by B (or Bn in case of Rn ). In practice every set one encounters is a Borel set. However, it can be proved (applying the axiom of choice) that there exist subsets of R (and of Rn ) that are not Borel. A measure on B (or Bn ) is said to be a Borel measure. In measure theory the following statement is proved: “There exists on R a unique translation invariant Borel measure satisfying .Œ0; 1/ D 1.” The following notations are introduced to make this statement more precise: (i) If A Rn and a 2 Rn then A C a WD ¹x C a W x 2 Aº,
(ii) Œ0; 1n WD ¹.x1 ; : : : ; xn / 2 Rn W 0 xi 1
.i D 1; 2; : : : ; n/º.
Theorem I.1.1. On Rn there exists a unique Borel measure (one should write n ) satisfying (i) .Œ0; 1n / D 1,
(ii) .A C a/ D .A/
for all A 2 Bn and a 2 Rn :
Proof. See [11], [28], [58], [64]. The measure in Theorem I.1.1 is said to be the Lebesgue measure on Rn . It turns out that it is not possible to extend to a translation invariant measure on the algebra comprising all subsets of Rn . On R the measure generalizes the concept of length, on R2 the concept of area and on R3 the concept of volume. Example 6.
(i) For an interval I R one has: .I / D length of I .
(ii) The Lebesgue measure of a disk DR .a/ R2 with center a and radius R is given by DR .a/ D R2 .
(iii) The Lebesgue measure of a ball BR .a/ R3 with center a and radius R is given by BR .a/ D 4R3 =3.
In measure theory one generalizes the idea of a “continuous function” to the idea of a “Borel function”. Definition I.1.4. A function f W Rn ! Rm is said to be a Borel function if for every open set O Rm the set f 1 .O/ is a Borel set in Rn .
4
Chapter I Probability theory
Remark 1. Obviously every continuous function is a Borel function. Remark 2. In practice all functions f W Rn ! Rm one encounters are Borel functions. However, it can be proved that in pure mathematics there exist functions that are not Borel. Proposition I.1.2. If f W Rn ! Rm is a Borel function and A Rm is a Borel set, then f 1 .A/ is a Borel set in Rn . Proof. See Exercise 5. In the following a brief look will be taken at Lebesgue’s theory of integration. It turns out that it makes sense to define for every Borel function f W Rn ! Œ0; C1/ its integral Z Z f d D
f .x/ d.x/:
A general line will be sketched to introduce this integral, thereby ignoring the technical details (see [11], [28], [58], [64]). As a first step the integral will be defined in cases where f is a positive simple function, that is to say, a linear combination of indicator functions n X f D ˛i 1Ai ; iD1
where ˛1 ; : : : ; ˛n > 0 and A1 ; : : : ; An mutually disjoint Borel sets in Rn . For such functions f the integral is defined as Z n X f d WD ˛i .Ai /: iD1
For two functions f and g of this type one has Z Z f g H) f d g d:
It is not difficult to prove that every positive Borel function is the pointwise limit of an increasing sequence of positive simple functions. More precisely, if f W Rn ! Œ0; C1/ is a Borel function, then there exists a sequence .fn /n2N such that (i) fn is a positive simple function for all n, (ii) fn fnC1 for all n 2 N , (iii) f .x/ D limn!1 fn .x/ for all x 2 Rn . It is now clear that the sequence Z
fn d
n2N
5
Section I.1 Probability spaces
is increasing and consequently convergent in Œ0; C1. One defines (finite or infinite): Z Z Z f .x/ d.x/ D f d WD lim fn d: n!1
It turns out that this expression does not depend on the particular R choice of the sequence .fn /n2N and therefore it presents a correct definition of f d. The integrals introduced in this way satisfy the elementary properties that integrals should have: R (i) f d 2 Œ0; C1. R R R (ii) .f C g/ d D f d C g d. R R (iii) If ˛ 0, then ˛f d D ˛ f d.
To extend the concept of an integral to Borel functions f W Rn ! R that are not necessarily positive, one proceeds in the following way: Write AC WD ¹x 2 Rn W f .x/ 0º
and A WD ¹x 2 Rn W f .x/ < 0º:
Now the functions f C D 1AC f
and f
D
1A f
present positive Borel functions (see Exercise 7) and one has f DfC Thus, it is natural to define Z
and jf j D f C C f :
f
f d WD
Z
f
C
d
Z
f
d:
To avoid in this definition the (undefined) expression .C1/ .C1/, it is assumed that Z Z f C d < C1 and f d < C1: This is the same as saying that
Z
jf j d < C1:
Definition I.1.5. A Borel function f W Rn ! R that satisfies the assumption above is said to be -summable or -integrable. It turns out (see Exercise 7) that for every Borel function f W Rn ! R and every Borel set A Rn the function 1A f is again a Borel function. If f 0, or if f is -summable, one defines Z Z f d WD 1A f d: A
6
Chapter I Probability theory
Proposition I.1.3. Let f W Rn ! Œ0; C1/ be a Borel function. Suppose that A D S 1 n iD1 Ai , where A1 ; A2 ; : : : is a collection of mutually disjoint Borel sets in R . Then Z 1 Z X f d D f d: A
iD1 Ai
Proof. See for example [11], [28], [58], [64]. A direct consequence of this proposition is that the map Z f d W A 7! A
presents a Borel measure on
Rn .
This measure will be denoted as D f :
The measure is said to have density f with respect to the measure . As to such measures, the following measure theoretical theorem plays a role in the foundations of probability theory: Theorem I.1.4. Let g W Rn ! Œ0; C1/ be a Borel function, a Borel measure and D g. A Borel function f W Rn ! R is -summable if and only if fg if -summable. In that case one has Z Z f d D fg d: For f 0 the above (finite or infinite) is always valid.
Proof. For details see for example [11], [28], [58], [64]. The proof is not difficult: First one proves the theorem in the strongly simplified case where f D 1A , A being a Borel set. Next, the case where f is a simple function follows. The general case is proved by an argument involving approximation by simple functions. In Lebesgue’s integration theory many proofs run along this line. Integrals with respect to the Lebesgue measure on Rn are said to be Lebesgue integrals; they present a generalization of the “classical” Riemann integrals. More precisely, if A Rn is a set of type A D Œa1 ; b1 Œan ; bn and if f W A ! R is (piecewise) continuous, then Z Z Z b1 Z bn f d D f .x/ dx D f .x1 ; : : : ; xn / dx1 : : : dxn ; A
A
a1
an
with the right hand members presenting integrals in the sense of Riemann.
7
Section I.1 Probability spaces
Example 7. Let g W R ! Œ0; C1/ be the function defined by 1 g.x/ WD p e 2
x 2 =2
.x 2 R/:
Suppose D g, that is, is the Borel measure defined by Z Z 1 2 W A 7! g d D p e x =2 dx: 2 A A If f .x/ D x 2 , then by Theorem I.1.4 one has Z Z Z Z 2 2 f d D x d.x/ D x g.x/ d.x/ D
C1 1
1 x2 p e 2
x 2 =2
dx:
Besides the Lebesgue measure, the Dirac measures (or point measures) also play an important role. The Dirac measure in a point a 2 Rn is defined by: ´ 1 if a 2 A; ıa .A/ D 0 if a … A: It is easily verified that ıa is a measure. An integral with respect to ıa is characterized by Z f dıa D f .a/:
In terms of Dirac measures one can easily construct new measures. This is illustrated in the following two examples. Example 8. On R one can define a measure by .A/ WD number of elements in A \ N D #.A \ N/:
In terms of Dirac measures this can be expressed by .A/ D
1 X
ın .A/:
nD0
The above will often be written more briefly as D
1 X
nD0
ın
or D
X
n2N
For f 0 one has (finite or infinite): Z 1 X f d D f .n/: nD0
ın :
8
Chapter I Probability theory
A function f is -summable if and only if Z in which case one has
jf j d D Z
1 X
nD0
f d D
jf .n/j < C1; 1 X
f .n/;
nD0
where the series on the right hand side is absolutely convergent. Example 9. Let .cn /n2N be a sequence of non-negative real numbers. Define the measure on R by 1 X .A/ WD cn ın .A/: nD0
It is not difficult to see that this defines a measure on R indeed. The above can be written more briefly as: 1 X D cn ın : nD0
A function f is -summable if and only if Z
f d D
1 X
nD0
cn jf .n/j < C1;
in which case Z
f d D
1 X
cn f .n/:
nD0
The power of Lebesgue’s integration theory, beyond that of Riemann, emanates from a number of powerful convergence theorems (see Appendix A). In probability theory one is mainly interested in measures on the sample space and measures associated with these. A measure on that satisfies ./ D 1 is said to be a probability measure. In the following, these measures will systematically be denoted by the symbol P . Along the lines proposed by the famous Russian mathematician Andrej N. Kolmogorov (1903–1987), the concept of a probability space is defined as follows: Definition I.1.6. A probability space associated with a probability experiment is understood to be a triplet .; A; P /, where is a sample space, A a -algebra consisting of subsets of and P a probability measure on A.
9
Section I.2 Stochastic variables
Example 10. Suppose one is testing the life time of a certain type of lamps. The outcome of an experiment like this can be characterized by a real number, so one could set set WD R. As announced before, in practice every subset of R is Borel so one is playing quite safe by setting A WD B. In experiments done before it turned out that the following probability measure P fits to this experiment: P D f ; where f .x/ D One has P .R/ D
Z
R
´
1 x=1000 1000 e
0
f d D
So P is a probability measure indeed.
Z
C1
0
if x 0; elsewhere:
1 e 1000
x=1000
dx D 1:
Now let A be the event “the life time of the tested lamp is more than 1000 hours”. This event corresponds to the subset .1000; C1/ of R. This is a Borel set so A presents an admissible event. The chance P .A/ that A will occur is given by Z Z C1 1 P .A/ D f d D e x=1000 dx A 1000 1000 h iC1 1 D e x=1000 D 0:37: 1000 e
Example 11. For the probability space belonging to a cast of a die one could take WD ¹1; 2; 3; 4; 5; 6º, A the collection of all subsets of and P defined by P .A/ WD
number of elements in A #.A/ D : 6 6
Now P is a probability space. The probability that the event “number of pips more than 4” will occur is presented by the number P .A/, where A D ¹5; 6º. Of course one has P .A/ D
I.2
2 1 D : 6 3
Stochastic variables
Let .; A; P / be a probability space and let X W ! Rn be a function. For every subset A Rn one has by definition X
1
.A/ WD ¹! 2 W X.!/ 2 Aº:
10
Chapter I Probability theory
The function X is said to be A-measurable if X
1
.A/ 2 A
for every Borel set A in Rn (see also Exercise 13). Definition I.2.1. An A-measurable function X W ! Rn is said to be a stochastic n-vector. Remark. In cases where n D 1 the notation X will be used rather than X; a stochastic 1-vector will also be called a stochastic variable. Example 1. A probability experiment consists of casting a die twice. Here the sample space can be defined by WD ¹.i; j / W i; j D 1; 2; 3; 4; 5; 6º; where .i; j / corresponds to “first time i pips, second time j pips”. Let A be the collection of all subsets of and define P by P .A/ WD
number of elements A #.A/ D : 36 36
Now the triplet .; A; P / is a probability space corresponding to the given probability experiment. The stochastic variable X W .i; j / 7! i C j presents the total number of pips thrown. Given a stochastic n-vector X on , a Borel measure PX on Rn can be obtained by defining PX .A/ WD P ŒX 1 .A/ for all Borel sets A in Rn (see Exercise 11). One often writes: P .X 2 A/ WD PX .A/: Definition I.2.2. The Borel measure PX is said to be the probability distribution of X. Example 2. Let be the Lebesgue measure on R. Suppose f W R ! Œ0; C1/ is defined by 1 1 2 2 f .x/ WD p e 2 .x / = : 2 For any probability space .; A; R/ a stochastic variable X W ! R is said to be normally distributed with parameters and if PX D f :
11
Section I.2 Stochastic variables
In that case one has P .X 2 A/ D PX .A/ D
Z
A
1 p e 2
1 2 .x
/2 = 2
dx:
If A is an interval Œa; b, then Z
P .a X b/ D
b a
1 p e 2
1 2 .x
/2 = 2
dx:
More details about normally distributed variables can be found in §I.6. Example 3. Let .; A; P / be a probability space. Suppose X W ! R is a constant variable, that is: there exists an element a 2 R such that X.!/ D a for all ! 2 . Then PX D ıa . Example 4. Suppose ın is on R the Dirac measure in the point n 2 N . Define by .A/ WD
1 X n e nŠ
ın .A/
. > 0/:
nD0
More briefly one could write D
X n e nŠ
ın :
n2N
For any probability space .; A; P / a variable X W ! R is said to be Poisson distributed with parameter > 0 if PX D
X n e nŠ
ın :
n2N
The Poisson distribution (described above) is an example of a so-called discrete probability distribution: Definition I.2.3. A stochastic n-vector X with a countable range (that is the set X./ WD ¹X.!/ W ! 2 º is countable) is said to have a discrete probability distribution. Proposition I.2.1. If X is discretely distributed then, setting W WD X./, one has X PX D P .X D a/ ıa : a2W
Proof. See Exercise 17.
12
Chapter I Probability theory
If X is discretely distributed then the map a 7! P .X D a/ is said to be the discrete probability density of X. As an example, a stochastic variable X is said to have a binomial distribution with parameters n 2 ¹1; 2; : : :º and 2 Œ0; 1 if W D ¹0; 1; : : : ; nº and if for all k 2 W one has ! n k P .X D k/ D .1 /n k : k As another example, a stochastic variable X is said to be geometrically distributed if W D ¹1; 2; 3; : : :º and
P .X D k/ D .1
/k
1
:
In Example 2 the normal distribution on R was introduced. This is an example of a so-called absolutely continuous distribution: Definition I.2.4. A stochastic n-vector X is said to have an absolutely continuous probability distribution if PX has a density with respect to the Lebesgue measure on Rn . In other words: X has an absolutely continuous probability distribution if there exists a Borel function f W Rn ! Œ0; C1/ such that PX D f . If this is the case, then the function f is called the probability density of X. Remark. The probability density associated with an absolutely continuous probability distribution is not necessarily a continuous function. Proposition I.2.2. If X is a stochastic n-vector with an absolutely continuous probability distribution, then its corresponding probability density f satisfies the following three basic properties: (i) f is a Borel function, (ii) f 0, R (iii) f d D 1.
Conversely, every function f W Rn ! R satisfying the properties (i), (ii) and (iii) is the probability density function of at least one stochastic n-vector. Proof. It is direct from Definition I.2.4 that a probability density function satisfies (i), (ii) and (iii). Suppose, conversely, that f satisfies (i), (ii) and (iii). Define WD Rn , A WD Bn and P WD f . Then .; A; P / is a probability space. Let X W ! Rn be the identity on Rn . Then PX D f , so f is the probability density function of the stochastic variable X. Example 5. For all ˛ 2 R and ˇ > 0, define f W R ! R by f .x/ WD
ˇ .x
1 ˛/2 C ˇ 2
(x 2 R):
13
Section I.3 Product measures and statistical independence
The function f satisfies the conditions (i), (ii) and (iii) in Proposition I.2.2. So f is the probability density of at least one stochastic variable. Stochastic variables having this specific probability density f are said to have a Cauchy distribution with parameters ˛ and ˇ. Remark. There exist stochastic variables that neither have a discrete, nor an absolutely continuous probability distribution. Such variables come across for example in quantum mechanics (see also the intermezzo in §I.7). As said before, if X W ! Rn is a stochastic n-vector then PX is a probability measure on the Borel sets of Rn . Thus the triplet .Rn ; Bn ; PX / presents a probability space. In probability theory it is often this space that is in the picture, just ignoring the idea of an underlying triplet .; A; P /. Remark. In measure theory it is usual to call PX the image measure of P under the map X. Very often this measure is denoted by X.P /, but other notations are used as well. If X1 ; : : : ; Xn are stochastic variables, then an associated stochastic n-vector X can be defined by setting X.!/ WD X1 .!/; : : : ; Xn .!/ 2 Rn :
See Appendix B to clear up the problem of measurability of X. In this case PX will also be denoted by PX1 ;:::;Xn . This measure is often called the joint probability distribution of X1 ; : : : ; Xn .
I.3
Product measures and statistical independence
In this section the concept of statistical independence of stochastic variables will be defined in terms of measure theory. For every A1 ; : : : ; An R the Cartesian product A1 A2 An Rn is understood to be the set defined by A1 An WD ¹.x1 ; : : : ; xn / 2 Rn W xi 2 Ai
(i D 1; : : : ; n)º:
The projection of Rn onto the i th coordinate will be denoted by i , so: i .x1 ; : : : ; xn / WD xi : The map i is continuous and therefore surely a Borel function. One has A1 An D 1 1 .A1 / \ \ n 1 .An /: Consequently, if A1 ; : : : ; An are Borel in R then so is A1 An in Rn .
14
Chapter I Probability theory
If P is a probability measure on Rn then the measure Pi defined by Pi .A/ WD P Œi 1 .A/
.A Borel in R/
is a probability measure on R. One has Pi .A/ D P .R R A R R/
where A is on the i th place.
Proposition I.3.1. Suppose X W ! Rn is a stochastic n-vector with components X1 ; : : : ; Xn , then .PX /i D PXi : Proof. One has .PX /i .A/ D PX i 1 .A/ D P X 2 i 1 .A/ D P .X1 ; : : : ; Xn / 2 i 1 .A/ D P .Xi 2 A/ D PXi .A/:
Remark. PXi is said to be a marginal probability distribution of X. Note that in fact PXi is the image of PX under the map i (see §I.2). Definition I.3.1. A probability measure P on Rn is said to be the product of the measures P1 ; : : : ; Pn and denoted by P D P1 ˝ ˝ Pn if P .A1 An / D P1 .A1 / P2 .A2 / Pn .An /
for all Borel sets A1 ; : : : ; An R (see also Appendix B).
Example 1. Given a point .a; b/ 2 R2 , the Dirac measure ı.a;b/ in the point .a; b/ satisfies ı.a;b/ D ıa ˝ ıb . Definition I.3.2. The stochastic variables X1 ; : : : ; Xn are said to be statistically independent if PX1 ;:::;Xn D PX1 ˝ ˝ PXn : Example 2. In the case of two variables X1 and X2 one has PX1 ;X2 .A1 A2 / D P Œ.X1 ; X2 / 2 A1 A2 D P ŒX1 2 A1 and X2 2 A2 : So the variables X1 and X2 are statistically independent if and only if P ŒX1 2 A1 and X2 2 A2 D P .X1 2 A1 / P .X2 2 A2 / for all Borel sets A1 ; A2 .
Section I.3 Product measures and statistical independence
15
Example 3. Suppose .; A; P / is a probability space and B1 ; B2 2 A. Setting X1 WD 1B1 and X2 WD 1B2 , the variables X1 and X2 are statistically independent if and only if P .B1 \ B2 / D P .B1 /P .B2 /
(see Exercise 15).
Definition I.3.3. The events B1 ; : : : ; Bn are understood to be statistically independent if the stochastic variables 1B1 ; : : : ; 1Bn are so. Example 4. Three events B1 , B2 , B3 are statistically independent if and only if P .B1 \ B2 \ B3 / D P .B1 / P .B2 / P .B3 /; P .B1 \ B2 / D P .B1 / P .B2 /; P .B1 \ B3 / D P .B1 / P .B3 /; P .B2 \ B3 / D P .B2 / P .B3 / (see Exercise 16). It may very well occur that variables X1 ; : : : ; Xn are pairwise independent but still do not form an independent system! An example of this phenomenon is: Example 5. Throw a die twice and define the events A, B and C by A WD ¹1st throw is evenº;
B WD ¹2nd throw is evenº;
C WD ¹1st and 2nd throw are both even or both oddº: Then 1 1 1 D D P .A/ P .B/; 4 2 2 1 1 1 P .A \ C / D D D P .A/ P .C /; 4 2 2 1 1 1 P .B \ C / D D D P .B/ P .C /: 4 2 2 P .A \ B/ D
However, P .A \ B \ C / D 14 , so: P .A \ B \ C / ¤ P .A/ P .B/ P .C /: Defining X WD 1A ; Y WD 1B and Z WD 1C , the variables X; Y and Z are pairwise independent. However the system X; Y; Z is not statistically independent.
16
Chapter I Probability theory
The tensor product of a system of functions f1 ; : : : ; fn W R ! R is understood to be the function f W Rn ! R defined by f .x1 ; : : : ; xn / WD f1 .x1 / : : : fn .xn /:
This function will be denoted as
f D f1 ˝ ˝ fn :
The following theorem is formulated in terms of this notation. Theorem I.3.2. Suppose X1 ; : : : ; Xn are stochastic variables with probability densities fX1 ; : : : ; fXn . The system X1 ; : : : ; Xn is then statistically independent if and only if the stochastic n-vector .X1 ; : : : ; Xn / has as its probability density the function fX1 ˝ ˝ fXn . Proof. (i) Suppose X1 ; : : : ; Xn are independent. Then for all Borel sets A1 ; : : : ; An in R and for X WD .X1 ; : : : ; Xn / one has: P .X 2 A1 An / D P .X1 2 A/ P .Xn 2 An / Z Z fXn .xn / dxn fX1 .x1 / dx1 : : : D A1
(Fubini’s theorem) D
Z
An
Z fX1 .x1 / : : : fXn .xn / dx1 : : : dxn
A1 An
D Thus one finds that P
Z
Z .fX1 ˝ ˝ fXn /.x1 ; : : : ; xn / dx1 : : : dxn :
A1 An
X1 ; : : : ; Xn
2A D
Z
A
.fX1 ˝ ˝ fXn /.x/ dx
./
for all sets of type A D A1 An . This, however, implies that ./ holds for all Borel sets A Rn (see Appendix B). (ii) Conversely, suppose that fX1 ˝ ˝fXn is the probability density of the stochastic n-vector .X1 ; : : : ; Xn /. Then Z Z P X1 ; : : : ; Xn 2 A1 An D fX1 .x1 / : : : fXn .xn / dx1 : : : dxn A1 An
(Fubini’s theorem) D
Z
A1
fX1 .x1 / dx1 : : :
Z
An
fXn .xn / dxn
D P .X1 2 A1 / P .Xn 2 An /:
Consequently X1 ; : : : ; Xn is an independent system.
Section I.3 Product measures and statistical independence
17
One can also speak about statistical independence when dealing with stochastic vectors. To explain this, suppose X D .X1 ; : : : ; Xm / and Y D .Y1 ; : : : ; Yn / are two stochastic vectors. A stochastic .m C n/-vector .X; Y/ can then be defined by .X; Y/ WD .X1 ; : : : ; Xm ; Y1 ; : : : ; Yn /:
The stochastic vectors X and Y are said to be statistically independent if P.X;Y/ D PX ˝ PY :
So X and Y are independent if and only if for all Borel sets A Rm and B Rn one has P .X 2 A and Y 2 B/ D P .X 2 A/ P .Y 2 B/: Proposition I.3.3. If X D .X1 ; : : : ; Xm / and Y D .Y1 ; : : : ; Yn / are statistically independent stochastic vectors, then for all possible i and j the components Xi and Yj are also statistically independent. Proof. Suppose A and B are Borel sets in R. Define and
e A WD R R A R R Rm
e WD R R B R R Rn B
where A and B are on the i th place respectively. In this notation one has P.Xi ;Yj / .A B/ D P Œ.Xi ; Yj / 2 A B
e D P.X;Y/ .e e D P Œ.X; Y/ 2 e A B A B/
e D P .X 2 e e D PX .e A/ PY .B/ A/ P .Y 2 B/
D P .Xi 2 A/ P .Yj 2 B/ D PXi .A/ PYj .B/:
It follows that P.Xi ;Yj / D PXi ˝ PYj , consequently Xi and Yj are statistically independent. The converse of Proposition I.3.3 is not valid. That is, it may occur that all possible pairs of components Xi and Yj form independent systems, whereas the vectorial pair X D .X1 ; : : : ; Xm / and Y D .Y1 ; : : : ; Yn / does not form an independent system! This is illustrated in the following example: Example 6. Let X; Y and Z be as in Example 5. Consider the stochastic 1-vector X and the stochastic 2-vector .Y; Z/. Now the pair X and Y is independent, just as the pair X and Z. However, X and .Y; Z/ are not independent. To see this, note that P ŒX D 1 and .Y; Z/ D .1; 1/ ¤ P ŒX D 1 P Œ.Y; Z/ D .1; 1/:
Here the left side equals 14 , the right side 18 .
18
Chapter I Probability theory
Proposition I.3.4. Let X1 ; : : : ; Xn be a statistically independent system of stochastic variables and let 1 s n 1. Then the stochastic vectors .X1 ; : : : ; Xs / and .XsC1 ; : : : ; Xn / are statistically independent. Proof. Denote X D .X1 ; : : : ; Xs / and
Y D .XsC1 ; : : : ; Xn /:
In this notation one has, because of the independence of the system X1 ; : : : ; Xn P.X;Y/ D PX1 ˝ ˝ PXs ˝ PXsC1 ˝ ˝ PXn :
It follows from this (see Appendix B) that
P.X;Y/ D .PX1 ˝ ˝ PXs / ˝ .PXsC1 ˝ ˝ PXn /:
However
PX D PX1 ˝ ˝ PXs
Consequently
and PY D PXsC1 ˝ ˝ PXn :
P.X;Y/ D PX ˝ PY ;
which proves the independence of X and Y.
In concluding this section the concept of statistical independence is now defined for arbitrary systems of stochastic vectors. Definition I.3.4. A system X1 ; : : : ; Xn of stochastic vectors is said to be statistically independent / if PX1 ;:::;Xn D PX1 ˝ PX2 ˝ : : : ˝ PXn : In a natural way the propositions and theorems in this section can be generalized in several ways to systems of stochastic vectors.
I.4
Functions of stochastic vectors
In probability theory and statistics one often encounters functions of stochastic vectors. Example 1. Let .; A; P / be a probability space and X W ! R a stochastic variable. Define for all ! 2 Y .!/ WD ŒX.!/2 :
Now Y is also a stochastic variable. Briefly, one could write Y D X 2 . Defining the function f W R ! R by f .x/ WD x 2 , one can set Y D f ı X . /
The reader should not confuse statistical independence and linear independence in linear algebra!
19
Section I.4 Functions of stochastic vectors
Suppose X D .X1 ; : : : ; Xm / is a stochastic m-vector, defined on the probability space .; A; P /. Let f W Rm ! Rn be any Borel function. Now one can define a stochastic n-vector Y by Y D f ı X. The vector Y is said to be a function of X and is denoted by Y D f.X/. Example 2. Define f W Rm ! R by
x1 C C xm : m
f .x1 ; : : : ; xm / WD
If X D .X1 ; : : : ; Xm / is a stochastic m-vector, then f .X/ is a stochastic 1-vector. In this specific case the notation will be f .X/ D X . So in this notation one has: XD
X1 C C Xm : m
Given a stochastic m-vector X, one could look upon a Borel function f W Rm ! Rn as a stochastic n-vector defined on the probability space .Rm ; Bm ; PX /.The corresponding probability distribution of f on Rn will be denoted by .PX /f . On the other hand, one also has the probability distribution Pf.X/ of f.X/ on Rn . Proposition I.4.1. One has .PX /f D Pf.X/ . Proof. A direct consequence of the identity: .f ı X/
1
.A/ D X
1
Œf
1
.A/:
The next proposition is often applied in statistics. For reasons of simplicity the proposition is stated in terms of two stochastic vectors X; Y. Proposition I.4.2. Suppose f W Rm ! Rp and g W Rn ! Rq are Borel functions. If X D .X1 ; : : : ; Xm / and Y D .Y1 ; : : : ; Yn / are statistically independent stochastic vectors, then so are f.X/ and g.Y/. Proof. For all Borel sets A Rp and B Rq one has Pf.X/;g.Y/ .A B/ D P Œ f.X/; g.Y/ 2 A B D P Œ.X; Y/ 2 f
D PX Œf
1
1
.A/ g
.A/ PY Œg
1
1
.B/
.B/
D .PX /f .A/ .PY /g .B/
(Proposition I.4.1) D Pf.X/ .A/ Pg.Y/ .B/:
It follows that Pf.X/;g.Y/ D Pf.X/ ˝ Pg.Y/ , consequently f.X/ and g.Y/ are independent.
20
Chapter I Probability theory
Remark. It is easily verified that Proposition I.4.2 is also valid if more than two stochastic vectors and more than two Borel functions are involved. Example 3. If X D .X1 ; : : : ; Xm / and Y D .Y1 ; : : : ; Yn / are statistically independent, then so are X and Y . To see this, just define f .x1 ; : : : ; xm / WD
x1 C C xm m
and
g.x1 ; : : : ; xn / WD
x1 C C xn n
and apply Proposition I.4.2. The following example shows that caution should be exercised. Example 4. Let X; Y and Z be as in Example 5 of §I.3. Now the pair X; Y is statistically independent, just as the pair X; Z . However the pair X; Y C Z is not independent. To see this, note that P .X D 1 and Y C Z D 2/ ¤ P .X D 1/ P .Y C Z D 2/: The left side equals 14 , the right side 18 . Now suppose that X is a stochastic m-vector and f W Rm ! Rn a Borel function. One then has probability measures PX on Rm and Pf.X/ on Rn . The next theorem is a well-known result in measure theory. It describes the connection between integration with respect to PX and integration with respect to Pf.X/ . Theorem I.4.3. Suppose f W Rm ! Rn and ' W Rn ! R are Borel functions. Let X be a stochastic m-vector. Then ' is Pf.X/ -summable if and only if ' ı f is PX summable. In that case one has Z Z '.t/ dPf.X/ .t/ D .' ı f/.s/ dPX .s/: For ' 0 the above (finite or infinite) is always valid. Proof. As a first step the theorem is proved for cases where ' D 1A , where A is a Borel set. The result then follows for cases where ' is a simple function. The general case is proved by an argument involving approximation of ' by means of simple functions (see [11], [28]). Example 5. Let X be a stochastic variable, f .x/ WD x 2 and '.x/ WD x. Then one has: Z Z t dPX 2 .t / WD s 2 dPX .s/:
Section I.5 Expectation, variance and covariance of stochastic variables
21
As before the projection of Rn onto the i th coordinate is denoted by i , that is: i .x1 ; : : : ; xn / WD xi . Definition I.4.1. A function ' W Rm ! R is said to depend only on the i th coordinate if there exists a function e ' W R ! R such that ' D e ' ı i . Remark. The function e ' is unique. If ' is Borel then so is e '.
Example 6. Let ' W R2 ! R be the function defined by '.x; y/ WD x 2 . Defining e ' W R ! R by e ' .x/ WD x 2 , one can set ' D e ' ı 1 . Thus ' depends on the first coordinate only.
Proposition I.4.4. Suppose ' W Rm ! R is a Borel function depending on the i th coordinate only. If ' D e ' ı i then ' is PX -summable if and only if e ' is PXi -summable. In that case one has Z Z ' dPX D e ' dPXi : Proof. Note that PXi D Pi .X/ and apply Theorem I.4.3.
Example 7. Let .X; Y / be a stochastic 2-vector and let ' and e ' be as in Example 6. Then Z Z '.x; y/ dPX;Y .x; y/ D
That is:
Z
I.5
2
x dPX;Y .x; y/ D
Z
e ' .x/ dPX .x/:
x 2 dPX .x/:
Expectation, variance and covariance of stochastic variables
Suppose X is a stochastic m-vector and g W Rm ! R is a Borel function. If R jgj dPX < C1, in other words, R if g is a PX -summable function, then it makes sense to speak about the integral g dPX (see §I.1). In that case the expectation (or mean) of g.X/ is defined by Z EŒg.X/ WD g dPX : In particular, if X is a stochastic variable and g W R ! R is defined by g.x/ WD x,
22
Chapter I Probability theory
one can express the expectation (mean) of X as Z E.X/ D x dPX .x/; R provided jxj dPX .x/ < C1.
The expectation of X could be perceived as follows. When repeating the experiment many times, the arithmetic mean of the outcomes of X can be expected to be close to E.X /. In §VII.2 this will be explained in more detail. Example 1. If X equals the constant a, then PX D ıa . One then have Z Z EŒg.X/ D g.x/ dPX .x/ D g.x/ dıa .x/ D g.a/: In particular: E.X/ D a.
Generalizing this example one arrives at: Example 2. Suppose X is discretely distributed with range W . Then X PX D P .X D a/ıa : a2W
In that case
EŒg.X/ D provided
X
a2W
X
a2W
g.a/P .X D a/;
jg.a/j P .X D a/ < C1:
Example 3. Suppose X has an absolutely continuous probability distribution with probability density f . Then PX D f , where is the Lebesgue measure. Now EŒg.X/ exists if and only if Z Z jgj dPX D jg.x/jf .x/ dx < C1; in which case the expectation of g.X/ can be expressed as Z EŒg.X/ D g.x/f .x/ dx (see Theorem I.1.4). The following example describes a case where E.X / does not exist. Example 4. Suppose X is Cauchy distributed with parameters ˛ D 0 and ˇ D 1 (see §I.2, Example 7). One then has PX D f , where f is defined by 1 1 f .x/ WD : x2 C 1
Section I.5 Expectation, variance and covariance of stochastic variables
Here
Z
jxj dPX .x/ D
so E.X / does not exist.
Z
23
1 jxj dx D C1; x2 C 1
Proposition I.5.1. (i) Suppose X is a stochastic variable and a 2 R. If the expectation of X exists, then so does the expectation of aX and one has E.aX/ D a E.X /: (ii) If E.X / and E.Y / exist, then so does E.X C Y / and E.X C Y / D E.X/ C E.Y /: Proof. (i) If E.X/ exists, then Z
jxj dPX .x/ < C1:
Defining g W R ! R by g.x/ WD ax, one has Z Z Z jgj dPX D jaxj dPX .x/ D jaj jxj dPX .x/ < C1:
So E g.X / D E.aX/ exists. Furthermore one can write E.aX / D EŒg.X/ D
Z
ax dPX .x/ D a
Z
x dPX .x/ D aE.X /:
(ii) Suppose E.X/ and E.Y / exist. Define g W R2 ! R by: g.x; y/ D x C y . Then X C Y D g.X; Y /. To prove the existence of E.X C Y /, note that Z Z jgj dPX;Y D jx C yj dPX;Y .x; y/ Z jxj C jyj dPX;Y .x; y/ Z Z D jxj dPX;Y .x; y/ C jyj dPX;Y .x; y/ Z Z D jxj dPX .x/ C jyj dPY .y/ < C1: The last equality in this chain is a consequence of Proposition I.4.4. By applying Proposition I.4.4 again one proves that E.X C Y / D E.X / C E.Y /.
24
Chapter I Probability theory
In connection to Proposition I.5.1 one may wonder whether the equality E.X Y / D E.X /E.Y / is also valid. If X and Y are statistically independent then this is indeed the case. Fubini’s theorem about successive integration is needed to prove this. The following version of this theorem (see also Appendix B) is formulated here: Theorem I.5.2 (G. Fubini). Let X D .X1 ; : : : ; Xm / and Y D .Y1 ; : : : ; Yn / be statistically independent stochastic vectors and let g W RmCn ! R be a Borel function. If g 0 or if g is PX;Y -summable, then one has ³ Z Z ²Z g.x; y/ dPX;Y .x; y/ D g.x; y/ dPX .x/ dPY .y/ D
Z ²Z
³ g.x; y/ dPY .y/ dPX .x/:
Proof. See for example [11], [28], [64]. Proposition I.5.3. If X and Y are statistically independent with existing expectations, then the expectation of E.XY / also exists and one has E.XY / D E.X /E.Y /: Proof. Suppose X and Y are independent and suppose both E.X / and E.Y / exist. Define g W R2 ! R by g.x; y/ WD xy . Then, of course, X Y D g.X; Y /. Moreover Z
jgj dPX;Y D
Z
jxyj dPX;Y D
(Fubini’s theorem)
Z
jxjjyj d.PX ˝ PY / ³ Z ²Z D jxjjyj dPX .x/ dPY .y/
³ jxj dPX .x/ dPY .y/ Z Z D jxj dPX .x/ jyj dPY .y/ < C1:
D
Z
jyj
²Z
This proves the existence of E.XY /. Similarly it can be proved that E.X Y / D E.X / E.Y /. Suppose X is a stochastic variable of which the expectation E.X 2 / exists. It can be proved (see Exercise 18) that this implies the existence of E.X /. The variance of X , denoted by var.X/, is in such cases defined by var.X/ WD E.X 2 /
E.X /2 :
Section I.5 Expectation, variance and covariance of stochastic variables
25
The variance of X is a measure of the amount of spread of the outcomes of X around the expectation E.X/, when the experiment is repeated unendingly. Example 5. If X is the constant a, then E.X/ D a and E.X 2 / D a2 (see Example 1), so var.X/ D a2
a2 D 0:
Obviously the outcomes of X will show no spread in this case. Example 6. Let X be a stochastic variable. Suppose that PX D f , where f is defined by ´ 1 if x 2 .˛; ˇ/; f .x/ D ˇ ˛ 0 elsewhere: In that case X is said to be uniformly distributed in the interval .˛; ˇ/. One then has 1 (see Exercise 19): var.X/ D 12 .ˇ ˛/2 . Proposition I.5.4. Suppose E.X 2 / exists. Denoting D E.X /, one has var.X/ D EŒ.X
/2 :
Proof. Left as an exercise to the reader. Remark. By the proposition above one always has: var.X / 0. Proposition I.5.5. If var.X/ D 0, then P .X D / D 1, where D E.X /. In other words, with probability 1 the outcome of X will be equal to . Proof. Define A R by A WD ¹x W x ¤ º or, equivalently, by A WD ¹x W .x
/2 > 0º:
Furthermore, define A0 by A0 WD ¹x W .x /2 > 1º and define for all n 1 ° 1 1± An WD x W < .x /2 : nC1 n
Now .An /n2N is a collection of mutually disjoint Borel sets in R and one has AD Consequently PX .A/ D
1 [
An :
1 X
PX .An /:
nD0
nD0
26
Chapter I Probability theory
Now it is proved that PX .A/ D 0 by verifying that PX .An / D 0 for all n 2 N . To see the latter, note that for all x 2 An one has 1 .x nC1
/2 :
Integrating this inequality over the set An one arrives at Z 1 PX .An / .x /2 dPX .x/: nC1 An However, one also has Z Z 2 .x / dPX .x/ .x An
/2 dPX .x/ D EŒ.X
/2 D var.X / D 0:
As a result, PX .An / D 0 for all n 2 N . As said before this implies that PX .A/ D 0. This is the same as saying that PX .Ac / D 1. However, Ac is the singleton ¹º. It thus appears that PX .¹º/ D P .X D / D 1. Suppose X and Y are stochastic variables the variance of which exists. From the inequality jxyj 21 .x 2 C 2/ it follows (applying Proposition I.4.4) that Z Z 1 2 .x C y 2 / dPX;Y .x; y/ jxyj dPX;Y .x; y/ 2 Z Z 1 1 2 D x dPX;Y .x; y/ C y 2 dPX;Y .x; y/ 2 2 Z Z 1 1 D x 2 dPX .x/ C y 2 dPY .y/ < C1: 2 2 R Therefore jxyj dPX;Y .x; y/ < C1, which guarantees the existence of E.X Y /. This observation justifies the following definition: Definition I.5.1. If the variance of the stochastic variables X and Y exists, then the covariance of X and Y is defined by cov.X; Y / WD E.XY /
E.X /E.Y /:
Note that cov.X; X/ D var.X/. Proposition I.5.6. Suppose var.X/ and var.Y / exist. Denoting X D E.X / and Y D E.Y /, one has cov.X; Y / D E .X X /.Y Y / : Proof. Left as an exercise to the reader.
27
Section I.5 Expectation, variance and covariance of stochastic variables
Proposition I.5.7. The map X; Y 7! cov.X; Y / is bilinear, symmetric, and one has cov.X; X / 0 for all X . Proof. Left as an exercise to the reader. Proposition I.5.8. If cov.X; Y / D 0, then var.X C Y / D var.X / C var.Y /. Proof. One has var.X C Y / D cov.X C Y; X C Y /
D cov.X; X/ C cov.Y; X/ C cov.X; Y / C cov.Y; Y /
D var.X/ C var.Y /:
Remark. Proposition I.5.8. can easily be generalized: If X1 ; : : : ; Xn are stochastic variables with existing variances, then the condition that cov.Xi ; Xj / D 0 for all i ¤ j guarantees that X X var Xi D var.Xi /: i
i
Proposition I.5.9. If var.X/ exists and a 2 R, then var.aX / exists also and var.aX/ D a2 var.X /: Proof. This is obvious from the definition of variance. Proposition I.5.10. Suppose X and Y are stochastic variables with existing variances. If X and Y are statistically independent, then cov.X; Y / D 0: Proof. This follows directly from Definition I.5.1 and Proposition I.5.3. The following example shows that the condition cov.X; Y / D 0 is in general not sufficient to guarantee statistical independence of X and Y . Example 7. Define a stochastic variable satisfying: 1 P .X D C1/ D ; 3
P .X D 0/ D
1 3
and
P .X D
1 1/ D : 3
Let Y D X 2 . Then cov.X; Y / D E.XY /
E.X/ E.Y / D E.X 3 /
E.X /E.X 2 / D 0;
because E.X 3 / D 0 and E.X/ D 0. However, the variables X and Y are not statistically independent! (Verify this.)
28
Chapter I Probability theory
The covariance of two variables turns out to be a translation invariant quantity: Proposition I.5.11. If X and Y are stochastic variables with existing variance and if a; b 2 R, then cov.X C a; Y C b/ D cov.X; Y /: Proof. For reasons of symmetry it is sufficient to prove cov.X C a; Y / D cov.X; Y /. To see this, note that cov.X C a; Y / D E .X C a/Y E.X C a/E.Y / D E.XY / C aE.Y /
D E.XY /
.E.X / C a/E.Y /
E.X /E.Y / D cov.X; Y /:
Definition I.5.2. If X is a stochastic variable with existing variance, then the variable e WD Xp E.X / X var.X /
is said to be the standardized form of X (or the standardization of X ). This definition is justified by the following proposition: e of X always satisfies: Proposition I.5.12. The standardized form X e / D 0 and E. X
e / D 1: var. X
p Proof. Set WD E.X/ and WD var.X/. Then one can write, applying Proposition I.5.1, X 1 1 e D E.X / D .E.X / / D 0: E. X / D E
Furthermore, by the Propositions I.5.9 and I.5.11, one has X 1 e var.X/ D var D 2 var.X / 1 1 D 2 cov.X ; X / D 2 cov.X; X / 1 D var.X/ D 1: var.X/
Lemma I.5.13. Let a 2 R and let X be a stochastic variable that satisfies the identity P .X D a/ D 1. Then for every stochastic variable Y the pair X; Y is statistically independent.
Section I.5 Expectation, variance and covariance of stochastic variables
29
Proof. Suppose A and B are Borel sets in R. If a 2 A, then P .X 2 A/ D 1 and P .X 2 Ac / D 0: Consequently P .X 2 Ac
and
Y 2 B/ D 0:
Therefore P .X 2 A and Y 2 B/ D P .X 2 A and Y 2 B/ C P .X 2 Ac and Y 2 B/ D P .X 2 R and Y 2 B/ D P .Y 2 B/
D P .X 2 A/P .Y 2 B/: If a … A, then P .X 2 A/ D 0 and hence also P .X 2 A and Y 2 B/ D 0. Thus it appears that in this case one also has P .X 2 A and Y 2 B/ D P .X 2 A/P .Y 2 B/: The equality is valid for all Borel sets A; B R; consequently X and Y are statistically independent. Proposition I.5.14 (Inequality of Cauchy–Schwarz–Bunyakovski). If X and Y are stochastic variables with existing variance, then 2 cov.X; Y / var.X / var.Y /:
Equality occurs if and only if there exist constants a; b; c (where a2 C b 2 ¤ 0) such that P .aX C bY D c/ D 1. Proof. If var.X / D 0, then by Proposition I.5.5 there is a constant c 2 R such that P .X D c/ D 1. Therefore, by Proposition I.5.13, the pair X; Y is statistically independent. From Proposition I.5.10 it follows that cov.X; Y / D 0, so in this case 2 cov.X; Y / D var.X / var.Y /:
Note that P .aX C bY D c/ D 1 if one chooses a D 1 and b D 0. Next, consider the case where var.X/ ¤ 0. Define the quadratic function f W R ! R by f ./ WD 2 var.X/ C 2 cov.X; Y / C var.Y /: This is the same as defining f by: f ./ D var.X C Y / D cov.X C Y; X C Y /:
30
Chapter I Probability theory
It follows that f ./ 0 for all 2 R. For this reason f cannot have more than one zero. Consequently the discriminant of this quadratic function cannot be a positive number: ¹2 cov.X; Y /º2 4 var.X / var.Y / 0: The inequality of Cauchy–Schwarz–Bunyakovski immediately follows from this. If equality occurs then the discriminant of f vanishes. In that case f has exactly one zero a 2 R: f .a/ D var.aX C Y / D 0: Now by Proposition I.5.5 there exists a constant c such that P .aX C Y D c/ D 1. Therefore, if one chooses b D 1, then P .aX C bY D c/ D 1. Conversely, if P .aX C bY D c/ D 1
.a2 C b 2 ¤ 0/;
then one has equality in the inequality of Cauchy–Schwarz–Bunyakovski. This can be proved by easy verification (just read the above in a reversed direction). p If X is a stochastic variable with existing variance then the expression var.X / is said to be the standard deviation of X . This quantity will be denoted by .X / or X . Definition I.5.3. If X and Y are stochastic variables with existing variance and if neither var.X/, nor var.Y / equals zero, then the correlation coefficient of X and Y is defined by cov.X; Y / .X; Y / WD : .X / .Y / Proposition I.5.15. If X and Y are stochastic variables with existing correlation coefficient, then 1 .X; Y / C1:
Equality to ˙1 occurs if and only if there exist constants a; b; c (with a2 C b 2 ¤ 0) such that P .aX C bY D c/ D 1: Proof. Apply Proposition I.5.14.
I.6
Statistical independence of normally distributed variables
In Example 4 of §I.4 it was shown that, concerning statistical independence, one could easily tumble into venomous pitfalls. In this section it will be explained that the world looks much friendlier if it is assumed that the components of stochastic vectors are
31
Section I.6 Statistical independence of normally distributed variables
normally distributed. Recall that a stochastic variable is said to be normally distributed with parameters and if it has the function f W x 7!
1 p e 2
1 2 .x
/2 = 2
as its probability density. In Exercise 35 the reader is asked to verify that this function satisfies the characterizing properties of a probability density indeed. In the following a normal distribution with parameters and will be referred to as a N.; 2 /distribution. The N.0; 1/-distribution is said to be the standard normal distribution. Theorem I.6.1. If the stochastic variable X is N.; 2 /-distributed, then: (i) The variable pX C q, where p ¤ 0, is N.p C q; p 2 2 /-distributed.
(ii) The variable .X
/= is N.0; 1/-distributed.
Proof. The function ' W x 7! px C q is introduced to prove (i). Now for every Borel set A R one has Z 1 1 2 2 P '.X / 2 A D P X 2 ' 1 .A/ D p e 2 .x / = dx: ' 1 .A/ 2 Carrying out the substitution x D ' hand side, one obtains P '.X/ 2 A D
1 .s/
Z
q/=p in the integral on the right
D .s
1 1 p e 2 A 2 Z 1 D p e A jpj 2
'
1 .s/
1 2 .s
2
= 2
1 ds jpj
q p/2 =p 2 2
ds:
The integrand in the last integral is exactly the probability density corresponding to an N.p C q; p 2 2 /-distribution. It therefore appears that '.X / D pX C q is N.p C q; p 2 2 /-distributed, which proves (i). Statement (ii) is a direct consequence of (i). The choice of the suggestive characters and in the definition of a normal probability density is justified by the following theorem: Theorem I.6.2. If a stochastic variable X is N.; 2 /-distributed, then E.X/ D
and
var.X / D 2 :
32
Chapter I Probability theory
Proof. (Split up into two steps.) Step 1. The case that X is N.0; 1/-distributed. One then has Z C1 Z C1 1 2 1 jxj fX .x/ dx D jxj p e 2 x =2 dx < C1 2 1 1 and Z
C1
Z
C1
1 2 1 x2 p e 2 x =2 dx < C1: 2 1 1 In this way it follows that E.X/ and var.X/ exist. Now on the one hand one has Z C1 Z C1 1 2 1 E.X/ D x fX .x/ dx D x p e 2 x dx 2 1 1 Z Ca 1 2 1 D lim p x e 2 x dx a!C1 2 a i 1 2 Ca 1 h D lim p e 2x D 0: a!C1 a 2 On the other hand, applying partial integration, one gets Z C1 2 2 2 var.X / D E.X / E.X/ D E.X / D x 2 fX .x/ dx
Z
2
x fX .x/ dx D
C1
1
1 2 1 1 D x p e 2 x dx D lim p a!C1 2 2 1 ²h Z i Ca 1 2 Ca 1 D lim p x e 2x C e a!C1 a 2 a Z C1 1 2 1 Dp e 2 x dx D 1: 2 1 This proves Step 1.
2
Z
Ca
x2 e
a
1 2 2x
dx
1 2 2x
dx
³
Step 2. To reduce the general case to Step 1, set e WD X X
:
e is N.0; 1/-distributed. Applying the result of Step 1 By Theorem I.6.1 the variable X e to X one gets e C / D E. X e / C D : E.X/ D E. X
Furthermore, from Proposition I.5.9 and Proposition I.5.11 it follows that e C / D var. X e / D 2 var. X e / D 2: var.X/ D var. X
This proves the theorem.
Section I.6 Statistical independence of normally distributed variables
33
Remark. If X is an N.; 2 /-distributed variable, then (by Theorem I.6.2) the variable e DX X presents the standardized form of X in the sense of Definition I.5.2. By Theorem I.6.1 e of X always has a standard normal distribution. Our termithe standardized form X nology is therefore consistent. In the following inner products will be used to compactify notations. The standard inner product of two vectors x D .x1 ; : : : ; xn / and y D .y1 ; : : : ; yn / in Rn will systematically be denoted by hx; yi. That is: hx; yi WD x1 y1 C C xn yn : The next proposition is formulated in this notation: Proposition I.6.3. Suppose that X1 ; : : : ; Xn are statistically independent N.0; 2 /distributed stochastic variables. Then, with respect to the Lebesgue measure on Rn , the measure PX1 ;:::;Xn has a density f W Rn ! Œ0; C1/ defined by f .x/ WD
1 n .2/n=2
e
hx;xi=2 2
.x 2 Rn /:
Proof. By Theorem I.3.2 the stochastic n-vector .X1 ; : : : ; Xn / has, under the mentioned conditions, a probability density f defined by f WD f1 ˝ ˝ fn ; where
1 2 1 2 e 2 xi = .i D 1; : : : ; n/: p 2 Consequently, when writing x D .x1 ; : : : ; xn /, then
fi .xi / WD
f .x1 ; : : : ; xn / D D
1 n .2/n=2
e
1 e n .2/n=2
1 2 2 2 2 .x1 CCxn /=
1 2 2 hx;xi=
:
Definition I.6.1. Let X D .X1 ; : : : ; Xn / be a stochastic n-vector. The probability distribution PX of X is said to be rotation invariant if for every orthogonal linear operator Q W Rn ! Rn one has PX D PQX . If the reader is not familiar to the concept of orthogonal operators in linear algebra, then consult for example [52], [55], [65] (see also §VIII.1).
34
Chapter I Probability theory
Note that PX is rotation invariant if and only if for every Borel set A Rn and every orthogonal linear map Q W Rn ! Rn one has P .X 2 A/ D P .X 2 QA/: Proposition I.6.4. Suppose the X1 ; : : : ; Xn constitute a statistically independent set of N.0; 2 /-distributed variables, then PX1 ;:::;Xn is rotation invariant. Proof. For an arbitrary Borel set A Rn and an arbitrary orthogonal linear operator Q W Rn ! Rn one has (by Proposition I.6.3) Z 1 2 P .X1 ; : : : ; Xn / 2 QA D e hx;xi=2 dx: n n=2 QA .2/ In the integral on the right hand side a change of variables is carried out: x D Qu. The substitution formula for multiple integrals now gives (see [43], [64]): Z 1 2 P .X1 ; : : : ; Xn / 2 QA D e hQu;Qui=2 JQ du n .2/n=2 A
where JQ is the Jacobian of Q, that is, j det.DQ/j. Because Q is a linear map one has DQ D Q. It is well known (see [52], [55], [65] or §VIII.1) that det.Q/ D ˙1. It follows that the Jacobian of an orthogonal linear map is 1. Furthermore one has hQu; Qui D hu; ui. It thus appears that Z 1 2 PX1 ;:::;Xn .QA/ D P .X1 ; : : : ; Xn / 2 QA D e hu;ui=2 du: n n=2 A .2/ By Proposition I.6.3 the right side of this equality is PX1 ;:::;Xn .A/. Therefore PX1 ;:::;Xn .QA/ D PX1 ;:::;Xn .A/ for every orthogonal linear operator Q W Rn ! Rn and every Borel set A Rn . This proves the proposition. Remark. It is an interesting note that a converse of the statement in Proposition I.6.4 is also true (see Exercise 40). Suppose X1 ; : : : ; Xn are statistically independent and N.0; 1/-distributed. Let V be the linear span of the X1 ; : : : ; Xn . That is: ²X ³ V WD ci Xi W ci 2 R .i D 1; : : : ; n/ : i
On V an inner product h; i is defined by hM; N i WD cov.M; N / for all M; N 2 V:
Section I.6 Statistical independence of normally distributed variables
35
The verification that h; i presents an inner product on V is left to the reader. Note that, with respect to this inner product, the system ¹X1 ; : : : ; Xn º is an orthonormal base in V. Lemma I.6.5. Suppose X1 ; : : : ; Xn are statistically independent and N.0; 1/-distributed. If the elements Y1 ; : : : ; Yn constitute an orthonormal base in V, then the stochastic n-vectors X D .X1 ; : : : ; Xn / and Y D .Y1 ; : : : ; Yn / are identically distributed; that is P .X 2 A/ D P .Y 2 A/ for every Borel set A Rn : Proof. Let ˆ W V ! V be the orthogonal linear map which converts the base ¹X1 ; : : : ; Xn º into the base ¹Y1 ; : : : ; Yn º. Denote the matrix of ˆ with respect to the base ¹X1 ; : : : ; Xn º by Œˆ, its transposed by Œˆt . Note that Œˆ 1 D Œˆt . Now Yj D
n X iD1
Œˆij Xi D
n X
Œˆt
iD1
ji
Xi :
It follows from this that for the orthogonal linear operator Q W Rn ! Rn with matrix Œˆt one has Y D QX. In virtue of the rotation invariance of PX it follows that for every Borel set A Rn one has: P .Y 2 A/ D P .QX 2 A/ D P .X 2 Q
1
A/ D P .X 2 A/:
This proves that X and Y are identically distributed. Lemma I.6.5 will now be used to prove two important theorems. Theorem I.6.6. Suppose X1 ; : : : ; Xn are statistically independent stochastic variables. If for all i D 1; : : : ; n the variable Xi is N.i ; i2 /-distributed, then the variable S WD X1 C C Xn is N.1 C C n ; 12 C C n2 /-distributed.
Proof. (Split up into three steps.) Step 1. Suppose X1 ; X2 are statistically independent and both N.0; 1/-distributed. As before, let V be the linear span of X1 , X2 , equipped with the inner product mentioned above. Now for arbitrary a; b 2 R (at least one of which is non-zero) the system ² ³ bX1 aX2 aX1 C bX2 ; p p a2 C b 2 a2 C b 2
36
Chapter I Probability theory
is an orthonormal base in V (as these two components are orthogonal and of unit length). Therefore, using Lemma I.6.5, it follows that the two stochastic 2-vectors aX1 C bX2 bX1 aX2 .X1 ; X2 / and ; p p a2 C b 2 a2 C b 2 are identically distributed. But then the components X1
and
aX1 C bX2 p a2 C b 2
are also identically distributed (Exercise 27), that is, N.0; 1/-distributed. Step 2. Suppose that X1 ; X2 are statistically independent and that for i D 1; 2 the e 1 and X e 2 are variable Xi is N.i ; i2 /-distributed. Then the standardized forms X also statistically independent (Proposition I.4.2) and they are both N.0; 1/-distributed (Theorem I.6.2). By Step 1 it follows that e 1 C 2 X e2 1 X q 12 C 22
is N.0; 1/-distributed. One can write 1 0 q e 1 C 2 X e2C B 1 X X1 C X2 D 12 C 22 @ q A C .1 C 2 /: 12 C 22
Therefore, as a consequence of Theorem I.6.1, the variable X1 C X2 is N.1 C 2 ; 12 C 22 /-distributed. Step 3. Using Step 2, the proof of the theorem can now easily be completed by an argument involving induction. As will be seen later on, the following theorem will be a powerful tool in proving statistical independence when dealing with samples from normally distributed populations. Theorem I.6.7. Suppose X1 ; : : : ; Xn are statistically independent normally distributed stochastic variables. Let V be the linear span of X1 ; : : : ; Xn . If M1 ; : : : ; Mp 2 V and N1 ; : : : ; Nq 2 V and for all i; j one has cov.Mi ; Nj / D 0; then the stochastic vectors .M1 ; : : : ; Mp / and .N1 ; : : : ; Nq / are statistically independent.
37
Section I.6 Statistical independence of normally distributed variables
Proof. (Split up into two steps.) Step 1. The case in which the X1 ; : : : ; Xn are independent and N.0; 1/-distributed. Let M V be the linear span of the M1 ; : : : ; Mp and N the linear span of the N1 ; : : : ; Nq . Then, with respect to the inner product on V, one has M ? N. Choose orthonormal bases ¹E1 ; : : : ; Es º and ¹F1 ; : : : ; F t º in M and N respectively. Extend the joint system ¹E1 ; : : : ; Es ; F1 ; : : : ; F t º to an orthonormal base ¹E1 ; : : : ; Es ; F1 ; : : : ; F t ; G1 ; : : : ; Gu º of V. Denote E WD .E1 ; : : : ; Es /; F WD .F1 ; : : : ; F t / and
G WD .G1 ; : : : ; Gu /:
Now by Lemma I.6.5 the vectors X D .X1 ; : : : ; Xn / and .E; F; G/ are identically distributed. But then the vectors .X1 ; : : : ; XsCt / and .E; F/ are also identically distributed (see Exercise 27). According to Proposition I.3.4 the vectors .X1 ; : : : ; Xs / and .XsC1 ; : : : ; XsCt / are statistically independent; therefore one has for A Rs and B Rt that P .E; F/ 2 A B D P .X1 ; : : : ; XsCt / 2 A B D P .X1 ; : : : ; Xs / 2 A P .XsC1 ; : : : ; XsCt / 2 B :
By choosing successively A D Rs and B D Rt it appears that P .E; F/ 2 A B D P .E 2 A/P .F 2 B/:
This proves that E and F are statistically independent. Moreover, there exist linear maps ' W Rs ! Rp and W Rt ! Rq such that .M1 ; : : : ; Mp / D '.E1 ; : : : ; Es /
and .N1 ; : : : ; Nq / D
.F1 ; : : : ; F t /:
An application of Proposition I.4.2 shows that the stochastic vectors .M1 ; : : : ; Mp / and .N1 ; : : : ; Nq / are statistically independent. This completes the proof of Step 1. Step 2. Suppose that the X1 ; : : : ; Xn are statistically independent and that Xi is N.i ; i2 /-distributed (all i ). For all X 2 V denote, as before, the standardized form of X by e D X : X X e e 1; : : : ; X e n are statistically independent and N.0; 1/-distributed. Let V Now the X e e e e f e be the linear span of the X 1 ; : : : ; X n . Then M i 2 V and N j 2 V for all i; j . Furthermore fi; N e j / D 0 for all i; j: cov. M f 1; : : : ; M fp / and . N e 1; : : : ; N e q/ By Step 1 it follows that the stochastic vectors . M are statistically independent. By Proposition I.4.2 it follows that .M1 ; : : : ; Mp / and .N1 ; : : : ; Nq / are statistically independent.
38
Chapter I Probability theory
Remark 1. If M; N 2 V, then cov.M; N / D 0 ” M and N statistically independent: Remark 2. If X; Y; Z 2 V, while the pair X; Y is independent and the pair X; Z is independent, then X and Y C Z are also independent. This can obviously be explained by the fact that cov.X; Y C Z/ D cov.X; Y / C cov.X; Z/ D 0: Thus a pitfall as illustrated in §I.4, Example 4, is not possible within V. Remark 3. In Theorem I.6.7 the statistical independence of two stochastic vectors .M1 ; : : : ; Mp / and .N1 ; : : : ; Nq / is involved. However, by applying similar arguments one can prove versions of this theorem in which more than two stochastic vectors occur. Remark 4. Theorem I.6.7 is very characterizing for normal distributions. That is to say, if the X1 ; : : : ; Xn enjoy any non-normal distribution then the statement in the theorem is always wrong (see [69]).
I.7
Distribution functions and probability distributions
Let .; A; P / be a probability space and X W ! R a stochastic variable. The (cumulative) distribution function of X is understood to be the function FX W R ! Œ0; 1, defined by FX .x/ WD P .X x/ D PX . 1; x .x 2 R/:
A distribution function always has the following three properties (see Exercise 44): (i) limx!
1 FX .x/
D 0 and limx!C1 FX .x/ D 1,
(ii) FX is an increasing function on R, (iii) FX is continuous from the right.
One can prove that to every function F W R ! Œ0; 1 that satisfies (i), (ii) and (iii) there corresponds a unique Borel probability measure P on R such that F .x/ D P . 1; x
(see for example [11], [64]). If, on the probability space .R; B; P /, the stochastic variable X is defined by setting X.x/ WD x for all x 2 R, then one obviously has F .x/ D P . 1; x D PX . 1; x :
Section I.7 Distribution functions and probability distributions
39
This shows that every function F that satisfies the conditions (i), (ii) and (iii) is the distribution function of at least one stochastic variable. The observations above are summarized in the following theorem. Theorem I.7.1. To every distribution function F there is at least one stochastic variable X such that F D FX . Moreover, for any pair X; Y of stochastic variables one has FX D FY if and only if PX D PY . Remark. Traditionally one often writes Z g.x/ dFX .x/ or instead of
Z
Z
g.x/FX .dx/
g.x/ dPX .x/:
These notations can be justified by the previous theorem. The integrals can be regarded as Riemann–Stieltjes integrals. If X has an absolutely continuous distribution with probability density f W R ! Œ0; C1/, then Z x FX .x/ D PX . 1; x D f .t / dt: 1
In that case one has in every point x in which f is continuous: FX0 .x/ D f .x/: Some illustrating examples are given now:
Example 1. Suppose X is an N.0; 1/-distributed stochastic variable. Then the probability density f is defined by 1 f .x/ D p e 2
x 2 =2
:
The distribution function FX is usually denoted by the Greek capital letter ˆ, so in this notation one has Z x 1 2 ˆ.x/ D p e t =2 dt: 2 1 Consequently 1 2 ˆ0 .x/ D p e x =2 for every x 2 R: 2 The values of ˆ are tabulated in detail (see Table II).
40
Chapter I Probability theory
Example 2. Suppose X has an absolutely continuous distribution with a probability density given by ´ e x if x 0; f .x/ D 0 if x < 0: Then FX .x/ D It follows that
8Z < :
0
x 1
f .t / dt D
´ 1 FX .x/ D 0
e
Z
x
e
t
dt
0
x
if x 0; elsewhere:
if x 0; elsewhere:
Here one has FX0 .x/ D f .x/ for all x 2 R, with the exception of the point x D 0 (see Exercise 20). Intermezzo. In Definition I.2.4 the concept of an “absolutely continuous probability distribution” was introduced. One talks about an absolutely continuous probability distribution of a variable X if there exists a Borel function f W R ! Œ0; C1/ such that Z PX .A/ D f .t / dt .A Borel in R/: A
This is equivalent to
FX .x/ D PX . 1; x D
Z
x
f .t / dt 1
for all x 2 R:
As noticed earlier, in the phrase “absolutely continuous distribution” the adjective “continuous” has nothing to do with possible continuity of the probability density function f . It is however true that in the case of an absolutely continuous probability distribution the distribution function F is always continuous. Is perhaps absolute continuity of the probability distribution equivalent to continuity of the distribution function F ? The answer is: No! To be more precise, one can prove the existence / of distribution functions F which are continuous everywhere on R but nevertheless cannot be represented as Z x F .x/ D f .t / dt .x 2 R/: 1
Can absolute continuity of the probability distribution be recognized in terms of its distribution function F ? The answer is: Yes! This problem is closely related to / In the past probability distributions corresponding to these kind of distribution functions were supposed to be of no practical value. In modern physics, however, they play an important role nowadays.
41
Section I.7 Distribution functions and probability distributions
Lebesgue’s differentiation theorems. A beautiful exposition (including rigorous proofs) of these deep theorems can be found in [60]. It turns out that a probability distribution is absolutely continuous if and only if F has the following property: To every " > 0 there is a ı > 0 such that n X .ˇi
˛i / < ı
implies
iD1
n X ˇ ˇF .ˇi / iD1
whenever .˛1 ; ˇ1 /; : : : ; .˛n ; ˇn / are disjoint intervals.
ˇ F .˛i /ˇ < "
A function F satisfying this property is (also) called an absolutely continuous function. So, in this terminology, a probability distribution is absolutely continuous if its distribution function F is so. It is easy to see that an absolutely continuous F is automatically continuous; the converse is (of course) not true. On the other hand it is easily verified that a continuously differentiable F is automatically absolutely continuous. So “absolutely continuous” is somewhere in between “continuous” and “continuously differentiable”. If X D .X1 ; : : : ; Xn / is a stochastic n-vector, then the distribution function FX1 ;:::;Xn (or briefly FX ) is given by FX1 ;:::;Xn .x1 ; : : : ; xn / D PX1 ;:::;Xn . 1; x1 . 1; xn : The function FX1 ;:::;Xn is often called the joint distribution function of X1 ; : : : ; Xn . For a continuous FX the open set OFX D ¹.x1 ; : : : ; xn / 2 Rn W FX .x1 ; : : : ; xn / > 0º is called the open support of FX . If FX has on its open support continuous partial derivatives of order n, then X has an absolutely continuous probability distribution. The probability density of X is then given by 8 n < @ FX on OFX ; fX D @x1 : : : @xn : 0 elsewhere. If X has a probability density, then the distribution function FX can be represented as Z x1 Z xn FX .x1 ; : : : ; xn / D fX .t1 ; : : : ; tn / dt1 : : : dtn : 1
1
It can be proved that a system X1 ; : : : ; Xn of stochastic variables is statistically independent if and only if FX1 ;:::;Xn D FX1 ˝ ˝ FXn :
Except in some of the exercises, the concept of joint distribution functions will not be used in this book.
42
I.8
Chapter I Probability theory
Moments, moment generating functions and characteristic functions
In probability theory one associates with every stochastic variable its moments (as far as these exist): Definition I.8.1. Let X be a stochastic variable. The nth moment of X is understood to be the real number n WD E.X n /; provided this expression exists. Remark. The first moment 1 is identical to the expectation value of X . Furthermore, the variance of X is related to the first and the second moment of X : var.X/ D 2
.1 /2 :
It will turn out that, as a kind of a rule, the probability distribution of a stochastic variable is completely characterized by its moments. To explain this the following important concept in probability theory is introduced: Definition I.8.2. Let X be a stochastic variable. The moment generating function MX of X is defined by Z MX .t / WD E.e tX / D
e tx dPX .x/
for all values of t where this expression makes sense.
Remark. MX is closely related to the so-called Laplace transform of the probability measure PX (see [16], [19], [41], [66], [74]). Theorem I.8.1. If E.e "jXj / exists for some " > 0 then: (i) All moments of X exist. (ii) MX is on the interval . "; C"/ an analytic function with power series expansion 1 n X t E.X n / MX .t / D : nŠ nD0
Proof. If E.e "jXj / exists, then Z
e "jxj dPX .x/ < C1:
Section I.8 Moments, moment generating and characteristic functions
43
However, for t 2 . "; C"/ one has the inequality e tjxj e "jxj : Consequently one has for all t 2 . "; C"/ Z e tjxj dPX .x/ < C1: In this way it follows that E.e tjXj / exists for all t 2 . "; C"/. Similarly, using the inequality e tx e jtjjxj one verifies that E.e tX / exists for all t 2 . "; C"/. Next, note that the series X jt jn jxjn n
nŠ
has positive terms. For this reason summation and integration are exchangeable (see Appendix A or [11], [45], [64]): X jt jn Z X Z jt jn jxjn n dPX .x/ jxj dPX .x/ D nŠ nŠ n n Z X n n Z jt j jxj dPX .x/ D e jtjjxj dPX .x/: D nŠ n Because the last integral is finite for all t 2 . "; C"/, it follows that X jt jn Z jxjn dPX .x/ < C1 for all t 2 . "; C"/: nŠ n Consequently
Z
jxjn dPX .x/ < C1
for all n 2 N
so it turns out that all moments of X exist, which is the statement in (i). To prove (ii), note that XZ n
jtxjn dPX .x/ < C1 nŠ
for all t 2 . "; C"/.
It follows from this that it is allowed to exchange summation and integration (see Appendix A, or [11], [45], [64]) in the series X Z .tx/n dPX .x/: nŠ n
44
Chapter I Probability theory
It thus appears that for all t 2 . "; C"/ Z X X t n E.X n / X Z .tx/n .tx/n D dPX .x/ D dPX .x/ nŠ nŠ nŠ n n n Z D e tx dPX .x/ D E.e tX / D MX .t /:
This proves the theorem.
The following proposition is a direct consequence of Theorem I.8.1. Proposition I.8.2. Under the conditions of Theorem I.8.1 one can obtain the moments of X by differentiation of MX in the point t D 0: n d n E.X / D MX .t / : dt tD0 By Theorem I.8.1 and Proposition I.8.2 the moments of X are completely characterized by MX and vice versa. In turn, the probability distribution of X is completely characterized by MX : Theorem I.8.3. Suppose X and Y are stochastic variables obeying the conditions imposed in Theorem I.8.1. Then the following three statements are equivalent: (i) PX D PY , (ii) MX D MY , (iii) E.X n / D E.Y n / for all n 2 N . Proof. The equivalence (i) ” (ii) presents the fact that the Laplace transformation is injective (see [16], [19], [66], [74]). The equivalence (ii) ” (iii) is a consequence of Theorem I.8.1 and Proposition I.8.2. Remark. In the theorem above the imposed conditions cannot be dropped. For a discussion and counterexamples see [18], [62]. Given a sequence X1 ; X2 ; X3 ; : : : ; Xn ; : : : of stochastic variables, it is in mathematical statistics often useful to study the asymptotic distribution, that is: the probability distribution of Xn for very large n. As to this kind of statistical analysis one is often dealing with the concept of so-called convergence in distribution: Definition I.8.3. The sequence .Xn /n2N is said to converge in distribution to a variable X , if lim FXn .x/ D FX .x/ n!1
in every point x where FX is continuous.
Section I.8 Moments, moment generating and characteristic functions
45
Remark 1. In this definition convergence of the sequence FXn .x/ is required only in the points x where FX is continuous. In Exercise 30 it is illuminated that it is not reasonable to require convergence in points where FX is discontinuous. Remark 2. If .Xn /n2N converges in distribution to both X and Y this does not at all mean that X D Y . However, one does have FX D FY in that case, so X and Y are identically distributed (see also Exercise 31). Convergence in distribution can be characterized by means of moment generating functions: Theorem I.8.4 (P. Lévy). Let X be a stochastic variable and .Xn /n2N a sequence of stochastic variables. Suppose the moment generating functions MX and MXn exist (for all n) on a common interval . 1; and suppose that lim MXn .t / D MX .t /
n!1
for all t 2 . 1; : Then the sequence .Xn /n2N converges in distribution to X . Proof. See [21]. Exercise 29 illustrates a nice application of this theorem. Quite similar to moment generating functions there is (in statistics and stochastic analysis) the concept of “characteristic functions”: Definition I.8.4. Let X be any stochastic variable. The characteristic function (more precisely denoted by X) of X is understood to be defined by Z .t / WD E.e i tX / D e i tx dPX .x/ .t 2 R/; where i denotes the imaginary unit .i 2 D
1/.
Remark. In the definition above one has E.e i tX / D E cos.tX/ C i E sin.tX / :
A simple application of Lebesgue’s dominated convergence theorem (see Appendix A) shows that this expression is continuous in t . In contrast to the moment generating function, the characteristic function exists always for every t 2 R. Of course this is an advantage; a disadvantage is (in a didactical sense) that complex numbers
46
Chapter I Probability theory
come across. Actually, the function X is the Fourier transform of the probability measure PX . The measure PX is completely characterized by X , that is to say: PX D PY ” X D Y : Details can be found in [6], [18], [19], [21], [41], [64]; see also Appendix F. Convergence in distribution can also be expressed in terms of characteristic functions: Theorem I.8.5 (P. Lévy). The sequence of stochastic variables .Xn /n2N converges in distribution to the variable X if and only if lim Xn .t / D X .t / for all t 2 R:
n!1
Proof. See [6], [18], [19], [21]. The following proposition describes how characteristic functions and moment generating functions change when scale transformations are applied on X : Proposition I.8.6. Let X be an arbitrary stochastic variable and let Y D pX C q. Then: (i) MY .t / D e qt MX .pt / for all t where these expressions make sense,
(ii) Y .t / D e i qt X .pt / for all t 2 R.
Proof. To avoid trite explanations, only the proof of (ii) is given. For all t 2 R one has Y .t / D E e i tY D E e i t.pXCq/ D E e iptX e itq D e i qt E e iptX D e iqt X .pt /: Proposition I.8.7. If X is N.; 2 /-distributed, then: 1
(i) MX .t / D e tC 2
(ii) X .t / D e it
2t 2
1 2 2 2 t
for all t 2 R,
for all t 2 R.
Proof. Only (ii) is proved; the proof of (i) can be given by similar arguments. Step 1. The case that X is N.0; 1/-distributed. One then has X .t / D E.e i tX / D
Z
C1 1
1 e itx p e 2
1 2 2x
dx:
47
Section I.8 Moments, moment generating and characteristic functions
Now the derivative of X is determined by differentiation across the integral sign. This can be justified by an application of Lebesgue’s dominated convergence theorem (see Appendix A) on quotients of the form X .t C h/ h
X .t /
D
Z
e i.tCh/x h
e itx
1 p e 2
1 2 2x
dx:
As a result one obtains Z
C1
1 2 1 e i tx ix p e 2 x dx 2 1 Z Ca 1 2 i e itx e 2 x x dx D lim p a!C1 2 a ²h Z Ca i 1 2 xDCa i i tx x 2 lim e e C e itx i t e (partial integration) D p xD a 2 a!C1 a Z C1 1 2 1 D t e i tx p e 2 x dx D tX .t /: 2 1
0X .t /
D
1 2 2x
dx
³
It thus appears that X satisfies the differential equation 0X .t / D
t X .t / while
X .0/ D 1:
The unique solution of this equation is given by X .t / D e
1 2 2t
.t 2 R/:
Step 2. The case that X is N.; 2 /-distributed. Then one has e C ; X D X
e is the standardized form of X . By Proposition I.8.6 together with Step 1 where X one obtains X .t / D e it e . t / D e it e X
1 2 2 2 t
D e it
1 2 2 2 t
:
This proves part (ii) of the proposition.
Remark. The previous proposition can also be proved (in an elegant way) by using an argument involving the calculus of residues (see Exercise 45). The following proposition is often useful when dealing with sums of independent variables.
48
Chapter I Probability theory
Proposition I.8.8. If X and Y are statistically independent variables and S D X C Y , then: (i) MS .t / D MX .t /MY .t / for all t where the right side makes sense,
(ii) S .t / D X .t /Y .t / for all t 2 R. Proof. See Exercise 46.
The proposition above can easily be generalized to the case where S consists of more than two independent summing terms. As an illustration how Proposition I.8.8 can be applied, a second proof of Theorem I.6.6 is given here: Assertion. If X and Y are statistically independent variables which are N.X ; X2 / and N.Y ; Y2 /-distributed respectively, then S D X C Y is a variable with an N.X C Y ; X2 C Y2 /-distribution. Proof. The moment generating functions of X and Y are given by (see Proposition I.8.7) 1 2 2 1 2 2 MX .t / D e X tC 2 X t and MY .t / D e Y tC 2 Y t : Because X and Y are assumed to be independent one has by Proposition I.8.8 1
2
2
2
MS .t / D e .X CY /tC 2 .X CY /t : However, this is exactly the moment generating function corresponding to an N.X C Y ; X2 C Y2 /-distribution.
I.9
The central limit theorem
In this section the central limit theorem will be proved. This theorem is about the asymptotic distribution of sums of independent identically distributed variables. Definition I.9.1. A sequence X1 ; X2 ; : : : of stochastic variables is said to be statistically independent if for all n the variables X1 ; : : : ; Xn constitute a statistically independent system. First of all, as a preparation to the proof of the central limit theorem, two lemmas are proved: Lemma I.9.1. If ˛1 ; ˛2 ; : : : is a sequence of complex numbers such that lim n˛n D 0;
n!1
49
Section I.9 The central limit theorem
then for all x 2 R
n x 1 C C ˛n D e x : n!1 n lim
Proof. Let “log” be the principal branch of the complex logarithm (see [1], [7], [64]). Then, denoting the real part of a complex number by Re.z/, one has for sufficiently large n x Re 1 C C ˛n > 0: n It therefore makes sense to talk about x log 1 C C ˛n : n
Counting with the fact that
log.1 C z/ D1 z!0 z lim
one may write
lim n log 1 C
n!1
x C ˛n n
x log 1 C C ˛n x n n C ˛ D lim n x n!1 n C ˛n n x log 1 C C ˛n n lim .x C n˛n / D x: D lim x n!1 n!1 C ˛n n
It follows from this that n x lim 1 C C ˛n D lim e n log.1Cx=nC˛n / D e x : n!1 n!1 n Lemma I.9.2. There exists a bounded continuous function r W R ! C such that (i) e ix D 1 C ix 21 x 2 C x 2 r.x/, (ii) limx!0 r.x/ D 0.
Proof. Expanding the function x 7! e ix in its (everywhere convergent) Taylor series, one finds that e ix D 1 C ix
1 1 2 X .ix/k x C D 1 C ix 2 kŠ kD3
Now define r.x/ D
1 X .ix/k kŠ
kD3
1 2 x 2
2
.x 2 R/:
x2
1 X .ix/k kŠ
kD3
2
:
50
Chapter I Probability theory
By a well-known result in the theory of power series, r is a continuous function. This, together with the fact that r.0/ D 0, implies that, by construction of r , statement (ii) is true. It remains to be proved that r is a bounded function on R. To verify this, note that e ix 1 ix C 21 x 2 r.x/ D : x2 Applying the triangle inequality on the right side, one obtains jr.x/j
1 1 2 C : C x2 jxj 2
In this way it follows that r is bounded on the set . 1; 1 [ Œ1; C1/: Moreover, being a continuous function, r is of course also bounded on the compact interval Œ 1; C1. Glueing the pieces together it follows that r is bounded on R. The proof of the central limit theorem can now be started. It is easier to understand the core of the proof when first the Exercises 29 and 32 are done. Theorem I.9.3 (Central limit theorem). Let X1 ; X2 ; : : : be a statistically independent sequence of stochastic variables, all of them with expectation and variance 2 . It is assumed that ¤ 0. Under these conditions the variable 1 X1 C X2 C C Xn Sn WD n n is for large n approximately N.; 2 =n/-distributed. More precisely, the sequence .Sn =n/ p = n converges in distribution to the N.0; 1/-distribution. Proof. (Split up into two steps; Lévy’s theorem is applied.) Step 1. (The case where D 0 and D 1.) To compactify notations, set X WD X1
and
1 Zn WD p Sn : n
By Proposition I.8.8 one has Sn .t / D ¹X .t /ºn
for all n 2 N:
51
Section I.9 The central limit theorem
Applying Proposition I.8.6 it follows that p Zn .t / D ¹X .t = n/ºn : To prove that the sequence Zn converges in distribution to the N.0; 1/-distribution it is, by Theorem I.8.5 and Proposition I.8.7, sufficient to establish that 1 2 2t
lim Zn .t / D e
n!1
:
To this end, fix any t 2 R. In the notation of Lemma I.9.2 one has e i tx D 1 C itx
1 2 2 t x C t 2 x 2 r.tx/: 2
Consequently 1 2 t E.X 2 / C t 2 E X 2 r.tX / : 2
X .t / D E.e i tX / D 1 C it E.X/
Exploiting that D 0 and D 1, this can be simplified into: 1 2 t C t 2 E X 2 r.tX / : 2
X .t / D 1 Therefore
p Zn .t / D ¹X .t = n/ºn D 1
n 1 t2 t2 tX C E X2 r p : 2n n n
In this expression the terms ˛n D
t2 tX E X 2r p n n
turn out to be negligible for large n. More precisely, it will appear that lim n˛n D 0:
n!1
To verify this, note that n˛n D t 2
Z
x2 r
tx p n
dPX .x/:
Here the following properties are satisfied: R (i) x 2 dPX .x/ < C1, tx (ii) x 2 r p cx 2 for all x 2 R, some c 2 R, n tx (iii) limn!1 x 2 r p D 0. n
./
52
Chapter I Probability theory
Using Lebesgue’s dominated convergence theorem (see Appendix A) it follows that Z tx lim n˛n D lim t 2 x 2 r p dPX .x/ n!1 n!1 n Z tx 2 2 Dt lim x r p dPX .x/ D 0: n!1 n It thus appears that the expression Zn .t / can be presented as: Zn .t / D .1
1 t2 C ˛n /n 2n
where lim n ˛n D 0: n!1
By Lemma I.9.1 this implies lim Zn .t / D lim .1
n!1
n!1
1 2 t C ˛n /n D e 2
1 2 2t
:
Hence ./, and therefore Step 1, is proved by this. Step 2. (The general case.) Suppose the X1 ; X2 ; : : : constitute an independent sequence of identically distributed variables with expectation and variance 2 . As before, write e i WD Xi .i D 1; 2; : : :/: X e e Now the X 1 ; X 2 ; : : : constitute an independent sequence of identically distributed variables with expectation 0 and variance 1. By Step 1 the sequence e1 C C X en X 1 Sn D p e p n n
converges in distribution to the N.0; 1/-distribution. By recognizing that
the theorem is proved.
1 Sn =n Sn D p e p n = n
Remark 1. There exist more general and deeper versions of the central limit theorem (see [18], [50]). Remark 2. It is both amusing and instructive to visualize the central limit theorem in the case of an independent sequence of variables that are all uniformly distributed in the interval . 21 ; C 12 /. In Exercise 51 the reader is asked to actually carry this out. Remark 3. The central limit theorem explains partly the special role played by the normal distribution in probability theory and statistics. As to this, see also Exercise 40.
53
Section I.10 Transformation of probability densities
I.10
Transformation of probability densities
Let X be an absolutely continuous distributed stochastic n-vector of which the probability density is given. If g W Rn ! R is a given continuous function, how can one express the probability density of g.X/ in terms of the density of X? In this section a technique based on differentiation of the distribution function is illustrated. Example 1. Let g W Rn ! R defined by g.x/ WD a (constant). Now for every stochastic n-vector X the probability distribution of g.X/ is presented by the Dirac measure ıa . It follows that, even if g is continuous and X has an absolutely continuous probability distribution, there is no guarantee that g.X/ has an absolutely continuous probability distribution! Example 2. Suppose X is N.0; 1/-distributed. Define Y WD X 2 . Is Y continuously distributed? If it is, how can one obtain an explicit form for the probability density of Y ? Solution. The strategy is to first find an explicit form for the distribution function FY of Y . Because Y cannot possibly assume negative values one has FY .y/ D 0 if y < 0. For y 0 one has p p FY .y/ D P .Y y/ D P .X 2 y/ D P . y X y/ Z Cpy Z py 2 =2 2 1 x p1 e x =2 dx: D p p e dx D 2 y
Substituting x D
p
2
0
2
t in the last integral one finds that Z y 1 p1 t 2 e t=2 dt: FY .y/ D 0
2
The density function fY of Y can now be obtained by differentiation of FY .y/ with respect to y : ´ 1 p1 y 2 e y=2 if y > 0; 2 fY .y/ D 0 elsewhere: Example 3. Let .X; Y / be a stochastic 2-vector with a probability density f defined by ´ x C y if .x; y/ 2 Œ0; 1 Œ0; 1; f .x; y/ D 0 elsewhere: Suppose Z WD XY . Does Z have a probability density? If it has, what is its explicit form?
54
Chapter I Probability theory
Solution. As in the previous example, the first step is to derive an explicit form for the distribution function FZ of Z. If a < 0 then P .Z a/ D 0, so FZ .a/ D 0 if a < 0. In the same way FZ .a/ D 1 e a of R2 are defined as: if a > 1. For 0 a 1 the subsets Ga and G Ga WD ¹.x; y/ W xy aº
and
e a WD Ga \ .Œ0; 1 Œ0; 1/: G
Now one can express FZ .a/ in the following way:
FZ .a/ D P .Z a/ D P .XY a/ ZZ ZZ D P .X; Y / 2 Ga D f .x; y/ dxdy D .x C y/ dxdy e Ga Ga Z a °Z 1 Z 1 ° Z a=x ± ± D .x C y/ dy dx C .x C y/ dy dx: 0
0
Working out these integrals, one finds 8 ˆ 1:
Differentiation of FZ gives the density fZ of Z: ´ 2 2a if 0 a 1; fZ .a/ D 0 elsewhere: Example 4. Let X and Y be statistically independent stochastic variables with probability densities fX and fY given by ´ 1 2 x e x if x 0; fX .x/ D 2 0 elsewhere; ´ e y if y 0; fY .y/ D 0 elsewhere: Let Z WD
Y X
. Find, if it exists, an explicit form of the probability density of Z.
Solution. As in the preceding two examples, the first step towards a solution is to find an explicit expression for the distribution function FZ of Z. To this end (see Theorem I.3.2) note that the density of the 2-vector .X; Y / is given by ´ 1 2 x e .xCy/ if x; y 0; fX;Y .x; y/ D 2 0 elsewhere:
55
Section I.11 Exercises
If a < 0, then obviously FZ .a/ D 0. For all a 0 define the subset Ga of R2 by Ga D ¹.x; y/ W x; y 0I y axº:
Then
FZ .a/ D P .Z a/ D P .Y =X a/ D P .Y aX / D P .X; Y / 2 A ZZ Z C1 ° Z ax 2 x ± x e D fX;Y .x; y/ dxdy D e y dy dx 2 A 0 0 1 D1 : .1 C a/3
By differentiation of FZ .a/ to a one gets ´ 3.1 C a/ fZ .a/ D 0
I.11
4
if a 0; elsewhere:
Exercises
Detailed solutions to all exercises can be found in [2]. 1. Suppose A is a -algebra. If A; B 2 A then also A [ B; A \ B; AnB 2 A. Prove this. T 2. If A1 ; A2 ; : : : 2 A, then also 1 iD1 Ai 2 A. Prove this.
3. Let F be an arbitrary collection of subsets of . Suppose that A is the intersection of all -algebras containing F. Prove that: (i) A is a -algebra. (ii) If B is a -algebra which contains F then A B. A is said to be the smallest -algebra containing F, or the -algebra generated by F. 4. Is the collection of all open subsets in R a -algebra? 5. Suppose f W Rm ! Rn is a Borel function. Define A WD ¹A Rn W f
1
.A/ Borel in Rm º:
(i) Show that A is a -algebra which contains all open subsets of Rn . (ii) Prove that f 1 .A/ is Borel in Rm for all A Borel in Rn . This exercise shows that a function is a Borel function if and only if it is measurable with respect to the Borel -algebra.
56
Chapter I Probability theory
6. Prove that the composition of two Borel functions is again a Borel function. 7. If f and g are real-valued Borel functions then so are f C g and fg. Prove this. (In particular it follows that 1A f is a Borel function if A and f are Borel.) 8. Let A be a -algebra and a measure on A. If A; B 2 A and A B , then .A/ .B/. Prove this. 9. Suppose is a measure on A and A1 ; A2 ; : : : 2 A. Prove that ! 1 1 [ X Ai .Ai /: iD1
iD1
10. If is a measure on A and A1 ; A2 ; : : : are elements of A such that then
A1 A2 : : :
and .A1 / < C1;
lim .An / D
n!1
Prove this.
1 \
iD1
!
Ai :
11. Prove that PX is a Borel measure for every stochastic n-vector X. 12. As a probability experiment one throws a die once. Let X be the square of the number of pips. Express PX as a linear combination of Dirac measures ı1 ; ı2 ; ı3 ; : : : ; ı36 . 13. Let .; A; P / be a probability space. If X W ! Rn and X 1 .O/ 2 A for all open sets O in Rn , then X is a A-measurable variable. Prove this. 14. Let .; A; P / be a probability space. Suppose A 2 A and X D 1A . Prove that X is a stochastic variable. Express PX in terms of the Dirac measures ı0 and ı1 . 15. Prove the assertion in Example 3 of §I.3. 16. Prove the assertion in Example 4 of §I.3. 17. Let X be a discrete stochastic variable with range ¹a1 ; a2 ; : : :º. Prove that PX D
1 X iD1
P .X D ai /ıai :
18. Prove that for every stochastic variable X and every n D 1; 2; : : : one has Z Z jxjn dPX .x/ jxjnC1 dPX .x/ C 1: Consequently, if E.X nC1 / exists then so does E.X n /.
57
Section I.11 Exercises
19. Prove the assertion in Example 6 of §I.5. 20. Let X be as in Example 2 of §I.7. Determine the left and right derivative of FX in the point x D 0. 21.
(i) Determine E.X/; E.Y / and E.Z/ in Example 4 of §I.10. (ii) Because X and Y are independent, one has E.X Y / D E.X /E.Y /. Is it also true that Y E.Y / E D ‹ X E.X /
22. Compute in Example 3 of §I.10 the value of E.X / and P .X Y /. 23. A stochastic 2-vector .X; Y / has a probability density given by ´ xe xy if .x; y/ 2 Œ0; 1 Œ0; C1/; f .x; y/ D 0 elsewhere: (i) Determine, as far as they exist, E.X/ and E.Y /. (ii) Determine the probability density of Z D X Y .
(iii) Evaluate E.Z/.
24. Let X and Y be exponentially distributed stochastic variables with E.X / D 1 and E.Y / D 1. q Y (i) Determine the probability density of X. q Y (ii) Evaluate E X .
25. Suppose X has an absolutely continuous distribution with probability density fX . Prove that the variable Y WD pX C q, where p ¤ 0, has also an absolutely continuous distribution and one has 1 y q fY .y/ D fX : jpj p 26. In this exercise the reader is asked to prove two fundamental inequalities in probability theory. (i) Let X be a stochastic variable with expectation . If X takes on values only in Œ0; C1/, then P .X "/
"
for all " > 0:
Prove this. In probability theory this inequality is known as the Markov inequality.
58
Chapter I Probability theory
(ii) Let X be a stochastic variable with expectation and variance 2 . Show that for all " > 0 2 P .jX j "/ 2 : " This inequality is the well-known Chebyshev inequality. 27. If .X1 ; : : : ; Xn / and .Y1 ; : : : ; Yn / are identically distributed and f W Rn ! R is an arbitrary Borel function, then f .X1 ; : : : ; Xn / and f .Y1 ; : : : ; Yn / are also identically distributed. Prove this. 28. Suppose X1 and X2 are independent and both N.0; 1/-distributed. Define for all a; b > 0 and c 2 R the set A.c/ R2 by A.c/ WD ¹.x1 ; x2 / 2 R2 W ax1 C bx2 cº: (i) Prove that c P .X1 ; X2 / 2 A.c/ D P X1 p : a2 C b 2
p (ii) Prove that .aX1 C bX2 /= a2 C b 2 is N.0; 1/-distributed.
(iii) Prove that aX1 C bX2 is N.0; a2 C b 2 /-distributed.
29. Suppose X1 ; X2 ; : : : ; Xn ; : : : is an independent sequence of identically distributed variables, all of them having the probability distribution given by: x
0
1
P .X D x/
1 2
1 2
That is, the Xi are independent and Bernoulli distributed with parameter D (the Bernoulli distribution is introduced in §II.7). (i) Show that MXi .t / D e t=2
e t=2 C e 2
t=2
!
D e t=2 cosh
(cosh D cosinus hyperbolicus).
(ii) Prove that if Sn D X1 C X2 C C Xn , then MSn .t / D e nt=2 cosh
t n : 2
t 2
1 2
59
Section I.11 Exercises
(iii) Denoting Zn WD .Sn
1 1p 2 n/= 2 n,
verify that ² ³n t MZn .t / D cosh p : n p (iv) Expanding in a Taylor series, show that cosh.t = n/ can be decomposed as t t2 cosh p D 1 C C ˛n .t /; 2n n where lim n˛n .t / D 0: n!1
(v) Prove that the sequence Z1 ; Z2 ; : : : converges in distribution to the standard normal distribution. 30. Suppose X and Xn are the constant stochastic variables defined by X D 2 and Xn D 2 C n 1 . Does one have here limn!1 FXn .x/ D FX .x/ for all x 2 R? Does the sequence X1 ; X2 ; : : : converge in distribution to X ? 31. An unbiased coin is tossed unendingly. The stochastic variable Xn is defined by Xn D 1 Xn D 0
if the outcome of the nth tossing is head, if the outcome of the nth tossing is tail:
The variables X1 ; X2 ; : : : are statistically independent, consequently Xm ¤ Xn if m ¤ n. Check that for all fixed m the sequence X1 ; X2 ; : : : converges in distribution to Xm . 32. Formulate and solve the analogue of Exercise 29 in terms of characteristic functions. 33. Suppose X has an absolutely continuous probability distribution. If ' W R ! R is differentiable, ' 0 is continuous and ' 0 > 0, then '.X / also has an absolutely continuous distribution. Prove this and give an expression for the density of '.X /. 34. (Petersburger paradox). In a game of chance one is tossing an unbiased coin until the outcome “head” occurs. If k tossings are needed to see for the first the outcome “head”, then the player gets 2k roubles. Evaluate, if existing, the expectation of the amount of money gained by the player. 35. A well-known result in mathematical analysis states that Z C1 1p 2 e x dx D : 2 0
60
Chapter I Probability theory
(i) Using this, prove that the function 1 p e 2
f W x 7!
1 2 .x
/2 = 2
presents a probability density. (ii) The stochastic variables X1 ; : : : ; Xn are said to have a joint (central) vectorial normal distribution if the stochastic n-vector .X1 ; : : : ; Xn / has a probability distribution f of the form f .x/ D c e
1 2 hTx;xi
.x 2 Rn /;
where c is a constant and T a symmetric, invertible linear operator of positive type (that is to say, hTx; xi 0 for all x 2 Rn ). Express c in terms of det.T/, n and . 36. In this exercise it is illustrated how, starting from normally distributed variables, the Cauchy distribution can emerge. (i) If the stochastic 2-vector .X; Y / has a continuous probability density f , Y then the variable Z D X has a probability density given by fZ .z/ D
Z
C1 1
jxjf .x; xz/ dx:
Prove this. (ii) If X and Y are statistically independent and both N.0; 1/-distributed, then Y is Cauchy distributed with parameters ˛ D 0 and the variable Z D X ˇ D 1. That is: fZ .z/ D
1 1 z2 C 1
for all z 2 R:
37. Suppose the stochastic 2-vector .X1 ; X2 / has a continuous probability density f W R2 ! Œ0; C1/. Then the probability distribution of .X1 ; X2 / is rotation invariant if and only if f Q.x1 ; x2 / D f .x1 ; x2 /, for all .x1 ; x2 / 2 R2 and for all orthogonal linear maps Q W R2 ! R2 . Prove this. 38. Suppose f1 and f2 are continuously differentiable (differentiable with a continuous derivative) on R. Define, as before, f1 ˝ f2 on R2 by .f1 ˝ f2 /.x1 ; x2 / WD f1 .x1 /f2 .x2 /:
61
Section I.11 Exercises
Now assume that f1 ˝ f2 is rotation invariant, that is: .f1 ˝ f2 / Q.x1 ; x2 / D .f1 ˝ f2 /.x1 ; x2 /
for all orthogonal Q and .x1 ; x2 / 2 R2 .
(i) Show that fi .x/ D fi . x/ for all x 2 R .i D 1; 2/.
(ii) Prove that f1 D f2 .
(iii) Prove that there exist constants A and B such that 1
2
f .x/ D f1 .x/ D f2 .x/ D Ae 2 Bx : Hint to (iii): Suppose Q.t / is the orthogonal linear map belonging to the matrix cos t sin t : sin t cos t Then one has for all .x1 ; x2 / 2 R2
.f ˝ f / Q.t /.x1 ; x2 / D .f ˝ f /.x1 ; x2 /:
Evaluate the derivative of the left side in the point t D 0 and deduce that f 0 .x2 / f 0 .x1 / D D constant. x1 f .x1 / x2 f .x2 / 39. Suppose X1 ; X2 are statistically independent stochastic variables with strictly positive, continuously differentiable probability densities. If the probability distribution of .X1 ; X2 / is rotation invariant, then X1 and X2 have a common N.0; 2 /-distribution. Prove this. 40. Suppose that X1 ; : : : ; Xn are statistically independent stochastic variables with strictly positive continuously differentiable probability densities. If the probability distribution of .X1 ; : : : ; Xn / is rotation invariant, then the variables X1 ; : : : ; Xn have a common N.0; 2 /-distribution. Prove this. Remark. This result presents a converse to Proposition I.6.4. It can be applied in an interesting way in kinetic gas theory (see §II.2, Example 1). 41. In this exercise the reader is asked to prove the existence of two stochastic variables X and Y satisfying: I: X and Y are not statistically independent. II: cov.X; Y / D 0.
III: X and Y are N.0; 1/-distributed.
62
Chapter I Probability theory
In fact this presents a refinement of Example 7 in §I.5. As a first step, set up a probability space .; A; P / in the following way: WD R, A WD B and P defined by Z 1 2 1 P .A/ WD p e 2 x dx .A Borel in R/: 2 A The stochastic variable X W ! R, defined by X.x/ WD x
for all x 2
is obviously N.0; 1/-distributed. For all c 0 the variable Yc is defined by ´ x if jxj c; Yc .x/ D x if jxj < c: (i) Sketch a graph of Yc . (ii) Show that Y0 D X and that limc!C1 Yc D
X.
(iii) Prove that Yc is N.0; 1/-distributed for all c 0. The reader may take for granted that the function f W c 7! cov.X; Yc / is sequentially continuous and thus continuous. (This can be proved for example by applying Lebesgue’s dominated convergence theorem (see Appendix A)). (iv) Evaluate f .0/ and limc!C1 f .c/. (v) Prove that there exists an element a 2 R such that cov.X; Ya / D 0.
(vi) Show that for all c > 0:
P jX j 1 and jYc j 1 ¤ P jX j 1 P jYc j 1 :
(vii) Deduce from (vi) that X and Yc are statistically dependent for all c 0. Now the pair X and Ya has the three properties listed in the beginning of this exercise. (viii) Is not the above contradictory to Theorem I.6.7? 42. Given any finite collection of prescribed probability densities f1 ; : : : ; fn , there exists a statistically independent system of variables X1 ; : : : ; Xn such that Xi has probability density fi .i D 1; : : : ; n/. Prove this. 43. If both X and Y have absolutely continuous distributions, then so does the 2vector .X; Y /. Is this statement true or false?
63
Section I.11 Exercises
44. Prove that the distribution function of an arbitrary stochastic variable has the properties (i), (ii) and (iii) mentioned in §I.7. 45. Using a technique involving the calculus of residues (see [7, p. 219]) it is easily verified that Z
C1
e
.xCi a/2
1
dx D
Z
C1
e
x2
dx
for all a 2 R:
1
Deduce statement (ii) in Proposition I.8.7 from this. 46. Prove Proposition I.8.8. 47. Prove that the sum of two (or more) statistically independent Poisson distributed variables is Poisson distributed. 48. Can you give conditions under which the sum of two independent binomially distributed variables is binomially distributed? 49. The characteristic function of a Cauchy distribution is described in Appendix D. Is the sum of two independent Cauchy distributed variables Cauchy distributed? 50. Suppose X and Y are independent stochastic variables with probability densities fX and fY . It is assumed that fX and fY are bounded piece-wise continuous functions and that there exists an a > 0 such that fX .x/ D 0
if jxj a:
The convolution product fX fY W R ! Œ0; C1/ of such a pair of probability densities is defined by .fX fY /.s/ WD
Z
C1
fX .x/fY .s
x/ dx
1
.s 2 R/:
(i) Prove that the integral on the right side exists and that fX fY is a continuous function. (ii) Prove that if S D X C Y , then fS D fX fY : (iii) Prove that .fX fY /.s/ D 0 if jsj 2a: (iv) Formulate and prove a discrete analogue of the above.
64
Chapter I Probability theory
51. In this exercise the reader is asked to visualize the central limit theorem. Let X1 ; X2 and X3 be independent variables, uniformly distributed in the interval 1 1 ; C 2 2 . Define the sums S1 ; S2 and S3 as S1 WD X1 ;
S2 WD X1 C X2
and
S3 WD X1 C X2 C X3 :
1 (i) One has E.S1 / D 0 and var.S1 / D 12 . Sketch the probability density of 1 S1 and the one corresponding to the N.0; 12 /-distribution in one and the same figure.
(ii) Determine the probability density of S2 . (iii) One has E.S2 / D 0 and var.S2 / D 61 . Sketch the probability density of S2 and the one corresponding to the N.0; 16 /-distribution in one and the same figure. (iv) Determine the probability distribution of S3 . (v) One has E.S3 / D 0 and var.S3 / D 14 . Sketch the probability density of S3 and the one corresponding to the N.0; 14 /-distribution in one and the same figure. 52. Given are two statistically independent variables X and Y , having a common probability density f of type 8 < 1 e x= if x 0; f .x; / D :0 elsewhere: Here is a fixed positive number.
(i) Prove that X C Y and X=Y are statistically independent.
(ii) Prove that X C Y and X=.X C Y / are statistically independent.
Chapter II
Statistics and their probability distributions, estimation theory
II.1 Introduction In order to determine the acceleration of gravity g on earth’s surface a heavy leaden ball is dropped from a height of 100 meter. Now g and the time X it takes to reach the ground are related as follows: s 200 .X in seconds/: XD g Consequently, by measuring X one could get an impression of the numerical value of g. Repetition of this experiment will, in general, not result in exactly the same measured value of X . This is due to restricted accuracy of the pieces of apparatus, to inaccuracy in reading off things and to other (exterior) disturbing factors. The process of measuring X can therefore be considered as “observing the outcome of a stochastic variable”. This stochastic variable, which exists in ones mind only, will also be denoted by X . In the physical sciences it is often assumed that such a variable is normally distributed, say N.; 2 /-distributed. The stochastic variable X is called the error. This variable is N.0; 2 /-distributed (see Theorem I.6.1). Suppose, which is in practice almost never the case, that the numerical value of is known. Suppose for example that D 0:8. How could one then determine the chance that the error in the measurements will be (in absolute sense) more than 1 second? In other words: How can one evaluate P .jX j > 1/? To solve this problem the variable X is standardized in the sense of Definition I.5.2. By Theorem I.6.1 the variable e DX Z WD X
is N.0; 1/-distributed. (In statistics, N.0; 1/-distributed variables are often denoted by the letter Z). Using this the probability P .jX j > 1/ can be evaluated in the following way: ˇ ˇ ˇX ˇ 1 1 ˇ ˇ P jX j > 1 D P ˇ > D P jZj > D P .jZj > 1:25/; ˇ 0:8
66
Chapter II Statistics and their probability distributions, estimation theory
where, as said before, Z is N.0; 1/-distributed. One therefore has Z 1:25 Z C1 1 P jZj > 1:25 D P jZj 1:25 D C p e 2 1 1:25
x 2 =2
dx:
The numerical value of this integral can be determined by expressing it in terms of the distribution function ˆ of the standard normal distribution (see Table II). Because of symmetry one may write P jZj 1:25 D 2 ˆ. 1:25/ D 2 0:1056 D 0:2112: It thus appears that in the experiment there is a chance of roughly 20% that the error (in absolute sense) will be more than 1 second.
The accuracy in the experiment could be improved by repeating it a number of times and afterwards considering the average of the measured values of X . Then, in a natural way, there are n stochastic variables X1 ; : : : ; Xn involved, representing the n measurements of the falling-time. It is often assumed that there is no mutual interference in the measurements. In statistical language: The variables X1 ; : : : ; Xn are supposed to be statistically independent. A sequence like X1 ; : : : ; Xn is said to be a sample from a normally distributed population. More general and more precise: Definition II.1.1. A sample of size n is understood to be a statistically independent sequence X1 ; : : : ; Xn of stochastic variables, all of them identically distributed. The common probability distribution of the Xi is said to be the (probability) distribution of the population. If X1 ; : : : ; Xn is a sample and g W Rn ! R a Borel function (in practice usually a continuous function) then the stochastic variable g.X1 ; : : : ; Xn / is said to be a statistical quantity, or briefly a statistic. Statistics will as a rule be denoted by the capital letter T . If (after completing an experiment) an outcome t for T appears, then this will be bluntly written as T D t . By this notation one surely does not mean that T be equal to the constant stochastic variable t . A more precise (but too lengthy) formulation would be: “In the underlying probability space the element ! appeared as an outcome and one has T .!/ D g X1 .!/; : : : ; Xn .!/ D t:” A statistic of fundamental importance is the sample mean: X WD As to this sample mean one has:
X1 C C Xn : n
67
Section II.1 Introduction
Proposition II.1.1. If X1 ; : : : ; Xn is a sample from a population with mean and variance 2 , then 2 E.X/ D and var.X / D : n Proof. One has n
1X Xi E.X/ D E n iD1
D
1 n
n X iD1
!
n X 1 Xi D E n
E.Xi / D
iD1
1 n
n X iD1
!
D :
Furthermore, the X1 ; : : : ; Xn being statistically independent, one has ! ! X 1 1X Xi D 2 var Xi var.X/ D var n n i
D
i
1 X 1 2 2 / D D : var.X n i n2 n2 n i
In the case of a sample from a normally distributed population the previous proposition can be sharpened: Proposition II.1.2. If X1 ; : : : ; Xn is a sample from an N.; 2 /-distributed population, then X is an N.; 2 =n/-distributed variable. P Proof. By Theorem I.6.6 the sum i Xi is N.n; n 2 /-distributed. By applying Theorem I.6.1 the proof can be completed in an obvious way. Now return to the experiment in which one was measuring the falling-time X of a leaden ball. Repeating this experiment 4 times, a sample X1 ; X2 ; X3 ; X4 of size 4 arises from an N.; 0:64/-distributed population. By Proposition II.1.2 the sample mean X is an N.; 0:64=4/-distributed variable. The chance that one is going to make an error that is (in absolute sense) more than 1 second, is expressed by the probability P .jX j 1/. This probability can be computed as follows: ˇ ˇ ! ˇX ˇ 1 ˇ ˇ D P jZj 2:5 ; P jX j > 1 D P jX j 1 D P ˇ ˇ ˇ 0:4 ˇ 0:4 where Z is an N.0; 1/-distributed variable. Consulting Table II one obtains P jX j 1 D 0:012:
68
Chapter II Statistics and their probability distributions, estimation theory
In words, the probability that there will be an absolute error of more than 1 second is roughly 1%. Recall that for just one single measurement this chance was 20%! In the example above it was presumed that the numerical value of was known beforehand. In practice this is as a rule not the case. For this reason one will usually have to make an estimate of , based on the available measurements. For this purpose a statistic is introduced that is generally used to estimate the value of 2 : Definition II.1.2. For any sample X1 ; : : : ; Xn of size n 2 the statistic S 2 WD
1 n
1
is called the sample variance.
n X
X /2
.Xi
iD1
The following proposition explains why, rather than ing factor n 1 1 has been chosen.
1 n,
in the definition of S 2 a scal-
Proposition II.1.3. If X1 ; : : : ; Xn is a sample from a population with variance 2 , then E.S 2 / D 2 :
For this reason S 2 is called an unbiased estimator of 2 . Proof. One may write X X .Xi /2 D ¹.Xi i
i
X D .Xi i
Because
P
i .Xi
/º2
X/ C .X X/2 C 2
X .Xi i
X /.X
/ C
X .X
/2 :
i
P X/ D . i Xi / nX D 0, this can be simplified as follows: X X .Xi /2 D .Xi X /2 C n.X /2 : i
i
Taking the expectation of the left and the right side, one obtains ²X ³ X 2 2 EŒ.Xi / D E .Xi X / C n EŒ.X i
/2 :
i
However, EŒ.Xi /2 D 2 and furthermore (by Proposition II.1.1) one has that EŒ.X /2 D 2 =n. Thus one obtains X n 2 D E .Xi X /2 C 2 : i
69
Section II.2 The gamma distribution and the 2 -distribution
Consequently
X E .Xi
2
X/
i
Therefore
D .n
1/ 2 :
E.S 2 / D 2 : Remark. Actually the proofs of Propositions II.1.1 and II.1.3 lean only on the condition that the X1 ; : : : ; Xn be statistically independent together with the condition that they all have a common expectation and a common variance 2 . This section is closed now by returning to the experiment where the falling-time of a ball was measured in order to get an impression of the numerical value of the acceleration of gravity on earth. Suppose, repeating the experiment 8 times, that one obtains the following measurements (in seconds): 4:4
4:7
4:8
4:5
4:4
4:2
4:2
4:0:
This results in the following numerical value of the sample mean: X D 4:4: The numerical value of the sample variance is S 2 D 0:0714: Thus one arrives at the estimate 2 0:0714: Which amount of accuracy can be ascribed to this estimate? In order to answer this question it is necessary to study the probability distribution of S 2 in cases where samples are drawn from normally distributed populations. To this end a number of special probability distributions, all of them related to the normal distribution, will pass in review in the next sections.
II.2 The gamma distribution and the 2 -distribution In §I.10, Example 2, the distribution of X 2 was studied in the case where X was an N.0; 1/-distributed stochastic variable. It turned out that the probability density f of X 2 is then given by 8 < p1 x 12 e x=2 if x > 0; f .x/ D 2 : 0 elsewhere:
70
Chapter II Statistics and their probability distributions, estimation theory
This probability distribution is a member of the family of so-called gamma distributions. Before giving the general definition of gamma distributions, the definition of a celebrated function in mathematical analysis is recalled: The function W .0; C1/ ! Œ0; C1/ defined by Z C1 .x/ WD t x 1 e t dt
.x > 0/
0
is said to be the gamma function. By using the technique of partial integration the following proposition is easily deduced: Proposition II.2.1. For all x > 0 one has .x C 1/ D x .x/: Proof. See Exercise 1. As a corollary of Proposition II.2.1 one has .n C 1/ D nŠ for all n D 1; 2; : : : : The values of have been tabulated. For x D terms of : Proposition II.2.2. One has . 12 / D
p
1 2
the value of can be expressed in
.
Proof. See Exercise 2. Definition II.2.1. A stochastic variable is said to be gamma distributed with parameters ˛; ˇ > 0 if it has a probability density given by 8 < 1 x ˛ 1 e x=ˇ if x > 0; f .x/ D ˇ ˛ .˛/ : 0 elsewhere. For an exercise, check that this defines a probability density indeed. Note that ˛ and ˇ are not exchangeable.
For ˛ D 1 and ˇ D the gamma distribution is usually called the exponential distribution with parameter > 0. Proposition II.2.3. If X is N.0; 1/-distributed, then X 2 is gamma distributed with parameters ˛ D 12 and ˇ D 2. Proof. A direct consequence of §I.10, Example 2, together with Proposition II.2.2.
71
Section II.2 The gamma distribution and the 2 -distribution
Proposition II.2.4. The moment generating function of a gamma distributed variable exists on the interval . 1; ˇ1 / and is on this interval defined by 1 : .1 ˇt /˛
M.t / D
Proof. By definition of a moment generating function one has Z C1 1 M.t / D e tx ˛ x ˛ 1 e x=ˇ dx; ˇ .˛/ 0 which is the same as 1 M.t / D ˛ ˇ .˛/
Z
C1
x˛
1 ˇ /x
1 .t
e
dx:
0
It is easily verified that this integral makes sense if t
1 ˇ
< 0, that is, if t 2 . 1; ˇ1 /.
By writing x D ˇu and applying the substitution formula for integrals one obtains Z C1 1 u˛ 1 e .1 ˇ t/u du: M.t / D .˛/ 0 If, in turn, one substitutes u D v=.1
ˇt / in this integral then one arrives at Z C1 1 1 1 M.t / D v ˛ 1 e v dv D : ˛ .˛/ .1 ˇt / 0 .1 ˇt /˛
Through differentiation of the moment generating function it is easy to deduce the following proposition: Proposition II.2.5. If X is gamma distributed with parameters ˛ and ˇ, then: (i) E.X / D ˛ˇ,
(ii) var.X / D ˛ˇ 2 . Proof. See Exercise 3. Using Proposition II.2.4 it is also easy to prove the following proposition: Proposition II.2.6. Suppose X1 ; : : : ; Xn are statistically independent variables. If for all i the variable Xi is gamma distributed with parameters ˛i and ˇ, then the variable X1 C C Xn is gamma distributed with parameters ˛1 C C ˛n and ˇ. Proof. This is a direct consequence of Proposition II.2.4 together with a generalized version of Proposition I.8.8.
72
Chapter II Statistics and their probability distributions, estimation theory
Now suppose that X1 ; : : : ; Xn are statistically independent N.0; 1/-distributed variables. Then the variables X12 ; : : : ; Xn2 are independent and gamma distributed with ˛ D 21 and ˇ D 2. By Proposition II.2.6 it follows that the variable X12 C C Xn2 is gamma distributed with ˛ D n=2 and ˇ D 2. Such probability distributions, for historical reasons, have a special name (introduced by the statistician K. Pearson (1857–1936)): Definition II.2.2. A stochastic variable is said to be 2 -distributed with n degrees of freedom if it is gamma distributed with parameters ˛ D n=2 and ˇ D 2. In this terminology one has: Proposition II.2.7. If X1 ; : : : ; Xn is a sample from an N.0; 1/-distributed population, then the variable X12 C C Xn2 is 2 -distributed with n degrees of freedom. Furthermore, as a special case in Proposition II.2.5, one has: Proposition II.2.8. If X is a 2 -distributed variable with n degrees of freedom, then one has: (i) E.X / D n,
(ii) var.X / D 2n. Example 1. The following Gedankenexperiment / is set up: Inside a sphere with center in the origin live a number of gas molecules. At a certain instant t D t0 an arbitrary molecule is chosen and its velocity V is measured. When this experiment is repeated it generally does not result in the same measured value for V. Therefore the measurement is regarded as the outcome of a stochastic 3-vector V D .Vx ; Vy ; Vz /.
z
y x
It is assumed that the outcome of one of Figure 1 the components Vx ; Vy ; Vz does not interfere with the probability distribution of the remaining components (still to be measured). More specific, it is assumed that the variables Vx ; Vy ; Vz are statistically independent. / “Gedankenexperiment” (thought experiment) is a German word. It means an experiment, consistent with the laws of physics, that can only be done in one’s imagination.
Section II.2 The gamma distribution and the 2 -distribution
73
For reasons of symmetry one may assume that the Vx ; Vy ; Vz are identically distributed and that E.Vx / D E.Vy / D E.Vz / D 0. In fact, because of the perfect symmetry in the experiment, one may assume that the probability distribution of .Vx ; Vy ; Vz / is rotation invariant. This, however, implies (see §I.11, Exercise 40) that the variables Vx ; Vy ; Vz have one and the same N.0; 2 /-distribution! Let kVk be the length of V. One has 2 2 2 Vy kVk2 Vx Vz D C C : 2 Looking back upon Proposition II.2.7 it follows that kVk2 = 2 is 2 -distributed with 3 degrees of freedom. (This also explains why the parameter n is called the number of degrees of freedom. In a 2-dimensional model the variable kVk2 = 2 would be 2 -distributed with 2 degrees of freedom.) Using Proposition II.2.7 one now obtains a well-known result in kinetic gas theory: E.kVk2 / D 3 2
and
var.kVk2 / D 6 4 :
Now, on the one hand, for an ideal gas one has P D
NM E.kVk2 / NM 2 D ; 3I I
where N is the total number of molecules, M the total mass of the gas, I the volume of the ball, and P the pressure. On the other hand it is known that for an ideal gas the law of Boyle-Gay Lussac is valid: PI D NR T; T denoting the absolute temperature and R being a universal constant (which is called Reynold’s constant). It follows that 2 D R T =M . Thus the microscopic quantity 2 is expressed in terms of macroscopic quantities. It thus appears that the numerical value of 2 can be determined by measurement of macroscopic quantities. As noted above, kVk2 = 2 is 2 -distributed with 3 degrees of freedom. Therefore the probability density of kVk2 = 2 can be described explicitly. Using the techniques discussed in §I.10, one deduces that the probability density of kVk is given by 8 3=2 2 < p2 M x 2 e M x =2RT if x 0; 2 RT f .x/ D :0 elsewhere.
This result is due to the brilliant Scottish physicist James C. Maxwell (1831–1879). The probability density of kVk is therefore called Maxwell’s molecular velocity density in ideal gasses.
74
Chapter II Statistics and their probability distributions, estimation theory
Remark. In classical mechanics, where velocities are not too high, it is reasonable to presume statistical independence of Vx ; Vy and Vz . However, in relativistic mechanics (to be used when dealing with extremely high velocities) one will have to count with the fact that the velocity of a particle cannot exceed the velocity of light. This is the same as saying that Vx2 C Vy2 C Vz2 c 2 , where c denotes the velocity of light. It follows that a large measured value of Vx implies that the values of Vy and Vz (still to be measured) can only be small. For this reason there cannot possibly be independence of the components Vx ; Vy and Vz in the relativistic model! A direct consequence of Proposition II.2.6 is: Proposition II.2.9. Let X and Y be statistically independent and 2 -distributed with m and n degrees of freedom respectively. Then the variable X C Y is 2 -distributed with m C n degrees of freedom. In the sequel a converse to the previous proposition will also be needed: Proposition II.2.10. Suppose T D X C Y , where X and Y are statistically independent. If X and T are 2 -distributed with m and n degrees of freedom respectively, where m < n, then Y is 2 -distributed with n m degrees of freedom. Proof. By Proposition I.8.8 the moment generating functions of X; Y and T are related as MT .t / D MX .t /MY .t /: Consequently
.1
1 D 2t /n=2 .1
Therefore MY .t / D
.1
1 MY .t /: 2t /m=2 1 2t /.n
m/=2
:
This is, however, the moment generating function corresponding to a 2 -distribution with n m degrees of freedom. By Theorem I.8.3 this proves the proposition. It turns out that in the case of a sample from a normally distributed population the distribution of the sample variance S 2 is related to the 2 -distribution. To state this more precisely the following theorem, which is interesting in itself, is formulated (and proved). At first sight the statement in the theorem might seem odd, for the quantity X occurs explicitly in the defining expression of S 2 (see Definition II.1.2). Theorem II.2.11. When sampling from a normally distributed population the statistics S 2 and X are statistically independent.
75
Section II.2 The gamma distribution and the 2 -distribution
Proof. With help of Theorem I.6.7 the proof of this theorem is reduced to a simple verification. Let X1 ; : : : ; Xn be a sample from a normally distributed population. As in §I.6, denote the linear span of the X1 ; : : : ; Xn by V. Then for all i one has Xi X 2 V and X 2 V. Furthermore cov.Xi
X; X/ D cov.Xi ; X / cov.X ; X / 1X D cov.Xi ; Xj / var.X /: n j
For i ¤ j the pair Xi , Xj is statistically independent; therefore one has cov.Xi ; Xj / D 0
if i ¤ j:
Moreover, by Proposition II.1.1, one has var.X/ D
2 : n
It is now easy to verify that cov.Xi
X; X/ D
1 var.Xi / n
2 2 D n n
2 D 0: n
Using Theorem I.6.7 it follows that the stochastic n-vector .X1 X ; : : : ; Xn X / and X are statistically independent. However, S 2 is a continuous function of the vector .X1 X ; : : : ; Xn X/ and it therefore follows (by Proposition I.4.2) that the pair S 2 and X is also statistically independent. Remark. The peculiar independence of X and S 2 is a characteristic property, which distinguishes normal distributions from all other probability distributions (see [51] and also Exercise 24). The previous theorem provides a tool to determine the probability distribution of S 2 , when dealing with normally distributed populations. Theorem II.2.12. If X1 ; : : : ; Xn (where n 2/ is a sample from an N.; 2 /-distributed population, then the statistic .n 1/S 2 = 2 is 2 -distributed with n 1 degrees of freedom. Proof. Under the above-mentioned conditions the variables .Xi /= are all N.0; 1/-distributed and they constitute a statistically independent system. Therefore, by Proposition II.2.7, the quantity n X Xi 2 iD1
76
Chapter II Statistics and their probability distributions, estimation theory
is 2 -distributed with n degrees of freedom. Moreover, by Proposition II.1.2, the p variable n.X /= is N.0; 1/-distributed. It follows from this that n.X /2 = 2 is 2 -distributed with 1 degree of freedom. Furthermore one has (see the proof of Proposition II.1.3) !2 n n X Xi 2 X .Xi X /2 X D Cn 2 iD1 iD1 !2 .n 1/S 2 X D : Cn 2 Applying Theorem II.2.11 together with Proposition II.2.10, it now follows that .n 1/S 2 = 2 is 2 -distributed with n 1 degrees of freedom. The 2 -distribution has been tabulated in detail (see Table IV). In the following way (for example) one can utilize such tables: Example 2. In the closing of §II.1 there was talk of 8 measurements involving the falling-time of a leaden ball from a height of 100 meters. Here the statistic 7S 2 = 2 is (a priori) 2 -distributed with 7 degrees of freedom. From Table IV one can extract: 2 2 7S 7S P 2:17 D 0:05 and P 14:1 D 0:05: 2 2 It thus appears
Consequently
7S 2 P 2:17 < 2 < 14:1 D 0:90: S2 P 0:31 < 2 < 2:01 D 0:90:
It follows that, when doing this experiment, one has a priori a 90% chance that the variable S 2 = 2 will assume a value between 0:31 and 2:01. Once the experiment is finished, the numerical value of S 2 can be determined by computation. In the experiment one got (see §II.1) S 2 D 0:0714: Substitution of this value of S 2 in the inequality 0:31
0:
if a > 0:
Differentiation of this expression with respect to a gives us the probability density fU of U : 8 2 2 < un 1 e u =2 if u > 0; n=2 fU .u/ D 2 .n=2/ : 0 if u 0:
Step 2. (The probability distribution of Z=U .) Using the techniques discussed in §I.10, one obtains Z P a D P .Z aU / D P .Z; U / 2 Ga U
where Ga D ¹.x; u/ 2 R2 W u > 0 and x auº. Because of the independence of Z and U , one finds the following expression for the probability density of .Z; U /: 8 0; if u 0:
81
Section II.3 The t -distribution
Keeping u fixed, substitution of x D ut in the inner integral gives Z C1 ²Z a ³ Z 1 .ut/2 =2 P a D fU .u/ u dt du p e U 2 0 1 ³ Z a ²Z C1 1 .ut/2 =2 (Fubini) D fU .u/ p e u du dt; 2 1 0
where the order of integration was changed. It follows from this that the probability e of Z=U can be written as density f Z C1 1 2 e.a/ D f fU .u/ p e .au/ =2 u du 2 0 Z C1 2 2 2 D un e .1Ca /u =2 du: p n 2n=2 2. 2 / 0 p Now a substitution u D v= 1 C a2 in the last integral gives Z C1 2 2 e .a/ D .1 C a2 / nC1 2 f v n e v =2 dv: p n n=2 2 2 . 2 / 0 p By a substitution v D 2s, the integral on the right side can be expressed in terms of the gamma function: p Z Z C1 2n=2 2 C1 n 1 s n v 2 =2 s 2 2 e ds v e dv D 2 0 0 p 2n=2 2 nC1 : D 2 2 In this way one obtains
. nC1 / e .a/ D p 2 .1 C a2 / f . n2 / Step 3. (The probability density of
p Z .) Y =n
nC1 2
:
One may write
p Z p Z Z D np D n : p U Y Y =n
By applying the result of Exercise 25 in §I.11 one obtains an expression for the probp ability density f of nZ=U : . nC1 / a2 f .a/ D p 2 n 1 C n n . 2 /
This completes the proof of the proposition.
nC1 2
:
82
Chapter II Statistics and their probability distributions, estimation theory
Warning. The proposition above is generally false if the condition of independence of Z and Y is dropped. To illustrate this, choose any N.0; 1/-distributed variable Z and set Y DpZ 2 . Then Y is 2 -distributed with one degree of freedom. Now the variable Z= Y D Z=jZj can assume the values C1 and 1 only. Therefore this variable cannot possibly have an absolutely continuous distribution, and therefore definitely not a t -distribution: Definition II.3.1. A stochastic variable is said to be t -distributed with n degrees of freedom if it has a probability density of type described in Proposition II.3.1. Remark. The t -distribution was first introduced by William S. Gosset (1876–1937). He was employed on a brewery. The brewery, however, forbade their employees to publish scientific results. For this reason he has been publishing his scientific activities under the pseudonym “Student”. Accordingly, many people nowadays talk about a “Student distribution” instead of a “t -distribution”. The t -distribution has been tabulated in detail (see Table III). Theorem II.3.2. If X1 ; : : : ; Xn is a sample from an N.; 2 /-distributed population, then the statistic X p S= n is t -distributed with n 1 degrees of freedom. Proof. A direct consequence of Proposition II.3.1 and the discussion preceding it. The theorem above can be used to construct confidence intervals for : Example. This example leans on the experiment discussed in §II.1. If the falling-time is measured 8 times, then the quantity X
p S= 8
is (by Theorem II.3.2) a priori t -distributed with 7 degrees of freedom. From Table III it can therefore be read off that ! X P p 1:895 D 0:05: S= 8 Examination of the probability density in Proposition II.3.1 shows that the t -distribution is symmetric with respect to the origin. It follows from this that one also has ! X P p 1:895 D 0:05: S= 8
Section II.4 Statistics to measure differences in mean
83
By combining this one obtains P
! 1:895 < p < 1:895 D 0:90: S= 8 X
In other words, when doing this experiment one has (a priori) a 90% chance that the statistic X p S= 8 will assume a value between 1:895 and C1:895. Substitution of the outcomes S D 0:267 and X D 4:40 in the inequality 1:895
1:96 D 0:025: P 0:0095 Because of symmetry, one also has P
.X
Y / .X 0:0095
Y /
0, one has U P a D 0 if a 0: V For a > 0 one can write U P a D P .U aV / D P .U; V / 2 A ; V
91
Section II.5 The F -distribution
where A D ¹.u; v/ W u av and u; v 0º R2 . By using the techniques discussed in §I.10 one finds that ZZ m U 1 1 n 1 u=2 P a D e v=2 du dv: nu 2 v 2 e mCn m V A 2 2 2 2 Denote (temporarily)
C D
1 2
mCn 2
m 2
n 2
:
By applying Fubini’s theorem about successive integration on the last integral one obtains ³ Z C1 ²Z av m U 1 n 1 u=2 v=2 2 2 P a DC u e du dv: v e V 0 0 Keeping v fixed and substituting u D vt in the inner integral, the right side becomes ³ Z C1 ²Z a m 1 n 1 m 1 vt=2 v=2 2 2 2 e v dt dv v t e C v 0
0
DC
Z
a
0
²Z
C1
C
a
0
´Z
0
C1
DC DC
Z
2 1Ct
a
0
Z
0
a
mCn 2
2 1Ct 2
1
t
m 2
1
.1Ct/v=2
e
2s 1Ct
t
m 2
mCn 2
mCn 2
.1 C t /
t mCn 2
in the inner integral, one obtains 1
t
³
dv dt:
0
Keeping t fixed and substituting v D Z
v
mCn 2
s
m 2
mCn 2
1
dt
1
µ
s
e
! Z
ds dt C1
s
mCn 2
1
e
0
m 2
1
dt
!
mCn : 2
s
ds
Consequently P
Z a . mCn m U 2 / a D t2 n V . m /. / 0 2 2
1
.1 C t /
mCn 2
dt:
e of U=V is obNow, by differentiation with respect to a, the probability density f tained: 8 mCn ˆ < . 2 / a m2 1 .1 C a/ mCn 2 if a 0; e .a/ D . m /. n / f 2 2 ˆ :0 elsewhere:
92
Chapter II Statistics and their probability distributions, estimation theory
Step 2. (The probability distribution of
U=m .) V =n
Evidently one may write
U=m nU D : V =n mV Applying §I.11, Exercise 25, one obtains the probability density f of one of
U V
U=m V =n
from the
: 8 mCn m ˆ < . 2 / m 2 a m2 n m f .a/ D . 2 /. 2 / n ˆ :0
1
m 1C a n
mCn 2
if a > 0; elsewhere:
This expression presents the probability density of an F -distribution with m and n degrees of freedom in the numerator and denominator respectively. The F -distribution has been tabulated in detail (see Table V). When making use of such tables, it is frequently handy to apply the next proposition: Proposition II.5.2. Suppose X is F -distributed with m degrees of freedom in the numerator and n degrees of freedom in the denominator. Then the variable 1=X is F -distributed with n degrees of freedom in the numerator and m degrees of freedom in the denominator. Proof. See Exercise 35. The use of Proposition II.5.2 is illustrated in the following example: Example 1. Suppose X is F -distributed with 7 and 10 degrees of freedom in numerator and denominator respectively. Through Table V one wants to find a number a such that P .X a/ D 0:05. Solution. One has P .X a/ D P
1 1 : X a
By Proposition II.5.2 the variable 1=X is F -distributed with 10 and 7 degrees of freedom in numerator and denominator respectively. From Table V it can now be read off that 1 1 P D 0:05 X a if a
1
D 3:64, that is a D 0:27. Consequently
P .X 0:27/ D 0:05:
93
Section II.5 The F -distribution
Theorem II.5.3. Suppose X1 ; : : : ; Xm and Y1 ; : : : ; Yn are statistically independent samples from an N.X ; X2 /-distributed and an N.Y ; Y2 /-distributed population respectively. Then the statistic SX2 =X2 SY2 =Y2 is F -distributed with m 1 degrees of freedom in the numerator and n of freedom in the denominator respectively.
1 degrees
Proof. By Theorem II.2.12 the variables .m 1/SX2 =X2 and .n 1/SY2 =Y2 are 2 distributed with m 1 and n 1 degrees of freedom respectively. By Proposition I.4.2 the pair SX2 and SY2 is independent. Now an application of Theorem II.5.1 completes the proof. Example 2. Return to the experiment in which the falling-time of a ball was measured 10 times with apparatus A and 16 times with apparatus B . These measurements can be represented by two independent samples X1 ; : : : ; X10 and Y1 ; : : : ; Y16 . It is assumed that they emanate from normally distributed populations with variances X2 and Y2 respectively. The outcome of the samples reveals (after some calculations): SX2 D 0:0030 and SY2 D 0:0035: Using this information a 95% confidence interval is computed for X =Y . By Theorem II.5.3 the statistic SX2 =X2 SY2 =Y2 is (a priori) F -distributed with 9 and 15 degrees of freedom in the numerator and denominator. From Table V it can be read off that ! SX2 =X2 1 P < 2 2 < 3:12 D 0:95: 3:77 SY =Y Substituting SX2 D 0:0030 and SY2 D 0:0035 in the inequality S 2 = 2 1 < X2 X2 < 3:12; 3:77 SY =Y the following inequality for Y =X emerges: 0:56
0 one defines Z 1 B.p; q/ WD t p 1 .1
t /q
1
dt:
0
The function B W .0; C1/ .0; C1/ ! .0; C1/ is called the beta function. Proposition II.6.1. The beta function is symmetric, that is to say, for all p; q > 0 one has B.p; q/ D B.q; p/: Proof. Substitute t D 1
s in the integral
R1 0
tp
1 .1
t /q
1 dt .
The beta function can be represented in various ways. In the following propositions some useful representations will be given. Proposition II.6.2. For all p; q > 0 one has Z C1 B.p; q/ D s p 1 .1 C s/
.pCq/
ds:
0
Proof. Substitute t D
s 1Cs
in the integral
R1 0
tp
1 .1
t /q
1 dt .
95
Section II.6 The beta distribution
Proposition II.6.3. For all p; q > 0 one has B.p; q/ D 2
Z
1
s 2p
1
s 2 /q
.1
1
ds:
0
Proof. Substitute t D s 2 in the integral
R1 0
tp
1 .1
t /q
1 dt .
Proposition II.6.4. For all p; q > 0 one has B.p; q/ D 2 Proof. Substitute t D
s2 1Cs 2
Z
C1
s 2p
1
0
.1 C s 2 /
R1
in the integral
0
tp
.pCq/
1 .1
t /q
ds:
1 dt .
Proposition II.6.5. For all p; q > 0 one has B.p; q/ D 2
Z
=2
cos2p
1
sin2q
1
d:
0
Proof. Substitute t D cos2 in the integral
R1 0
tp
1 .1
t /q
1 dt:
It is interesting that the beta function is closely connected to the gamma function: Theorem II.6.6. For all p; q > 0 one has B.p; q/ D
.p/.q/ : .p C q/
Proof. By definition of the gamma function one has .x/ D
Z
C1
tx
1
e
t
dt
.x > 0/:
0
Substituting t D u2 in the integral on the right side one obtains the following representation of the gamma function: .x/ D 2
Z
C1
u2x
1
e
u2
du:
./
0
Applying Fubini’s theorem about successive integration, one can therefore express .p/.q/ as a double integral .p/.q/ D 4
Z
0
C1 Z C1 0
x 2p
1 2q 1
y
e
.x 2 Cy 2 /
dxdy:
96
Chapter II Statistics and their probability distributions, estimation theory
In this double integral one passes to polar coordinates, that is to say, there is a change of variables of type: x D r cos and y D r sin : The Jacobian of this transformation is given by ˇ " @x ˇ ˇ @r J.r; / D ˇdet @y ˇ @r
@x @ @y @
#ˇ ˇ ˇ ˇ D r: ˇ
By the substitution formula for multiple integrals, the double integral now assumes the form Z C1 Z =2 2 4 cos2p 1 sin2q 1 r 2pC2q 1 e r d dr: 0
0
Applying (again) Fubini’s theorem, one can write this integral as ! Z Z =2
2p 1
cos
4
sin
2q 1
C1
d
0
r
2pC2q 1
0
e
r2
dr :
By using Proposition II.6.5 together with the representation ./ of the gamma function one arrives at .p/.q/ D B.p; q/.p C q/:
This proves the theorem.
The next example illustrates a simple application of the previous theorem. Example 1. The area enclosed by a semi-circle with radius R is presented by the integral Z 1p Z CR p 2 2 2 R x dx D 2R 1 s 2 ds: R
0
B. 12 ; 1 12 /R2 .
By Proposition II.6.3 this can be written as rem II.6.6 that . 21 /.1 12 / : B. 12 ; 1 12 / D .2/
Via Propositions II.2.1 and II.2.2 one gets p 1p 2 1 1 B. 2 ; 1 2 / D D 21 : 1Š Consequently
Z
CR R
p R2
x 2 dx D 21 R2 ;
which of course presents a familiar result.
It follows from Theo-
97
Section II.6 The beta distribution
Now the beta distribution is introduced: Definition II.6.2. If X has an absolutely continuous distribution with a probability density f given by 8 < 1 x p 1 .1 x/q 1 if x 2 .0; 1/, f .x/ D B.p; q/ : 0 elsewhere; then it is said that X is beta distributed with parameters p; q > 0.
Remark. In the beta distribution the parameters p and q are not exchangeable.
p D 4, q D 4
p D 12, q D 12
p D 6, q D 20
p D 12, q D 4
Figure 3. The density of the beta distribution for several values of p and q.
Proposition II.6.7. If X is a beta distributed with parameters p and q, then: p (i) E.X / D , pCq pq (ii) var.X / D . 2 .p C q/ .p C q C 1/
98
Chapter II Statistics and their probability distributions, estimation theory
Proof. See Exercise 17 for the proof of (ii). One has Z Z 1 p x .1 x/q 1 E.X/ D xfX .x/ dx D dx B.p; q/ 0 B.p C 1; q/ .p C 1/.q/ .p C q/ D D B.p; q/ .p C q C 1/ .p/.q/ p .p/.q/ .p C q/ p D D ; .p C q/.p C q/ .p/.q/ pCq which directly proves (i). The beta distribution is related to the binomial distribution. A preparatory lemma is used to reveal this relationship: Lemma II.6.8. Suppose p and q are positive integers. Let Xb be a stochastic variable which is binomially distributed with parameters p C q 1 and 2 .0; 1/. Then for p > 1 one has Z .p C q/ p 1 t .1 t /q 1 dt 0 .p/.q/ Z .p C q/ D P .Xb D p 1/ C t p 2 .1 t /q dt: 1/.q C 1/ 0 .p Proof. Denote I.p; q/ WD
Z
0
.p C q/ p t .p/.q/
1
.1
t /q
1
dt:
Recall (Proposition II.2.1) that for any positive integer n one has that .n/ D .n 1/Š. Applying partial integration one obtains the following expression for I.p; q/: ´ µ Z .p C q 1/Š t p 1 .1 t /q p 1 p 2 C t .1 t /q dt : I.p; q/ D .p 1/Š.q 1/Š q q 0 0 Therefore I.p; q/ D
.p C q 1/Š p .p 1/Š qŠ
1
.1
/q C
.p C q 1/Š .p 2/Š qŠ
Z
tp
0
It thus appears that I.p; q/ D This proves the lemma.
P .Xb D p
1/ C I.p
1; q C 1/:
2
.1
t /q dt:
99
Section II.7 Populations that are not normally distributed
Proposition II.6.9. Suppose Xˇ is beta distributed with parameters p and q, both positive integers, where p > 1. Let Xb be binomially distributed with parameters p C q 1 and 2 .0; 1/. Then P .Xˇ / D P .Xb p/: Proof. Applying Lemma II.6.8 again and again, one has (in the notation introduced in the proof of this lemma) P .Xˇ / D
Z
0
.p C q/ p t .p/.q/
D I.p; q/ D
1
.1
t /q
P .Xb D p
1
dt
1/ C I.p
1; q C 1/
D
P .Xb D p
D
P .1 Xb p
D
P .1 Xb p
D
P .1 Xb p
1/
.1
D
P .1 Xb p
1/
P .Xb D 0/ C 1
1/
P .Xb D 1/ C I.1; q C p 1/ Z .p C q/ 1/ C .1 t /qCp .1/.q C p 1/ 0 .1 t /qCp 1 .p C q 1/Š 1/ C .q C p 2/Š qCp 1 0 /qCp
1
2
dt
C1
D P .Xb p/: This proposition will be used in the construction of so-called Bayesian interval estimates in §II.8.
II.7 Populations that are not normally distributed Up to now it was in this chapter systematically assumed that the samples emanated from normally distributed populations. If the population is not normally distributed, then what can one say about the distribution of, for example, the statistic X ? For large-sized samples it turns out that X is approximately normally distributed. More specific, if X1 ; : : : ; Xn is a sample from a population with mean and variance 2 , whereas n is large, then by the central limit theorem (Theorem I.9.3) the variable 1 P X i Xi n p D p = n = n is approximately N.0; 1/-distributed. Below some applications of this are given in the case of so-called Bernoulli distributed populations:
100
Chapter II Statistics and their probability distributions, estimation theory
Definition II.7.1. Suppose X is a stochastic variable that can assume the values 0 and 1 only. Denoting P .X D 1/ D , one says that X is Bernoulli distributed with parameter . Example 1. In a big city 20% of the citizens are infected with a certain disease. At random a citizen is chosen and tested whether he or she is infected. To this experiment an associated stochastic variable X is defined as follows: ´ 1 if the person is infected, XD 0 if the person is not infected. Now X is Bernoulli distributed with parameter D 0:2. Proposition II.7.1. If X is Bernoulli distributed with parameter , then: (i) E.X / D , (ii) var.X / D .1
/.
Proof. Trivial. By Proposition II.7.1 and the considerations preceding it, it follows that: Theorem II.7.2. Suppose X1 ; : : : ; Xn is a sample from a Bernoulli distributed population with parameter . Then for large n the statistic r .1 / .X / n is approximately N.0; 1/-distributed. Remark. In practice, Theorem II.7.2 is also applied for relatively small values of n. To be assured of a reasonable approximation, one frequently requires that n 6; n.1
/ 6 and n 30:
This should be regarded as a mere empirical rule-of-thumb. Example 2. In a big city there is a wish to make an interval estimate of the percentage of the citizens infected with a certain disease. To this end successively 200 persons are chosen. After testing, it turns out that among these 200 citizens there are 36 persons infected. Actually, a sample X1 ; : : : ; X200 was drawn here / from a Bernoulli distributed population with unknown parameter . Among the variables X1 ; : : : ; X200 the / Formally, the selection of the citizens should be done “with replacement”, which means that one and the same person possibly can be chosen several times. Otherwise “the probability distribution is changing” while carrying out the experiment. Because one is dealing with a big city it is, practically spoken, of no consequence whether one replaces or not.
Section II.7 Populations that are not normally distributed
101
value 1 has been observed 36 times, so one may write XD
X1 C C X200 36 D D 0:18: 200 200
Starting from this information, a 95% confidence interval for is constructed. By Theorem II.7.2 the statistic r .1 / .X / n is approximately N.0; 1/-distributed. For this reason one finds (using Table II) 1 0 ˇ ˇ ˇ X ˇ ˇ C Bˇ P @ˇ q ˇ < 1:96A 0:95: ˇ .1 / ˇ 200
Substitution of the outcome X D 0:18 in the inequality ˇ ˇ ˇ X ˇ ˇ ˇ ˇq ˇ < 1:96 ˇ .1 / ˇ 200
provides
r
j0:18
.1 / : 200
j < 1:96
However, for all 2 .0; 1/ one has .1 j0:18
./
/ 14 . Consequently r
j < 1:96
1=4 : 200
In this way one obtains the following inequality for : 0:11 < < 0:25: Now, roughly, the interval .0:11; 0:25/ presents a 95% confidence interval for . The calculations can be refined a bit by squaring ./. Then a quadratic inequality emerges. The solution of it is presented by 0:13 < < 0:24: One thus obtains a slightly smaller, that is to say, a slightly better 95% confidence interval for .
102
II.8
Chapter II Statistics and their probability distributions, estimation theory
Bayesian estimation
In this section an introductory exposition of the Bayesian point of view in making estimates is given. In this point of view one assumes that the parameter to be estimated enjoys beforehand a given probability distribution. After the outcome of a sample has been obtained, a new (conditional) probability distribution for the parameter in question emerges. Frequently this newly obtained distribution has a smaller variance than the original one. Sharpened estimates may emanate from this. The above will now be illustrated in detail in the case of a Bernoulli distributed population: Let X1 ; : : : ; Xn be a sample of size n from a Bernoulli distributed population with parameter . Denote X WD X1 C X2 C C Xn : Then X is binomially distributed (see Exercise 26) with parameter n and , that is to say: ! n k P .X D k/ D .1 /n k .k D 0; 1; : : : ; n/: k From now on will be regarded as the outcome of a stochastic variable ‚. One then may write the above as a conditional probability (see Appendix C): ! n k P .X D k j ‚ D / D .1 /n k .k D 0; 1; : : : ; n/: k Next, additionally, assume that ‚ has a beta distribution with parameters p and q. Given an outcome X D k, then what can one say about the new (conditional) probability distribution of ‚? Before answering this question, one has to create order in the mathematical usage. What exactly is meant by the expression P .X D k j ‚ D /? This is discussed below: In probability theory one defines for two events A and B the chance that “A, given B ” occurs as (see Appendix C): P .A j B/ D
P .A \ B/ ; P .B/
provided
P .B/ ¤ 0:
Therefore, as a first step, it would be quite natural to define: P .X D k j ‚ D / D
P .X D k and ‚ D / : P .‚ D /
However, under the assumption that ‚ is beta distributed one has P .‚ D / D 0. For this reason the above is untenable. Instead the concept of a conditional probability density is introduced:
103
Section II.8 Bayesian estimation
Definition II.8.1. Suppose X and Y enjoy a discrete or absolutely continuous probability distribution. Then the function f . j Y D y/, defined by 8 < fX;Y .x; y/ if f .y/ > 0; Y fY .y/ fX .x j Y D y/ D : 0 if fY .y/ D 0; is said to be the conditional probability density of X , given Y D y .
Why is it defined in this particular way? A heuristical motivation could be as follows: Denote (classically) a small increment of x and y by x and y . Then one may write P .x X x C x j Y D y/ P .x X x C x j y Y y C y/
P .x X x C x and y Y y C y/ P .y Y y C y/ R yCy R xCx f .s; t / ds dt X;Y y x R yCx fY .t /dt y R xCx f .s; y/ ds y X;Y x fY .y/y Z xCx fX;Y .s; y/ ds: fY .y/ x
If one now lets x # 0 and y # 0 (carefully screening off this page so that it will not be read by a severe mathematician), then one can write slyly: P .x X x C x j Y D y/ fX;Y .x; y/ D : x fY .y/ x#0 lim
Roughly spoken, this means that a conditional probability density should indeed be defined in the way presented in Definition II.8.1. As to conditional densities, the following proposition is fundamental. Proposition II.8.1. If X and Y have discrete or absolutely continuous probability distributions, then: (i) fX;Y .x; y/ D fX .x j Y D y/fY .y/ D fY .y j X D x/fX .x/, Z C1 (ii) fX .x/ D fX .x j Y D y/fY .y/ dy , 1
104
Chapter II Statistics and their probability distributions, estimation theory
(iii) fY .y/ D
Z
C1 1
fY .y j X D x/fX .x/ dx.
Integration must be replaced by summation in the case of discrete distributions. Proof. (i) is a direct consequence of Definition II.8.1. In turn, (ii) and (iii) are direct consequences of (i). Returning to the problem, one was in trouble with the expressions P .X D k j ‚ D /
and P .‚ D j X D k/;
where X has a binomial and ‚ a beta distribution. It is now quite natural to replace these expressions by conditional densities: fX .k j ‚ D /
and f‚ . j X D k/:
These densities are described in the following elegant theorem: Theorem II.8.2. Let X be a binomially distributed stochastic variable with parameters n and . Consider as the outcome of a stochastic variable ‚, which is beta distributed with parameters p and q. Then the conditional probability density of ‚, given the outcome X D k, is a beta density with parameters p 0 D p C k and q 0 D q C n k. More specific: for k D 0; 1; : : : ; n, one has 8
0. For members A 2 A that are subsets of 0 the above reads simply as P .A/ : P .A j 0 / D P .0 / The map A 7! P .A j 0 / actually defines a probability measure on the set 0 . It is defined on the -algebra A0 consisting of all elements of A that happen to be subsets of 0 . The measure above will be denoted as P . j 0 / or briefly as P0 . Thus a probability space .0 ; A0 ; P0 / arises. The measure P0 will be called the probabilistic restriction of P to 0 . Now suppose that two stochastic variables X and Y are defined on a probability space .; A; P /. Given two Borel sets A and B in R, the above could be applied to the events ¹X 2 Aº and 0 D ¹Y 2 Bº: Note that, relative to the foregoing, the role of A is now played by ¹X 2 Aº. Given a fixed Borel set B in R, the map A 7! P .X 2 A j Y 2 B/ defines a Borel measure on R that will be denoted as PX . j Y 2 B/. This measure is actually the probability distribution of the restriction of X to the space .0 ; A0 ; P0 /. Now suppose that there is another variable Z, defined on .; A; P /, and a Borel set C in R such that ® ¯ ® ¯ ! 2 W Y .!/ 2 B D 0 D ! 2 W Z.!/ 2 C :
In words, the same event 0 is formulated in two different ways. Then, provided P .0 / > 0, one has of course P .X 2 A j Y 2 B/ D P .X 2 A j Z 2 C / for all Borel sets A in R. More briefly, one then has PX . j Y 2 B/ D PX . j Z 2 C /: A paradox in this is, however, that in cases where P .0 / D 0 the above may occur to be false! The following example testifies to this.
109
Section II.8 Bayesian estimation
Example 2. In this example the so-called Borel paradox is explained. The Borel paradox deals with conditional probabilities of the form P .X 2 A j Y D y/ where A is an arbitrary Borel set in R and where Y D y are events of probability zero. To set the scenario, let be a sphere in R3 with radius 1 and center in the origin. Equip with the relative topology with respect to R3 . That is to say, a subset V in is said to be open in if it can be written as V D\O
with O open in R3 :
The -algebra generated by all open subsets of will be denoted by B. The elements of B are, by definition, the Borel sets of . It can be proved (see for example [58]) that there exists on the -algebra B a unique probability measure that is invariant under rotations around the origin. More precisely, there is on exactly one probability measure P that satisfies P .G/ D P .QG/ for every Borel set G in and every orthogonal linear transformation Q. This measure presents the uniform probability distribution on the sphere and could, in a less pedantic way, also be defined as P .G/ D
area.G/ 4
for all Borel sets G in :
Thus the triplet .; B; P / emerges as a probability space in the sense of Kolmogorov, that is, in the sense of Definition I.1.6. Now, in order to quickly define two stochastic variables, think of the sphere as if it were the earth. Two variables X and Y can then be defined as X.!/ D geographic latitude of the point ! 2 ;
Y .!/ D geographic longitude of the point ! 2 :
The outcomes of X and Y will be taken in radials, not in degrees. Thus the variable X assumes its values in the interval Œ ; C/ and the variable Y in the interval Œ =2; C=2. The sets ¹! W X.!/ D 0º
and
¹! W Y .!/ D 0º
are the equator and the prime meridian respectively. From elementary mathematical analysis it is known that the simultaneous probability distribution of .X; Y / is given by ³ Z ²Z 1 P .X 2 A and Y 2 B/ D cos. / d d' 4 A B
110
Chapter II Statistics and their probability distributions, estimation theory
for all Borel sets A and B in Œ ; C and Œ =2; C=2 respectively. From the above it is easily derived that P .X 2 A and Y 2 B/ D P .X 2 A/P .Y 2 B/
()
with 1 P .X 2 A/ D 2
Z
1 d'
A
and
1 P .Y 2 B/ D 2
Z
cos. / d:
()
B
From () it follows that X and Y are statistically independent. This, together with (), implies that Z 1 P .Y 2 B j X D 0/ D P .Y 2 B/ D 1 d': 2 B In words, on the equator one has a uniform distribution. In contrast to this, however, one has Z 1 cos. / d: P .X 2 B j Y D 0/ D P .X 2 B/ D 2 B
Hence on the prime meridian a non-uniform distribution emerges. In the following this prime meridian will serve as the set 0 . To create the paradox, a third variable Z is now defined such that ® ¯ ® ¯ ! 2 W Y .!/ D 0 D 0 D ! 2 W Z.!/ D 0 :
()
To this end the point !0 is defined to be the place on the equator with longitude =2. For an arbitrary point ! a great circle is drawn through ! and !0 . The point S.!/ is defined to be the particular point where this circle intersects the prime meridian. Figure 4 could serve as an illustration.
S.!/
!
!0
Figure 4
Section II.9 Estimation theory in a framework
111
Define Z.!/ to be the arc (in radials) when traveling eastward from S.!/ to ! . Thus the variable Z may assume all values in the interval Œ0; 2/. For this Z one has () indeed. Arguments similar to those given above show that Z 1 P .X 2 A j Z D 0/ D P .X 2 A/ D 1 d: 2 A That is, now a uniform distribution on the prime meridian arises. Hence, in spite of (), one has PX . j Y D 0/ ¤ PX . j Z D 0/: So, despite all rotation invariance on the sphere , there is no canonic probabilistic restriction of P to the prime meridian. This also applies, of course, to other meridians and to arbitrary great circles. For geometers this phenomenon may come as a nightmare. From the geometric point of view, namely, the candidates for a probabilistic restriction P0 of P to a great circle ought to be rotation invariant. Thus, on such circles, a uniform distribution would be left as the only possibility.
The Borel paradox shows that extreme caution is advisable when dealing with conditional probability distributions. Counter-intuitive scenarios may be encountered!
II.9 Estimation theory in a framework In the previous sections certain statistics were introduced to estimate some particular, characterizing, parameter of the probability distribution of the population. For example, as a point estimate of the mean of a population one took the outcome of the statistic X . For this reason, such a statistic is called an estimator of . By Theorem II.1.1 the estimator X of has the following property: E.X/ D : That is to say, when repeating the experiment unendingly and averaging all the associated outcomes of X , one obtains / the exact value of . Therefore, X is said to be an unbiased estimator of . In estimation theory, an important branch of statistics, one deals with the construction of estimators that are optimal in some sense. In the following a general framework will be set up, in which the observations above find their natural place. Let ‚ be an arbitrary set and let f W R ‚ ! Œ0; C1/ be a function. For every fixed 2 ‚ denote the function x 7! f .x; / by f .; /. If / This interpretation is justified by the weak and (in a stronger sense) by the strong law of large numbers. These fundamental laws in probability theory and statistics are treated in §VII.2. For the moment, the reader can take this interpretation for granted (without having moral dilemmas concerning the logic in the set-up).
112
Chapter II Statistics and their probability distributions, estimation theory
for all 2 ‚ this function presents a (possibly discrete) probability density, then one talks about a family of probability densities with parameter 2 ‚. In this context the set ‚ is called the parameter space. Example 1. For all 2 R and all 2 .0; C1/ the function x 7! f .x; ; /, defined by 1 1 2 2 f .x; ; / WD p e 2 .x / = .x 2 R/ 2 is a probability density. Setting ‚ WD R .0; C1/ and WD .; /, one can say that the family f .; /, where 2 ‚, presents a family of probability densities. Now, once and for all, let f .; / 2‚ be any family of probability densities. It is assumed that the population from which the samples are drawn has a probability density f .; /, where 2 ‚. There is a wish to make estimates (based on samples) of certain “characteristics” of the population: Definition II.9.1. A function W ‚ ! R (or C) is said to be a characteristic of the population. If is constant, then one talks about a trivial characteristic. Example 2. Given is a population with probability density f .; /, where 2 ‚. It is known that the variance 2 of the population exists. Therefore the mean of the population also exists (see §I.11 Exercise 18). In the case of an absolutely continuous probability distribution one can write Z D xf .x; / dx D 1 . /; Z 2 D .x /2 f .x; / dx D 2 . /:
It thus appears that and 2 are functions of : they are characteristics of the population. This can also be observed in the case of discretely distributed populations. Example 3. The characteristic function W R ! C of a population with probability density f .; /, where 2 ‚, was (see §I.8) defined as Z .t; / WD e i tx f .x; / dx .t 2 R; 2 ‚/: For all fixed t 2 R the function 7! .t; /, which will be denoted by .t; /, presents a characteristic of the population.
Definition II.9.2. Let X1 ; : : : ; Xn be a sample from a population with probability density f .; /, where 2 ‚. Suppose is a characteristic of the population. A statistic T D g.X1 ; : : : ; Xn /, the outcome of which is used as an estimate of . /, is said to be an estimator of (based on the sample X1 ; : : : ; Xn ).
113
Section II.9 Estimation theory in a framework
Note that the expectation value of a statistic T D g.X1 ; : : : ; Xn / will, if existing, in general depend on the parameter . For example, in the case of a population with probability density f .; / one has Z Z E.T / D : : : g.x1 ; : : : ; xn /f .x1 ; / : : : f .xn ; / dx1 : : : dxn : If in some context one has to count with this, one could write E .T / instead of E.T /. Definition II.9.3. A statistic T is said to be an unbiased estimator of if E .T / D ./
for all 2 ‚:
Example 4. The statistics X and S 2 present (by Propositions II.1.1 and II.1.3) unbiased estimators of the characteristics and 2 respectively. Unbiased estimators do not always exist, and if they do, they are not always useful. The following two examples testify to this. Example 5. One draws a sample X1 , of size 1, from a binomially distributed population with parameters n and p, where p 2 .0; 1/. Assume that n is known, whereas p is unknown. The probability density of the population is now described by ! n k f .k/ D p .1 p/n k .k D 0; 1; : : : ; n/: k Here the parameter space can be chosen as ‚ WD .0; 1/. Suppose T D g.X1 / is an unbiased estimator of the characteristic p 7! p1 D .p/. Then for all p 2 .0; 1/ one ought to have ! n X n k 1 Ep .T / D g.k/ p .1 p/n k D .p/ D : k p kD0
This equality cannot possibly be valid for all p 2 .0; 1/. To see this, let p # 0. It follows that there is no unbiased estimator of the characteristic p1 . Example 6. Listening to a Geiger–Müller counter, one counts during one minute the passage of cosmic particles. The number of counted particles is denoted by X . Repeating this experiment (under exactly the same conditions), X does not necessarily assume the same value. Thus one can look upon X as a stochastic variable. It is assumed that X is Poisson distributed with parameter > 0. That is to say, it is assumed that k P .X D k/ D e .k D 0; 1; 2; : : :/: kŠ
114
Chapter II Statistics and their probability distributions, estimation theory
During three successive minutes this experiment is repeated. So here one has a sample X1 ; X2 ; X3 of size 3 from a Poisson distributed population with parameter . Once obtained the outcome of X1 , there is a wish to estimate the probability that in the next two minutes there will be no passage of any cosmic particle. More precisely, there is a wish to make an estimate of the characteristic P .X2 C X3 D 0/; based on the sample X1 of size 1. The variable X2 C X3 is (see §I.11, Exercise 47) Poisson distributed with parameter 2. Consequently one is dealing here with a characteristic , defined by ./ WD P .X2 C X3 D 0/ D e
2
:
Now suppose that T D g.X1 / is an unbiased estimator of the characteristic . Then for all > 0 one has E .T / D
1 X
g.k/e
k
kŠ
kD0
D ./ D e
2
:
This is the same as saying that for all > 0 1 X
kD0
g.k/
k De kŠ
D
1 X . /k : kŠ
kD0
By the theory of power series, this is possible only if g.k/ D . 1/k . Therefore T D g.X1 / D . 1/X1 : In other words, the characteristic e 2 is estimated C1 if X1 is even and odd. Obviously such estimates are useless.
1 if X1 is
Proposition II.9.1. Concerning a characteristic , always exactly one of the following statements is true: (i) There is no unbiased estimator of . (ii) There is only one unbiased estimator of . (iii) There are infinitely many unbiased estimators of . Proof. In Examples 5 and 6 it was shown that (i) and (ii) could possibly occur. To complete the proof of the proposition note the following: If there are two different unbiased estimators T1 and T2 in sight, then one can construct infinitely many of them. Namely, for every 2 R the statistic T1 C .1
/T2
115
Section II.9 Estimation theory in a framework
is also an unbiased estimate of . To see this, note that for all 2 ‚ one has E .T1 C .1
/T2 / D E .T1 / C .1 D ./ C .1
/E.T2 / /. / D . /:
If one is in the comfortable position where (concerning a characteristic) there are infinitely many unbiased estimators, then one may ask oneself to select in some way “the best ones” among them. To state this more precisely, Chebyshev’s inequality (see §I.11, Exercise 26) is now formulated in terms of estimators. Because the probability distribution of the population depends on 2 ‚, probabilities P will be written as P and variances by var instead of var. Proposition II.9.2 (Chebyshev’s inequality). If T is an unbiased estimator of the characteristic , then one has for all " > 0 and all 2 ‚ that P jT
var .T / : ./j " "2
Proof. See §I.11, Exercise 26. Intuitively, Chebyshev’s inequality expresses that in the case of a small value of var .T / one may expect the outcome of T to be close to the (unknown) value of . /. Of course this is what is to be preferred. Therefore, concerning the variance var .T / of an unbiased estimator T , one can say: “the smaller the variance, the better the estimator.” In connection to this the adjective efficient is often used; the smaller the variance, the more efficient the estimator. The choice to take X as an estimator of is, in connection to the above, not at all a silly one. This is explained by the following exploration. Definition II.9.4. An estimator T , based on a sample X1 ; : : : ; Xn , is said to be linear if it is a linear combination of the X1 ; : : : ; Xn . In other words, T is a linear estimator if T D c1 X1 C C cn Xn ; where c1 ; : : : ; cn are constants. In this terminology the sample mean X obviously presents a linear estimator (just take c1 D D cn D n1 ).
116
Chapter II Statistics and their probability distributions, estimation theory
Theorem II.9.3. Let X1 ; : : : ; Xn be a sample from a population with probability density f .; /, where 2 ‚. Assume that for all possible the mean and the variance 2 of the population exists. Moreover, assume that the characteristic is non-trivial. Under these conditions X is among the unbiased linear estimators of the unique one of minimum variance. Proof. Because and 2 are functions of they will be denoted as . / and 2 . /. Let T D c1 X1 C C cn Xn
be an arbitrary linear estimator of , based on the sample X1 ; : : : ; Xn . Because of the statistical independence of X1 ; : : : ; Xn , one may write (see Propositions I.5.8 and I.5.9): var .T / D var .c1 X1 / C C var .cn Xn /
D c12 var .X1 / C C cn2 var .Xn / D .c12 C C cn2 / 2 . /:
The estimator T is unbiased if and only if for all 2 ‚ E .T / D c1 E .X1 / C C cn E .Xn / D .c1 C C cn /. / D . /: Because it was presumed that the characteristic is not trivial, there is an element 2 ‚ such that ./ ¤ 0. In this way it follows that T is an unbiased estimator of if and only if g.c1 ; : : : ; cn / WD c1 C C cn D 1:
Under this condition one has to find (for an arbitrary fixed ) a minimum value of the expression f .c1 ; : : : ; cn / WD var .T / D .c12 C C cn2 / 2 . /:
The existence of at least one minimum can be proved as follows. The function f on Rn is bounded below .f 0/, therefore the following infimum exists: m WD inf ¹f .c1 ; : : : ; cn / W .c1 ; : : : ; cn / 2 Rn ; satisfying c1 C C cn D 1º : Obviously, one now has m f .1; 0; : : : ; 0/ D 2 . /: Let B be the closed unit ball in Rn . Then 2 ./ < f .c1 ; : : : ; cn / for all .c1 ; : : : ; cn / … B:
Section II.9 Estimation theory in a framework
117
It follows that m D inf¹f .c1 ; : : : ; cn / W .c1 ; : : : ; cn / 2 B; satisfying c1 C C cn D 1º: However, f is a continuous function and the set ¹.c1 ; : : : ; cn / 2 B W c1 C C cn D 1º is compact / . Therefore, by a fundamental result (see [14], [63]) in mathematical analysis, the infimum of f on this set is realized in some point .c1 ; : : : ; cn /. In other words, there exists at least one point .c1 ; : : : ; cn / 2 Rn such that m D f .c1 ; : : : ; cn /: By a theorem of the French mathematician Joseph L. Lagrange (1736–1813) such a point necessarily satisfies an equation of the following type (see [14], [43]): .grad f /.c1 ; : : : ; cn / D .grad g/.c1 ; : : : ; cn / . 2 R/; where “grad” denotes the gradient (also denoted by r ). By writing this out one obtains .2c1 2 ; : : : ; 2cn 2 / D .1; : : : ; 1/:
It thus appears that the constants c1 ; : : : ; cn are mutually identical. In connection to the condition c1 C C cn D 1
it now follows that c1 D D cn D n1 . With these coefficients only, the unbiased estimator T is of minimum variance. However, this implies that T D
1 1 X1 C C Xn D X ; n n
which proves the theorem. Generally, finding an unbiased estimator of minimum variance is a difficult problem. To explain the difficulties, the mathematical content of this problem is now discussed in the case of an arbitrary population with probability density f .; /, where 2 ‚. A sample X1 ; : : : ; Xn is drawn from such a population and there is a wish to find an unbiased estimator T D g.X1 ; : : : ; Xn / of minimum variance for the characteristic . In other words, there is a wish to construct a function g W Rn ! R that has the following property: For every 2 ‚, the function g realizes a minimum value of the expression V .g/ WD var .T / Z Z D : : : .g.x1 ; : : : ; xn / /
.//2 f .x1 ; / : : : f .xn ; / dx1 : : : dxn
A subset of Rn is compact if and only if it is both closed and bounded (see [14], [63]).
118
Chapter II Statistics and their probability distributions, estimation theory
under the additional condition Z Z E .T / D : : : g.x1 ; : : : ; xn /f .x1 ; / : : : f .xn ; / dx1 : : : dxn D . /:
For a fixed this problem presents a so-called isoperimetric problem (see for example [16]). In this sense the solution of the problem can be found by solving a family of isoperimetric problems simultaneously. It is interesting to deal with this problem in the context of the theory of Hilbert spaces (L2 -spaces). Unbiased estimators of minimum variance are, if they exist, unique. This is, roughly spoken, the content of the following theorem. In order to formulate this theorem properly, the following definition is given.
Definition II.9.5. Suppose X1 ; : : : ; Xn is a sample from a population with probability density f .; /, where 2 ‚. Two statistics T1 and T2 are said to be essentially equal with respect to the family f .; / 2‚ if P .T1 D T2 / D 1
for all 2 ‚:
In other words, T1 and T2 are essentially equal if there is zero probability that T1 and T2 will assume different outcomes, whatever the value of may be. Essential equality of T1 and T2 is (in other texts) frequently indicated by the phrase “T1 D T2
almost surely”.
Remark. If T1 and T2 are essentially equal then for all Borel sets A R P .T1 2 A/ D P .T2 2 A/
. 2 ‚/:
Consequently, if T1 and T2 are essentially equal then they are identically distributed (see Exercise 36). A direct consequence of this is that E .T1 / D E .T2 / and
var .T1 / D var .T2 /
for all 2 ‚ (provided these expressions make sense). Now let C be a collection of statistics, all of them with existing variance and all based on the sample X1 ; : : : ; Xn . For every characteristic of the population one denotes C D ¹T 2 C W E .T / D . /
for all 2 ‚º:
That is to say, C comprises those T 2 C which present unbiased estimators of the characteristic . One says that C is convex if for all 2 Œ0; 1 the following implication is valid: T1 ; T2 2 C H) T1 C .1 /T2 2 C:
If C is convex, then so is C (verify this).
119
Section II.9 Estimation theory in a framework
For all 2 ‚ one defines m./ WD inf¹var .T / W T 2 C º: Example 7. In Theorem II.9.3 the case where C D ¹T W T D c1 X1 C C cn Xn º was treated. This set is convex. As to the characteristic one has C D C D ¹T W T D c1 X1 C C cn Xn where c1 C C cn D 1º: For all 2 ‚ the element X is the one of minimum variance in C . One may write m./ D var.X/ D
2 . / : n
The next theorem is stated in the terminology and notations introduced above. Theorem II.9.4. Let X1 ; : : : ; Xn be a sample from a population with probability density f .; /, where 2 ‚. Suppose C is a convex collection of statistics, based on this sample, all of them with existing variance. Moreover, suppose that is a characteristic of the population. If T1 and T2 are two elements in C for which var .T1 / D m./ D var .T2 /
for all 2 ‚;
then T1 and T2 are essentially equal with respect to the family f .; / 2‚ .
Proof. Suppose T1 and T2 satisfy the imposed conditions. Because C is convex, one has 12 .T1 C T2 / 2 C. For this reason one has for all 2 ‚ that 1 var .T1 C T2 / var .T1 / D var .T2 /: 2 It follows from this that 1 1 1 var .T1 / C cov .T1 ; T2 / C var .T2 / var .T1 / D var .T2 /: 4 2 4 Therefore one has on the one hand cov .T1 ; T2 / var .T1 / D var .T2 /: On the other hand, by Theorem I.5.14, one may write p p cov .T1 ; T2 / var .T1 / var .T2 / D var .T1 /:
120
Chapter II Statistics and their probability distributions, estimation theory
Summarizing: cov .T1 ; T2 / D var .T1 / D var .T2 /: However, then var .T1
T2 / D var .T1 /
2 cov .T1 ; T2 / C var .T2 / D 0:
Counting with the fact that E .T1
T2 / D E .T1 /
E .T2 / D . /
. / D 0;
this implies, by Proposition I.5.5, that P .T1
T2 D 0/ D 1:
This is the same as saying that P .T1 D T2 / D 1: It therefore appears that T1 and T2 are essentially identical with respect to the family f .; / 2‚ . An estimator T 2 C that satisfies
var .T / D m./
for all 2 ‚
is said to be an “unbiased estimator of minimum variance”. Such estimators will frequently be denoted by Tmin . In literature these estimators are often indicated by the bizarre abbreviation “U.M.V.U. estimator”. The letters U.M.V.U. stand for “Uniform Minimum Variance Unbiased”. Here the adjective “uniform” expresses the fact that minimum variance is simultaneously attained for all 2 ‚. If a statistic is used as an estimator, then it is quite reasonable to require that it does not “favor” a certain measurement above the others. For example, the statistic T WD
X1 C 3X2 C X3 C C Xn nC2
presents an unbiased estimator of , that ascribes extra weight to measurement number 2. If one considers the measurements as being mutually of equal significance, then such an estimator is of course not acceptable. In a way that might at first glance seem a bit pedantic, these considerations are now translated into mathematical language. First of all the concept of a permutation. By a permutation of ¹1; 2; : : : ; nº a bijective map W ¹1; 2; : : : ; nº ! ¹1; 2; : : : ; nº
121
Section II.9 Estimation theory in a framework
is meant. In this sense a permutation is of course equivalent to an arrangement of the numbers 1; 2; : : : ; n in some order. The set of all permutations of ¹1; 2; : : : ; nº will be denoted by Pn . One associates with every 2 Pn an orthogonal linear map Q./ W Rn ! Rn
defined by
Q./.x1 ; : : : ; xn / WD .x.1/ ; : : : ; x.n/ /:
An orthogonal map Q./ is called an exchange of coordinates. For an arbitrary function g W Rn ! R, denote g WD g ı Q./
or, which amounts to the same thing:
. 2 Pn /;
g .x1 ; : : : ; xn / WD g.x.1/ ; : : : ; x.n/ /
for all .x1 ; : : : ; xn / 2 Rn .
Definition II.9.6. A function g W Rn ! R is said to be symmetric if g D g for all 2 Pn . Now let X1 ; : : : ; Xn be a sample from a population with probability density f .; /, where 2 ‚. For an arbitrary statistic T D g.X1 ; : : : ; Xn / denote T WD g .X1 ; : : : ; Xn / D g.X.1/ ; : : : ; X.n/ / . 2 Pn /:
The following proposition is stated in this notation:
Proposition II.9.5. The statistics T and T are always identically distributed. Proof. The variables X1 ; : : : ; Xn are statistically independent and identically distributed. Consequently for all Borel sets A1 ; : : : ; An in R and all 2 Pn one has P .X1 ; : : : ; Xn / 2 A1 An D P .X1 2 A1 / : : : P .Xn 2 An / D P .X.1/ 2 A1 / : : : P .X.n/ 2 An /
This implies (see Appendix B) that
D P .X.1/ ; : : : ; X.n/ / 2 A1 An :
P .X1 ; : : : ; Xn / 2 A D P .X.1/ ; : : : ; X.n/ / 2 A
for all Borel sets A in Rn . Therefore the stochastic n-vectors .X1 ; : : : ; Xn / and .X.1/ ; : : : ; X.n/ / are identically distributed. However, then the statistics T D g.X1 ; : : : ; Xn / and T D g.X.1/ ; : : : ; X.n/ /
are also identically distributed (see §I.11, Exercise 27).
122
Chapter II Statistics and their probability distributions, estimation theory
Remark. By this proposition one always has (as far as the expressions make sense) E.T / D E.T /
and
var.T / D var.T /:
Definition II.9.7. A statistic T , based on the sample X1 ; : : : ; Xn is said to be essentially symmetric if for all 2 Pn the variables T and T are essentially equal with respect to the family f .; / 2‚ .
If the measurements X1 ; : : : ; Xn can be considered to be of mutually equal importance, then it is quite reasonable to require essential symmetry of an estimator based on X1 ; : : : ; Xn . Is not this point of view in conflict with the aspiration to find unbiased estimators of minimum variance? As will become apparent, it is no use having sleepless nights about this. Namely, it turns out that under mild conditions an unbiased estimator of minimum variance is automatically essentially symmetric. This is the content of the following theorem. Definition II.9.8. A collection C of statistics is called symmetric if T 2 C H) T 2 C
for all 2 Pn :
Theorem II.9.6. Let X1 ; : : : ; Xn be a sample from a population with probability density f .; /, where 2 ‚. Suppose C is a convex symmetric collection of statistics, all of them with existing variance. Moreover, suppose that is a characteristic of the population. If the element Tmin 2 C satisfies
then Tmin
var .Tmin / D m./ D inf¹var .T / W T 2 C º for all 2 ‚; is essentially symmetric with respect to the family f .; / 2‚ .
Proof. By Proposition II.9.5 one has for all 2 Pn and all 2 ‚ E .Tmin / D E .Tmin / D . /:
is an unbiased estimator of . Furthermore, by symmetry of C, one Therefore Tmin has Tmin 2 C . Besides this, one has (again by Proposition II.9.5) var .Tmin / D var .Tmin / D m. /: By Theorem II.9.4 it follows that Tmin and Tmin are essentially equal with respect to the family f .; / 2‚ . In other words, Tmin is essentially symmetric.
The theorem above presents a handy tool to find, provided they exist, unbiased estimators of minimum variance. This is illustrated in the following:
123
Section II.9 Estimation theory in a framework
Let be a characteristic of the population. Suppose that C is a convex symmetric collection of so-called polynomial estimators. That is to say, it is presumed that C consists of elements T D g.X1 ; : : : ; Xn /; where g W Rn ! R is a polynomial in n variables. Now one is looking in C for an element of minimum variance. If such an element Tmin exists in C , then it satisfies Property I:
E Tmin / D E .g.X1 ; : : : ; Xn / D . /
for all 2 ‚.
By Theorem II.9.6 the estimator Tmin is essentially symmetric, which means that for 2 Pn the variables Tmin D g.X1 ; : : : ; Xn /
and g .X1 ; : : : ; Xn / D Tmin
are essentially equal. Under mild conditions on the population (see [3]) this implies that the polynomials g and g are identical. Therefore Tmin also satisfies Property II:
Tmin D g.X1 ; : : : ; Xn /, where g is a symmetric polynomial.
It turns out that in many cases the Properties I and II characterize Tmin completely. Two examples of this phenomenon are given now (see also [3]): Example 8. As in Theorem II.9.3, one is looking for an unbiased estimator Tmin for in the following collection of statistics C WD ¹T W T D c1 X1 C C cn Xn º: Here, Property I is equivalent to c1 C C c1 D 1:
./
Property II says that the function g, defined by g.x1 ; : : : ; xn / WD c1 x1 C C cn xn ; is symmetric. Consequently, if for i ¤ 1 the coordinates x1 and xi are exchanged, then one should have c1 x1 C C ci xi C C cn xn D c1 xi C C ci x1 C C cn xn for all .x1 ; : : : ; xn / 2 Rn . This is possible only if ci D c1 . It thus appears that c1 D c2 D D cn . From this, together with ./, it follows that c1 D c2 D D cn D In other words: Tmin D X .
1 : n
124
Chapter II Statistics and their probability distributions, estimation theory
In connection to Theorem II.9.3, the example above is all ancient history. This is quite different in the next example. Example 9. Let X1 ; X2 ; X3 be a sample of size 3 from an N.; 2 /-distributed population with unknown parameters and . Basing oneself on this sample there is a wish to find in the convex symmetric collection C WD ¹T D g.X1 ; X2 ; X3 / W g is a polynomial in three variables of degree 2º an unbiased estimator Tmin of minimum variance concerning the characteristic 2 . It can be proved that such an estimator exists (see [3]). By Property II it follows that Tmin D g.X1 ; X2 ; X3 /, where g is a symmetric polynomial in three variables, of degree 2. Such a polynomial is necessarily of the following form: g.x1 ; x2 ; x3 / D a.x12 C x22 C x32 / C b.x1 x2 C x1 x3 C x2 x3 / C c.x1 C x2 C x3 / C d: Consequently Tmin D a.X12 C X22 C X32 / C b.X1 X2 C X1 X3 C X2 X3 / C c.X1 C X2 C X3 / C d: By Property I one has for all and E.Tmin / D 2 : Counting with the fact that E.Xi2 / D 2 C 2
and E.Xi Xj / D E.Xi /E.Xj / D 2
if
i ¤j
it follows that for all and 3a. 2 C 2 / C 3b2 C 3c C d D 2 : This is possible only if 3a D 1;
3a C 3b D 0;
3c D 0;
d D 0:
Therefore 1 Tmin D .X12 C X22 C X32 / 3
1 .X1 X2 C X1 X3 C X2 X3 /: 3
However, this is exactly the same as S2 D
1 .X1 2
X/2 C .X2
X /2 C .X3
X /2 :
It thus appears that S 2 is, within C, an unbiased estimator of minimum variance (concerning the characteristic 2 ). The previous arguments also apply when samples of an arbitrary size n are concerned (see [3]).
125
Section II.9 Estimation theory in a framework
Is it generally possible to determine a useful lower bound for the variance of an unbiased estimator of some characteristic? The next theorem states, surprisingly, that the answer to this question is affirmative. Theorem II.9.7 (Information inequality). Let X1 ; : : : ; Xn be a sample from a population with probability density f .; /, where 2 ‚. Suppose ‚ is an open interval in R and suppose that the characteristic W ‚ ! R presents a differentiable function on ‚. Then under mild regularity conditions (emerging in the proof of this theorem) every unbiased estimator T of satisfies 2 0 ./ n var .T / E
"
2 # @ log f .X; / @
. 2 ‚/;
where X is one of the variables X1 ; : : : ; Xn . Proof. The theorem is proved in the case where the population has an absolutely continuous distribution. The proof can easily be adapted to the case where discrete probability densities are involved. To start the proof, define for all 2 ‚ the function L W Rn ! Œ0; C1/ by L WD fX1 ;:::;Xn . More specific, L .x1 ; : : : ; xn / WD f .x1 ; / : : : f .xn ; /: The remaining part of the proof is split up into five steps. Step 1. If X is one of the variables X1 ; : : : ; Xn , then E
@ log f .X; / D 0: @
To verify this, observe that E
Z @ @ log f .X; / D log f .x; / f .x; / dx @ @ Z 1 @ D f .x; / f .x; / dx f .x; / @ Z Z @ @ D f .x; / dx D f .x; / dx @ @ @ D .1/ D 0: @
Step 2. For all 2 ‚ one has E
@ @
log L .X1 ; : : : ; Xn / D 0. This is a conse-
126
Chapter II Statistics and their probability distributions, estimation theory
quence of Step 1: E
" @ @ log L .X1 ; : : : ; Xn / D E @ @ " @ D E @ D
n X
E
iD1
log
n Y
f .Xi ; /
iD1 n X
!#
log f .Xi ; /
iD1
!#
@ log f .Xi ; / D 0: @
Step 3. If X is one of the variables X1 ; : : : ; Xn , then var
" 2 # @ @ log f .X; / D E log f .X; / : @ @
To verify this, use (again) Step 1: var
" 2 # 2 @ @ @ log f .X; / D E log f .X; / E log f .X; / @ @ @ " 2 # @ D E log f .X; / : @
Step 4. One has var
" 2 # @ @ log L .X1 ; : : : ; Xn / D n E log f .X; / : @ @
To prove this, write n
X @ @ log L .X1 ; : : : ; Xn / D log f .Xi ; /: @ @ iD1
The variables X1 ; : : : ; Xn are statistically independent. By Proposition I.4.2 (generalized version) the variables @ @ log f .X1 ; /; : : : ; log f .Xn ; / @ @
Section II.9 Estimation theory in a framework
127
are therefore also independent. Now by Proposition I.5.8, together with Step 3, one may write var
X n @ @ log L .X1 ; : : : ; Xn / D var log f .Xi ; / @ @ iD1 " 2 # @ D n E log f .X; / : @
This proves Step 4. Step 5. The finale. Suppose that T D g.X1 ; : : : ; Xn / is an unbiased estimator of the characteristic . Then for all 2 ‚ one has Z Z . / D : : : g.x1 ; : : : ; xn /L .x1 ; : : : ; xn / dx1 : : : dxn : The left and the right hand side of this equality are now differentiated with respect to . As a result one obtains Z Z @ 0 . / D : : : g.x1 ; : : : ; xn /L .x1 ; : : : ; xn / dx1 : : : dxn @ Z Z @ D : : : g.x1 ; : : : ; xn / L .x1 ; : : : ; xn / dx1 : : : dxn : @ However, one has @ L .x1 ; : : : ; xn / D @
@ log L .x1 ; : : : ; xn / L .x1 ; : : : ; xn /: @
Therefore one can write Z Z @ log L .x1 ; : : : ; xn / 0 ./ D : : : g.x1 ; : : : ; xn / @
L .x1 ; : : : ; xn / dx1 : : : dxn
@ D E g.X1 ; : : : ; Xn / log L .X1 ; : : : ; Xn / @ @ D E T log L .X1 ; : : : ; Xn / @ @ (Step 2) D cov T; log L .X1 ; : : : ; Xn / : @
128
Chapter II Statistics and their probability distributions, estimation theory
Now by Proposition I.5.14, together with Step 4, it follows that 2 @ 0 ./ var .T / var log L .X1 ; : : : ; Xn / @ " 2 # @ D var .T / n E log f .X; / : @ This proves the theorem. Remark 1. The expression E
"
2 # @ log f .X; / @
is called the Fisher information about , supplied by a single measurement. Remark 2. If the Fisher information about is positive then the following inequality emerges: 2 0 . / var .T / 2 : @ n E @ log f .X; / In this way a lower bound for var .T / is realized.
Remark 3. In literature the information inequality is usually called the “Cramér–Rao inequality” (named after the famous statisticians H. Cramér and C. R. Rao). Nowadays, however, it is believed that the French mathematician Maurice R. Fréchet (1878– 1973) was the one who first discovered and proved this inequality. Example 10. Suppose one is dealing with an N.; 2 /-distributed population. It is presumed that the numerical value of is known, whereas the value of is unknown. Concerning this population there is a wish to determine the Fisher information about , supplied by a single measurement. Here one is dealing with a family of probability densities .f /2R , the members of which are defined by f .x/ WD Therefore
1 p e 2
1 2 .x
/2 = 2
1 log f .x/ D log p 2 Differentiation with respect to gives
.x 2 R/:
1 .x /2 : 2 2
x @ log f .x/ D : @ 2
Section II.10 Maximum likelihood estimation, sufficiency
Consequently " E
129
" 2 # # @ X 2 log f .X/ D E @ 2 D
1 E .X 4
1 1 /2 D 4 var .X / D 2 :
It thus appears that an arbitrary unbiased estimator T of the characteristic ./ D , based on a sample X1 ; : : : ; Xn of size n, satisfies var .T /
1 n 12
D
2 : n
When taking for T the estimator X one obtains equality in this inequality. When other populations are concerned, X does not necessarily realize equality in the information inequality. There are examples of populations where there live, among the unbiased estimators of , statistics (non-linear ones) doing their work better than X (see Exercise 32).
II.10
Maximum likelihood estimation, sufficiency
Aside from the construction of unbiased estimators of minimum variance, the are in statistics other constructions of importance. One of them is based on maximizing the so-called likelihood function: Definition II.10.1. Let X1 ; : : : ; Xn be a sample from a population with probability density f . The likelihood function associated with this sample is understood to be the probability density of the n-vector .X1 ; : : : ; Xn /. In other words, the likelihood function is the function L W Rn ! Œ0; C1/ defined by L.x1 ; : : : ; xn / WD f .x1 / f .xn /
..x1 ; : : : ; xn / 2 Rn /:
If on a certain region A Rn the likelihood function assumes small values only, then it is not likely that the outcome of .X1 ; : : : ; Xn / will be an element of A. This can be explained by simply writing Z Z P ..X1 ; : : : ; Xn / 2 A/ D : : : L.x1 ; : : : ; xn / dx1 : : : dxn : A
Outcomes .x1 ; : : : ; xn / for which L assumes a small value are therefore said to be unlikely. In a direct line with this terminology an outcome .x1 ; : : : ; xn / is called likely if L.x1 ; : : : ; xn / assumes a large value.
130
Chapter II Statistics and their probability distributions, estimation theory
Now suppose one draws a sample X1 ; : : : ; Xn from a population with probability density f .; /, where 2 ‚. The likelihood function now depends on 2 ‚; it is therefore denoted by L instead of L. The experiment results in an outcome .x1 ; : : : ; xn / 2 Rn of .X1 ; : : : ; Xn /. An element b in ‚ is now chosen in such a way that it maximizes the function 7! L .x1 ; : : : ; xn /:
In other words, the value of is adjusted in such a way that the given outcome .x1 ; : : : ; xn / emerges as one of maximum likelihood. In the following it will be systematically assumed that, given the outcome .x1 ; : : : ; xn /, there is exactly one such b in ‚. Under this assumption b is called the maximum likelihood estimation of . In general b depends on the outcome .x1 ; : : : ; xn /. One should therefore write b Db .x1 ; : : : ; xn /:
Now for an arbitrary characteristic W ‚ ! R and outcome .x1 ; : : : ; xn / the number .x1 ; : : : ; xn / will be interpreted as an estimate of . /. In this way estimators b of the following type are obtained:
Definition II.10.2. If ı b W Rn ! R is a Borel function (which is in practice always the case) then the statistic T WD b .X1 ; : : : ; Xn / is called the maximum likelihood estimator of the characteristic .
Example 1. Given is an exponentially distributed population with parameter . A sample X1 ; : : : ; Xn from this population results in the outcome .x1 ; : : : ; xn / of the stochastic n-vector .X1 ; : : : ; Xn /. Starting from this there is a wish to construct the maximum likelihood estimate of . Here the probability density of the population is given by 8 < 1 e x= if x 0; f .x; / D :0 elsewhere:
Consequently, the value of the likelihood function in the point .x1 ; : : : ; xn / is given by 1 L .x1 ; : : : ; xn / D n e .x1 CCxn /= . > 0/: By differentiation with respect to this expression can be maximized: x1 C C xn @ n L .x1 ; : : : ; xn / D e .x1 CCxn /= : @ nC2 nC1
131
Section II.10 Maximum likelihood estimation, sufficiency
This derivative shows a change of sign in the point D
x1 C C xn : n
In this way one obtains an explicit expression of b in terms of .x1 ; : : : ; xn /: x1 C C xn b Db .x1 ; : : : ; xn / D : n
Thus the maximum likelihood estimator of emerges
X1 C C Xn b D X: .X1 ; : : : ; Xn / D n
The maximum likelihood estimator of an arbitrary characteristic 7! . / is easily obtained from this: It is presented by the statistic b .X1 ; : : : ; Xn / D .X /. As an illustration, the maximum likelihood estimator of the characteristic 7! cos is presented by the statistic cos.X/. The reader may wonder whether a maximum likelihood estimator is automatically unbiased and/or of minimum variance. As to this, the next example gives an answer in the negative. Example 2. One draws a sample X1 ; : : : ; Xn from an N.; 2 /-distributed population, where and are unknown. Given an outcome .x1 ; : : : ; xn / of the stochastic n-vector .X1 ; : : : ; Xn /, there is a wish to make a maximum likelihood estimate of the 2-vector .; 2 /. Here the probability density of the population is f; .x/ D
1 p e 2
.x /2 =2 2
:
Consequently the likelihood function assumes in the point .x1 ; : : : ; xn / the value L; .x1 ; : : : ; xn / D
1 n .2/n=2
e
Œ.x1 /2 CC.xn /2 =2 2
:
Now maximization of L; .x1 ; : : : ; xn / is equivalent to maximization of the expression log L; .x1 ; : : : ; xn /: Maximization of the latter expression is easier, because its partial derivatives assume a less complicated form than the ones of L; .x1 ; : : : ; xn /. One has log L; .x1 ; : : : ; xn / D
n log 2 2
n log
.x1
/2 C C .xn 2 2
/2
:
132
Chapter II Statistics and their probability distributions, estimation theory
Differentiation with respect to and gives @ .x1 C C xn / log L; .x1 ; : : : ; xn / D @ 2
n
;
@ n .x1 /2 C C .xn /2 log L; .x1 ; : : : ; xn / D C : @ 3 These derivatives vanish if x1 C C xn 1 and 2 D .x1 /2 C C .xn /2 : D n n It can be proved that the likelihood function attains a global maximum indeed for these values of and 2 . Denoting x D .x1 C C xn /=n, one can express the maximum likelihood estimate of D .; 2 / as follows: ! n X 1 b Db .x1 ; : : : ; xn / D x ; .xi x/2 : n iD1
It thus appears that the maximum likelihood estimator of the characteristic (formally one should speak about the characteristic .; 2 / 7! / is presented by the statistic X . Moreover, it turns out that the maximum likelihood estimator of the characteristic 2 is given by n
1X .Xi n iD1
X/2 D
n
1 n
S 2;
where S 2 is the sample variance of X1 ; : : : ; Xn . With the help of Proposition II.1.3 one now derives that ! n n 1 1X n 1 2 1 2 2 E .Xi X/ D E.S 2 / D D 1 : n n n n iD1
It follows that the maximum likelihood estimator of 2 is not unbiased! Estimates of 2 emanating from this estimator may be expected to be too low by a fraction n1 . Once the property of being unbiased is lost, it is useless to speak about minimum variance anymore. Maximum likelihood is preserved when transformations are applied; this is the content of the next proposition. Proposition II.10.1. Let T be the maximum likelihood estimator of the characteristic . Then for every Borel function ' W R ! R the statistic '.T / presents the maximum likelihood estimator of the characteristic ' ı . Proof. A direct consequence of Definition II.10.2.
133
Section II.10 Maximum likelihood estimation, sufficiency
As to estimators, the property of being unbiased is frequently lost when transformations are applied. More specific: If T is an unbiased estimator of then there is no guarantee at all that '.T / presents an unbiased estimator of ' ı . As an illustrating example: Example 3. Suppose X1 ; : : : ; Xn is a sample from a population with mean and variance 2 . It is known that X is an unbiased estimator of . However, .X /2 is in general not an unbiased estimator of 2 . To see this, write 1 0 0 1 n n X X X 1 1 Xi Xj A D 2 E @ Xi2 C Xi Xj A E .X /2 D 2 E @ n n i;j D1
D
1 n. 2 C 2 / C n.n n2
iD1
i¤j
2 1/2 D 2 C : n
Maximum likelihood estimators share an important property with unbiased estimators of minimum variance: Theorem II.10.2. A maximum likelihood estimator is always (essentially) symmetric. Proof. See Exercise 39. Next, by means of an example, the reader is introduced to the concept of sufficiency in statistics. Suppose one draws a sample X1 ; : : : ; Xn from a Poisson distributed population with parameter . After one has obtained the outcomes of the X1 ; : : : ; Xn , one computes the outcome of the sum S D X1 C C Xn . The day after it appears that the data of the experiment, that is to say the outcomes of the X1 ; : : : ; Xn , have got lost. However, one still has the outcome of S D X1 C C Xn . How much information (in the estimation process concerning the parameter ) is lost in this situation? Or, for one still knows the outcome of S , is there perhaps no loss of any relevant information? It will become apparent that the latter is true. To see this it will now be studied how many “probabilistic room” there is left to live in for the stochastic n-vector X D .X1 ; : : : ; Xn /, once it is known that S assumes some fixed value s. More precisely, the conditional probability distribution of X, given S D s, will be studied. By generalizing Definition II.8.1 a bit one easily arrives at 8 < fX;S .k; s/ if f .s/ > 0; S fS .s/ fX .k j S D s/ D : 0 elsewhere. To make this more explicit, note that on the one hand one has fX;S .k; s/ D P .X D k and S D s/ :
134
Chapter II Statistics and their probability distributions, estimation theory
Clearly this probability equals zero if k1 C C kn ¤ s. If k1 C C kn D s, then fX;S .k; s/ D P .X D k/ D P .X1 D k1 / P .Xn D kn / De
k1
k1 Š
e
kn
kn Š
De
n
s : k1 Š kn Š
On the other hand it can be proved (see §I.11, Exercise 47) that the variable S enjoys (a priori) a Poisson distribution with parameter n. Therefore one has fS .s/ D P .S D s/ D e
s n .n/
sŠ
:
Summarizing one arrives at the conditional probability density of X, given S D s: 8
0 one has .x C 1/ D x.x/: 2. Prove that . 12 / D
p .
3. Prove Proposition II.2.5. 4. Suppose X1 ; : : : ; Xn is a sample from an N.0; 1/-distributed population. Define Sn WD X12 C C Xn2 . (i) Prove that for all a 2 R lim P
n!1
Sn n a D ˆ.a/; p 2n
where ˆ is the distribution function of the N.0; 1/-distribution. (ii) Prove Proposition II.2.13. 5. If can be read off in Table IV that for a 2 -distributed variable X with 100 degrees of freedom one has P .X 118:50/ D 0:90: Using Proposition II.2.13, approximate the value of P .X 118:50/: 6. Suppose X1 ; : : : ; Xn is a sample from an N.; 2 /-distributed population. Prove that 2 4 var.S 2 / D : n 1 7. Let X1 ; X2 be a sample from a population which is uniformly distributed in the interval Œ0; 1. (i) Show that D
1 2
(ii) Verify that S 2 D .X1
(iii) Prove that
E.S 4 /
1 12 . /2 =2.
and 2 D
D
X2
1 60 .
(iv) Is it true, as in Exercise 6, that var.S 2 / D 2 4 =.n
1/?
(v) Determine the numerical value of P .X 14 / and P .S 2 18 /.
141
Section II.11 Exercises
(vi) Determine the numerical value of P .X
1 4
and S 2 81 /.
(vii) Prove that X and S 2 are not statistically independent. Is not this contradictory to Theorem II.2.11? 8. Prove that if X is t -distributed with n degrees of freedom, then: (i) E.X / D 0
.n 2/,
(ii) var.X/ D n=.n
2/ .n 3/.
9. Recall that a continuous function ' W .0; C1/ ! R is convex if for all x1 ; x2 2 .0; C1/ and all 2 Œ0; 1 it satisfies ' x1 C .1 /x2 '.x1 / C .1 /'.x2 /:
In this exercise the reader is asked to prove that for large n the t -distribution is approximately the same as the N.0; 1/-distribution. The fact that the gamma function is “logarithmically convex” will be used here, that is to say, the exercise leans on the fact that the function x 7! log .x/ is convex (see [5]). (i) Prove that the logarithmical convexity of the gamma function implies that for all x > 0 .x C 12 / 1: p x .x/ (ii) Derive from (i) that for all x > s (iii) Prove that
1 2
x x
1 2
one has
.x C 12 / p 1: x .x/
nC1 1 2 Dp : lim p n!1 n . n2 / 2
(iv) Let fn be the probability density of the t -distribution with n degrees of freedom. Prove that for all x 2 R 1 lim fn .x/ D p e 2
x 2 =2
n!1
:
(v) Denote the distribution function of the t -distribution with n degrees of freedom by Fn . As before, the distribution function of the standard normal distribution is denoted by ˆ. Prove that for all x 2 R one has lim Fn .x/ D ˆ.x/:
n!1
142
Chapter II Statistics and their probability distributions, estimation theory
10. Let Fn and ˆ be as in Exercise 9 (v). Using appropriate tables, determine the numerical value of jFn .x/ ˆ.x/j for n D 20 and x D 1:325. 11. Let X be an F -distributed variable with m and n degrees of freedom in the numerator and denominator respectively. Assume that n > 4. Express E.X / and var.X/ in terms of m and n. 12. Suppose X is F -distributed with 4 and 8 degrees of freedom in the numerator and denominator respectively. (i) Show that the probability density of X can be expressed as ´ 5x.1 C 12 x/ 6 if x 0; fX .x/ D 0 elsewhere: (ii) Using the fundamental theorem of (integral) calculus, compute Z
C1
f .x/ dx:
2:81
(iii) Using tables for the F -distribution, check your outcome in (ii). 13. Let X be a variable that is t -distributed with n degrees of freedom. Prove that Y D X 2 is F -distributed with 1 degree of freedom in the numerator and n degrees of freedom in the denominator. 14. Suppose X is beta distributed with parameters p D variable Y by nX Y WD : m.1 X /
m 2
and q D
n 2.
Define the
Prove that Y is F -distributed with m and n degrees of freedom in the numerator and denominator respectively. 15. Let X be beta distributed with parameters p D 4 and q D 7. (i) Determine the numerical value of P .X 0:2/.
(ii) Draw a graph of the distribution function FX of X . Generally, the median of a stochastic variable is defined as the number m for which P .X m/ D 0:5: ./ (iii) Determine the median of X .
143
Section II.11 Exercises
(iv) Are D E.X/ and m equal?
(v) Can you give an example of a probability distribution where ./ holds for more than one value of m?
16. The lifetime of a certain type of battery is tested. A sample of size 9 is drawn. As a result the following measurements of the life time (in hours) are obtained: 220; 205; 192; 198; 201; 207; 195; 201; 204: Assume that the life time of the batteries is N.1 ; 2 /-distributed, where 1 and 1 are unknown. (i) Construct a 95% confidence interval for 1 . (ii) Construct a 95% confidence interval for 1 . Next, 7 batteries of an other type are tested (concerning life time). The results (again in hours): 213; 215; 198; 209; 211; 205; 203: Assume that the life time of this type of batteries is N.2 ; 22 /-distributed with unknown parameters 2 and 22 . (iii) Construct a 90% confidence interval for 1 =2 . Next the additional assumption is made that 1 D 2 . (iv) Construct a 95% confidence interval for 2
1 .
17. Suppose that X is beta distributed with parameters p and q. Prove that var.X/ D
pq : .p C q/2 .p C q C 1/
18. Among the Dutch people a poll is conducted to estimate the percentage of voters on the socialist party. In a representative selection of 500 people one counts 108 voters on the socialist party. Construct a 90% confidence interval for the percentage mentioned above. 19. In this exercise it is assumed that the population is Cauchy distributed with parameters ˛ D 0 and ˇ D 1. This means that the corresponding probability density is given by f .x/ D
1 1 x2 C 1
for all x 2 R:
144
Chapter II Statistics and their probability distributions, estimation theory
Peculiar properties of such populations will be revealed in this exercise. The reader may take for granted the following result in mathematical analysis (see Appendix D): Z
C1 1
e i tx dx D e x2 C 1
jtj
for all t 2 R:
(i) Let X1 ; : : : ; Xn be a sample of size n from the above-mentioned population. Prove that the characteristic function (see §I.8) of X1 and the one of X are identical. (ii) Now what can you say about the probability distribution of X ? In the following X1 ; X2 is a sample of size 2 from the population mentioned above. (iii) Show that the characteristic function of both X1 C X2 and X1 given by .t / D e 2jtj :
X2 is
(iv) Prove that X1 C X2 and X1 X2 are not statistically independent. Could this occur if X1 ; X2 were a sample from a normally distributed population? 20. Derive Maxwell’s probability density for molecular velocities in an ideal gas (see §II.2, Example 1). 21. By genetic manipulations the HVA (high velocity ant) emerged. This type of ant is able to move on with incredible speeds. One is examining the velocity distribution of these ants. To this end, a number of ants is set out on a homogeneous disc, equipped with Cartesian coordinate axes (with the origin in the center of the disc). At a randomly chosen instant, the velocity V D .Vx ; Vy / of a randomly chosen ant is measured. It is assumed that Vx and Vy are statistically independent stochastic variables. (i) What kind of probability distribution has kVk2 ?
(ii) Give a formula for the probability density of kVk.
(iii) Comment upon the assumption that Vx and Vy be statistically independent. 22. The probability distribution f of a certain population has the following properties: (a) f .x/ D 0 if x 0, (b) f .x/ > 0 if x > 0,
(c) f is continuous on .0; C1/. (i) Prove that S 2 D .X1
X2 /2 =2.
145
Section II.11 Exercises
(ii) For all a > 0 one has P .S 2 2a2 / ¤ 0 and P .X a/ ¤ 0. Prove this statement. (iii) Determine the numerical value of P .S 2 2a2 and X a/. (iv) Are X and S 2 statistically independent?
23. Given is a population with mean and variance 2 . Moreover it is given that the population has a probability density f which is symmetric with respect to the origin. That is to say: f .x/ D f . x/ for all x 2 R: (i) Verify that the characteristic function of this probability distribution can be expressed as .t / D
Z
C1 1
f .x/e i tx dx D
Z
C1
f .x/ cos.xt / dx:
1
(ii) Show (by differentiation across the integral sign) that .0/ D 1; 0 .0/ D 0 and 00 .0/ D
2:
Now let X1 ; X2 be a sample of size 2 from the above-mentioned population. e WD .X1 X2 /=2. Define: X WD .X1 C X2 /=2 and X e satisfy (iii) Prove that the characteristic functions of X and X 2 X .t / D e X .t / D ¹.t =2/º :
e / D 0. (iv) Verify that cov.X; X
e are In the case of a normally distributed population (iv) implies that X and X statistically independent. (From this, in turn, it is easily deduced that X and S 2 are independent). e are statistically independent. The goal is Now conversely assume that X and X to deduce from this that the population is necessarily normally distributed. (v) Prove that .t / D ¹. 2t /º4 for all t 2 R.
(vi) Set '.t / WD log .t /. Show that '.0/ D 0, ' 0 .0/ D 0, ' 00 .0/ D
(vii) Setting
.t / WD '.t /=t 2 , prove that lim t!0
(viii) Prove that .t / D 1 2 2 for all t ¤ 0.
. 2t /
.t / D
for all t ¤ 0. Deduce from this that
(ix) Prove that the population is N.0; 2 /-distributed.
2.
1 2 2 .
.t / D
146
Chapter II Statistics and their probability distributions, estimation theory
24. This exercise deals with a converse to Theorem I.6.7 and Theorem II.2.11. Given is a population with mean and variance 2 . The probability density f of the population is symmetric with respect to the point , that is to say: f . C x/ D f .
x/ for all x 2 R:
Moreover, the population has the following property: For all samples X1 ; X2 of e D .X1 X2 /=2 are statistically independent. Prove size 2 the statistics X and X that the population is N.; 2 /-distributed. (See also [69], where the above is generalized to an astonishing extent). 25. Suppose that X1 and X2 are two statistically independent N.0; 1/-distributed variables. Denote the probability density of the stochastic 2-vector .X1 ; X2 / by e W R2 ! Œ0; C1/ (see Figure f W R2 ! Œ0; C1/. Next define the function f 5) as follows: 8 ˆ 2f .x1 ; x2 / on the interior of squares I and III; ˆ ˆ ˆ 0 and 1 C C p D 1º: Consequently dim ‚ D p
1:
Summarizing, one may say that the variable 2 log ƒ.F1 ; : : : ; Fp / is, for large n, approximately 2 -distributed with p 1 degrees of freedom. The “proof” is now completed by showing that for large n one has 2 log ƒ.F1 ; : : : ; Fp /
p X Fi iD1
2 E.Fi / : E.Fi /
The following arguments can be used to bring this about: Under H0 we have (see §II.11, Exercise 34) E.Fi / D n i;0 : In connection to Theorem III.3.2 one may therefore write ƒ.F1 ; : : : ; Fp / D
E.F1 / F1
F1
:::
E.Fp / Fp
Fp
:
186
Chapter III Hypothesis tests
Hence 2 log ƒ.F1 ; : : : ; Fp / D
2
p X
Fi log
iD1
E.Fi / : Fi
Thus, all one has to do to make it plausible that 2
p X iD1
E.Fi / Fi log Fi
p X Fi iD1
2 E.Fi / : E.Fi /
To this end the second order Taylor expansion of the function x 7! log.1 C x/ can be used in the point x D 0: 1 2 x C o.x 2 / 2
log.1 C x/ D x
o.x 2 / D 0: x!0 x 2
where
lim
Accordingly, one has for small x the second order estimation 1 2 x : 2
log.1 C x/ x
Now under H0 one may expect that for large n it will be observed that Fi i;0 : n For large n one therefore has E.Fi / n i;0 i;0 D D 1: Fi Fi Fi =n For this reason one may write ² E.Fi / E.Fi / log D log 1 C Fi Fi
³ E.Fi / 1 Fi
It follows from this that E.Fi / .E.Fi / Fi log Fi
Fi /
1
1 2
iD1
leads to
p X iD1
E.Fi / D
p X iD1
E.Fi / Fi log Fi
n i;0 D n D p
1 X Fi 2 iD1
E.Fi / Fi
1 .E.Fi / Fi /2 : 2 Fi
Summation (left and right) over i , thereby using the fact that p X
p X
Fi ;
iD1
2 E.Fi / : Fi
2 1 :
187
Section III.3 The 2 -test on goodness of fit
Again using the fact that Fi =E.Fi / 1, it follows that
p X
E.Fi / 2 log ƒ.F1 ; : : : ; Fp / 2 Fi log Fi iD1 2 p X Fi E.Fi / Fi
iD1
2 E.Fi / Fi iD1 2 p X Fi E.Fi / : E.Fi / p X
Fi Fi E.Fi /
iD1
Here the left hand side, and therefore the right hand side too, is approximately 2 distributed with p 1 degrees of freedom. This “proves” the theorem. Remark. The “proof” that has been given here can be made rigorous by refining the argumentation. In practice the following decision rule is often used. If, after drawing a sample, the test statistic 2 p X Fi E.Fi / E.Fi / iD1
shows a “large” outcome (then 2 log ƒ assumes a large value and therefore ƒ a small one) then the null hypothesis is rejected. What is meant by “large” and “small” can (given a certain level of significance) be specified, for one knows the probability distribution of the test statistic. Example 2. This example leans on Example 1. There was a wish there to test whether the following fractions class fraction
1 0:38
2 0:47
3 0:15
are applicable to the population of the city of Nijmegen. The level of significance was set to ˛ D 0:05. In a sample of size 200 the following frequencies were observed: class observed frequency expected frequency under H0
1 84 76
2 96 94
3 20 30
188
Chapter III Hypothesis tests
The following test statistic is used: T D
3 X Fi iD1
2 E.Fi / : E.Fi /
The outcome of this variable is T D
.84
76/2 .96 94/2 .20 30/2 C C D 4:22: 76 94 30
Should one look upon this outcome as being a large one? In this, note that under H0 the variable T is a priori 2 -distributed with 3 1 D 2 degrees of freedom. The critical region for T , at the prescribed level of significance ˛ D 0:05, is therefore presented (see Table IV) by the interval Œ5:99; C1/. The observed outcome of T is not an element of this interval, so it should not be considered a large outcome. Conform the decision rule the hypothesis H0 maintained.
III.4 The 2 -test on statistical independence In this section a decision procedure will be developed to check statistical independence of two discretely distributed stochastic variables. The procedure will be very much along the lines of §III.3 and will be based on likelihood ratios. Here is an illustrating and introducing example as a starting point: Example. At the exit of a supermarket randomly chosen customers are asked on the one hand about the amount of money spent on shopping .X /, on the other hand about their gross annual income .Y /. In the following way four possible values are assigned to X : value of X X X X X
D1 D2 D3 D4
amount of money ” ” ” ”
less than 25 25 0. One draws a sample of size 2 from this population in order to test the hypothesis H0 W D 1 against H1 W D 2. (i) Use the Neyman–Pearson lemma to find a critical region of size ˛ D 0:10, generating maximal power. (ii) Determine the numerical value of the power generated by the region constructed in (i). Define F .c/ WD ¹.x1 ; x2 / 2 R2 W x1 ; x2 cº. (iii) Adjust the value of c in such a way that F .c/ assumes a size ˛ D 0:10. Determine the power generated by F .c/ for this value of c.
5. From a Poisson distributed population one draws a sample X1 ; X2 ; X3 of size 3. Basing a decision on this sample, one wishes to test H0 W D 2
against
H1 W D 1:
For all c > 0 define the region G.c/ by G.c/ WD ¹.n1 ; n2 ; n3 / 2 N 3 W n1 C n2 C n3 cº: (i) Show that a critical region based on the Neyman–Pearson lemma is a set of type G.c/. (ii) Determine the size of the critical region G.c/ for c D 2.
(iii) Determine the power of the test if G.2/ is chosen to be the critical region. (iv) Is it possible to construct another critical region of the same size as G.2/, but generating a larger power?
196
Chapter III Hypothesis tests
6. From an N.0; 2 /-distributed population one draws a sample X1 ; : : : ; Xn of size n in order to test the hypothesis H0 W D 1
against H1 W ¤ 1:
(i) Express the likelihood ratio ƒ.x1 ; : : : ; xn / in terms of x1 ; : : : ; xn . Let BR be the closed ball with radius R in Rn , centered at .0; : : : ; 0/. (ii) Show that a critical region G based on the likelihood ratio is of the form c G D BR1 [ BR ; 2 c c the closure of BR where R1 < R2 , and BR , that is to say 2 2 c D ¹x 2 Rn W kxk R2 º: BR 2
(iii) In the case where n D 2, determine the constants R1 and R2 in such a c way that both BR1 and BR are of size ˛ D 0:05. Then what is the power 2 of the test in D 2? 7. From an exponentially distributed population with parameter one draws a sample X1 ; : : : ; Xn of size n. There is a wish to test the hypotheses H0 W D 1
against H1 W ¤ 1:
(i) Find an expression for the likelihood ratio in terms of x1 ; : : : ; xn . (ii) Show that a critical region G based on the likelihood ratio is a disjoint union G D G1 [ G2 where .c1 < c2 /: G1 D ¹.x1 ; : : : ; xn / 2 RnC W x c1 º and G2 D ¹.x1 ; : : : ; xn / 2 RnC W x c2 º: (iii) Evaluate the level of significance of the test if one chooses c1 D c2 D 1 12 . 8. Prove Theorem III.2.3. 9. Prove Theorem III.2.4.
1 2
and
197
Section III.5 Exercises
10. (i) Prove Lemma III.2.5. (ii) Prove Lemma III.2.6. (iii) Prove Theorem III.2.7. 11. Prove Lemma III.2.8. 12. Prove Theorem III.2.9. 13. Prove Lemma III.2.10. 14. Prove Theorem III.2.11. 15. Develop a method to test paired differences (see Theorem II.4.6). 16. Let X1 ; : : : ; Xn be a sample of a very large size n from a population having a probability density f , given by ´ x 1 if x 2 .0I 1/; f .x/ D 0 elsewhere: There is a wish to test the hypothesis H0 W D 0
against
H1 W ¤ 0 :
In this, a critical region G.ı/ based on the likelihood ratio is chosen: G.ı/ D ¹x 2 Rn W ƒ.x/ ıº: Adjust ı in such a way that the critical region assumes a size ˛ D 0:05. 17. As to the smoking of cigarettes, the Dutch population is partitioned into four classes:
class 1 class 2 class 3 class 4
” ” ” ”
average number of cigarettes a day no smoking 1 < 10 10 < 20 20 or more
It is known that nationally the following percentages apply: class percentage
1 46
2 11
3 27
4 16
One is wondering whether these percentages also apply to the population of Dutch university students. In a sample consisting of 500 randomly chosen stu-
198
Chapter III Hypothesis tests
dents one reveals the following frequencies: class frequency
1 206
2 41
3 155
4 98
Work out a goodness-of-fit test at a prescribed level of significance ˛ D 0:10. 18. Using a Geiger–Müller counter one is observing an average number of 4.3 passages of cosmic particles a minute. There is a conjecture that the number .X / of passing particles (a minute) is Poisson distributed. To test this conjecture an experiment is set up: During 100 successive minutes every minute the number of passing particles is counted. The experiment results in the following measurement: X frequency
1 9
2 11
3 16
4 21
5 18
6 8
7 12
8 5
Carry out a goodness-of-fit test at a level of significance of 5%. 19. A manufacturer of cassettes states that the recording time in minutes of his tapes has a N.92; 1/-distribution. A sample is drawn, consisting of 100 randomly chosen tapes. The recording time of each of these tapes is measured. It appears that 20 of them have a recording time in the interval . 1; 91, 30 in the interval .91; 93 and 24 in the interval .93; C1/. (i) Determine the expected frequencies in the intervals mentioned above, under the hypothesis that the statement of the manufacturer is correct. (ii) Carry out a goodness-of-fit test at a level of significance of 5%. Is the statement of the manufacturer rejected? (iii) Could it occur, applying other intervals, that the test results in another decision? If so, give an example. 20. (i) In the notations of §III.4, one has ij D i j for all i; j if and only if there exist real numbers ˛1 ; : : : ; ˛p 0 and ˇ1 ; : : : ; ˇq 0 such that ˛1 C C ˛p D 1; ˇ1 C C ˇq D 1 and ij D ˛i ˇj : Prove this. (ii) Use (i) to prove Lemma III.4.1. 21. Give a heuristical proof of Theorem III.4.3, along the lines of “the proof” of Theorem III.3.3. 22. On a civic sports-festival for students one measures the individual sporting records Y , the weight G (in kilos) and the height L (in meters) of 250 ran-
199
Section III.5 Exercises
domly chosen students. One defines the variable X to be the so-called “Quetelet quotient”: G X WD 2 : L One is wondering whether the variables X and Y are statistically independent. The outcome of the experiment is shown in the following scheme: Y
bad
average
good
very good
20 21; : : : ; 25 26; : : : ; 30 > 30
5 16 12 11
6 9 18 9
8 61 10 5
6 34 5 5
X
Carry out a 2 -test on statistical independence for the variables X and Y at a level of significance ˛ D 0:05.
Chapter IV
Simple regression analysis
IV.1
The least squares method
Introduction. A physicist is carrying out n measurements in an electrical circuit concerning the strength of current x and the voltage y . As pairs currents xi and voltages yi are measured .i D 1; : : : ; n/. The results are summarized in a scheme: x y
x1 y1
x2 y2
Setting out the points .xi ; yi / in a graph one obtains what is usually called the scatter diagram corresponding to these measurements. In the following it will be systematically assumed that among the measurements xi there are at least two different values.
::: :::
xn yn
y
Theoretically there is a linear relation between the variables x and y (Ohm’s law):
x Figure 8
y D R x:
Consequently, if there were no inaccuracies in reading off the instruments and no other disturbing factors, the points .xi ; yi / would be positioned exactly on a straight line through the origin. In practice this will be observed only approximately. There b of the constant R, the resistance in Ohm’s law. is a wish to make an estimate R b is achieved one can, given an arbitrary current x, make Whenever such an estimate R an estimate b y of the corresponding voltage y by setting b x: b y WD R
b of R may be, there will be no escape from the phenomenon Whatever the estimate R that, given a current xi , a discrepancy will appear between the measured value yi and b xi of yi . This difference between yi and b the estimated value b yi D R y i is called
201
Section IV.1 The least squares method
O y D Rx
ei y
x Figure 9
the error or the residual corresponding to measurement no. i ; it will be denoted by ei . More specific: b xi D yi b ei WD yi R yi:
Of course it is to be preferred to have the errors as small as possible. One is thereb of R that minimizes the errors in some sense. For fore searching for an estimate R P b that it minimizes the expression example, one could demand from R i jei j or the expression maxi jei j. This leads to the problem of minimizing functions of type X b 7! b xi j or f W R b 7! max jyi R b xi j: f WR jyi R i
i
This, however, is rather complicated, for these functions are in general not differenb . One could get around this problem (as proposed a long time tiable with respect to R ago by the French mathematician AdrienP M. Legendre (1752–1833)) by choosing, instead of the expressions above, the sum i ei2 to be minimized. The expression b / WD f .R
X i
ei2 D
X .yi i
b xi /2 R
is called the sum of squares of errors. Differentiation of this expression with respect b yields to R X b/ D 2 b xi /xi : f 0. R .yi R i
P P 2 b D In this way one reveals that f 0 changes sign in the point R i yi xi = i xi . In b that arises in this way is called this point f attains a global minimum. The value of R the least squares estimate of R.
202
Chapter IV Simple regression analysis
Of course one can also use this technique in cases where x and y are theoretically related as y D ˛ C ˇ x:
Suppose that, starting from n measurements .x1 ; y1 /; : : : ; .xn ; yn /, one wishes to make estimates of the parameters ˛ and ˇ. If one chooses a and b to be estimates of ˛ and ˇ respectively, then the error corresponding to measurement number i is given by: ei WD yi .a C b xi /: Consequently, the sum of squares of errors is in this case a function of two variables: X f .a; b/ WD ¹yi .a C b xi /º2 : i
This expression is now minimized by finding the zeros of the partial derivatives: X @ f .a; b/ D 2¹yi .a C b xi /º D 0; @a i
@ f .a; b/ D @b
X
2xi ¹yi
i
.a C b xi /º D 0:
This results in a system of two linear equations in a and b; these equations are said to be normal equations: X X na C xi b D yi ; X i
i
xi a C
X i
i
X 2 xi b D xi yi : i
The determinant of this system cannot be zero (see Exercise 3), so there is exactly one solution for a and b. These solutions will systematically be denoted by b ˛ and b ˇ respectively. One has P P P n. i xi yi / . i xi /. i yi / b ˇD ; P P n. i xi2 / . i xi /2 P P b ˇ i xi i yi b ˛D : n In order to simplify these expressions the following notations are introduced: 1X 1X x WD xi ; y WD yi ; n n i i X X X .yi y/2 ; Sxy WD .xi x/.yi y/: Sxx WD .xi x/2 ; Syy WD i
i
i
203
Section IV.1 The least squares method
The least squares estimates b ˛ and b ˇ can now be expressed as follows:
Proposition IV.1.1. Suppose that, concerning the quantities x and y , there is a theoretical relationship y D ˛ C ˇx. Then the least squares estimates of ˛ and ˇ, based on n measurements .x1 ; y1 /; : : : ; .xn ; yn /, are given by
Proof. See Exercise 4.
Sxy b ˇ D Sxx
and
b ˛ Dy
b ˇ x:
The method of least squares can also be applied in cases where one is dealing with non-linear models. To illustrate this, suppose that, theoretically, there is an exponential relationship between x and y of type y D ˛ C ˇ e x : Given n measurements .x1 ; y1 /; : : : ; .xn ; yn /, the sum of squares of errors now assumes the form X f .a; b; c/ D ¹yi .a C b e cxi º2 : i
The least squares estimates of ˛; ˇ and can be obtained by solving the equations @f D 0; @a
@f D 0; @b
@f D0 @c
simultaneously. In this sense the method of least squares can be applied universally. Now look back upon the experiment where in an electrical circuit measurements were done with respect to the electrical current x and the voltage y . Suppose that the current can be adjusted very accurately to a fixed value of x. At such a fixed value of x one carries out measurements on the voltage y . Because of inaccuracies in reading off the instruments and because of other disturbing factors, these measured values of y are generally not equal. Therefore, given a fixed value of x, these values of y are regarded as outcomes of a stochastic variable Yx . In such cases the variable x is called a controlled variable and Yx a response variable. It is assumed that, when dealing with larger and larger numbers of measurements y1 ; : : : ; yn at an adjusted current x, one will have with great accuracy yD
y1 C C yn R x: n
This can be formalized by saying that one assumes E.Yx / D R x:
204
Chapter IV Simple regression analysis
In the following E.Yx / will frequently be denoted as x . Choosing in succession fixed values x1 ; : : : ; xn for the current x, a sequence Yx1 ; : : : ; Yxn of stochastic variables emerges. Typography can be simplified by setting Yi WD Yxi
and
i WD xi :
In these notations one may write i D E.Yi / D E.Yxi / D R xi : In the theory of linear regression one often assumes that the variables Y1 ; : : : ; Yn constitute a statistically independent system and that they all have the same variance 2 . Now, the measurements .x1 ; y1 /; : : : ; .xn ; yn /, the least squares estimate P givenP 2 bD R x y = i i i i xi of R can be regarded as an outcome of the stochastic variable P b D Pi xi Yi : R 2 i xi
b of R appears. The estimator R b has the following imporIn this way an estimator R tant properties:
b is a linear combination of the variables Y1 ; : : : ; Yn , that is to say, R b is a (i) R linear estimator in the Y1 ; : : : ; Yn . b is an unbiased estimator of R. (ii) R P 2 b / D 2= (iii) var. R i xi . b is of minimum vari(iv) Among the unbiased linear estimators of R the variable R b can be considered the best unbiased linear estimator of the parameance, so R ter R.
In Exercise 5 the reader is invited to prove these properties.
In the following the above will be generalized to the case where there are two quantities x and y , between which there is a theoretical relationship y D ˛ C ˇx: For fixed x, as before, the value of y is regarded as being the outcome of the stochastic variable Yx and it is assumed that: Assumption I:
E.Yx / D ˛ C ˇx.
Here too, denote x D E.Yx /. In the case of a number x1 ; : : : ; xn of prescribed values for x, write Yi WD Yxi :
205
Section IV.1 The least squares method
In this notation one may write i D ˛ C ˇxi : Besides the above it will be systematically assumed in this section that: Assumption II:
Y1 ; : : : ; Yn are statistically independent.
Assumption III: One has var.Y1 / D D var.Yn / D 2 . The latter assumption is often referred to as homoscedastiticity. It turns out that under these assumptions the least squares estimators b ˛ and b ˇ have a number of very desirable properties. Some of them are presented in the following proposition. Proposition IV.1.2. The least squares estimators b ˛ and b ˇ of ˛ and ˇ are unbiased linear estimators in Y1 ; : : : ; Yn . Moreover, one has P .xi x/Yi b ˇD i ; b ˛DY b ˇ x; Sxx 2 var.b ˇ/ D ; Sxx
var.b ˛/ D
.Sxx C n.x/2 / 2 nSxx
and
cov.b ˛; b ˇ/ D
Proof. From Proposition IV.1.1 it follows that P SxY .xi x/.Yi Y / b D i : ˇ D Sxx Sxx P Realizing that i .xi x/ D 0, one can rewrite this expression as: P X xi x .xi x/Yi b ˇ D i D Yi : Sxx Sxx
x 2 : Sxx
i
In this way it can be seen that b ˇ is a linear combination of the Y1 ; : : : ; Yn ; that is to say: b ˇ is a linear estimator in Y1 ; : : : ; Yn . Furthermore one has 1 X 1 X E.b ˇ/ D .xi x/ E.Yi / D .xi x/.˛ C ˇxi / Sxx Sxx i i ± X 1 ° X D ˛ .xi x/ C ˇ .xi x/xi : Sxx i
Again using the fact that
P
i
i .xi
E. b ˇ/ D
x/ D 0, one deduces from this that
X 1 ˇ .xi Sxx i
x/.xi
x/ D ˇ:
206
Chapter IV Simple regression analysis
This proves that b ˇ is an unbiased estimator of ˇ. To prove that b ˛ is an unbiased estimator of ˛, note that, by Proposition IV.1.1, the estimator b ˛ is given by b ˇ x:
b ˛ DY
Because both Y and b ˇ are linear combinations of the Y1 ; : : : ; Yn , the variable b ˛ is also a linear combination in Y1 ; : : : ; Yn . In other words, b ˛ is a linear estimator in Y1 ; : : : ; Yn . Moreover, one has 1X x E.b ˇ/ D E.Yi / n i ± 1°X .˛ C ˇxi / ˇx D n i X ± 1° D xi ˇx n˛ C ˇ n
E.b ˛ / D E.Y /
xˇ
i
D˛Cˇ x
ˇ x D ˛:
This proves that b ˛ is an unbiased estimator of ˛.
Using the independence of the Y1 ; : : : ; Yn and applying Propositions I.5.8 and I.5.9, an explicit form of the variance of the variable b ˇ can be deduced: ! X xi x Yi var.b ˇ/ D var Sxx i X xi x 2 D var.Yi / Sxx i
D
1 .Sxx /2
X .xi i
x/2 2 D
2 : Sxx
Furthermore, one may write cov.b ˛; b ˇ/ D cov.Y
b ˇ x; b ˇ/
D cov.Y ; b ˇ/ cov.b ˇ x; b ˇ/ X 1 D cov Yi ; b ˇ x var.b ˇ/ n i
1X ˇ/ cov.Yi ; b D n i
x
2 : Sxx
()
207
Section IV.1 The least squares method
However, one has X xj x b cov.Yi ; ˇ/ D cov Yi ; Yj Sxx j X xj x xi x 2 cov.Yi ; Yj / D : D Sxx Sxx j
Again using the fact that way ./ reduces to
P
i .xi
x/ D 0, it follows that x 2 : Sxx
cov.b ˛; b ˇ/ D
P
i
cov.Yi ; b ˇ / D 0. In this
In Exercise 7 the reader is asked to verify that var.b ˛/ D
.Sxx C n.x/2 / 2 : nSxx
The least squares estimators b ˛ and b ˇ are also doing good work in the following sense:
Proposition IV.1.3. Among the unbiased linear estimators of ˛ and ˇ the least squares estimators b ˛ and b ˇ are the ones of minimum variance. They are unique in this.
P Proof. This statement is verified for the estimator b ˇ . Suppose e ˇ D i ci Yi is an unbiased linear estimator ˇ. Whatever the values of ˛ and ˇ may be, one then has Therefore E. e ˇ/ D
X i
E. e ˇ / D ˇ:
ci E.Yi / D
X i
ci .˛ C ˇxi / D ˇ
for all possible values of ˛ and ˇ. It thus appears that unbiasedness of the estimator e ˇ is equivalent to the following two constraints on the coefficients c1 ; : : : ; cn : X X ci D 0 and ci xi D 1: ./ i
i
Among the unbiased linear estimators of ˇ one is looking for one of minimum variance. Using Propositions I.5.8 and I.5.9, together with the independence of the variables Y1 ; : : : ; Yn , one obtains X X X ci2 : var. e ˇ/ D ci2 var.Yi / D ci2 2 D 2 i
i
i
208
Chapter IV Simple regression analysis
In this it follows that the mathematical content of the problem is the minimization P way 2 of i ci under the constraints ./. A theorem of Joseph L. Lagrange (1736–1813) is applied to solve this (see [14], [43]; see also the proof of Theorem II.9.3): P In a point .c1 ; : : : ; cn / where i ci2 is minimal, one necessarily has 2.c1 ; : : : ; cn / D .1; : : : ; 1/ C .x1 ; : : : ; xn /:
Consequently The condition
P
2ci D C xi : i
ci D 0 implies
X . C xi / D n C nx D 0: i
Therefore D
x:
As a result one gets 2ci D C xi D .xi x/: P Substitution of this equality in the constraint i ci xi D 1, gives us X i
Because
Hence
P
i .xi
ci xi D
X
1 2 .xi
./
x/xi D 1:
i
x/ D 0, one can rewrite this as 1 P i .xi 2
x/2 D 1:
2 2 D : 2 .x x/ S xx i i
D P
According to ./ one may write ci D
.xi 2
x/ D
xi x : Sxx
However, these are (Proposition IV.1.2) exactly the coefficients of b ˇ and for this reason one has e ˇ Db ˇ . This proves that an unbiased linear estimator of ˇ of minimum variance is necessarily equal to b ˇ (but: see Exercise 17). In a similar way this statement is proved for b ˛.
209
Section IV.1 The least squares method
After having finished the experiment, the estimators b ˛ and b ˇ show a certain outcome. Choosing an arbitrary fixed x one can now make an estimate of x D ˛ C ˇx by setting x b ˛ Cb ˇ x: In fact one is dealing here with the following estimator of x : b x WD b Y ˛ Cb ˇ x:
The following proposition describes some properties of this estimator. Proposition IV.1.4. For an arbitrary fixed x, one has: b x is an unbiased estimator of x , (i) Y
b x is a linear estimator in Y1 ; : : : ; Yn , (ii) Y ² ³ 1 .x x/2 b (iii) var. Y x / D C 2: n Sxx
Proof. The statements (i), (ii), and (iii) can be deduced in a direct way from Proposition IV.1.2. This is left to the reader. Here is an example that illustrates how the proposition above can be applied. Example 1. In a certain experiment one carries out five measurements concerning two quantities x and y . Theoretically, x and y should satisfy the linear relation y D ˛ C ˇx: The outcomes of the measurements are summarized in the following scheme: measurement no. x y
1 2.30 2.74
2 3.71 4.50
3 4.22 5.53
4 5.87 6.06
5 6.33 7.27
Some calculations provide: x D 4:486, Sxx D 10:76732 and y D 5:220. The estimators b ˛ and b ˇ of ˛ and ˇ show the following outcomes: b ˇD
P
b ˛Dy
i .xi
x/yi
D 1:007; Sxx b ˇ x D 0:704:
210
Chapter IV Simple regression analysis
b x is given by 0:704 C 1:007 x. In It follows that the outcome b y of the estimator Y particular, in the point x D 6:50 one obtains the following estimate of x D ˛ C ˇx: b y D 0:704 C 1:007 6:50 D 7:25:
This can be interpreted in the following way: When doing again and again measurements on y in the point x D 6:50, then the averaged value of the y -measurements can be expected to be close to the estimated value 7:25. The variance of the estib x in the point x D 6:50 is a measure of the accuracy of the estimate. By mator Y Proposition IV.1.4 one has b x/ D var. Y
°1
5
C
.6:50 4:486/2 ± 2 D 0:577 2 : 10:76732
b x . Suppose, Because 2 is unknown, this does not tell much about the variance of Y which is in practice usually not the case, that the numerical value of 2 is known. For b x / D 0:115 in the point x D example, suppose that 2 D 0:20. Then one: var. Y b 6:50. If in addition it is assumed that Y x is (approximately) normally distributed, then an interval estimate for x in the point x D 6:50 can be constructed: In Table II it can be read off that ˇ ˇ ! ˇ ˇY ˇ b x x ˇ P ˇ p ˇ < 1:96 0:95: ˇ 0:115 ˇ b x D 7:25 in the inequality Substituting the outcome Y p ˇ ˇ ˇY b x x ˇ < 1:96 0:115; one obtains for x the following inequality:
6:59 < x < 7:91: Thus the interval (6.59,7.91) appears as a 95% confidence interval for x in the point x D 6:50. As said before, the numerical value of 2 is usually not known. In the next section this problem will be dealt with.
IV.2
Construction of an unbiased estimator of 2
In the preceding section the assumption was made that var.Yi / D 2 for all i D 1; : : : ; n. If the numerical value of 2 is not known, one will have to make estimates of it. This section is devoted to the construction of an unbiased estimator of 2 . Running
211
Section IV.2 Construction of an unbiased estimator of 2
ahead of things, it will appear that, given a series of outcomes .x1 ; y1 /; : : : ; .xn ; yn / of the experiment, the expression 1 n
2
X
b y i /2
.yi
i
can be regarded as of 2 . Note that in this expression the sum of Pa genuine estimate 2 squares of errors i .yi b y i / occurs.
In the following the yi and b y i will be looked upon as being outcomes of the variables P bi D b Yi and Y ˛ Cb ˇ xi respectively. Thus the sum of squares of errors i .yi P b i /2 . This b y i /2 can be regarded as the outcome of the stochastic variable i .Yi Y stochastic variable will systematically be denoted by SSE, so in this notation one has X b i /2 : SSE WD .Yi Y i
For every x the variable Ex is defined by Ex WD Yx
.˛ C ˇx/:
For typographical reasons Exi will be denoted as Ei . The next lemma is a prelude to the construction of an unbiased estimator of 2 . bi D Y bi Lemma IV.2.1. Denoting E has: E D .b ˇ b E i D Yi
bi (i) E
(ii) Ei X bi (iii) .E i
E/.Ei
ˇ/.xi bi , Y
.˛ C ˇxi / and E D .E1 C C En /=n, one
x/,
b i / D 0. E
b i and E can be captured as: Proof. (i) The difference between E bi E
bi EDY
bi DY
.˛ C ˇxi / Y
Db ˛ Cb ˇxi
ˇ.xi
Y
x/
.b ˛ Cb ˇ x/
.˛ C ˇ x/
ˇ.xi
x/ D .b ˇ
b i one may write: (ii) For the discrepancy between Ei and E Ei
b i D ¹Yi E
.˛ C ˇxi /º
bi ¹Y
ˇ/.xi
.˛ C ˇxi /º D Yi
x/:
bi: Y
212
Chapter IV Simple regression analysis
(iii) Using (i), (ii) and Proposition IV.1.2, it follows that X bi .E
X .b ˇ
bi / D E
E/.Ei
i
ˇ/.xi
x/.Yi
X
.xi
x/.Yi
X ˇ/¹ .xi
x/Yi
i
D .b ˇ
D .b ˇ
D .b ˇ D .b ˇ
D .b ˇ
ˇ/
bi / Y
bi / Y
i
i
ˇ/¹Sxx b ˇ
ˇ/¹Sxx b ˇ ˇ/¹Sxx b ˇ
X .xi
X .xi i
x/.b ˛ Cb ˇxi /º
i
b ˛
X
bi º x/Y
x/
.xi
i
Sxx b ˇº D 0:
b ˇ
X .xi i
x/xi º
Proposition IV.2.2. The following identity is generally valid: X .Ei
E/2 D SSE C . b ˇ
i
ˇ/2 Sxx :
Proof. The sum of squares on the left side can be written out as: X .Ei i
E/2 D
X ¹.Ei
b i / C .E bi E
i
X D .Ei i
D
X .Yi i
b i /2 C E
X bi .E i
bi /2 C .b Y ˇ
D SSE C .b ˇ
E/º2
2
ˇ/2
E/2 C 2 X .xi i
X .Ei i
x/2 C 0
b i /.E bi E
E/
ˇ/ Sxx :
Theorem IV.2.3. The variable SSE=.n parameter 2 .
2/ presents an unbiased estimator of the
Proof. The variables E1 ; : : : ; En all have a common expectation value and a common variance. Furthermore, they constitute a statistically independent system. Therefore (see the remark subsequent to Proposition II.1.3) one has E
X .Ei i
E/2 D .n
1/ 2 :
213
Section IV.3 Normal regression analysis
Moreover, one has by Proposition IV.1.2: E .b ˇ ˇ/2 Sxx D Sxx E .b ˇ D Sxx
ˇ/2 D Sxx var.b ˇ/
2 D 2: Sxx
Using Proposition IV.2.2 one arrives at .n This proves the theorem.
1/ 2 D E.SSE/ C 2 :
An application of the theorem above is illustrated in the following example. Example 1. As to the measurements in Example 1 of §IV.1 one could set up the b x and y b following scheme. (Here b y denotes the outcome of Y y the error corresponding to measurements in the point x.) x 2.30 3.71 4.22 5.87 6:33
y 2.74 4.50 5.53 6.06 7:27
b y 3.0192 4.4387 4.9522 6.6134 7:0765
y b y 0:2792 0:0613 0:5778 0:5534 0:1935 C 0:00
.y b y /2 0.0780 0.0038 0.3339 0.3063 0:0374 C 0:76
It follows that SSE D 0:76. The following estimate for 2 arises from this outcome of SSE: SSE 0:76 2 D D 0:25: n 2 3 In this stage nothing can be said about the accuracy of this estimate, for the probability distribution of the variable SSE is not known. This will be a topic in the next section.
IV.3 Normal regression analysis In this section the notations introduced before are retained and the basic Assumptions I, II and III in §IV.1 are sharpened. In fact Assumptions I and III are replaced by a new Assumption I in the following way: Assumption I:
For all x the variable Yx is N.˛ C ˇx; 2 /-distributed.
Assumption II: For every fixed sequence x1 ; : : : ; xn the variables Yx1 ; : : : ; Yxn constitute a statistically independent system. Under these two assumptions one talks about normal regression analysis.
214
Chapter IV Simple regression analysis
Proposition IV.3.1. The residual vector .Y1 .b ˛; b ˇ / are statistically independent.
b 1 ; : : : ; Yn Y
b n / and the 2-vector Y
Proof. Let V be the linear span of the Y1 ; : : : ; Yn . By Proposition IV.1.4 one has for b i 2 V. From Proposition IV.1.2 it follows that b all i that Yi Y ˛; b ˇ 2 V. To prove the proposition it is, by Theorem I.6.7, sufficient to verify that cov.Yi
bi ; b Y ˛ / D 0;
cov.Yi
bi ; b Y ˇ/ D 0
for all i:
These verifications are standard and are left to the reader.
Through the following theorem one gets more grip on the sum of squares of errors: Theorem IV.3.2. In the case of normal regression analysis one has: (i) SSE and .b ˛; b ˇ / are statistically independent,
(ii) SSE= 2 is 2 -distributed with n
2 degrees of freedom.
P b i /2 , a function of the residual vecProof. (i) By definition one has SSE D i .Yi Y tor. By Proposition IV.3.1 and Proposition I.4.2 it now follows that SSE and .b ˛; b ˇ/ are statistically independent. To prove (ii), note that, by Proposition IV.2.2, one has P
i .Ei
2
E/2
D
SSE .b ˇ ˇ/2 Sxx C : 2 2
Theorem II.2.12 tells that the left side of this equality has a 2 -distribution with n 1 degrees of freedom. The variable b ˇ is, being a non-trivial linear combination of the independent normally distributed Y1 ; : : : ; Yn , also normally distributed (combine Theorems p I.6.1 and I.6.6). Using Proposition IV.1.2 it follows that the variable . b ˇ ˇ/ Sxx = is N.0; 1/distributed. Consequently, the variable . b ˇ ˇ/2 Sxx = 2 is 2 -distributed with one degree of freedom. By (i) the variables SSE= 2 and . b ˇ ˇ/2 Sxx = 2 are independent. 2 Applying Proposition II.2.10 it follows that SSE= has a 2 -distribution with n 2 degrees of freedom. The theorem above can be used to make interval estimates for 2 : Example 1. Now return to Example 1 of §IV.1 and assume that one is here dealing with a case where normal regression analysis can be applied. Then the variable SSE= 2 is (a priori) 2 -distributed with 5 2 D 3 degrees of freedom. Now Table IV provides SSE P 0:35 < 2 < 7:81 D 0:90:
215
Section IV.3 Normal regression analysis
Once the experiment is finished one observes the following outcome of SSE (see Example 1 of §IV.2): SSE D 0:76: Substituting this outcome in the inequality 0:35
0. Define for all k D 1; 2; : : : and all n D 1; 2; : : : the set An;k by ² ³ ² ³ 1 1 An;k WD ! 2 W jXn .!/j D jXn j : k k Starting from the premise in the lemma, one has for all fixed k 1 1 X X 1 P .An;k / D P jXn j < C1: k nD1
nD1
This implies that to every k there is a number Nk 2 N such that 1 X
nDNk
P .An;k /
ı : 2k
Hence, if the set Ak (one should write Aık ) is defined by Ak WD then one has P .Ak / Next, the set Bı is defined by
1 [
An;k ;
nDNk
1 X
nDNk
Bı WD
P .An;k / ı=2k : 1 \
Ack :
kD1
Now it turns out that sets of the form Bı have the following two properties: (i) P .Bı / 1
ı
and (ii) Xn .!/ ! 0 for all ! 2 Bı .
./
298
Chapter VII Stochastic analysis and its applications in statistics
To prove (i), note that by De Morgan’s laws one has nBı D n
1 \
Ack D
kD1
1 [
Ak :
kD1
Via ./ one thus arrives at 1 X
P .nBı / which proves that P .Bı / 1
kD1
P .Ak /
1 X
kD1
ı=2k D ı;
ı.
Property (ii) can be proved as follows: If ! 2 Bı , then by definition of Bı one has for all k:
! … Ak
In turn, by definition of Ak , this implies that for all n Nk :
! … An;k This is, however, the same as saying that
1 for all n Nk : k Summarizing, it follows that for ! 2 Bı one has jXn .!/j
0 such that jXn j C
for all n:
In this book this presumption will not form any obstacle. A proof of the general case can be found in [17], [18], [21], [50]. By replacing Xi by Xi one may assume, without loss of generality, that D 0. Now define the variables Y1 ; Y2 ; : : : by Yn WD
X1 C C Xn : n
It is very easy to see that the sequence Y1 ; Y4 ; Y9 ; : : : ; Yn2 ; : : : converges strongly to zero. Namely by Chebyshev’s inequality one has for all " > 0 var.Yn2 / var.X1 / C2 D : P jYn2 j " "2 n 2 "2 n 2 "2
It is immediate from this that
1 X
nD1
P jYn2 j " < C1:
By Lemma VII.2.13 this implies that lim Yn2 D 0 strongly:
n!1
The next step is now to sandwich every positive integer k between two successive perfect squares: to every k 2 N there is a unique number n (one should write nk ) such that n2 < k .n C 1/2 :
300
Chapter VII Stochastic analysis and its applications in statistics
The remainder of the proof consists in verifying that jYk
if k ! 1:
Yn 2 j ! 0
To this end, first note that k
n2 2n C 1:
Using this one has jYk
Yn 2 j ˇ ˇ X1 C C Xk D ˇˇ k
ˇ X1 C C Xn2 ˇˇ ˇ n2
ˇ 2 ˇ ˇ n .X1 C C Xk / k.X1 C C Xn2 / ˇ ˇ ˇ Dˇ ˇ k n2
ˇ 2 ˇ n .X1 C C Xn2 / C n2 .Xn2 C1 C C Xk / D ˇˇ k n2
.k
ˇ k.X1 C C Xn2 / ˇˇ ˇ
n2 /jX1 C C Xn2 j n2 jXn2 C1 C C Xk j C k n2 k n2
.2n C 1/n2 C n2 .2n C 1/C C 2 kn k n2
.2n C 1/n2 C n2 .2n C 1/C C : n4 n4
If k ! 1, then also n D nk ! 1. Hence the above implies that lim jYk
k!1
Yn2 j D 0: k
./
Because lim Yn2 D 0
k!1
k
strongly;
there exists a set C (more precisely, an element C 2 A) such that P .C / D 1 and such that lim Yn2 .!/ D 0 for all ! 2 C: k!1
k
This, together with ./, implies that lim Yk .!/ D 0
k!1
for all ! 2 C:
That is to say Yk ! 0 strongly if k ! 1. This proves the theorem.
Section VII.2 Convergence of stochastic variables
301
As will be seen in the sequel, the strong law of large numbers can be helpful to get a natural perception of the expectation value of a variable and of the concept of unbiasedness concerning estimators. First of all, however, an intermezzo. Without any problem the reader can skip this, for in the sequel there will be no reliance on its content. Intermezzo. In this intermezzo the strong law of large numbers is translated into the language of functional analysis. Define RN as being the set of all functions f W N ! R. For reasons of notational consistency as to the elements of RN , write x instead of f . So in this notation RN D ¹x j x W N ! Rº: In a natural way one can define on RN an operation, called addition, by for all x; y 2 RN the element x C y 2 RN is defined by .x C y/.i / WD x.i / C y.i /; and a scalar multiplication by for all 2 R and x 2 RN the element x 2 RN is defined by .x/.i / WD x.i /: This turns RN into a linear space, that is, a vector space. A useful metric (see Appendix E) can be defined on RN as follows: d.x; y/ WD
1 X iD0
2
i
jx.i / y.i /j : 1 C jx.i / y.i /j
It is left to the reader to verify that this defines a metric on RN indeed. In this metric a sequence x1 ; x2 ; : : : in RN converges to x if and only if for every fixed i the sequence xk .i / converges to x.i/ in R. In the metric space .RN ; d / addition and scalar multiplication are continuous operations. More precisely:
and
if xk ! x and yk ! y in .RN ; d /; then xk C yk ! x C y in .RN ; d /; if k ! in R and xk ! x in .RN ; d /; then k xk ! x in .RN ; d /:
Vector spaces in which this phenomenon can be observed are called (metrizable) topological vector spaces. Functional analysis is the branch of mathematics dealing with such spaces. Now let .; A; P / be a probability space and let Xi W ! R, where i 2 N , be a sequence of stochastic variables. Then with every outcome ! 2 one could associate a sequence i 7! Xi .!/ 2 R:
302
Chapter VII Stochastic analysis and its applications in statistics
This sequence will be denoted by X.!/; so in this notation one may write X.!/.i/ D Xi .!/: For every ! 2 the sequence X.!/ presents an element of RN and thus a map X W ! RN emerges. It can be proved that this map is A-measurable, that is to say: X
1
.A/ 2 A
for all Borel sets A in RN :
For this reason X is called a stochastic variable (assuming its values in RN ). Next, define for all 2 R the set M./ in RN by ² ³ n 1X M./ D x W lim x.i / D : n!1 n iD1
This set, as can be proved, is a Borel set in .RN ; d /. Consequently it makes sense to talk about the probability P X 2 M./ D P X 1 M./ : Details as to measurability and the fact that M./ is Borel in RN are left to the reader. In [67] one can find suitable tools to prove these facts.
Now let .Xi /i2N be a statistically independent sequence of identically distributed variables all of them with expectation . In terms of the above, the strong law of large numbers states that P X 2 M./ D 1: More colloquial, one can be almost sure that the stochastic variable X will show an outcome that is in M./.
As announced before, the strong law of large numbers can be helpful to understand what actually is presented by an expectation value. Namely, if in some experiment one is dealing with a stochastic variable X of expectation , then one could look upon in the following way: When repeating the experiment unendingly (under exactly the same conditions) the value will strike the eye when averaging all the observed outcomes of X . To see this, define the following sequence of stochastic variables: X1 is the outcome in the first experiment, X2 is the outcome in the second experiment, :: : et cetera:
303
Section VII.3 The Glivenko–Cantelli theorem
Now the strong law of large numbers expresses the phenomenon that, when computing 1 Pn the averaging sums n iD1 Xi , one can be quite sure that for large n the number will emerge. Similarly, when making estimates of a characteristic . / by means of an unbiased estimator T , one can be confident that (when repeating the experiment many many times) the averaging sums of the individual estimates will generate the exact value of . /. In the next section the strong law of large numbers will be exploited to study the asymptotic behavior of empirical distribution functions.
VII.3 The Glivenko–Cantelli theorem Let X1 ; X2 ; : : : be an infinite sample from a population with distribution function F . In §VII.1 it was explained that, for an arbitrary fixed x, the statistic b .X1 ; : : : ; Xn /.x/ D #¹i W i n; Xi xº F n
can be looked upon as a sample mean belonging to a sample of size n from a Bernoulli distributed population with parameter F .x/. Hence, by the strong law of large numbers one has for every fixed x that b .X1 ; : : : ; Xn /.x/ D F .x/ lim F
strongly:
n!1
In a similar way it can be derived (see Exercise 41) that b .X1 ; : : : ; Xn /.x / D F .x /: lim F
n!1
In the estimation process the function F is estimated by means of the step function b .X1 ; : : : ; Xn /.x/: x 7! F
The maximal (vertical) distance between the graphs of the functions b .X1 ; : : : ; Xn /.x/ x 7! F
and
x 7! F .x/
is presented by the expression
b .X1 ; : : : ; Xn /.x/ sup j F
F .x/j:
./
x2R
For bounded functions ' W R ! R it is in (functional) analysis quite usual to denote k'k1 D sup j'.x/j: x2R
304
Chapter VII Stochastic analysis and its applications in statistics
So in this notation the supremum in ./ can be written as b .X1 ; : : : ; Xn / kF
F k1 :
The next lemma will be useful for both technical and theoretical reasons. Recall that a subset C R is said to be dense in R if every real number can be represented as the limit of a sequence of elements in C . Lemma VII.3.1. Let F and G be arbitrary distribution functions and let C be an arbitrary dense set in R. Then one has kF
Gk1 D sup jF .x/
G.x/j:
x2C
Proof. See Exercise 38, where the case C D Q is considered. b .X1 ; : : : ; Xn / F k1 is a stochastic Using the lemma above it can be proved that k F variable in the sense that it is a measurable function on the underlying probability space of the variables X1 ; X2 ; : : : . For more details about this, see Exercises 37 and 38. It turns out, as will be seen below, that the quantity b .X1 ; : : : ; Xn / kF
F k1
converges to zero almost surely if n ! 1. This important and interesting result is known as the Glivenko–Cantelli theorem. The next two lemmas are preparatory to the proof of this theorem. Lemma VII.3.2. Let F be an arbitrary distribution function. Then for all fixed " > 0 the set ¹x W F .x/ F .x / "º is finite. Proof. By Lemma VII.2.9 the set ¹x W F .x/
F .x / > 0º
is at most countably infinite. If this set is finite then the statement in the lemma is immediate. If not, then there is an enumeration x1 ; x2 ; : : : of the set in question. For all n one then has n X F .xi / F .xi / 1: iD1
Hence the series
1 X
F .xi /
iD1
F .xi /
is convergent. It follows from this that one has F .xi / for all but finitely many values of i .
F .xi / < "
305
Section VII.3 The Glivenko–Cantelli theorem
The proof of the Glivenko–Cantelli will lean on a suitable partitioning of the real axis. The following lemma will be used to bring this about. Lemma VII.3.3. Let " be any positive number and let F be any distribution function. If for all elements x in the interval .a; b/ one has F .x/
F .x / < ";
then there exists a sequence x0 ; x1 ; : : : ; xk with the following properties: (i) a D x0 < x1 < < xk D b,
(ii) F .xi /
(iii) F .xk /
F .xi
1/
F .xk
< " for i D 1; : : : ; k
1/
1,
< ".
Proof. Under the premises of the lemma there exists to every element c 2 .a; b/ a number ı.c/ > 0 such that jF .x/
F .y/j < "
ı.c/; c C ı.c/ :
for all x; y 2 c
To see this, fix any c 2 R. Then F .c/
F .c / D F .c C/
F .c / < ":
It follows that there exists a number ı > 0, depending on c, such that F .c C ı/
F .c
ı/ < ":
Evidently this implies that F .y/
F .x/ < "
for all x; y 2 .c
ı; c C ı/:
Furthermore there is a number ı.a/ > 0 such that jF .x/
F .a/j < "
and a number ı.b/ > 0 such that jF .b /
F .x/j < "
for all x 2 a; a C ı.a/ for all x 2 .b
ı.b/; b/:
Because Œa; b is a compact set, there exist numbers c1 < < cn such that Œa; b Œa; a C 12 ı.a// [ .c1 [ .cn
1 2 ı.c1 /; c1 C 1 2 ı.cn /; cn C
1 2 ı.c1 // 1 2 ı.cn //
[ [ .b
1 2 ı.b/; b:
306
Chapter VII Stochastic analysis and its applications in statistics
Next, define the number ı > 0 as follows: ı WD
1 min¹ı.a/; ı.c1 /; : : : ; ı.cn /; ı.b/º: 2
Now, if x and y are in .a; b/ and if jx
a; ı.a/ ; c1
yj < ı, then there is among the intervals
ı.c1 /; c1 C ı.c1 / ; : : : ; .cn
ı.cn /; cn C ı.cn //; .b
ı.b/; b/
at least one containing both x and y . From this it follows that one has for such x and y the following inequality: jF .x/
as soon as
F .y/j < "
jx
yj < ı:
It is now easy to see that every sequence a D x 0 < x1 < < xn D b for which xi xi
1
< ı satisfies the properties (i), (ii) and (iii) listed in the lemma.
Theorem VII.3.4 (V. Glivenko, F. P. Cantelli). Let X1 ; X2 ; : : : be an infinite sample from a population having distribution function F . Then b .X1 ; : : : ; Xn / lim k F
F k1 D 0
n!1
strongly:
Proof. Let .; A; P / be the underlying probability space of the X1 ; X2 ; : : : . Choose any fixed " > 0. It will now be proved that then b .X1 ; : : : ; Xn / lim k F
n!1
F k1 "
almost surely:
.1/
As a first step in this, note that from Lemma VII.3.2, together with the fact that lim F .a/ D 0
a! 1
and
lim F .a/ D 1;
a!C1
it follows that there exists a sequence a0 < a1 < < am such that F .a0 / < ";
F .am / > 1
"
.2/
1 ; ai /
.3/
and F .x/
F .x / < "
if x 2 .ai
for i D 1; : : : ; m. In order to simplify notation, denote
b n .!/ WD F b X1 .!/; : : : ; Xn .!/ : F
307
Section VII.3 The Glivenko–Cantelli theorem
Furthermore, set b n .!/ kF
b n .!/ kF
b n .!/ kF
b n .!/ .x/ F k0 WD sup jF
F .x/j;
xa0
F ki WD
sup
ai
1 dL .F; G/ and for this reason one may conclude that dL .G; F / dL .F; G/: Interchanging the role of F and G one obtains dL .F; G/ dL .G; F /: Summarizing: dL .F; G/ D dL .G; F /. Step 2. The equality dL .F; G/ D 0 is possible if and only if F D G . If F D G , then dL .F; G/ D 0; this is trivial. Suppose conversely that dL .F; G/ D 0. For all ˛ > 0 one then has (by Lemma VII.5.1) F .x
˛/
˛ G.x/ F .x C ˛/ C ˛
319
Section VII.5 Metrics on the set of distribution functions
for all x 2 R. Letting ˛ # 0, it follows from this that F .x / G.x/ F .x C/ D F .x/: Therefore G F . However, by using Step 1 it can be seen that dL .F; G/ D 0 implies dL .G; F / D 0. In turn this implies F G . The conclusion is that F D G , thus finishing the proof of Step 2. Step 3. For all F; G; H 2 one has dL .F; G/ dL .F; H / C dL .H; G/: To prove this, choose arbitrary fixed numbers ˛ and ˇ such that and
˛ > dL .F; H /
ˇ > dL .H; G/:
By Lemma VII.5.1 it follows that for all x 2 R the following inequalities hold: F .x
˛/
˛ H.x/ F .x C ˛/ C ˛;
H.x
ˇ/
ˇ G.x/ H.x C ˇ/ C ˇ:
Combining these inequalities it follows that for all x 2 R G.x/ H.x C ˇ/ C ˇ F .x C ˛ C ˇ/ C ˛ C ˇ and G.x/ H.x
ˇ/
ˇ F .x
˛
ˇ/
˛
ˇ:
By definition of dL .F; G/ this implies dL .F; G/ ˛ C ˇ: Because this holds for all ˛ > dL .F; H / and all ˇ > dL .H; G/ one arrives at the conclusion that dL .F; G/ dL .F; H / C dL .H; G/: This proves Step 3 and, together with Steps 1 and 2, the proposition. Definition VII.5.2. The metric dL in the previous proposition is called the Lévy metric on . Remark. The Lévy metric is named after the well-known French mathematician PaulPierre Lévy (1886–1971).
320
Chapter VII Stochastic analysis and its applications in statistics
Example 1 showed that the concept of distance belonging to the norm metric is not quite in tune with intuition. This discord dissolves when replacing dN by dL . Namely, one has: Proposition VII.5.3. If Ea D 1Œa;C1/ , then dL .Ea ; Eb / D min.1; ja
bj/ ja
bj:
Proof. See Exercise 23. Proposition VII.5.4. For the norm metric and the Lévy metric one has dL .F; G/ dN .F; G/ for all F; G 2 . Proof. See Exercise 20. Hence, if a sequence F1 ; F2 ; : : : converges to F in .; dN /, then it also converges to F in .; dL /. The converse is not true; looking back on Example 1 and Proposition VII.5.3 a counterexample is easily found. However, if the limit F is a continuous distribution function, then the metrics dL and dN are looking through the same glasses (concerning convergence to F ). The following lemma is proved as a first step to see this: Lemma VII.5.5. A continuous distribution function is automatically uniformly continuous on R. Proof. Let F W R ! Œ0; C1/ be a continuous distribution function. Choose an arbitrary fixed " > 0. Because lim F .x/ D 0
x! 1
and
lim F .x/ D 1
x!C1
there exist real numbers a and b such that 0 F .x/ " " F .x/ 1
1
for all x 2 . 1; a/;
for all x 2 .b; C1/:
()
The interval Œa "; b C " is bounded and closed, that is, compact. In elementary mathematical analysis there is a well-known theorem (due to the German mathematician Karl T. W. Weierstraß (1815–1897)) stating that a continuous function on such an interval is automatically uniformly continuous. So by this theorem there exists a ı > 0 such that 0 < ı < " and such that for all x; y 2 Œa "; b C " one has jx
yj < ı H) jF .x/
F .y/j < ":
./
321
Section VII.5 Metrics on the set of distribution functions
In virtue of ./ and ./ it follows that for all x; y 2 R jF .x/ whenever jx
F .y/j < ";
yj < ı. This proves the lemma.
Using the previous lemma it is easy to prove the next theorem: Theorem VII.5.6. Let F1 ; F2 ; : : : be a sequence in . If F is a continuous distribution function, then Fk ! F in .; dN / ” Fk ! F in .; dL / : Proof. The implication ) is a direct consequence of Proposition VII.5.4. To prove the implication (, suppose that F is continuous and that Fk ! F in .; dL /. Choose an arbitrary " > 0. By the foregoing lemma F is uniformly continuous, so there exists a ı > 0 such that 0 < ı < " and such that jx
yj < ı H) jF .x/
F .y/j < ":
./
Because Fk ! F in .; dL /, there exists an integer N such that dL .Fk ; F / < 12 ı
for all k N:
By Lemma VII.5.1 this implies that for k N one has F .x
1 2 ı/
1 2ı
Fk .x/ F .x C 21 ı/ C 12 ı
for all x 2 R. In this way it follows that for k N and arbitrary x 2 R the numbers Fk .x/ and F .x/ both belong to the interval F .x
1 2 ı/
1 2 ı; F .x
C 12 ı/ C 12 ı :
Because of ./ this interval has a length less than 2". Consequently one has for all kN jFk .x/ F .x/j 2" for all x 2 R: So for every " > 0 there exists an N 2 N such that dN .Fk ; F / D kFk
F k1 2"
This is the same as saying that Fk ! F in .; dN /.
for all k N:
322
Chapter VII Stochastic analysis and its applications in statistics
Proposition VII.5.4 and Theorem VII.5.6 are summarized in the scheme below. Fk ! F in .; dN / H) Fk ! F in .; dL / (H if F continuous The content of the next theorem is, roughly spoken, that convergence in distribution can be captured in terms of the Lévy metric. Theorem VII.5.7. For every F 2 and every sequence F1 ; F2 ; : : : in the following two statements are equivalent: (i) Fk ! F in .; dL /,
(ii) Fk .x/ ! F .x/ in every point of continuity of F . Proof. To prove the implication (i) ) (ii), suppose Fk ! F in .; dL /. One has to deduce from this that Fk .x/ ! F .x/ for every point of continuity of F . To this end, choose an arbitrary fixed " > 0. Because Fk ! F in .; dL /, there exists an integer N" such that dL .Fk ; F / < " for all k N" : By Lemma VII.5.1 this implies that for all k N" one has F .x
" Fk .x/ F .x C "/ C "
"/
for all x 2 R. It follows from this that for all " > 0 and all x 2 R F .x
"/
" lim Fk .x/ lim Fk .x/ F .x C "/ C ": k!1
k!1
By letting " # 0 it follows that F .x / lim Fk .x/ lim Fk .x/ F .x C/: k!1
k!1
In a point of continuity x one has F .x / D F .x C/ and for this reason one has in such points lim Fk .x/ D F .x/: k!1
This proves the implication (i) ) (ii). Next the implication (ii) ) (i) is proved. Suppose that Fk .x/ ! F .x/ in all points of continuity of F . It must be derived from this that lim dL .Fk ; F / D 0:
k!1
323
Section VII.5 Metrics on the set of distribution functions
To this end, choose an arbitrary fixed " > 0. The proof will be completed once it is shown that there exists an N" such that for all k N" :
dL .Fk ; F / "
Because F is a distribution function, there exist real numbers a and b such that 0 F .x/ 12 "
1
1 2"
for all x 2 . 1; a;
()
for all x 2 Œb; C1/:
F .x/ 1
By Lemma VII.2.9 one may assume, without loss of generality, that F is continuous in the points a and b. Now choose a sequence x0 ; x1 ; : : : ; xn of points of continuity of F such that a D x0 < x1 < < xn D b and such that xi
xi
1
0. Because the sequence F1 ; F2 ; : : : is Cauchy in .; dL /, there is an integer N" 2 N such that dL .Fm ; Fn / < "
for all m; n N" :
By Lemma VII.5.1 this implies that for all x 2 R and all m; n N" one has Fn .x
"/
" Fm .x/ Fn .x C "/ C ":
It follows that for all x 2 R and all m N" lim Fn .x
"/
n!1
" Fm .x/ lim Fn .x C "/ C "; n!1
that is: H.x
"/
" Fm .x/ G.x C "/ C ":
./
Hence for all " > 0 the following inequality holds: H.x Letting " # 0 one arrives at
"/
" G.x C "/ C ":
H.x / G.x C/:
./
From ./ and ./ together it follows that in a point x in which both G and H are continuous one necessarily has G.x/ D H.x/ D lim Fk .x/: k!1
Next, define the function F W R ! Œ0; 1 by 1 F .x/ WD ¹G.x C/ C H.x C/º: 2 By Lemma VII.5.8 this function is right-continuous and one has F .x/
F .x / D
1® G.x C/ 2
G.x / C H.x C/
¯ H.x / :
It follows that in a point x where F is continuous automatically also G and H are continuous. In connection to the definition of F , this implies that for every point of continuity (of F ) one has lim Fk .x/ D F .x/: k!1
326
Chapter VII Stochastic analysis and its applications in statistics
Now, by Theorem VII.5.7, the proof is finished once one has proved that F is a distribution function on R. It is already known that F is right-continuous and increasing, so the only thing left to prove is that lim F .x/ D 0
x! 1
and
lim F .x/ D 1:
x!C1
To this end, look back on ./ and conclude that for fixed m N" one has lim Fm .x/
x!C1
lim G.x/ C ":
x!C1
However, Fm is a distribution function, so 1
lim G.x/ C ":
x!C1
This holds for all " > 0, hence, because G.x/ 1 for all x 2 R, one has 1
lim G.x/
x!C1
lim G.x/ 1:
x!C1
So lim G.x/ D 1:
x!C1
Moreover, for all x 2 R one has and it follows from this that
G.x/ H.x/ 1 lim H.x/ D 1:
x!C1
Using Lemma VII.5.8 again it follows that lim F .x/ D 1:
x!C1
In a similar way it can be proved that lim F .x/ D 0;
thus finishing the proof.
x! 1
Now that the completeness of .; dL / has been proved, this section is completed by presenting its last theorem (though in the sequel there will be no reliance on this theorem): Theorem VII.5.10. The metric space .; dL / is separable. Proof. See Exercise 29. On there exist other metrics that are of considerable importance in stochastic analysis. Without going in details some of them are mentioned here:
327
Section VII.5 Metrics on the set of distribution functions
The Prokhorov metric. To every F 2 there is, as explained in §I.7, a unique Borel measure P on R such that F .x/ D P Œ. 1; x
for all x 2 R:
This measure will be denoted by PF . For every Borel set A R and every ı > 0 the set Aı is defined as Aı WD ¹x 2 R W jx
aj < ı
for some a 2 Aº:
In these notations the Prokhorov metric dP on can be defined as follows: dP .F; G/ WD inf¹ı 0 W PF .A/ PG .Aı / C ı
and PG .A/ PF .Aı / C ı for all A Borel in Rº:
The Prokhorov metric and the Lévy metric generate the same topology on , but they are not equivalent in the sense of Appendix E (see Exercise 28). The Prokhorov metric is named after the Russian mathematician Youri V. Prokhorov (born in 1929). More details can be found for example in [10], [17], [39]. The bounded Lipschitz metric. This metric will be denoted by dBL . It is defined in the following way: ˇZ ˇ Z ˇ ˇ dBL .F; G/ WD sup ˇˇ ' dPF ' dPG ˇˇ ; where the supremum is taken over all functions that satisfy the Lipschitz condition: j'.x/
'.y/j
jx yj 1 C jx yj
.x; y 2 R/:
It can be proved (see [39]) that for all F; G 2
dP .F; G/2 dBL .F; G/ 2 dP .F; G/: From this it follows that dP and dBL are equivalent metrics (in the sense of Appendix E). In particular they generate the same topology on . The Skorokhod metric. The definition of this metric can not be captured in one or two words. The Skorokhod metric is defined not only on but even on D.R/, the set of all cadlag functions. In the Skorokhod metric, is a closed subset of D.R/. Denoting the Skorokhod metric by dS , one has Fk ! F in .; dN / H) Fk ! F in .; dS / H) Fk ! F in .; dL / :
328
Chapter VII Stochastic analysis and its applications in statistics
The Skorokhod metric is named after the Ukrainian mathematician A. V. Skorokhod (born in 1930). He also proved Theorem VII.2.11 in a far more general setting than presented in this book. For more details see for example [10], [56]. Metrics on will be exploited in §VII.7 where a precise meaning of the concept of robustness of an estimator will be given.
VII.6
Smoothing techniques
Introduction. In order to estimate the unknown distribution function of some popb associated with a sample ulation, one could use the empirical distribution function F from the population in question. An empirical distribution function, however, always shows discontinuities. If beforehand it is known that F is continuous, then there may be a wish to make an estimate of F that is also continuous itself. This could be done, b . That is to say, by modifying F b a bit in for example, by so-called smoothing of F order to turn it into a continuous function. Smoothing techniques are often based on what is called “convolution”. What this means is explained in this section. Let F and G be arbitrary distribution functions. Then there exists (see Exercise 17) a pair of stochastic variables X; Y such that: I: The distribution function of X is F , II: The distribution function of Y is G , III: X and Y are statistically independent. The sum X C Y of such variables is, of course, again a stochastic variable. The distribution function of such a sum will be denoted by F G . This distribution function does not depend on the particular choice of the variables X and Y . To see this, choose a second pair of variables X 0 ; Y 0 having properties I, II and III. Then the two 2-vectors .X; Y / and .X 0 ; Y 0 / are identically distributed (see Exercise 21), which implies (see §I.11, Exercise 27) that X C Y and X 0 C Y 0 are also identically distributed. It follows that the distribution function F G is defined in a correct way. Definition VII.6.1. The function F G is called the distribution product of the functions F and G . In this section, as before, the set of all distribution functions on R will systematically be denoted by . Furthermore, the distribution function of the constant stochastic variable X D 0 will be denoted by E . So ´ 0 if x < 0; E.x/ D 1 if x 0:
329
Section VII.6 Smoothing techniques
Proposition VII.6.1. The distribution product on has the following three properties: (i) .F G/ H D F .G H / for all F; G; H 2 ,
(ii) F G D G F for all F; G 2 ,
(iii) E F D F E D F for all F 2 . Proof. (i) Choose (see Exercise 17) a statistically independent system X; Y; Z such that X; Y and Z have distribution functions F; G and H respectively. Exploiting the fact that both the pair X C Y; Z and the pair X; Y C Z are statistically independent systems it follows that both .F G/ H
and
F .G H /
present the distribution function of X C Y C Z. For this reason they are equal. Statement (ii) is immediate from the definition of F G . To see (iii), let X be a variable that has F as its distribution function and let Y D 0. Then Y has E as its distribution function. Moreover, X and Y by Lemma I.5.13 form a statistically independent system. Because X C Y D X C 0 D X; it follows that F E D F . By using (ii) it can be seen that also E F D F . Remark. The proposition above can be reformulated by saying that , equipped with the operation , is a semigroup with unit E . It can be proved (see Exercise 36) that is definitely not a group in the sense of Galois. As noticed in §I.7, to every F 2 there is a unique Borel measure PF on R such that PF . 1; x D F .x/: In this notation the following proposition is formulated.
Proposition VII.6.2. For all F; G 2 and all x 2 R one has Z .F G/.x/ D F .x y/ dPG .y/: Proof. Let F and G be arbitrary elements in . Choose a statistically independent pair X; Y such that X and Y have distribution functions F and G respectively. Then PX D PF
and
PY D PG :
330
Chapter VII Stochastic analysis and its applications in statistics
Because X and Y are independent, one has furthermore P.X;Y / D PX ˝ PY D PF ˝ PG : Next, choose an arbitrary fixed element c 2 R. Then one may write
where
.F G/.c/ D P .X C Y c/ D P.X;Y / A.c/ ; A.c/ D ¹.x; y/ 2 R2 W x C y cº:
Hence .F G/.c/ D
Z
1A.c/ .x; y/ dP.X;Y / .x; y/
Z
1A.c/ .x; y/ d.PF ˝ PG /.x; y/ Z °Z ± (Fubini’s theorem) D 1A.c/ .x; y/ dPF .x/ dPG .y/: D
However, it is easily verified that 1A.c/ .x; y/ D 1. For this reason one has Z ²Z .F G/.c/ D 1. 1;c Z D PF . 1; c This proves the proposition.
1;c y .x/:
³ y .x/ dPF .x/ dPG .y/ Z y dPG .y/ D F .c
y/ dPG .y/:
In connection to the representation of F G in the proposition above, the concept of a convolution product is now defined: Definition VII.6.2. Let ' W R ! R be a bounded Borel function and P a probability measure on R. The function ' P W R ! R, defined by Z .' P /.x/ WD '.x y/ dP .y/ .x 2 R/; is called the convolution product of ' and P .
331
Section VII.6 Smoothing techniques
If P D f , where f is a probability density and the Lebesgue measure on R, then Z .' P /.x/ D ' .f / .x/ D '.x y/f .y/ dy:
In that case (see also §I.11, Exercise 50) one usually denotes ' f instead of ' .f /. In terms of a convolution product the content of Proposition VII.6.2 can be reformulated as F G D F PG : The next proposition describes the behavior of a convolution product under differentiation. Proposition VII.6.3. Suppose ' is continuously differentiable with a derivative that is bounded on R. Then ' P is also continuously differentiable and one has .' P /0 D ' 0 P : Proof. For all x 2 R and all h ¤ 0 one has Z '.x C h .' P /.x C h/ .' P /.x/ D h
y/ h
'.x
y/
dP .y/:
./
Because ' 0 is presumed to be bounded on R, there exists a constant M 0 such that j' 0 .t /j M for all t 2 R. Using the mean value theorem one arrives at the inequality ˇ ˇ ˇ '.x C h y/ '.x y/ ˇ ˇ ˇM ˇ ˇ h for all x; y 2 R, provided h ¤ 0. Now choose a sequence h1 ; h2 ; : : : such that hn ! 0 if n ! 1. Applying Lebesgue’s theorem on dominated convergence (see Appendix A) it follows that Z .' P /0 .x/ D ' 0 .x y/ dP .y/ D .' 0 P /.x/:
In a similar way it can be proved that ' 0 P is continuous on R. Our smoothing techniques will be based on the above proposition. Proposition VII.6.4. Suppose that F; G 2 and that G is continuously differentiable with a bounded derivative. Then G F is continuously differentiable and its derivative is given by .G F /0 D G PF :
332
Chapter VII Stochastic analysis and its applications in statistics
Proof. By Proposition VII.6.2 one may write G F D G PF ; so this proposition is an immediate consequence of Proposition VII.6.3. Remark. If G is p times differentiable and if all derivatives of order p are bounded and continuous, then G F is also p times differentiable with continuous derivatives. This can be verified by repeated application of Proposition VII.6.3. Example 1. Let ˆ be the distribution function belonging to the standard normal distribution. This distribution function is infinitely many times differentiable and all its derivatives are bounded. It follows from this that for an arbitrary F 2 the function ˆ F is infinitely many times differentiable. As in §VII.5 the characteristic function belonging to an element F 2 will be denoted by F . Proposition VII.6.5. For all F; G 2 one has F G .t / D F .t / G .t / .t 2 R/: Proof. With the definition of F G in mind, this is a direct consequence of Proposition I.8.8. It is now very easy to prove: Proposition VII.6.6. If Fk ! F and Gk ! G in .; dL / then also Fk Gk ! F G
in
.; dL /:
Proof. To prove this, note that for such sequences one has for all t 2 R Fk .t / ! F .t /
and
Gk .t / ! G .t /:
However, this means that Fk Gk .t / D Fk .t / Gk .t / ! F .t / G .t / D F G .t / for all t 2 R. In turn this implies (see Theorem II.5.7) that Fk Gk ! F G in .; dL /, which proves the proposition. Remark. This proposition expresses the fact that the operation on .; dL / is a continuous operation. For this reason , equipped with the operation and the metric dL , is called a “(metrizable) topological semigroup”.
333
Section VII.6 Smoothing techniques
Next, for every G 2 and every > 0, define the distribution function G by G .x/ WD G.x=/
.x 2 R/:
If X is a stochastic variable with distribution function G , then G is the distribution function corresponding to X . Proposition VII.6.7. For all G 2 one has lim dL .G ; E/ D 0:
#0
Proof. If x > 0, then lim G .x/ D lim G.x=/ D
#0
#0
lim G.y/ D 1:
y!C1
On the other hand, if x < 0, then lim G .x/ D lim G.x=/ D lim G.y/ D 0:
#0
y! 1
#0
Thus it follows that lim G .x/ D E.x/
#0
in every point of continuity x of E . By Theorem VII.5.7 this implies that lim dL .G ; E/ D 0:
#0
This proves the proposition. The foregoing propositions are now summarized in the following theorem. Theorem VII.6.8. For all F; G 2 F one has lim dL .F G ; F / D lim dL .G F; F / D 0:
#0
#0
Proof. Combine Propositions VII.6.6 and VII.6.7. This theorem, together with Proposition VII.6.4, expresses the fact that discontinuous distribution functions can always be approximated (in the Lévy metric) by smooth distribution functions. One therefore talks about smoothing techniques.
334
Chapter VII Stochastic analysis and its applications in statistics
Example 2. Denote, as always, the distribution function of the N.0; 1/-distribution by ˆ. For all > 0 the function ˆ is infinitely many times differentiable and all derivatives are bounded. It follows from this that for an arbitrary F 2 the distribution functions F ˆ D ˆ F D ˆ PF are infinitely many times differentiable. Moreover
lim dL .ˆ F; F / D 0:
#0
This shows that an arbitrary F in can always be approximated (in the Le´vy metric) by super smooth elements in . When estimating an unknown distribution function of some population, smoothing techniques can be used as follows: b , based on the outExample 3. Suppose that the empirical distribution function F come of a sample, is used as an estimate of the unknown distribution function F of some population. There is, however, a wish to estimate F by means of a continuous distribution function. This is very well possible. Just choose > 0 so small that (for b is very close to F b . Then replace the estimate F b by ˆ F b. example) ˆ F
VII.7
Robustness of statistics
In this section the concept of robustness, as sketched at the end of §I.10, will be studied in more detail. Let g W Rn ! R be a Borel function and let X1 ; : : : ; Xn be a sample from a population having a distribution function F . Then one may talk about the statistic T D g.X1 ; : : : ; Xn /. The distribution function FT of this quantity generally depends on the distribution function F of the population. So in a natural way a map e g W ! emerges, defined by e g W F 7! FT :
Here is, as before, the set of all distribution functions on R. Remark. Let PF be the unique Borel measure on R that satisfies PF . 1; x D F .x/ .x 2 R/:
In this notation the distribution function e g .F / can also be characterized as e g .F /.x/ D g.PF ˝ ˝ PF / . 1; x
335
Section VII.7 Robustness of statistics
where g.PF ˝ ˝PF / is the image of PF ˝ ˝PF under the map g W Rn ! R. A statistic T D g.X1 ; : : : ; Xn / is called robust if a small perturbation of the distribution function F cannot possibly bring about large deviations in the distribution function FT . This can be formulated more precisely when, in some way, continuity of the map e g W F 7! FT
is brought forward (see Appendix E).
Definition VII.7.1. Let d1 and d2 be metrics on . A statistic T D g.X1 ; : : : ; Xn / is said to .d1 ; d2 /-continuous in F 2 if the map e g W .; d1 / ! .; d2 /
is continuous in F . If e g is continuous in every F 2 , then T is said to be .d1 ; d2 /continuous on . In cases where d1 D d D d2 one talks about d -continuity rather than .d; d /-continuity. Remark. If d10 generates the same topology as d1 and d20 the same topology as d2 , then .d10 ; d20 /-continuity is the same as .d1 ; d2 /-continuity. As in the foregoing sections the Lévy metric on will be denoted by dL . In this notation the following theorem, which generalizes the result in Exercise 16, is formulated. Theorem VII.7.1. If the function g W Rn ! R is continuous, then the statistic T D g.X1 ; : : : ; Xn / is dL -continuous on . Proof. Suppose g W Rn ! R is continuous. In order to prove that T is a dL continuous statistic one has to prove that the map e g W ! is dL -continuous in every point F 2 . To this, choose an arbitrary F 2 and suppose that Fk ! F . It needs to be clarified that this implies that e g .Fk / ! e g .F / in .; dL /. To this end a generalized version of Skorokhod’s will be needed (see Exercise 18). This version of the theorem states there exists in this scenario a stochastic n-vector X D .X1 ; : : : ; Xn / together with a sequence stochastic n-vectors Xk D .X1k ; : : : ; Xnk / such that: (i) The variables X1 ; : : : ; Xn have a common distribution function F and they form a statistically independent system. (ii) For all fixed k the variables X1k ; : : : ; Xnk have a common distribution function Fk and they form a statistically independent system. (iii) Xk ! X strongly if k ! 1.
336
Chapter VII Stochastic analysis and its applications in statistics
From (iii), together with the fact that g is continuous, it immediately follows that g.Xk / ! g.X/ strongly if k ! 1. This implies that also g.Xk / ! g.X/ in distribution if k ! 1. However, by construction g.Xk / has e g .Fk / and g.X/ has e g .F / as its distribution function. So e g .Fk /.x/ ! e g .F /.x/
in every point of continuity x of e g .F /. By Theorem VII.5.7 this means that e g .Fk / ! e g .F /
in .; dL /:
This completes the proof of the theorem.
By the theorem above both the sample mean and the sample variance are dL continuous statistics. In mathematical statistics these quantities are usually not considered as being robust. The intuitive statistical concept of robustness of a statistic T D g.X1 ; : : : ; Xn / is not captured by saying that T is dL -continuous. Below the concept of robustness will therefore be defined as a stronger form of continuity: Suppose that for all n D 1; 2; : : : a Borel function gn W Rn ! R is given. For any infinite sample X1 ; X2 ; : : : such a sequence of Borel functions generates a sequence of statistics T1 ; T2 ; : : : defined by Tn WD gn .X1 ; : : : ; Xn /
.n D 1; 2; : : :/:
In mathematical statistics robustness is often defined in terms of these sequences of statistics. Along the lines proposed by Frank R. Hampel in his thesis (1968) the concept of robustness is now defined in terms of equicontinuity (see Appendix E): Definition VII.7.2. The sequence T1 ; T2 ; : : : is called .d1 ; d2 /-robust in F if the sequence of maps e g n W .; d1 / ! .; d2 / .n D 1; 2; : : :/
is equicontinuous in F . If the sequence e g 1; e g 2 ; : : : is equicontinuous in every point F in , then the sequence T1 ; T2 ; : : : is said to be .d1 ; d2 /-robust on . Remark. If d10 generates the same topology as d1 and d20 the same topology as d2 , then .d10 ; d20 /-robustness is the same as .d1 ; d2 /-robustness.
In the remainder of this section it will be proved that the sequence of sample means, belonging to an infinite sample, is in no element of F in a dL -robust sequence. To this end some preparatory lemmas and propositions are needed. They are interesting in themselves.
337
Section VII.7 Robustness of statistics
Lemma VII.7.2. A characteristic function F of an arbitrary distribution function F always has the following three properties: (i) F W R ! C is a continuous function,
(ii) F .0/ D 1,
(iii) jF .t /j 1 for all t 2 R. Proof. See Exercise 30. Lemma VII.7.3. There exists a distribution function G 2 such that ² ³n t lim G D0 n!1 n
for all t ¤ 0. Proof. See Appendix F. For reasons that become clear later on one now defines: Definition VII.7.3. An element G 2 that satisfies ² ³n t D 0 for all t ¤ 0 lim G n!1 n will be called a stoutly tailed distribution function. Populations having a stoutly tailed distribution function show unruly behavior with respect to the law of large numbers: Proposition VII.7.4. Let X1 ; X2 ; : : : be an infinite sample from a population having a stoutly tailed distribution function. Then the corresponding sequence X1 C C Xn n
.n D 1; 2; : : :/
of sample means does not converge in distribution. Proof. Let G be the stoutly tailed distribution function of the population. Combining Propositions I.8.6 and I.8.8 it follows that the characteristic function of the statistic Tn D
X1 C C Xn n
338
Chapter VII Stochastic analysis and its applications in statistics
can be represented as ² ³n t Tn .t / D G n
.t 2 R/:
This sequence of characteristic functions converges pointwise to the function , defined by ´ 0 if t ¤ 0; .t / D 1 if t D 0: However, the function is not continuous and for this reason it cannot be the characteristic function belonging to some distribution function. By Lévy’s theorem (Theorem I.8.5) this implies that the sequence T1 ; T2 ; : : : is not convergent in distribution. Looking back on the weak or strong law of large numbers, it follows from the proposition above that the mean of a population having a stoutly tailed distribution function does not exist. Roughly spoken, such a thing occurs when expressions of the form PG . 1; a and/or
PG Œa; C1/
are tending too slowly to zero if jaj ! 1. These sort of probabilities are called left tail and right tail probabilities. This is the reason they are called here “stoutly tailed” distribution functions. Being “stoutly tailed” enjoys the following permanence properties: Lemma VII.7.5. Suppose G 2 is stoutly tailed. Then: (i) For all > 0 the function G W x 7! G.x=/ is stoutly tailed.
(ii) For all F 2 the function F G is stoutly tailed. Proof. (i) follows from the fact that (see Exercise 25) G .t / D G .t /:
(ii) is a direct consequence of Proposition VII.6.5, which says that GF .t / D G .t / F .t /: Remark. Denote, as always, the distribution function of the N.0; 1/-distribution by ˆ. Then for all G 2 the distribution function G ˆ is smooth (see §VII.6, Example 1). By the lemma above, together with Lemma VII.7.3, it follows that there exist stoutly tailed distribution functions that are smooth.
339
Section VII.7 Robustness of statistics
Proposition VII.7.6. Every distribution function is in .; dL / the limit of a sequence of stoutly tailed (smooth) distribution functions. In other words, the set of stoutly tailed distribution functions is dense in .; dL /. Proof. Choose any (smooth) stoutly tailed element G 2 . Set k WD 1=k, where k D 1; 2; : : : . Now, by Theorem VII.6.8, one has that for all F 2 G k F ! F
in .; dL /:
This, together with the previous lemma (and Proposition VII.6.4), proves the proposition. Roughly spoken, the above expresses the fact that there are plenty of stoutly tailed elements in . There is in also an abundance of elements that are, in contrast to stoutly tailed elements, “tailless”: Definition VII.7.4. A distribution function F is said to be tailless if there exist real numbers a and b such that F .a/ D 0
and F .b/ D 1:
Example. The distribution function F belonging to the uniform distribution on the interval .˛; ˇ/ is tailless, for one has F .˛/ D 0 and F .ˇ/ D 1: Proposition VII.7.7. If X is a stochastic variable such that FX is tailless, then all moments of X exist. Proof. See Exercise 43. A special case of tailless distribution functions is presented by functions of type b .x1 ; : : : ; xn /, where x1 ; : : : ; xn 2 R and F b .x1 ; : : : ; xn / WD #¹i W xi xº : F n
As agreed in §VII.1, this kind of functions are called empirical distribution functions. Proposition VII.7.8. The set of all empirical distribution functions is dense in the metric space .; dL /.
340
Chapter VII Stochastic analysis and its applications in statistics
Proof. Let F be an arbitrary element in . Choose an arbitrary fixed " > 0. Then there exists (just smoothen F ) a continuous distribution function G such that dL .F; G/ < "=2:
./
Next, choose a positive integer n such that 1=n < "=2 and choose real numbers x1 ; : : : ; xn such that G.xi / D
i n
for
i D 1; : : : ; n
1;
and
G.xn / > 1
1 : n
Because G is continuous the above is possible by the Weierstrass theorem on intermediate values. It is easy to see that then for all x 2 R jG.x/ Therefore
b .x1 ; : : : ; xn /.x/j < "=2: F
b .x1 ; : : : ; xn / D kG dN G; F
b .x1 ; : : : ; xn /k1 "=2: F
By Proposition VII.5.4 it follows from this that
b .x1 ; : : : ; xn / < "=2: dL G; F
./
From ./ and ./ it now follows that
b .x1 ; : : : ; xn / < ": dL F; F
Because F 2 and " > 0 were chosen arbitrarily, this means that the empirical distribution functions form a dense set in .; dL /. Theorem VII.7.9. There is in no element in which the sample mean is dL -robust. Proof. As before, define gn W Rn ! R by gn .x1 ; : : : ; xn / WD
x1 C C xn : n
To prove that the sample mean is nowhere dL -robust on , one has to show that the sequence e g n W .; dL / ! .; dL / .n D 1; 2; : : :/
cannot possibly be equicontinuous in any element of . Appendix E, Theorem 10, is now applied to bring this about: Let K WD ¹F W the sequence e g 1 .F /; e g 2 .F /; : : : is convergent in .; dL /º:
Section VII.8 Trimmed means, the median and their robustness
341
This set K contains all tailless elements in . To see this, let F be an arbitrary tailless element in . Choose an infinite sample X1 ; X2 ; : : : from a population having F as its distribution function. By Proposition VII.7.7 the mean of such a population exists; it will be denoted by .F /. Now the weak law of large numbers (Theorem VII.2.3) guarantees that X1 C C Xn Tn WD ! .F / n in probability and therefore also in distribution. However, the distribution function of Tn is precisely e g n .F /. In this way it follows that the sequence e g 1 .F /; e g 2 .F /; : : : converges in .; dL /, that is, F is an element of K . So K contains all tailless elements in . By Proposition VII.7.8 this implies that K is dense in .; dL /. On the other hand, by Proposition VII.7.4, the set K c contains all stoutly tailed elements. It follows from this (Proposition VII.7.6) that K c is also dense .; dL /. Furthermore, the metric space .; dL / is complete (Theorem VII.5.9). Thus all premises of Appendix E Theorem 10 are satisfied and one may conclude that the sequence e g 1; e g 2 ; : : : is nowhere equicontinuous on . This proves the theorem.
The theorem above is a strong dissonant in the songs of praise as to the sample mean. In the next section robust statistics will be constructed that can be used to estimate “the center” of a population. The idea to capture robustness of statistics in terms of equicontinuity is called the qualitative approach of robustness. The so-called quantitative approach of robustness is based on the concept of the “asymptotic break-even-point” (see [31]). The quantitative approach will not be discussed in this book. Last but not least there is the so-called infinitesimal approach of robustness; this will be discussed in §VII.10. More detailed studies on robustness can be found in [30], [31], [39].
VII.8 Trimmed means, the median and their robustness In the previous section it was explained that the sample mean provides an example of a statistic that is not robust. To estimate the “center” of a population in a robust way one can use (for example) a so-called “trimmed mean”. Trimmed means can be defined as follows: At a prescribed percentage, say 10%, one deletes from the outcome of a sample the 10% smallest and the 10% largest observations. Of the remaining 80% of the observations the arithmetic mean is taken. The result is now called the “10%-trimmed mean of the sample”. Of course the above can be generalized to any arbitrarily prescribed percentage between 0 and 50.
342
Chapter VII Stochastic analysis and its applications in statistics
Trimmed means are used since old days. One is using them when there is a wish to limit the impact of so-called “outliers”. An outlier in a sample is understood to be an observation that finds itself in an isolated position relative to the other observations. It turns out, as will be seen in this section, that trimmed means are robust statistics in the sense of Definition VII.7.2. Trimmed means can be defined in a more precise way in terms of the quantile function belonging to the empirical distribution function. Recall that for an arbitrary sample outcome x1 ; : : : ; xn the empirical distribution b DF b .x1 ; : : : ; xn / was defined as function F b .x/ D F b .x1 ; : : : ; xn /.x/ D #¹i W xi xº : F n
Definition VII.8.1. The quantile function b q .x1 ; : : : ; xn / W .0; 1/ ! R belonging b to the empirical distribution function F .x1 ; : : : ; xn / is called the empirical quantile function. In the notations of §VII.2 one may write b q .x1 ; : : : ; xn / D q b F .x
1 ;:::;xn /
:
Given a certain sample outcome, the quantile function is described in an explicit form in the following proposition. Proposition VII.8.1. Let x1 ; : : : ; xn be arbitrary real numbers and let be a permutation of the integers 1; : : : ; n such that x.1/ x.2/ x.n/ : Then b q .x1 ; : : : ; xn / D
n X
x.i/ 1Iin ;
iD1
where for all i D 1; 2; : : : ; n the interval Iin is given by i 1 i n ; : Ii D n n Proof. Left as an exercise to the reader. As a direct consequence of this proposition one has Z 1 x1 C C xn xD D b q .x1 ; : : : ; xn /.u/ du: n 0
Section VII.8 Trimmed means, the median and their robustness
343
Next, define for all ˛ 2 Œ0; 12 / the function gn;˛ W Rn ! R by gn;˛ .x1 ; : : : ; xn / WD
1 1
2˛
Z
1 ˛
b q .x1 ; : : : ; xn /.u/ du:
˛
Proposition VII.8.2. The function gn;˛ W Rn ! R is continuous on Rn . Proof. Using Proposition VII.8.1 it is easily verified that for fixed u 2 .0; 1/ the map .x1 ; : : : ; xn / 7! b q .x1 ; : : : ; xn /.u/
is continuous. Moreover one has
sup j b q .x1 ; : : : ; xn /.u/j D max jx1 j; : : : ; jxn j :
u2.0;1/
Now, applying Lebesgue’s theorem on dominated convergence (see Appendix A), it is easily seen that gn;˛ is continuous. The proposition above allows to define: Definition VII.8.2. Let ˛ 2 Œ0; 12 / and let X1 ; : : : ; Xn be an arbitrary sample. Then the statistic .X /˛ , defined by . X /˛ WD gn;˛ .X1 ; : : : ; Xn / D
1 1
2˛
Z
1 ˛ ˛
b q .X1 ; : : : ; Xn /.u/ du;
is called the ˛-trimmed mean of the sample. When it is necessary to include the size of the sample in the notation, then one writes .X /n;˛ instead of .X /˛ . The number ˛ is called the trim level. It turns out (as will be seen below) that for ˛ 2 .0; 12 / the sequence of trimmed means .X /n;˛ , associated with an infinite sample, is robust. In order to prove this, trimmed versions of the strong law of large numbers will be needed. These results are interesting in themselves. As a first step the following definition is given: Definition VII.8.3. Let there be given a population having distribution function F . Denoting the quantile function belonging to F by qF , for every ˛ 2 .0; 21 / the number Z 1 ˛ 1 qF .u/ du ˛ D ˛ .F / WD 1 2˛ ˛ is called the ˛-trimmed mean of the population.
344
Chapter VII Stochastic analysis and its applications in statistics
In these notations the ˛-trimmed mean of the real numbers x1 ; : : : ; xn can be written as b .x1 ; : : : ; xn / : .x/n;˛ D ˛ F Remark. For ˛ 2 .0; 12 / the function qF is increasing on the interval Œ˛; 1 ˛, hence one has qF .˛/ qF .u/ qF .1 ˛/
for all u 2 Œ˛; 1 ˛. This shows that qF is bounded on Œ˛; 1 ˛. Furthermore, by Lemma VII.2.6, qF is a Borel function. It follows that it makes always sense to speak about the integral Z 1 ˛ qF .u/ du ˛
whenever 0 < ˛ < 21 . It thus appears that the ˛-trimmed population mean always exists if 0 < ˛ < 12 . In the case ˛ D 0 one has to consider the existence of the integral Z 1 qF .u/ du: 0
In Exercise 44 the reader is asked to prove that this integral exists if and only if the mean of the population exists. If so, then 0 D
Z
1 0
qF .u/ du D
Z
x dPF .x/ D :
It thus appears that for ˛ D 0 the ˛-trimmed mean is just the ordinary population mean. Preparatory to the proof of the trimmed version of the strong law of large numbers, two lemmas are now presented: Lemma VII.8.3. If for two distribution functions F and G one has the inequality dL .F; G/ < ", where dL denotes the Lévy metric, then qF .u for all u 2 ."; 1
"/
" qG .u/ qF .u C "/ C "
"/.
Proof. Suppose dL .F; G/ < ". Choose an arbitrary u 2 .0; 1 "/ and choose an arbitrary ı > 0. Then, by definition of the quantile function, there exists an x such that u C " F .x/ and x qF .u C "/ C ı: ./
345
Section VII.8 Trimmed means, the median and their robustness
Because dL .F; G/ < ", one has furthermore (by Lemma VII.5.1) F .x/ G.x C "/ C ": Together with ./ this implies that u G.x C "/: In turn this, together with ./, implies that qG .u/ x C " qF .u C "/ C ı C ": The above holds for all ı > 0, therefore one has qG .u/ qF .u C "/ C "
for all u 2 .0; 1
"/:
./
The roles played by F and G are exchangeable. For this reason one also has qF .u/ qG .u C "/ C " If u 2 ."; 1
"/, then u
" 2 .0; 1 qF .u
for all u 2 .0; 1
"/:
"/ and hence "/ qG .u/ C ":
Together with ./ this implies that qF .u for all u 2 ."; 1
"/
" qG .u/ qF .u C "/ C "
"/. This proves the lemma.
Proposition VII.8.4. Suppose ˛ 2 .0; 12 /. If the sequence F1 ; F2 ; : : : converges in the Lévy metric to F , then lim ˛ .Fn / D ˛ .F /:
n!1
In other words, ˛ is a continuous function on .; dL /. Proof. Let M be the supremum of the function jqF j on the interval Œ 21 ˛; 1 12 ˛. Choose an arbitrary fixed " in the interval .0; 21 ˛/. Then there is an integer N" such that dL .Fn ; F / < " for all n N" : By the previous lemma this means that for all u 2 ."; 1 qF .u
"/
"/ and all n N" one has
" qFn .u/ qF .u C "/ C ":
346
Chapter VII Stochastic analysis and its applications in statistics
Hence
Z
1 ˛ ˛
qFn .u/ du
Z
1 ˛®
¯ qF .u C "/ C " du:
˛
./
For the right side of this inequality one has Z 1 ˛ Z 1 ˛C" ® ® ¯ ¯ qF .u C "/ C " du D qF .u/ C " du ˛
˛C"
D
Z
1 ˛
˛
Z
Z
qF .u/ du C ˛C"
˛ 1 ˛
˛
Z
1 ˛C"
qF .u/ du
1 ˛
qF .u/ du C .1
2˛/ "
qF .u/ du C "M C "M C ":
Together with ./ this leads to the conclusion that Z 1 ˛ Z 1 ˛ qFn .u/ du qF .u/ du C .2M C 1/" ˛
˛
for all n N" . In the same way it can be verified that for all n N" one has Z 1 ˛ Z 1 ˛ qF .u/ du .2M C 1/": qFn .u/ du ˛
˛
Summarizing one arrives at the inequality ˇ ˇZ 1 ˛ Z 1 ˛ ˇ ˇ ˇ qF .u/ duˇˇ .2M C 1/" qFn .u/ du ˇ ˛
˛
for all n N" . Because M does not depend on ", it follows from the above that Z 1 ˛ Z 1 ˛ qF .u/ du D ˛ .F /: lim ˛ .Fn / D lim qFn .u/ du D n!1
n!1 ˛
˛
This proves the lemma. It is very easy to prove a strong law of large numbers for trimmed means. Note that here, in contrast to the untrimmed version, not a single condition on the distribution function of the population is imposed. Theorem VII.8.5. Let X1 ; X2 ; : : : be an infinite sample from a population having distribution function F and let 0 < ˛ < 21 . Then lim . X /n;˛ D ˛ .F /
n!1
strongly:
347
Section VII.8 Trimmed means, the median and their robustness
Proof. Let .; A; P / be the underlying probability space of the variables X1 ; X2 ; : : :. By the Glivenko–Cantelli theorem there exists an element A 2 A such that P .A/ D 1 and such that b X1 .!/; : : : ; Xn .!/ lim k F F k1 D 0 for all ! 2 A: n!1
Evidently, by Theorem VII.5.4, for such ! one also has b X1 .!/; : : : ; Xn .!/ ; F D 0: lim dL F n!1
By the previous lemma this implies that b X1 .!/; : : : ; Xn .!/ D ˛ .F /: lim ˛ F n!1
However, this is the same as saying that
lim . X /n;˛ .!/ D ˛ .F /
n!1
for all ! 2 A:
Because P .A/ D 1, the theorem is proved by this. Indifferent the distribution F , the expression Z 1 ˛ 1 qF .u/ du 1 2˛ ˛ is not defined for ˛ D 12 . However, there is a limit if ˛ " lemma will be needed to see this: Lemma VII.8.6. Let f W Œa; b ! R be an x 2 .a; b/ one has Z xC 1 lim f .t / dt D #0 2 x
1 2.
A purely analytical
increasing function. Then for any fixed 1 ¹f .x / C f .x C/º: 2
Proof. Choose and fix an x 2 .a; b/. As a first step it will be proved that Z xC 1 lim f .t / dt D f .x C/: #0 x To this end, note that ˇ Z xC ˇ1 ˇ f .t / dt ˇ x
ˇ ˇ Z xC ˇ ˇ ˇ1 ˇ ˇ ˇ f .x C/ˇ D ˇ f .t / f .x C/ dt ˇˇ x Z ˇ 1 xC ˇˇ f .t / f .x C/ˇ dt: x
./
()
348
Chapter VII Stochastic analysis and its applications in statistics
Next, choose an arbitrary " > 0. Because lim f .t / D f .x C/; t#x
there is a ı > 0 such that for x < t < x C ı ˇ ˇ ˇf .t / f .x C/ˇ < ": Together with ./ this implies that ˇ Z xC ˇ1 ˇ f .t / dt ˇ x
ˇ ˇ f .x C/ˇˇ < "
whenever 0 < < ı. This proves ./. In the same way one can prove that Z 1 x lim f .t / dt D f .x /: #0 x
./
The proof can now be completed by taking the mean of ./ and ./. Now let F be an arbitrary distribution function. For the ˛-trimmed mean ˛ .F / belonging to F one then has Z 1 ˛ Z 1 C 2 1 1 lim ˛ .F / D lim qF .u/ du D lim qF .u/ du 1 1 1 1 2˛ ˛ #0 2 ˛" 2 ˛" 2 2 1 D ¹qF . 12 / C qF . 12 C/º ; 2 where Lemma VII.8.6 was applied to the quantile function qF . Definition VII.8.4. By the median of a population with distribution function F one means the number ¯ 1® b .F / WD qF . 12 / C qF . 21 C/ : 2 One talks about a strict median (concerning F ) if qF . 12 / D qF . 12 C/:
Along these lines it is now natural to define the sample median in the following, perhaps somewhat pedantic, way: Definition VII.8.5. Let x1 ; : : : ; xn be the outcome of a sample. The number b .x1 ; : : : ; xn / ; b x WD b F
b .x1 ; : : : ; xn / is the empirical distribution function belonging to x1 ; : : : ; xn , where F is called the sample median.
Section VII.8 Trimmed means, the median and their robustness
349
The following proposition is presented to link this definition with the more usual definitions of the sample median: Proposition VII.8.7. Let x1 ; : : : ; xn be the outcome of an arbitrary sample and let be a permutation of the numbers 1; : : : ; n such that x.1/ x.2/ x.n/ : Then b x D
´
x.mC1/ 1 2 ¹x.m/ C x.mC1/ º
if n D 2m C 1; if n D 2m:
Proof. An immediate consequence of Definition VII.8.4 and Proposition VII.8.1. The proposition above can be used to prove that the map gn W Rn ! R, defined by gn .x1 ; : : : ; xn / D b x;
is continuous and therefore surely Borel. For this reason it makes sense, given a sample X1 ; : : : ; Xn , to talk about the statistic b WD gn .X1 ; : : : ; Xn /: X
b /n If there is a wish to incorporate the size of the sample in the notation then write . X b instead of X . The following proposition, which is interesting in itself, will be needed to prove the strong law of large numbers for the median: Proposition VII.8.8. The function b W .; dL / ! R
is continuous in precisely those elements of that have a strict median. Proof. Suppose that F is an element in having a strict median. To prove that b is continuous in F , let F1 ; F2 ; : : : be any sequence that converges in the Lévy metric to F . One has to show that then necessarily lim b .Fn / D b .F /:
n!1
Well, for all n one has
b .Fn / D
¯ 1® qFn . 12 / C qFn . 12 C/ : 2
350
Chapter VII Stochastic analysis and its applications in statistics
It will first be proved that lim qFn . 12 C/ D qF . 12 /:
./
n!1
To this end, choose an arbitrary fixed " > 0. Then there is an integer N" such that dL .Fn ; F / < "=2
for all n N" :
By Lemma VII.8.3 this implies that for all u 2 ."=2; 1 qF .u
"=2/ one has
"=2 qFn .u/ qF .u C "=2/ C "=2:
"=2/
It follows from this that for all u 2 ."; 1
"/
qF .u
"/
" qFn .uC/ qF .u C "/ C ":
qF . 12
"/
" qFn . 12 C/ qF . 12 C "/ C ":
In particular Therefore one may write qF . 21
"/
" lim qFn . 12 C/ lim qFn . 21 C/ qF . 12 C "/ C ": n!1
n!1
This holds for all " > 0, hence qF . 12 / lim qFn . 12 C/ lim qFn . 12 C/ qF . 12 C/: n!1
n!1
The assumption that F has a strict median means that qF . 21 / D qF . 12 / D qF . 12 C/: It thus appears that ./ holds. In the same way it can be proved that lim qFn . 21 / D qF . 12 /:
n!1
From ./ and ./ together it follows that b .Fn / D 12 ¹qFn . 12 / C qFn . 12 C/º ! qF . 21 / D b .F /
if n ! 1. This proves that b is continuous if F has a strict median.
To prove the converse, suppose that F does not have a strict median. Then qF . 12 / < qF . 21 C/:
./
351
Section VII.8 Trimmed means, the median and their robustness
Choose an arbitrary 2 qF . 12 /; qF . 12 C/ and an arbitrary " 2 .0; 21 /. Define the distribution function F;" as follows (draw a figure): 8 ˆ 0 there exists an N" such that
In other words:
g n .F /; Ec " dL e e g n .F / ! Ec
for all n N" : in .; dL /:
By Proposition VII.8.14 this can only be so if F has a strict median. Hence e g 1; e g 2 ; : : : equicontinuous in F H) F has a strict median : This completes the proof of the theorem.
The reader should be aware of the fact that all results in this section remain valid if the Lévy metric is replaced by another metric that generates the same topology. For further reading the reader is referred to [30], [31], [39].
358
VII.9
Chapter VII Stochastic analysis and its applications in statistics
Statistical functionals
This section has, with respect to the foregoing ones, a unifying character. One thing and another will be brought about by studying functions ƒ of the form ƒ W Dƒ ! R; where Dƒ . The set Dƒ is called the domain of ƒ. In mathematics, functions defined on a domain that consists itself of functions are usually called functionals. If a functional as above satisfies certain (mild) conditions, then it is called a “statistical functional”: Definition VII.9.1. A function ƒ W Dƒ ! R, where Dƒ , is said to be a statistical functional if it satisfies the following two conditions: b .x1 ; : : : ; xn / 2 Dƒ for all finite sequences x1 ; : : : ; xn . (i) F b .x1 ; : : : ; xn / is for all fixed n a Borel function (ii) The map .x1 ; : : : ; xn / 7! ƒ F on Rn .
Remark. Because
b .a/ D Ea ; F
the definition above implies that Ea 2 Dƒ for all a 2 R. A statistical functional generates in a natural way a sequence of statistics. Namely, for any sample X1 ; : : : ; Xn one may define Tn by b .X1 ; : : : ; Xn / : Tn WD ƒ F
Now Tn is a correctly defined statistic in the sense of Definition II.1.1. To see this, note that the function gn W Rn ! R defined by b .x1 ; : : : ; xn / gn .x1 ; : : : ; xn / WD ƒ F
is a Borel function. In this notation Tn can be represented as Tn D gn .X1 ; : : : ; Xn /:
In mathematical statistics many quantities emanate in this way from statistical functionals. This is illustrated in the following examples. Example 1. Let Dƒ be the set of all distribution functions with an existing first moment, that is: Z ° ± Dƒ WD F 2 W jxj dPF .x/ < C1 :
359
Section VII.9 Statistical functionals
Define ƒ W Dƒ ! R by
ƒ.F / WD
Z
x dPF .x/:
b WD Now ƒ.F / presents the mean of a population with distribution function F . If F b .x1 ; : : : ; xn /, then one has F Pb D F
1 .ıx C C ıxn /; n 1
where ıxi is the Dirac measure in xi . Thus it follows that Dƒ contains all empirical distribution functions. Furthermore Z b .x1 ; : : : ; xn / D x dP .x/ D x1 C C xn : ƒ F b F n
Hence the map ƒ W Dƒ ! R is a statistical functional from which the sample mean emanates. Example 2. Let Dƒ be the set defined by Z ° ± Dƒ WD F 2 W x 2 dPF .x/ < C1 :
Define on Dƒ the map ƒ by
ƒ.F / WD
Z
x 2 dPF .x/
Z
2 x dPF .x/ :
Now ƒ.F / presents the variance of a population with distribution function F . It is easily seen that ƒ W Dƒ ! R is a statistical functional and that ! n n X 1 1X 2 b .x1 ; : : : ; xn / D ƒ F xi .x/2 D .xi x/2 : n n iD1
It follows that
iD1
b .X1 ; : : : ; Xn / D n ƒ F
1 2 Sn ; n where Sn2 is the sample variance of a sample X1 ; : : : ; Xn . Example 3. Let 0 < ˛
0 the number a .F / is defined by a .F / .x/ WD F .x=a/ .x 2 R/:
If X has distribution function F , then a .F / is the distribution function of aX . The quantities a .F / and a .F / will often be denoted as a F and a F . The next proposition describes the effect of the operations introduced above on quantile functions. Proposition VII.9.3. For every distribution function F and all u 2 .0; 1/ one has: (i) qa F .u/ D qF .u/ C a. (ii) q e .u/ D qF .1 u/C . F (iii) qa F .u/ D a qF .u/.
(iv) If F G , then qF qG .
Proof. The proof of (i), (iii) and (iv) is trivial. For the proof of (ii), see Exercise 47. Definition VII.9.4. A statistical functional ƒ W Dƒ ! R is called a location functional if it satisfies the following conditions: (i) If a 2 R and F 2 Dƒ , then a F 2 Dƒ and ƒ.a F / D ƒ.F / C a. e 2 Dƒ and ƒ. F e / D ƒ.F /. (ii) If F 2 Dƒ , then F
(iii) If F 2 Dƒ and a > 0, then a F 2 Dƒ and ƒ.a F / D aƒ.F /. (iv) If F; G 2 Dƒ and F G , then ƒ.F / ƒ.G/.
The following proposition is presented before giving examples of location functionals:
366
Chapter VII Stochastic analysis and its applications in statistics
Proposition VII.9.4. If ƒ W Dƒ ! R is a location functional, then the following statements hold: (i) If F is symmetric around a, then ƒ.F / D a.
(ii) For all a 2 R one has ƒ.Ea / D a.
(iii) If there are real numbers a and b such that F .a/ D 0 and F .b/ D 1, then ƒ.F / 2 Œa; b. e D F , hence Proof. (i) If F is symmetric around the origin (case a D 0) then F e/ D ƒ.F / D ƒ. F
ƒ.F /:
This is possible only if ƒ.F / D 0. If F is symmetric around the point a, then is symmetric around the origin and therefore ƒ.F /
a D ƒ.
aF /
aF
D 0:
It follows that ƒ.F / D a. This proves statement (i). Statement (ii) is a direct consequence of (i). (iii) If there exist real numbers a and b such that F .a/ D 0 and F .b/ D 1, then Eb F Ea : Therefore a D ƒ.Ea / ƒ.F / ƒ.Eb / D b:
This proves statement (iii).
As announced, examples of location functionals are given now. Example 7. Let ƒ W Dƒ ! R be, as in Example 1, the statistical functional belonging to the mean. That is to say, if F 2 and X has distribution function F , then F is an element of Dƒ if and only if the expectation value of X exists; if so, then ƒ.F / D E.X /: It is easily verified that ƒ is a location functional. Example 8. Let ƒ˛ W Dƒ ! R be the statistical functional belonging to the ˛trimmed mean. That is to say, Dƒ D and ƒ˛ is defined by Z 1 ˛ 1 ƒ˛ .F / D qF .u/ du D ˛ .F /: 1 2˛ ˛ Applying Proposition VII.9.3 it is easily verified that ƒ˛ presents a location functional.
367
Section VII.9 Statistical functionals
Example 9. Let ƒ W Dƒ ! R be the statistical functional belonging to the median. Then Dƒ D and ƒ is defined by ƒ.F / D
¯ 1® qF . 12 / C qF . 12 C/ : 2
Exploiting (for example) the fact that
ƒ.F / D lim ˛ .F /; ˛" 21
the functional ƒ is easily seen to be a location functional. Example 10. Let 0 < ˛ ƒ.F / D
1 2
and let Dƒ D . Define ƒ W Dƒ ! R by
1® qF .˛ / C qF .˛ C/ C qF .1 4
˛/
C qF .1
˛/C
¯
:
Using Proposition VII.9.3 it is easy to see that ƒ is a location functional. In the following this functional will be denoted by Q˛ . Note that for ˛ D 12 the functional Q˛ presents the median. A convex linear combination of two location functionals is again a location functional. More precisely, suppose that ƒ1 W Dƒ1 ! R and ƒ2 W Dƒ2 ! R are location functionals and that 0 < t < 1. Set Dƒ D Dƒ1 \ Dƒ2 and define ƒ W Dƒ ! R by ƒ.F / WD .1
t /ƒ1 .F / C tƒ2 .F /;
where F 2 Dƒ . Then ƒ is a location functional. Combining Examples 8 and 10 one thus arrives at the “˛-Winsorized mean”: Example 11. Let 0 < ˛ < number W˛ .F / defined by
1 2.
The ˛-Winsorized mean of F is understood to be the
W˛ .F / WD .1
2˛/ ƒ˛ .F / C 2 ˛ Q˛ .F /:
Here ƒ˛ is the ˛-trimmed mean and Q˛ the functional as defined in Example 10. Roughly spoken, a location functional ƒ assigns to every F 2 Dƒ a number that can be interpreted as “the center” of a population with distribution function F . Another important class of statistical functionals is presented by the so-called dispersion-functionals. Such functionals are measures for the spread in a population.
368
Chapter VII Stochastic analysis and its applications in statistics
Definition VII.9.5. A statistical functional ƒ W Dƒ ! R is called a dispersion functional if it satisfies the following properties: (i) If F 2 Dƒ and a 2 R, then a F 2 Dƒ and ƒ.a F / D ƒ.F /. e 2 Dƒ and ƒ. F e / D ƒ.F /. (ii) If F 2 Dƒ , then F
(iii) If F 2 Dƒ and a > 0, then a F 2 Dƒ and ƒ.a F / D aƒ.F /. (iv) ƒ.F / 0 for all F 2 Dƒ .
Proposition VII.9.5. If ƒ W Dƒ ! R is a dispersion functional, then ƒ.Ea / D 0 for every a 2 R. Proof. See Exercise 48. Some examples of dispersion functionals are given now: Example 12. Define Dƒ by Z ° ± Dƒ WD F 2 W x 2 dPF .x/ < C1 :
Next, define ƒ W Dƒ ! R in the following way: If F 2 Dƒ and if X is a variable having F as its distribution function, then p ƒ.F / WD var.X / D X :
Now ƒ is a dispersion functional.
Example 13. Let Dƒ be the set Z ° ± Dƒ WD F 2 W jxj dPF .x/ < C1 :
Define for every F 2 Dƒ the number ƒ.F / as follows:
If X is a variable having F as its distribution function, then ƒ.F / WD E jX E.X /j :
The above defines a dispersion functional. Example 14. Let Dƒ D and define ƒ.F / WD
1® qF . 34 C/ C qF . 34 / 4
qF . 14 C/
qF . 14 /
¯
for all F 2 Dƒ . Now ƒ W Dƒ ! R is a dispersion functional. The functional ƒ is called the interquartile distance.
369
Section VII.9 Statistical functionals
This section is closed now with a discussion about so-called linear functionals: Definition VII.9.6. A statistical functional ƒ W Dƒ ! R is called linear if Dƒ is convex and if there exists a Borel function g W R ! R such that: R (i) Dƒ D ¹F 2 W jg.x/j dPF .x/ < C1º, R (ii) ƒ.F / D g.x/ dPF .x/. Remark 1. For every linear statistical functional g is unique. Namely, one has PEa D ıa ; where ıa is the Dirac measure in a. The function g W R ! R is now determined by Z Z b .a/ D ƒ.Ea / D ƒ F g.x/ dPEa .x/ D g.x/ dıa .x/ D g.a/:
Remark 2. A linear statistical functional ƒ W Dƒ ! R is not linear in the sense of functional analysis. However, for every F; G 2 Dƒ one does have ƒ .1 t /F C t G D .1 t /ƒ.F / C t ƒ.G/; where 0 t 1. In functional analysis such maps are usually called affine.
Proposition VII.9.6. Let ƒ W Dƒ ! R be a linear statistical functional and let F1 ; : : : ; Fn 2 Dƒ . If 0 t1 ; : : : ; tn 1 and t1 C C tn D 1, then: (i) t1 F1 C C tn Fn 2 Dƒ ,
(ii) ƒ.t1 F1 C C tn Fn / D t1 ƒ.F1 / C C tn ƒ.Fn /. Proof. See Exercise 49. Next, the proposition above is applied together with the fact that b .x1 ; : : : ; xn / D 1 .Ex1 C C Exn /: F n
It follows that if ƒ W Dƒ ! R is a linear statistical functional, one has
® ¯ ® ¯ b .X1 ; : : : ; xn / D 1 ƒ.Ex1 / C C ƒ.Exn / D 1 g.x1 / C C g.xn / : ƒ F n n
It thus appears that statistics emanating from linear statistical functionals are always of the following form: n X b .X1 ; : : : ; Xn / D 1 Tn D ƒ F g.Xi /: n iD1
370
Chapter VII Stochastic analysis and its applications in statistics
Such statistics always present unbiased estimators of ƒ. To see this, choose an F 2 Dƒ . Let X1 ; : : : ; Xn be a sample from a population with distribution function F . Then n 1X E.Tn / D E g.Xi / D E g.X1 / n iD1 Z Z D g.x/ dPX1 .x/ D g.x/ dPF .x/ D ƒ.F /:
Because this holds for all F 2 Dƒ , the statistic Tn is an unbiased estimator for ƒ. Theorem VII.9.7. Linear statistical functionals are strongly consistent in every element of their domain. Proof. Suppose ƒ W Dƒ ! R is a linear statistical functional. Let F 2 Dƒ and let X1 ; X2 ; : : : be an infinite sample from a population with distribution function F . In order to show that ƒ is strongly consistent in F , one has to verify that b .X1 ; : : : ; Xn / ! ƒ.F / strongly: ƒ F This is easy. Namely, one has Z Z E jg.Xi /j D jg.x/j dPXi .x/ D jg.x/j dPF .x/ < C1:
Thus it appears that the sequence g.X1 /; g.X2 /; : : : constitutes an infinite sample from a population with mean ƒ.F /. Consequently, by the strong law of large numbers, one has n X b .X1 ; : : : ; Xn / D 1 g.Xi / ! ƒ.F / ƒ F n
strongly:
iD1
This proves the theorem.
The theorem above is, properly spoken, a reformulation of the strong law of large numbers. The central limit theorem can also be reformulated in this way: Theorem VII.9.8. Let ƒ W Dƒ ! R be a linear statistical functional and let F 2 Dƒ . Suppose that Z 2 g.x/ dPF .x/ < C1:
Then for any infinite sample X1 ; X2 ; : : : from a population with distribution function F the sequence ± p ° p b .X1 ; : : : ; Xn / n Tn ƒ.F / D n ƒ F ƒ.F / .n D 1; 2; : : :/
371
Section VII.9 Statistical functionals
converges in distribution to the N.0; 2 /-distribution, where Z 2 2 D var g.Xi / D g.x/ dPF .x/:
Proof. Just apply the central limit theorem (Theorem I.9.3) to the sample g.X1 /; g.X2 /; g.X3 /; : : : . The theorem above can be used to make interval estimates of ƒ.F /. This runs as follows: Suppose that ƒ W Dƒ ! R is a linear statistical functional. Let X1 ; : : : ; Xn be a sample from a population with unknown distribution F , having the numbers x1 ; : : : ; xn as its outcome. Basing oneself on this, one makes the estimate ƒ.F / b .x1 ; : : : ; xn / . By Theorem VII.9.8 the quantity ƒ F b .X1 ; : : : ; Xn / ƒ F
ƒ.F /
is for large n approximatelyN.0; 2 =n/-distributed. In a rough way, one could now proceed by making (in some way) an estimate of 2 . In order to construct a, say 90% confidence interval for ƒ.F /, solve (approximately) the probability equation ˇ ˇ b .X1 ; : : : ; Xn / P ˇƒ F ƒ.F /ˇ < ı D 0:90: Then the inequality
b .x1 ; : : : ; xn / ƒ F
b .x1 ; : : : ; xn / C ı ı < ƒ.F / < ƒ F
presents a 90% confidence interval for ƒ.F /.
This section is now closed by giving an example of a map ƒ W Dƒ ! R that is not a statistical functional, whereas it does play an important role in statistics. Example 15. Let Dƒ be the set given by Dƒ D ¹F 2 W F is continuously differentiableº: Define for fixed x 2 R the map ƒ W Dƒ ! R by ƒ.F / D F 0 .x/: Now ƒ W Dƒ ! R is not a statistical functional for the simple reason that Dƒ does not contain the empirical distribution functions. Derivatives of distribution functions do play an important role, however, for they present probability densities. In Sections VII.12 and VII.13 the estimation of population probability densities will be the main topic.
372
Chapter VII Stochastic analysis and its applications in statistics
VII.10
The von Mises derivative and influence functions
Introduction. In this section differentiation of statistical functionals will be studied. As will be well known to the reader, differentiation of functions defined on Rp presents an essential technique in mathematical analysis. Likewise, differentiation of statistical functionals is a valuable tool in stochastic analysis. Before discussing this kind of differentiation, first a recapitulation how functions on Rp can be differentiated: Let ' W Rp ! R be any function and let a be a fixed point in Rp . For every fixed v 2 Rp one considers the function t 7! '.a C t v/ .t 2 R/:
If for all v 2 Rp this function is differentiable in the point t D 0, then one says that the function ' has in a derivatives in all directions. If, moreover, the map d v 7! '.a C t v/ .v 2 Rp / dt tD0
is a linear form on Rp , then ' is said to be Gâteaux differentiable. Then the linear form in question is denoted by ' 0 .a/, or more precisely by ' 0 .a/ W Rp ! R:
The 1 n-matrix Œ' 0 .a/ of this linear map is given by @' @' 0 Œ' .a/ D .a/; : : : ; .a/ : @x1 @xp
The function ' is called differentiable (or Fréchet differentiable) in a if lim
v!0
j'.a C v/
where kvk D
'.a/ kvk
' 0 .a/vj
D 0;
q v12 C C vp2 :
The following scheme of implications is valid: Fréchet differentiable
H)
Gâteaux differentiable
H)
all partial derivatives exist
In this scheme no implication may be reversed. A well-known result in mathematical analysis is the chain rule. If f W R ! Rp is (componentwise) differentiable in the point a 2 R and if ' W Rp ! R is differentiable in f.a/, then the function t 7! ' f.t / D ' f1 .t /; : : : ; fp .t /
373
Section VII.10 The von Mises derivative and influence functions
is differentiable in a and one has d @' @' ' f.t / D f.a/ f10 .a/ C C f.a/ fp0 .a/: dt @x1 @xp tDa
A similar result holds if f has only a left or right-derivative in a.
The introductory considerations above will now be generalized to statistical functionals. As a first tentative definition of a derivative of a statistical functional ƒ W Dƒ ! R one could think of the derivative in t D 0 of functions of type t 7! ƒ.F C t G/: However, for t ¤ 0 the function F C t G is not a distribution function and for this reason it cannot possibly be an element of Dƒ . Hence the expression written down above does not make sense in general. Instead, a look is taken now at the rightderivative of the function t 7! ƒ .1 t /F C t G
in the point t D 0. Distribution functions of the form .1 t /F C t G are often called mixtures of F and G . For small t one also speaks about a contamination of F by G . In the expression above there must of course be a guarantee that .1
t /F C t G 2 Dƒ
for
0 t 1:
For this reason it will be systematically assumed that Dƒ is convex. Starting from this, choose and fix an F 2 Dƒ . If for some G 2 Dƒ the rightderivative of the function t 7! ƒ .1 t /F C t G
exists in the point t D 0 then one says that ƒ is in F differentiable in the direction of G . The right differential quotient in question will be denoted by ƒ0F .G/. So ƒ0F .G/ D
d ƒ .1 dt
In terms of this derivative one defines:
t /F C t G
: tD0
Definition VII.10.1. The set of all elements G 2 Dƒ for which ƒ0F .G/ exists will be called the von Mises hull of F . This set will be denoted by Mƒ .F /. Note that always F 2 Mƒ .F /, for if G D F then the function t 7! ƒ .1
t /F C t G D ƒ.F /
374
Chapter VII Stochastic analysis and its applications in statistics
is constant in t . Hence one always has ƒ0F .F / D 0: If the map ƒ0F W Mƒ .F / ! R is a linear statistical functional, then ƒ is said to be von Mises differentiable in F . If so, the corresponding linear functional ƒ0F is said to be the von Mises derivative of ƒ in F . The function F W R ! R defined by F .x/
WD ƒ0F .Ex / .x 2 R/
is called, for reasons that become clear later on, the influence function belonging to ƒ and F . The assumption that ƒ0F W Mƒ .F / ! R be linear implies that Z ° ± Mƒ .F / D G 2 W j F .x/j dPG .x/ < C1 and that
ƒ0F .G/
D
for all G 2 Mƒ .F /:
Z
F .x/ dPG .x/
In the following it will be discussed how this kind of differentiation can be used. First, however, some examples of statistical functionals that are differentiable in the sense of von Mises: Example 1. If ƒ W Dƒ ! R is a linear statistical functional, then Mƒ .F / D Dƒ for all F . One then has ƒ0F .G/ D ƒ.G/ ƒ.F / G 2 Mƒ .F / ; which shall abbreviate by writing
ƒ0F D ƒ
ƒ.F /:
It is easily seen that the above is true, as for all 0 t 1 and all F; G 2 Dƒ one has ƒ .1 t /F C t G D .1 t /ƒ.F / C t ƒ.G/: Just differentiate this expression in the point t D 0 and the above follows.
Example 2. Let ƒ W Dƒ ! R be the statistical functional belonging to the mean. That is to say Z ° ± Dƒ WD F W jxj dPF .x/ < C1
and
ƒ.F / WD
Z
x dPF .x/ D .F /:
Now ƒ is a linear statistical functional and therefore ƒ is von Mises differentiable in every element F 2 Dƒ .
Section VII.10 The von Mises derivative and influence functions
375
Now suppose that ƒ1 ; : : : ; ƒp are statistical functionals defined on the domains Dƒ1 ; : : : ; Dƒp respectively. Define the map ƒ W Dƒ ! Rp by Dƒ WD Dƒ1 \ \ Dƒp and
ƒ.F / WD ƒ1 .F /; : : : ; ƒp .F /
for all F 2 Dƒ :
Now for any Borel function ' W Rp ! R a statistical functional can be defined in the following way F 7! ' ƒ1 .F /; : : : ; ƒp .F / D ' ƒ.F / ; where F 2 Dƒ . This functional will be denoted by '.ƒ1 ; : : : ; ƒp / or, briefly, by '.ƒ/. The next theorem is formulated in these notations:
Theorem VII.10.1. Suppose that ƒ1 ; : : : ; ƒp are linear statistical functionals. Let ' W Rp ! R be a Borel function that is differentiable in the point ƒ.F /, where F 2 Dƒ . Then the statistical functional '.ƒ/ is von Mises differentiable in F , and one has @' @' '.ƒ/0F D ƒ.F / ƒ1 ƒ1 .F / C C ƒ.F / ƒp ƒp .F / : @x1 @xn
Proof. Just apply the chain rule to the function t 7! ' ƒ .1
t /F C t G
and evaluate the right-derivative in the point t D 0.
This theorem is applied in the following example: Example 3. Define on the domain Z ° ± D WD F 2 W x 2 dPF .x/ < C1
the linear statistical functionals ƒ1 and ƒ2 by Z Z 2 ƒ1 .F / WD x dPF .x/ and ƒ2 .F / WD x dPF .x/: Furthermore, let ' W R2 ! R be given by '.x1 ; x2 / WD x1
.x2 /2 :
Then '.ƒ1 ; ƒ2 / W D ! R is the statistical functional belonging to the variance. That is to say, if X has the element F 2 D as its distribution function, then '.ƒ1 ; ƒ2 /.F / D var.X /:
376
Chapter VII Stochastic analysis and its applications in statistics
By Theorem VII.10.1 the functional '.ƒ1 ; ƒ2 / is von Mises differentiable in every F 2 D. The idea to differentiate statistical functionals was first proposed by Richard M. E. von Mises (1883–1953). In the following a heuristical sketch of von Mises’ ideas is given: Suppose that ƒ W Dƒ ! R is von Mises differentiable in F 2 Dƒ and that G 2 Mƒ .F /. Expanding the function t 7! ƒ .1 t /F C t G
in a first order Taylor expansion around the point t D 0, one arrives at something of the form ƒ .1 t /F C t G D ƒ.F / C t ƒ0F .G/ C remainder term: Setting t D 1 in this expression one arrives at
ƒ.G/ D ƒ.F / C ƒ0F .G/ C RF .G/; where RF .G/ should be regarded as a remainder term. As to this remainder term one may expect that RF .G/ 0 if F and G are close to each other.
Now, in applying the above, suppose that X1 ; X2 ; : : : is an infinite sample from a population with distribution function F . When taking for G the empirical distribution b .X1 ; : : : ; Xn / one obtains function F b .X1 ; : : : ; Xn / : b .X1 ; : : : ; Xn / C RF F b .X1 ; : : : ; Xn / D ƒ.F / C ƒ0 F ƒ F F Because ƒ0F .F / D 0, this can also be read as b .X1 ; : : : ; Xn / b .X1 ; : : : ; Xn / ƒ F ƒ.F / D ƒ0F F
ƒ0F .F /
b .X1 ; : : : ; Xn / : C RF F
p Multiplication by a factor n gives ± p ° ± p ° b .X1 ; : : : ; Xn / b .X1 ; : : : ; Xn / n ƒ F ƒ.F / D n ƒ0F F ƒ0F .F / p b .X1 ; : : : ; Xn / : C n RF F
The fact that ƒ0F is linear implies, by Theorem VII.9.8, that the sequence ± p ° 0 b .X1 ; : : : ; Xn / n ƒF F ƒ0F .F / .n D 1; 2; : : :/
377
Section VII.10 The von Mises derivative and influence functions
converges in distribution to a N.0; 2 /-distribution, where Z 2 2 D F .x/ dPF .x/:
If for the remainder term one has p b .X1 ; : : : ; Xn / ! 0 n RF F
in distribution,
then one may expect that the sequence p
n Tn
p ° b .X1 ; : : : ; Xn / ƒ.F / D n ƒ F
± ƒ.F /
converges in distribution function to an N.0; 2 /-distribution. In mathematical statistics this phenomenon turns up with an amazing frequency. For example it can be observed when dealing with the functionals that belong to the ˛-trimmed mean and to the median. The arguments sketched above will be turned into rigorous mathematics for functionals of the form '.ƒ1 ; : : : ; ƒp /, where ' is differentiable and ƒ1 ; : : : ; ƒp linear. To this end, however, vectorial versions of the normal distribution will be needed. For this reason this is postponed to §VIII.6.
ƒ.F / Figure 19
In an estimation process where Tn is used as an estimator of ƒ.F / it is important to know p that the sequence n Tn ƒ.F / converges in distribution to an N.0; 2 /-distribution. This, because then for large n the estimator Tn is approximately N ƒ.F /; 2 =n -distributed (see Figure 19). So for large n it is quite probable that the outcome of Tn will be close to ƒ.F /. More precisely, one has for all " > 0: lim P jTn ƒ.F /j " D 0: n!1
By Proposition VII.2.2, this is the same as saying that the sequence of estimators T1 ; T2 ; : : : for ƒ.F / converges in distribution to the constant ƒ.F /. This, in turn, is the same as saying that ƒ is weakly consistent (see also Exercise 2). In the following there will be a discussion of the statistical meaning of the influence function F W R ! R belonging to a von Mises derivative ƒ0F of a functional ƒ W Dƒ ! R. This function was defined as F .x/
D ƒ0F .Ex /;
378
Chapter VII Stochastic analysis and its applications in statistics
where x 2 R. In a more explicit form one may express ƒ .1 t /F C t Ex F .x/ D lim t t#0
F .x/
as
ƒ.F /
:
Let X1 ; X2 ; : : : be an infinite sample from a population with distribution function F . Now, by observing the influence function F carefully, robustness properties of the sequence of estimators b .X1 ; : : : ; Xn / .n D 1; 2; : : :/ Tn D ƒ F
of ƒ.F / can be read off. This could be explained by means of a “statistical Gedankenexperiment”: Suppose that a sample X1 ; : : : ; Xn from a population with distribution function F shows an outcome x1 ; : : : ; xn such that b .x1 ; : : : ; xn / D ƒ.F /: ƒ F b .x1 ; : : : ; xn / presents In other words, the outcome is such that the estimate ƒ F the exact value of ƒ.F /. Now, in an artificial way, one adds to the observations x1 ; : : : ; xn an extra “observation” x and then one takes a look how and to what extent this new observation disturbs the initial (perfect) estimate. More precisely, a look is taken at the discrepancy between the “estimates” b .x1 ; : : : ; xn / and ƒ F b .x1 ; : : : ; xn ; x/ : ƒ F The involved empirical distribution functions are related as 1 b .x1 ; : : : ; xn ; x/ D 1 b F nC1 F .x1 ; : : : ; xn / C
1 nC1 Ex :
./
The discrepancy between the mentioned estimates can now be expressed by the following “scaled” difference: b .x1 ; : : : ; xn ; x/ b .x1 ; : : : ; xn / ƒ F ƒ F : 1 nC1
By ./ this is the same as the quotient 1 b .x1 ; : : : ; xn / C ƒ .1 nC1 /F
1 nC1 Ex
1 nC1
b .x1 ; : : : ; xn / ƒ F
:
b .x1 ; : : : ; xn / For large n one may expect, by the Glivenko–Cantelli theorem, that F F . In this way the expression above is related to the quotient 1 1 ƒ .1 nC1 /F C nC1 Ex ƒ.F / : 1 nC1
379
Section VII.10 The von Mises derivative and influence functions
It will be clear that this expression tends to F .x/ if n ! 1. Apparently the number F .x/ is a measure for the extent of disturbance, brought about by the added observation x. This point of view is called the infinitesimal approach of robustness. The influence function will now be determined in some standard cases. First of all the influence function of the mean: Proposition VII.10.2. Let ƒ W Dƒ ! R be the statistical functional defined by Z ƒ.F / WD x dPF .x/; where F is an element of the set Z ° ± Dƒ WD F 2 W jxjdPF .x/ < C1 : Then the influence function
F
belonging to ƒ and F is given by
F .x/
Dx
ƒ.F /
.x 2 R/:
Proof. As was explained in the beginning of this section, the von Mises derivative of ƒ in F is given by ƒ0F D ƒ ƒ.F /:
Consequently one has for all x 2 R F .x/
D ƒ0F .Ex / D ƒ.Ex /
ƒ.F / D x
ƒ.F /:
This proves the proposition.
F
ƒ.F /
x
It follows that, concerning the mean, an added observation x at a large distance from ƒ.F / brings about a large value (in absolute sense) of F .x/. The sample mean is apparently very sensitive to outliers. So, in the infinitesimal approach too, the sample mean is not robust (compare with Theorem VII.7.9).
As another example, a look is taken now at the influence function of the functional ƒ W ! R Figure 20 defined by ® ¯ ® ¯ ƒ.F / WD 12 qF .u / C qF .uC/ D 21 qF .u/ C qF .uC/ ;
where qF is the quantile function belonging to F and u some prescribed number in the interval .0; 1/. This functional turns out to be von Mises differentiable in every
380
Chapter VII Stochastic analysis and its applications in statistics
element F 2 that is continuously differentiable with F 0 > 0. In order to prove this one must take a look at the right-hand derivative of functions of the form t 7! q.1
t/F Ct G .u/
in the point t D 0. For the influence function G D Ex . In terms of the function F t;x WD .1
F .x/
it suffices to consider the case
t /F C t Ex
one has the following lemma: Lemma VII.10.3. Suppose F 2 has 8 u ˆ q ˆ ˆ F 1 t < qF t;x .u/ D x ˆ ˆ u t ˆ :qF 1 t
is continuously differentiable. If F 0 > 0 then one if u .1 if .1
t /F .x/;
t /F .x/ < u .1
if u > .1
t /F .x/ C t;
t /F .x/ C t:
Proof. See Exercise 53. As to the above, the reader could also consider a graphical “proof”. Using the previous lemma, it is easy to prove: Lemma VII.10.4. Let ƒ W ! R be the statistical functional given by ƒ.F / WD qF .u/: If F is continuously differentiable and if F 0 > 0, then the influence function belonging to ƒ and F is given by 8 0 ˆ ƒ.F /:
F
Proof. Under the mentioned conditions on F the quantile function qF W .0; 1/ ! R is just the inverse of F W R ! .0; 1/. Now suppose that x > ƒ.F / D qF .u/. Then F .x/ > u, which implies that for sufficiently small t one has .1 t /F .x/ > u. In this way it follows, using Lemma VII.10.3, that for small t one has u ƒ .1 t /F C t Ex D qF t;x .u/ D qF : 1 t
381
Section VII.10 The von Mises derivative and influence functions
By definition, is to say:
F .x/
is just the right-hand derivative of this in the point t D 0. That F .x/
D u .qF /0 .u/:
./
In order to find an explicit form for the derivative .qF /0 .u/, note that F qF .v/ D v
for all v 2 .0; 1/:
Differentiation with respect to v leads to F 0 qF .v/ .qF /0 .v/ D 1:
By setting v D u one obtains
qF0 .u/ D
F0
1 1 D : 0 qF .u/ F ƒ.F /
In connection with ./ it thus appears that F .x/
D
F0
u ƒ.F /
if x > ƒ.F /:
For x ƒ.F / in a similar way an explicit expression for
F .x/
can be found.
It is left to the reader to verify that the functional ƒ W ! R defined by ƒ.F / WD qF .uC/ has the same influence function as the functional F 7! qF .u/. Summarizing one arrives at: Proposition VII.10.5. Let ƒ W ! R be statistical functional defined by ƒ.F / D
¯ 1® qF .u/ C qF .uC/ : 2
If F is continuously differentiable with F 0 > 0, then the influence function longing to ƒ and F is given by 8 0 ˆ ƒ.F /:
F
be-
382
Chapter VII Stochastic analysis and its applications in statistics
F
1=F 0 .ƒ.F //
ƒ.F /
x
Figure 21
The case u D 21 presents the median. The influence function in that case is sketched in Figure 21. It is clear that the influence function is bounded on R. This can be interpreted by saying that the median is not very sensitive to outliers. Furthermore a discontinuity in the point x D ƒ.F / can be observed. This is an indication that a small perturbation of an observation that is close to ƒ.F / can very well bring about a significant disturbance in the estimation process (think of rounding off errors). The size of the jump in the point x D ƒ.F / is inversely proportional to the value of the probability density of the population in the point x D ƒ.F /. If the value of the probability density in this point is small, then it is difficult to estimate ƒ.F / in a proper way. As a last topic in this section the influence function of the ˛-trimmed mean will be determined. Proposition VII.10.6. Let 0 < ˛ < by ƒ.F / WD
1 2
and let ƒ W ! R be the functional defined
1 1
2˛
Z
1 ˛
qF .u/ du:
˛
If F is continuously differentiable with F 0 > 0, then the influence function belonging to ƒ and F is given by 8 ± 1 ° ˆ ˆ ˛ q .1 ˛/ C .1 ˛/ q .˛/ ƒ.F / if x < qF .˛/; F F ˆ ˆ 1 2˛ ˆ ˆ < ± 1 ° if qF .˛/ x ˛ qF .1 ˛/ ˛ qF .˛/ C x ƒ.F / F .x/ D ˆ 1 2˛ qF .1 ˛/; ˆ ˆ ˆ ± ˆ 1 ° ˆ : .1 ˛/ qF .1 ˛/ ˛ qF .˛/ ƒ.F / if x > qF .1 ˛/: 1 2˛
383
Section VII.10 The von Mises derivative and influence functions
Proof. For all x and all 0 < t < 1 one has ƒ .1
t /F C t Ex / t
ƒ.F /
D
1 1
2˛
Z
1 ˛ ˛
qF t;x .u/ qF .u/ du: t
In the integrand, if t # 0, the following expression emerges: g.x; u/ WD lim t#0
qF t;x .u/ qF .u/ : t
Using Lemma VII.10.3 it is easily verified that ´ u .qF /0 .u/ g.x; u/ D .u 1/ .qF /0 .u/
if u < F .x/; if u > F .x/:
By applying the mean value theorem, together with Lebesgue’s theorem on dominated convergence, it follows that Z 1 ˛ ƒ .1 t /F C t Ex / ƒ.F / 1 .x/ D lim D g.x; u/ du: F t 1 2˛ ˛ t#0
In order to find an explicit form for the integral on the right side, three cases are distinguished: Case 1. x < qF .˛/, or F .x/ < ˛. For such x one has the following expression for F .x/: Z 1 ˛ Z 1 ˛ 1 1 .x/ D g.x; u/ du D .u F 1 2˛ ˛ 1 2˛ ˛ Via partial integration one arrives at 1 ˛ 1 .u 1/ qF .u/ F .x/ D 1 2˛ ˛ 1 D ¹ ˛ qF .1 ˛/ C .1 1 2˛
1 1
2˛
Z
1/ .qF /0 .u/ du:
1 ˛
qF .u/ du ˛
˛/ qF .˛/º
ƒ.F /:
Case 2. qF .˛/ x qF .1 ˛/, or ˛ F .x/ 1 ˛. In this scenario F .x/ can be written as Z F .x/ Z 1 ˛ 1 1 0 u .qF / .u/ du C .u 1/ .qF /0 .u/ du: F .x/ D 1 2˛ ˛ 1 2˛ F .x/ Applying partial integration and using the fact that qF F .x/ D x, these integrals too can be turned into an explicit form. Case 3. x > qF .1 ˛/, or F .x/ > 1 This case is left to the reader.
˛.
384
Chapter VII Stochastic analysis and its applications in statistics
Influence functions belonging to ˛-trimmed means are of the form as sketched in Figure 22. Note that these functions are both bounded and continuous. It follows that there are no problems here as to sensitivity to outliers (compare the untrimmed mean); nor are there local problems (as in the case of the median).
F
x Figure 22
VII.11
Bootstrap methods
In this section a peculiar method is discussed to estimate the distribution function of an arbitrary statistic, basing oneself on a sample. Let g W Rm ! R be a Borel function and let e g W ! be the associated map, as defined in §VII.7. When dealing with a population having a distribution function F , it is often relevant to have an impression of the distribution function e g .F /. Sometimes e g .F / can be determined in an analytical way. Two examples of this are given now. Example 1. Define g W Rm ! R by
g.x1 ; : : : ; xm / WD
x1 C C xm : m
If F is the distribution function belonging to the N.; 2 /-distribution, then e g .F / 2 is (by Proposition II.1.2) the distribution function belonging to the N.; =m/distribution. Example 2. Define g W Rm ! R by g.x1 ; : : : ; xm / WD
m X .xi iD1
x/2 :
385
Section VII.11 Bootstrap methods
If F is the distribution function belonging to N.; 2 /-distribution, then the function x 7! e g .F /.x 2 /
is (by Theorem II.2.12) the distribution function belonging to a 2 -distribution with m 1 degrees of freedom. In general, however, even when the distribution function of the population is exactly known, the situation is far too complex to determine e g .F / by means of analytical methods. Then one will have to content oneself with estimates of e g .F /. In cases where g is continuous, it is quite natural to make such estimates in the following way. Let X1 ; : : : ; Xn be a sample from a population with distribution function F , showing an outcome X1 D x1 ; : : : ; Xn D xn : By the Glivenko–Cantelli theorem it may be expected that for large n one has b .x1 ; : : : ; xn / F: F
If g W Rm ! R is continuous then so is (Theorem VII.7.1) the map e g W .; dL / ! .; dL /:
This means, roughly spoken, that
b .x1 ; : : : ; xn / e e g F g .F /:
b .x1 ; : : : ; xn / emerges as a natural estimate In this way the distribution function e g F b .x1 ; : : : ; xn / for e g .F /. An alternative interpretation of this distribution function e g F be a sample from a population with distribution funcis now given. Let X1 ; : : : ; Xm b .x1 ; : : : ; xn /. Then e b .x1 ; : : : ; xn / is just the distribution function of the tion F g F statistic T D g.X1 ; : : : ; Xm /: The (discrete) probability distribution of the variables Xk is given by #¹i W xi D xº : P Xk D x D n
./
This means that actually one is sampling (with replacement) from the set of outcomes ¹x1 ; : : : ; xn º of the original sample X1 ; : : : ; Xn . It is as if one is trying to pull oneself up by ones own bootstraps: b .x1 ; : : : ; xn / is called the bootDefinition VII.11.1. The distribution function e g F strap distribution function belonging to x1 ; : : : ; xn .
386
Chapter VII Stochastic analysis and its applications in statistics
In order to understand the computational problems involved when determining a bootstrap distribution function, an underlying probability space . ; A ; P / for the sto / is now constructed: chastic vector .X1 ; : : : ; Xm 8 ˆ ˆ ˆ
0º:
Now let X1 ; X2 ; : : : be an infinite sample from a population with a continuous distribution function F 2 Dƒ . The underlying probability space of the X1 ; X2 ; : : : will, as always, be denoted by .; A; P /. For all ! 2 one now has b X1 .!/; : : : ; Xn .!/ D min X1 .!/; : : : ; Xn .!/ : Tn .!/ D ƒ F As before, denote D ƒ.F /. Because F is presumed to be continuous, one has F . / D 0. From this it follows that n p Gn .0/ D P n .Tn / 0 D P .Tn / D 1 1 F . / D 0: Next, define the set A by
A WD ¹! 2 W there is no tie in the sequence X1 .!/; X2 .!/; : : :º: It can be proved (see Exercise 54) that A 2 A and that P .A/ D 1. Now choose a fixed ! 2 A and write X1 .!/ D x1 ; X2 .!/ D x2 ; : : : ; Xn .!/ D xn : The bootstrap statistic Tn .!/ belonging to Tn assumes in this case the form Tn .!/ D min.X1 ; : : : ; Xn /;
391
Section VII.11 Bootstrap methods
b .x1 ; : : : ; xn / as its distribuwhere X1 ; : : : ; Xn is a sample from a population with F tion function. Moreover, one has b .x1 ; : : : ; xn / D min.x1 ; : : : ; xn / D Tn .!/: ƒ F
It is easy to see that
p n Tn .!/ Tn .!/ 0 D P Tn .!/ D min.x1 ; : : : ; xn / D 1
Gn .!/.0/ D P
Hence
lim jGn .!/.0/
n!1
Gn .0/j D lim
n!1
Consequently one has for all ! 2 A lim kGn .!/
n!1
°
1
Gn k1 lim kGn .!/ n!1
1
1 n : n
1
1 n ± D1 n
Gn k 1 1
1 : e
1 > 0: e
Because P .A/ D 1, this implies that the bootstrap does not work in this example. In the following the distribution function of the N.0; 2 /-distribution will be denoted by ˆ , so Z x 1 1 2 ˆ .x/ WD p e 2 .t=/ dt: 1 2 As said before, it is often so that p
n .Tn
/ .n D 1; 2; : : :/
converges in distribution to an N.0; 2 /-distribution. In other words, in many cases there exists a > 0 such that lim dL .Gn ; ˆ / D 0:
n!1
The next proposition provides a sufficient condition for the bootstrap to be working. Proposition VII.11.2. If there exists a > 0 such that lim dL .Gn ; ˆ / D 0
n!1
then the bootstrap works.
and
lim dL .Gn ; ˆ / D 0
n!1
strongly;
392
Chapter VII Stochastic analysis and its applications in statistics
Proof. Because ˆ is continuous, one can apply Theorem VII.5.6 and conclude that lim kGn
n!1
ˆ k1 D 0 and
lim kGn
n!1
ˆ k1 D 0
strongly:
Of course this implies that kGn
Gn k1 ! 0 strongly:
That is to say, the bootstrap works. It can be proved that for various types of statistical functionals the premises in the proposition above are fulfilled. For example the functional belonging to the mean Z ƒ.F / WD .F / D x dPF .x/; and the functional belonging to the variance Z ƒ.F / WD 2 .F / D x
2 .F / dPF .x/
satisfy the premises of the mentioned proposition (see [9]). So the bootstrap works in these cases. Today, in an era where computational techniques become more and more powerful, bootstrap methods present an extremely useful tool in statistics. Theoretical aspects, however, still do not form an open book and they are subject of many deep analyses. For further reading, see for example [9], [20].
VII.12
Estimation of densities by means of kernel densities
Instead of making estimates of the distribution function of a population, one may be willing to make estimates of the probability density f of the population. There are several methods to do this. In this section discuss one of them is discussed. Suppose X1 ; : : : ; Xn is a sample from a population with probability density f . With b is the outcome x1 ; : : : ; xn of such a sample a so-called empirical density function f associated, defined by b .x/ WD #¹i W xi D xº f n
.x 2 R/:
This empirical density function is always a discrete probability density. If beforehand it is known that the probability density of the population is also discrete, it is quite b as being an estimate of f . natural to look upon f
393
Section VII.12 Estimation of densities by means of kernel densities
Example 1. One is dealing dealing with a biased die. There is a wish to get an impression of the probability density concerning the outcomes generated by this die. To this end the die is thrown 100 times, leading to a sample X1 ; : : : ; X100 , whereupon the frequencies of the outcomes 1; 2; : : : ; 6 are counted. The result is summarized in the following scheme: outcome frequency
1 21
2 9
3 8
4 12
5 19
6 31
b belonging to this outcome is given by The empirical distribution function f b.1/ D 21 ; f 100 b f .x/ D 0 if
b.2/ D 9 ; f 100
b.3/ D 8 ; f 100
:::;
x ¤ 1; 2; 3; 4; 5; 6:
The real probability density f belonging to this die can now be estimated by means b. of f
If, however, the population has an absolutely continuous probability distribution, then the empirical density function is not at all appropriate to estimate the density f of the population. An example of this is:
Example 2. It is assumed that the height of a randomly chosen Dutchman has an absolutely continuous probability distribution with density f . A sample of size 3 shows as its outcome (in centimeters): 167:0;
172:6;
182:3:
The associated empirical density function is now given by b.167:0/ D 1 ; f 3
b.172:6/ D 1 ; f 3
b.x/ D 0 if f
b.182:3/ D 1 ; f 3
x ¤ 167:0; 172:6; 182:3:
b as a final estimate for f then the estimate of the probability correWhen using f sponding to a height strictly between 168:0 and 172:0 centimeters would be zero. Of course this is untenable. When dealing with populations having absolutely continuous probability distributions, one often uses the empirical density function in a “spread out form” as an estimate for the population density. In the following this is discussed in more detail. Let X1 ; : : : ; Xn be a sample from a population having an absolutely continuous distribution with density f . Suppose that the sample shows an outcome x1 ; : : : ; xn
394
Chapter VII Stochastic analysis and its applications in statistics
b be the associated empirical density function. The probability distribution and let f b , expressed as a Borel measure, is then given by belonging to f n
X bD 1 ıxi : P n iD1
The involved Dirac measures will now be smoothened. As a first step this is carried out for the Dirac measure ı0 in the point 0. Recall that ı0 presents the probability distribution belonging to a stochastic variable that assumes with probability 1 the outcome 0. Define the continuous probability density k W R ! Œ0; C1/ by 8 ˆ 1 x 0; 0 defined by K .x/ WD K.x=/ for all x 2 R. Now K is the distribution function belonging to the continuous density k , given by 1 k .x/ WD k.x=/
C
Figure 23
for all x 2 R:
A stochastic variable that has k as its probability density assumes with probability 1 an outcome in the interval . ; C/ and the number 0 is the outcome of maximum likelihood (see Figure 23). The distribution function K is of the form 8 0 if x ; ˆ ˆ ˆ ˆ ˆ if x C; 0/: K F
397
Section VII.12 Estimation of densities by means of kernel densities
The function K , having k as its derivative, is of course continuously differentiable. By Lemma VII.12.1 this derivative is bounded on R. Applying Proposition VII.6.4 b is continuously differentiable. As an estimate for the one may conclude that K F density f of the population the derivative of b /0 .K F
b ? is now taken. How is this estimate for f related to the kernel density estimate f It turns out that they are the same. To see this, note that the Borel measure P b on R, F b , is given by belonging to the distribution function F n
X bD 1 Pb D P ıxi : F n iD1
By Proposition VII.6.4 one therefore has
b /0 D .K P /0 D K 0 P D k P : .K F b b b F F F
Hence, by definition of a convolution product, one has for all x 2 R Z n 1X k .x xi / .k Pb /.x/ D k .x y/ dP .y/ D b F F n iD1
n 1X1 x xi b .x/: k Df D n iD1
One may conclude that
b: b /0 D k P D f .K F b F
In the remainder of this section the asymptotic behavior of kernel density estimates b .x/ depends on the outcome will be studied. To start with, note that the value of f x1 ; : : : ; xn of the sample X1 ; : : : ; Xn , for one has n
X b .x/ D 1 f k .x n
xi /:
iD1
b .x/ as a statistic and for this reason it will Therefore, for fixed x, one may regard f be denoted by n
X b .x/ D f b .X1 ; : : : ; Xn /.x/ WD 1 f k .x n
Xi /:
iD1
In order to prove the main result of this section (Theorem VII.12.5), first some preparatory work has to be done.
398
Chapter VII Stochastic analysis and its applications in statistics
Proposition VII.12.2. One always has b .x/ D E f b .X1 ; : : : ; Xn /.x/ D .k f /.x/; E f where
.k f /.x/ D
Z
k .x
t /f .t / dt:
Proof. Suppose that X1 ; : : : ; Xn is a sample from a population with density f . Then fXi D f and consequently Z C1 E k .x Xi / D k .x t /f .t / dt: 1
The proposition follows directly from this.
The next lemma is a well-known result in mathematical analysis: Lemma VII.12.3. If x is a point of continuity of f , then lim .k f /.x/ D f .x/:
#0
Proof. A proof is given here for the case where k has a compact support, that is to say, the case where there is a number A 0 such that ¹t W k.t / ¤ 0º Œ A; CA: It follows from this that k .t / D 0 if
Therefore one may write j.k f /.x/ f .x/j ˇZ ˇ D ˇ k .x t /f .t / dt ˇZ ˇ D ˇ k .t /f .t C x/ dt ˇZ ˇ D ˇ k .t / f .t C x/
Z
CA A
sup jf .t C x/ jtjA
ˇ ˇZ ˇ ˇ f .x/ˇ D ˇ k .t x/f .t / dt ˇ ˇZ ˇ ˇ f .x/ˇ D ˇ k .t /f .t C x/ dt
Z ˇˇ ˇˇ f .x/ dt ˇ D ˇ
k .t /jf .t C x/
jt j A:
CA A
k .t / f .t C x/
f .x/j dt sup jf .t C x/
f .x/j:
jtjA
ˇ ˇ f .x/ˇ Z ˇ ˇ k .t /f .x/ dt ˇ ˇˇ f .x/ dt ˇ
f .x/j
Z
CA A
k .t / dt
Section VII.12 Estimation of densities by means of kernel densities
399
If f is continuous in x, then the last expression in this chain of inequalities converges to zero if # 0. Therefore lim j.k f /.x/
#0
f .x/j D 0:
This proves the lemma. Lemma VII.12.4. Let Y1 ; Y2 ; : : : be a sequence of stochastic variables such that (i) E.Yn / ! c
(ii) var.Yn / ! 0
if n ! 1,
if n ! 1.
Then the sequence Y1 ; Y2 ; : : : converges in probability to c. Proof. See Exercise 2. Theorem VII.12.5. Let X1 ; X2 ; : : : be an infinite sample from a population with probability density f . Furthermore, let 1 ; 2 ; : : : be a sequence of positive real numbers such that n ! 0 and n n2 ! C1. Then for every point x of continuity of f one has b .X1 ; : : : ; Xn /.x/ D f .x/ in probability: lim f n n!1
Proof. (Split up into three steps.)
Step 1. There exists a constant M 0 such that for all > 0 the following inequality holds: M2 var k .x Xi / 2 : This is not difficult to see. Namely, by Lemma VII.12.1 the function k is bounded on R. Therefore, there is a number M > 0 such that 0 k.y/ M
for all y 2 R:
By definition of k this implies that 0 k .y/
M
.y 2 R; > 0/:
Hence 0 k .x
Xi /
M ;
400
Chapter VII Stochastic analysis and its applications in statistics
which implies that 0 E k .x Consequently jk .x It follows that
k .x
so that var k .x This proves Step 1.
Xi /
Xi /
M Xi / : M Xi / j :
E k .x
E k .x
² Xi / D E k .x
2 M 2 Xi / 2 ;
Xi /
E k .x
³ 2 M2 Xi / 2 :
Step 2. Because the variables k .x X1 /; : : : ; k .x Xn / are statistically independent one may write n X b .X1 ; : : : ; Xn /.x/ D 1 var f var k .x n2 iD1
M2 Xi / : n 2
b .X1 ; : : : ; Xn /.x/. Using Proposition VII.12.2 and Step 3. Now set Yn WD f n Lemma VII.12.3 it follows that, if f is continuous in x, one has b .X1 ; : : : ; Xn /.x/ D .k f /.x/ ! f .x/. (i) E.Yn / D E f n n Using Step 2, together with the fact that n n2 ! C1, it follows that 2 b .X1 ; : : : ; Xn /.x/ M ! 0: (ii) var.Yn / D var f n n n2
Applying VII.12.4 it follows that Yn ! f .x/ in probability. This completes the proof of the theorem. Remark. Roughly spoken the theorem above says that for large n the estimator b .x1 ; : : : ; Xn /.x/ is doing good work, provided the parameter is chosen in a f b .X1 ; : : : ; Xn /.x/ proper way. On the one hand, in order to have the estimator f approximately unbiased, one must choose so small that b .X1 ; : : : ; Xn /.x/ D .k f /.x/ f .x/: E f
Section VII.13 Estimation of densities by means of histograms
401
On the other hand, should not be chosen too small because one likes to see a large value for n 2 . The latter, because then the inequality b .X1 ; : : : ; Xn /.x/ M var f n 2
implies a small value for the variance of the estimator (which is of course also a desirable property for an estimator). For further reading, see [34], [68], [73].
VII.13
Estimation of densities by means of histograms
In this section a technique will be discussed to make estimates of the probability density of a population by means of functions that are called “histograms”. An interval in R will systematically be denoted by I and its length by jI j. A sequence of intervals I1 ; I2 ; : : : is called an interval partition of R if (i) Ip \ Iq D ; if p ¤ q, S (ii) R D p2N Ip .
Such a sequence of intervals will be denoted by I, so I WD .Ip /p2N : Besides this the following notation will be used: jIj WD sup jIp j
and
p
ı.I/ WD inf.Ip /: p
It will systematically be assumed (without saying it again and again) that 0 < ı.I/ jIj < C1:
./
If all intervals belonging to I have one and the same length ", then one has of course jIj D ı.I/ D ":
./
In that case the condition ./ is automatically fulfilled. The analysis that follows simplifies a bit when starting from ./ instead of ./. Definition VII.13.1. A step function based on the intervals of I is a function of the form X hD cp 1Ip : p
If such a function presents a probability density, then one talks about a histogram.
402
Chapter VII Stochastic analysis and its applications in statistics
Now let X1 ; : : : ; Xn be a sample from a population having an absolutely continuous probability distribution with density f . Furthermore, let I be any prescribed interval partition of R. Then with every given outcome x1 ; : : : ; xn of the sample one associates a histogram given by X h D h.II x1 ; : : : ; xn / WD cp 1Ip ; p
where
#¹i W xi 2 Ip º : njIp j It is easily verified that the step function h defined above presents a probability density, hence it is indeed a histogram. This h can be used as an estimate for f . One then talks about a histogram estimate. The main theorem of this section (Theorem VII.13.2) is a song of praise on this way of making estimates. In order to prove this theorem, first a closer look will be taken at the coefficients cp : Given an interval partition I , these coefficients depend on the sample outcome x1 ; : : : ; xn . For this reason, one could look upon cp as being a statistic. Therefore one could write cp D
cp D cp .X1 ; : : : ; Xn /: In this notation one has n
#¹i W Xi 2 Ip º 1 X cp .X1 ; : : : ; Xn / D D 1Ip .Xi /: njIp j njIp j iD1
The variables 1Ip .X1 /; : : : ; 1Ip .Xn / can assume the values 0 and 1 only. Hence they have a Bernoulli distribution the parameter of which is given by Z D P 1Ip .Xi / D 1 D P .Xi 2 Ip / D f .t / dt: Ip
By applying Proposition I.4.2 it follows that the variables 1Ip .X1 /; : : : ; 1Ip .Xn / form a statistically independent system. This make it possible to determine the expectation value and the variance of the variables cp .X1 ; : : : ; Xp /: Lemma VII.13.1. One has (i) (ii)
1 E cp .X1 ; : : : ; Xn / D jIp j
var cp .X1 ; : : : ; Xn / D
Proof. See Exercise 59.
Z
f .t / dt;
Ip
1 njIp j2
Z
f .t / dt
Ip
1 sup f .t /: njIp j t2Ip
Z
Ipc
f .t / dt
403
Section VII.13 Estimation of densities by means of histograms
The main theorem of this section can now be proved. Recall that, given a sample X1 ; : : : ; Xn and a prescribed interval partition I of R, the associated histogram was defined as X h.II X1 ; : : : ; Xn / D cp .X1 ; : : : ; Xn /1Ip : p
Now for fixed x 2 R the expression h.II X1 ; : : : ; Xn /.x/ presents a real-valued statistic. Theorem VII.13.2. Let X1 ; X2 ; : : : be an infinite sample from an absolutely continuous population with probability density f . Suppose that I1 ; I2 ; : : : is a sequence of interval partitions of R such that jIn j ! 0 and
n ı.In / ! C1
if n ! 1:
Then in every point of continuity x of f one has lim h.In I X1 ; : : : ; Xn /.x/ D f .x/
in probability:
n!1
Proof. Choose and fix a point x in which f is continuous. Suppose that the interval partition In consists of the intervals I1n ; I2n ; : : : . In this sequence the interval that contains x will be denoted by I n .x/ and the rank number of this interval will be denoted by p.n; x/. So, in this notation, one has n I n .x/ D Ip.n;x/
and h.In I X1 ; : : : ; Xn /.x/ D cp.n;x/ .X1 ; : : : ; Xn /: By Lemma VII.13.1 one therefore has E h.In I X1 ; : : : ; Xn /.x/ D
1 n jI .x/j
Z
I n .x/
Typography can be simplified by setting Yn WD h.In I X1 ; : : : ; Xn /.x/: The remainder of the proof is now split up into two steps.
f .t / dt:
./
404
Chapter VII Stochastic analysis and its applications in statistics
Step 1. E.Yn / ! f .x/ if n ! 1. This is well known from mathematical analysis; it can be proved as follows: ˇ ˇ Z ˇ ˇ 1 ˇ ˇ ˇ jI n .x/j n f .t / dt f .x/ˇ I .x/ ˇ ˇ Z Z ˇ 1 ˇ 1 ˇ D ˇˇ n f .t / dt f .x/ dt ˇ jI .x/j I n .x/ jI n .x/j I n .x/ ˇZ ˇ ˇ ˇ 1 ˇ ˇ D n .f .t / f .x// dt ˇ ˇ jI .x/j I n .x/ Z 1 n jf .t / f .x/j dt jI .x/j I n .x/
1
jI n .x/j
jI n .x/j sup jf .t /
f .x/j
t2I n .x/
sup jf .t /
f .x/j:
t2I n .x/
Because f is continuous in x and jI n .x/j ! 0 if n ! 1, the last expression in this chain of inequalities converges to zero if n ! 1. This proves Step 1. Step 2. var.Yn / ! 0 if n ! 1. To see this, note that by ./ and Lemma VII.13.1 one has 1 var.Yn / D var cp.n;x/ .X1 ; : : : ; Xn / n jI n .x/j
sup f .t /: t2I n .x/
The assumption that f is continuous in x implies that sup f .t / ! f .x/ if
t2I n .x/
n ! 1:
Furthermore jI n .x/j ı.In /:
Together with the premise that n ı.In / ! C1 if n ! 1, this implies that var.Yn / ! 0
if n ! 1:
Combining Step 1, Step 2 and Lemma VII.12.4 it follows that Yn ! f .x/ in probability. This completes the proof. The theorem above is praising the asymptotic behavior of histogram estimates. However, there are also shortcomings. For example, a histogram is always a discontinuous function. Furthermore, the shape of a histogram depends strongly on the choice of the interval partition I (see Exercise 58). For further reading, see for example [34], [68], [73].
405
Section VII.14 Exercises
VII.14
Exercises
Detailed solutions to all exercises can be found in [2]. 1. A sample X1 ; X2 ; X3 ; X4 from a population with unknown distribution function F shows the following outcome: 4:5;
1:0;
2:0;
3:5:
Make a sketch of the corresponding empirical distribution function. 2. Let X1 ; X2 ; : : : be an infinite sample from a population with the probability density f .; /, where 2 ‚. Furthermore, let W ‚ ! R be a characteristic of the population and Tn an estimator of , based on the finite sample X1 ; : : : ; Xn . A sequence of estimators .Tn / is called asymptotically unbiased if lim E .Tn / D ./
n!1
for all 2 ‚:
Quite in tune with Definition VII.9.3 the sequence T1 ; T2 ; : : : will be called weakly (strongly) consistent if lim Tn D ./
in distribution (strongly):
n!1
(i) If the sequence T1 ; T2 ; : : : of estimators of is asymptotically unbiased and if var.Yn / ! 0 if n ! 1 then .Tn / is weakly consistent. Prove this.
(ii) Let X1 ; X2 ; : : : be an infinite sample from a population with variance 2 . Assume that the fourth moment of the population exists. Prove that the sample variances Sn2 D
1 n
1
n X .Xi iD1
X /2
.n D 2; 3; : : :/
present a weakly consistent sequence of estimators of 2 . 3. Let .; A; P / be an arbitrary probability space and let Xn W ! R and X W ! R be stochastic variables. (i) Prove that ¹! 2 W lim Xn .!/ D X.!/º 2 A: n!1
(ii) Prove that ¹! 2 W lim Xn .!/ existsº 2 A: n!1
406
Chapter VII Stochastic analysis and its applications in statistics
4. Give an example of (i) a sequence of stochastic variables that converges in distribution, but not in probability. (ii) a sequence of stochastic variables that converges in probability, but not strongly. 5. If the sequence X1 ; X2 ; : : : converges in probability to X , then there is a subsequence Xn1 ; Xn2 ; : : : such that lim Xnk D X
k!1
strongly:
Prove this statement. 6. Let X1 ; : : : ; Xn be a sample from a population with distribution function F . If F .a/ D F .b/, then the stochastic variables b .X1 ; : : : ; Xn /.a/ and F
are essentially equal. Prove this.
b .X1 ; : : : ; Xn /.b/ F
7. Let X1 ; : : : ; Xn be a sample from a population with distribution function F . Determine b .X1 ; : : : ; Xn /.a/ ; F b .X1 ; : : : ; Xn /.b/ cov F for all a; b 2 R.
8. Let F W R ! Œ0; 1 be an arbitrary distribution function and q W .0; 1/ ! R the corresponding quantile function in the sense of Definition VII.2.4. (i) Show that q is an increasing function, taking on its values in R. (ii) Check that ¹y W u F .y/º D Œq.u/; C1/: (iii) Prove the following equivalence: u F .x/ ” q.u/ x: (iv) Prove that always F q.u/ u.
(v) Prove that q is always a left-continuous function. (vi) Show that for all " > 0 one has F q.u/ " < u. (vii) If F is continuous in q.u/ then F q.u/ D u. Prove this.
(viii) Prove that
x 2 q.u/; q.uC/ H) F .x/ D u:
407
Section VII.14 Exercises
(ix) If X is a stochastic variable with a continuous distribution function F , then F .X / is uniformly distributed on the interval .0; 1/. Prove this statement. (x) If X is a stochastic variable with distribution function F and if F .X / is uniformly distributed on .0; 1/, then F is continuous. Prove this. 9. A sample X1 D X , of size 1, is drawn from a population with continuous distribution function F . (i) Describe for every possible outcome of X the empirical distribution funcb. tion F
(ii) Show that
b .x/ D1 D sup j F x2R
b F .x/j D k F
F k1 D max F .X /; 1
F .X / :
(iii) Show that the distribution function of D1 is given by 8 ˆ 0 then both Xi and Yi are N.i ; i /-distributed; if i D 0 then both Xi and Yi are essentially equal to the constant i . Hence for all i the variables Xi and Yi are identically distributed. Moreover, both the system ¹X1 ; : : : ; Xp º and the system ¹Y1 ; : : : ; Yp º are statistically independent. Together this implies that X D .X1 ; : : : ; Xp /
and
Y D .Y1 ; : : : ; Yp /
are identically distributed. This proves Step 1. Step 2. (The general case.) Suppose X and Y are Gaussian distributed with E.X/ D E.Y/
and
C .X/ D C .Y/:
Then (see §VIII.1(j)) there exists an orthogonal linear operator Q such that Q C .X/Q D Q C .Y/Q D diag.1 ; : : : ; p /:
454
Chapter VIII Vectorial statistics
It follows that both Q X and Q Y are stochastic vectors with uncorrelated components. Hence (see the corollary to Theorem VIII.4.2) the components of both Q X and Q Y constitute statistically independent systems. Furthermore, by Proposition VIII.4.1, the components of Q X and Q Y are Gaussian distributed scalar variables. Summarizing, both Q X and Q Y have an elementary Gaussian distribution. Moreover, using Propositions VIII.2.2 and VIII.2.6, one has E.Q X/ D Q E.X/ D Q E.Y/ D E.Q Y/ and C .Q X/ D Q C .X/Q D Q C .Y/Q D C .Q Y/: By Step 1 this implies that Q X and Q Y are identically distributed. However, then (see §I.11, Exercise 27) the variables Q
1
QX D X
and
Q
1
QY D Y
are also identically distributed. This proves Step 2 and the theorem. The previous theorem allows for the following convention. Convention. A stochastic vector X is said to be N.µ; C/-distributed if X has a Gaussian distribution with expectation vector µ and covariance operator C. How does a Gaussian distributed variable behave under elementary transformations? Does one then obtain new variables that are also Gaussian distributed? To put it in other words, what are the permanence properties of the Gaussian distribution? To this end the following two theorems. Theorem VIII.4.4. Suppose X is an N.µ; C/-distributed variable, then X C a has an N.µ C a; C/-distribution. Proof. If X is a Gaussian distributed vector, then there exists an orthogonal linear operator Q such that Q X has an elementary Gaussian distribution. Of course then Q X C Q a, that is Q.X C a/, also has an elementary Gaussian distribution. This shows that X C a is Gaussian distributed. Moreover one has E.X C a/ D E.X/ C a D µ C a and C .X C a/ D C .X/ D C: Consequently X C a is N.µ C a; C/-distributed.
Section VIII.4 Vectorial normal distributions
455
Theorem VIII.4.5. Suppose X is an N.µ; C/-distributed stochastic p-vector. Let T W Rp ! Rq be an arbitrary linear operator. Then the stochastic q-vector T X is N.T µ; T C T /-distributed. Proof. Choose (Theorem VIII.2.8) an orthogonal linear operator Q W Rq ! Rq such that Q T X has uncorrelated components. Then, according to Theorem VIII.4.2, the components of Q T X constitute a statistically independent system. Moreover, using Proposition VIII.4.1, it is easily verified that the components of Q T X are Gaussian distributed scalar variables. It follows that Q T X has an elementary Gaussian distribution. Hence T X is Gaussian distributed. By Propositions VIII.2.2 and VIII.2.6 one has E.T X/ D T E.X/ D T µ and C .T X/ D T C .X/T D T C T : Altogether it follows that T X is N.T µ; T C T /-distributed. Example 1. Let X D .X1 ; : : : ; Xp / be a Gaussian distributed vector and a any fixed element in Rp . Then the map x 7! hx; ai from Rp into R is linear. Therefore, according to Theorem VIII.4.5, the scalar variable hX; ai is Gaussian distributed. Furthermore, one has E hX; ai D hµ; ai and var hX; ai D hCa; ai:
Hence hX; ai has an N.hµ; ai; hCa; ai/-distribution. That is to say, if hCa; ai ¤ 0 then hX; ai is normally distributed; if hCa; ai D 0 then hX; ai is (essentially) constant.
Example 2. If X D .X1 ; : : : ; Xp / enjoys a Gaussian distribution then (as a special case of Example 1) so do the components Xi . If, conversely, it is given that all the components of a stochastic p-vector X have a Gaussian distribution, is then X automatically Gaussian distributed? It is frequently misunderstood that the answer to this question is: No! To convince the reader, choose two normally distributed scalar variables such that X1 and X2 are uncorrelated but not statistically independent. (The existence of such a pair of variables is the topic of §I.11, Exercise 41.) Next, consider the stochastic 2-vector X D .X1 ; X2 /. By the corollary belonging to Theorem VIII.4.2 X cannot possibly be Gaussian distributed, though its two components are so. Another example of this kind is provided by §II.11, Exercise 25.
456
Chapter VIII Vectorial statistics
Example 3. Suppose one is dealing with an N.µ; C/-distributed stochastic 2-vector X D .X1 ; X2 / where 3 6 µ D .5; 2/ and † =.X/ D ŒC D : 6 13 Let M be the linear subspace in R2 defined by M WD ¹.x1 ; x2 / 2 R2 W 2x1
x2 D 0º:
The orthogonal projection from R2 onto M will, as agreed in in §VIII.1(g), be denoted by PM . Problem. What is the probability that the vector PM X will show a length > 1? Solution. The linear subspace M is spanned by the vector v D is of length 1 and thus one has (see §VIII.1(l))
p1 .1; 2/. 5
This vector
PM D v ˝ v: More specific: PM x D hx; vi v D 51 .x1 C 2x2 ; 2x1 C 4x2 /:
Hence
kPM Xk D jhX; vij:
By Example 1 the variable hX; vi is normally distributed. More precisely, one has p hµ; vi D p1 h.5; 2/; .1; 2/i D 9= 5 and hCv; vi D 51 hC.1; 2/; .1; 2/i D 79 5 : 5
It follows from this that hX; vi is N.4:02; 15:8/-distributed. Standardizing hX; vi and using Table II leads to P kPM Xk > 1 D P jhX; vij > 1 D 1 P 1:26 Z 0:76 0:88; where Z stands for an N.0; 1/-distributed variable.
Proposition VIII.4.6. Let X be any N.µ; C/-distributed stochastic p-vector. Then X is normally distributed if and only if C is invertible. Proof. If X is Gaussian distributed then there exists an orthogonal linear operator Q W Rp ! Rp and an elementarily Gaussian distributed p-vector E such that X D Q E. Therefore the operator C .X/ can be written as C .X/ D Q C .E/ Q D Q diag.1 ; : : : ; p / Q :
457
Section VIII.4 Vectorial normal distributions
Now C .X/ is invertible if and only if diag.1 ; : : : ; p / is so. The latter is the case if and only if 1 ; : : : ; p > 0. This, in turn, occurs exactly then if E D Q X has among its components no constant variables. The latter is the same as saying that E has an elementary normal distribution. A very special normal distribution is the N.0; I/-distribution, where I is the identity operator on Rp . This probability distribution is the vectorial analogue of the scalar N.0; 1/-distribution. For reasons rooted in physics the N.0; I/-distribution is often called the white noise on Rp . Next, let X be an arbitrary stochastic p-vector with E.X/ D µ and C .X/ D C. Suppose that C is invertible. Then the linear operator C 1 is, as C is, self-adjoint and therefore talk (see §VIII.1(k)) about the linear operator p of positive type. One can 1 1 C , also denoted by C 2 . In this notation one may define 1 2
e X WD C
.X
µ/:
The vector e X is called the standardized form of X. Quite analogous to the scalar case one has Proposition VIII.4.7. For the standardized form e X of a variable X one always has e e E. X/ D 0 and C . X/ D I. If, moreover, X is normally distributed then e X is N.0; I/distributed. Proof. Using Propositions VIII.2.2 and VIII.2.6 it follows that
and
E. e X/ D C
C.e X/ D C
1 2
1 2
E.X
CC
1 2
µ/ D C
DC
1 2
1
1 2
1
0D0
C2 C2 C
1 2
D I:
If X is normally distributed then by Theorems VIII.4.4 and VIII.4.5 the variable e X has a Gaussian distribution. Because of the above this is necessarily an N.0; I/distribution. Lemma VIII.4.8. Let X be an N.0; I/-distributed p-vector and let M be a linear subspace of Rp with orthogonal projection PM . Then the scalar variable kPM Xk2 has a 2 -distribution and the number of degrees of freedom is equal to dim.M/. Proof. Choose an orthonormal basis v1 ; : : : ; vm in M. Then (see §VIII.1(l)) for all x 2 Rp one has PM x D hx; v1 iv1 C C hx; vm ivm :
458
Chapter VIII Vectorial statistics
Therefore kPM xk2 D Next, consider the stochastic m-vector
m X hx; vi i2 : iD1
.hX; v1 i; : : : ; hX; vm i/: This vector is Gaussian distributed, for it is the image of X under the linear map x 7! .hx; v1 i; : : : ; hx; vm i/ from Rp into Rm . For all i D 1; : : : ; m one has E hX; vi i D hE.X/; vi i D h0; vi i D 0:
Moreover, by Proposition VIII.2.4, cov hX; vi i; hX; vj i D hC .X/vi ; vj i D hI vi ; vj i D hvi ; vj i:
Because the v1 ; : : : ; vm form an orthonormal system, it follows from this that var hX; vi i D 1 and cov hX; vi i; hX; vj i D 0 if i ¤ j:
Summarizing (with the help of Theorems VIII.4.2 and VIII.4.5) one arrives at the conclusion that hX; v1 i; : : : ; hX; vm i constitute a statistically independent system of N.0; 1/-distributed variables. By Proposition II.2.7 this implies that the scalar variable m X kPM Xk D hX; vi i2 2
iD1
is 2 -distributed with m degrees of freedom. This proves the lemma. The following theorem, which is important in vectorial statistics, is now easily proved: Theorem VIII.4.9 (W. G. Cochran). Let M1 ; : : : ; Mr be mutually orthogonal linear subspaces of Rp and let PM1 ; : : : ; PMr be the corresponding orthogonal projections. Then for every N.0; I/-distributed p-vector X one has: (i) The PM1 X; : : : ; PMr X constitute a statistically independent system. (ii) The variable kPMi Xk2 is 2 -distributed with dim.Mi / degrees of freedom.
459
Section VIII.4 Vectorial normal distributions
Proof. Statement (ii) is immediate from Lemma VIII.4.8. Theorem VIII.4.2, with Ti D PMi , is used in order to prove (i). By this theorem it suffices to prove that for all i ¤ j the variables PMi X and PMj X are uncorrelated. To see this, first note that component number s of the variable PMi X can be written as .PMi X/s D hPMi X; es i D hX; PMi es i: Therefore, if i ¤ j , one has for all possible s and t that cov .PMi X/s ; .PMj X/ t D cov hX; PMi es i; hX; PMj e t i D hC .X/ PMi es ; PMj e t i
D hI PMi es ; PMj e t i D hPMi es ; PMj e t i D 0; the latter because Mi ? Mj . This shows that for i ¤ j the stochastic vectors PMi X and PMj X are uncorrelated. Combining Proposition VIII.4.7 and Lemma VIII.4.8, one obtains a result that will be exploited in §VIII.6 in order to construct so-called confidence regions for the parameter µ . Proposition VIII.4.10. Let X be a normally distributed stochastic p-vector with expectation vector µ and covariance operator C. Then the scalar variable hC
1
.X
µ/; .X
µ/i
is 2 -distributed with p degrees of freedom. 1
Proof. According to Proposition VIII.4.7 the variable C 2 .X µ/ is N.0; I/-distributed. By applying Lemma VIII.4.8 with M D Rp it follows that the variable kPM C
1 2
.X
µ/k2 D kC
D h.C D hC
1 2
1 2 1 2
µ/k2 D hC
.X / C
C
1 2
1 2
.X
.X
1 2
.X
µ/; .X
µ/; .X
µ/; .C
1 2
.X
µ/i
µ/i
µ/i D hC
1
.X
µ/; .X
µ/i
is 2 -distributed with p degrees of freedom. It turns out, as will be seen below, that the vectorial normal distribution has a density with respect to the Lebesgue measure on Rp . The aim is to find an explicit form of this density in terms of µ and C. In this, the following lemma will prove to be helpful.
460
Chapter VIII Vectorial statistics
Lemma VIII.4.11. Suppose the stochastic vector X has a probability density fX . Let Y be the stochastic vector defined by Y WD A X C b where A is an invertible linear operator and b a constant vector. Then Y has a probability fY , given by 1 fX A j det.A/j
fY .x/ D
1
.x
b/ :
Proof. See Exercise 39. With the help of this lemma it is easy to prove: Theorem VIII.4.12. A normally distributed stochastic p-vector with expectation vector µ and covariance operator C has a probability density given by fX .x/ D
h 1 exp p .2/p=2 det.C/
1 1 2 hC .x
µ/; .x
i
µ/i :
Proof. (Split up into two steps.)
Step 1. (The special case where X is N.0; I/-distributed.) In this case one has X D .X1 ; : : : ; Xp / where the X1 ; : : : ; Xp are statistically independent and all N.0; 1/-distributed. According to Theorem I.3.2 one therefore has 1 1 fX .x/ D fX1 .x1 / fXp .xp / D p exp. 21 x12 / p exp. 21 xp2 / 2 2 h h i i 1 1 2 1 2 1 D exp exp .x C C x / D hx; xi : 1 p 2 2 .2/p=2 .2/p=2 This proves Step 1. Step 2. (The general case.) In the general case one may write 1
X D C2 e X C µ;
461
Section VIII.4 Vectorial normal distributions
where e X is the standardized form of X. Using Lemma VIII.4.11 together with Step 1 one obtains 1 1 2 .x C fX .x/ D f µ / e 1 X det.C 2 / h i 1 1 1 1 1 2 .x 2 .x D exp hC µ /; C µ /i 1 2 p=2 det.C 2 / .2/ h i 1 exp 21 hC 1 .x µ/; .x µ/i : D p .2/p=2 det.C/ 1 2
In the last equality in this chain it was exploited that C a self-adjoint linear operator of positive type one has 1
is self-adjoint and that for
1
det.C 2 / det.C 2 / D det.C/: The last topic in this section is dimension reduction. Let X be an N.µ; C/-distributed p-vector. Renumber the eigenvalues 1 ; : : : ; p of C in such a way that 1 2 p 0: Suppose ¹v1 ; : : : ; vp º is a corresponding orthonormal basis of eigenvectors for C. Then, as was seen in the previous section, X can be decomposed as follows: X D Y1 v1 C C Yp vp ; where the Y1 ; : : : ; Yp are mutually uncorrelated, with var.Yi / D i . The vector Y D .Y1 ; : : : ; Yp / is a linear image of X and therefore (by Theorem VIII.4.5) it has a Gaussian distribution. By Theorem VIII.4.2 this implies that the Y1 ; : : : ; Yp constitute a statistically independent system. The idea of dimension reduction is now to identify variables of very small variance with constants. This is illustrated in the following example: Example 4. Suppose an N.µ; C/-distributed vector X D .X1 ; X2 ; X3 / is given with µ D E.X/ D .5:201; 7:312; 2:798/
and
0
2:7834 0:0042 @ 0:0042 1:4131 † = D ŒC D 0:2274 1:3856
The eigenvalues of C can be computed to be 1 D 3:100;
2 D 2:700;
1 0:2274 1:3856A : 1:8035 3 D 0:200:
462
Chapter VIII Vectorial statistics
A corresponding orthonormal basis ¹v1 ; v2 ; v3 º of eigenvectors is given by v1 D . 0:4800; 0:5561; 0:6785/; v2 D .0:8753; 0:3557; 0:3277/; v3 D . 0:0591; 0:7512; 0:6575/: Now one has X D Y1 v1 C Y2 v2 C Y3 v3 ; where the Y1 ; Y2 ; Y3 are statistically independent with var.Y1 / D 3:100;
var.Y2 / D 2:700;
var.Y3 / D 0:200:
Apparently the variable Y3 is of a very small variance compared with the variables Y1 and Y2 . Therefore it is decided to identify Y3 with the constant E.Y3 /. Next, the numerical value of this constant is determined. To this end, note that X D QY
and hence
Y D Q X;
where Q is the orthogonal linear operator that converts the standard basis ¹e1 ; e2 ; e3 º into ¹v1 ; v2 ; v3 º. For the matrix of Q one has 0
0:4800 0:5561 ŒQ D ŒQt D @ 0:8753 0:3557 0:0591 0:7512
According to Theorem VIII.2.2 one may write
1 0:6785 0:3277A : 0:6575
E.Y/ D E.Q X/ D Q E.X/ D Q .5:201; 7:312; 2:798/: Therefore E.Y/ D .E.Y1 /; E.Y2 /; E.Y3 // D .3:468; 8:070; 3:346/: Summarizing, the variable Y3 is identified with the constant 3:346. As an automatic consequence X is now identified with the variable X0 WD Y1 v1 C Y2 v2 C 3:346 v3 : In contrast to X, the variable X0 takes on its values in a 2-dimensional affine subspace in R3 (which is academic blah blah for a plane in R3 ). The transition from X to X0 is called dimension reduction.
Section VIII.5 Conditional probability distributions related to Gaussian ones
463
In carrying out a dimension reduction one looses a certain percentage of the total variance of X. In the example above this percentage is given by 3 var.X/ var.X0 / 100% D 100% 3%: var.X/ 1 C 2 C 3
Instead of the above one also could have carried out a transition from X to X00 WD .X1 ; 0; X3 /. Namely, among the components X1 ; X2 ; X3 the variable X2 shows the smallest variance. One then looses the following percentage of the total variance: var.X/ var.X00 / 1:4131 100% D 100% 24%: var.X/ 6:0000 The loss in variance is now larger than in the transition from X to X0 . This phenomenon is structural (see Exercises 8 and 9). Besides this, there is a statistical dependence between the split off component X2 and the remaining components X1 and X3 . For this reason a small variation in the outcome of X2 could possibly imply large deviations in the outcome of X1 and X3 . Hence X2 was not completely eliminated, for its roots “remain active”. In the general case where X D Y1 v1 C C Yp vp ; one could carry out dimension reduction by identifying (for some r p) the variables Yr ; YrC1 ; : : : ; Yp with the constants E.Yr /; : : : ; E.Yp / respectively. The percentage indicating the loss in variance is then given by r C rC1 C C p 100%: 1 C C p Though theoretically not fully justified, dimension reduction is also often applied in cases where X is not Gaussian distributed.
VIII.5 Conditional probability distributions related to Gaussian ones In this section some permanence properties of Gaussian distributions will be proved. The permanence properties are related to conditional probability. Before diving in this topic, however, something has to be said about the concept of conditional probability when dealing with arbitrary stochastic vectors. This will be done in a rather general setting. In §II.8 the concept of a conditional probability densities was defined for cases where two real-valued stochastic variables are involved. It was assumed there that the joint
464
Chapter VIII Vectorial statistics
probability distribution of the variables in question is absolutely continuous. This can easily be generalized to cases where a stochastic p-vector X and a stochastic q-vector Y are involved, whenever the vector .X; Y/ has an absolutely continuous probability distribution. Namely, then the variables X and Y also have absolutely continuous distributions and their densities are given by Z Z fX .x/ D fX;Y .x; y/ dy and fY .y/ D fX;Y .x; y/ dx: As in §II.8, the conditional probability density of X, given Y D y, is defined as 8 < fX;Y .x; y/ if f .y/ > 0; Y fY .y/ fX .x j Y D y/ D : 0 if fY .y/ D 0:
The conditional probability distribution of X, given Y D y is in such cases presented by a Borel measure PX . j Y D y/ on Rp , which is defined by Z PX .A j Y D y/ WD fX .x j Y D y/ dx .A Borel in Rp /: A
Using Fubini’s theorem on successive integration it can easily be derived that for every Borel set A and B in Rp and Rq one has Z P.X;Y/ .A B/ D PX .A j Y D y/fY .y/ dy: B
The above can also be read as
P.X;Y/ .A B/ D
Z
B
PX .A j Y D y/ dPY .y/:
()
The point is now that in the general case too (where .X; Y/ does not necessarily have an absolutely continuous probability distribution) there is always a privileged family ¹PX . j Y D y/ W y 2 Rq º such that () holds for every Borel set A and B in Rp and Rq . What is more, the family in question is (essentially) unique. In the general case too, the measure PX . j Y D y/ is called the conditional probability of X, given Y D y. It is far beyond the scope of this book to prove the general existence and unicity of such a family ¹PX . j Y D y/º. Tools to bring this about can be found in [15], p. 39, where disintegrations of measures (such as ()) are discussed in a very general setting. If X and Y are statistically independent, then unicity (together with Fubini’s theorem) implies that PX .A j Y D y/ D PX .A/
.A Borel in Rp ; y 2 Rq /:
In words, if X and Y are statistically independent, then the conditional probability distribution of X, given Y D y, is the same as the (unconditional) probability of X. Of course this is how things ought to be.
Section VIII.5 Conditional probability distributions related to Gaussian ones
465
In the following there will be reliance on the following property, which is intuitively immediately clear: PXCY . j Y D y/ D PXCy . j Y D y/: In words, the conditional probability distribution of X C Y, given Y D y, is the same as the conditional probability distribution of X C y, given Y D y. Similar arguments apply to variables of the form Y X and X=Y . In the following example it is illustrated how one could apply the above. Example 1. Let X and Y be statistically independent N.0; 1/-distributed variables. Then the conditional probability distribution of X=Y , given Y D y , is the same as that of X=y , given Y D y . However, X=y and Y being statistically independent, the conditional probability distribution of X=y , given Y D y , is the same as the (unconditional) probability distribution of X=y . By Theorem I.6.1 the variable X=y has an N.0; 1=y 2 /-distribution. Altogether it follows that Z jyj 2 2 PX=Y .A j Y D y/ D PX=y .A j Y D y/ D p e x y =2 dx: 2 A It is easy to derive the probability distribution of X=Y from this. Namely, one has PX=Y .A j Y D y/
Z
D P.X=Y;Y / .A R/ D PX=Y .A j Y D y/ dPY .y/ ³ Z Z ²Z jyj x 2 y 2 =2 D PX=y .A j Y D y/ dPY .y/ D dx dPY .y/ p e 2 A ³ Z C1 ²Z jyj 1 2 x 2 y 2 =2 D dx p e y =2 dy p e 2 2 1 A Z jyj .1Cx 2 /y 2 =2 D e dxdy 2 AR ³ Z ²Z C1 jyj .1Cx 2 /y 2 =2 D e dy dx A 1 2 ³ Z ²Z C1 jyj .1Cx 2 /y 2 =2 D e dy dx A 0 " #yDC1 Z 2 2 e .1Cx /y =2 D dx .1 C x 2 / A yD0 Z 1 1 D dx: 2 A 1Cx
466
Chapter VIII Vectorial statistics
This shows that X=Y has a Cauchy distribution with parameters ˛ D 0 and ˇ D 1 (see also §I.11 Exercise 36). Theorems I.6.6, II.2.6, II.3.1 and II.5.1 can (just as an exercise) also also be proved along the lines sketched above. Now let Z be an arbitrary N.µ; C/-distributed stochastic p-vector. Let M and N be arbitrary linear spaces of Rp and write X D PM Z
and
Y D PN Z:
The vectors X and Y are called the marginals of Z (with respect to M and N). The aim is to describe the conditional probability of X, given Y D y. To this end a bit more machinery will be needed then was developed up to now. Namely, rather than the covariance operator of one stochastic vector, the concept of a covariance vector of two stochastic vectors will be needed. In §VIII.2 it was explained that for just one stochastic p-vector X is the unique linear operator C .X/ W Rp ! Rp that satisfies cov hX; ai; hX; bi D hC .X/a; bi for all a; b 2 Rp :
To generalize this, suppose now that a stochastic vector p-vector X and a stochastic vector q-vector Y are given. It will be assumed that their Cartesian components are all of finite variance. Then the map .a; b/ 7! cov hX; ai; hX; bi
presents a bilinear form on Rp Rq . It follows (§VIII.1(n)) that there is a unique linear operator C .X; Y/ W Rp ! Rq that satisfies
cov hX; ai; hX; bi D hC .X; Y/a; bi
for all a 2 Rp and b 2 Rq . This operator C .X; Y/ is called the covariance operator of X Y. Of course one has C .X; X/ D C .X/
and
C .Y; Y/ D C .Y/:
The matrix ŒC .X; Y/ of C .X; Y/ is given by 0 1 cov.X1 ; Y1 / cov.X2 ; Y1 / cov.Xp ; Y1 / Bcov.X1 ; Y2 / cov.Xp ; Y2 /C B C ŒC .X; Y/ D B C: :: :: @ A : : cov.X1 ; Yq / cov.Xp ; Yq /
In literature this matrix is often denoted by † =.X; Y/. Note that X and Y are uncorrelated if and only if C .X; Y/ D 0.
Section VIII.5 Conditional probability distributions related to Gaussian ones
467
The following proposition generalizes Proposition VIII.2.6. Proposition VIII.5.1. Let X be a stochastic p-vector and Y a stochastic q-vector. Assume that the components of X and Y are of finite variance. Then: (i) For all a 2 Rp and b 2 Rq one has C .X C a; Y C b/ D C .X; Y/: (ii) For every linear operator S W Rp ! Rm and T W Rq ! Rn one has C .SX; TY/ D TC .X; Y/S :
(iii) C .X; Y/ D C .Y; X/. Proof. Just mimic the proof of Proposition VIII.2.6 to prove (i) and (ii). The proof of (iii) is trivial. Proposition VIII.5.2. Let X and Y be two stochastic p-vectors and Z a stochastic q-vector. Assume that the components of X, Y and Z are of finite variance. Then one has C .˛X C ˇY; Z/ D ˛C .X; Z/ C ˇC .Y; Z/ and for all ˛; ˇ 2 R.
C .Z; ˛X C ˇY/ D ˛C .Z; X/ C ˇC .Z; Y/
Proof. Trivial. Lemma VIII.5.3. Let X be a stochastic p-vector and Y a stochastic q-vector. Assume that the components of X and Y are of finite variance. Then: (i) Ker C .X; Y/ Ker C .X/, (ii) Im C .X; Y/ Im C .Y/. Proof. To prove (i), let a 2 Ker C .X/. Then one has var hX; ai D cov hX; ai; hX; bi D hC .X/a; ai D 0:
Hence, by Proposition I.5.5, the variable hX; ai is essentially constant. But then, by Lemma I.5.13, the variables hX; ai and hY; bi form a statistically independent system for every b 2 Rq . It follows that hC .X; Y/a; bi D cov hX; ai; hY; bi D 0 for every b 2 Rq . This implies that
C .X; Y/a D 0:
468
Chapter VIII Vectorial statistics
It thus appears that (i) holds, that is to say, Ker C .X/ Ker C .X; Y/: Statement (ii) can be derived from this, for the above is equivalent to ? ? Ker C .X/ Ker C .X; Y/ :
By §VIII.1(h), this is the same as saying that
Im C .X/ Im C .X; Y/ Im C .Y; X/: Exchanging X and Y one obtains (ii) from this. Proposition VIII.5.4. Let X be a stochastic p-vector and Y a stochastic q-vector. Assume that the components of X and Y are of finite variance. Then there exists a unique linear operator T W Rq ! Rp such that: (i) C .Y; X
TY/ D 0,
(ii) Tx D 0 for all x 2 Ker C .Y/. Proof. First of all, by Propositions VIII.5.1 and VIII.5.2, one has C .Y; X
TY/ D C .Y; X/
TC .Y/:
It follows that condition (i) is satisfied for every linear operator T W Rq ! Rp such that TC .Y/ D C .Y; X/: ()
By the above T is pinned down on Im C .Y/. Moreover, by condition (ii) the linear operator T is pinned down on Ker C .Y/. Because ? Rq D Im C .Y/ ˚ Im C .Y/ D Im C .Y/ ˚ Ker C .Y/ it follows that there can be at most one linear operator T W Rq ! Rp such that conditions (i) and (ii) are fulfilled. To prove the existence of such a T, an operator T is defined on Im C .Y/ by setting TC .Y/a D C .Y; X/a
for all a 2 Rq :
In this way T is a correctly defined map from Im C .Y/ to Rp , for C .Y/a D C .Y/b
()
implies C .Y; X/a D C .Y; X/b:
()
Section VIII.5 Conditional probability distributions related to Gaussian ones
469
To see this, note that () implies that b 2 Ker C .Y/:
a
However, by the previous lemma one has Ker C .Y/ Ker C .Y; X/: Hence a which evidently implies ().
b 2 Ker C .Y; X/;
So, by the above, a correctly defined map T W Im C .Y/ ! Rp was born. It is left to the reader to show that T is linear on Im C .Y/. As a next step the operator T is now defined on ? Im C .Y/ D Ker C .Y/
by just setting
Tx D 0
for all x 2 Ker C .Y/:
Altogether a linear operator T W Rq ! Rp is defined by the above. By construction this operator satisfies the conditions (i) and (ii). The operator T in the proposition above will be referred to by saying that T makes X free from Y. The foregoing was a preparation to prove the following permanence property of Gaussian distributions. Theorem VIII.5.5. Let Z be an N.µ; C/-distributed stochastic p-vector and let M and N be linear subspaces of Rp with corresponding marginals X and Y of Z. Then the conditional probability distribution of X, given Y D y, is an N.e µ; e C/distribution with e µ D µX C T.y
µY /
and
e C D C .X/
TC .X; Y/:
Here T W Rq ! Rp is the linear operator that makes X free from Y; the vectors µX and µY stand for the expectations of X and Y respectively. Proof. Let T W Rq ! Rp be the linear operator that makes X free from Y. Then the stochastic vectors X TY and X are uncorrelated. By Theorem VIII.4.2 this implies that X TY and X are actually statistically independent. For that reason one has PX
TY . j Y
D y/ D PX
TY :
In words: the conditional probability of X TY, given Y D y, is the same as the (unconditional) probability distribution of X TY. Because TY is completely pinned
470
Chapter VIII Vectorial statistics
down by the condition Y D y, it follows that the conditional distribution of X Ty, given Y D y, is the same as the (unconditional) distribution of X TY. It follows that the conditional distribution of X, given Y D y, is the same as the (unconditional) distribution of X TY C Ty D X C T.y Y/: The distribution of the above is, however, Gaussian. Its expectation vector is e µ D E .X C T.y Y/ D E .X C T y E.Y/ D µX C T.y µY /:
To determine the covariance operator of the distribution, note that the covariance of X C T.y Y/ is equal to that of X TY. Hence one has e C D C .X
TY; X
TY/ D C .X; X
TY/
C .TY; X
TY/:
Because Y and X TY are statistically independent, the stochastic vectors TY and X TY are also independent.The above thus reduces to e C D C .X; X
TY/
0 D C .X; X/
C .X; TY/ D C .X/
TC .X; Y/;
which completes the proof. Remark 1. How to construct the operator T can be found in the proof of Proposition VIII.5.4. Remark 2. The 2-dimensional case with M D ¹.x; 0/ W x 2 Rº
and
N D ¹.0; y/ W y 2 Rº
is discussed in Exercise 17. Remark 3. It is instructive to figure out what the statements in the theorem are in the particular case where one takes M D N, or M D Rp and N D .0/.
VIII.6 Vectorial samples from Gaussian distributed populations Let X1 ; : : : ; Xn be a scalar sample from a N.; 2 /-distributed population. Then p the sample mean X is N.; 2 =n/-distributed, n .X /=S is t -distributed with n 1 degrees of freedom, and .n 1/S 2 = 2 is 2 -distributed with n 1 degrees of freedom. In this section these results are generalized to vectorial samples. First of all some preparatory work:
Section VIII.6 Vectorial samples from Gaussian distributed populations
471
For every set of linear operators T1 ; : : : ; Tn 2 L.Rp / a linear operator p p p p e T W R R… !R R… „ ƒ‚ „ ƒ‚ n Cartesian factors
n Cartesian factors
can be defined by setting
e T.x1 ; : : : ; xn / WD .T1 x1 ; : : : ; Tn xn /:
Denote
e T D T1 ˚ T2 ˚ ˚ Tn :
The next lemma is presented in this notation:
Lemma VIII.6.1. Suppose the stochastic p-vectors X1 ; : : : ; Xn constitute a statistically independent system. If (for all i ) the p-vector Xi has an N.µi ; Ci /-distribution, then the stochastic vector .X1 ; : : : ; Xn /, taking on its values in Rp Rp , is N.e µ; e C/-distributed with e µ D .µ1 ; : : : ; µn /
and
Proof. See Exercise 13.
e C D C1 ˚ ˚ Cn :
It is now easy to prove: Proposition VIII.6.2. Suppose X1 ; : : : ; Xn is a vectorial sample from an N.µ; C/distributed population, then the statistic XD
X 1 C C Xn n
is N.µ; C=n/-distributed. Proof. Under the mentioned conditions the stochastic vector .X1 ; : : : ; Xn /, assuming its values in Rp Rp , has by Lemma VIII.6.1 a Gaussian distribution. Define T W Rp Rp ! Rp by
x1 C C xn : n Now T is a linear operator. Therefore, by Theorem VIII.4.5, the variable T.x1 ; : : : ; xn / WD
X 1 C C Xn D T.X1 ; : : : ; Xn / n also has a Gaussian distribution. Moreover, one has (Proposition VIII.3.1) XD
E.X/ D µ
and
C .X/ D
C : n
Summarizing one learns that X is N.µ; C=n/-distributed.
472
Chapter VIII Vectorial statistics
The next proposition can, as will be seen in the sequel, be useful in estimation processes for the parameter µ . Proposition VIII.6.3. Suppose the X1 ; : : : ; Xn form a p-dimensional sample from a N.µ; C/-distributed population. Then, provided C is invertible, the statistic nhC
1
.X
µ/; .X
µ/i
is 2 -distributed with p degrees of freedom. Proof. Combine Propositions VIII.4.10 and VIII.6.2. If, as to some N.µ; C/-distributed population, C is known then one could use this proposition to make “region estimates” for µ . This runs as follows: Because the statistic nhC
1
.X
µ/; .X
µ/i
is (a priori) 2 -distributed with p degrees of freedom one can, at a prescribed level ˛, determine a bound g˛ such that P nhC 1 .X µ/; .X µ/i < g˛ D ˛: Given an outcome x of X one now calls the region G˛ D ¹µ 2 Rp W nhC D ¹µ 2 Rp W nhC
1
.x
µ/; .x
1
µ/i < g˛ º
.µ
x/; .µ
x/i < g˛ º
an ˛ 100% confidence region for µ . Generally G˛ is the interior region of an ellipsoid in Rp . Here is an explicit example: Example 1. One is dealing with a 2-dimensional normally distributed population with unknown µ . The covariance operator is known; it is given by the matrix ! 2:12 0:87 ŒC D † =D : 0:87 3:11 From this population a sample of size 7 is drawn in order to get information about the parameter µ . The outcome of X turns out to be x D .1:30; 2:27/:
Section VIII.6 Vectorial samples from Gaussian distributed populations
473
Basing oneself on the outcome of the sample, one wants a 90% confidence region for µ . Well, harking back to the considerations preceding this example, the set G D ¹µ 2 R2 W 7hC
1
.µ
x/; .µ
x/i < 4:605º
is such a confidence region. To get an impression of the shape of this region, first carry out a translation over the vector x: x D ¹v 2 R2 W 7hC
G
1
v; vi < 4:605º:
Next determine the eigenvalues of C and a corresponding orthonormal basis of eigenvectors. The eigenvalues of C are given by and
1 D 1:614
2 D 3:616:
Corresponding eigenvectors (of length 1) are and
v1 D .0:864; 0:503/ These vectors are also eigenvectors of C ing to the eigenvalues 1 1 D
1 1:614
v2 D .0:503; 0:864/:
1 , namely, they are eigenvectors correspond-
and
2 1 D
1 : 3:616
Every arbitrarily chosen vector v 2 R2 can now be decomposed with respect to the orthonormal basis ¹v1 ; v2 º for R2 : v D 1 v1 C 2 v2 : In these notations one has E D1 2 2 1 hC 1 v; vi D 1 v1 C 2 v2 ; 1 v1 C 2 v2 D 1 C 2 : 1 2 1 2 The constraint
7hC
1
v; vi < 4:605
is therefore equivalent to 2 12 C 2 < 1: 1:06 2:38 Thus it follows that ° ± 2 2 G x D v 2 R2 W v D 1 v1 C 2 v2 ; where 1 C 2 < 1 : 1:06 2:38
This region presents the interior of an ellipse having the origin as its center and having principal axes of length 1:03 and 1:54. The small axis is pointing in the v1 -direction,
474
Chapter VIII Vectorial statistics
2
1 Figure 27
the large one in the v2 -direction. The region G can be found by a backward translation G D .G x/ C x. See Figure 27. The previous example started from a covariance operator C that was known beforehand. In practice this is rarely the case. Usually C is not known and hence one will have to make estimates of it. Using the outcome of S2 as an estimate of C one could handle in the following way to obtain confidence regions (compare this with the discussion in §II.1): Example 2. A sample X1 ; : : : ; X5 from a 3-dimensional normally distributed population shows the outcomes X1 D .4:93; 3:37; 1:42/;
X2 D .4:02; 3:16; 1:14/;
X4 D .5:58; 2:46; 1:86/;
X5 D .4:34; 2:84; 0:92/:
X3 D .5:68; 3:92; 1:22/;
This implies for X the outcome X D .4:91; 3:15; 1:31/: The outcome of the matrix of S2 can be computed to be 0 1 0:54 0:08 0:17 0:08A : ŒS2 D @0:08 0:30 0:17 0:08 0:13
It is now assumed that C S2 and that (a priori) the variable 5hS
2
.X
µ/; .X
µ/i D 5hS
2
.µ
X/; .µ
X/i
Section VIII.6 Vectorial samples from Gaussian distributed populations
475
is approximately 2 -distributed with 3 degrees of freedom. When basing oneself on these dubious assumptions one could construct a “95% confidence region” of the form G 0 D ¹µ 2 R3 W 5hS
2
.µ
X/; .µ
X/i < 7:81º:
The eigenvalues of S2 are
1 D 0:331;
2 D 0:607;
3 D 0:031:
Corresponding eigenvectors (of length 1) are v1 D . 0:035; 0:927; 0:373/; v2 D .0:939; 0:158; 0:304/; v3 D . 0:341; 0:340; 0:876/: In Exercise 18 the reader is asked to describe the region G in detail. In Example 2 it was assumed as a matter of fact that, given a sample X1 ; : : : ; Xn from a normally distributed population, the variable nhS
2
.X
µ/; .X
µ/i
./
has a 2 -distribution with p degrees of freedom. For large n this is indeed correct in an approximate sense. For small n, as will be seen in Example 3, this assumption is very dubious. Later on in this section some more attention is paid to the probability distribution of ./. Now a brief look is taken at the probability distribution of S2 . To start with, the reader is introduced to the so-called “Wishart distribution”: Definition VIII.6.1. Let X1 ; : : : ; Xn be a p-dimensional sample from an N.0; C/distributed population. Then the probability distribution of the variable n X iD1
Xi ˝ Xi
is called the Wishart distribution with parameters n and C. This distribution will be denoted as the W .n; C/-distribution. When sampling from Gaussian distributed populations, the variable .n 1/S2 has some Wishart distribution. To prove this, first some preparatory work has to be done: Lemma VIII.6.4. Let X be an N.0; C/-distributed variable and Q an orthogonal linear operator. Then X and QX are identically distributed if and only if Q commutes with C (that is to say: if and only if QC D CQ).
476
Chapter VIII Vectorial statistics
Proof. By Theorem VIII.4.5 the variable QX is N.0; Q C Q /-distributed. It follows from this that X and QX are identically distributed if and only if Q C Q D C: Because Q is orthogonal one has Q D Q 1 . The above is therefore the same as saying that QC D CQ: This proves the lemma.
Remark. Note that in the particular case where C is an operator of type C D 2 I the content of this lemma is exactly that of Proposition I.6.4. Next, with every orthogonal linear operator Q W Rn ! Rn one associates an accompanying linear operator p p p p QWR R… !R ; R… „ ƒ‚ „ ƒ‚ n Cartesian factors
n Cartesian factors
defined in the following way:
Q.x1 ; : : : ; xn / WD .y1 ; : : : ; yn /; where yj D
n X
ŒQ˛j x˛ :
˛D1
It is trivially verified that Q is a linear operator. Furthermore, an inner product on Rp Rp is defined by setting h.x1 ; : : : ; xn /; .y1 ; : : : ; yn /i WD hx1 ; y1 i C C hxn ; yn i: When identifying Rp Rp in a natural way with Rpn , this inner product presents the standard inner product on Rpn . Lemma VIII.6.5. If Q W Rn ! Rn is an orthogonal linear operator then:
(i) Q W Rp Rp ! Rp Rp is an orthogonal linear operator.
(ii) If .y1 ; : : : ; yn / D Q.x1 ; : : : ; xn / then n X
j D1
yj ˝ yj D
n X
j D1
xj ˝ xj :
(iii) Q commutes with every linear operator of type e T D T ˚ ˚ T.
477
Section VIII.6 Vectorial samples from Gaussian distributed populations
Proof. To prove (i), first note that DX E X kyj k2 D hyj ; yj i D ŒQ˛j x˛ ; ŒQˇj xˇ ˛
D
X ˛;ˇ
ˇ
X
ŒQ˛j ŒQ jˇ hx˛ ; xˇ i:
XX
ŒQ˛j ŒQ jˇ hx˛ ; xˇ i
ŒQ˛j ŒQˇj hx˛ ; xˇ i D
˛;ˇ
It follows from this that 2
k.y1 ; : : : ; yn /k D D D
n X
j D1
X ˛;ˇ
n X
˛D1
kyj k2 D
˛;ˇ
j
ŒQQ ˛ˇ hx˛ ; xˇ i D
X ˛;ˇ
ŒI˛;ˇ hx˛ ; xˇ i
kx˛ k2 D k.x1 ; : : : ; xn /k2 :
This implies (§VIII.1(j)) that Q is an orthogonal linear operator on Rp Rp . Statement (ii) can be proved by applying the same kind of algebra. To prove (iii), first set .y1 ; : : : ; yn / WD Q.x1 ; : : : ; xn /. One has Q.T ˚ ˚ T/.x1 ; : : : ; xn / D Q.Tx1 ; : : : ; Txn /: The j th component of the right side is given by X X ŒQ˛j Tx˛ D T ŒQ˛j x˛ D Tyj : ˛
˛
In turn, the right side of this equality is exactly the j th component of the vector .T ˚ T ˚ ˚ T/ .y1 ; : : : ; yn /: This proves (iii). The following theorem (in the notations introduced above) can now be proved. Theorem VIII.6.6. If X1 ; : : : ; Xn is a p-dimensional sample from an N.µ; C/-distributed population, then the variable .n 1/S2 is W .n 1; C/-distributed. Proof. The quantity S2 does not alter when Xi is replaced by Xi may be assumed that µ D 0.
µ ; therefore it
478
Chapter VIII Vectorial statistics
Starting from this, let ¹e1 ; : : : ; en º be the standard basis in Rn and set e WD .1; : : : ; 1/ 2 Rn . Now there exists an orthogonal linear operator Q W Rn ! Rn such that p Qen D e= n. The matrix of such an operator is of the form 0 p 1 1= n B 1=pnC C B ŒQ D B : : :: :: C : : : @: : : : A p 1= n
Therefore, if one has
.y1 ; : : : ; yn / D Q.x1 ; : : : ; xn /;
then necessarily
yn D
p x1 C C xn D n x: p n
./
Furthermore (Lemma VIII.6.1) the variable .X1 ; : : : ; Xn / is N.0; e C/-distributed, where e C D C ˚ ˚ C:
By Lemma VIII.6.5 the operator Q commutes with e C . Applying Lemma VIII.6.4 it follows that the stochastic vectors .X1 ; : : : ; Xn / and .Y1 ; : : : ; Yn / WD Q.X1 ; : : : ; Xn /
are identically distributed. It follows from this that the Y1 ; : : : ; Yn also constitute a sample from an N.0; C/-distributed population. Moreover, by the well-known algebraic argument, one may write .n
1/S2 D
n X iD1
.Xi
X/ ˝ .Xi
X/ D
n X iD1
Xi ˝ Xi
nX ˝ X:
Using Lemma VIII.6.5 this can be rewritten as .n
1/S2 D
n X iD1
Yi ˝ Yi
nX ˝ X:
In virtue of ./ this can be read as .n
1/S2 D
n X1 iD1
Yi ˝ Yi :
Of course the right side of this equality is W .n the proof.
1; C/-distributed, thus completing
479
Section VIII.6 Vectorial samples from Gaussian distributed populations
As announced before, some attention is now paid to the probability distribution of the variable hS 2 .X µ/; .X µ/i; where X1 ; : : : ; Xn is a p-dimensional sample from an N.µ; C/-distributed population. It can be proved that for n p C 1, with probability 1, the outcome of S2 will be invertible (see [4]). It makes therefore sense to speak about the variable S 2 , assuming values in L.Rp /. Lemma VIII.6.7. If X1 ; : : : ; Xn is a sample from an N.0; I/-distributed population then for n p C 1 the scalar variable n.n p.n is F -distributed with p and n nator respectively.
p/ hS 1/
2
X; Xi
p degrees of freedom in the numerator and denomi-
Proof. See [4]. In order to study probability distributions of variables of type hS
2
.X
µ/; .X
µ/i
in a more general setting, the following notations are introduced: Let X1 ; : : : ; Xn and Y1 ; : : : ; Yn be samples from arbitrary (not necessarily equal) populations with existing expectation vectors. One then writes S2x WD S2y WD
1 n
1 1
n
1
n X .Xi iD1
n X .Yi iD1
X/ ˝ .Xi
X/
and
µx WD E.Xi /;
Y/ ˝ .Yi
Y/
and
µy WD E.Yi /:
In these notations the next lemma is presented: Lemma VIII.6.8. Let X1 ; : : : ; Xn be a p-dimensional sample from an arbitrary population with expectation vector µx . Define the Y1 ; : : : ; Yn by Yi WD A Xi C b; where A is a fixed invertible linear operator and b a fixed element in Rp . Then, provided it makes sense to speak about Sx 2 , the scalar variables hSx 2 .X are identical.
µx /; .X
µx /i
and
hSy 2 .Y
µy /; .Y
µy /i
480
Chapter VIII Vectorial statistics
Proof. For all i one has Y D A.Xi
Yi
X/:
Thus, using §VIII.1(l), it follows that
S2y D A S2x A : Hence Sy 2 D .A /
1
Sx 2 A
1
Using this, together with the fact that Y
1
/ Sx 2 A
D .A
µy D A.X
1
:
µx /;
one derives that hSy 2 .Y
µy /; .Y
µy /i D h.A
1
/ Sx 2 A
D hSx 2 .X
D hSx 2 .X
1
A.X
µx /; A
1
µx /; .X
µx /; A.X
A.X
µx /i
µx /i
µx /i:
This proves the lemma. Transformations of Rp of type x 7! A x C b; where A is an invertible linear operator and b an element in Rp , are called scale transformations. With composition as multiplication, they form a group. The foregoing lemma says that variables of type hSx 2 .X
µx /; .X
µx /i
are invariant under scale transformations. This, together with the fact that they are not trivial, indicates that the probability distributions of such variables might very well have “a right to exist” in mathematics. When sampling from normally distributed populations, this is confirmed by the next theorem: Theorem VIII.6.9. Suppose that X1 ; : : : ; Xn is a p-dimensional sample from an N.µ; C/-distributed population. If C is invertible and if n p C 1, then the scalar variable n.n p/ hS 2 .X µ/; .X µ/i p.n 1/ is F -distributed with p and n nator respectively.
p degrees of freedom in the numerator and denomi-
Section VIII.6 Vectorial samples from Gaussian distributed populations
481
Proof. Define the variables Y1 ; : : : ; Yn by Yi WD e Xi D C
1 2
1 2
µ/ D C
.Xi
Xi
C
1 2
µ:
Now (Proposition VIII.4.7) the Y1 ; : : : ; Yn constitute a sample from an N.0; I/distributed population. Furthermore, by Lemma VIII.6.8 one has n.n p.n
p/ hS 2 .X 1/ x
µx /i D
µx /; .X
n.n p.n
p/ hS 2 Y; Yi: 1/ y
By Lemma VIII.6.7 the right side of this equality has an F -distribution with p and n p degrees of freedom in the numerator and denominator respectively. Hence so does the left side. Remark. Note that in the case p D 1 the theorem above is in agreement with Theorem II.3.2 (see also §II.11, Exercise 13). The following example shows how dubious the reasoning in Example 2 was. Example 3. This example leans on the sample described in Example 2. Denote the outcome of S 2 by A. Then the “confidence region” for µ , constructed in Example 2, can be expressed as G 0 D ¹µ 2 R3 W 5hA.µ
x/; .µ
x/i < 7:81º:
Now, by Theorem VIII.6.9, one knows that the variable 3 5 hS 2 .X µ/; .X µ/i 3 5 1 is (a priori) F -distributed with 3 and 2 degrees of freedom in the numerator and denominator respectively. It can be read off in Table V that 5 hS 2 .X µ/; .X µ/i < 19:2 D 0:95: P 6 5
Substituting the outcomes of X and S
2
in the inequality
5 hS 2 .X µ/; .X µ/i < 19:2: 6 One arrives at the following 95% confidence region for µ : G D ¹µ 2 R3 W 5hA.µ
x/; .µ
x/i < 115:2º:
Both the regions G 0 and G present the interior regions of ellipsoids with center x and the principal axes of both ellipsoids point into the same directions. However, the length of the principal axes of G is about four times as large as the corresponding axes of G 0 (see Exercise 18). It follows that constructions as in Example 2 should not be carried out when dealing with such small-sized samples.
482
Chapter VIII Vectorial statistics
The previous examples and theorems complete the discussion about the probability distribution of X, that of S2 and that of hS 2 .X µx /; .X µx /i.
As was seen in §VIII.2, the vectorial statistics X and S2 present unbiased estimators of the parameters µ and C. In cases where the population is normally distributed these parameters are completely characterizing. The reader may wonder whether the vectorial statistic .X; S2 /, assuming its values in Rp L.Rp /, is sufficient for the parameters µ and C when sampling from normally distributed populations. It turns out, as will be seen below, that this can be answered in the affirmative. In matters like sufficiency, the likelihood function of a sample plays a principal role. For a vectorial sample X1 ; : : : ; Xn this likelihood function L W Rp Rp ! Œ0; C1/ is defined in exactly the same way as in the scalar case, namely as follows: L.x1 ; : : : ; xn / D fX1 ;:::;Xn .x1 ; : : : ; xn / D fX1 .x1 / : : : fXn .xn /: When sampling from a normally distributed population the likelihood function assumes the form as described in the following lemma: Lemma VIII.6.10. Suppose that X1 ; : : : ; Xn is a sample from an N.µ; C/-distributed population, where C is invertible. Then the likelihood function Lµ;C of the sample is given by ± ° n 1 1 Lµ;C .x1 ; : : : ; xn / D exp µ /; .x µ /i hC .x n=2 2 .2/np=2 det.C/ ° 1 ± X exp hC 1 ; .xi x/ ˝ .xi x/i : 2 i
Proof. Using Theorem VIII.4.12 it follows that Lµ;C .x1 ; : : : ; xn / D
1 .2/np=2
n=2 det.C/
By §VIII.1(q) one may write X hC 1 .xi µ/; .xi i
° 1X exp hC 2
µ/i D hC
1
.xi
µ/; .xi
±
µ/i :
i
1
;
X .xi i
µ/ ˝ .xi
µ/i:
()
./
Here, on the right side, the Hilbert–Schmidt inner product appears. Furthermore, by the well-known algebraic argument, one has X X .xi µ/ ˝ .xi µ/ D .xi x/ ˝ .xi x/ C n.x µ/ ˝ .x µ/: ./ i
i
The proof can be completed by substituting ./ into ./ into ./.
483
Section VIII.6 Vectorial samples from Gaussian distributed populations
It is now easy to prove: Theorem VIII.6.11. When sampling from an N.µ; C/-distributed population, where C is invertible, the compound statistic quantity .X; S2 / is sufficient for the compound parameter .µ; C/. Proof. A vectorial version of the factorization theorem (Theorem II.10.3) will be applied. To this end the function p p gWR R… ! Rp LC .Rp / „ ƒ‚ n Cartesian factors
is defined as follows:
g.x1 ; : : : ; xn / WD Of course one then has
x1 C C xn 1 X ; .xi n n 1 i
.X; S2 / D g.X1 ; : : : ; Xn /:
x/ ˝ .xi
x/ : ./
Moreover, define the function ' W Rp LC .Rp / Rp LC .Rp / ! Œ0; C1/;
where LC .Rp / is the set of all positive invertible self-adjoint elements in L.Rp /, by ± ° n 1 1 '.a; A; b; B/ D hA 1 ; Bi n=2 exp 2 .2/np=2 det.A/ ° n ± exp hA 1 .b a/; .b a/i : 2 Setting h 1 one obtains Lµ;C .x1 ; : : : ; xn / D h.x1 ; : : : ; xn / ' µ; C; g.x1 ; : : : ; xn / : ./ Now ./ and ./ imply, by the factorization theorem, that the statistic .X; S2 / is sufficient for the parameter .µ; C/ 2 Rp LC .Rp /.
As a last topic in this section the maximum likelihood estimators of µ and C are determined, when sampling from normally distributed populations. As in the scalar case, the maximum likelihood estimate of .µ; C/ is, by definition, the element for which (for fixed x1 ; : : : ; xn ) the map .t; T/ 7! L.t;T/ .x1 ; : : : ; xn /
attains a global maximum. Here .t; T/ runs through the parameter space Rp LC .Rp /, where (as before) LC .Rp / denotes the set of all positive self-adjoint invertible elements in L.Rp /. To begin with, a lemma is derived that will appear to be the crux in the maximization process:
484
Chapter VIII Vectorial statistics
Lemma VIII.6.12. Let D be a fixed element in LC .Rp / and let f W LC .Rp / ! R be the function defined by f .T/ WD log det.T/ hT; Di:
Then f attains a global maximum and T D D which this maximum is taken on.
1
is the only element in LC .Rp / in
Proof. (Split up into two steps.) Step 1. (The case D D I.) Define the function ' W .0; C1/ .0; C1/ ! R by '.x1 ; : : : ; xp / WD log.x1 xp /
.x1 C C xp /:
If T has eigenvalues 1 ; : : : ; p , then f .T/ D log det.T/ hT; Ii D log det.T/
tr.T/ D '.1 ; : : : ; p /:
Therefore, instead of maximizing f , it suffices to maximize ' . By writing '.x1 ; : : : ; xp / D .log x1
x1 / C C .log xn
xn /;
it can be seen that this maximization essentially concerns the maximization of the single-variable function x 7! log x x on the interval .0; C1/. It is easily verified that this function attains a global maximum and that this maximum is realized in the point x D 1 only. It follows that ' attains a global maximum in the point .1; : : : ; 1/ and that there are no other points where this maximum is taken on. In turn it follows from this that f .T/ attains a global maximum if and only if all the eigenvalues of T are equal to 1. This is possible only if T D I, thus proving Step 1. Step 2. (The case where D is an arbitrary fixed element in LC .Rp /.) Define ˆ W LC .Rp / ! LC .Rp / as follows: ˆ.T/ WD D
1 2
TD
1 2
:
Now ˆ is bijective (check this). It follows that f attains a global maximum if and only if f ı ˆ does so. However, for f ı ˆ one has 1 1 1 1 .f ı ˆ/ .T/ D log det.D 2 T D 2 / hD 2 T D 2 ; Di 1 1 tr.D 2 T D 2 D/ D log det.D/ C log det.T/ 1 1 D log det.D/ C log det.T/ tr.T D 2 DD 2 / hT; Ii: D log det.D/ C log det.T/
Section VIII.6 Vectorial samples from Gaussian distributed populations
485
So by Step 1 it follows that T D I is the unique point where f ı ˆ assumes a global maximum. Hence ˆ.I/ D D 1 is the unique point where f takes on a global maximum. Theorem VIII.6.13. When sampling from normally distributed populations, the maximum likelihood estimators of µ and C are n
X
and
1X .Xi n iD1
X/ ˝ .Xi
X/
respectively. Proof. Let x1 ; : : : ; xn be a fixed outcome of the p-dimensional sample X1 ; : : : ; Xn . Only the case where n > p will be dealt with. Then, with probability 1, the operator D WD
n X .xi iD1
x/ ˝ .xi
x/
is invertible. Given this outcome, one has to maximize the likelihood function .t; T/ 7! L.t;T/ .x1 ; : : : ; xn /; where .t; T/ runs through Rp LC .Rp /. This is equivalent to maximization of the map .t; T/ 7! log L.t;T/ .x1 ; : : : ; xn /: By Lemma VIII.6.10 one can write down the following explicit form of this map: D 1 1 E± np n° log L.t;T/ .x1 ; : : : ; xn / D log 2 C T ; D log det.T 1 / 2 2 n n 1 hT .x t/; .x t/i: 2
Whatever the value of T 2 LC .Rp / is, the map n t 7! hT 1 .x t/; .x t/i 2 attains a global maximum (being zero) in the unique point t D x. Furthermore, realizing that T 7! T 1 is one to one from LC .Rp / onto LC .Rp /, one can apply Lemma VIII.6.12 and conclude that the map D 1 1 E± n° T 7! log det.T 1 / T ; D 2 n takes on a global maximum in the unique point T D n1 D.
Summarizing, the likelihood function attains a global maximum in the point .t; T/ D .x; n1 D/ and in this point only. This proves the theorem.
486
Chapter VIII Vectorial statistics
VIII.7 Vectorial versions of the fundamental limit theorems Strong convergence, as defined for scalar variables, can of course also be defined for vectorial variables: To see this, suppose that X is a stochastic p-vector and that X1 ; X2 ; : : : is a sequence of stochastic p-vectors. Let .; A; P / be the common underlying probability space of these variables. Definition VIII.7.1. The sequence X1 ; X2 ; : : : is said to converge strongly (or almost surely) to X if there exists a set A 2 A such that P .A/ D 1 and such that lim Xn .!/ D X.!/
n!1
for all ! 2 A:
The following proposition describes how this kind of convergence can be reduced to strong convergence of scalar variables. Proposition VIII.7.1. For any sequence X; X1 ; X2 ; : : : of stochastic p-vectors the next three statements are equivalent: (i) Xn ! X strongly.
(ii) For all v 2 Rp one has hXn ; vi ! hX; vi strongly.
(iii) For all i D 1; : : : ; p one has hXn ; ei i ! hX; ei i strongly. Proof. See Exercise 23. As in the scalar case a sequence X1 ; X2 ; : : : will be called a statistically independent system if for all n the finite system X1 ; : : : ; Xn is so. A statistically independent sequence X1 ; X2 ; : : : is said to be an infinite sample if the Xi are mutually identically distributed. Then the common probability distribution is called the distribution of the population. Now a vectorial version of the strong law of large numbers is presented: Theorem VIII.7.2. Let X1 ; X2 ; : : : be an infinite sample from a population with expectation vector µ . Then X 1 C C Xn !µ n
strongly:
Proof. Apply Proposition VIII.7.1 and reduce the assertion to the scalar case (Theorem VII.2.14). The concept of convergence in probability can also in a natural way be defined for vectorial variables:
Section VIII.7 Vectorial versions of the fundamental limit theorems
487
Definition VIII.7.2. One says that the sequence X1 ; X2 ; : : : converges in probability to X if for all " > 0 lim P kXn Xk " D 0: n!1
Convergence in probability too can be reduced to scalar convergence of the components:
Proposition VIII.7.3. For any sequence X; X1 ; X2 ; : : : of stochastic p-vectors the next three statements are equivalent: (i) Xn ! X in probability.
(ii) For all v 2 Rp one has hXn ; vi ! hX; vi in probability.
(iii) For all i D 1; : : : ; p one has hXn ; ei i ! hX; ei i in probability. Proof. See Exercise 24. For vectorial variables convergence in distribution is defined in the following way: Definition VIII.7.3. A sequence of p-vectors X1 ; X2 ; : : : converges in distribution to X if for all bounded continuous functions ' W Rp ! R one has E '.Xn / ! E '.X/ :
Theorem VII.2.12 makes sure that in the scalar case this definition amounts to the same as Definition I.8.3. An direct consequence of Definition VIII.7.3 is the following theorem: Theorem VIII.7.4. Let g W Rp ! Rq be any continuous function. If the sequence of stochastic p-vectors X1 ; X2 ; : : : converges in distribution to X, then the sequence of stochastic q-vectors g.X1 /; g.X2 /; : : : converges in distribution to g.X/. Next, suppose that the sequence X1 ; X2 ; : : : of stochastic p-vectors converges to X in distribution. Define for fixed i D 1; : : : ; p, the continuous function g W Rp ! R by g.x/ WD hx; ei i
.x 2 Rp /:
Now, by Theorem VIII.7.4, one has for all i D 1; : : : ; p hXn ; ei i D g.Xn / ! g.X/ D hX; ei i in distribution. Thus it follows that componentwise the sequence X1 ; X2 ; : : : is also converging in distribution to X. A venomous pitfall in stochastic analysis is that the converse is not true:
488
Chapter VIII Vectorial statistics
Example 1. Choose two statistically independent N.0; 1/-distributed scalar variables Z1 and Z2 . By §I.11, Exercise 42 (or §VII.14, Exercise 17), this is possible. Define the stochastic 2-vectors Xn and X by Xn WD .Z1 ; Z2 /
and
X WD .Z1 ; Z1 /:
Evidently one now has for i D 1; 2: in distribution:
hXn ; ei i ! hX; ei i However, it is not true that hXn ; e2
e1 i ! hX; e2
in distribution.
e1 i
To see this, note that for all n the scalar variable hXn ; e2
e1 i D Z2
Z1
has an N.0; 2/-distribution, whereas hX; e2
e1 i D Z1
Z1 D 0:
Now Theorem VIII.7.4 excludes the possibility that Xn ! X in distribution. By the characteristic function of a stochastic p-vector X one means the function X W Rp ! C, defined by X .v/ WD E e ihX;vi .v 2 Rp /: The vectorial analogue of Lévy’s theorem (Theorem I.8.5) is:
Theorem VIII.7.5. A sequence X1 ; X2 ; : : : of stochastic p-vectors converges weakly to X if and only if for all v 2 Rp one has lim Xn .v/ D X .v/:
n!1
Proof. See for example [6], [11]. Using this theorem it is easy to prove the following important result: Proposition VIII.7.6. For any sequence X; X1 ; X2 ; : : : of stochastic p-vectors the following two statements are equivalent: (i) Xn ! X in distribution.
(ii) For all v 2 Rp one has hXn ; vi ! hX; vi in distribution.
489
Section VIII.7 Vectorial versions of the fundamental limit theorems
Proof. The implication (i) ) (ii) is an immediate consequence of Theorem VIII.7.4. To prove the implication (ii) ) (i) first note to the following relationship between the characteristic functions of a vectorial variable X and an associated scalar variable hX; vi. Namely, for all t 2 R and v 2 Rp one has X .t v/ D E e ihX;tvi D E e ithX;vi D hX;vi .t /:
If for all v 2 Rp one has hXn ; vi ! hX; vi in distribution, then
Xn .v/ D hXn ;vi .1/ ! hX;vi .1/ D X .v/ for all v 2 Rp . By Theorem VIII.7.5 this implies that Xn ! X in distribution. In reducing to scalar convergence of variables of type hXn ; vi it is easily seen that the following scheme applies: Xn ! X strongly ) Xn ! X in probability ) Xn ! X in distribution Using Proposition VIII.7.6 it is easy to prove the following vectorial version of the central limit theorem: Theorem VIII.7.7. Let X1 ; X2 ; : : : be an infinite sample from a population with expectation vector µ and covariance operator C. Then the sequence p n
²
³
X1 C C Xn n
µ
.n D 1; 2; : : :/
converges in distribution to the N.0; C/-distribution. Proof. Let X1 ; X2 ; : : : be as described above and let v be an arbitrary fixed element in Rp . Then the sequence hX1 ; vi; hX2 ; vi; : : : presents an infinite scalar sample from a population with mean hµ; vi and variance hCv; vi. Let X be an arbitrary N.0; C/-distributed variable. By Theorem VIII.4.5 the scalar variable hX; vi is N.0; hCv; vi/-distributed. From the scalar central limit theorem it now follows that for all v 2 Rp the sequence p X 1 C C Xn n n
µ ;v D
p
n
hX1 ; vi C C hXn ; vi n
hµ; vi
490
Chapter VIII Vectorial statistics
converges in distribution to hX; vi. By Proposition VIII.7.6 this implies that p X1 C C Xn n µ ! X in distribution: n This proves the theorem. In the previous section it was explained that the sample mean X of a vectorial sample X1 ; : : : ; Xn from an N.µ; C/-distributed population always has an N.µ; C=n/distribution. The vectorial version of the central limit theorem says that, even if a population with mean µ and covariance operator C is not Gaussian distributed, the sample mean is for large n approximately N.µ; C=n/-distributed. As already announced in §VII.10, this phenomenon can also be observed with statistics other than the sample mean. In order to prove a result of this kind, first a theorem of the Russian mathematician Jevgeni J. Slutsky (1880–1948) is formulated and proved. The content of this theorem looks trivial, but it is not. It should be seen in the light of the following problematic nature of convergence in distribution, which already occurs in the scalar case: If the sequences X1 ; X2 ; : : : and Y1 ; Y2 ; : : : converge in distribution to X and Y respectively then there is no guarantee at all that Xn C Yn converges in distribution to X C Y . Neither is there a guarantee that Xn Yn converges in distribution to X Y . See, as to this phenomenon, Exercise 25. Theorem VIII.7.8 (J. J. Slutsky). Let X1 ; X2 ; : : : be a sequence of stochastic pvectors converging in distribution to X. Then one has: (i) If the sequence Y1 ; Y2 ; : : : of stochastic p-vectors converges in distribution to the constant vector a, then Xn C Yn ! X C a in distribution.
(ii) If the sequence Y1 ; Y2 ; : : : of scalar stochastic variables converges in distribution to the constant a, then Yn Xn ! aX in distribution. Proof. By Proposition VIII.7.6 it suffices to prove this theorem in the case where all variables are scalar (check this). To prove (i), suppose that Xn ! X and Yn ! a in distribution. Denote Zn WD Xn C Yn
and
Z WD X C a:
In this notation it must be proved that FZn .z/ ! FZ .z/ in all points z 2 R where FZ is continuous. To this end, choose an arbitrary fixed " > 0 and an arbitrary z 2 R. Then, by Lemma VII.2.9, there exists a z 0 such that
Section VIII.7 Vectorial versions of the fundamental limit theorems
491
z < z 0 < z C " and such that FZ is continuous in z 0 . Now
FZn .z/ D P Zn z D P Xn C Yn z D P Xn C Yn z and jYn aj < z 0 z C P Xn C Yn z and jYn aj z 0 z P Xn C Yn a z a and jYn aj < z 0 z C P jYn aj z 0 z P Xn z 0 a C P jYn aj z 0 z D FXn z 0 a C P jYn aj z 0 z :
Because FZ is continuous in z 0 , the function FX is continuous in z 0 a. One therefore has lim FXn .z 0 a/ D FX .z 0 a/ D FZ .z 0 /: n!1
Furthermore, by Proposition VII.2.2, one has lim P jYn
n!1
Summarizing one may conclude that
aj z 0
z D 0:
lim FZn .z/ FZ .z 0 / FZ .z C "/:
n!1
As " > 0 was chosen arbitrarily, it follows from this that lim FZn .z/ FZ .z C/:
n!1
Giving the expression 1
./
FZn .z/ a similar treatment, one arrives at lim FZn .z/ FZ .z /:
./
n!1
From () and () it follows that lim FZn .z/ D FZ .z/
n!1
in all points z where FZ is continuous. This means that Zn ! Z in distribution, thus proving statement (i). The proof of (ii) can be given in a similar way (see Exercise 26).
492
Chapter VIII Vectorial statistics
Next, some notational preparation to the following (interesting) theorem about statistical functionals. Suppose ƒ1 ; : : : ; ƒp are linear statistical functionals with domains Dƒ1 ; : : : ; Dƒp respectively. As in §VII.10 an associated vectorial map ƒ W Dƒ ! Rp is defined by ƒ.F / WD ƒ1 .F /; : : : ; ƒp .F /
for all F 2 Dƒ D Dƒ1 \ \ Dƒp . Denote i .x/
So
WD ƒi .Ex / ψ.x/ D
and in these notations one has Z ƒi .F / D i .x/ dPF .x/; Z Z ƒ.F / D ψ.x/ dPF .x/ D
and
ψ.x/ WD ƒ.Ex /:
1 .x/; : : : ;
p .x/
1 .x/ d PF .x/;
::: ;
Z
p .x/ dPF .x/
Hence, if X is a stochastic variable with distribution function F , then
:
ƒ.F / D EŒψ.X /: Imposing on F 2 Dƒ the condition that Z 2 i .x/ dPF .x/ < C1
for all i D 1; : : : ; p is the same as requiring that Z kψ.x/k2 dPF .x/ < C1:
In these notations the next theorem is formulated and proved.
Theorem VIII.7.9. Suppose that ƒ1 ; : : : ; ƒp are linear statistical functionals and that F 2 Dƒ is a distribution function such that Z kψ.x/k2 dPF .x/ < C1:
Moreover, let ' W Rp ! R be a Borel function that is differentiable in the point ƒ.F /. Then for every infinite scalar sample X1 ; X2 ; : : : from a population with distribution function F the sequence ± p ° b .X1 ; : : : ; Xn / n ' ƒ F 'Œƒ.F / .n D 1; 2; : : :/
Section VIII.7 Vectorial versions of the fundamental limit theorems
493
converges in distribution to an N.0; 2 /-distribution. Here 2 is given by 2 D ' 0 Œƒ.F / C Œψ.Xi / ' 0 Œƒ.F / : Proof. (Split up into three steps.) Step 1. Let T W Rp ! Rq be a linear operator and Y1 ; Y2 ; : : : a sequence of stochastic p-vectors converging in distribution to an N.0; C/-distribution. Then the sequence T Y1 ; T Y2 ; : : : converges in distribution to an N.0; T C T /-distribution. To see this, choose any N.0; C/-distributed variable Y. Then Yn ! Y in distribution: A linear operator T W Rp ! Rq is automatically continuous. Hence, by Theorem VIII.7.4, it follows that T Yn ! T Y
in distribution:
By Theorem VIII.4.5 the variable T Y is N.0; T C T /-distributed. This proves Step 1. Step 2. For an arbitrary set of scalars x1 ; : : : ; xn one has 1 b ƒ F .x1 ; : : : ; xn / D ƒ .Ex1 C C Exn / n 1 ψ.x1 / C C ψ.xn / D Œƒ.Ex1 / C C ƒ.Exn / D : n n Now suppose that F 2 Dƒ is such that Z kψ.x/k2 dPF .x/ < C1 and suppose that X1 ; X2 ; : : : is an infinite sample from a population with distribution function F . One then has for all n D 1; 2; : : : that b .X1 ; : : : ; Xn / D ψ.X1 / C C ψ.Xn / : ƒ F n
Applying the vectorial version of the central limit theorem (Theorem VIII.7.7) to the sample ψ.X1 /; ψ.X2 /; : : :, it follows that the sequence ± p ° b .X1 ; : : : ; Xn / n ƒ F ƒ.F / .n D 1; 2; : : :/ converges in distribution to an N.0; D/-distribution. Here D is given by D WD C Œψ.Xi /:
494
Chapter VIII Vectorial statistics
Furthermore, the strong law of large numbers (Theorem VIII.7.2) says that b .X1 ; : : : ; Xn / ƒ F ƒ.F / ! 0 strongly:
Step 3. If a function ' W Rp ! R is differentiable in a 2 Rp , then for the function r W Rp ! R defined by r.h/ WD '.a C h/ one has
' 0 .a/h .h 2 Rp /
'.a/
r.h/ D 0: h!0 khk lim
Define " W Rp ! R by r.h/ khk
".h/ WD
if h ¤ 0
and
".0/ WD 0:
Then " is continuous in 0 and one has by construction that '.a C h/
'.a/ D ' 0 .a/h C khk ".h/:
Next, in the equality above, set a D ƒ.F /
and
b .X1 ; : : : ; Xn / h D Hn WD ƒ F
p Then, after multiplying by a factor n, one obtains i ± p ° h b .X1 ; : : : ; Xn / n ' ƒ F 'Œƒ.F / p p D ' 0 Œƒ.F /. n Hn / C k n Hn k ".Hn /:
ƒ.F /:
()
Now a look is taken at the asymptotic behavior of the right side of this equality. In Step 2 one has seen that Hn ! 0 strongly. Because lim ".h/ D 0;
h!0
it follows that Then of course, one also has
".Hn / ! 0
strongly:
in distribution: p Moreover, as explained in Step 2, the sequence n Hn converges for n ! 1 in distribution to an N.0; D/-distribution. As the map ".Hn / ! 0
x ! kxk
Section VIII.7 Vectorial versions of the fundamental limit theorems
495
is continuous it follows (Theorem VIII.7.4) that the sequence p k n Hn k .n D 1; 2; : : :/ is also convergent in distribution. Applying Slutsky’s theorem (Theorem VIII.7.8) it follows that p k n Hn k ".Hn / ! 0 in distribution: ./ p As said before, the sequence n Hn converges for n ! 1 in distribution to an N.0; D/-distribution, where D D C Œψ.Xi /: By Step 1 this implies that the sequence p ' 0 Œƒ.F / . n Hn / .n D 1; 2; : : :/
converges in distribution to an N.0; 2 /-distribution, where 2 is given by 2 D ' 0 Œƒ.F / C Œψ.Xi / ' 0 Œƒ.F / :
./
The above, together with ./ and ./, implies by Slutsky’s theorem that i ± p °h b .X1 ; : : : ; Xn / n ƒ F 'Œƒ.F / .n D 1; 2; : : :/
converges in distribution to the N.0; 2 /-distribution, where the value of 2 is given by ./.
Remark 1. This theorem can be generalized in several directions. For example, for ' one could also take a function of the form ϕ W Rp ! Rq . Furthermore one could generalize to cases where infinite vectorial samples X1 ; X2 ; : : : are involved. Remark 2. Once again the special role played by the normal distribution is emphasized here. Example 2. Define the linear statistical functionals ƒ1 and ƒ2 on the domains Z Z ° ± ° ± Dƒ1 WD F W jxj dPF .x/ < C1 and Dƒ2 WD F W x 2 dPF .x/ < C1 by
ƒ1 .F / WD
Z
x dPF .x/
and
Moreover, define ' W R2 ! R as follows: '.x1 ; x2 / WD x2
.x1 /2
ƒ2 .F / WD
Z
x 2 dPF .x/:
.x1 ; x2 / 2 R2 :
This function is everywhere differentiable on R2 .
496
Chapter VIII Vectorial statistics
For all F 2 Dƒ D Dƒ1 \ Dƒ2 D Dƒ2 the scalar ' ƒ1 .F /; ƒ2 .F / presents the variance of a population with distribution function F . Next, let X1 ; X2 ; : : : be an infinite sample from a population with distribution function F . Then for all n D 2; 3; : : : one has b .X1 ; : : : ; Xn / D n 1 Sn2 ; ' ƒ F n where Sn2 is the sample variance of the sample X1 ; : : : ; Xn . If the distribution function F of the population satisfies the condition Z x 4 dPF .x/ < C1;
then by Theorem VII.6.9 the sequence ² ³ p n 1 2 n Sn ' ƒ.F / n
.n D 1; 2; : : :/
converges in distribution to an N.0; 2 /-distribution. Now 2 is expressed in terms of the first four moments of the population. To this end, as a first step, note that the function ψ is in this case given by ψ.x/ D ƒ.Ex / D ƒ1 .Ex /; ƒ2 .Ex / D .x; x 2 /: Hence the covariance operator C ψ.Xi / is given by † = ψ.Xi / D C ψ.Xi / cov.X; X/ cov.X; X 2 / 2 21 3 1 2 D D : cov.X; X 2 / cov.X 2 ; X 2 / 3 1 2 4 22 The derivative ' 0 .x1 ; x2 / has the following 1 2 matrix: Œ' 0 .x1 ; x2 / D . 2x1
Therefore The 1 1-matrix
Œ' 0 ƒ.F / D . 21
' 0 Œƒ.F / C Œψ.X
2 D . 21 D 4
1/
i
1/: 1/:
/ ' 0 Œƒ.F /
2 21 3 1 2
43 1 C 82 21
can now be expressed as 3 1 2 21 4 22 1
441
22 :
In this way one has expressed 2 in terms of the moments 1 ; 2 ; 3 ; 4 of the population. Using Slutsky’s theorem, it is easily proved (see Exercise 27) that the sequence ¯ p ® 2 n Sn ' ƒ.F / .n D 1; 2; : : :/
also converges in distribution to this N.0; 2 /-distribution.
497
Section VIII.8 Normal correlation analysis
VIII.8 Normal correlation analysis When there are two quantities X and Y in the field of view it is often important to know whether they are statistically independent or not. (For example, think of the cholesterol level X and the blood pressure Y of a randomly chosen person.) Rather rough methods to test independence are for example the 2 -test on independence and Spearman’s rank correlation test, both non-parametric. In this section a parametric test is developed, based on the assumption that it is known beforehand that the 2vector .X; Y / has a normal distribution. In that case X and Y are independent if and only if they are uncorrelated (see Theorem VIII.4.2). In terms of X ; Y and the matrix † = of the covariance operator of the 2-vector .X; Y / can be expressed as follows: cov.X; X/ cov.X; Y / X2 X Y † = D ŒC D D : () cov.Y; X/ cov.Y; Y / X Y Y2 In this notation one has Proposition VIII.8.1. The probability density of a normally distributed 2-vector .X; Y / is of the form 1 fX;Y .x; y/ D exp p 2X Y 1 2
x X 2 2.1 2 / X y Y y Y 2 x X C : 2 X Y Y 1
Proof. Theorem VIII.4.12 in the 2-dimensional case can be read as: X2 X Y D X2 Y2 .1 2 /: det.C/ D det.† =/ D det X Y Y2 Furthermore the matrix of the operator C
ŒC
1
D† =
1
1
is † = 0
1
, so
1 B 1 B X2 D 1 2 @ X Y
1 X Y C C 1 A: Y2
As said before, if .X; Y / is normally distributed then statistical independence of X and Y is equivalent to D 0. Therefore, given a 2-dimensional normally distributed population, it is useful to have an impression of the magnitude of . For this reason now an explicit expression for the maximum likelihood estimator of is given:
498
Chapter VIII Vectorial statistics
Proposition VIII.8.2. Suppose that the .X1 ; Y1 /; : : : ; .Xn ; Yn / form a sample from a 2-dimensional normally distributed population. Then the maximum likelihood estimator of , based on this sample, is of the form
b Ds
n P
.Xi
iD1 n P
.Xi
iD1
X /.Yi Y / : s n P X/2 .Yi Y /2 iD1
Proof. By Theorem VIII.6.13, given an outcome .x1 ; y1 /; : : : ; .xn ; yn /, the likelihood function takes on a global maximum when .X ; Y / D .x; y/ and
1 X .xi ; yi / n n
CD
iD1
.x; y/ ˝ .xi ; yi /
.x; y/ :
The matrix of this C is given by n 1 X xi x .xi x yi y/ ŒC D yi y n iD1 1 1 Sxx Sxy †i .xi x/2 †i .xi x/.yi y/ : D D †i .yi y/2 n †i .xi x/.yi y/ n Sxy Syy Using the representation () it is easy to verify that this implies that b Dp
Sxy p Sxx Syy
is the maximum likelihood estimator of . This proves the proposition. For another proof, see Exercise 28. Remark. From the above it follows that in fact b is Pearson’s product-moment correlation coefficient of the cloud of points .X1 ; Y1 /; : : : ; .Xn ; Yn /. In §VI.4 this quantity was denoted by RP . This notation will be retained. So the notation b disappears to make room for RP . To get more grip on estimation processes in which RP is used as an estimator, it would be convenient to know the probability distribution of RP . As said before (in §VI.4) it is a rather complicated piece of work to find an explicit form for this distribution. / / The probability distribution of R , in the case of bivariate normally distributed populations, was P first determined by Sir R. A. Fisher (see [4], [23]).
Section VIII.8 Normal correlation analysis
499
To start with, Fubini’s theorem (Theorem I.5.2) is reformulated in the form of the next lemma: Lemma VIII.8.3. Let X and Y be a statistically independent pair of stochastic n-vectors. Suppose g W Rn Rn ! R is a Borel function. Define V WD g.X; Y/ and define for all fixed x 2 Rn the variable Vx by Vx WD g.x; Y/: Then the probability distributions of V and Vx are related as follows: Z PV .A/ D PVx .A/ dPX .x/ for all Borel sets A in R. In particular, if the Vx are mutually identically distributed then for all x 2 Rn the variables V and Vx are identically distributed. Proof. Choose an arbitrary Borel set A in R. Define for all x 2 Rn the set Bx Rn by Bx WD ¹y 2 Rn W g.x; y/ 2 Aº:
In this notation, exploiting the independence of the pair X; Y together with Theorem I.5.2, one may write PV .A/ D P g.X; Y/ 2 A D P .X; Y/ 2 g 1 .A/ Z Z D d PX;Y D 1g 1 .A/ .x; y/ dPX;Y .x; y/ g
(Fubini) D
1 .A/
Z ²Z
Z ²Z
1g
³ dPX .x/ 1 .A/ .x; y/ d PY .y/
³ D 1Bx .y/ d PY .y/ dPX .x/ Z Z Z D PY .Bx / d PX .x/ D P .Y 2 Bx / d PX .x/ D P g.x; Y/2A dPX .x/ Z Z () D P .Vx 2 A/ dPX .x/D PVx .A/ dPX .x/:
If the Vx are mutually identically distributed then for an arbitrary fixed element a 2 Rn one has PVx D PVa for all x 2 Rn : Then ./ can be read as follows: Z Z PV .A/ D PVa .A/ dPX .x/ D PVa .A/ 1 dPX .x/ D PVa .A/: This shows that PV D PVa .
500
Chapter VIII Vectorial statistics
Now apply this lemma is applied to prove: Proposition VIII.8.4. Suppose .X1 ; Y1 /; : : : ; .Xn ; Yn / is a sample of size n from a 2-dimensional normally distributed population with D 0. Then the variable p n 2 RP q 1 RP2 is t -distributed with n
2 degrees of freedom.
Proof. (Split up into two steps.) Step 1. The system ¹X1 ; : : : ; Xn ; Y1 ; : : : ; Yn º is statistically independent. To see this, note that, by Lemma VIII.6.1, the stochastic vector .X1 ; : : : ; Xn ; Y1 ; : : : ; Yn / has a Gaussian distribution. The components of this vector are pairwise statistically independent and therefore uncorrelated. By Theorem VIII.4.2 this implies that the system ¹X1 ; : : : ; Xn ; Y1 ; : : : ; Yn º is statistically independent. Step 2. One may write RP D p
SXY p
SXX SYY
;
where X D .X1 ; : : : ; Xn / and Y D .Y1 ; : : : ; Yn /. The quantity RP is, by Theorem IV.4.4, invariant under positive scale transformations. Without changing the value of RP one may therefore replace Xi by .Xi X /=X and Yi by .Yi Y /=Y . Hence, without loss of generality one may assume that ¹X1 ; : : : ; Xn ; Y1 ; : : : ; Yn º is a statistically independent system consisting of N.0; 1/-distributed variables. One has p p p p p n 2 RP n 2 SXY = SXX SYY n 2 SXY = SXX D q D q q 2 =S 2 =S 1 RP2 1 SXY S SYY SXY XX YY XX p p n 2 SXY = SXX Ds : n P 2 2 Yi n.Y /2 SXY =SXX iD1
Define g W Rn Rn ! R by
g.x; y/ WD qP n
p
n
2 iD1 yi
p 2 Sxy = Sxx n.y/2
2 =S Sxy xx
:
501
Section VIII.8 Normal correlation analysis
It will be proved that for all fixed x 2 Rn the quantity p n 2 .RP /x D g.x; Y/ p 1 .RP /2x
has a t -distribution with n 2 degrees of freedom. To this end, note that the probability distribution of the variable Y D .Y1 ; : : : ; Yn / is rotation invariant (see Proposition I.6.4). This can be exploited, as will be seen below, by choosing in a sly way an orthogonal linear operator Q W Rn ! Rn . First, note that the vectors 1 v1 D p .1; : : : ; 1/ n
1 v2 D p .x1 Sxx
and
x; : : : ; xn
x/
form an orthonormal system in Rn . Complete this system to an orthonormal basis ¹v1 ; : : : ; vn º of Rn . Let Q be the orthogonal linear operator that converts the standard basis ¹e1 ; : : : ; en º of Rn into ¹v1 ; : : : ; vn º. For the matrix ŒQ of Q one then has 0 1 1 p p .x1 x/= Sxx n B : :: :: :: :: C C : ŒQ D B : : : :A: @ : p 1 p .xn x/= Sxx n Define
Z WD Q
1
Y:
Now the components Z1 ; : : : ; Zn of Z form a statistically independent system of N.0; 1/-distributed variables. Realizing that the matrix of Q 1 is ŒQt , it is easily verified that p 1 SxY : Z1 D n Y and Z2 D p Sxx Of course one has n X iD1
Yi2 D kYk2 D kQ Zk2 D kZk2 D
Hence g.x; Y/ D s
p n P
iD1
n
Zi2
2 Z2 Z12
Z22
By Proposition I.4.2 the variables
Z2
and
n X iD3
Zi2
n X
Zi2 :
iD1
p n D s
2 Z2 n P
iD3
Zi2
:
./
502
Chapter VIII Vectorial statistics
are statistically independent. Moreover the variable Z2 is N.0; 1/-distributed and, by Proposition II.2.7, the variable n X Zi2 iD3
2 -distributed
is with n 2 degrees of freedom. Looking back at ./ and Definition II.3.1, one may conclude that for all fixed x 2 Rn the variable g.x; Y/ is t -distributed with n 2 degrees of freedom. Because X and Y are statistically independent one may apply Lemma VIII.8.3, by which p n 2 RP g.X; Y/ D p 1 .RP /2
p n 2 .RP /x g.x; Y/ D p 1 .RP /2x
and
are identically distributed. This proves the proposition.
Remark. It is a nice exercise to prove the previous proposition by means of the theory developed in §IV.3 (see Exercise 32). Lemma VIII.8.5. If the stochastic variable X is t -distributed with n degrees of freedom, then the variable X Y Dp X2 C n has a probability density given by nC1 2 .1 fY .y/ D p n2
1
y 2/ 2 n
1
where y 2 Œ 1; C1:
Proof. Rely on §I.11, Exercise 33. In Exercise 29 the reader is asked to fill in the details. Proposition VIII.8.6. Suppose that the .X1 ; Y1 /; : : : ; .Xn ; Yn / form a sample from a 2-dimensional normally distributed population. If D 0, then the probability density of the statistic RP is given by n2 1 fRP .x/ D p .1 n2 2
1
x2/ 2 n
2
where x 2 Œ 1; C1:
Furthermore one has
E.RP / D 0
and
var.RP / D
1 n
1
:
503
Section VIII.8 Normal correlation analysis
Proof. The explicit form for the probability density of RP can be deduced by combining Proposition VIII.8.4 and Lemma VIII.8.5. By symmetry it follows that Z C1 E.RP / D xfRP .x/ dx D 0: 1
Using Proposition II.6.3 and Theorem II.6.6 it follows that var.RP / D E.RP2 /
E.RP /2 Z C1 2 D E.RP / D x 2 fRP .x/ dx D2
Z
0
1
x2 p
1 n 1 2 .1 n2 2
1
x2/ 2 n
2
dx D
1 n
1
:
Basing oneself on the outcome of RP , the previous proposition can be used to test the statistical independence of two quantities: Example 1. Given is a 2-dimensional normally distributed population. There is a wish to test H0 W D 0 versus H1 W > 0 at a prescribed level of significance ˛ D 0:10. A sample of size 11 results in an outcome 0:28 for RP . The corresponding P -value is now ! p p 11 2 RP 11 2 0:28 P -value D P .RP 0:28/ D P q p 1 0:282 1 RP2 ! 3RP DP q 0:875 > 0:10: 1 RP2 q Here Table IV was used and the fact that 3RP = 1 RP2 is under H0 t -distributed with 9 degrees of freedom. The P -value exceeds the prescribed level of significance. Therefore it is decided to look upon H0 as being true.
Proposition VIII.8.7. Suppose .X1 ; Y1 /; : : : ; .Xn ; Yn / is a sample of size n from a 2-dimensional normally distributed population with correlation coefficient . Then for large n the statistic 1 1 C RP Z WD log 2 1 RP is approximately N.; 2 /-distributed with 1C 1 D log and 2 1
2 D
1 n
3
:
504
Chapter VIII Vectorial statistics
Proof. See [4], [40]. A detailed study on RP , carried out by the American statistician Harold Hotelling, can be found in [38]. Remark. The statistic
1 1 C RP Z D log 2 1 RP
is often called Fisher’s Z. Here the letter Z does not stand for an N.0; 1/-distributed variable! Example 2. The height and the weight of a randomly chosen Amsterdam woman is denoted by X and Y respectively. It is presumed that .X; Y / has (approximately) a 2-dimensional normal distribution with correlation coefficient . At a level of significance of 5% one wants to test H0 W D 0:80
against
H1 W < 0:80:
To this end a sample of size 40 is drawn from this population. After the experiment is completed the statistic RP shows the following outcome: RP D 0:690;
which means that
1 1 C RP D 0:848: Z D log 2 1 Rp
By Proposition VIII.8.7 the variable Z was a priori (approximately) N.; 2 /-distributed with D
1 1 C 0:80 log D 1:099 2 1 0:80
and
2 D
1 40
3
D
1 : 37
Using this, the P -value can be computed P -value D P .RP 0:690/ D P .Z 0:848/ 0:06: For this reason it is decided to accept H0 . Remark. As already mentioned, instead of the tests discussed above one could also use the 2 -test on independence or Spearman’s rank correlation test. However, when it is beforehand known that the 2-dimensional population is normally distributed, the latter two tests lead to an unnecessary loss of power (see [40]). Alternatively one could, of course, use bootstrap techniques.
505
Section VIII.9 Multiple regression analysis
VIII.9 Multiple regression analysis In Chapter IV simple regression theory has been treated. In that chapter one started from two (scalar) quantities x and y between which a linear relationship y D ˛ C ˇx was presumed. The aim was to make in an intelligent way estimates, based on measurements, of the parameters ˛ and ˇ. In the following a scalar quantity y is considered, depending linearly on variables x1 ; : : : ; xm . Otherwise stated, it is assumed that the quantities y and x1 ; : : : ; xm are linked by y D ˛ C ˇ1 x1 C C ˇm xm : ./ Under this assumption one talks about multiple linear regression. In order to simplify notation one could write β D .ˇ1 ; : : : ; ˇm /
and
x D .x1 ; : : : ; xm /:
In these notations ./ can be rewritten as y D ˛ C hβ; xi: Analogous to the considerations in Chapter IV it is presumed that the variable x can be adjusted at any prescribed value, that is to say, it is assumed that x is a controlled variable. At any adjusted value for x the measurements concerning y are looked upon as being outcomes of a stochastic variable Yx . Given a set of prescribed adjustments x1 ; : : : ; xn of x the Yx1 ; : : : ; Yxn will frequently be written as Y1 ; : : : ; Yn , thus simplifying typography. The theory that will be developed is based on the following three axioms of linear regression: Axiom I:
For every set of adjusted values x1 ; : : : ; xn for x the variables Yx1 ; : : : ; Yxn constitute a statistically independent system.
Axiom II:
E.Yx / D ˛ C hβ; xi.
Axiom III: For every sequence x1 ; : : : ; xn one has var.Yx1 / D D var.Yxn / D 2 : One talks of normal regression analysis when a fourth axiom is added: Axiom IV: For all x the variable Yx is normally distributed. Note that Axioms II, III and IV can be summarized by saying that for all x the variable Yx has an N.˛ C hβ; xi; 2 )-distribution.
506
Chapter VIII Vectorial statistics
When at adjusted values x1 ; : : : ; xn for x respectively the values y1 ; : : : ; yn for y are measured, then the set of points .x1 ; y1 /; : : : ; .xn ; yn / 2 RmC1 will be called the cloud of points belonging to the experiment. For an arbitrary estimate a of ˛ and b of β the sum of squares of errors is defined by f .a; b/ WD
n X
.yi
iD1
a
hb; xi i/2 :
A geometrical interpretation is now assigned to this sum. To this end the following notations are introduced: x1 D .x11 ; x21 ; : : : ; xm1 /; : : : ; x1 D .x11 ; x12 ; : : : ; x1n /;
xn D .x1n ; : : : ; xmn /;
: : : ; xm D .xm1 ; : : : ; xmn /:
So xi is a vector in Rn representing the n adjusted values of (controlled) variable xi . Now set y WD .y1 ; : : : ; yn / and e WD .1; 1; : : : ; 1/: In these notations the sum of squares of errors reads as f .a; b/ D ky
ae
b1 x1
b2 x2
bm xm k2 :
Using a bit linear algebra it is easy to minimize this expression. Namely, let M be the linear subspace of Rn defined by M WD linear span ¹e; x1 ; : : : ; xm º: Looking back at §VIII.1(g) it is now immediate that the sum of squares of errors takes on a global minimum for those values of a and b making valid the equality ae C b1 x1 C C bm xm D PM y; where PM is the orthogonal projection from Rn onto M. It follows that the minimizing elements a and b are unique if the system ¹e; x1 ; : : : ; xm º is linearly independent in Rn . This will be systematically presumed. The unique minimizing elements a and b will be denoted by b ˛ and b β . One calls them the least squares estimates of ˛ and β respectively. The n .m C 1/-matrix
0 1 1 x11 x21 xm1 B1 x12 x22 xm2 C B C M WD B : :: :: :: C @ :: : : : A 1 x1n x2n xmn
507
Section VIII.9 Multiple regression analysis
is usually called the design matrix. The matrix of PM can be expressed in terms of this matrix (see §VIII.1(g)): ŒPM D M.Mt M/
1
Mt :
The defining equality b ˛ eCb ˇ 1 x1 C C b ˇ m xm D PM y
for the ˛; b ˇ 1; : : : ; b ˇ m now implies (see §VIII.1(g)) 0 1 0 1 b ˛ y1 C Bb ˇ B 1C t 1 t B :: C M B : C D M.M M/ M @ : A : @ :: A yn b ˇm
Because the columns of M are linearly independent, it follows from this that 0 1 0 1 b ˛ y1 Bb C B ˇ1 C t 1 t B :: C B :: C D .M M/ M @ : A : @ : A yn b ˇm
Interpreting the y1 ; : : : ; yn as outcomes of the variables Y1 ; : : : ; Yn respectively, the quantities b ˛; b ˇ 1; : : : ; b ˇ m can be looked upon as defined by the matrix product 0 1 0 1 b ˛ Y1 Bb C B ˇ1 C t 1 t B :: C B :: C D .M M/ M @ : A : @ : A Yn b ˇm The following notation will be used:
Y D .Y1 ; : : : ; Yn /: In these notations the considerations above are summarized: Proposition VIII.9.1. The least squares estimators b ˛; b β of ˛; β are given by .b ˛; b β / D TY D T.Y1 ; : : : ; Yn /
where T W Rn ! RmC1 is the linear operator belonging to the matrix ŒT D .Mt M/
1
Mt :
508
Chapter VIII Vectorial statistics
Using this proposition it is easy to deduce: Proposition VIII.9.2. For the vectorial least squares estimator .b ˛; b β / one has: (i) E.b ˛; b β / D .˛; β/, (ii) † = .b ˛; b β / D ŒC .b ˛; b β / D 2 .Mt M/
1.
Proof. Adapt Example 2 of §VIII.2.
The components of the vectorial estimator .b ˛; b β / are all linear in Y1 ; : : : ; Yn . For b this reason .b ˛ ; β / is called a linear estimator of .˛; β/. Now a brief look is taken at the class of all linear estimators of .˛; β/. It is clear that every element .e ˛; e β / in this class can be written as .e ˛; e β / D SY D S.Y1 ; : : : ; Yn /
where S W Rn ! RmC1 is some linear operator. Denoting by V W RmC1 ! Rn the linear operator belonging to the design matrix M, one has the following criterion for unbiasedness of a linear estimator: Lemma VIII.9.3. The linear estimator .e ˛; e β / D SY of .˛; β/ is unbiased if and only mC1 mC1 if SV D I, where I W R !R is the identity. Proof. The estimator .e ˛; e β / is unbiased if and only if E.e ˛; e β / D .˛; β/;
./
whatever the value of .˛; β/ may be. When writing out the left side of this equality one arrives at: E.e ˛; e β/ D E S.Y1 ; : : : ; Yn / D S E.Y1 ; : : : ; Yn / D S E.Y1 /; : : : ; E.Yn / D S .˛ C hβ; x1 i; : : : ; ˛ C hβ; xn i/ D S .˛e C ˇ1 x1 C C ˇm xm / D ˛Se C ˇ1 Sx1 C C ˇm Sxm : The linear operator V W RmC1 ! Rn with matrix M converts the standard basis ¹e1 ; : : : ; emC1 º in RmC1 into the system ¹e; x1 ; : : : ; xm º. Therefore, counting with the above, ./ can be rewritten as E.e ˛; e β / D ˛SVe1 C ˇ1 SVe2 C C ˇm SVemC1 D ˛e1 C ˇ1 e2 C C ˇm emC1 :
This holds for all .˛; β/ 2 RmC1 if and only if SVei D ei that is, if and only if SV D I.
for all i D 1; : : : ; m C 1;
509
Section VIII.9 Multiple regression analysis
Using the lemma above the following important theorem is easily proved: Theorem VIII.9.4 (C. F. Gauß, A. A. Markov). Among the unbiased linear estimators of .˛; β/ the least squares estimator .b ˛; b β / is the unique one of minimum total variance. Proof. Let .e ˛; e β / D SY be any unbiased linear estimator of .˛; β/. It must be proved that for such an estimator one always has var.e ˛; e β / var.b ˛; b β/
and that equality can occur only if .e ˛; e β / D .b ˛; b β /. To this end, first note that var.e ˛; e β/
var.b ˛; b β / D tr C .e ˛; e β/
tr C .b ˛; b β/ :
./
For a linear estimator .e ˛; e β / of .˛; β/ the covariance operator is always given by C .e ˛; e β / D C .SY/ D S C .Y/ S D S 2 I S D 2 SS :
In particular
C .b ˛; b β / D 2 TT ;
where (by Proposition VIII.9.1) T is given by T D .V V/
1
V :
It follows that ./ can be rewritten as var.e ˛; e β/
var.b ˛; b β / D 2 tr.SS
TT /:
./
Now a kind of an algebraic miracle follows. Namely, it turns out that SS
TT D .S
T/ :
T/ .S
./
To see this, just write out the right side of this equality: .S
T/ .S
T / D SS
D SS
ST
TS C TT
SV.V V/
C .V V/
1
1
.V V/
V V.V V/
1
1
V S
:
The estimator .e ˛; e β / is assumed to be unbiased, so by Lemma VIII.9.3 one has SV D ImC1 and consequently V S D ImC1 . Exploiting this one arrives at .S
T/ .S
T / D SS D SS
.V V/
1
.V V/
1
.V V/
D SS
1
C .V V/
TT ;
1
510
Chapter VIII Vectorial statistics
which proves ./. The proof of the theorem can now quickly be completed, for ./ can now be read as var.e ˛; e β/ var.b ˛; b β/ D 2 tr .S T/ .S T/ D 2 kS Tk2 ;
where on the right side the Hilbert–Schmidt norm appears. This shows that var.e ˛; e β / var.b ˛; b β/
and that equality can only occur if S D T, that is, if .e ˛; e β / D .b ˛; b β /.
Remark. The theorem above, in this generality, was proved by Andrej A. Markov (1856–1922), a famous Russian name in the field of probability theory. In the case of simple regression analysis the result was already known to the brilliant German mathematician, physicist and astronomer Carl F. Gauß (1777–1855). Besides the parameters ˛ and β there is in the model of linear regression another parameter, namely 2 , the variance of the variables Yx . The next goal is to construct a suitable estimator for 2 . To this, first take a look at the quantity SSE WD kY
PM Yk2 D kPM? Yk2 :
Lemma VIII.9.5. The quantity SSE can be expressed as SSE D kPM? Y0 k2 ; where M? is the orthogonal complement of M and Y0 D Y
˛e
ˇ1 x1
ˇm xm :
Proof. By §VIII.1(g) one may write SSE D kY
PM Yk2 D k.I
PM /Yk2 D kPM? Yk2 :
Because ˛ e C ˇ1 x1 C C ˇm xm 2 M, one has PM? .˛ e C ˇ1 x1 C C ˇm xm / D 0: Hence and consequently
PM? Y D PM? Y0 SSE D kPM? Yk2 D kPM? Y0 k2 :
The next proposition provides an unbiased estimator of 2 .
511
Section VIII.9 Multiple regression analysis
Proposition VIII.9.6. For the sum of squares of errors one has E.SSE/ D n .m C 1/ 2 :
In other words, the variable
SSE= n is an unbiased estimator of 2 .
.m C 1/
Proof. First note that E.Y/ D E.Y1 /; : : : ; E.Yn / D .˛ C hβ; x1 i; : : : ; ˛ C hβ; xn i/ D ˛ e C ˇ1 x1 C C ˇm xm :
It follows from this that for Y0 D Y
˛e
ˇ1 x1
ˇm xm one has
E.Y0 / D 0: By Proposition VIII.2.2 this implies that E.PM? Y0 / D 0: Using this, together with Propositions VIII.2.6, VIII.2.10 and the previous lemma one may write E.SSE/ D E kPM? Y0 k2 D var PM? Y0 D tr C .PM? Y0 / D tr PM? C .Y0 / P ? : M
However, one has C .Y0 / D C .Y/ D 2 I, P
M?
D PM? and P2
M?
E.SSE/ D tr.PM? . 2 I/PM? / D tr. 2 P2
M?
D PM? , hence
/ D 2 tr.PM? /:
By §VIII.1(m) one has furthermore tr.PM? / D dim.M? / D n
dim M D n
.m C 1/:
This proves the proposition. As promised in §IV.3 it will now be derived that the least squares statistics b ˛; b β and SSE are sufficient for the parameters ˛; β and .
Theorem VIII.9.7. In the case normal regression analysis the vectorial statistic .b ˛; b ˇ 1; : : : ; b ˇ m ; SSE/ is sufficient for the vectorial parameter .˛; ˇ1 ; : : : ; ˇm ; /.
512
Chapter VIII Vectorial statistics
Proof. The factorization theorem (Theorem II.10.3) is applied to the probability density fY of the stochastic n-vector Y D .Y1 ; : : : ; Yn /. In normal regression analysis this density is given by fY .y/ D
1 1 exp ky 2 2 n .2/n=2
˛e
ˇ1 x1
ˇm xm k2 :
Analogous to the notation in Lemma VIII.9.5, set y0 WD y
˛e
ˇ1 x1
ˇm xm :
The density fY can then be expressed as 1 1 0 2 exp ky k 2 2 n .2/n=2 1 1 0 2 0 2 D n exp kP y k : kP y k C ? M M 2 2 .2/n=2
fY .y/ D
However, one has kPM y0 k2 D kb ˛eCb ˇ 1 x1 C C b ˇ m xm D k.b ˛
˛/e C .b ˇ1
Furthermore (Lemma VIII.9.5):
˛e
ˇ1 x1
ˇ1 /x1 C C .b ˇm
ˇm xm k2
ˇm /xm k2 :
kPM? y0 k2 D SSE: It thus appears that fY can be represented in the following way: fY .y/ D '.˛; ˇ1 ; : : : ; ˇm ; ; b ˛; b ˇ 1; : : : ; b ˇ m ; SSE/;
where ' W RmC1 .0; C1/ RmC1 .0; C1/ ! R is a continuous function. Moreover there exists a continuous function g W Rn ! RmC2 such that
It follows that
.b ˛; b ˇ 1; : : : ; b ˇ m ; SSE/ D g.y/: fY .y/ D h.y/ ' ˛; β; ; g.y/ ;
where h 1. By the factorization theorem (continuous version) the above implies that the statistic .b ˛; b ˇ 1; : : : ; b ˇ m ; SSE/ D g.Y/ is sufficient for the vectorial parameter .˛; ˇ1 ; : : : ; ˇm ; /.
513
Section VIII.9 Multiple regression analysis
It is often very useful to consider, aside from a starting model y D ˛ C ˇ1 x1 C C ˇm xm ; a so-called reduced model: y D ˛ C ˇ1 x1 C C ˇk xk
.k < m/:
When doing this the original model is usually called the full model. Given a cloud of points .x1 ; y1 /; : : : ; .xn ; yn / in RmC1 , the sum of squares of errors in the reduced model will generally not be the same as in the full model. This phenomenon will be studied in more detail now. To start with, let R be the linear subspace of Rn defined by R WD linear span of ¹e; x1 ; : : : ; xk º: Denoting the sum of squares of errors in the reduced model by SSE R , one may write SSE R D kY
PR Yk2 D kPR? Yk2 :
In order to describe the discrepancy between SSE R and SSE (the latter being the sum of squares of errors in the full model) the next lemma will be helpful: Lemma VIII.9.8. If R M, then:
(i) R? D M? ˚ .R? \ M/, (ii) kPR? yk2 D kPM? yk2 C kPR? \M yk2 .
Proof. (i) If R M then R? M? . Moreover R? R? \ M. Because R? is a linear subspace, this implies that R? M? C .R? \ M/: In order to prove the converse inclusion, suppose y 2 R? . Then PM? y 2 M? R? , and y PM? y D PM?? y D PM y 2 R? \ M:
Therefore
This proves that
y D PM? y C .y
Summarizing, one has
PM? y/ 2 M? C .R? \ M/:
R? M? C .R? \ M/: R? D M? C .R? \ M/:
In the above, because M? and R? \ M are orthogonal linear subspaces, the symbol + can be replaced by the symbol ˚. This proves (i). Statement (ii) is a direct consequence of (i).
514
Chapter VIII Vectorial statistics
Proposition VIII.9.9. One always has SSE R SSE. Proof. Using the previous lemma it follows that SSE R D kY
PR Yk2 D kPR? Yk2 kPM? Yk2 D kY
PM Yk2 D SSE:
Proposition VIII.9.10. In the most extreme case of model reduction the reduced model is given by y D ˛. One then has SSE R D
n X .Yi iD1
Y /2 D SYY :
Proof. See Exercise 36. In the case of normal regression analysis the exact probability distribution of the statistics b ˛; b ˇ and SSE can be determined: Theorem VIII.9.11. In normal regression analysis one has: (i) .b ˛; b β / is N.µ; C/-distributed with µ D .˛; β/
and
ŒC D 2 .Mt M/
(ii) The variable SSE= 2 is 2 -distributed with n
1
:
.m C 1/ degrees of freedom.
(iii) Under the assumption that the reduced model is valid, that is, if for all x D .x1 ; : : : ; xm / one has E.Yx / D ˛ C ˇ1 x1 C C ˇk xk ; the variables SSE= 2 and .SSE R SSE/= 2 are statistically independent and they have 2 -distributions with n .m C 1/ and m k degrees of freedom respectively. Proof. (i) In normal regression analysis the n-vector Y D .Y1 ; : : : ; Yn / has an (elementary) Gaussian distribution. Apply Theorem VIII.4.5 together with Propositions VIII.9.1 and VIII.9.2 and statement (i) follows. (ii) By Lemma VIII.9.5 one has SSE= 2 D kPM? Y0 =k2 ; where Y0 D Y ˛ e ˇ1 x1 ˇm xm . However, the variable Y0 = 2 is N.0; I/distributed. Using Lemma VIII.4.8 (or theorem VIII.4.9) it follows that SSE= 2 is 2 -distributed with dim.M? / D n .m C 1/ degrees of freedom.
515
Section VIII.9 Multiple regression analysis
(iii) Next the reduced model is taken as a starting point. Then ˇkC1 D D ˇm D 0 and therefore the variable Y0 D Y
˛e
ˇ1 x1
ˇk xk
is N.0; 2 I/-distributed. Hence Y0 = is N.0; I/-distributed. Furthermore, one has ˛e C ˇ1 x1 C C ˇk xk ? R? :
Consequently
PR? Y D PR? Y0 and PM? Y D PM? Y0 :
It thus appears that besides the equality
SSE= 2 D kPM? Y0 =k2
./
one has (apply Lemma VIII.9.5) .SSE R
SSE/= 2 D kPR? Y=k
D kPR? Y0 =k2
kPM? Y=k2
kPM? Y0 =k2
D kPR? \M Y0 =k2 :
()
Using ./, ./ and Theorem VIII.4.9 (with M1 D M? and M2 D R? \ M/ statement (iii) follows. This theorem makes it possible to develop a decision procedure when testing the hypothesis H0 W “the reduced model is valid”, that is: ˇkC1 D D ˇm D 0, against the alternative H1 W “the reduced model is not valid”, that is: ˇi ¤ 0 for at least one i D k C 1; : : : ; m: Namely, the outcome of the sum of squares of errors can be looked upon as a measure of discrepancy between the actual measurements and the presumed model. If the sum of squares of errors is small (large), this is an indication that the measurements are in good (bad) agreement with the model. If in a model reduction the sum of squares of errors increases out of proportion then the agreement between the measurements and the model is getting lost. With this in mind a decision procedure will be developed to test H0 against H1 . It is, roughly spoken, the following procedure: “if SSE R is much larger than SSE, then reject H0 ” and “if SSE R SSE, then accept H0 ”.
516
Chapter VIII Vectorial statistics
The following test statistic is introduced to make the above more precise: T WD
.SSE R SSE/=.m k/ : SSE=.n m 1/
The decision procedure sketched above can now also be formulated as follows: “if T is large, then reject H0 ” and “if T is small, then accept H0 ”. As always, in order to specify what is meant by a large or a small outcome of T one needs to know something about the probability distribution of T . The next theorem is throwing light on this: Theorem VIII.9.12. In normal regression analysis the test statistic T D
.SSE R SSE/=.m k/ SSE=.n m 1/
has under H0 an F -distribution with m k degrees of freedom in the numerator and n m 1 degrees of freedom in the denominator. Proof. A direct consequence of Theorems II.5.1 and VIII.9.11. This theorem can be used, for example, in the following way: Example. In a certain experiment one starts from the axioms of normal regression analysis in a model y D ˛ C ˇ1 x1 C ˇ2 x2 C ˇ3 x3 C ˇ4 x4 : Measurements result in a cloud of points consisting of 20 elements. The model turns out to be in good agreement with the measurements. At a prescribed level of significance of 10% one wants to test whether perhaps the reduced model y D ˛ C ˇ1 x1 C ˇ2 x2 is also in good agreement with the measurements. After the necessary computations one learns that the sum of squares of errors increases by 43%. This implies the following outcome for T : SSE R .SSE R SSE/=.4 2/ 15 D T D 1 D .1:43 1/ 7:5 D 3:23: SSE=.20 4 1/ SSE 2
Section VIII.10 The multiple correlation coefficient
517
A priori T was (under H0 ) F -distributed with 2 and 15 degrees of freedom in the numerator and denominator respectively. Thus it can be read off in Table V that P -value D P .T 3:23/ < 0:10: It thus appears that the P -value is smaller than the prescribed level of significance. Conform the agreed decision procedure the hypothesis H0 is rejected. That is to say, the announced model reduction is rejected. Remark 1. In order to have an indication to which extent the model is compatible with the actual measurements, one can take a look at the so-called multiple correlation coefficient. This quantity will be discussed in the next section. Remark 2. In this section it will be systematically assumed that the vectors x1 ; : : : ; xm are linearly independent in Rn . However, it could occur that these vectors are “almost” linearly dependent (for example if x1 is close to being in the linear span of the x2 ; : : : ; xm ). One then talks about multicollinearity. This phenomenon can manifest itself for example at the moment when the inverse of the matrix Mt M has to be computed (see Proposition VIII.9.2). Model reduction can be considered to get around this problem.
VIII.10 The multiple correlation coefficient In §IV.4 Pearson’s product-moment correlation coefficient RP was introduced. It was explained there that the value of RP can be interpreted as a measure for the amount of linear structure in the cloud of points belonging to the measurements. Otherwise stated, the value of Rp is a measure for the discrepancy between the linear model y D ˛ C ˇx and the paired measurements .x1 ; y1 /; : : : ; .xn ; yn /. In this section the above will be generalized to the model of multiple linear regression y D ˛ C ˇ1 x1 C C ˇm xm in relation to a given cloud of points .x1 ; y1 /; : : : ; .xn ; yn / in RmC1 . In the following, Proposition IV.5.2 will serve as a guide in defining the multiple correlation coefficient: Definition VIII.10.1. The scalar R in the interval Œ0; 1, defined by R2 WD 1 is called the multiple correlation coefficient.
SSE Syy
518
Chapter VIII Vectorial statistics
Remark 1. Propositions VIII.9.8 and VIII.9.9 guarantee that Syy SSE, so the above makes sense. Remark 2. The multiple correlation coefficient generalizes Pearson’s product-moment correlation coefficient in the sense that R D jRP j when m D 1. Here too the scalar R will be interpreted geometrically. To this end, denote (as was done in §IV.4) the linear span of the vector e D .1; 1; : : : ; 1/ by E and the quotient space Rn =E by V . The equivalence class belonging to an element a 2 Rn will be denoted by Œa. As proved in §IV.4, an inner product on V can be defined by setting hŒa; Œbi WD Sab : Given a cloud of points .x1 ; y1 /; : : : ; .xn ; yn / in RmC1 , define the vectors x1 ; : : : ; xm , y in Rn and the linear subspace M of Rn as in the previous section. Evidently ŒM D ¹Œa 2 V W a 2 Mº is a linear subspace of V . In a natural way an analogue to Lemma IV.4.2 is now presented: Lemma VIII.10.1. The following three statements are equivalent: (i) Œy 2 ŒM.
(ii) yi D ˛ C ˇ1 x1i C ˇ2 x2i C C ˇm xmi .i D 1; : : : ; n/.
(iii) The elements of the cloud of points fit exactly in the prescribed linear model. Proof. Analogous to the proof of Lemma IV.4.2. In the next theorem the multiple correlation coefficient is interpreted in terms of geometry in V : Theorem VIII.10.2. The multiple correlation coefficient is the cosine of the angle included by Œy and ŒM. Proof. The case m 2 will be considered. Let P be the orthogonal projection from Rn onto M, e P the orthogonal projection from V onto ŒM and the angle between Œy and ŒM. The proof is split up into four (small) steps. Step 1. If a ? e in Rn , then hŒa; Œbi D ha; bi for all b 2 Rn . To see this, note that a ? e is the same as saying that n
aD
1X 1 ai D ha; ei D 0: n n iD1
519
Section VIII.10 The multiple correlation coefficient
Hence n X D .ai
hŒa; Œbi D Sab
n X
D
iD1
ai b i
!
b/ D
a/.bi
iD1
n X iD1
!
ai b D
n X
ai .bi
b/
iD1
n X iD1
ai bi D ha; bi:
Step 2. For all a 2 Rn one has e PŒa D ŒP a.
To prove this, proceed as follows: Set v1 WD p1n e, then v1 is of length 1 in Rn . Complete the element v1 in M to an orthonormal basis for M. According to Step 1 one now has for all a 2 Rn hŒa; Œvi i D ha; vi i
for i D 2; : : : ; m C 1:
In particular it follows that the elements Œv2 ; : : : ; ŒvmC1 form an orthonormal basis for ŒM. However, then (by §VIII.1(g)) one may write e PŒa D
mC1 X iD2
hŒa; Œvi iŒvi D
mC1 X iD2
ha; vi iŒvi
"mC1 # "mC1 # X X D ha; vi ivi D ha; vi ivi D ŒP a: iD2
iD1
This proves Step 2. Step 3. For all a 2 Rn one has: ka
P ak2 D kŒa
e PŒak2 .
This is very easy to see. Namely: .a Pa/ ? M and therefore one surely has .a Pa/ ? e. By Step 1 and Step 2 one may now write kŒa
e PŒak2 D kŒa D ha
ŒP ak2 D kŒa P a; a
P ak2 D hŒa
P ai D ka
P a; Œa
P ai
2
P ak :
Step 4. Completing the proof. By the theorem of Pythagoras, in V one has for the vector y (emerging from the cloud of points) kŒyk2 D ke PŒyk2 C kŒy e PŒyk2 :
520
Chapter VIII Vectorial statistics
Consequently the cosine of the angle between Œy and ŒM is given by cos2 D (Step 3) D
ke PŒyk2 kŒyk2 kŒy e PŒyk2 D kŒyk2 kŒyk2 kŒyk2
D1
Syy SSE ky P yk2 D 2 kŒyk Syy
SSE D R2 : Syy
This proves the theorem, as the angle is always acute. Remark 1. If the cloud of points fits exactly into the presumed linear model, then Œy 2 ŒM and hence the angle between Œy and ŒM equals zero. For the multiple correlation coefficient one then observes R D 1. An outcome R D 0, in contrast to the above, stands for Œy ? ŒM. This is an indication that the cloud of points is not at all in tune with the linear model. Remark 2. If m 2 (that is: if dim.ŒM/ 2) the angle between Œy and ŒM is always acute. For this reason R is always taken positive. In §IV.4 it was explained that Pearson’s product-moment correlation coefficient is invariant under positive scale transformations. It turns out, as will be seen below, that the multiple correlation coefficient also has these kind of invariance properties. Definition VIII.10.2. By a multiple scale transformation of a given cloud of points .x1 ; y1 /; : : : ; .xn ; yn / in RmC1 , one means a transition to a new cloud of points .e x 1; e y 1 : : : ; .e x m; e y n / that is related to the old one as 0 1 0 1 0 1 x1j b1 e x 1j B :: C B :: C B :: C y j D pyj C q: @ : A D A @ : A C @ : A and e e x mj xmj bm Here A is an invertible m m-matrix and p ¤ 0. One talks about a positive scale transformation if A is of positive type and p > 0.
Theorem VIII.10.3. The multiple correlation coefficient is invariant under multiple scale transformations. Proof. Let A be any invertible m m-matrix. In order to simplify typography, set ŒAij D aij :
Section VIII.10 The multiple correlation coefficient
521
In the notations of Definition VIII.9.2 one has for all i D 1; : : : ; m ! m X e x ij D ai k xkj C bi : kD1
It follows from this that
However, this implies that
e xi D
m X
kD1
Œe xi D
!
ai k xk C bi e:
m X
ai k Œxk :
kD1
f one has, because the matrix A is Denoting the linear span of e, e x 1; : : : ; e x m by M invertible, that f D ŒM: Œ M Furthermore
Œe y D pŒy:
e of the transformed cloud of points one now For the multiple correlation coefficient R has f e D cosine of the angle between Œe R y and ŒM
D cosine of the angle between pŒy and ŒM
D cosine of the angle between Œy and ŒM D R:
This proves the theorem. As said before, the value of R shows to what extent the presumed linear model is compatible with the measurements in the experiment. It is therefore important to have an objective criterion for what is meant by saying that R is close to 0 or close to 1. An outcome of R close to 0 implies that 1
SSE 0: Syy
In turn this implies that Syy SSE. One is then in a situation where the sum of squares of errors does not increase significantly when passing to the reduced mode y D ˛ (see Proposition VIII.9.10). Hence, if the linear model y D ˛ C ˇ1 x1 C C ˇm xm is not at all compatible with the measurements in the experiment then one could take as well the reduced model y D ˛ as the starting-model.
522
Chapter VIII Vectorial statistics
Theorem VIII.10.4. Starting from the axioms of normal regression analysis, assume that the reduced model y D ˛ is valid (that is: assume that ˇ1 D D ˇm D 0). Then the test statistic R2 =m .1 R2 /=.n m 1/ is a priori F -distributed with m and n and denominator respectively.
m
1 degrees of freedom in the numerator
Proof. Apply Theorem VIII.9.12 with k D 0. It then follows that the quantity .SYY SSE/=m SSE=.n m 1/ has an F -distribution with m and n m 1 degrees of freedom in the numerator and denominator respectively. However, it is readily verified that .SYY SSE/=m D SSE=.n m 1/ .1
R2 =m R2 /=.n m
1/
:
This proves the theorem. The next example shows how the previous theorem can be applied. Example 1. In a certain experiment one is carrying out 14 measurements. One wonders whether the resulting cloud of points fits into the linear model y D ˛ C ˇ1 x1 C ˇ2 x2 C ˇ3 x3 : The axioms of normal regression analysis are presumed to be valid. After the necessary computations an outcome R D 0:82 emerges. Basing oneself on this outcome there is a wish to test the hypothesis
against
H0 W “the cloud of points does not at all fit in the linear model”; that is: ˇ1 D ˇ2 D ˇ3 D 0; H1 W “to some extent the cloud of points fits in the linear model”; that is: ˇi ¤ 0 for at least one value of i :
By Theorem VIII.9.4 the statistic R2 =3 .1 R2 /=10
523
Section VIII.10 The multiple correlation coefficient
is (under H0 ) a priori a F -distributed with 3 and 10 degrees of freedom in the numerator and denominator respectively. This make it possible to determine the P -value belonging to the outcome R D 0:82. Namely, in Table V it can be read off that R2 =3 P -value D P .R 0:82/ D P 6:84 < 0:05: .1 R2 /=10 The hypothesis H0 is rejected. The defining equality for R was given by R2 D 1
SSE : Syy
It was already noticed that Syy represents the sum of squares of errors belonging to the reduced model y D ˛. Less extreme reductions, for example to the model y D ˛ C ˇ1 x1 C C ˇk xk ; bring about sums of squares of errors SSE R (in the notation of the foregoing section). It is now quite natural to consider, aside from and analogous to R, quantities R R defined via SSE R2R D 1 : SSE R The scalar R R is called a partial correlation coefficient. The statistic
.SSE R SSE/=.m k/ ; SSE=.n m 1/ which was used in the previous section in order to test model reduction, can be expressed in terms of the partial correlation coefficient R R . Namely, T D
T D
.1
R2R =.m
k/
R2R /=.n
m
1/
:
So in this kind of decision procedures (concerning model reduction) the outcome of R R is conclusive. By successive model reductions in the linear model y D ˛ C ˇ1 x1 C C ˇm xm one can, among the variables x1 ; : : : ; xm , remove those that do not contribute significantly in the minimization of the sum of squares of errors. See, in connection to this, also Exercise 35.
524
Chapter VIII Vectorial statistics
VIII.11
Exercises
Detailed solutions to all exercises can be found in [2]. 1. Suppose v1 ; : : : ; vm is an independent set of vectors in Rp , with linear span M. Let T W Rm ! Rp be the linear operator defined by T ei D vi
.i D 1; : : : ; m/:
(i) Prove that the linear operator T T is invertible. (ii) Show that Im T D M.
(iii) Show that the orthogonal projection PM is given by PM D T.T T/
1
T :
(iv) Prove that, if the v1 ; : : : ; vm form an orthonormal system, one has PM D v1 ˝ v1 C C vm ˝ vm : 2. Verify that for all x; y; z 2 Rp one has .˛x C ˇy/ ˝ z D ˛.x ˝ y/ C ˇ.y ˝ z/ and z ˝ .˛x C ˇy/ D ˛.z ˝ x/ C ˇ.z ˝ y/: 3. Prove that for each pair S; T 2 L.Rp / one has
tr .T S/ D
p X
hS ej ; T ej i D
j D1
p X
ŒSij ŒTij :
i;j D1
Moreover, show that by setting hS; Ti WD tr .T S/ an inner product is defined on L.Rp /. 4. Show that for arbitrary x; y; a; b 2 Rp and T 2 L.Rp / one has: (i) hT; x ˝ yi D hT y; xi,
(ii) hT; ei ˝ ej i D ŒTij ,
(iii) hx ˝ y; a ˝ bi D hx; ai hy; bi, (iv) kx ˝ yk D kxk kyk, (v) .x ˝ y/ D y ˝ x.
525
Section VIII.11 Exercises
5. Prove that the p 2 operators ei ˝ ej form an orthonormal basis of L.Rp /, equipped with the Hilbert–Schmidt inner product. 6. Prove that the map T 7! T presents an orthogonal linear operator on L.Rp /, equipped with the Hilbert–Schmidt inner product. Determine the trace of this operator. 7. This exercise starts from a stochastic p-vector X, of which the covariance operator C is given by the matrix † = D ŒC D
66 12
12 : 59
(i) Determine var.X/. (ii) Express the principal components of X in terms of X1 and X2 . 8. Let 1 ; : : : ; p be the eigenvalues belonging to C .X/, where X is a stochastic p-vector. Prove that for all a 2 Rp one has kak2 min .i / var.hX; ai/ kak2 max .i /: iD1;:::;p
iD1;:::;p
9. Let X be a stochastic p-vector having C as its covariance operator. Suppose that the eigenvalues of C are given by 1 > 2 > > p and that ¹v1 ; : : : ; vp º is a corresponding orthonormal basis of eigenvectors. (i) Prove that, under the constraint kak 1, the variable hX; ai is of maximum variance for a D ˙v1 .
(ii) Under the constraints
“kak 1”
and
“hX; ai uncorrelated to hX; v1 i”
the variable hX; ai is of maximum variance for a D ˙v2 . Prove this.
(iii) Characterize the principal components of X. 10. Prove Proposition VIII.2.10.
11. A vectorial sample X1 ; X2 ; X3 ; X4 shows the following outcome X1 D .2; 4; 1/;
X2 D .3; 3; 2/;
Determine the matrix of S2 .
X3 D .3; 4; 1/;
X4 D .4; 1; 2/:
526
Chapter VIII Vectorial statistics
12. A vectorial sample of size n consists of the p-vectors X1 ; : : : ; Xn . (i) Prove that dim.Ker S2 / max.p
n; 0/:
(ii) Is it possible for S2 to be invertible if n < p? 13. Prove Lemma VIII.6.1. 14. If the X1 ; : : : ; Xn are a vectorial sample from a Gaussian distributed population, then X and .X1 X; X2 X; : : : ; Xn X/ are statistically independent. Prove this statement. 15. Let X1 and X2 be statistically independent stochastic p-vectors, of which X1 is N.µ1 ; C1 /-distributed and X2 is N.µ2 ; C2 /-distributed. Prove that X1 C X2 is N.µ1 C µ2 ; C1 ˚ C2 /-distributed.
16. Suppose X is a normally distributed p-vector. Let T W Rp ! Rq be a linear operator. Is T X necessarily normally distributed? 17. Suppose the 2-vector .X; Y / is normally distributed with expectation vector .X ; Y / and covariance operator C. Then the matrix ŒC of C is given by X2 X Y ŒC D : X Y Y2 Prove that, given Y D y , the variable X has an N.; 2 /-distribution with D X C XY .y Y / and 2 D X2 .1 2 /. 18. Describe in detail the confidence regions G in Examples 2 and 3 of §VIII.6. 19. A sample X1 ; : : : ; X16 is drawn from a N.µ; C/-distributed population of which µ is unknown, whereas C is known: 0 1 14 4 4 1A : ŒC D @ 4 11 4 1 11 The outcome of X is x D .2:1; 3:8; 1:7/. Construct a 95% confidence region for µ .
20. Let X be a N.µI C/-distributed stochastic 3-vector, for which 0 1 2 3 1 µ D .3; 1; 2/ and ŒC D @3 6 5 A : 1 5 10
Apply the method of dimension reduction to X, where the loss of total variance is not allowed to exceed 5%.
527
Section VIII.11 Exercises
21. Let there be given a cloud of points .x1 ; y1 /; : : : ; .xn ; yn / in R2 , presenting the outcome of a 2-dimensional sample from a population of which both the expectation vector and covariance operator exist. It is assumed that both the covariance operator and the outcome of S2 are invertible. Suppose that v D .vx ; vy / is an eigenvector of S2 corresponding to its largest eigenvalue. A confidence region for µ , as constructed in the way described in §VIII.6, has the form of an ellipsoid of which the large axis points into the direction of v. Does the straight line, given by ¹.x; y/ C t .vx ; vy / W t 2 Rº necessarily coincide with the “least squares fit”, as described in Chapter IV?
22. Suppose X is as in Exercise 20; let M be the linear span of the vector .1; 1; 1/. Determine P .kPM Xk 1/. 23. Prove Proposition VIII.7.1. 24. Prove Proposition VIII.7.2. 25. Give an example of two sequences X1 ; X2 ; : : : and Y1 ; Y2 ; : : : of variables that converge in distribution to variables X and Y , but for which it is not true that Xn C Yn ! X C Y in distribution. Do the same with respect to multiplication. 26. If Xn ! X and Yn ! a in distribution (where a is a constant) then also Yn Xn ! aX in distribution. Prove this statement. 27. Prove that, starting from a population with distribution function F , one has p n ¹Sn2 F2 º ! N.0; 2 / in distribution, where F2 is the variance of the population.
28. Using the technique of partial differentiation, find the maximum likelihood estimators for X , Y , X , Y and when sampling from a population that has a 2-dimensional normal distribution. 29. Prove Lemma VIII.8.5. 30. A sample of size 25 from a 2-dimensional normally distributed population provides RP D 0:16. Test H0 W D 0
versus
at a level of significance of 5%.
H1 W ¤ 0
31. A sample of size 50 from a 2-dimensional normally distributed population provides RP D 0:70. Test H0 W D 0:90 versus
at a level of significance of 5%.
H1 W < 0:90
528
Chapter VIII Vectorial statistics
32. Suppose .X1 ; Y1 /; : : : ; .Xn ; Yn / is a sample from a 2-dimensional normally distributed population with D 0. For any sequence x1 ; : : : ; xn 2 R that is nonconstant, simply by setting ˇ WD 0 and ˛ WD Y one could say that for all i the variable Yi is N.˛ C ˇxi ; Y2 /-distributed. In that way the sequence Y1 ; : : : ; Yn satisfies the conditions of normal analysis of regression as treated in §IV.3. (i) Prove that
p
.b ˇ n 2 .RP /x D p 2 1 .RP /x
p ˇ/ .n 2/Sxx : p SSE
(ii) Deduce Proposition VIII.8.4 from Proposition IV.3.3. 33. This exercise starts from a 2-dimensional normally distributed population, for which D 0. A sample .X1 ; Y1 /; : : : ; .Xn ; Yn / is drawn from this population. Now the outcome of RP2 is surely an element of the interval Œ0; 1. Is RP2 perhaps beta distributed? 34. Prove that in the case of normal multiple regression analysis the .m C 1/-vector .b ˛; b β / and the variable SSE are statistically independent.
35.
(i) Prove that in the case of normal regression analysis the statistic p .b ˇ i ˇi / n m 1 q p var. b ˇ i / SSE= 2 has a t -distribution with n
m
1 degrees of freedom.
(ii) Prove that var. b ˇ i / D .i C 1/th diagonal element of Œ.Mt M/
1
2:
(iii) Indicate how the above can be used to perform hypothesis tests on ˇi and to construct confidence intervals for ˇi . 36. Prove Proposition VIII.9.10. 37. In the case of multiple normal regression, determine the maximum likelihood estimators of ˛; β and 2 . 38. Indicate in which way the theory of multiple linear regression can be used in the case of a model y D ˛ C ˇ1 x1 C ˇ2 x12 C ˇ3 e x2 : 39. Prove Lemma VIII.4.11.
Appendix A
Lebesgue’s convergence theorems
Suppose a sequence of functions fn W Rm ! R is given, where n 2 N . One says that fn converges pointwise if limn!1 fn .x/ exists for all possible x. If this is the case, then the function f defined by f .x/ WD lim fn .x/ n!1
is called the limit function. It is easy to see that the limit function of a sequence of continuous functions is not necessarily continuous: Example 1. Define fn W R ! R by 8 n ˆ 0. Here the reader is cautioned, for Radon measures are in literature usually defined on more general topological spaces and in a way which does not even look like the definition above. As seen through the glasses of a Radon measure, bounded Borel sets are always of a finite size. An example of a Radon measure on Rn is the Lebesgue measure. Moreover, every probability distribution PX belonging to some stochastic n-vector X is a Radon measure on Rn . An example of a Borel measure that is not a Radon measure is the so-called counting measure, which is defined as follows: ´ number of elements in A if A contains a finite number of elements, .A/ D C1 if A contains infinitely many elements. Theorem 3. To every (non-trivial) pair of Radon measures and on Rp and Rq respectively, there exists a unique Radon measure ˝ on RpCq such that . ˝ /.A B/ D .A/ .B/ for all Borel sets A and B in Rp and Rq . If three Radon measures ; and on Rp , Rq and Rr are given, then, in a canonical way, the next two Radon measures on RpCqCr arise: . ˝ / ˝
and ˝ . ˝ /:
Unicity in Theorem 3 guarantees that these two measures are identical. For this reason they will simply be written as ˝ ˝ :
Of course this can be generalized to cases where there are more than three Radon measures in sight. Example. Denote the Lebesgue measure on Rn by n . In this notation one may write pCq D p ˝ q . Consequently n D 1 ˝ ˝ 1 : This can be verified by applying Theorem I.1.1.
535
Appendix
The following theorem is frequently useful when it is to be proved that two measures are equal. Theorem 4. Suppose and are Radon measures on Rn . If .A1 An / D .A1 An / for all Borel sets A1 ; : : : ; An in R, then D . Proofs of Theorems 1, 2, 3 or 4 can be found in any elementary textbook on measure theory, for example in [11], [28], [64].
Appendix C
Conditional probabilities
Introduction.
Suppose one is throwing twice a die. Define the event A as A
! “both outcomes are different”,
and the event B as B
! “the sum of both outcomes is 10 or more”.
This experiment is repeated unendingly and the following question is posed: At what rate occurs A among the cases where B occurs? The following notation introduced to get more insight in this: After repeating the experiment N times the frequency of occurrence of A; B and A \ B are denoted by N.A/, N.B/ and N.A \ B/ respectively. Intuitively / one could write: N.A/ ; N !1 N
P .A/ D lim
N.B/ ; N !1 N
P .B/ D lim
N.A \ B/ : N !1 N
P .A \ B/ D lim
The fraction in which one is interested is given by N.A \ B/ D lim N !1 N !1 N.B/ lim
N.A\B/ N N.B/ N
D
P .A \ B/ : P .B/
In the introductory case this expression can easily be evaluated. One could choose here the following underlying probability space: WD ¹.i; j / W i; j D 1; 2; : : : ; 6º; where for all subsets C one defines P .C / D
#.C / : 36
In this case one obviously has #.A \ B/ D 4 and
#.B/ D 6:
/ Many years ago the French mathematician P. S. Laplace (1749–1827) defined the concept of probability in this way. Such a definition can be “justified” by the strong law of large numbers.
537
Appendix
So one has
P .A \ B/ 4=36 2 D D : P .B/ 6=36 3
The percentage in which one is interested is roughly 67%. The introductory considerations above are meant as a prelude to the next definition: Definition. Let .; A; P / be a probability space and let A; B 2 A. The probability that A, given B , occurs is understood to be the fraction P .A j B/ WD
P .A \ B/ ; P .B/
provided P .B/ ¤ 0. It is now immediate that for any pair of events A1 ; A2 one has P .A1 \ A2 / D P .A1 / P .A2 j A1 /; provided P .A1 / ¤ 0. This can be generalized as follows: Theorem 1. For n events A1 ; : : : ; An one has P .A1 \ \An / D P .A1 / P .A2 j A1 / P .A3 j A1 \A2 / : : : P .An j A1 \ \An provided P .A1 \ \ An
1/
1 /;
¤ 0.
Proof. Left as an exercise to the reader. The theorem above can be used in the following way: Example 1. An urn contains 5 white and 5 black balls. In a random way successively 3 balls are drawn from this urn, without replacement. What is the probability that these 3 balls are all white? Solution. Let A1 ; A2 and A3 respectively be the events that the first, second and third ball is white. The probability in which one is interested is now given by P .A1 \ A2 \ A3 / D P .A1 / P .A2 j A1 / P .A3 j A1 \ A2 /: Obviously one has P .A1 / D
5 1 D ; 10 2
4 P .A2 j A1 / D ; 9
3 P .A3 j A1 \ A2 / D : 8
538
Appendix C Conditional probabilities
Consequently P .A1 \ A2 \ A3 / D
1 4 3 1 D : 2 9 8 12
Another important elementary theorem about conditional probabilities is the so-called law of the total probability: Theorem 2. Let .; A; P / be a probability space. Suppose that the events B1 ; : : : ; Bn in A constitute a partition of . If P .Bi / ¤ 0 for all i , then for every A 2 A one has n X P .A j Bi / P .Bi /: P .A/ D iD1
Proof. Under the given conditions on B1 ; : : : ; Bn one has AD
n [
.A \ Bi /:
iD1
The sets A \ B1 ; : : : ; A \ Bn are mutually disjoint, hence P .A/ D
n X iD1
P .A \ Bi / D
n X iD1
P .A j Bi / P .Bi /:
Example 2. Two urns contain white, black and red balls. Urn I contains 3 white, 4 black and 5 red balls. Urn II contains 5 white, 2 black and 3 red balls. In a random way John draws a ball from urn I and puts it into urn II. Then Peter randomly draws a ball from urn II. What is the probability that Peter will draw a red ball? Solution. The events “John is transferring a white, a black, a red ball” will be denoted respectively by B1 ; B2 ; B3 . The events B1 ; B2 ; B3 are mutually excluding, which is the same as saying that B1 ; B2 ; B3 are mutually disjoint. Moreover, it is sure that one of the events B1 ; B2 ; B3 will occur. Therefore these events constitute a partition of the space of all possible outcomes . Denoting the event “Peter draws a red ball” by A, one has P .A/ D P .A j B1 / P .B1 / C P .A j B2 / P .B2 / C P .A j B3 / P .B3 /: Evidently, here P .B1 / D
3 ; 12
P .B2 / D
4 12
and
P .B3 / D
5 : 12
539
Appendix
Moreover P .A j B1 / D
3 ; 11
P .A j B2 / D
3 ; 11
P .A j B3 / D
4 : 11
Hence P .A/ D
3 3 3 4 4 5 41 C C D 0:31: 11 12 11 12 11 12 132
Now Bayes’ formula is presented: Theorem 3. Let .; A; P / be a probability space and let B1 ; : : : ; Bn 2 A be a partition of . Suppose that P .Bi / ¤ 0 for all i . Then for every A 2 A one has P .Bj j A/ D
P .A j Bj / P .Bj / P .A j Bj / P .Bj / D ; P .A j B/ P .B1 / C C P .A j Bn / P .Bn / P .A/
provided P .A/ ¤ 0. Proof. Left to the reader as an exercise. Example 3. If Peter (in Example 2) has drawn a red ball, then what is the probability that previously John transferred a ball that was also red? Solution. Here there is a wish to determine the numerical value of the conditional probability of the event B3 , given A. By Theorem 3 one may write P .B3 j A/ D
P .A j B3 /P .B3 / D P .A/
4 11
41 132
5 12
D
20 0:49: 41
More about conditional probability can be found (for example) in [6], [18], [21], [26].
Appendix D
The characteristic function of the Cauchy distribution
This appendix is devoted to the following theorem in mathematical analysis. The imaginary unit will systematically be denoted by the symbol i . Theorem. For all ˛ 2 R and all ˇ > 0 one has Z
C1 1
.x
e itx dx D e i ˛t e 2 2 ˛/ C ˇ ˇ
ˇ jtj
.t 2 R/:
Proof. (Split up into three steps, applying contour integration.) Step 1. (The case ˛ D 0, ˇ D 1 and t > 0.)
For every n 2 N the contour Cn in the complex plane is defined by Figure 28. ni
Cn
n Figure 28
By the residue theorem in the theory of complex functions one has (see [1], [7], [42], [64]): Z
Cn
e itz dz D 2 i .sum of the residues in the interior region of Cn /: z2 C 1
The function f W z 7!
e itz z2 C 1
541
Appendix
has in the interior region of Cn exactly one simple pole, namely the pole in the point z D i . To see this, decompose f as follows: e i tz D C1 .z
z2
e itz : i /.z C i /
The residue in the point z D i is given by Res zDi
e i tz z2 C 1
D lim .z z!i
D lim
z!i
z!i
e i tz zCi
t
D
e : 2i
Thus one obtains Z Cn
itz e i/ 2 z C1
i/f .z/ D lim .z
i tz e i tz e dz D 2 i Res D e 2 2 zDi z C 1 z C1
t
for all n > 2:
The integral along the contour Cn can be split up in the following way: e
t
D
Z
Cn
e i tz dz D z2 C 1
Z
Cn n
e itx dx C x2 C 1
Z
0
i
e itne ni e i d: n2 e 2i C 1
./
The second integral on the right hand side converges to zero if n ! 1. To see this, note that jn2 e 2i C 1j n2 1. Therefore ˇ i tnei ˇ ˇ e ˇ ne nt sin n i ˇ ˇ ˇ n2 e 2i C 1 nie ˇ n2 1 n2 1 :
Hence ˇZ ˇ ˇ ˇ
0
ˇ Z ˇ itnei ˇ i ˇ ˇ e ˇ e i tne i i ˇ ˇ ˇ nie d ni e ˇ ˇ ˇ d n2 e 2i C 1 n2 e 2i C 1 0 Z n 2 n d D 2 : 2 1 n 1 0 n
Consequently lim
Z
n!1 0
i
e i tne ni e i d D 0: 2 2i n e C1
Letting n ! 1 in ./, one arrives at Z
C1 1
e i tx dx D e t : x2 C 1
542
Appendix D The characteristic function of the Cauchy distribution
n
Cn
ni Figure 29
Step 2. (The case ˛ D 0, ˇ D 1 and t < 0.) Repeat the arguments in Step 1, using the contour shown in Figure 29. This leads to Z C1 i tx e dx D e t : 2 1 x C1 Step 3. (The general case.) From Step 1 and Step 2 it follows that Z C1 i tx e dx D e jtj : 2 1 x C1 Now in the integral Z C1 e itx dx: ˛/2 C ˇ 2 1 .x First carry out the substitution x D v C ˛. This gives Z C1 Z C1 i t.vC˛/ Z e itx e it˛ dx D dv D e 2 2 ˛/2 C ˇ 2 1 .x 1 v Cˇ
C1 1
e itv dv: v2 C ˇ2
Carrying out the substitution v D ˇu in the last integral one obtains Z C1 Z e i tx 1 i t˛ C1 e i tˇ u dx D e du 2 ˛/2 C ˇ 2 ˇ 1 .x 1 u C1 D e it˛ e jtˇ j D e i˛t e ˇ jtj : ˇ ˇ This proves the theorem.
In an alternative formulation the theorem above states that the characteristic function belonging to a Cauchy distribution with parameters ˛ and ˇ is given by W t 7! e i˛t e
ˇ jtj
This result will be used in some of the exercises.
:
Appendix E
Metric spaces, equicontinuity
In mathematics, including statistics and probability theory, there is often a wish to express the discrepancy between two (possibly very abstract) object in terms of a socalled “distance”. For example, the Euclidean distance d.x; y/ between two points x D .x1 ; : : : ; xn / and y D .y1 ; : : : ; yn / in Rn is defined by d.x; y/ WD
X n .xi
yi /2
iD1
12
:
Any collection of objects where, in a generalized sense, on can talk about the distance between two arbitrarily chosen objects is called a “metric space”. More precisely: Definition 1. Let S be an arbitrary set and d W S S ! Œ0; C1/ a function enjoying the following three properties: (i) d.x; y/ D 0 if and only if x D y .x; y 2 S /,
(ii) d.x; y/ D d.y; x/ .x; y 2 S/,
(iii) d.x; y/ d.x; z/ C d.z; y/ .x; y; z 2 S/. Then d is called a metric on S (or a distance function). The pair .S; d / is then called a metric space. Remark 1. The idea to axiomatize the concept of a metric space in this way was originated by the French mathematician Maurice R. Fréchet (1878–1973). He has contributed significantly to the development of the theory of metric spaces, nowadays generally accepted. The theory plays a fundamental role in modern mathematics. Remark 2. The property that d.x; y/ d.x; z/ C d.z; y/ is called the triangle inequality. Remark 3. It is easy to see that .Rn ; d /, where d is the Euclidean distance, is a metric space in the sense of Definition 1. This is left as an exercise to the reader. The concept of a general metric space is rather abstract. For this reason it is frequently advisable to think of .R2 ; d / with its accompanying natural visualizations.
544
Appendix E Metric spaces, equicontinuity
Remark 4. If .S; d / is a metric space and S0 S , then in a natural way a metric d0 can be defined by setting d0 .x; y/ WD d.x; y/ for all x; y 2 S0 : Now .S0 ; d0 / is a metric space too. In this way a subset (in this context often called a subspace) of a metric space is always a metric space in itself. In metric spaces one can, in a very natural way, introduce the concept of convergence of a sequence of elements, namely: Definition 2. A sequence of elements x1 ; x2 ; : : : in a metric space .S; d / is said to be convergent if there exists an element x in S such that lim d.xk ; x/ D 0:
k!1
In this context the element x is called a limit of the sequence. A sequence that is not convergent is called divergent. Remark. If x1 ; x2 ; : : : has x as a limit then this will be indicated by writing xk ! x. Proposition 1. A sequence x1 ; x2 ; : : : cannot possibly have more than one limit. Proof. Left as an exercise, or see [45], [63]. In an arbitrary metric space one can, as in Rn , talk about closed and about open subsets: Definition 3. A subset F of a metric space .S; d / is said to be closed if limits of convergent sequences of elements in F are automatically elements of F themselves. In other words, F is a closed subset if the following implication is valid: µ x1 ; x2 ; : : : 2 F H) x 2 F: xk ! x 2 S A set O S is called open if O c is closed. Remark. From this definition it follows immediately that S is a closed set in .S; d /. Applying formal logic it also follows from the definition above that the empty set should be looked upon as a closed subset of .S; d /. In turn it follows that S and the empty set are open sets. By a so-called ball, as seen through the glasses of a metric d , a set of the following form is meant:
545
Appendix
Definition 4. Let .S; d / be a metric space, a 2 S and R > 0. Sets of the form B.a; R/ WD ¹x 2 S W d.a; x/ < Rº; B.a; R/ WD ¹x 2 S W d.a; x/ Rº
are called balls with center a and radius R.
Example 1. Define on R3 the following metric: d.x; y/ WD max jxi iD1;2;3
yi j:
Check that indeed this defines a metric on R3 . In this metric a ball B.a; R/ is a cube with center a. In the Euclidean metric on R3 balls are of the form one is used to. Proposition 2. In a metric space a ball of type B.a; R/ is an open and a ball of type B.a; R/ a closed set. Proof. Left as an exercise, or see [45], [63]. Proposition 3. A set O in a metric space is open if and only if to every element x 2 O there is at least one ball B.x; R/ such that B.x; R/ O . Proposition 4. (i) The intersection of a finite number of open sets is open. (ii) The union of an arbitrary collection of open sets is open. Proof. Left as an exercise, or see [45], [63]. By applying the laws of De Morgan the following proposition can be derived from the previous one: Proposition 5. (i) The union of a finite number of closed sets is closed. (ii) The intersection of an arbitrary collection of closed sets is closed. The collection of all open sets in a metric space .S; d / is called the topology on S . It can very well happen that two different metrics on S generate the same topology. In this, there is the following criterion: Proposition 6. For any pair of metrics d1 and d2 the following two statements are equivalent: (i) d1 and d2 generate the same topology. (ii) An arbitrary sequence x1 ; x2 ; : : : is d1 -convergent if and only if it is d2 -convergent. Proof. Left as an exercise.
546
Appendix E Metric spaces, equicontinuity
Example 2. Define on R the following two metrics d1 and d2 : d1 .x; y/ WD jx
yj and
d2 .x; y/ WD arctan jx
yj :
Now xk ! x in d1 if and only if xk ! x in d2 . Hence d1 and d2 generate the same topology. Later on it will be seen that, in another sense, the metrics d1 and d2 are very much different. Just as in Rn , in a metric space .S; d / the collection of Borel sets is defined as the smallest -algebra that contains all open sets. This -algebra will be denoted by B.S; d /, by B.S/, or simply by B. On this -algebra (probability) measures can be defined. In statistics and probability theory they often emerge in the following way: Let .; A; P / be an arbitrary probability space and let X W!S be a stochastic variable. By the latter one means a map that has the property that X
1
.A/ 2 A for all Borel sets A in S:
For all Borel sets A in S it then makes sense to speak about the probability P .X 2 A/ WD P X 1 .A/ :
The associated map
A 7! P .X 2 A/
is a Borel measure on S . Analogous to the case S D R (or S D Rn ) such a Borel measure is denoted by PX and it is called the probability distribution of the variable X . In the case where S D Rn (equipped with the Euclidean metric) Ameasurability of X is, practically spoken, never an obstacle. In general metric spaces this can, however, very well form an obstruction (see §VII.14, Exercise 13). For this reason there is in mathematics often a wish to work in metric spaces that are both separable and complete / . In such spaces it is usually possible to overcome the measurability problems that are encountered. A brief look is taken now at the concept of separability and that of completeness. Definition 5. A subset A is said to be dense in the metric space .S; d / if every element in S is the limit of a sequence of elements in A. Example 3. In the space .R; d /, where d is the standard (Euclidean) metric, Q is a dense subset: every real number is the limit of a sequence of rational numbers. /
Such metric spaces are called Polish spaces.
547
Appendix
Definition 6. A metric space is said to be separable if it contains a countable dense subset. By Example 3 the space .R; d /, where d is the standard metric, is a separable metric space. Definition 7. A sequence x1 ; x2 ; : : : in a metric space .S; d / is called a Cauchy sequence if to every " > 0 there is a number N" such that d.xp ; xq / < " for all p; q N" : It is easy to see that a convergent sequence is automatically a Cauchy sequence. The converse, however, is not always true: Example 4. Define on the set Q of rational numbers the metric d by d.x; y/ D jx
yj;
x; y 2 Q:
p Choose in .Q; d / a sequence x1 ; x2 ; : : : that converges to 2 in R. Such a sequence is Cauchy in .Q; d /, however it is not convergent in .Q; d /. Definition 8. A metric space .S; d / is called complete if every Cauchy sequence is convergent in .S; d /. Example 5. Define on R the metrics d1 and d2 as in Example 2. A fundamental result in mathematical analysis states that .R; d1 / is a complete metric space. Although d2 generates on R the same topology, the metric space .R; d2 / is not complete. To see why not, define the sequence x1 ; x2 ; : : : by xk WD k. This sequence is Cauchy in .R; d2 / but it is not a convergent sequence. For functions defined on metric spaces the notion of continuity can be defined in the same way as for functions on R or Rn : Definition 9. Suppose .S; dS / and .T; dT / are metric spaces. A function g W S ! T is said to be continuous in x if for all sequences x1 ; x2 ; : : : in S the following implication holds: xk ! x in .S; dS / H) g.xk / ! g.x/ in .T; dT / : Uniform continuity is defined in the following way:
548
Appendix E Metric spaces, equicontinuity
Definition 10. A function g W .S; dS / ! .T; dT / is said to be uniformly continuous if to every " > 0 there exists a ı > 0 such that for all x; y 2 S one has the following implication: dS .x; y/ < ı H) dT g.x/; g.y/ < ": A uniformly continuous function is continuous, but the converse is not always true:
Example 6. Equip R with the standard metric d . Now the function x 7! x 2 from .R; d / to .R; d / is continuous in every point of R, but it is not uniformly continuous. From now on, on arbitrary sets, the identity map will be denoted by I . That is to say, I W S ! S is the function defined by I.x/ WD x. Definition 11. Two metrics d1 and d2 on S are called equivalent if both the maps I W .S; d1 / ! .S; d2 /
are uniformly continuous.
and I W .S; d2 / ! .S; d1 /
Proposition 7. Suppose d1 and d2 are equivalent metrics on S . Then the following statements hold: (i) xk ! x in .S; d1 / ” xk ! x in .S; d2 /. (ii) x1 ; x2 ; : : : Cauchy in .S; d1 / ” x1 ; x2 ; : : : Cauchy in .S; d2 /. (iii) .S; d1 / is complete ” .S; d2 / is complete. Proof. Left as an exercise. Remark. Two equivalent metrics always generate the same topology. The converse of this statement is not true! To see this, look back upon Examples 2 and 5. In these examples d1 and d2 generate the same topology on R. However, .R; d1 / is complete whereas .R; d2 / is not. Hence d1 and d2 cannot possibly be equivalent. Equivalence of metrics can often (but not always) be verified by using the following proposition. Proposition 8. Suppose d1 and d2 are metrics on S . If there exist two strictly positive constants ˛ and ˇ such that for all x; y 2 S one has ˛ d1 .x; y/ d2 .x; y/ ˇ d1 .x; y/; then d1 and d2 are equivalent. Proof. Trivial.
549
Appendix
Remark. The converse of this proposition is not true. It may very well occur that two metrics are equivalent while there are no constants ˛ and ˇ as described in the proposition. Definition 12. A sequence of functions gn W .S; dS / ! .T; dT /, where n D 1; 2; : : :, is said to be equicontinuous in a point x 2 S if to every " > 0 there is a ı > 0 such that dS .x; y/ < ı H) dT gn .x/; gn .y/ < " for all n D 1; 2; : : : :
If the sequence is equicontinuous in every point of S , then one says that it is equicontinuous on S . Remark. In an evident way one can also define uniform equicontinuity. The elements of a (uniformly) equicontinuous sequence of functions are automatically (uniformly) continuous. The converse of this statement is not true. When verifying whether a certain sequence is (or is not) equicontinuous the following two theorems can be very helpful. Theorem 9. Let gn W .S; dS / ! .T; dT /, where n D 1; 2; : : :, be a sequence of functions, all of them continuous in the point x 2 S . Suppose that for every sequence x1 ; x2 ; : : : converging to x in .S; dS / the sequence g1 .x1 /; g2 .x2 /; : : : converges in .T; dT /. Then the sequence g1 ; g2 ; : : : is equicontinuous in x. Proof. Suppose that for every sequence x1 ; x2 ; : : : converging to x the sequence g1 .x1 /; g2 .x2 /; : : : is convergent in .T; dT /. Then, as a special case, the sequence g1 .x/; g2 .x/; : : : is necessarily convergent. Denote this limit by a. Next let x1 ; x2 ; : : : be an arbitrary sequence that converges to x. Of course the sequence gk .xk / converges and it is left to the reader to prove that its limit can only be a. Now suppose, in order to get a contradiction, that the sequence g1 ; g2 ; : : : is not equicontinuous in x. Then there exists an " > 0 with the following property: For every ı > 0 there exists an element y 2 S such that () dS .x; y/ < ı whereas dT gn .x/; gn .y/ "
for at least one value for n. Exploiting this, choose ı1 D 1 and choose x1 and n1 in such a way that dS .x; x1 / < ı1 whereas dT gn1 .x/; gn1 .x1 / ": dS .x; y/ < ı2
H)
dT
1 2
and such that gn .x/; gn .y/ < " for all n D 1; 2; : : : ; n1 :
Next, choose ı2 such that 0 < ı2
0, together with a sequence n1 < n2 < of positive integers, such that dT gnk .xk /; a " for all k D 1; 2; : : : : ./
By ./, however, one has
Now choose k so large that
1 dT gnk .xk /; gnk .x/ : k
dT gnk .xk /; gnk .x/ < "=2 and dT gnk .x/; a < "=2:
552
Appendix E Metric spaces, equicontinuity
Because of the triangle inequality such a k necessarily satisfies dT gnk .xk /; a < ":
This is contradictory to ./. The contradiction was derived from the assumption that E contains an element x; it follows that E is necessarily empty.
Appendix F
The Fourier transform and the existence of stoutly tailed distributions
A very important concept in mathematical analysis is that of the Fourier transform of a function. In this appendix a survey is given of the very elements of Fourier analysis. At the end of the appendix the theory will be applied to prove the existence of stoutly tailed distribution functions. To begin with, the definition of the characteristic function X belonging to a stochastic variable X is recalled: Z X .t / WD E.e i tX / D e itx dPX .x/: Next the particular case where X has a probability density f is considered. That is to say, the case PX D f , where f is a probability density and the Lebesgue measure on R. The characteristic function of X then assumes the form Z C1 X .t / D e i tx f .x/ dx: 1
This expression is, aside from a constant scalar factor, exactly what in mathematical analysis is called the (conjugate) Fourier transform of f . The above will now be set in a more general framework. As a first step the Lp -spaces .p > 0/ are defined: Z ˇ ± ° ˇ j'.x/jp dx < C1 : Lp WD ' W R ! C ˇ ' is a Borel function satisfying
So Lp is a collection of Borel functions satisfying a certain integrability condition. It can be proved that, equipped with the natural scalar multiplication and addition, Lp is a linear space. In this appendix the only Lp -spaces playing a role are L1 and L2 . Theorem 1. If ';
2 L2 , then '
2 L1 .
Proof. Direct consequence of the inequality j'
j
1 j'j2 C j j2 : 2
554
Appendix F The Fourier transform, existence of stoutly tailed distribution functions
Another linear space that will play a role is C0 , defined by ˇ ° ± ˇ C0 WD ' W R ! C ˇ ' continuous and lim '.x/ D lim '.x/ D 0 : x!C1
x! 1
For all ' 2 L1 the function F ' W R ! C is defined by 1 .F '/.t / WD p 2
Z
C1
e
itx
'.x/ dx
1
.x 2 R/:
The function F ' is called the Fourier transform of ' . Analogously, the conjugate Fourier transform F ' W R ! C of ' 2 L1 is defined as 1 .F '/.t / WD p 2
Z
C1
e itx '.x/ dx
1
.x 2 R/:
Furthermore for all ' W R ! C the function e ' W R ! C is defined by e ' .x/ WD '. x/
.x 2 R/:
The following theorem is formulated in these notations: Theorem 2. For all ' 2 L1 one has F .e ' / D F .'/: Proof. Left as an exercise to the reader. The Fourier transform F ' of an element ' 2 L1 is in general not an element in L1 . What one does have, however, is the following: Theorem 3. For all ' 2 L1 both F ' and F ' are elements of C0 . Proof. See [19], [41], [64], [66]. The linear map F W L1 ! C0 is also called the Fourier transform. It should be noted here that there are elements in C0 that can not be written as the Fourier transform of some element in L1 . In other words, the map F W L1 ! C0 is not surjective. Though not being surjective, it is injective in some sense. In order to discuss this, first the following definition: Definition. Two Borel functions ' and are said to be equal almost everywhere if the Borel set ¹x 2 R W '.x/ ¤ .x/º has Lebesgue measure zero. This will also be expressed by writing “' D almost everywhere”.
555
Appendix
Theorem 4. If ' and are equal almost everywhere and if they are both continuous in the point x, then '.x/ D .x/. Proof. Left as an exercise to the reader. Theorem 5. If ' 2 L1 and F ' 2 L1 , then: (i) ' D F .F '/ almost everywhere,
(ii) ' D F .F '/ almost everywhere. Proof. See [19], [41], [64], [66].
From this theorem it follows immediately (look at ' ) that the Fourier transform is injective in the following sense: F' DF H) ' D almost everywhere :
As said before, the Fourier transform F W L1 ! C0 is not surjective. It is, in general, rather difficult to check out whether a function in C0 is, or is not, the Fourier transform of some element in L1 . This is, in fact, the underlying problem in proving the existence of stoutly tailed distribution functions. In order to solve this problem, some more fundamental theorems in Fourier analysis are presented now. Theorem 6. If ' 2 L1 \ L2 , then F ' 2 L2 and F ' 2 L2 . Proof. See [19], [41], [64], [66]. For two Borel functions ' and with §VII.6) .'
one defines the convolution product as (compare
/.y/ WD
Z
'.x/ .y
x/ dx
for all y for which the integral on the right side exists. If '; 2 L1 then the set of all y for which the integral does not exist has Lebesgue measure zero. In such points y the convolution product is defined to be equal to zero. Theorem 7. The convolution product enjoys the following properties: (i) '
D '. e (ii) e ' D ' .
A
(iii) If ' 2 L1 and
2 L1 , then '
Proof. See [19], [41], [64], [66].
2 L1 .
556
Appendix F The Fourier transform, existence of stoutly tailed distribution functions
The Fourier transform p converts convolution into ordinary multiplication of functions (aside from a factor 2 ): 2 L1 , then: p / D 2 F ' F . p / D 2 F ' F .
Theorem 8. If '; (i) F .'
(ii) F .'
Proof. See [19], [41], [64], [66]. Now the attack on the existence problem of stoutly tailed distribution functions can be started: If '; 2 L1 \ L2 then one has (combining Theorems 1, 6 and 8): F .'
/ 2 L1 :
Therefore, by Theorem 5, ' In particular, setting
D F F .'
WD e ' , one obtains
/:
'e ' D F F .' e '/ :
Applying Theorems 2 and 8 one arrives at p ˇ ˇ2 2 ˇF .'/ˇ D F .f /: 'e ' DF
ˇ2 p ˇ Here f WD 2 ˇF .'/ˇ , which implies that f 2 L1 and that f 0. Gauging ' in such a way ' /.0/ D 1, one has for f the additional property that its integral p that .' e equals 2 . Then, of course, p1 f is a probability density. If the distribution 2 function F is defined by Z x 1 F .x/ WD p f .s/ ds 2 1 then one may write 'e ' D F
where F denotes the characteristic function belonging to F . Along these lines and in these notations the following theorem can be proved: Theorem 9. Stoutly tailed distribution functions exist.
557
Appendix
Proof. A function ' 2 L1 \ L2 is chosen as follows: '.x/ D
´
p1 2
log.x/
if x 2 .0; 1; elsewhere:
0
Define a function by setting WD ' e ':
Now, because ' is so, is real-valued. Moreover, one has
A
e D 'e ' De ' ' De ' ' D'e ' D :
Summarizing, the above shows that is symmetric around the origin, that is: .t / D . t / for all t 2 R: The function , being a convolution product, can be represented as an integral: .t / D
Z
1
'.x/ '.x
t / dx:
0
This integral vanishes if jt j > 1. For 0 t 1 it assumes the form .t / D . t / D
1 2
Z
0
1 t
log.x/ log.x C t / dx:
It is easily verified that .0/ D 1. Consequently, by the preparatory considerations, there exists a distribution function F such that D F . The remaining part of the proof consists in verifying that this F is necessarily stoutly tailed, or ² ³n t lim D 0: n!1 n
A brief consideration will convince the reader that it is sufficient to prove the latter for t > 0. To show this, first the derivative of is determined. To this end, in turn, the function h W .0; 1/ .0; 1/ ! R is introduced by setting Z 1 u h.u; v/ WD log.x/ log.x C v/ dx: 2 0 The partial derivatives of h are given by 1 @h .u; v/ D log.u/ log.u C v/ @u 2
and
@h 1 .u; v/ D @v 2
Z
u 0
log.x/ dx: xCv
558
Appendix F The Fourier transform, existence of stoutly tailed distribution functions
Defining the function g W .0; 1/ ! .0; 1/ .0; 1/ by g.t / u.t /; v.t / WD .1
t; t /
one can represent the function , for 0 < t < 1, as
.t / D .h ı g/.t / D h g.t / :
Using the chain rule for functions of several variables, one obtains du.t / @h dv.t / @h g.t / C g.t / @u dt @v dt Z Z 1 1 1 1 1 t log.x/ log.1 t / log 1 C dx D D 2 2 0 xCt 2 0
0 .t / D
t
log.x/ dx: xCt
Applying (in an appropriate way) Lebesgue’s monotone convergence theorem in the last integral, one learns that lim 0 .t / D 1: t#0
Next the mean value theorem will be exploited. By this theorem there is some element .t / in the interval .0; t / such that .t / .0/ D t 0 .t / : Recalling that .0/ D 1 and setting ˛.t / WD 0 .t / , this reads as .t / D 1 C t ˛.t /:
Now fix any t > 0. Then t t lim ˛ D0 n!1 n n
and
t lim ˛ D n!1 n
1:
Taking the logarithm of this it follows that ² ³n ² ³n t t t lim D lim 1 C ˛ D 0: n!1 n!1 n n n Because D F , this is the same as saying that F is stoutly tailed. As outlined in §VII.7 the existence of one stoutly tailed distribution function implies that these kind of functions are dense in the metric space .; dL /. Note that the Cauchy distribution, though showing heavy tails, is not stoutly tailed in the sense defined in §VII.7 (see Appendix D).
List of elementary probability densities
The support of a density f is the closure of the set ¹x 2 R W f .x/ ¤ 0º.
Densities belonging to discrete probability distributions name
support
density
Bernoulli distribution with parameter
¹0; 1º
Binomial distribution with parameters n and
¹0; 1; : : : ; nº
f .k/ D
Geometrical distribution with parameter
¹1; 2; 3; : : : º
f .k/ D .1
Poisson distribution with parameter
¹0; 1; 2; : : : º
f .k/ D e
Negative binomial distribution with parameters r and
¹r; r C 1; : : : º
, f .1/ D
f .0/ D 1
f .k/ D
n k k .1
/k
/n
k
1
k kŠ
k 1 r r 1 .1
/k
r
560
List of elementary probability densities
Densities belonging to absolutely continuous probability distributions name
support p1 2
density exp 12 . x /2
Normal distribution with parameters and
. 1; C1/
f .x/ D
Exponential distribution with parameter
Œ0; C1/
f .x/ D
1
Cauchy distribution with parameters ˛ and ˇ
. 1; C1/
f .x/ D
ˇ 1 .x ˛/2 Cˇ 2
Gamma distribution with parameters ˛ and ˇ
Œ0; C1/
f .x/ D
1 ˇ ˛ .˛/
2 -distribution with n degrees of freedom
Œ0; C1/
f .x/ D
1 2n=2 . 12 n/
x .n
f .x/ D
nC1 2 p n . n 2/
1C
t -distribution with n degreesof freedom F -distribution with m; n degrees of freedom in the numerator and denominator respectively Beta distribution with parameters p and q Weibull distribution with parameters ˛ and ˇ
. 1; C1/
Œ0; C1/
f .x/ D
e
.
x=
x˛
mCn 2 m n 2 2
/ . /
1 e x=ˇ
Œ0; C1/
f .x/ D
1 B.p;q/
f .x/ D ˛ˇ x ˇ
xp
x2 n
nC1 2
m m=2 m x2 1 n
1C Œ0; 1
2/=2 e x=2
1 .1
1 exp.
m nx
mCn 2
x/q
˛x ˇ /
1
Frequently used symbols
Chapter I
sample space (set of all possible outcomes)
!
outcome (element of )
A
-algebra (of admissible events)
P
probability measure
1A
indicator function belonging to the set A
; n
Lebesgue measure (on Rn )
X
(real-valued) stochastic variable
X
stochastic n-vector
PX ; PX
probability distribution of X; X
PX1 ;:::;Xn
probability distribution of .X1 ; : : : ; Xn /
fX ; fX
probability density of X; X
ıa
Dirac measure in the point a
A1 An
Cartesian product of the sets A1 ; : : : ; An
P1 ˝ ˝ Pn i
E.X /; E g.X/ ; X
product measure of P1 ; : : : ; Pn projection on i th coordinate expectation value (also: mean) of X; g.X/ mean (of X )
var.X /
variance of X
; X e X
standard deviation (of X ) standardized form of X
cov.X; Y /
covariance of X and Y
D .X; Y /
correlation coefficient of X and Y
log
natural logarithm
exp
exponential map
N.; 2 /
normal distribution with parameters and
#.A/
number of elements in the set A
hx; yi
(standard) inner product of x and y
562
Frequently used symbols
Q
orthogonal linear operator
V
linear span of X1 ; : : : ; Xn
F; FX
distribution function (of X )
ˆ
distribution function belonging to the N.0; 1/-distribution
n ; n .X /
nth moment (of X )
M; MX
moment generating function (of X )
S
usually (not always) indicating a sum
; X
characteristic function (of X )
Chapter II T; T
(vectorial) statistic
X
sample mean (of X1 ; : : : ; Xn /
S 2;
SX2
sample variance (of X1 ; : : : ; Xn /
Sp2
pooled variance
gamma function
Di D Yi
Xi
differences in a paired sample .X1 ; Y1 /; : : : ; .Xn ; Yn /
B
beta function
P .A j B/
probability of A, given B
fX . j Y D y/ f .; / 2‚
probability density of X , given Y D y
‚
parameter space
W‚!R
characteristic (of a population)
Pn
set of all permutations of the numbers 1; : : : ; n
permutation
Q./
exchange of coordinates
Tmin
unbiased estimator of minimum variance
family of probability densities
L; L likelihood function b ; b .X1 ; : : : ; Xn / maximum likelihood estimation (estimator)
Chapter III H0
null hypothesis
H1
alternative hypothesis
Frequently used symbols
G Rn
critical region
˛; ˛.G/
probability to commit a type I error, level of significance, size of G
ˇ
probability to commit a type II error
ƒ
likelihood ratio
G.ı/
critical region based on the likelihood ratio
Chapter IV ei
error (residual) corresponding to measurement no. i
Yx
the quantity y to be measured, considered as a stochastic variable, at an adjusted value x
x
expectation value of Yx , that is: E.Yx /
Yi ; i b ˛; b ˇ
simplified typography for Yxi and xi
Sxy
bx Y
SSE e
E Œx
least squares estimators of ˛ and ˇ P i .xi x/.yi y/ the linear estimator b ˛ Cb ˇ x for x D ˛Cˇx sum of squares of errors
the vector .1; 1; : : : ; 1/ 2 Rn
the one-dimensional linear subspace spanned by e ı equivalence class in Rn E belonging to x
RP
Pearson’s product-moment correlation coefficient
SSR
sum of squares of regression
Chapter V i
mean of population no. i
X i
sample mean of Xi1 ; : : : ; Xi n
X
sample mean of all Xij
MSL
mean of squared . errors, that is, the sum P 2 p.n 1/ ij .Xij X i /
ij
mean of population .i; j /
MSE
mean of squared level deviations, that is, the sum 2 . P .p 1/ i X i X n
563
564 i j
Frequently used symbols
ı level mean, that is: .i1 C C iq / q ı level mean, that is: .ij C C pj / p
grand mean, that is, the mean of all ij
˛i
effect of factor A at level i
ˇj
effect of fact B at level j
ij
interaction of factors A and B at simultaneous level .i; j /
Chapter VI T C; T
usually a sum of certain rank numbers
r.Xi /
rank number of Xi in the sequence X1 ; : : : ; Xn , ordered from small to large
r.Xi /
RK
rank number of Xi in the sequence X1 ; : : : ; Xm ; Y1 ; : : : ; Yn , ordered from small to large P the sum i r.Xi /
RS
Spearman’s rank correlation coefficient
Rij k
rank number of Xij k in the sequence X1j1 ; : : : ; X1j n ; X2j1 ; : : : ; X2j n ; : : : ; Xpj1 ; : : : ; Xpj n , ordered from small to large
TX
Kendall’s coefficient of concordance
Chapter VII b .x1 ; : : : ; xn /.x/ value in the point x of the empirical distribution function F belonging to an outcome x1 ; : : : ; xn
qF
quantile function belonging to the distribution function F
kF Gk1
x2R
Dn
Kolmogorov–Smirnov test statistic
dL
Lévy metric
dN
norm metric
dP
Prokhorov metric
dBL
bounded Lipschitz metric
dS
Skorokhod metric
F G
distribution product of F and G ı the function x 7! F .x /
F
sup jF .x/ G.x/j
Frequently used symbols
'P
convolution product of ' and P
e g .F /
distribution function of the statistic g.X1 ; : : : ; Xn /, where X1 ; : : : ; Xn are independent variables with distribution function F
b q .x1 ; : : : ; xn /
quantile function of the empirical distribution function b .x1 ; : : : ; xn / F
.X /˛ ; .X /n;˛
˛ .F / b X b .F /
ƒ W Dƒ ! R a F
a F ƒ0F .G/ F T
˛-trimmed mean of a sample
˛-trimmed mean of a population with distribution function F
median of a sample median of a population with distribution function F statistical functional with domain Dƒ the function x 7! F .x a/ ı the function x 7! F .x a/
von Mises derivative of ƒ in the point F in direction G influence function bootstrap statistic
X1 ; : : : ; Xm
bootstrap sample
k
kernel
k b f
ı kernel x 7! 1 k.x /
kernel density estimation
jI j
length of the interval I
I
interval partition I1 ; I2 ; : : : of R
jIj
supp jIp j
h.II x1 ; : : : ; xn /
histogram belonging to I and sample outcome x1 ; : : : ; xn
ı.I/
565
infp jIp j
Chapter VIII hx; yi
inner product of the vectors x and y
kxk
length of the vector x
T
linear operator; in this chapter not a vectorial statistic
x˝y
linear operator z 7! hz; yix
PM
orthogonal projection on linear subspace M
M?
orthogonal complement of M
566
Frequently used symbols
ŒT
matrix of T
det.T/
determinant of T
tr.T/
trace of T
Q
orthogonal linear operator
E.X/; µ; µX
expectation vector of the stochastic vector X
C .X/
covariance operator of the stochastic vector X
† =.X/
matrix of C .X/
C
(usually:) some covariance operator
N.µ; C/
Gaussian distribution with mean µ and covariance operator C
W .n; C/
Wishart distribution with parameters n and C
Z
(possibly) “Fisher’s Z”
M
(design) matrix
SSE
sum of squares of errors of the full model
SSER
sum of squares of errors of the reduced model
567
Frequently used symbols
The Greek alphabet
The Gothic alphabet
A
˛
alpha
A
a
a
B
ˇ
beta
B
b
b
gamma
C
c
c
ı
delta
D
d
d
E
"
epsilon
E
e
e
Z
zeta
F
f
f
H
eta
G
g
g
‚
#
theta
H
h
h
I
iota
I
i
i
K
kappa
J
j
j
ƒ
lambda
K
k
k
M
mu
L
l
l
N
nu
M
m
m
„
xi
N
n
n
O
o
omicron
O
o
o
…
pi
P
p
p
P
rho
Q
q
q
†
sigma
R
r
r
T
tau
S
s
s
‡
upsilon
T
t
t
ˆ
'
phi
U
u
u
X
chi
V
v
v
psi
W
w
w
omega
X
x
x
Y
y
y
Z
z
z
‰
!
Statistical tables
Table I:
The binomial distribution
569
Table II:
The standard normal distribution
573
Table III:
The t -distributions
574
Table IV:
The 2 -distributions
575
Table V:
The F -distributions
576
Table VI:
Wilcoxon’s signed-rank test
580
Table VII:
Wilcoxon’s rank-sum test
581
Table VIII:
The runs test
583
Table IX:
Spearman’s rank correlation test
585
Table X:
Kendall’s tau
585
Table XI:
The Kolmogorov–Smirnov test
586
Tables II up to V taken over with kind permission of Druk Boompers drukkerijen bv, Meppel, from W. van den Brink and P. Koele, Statistiek, Deel 3 (Boom Meppel, Amsterdam, 1987). Table I and Tables VI up to XI computed by Alessandro di Bucchianico and Mark van de Wiel, Eindhoven University of Technology, using M ATHEMATICA.
569
Statistical tables
Table I: The binomial distribution
n k 0:05 0:10 1 0 0:9500 0:9000 2 0 0:9025 0:8100 1 0:9975 0:9900 3 0 0:8574 0:7290 1 0:9928 0:9720 2 0:9999 0:9990 4 0 0:8145 0:6561 1 0:9860 0:9477 2 0:9995 0:9963 3 1:0000 0:9999 5 0 0:7738 0:5905 1 0:9774 0:9185 2 0:9988 0:9914 3 1:0000 0:9995 4 1:0000 1:0000 6 0 0:7351 0:5314 1 0:9672 0:8857 2 0:9978 0:9842 3 0:9999 0:9987 4 1:0000 0:9999 5 1:0000 1:0000 7 0 0:6983 0:4783 1 0:9556 0:8503 2 0:9962 0:9743 3 0:9998 0:9973 4 1:0000 0:9998 5 1:0000 1:0000 6 1:0000 1:0000 8 0 0:6634 0:4305 1 0:9428 0:8131
0:15 0:8500 0:7225 0:9775 0:6141 0:9393 0:9966 0:5220 0:8905 0:9880 0:9995 0:4437 0:8352 0:9734 0:9978 0:9999 0:3771 0:7765 0:9527 0:9941 0:9996 1:0000 0:3206 0:7166 0:9262 0:9879 0:9988 0:9999 1:0000 0:2725 0:6572
0:20 0:8000 0:6400 0:9600 0:5120 0:8960 0:9920 0:4096 0:8192 0:9728 0:9984 0:3277 0:7373 0:9421 0:9933 0:9997 0:2621 0:6554 0:9011 0:9830 0:9984 0:9999 0:2097 0:5767 0:8520 0:9667 0:9953 0:9996 1:0000 0:1678 0:5033
p 0:25 0:7500 0:5625 0:9375 0:4219 0:8438 0:9844 0:3164 0:7383 0:9492 0:9961 0:2373 0:6328 0:8965 0:9844 0:9990 0:1780 0:5339 0:8306 0:9624 0:9954 0:9998 0:1335 0:4449 0:7564 0:9294 0:9871 0:9987 0:9999 0:1001 0:3671
0:30 0:7000 0:4900 0:9100 0:3430 0:7840 0:9730 0:2401 0:6517 0:9163 0:9919 0:1681 0:5282 0:8369 0:9692 0:9976 0:1176 0:4202 0:7443 0:9295 0:9891 0:9993 0:0824 0:3294 0:6471 0:8740 0:9712 0:9962 0:9998 0:0576 0:2553
0:35 0:6500 0:4225 0:8775 0:2746 0:7183 0:9571 0:1785 0:5630 0:8735 0:9850 0:1160 0:4284 0:7648 0:9460 0:9947 0:0754 0:3191 0:6471 0:8826 0:9777 0:9982 0:0490 0:2338 0:5323 0:8002 0:9444 0:9910 0:9994 0:0319 0:1691
0:40 0:6000 0:3600 0:8400 0:2160 0:6480 0:9360 0:1296 0:4752 0:8208 0:9744 0:0778 0:3370 0:6826 0:9130 0:9898 0:0467 0:2333 0:5443 0:8208 0:9590 0:9959 0:0280 0:1586 0:4199 0:7102 0:9037 0:9812 0:9984 0:0168 0:1064
0:50 0:5000 0:2500 0:7500 0:1250 0:5000 0:8750 0:0625 0:3125 0:6875 0:9375 0:0313 0:1875 0:5000 0:8125 0:9688 0:0156 0:1094 0:3438 0:6562 0:8906 0:9844 0:0078 0:0625 0:2266 0:5000 0:7734 0:9375 0:9922 0:0039 0:0352
The distribution function belonging to the binomial distribution. For example: if X is binomially distributed with parameters n D 5 and p D 0:15, then P .X 2/ D 0:9734.
570
n k 0:05 8 2 0:9942 3 0:9996 4 1:0000 5 1:0000 6 1:0000 7 1:0000 9 0 0:6302 1 0:9288 2 0:9916 3 0:9994 4 1:0000 5 1:0000 6 1:0000 7 1:0000 8 1:0000 10 0 0:5987 1 0:9139 2 0:9885 3 0:9990 4 0:9999 5 1:0000 6 1:0000 7 1:0000 8 1:0000 9 1:0000 11 0 0:5688 1 0:8981 2 0:9848 3 0:9984 4 0:9999 5 1:0000 6 1:0000 7 1:0000 8 1:0000 9 1:0000 10 1:0000 12 0 0:5404 1 0:8816
Statistical tables
0:10 0:9619 0:9950 0:9996 1:0000 1:0000 1:0000 0:3874 0:7748 0:9470 0:9917 0:9991 0:9999 1:0000 1:0000 1:0000 0:3487 0:7361 0:9298 0:9872 0:9984 0:9999 1:0000 1:0000 1:0000 1:0000 0:3138 0:6974 0:9104 0:9815 0:9972 0:9997 1:0000 1:0000 1:0000 1:0000 1:0000 0:2824 0:6590
0:15 0:8948 0:9786 0:9971 0:9998 1:0000 1:0000 0:2316 0:5995 0:8591 0:9661 0:9944 0:9994 1:0000 1:0000 1:0000 0:1969 0:5443 0:8202 0:9500 0:9901 0:9986 0:9999 1:0000 1:0000 1:0000 0:1673 0:4922 0:7788 0:9306 0:9841 0:9973 0:9997 1:0000 1:0000 1:0000 1:0000 0:1422 0:4435
0:20 0:7969 0:9437 0:9896 0:9988 0:9999 1:0000 0:1342 0:4362 0:7382 0:9144 0:9804 0:9969 0:9997 1:0000 1:0000 0:1074 0:3758 0:6778 0:8791 0:9672 0:9936 0:9991 0:9999 1:0000 1:0000 0:0859 0:3221 0:6174 0:8389 0:9496 0:9883 0:9980 0:9998 1:0000 1:0000 1:0000 0:0687 0:2749
p 0:25 0:6785 0:8862 0:9727 0:9958 0:9996 1:0000 0:0751 0:3003 0:6007 0:8343 0:9511 0:9900 0:9987 0:9999 1:0000 0:0563 0:2440 0:5256 0:7759 0:9219 0:9803 0:9965 0:9996 1:0000 1:0000 0:0422 0:1971 0:4552 0:7133 0:8854 0:9657 0:9924 0:9988 0:9999 1:0000 1:0000 0:0317 0:1584
0:30 0:5518 0:8059 0:9420 0:9887 0:9987 0:9999 0:0404 0:1960 0:4628 0:7297 0:9012 0:9747 0:9957 0:9996 1:0000 0:0282 0:1493 0:3828 0:6496 0:8497 0:9527 0:9894 0:9984 0:9999 1:0000 0:0198 0:1130 0:3127 0:5696 0:7897 0:9218 0:9784 0:9957 0:9994 1:0000 1:0000 0:0138 0:0850
0:35 0:4278 0:7064 0:8939 0:9747 0:9964 0:9998 0:0207 0:1211 0:3373 0:6089 0:8283 0:9464 0:9888 0:9986 0:9999 0:0135 0:0860 0:2616 0:5138 0:7515 0:9051 0:9740 0:9952 0:9995 1:0000 0:0088 0:0606 0:2001 0:4256 0:6683 0:8513 0:9499 0:9878 0:9980 0:9998 1:0000 0:0057 0:0424
0:40 0:3154 0:5941 0:8263 0:9502 0:9915 0:9993 0:0101 0:0705 0:2318 0:4826 0:7334 0:9006 0:9750 0:9962 0:9997 0:0060 0:0464 0:1673 0:3823 0:6331 0:8338 0:9452 0:9877 0:9983 0:9999 0:0036 0:0302 0:1189 0:2963 0:5328 0:7535 0:9006 0:9707 0:9941 0:9993 1:0000 0:0022 0:0196
0:50 0:1445 0:3633 0:6367 0:8555 0:9648 0:9961 0:0020 0:0195 0:0898 0:2539 0:5000 0:7461 0:9102 0:9805 0:9980 0:0010 0:0107 0:0547 0:1719 0:3770 0:6230 0:8281 0:9453 0:9893 0:9990 0:0005 0:0059 0:0327 0:1133 0:2744 0:5000 0:7256 0:8867 0:9673 0:9941 0:9995 0:0002 0:0032
571
Statistical tables
n k 0:05 12 2 0:9804 3 0:9978 4 0:9998 5 1:0000 6 1:0000 7 1:0000 8 1:0000 9 1:0000 10 1:0000 11 1:0000 13 0 0:5133 1 0:8646 2 0:9755 3 0:9969 4 0:9997 5 1:0000 6 1:0000 7 1:0000 8 1:0000 9 1:0000 10 1:0000 11 1:0000 12 1:0000 14 0 0:4877 1 0:8470 2 0:9699 3 0:9958 4 0:9996 5 1:0000 6 1:0000 7 1:0000 8 1:0000 9 1:0000 10 1:0000 11 1:0000 12 1:0000 13 1:0000
0:10 0:8891 0:9744 0:9957 0:9995 0:9999 1:0000 1:0000 1:0000 1:0000 1:0000 0:2542 0:6213 0:8661 0:9658 0:9935 0:9991 0:9999 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 0:2288 0:5846 0:8416 0:9559 0:9908 0:9985 0:9998 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000
0:15 0:7358 0:9078 0:9761 0:9954 0:9993 0:9999 1:0000 1:0000 1:0000 1:0000 0:1209 0:3983 0:6920 0:8820 0:9658 0:9925 0:9987 0:9998 1:0000 1:0000 1:0000 1:0000 1:0000 0:1028 0:3567 0:6479 0:8535 0:9533 0:9885 0:9978 0:9997 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000
0:20 0:5583 0:7946 0:9274 0:9806 0:9961 0:9994 0:9999 1:0000 1:0000 1:0000 0:0550 0:2336 0:5017 0:7473 0:9009 0:9700 0:9930 0:9988 0:9998 1:0000 1:0000 1:0000 1:0000 0:0440 0:1979 0:4481 0:6982 0:8702 0:9561 0:9884 0:9976 0:9996 1:0000 1:0000 1:0000 1:0000 1:0000
p 0:25 0:3907 0:6488 0:8424 0:9456 0:9857 0:9972 0:9996 1:0000 1:0000 1:0000 0:0238 0:1267 0:3326 0:5843 0:7940 0:9198 0:9757 0:9944 0:9990 0:9999 1:0000 1:0000 1:0000 0:0178 0:1010 0:2811 0:5213 0:7415 0:8883 0:9617 0:9897 0:9978 0:9997 1:0000 1:0000 1:0000 1:0000
0:30 0:2528 0:4925 0:7237 0:8822 0:9614 0:9905 0:9983 0:9998 1:0000 1:0000 0:0097 0:0637 0:2025 0:4206 0:6543 0:8346 0:9376 0:9818 0:9960 0:9993 0:9999 1:0000 1:0000 0:0068 0:0475 0:1608 0:3552 0:5842 0:7805 0:9067 0:9685 0:9917 0:9983 0:9998 1:0000 1:0000 1:0000
0:35 0:1513 0:3467 0:5833 0:7873 0:9154 0:9745 0:9944 0:9992 0:9999 1:0000 0:0037 0:0296 0:1132 0:2783 0:5005 0:7159 0:8705 0:9538 0:9874 0:9975 0:9997 1:0000 1:0000 0:0024 0:0205 0:0839 0:2205 0:4227 0:6405 0:8164 0:9247 0:9757 0:9940 0:9989 0:9999 1:0000 1:0000
0:40 0:0834 0:2253 0:4382 0:6652 0:8418 0:9427 0:9847 0:9972 0:9997 1:0000 0:0013 0:0126 0:0579 0:1686 0:3530 0:5744 0:7712 0:9023 0:9679 0:9922 0:9987 0:9999 1:0000 0:0008 0:0081 0:0398 0:1243 0:2793 0:4859 0:6925 0:8499 0:9417 0:9825 0:9961 0:9994 0:9999 1:0000
0:50 0:0193 0:0730 0:1938 0:3872 0:6128 0:8062 0:9270 0:9807 0:9968 0:9998 0:0001 0:0017 0:0112 0:0461 0:1334 0:2905 0:5000 0:7095 0:8666 0:9539 0:9888 0:9983 0:9999 0:0001 0:0009 0:0065 0:0287 0:0898 0:2120 0:3953 0:6047 0:7880 0:9102 0:9713 0:9935 0:9991 0:9999
572
n k 0:05 15 0 0:4633 1 0:8290 2 0:9638 3 0:9945 4 0:9994 5 0:9999 6 1:0000 7 1:0000 8 1:0000 9 1:0000 10 1:0000 11 1:0000 12 1:0000 13 1:0000 14 1:0000 20 0 0:3585 1 0:7358 2 0:9245 3 0:9841 4 0:9974 5 0:9997 6 1:0000 7 1:0000 8 1:0000 9 1:0000 10 1:0000 11 1:0000 12 1:0000 13 1:0000 14 1:0000 15 1:0000 16 1:0000 17 1:0000 18 1:0000 19 1:0000
Statistical tables
0:10 0:2059 0:5490 0:8159 0:9444 0:9873 0:9978 0:9997 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 0:1216 0:3917 0:6769 0:8670 0:9568 0:9887 0:9976 0:9996 0:9999 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000
0:15 0:0874 0:3186 0:6042 0:8227 0:9383 0:9832 0:9964 0:9994 0:9999 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 0:0388 0:1756 0:4049 0:6477 0:8298 0:9327 0:9781 0:9941 0:9987 0:9998 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000
0:20 0:0352 0:1671 0:3980 0:6482 0:8358 0:9389 0:9819 0:9958 0:9992 0:9999 1:0000 1:0000 1:0000 1:0000 1:0000 0:0115 0:0692 0:2061 0:4114 0:6296 0:8042 0:9133 0:9679 0:9900 0:9974 0:9994 0:9999 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000
p 0:25 0:0134 0:0802 0:2361 0:4613 0:6865 0:8516 0:9434 0:9827 0:9958 0:9992 0:9999 1:0000 1:0000 1:0000 1:0000 0:0032 0:0243 0:0913 0:2252 0:4148 0:6172 0:7858 0:8982 0:9591 0:9861 0:9961 0:9991 0:9998 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000
0:30 0:0047 0:0353 0:1268 0:2969 0:5155 0:7216 0:8689 0:9500 0:9848 0:9963 0:9993 0:9999 1:0000 1:0000 1:0000 0:0008 0:0076 0:0355 0:1071 0:2375 0:4164 0:6080 0:7723 0:8867 0:9520 0:9829 0:9949 0:9987 0:9997 1:0000 1:0000 1:0000 1:0000 1:0000 1:0000
0:35 0:0016 0:0142 0:0617 0:1727 0:3519 0:5643 0:7548 0:8868 0:9578 0:9876 0:9972 0:9995 0:9999 1:0000 1:0000 0:0002 0:0021 0:0121 0:0444 0:1182 0:2454 0:4166 0:6010 0:7624 0:8782 0:9468 0:9804 0:9940 0:9985 0:9997 1:0000 1:0000 1:0000 1:0000 1:0000
0:40 0:0005 0:0052 0:0271 0:0905 0:2173 0:4032 0:6098 0:7869 0:9050 0:9662 0:9907 0:9981 0:9997 1:0000 1:0000 0:0000 0:0005 0:0036 0:0160 0:0510 0:1256 0:2500 0:4159 0:5956 0:7553 0:8725 0:9435 0:9790 0:9935 0:9984 0:9997 1:0000 1:0000 1:0000 1:0000
0:50 0:0000 0:0005 0:0037 0:0176 0:0592 0:1509 0:3036 0:5000 0:6964 0:8491 0:9408 0:9824 0:9963 0:9995 1:0000 0:0000 0:0000 0:0002 0:0013 0:0059 0:0207 0:0577 0:1316 0:2517 0:4119 0:5881 0:7483 0:8684 0:9423 0:9793 0:9941 0:9987 0:9998 1:0000 1:0000
573
Statistical tables
Table II: The standard normal distribution
z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
0 0.500 0.540 0.579 0.618 0.655 0.692 0.726 0.758 0.788 0.816 0.841 0.864 0.885 0.903 0.919 0.933 0.945 0.955 0.964 0.971 0.977 0.982 0.986 0.989 0.992 0.994 0.995 0.997 0.997 0.998 0.999
1 0.504 0.544 0.583 0.622 0.659 0.695 0.729 0.761 0.791 0.819 0.844 0.867 0.887 0.905 0.921 0.935 0.946 0.956 0.965 0.972 0.978 0.983 0.986 0.990 0.992 0.994 0.995 0.997 0.998 0.998 0.999
2 0.508 0.548 0.587 0.626 0.663 0.699 0.732 0.764 0.794 0.821 0.846 0.869 0.889 0.907 0.922 0.936 0.947 0.957 0.966 0.973 0.978 0.983 0.987 0.990 0.992 0.994 0.996 0.997 0.998 0.998 0.999
last decimal of z 3 4 5 6 0.512 0.516 0.520 0.524 0.552 0.556 0.560 0.564 0.591 0.595 0.599 0.603 0.629 0.633 0.637 0.641 0.666 0.670 0.674 0.677 0.702 0.705 0.709 0.712 0.736 0.739 0.742 0.745 0.767 0.770 0.773 0.776 0.797 0.800 0.802 0.805 0.824 0.826 0.829 0.832 0.849 0.851 0.853 0.855 0.871 0.873 0.875 0.877 0.891 0.893 0.894 0.896 0.908 0.910 0.912 0.913 0.924 0.925 0.927 0.928 0.937 0.938 0.939 0.941 0.948 0.950 0.951 0.952 0.958 0.959 0.960 0.961 0.966 0.967 0.968 0.969 0.973 0.974 0.974 0.975 0.979 0.979 0.980 0.980 0.983 0.984 0.984 0.985 0.987 0.988 0.988 0.988 0.990 0.990 0.991 0.991 0.993 0.993 0.993 0.993 0.994 0.995 0.995 0.995 0.996 0.996 0.996 0.996 0.997 0.997 0.997 0.997 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.999 0.999 0.999 0.999 0.999
7 0.528 0.568 0.606 0.644 0.681 0.716 0.749 0.779 0.808 0.834 0.858 0.879 0.898 0.915 0.929 0.942 0.953 0.962 0.969 0.976 0.981 0.985 0.988 0.991 0.993 0.995 0.996 0.997 0.998 0.999 0.999
8 0.532 0.571 0.610 0.648 0.684 0.719 0.752 0.782 0.811 0.837 0.860 0.881 0.900 0.916 0.931 0.943 0.954 0.963 0.970 0.976 0.981 0.985 0.989 0.991 0.993 0.995 0.996 0.997 0.998 0.999 0.999
9 0.536 0.575 0.614 0.652 0.688 0.722 0.755 0.785 0.813 0.839 0.862 0.883 0.902 0.918 0.932 0.944 0.955 0.963 0.971 0.977 0.982 0.986 0.989 0.992 0.994 0.995 0.996 0.997 0.998 0.999 0.999
The distribution function corresponding to the N.0; 1/-distribution. For example: if Z has this distribution then P .Z 1:84/ D 0:967 In general one has: P .Z
and
P .jZj 1:84/ D 0:967
z/ D P .Z z/ D 1
.1
P .Z z/.
0:967/ D 0:934:
574
Statistical tables
Table III: The t -distributions
n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0.90 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323
0.95 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721
1 ˛ 0.975 12.71 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080
0.99 31.82 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518
0.995 63.66 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831
n 22 23 24 25 26 27 28 29 30 35 40 45 50 60 70 80 90 100 200 500 1000
0.90 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.306 1.303 1.301 1.299 1.296 1.294 1.292 1.291 1.290 1.286 1.283 1.282
0.95 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.690 1.684 1.679 1.676 1.671 1.667 1.664 1.662 1.660 1.652 1.648 1.646
1 ˛ 0.975 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.030 2.021 2.014 2.009 2.000 1.994 1.990 1.987 1.984 1.972 1.965 1.962
0.99 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.438 2.423 2.412 2.403 2.390 2.381 2.374 2.368 2.364 2.345 2.334 2.330
0.995 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.724 2.704 2.690 2.678 2.660 2.648 2.639 2.632 2.626 2.601 2.586 2.581
Quantiles with the t -distribution. For example: if X has the t -distribution with 20 degrees of freedom, then P .X 2:086/ D 0:975. Or: if Y has the t -distribution with 30 degrees of freedom, then P .jY j 2:457/ D 0:98.
575
Statistical tables
Table IV: The 2 -distributions
n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 30 40 50 60 70 80 90 100
0.005 .000 .010 .072 .207 .412 .676 .989 1.34 1.73 2.16 2.60 3.07 3.57 4.07 4.60 5.14 5.70 6.26 6.84 7.43 8.03 8.64 9.26 9.89 10.5 13.8 20.7 28.0 35.5 43.3 51.2 59.2 67.3
0.01 .000 .020 .115 .297 .554 .872 1.24 1.65 2.09 2.56 3.05 3.57 4.11 4.66 5.23 5.81 6.41 7.01 7.63 8.26 8.90 9.54 10.2 10.9 11.5 15.0 22.2 29.7 37.5 45.4 53.5 61.8 70.1
˛ 0.025 .001 .051 .216 .484 .831 1.24 1.69 2.18 2.70 3.25 3.82 4.40 5.01 5.63 6.26 6.91 7.56 8.23 8.91 9.59 10.3 11.0 11.7 12.4 13.1 16.8 24.4 32.4 40.5 48.8 57.2 65.6 74.2
0.05 .004 .103 .352 .711 1.15 1.64 2.17 2.73 3.33 3.94 4.57 5.23 5.89 6.57 7.26 7.96 8.67 9.39 10.1 10.9 11.6 12.3 13.1 13.8 14.6 18.5 26.5 34.8 43.2 51.7 60.4 69.1 77.9
0.10 .016 .211 .584 1.06 1.61 2.20 2.83 3.49 4.17 4.87 5.58 6.30 7.04 7.79 8.55 9.31 10.1 10.9 11.7 12.4 13.2 14.0 14.8 15.7 16.5 20.6 29.1 37.7 46.5 55.3 64.3 73.3 82.4
0.90 2.71 4.61 6.25 7.78 9.24 10.6 12.0 13.4 14.7 16.0 17.3 18.5 19.8 21.1 22.3 23.5 24.8 26.0 27.2 28.4 29.6 30.8 32.0 33.2 34.4 40.3 51.8 63.2 74.4 85.5 96.6 107.6 118.5
0.95 3.84 5.99 7.81 9.49 11.1 12.6 14.1 15.5 16.9 18.3 19.7 21.0 22.4 23.7 25.0 26.3 27.6 28.9 30.1 31.4 32.7 33.9 35.2 36.4 37.7 43.8 55.8 67.5 79.1 90.5 101.9 113.1 124.3
1 ˛ 0.975 5.02 7.38 9.35 11.1 12.8 14.4 16.0 17.5 19.0 20.5 21.9 23.3 24.7 26.1 27.5 28.8 30.2 31.5 32.9 34.2 35.5 36.8 38.1 39.4 40.6 47.0 59.3 71.4 83.3 95.0 106.6 118.1 129.6
0.99 6.63 9.21 11.3 13.3 15.1 16.8 18.5 20.1 21.7 23.2 24.7 26.2 27.7 29.1 30.6 32.0 33.4 34.8 36.2 37.6 38.9 40.3 41.6 43.0 44.3 50.9 63.7 76.2 88.4 100.4 112.3 124.1 135.8
0.995 7.88 10.6 12.8 14.9 16.7 18.5 20.3 22.0 23.6 25.2 26.8 28.3 29.8 31.3 32.8 34.3 35.7 37.2 38.6 40.0 41.4 42.8 44.2 45.6 46.9 53.7 66.8 79.5 92.0 104.2 116.3 128.3 140.2
Quantiles with the 2 -distribution. For example: if X is 2 -distributed with 18 degrees of freedom, then P .X 31:5/ D 0:975, as well as P .X 10:9/ D 0:90.
n 1 ˛ 1 0.90 0.95 0.90 2 0.95 0.975 0.99 0.90 3 0.95 0.975 0.99 0.90 4 0.95 0.975 0.99 0.90 5 0.95 0.975 0.99 0.90 6 0.95 0.975 0.99
1 39.9 161 8.53 18.5 38.5 98.5 5.54 10.1 17.4 34.1 4.54 7.71 12.2 21.2 4.06 6.61 10.0 16.3 3.78 5.99 8.81 13.7
2 49.5 200 9.00 19.0 39.0 99.0 5.46 9.55 16.0 30.8 4.32 6.94 10.7 18.0 3.78 5.79 8.43 13.3 3.46 5.14 7.26 10.9
3 53.6 216 9.16 19.2 39.2 99.2 5.39 9.28 15.4 29.5 4.19 6.59 10.0 16.7 3.62 5.41 7.76 12.1 3.29 4.76 6.60 9.78
4 55.8 225 9.24 19.2 39.3 99.2 5.34 9.12 15.1 28.7 4.11 6.39 9.60 16.0 3.52 5.19 7.39 11.4 3.18 4.53 6.23 9.15
5 57.2 230 9.29 19.3 39.3 99.3 5.31 9.01 14.9 28.2 4.05 6.26 9.36 15.5 3.45 5.05 7.15 11.0 3.11 4.39 5.99 8.75
Table V: The F -distributions 6 58.2 234 9.33 19.3 39.3 99.3 5.28 8.94 14.7 27.9 4.01 6.16 9.20 15.2 3.40 4.95 6.98 10.7 3.05 4.28 5.82 8.47
m : degrees of freedom in numerator 7 8 9 10 15 20 30 58.9 59.4 59.9 60.2 61.2 61.7 62.3 237 239 241 242 246 248 250 9.35 9.37 9.38 9.39 9.42 9.44 9.46 19.4 19.4 19.4 19.4 19.4 19.4 19.5 39.4 39.4 39.4 39.4 39.4 39.5 39.5 99.4 99.4 99.4 99.4 99.4 99.4 99.5 5.27 5.25 5.24 5.23 5.20 5.18 5.17 8.89 8.85 8.81 8.79 8.70 8.66 8.62 14.6 14.5 14.5 14.4 14.3 14.2 14.1 27.7 27.5 27.3 27.2 26.9 26.7 26.5 3.98 3.95 3.94 3.92 3.87 3.84 3.82 6.09 6.04 6.00 5.96 5.86 5.80 5.75 9.07 8.98 8.90 8.84 8.66 8.56 8.46 15.0 14.8 14.7 14.5 14.2 14.0 13.8 3.37 3.34 3.32 3.30 3.24 3.21 3.17 4.88 4.82 4.77 4.74 4.62 4.56 4.50 6.85 6.76 6.68 6.62 6.43 6.33 6.23 10.5 10.3 10.2 10.1 9.72 9.55 9.38 3.01 2.98 2.96 2.94 2.87 2.84 2.80 4.21 4.15 4.10 4.06 3.94 3.87 3.81 5.70 5.60 5.52 5.46 5.27 5.17 5.07 8.26 8.10 7.98 7.87 7.56 7.40 7.23 40 62.5 251 9.47 19.5 39.5 99.5 5.16 8.59 14.0 26.4 3.80 5.72 8.41 13.7 3.16 4.46 6.18 9.29 2.78 3.77 5.01 7.14
50 62.7 252 9.47 19.5 39.5 99.5 5.15 8.58 14.0 26.4 3.80 5.70 8.38 13.7 3.15 4.44 6.14 9.24 2.77 3.75 4.98 7.09
60 62.8 252 9.47 19.5 39.5 99.5 5.15 8.57 14.0 26.3 3.79 5.69 8.36 13.7 3.14 4.43 6.12 9.20 2.76 3.74 4.96 7.06
100 63.0 253 9.48 19.5 39.5 99.5 5.14 8.55 14.0 26.2 3.78 5.66 8.32 13.6 3.13 4.41 6.08 9.13 2.75 3.71 4.92 6.99
500 63.3 254 9.49 19.5 39.5 99.5 5.14 8.53 13.9 26.1 3.76 5.64 8.27 13.5 3.11 4.37 6.03 9.04 2.73 3.68 4.86 6.90
1 63.3 254 9.49 19.5 39.5 99.5 5.13 8.53 13.9 26.1 3.76 5.63 8.26 13.5 3.10 4.36 6.02 9.02 2.72 3.67 4.85 6.88
576 Statistical tables
n 1 ˛ 0.90 7 0.95 0.975 0.99 0.90 8 0.95 0.975 0.99 0.90 9 0.95 0.975 0.99 0.90 10 0.95 0.975 0.99 0.90 11 0.95 0.975 0.99 0.90 12 0.95 0.975 0.99
1 3.59 5.59 8.07 12.2 3.46 5.32 7.57 11.3 3.36 5.12 7.21 10.6 3.29 4.96 6.94 10.0 3.23 4.84 6.72 9.65 3.18 4.75 6.55 9.33
2 3.26 4.74 6.54 9.55 3.11 4.46 6.06 8.65 3.01 4.26 5.71 8.02 2.92 4.10 5.46 7.56 2.86 3.98 5.26 7.21 2.81 3.89 5.10 6.93
3 3.07 4.35 5.89 8.45 2.92 4.07 5.42 7.59 2.81 3.86 5.08 6.99 2.73 3.71 4.83 6.55 2.66 3.59 4.63 6.22 2.61 3.49 4.47 5.95
4 2.96 4.12 5.52 7.85 2.81 3.84 5.05 7.01 2.69 3.63 4.72 6.42 2.61 3.48 4.47 5.99 2.54 3.36 4.28 5.67 2.48 3.26 4.12 5.41
5 2.88 3.97 5.29 7.46 2.73 3.69 4.82 6.63 2.61 3.48 4.48 6.06 2.52 3.33 4.24 5.64 2.45 3.20 4.04 5.32 2.39 3.11 3.89 5.06
6 2.83 3.87 5.12 7.19 2.67 3.58 4.65 6.37 2.55 3.37 4.32 5.80 2.46 3.22 4.07 5.39 2.39 3.09 3.88 5.07 2.33 3.00 3.73 4.82
m : degrees of freedom in numerator 7 8 9 10 15 20 30 2.78 2.75 2.72 2.70 2.63 2.59 2.56 3.79 3.73 3.68 3.64 3.51 3.44 3.38 4.99 4.90 4.82 4.76 4.57 4.47 4.36 6.99 6.84 6.72 6.62 6.31 6.16 5.99 2.62 2.59 2.56 2.54 2.46 2.42 2.38 3.50 3.44 3.39 3.35 3.22 3.15 3.08 4.53 4.43 4.36 4.30 4.10 4.00 3.89 6.18 6.03 5.91 5.81 5.52 5.36 5.20 2.51 2.47 2.44 2.42 2.34 2.30 2.25 3.29 3.23 3.18 3.14 3.01 2.94 2.86 4.20 4.10 4.03 3.96 3.77 3.67 3.56 5.61 5.47 5.35 5.26 4.96 4.81 4.65 2.41 2.38 2.35 2.32 2.24 2.20 2.16 3.14 3.07 3.02 2.98 2.85 2.77 2.70 3.95 3.85 3.78 3.72 3.52 3.42 3.31 5.20 5.06 4.94 4.85 4.56 4.41 4.25 2.34 2.30 2.27 2.25 2.17 2.12 2.08 3.01 2.95 2.90 2.85 2.72 2.65 2.57 3.76 3.66 3.59 3.53 3.33 3.23 3.12 4.89 4.74 4.63 4.54 4.25 4.10 3.94 2.28 2.24 2.21 2.19 2.10 2.06 2.01 2.91 2.85 2.80 2.75 2.62 2.54 2.47 3.61 3.51 3.44 3.37 3.18 3.07 2.96 4.64 4.50 4.39 4.30 4.01 3.86 3.70 40 2.54 3.34 4.31 5.91 2.36 3.04 3.84 5.12 2.23 2.83 3.51 4.57 2.13 2.66 3.26 4.17 2.05 2.53 3.06 3.86 1.99 2.43 2.91 3.62
50 2.52 3.32 4.28 5.86 2.35 3.02 3.81 5.07 2.22 2.80 3.47 4.52 2.12 2.64 3.22 4.12 2.04 2.51 3.03 3.81 1.97 2.40 2.87 3.57
60 2.51 3.30 4.25 5.82 2.34 3.01 3.78 5.03 2.21 2.79 3.45 4.48 2.11 2.62 3.20 4.08 2.03 2.49 3.00 3.78 1.96 2.38 2.85 3.54
100 2.50 3.27 4.21 5.75 2.32 2.97 3.74 4.96 2.19 2.76 3.40 4.41 2.09 2.59 3.15 4.01 2.01 2.46 2.96 3.71 1.94 2.35 2.80 3.47
500 2.48 3.24 4.16 5.67 2.30 2.94 3.68 4.88 2.17 2.72 3.35 4.33 2.06 2.55 3.09 3.93 1.98 2.42 2.90 3.62 1.91 2.31 2.74 3.38 1 2.47 3.23 4.14 5.65 2.29 2.93 3.67 4.86 2.16 2.71 3.33 4.31 2.06 2.54 3.08 3.91 1.97 2.40 2.88 3.60 1.90 2.30 2.72 3.36
Statistical tables
577
n 1 ˛ 0.90 13 0.95 0.975 0.99 0.90 14 0.95 0.975 0.99 0.90 15 0.95 0.975 0.99 0.90 16 0.95 0.975 0.99 0.90 17 0.95 0.975 0.99 0.90 18 0.95 0.975 0.99
1 3.14 4.67 6.41 9.07 3.10 4.60 6.30 8.86 3.07 4.54 6.20 8.68 3.05 4.49 6.12 8.53 3.03 4.45 6.04 8.40 3.01 4.41 5.98 8.29
2 2.76 3.81 4.97 6.70 2.73 3.74 4.86 6.51 2.70 3.68 4.77 6.36 2.67 3.63 4.69 6.23 2.64 3.59 4.62 6.11 2.62 3.55 4.56 6.01
3 2.56 3.41 4.35 5.74 2.52 3.34 4.24 5.56 2.49 3.29 4.15 5.42 2.46 3.24 4.08 5.29 2.44 3.20 4.01 5.18 2.42 3.16 3.95 5.09
4 2.43 3.18 4.00 5.21 2.39 3.11 3.89 5.04 2.36 3.06 3.80 4.89 2.33 3.01 3.73 4.77 2.31 2.96 3.66 4.67 2.29 2.93 3.61 4.58
5 2.35 3.03 3.77 4.86 2.31 2.96 3.66 4.69 2.27 2.90 3.58 4.56 2.24 2.85 3.50 4.44 2.22 2.81 3.44 4.34 2.20 2.77 3.38 4.25
6 2.28 2.92 3.60 4.62 2.24 2.85 3.50 4.46 2.21 2.79 3.41 4.32 2.18 2.74 3.34 4.20 2.15 2.70 3.28 4.10 2.13 2.66 3.22 4.01
m : degrees of freedom in numerator 7 8 9 10 15 20 30 2.23 2.20 2.16 2.14 2.05 2.01 1.96 2.83 2.77 2.71 2.67 2.53 2.46 2.38 3.48 3.39 3.31 3.25 3.05 2.95 2.84 4.44 4.30 4.19 4.10 3.82 3.66 3.51 2.19 2.15 2.12 2.10 2.01 1.96 1.91 2.76 2.70 2.65 2.60 2.46 2.39 2.31 3.38 3.29 3.21 3.15 2.95 2.84 2.73 4.28 4.14 4.03 3.94 3.66 3.51 3.35 2.16 2.12 2.09 2.06 1.97 1.92 1.87 2.71 2.64 2.59 2.54 2.40 2.33 2.25 3.29 3.20 3.12 3.06 2.86 2.76 2.64 4.14 4.00 3.89 3.80 3.52 3.37 3.21 2.13 2.09 2.06 2.03 1.94 1.89 1.84 2.66 2.59 2.54 2.49 2.35 2.28 2.19 3.22 3.12 3.05 2.99 2.79 2.68 2.57 4.03 3.89 3.78 3.69 3.41 3.26 3.10 2.10 2.06 2.03 2.00 1.91 1.86 1.81 2.61 2.55 2.49 2.45 2.31 2.23 2.15 3.16 3.06 2.98 2.92 2.72 2.62 2.50 3.93 3.79 3.68 3.59 3.31 3.16 3.00 2.08 2.04 2.00 1.98 1.89 1.84 1.78 2.58 2.51 2.46 2.41 2.27 2.19 2.11 3.10 3.01 2.93 2.87 2.67 2.56 2.44 3.84 3.71 3.60 3.51 3.23 3.08 2.92 40 1.93 2.34 2.78 3.43 1.89 2.27 2.67 3.27 1.85 2.20 2.59 3.13 1.81 2.15 2.51 3.02 1.78 2.10 2.44 2.92 1.75 2.06 2.38 2.84
50 1.92 2.31 2.74 3.38 1.87 2.24 2.64 3.22 1.83 2.18 2.55 3.08 1.79 2.12 2.47 2.97 1.76 2.08 2.41 2.87 1.74 2.04 2.35 2.78
60 1.90 2.30 2.72 3.34 1.86 2.22 2.61 3.18 1.82 2.16 2.52 3.05 1.78 2.11 2.45 2.93 1.75 2.06 2.38 2.83 1.72 2.02 2.32 2.75
100 1.88 2.26 2.67 3.27 1.83 2.19 2.56 3.11 1.79 2.12 2.47 2.98 1.76 2.07 2.40 2.86 1.73 2.02 2.33 2.76 1.70 1.98 2.27 2.68
500 1.85 2.22 2.61 3.19 1.80 2.14 2.50 3.03 1.76 2.08 2.41 2.89 1.73 2.02 2.33 2.78 1.69 1.97 2.26 2.68 1.67 1.93 2.20 2.59 1 1.85 2.21 2.60 3.17 1.80 2.13 2.49 3.00 1.76 2.07 2.40 2.87 1.72 2.01 2.32 2.75 1.69 1.96 2.25 2.65 1.66 1.92 2.19 2.57
578 Statistical tables
1 ˛ 0.90 0.95 0.975 0.99 0.90 0.95 0.975 0.99 0.90 0.95 0.975 0.99 0.90 0.95 0.975 0.99 0.90 0.95 0.975 0.99
1 2.99 4.38 5.92 8.18 2.97 4.35 5.87 8.10 2.79 4.00 5.29 7.08 2.75 3.92 5.15 6.85 2.71 3.84 5.02 6.63
2 2.61 3.52 4.51 5.93 2.59 3.49 4.46 5.85 2.39 3.15 3.93 4.98 2.35 3.07 3.80 4.79 2.30 3.00 3.69 4.61
3 2.40 3.13 3.90 5.01 2.38 3.10 3.86 4.94 2.18 2.76 3.34 4.13 2.13 2.68 3.23 3.95 2.08 2.60 3.12 3.78
4 2.27 2.90 3.56 4.50 2.25 2.87 3.51 4.43 2.04 2.53 3.01 3.65 1.99 2.45 2.89 3.48 1.94 2.37 2.79 3.32
5 2.18 2.74 3.33 4.17 2.16 2.71 3.29 4.10 1.95 2.37 2.79 3.34 1.90 2.29 2.67 3.17 1.85 2.21 2.57 3.02
6 2.11 2.63 3.17 3.94 2.09 2.60 3.13 3.87 1.87 2.25 2.63 3.12 1.82 2.17 2.52 2.96 1.77 2.10 2.41 2.80
m : degrees of freedom in numerator 7 8 9 10 15 20 30 2.06 2.02 1.98 1.96 1.86 1.81 1.76 2.54 2.48 2.42 2.38 2.23 2.16 2.07 3.05 2.96 2.88 2.82 2.62 2.51 2.39 3.77 3.63 3.52 3.43 3.15 3.00 2.84 2.04 2.00 1.96 1.94 1.84 1.79 1.74 2.51 2.45 2.39 2.35 2.20 2.12 2.04 3.01 2.91 2.84 2.77 2.57 2.46 2.35 3.70 3.56 3.46 3.37 3.09 2.94 2.78 1.82 1.77 1.74 1.71 1.60 1.54 1.48 2.17 2.10 2.04 1.99 1.84 1.75 1.65 2.51 2.41 2.33 2.27 2.06 1.94 1.82 2.95 2.82 2.72 2.63 2.35 2.20 2.03 1.77 1.72 1.68 1.65 1.55 1.48 1.41 2.09 2.02 1.96 1.91 1.75 1.66 1.55 2.39 2.30 2.22 2.16 1.94 1.82 1.69 2.79 2.66 2.56 2.47 2.19 2.03 1.86 1.72 1.67 1.63 1.60 1.49 1.42 1.34 2.01 1.94 1.88 1.83 1.67 1.57 1.46 2.29 2.19 2.11 2.05 1.83 1.71 1.57 2.64 2.51 2.41 2.32 2.04 1.88 1.70 40 1.73 2.03 2.33 2.76 1.71 1.99 2.29 2.69 1.44 1.59 1.74 1.94 1.37 1.50 1.61 1.76 1.30 1.39 1.48 1.59
50 1.71 2.00 2.30 2.71 1.69 1.97 2.25 2.64 1.41 1.56 1.70 1.88 1.34 1.46 1.56 1.70 1.26 1.35 1.43 1.52
60 1.70 1.98 2.27 2.67 1.68 1.95 2.22 2.61 1.40 1.53 1.67 1.84 1.32 1.43 1.53 1.66 1.24 1.32 1.39 1.47
100 1.67 1.94 2.22 2.60 1.65 1.91 2.17 2.54 1.36 1.48 1.60 1.75 1.28 1.37 1.45 1.56 1.18 1.24 1.30 1.36
500 1.64 1.89 2.15 2.51 1.62 1.86 2.10 2.44 1.31 1.41 1.51 1.63 1.21 1.28 1.34 1.42 1.08 1.11 1.13 1.15 1 1.63 1.88 2.13 2.49 1.61 1.84 2.09 2.42 1.29 1.39 1.48 1.60 1.19 1.25 1.31 1.38 1.00 1.00 1.10 1.00
Quantiles with the F -distribution. For example: if X has the F -distribution with 10 and 20 degrees of freedom in the numerator and in the denominator respectively, then P .X 2:35/ D 0:95. On the other hand, in such a case 1=X has the F -distribution with 20 and 10 degrees of freedom respectively, which leads to P .X 1=2:20/ D 0:90.
1
120
60
20
19
n
Statistical tables
579
580
Statistical tables
Table VI: Wilcoxon’s signed-rank test n 0:005 0:01 0:025 0:05 3 4 5 0 6 0 2 7 0 2 3 8 0 1 3 5 9 1 3 5 8 10 3 5 8 10 11 5 7 10 13 12 7 9 13 17 13 9 12 17 21 14 12 15 21 25 15 15 19 25 30 16 19 23 29 35 17 23 27 34 41 18 27 32 40 47 19 32 37 46 53 20 37 43 52 60 21 42 49 58 67 22 48 55 65 75 23 54 62 73 83 24 61 69 81 91 25 68 76 89 100 26 75 84 98 110
0:10 0 2 3 5 8 10 14 17 21 26 31 36 42 48 55 62 69 77 86 94 104 113 124
n 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
0:005 83 91 100 109 118 128 138 148 159 171 182 194 207 220 233 247 261 276 291 307 322 339 355 373
0:01 92 101 110 120 130 140 151 162 173 185 198 211 224 238 252 266 281 296 312 328 345 362 379 397
0:025 107 116 126 137 147 159 170 182 195 208 221 235 249 264 279 294 310 327 343 361 378 396 415 434
0:05 119 130 140 151 163 175 187 200 213 227 241 256 271 286 302 319 336 353 371 389 407 426 446 466
0:10 134 145 157 169 181 194 207 221 235 250 265 281 297 313 330 348 365 384 402 422 441 462 482 503
Left critical values with Wilcoxon’s signed-rank test. For example: for a sample size of n D 18, significance level ˛ D 0:05 and left-sided alternative (the median of the population is m0 ), one accepts H0 (m D m0 ) precisely if T C is > 47. If the alternative is right-sided one accepts H0 if T C < 21 18 19 47.
581
Statistical tables
Table VII: Wilcoxon’s rank-sum test n m 0:005 0:01 0:025 0:05 3 2 3 6 4 2 3 6 4 10 11 5 2 3 3 6 7 4 10 11 12 5 15 16 17 19 6 2 3 3 7 8 4 10 11 12 13 5 16 17 18 20 6 23 24 26 28 7 2 3 3 6 7 8 4 10 11 13 14 5 16 18 20 21 6 24 25 27 29 7 32 34 36 39 8 2 3 4 3 6 8 9 4 11 12 14 15 5 17 19 21 23 6 25 27 29 31 7 34 35 38 41 8 43 45 49 51 9 1 2 3 4 3 6 7 8 10 4 11 13 14 16 5 18 20 22 24 6 26 28 31 33 7 35 37 40 43 8 45 47 51 54 9 56 59 62 66
0:10 3 7 3 7 13 4 8 14 20 4 9 15 22 30 4 10 16 23 32 41 5 11 17 25 34 44 55 1 5 11 19 27 36 46 58 70
0:10 9 14 11 17 23 12 19 26 35 14 21 29 38 48 16 23 32 42 52 64 17 25 35 45 56 68 81 10 19 28 37 48 60 73 86 101
0:05 0:025 0:01 0:005 15 18 25 26 13 20 21 28 29 30 36 38 39 40 15 22 23 31 32 33 34 40 42 43 44 50 52 54 55 17 25 26 27 34 35 37 38 44 45 47 49 55 57 59 60 66 69 71 73 18 19 27 28 30 37 38 40 41 47 49 51 53 59 61 63 65 71 74 77 78 85 87 91 93 20 21 29 31 32 33 40 42 43 45 51 53 55 57 63 65 68 70 76 79 82 84 90 93 97 99 105 109 112 115
582
Statistical tables
n m 0:005 0:01 0:025 0:05 10 1 2 3 4 3 6 7 9 10 4 12 13 15 17 5 19 21 23 26 6 27 29 32 35 7 37 39 42 45 8 47 49 53 56 9 58 61 65 69 10 71 74 78 82 11 1 2 3 4 3 6 7 9 11 4 12 14 16 18 5 20 22 24 27 6 28 30 34 37 7 38 40 44 47 8 49 51 55 59 9 61 63 68 72 10 73 77 81 86 11 87 91 96 100 12 1 2 4 5 3 7 8 10 11 4 13 15 17 19 5 21 23 26 28 6 30 32 35 38 7 40 42 46 49 8 51 53 58 62 9 63 66 71 75 10 76 79 84 89 11 90 94 99 104 12 105 109 115 120
0:10 1 6 12 20 28 38 49 60 73 87 1 6 13 21 30 40 51 63 76 91 106 1 7 14 22 32 42 54 66 80 94 110 127
0:10 11 20 30 40 52 64 77 92 107 123 12 22 32 43 55 68 82 97 113 129 147 13 23 34 46 58 72 86 102 118 136 154 173
0:05 22 32 43 54 67 81 96 111 128 24 34 46 58 71 86 101 117 134 153 25 37 49 62 76 91 106 123 141 160 180
0:025 23 33 45 57 70 84 99 115 132 25 36 48 61 74 89 105 121 139 157 26 38 51 64 79 94 110 127 146 165 185
0:01 35 47 59 73 87 103 119 136 38 50 63 78 93 109 126 143 162 40 53 67 82 98 115 132 151 170 191
0:005 36 48 61 75 89 105 122 139 39 52 65 80 95 111 128 147 166 41 55 69 84 100 117 135 154 174 195
Left and right critical values with Wilcoxon’s rank-sum test. The critical values belong to the sum of the ranks of the smallest sample, say TY . For example: given sample sizes n D 11 and m D 5 and given ˛ D 0:10, in a two-sided test one accepts H0 (the medians of the two populations are the same) precisely if 27 < TY < 58.
583
Statistical tables
Table VIII: The runs test 1 ˛ n m 0:005 0:01 0:025 0:05 0:10 0:10 4 1 2 3 2 7 4 2 2 8 5 1 2 2 3 2 2 7 4 2 2 3 8 5 2 2 3 3 9 6 1 2 2 3 2 2 2 4 2 2 3 3 9 5 2 2 3 3 3 9 6 2 2 3 3 4 10 7 1 2 2 3 2 2 3 4 2 2 3 3 9 5 2 2 3 3 4 10 6 2 3 3 4 4 11 7 3 3 3 4 5 11 8 1 2 2 2 3 2 2 3 4 2 2 3 3 3 9 5 2 2 3 3 4 10 6 3 3 3 4 5 11 7 3 3 4 4 5 12 8 3 4 4 5 5 13 9 1 2 2 2 3 2 2 2 3 4 2 2 3 3 4 9 5 2 3 3 4 4 10 6 3 3 4 4 5 11
0:05 0:025 0:01 0:005 7 8 9 9 9 9 10 10 9 9 10 10 11 11 11 11 12 12 9 10 11 11 11 12 12 13 12 13 13 13 11 11 12 12 13 13 13 13 14 14 13 14 14 15 11 12 13 13
584
Statistical tables
1 ˛ n m 0:005 0:01 0:025 0:05 0:10 0:10 9 7 3 4 4 5 5 12 8 3 4 5 5 6 13 9 4 4 5 6 6 14 10 2 2 2 3 2 2 3 3 4 2 2 3 3 4 5 3 3 3 4 5 11 6 3 3 4 5 5 12 7 3 4 5 5 6 13 8 4 4 5 6 6 13 9 4 5 5 6 7 14 10 5 5 6 6 7 15 11 2 2 2 3 2 2 3 3 4 2 2 3 3 4 5 3 3 4 4 5 11 6 3 4 4 5 5 12 7 4 4 5 5 6 13 8 4 5 5 6 7 14 9 5 5 6 6 7 15 10 5 5 6 7 8 15 11 5 6 7 7 8 16 12 2 2 2 2 3 2 2 2 3 3 4 2 3 3 4 4 5 3 3 4 4 5 11 6 3 4 4 5 6 12 7 4 4 5 6 6 13 8 4 5 6 6 7 14 9 5 5 6 7 7 15 10 5 6 7 7 8 16 11 6 6 7 8 9 16 12 6 7 7 8 9 17
0:05 0:025 0:01 0:005 13 14 14 15 14 14 15 15 14 15 16 16 11 12 13 13 14 15 15 14 15 15 16 15 16 16 17 16 16 17 17 13 13 14 14 15 15 15 15 16 16 15 16 17 17 16 17 18 18 17 17 18 19 13 13 14 14 15 15 16 16 17 16 16 17 18 17 17 18 19 17 18 19 19 18 19 19 20
Critical values belonging to the runs test. For example: if one has two groups, consisting of 12 and 10 members respectively, then under H0 the total number of chains T satisfies the inequality P .T 6/ 0:01, and 6 is the largest number for which this inequality is still correct.
585
Statistical tables
Table IX: Spearman’s rank correlation test n 5 6 7 8 9 10 11 12 13 14 15
0.20 0.800 0.657 0.571 0.524 0.483 0.455 0.427 0.406 0.385 0.367 0.354
0.10 0.900 0.829 0.714 0.643 0.600 0.564 0.536 0.503 0.484 0.464 0.446
˛ 0.05 1.000 0.886 0.786 0.738 0.700 0.648 0.618 0.587 0.560 0.538 0.521
0.02 1.000 0.943 0.893 0.833 0.783 0.745 0.709 0.678 0.648 0.626 0.604
0.01 1.000 0.929 0.881 0.833 0.794 0.755 0.727 0.703 0.679 0.654
Critical values for Spearman’s rank correlation coefficient. For example: for a sample size of n D 13 and ˛ D 0:10 one accepts H0 (no correlation) if and only if jRS j < 0:484. The values are for two-sided tests.
Table X: Kendall’s tau n 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0:20 6 8 9 11 12 14 17 19 20 24 25 29 30 34 37 39 42
0:10 6 8 11 13 16 18 21 23 26 28 33 35 38 42 45 49 52
˛ 0:05 10 13 15 18 20 23 27 30 34 37 41 46 50 53 57 62
0:02 10 13 17 20 24 27 31 36 40 43 49 52 58 63 67 72
0:01 15 19 22 26 29 33 38 44 47 53 58 64 69 75 80
n 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 40
0:20 44 47 51 54 58 61 63 68 70 75 77 82 86 89 93 112
0:10 56 61 65 68 72 77 81 86 90 95 99 104 108 113 117 144
˛ 0:05 66 71 75 80 86 91 95 100 106 111 117 122 128 133 139 170
0:02 78 83 89 94 100 107 113 118 126 131 137 144 152 157 165 200
0:01 86 91 99 104 110 117 125 130 138 145 151 160 166 175 181 222
Critical values for the sum of concordances with Kendall’s tau. For example: for a sample size of n D 19 and ˛ D 0:10 one accepts H0 (no correlation) if . 12 19 18/jRK j < 49. The values are for two-sided tests.
586
Statistical tables
Table XI: The Kolmogorov–Smirnov test
n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0:10 0:950 0:776 0:636 0:565 0:509 0:468 0:436 0:410 0:388 0:369 0:352 0:338 0:326 0:314 0:304 0:295 0:286 0:279 0:271 0:265 0:259 0:253 0:248 0:242 0:238
˛ 0:05 0:02 0:975 0:990 0:842 0:900 0:708 0:785 0:624 0:689 0:563 0:627 0:519 0:577 0:483 0:538 0:454 0:507 0:430 0:480 0:409 0:457 0:391 0:437 0:375 0:419 0:361 0:404 0:349 0:390 0:338 0:377 0:327 0:366 0:318 0:355 0:309 0:346 0:301 0:337 0:294 0:329 0:287 0:321 0:281 0:314 0:275 0:307 0:269 0:301 0:264 0:295
0:01 0:995 0:929 0:829 0:734 0:669 0:617 0:576 0:542 0:513 0:489 0:468 0:449 0:433 0:418 0:404 0:392 0:381 0:371 0:361 0:352 0:344 0:337 0:330 0:323 0:317
n 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
0:10 0:234 0:229 0:225 0:221 0:218 0:214 0:211 0:208 0:205 0:202 0:199 0:197 0:194 0:192 0:189 0:187 0:185 0:183 0:181 0:179 0:177 0:175 0:173 0:172 0:170
˛ 0:05 0:02 0:259 0:290 0:254 0:284 0:250 0:279 0:246 0:275 0:242 0:270 0:238 0:266 0:234 0:262 0:231 0:258 0:227 0:254 0:224 0:251 0:221 0:247 0:218 0:244 0:215 0:241 0:213 0:238 0:210 0:235 0:208 0:232 0:205 0:229 0:203 0:227 0:201 0:224 0:198 0:222 0:196 0:219 0:194 0:217 0:192 0:215 0:190 0:213 0:188 0:211
0:01 0:311 0:305 0:300 0:295 0:290 0:285 0:281 0:277 0:273 0:269 0:265 0:262 0:258 0:255 0:252 0:249 0:246 0:243 0:240 0:238 0:235 0:233 0:230 0:228 0:226
Quantiles belonging to the Kolmogorov–Smirnov test statistic. For example, denoting this test statistic by Dn , one has: P .D20 0:294/ D 0:95.
Bibliography
[1] Ahlfors, L. V., Complex Analysis, McGraw-Hill Inc., New York, 1966. [2] Alberink, I. B. and Pestman, W. R., Mathematical Statistics, Problems and Detailed Solutions, Walter de Gruyter, Berlin, 1997. [3] Alberink, I. B., Polynomial estimators of minimal variance, Report 9531, Dept. of Math. University of Nijmegen, Holland, 1995. [4] Anderson, T. W., An Introduction to Multivariate Statistical Analysis, John Wiley & Sons, New York, 1984. [5] Artin, E., The gammafunction, Holt, Rinehart and Winston, New York, 1964. [6] Bauer, H., Probability Theory, Walter de Gruyter, Berlin, 1996. (This book is also available in the German language.) [7] Behnke, H. and Sommer, F., Theorie der analytischen Funktionen einer komplexen Veränderlichen, 2nd edition, Springer-Verlag, Berlin, 1962. [8] Bickel, P. J. and Doksum, K. A., Mathematical Statistics: Basic Ideas and Selected Topics, Holden-Day, San Francisco, 1977. [9] Bickel, P. J. and Freedman, D. A., Some asymptotic theory for the bootstrap, Annals of Statistics 9 (1981), no. 6, 1196–1217. [10] Billingsley, P., Convergence of Probability Measures, John Wiley & Sons, New York, 1968. [11] Billingsley, P., Probability and Measure, John Wiley & Sons, New York, 1994. [12] Blair, R. C. and Higgins, J. J., A comparison of the power of Wilcoxon’s rank-sum statistic to that of Student’s t -statistic under various non-normal distributions, Journal of Educational Statistics 5 (1980), 309–335. [13] Blair, R. C. and Higgins, J. J., Comparison of the power of the paired samples t -test to that of Wilcoxon’s signed-ranks test under various population shapes, Psychological Bulletin 97 (1985), 119–128. [14] Blatter, Chr., Analysis I, II, Heidelberger Taschenbücher, Springer-Verlag, Berlin, 1992. [15] Bourbaki, N., Eléments de mathématique, livre VI, Intégration, chapitre IX, Hermann, Paris, 1969. [16] Courant, R. and Hilbert, D., Methoden der Mathematischen Physik I, II, 2nd edition, Heidelberger Taschenbücher, Springer-Verlag, Berlin, 1968. [17] Dudley, R. M., Probabilities and Metrics, Convergence of Laws on Metric Spaces with a View to Statistical Testing, Lecture Notes Series 45, Matematisk Institut, Aarhus Universitet, 1976.
588
Bibliography
[18] Durrett, R., Probability: Theory and Examples, Wadsworth & Brooks/Cole, Belmont, California, 1991. [19] Dym, H. and McKean, H. P., Fourier Series and Integrals, Academic Press, New York, 1972. [20] Efron, B. and Tibshirani, J., An Introduction to the Bootstrap, Chapman & Hall, London, 1993. [21] Feller, W., An Introduction to Probability Theory and its Applications I, II, John Wiley & Sons Inc., New York, 1971. [22] Ferguson, T. S., Mathematical Statistics: A Decision Theoretic Approach, Academic Press Inc., New York, 1967. [23] Fisher, R. A., Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population, Biometrika 10 (1915), 507–521. [24] Freund, J. E., Mathematical Statistics, Prentice Hall Inc., London, 1992. [25] Gibbons, J. D. and Chakraborti, S., Nonparametric Statistical Inference, Dekker Inc., New York, 1992. [26] Grimmett, G. R. and Stirzaker, D. R., Probability and Random Processes, Oxford University Press, Oxford, 1993. [27] Grimmett, G. R. and Welsh, D. J. A., Probability. An Introduction, Clarendon Press, Oxford, 1986. [28] Halmos, P. R., Measure Theory, Van Nostrand Company Inc., New York, 1950. [29] Halmos, P. R. and Savage, L. R., Application of the Radon–Nikodym theorem to the theory of sufficient statistics, Annals of Mathematical Statistics 20 (1949), 225–241. [30] Hampel, F.R., The influence curve and its role in robust estimation, Journal of the American Statistical Association 69 (1974), 383–393. [31] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A., Robust Statistics, John Wiley & Sons, New York, 1986. [32] Hajék, J., Nonparametric Statistics, Holden-Day, San Francisco, 1969. [33] Hajék, J. and Sidak, Z., Theory of Rank Tests, Academic Press, New York, 1967. [34] Härdle, W., Smoothing Techniques, Springer-Verlag, Berlin, 1991. [35] Hays, W. L., Statistics, Holt, Rinehart and Winston, New York, 1981. [36] Hoffmann-Jørgensen, J., Probability with a View Toward Statistics I, II, Chapman & Hall, London, 1994. [37] Hogg, R. V. and Craig, A. T., Introduction to Mathematical Statistics, Macmillan Inc., New York, 1978. [38] Hotelling, H., New light on the correlation coefficient and its transforms, Journal of the Royal Statistical Society, B 15 (1953), 193–225. [39] Huber, P. J., Robust Statistics, John Wiley & Sons, New York, 1981.
Bibliography
589
[40] Kendall, M. G. and Stuart, A., The Advanced Theory of Statistics I, II, Macmillan Inc., New York, 1977. [41] Körner, T. W., Fourier Analysis, Cambridge University Press, Cambridge, 1993. [42] Kortram, R. A., De theorie van complexe functies, Epsilon Uitgaven, Utrecht, 1989. [43] Kortram, R. A. and van Rooij, A., Analyse, Epsilon Uitgaven, Utrecht, 1990. [44] Kruskal, W. H., A nonparametric test for the several sample problem, Annals of Math. Statistics 23 (1952), 525–540. [45] Lang, S., Real and Functional Analysis, Springer-Verlag, Berlin, 1993. [46] Lehmann, E. L., The Theory of Point Estimation, John Wiley & Sons Inc., New York, 1983. [47] Lehmann, E. L., Testing Statistical Hypotheses, John Wiley & Sons Inc., New York, 1986. [48] Lehmann, E. L., Nonparametrics: Statistical Methods Based on Ranks, Holden-Day Inc., San Francisco, 1975. [49] Lindgren, B. W., Statistical Theory, Chapman Hall, London, 1993. [50] Loève, M., Probability Theory I, II, Springer-Verlag, Berlin, 1978. [51] Lukacz, E., A characterization of the normal distribution, Annals of Mathematical Statistics 13 (1942), 91–93. [52] Messer, R., Linear Algebra, Gateway to Mathematics, Harper Collins, New York, 1994. [53] Mood, A. M., Graybill, F. A. and Boes, D. C., Introduction to the Theory of Statistics, McGraw-Hill, New York, 1974. [54] Neveu, J., Calcul des Probabilités, Masson, Paris, 1980. [55] Nicholson, K. W., Linear Algebra (with applications), PWS Publishing Company, Boston, 1995. [56] Pestman, W. R., Measurability of linear operators in the Skorokhod topology, Bull. of the Belgian Math. Soc. 2(1995), no. 4, 381–388. [57] Pfanzagl, J., Parametric Statistical Theory, Walter de Gruyter, Berlin, 1994. [58] Reed, M. and Simon, B., Functional Analysis, Academic Press, New York, 1972. [59] Rice, J. A., Mathematical Statistics and Data Analysis, Wadsworth & Brooks, Belmont, California, 1988. [60] Riesz, F. and Nagy, B. Sz., Leçons d’Analyse Fonctionelle, Akadémiai Kiadó, Budapest, 1952. (This work is also available in the English language (by Frederick Ungar Publ., New York).) [61] Rohatgi, V. K., An Introduction to Probability Theory and Mathematical Statistics, John Wiley & Sons Inc., New York, 1976. [62] Romano, J. P. and Siegel, A. F., Counterexamples in Probability and Statistics, Wadsworth & Brooks/Cole, Belmont California, 1986.
590
Bibliography
[63] Rudin, W., Principles of Mathematical Analysis, 3rd edition, McGraw-Hill Inc., New York, 1976. [64] Rudin, W., Real and Complex Analysis, Mc Graw-Hill Inc., New York, 1974. [65] Satake, I., Linear Algebra, Dekker Inc., New York, 1975. [66] Schwartz, L., Mathematics for the Physical Sciences, Hermann, Paris, 1966. (This work is also available in the German language in the series “Heidelberger Taschenbücher”, Springer-Verlag, Berlin.) [67] Schwartz, L., Radon Measures on Arbitrary Topological Spaces and Cylindrical Measures, Oxford University Press, London, 1973. [68] Silverman, B. W., Density Estimation, Chapman & Hall, London, 1993. [69] Skitovitch, V. P., Linear forms of independent random variables and the normal distribution law, Izvestia Acad. Nauk. SSSR 18 (1954), 185–200. [70] Snedecor, G. W. and Cochran, W. G., Statistical Methods, 8th edition, Iowa University Press, Iowa, 1994. [71] Ventsel, H., Théorie des Probabilités, Éditions Mir, Moscou, 1977. [72] Waerden, B. L. van der, Mathematical Statistics, Springer-Verlag, Berlin, 1969. (This is a translation of the originally German version.) [73] Wand, M. P. and Jones, M. C., Kernel Smoothing, Chapman & Hall, London, 1995. [74] Widder, D. V., The Laplace Transform, Princeton University Press, Princeton, 1941. [75] Wilks, S. S., Mathematical Statistics, John Wiley & Sons Inc., New York, 1962.
Index
A absolutely continuous, 12, 40 adjoint operator, 427 affine subspace, 421 algebra, 422 ˛-trimmed mean of the population, 343 alternative left-sided, 170 one-sided, 170 right-sided, 170 two-sided, 170 analysis of variance, 229–249 angle (between two vectors), 425
B band width, 395 basis, 420 orthonormal, 425 standard, 420 Bayesian estimates, 107 Bayesian estimation, 102 Behrens–Fisher problem, 86 Bernoulli distributed, 100 beta distribution, 97 beta function, 94 binomial distribution, 12 bootstrap, 384–392 distribution function, 385 sample, 386 statistic, 386 works, 390 Borel function, 3, 55 Borel measure, 3 Borel paradox, 109 Borel set, 3 bounded Lipschitz metric, 327
C cadlag function, 310
Cauchy distribution, 13, 143 Cauchy–Schwarz–Bunyakovski inequality, 29, 425 central limit theorem, 48–52 vectorial version, 489 characteristic, 112 trivial, 112 characteristic function, 112 Chebyshev inequality, 58, 115 2 -distribution, 72 2 -test – on goodness of fit, 182–188 – on independence, 188–193 clustering, 261 Cochran’s theorem, 458 coefficient of determination, 224 cofactor, 432 collinear, 220 conditional probability, 102, 463–470 conditional probability density, 103 confidence interval, 77 confidence region, 459, 472–475 consistency of a statistical functional, 361 contamination, 373 continuity correction, 152 contour lines, 450 controlled variable, 203 convergence, 284–303, 486–496 almost sure, 289, 486 – in distribution, 44, 487 – in measure, 285 – in probability, 285, 487 strong, 289, 486 convex, 118 convolution product, 63, 330 correlation coefficient, 30 covariance, 26
592 covariance operator, 436–440, 466 Cramér–Rao inequality, 128 critical region, 156, 157 based on likelihood ratio, 165
D .d1 ; d2 /-continuous, 335 .d1 ; d2 /-robust, 336 density, 6 conditional probability, 103 density estimation, 392–404 density of normal distribution, 460 design matrix, 507 determinant, 432 diagonal operator, 421 diagonalized form (of operators), 428 differences in mean, 83 dimension, 420 dimension reduction, 461 Dirac measure, 7 discrete probability density, 12 discrete probability distribution, 11 dispersion functional, 368 distribution function, 38 joint, 41 distribution of S 2 , 75 distribution product, 328
E effect, 231, 238 efficient, 115 eigenvalue, 428 eigenvector, 428 elementary Gaussian distribution, 451 empirical density function, 392 empirical distribution function, 282–284 empirical quantile function, 342 Epanechnikov kernel, 395 error, 65, 201 type I, 156 type II, 156 essentially equal, 118
Index
estimation theory, 65–154 estimator, 112 linear, 115, 204 unbiased, 113, 361 event, 1 exchange of coordinates, 121 expectation (value), 21 expectation vector, 434 exponential distribution, 70
F F -distribution, 90 factor, 229 factorization theorem, 135 family of probability densities, 112 Fisher information, 128 Fisher’s Z, 504 form linear, 427 multilinear, 431 Friedman’s test, 273–277 Fubini’s theorem, 24 full model, 513 functions of stochastic vector, 18
G gamma distribution, 70 gamma function, 70 Gaussian distribution, 450–463 elementary, 451 Gauß–Markov theorem, 509 generalized linear regression, 218 geometrical distribution, 12 Glivenko–Cantelli theorem, 306 goodness of fit, 182 grand population mean, 229, 230, 236 grand sample mean, 231
H Hilbert–Schmidt inner product, 434 histogram, 401 histogram density estimation, 401–404
593
Index
homoscedasticity, 205 hypothesis (statistical), 155 alternative, 155 composite, 155 null, 155 simple, 155 test, 157 hypothesis testing, 155–199
I image measure, 13 independence of X and S 2 , 74 independence, statistical, 14–19, 36 independent, statistical, 48 inequality of Cauchy–Schwarz–Bunyakovski, 29, 425 Chebyshev, 58, 115 Cramér–Rao, 128 Markov, 57 influence function, 374, 377 information inequality, 125 inner product, 424, 433 Hilbert–Schmidt, 434 standard, 424 interaction, 238 interquartile distance, 368 interval partition, 401 invertible matrix, 424 invertible operator, 422 isoperimetric problem, 118
J joint distribution function, 41 joint probability distribution, 13
K Kendall’s coefficient of concordance, 268–270 Kendall’s tau, 269 kernel, 395 kernel (of operator), 428
kernel density, 395 kernel density estimate, 395 Kolmogorov–Smirnov test, 311–316 Kruskal–Wallis test, 270–273
L law of large numbers strong, 299 vectorial version, 486 weak, 288 least squares estimate, 201 least squares estimation, 506 least squares method, 201–203 Lebesgue integral, 6 Lebesgue measure, 3 length (of a vector), 424 level mean, 236 level of significance, 158 level sample mean, 231 Lévy metric, 319 completeness, 324 separability, 326 Lévy’s theorem, 45, 322 vectorial form, 488 likelihood function, 129 likelihood ratio, 164 likelihood ratio test, 165 linear algebra, 419–434 linear combination, 419 linear estimator, 115, 204 linear form, 427 linear independence, 419 linear operator, 421 orthogonal, 428 trace, 430 linear span, 419 linear statistical functional, 369–371, 492–496 linear structure, 419 linear subspace, 420 location functional, 365–367 logistic regression, 218
594
M Mann–Whitney test, 261 marginal probability distribution, 14 Markov inequality, 57 matrix, 422 – of a linear operator, 422 symmetric, 424 transposed, 424 maximum likelihood estimator, 130 –s of .µ; C/, 485 mean of squared errors, 232 squared level deviations, 234 measurable, 10 measure, 2, 56 probability, 8 product, 14 median of – a population, 348 – a sample, 348 metric bounded Lipschitz, 327 Lévy, 319 norm, 310 Prokhorov, 327 Skorokhod, 327 midrank, 273 minimum variance estimator, 111–129 mixing, 262 model full, 513 reduced, 513–517 molecular velocities, 72–73 moment, 42 moment generating function, 42 Monte-Carlo-methods, 388 MSE, 234 MSL, 234 multicollinearity, 517 multilinear form, 431 symmetric, 431
Index
multinomial distribution, 150 multiple correlation coefficient, 517– 523 multiple linear regression, 505–517 multiple regression analysis, 505–517 multiple scale transformation, 520 multivariate sample, 446
N Neyman–Pearson lemma, 161 non-parametric methods, 250–281 non-parametric test, 250 non-robustness of sample mean, 139, 340 norm metric, 310 normal analysis of variance, 229–249 normal correlation analysis, 497–504 normal distribution, 10, 30–38 density of, 31, 460 vectorial, 450–463 normal equations, 202 normal kernel, 395 normal regression analysis, 213–217, 505
O open support, 41 operator covariance, 436–440 diagonal, 421 invertible, 422 kernel, 428 linear, 421 matrix, 422 orthogonal, 428 trace, 430 positive type, 429 range, 428 self adjoint, 428 spectrum, 428 symmetric, 428 orthogonal complement, 425
595
Index
orthogonal linear operator, 428 orthogonal projection, 426 orthogonal system, 425 orthogonal vectors, 425 orthonormal basis, 425
P P -value, 173 paired sample, 88 parameter space, 112 partial correlation coefficient, 523 Pearson’s product-moment correlation coefficient, 219–221, 498 point estimate, 78 Poisson distribution, 11 Poisson regression, 218 pooled variance, 86 population level mean, 236 positive type (of operators), 429 power (function), 158 principal components, 440 probability conditional, 102 probability density, 11–12 probability distribution, 10 absolutely continuous, 12 discrete, 11 joint, 13 marginal, 14 probability measure, 8 probability space, 8 product measure, 14 projection, 13 Prokhorov condition, 413 Prokhorov metric, 327 proportional, 220
Q quantile function, 291, 406 empirical, 342
R range (of operators), 428
rank correlation tests, 265–270 reduced model, 513–517 regression analysis, 200–228, 505–517 regular (matrix), 424 residual, 201 residual vector, 214 response variable, 203 robustness, 139, 334–342 rotation invariance, 33 runs test, 261–265
S sample, 66 bivariate, 266 median, 348 multivariate, 446 paired, 88 vectorial, 266, 446–450 sample covariance operator, 448 sample mean, 66 sample variance, 68 scale transformation, 46, 221, 480 multiple, 520 scatter diagram, 200 self adjoint (operator), 428 -algebra, 1, 55 generated by, 3, 55 smallest, 3, 55 sign test, 250–253 size of critical region, 158 Skorokhod metric, 327 Skorokhod’s theorem, 294 Slutsky’s theorem, 490 smoothing, 328–333 smoothing techniques, 333 Spearman’s rank correlation coefficient, 266 spectrum (of operators), 428 standard basis, 420 standard deviation, 30 standard inner product, 424 standard twin, 292
596 standardization, 28 standardized form, 28, 457 statistic, 66 statistical functional, 358–371 continuity, 361 strong consistency, 361 weak consistency, 361 statistical independence, 48 statistical quantity, 66 statistically independent, 48 stochastic variable, 10 stochastic vector, 10 stoutly tailed distribution function, 339, 558 stoutly tailed distribution function, 337 strict median, 348 strong consistency (of a statistical functional), 361 strong law of large numbers, 299 vectorial version, 486 subspace affine, 421 linear, 420 sufficiency, 133–139 – of .X; S2 /, 483 – of .b ˛; b ˇ ; SSE/, 511 sum of squares of errors, 201, 211, 506 sum of squares of regression, 222 symmetric (essentially), 122 symmetric matrix, 424 symmetric multilinear form, 431 symmetric operator, 428
T t -distribution, 82 tail probability (left/right), 338 tailless (distribution function), 339 tensor product of two vectors, 429 test statistic, 169 theorem of Cochran, 458 Fubini, 24
Index
Gauß–Markov, 509 Glivenko–Cantelli, 306 Lévy, 45, 322 vectorial form, 488 Skorokhod, 294 Slutsky, 490 tie, 257 topological vector spaces, 301 total variance, 441 trace (of operators), 430 transformation – of probability densities, 53–55 scale, 46, 221, 480 multiple, 520 transposed (of a matrix), 424 triangular kernel, 395 trim level, 343 trimmed (population) mean, 343 trivial characteristic, 112 twin (sequence), 290 type I error, 156 type II error, 156
U U -test, 261 U.M.V.U. estimator, 120 unbiased estimator, 113, 361 uncorrelated components, 438, 466 uniform distribution, 25
V variable controlled, 203 response, 203 variance, 24 total, 441 vector, 419 vectorial form of Lévy’s theorem, 488 vectorial normal distribution, 450–463 vectorial sample, 266, 446–450 vectorial statistics, 419–528
Index
vectorial version of the – central limit theorem, 489 – strong law of large numbers, 486 von Mises derivative, 372–384 von Mises hull, 373
W weak consistency (of a statistical functional), 361 weak law of large numbers, 288 white noise, 457 Wilcoxon’s rank-sum test, 256–261 Wilcoxon’s signed-rank test, 253–256 Winsorized mean, 367 Wishart distribution, 475
597