210 56 114MB
English Pages 554 [556] Year 1998
de Gruyter Textbook Pestman • Mathematical Statistics
At a young age, the author's inability to convert statistical data into round figures caused much distress
Wiebe R. Pestman
Mathematical Statistics An Introduction
W Walter de Gruyter G Berlin-NewYork 1998 DE
Author Wiebe R. Pestman University of Nijmegen NL-6525 ED Nijmegen The Netherlands
1991 Mathematics Subject Classification: 62-01 Keywords: Estimation theory, hypothesis testing, regression analysis, non-parametrics, stochastic analysis, vectorial (multivariate) statistics
©Printed on acid-free paper which foils within the guidelines of the ANSI to ensure permanence and durability.
Library of Congress - Cataloging-in-Publication Data Pestman, Wiebe R„ 1954— Mathematical statistics : an introduction / Wiebe R. Pestman. p. cm. - (De Gruyter textbook) Includes bibliographical references and index. ISBN 3-11-015357-2 (hardcover: alk. paper). ISBN 3-11-015356-4 (pbk.: alk. paper) 1. Mathematical statistics. I. Title. II. Series. QA276.P425 1998 519.5-dc21 98-18771 CIP
Die Deutsche Bibliothek - Cataloging-in-Publication Data Pestman, Wiebe R.: Mathematical statistics - an introduction / Wiebe R. Pestman - Berlin; New York : de Gruyter, 1998 (De Gruyter textbook) Erg. bildet: Pestman, Wiebe R.: Mathematical statistics - problems and detailed solutions ISBN 3-11-015356-4 brosch. ISBN 3-11-015357-2 Gb. © Copyright 1998 by Walter de Gruyter GmbH & Co., D-10785 Berlin. All rights reserved, including those of translation into foreign languages. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopy, recording or any information storage and retrieval system, without permission in writing from the publisher. Printed in Germany. Typesetting using the author's TgX files: I. Zimmermann, Freiburg. Printing and Binding: WB-Druck GmbH & Co., Rieden/Allgäu.
Preface
This book arose from a series of lectures I gave at the University of Nijmegen (Holland). It presents an introduction to mathematical statistics and it is intended for students already having some elementary mathematical background. In the text theoretical results are presented as theorems, propositions or lemmas, of which as a rule rigorous proofs are given. In the few exceptions to this rule references are given to indicate where the missing details can be found. In order to understand the proofs, an elementary course in calculus and linear algebra is a prerequisite. However, if the reader is not interested in studying proofs, this can simply be skipped. Instead one can study the many examples, in which applications of the theoretical results are illustrated. In this way the book is also useful to students in the applied sciences. To have the starting statistician well prepared in the field of probability theory, Chapter I is wholly devoted to this subject. Nowadays many scientific articles (in statistics and probability theory) are presented in terms of measure theoretical notions. For this reason, Chapter I also contains a short introduction to measure theory (without getting submerged by it). However, the subsequent chapters can very well be read when skipping this section on measure theory and my advice to the student reading it is therefore not to get bogged down by it. The aim of the remainder of the book is to give the reader a broad and solid base in mathematical statistics. Chapter II is devoted to estimation theory. The probability distributions of the usual elementary estimators, when dealing with normally distributed populations, are treated in detail. Furthermore, there is in this chapter a section about Bayesian estimation. In the final sections of Chapter II, estimation theory is put into a general framework. In these sections, for example, the information inequality is proved. Moreover, the concept of maximum likelihood estimation and that of sufficiency are discussed. In Chapters III, IV and V the classical subjects in mathematical statistics are treated: hypothesis testing, normal regression analysis and normal analysis of variance. In these chapters there is no "cheating" concerning the notion of statistical independence. Chapter VI presents an introduction to non-parametric statistics. In Chapter VII it is illustrated how stochastic analysis can be applied in modern statistics. Here subjects like the Kolmogorov-Smirnov test, smoothing techniques, robustness, density estimation and bootstrap methods are treated. Finally, Chapter VIII is about vectorial statistics. I sometimes find it annoying, when teaching this subject, that many undergraduate students do not seem (or indeed not anymore) to be aware of the fundamental theorems in linear algebra. To meet this need I have given a summary of the main results of elementary linear algebra in §VIII.l.
vi
Preface
The book is written in quite simple English; this might be an advantage to students who do not have English as their native language (like myself). The text contains some 260 exercises, which vary in difficulty. Ivo Alberink, one of my students, has expounded detailed solutions of all these exercises in a companion book, titled:
Mathematical Statistics Problems and Detailed Solutions Ivo has also read this book (which you now have in front of you) in a critical way. The many discussions which arose from this have been very useful. I wish to thank him for this. I hereby also wish to thank Inge Stappers, Flip Klijn and many other students at the University of Nijmegen. They too read the text critically and gave constructive comments. Furthermore I would like to thank Alessandro di Bucchianico and Mark van de Wiel from the Eindhoven University of Technology for providing most of the statistical tables. Finally, I am very indebted to the secretariat of the department of mathematics in Nijmegen, in particular to Hanny Heitink. She did almost all the typewriting, a colossal amount of work.
Nijmegen, March 1998
Wiebe R.
Pestman
Table of contents
Chapter I. Probability theory 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11
Probability spaces Stochastic variables Product measures and statistical independence Functions of stochastic vectors Expectation, variance and covariance of stochastic variables Independence of normally distributed variables Distribution functions and probability distributions Moments, moment generating functions and characteristic functions The central limit theorem Transformation of probability densities Exercises
1 9 13 18 21 30 37 40 47 51 53
Chapter II. Statistics and their probability distributions, estimation theory II. 1 II.2 II. 3 11.4 11.5 11.6 11.7 11.8 11.9 11.10 11.11
Introduction The gamma distribution and the x2-distribution The ¿-distribution Statistics to measure differences in mean The F-distribution The beta distribution Populations which axe not normally distributed Bayesian estimation Estimation theory in a more general framework Maximum likelihood estimation, sufficiency Exercises
62 66 76 80 86 90 95 97 103 119 129
Chapter III. Hypothesis tests 111.1 111.2 111.3
The Neyman-Pearson theory Hypothesis tests concerning normally distributed populations The x 2 -test on goodness of fit
142 153 165
viii 111.4 111.5
Table of contents The x 2 -test on statistical independence Exercises
171 176
Chapter IV. Simple regression analysis IV. 1 IV.2 IV.3 IV.4 IV.5 IV.6
The method of least squares Construction of an unbiased estimator of a 2 Normal regression analysis Pearson's product-moment correlation coefficient The sum of squares of errors as a measure of the amount of linear structure Exercises
182 192 194 198 201 204
Chapter V. Normal analysis of variance V.l V.2 V.3
One-way analysis of variance Two-way analysis of variance Exercises
208 214 223
Chapter VI. Non-parametric methods VI. 1 VI.2 VI.3 VI.4 VI.5 VI.6 VI. 7
The sign test, Wilcoxon's signed-rank test Wilcoxon's rank-sum test The runs test Rank correlation tests The Kruskal-Wallis test Friedman's test Exercises
227 233 238 241 246 249 253
Chapter VII. Stochastic analysis and its applications in statistics VII.l VII.2 VII.3 VII.4 VII.5 VII.6 VII.7
The empirical distribution function associated with a sample Convergence of stochastic variables The Glivenko-Cantelli theorem The Kolmogorov-Smirnov test statistic Metrics on the set of distribution functions Smoothing techniques Robustness of statistics
257 259 276 284 288 299 305
ix VII.8 VII.9 VII. 10 VII.ll VII. 12 VII.13 VII. 14
Trimmed means, the median, and their robustness Statistical functional The von Mises derivative; influence functions Bootstrap methods Estimation of densities by means of kernel densities Estimation of densities by means of histograms Exercises
311 326 339 350 357 365 369
Chapter VIII. Vectorial statistics VIII. 1 Linear algebra VIII.2 The expectation vector and the covariance operator of stochastic vectors VIII.3 Vectorial samples VIII.4 The vectorial normal distribution VIII.5 Conditional probability distributions that emanate from Gaussian ones VIII.6 Vectorial samples from Gaussian distributed populations VIII.7 Vectorial versions of the fundamental limit theorems VIII.8 Normal correlation analysis VIII.9 Multiple regression analysis VIII. 10 The multiple correlation coefficient VIII. 11 Exercises
382 396 407 410 422 428 442 452 460 471 477
Appendix A. Lebesgue's convergence theorems
482
Appendix B . Product measures
485
Appendix C. Conditional probabilities
488
Appendix D. The characteristic function of the Cauchy distribution
492
Appendix E. Metric spaces, equicontinuity
495
Appendix F . The Fourier transform and the existence of stoutly tailed distribution functions
504
List of elementary probability densities
510
Frequently used symbols
512
Statistical tables
518
References
537
Index
541
Chapter I Probability theory
I.l Probability spaces Probability theory is the branch of applied mathematics concerning probability experiments: experiments in which randomness plays a principal part. Trying to capture the notion of a probability experiment in a definition, one could say: A probability experiment is an experiment which, when repeated under exactly the same conditions, does not necessarily give the same results. In a probability experiment there is, a priori, always a variety of possible outcomes. The set of all possible outcomes of a probability experiment is said to be the sample space. This set will systematically be denoted by the symbol ÍÍ. Example 1. Throwing a die, the possible outcomes are the throwing of 1,2,3,4,5 or 6 pips. Here we can write Í2 : = {1,2,3,4,5,6}. • Example 2. Molecules of a gas collide mutually. The distance covered by some molecule between two successive collisions is a random quantity. Such a distance is always a positive number. We can set ÍÍ := (0, +oo). • Definition 1.1.1. An event is understood to be a subset of Í2. Example 3. In Example 1 the event "an even number of pips has been thrown" corresponds to the subset A = {2,4,6} of fi. • Example 4. In Example 2 the event "the distance between two successive collisions is less than 1" corresponds to the interval (0,1). • Generally in a probability model not all subsets of Í2 are admissible. Here, measure theoretical complications are involved. Complications, which we shall not discuss in this book. In a probability model the collection 21 of so-called "admissible events" is presented by a collection of subsets of fi, satisfying more or less specific properties. To this end the following definition: Definition 1.1.2. A collection 21 of subsets of Q is said to be a a-algebra if it satisfies the following three properties:
(i)
n € a,
(ii)
If A 6 21 then also A° 6 21,
(iii)
If Au A2,...
e 2C then also |J~i A
e
2
I. Probability theory
In general, in a probability experiment, one defines an associated sample space together with a a-algebra of so-called admissible events. Now suppose that 2t is such a a-algebra of subsets of ÍÍ. We shall look upon 2t as being the collection of eill possible events. With every event A € 21 is associated the chance that A is going to occur. We shall denote this chance by P(j4) , a real number in the interval [0,1]. Roughly spoken, we can interpret this number in the following way: i) ii) iii)
P(A) = 1 P(j4) = 0 P(A) = 0.3
0, then f af dfi = a f
fdfi.
To extend the notion of an integral to Borel functions / : R n —• R which are not necessarily positive, we continue in the following way. Write A+ := { x G R n : /(x) > 0}
and A~ := { x G R " : /(x) < 0}.
Now the functions and
f+ = lA+f
f~=-lA-f
present positive Borel functions (see Exercise 7) and one has / = /+-/"
and
|/| = / + + / - .
Thus, it is natural to define
j fd/i:= J f+d(x-J f-d/M.
To avoid in this definition the (undefined) expression (+oo) — (+oo), we assume that
J f+dfi< +oo and J f~ dfi < +oo.
This is the same as saying that
/
l/l dfj, < +00.
Definition 1.1.5. A Borel function / : R n —> R satisfying the assumption above is said to be fi-summable or ¡i-integrable. It turns out (see Exercise 7) that for every Borel function / : R n —• R and every Borel set A C R n the function 1a f is again a Borel function. If f > 0, or if / is /i-summable, one defines J^f dfi := J1A
f dfi.
Proposition 1.1.3. Let f : R n —> [0,+oo) be a Borel function. Suppose A = oo (J Ai, where Ai,A2,... is a collection of mutually disjoint Borel sets in K " . i=l Then
Í
JA
=
Proof. See for example [11], [28], [45], [64],
Ja
0 the above (finite or infinite) is always valid. Proof. For details see for example [11], [28], [45], [64]. The proof is not difficult: First one proves the theorem in the strongly simplified case where f = 1a, A being a Borel set. Next, the case where / is a simple function follows. The general case is proved by an argument involving approximation by simple functions. In Lebesgue's integration theory many proofs are given along this line. • Integrals with respect to the Lebesgue measure A on R n are said to be Lebesgue integrals; they present a generalization of the "classical" Riemann integral. More precisely, if A C R n is a set of type A = [ai,6i] x ••• x [a n ,6'n. n] and if / : A —• R is (piecewise) continuous, then
where the right-hand members present integrals in the sense of Riemann. E x a m p l e 7. Let g : R —• [0, +oo) be the function defined by g(x) :=-±=e-*2/2
(x € R).
Suppose n = gX, that is, /j is the Borel measure defined by
1.1 Probability spaces
7
If f(x) = x2, then by Theorem 1.1.4 one has
J f dp = J x2dfi(x)
= j x2g(x) d\(x) = J
r+oo
x2-^=e~x2/2
dx.
• Besides the Lebesgue measure, the Dirac measures (or point measures) also play an important role. The Dirac measure in a point a e R n is defined by d6
(A) ._ — /i 11 ifif aa 6e A'A, *lA)-\0 ifa^A
It is easily verified that Sa is a measure. An integral with respect to 0 we have (finite or infinite) // 0 for all n 6 N. Define the measure /j on R by oo
H(A)
•^Y.CnSniA). n=0
It is not difficult to see that indeed this defines a measure on R. We shall write oo
M=
Yl0*6"n=0
A function / is /x-summable if and only if oo
fdn
= J2cn\f(n)\ n=0
[0, +oo) such that P x = /A. If this is the case then the corresponding / is called the probability density of X . Remark. The probability density associated with an absolutely continuous probability distribution is not necessarily a continuous function.
12
I. Probability theory
Proposition 1.2.2. If X is a stochastic n-vector enjoying an absolutely continuous probability distribution, then its corresponding probability density f satisfies the following three basic properties: i)
f is a Borel function,
ii) / > 0, iii) f fdX = 1. Conversely, every function f : R n —• R satisfying the properties i), ii) and iii) is the probability density function of at least one stochastic n-vector. Proof. It is direct from Definition 1.2.4 that a probability density function satisfies i), ii) and iii). Suppose, conversely, that / satisfies i), ii) and iii). Define ii := R n , 21 := © n and P := /A. Then (fi,2l,P) is a probability space. Let X : ii R n be n the identity on R . Then Px = /A, so / is the probability density function of the stochastic variable X. • Example 7. For all a e R and /3 > 0, define / : R —• R by
This function / satisfies the conditions i), ii) and iii) in Proposition 1.2.2. So / is the probability density of at least one stochastic variable. A stochastic variable having this probability density / is said to have a Cauchy distribution with parameters a and (3. • Remark. There exist stochastic variables which neither have a discrete, nor an absolutely continuous probability distribution. Such variables come across for example in quantum mechanics (see also the intermezzo in §1.7). As said before, if X : ii —• R n is a stochastic n-vector then Px is a probability measure on the Borel sets of R n . Thus the triplet (R n ,Q3 n ,Px) presents a probability space. In probability theory it is often this space which is in the picture (just "forgetting" the triplet (fl,2l,P)). In measure theory it is usual to call Px the image measure of P under the map X. Very often this measure is denoted by X(P), but other notations are used as well. If Xi,..., Xn axe stochastic variables, then an associated stochastic n-vector X can be defined by (see Appendix B to clear up the problem of measurability of X): X(u/) := (X,(w)
X„(W))eR".
In this case Px will also be denoted by Pxi,...,;^ • This measure is often called the joint probability distribution of X\,..., Xn.
1.3 Product measures and statistical independence
13
1.3 Product measures and statistical independence In this section we shall define statistical independence of stochastic variables in terms of measure theory. For every A\,..., An C ® the Cartesian product A\ x Ai x • • • x An C R n is understood to be the set defined by Ai x ••• x An
{(xx,...,x„) e R n : Xi € At
(i = l . . . n ) } .
The projection of R n onto the i t h coordinate will be denoted by ni, so
The mapping ^ is continuous and therefore surely a Borel function. One has
Ai x • • • x An = ^ ( A i ) n • • • n TT"1^,,). Consequently, if Ai,..., An are Borel in R then so is A\ x • • • x An in R". If P is a probability measure on R n then the measure Pj defined by Pi(i4) := P[7T~1(>1)]
(A Borel in R)
is a probability measure on R. One has Pi(A) = P(R
X - - - X R X J 4 X R X - - - X R )
where A is on the i t h place. P r o p o s i t i o n 1.3.1. If X : fi —> R n is a stochastic n-vector with components X\,..., Xn, then (Px)i = Px 4 . Proof. F x H A ) = P x K " 1 ^ ) ) = P(X € t t " 1 ^ ) ) = F((XU ...,Xn)e
n-\A))
= P(X t e A) = P X i (A).
• R e m a r k . Px 4 is said to be a marginal probability distribution of X . Note that in fact P X i is the image of P x under the map 7ij (see §1.2). Definition 1.3.1. A probability measure P on R n is said to be the product of P i , . . . , P n and denoted by P = Pi ® • • • P„ if P(^1 x • • • x An) = P x (>li) P 2 ( A 2 ) . . . P n ( A , ) for all Borel sets Ai,...,
An C R (see also Appendix B).
14
I. Probability theory
Example 1. Given a point (a, b) 6 l 2 , the Dirac measure ) = 5 a ® • Definition 1.3.2. The stochastic variables Xx,..., independent if
X„ axe said to be statistically
Example 2. In case of two variables X\ and X2 we have I x At) = n(X1,X2)
G Ax x A2] = r[Xx € Ax and X 2 e A2}.
So the variables Xi and X2 are statisticaJly independent if and only if P[Xi € Ax and X2 € A2] = P(Xi e A I ) P ( X 2 € A2) for all Borel sets Ax,A2.
•
Example 3. Suppose P) is a probability space and Bx,B2 € 21. Setting Xx : = lfij and X2 := 1 B2, the variables Xx and X2 are statistically independent if and only if P ( £ i n S a ) = P(£i)P(£3) (see Exercise 15).
•
Definition 1.3.3. The events Bx,.--,Bn are understood to be statisticaJly independent if the stochastic variables l ^ j , . . . , are so. Example 4. Three events B x , B 2 , B z are statistically independent if and only if P(Bi n B2 n B3) - P(Bi) P(B a ) P(B 3 ) P ( B i n B 2 ) = P(B1)P(Ba) F(BxnB3)
=
F(Bi)F{B3)
P(B2nB3)
=
F(B2)F(B3)
(see Exercise 16).
•
It can occur that variables Xx,..., Xn are pairwise independent but still do not form an independent system! We give an example of this phenomenon. Example 5. Throwing a die twice we define the events A, B and C by A := {1 s t throw is even} B = {2 n d throw is even} C = {1 s t and 2 n d throw are both even or both odd}.
1.3 Product measures and statistical independence
15
Then P ( i n 5 ) = I = i X i =P(¿)P(£) 4 Z Z
P(A n C) = \ = ^ x J =P(i4)P(C) P(BnC) = i = i x i
=F(B)F(C).
However, P(A D B n C) = \, so P ^ n ñ n C ) ^ p ( A ) P (B) P(C). Defining X := 1,4, Y" := IB and Z :— le, the variables X, Y and Z are pairwise independent. However the system X, Y, Z is not statistically independent. • The tensor product of a system of functions /i, . . . , / „ : R to be the function / : R " —• R defined by f(xi,...
, x „ ) := fi(xi)...
R is understood
/„(x n ).
We shall denote / = fi ® • • • ® /„. T h e o r e m 1.3.2. Suppose X\,..., Xn are stochastic variables with probability densities fxi fx„ • The system X\,..., Xn is statistically independent if and only if the stochastic n-vector (X\,..., Xn) has a probability density fxi • • • ® fx„ • Proof, i) Suppose X\,..., Xn are independent. Then for all Borel sets Ai,..., in M and for X := ( X i , . . . , Xn) one has P ( X e A! X •• • X An) = r(Xi
e A)...V(Xn
e
K)
= / fxt (xi) dx 1... / fXn(xn) JAl JAn
(Fubini's theorem) =
J " J
fxAxi)
An
• • -fx„(xn)
dxn
dxi..
,dx„
So we find that P ( ( * ! , . . . , Xn) E A) = Jf A { f x , ® • • • ® /JCJ(X) d x (*) for all sets of type A = A\ x • • • x An. This, however, implies that (*) holds for all Borel sets i c R " (see Appendix B).
16
I. Probability theory
ii) Conversely, suppose that fx1 • • • ® fxn stochastic n-vector ( X i , . . . , Xn). Then P ((X1,...,Xn)eA1x...xAn)=
J...J
is the probability density of the
f x 1 ( x i ) . . . fxn(x„)
(Fubini's theorem) = / fXl (xi) dxi... Jax = Consequently X\,...
dx 1 . . . dxn
/ fXn (x n ) dxn JA„
F(X1tA1)---P(XneAn).
,Xn is an independent system.
•
One can also speak about statistical independence of stochastic vectors. Suppose X = ( X i , . . . , X m ) and Y = (Yi,..., Yn) are two stochastic vectors. A stochastic (m + n)-vector (X, Y ) can then be defined by (X, Y) := ( X i , . . .,Xm,Yi,..
.,Yn).
The stochastic vectors X and Y are said to be statistically independent if P(X,Y) = P X ® P Y . So X and Y are independent if and only if for all Borel sets A C R m and B c R " one has P(X 6 A and Y € B) = P(X € A)P(Y € B). P r o p o s i t i o n 1.3.3. If X = ( X i , . . . , Xm) and Y = (Yi,..., Yn) are statistically independent stochastic vectors, then for ail possible i and j the components Xi and Yj are also statistically independent. Proof. Suppose A and B are Borel sets in R. Define A : = R x - - - x R x j 4 x K x - " X R c Rm
and
B := Rx-• - x R x S x R x - • -xR c R n
where A and B are on the i t h and j t h place respectively. In this notation we have VlXitYj)(A
xB)
= P[(Xi, Y,) € A x B] = P[(X, Y) e A x B] = P ( x , Y ) ( I x B) = P X ( I ) P y ( B ) = P(X € A) P(Y € B) = P(Xf € A)P(Yj e B) =
PXi(A)¥Yj(B).
We see that P(x i ,y J ) = Px< ® Py 3 , consequently Xi and Yj are statistically independent. • The converse of Proposition 1.3.3 is not valid. That is, it can occur that every pair of components Xi and Yj is independent, whereas X = ( X i , . . . , Xm) and Y = (Yi,..., Yn) are not independent! This is illustrated in the following example:
1.3 Product measures and statistical independence
17
E x a m p l e 6. Let X, Y and Z be as in Example 5. Consider the stochastic 1-vector X and the stochastic 2-vector (Y, Z). Now the pair X and Y is independent, just as the pair X and Z. However, X and (Y, Z) axe not independent. To see this, note that F[X = 1 and (Y, Z) = (1,1)] ^ V[X = 1] P[(Y, Z) = (1,1)]. Here the left side equals \ , the right side | .
•
P r o p o s i t i o n 1.3.4. Let Xi,... ,Xn be a statistically independent system of stochastic variables and let 1 < s < n — 1. Then the stochastic vectors {X\, ...,X3) and (X3+i,... ,Xn) are statistically independent. Proof. We denote X = (.Xi,...,X„)
and
Y = (Xs+i,...,Xn).
In this notation we have, because of the independence of the system Xi,..., P(X,Y)
=P*.
Xn
Px. ® P x . + 1 ® - " ® P j C B .
It follows from this (see Appendix B) that P(X.Y)
= (Pxx ® • • • ® Px.) ® (Px. + 1 ® • • • ® P x J -
However Px = P x 1 ® - - - ® P x .
and
PY =P*.+1 ®•••®Px„.
Consequently, P(x,Y)
=Px®Py,
which proves the independence of X and Y .
•
To conclude this section we define the notion of statistical independence for a system of stochastic vectors. Definition 1.3.4. A system X i , . . . , X „ of stochastic vectors is said to be statistically independent*) if Px 1) ...,x„ = Pxx ® Px 2 ® • • • ® Px„In a natural way the propositions and theorems in this section can be generalized in several ways.
*)
The reader should not confuse statistical independence and linear independence in linear algebra!
18
I. Probability theory
1.4 Functions of stochastic vectors In probability theory and statistics we often encounter functions of stochastic vectors. Example 1. Let (fi,2l, P) be a probability space and X : ii —> R a stochastic variable. Define for all w e ii Y(u>)
:=[X{u)}2.
Now Y is also a stochastic variable. We shall write briefly Y = X2. Defining the function / : R —> R by f(x) := x2, we can set Y = f o X. • Suppose X = ( X j , . . . , Xm) is a stochastic m-vector, defined on the probability space (ii,2l,P). Let f : R m —• R n be any Borel function. Now we can define a stochastic n-vector Y by Y = f o X . The vector Y is said to be a function of X and is denoted by Y = f ( X ) . Example 2. Define / : R m —• R by ,/ f{xt,...,xm)
x
:=
Xi H
1- xm m
.
If X = ( X i , . . . ,Xm) is a stochastic m-vector, then / ( X ) is a stochastic 1-vector. Here we shall systematically denote / ( X ) = X. One has _ Xi H
H Xm m
• Given a stochastic m-vector X , we can look upon a Borel function f : R m —* R n as a stochastic n-vector defined on the probability space (R m ,58 m ,Px)-The corresponding probability distribution of f on R n will be denoted by (Px)f • On the other hand, we also have on R " the probability distribution Pf(x) of the variable f ( X ) . Proposition 1.4.1. One has (Px)f = Pf(x) • Proof. A direct consequence of the identity (foX)-1(J4) = X"1[f-1(A)].
• The next proposition is often applied in statistics. We state this proposition in terms of two stochastic vectors X , Y .
1.4 Functions of stochastic vectors
19
Proposition 1.4.2. Suppose f : R m -> 1 " and g : R n -> R 9 are Borel functions. I f X . = (Xi,... ,Xm) and Y = (Yi,... , Y„) are statistically independent stochastic vectors, then so are f(X) and g(Y). Proof. For all Borel sets A c R p and B C R 9 we have Pf(X),«(Y)(i4 x B) = P[(f(X),g(Y)) € A x B] = P[(X,Y)ef-1(i)xg-1(B)] = Px[r1(^)]PY[g-1(B)] = (Px)f(A)(P Y ) g (B) (Proposition 1.4.1) = P f ( x ) (yl)P g ( Y ) (B). We see that Pf(x),g(Y) = Pf(x) ®Pg(Y). consequently f(X) and g(Y) are independent. • R e m a r k . It is easily verified that Proposition 1.4.2 is also valid if more than two stochastic vectors and more than two Borel functions are involved. E x a m p l e 3. If X = ( X i , . . . , X m ) and Y = (Yi,..., Yn) are statistically independent, then so Eire X and Y. (Define ,/ V fix i,...,xm):=
^H
1-Xm m
, , and g(xlt...
X Xi H ,xn) :—
and apply Proposition 1.4.2.)
1- xn n •
The following example shows that caution should be exercised. E x a m p l e 4. Let X, Y and Z be as in Example 5 of §1.3. Then the pair X,Y is statistically independent, just as the pair X, Z. However, the pair X, Y + Z is not independent. To see this, note that P(X = 1 and Y + Z = 2) ^ P(JST = 1)P(Y + Z = 2). The left side equals | , the right side | .
•
Let X be a stochastic m-vector and f : R m —+ R n a Borel function. We then have probability measures P x on R m and Pf ( X ) on R n - The next theorem is a wellknown result in measure theory. It describes the connection between integration with respect to P x and integration with respect to Pf(x) • T h e o r e m 1.4.3. Suppose f : R m —> R" and ip : R n -» R are Borel functions. Let X be a stochastic m-vector. Then ip is Pf(x) -summable if and only if ip of is
20
I. Probability theory
Px -summable. In that case one has J v(t)dP,(X)(t) =
J( 0 the above (finite or infinite) is always valid. Proof. First of all the theorem is proved in the case that
0 or if g is Px,y -summable, then one has J 5 ( X ) y ) d P x , Y ( x , y ) = J ^ J < 7 ( x , y ) d P x ( x ) j dP Y (y) = /{/5(x,y)dPY(y)}dPx(x).
Proof See for example [11], [28], [45], [64],
•
Proposition 1.5.3. If X and Y are statistically independent wth existing expectations, then the expectation of E(XY) also exists and one has E ( x y ) = E(x)E(y).
Proof. Suppose X and Y are independent and suppose both E(X) and E(Y) exist. Define g : R 2 -»• R by g{x,y) xy. Then, of course, XY = g(X,Y).
24
I. Probability theory Moreover
J \g\dV ,Y = J M x
dPx,Y
(Fubini's theorem)
=J
WMd(Px®IPy)
= J ^ J \x\\y\dFx(x)^
dPY(y)
= J \ y \ { J |x|dPx(x)}dPy(y) = J |x|dP*(s) J |y|dPy(y) < +00. This proves the existence of E(XV). Similarly one proves that E(XF) E(X)E(y).
= •
Suppose X is a stochastic variable of which the expectation E(X 2 ) exists. One can prove (see Exercise 18) that this implies the existence of E(X). The variance of X, denoted by var(X), is then defined by var(X) := E(X 2 ) - E(X) 2 . The variance of X is a measure of the amount of spreading of the outcomes of X around the expectation E(X), when the experiment is repeated unendingly. Example 5. If X is the constant a, then E(X) = a and E,(X2) = a2 (see Example 1), so var(X) = a2-a2
= 0.
Obviously there is no spreading in this case.
•
Example 6. Let X be a stochastic variable. Suppose that Px = /A, where / is defined by f{x) = l j ^
if x 6 (a,/3),
0 elsewhere. In that case X is said to be uniformly distributed in the interval (a, (3). We then have (see Exercise 19): var(X) = ¿(/3 - a) 2 . • Proposition 1.5.4. Suppose E(X 2 ) exists. Denoting fi = E(X), one has varpQ = E[(JC - pt)2].
Proof. Left as an exercise to the reader. Remark. By the proposition above we see that always var(X) > 0.
•
1.5 Expectation, variance and covariance of stochastic variables
25
Proposition 1.5.5. If var(X) = 0, then F(X = n) = 1, where /z = E(X). In other words, with probability 1 the outcome of X will be equal to fi.
Proof. Define A C R by A := {x : x ^ n} or, equivalently, by A := {x : {x - fi)2 > 0}.
Furthermore, define AQ by Ao := {x : (x — fj,)2 > 1} and define for all n > 1 An : =
:
^TT
R n we have (by Proposition 1.6.3) F(
W
, ...,*„)
6
QA ) .
, - W * .
In the integral on the right hand side we carry out a change of variables: x = Q u . The substitution formula for multiple integrals now gives (see [43], [64]) F((*, • • • M
e OA) = fA — j ^ j
JQ du,
where JQ is the Jacobian of Q, that is, | det(Z)Q)|. Because Q is a linear map we have DQ = Q . It is well known (see [52], [55], [65] or §VIII.l) that det(Q) = ±1. It follows that the Jacobian of an orthogonal linear map is 1. Furthermore we have (Qu, Qu) = (u, u). It thus appear that xn(QA)
= H(X1,...,Xn)eQA)
= JA
- J - ^ e - ^ ^ ' d u .
By Proposition 1.6.3 the right side of this equality is Pxi...x„(- that is to say: PX = FY ^ X x =
Xy
Details can be found in [6], [18], [19], [21], [41], [64]; see also Appendix F. Convergence in distribution can also be expressed in terms of characteristic functions: Theorem 1.8.5 (P. Levy). The sequence of stochastic variables (Xn)ngN converges in distribution to the variable X if and only if JLim,Xx» W = ^x (0 Proof. See [6], [18], [19], [21].
f°r
d l t €
•
The following proposition describes how characteristic functions and moment generating functions change when "scale-transformations" are applied on X : Proposition 1.8.6. Let X be an arbitrary stochastic variable and let Y — pX+q. Then: i) My(i) = egtMx(pt) for all t where these expressions make sense, ii) xY W = eiqtXx (Pt) for edit & R.
1.8 Moments, moment generating functions and characteristic functions To avoid trite explanations, only the proof of ii) is given. For all
Proof.
have
XY(t)
=
E
= E (eit(pX+9)) = E
(eitY)
(eiptX)
= e^E
Proposition 1.8.7. i)
Mx(t)
ii)
X x
Proof. Step
=
If X
e"i+*&x) anoo \ y/n J
Using Lebesgue's dominated convergence theorem (see Appendix A ) we deduce lim nan = lim t2 [x2r
n-»oo
n—oo
(-^L n
J
\V J
) d¥x(x)
= t2 [ J
lim x2r (-^L
n_>°°
) d,Px(x)
\\nJ
= 0.
Summarizing, we reveal that X7
It2 n (t) = (1 - ~ 1 n 1- " n )
wheren—too lim n an = 0.
By Lemma 1.9.1 this implies lim
n->oo
z
"
(i) w
lim (1 - I - + a „ ) n =
n—>oo
2 71
Hence (*), and therefore step 1, has now been proved. Step 2. The general case. Suppose the X\,X2, • • • constitute an independent sequence of identically distributed variables with expectation fi and variance a 2 . As before, write Xi:=^LZJi o
(i =
1,2,...).
Now the Xi, X2,... constitute an independent sequence of identically distributed variables with expectation 0 and variance 1. By step 1 the sequence 1 ~ _X1 — y/n
—
+ --- + Xn y/n
converges in distribution to the N{0,1)-distribution. Recognizing that _1_~ yfn we have proved the theorem.
= n
Sn/n - fi cr/y/n •
R e m a r k 1. There exist more general and deeper versions of the central limit theorem (see for example [18], [50]). R e m a r k 2. It is both amusing and instructive to visualize the central limit theorem in the case of an independent sequence of variables which are all uniformly distributed in the interval ( — + 5 ) . In Exercise 51 the reader is asked to actually carry this out. R e m a r k 3. The central limit theorem explains in part the special role played by the normal distribution in probability theory and statistics. Concerning this, see also Exercise 40.
1.10 Transformation of probability densities
51
1.10 Transformation of probability densities Let X be an absolutely continuous distributed stochastic n-vector of which the probability density is given. If g : R n —> R is a given continuous function, how can we express the probability density of g(X.) in terms of the density of X? In this section we illustrate a technique based on differentiation of the distribution function. Example 1. Let g : R n —• R defined by g{x) := a (constant). Now for every stochastic n-vector X the probability distribution of p(X) is presented by the Dirac measure 0, then FY(y)
F(X2
= P ( Y
1.
if 0 < a < 1, elsewhere.
n
Example 4. Let X and Y be statistically independent stochastic variables with probability densities f x and fY given by , ix2e~x
«•{i f r y
if x >
0,
elsewhere,
( e - y
if y > 0,
^0
elsewhere.
Let Z := Y- Find, if it exists, an explicit form of the probability density of Z. Solution. As in the preceding two examples, we first determine an explicit expression for the distribution function Fz of Z. To this end (see Theorem 1.3.2) we note that the density of the 2-vector ( X , Y) is given by Jix2e-(*+v) /x,y(x,y) = ^ 2 (, 0
if x, y > 0, elsewhere.
If a < 0, then obviously Fz(a) = 0. For a > 0 we define the subset Ga of R 2 by Ga
= {{x,y)
: x,y
> 0;y
R 2 . Prove this. 2
Suppose / i and / 2 axe continuously differentiable (differentiable with a continuous derivative) on R. Define, as before, fi ® f2 on K 2 by (/i ® /2)(xi,x 2 ) := /i(xi)/ 2 (x 2 ). We assume that / i / 2 is rotation invariant, that is: (/I®/
2
)(Q(XI,X
2
))
=
(/I®/
A
)(®I,X
A
)
for all orthogonal Q and (xi,x 2 ) € R 2 . i)
Show that fi(x) = fi(-x)
ii)
Prove that / i = / 2 .
for all x 6 R (i = 1,2).
iii) Prove that there exist constants A and B such that / ( x ) = h{x) = h(x) =
AelBx\
(Hint to iii): Suppose Q(t) is the orthogonal linear map belonging to the matrix / cosi — sint\ \ sin t cos t J ' Then we have for all (xi,x 2 ) e R 2 : (/®/)(QW(ai,®a)) = (/®/)(®i,»a). Evaluate the derivative of the left side in the point t = 0 and deduce that f'(xi) ^ f'(x2) Xl/(xi) x2/(x2)
_^conctant)
Suppose X\, X2 are statistically independent stochastic variables with strictly positive, continuously differentiable probability densities. If the probability distribution of {X\,X2) is rotation invariant, then Xi and X2 have a common N(0, a2)-distribution. Prove this. Suppose Xi,... ,Xn are statistically independent stochastic variables with strictly positive continuously differentiable probability densities. If the probability distribution of ( X i , . . . , X n ) is rotation invariant, then the variables . X i , . . . , Xn have a common N(0, 0 the variable Yc is defined by
x Y r (eTi \s -; ii f
if | x |M< Nec .
i) Sketch a graph of Yc. ii) Show that Yo = X and that lim Yc = —X. c—>+oo iii) Prove that Yc is N(0,1)-distributed for all c > 0. The reader may take for granted that the function f :c— i • cov(X, Yc) is sequentially continuous and therefore continuous. (This can be proved for example by applying Lebesgue's dominated convergence theorem (see Appendix A)). iv) Evaluate /(0) and lim /(c). v) Prove that there exists an element a € R such that cov(X, Y^) = 0. vi) Show that for all c > 0: P(|X| < 1 and |yc| < 1) ^ P(|X| < i)p(|y c | < 1)vii) Deduce from vi) that X and Yc are statistically dependent for all c > 0. Now the pair X and Ya enjoys the three properties listed in the beginning of this exercise. viii) Is not the above contradictory to Theorem 1.6.7? Given any finite collection of prescribed probability densities / i , . . . , fn, there exists a statistically independent system of variables X\,..., Xn such that Xi has probability density /» (i — 1,... ,n). Prove this. If both X and Y enjoy absolutely continuous distributions, then so does the 2-vector (X, Y). Is this statement true or false? Prove that the distribution function of an arbitrary stochastic variable enjoys the properties i), ii) and iii) mentioned in §1.7.
60
I. Probability theory
45) Using a technique involving the calculus of residues (see [7], p. 219) it is easily verified that +oo f+OO / -oo e-(x+ia)
d x = J—oo
g-x2^
for
gjj a g
R
Deduce statement ii) in Proposition 1.8.7 from this. 46) Prove Proposition 1.8.8. 47) Prove that the sum of two (or more) statistically independent Poisson distributed variables is Poisson distributed. 48) Can you give conditions under which the sum of two independent binomially distributed variables is binomially distributed? 49) The characteristic function of a Cauchy distribution is described in appendix D. Is the sum of two independent Cauchy distributed variables Cauchy distributed? 50) Suppose X and Y axe independent stochastic variables with probability densities f x and / y . We assume that f x and / y are bounded piecewise continuous functions and that there exists an a > 0 such that fx(x)
= 0 if |x| > a.
The convolution product fx* fv : R —> [0, +00) of such a pair of probability densities is defined by +00 /
i) ii)
-00
fx(x)fy(s
- x)dx
( s € R).
Prove that the integral on the right side exists and that f x * / y is a continuous function. Prove that if S = X + Y, then f s = fx*
fr-
iti) Prove that ( f x * f y ) ( s ) = 0 if |s| > 2a. iv) Formulate and prove a discrete analogue of the above. 51) In this exercise the reader is asked to visualize the central limit theorem. Let X\, Xi and X3 be independent variables, uniformly distributed in the interval ( - i , + § ) . We define Si i)
ii)
Xi,
S2 '•= Xi + X2
and
S3 := Xi + X2 + X3.
We have E(Si) = 0 and var(Si) = yj. Sketch the probability density of -distribution in one and the Si and the one corresponding to the N(0, same figure. Determine the probability density of S2 •
1.11 Exercises
61
iii) We have £(¿>2) = 0 and var(5a) = g. Sketch the probability density of 52 and the one corresponding to the iV(0, ^-distribution in one and the same figure. iv) Determine the probability distribution of S3. v) We have E(5a) = 0 and var(£3) = Sketch the probability density of 53 and the one corresponding to the N(0, ^-distribution in one and the same figure. 52) Given are two statistically independent variables X and V, having a common probability density / of type
Here 9 is a fixed positive number. i) ii)
Prove that X + Y and X/Y are statistically independent. Prove that X + Y and X/{X + Y) sure statistically independent.
Chapter II Statistics and their probability distributions, estimation theory
II. 1 Introduction To determine the acceleration of gravity g on earth's surface we drop a heavy leaden ball from a height of 100 meter. Now g and the time T it takes to reach the ground are related as follows: (T in seconds). Consequently, by measuring T we can get an impression of the numerical value of g. Repetition of this experiment will, in general, not result in exactly the same measured value of T . This is due to restricted accuracy of the pieces of apparatus, to inaccuracy in reading off things and to other (exterior) disturbing factors. The process of measuring T can therefore be considered as "observing the outcome of a stochastic variable". This stochastic variable, which exists in our mind only, will also be denoted by T . In the physical sciences it is often assumed that such a variable is normally distributed, say 1)? To solve this problem we standardize T in the sense of Definition 1.5.2. By Theorem 1.6.1 the variable z-.=
f
=
is N(0, l)-distributed. (In statistics, N(0,1)-distributed variables axe often denoted by the letter Z). Using this we can evaluate P(|T — /¿| > 1) in the following way:
II. 1 Introduction
63
where, as said before, Z is 7V(0,1)-distributed. We therefore have ¥{\Z\ > 1.25) = F(\Z\ > 1.25) = ( f ^ \J-oo
+ f+°°) J 1.25 / V27T
e~x2'2dx.
The numerical value of this integral can be determined by expressing it in terms of the distribution function $ of the standard normal distribution (see Table II). Because of symmetry we can write P(|Z| > 1.25) = 2 $(-1.25) = 2 • 0.1056 = 0.2112. We see that in our experiment there is a chance of roughly 20% that the error (in absolute sense) will be more than 1 second. We can improve the accuracy in our experiment by repeating it a number of times and consider afterwards the average of the measured values of T. Then in a natural way there are n stochastic variables T i , . . . ,T n involved, representing the n measurements of the falling-time. We assume that there is no mutual interference in our measurements. In statistical language: The variables Ti,...,Tn Eire supposed to be statistical independent. A sequence like T i , . . . , Tn is said to be a sample from a normally distributed population. More general and more precise: Definition II. 1.1. A sample of size n is understood to be a statistically independent sequence Xi,... ,Xn of stochastic variables, all of them identically distributed. The common probability distribution of the Xi is said to be the (probability) distribution of the population. If Xi,..., Xn is a sample and g : R n —* R a Borel function (in practice usually a continuous function) then the stochastic variable g(Xi,..., Xn) is said to be a statistical quantity, or briefly a statistic. We shall mostly denote statistics by the capital letter T. If (after completing an experiment) an outcome t for T appeaxs, then we shall write bluntly T = t. By this notation we surely do not mean that T be equal to the constant stochastic variable t. A more precise (but too lengthy) formulation would be: "In the underlying probability space fi the element ui appeared as an outcome and we have T(u)=g(X1(u),...,Xn(u))
=t."
A statistic of fundamental importance is the sample mean n Proposition II.1.1. If Xi,... and variance a2, then
H Xn
,Xn is a sample from a population with mean n
E(X) = /x and
var(X) = — .
64
II. Statistics and their probability distributions, estimation theory
Proof. We can write
¿=i Furthermore, the Xi,...,
»=1
X„ being statistically independent, we have
var 1
V^
fv\
1
0-2
2
• In case of a sample from a normally distributed population the previous proposition can be sharpened. P r o p o s i t i o n II.1.2. If Xi,... ,Xn is a sample from a N(fi, c 2 )-distributed population, then X is a N (fi, a2/n) -distributed variable. Proof. By Theorem 1.6.6 the sum is N(nfi, ner 2 )-distributed. Applying Theorem 1.6.1 the proof can be completed in an obvious way. • We now return to our experiment in which we were measuring the falling-time T of a leaden ball. Repeating this experiment 4 times, we get a sample T%, T2,13, T4 of size 4 from a N(fi, 0.64)-distributed population. By Proposition II.1.2 the sample mean T is a N(fi, 0.64/4)-distributed variable. The chance that we are going to make an error which is (in absolute sense) more than 1 second, is now expressed byP(|T-/x|>l). P ( | r - H\ > 1) = P(|T - Ml > 1) = P ( | ^ where Z is a N(0,1)-distributed
>
= n\Z\
> 2.5),
variable. Consulting Table II, we learn that
P(|T — n\ > 1) =0.012. In words, the chance that we are going to make an absolute error of more than 1 second is roughly 1%. Recall that for just one single measurement this chance was 20%! In this example we presumed that the numerical value of a was known beforehand. In practice this is often not the case. Usually we shall have to make an estimation of a , based on our measurements. To this end we introduce a statistic the outcome of which is generally used as an estimation of a 2 .
II. 1 Introduction Definition II.1.2. For any sample Xi,...,
65
Xn of size n > 2 the statistic
i=l is called sample variance. The following proposition explains why in the definition of S2 a factor been chosen (instead of Proposition II.1.3. If Xi,..., a2, then
has
Xn is a sample from a population with variance E(S 2 ) = a 2 .
(For this reason S2 is called an unbiased estimator of a2). Proof. We can write i
- /¿)2 =
Because
-X)
i
-X)2
i
+ (X+ 2
n)}2 - X)(X -M) + £(X i
i
- M)2.
~ X) = (J2i Xi) — nX = 0, this can be simplified as follows: J2(Xi i
- nf =
i
- X? + n(X - f!)2.
Taking the expectation of the left and the right side, we obtain i
- H)2] = E / ^ ^ - X)2\ + n E [ ( I - ti)2]. ^ i '
However, E[(Xj — fi)2} = a 2 and furthermore (by Proposition II.1.1) we have that E[(X - /i)2] = a2/n. Thus we get na2
i
=W^{Xi-X)2]+o2.
Consequently
Therefore
E ^ - I f l H n - l V i E(S 2 ) = 0, elsewhere.
This probability distribution is a member of the "family of gamma distributions". Before giving the general definition of a gamma distribution we recall the definition of a celebrated function in mathematical analysis: The function T : (0, +oo) [0, +oo) defined by r+ o o
r(®):= /
Jo
tx~1e~tdt
(x
> 0)
is said to be the gamma function. Using the technique of partial integration one can deduce the following proposition.
II. 2 The gamma distribution and the x 2 -distribution
67
Proposition II.2.1. For all x > 0 we have r(x + l) =
xT(x).
•
Proof. See Exercise 1. As a corollary of Proposition II.2.1 we have r ( n + 1 ) = n!
for all n = 1 , 2 , . . . .
The values of T have been tabulated. For x = | the value of T can be expressed in terms of TT: Proposition II.2.2. We have
=yft.
•
Proof. See Exercise 2.
Definition II.2.1. A stochastic variable is said to be gamma distributed with parameters a, /3 > 0 if it has a probability density given by
elsewhere. For an exercise, check that this indeed defines a probability density. Note that a and ¡3 are not exchangeable. For a = 1 and ¡3 = 6 the gamma distribution is usually called the exponential distribution with parameter 0 > 0. Proposition II.2.3. If X is N(0,1)-distributed, •with parameters a = ^ and /3 = 2.
then X2 is gamma distributed
Proof. A direct consequence of §1.10 Example 2, together with Proposition II.2.2.D Proposition II.2.4. The moment generating function of a gamma distributed variable exists on the interval (—oo, 4) and is on this interval defined by
Proof. By definition of a moment generating function we have
68
II. Statistics and their probability distributions, estimation theory
which is the same as 1
r +f+OO oo
It is easily verified that this integral makes sense if t—^ < 0, that is, if t € (—oo,^). Writing x = /3u and applying the substitution formula for integrals we get
M ( i
>-fikr^0"***
If, in turn, we substitute u — v/(l— 1
1
(3t) in this integral we get r+oo
1 •
Exploiting differentiation of the moment generating function, it is easy to deduce the next proposition. Proposition II.2.5. If X is gamma distributed with parameters a and (3, then: i) ii)
E(X) = a(3, var(X) = a/3 2 .
Proof. See Exercise 3.
•
Using Proposition II.2.4 it is also easy to prove the following important permanence property of the gamma distribution. Proposition II.2.6. Suppose Xi,... ,Xn are statistically independent variables. If for all i the variable Xi is gamma distributed with parameters ati and /3, then the variable Xi + • • • + Xn is gamma distributed with parameters ai + • • • + an and 0. Proof. This is a direct consequence of Proposition II.2.4 together with a generalized version of Proposition 1.8.8. • Now suppose that X\,...,Xn are statistically independent N(0, l)-distributed variables. Then the variables . . . , X^ are independent and gamma distributed with a = ^ and ¡3 = 2. By Proposition II.2.6 it follows that the variable Xi + • • • + X% is gamma distributed with a — n / 2 and (3 = 2. Such probability distributions, for historical reasons, have a special name (introduced by the famous British statistician K. Pearson (1857-1936)): Definition II.2.2. A stochastic variable is said to be x2-distributed with n degrees of freedom if it is gamma distributed with parameters a = n/2 and 0 = 2. In this terminology we have:
II.2 The gamma distribution and the x2-distribution P r o p o s i t i o n II.2.7. If Xi,..., lation, then the variable Xf H
69
Xn is a sample from a N(0,1 )-distributed popu1-X% is x2 -distributed with n degrees of freedom.
Furthermore, as a special case in Proposition II.2.5, we have: P r o p o s i t i o n II.2.8. If X is a x2-distributed variable with n degees of freedom, then one has: i) E(X) = n, ii) var(X) = 2 n.
E x a m p l e 1. We set up the following Gedankenexperiment*): Inside a sphere with center in the origin live a number of gas molecules. At a certain instant t = fo we choose an arbitrary molecule and measure its velocity V . When this experiment is repeated it generally does not result in the same measured value for V . Therefore we consider our measurement as the outcome of a stochastic 3-vector V = (Vx,Vy,Vz).
We assume that the outcome of one of the components Vx,Vv,Vz does not interfere with the probability distribution of the remaining components (still to be measured). More specific, we assume that the variables Vx,Vy,Vz are statistically independent. For reasons of symmetry we may assume that the Vx,Vy,Vz are identically distributed and that E(VX) = E(V^) = E(V^) = 0. In fact, because of the perfect symmetry in our experiment, we may assume that the probability distribution of (Vx, Vy, Vz) is rotation invariant. This, however, implies (see §1.11 Exercise 40) that the variables Vx,Vy,Vz have one and the same iV(0, «redistribution ! Let ||V|| be the length of V. We have:
Looking back upon Proposition II.2.7, we conclude that || V||2/ 2) is a sample from a N(FI,(redistributed. population, then the statistic (n — 1 )S2/a2 is x2-distributed with n — 1 degrees of freedom. Proof. Under the above-mentioned conditions the variables (Xi — fi)/a are all N(0,1)-distributed and they constitute a statistically independent system. Therefore, by Proposition II.2.7, the quantity
is x2-distributed with n degrees of freedom. Moreover, by Proposition II.1.2, the variable y/n(X — n)/o is N(0, l)-distributed. It follows from this that n(X — /¿)2/14.1
J
= 0.05.
II. 2 The gamma distribution sind the x 2 -distribution
73
Thus we see P ^2.17 < ^
< 14.1^ = 0.90.
P ^0.31
1.05) = 2 P ( Z > 1.05) where Z is iV(0,1)-distributed. Consulting Table II, we get P(|T-/i| > 0.1) « 0.29. Then P(|T — fi\ dx}du --irMiiw^h
p S =
Keeping it fixed, substitution of x = ut in the inner integral gives
(Fubini)
= f
f ^ - ^ e ' W ^ u d u ^ d t ,
where wechanged the order of integration. It follows from this that the probability density / of Z/U can be written as f(a) = J =
o
f u
( u ) - j = e - M
f * " Jo 2 " /
/ 2udu
n -(iWW/2
d u
u e 2
v ^ r ( f )
Now a substitution u = v/y/1 + a 2 in the last integral gives f ( a ) = (1 + a
2
) - ^ —r~F=
v ne- v*'
2
dv.
78
II. Statistics and their probability distributions, estimation theory
By a substitution v = \/2s, the integral on the right side can be expressed in terms of the gamma function:
In this way we get
Step 3. The probability density of J^^ Z
. We can write
r- Z —V n — = VY
r-Z V n 77u'
Applying the result of Exercise 25 in §1.11, we obtain an expression for the probability density / of y/nZ/U: / ( a ) = \firn 1 ( j J
n
This completes the proof of the proposition.
•
Warning. The proposition above is in general false if we drop the condition of independence of Z and Y. To illustrate this, choose any N(0,1)-distributed variable Z and set . Then Y is x -distributed with one degree of freedom. Now the variable Z/^/Y — Z/\Z\ can assume the values + 1 and —1 only. Therefore this variable can impossibly enjoy an absolutely continuous distribution, and certainly not a i-distribution. Definition I I . 3 . 1 . A stochastic variable is said to be t-distributed with n degrees of freedom, if it has a probability density of type described in Proposition II.3.1. R e m a r k . The i-distribution was first introduced by William S. Gosset (18761937). He was employed on a brewery. The brewery, however, forbade their employees to publish scientific results. For this reason he has been publishing his scientific activities under the pseudonym "Student". Accordingly, many people nowadays talk about a "Student distribution" instead of a "i-distribution". The i-distribution has been tabulated in detail (see Table III). T h e o r e m I I . 3 . 2 . If Xi,..., tion, then the statistic
is t-distributed
Xn is a sample from a N(n, a2)-distributed X —n S/y/n
with n — 1 degrees of freedom.
popula-
II.3 The ¿-distribution
79
Proof. A direct consequence of Proposition II.3.1 and the discussion preceding it.D The theorem above can be used to construct confidence intervals for fi: Example. Again, we axe looking back upon the experiment discussed in §11.1. If we measure the falling-time 8 times, then the quantity T-ji S/V8 is (by Theorem II.3.2) a priori f-distributed with 7 degrees of freedom. In Table III we can read off that P ( — % > 1.895 ) = 0.05. \S/V8 ~ J Examining the probability density in Proposition II.3.1, we see that the ¿-distribution is symmetric with respect to the origin. It follows from this that we also have _ P (^ ^
< -1.895 ) = 0.05.
\s/Vs ~
Combining this, we get
J
P (-1.895 < ^ ^
V
s/Vs
< 1.895 ) = 0.90.
J
In other words, doing this experiment we have (a priori) a 90% chance that the statistic _ T-fi S/V8 will assume a value between —1.895 and +1.895. Substitution of the outcomes 5 = 0.267 and T = 4.40 in the inequality -1.895
2), ii) var(X) = n/(n - 2) (n > 3).
80
II. Statistics and their probability distributions, estimation theory
II.4 Statistics to measure differences in mean Introduction. We want to check whether the gravitational acceleration on the north pole is the same as the one on the equator. We therefore set up an experiment in which the falling-time of a leaden ball from a height of 100 meters is measured. We repeat the experiment m times on the north pole and n times on the equator. Here in a natural way two samples X i , . . . , X m and Y i , . . . , Yn are involved, representing the m and n measurements on the north pole and the equator respectively. We assume that Xi,..., Xm is a sample from a N(fix, ) -distributed and Y i , . . . , Yn a sample from a N(fiy, -distributed population. We have to specify what is meant by "independence of two samples". Definition II.4.1. Two samples X\,..., Xm and Y i , . . . ,Yn axe said to be statistically independent if the two stochastic vectors ( X i , . . . , X m ) and ( Y i , . . . , Yn) are so. We return to our experiment on the north pole and the equator: To reveal a possible difference in the mean falling-time fix and fiy on the north pole and the equator respectively, we interpret the outcome of the statistic X — Y as an estimation of fix — HY- If the numerical values of a \ and 0.0095
1.A6) = 0.025. J
Because of symmetry, we also have P ( ( V
X
"
" t X ~ 0.0095
F ) n
M y )
< -1.96) = 0.025. J
Consequently, p f-i.96 < V
0.0095
< 1.96) = 0.95. J
After rounding off, we see that P ^-0.019 < ( X - Y ) - ( n x - My) < 0.019^ = 0.95. That is to say, a priori we have a 95% chance that the error in our measurements will be less than 0.019. _ Suppose the measurements result in the following numerical values for X and
F:
X = 4.520 and Y = 4.490. After substitution of these values in the inequality -0.019 < ( X - V ) - { f i x - My) < 0.019, an inequality for fix — My appears: 0.011 < MX - My < 0.049. As before, the open interval (0.011,0.049) is said to present a 95% confidence interval for mx — My • It followsfromour measurements that we can be 95% confident that mx is at least 0.01 laxger than My- (The mean falling-time is larger on the north pole than on the equator. This implies that the gravitational acceleration on the equator is larger than the one on the north pole. Physicist explain this by the fact that the shape of the earth is not a perfect ball. Apart from that, in this example the numerical values of X and Y are fancy). • Remark. Concerning the previous example, see also Exercise 37. As noted already in §11.1, the numerical values of 30 and n > 30 the variable (X-7) - fax - py) y/S'jc/m + S'jr/n is approximately N(0, l)-distributed. For such values of m, n we shall use this statistic, together with Table II, to construct confidence intervals for fix ~ Hy • If we are dealing with small-sized samples, whereas a \ and a \ axe unknown, then there is a problem. To be more specific, in these cases the probability distribution of the statistic _ (JK-Y)-(hx-HY) y/ S'x/m + Sy /n is not related to the common statistical probability distributions. This inconvenience is known as the "Behrens-Fisher problem". Under the additional assumption that a x = 0Y > there is a simple solution to this problem. We discuss it in the following. Definition II.4.2. Suppose Xi,..., dent samples. Then the variable
Xm and Y\,... ,Yn are statistically indepen-
_ (m-1)5%+ (»-1)5?. ' m+n—2
52 p
is said to be the pooled variance. Proposition II.4.2. If X\,..., Xm and Yi,... ,Yn are statistically independent samples from a N(nx,