201 106 19MB
English Pages 220 [218] Year 2020
PROCEEDINGS OF THIRD BERKELEY
VOLUME I
THE
SYMPOSIUM
PROCEEDINGS of the THIRD BERKELEY SYMPOSIUM ON MATHEMATICAL STATISTICS AND PROBABILITY Held at the Statistical Laboratory University of California December, 1954 July and August, 1955
V O L U M E
I
CONTRIBUTIONS TO THE T H E O R Y OF STATISTICS
EDITED BY J E R Z Y
NEYMAN
UNIVERSITY OF CALIFORNIA BERKELEY AND LOS ANGELES
1956
PRESS
UNIVERSITY OF CALIFORNIA PRESS B E R K E L E Y A N D LOS A N G E L E S CALIFORNIA
CAMBRIDGE UNIVERSITY PRESS LONDON, ENGLAND
C O P Y R I G H T , 1 9 5 6 , BY T H E REGENTS OF THE UNIVERSITY OF CALIFORNIA
T h e United States Government and its officers, agents, and employees, acting within the scope of their duties, may reproduce, publish, and use this material in whole or in part for governmental purposes without payment of royalties thereon or therefor. T h e publication or republication by the government either separately or in a public document of any material in which copyright subsists shall not be taken to cause any abridgment or annulment of the copyright or to authorize any use or appropriation of such copyright material without the consent of the copyright proprietor.
LIBRARY OF CONGRESS CATALOG C A R D N U M B E R : 4 9 - 8 1 8 9
P R I N T E D IN T H E U N I T E D S T A T E S O F A M E R I C A
CONTENTS OF PROCEEDINGS, VOLUMES II-V Volume II—Probability Theory DAVID BLACKWELL, On a class of probability spaces. SALOMON BOCHNER, Stationarity, boundedness, almost periodicity of random-valued functions. K. L. CHUNG, Foundations of the theory of continuous parameter Markov chains. A. H. COPELAND, SR., Probabilities, observations and predictions. J. L. DOOB, Probability methods applied to the first boundary value problem. ROBERT FORTET, Random distributions with an application to telephone engineering. J. M. HAMMERSLEY, Zeros of a random polynomial. T. E. HARRIS, The existence of stationary measures for certain Markov processes. KIYOSIITÓ, Isotropic random current. PAUL LÉVY, A special problem of Brownian motion, and a general theory of Gaussian random functions. MICHEL LOÈVE, Variational terms and central limit problem. EUGENE LUKACS, Characterization of populations by properties of suitable statistics. KARL MENGER, Random variables from the point of view of a general theory of variables. EDITH MOURIER, /.-random elements and Z. "-random elements in Banach spaces. R. SALEM and A. ZYGMUND, A note on random trigonometric polynomials.
Volume III—Astronomy and Physics O. J. EGGEN, The relationship between the color and the luminosity of stars near the sun. J. L. GREENSTEIN, The spectra and other properties of stars lying below the normal main sequence. H. L. JOHNSON, Photoelectric studies of stellar magnitudes and colors. G. E. KRON, Evidence for sequences in the color-luminosity relationship for the M-dwarfs. G. C. McVITTIE, Galaxies, statistics and relativity. JERZY NEYMAN, ELIZABETH SCOTT and C. D. SHANE, Statistics of images of galaxies with particular reference to clustering. BENGT STROMGREN, The Hertzsprung-Russell diagram. F. ZWICKY, Statistics of clusters of galaxies. ANDRÉ BLANC-LAPIERRE and ALBERT TORTRAT, Statistical mechanics and probability theory. M. KAC, Foundations of kinetic theory. J. KAMPÉ DE FÉRIET, Random solutions of partial differential equations. E. W. MONTROLL, Theory of the vibration of simple cubic lattices with nearest neighbor interactions. NORBERT WIENER, Nonlinear prediction and dynamics.
Volume IV—Biology and Problems of Health JAMES CROW and MOTOO KIMURA, Some genetic problems in natural populations. E. R. DEMPSTER, Some genetic problems in controlled populations. JERZY NEYMAN, THOMAS PARK and ELIZABETH SCOTT, Struggle for existence. The Tribolium Model: biological and statistical aspects. M. S. BARTLETT, Deterministic and stochastic models for recurrent epidemics. A. T. BHARUCHAREID, On the stochastic theory of epidemics. C. L. CHIANG, J. L. HODGES, JR. and J. YERUSHALMY, Statistical problems in medical diagnoses. JEROME CORNFIELD, A statistical problem arising from restrospective studies. D. G. KENDALL, Deterministic and stochastic epidemics in closed populations. W. F. TAYLOR, Problems in contagion.
Volume V—Econometrics, Industrial Research, and Psychometry K. J. ARROW and LEONID HURWICZ, Reduction of constrained maxima to saddle-point problems. E. W. BARANKIN, Toward an objectivistic theory of probability. C. W. CHURCHMAN, Problems of value measurement for a theory of induction and decisions. PATRICK SUPPES, The role of subjective probability and utility in decision-making. A. H. BOWKER, Continuous sampling plans. CUTHBERT DANIEL, Fractional replication in industrial research. MILTON SOBEL, Sequential procedures for selecting the best exponential population. T. W. ANDERSON and HERMAN RUBIN, Statistical inference in factor analysis. FREDERICK MOSTELLER, Stochastic learning models. HERBERT SOLOMON, Probability and statistics in psychometric research: Item analysis and classification techniques.
ACKNOWLEDGMENT Most of the papers published in this volume were delivered at the sessions of the Symposium held during the summer of 1955. These sessions were organized in cooperation with the IMS Summer Institute, 1955, Professor David Blackwell, Chairman.
PREFACE on Mathematical Statistics and Probability was held in two parts, one from December 26 to 31, 1954, emphasizing applications, and the other, in July and August, 1955, emphasizing theory. The Symposium was thus divided because, on the one hand, it was thought desirable to provide an opportunity for contacts between American and foreign scholars who could come to Berkeley in the summer, but not in the winter, and because, on the other hand, the 121st Annual Meeting of the American Association for the Advancement of Science held in Berkeley in December, 1954, provided an opportunity for joint sessions on the various fields of applications with its many member societies. With the help of Dr. Raymond L. Taylor, of the AAAS, nine cosponsored sessions of the Symposium were organized. Two of these were given to problems of astronomy and one each to biology, medicine and public health, statistical mechanics, industrial research, psychometry, philosophy of probability, and to statistics proper. The importance of the second part of the Symposium, which emphasized theory, was increased by the decision of the Council of the Institute of Mathematical Statistics to hold its first Summer Institute in Berkeley and to hold this Institute "in conjunction with the Third Berkeley Symposium"; all members of the IMS Summer Institute were invited to participate in the Symposium and the two enterprises were conducted in parallel. In particular, the cooperation of Professor David Blackwell, Chairman of the IMS Summer Institute, made it possible to ensure that representatives of all the major centers of statistical research in this country be invited. As will be seen from the lists of contents of the Proceedings, the response was good, although various circumstances, including the concurrent Rio de Janeiro meeting of the International Statistical Institute, prevented some of the prospective participants from attending the Berkeley meetings. Two months were allotted to the second part of the Symposium in order to provide an opportunity not only for formal presentation of papers, but also for informal contacts among the participants. To facilitate such personal associations, after three weeks of intensive lectures and discussions, a trip was made to the Sierra. There, animated discussions of stochastic processes and of decision functions were interspersed with expressions of delight at the beauty of Yosemite Valley, Emerald Bay, and Feather River Canyon. After this vacation there was another period of intensive lecturing. Although much effort was expended to arrange lectures and personal contacts, the primary concern of the Statistical Laboratory and of the Department of Statistics was with the Proceedings. Because of the participation of the AAAS, the amount of material submitted for publication was estimated to be equivalent to thirteen hundred printed pages, roughly twice the length of the Proceedings of the Second Berkeley Symposium. This presented a most embarrassing problem. That it was finally solved is largely the result of the most effective support and advice of Dr. John Curtiss, Executive Director of the American Mathematical Society. His organizational talent and friendly help are greatly appreciated. Special thanks are due Mr. August Fruge, the Manager of the Publishing Department of the University of California Press, and also his staff, who undertook the difficult and costly publication in the best spirit of cooperation with, and of service to, the scholarly community. T H E T H I R D B E R K E L E Y SYMPOSIUM
vii
viii
PREFACE
Since a single thirteen-hundred-page volume would have been difficult to handle and, for the majority of scholars, too expensive to buy, it was decided to issue the Proceedings in five relatively small volumes, each given to a specialized and, so far as possible, unified cycle of ideas. A list of contents of the other four volumes of the Proceedings will be found preceding this preface. The initial steps in the organization of the Symposium were based on a grant obtained from the University of California through the good offices of Professor Clark Kerr, Chancellor of the Berkeley campus of the University of California, to whom sincere thanks are due. This grant was followed by an appropriation from the Editorial Committee of the University of California, which provided the nucleus of the fund eventually collected for the publication of the Proceedings. This action of the Editorial Committee is gratefully acknowledged. For further effective support of the Symposium thanks must be given the National Science Foundation, the United States Air Force Research and Development Command, the United States Army Office of Ordnance Research, and the United States Navy Office of Naval Research. I t is hoped that the material in the present Proceedings and, particularly, the scientific developments stimulated by the Symposium, will be sufficiently important to prove that the money received from these organizations was well spent. The success of the Symposium was, in large part, made possible by the generous and effective support of a number of scholarly societies. Sessions of the first part of the Symposium were sponsored by the American Physical Society; the American Statistical Association; the Astronomical Society of the Pacific; the Biometric Society, Western North American Region; the Ecological Society of America; the Institute of Mathematical Statistics; the Philosophy of Science Association; and the Western Psychological Society. The American Mathematical Society sponsored the second part of the Symposium, delegating for organizational work two of its most distinguished members, Professor J. L. Doob and Professor G. Polya, whose advice and cooperation were most helpful. The 1955 Summer Institute of the Institute of Mathematical Statistics was held in conjunction with the Symposium; the IMS also held its Western Regional Meeting in Berkeley in July. All papers published in these Proceedings were written at the invitation of the Statistical Laboratory, acting either on its own initiative or at the suggestion of the cooperating groups, and the Laboratory is, therefore, responsible for the selection of the authors, a responsibility that does not extend to the contents of the papers. The editorial work on the manuscripts submitted was limited to satisfying the requirements of the University of California Press regarding the external form of the material submitted, the numbering of formulas, etc., and to correcting obvious errors in transcription. Most of this was done by the staff of the Laboratory, in particular, Miss Catherine FitzGibbon, Mrs. Jeanne Lovasich, Mrs. Kathleen Wehner, and my colleague, Dr. Elizabeth L. Scott, who supervised some of the work. Occasionally, manuscripts were read by other participants in the Symposium particularly interested in them, and the authors were offered suggestions. However, in no case was there any pressure on the authors to introduce any material change into their work. Jerzy Neyman
Director, Statistical Laboratory Chairman, Department of Statistics
CONTENTS JOSEPH BERKSON—-Estimation by Least Squares and by Maximum Likeli-
hood Z. W . BIRNBAUM—On a Use of the Mann-Whitney Statistic
1 . . . .
13
HERMAN CHERNOFF and HERMAN RUBIN—The Estimation of the Location
of a Discontinuity in Density
19
ARYEH DVORETZKY—On Stochastic Approximation
39
SYLVAIN EHRENFELD—Complete Class Theorems in Experimental Design
57
G. ELFVING—Selection of Nonrepeatable Observations for Estimation.
69
.
ULF GRENANDER and MURRAY ROSENBLATT—Some Problems in Estimating
the Spectrum of a Time Series
77
J . L . HODGES, JR. and E. L . LEHMANN—Two Approximations to the Rob-
bins-Monro Process
95
WASSILY HOEFFDING—The Role of Assumptions in Statistical Decisions .
105
SAMUEL KARLIN-—Decision Theory for Polya Type Distributions. Case of
Two Actions, I
115
L. LE CAM—On the Asymptotic Theory of Estimation and Testing Hypotheses
129
HERBERT ROBBINS—An Empirical Bayes Approach to Statistics
.
157
MURRAY ROSENBLATT—Some Regression Problems in Time Series Analysis
165
CHARLES STEIN—Efficient Nonparametric Testing and Estimation
187
.
.
.
CHARLES STEIN—Inadmissibility of the Usual Estimator for the Mean of a
Multivariate Normal Distribution B . L . VAN DER WAERDEN—The C o m p u t a t i o n of t h e X - D i s t r i b u t i o n
197 207
ESTIMATION BY LEAST SQUARES AND BY MAXIMUM LIKELIHOOD JOSEPH BERKSON MAYO CLINIC AND MAYO FOUNDATION *
We are concerned with a functional relation: (1)
Pi = F(Xi,
(2)
Yi=
a+
a, ß) =F(
F,)
ßXi
where P , represents a true value corresponding to xit a, j3 represent parameters, and F,is the linear transform of Pi. At each of r ^ 2 values of x, we have an observation of pi which at Xi is distributed as a random variable around P , with variance a\. We are to estimate the parameters as a, /3 for the predicting equation (3)
Pi = F(Xi,
â, ß ) .
By a least squares estimate of a, /3 is generally understood one obtained by minimizing (4) Although statements to the contrary are often made, application of the principle of least squares is not limited to situations in which p is normally distributed. The GaussMarkov theorem is to the effect that, among unbiased estimates which are linear functions of the observations, those yielded by least squares have minimum variance, and the independence of this property from any assumption regarding the form of distribution is just one of the striking characteristics of the principle of least squares. The principle of maximum likelihood, on the other hand, requires for its application a knowledge of the probability distribution of p. Under this principle one estimates the parameters a, ¡3 so that, were the estimates the true values, the probability of the total set of observations of p would be maximum. This principle has great intuitive appeal, is probably the oldest existing rule of estimate, and has been widely used in practical applications under the name of "the most probable value." If the pi's are normally distributed about Pi with a\ independent of P „ the principle of maximum likelihood yields the same estimate as does least squares, and Gauss is said to have derived least squares from this application. In recent years, the principle of maximum likelihood has been strongly advanced under the influence of the teachings of Sir Ronald Fisher, who in a renowned paper of 1922 and in later writings [1] outlined a comprehensive and unified system of mathematical statistics as well as a philosophy of statistical inference that has had profound and wide development. Neyman [2] in a fundamental paper in 1949 defined a family of estimates, * The Mayo Foundation, Rochester, Minnesota, is a part of the Graduate School of the University of Minnesota. i
THIRD BERKELEY SYMPOSIUM: BERKSON
2
the R.B.A.N. estimates, based on the principle of minimizing a quantity asymptotically distributed as x2, which have the same asymptotic properties as those of maximum likelihood. F. Y. Edgeworth [3] in an article published in 1908 presented in translation excerpts from a letter of Gauss to Bessel, in which Gauss specifically repudiated the principle of maximum likelihood in favor of minimizing some function of the difference between estimate and observation, the square, the cube or perhaps some other power of the difference. Edgeworth scolded Gauss for considering the cube or any other power than the square, and advocated the square on the basis of considerations that he advanced himself as well as on the basis of Gauss's own developments in the theory of least squares. Fisher's revival of maximum likelihood in 1922 is thus seen to be historically a retrogression. Whether scientifically it was also a retrogression or an advance awaits future developments of statistical theory for answer, for I do not think the question is settled by what is now known. When one looks at what actually has been proved respecting the variance properties of maximum likelihood estimates, we find that it comes to little or nothing, except in some special cases in which maximum likelihood and least squares estimates coincide, as in the case of the normal distribution or the estimate of the binomial parameter. What has been mathematically proved in regard to the variance of maximum likelihood estimates almost entirely concerns asymptotic properties, and no one has been more unequivocal than Sir Ronald Fisher himself in emphasizing that this does not apply to real statistical samples. I hasten to note that, from what has been proved, there is a great deal that reasonably can be inferred as respects approximate minimum variance of the maximum likelihood estimate in large samples. But these are reasonable guesses, not mathematical proof; and sometimes the application in any degree, and always the measure of approximation, is in question. Of greatest importance is this: the maximum likelihood estimate is not unique in possession of the property of asymptotic efficiency. The members of Neyman's class of minimum x2 estimates have these properties and he introduced a new estimate in this class, the estimate of minimum reduced x2. Taylor's [4] proof that the minimum logit x2 estimate for the logistic function and the minimum normit x 2 estimate for the normal function advanced by me [5], [6] fall in this class directs attention to the possibility of its extension. In this paper is presented such an extension applying to a particular situation in which Pi = 1 — Qi is the conditional probability given Xi of some event such as death, and where F, = a + fixi is the linear transform of i \ . This is the situation of bio-assay as it has been widely discussed. We define a class of least squares estimates either by the minimization of (5)
(A)
^Wi(Pi-Pi)*
where pi is an observed relative frequency at Xi, distributed binomially about P „ p, is the estimate of Pi and l/m>i is any consistent estimate of the variance of or by the minimization of (6)
(B)
^Wi(
y i
-Si)
2
where j i is the value of the linear transform Yi corresponding to pi, = a + $Xi is the estimated value of the linear transform in terms of a, the estimates of a, ft, respec-
LEAST SQUARES AND MAXIMUM LIKELIHOOD
3
tively, and 1 /Wi is any consistent estimate of the variance of yi. The quantities (5) and (6) which are minimized are asymptotically distributed as x2The minimum logit x2 estimate and the minimum normit x2 estimate fall in the defined class of least squares estimates (B), and, as I mentioned previously, Taylor proved that these are R.B.A.N, estimates. Recently Le Cam kindly examined the class of estimates given by the extended definition and in a personal communication informed me that, on the basis of what is demonstrated in the paper of Neyman previously referred to and Taylor's paper, this whole class of least squares estimates can be shown to have the properties of R.B.A.N, estimates. They are therefore asymptotically efficient. The defined class contains an infinity of different specific estimates, of which a particular few suggest themselves for immediate consideration. Suppose we minimize (7) where »; is the number in the sample on which the observed pi is based and the pi of the weight Wi is constrained to be the same value as the estimate rpl. If f>t is a consistent estimate of Pi, then Pifti/ni is a consistent estimate of the variance of />,• and this estimate falls in the defined class. Now the expression (7) is identically equal to the classic x2 of Pearson, so that this particular least squares estimate is identical with the minimum x2 estimate. Suppose we have some other consistent estimate of Pi which we shall symbolize as •po = 1 — (omitting the subscripts i) and we minimize (8)
£
*
to-^)»;
then this is a least squares estimate as defined. The weights Wi — ni/p0Q0 are now known constants, and to minimize (8) we set the first derivatives equal to zero and obtain the equations of estimate
If now we specify that in the conditional equations (9), (10), p0 = pi, that is, that the values yielded as the estimates shall be the same as those used in the coefficients, then the equations of estimate become
The equations (11) and (12) are just the equations of estimate of the M.L.E. Therefore the M.L.E. is also a particular member of the defined class of least squares estimates. This may be presented more directly in a way that emphasizes an interesting point. Suppose the independently determined consistent estimate p 0 to be used in the constant weights for minimizing (8) is in fact the one obtained as the solution of (11) and (12). Then pi, the estimate obtained, will be the same as was used in the weights and this is the M.L.E. This is clear if we observe that we should obtain these least squares estimates as the solution of (9), (10), and we already have noted that these are
4
THIRD BERKELEY SYMPOSIUM: BERKSON
satisfied with p0 = Pi if p, is the M.L.E. The estimate obtained by minimizing (8) is consistent with the estimate used in the weights, only if the estimate appearing in the weights in equation (8) is the M.L.E. For instance, if we use for p„ in the weights w( not the M.L.E. but the minimum x 2 estimate, the estimate which will be obtained is not the minimum x2 estimate, nor is it the M.L.E., but another estimate which is neither, although it too is asymptotically efficient. This is seen at once if we note that the conditional equations of estimate for the minimum x 2 estimate [7] are not (11), (12) but (13) (14) I should like to make quite clear and convincing that the M.L.E. is derivable as a least squares estimate and that this is not an artificial contrivance used to lure the M.L.E. into the family of defined least squares estimates. In fact there is a reasonable way of proceeding by which the M.L.E. is derived as the most natural or least arbitrary of the least squares estimates of the family (A). Suppose one had never heard of the M.L.E. but only of a least squares estimate in the sense of minimizing (5). To obtain such an estimate, it would be natural to wish to use for 1 /wt the value P,Qi/nt. It would be realized that we did not have PjQi/iti because we did not have the true value P , = 1 — Qi, and it would be natural to suggest that, not knowing the true value, we should use an estimate. But what estimate? It would be reasonable to say that we should use the same estimate as the one which we were going to use for the final estimate. If this is what we wished to accomplish how would we go about getting the desired result? We might proceed by taking first some provisional estimate, f 0 = 1 — ?• We would then obtain the least squares estimates by differentiating (5), yielding the estimating equations (9), (10). The least squares estimates would thus be the solution of these equations (9) and (10). In general the estimates obtained would not be the same as those used provisionally, that is p, p0- We would then take the estimates just obtained and use them as new provisional values to obtain new estimates. We would notice that the new estimates are closer to those used originally and repeating the procedure we would notice that the two estimates, the one used in the weights and the one obtained using these weights, became closer and closer to one another. At some point we would be satisfied that we had fulfilled closely enough the objective of obtaining a least squares estimate minimizing (5) with the weights Wi in terms of the estimates, that is = Now the procedure that I have described is just the mechanics of obtaining a M.L.E. in the standard way. For what we would be doing is obtaining by iteration a solution of equations (11), (12), which are the estimating equations of the M.L.E. Objectively an estimate is defined by the estimating equations of which it is the solution, and not by the motivation by which these equations are obtained. The estimating equations (11), (12) can be obtained from a requirement to meet the condition of maximizing the probability of the sample, but they are also derived if a least squares criterion is set up, as just described. It is therefore as legitimate to consider the estimate a least squares estimate as it is to consider it a M.L.E. It suggests itself as a possibility that the minimum variance characteristics of the M.L.E., such as they are, are obtained by this estimate because it minimizes a squared deviation of the observation from the estimate rather than because it maximizes the probability of the sample.
LEAST SQUARES AND MAXIMUM LIKELIHOOD
5
The most familiar consistent estimate of the variance of pi is given by piqi/fii where pi = 1 — qi is the observed relative frequency and is the number in the sample on which it is based. If we use this estimate to define Wi we shall minimize (15) The expression (15) is equal to the reduced x 2 of Neyman, so that another familiar estimate in the defined class of least squares estimates is the minimum reduced x2 estimate of Neyman. Now I turn for a moment to the least squares estimates (B) defined in terms of the linear transform y. A consistent estimate of the variance of y is given by (16)
0, a pair of numbers Mt:a, Ne,a so that (3.2.3)
Pr{pg.4> + e}
^ 1 - a ,
if
m ^ Mta, n ^ Nt a.
From here on we shall be concerned mainly with this problem.
3.3. A one-sided confidence interval for p, not depending on norma
us first consider the case when F is known and G not known. This situation arises, for example, when it is easy to obtain a practically unlimited number of observations of X, and hence to reconstruct F as accurately as desired, but only a finite sample Fi, • • •, Y„ of Y can be obtained [this would correspond to lim (n/m) = 0 in the general case]. Let Y* < Y* < • • • < F * be the ordered sample of Y and 0 (3.3.1)
G,(j)
=
-n 1
Y* for Yt ^ s < Y*+i for i
0 and a left-hand limit /3o > 7o > 0. If there were available a consistent estimate of ao, one should be able to obtain an interval whose length approaches zero but which contains ao and many observations with large probability. The density in this interval is very similar to that in the original problem. We may transform our small interval to (0, 1), forget all observations not in this interval and apply the estimate a which is maximum likelihood for the original problem. It is reasonable to expect this "quasi-maximum likelihood estimate" to have the asymptotic distribution discussed above. One may now ask whether the "true" maximum likelihood estimate would have a better distribution. We shall show that under certain regularity conditions, this is not so. In fact, the two estimates differ by a quantity which is small compared to 1 ¡n. In the general problem the case where 7 = 0 introduces technical difficulties and has not been treated. 2. Consistency for the special problem We shall later treat the generalized version of the problem introduced above under certain regularity conditions. One of these conditions will be the consistency of the maximum likelihood estimate. In this section we shall prove that the maximum likelihood estimate is consistent for the special problem of the introduction. In doing so we shall make use of the Op and op notation of Mann and Wald [1] and a paraphrase due to Pratt of a section of one of their theorems [2]. Roughly speaking, this theorem states that the calculus of Op and op is the same as that of 0 and 0. In later sections we shall make use of a slight extension of another of their theorems which states that JZ\g(yn)] if £{y{n)) Jl(y) and the set of discontinuities of g is of probability measure zero with respect to the limiting distribution J}{y).
22
THIRD BERKELEY SYMPOSIUM: CHEKNOFF AND RUBIN
Since a maximizes (2) we may treat a as a function of the sample c.d.f. Fn . (16)
A =*(&).
Let F0 denote the "true" c.d.f. That is, (17)
F
0
F
0
( x ) = p ( x )
0
x
= l - y
f o r O ^ » g a ( l - x )
0
f o r a
0
0
^ 3 ^ 1
The function H
( 1 8 )
0
( x )
PoM
, „
r- / „ M ,
= F o ( s ) l o g ^ ^ + [ l - F o ( s ) ] l o g
1-F0(*) 1- *
O ^ x ^ l ,
reaches a positive peak at x = a 0 . Both right-hand and left-hand derivatives exist at x = ao, but they are not in general equal to zero. It can be shown that for a sequence of nonrandom c.d.f.'s G„(x) such that (19)
G
sup 0 0 such that M(00, a) < (47) sup > 1 x/n 1 - 2 «
for b > K'.
(B)
PROOF. A comparison of M(6o, a) with Br{y ) clearly establishes the desired result once it is shown that there is a K' such that
(72)
+
where K is the constant in the proof of lemma 1. [There we proved that n(a — a0) fi K with large probability.] The above inequality follows immediately upon applying Chebyshev's inequality. LEMMA 7. As n—» °°
(73)
£ [R (y l -
e
which implies the desired result. [In fact, T(y, b) —» T(y, ®) with probability one as &-» « . ] Applying lemmas 5 and 8 we have THEOREM 1. If the maximum likelihood estimates (6, a) converge to (6a, AO) when these are the true values of the parameters and conditions Ri, i?2 andR3 are satisfied and j3(0o, ao) > 7(^0, ao) > 0, then (79)
£[n{a-a,)\
->£[T(y,
®) ].
29
ESTIMATION OF A DISCONTINUITY Remarks.
1. Essentially we have also proved that a is close to x (r n) with probability almost equal to P { R ( y , = r\. 2. The case where y(6o, ao) > /3(0o, a0) > 0 is trivially related to the one we treated. 3. Except for a scale factor, the distribution of T(y, depends only on 0(00, a o ) M K ao). 4. It is clear that the special problem discussed in section 2 satisfies the regularity condition with 6 = ¡3. Since consistency was established in section 3, it follows that theorem 1 applies to this example. 5. Because of technical difficulties, it was decided not to treat the case where 7(0o, ao) = 0. One would expect that in this case a tends to be very close to the largest observation to the left of ao. 6. A variation of this problem arises when several related points are discontinuity points of the density. For example, we may be interested in the parameters of a rectangular distribution of known range. In general suppose that ^i(a), (a), - • •, \j/m(a) are the location of m discontinuities. Then a reasonable modification of condition R3 would require that (80)
r
J-00
{X
'
da
a)
d* = ] £ > < (a0) [ y i (0, a0) - ft (0, a0) ]
where ft and 7,• represent the left-hand and right-hand limits at the ith discontinuity point. Then M(6, ao) would be naturally replaced by (81)
m
I1««
~ l Q g y* (
e
>
a
» )
1
+
(a-a«)
^
1
i— 1
- F B [ ^ i ( a 0 ) ] } - ^ - ( a o ) [ f t (0, a0)
- y
{
( 0 ,
a0) ] (a - a0) .
The value of a which is close to ao and maximizes this expression has an asymptotic distribution which is determined by the natural extension of the random walk problem we treated in this section. 5. Quasi-maximum likelihood estimates In this section we shall define the quasi-maximum likelihood estimate and examine its properties. This estimate tends to be insensitive to irregularities in the distribution function and its properties can be shown to be the expected ones without the application of involved regularity conditions. Suppose that a* is a consistent estimate of a. Then there is a sequence {an\ —»0 such that nan—> °° and a* — a0 = o ( a ' i ) . Map the interval [a* — a„, a* + a ] into [0, 1] by a linear transformation. If m observations fall in [a* — an, a* + an], they give rise to a sample c.d.f. F* on [0, 1] under the above transformation. It would seem natural to define the quasi-maximum likelihood estimate with respect to a* and {an} by p
( 8 2 )
a * + a
n
[ 2 $ > ( F *
n
m
)
-
1]
where $ is defined in equation (16). [i^FJ is the true maximum likelihood estimate of a for the special problem of section 2.] However, this estimate has the following short-
30
THIRD BERKELEY SYMPOSIUM: CHERNOFF AND RUBIN
coming. Suppose a* is selected so that there is some observation within a* — an and a* — fl„ + n~n. Then $(F*) is likely to be close to 0 instead of 1/2. To avoid this possibility we define to be that value of r in | r — 1/21 ^ an which maximizes (83)
S(r)
=Fm
(t) l
o
g
[
i
-
p
l
jog1
( t ) ]
1 - T
Let our quasi-maximum likelihood estimate be (84)
a**
= a * + < i n [ 2 $ * (Ft)
THEOREM 2. If a* converges in probability
-1],
to AO and
lim f (x, d0, a0) = P (d0, a0) (8 5) lim / (x, do, a 0 ) = y (60, ao)
I—» 7 (00, ao) > 0 ,
£[n(a**-*o)]->£[T(y,
P (0o, a 0 ) +y
(do, a 0 )
2y (dp, a0) P (6o, ao) +7
=-{F m
n
[a
0
+2a
(0o, ao) n
( r - r„) ] - F „ ( a „ ) }.
Let (92)
a = a 0 + 2a„ (t — To) .
Then |M
(93)
- = a„ [p (d0, a 0 ) + t ( 0 o , a 0 ) ] [ 1 + n
(1) ] ,
and (94)
5 (r) = 5 (T«)
an(P
(do, aa) +y
, + (60, a0)
[1 +
(1) 1 M* (R) + (a — ao) op (1) | ,
ESTIMATION OF A DISCONTINUITY
31
where (95)
M* (r) = [Fn (a) -Fn
(96)
c0 = 0
(a0) ] -
C0
(a - a„)
0, there is, by the law of large numbers, an M such that i=2 (97)
( . X i - u i )
> 0 for all r > m \ > l - t . Hence
1 — e < P {Pi = Rim and P i is unique} ^ P {Pi is unique}.
32
THIRD B E R K E L E Y SYMPOSIUM: CHERNOFF AND RUBIN
Since e is arbitrary, P\R\ is unique} = 1. Furthermore, for M ^ RI, P{SI g SI\RI = rx} = P{S\m ^ j Rim = nl - The proof for (R2, S2) is similar. Examining equation (65), it becomes evident that we are interested in (R, S, T) where « i = jo/co, co2 = /?o/co (98)
R
=
R
(99)
R = - R
2
S
,
= — SI,
T = S + —
To
S =
po
c0
Co
=
Co
if
Si < - 5 , To ßo
1 Co
if
-s Si > 2 To ßo
1 c0
Suppose that the densities of (RI, SI), (R2, S2) and (R, S) are given by fr(s), gr(s) and hT(s), respectively. Suppose also that the cumulative distribution functions of SI and S2 are F(s) and G(s). We express hT(s) in terms of the other quantities. LEMMA 1 0 . For
(100)
r >
0
hr(s)
=7ofr(yos)G(-p0s-w2)
and, (101)
hT(s)
= /Sog-r [ — Wos-0,2)
] [ 1 - F ( T o 5) ]
for
r0,
kr{t) =hT(t-r-^~-),
r < 0,
(102)
and the density of T is = J > r ( i - £ ) ,
t> 0 ,
(103)
¿(0 = 2 A ' O T = —
1 ^ V
) .
C0
,
for
s* > — (r + 1) a>-
s*>
for gr {s)
ir 2 —P{Ri
(110)
i * > — coi
=p{S1^0}
g i ( i * ) = vi e~'*-»*
(108) (109)
= p { i ? 1 = 1}
for
= 1 } =P
ds
for
s*
> -
{S2 = 0 } .
We shall transform the above recursion relationships to a slightly more suitable form. Let (111)
,,(«)
then (112)
[ ^ ( « - r - 1 ) ] ^ . .
770(u) = 1 /•min ( r + l , u * )
7]r+i (w*) = /
J0
7ir(u)du
for
u> 0
for
U* > 0 .
Since Vr(u) is an r — i fold integral for 0 < u < i + 1, i ^ r, it must have r — i continuous derivatives on this interval. On the other hand, rjr {u) is clearly a polynomial of degree r — i for i < u < i + 1, i g r. We use the fact that (113)
r,r{u) =—t f!
for
0\—r m (r — 1)! T T
— + •••+
a
fl4
( r
and (117)
jjr(«)=ar
for
u> r .
i-17
_
0 l
(u-i) r~ i+i
=-T-TV7 (r —t-\-1)!
34
THIRD BERKELEY SYMPOSIUM: CHERNOFF AND RUBIN
Furthermore,
CO
(118) 7T =P{51>0} = V (119)
1= -
V
C O fC°fr(s)ds
aT (e—
/"\(w)
= *1'y/
Wl)
= -
V~
(
e-^^+'du
« )r
and (120)
(121)
1-tt1=P{51 - l
and (141)
?r(u)du
r,+i(«*) = /
for
M*>-1.
I t is easy to check that „42,
f r M
-
gr(s)
(143)
(»+;_+»'-,,>
= — («»«--.) r e - ' o)2
7-
• — JJ !
,
5>-w2.
The closed form expression for ¿(0, t < 0, seems to be difficult to obtain. The authors hope that the difficulty may be resolved in the near future so that the moments of the asymptotic distribution can be conveniently computed. I t may be remarked, however, that replacing c0 by one (as Karlin's method does for the special problem) may lead to a relatively poor asymptotic distribution in the case where po/yo is large. A P P E N D I X . REGULARITY CONDITIONS CONDITION Ri. The following limits
lim / (x, 0, a) = / 3 ( 0 , a o ) *->«„ a a
(144)
>0
x-?< aa
and
lim / (x,d, a) =y(6,a0)
(145)
>0
«-»«„ X>a
hold uniformly in 0 for 0 in some neighborhood about 0O. Also ¡3(9, a 0 ) and y(6, a0) are continuous in 0 at 0O. We may assume without loss of generality that /3(0O, o 0 ) > 7(0o, ao). CONDITION i?2- For x € B(a, a 0 ) (x not between a and a 0 ) (146)
^
/ (*» e> a ) ~
lp
g / (x> e> a °)
o. — Clo
=
d lp
g / (*» e>a") +h(x) d CL
o( 1)
where E{ \H(X) \ |0O, a0} < For x and 0 in some intervals about a 0 and 0O, d log f(x, 0, ao)/da is bounded. CONDITION R3. For some interval about 60, ri47i
1 y > log / ( X j , 0, B0)
U
nAi
;
da
converges uniformly in probability to (148)
5(0)
where 5(0) is continuous at 0O and 5(0o) = 7(0o> ao) — /3(0o, ao)-
ESTIMATION OF A DISCONTINUITY
37
The last part of R 3 is related to some simple interchange of derivative and integral conditions which are apparent from the following argument. (149)
C / (^M) ~ / ^Al^L J—co a — cio
Suppose a > ao. Consider the intervals (— 00 j (150)
r°^{x,e,a J-co da
0
)dx+o(i) +
Ja
aa
d x =
0
.
(ao, a),
00
). Then
e,a0)dx
+ o(l)+[jS(0,ao) - 7 « U o ) ] + o(l) (151)
y(0,a0)
-/3(0,ao)
= f ^ (x, 6, a 0 ) dx =e\— J-cx> a a (
aa
L I
=0 Ooj
)
.
It is clear that the special problem treated in section 2 satisfies the regularity conditions. REFERENCES [1] H. B. MANN and A. WALD, "On stochastic limit and order relationships," Annals of Math. Stat., Vol. 14 (1943), pp. 217-226. [2] H. CHERNOFF, "Large sample theory: parametric case," Annals of Math. Stat., Vol. 27 (1956), pp. 1-22. [3] T. W. ANDERSON and D. A. DARLING, "Asymptotic theory of certain 'goodness of fit' criteria based on stochastic processes," Annals of Math. Stat., Vol. 23 (1952), pp. 193-212. [4] N. H. ABEL, Oeuvres Completes, Vol. 1, Christiania, C. Grondahl, 1839, p. 102.
ON STOCHASTIC APPROXIMATION ARYEH DVORETZKY HEBREW UNIVERSITY OF JERUSALEM AND COLUMBIA UNIVERSITY
1. Introduction Stochastic approximation is concerned with schemes converging to some sought value when, due to the stochastic nature of the problem, the observations involve errors. The interesting schemes are those which are self-correcting, that is, in which a mistake always tends to be wiped out in the limit, and in which the convergence to the desired value is of some specified nature, for example, it is mean-square convergence. The typical example of such a scheme is the original one of Robbins-Monro [7] for approximating, under suitable conditions, the point where a regression function assumes a given value. Robbins and Monro have proved mean-square convergence to the root; Wolfowitz [8] showed that under weaker assumptions there is still convergence in probability to the root; and Blum [1] demonstrated that, under still weaker assumptions, there is not only convergence in probability but even convergence with probability 1. Kiefer and Wolfowitz [6] have devised a method for approximating the point where the maximum of a regression function occurs. They proved that under suitable conditions there is convergence in probability and Blum [1] has weakened somewhat the conditions and strengthened the conclusion to convergence with probability 1. The two schemes mentioned above are rather specific. We shall deal with a vastly more general situation. The underlying idea is to think of the random element as noise superimposed on a convergent deterministic scheme. The Robbins-Monro and KieferWolfowitz procedures, under conditions weaker than any previously considered, are included as very special cases and, despite this generality, the conclusion is stronger since our results assert that the convergence is both in mean-square and with probability 1. The main results are stated in section 2 and their proof follows in sections 3 and 4. Various generalizations are given in section 5, while section 6 furnishes an instructive counterexample. The Robbins-Monro and Kiefer-Wolfowitz procedures are treated in section 7. Because of the generality of our results the proofs in sections 3 and 4 have to overcome a number of technical difficulties and are somewhat involved. A special case of considerable scope where the technical difficulties disappear is discussed in section 8. This section is essentially self-contained and includes an extremely simple complete proof of the mean-square convergence result in the special case, which illustrates the underlying idea of our method. In section 8 we also find the best (unique minimax in a nonasymptotic sense) way of choosing the an in a special case of the Robbins-Monro scheme [they are of the form c/(n + c')}. The concluding section 9 contains some remarks on extensions to nonreal random variables and other topics. Since the primary object of this paper is to give the general approach, no attempt has been made to study any specific procedures except the well-known Robbins-Monro and Kiefer-Wolfowitz schemes which serve as illustrations. Research sponsored in part by the Office of Scientific Research of the Air Force under contract AF 18 (600)-442, Project R-34S-20-7.
39
4°
THIRD BERKELEY SYMPOSIUM: DVORETZKY
2. Statement of the main results Let (0 = {«}, y) be a probability space. X = X(o>), Y = F(co) and Z — Z{co), as well as the same letters with primes or subscripts or both, will denote (real) random variables, and the corresponding lower-case letters will denote values assumed by the random variables. Tn, T'n and T'n', n = 1, 2, • • •, will denote measurable transformations from «-dimensional real space into the reals. Instead of writing Tn(ri, •••,?•„) we shall often write Tn(rn) exhibiting only the last argument. E{ } and P{ } will denote the expected value of the random variable and the probability of the event within the braces, respectively. It is difficult to strike the proper balance between generality of result and simplicity of statement. We shall first state only a moderately general version of our results and follow it by an extension. Further generalizations will be given in section 5. THEOREM. Let a„, /3n and yn,n= 1, 2,•••, be nonnegative real numbers satisfying (2.1)
lima„ = 0 , = CO 00 2 > < » »=1
(2.2) and
00 ^ 7 » =
(2.3)
00
71 = 1
Let 6 be a real number and Tn, n — l,2,---,be (2.4)
\TArw-,
rn) -6
measurable transformations satisfying
\ ^ max [a„, (1 + /3n) | r n - 0 | - T n l
for all real n , • • •, r n . Let X\ and F„, n = 1, 2, • • •, be random variables and define (2.5)
X n + 1 (hi) = Tn [Xi ( « ) , • • • , X , ( « ) ] + F„ (co)
for w Si 1. Then the conditions E{X\\
< °°, 00
(2.6)
n=l
and
(2.7)
E{ F.I *!,•••, xn\ = 0
with probability 1 for all n, imply (2.8)
lim E{ (Xn — 0)2} = 0 n=co
and (2.9)
P{lim Xn= CO
6} = 1 .
The main difficulty is in proving (2.8); once this is done (2.9) follows by a simple device. In the theorem a„, /3„ and the restoring effect y„ are assumed independent of the observations xh • • •, xn. This need not be so and the following statement dispenses with this assumption. EXTENSION. The theorem remains valid if a„, P„ and yn in (2.4) are replaced by non-
ON STOCHASTIC APPROXIMATION negative Junctions a„(ri, • • •, rn), /3n(ri, •• •, r „ ) and y„(n,"
(2.10)
• •, rn), respectively, provided they
a„(n, • • •, r„) are uniformly
satisfy the conditions: The functions
41
bounded and
lim a n ( f 1 ; • • • , ) • „ ) = 0 n = co
uniformly for all sequences rh • • •, rny • • •; the functions
fin(ri,''
*>
measurable and
CO
(2.11)
^¡3n(ri,---,rJ n=l
w uniformly
bounded and uniformly
^unctions yn(ru
convergent for all sequences ru - • •, rn, • • •; awd //ze
• • •, rn) satisfy 00
(2.12)
' ' r*>
= a >
71=1
uniformly for all sequences r\, • • •, rn, • • • ,for which (2.13)
sup
| r„ | a and Z = -a
— a we have (3.4)
f [Tm{X J A
J
-Z]*dv=
f [(| Tm(Xm) Jk
Y2mdn.
| — a) +]
2d(i
if TJXm)
0 be given. Choose a = a(e) so that (3.9)
0
Then choose an integer k — k(a, e) > 1 so that (3.10)
max a,- g a ,
and (3.11)
u
k
( v
k
+
where (3.12)
u
k
= n
(l +
fr)2(l+a&)
,
Vk =
a
i=k
y
£ l 3
j
{ l + a i 3
j
) - ,
j=k
such an integer k exists in virtue of (2.1) and the fact that, by (2.2) and (2.6), all infinite series and products involved are convergent. For every j and a> put (3.13)
i 3 = Si («) = sgn Tj {X))
where sgn r denotes, as usual, 1 when r > 0 , - 1 when r < 0 and 0 when r = 0. For j > 1 let B,' and B " denote the events described by (3.14)
B; = { u : sgn Xy ^
sy-1}
and (3.15)
Bj' = i«:|7V-i ( X , - , ) | g a } .
ON STOCHASTIC APPROXIMATION P u t B; = B,' u B " and, for m ^
43
k,
(3.16)
Am = B
m
-UBy. i—k
Finally, let (3.17)
R„=UAM, m=k
A„ = N - R„
for every n ^ k. From (2.5) and (3.14) it follows t h a t | X m | ^ | i | throughout B^, while from (2.5) and (3.15) we have \Xn\ g 0 + | F „ _ i | in B^'. Thus | X m | - a g | Y ^ ] throughout B m and, in particular, in A m . Hence it results from (3.7), (3.8) and (3.12) that (3.18)
f
in
[(IJr.1 - a ) + ] * d ^ ^ u k f
(t>*+ 2 m j—k—1
Y
i)d>1
whenever n ^ m ^ k. Since the sets A m are disjoint, it follows from (3.10) on summing the inequalities (3.18) t h a t (3.19)
f
[ (|X
B
|-a)
As | Z „ | ^ ( | ATn| - o)+ + a, it follows a t once from (3.19) and (3.9) t h a t (3.20) for every n ^
f
T
X
"
d
» =
2
($
+
f
r
a 2 d
>l)=^
k.
Now let us turn to A„. B y (3.14) we have outside B,' (3.21)
I Xy | = Xy sgn Xy = i y - i T y - i ( Z y - j ) + 5y_, Fy-x = \Ti-AXi--d
1+Jy-iFy-x,
while outside B,-' we have for j ^ k by (3.15) and (3.10) (3.22)
| Ty-x (Xy_ x )
+
X , - ! | - Yy-i.
Hence outside By we have (3.23) \ X j \ £ ( l + /3y-i) | X y - i | - 7 , - 1 +
Fy-x
whenever,;' jg k. Since A„ is contained in the complement of By for every k < j ^ n, we can in A„ iterate the inequalities (3.23) and obtain for w in An (3.24)
| X n | ^wk,n\Xk\
-
m
smwm,nYm
m
where (3.25)
».,.= [ i ( l + m
W.
Putting n (3.26)
Zn = wk
n
\X
k
\-
m=k
n 7m,
Z'n=^
smwm,
nF„
44
we obtain from
THIRD BERKELEY SYMPOSIUM: DVORETZKY
(3.24)
(3.27)
f
XUn^f
[(Zn+Zi)+]2dix g f
(z++\zi\)2dv.
Hence (3.28)
f
^n
X l i p ^ l f (Z+) 2 rf M + 2 / | Z t f d n . ^n ^n
But by (3.26) and (2.7), since Sjn is defined by (3.29)
f J An
r|Zi|2dM=yym,*£{F2m} J12 ^L. m =«
and hence CO
(3.30)
f
^n
\Zi\'dvg
00
n (l + j—k
ft)2^£{F2m} m~k
by (3.11) and (3.12). Finally, we remark that if Z is any random variable with E{Z2}< oo then E{[(Z — r)+]2} tends to zero as r —> + °°. By (3.7) with m = 1 and n = k the random 00
variable Xk, and hence also Z = | Xk | J~J (1 + fij) have finite second moments. But, i=k n by (3.26), Zt ^ Z — ^ ym and it follows from (2.3) and the remark made at the bem=k ginning of this paragraph that (3.31)
f
J
{Zi)2dn^E\{Zt)'} N = N(e, k). Combining (3.20), (3.27), (3.30) and (3.31) we have E{Xl} < e for n > N. Since e > 0 is arbitrary this completes the proof of (2.8). The proof of (2.9) will now be easily achieved. Applying (3.7) with A = fi we can obtain for all n > m an inequality of the form (3.32)
E{X\\ 0 and e > 0 there exists = ?j( ® by (5.13), we have (2.8); the proof of (2.9) is exactly the same. The last generalization we wish to present extends the class of transformations Tn. Instead of considering transformations Tn determined by x\, xi, - • • ,xn we may consider random ones depending on the sample point co, that is, measurable mappings of R X 12 into R, R being the real line. In this case xi, • • •, xn do not determine the value tn as-
ON STOCHASTIC APPROXIMATION
49
sumed by Tn(X„). However, except for this fact which necessitates a restatement of (2.7), nothing is changed in all our arguments. Hence we have GENERALIZATION 5. The Extended Theorem remains valid also if Tn, n = 1, 2, • • - , are random, transformations provided (2.4) holds for all co and (2.7) is replaced by (5.16)
E[ Y„ | xi, •••, xn, tu •••,l/q„, and thus (2.6) and (2.7) are satisfied. On the other hand, the probability that X ^ i = sn for every n equals the probability that F n = vn for every n and, being equal to J^J (1 — qn), is positive. Hence not only (2.8) and (2.9) fail to hold but Xn does not even converge in probability to zero. [In this example the Tn are discontinuous; this is easily remedied. All we have to do is to define Tn(rn) = s„_i for r„ = s„_i, Tn(rn) = 0 for r„ g j„_2 or r„ Si s„ and by linear interpolation for the remaining values of rn.] 7. The Robbins-Monro and Kiefer-Wolfowitz procedures In this section we deal with a very special case of the general theory. It will be shown that specializing the general results will, without further ado, improve the best results previously obtained for the specific procedures. Let Z„ be a one-parameter family of random variables, the parameter space being the real line, and assume that (7.1)
/ ( « ) =£{Z U }
exists for every u. The Robbins-Monro and Kiefer-Wolfowitz procedures are concerned with finding, under suitable assumptions, the location of the root f(u) = 0 and of the maximum of the regression function /(«).
SO
THIRD BERKELEY SYMPOSIUM: DVORETZKY
The Robbins-Monro procedure is based on a sequence of positive numbers an,n = 1, 2, • • •, satisfying QO
(7.2)
CO
^>„=00,
£«2 p we have \f(u + c„) — f(u — c„) | > 2yc„
THIRD BERKELEY SYMPOSIUM: DVORETZKY
52
where y = min [—
sup
p/20, (8.1) holds for large n with Fn = 1 — Aan. Therefore if the an satisfy (7.2) the assumption is verified and hence E{Xl\ tends to zero. As the sequence (8.9) clearly satisfies (7.2) the conclusion holds in this case. [So far no use was made of (8.8). Also all the above follows directly from result 1.] From (8.11) it follows that if (8.12)
~
then T„ satisfies (8.1) with F„ = 1 — Aan; hence we have in this case according to (8.3) (8.13)
Aan)*Vl
+ aW
54
THIRD BERKELEY SYMPOSIUM: DVORETZKY
by (7.9) and (7.10). The minimum of the right side of (8.13) is achieved at (8.14)
ff2+
A V„ A1V2
0) and u* 6 S* n ( / > 0). It follows that there exists a constant h > 0 (in case f = 0 in some parts of «-space, h might not be uniquely determined) such that (3.6)
|
^h
c'M~lu\
in in
Sn
(/>0)
S*n(/>0).
This completes the proof of theorem 3.1. We do not as yet know whether, in the general case, the system (3.2) possesses a solution, or whether this, if existing, is unique and constitutes a solution of the variational problem presented in equations (2.3), (2.4) and (2.5). It seems likely that some iterative procedure might be designed for solving the system mentioned. We shall here only discuss a special case in which an explicit solution is easily obtainable. The resulting procedure will probably be useful as an approximate solution also in more general cases.
4. A transformation lemma Before proceeding to the special case referred to above, we shall prove the following simple LEMMA 4.1. Let v = Lube a linear transformation of k-space onto itself, with \L\ — 1. I f M, S, c satisfy
(4.1)
(3.2), then
M =LML',
S=LS,
c=Lc
satisfy the same system, with u replaced by v andf(u) by f(v) = f(Lrlv)\ and conversely. PROOF. Applying the transformation v = Lu to the second integral in (3.2), we find that the second part of (3.2) remains true when S and / are replaced by S a n d / , respectively. Applying the same transformation to the first part of (3.2) we find (4.2)
M = L~1- fvv'J(v)
dvL'~l;
it follows that the matrix M = LML', together with 8 and / , satisfies the first part of (3.2). A s t o t h e t h i r d e q u a t i o n of (3.2), w e h a v e
(4.3)
{v:\c'M~1L-lv\>h}-,
S=LS=
replacing c by L~H and M by ZrlMX' - 1 , we find that S, M, c satisfy the third part of (3.2). The converse is proved in the same way, replacing L by Lr1.
5. The spherically symmetric case When the density function f(u) is spherically symmetric with respect to the origin, it is intuitively almost obvious that the twin half-space of theorem 3.1 has to be symmetric with respect to the "relevant direction" determined by the vector c. This is the content of the following proposition. THEOREM 5.1. If f(u) is constant on every sphere u'u = C, then the system (3.2) has a unique solution, determined by the region (5.1)
S = {«: | c'u | > k}
where K has to be chosen so as to satisfy the second part of (3.2). PROOF, (i) We shall first prove that the solution (5.1) is sufficient in the standardized case c — ye where e is the first coordinate vector (1, 0, - • •, 0)' and y any positive constant.
NONREPEATABLE OBSERVATIONS
73
Take 5 according to (5.1), determine k SO as to satisfy the second part of (3.2), and M from the first part of (3.2). It remains to show that, for an appropriate h, the third part of (3.2) is satisfied. For this purpose, we first note that in the present case, M is of form
(5.2)
M =
mi 0 • • • Ol 0 m ••• 0 0 0
m,A
this follows from (2.6), noting that (a) 5 is now of form |«i| > constant, (b) f(u) is an even function of each variable Uj separately, and hence, (c) the integral from — °o to + 0. From (i) we know that the entities (5.5)
c=ye,
S = | v: | e' v \ > — | ,
M = J
vv' J {v) dv,
with appropriate choice of k, satisfy (3.2). Applying lemma 4.1 and noting that in the spherically symmetric case /(») = f(v), we conclude that the entities c = L'c, S = L'S, and M = L'ML, satisfy the same system (3.2). Moreover, (5.6)
S =L'S = jtt: | e'Lu | >-[
= {«:| c'u \ > «}
is actually the region (5.1). (iii) It remains to show that a region 5 which, together with the corresponding M and h, satisfies (3.2) is necessarily of form (5.1); that is, that the coefficient vector c'M~x in the third part of (3.2) is proportional to c'. For this purpose, take an orthogonal transformation v = Lu that takes the vector M~lc onto ye, y > 0, that is, such that (5.7)
M~1c =
yL'e.
This transformation takes the region 5 = {«: [c'M~ l u\ > h\ into (5.8)
1 " S=LS=\v:\vi
'
\
h]
By lemma 4.1 the transformed matrix M = LML' satisfies (5.9)
M=
f v v ' f ( v ) dv,
/Q
74
THIRD BERKELEY SYMPOSIUM: ELFVING
and hence, by the argument of (i), is of form (5.2) with first diagonal element, say, mi. From (5.7) then follows (5.10)
c = yML'e = yL'ML-L'e
= yL'Me = ym1L'e
,
and hence, again by (5.7), c = rh\M~lc. But this is the proportionality that we set out to establish, and so the proof of theorem 5.1 is complete. 6. The quadrically symmetric case Theorem 5.1 is easily generalized to the case that/(w) can be made spherically symmetric by a linear transformation of the argument. Assume that there is a nonsingular linear transformation v = Lu such that f(v) = f(L~h) is spherically symmetric. Since a constant factor in the argument does not affect this property, we may without restriction assume \L\ = 1. Our assumption says t h a t / remains constant whenever the squared distance (Z,-1 v) ' (L- 1 v) = v' (LL') - 1 v
(6.1)
remains constant, that is, on each member of a certain family of homothetic ellipsoids. We shall refer to this situation as the quadrically symmetric case. Consider any set of entities c, S, M in «-space and their counterparts c = Lc, S = LS, M = LML' in w-space. According to lemma 4.1, the former set satisfies (3.2) if and only if the latter set satisfies the same equations, with u replaced by v and f(u) by f(v) = f{L~h). Since/(D) is spherically symmetric, we know from theorem 5.1 that the latter system has a unique solution, generated by the region (6.2)
S= { o:| c'v\ > «}
where k has to be determined so as to meet the size condition. It then follows that the original system (3.2) has a unique solution generated by the transformed region (6.3)
S=L~1S=
{u: | c'Lu | > k\ = [u:\c'
(L'L) u\ > «}.
The matrix L'L can be expressed in terms of the covariance matrix A of the distribution /(«). Integrating over the whole w-space, we have (6.4)
A = fuu'f
(u) du=L-1-fvv'f
(v) dvL''1
.
Since f(v) is spherically symmetric, the last integral is of form 91 where 6 is a positive scalar. It follows that A = 0(L'L) -1 , L'L = OK-1. Inserting this result in (6.3) and denoting k/6 = X, we have the following theorem. THEOREM 6.1. / / / ( « ) is quadrically symmetric, then the system (3.2) has a unique solution generated, by the region (6.5)
S=
{u:\c'A~1u\>\},
where A is the covariance matrix of the /-distribution, and where X has to be determined so as to satisfy the size condition of the second part of (3.2). 7. Practical procedure We now finally turn back to our original discrete problem. If the distribution of the item points in «-space is regular enough to justify a description by means of a quadrically symmetric density function, then we may apply theorem 6.1 and use for A the
NONEEPEATABLE OBSERVATIONS
75
empirical covariance matrix of the item points. The size requirement may be met simply by counting off from the "outer end," that is, in order of decreasing | c'Arl/u \, as many items as desired. We thus end up with the following practical procedure: (i) Compute the moment matrix A with elements (7.1)
=
1
N
j,h=
" i=1 (ii) Compute the vector g = A~lc, that is, solve the equations (7.2)
XngiH
H X u g t = ci ,
Kigi~\
h X&fcgfc = Ck.
(iii) Compute, for each item i, the quantity (7.3)
k Wi= g ' U i = ^ gjUij j-1
and select the items with largest [ wi \. 0
0
0
0
0
Note added in proof: A direct treatment of the discrete selection problem will be published in Soc. Set. Fenn. Comment. Phys.-Ma.th. (1956). REFERENCES [1] G. E l f v i n g , "Optimum allocation in linear regression theory," Annals oj Mailt. Stat., Vol. 23 (1952), pp. 255-262. [2 ] , "Geometric allocation theory," Skand. Aktuar., Vol. 37 (1954), pp. 170-190.
SOME PROBLEMS IN ESTIMATING THE SPECTRUM OF A TIME SERIES ULF GRENANDER UNIVERSITY OF STOCKHOLM AND
MURRAY ROSENBLATT* THE UNIVERSITY OF CHICAGO
1. Introduction We consider a probability model that has proved to be useful in various applied fields. Statistical problems that arise in the analysis of data obtained in these fields are discussed. The basic model is that of a stochastic process {yt}
(1.1)
yt=xt+mt
where mt = Eyt is the mean value of yt and xt, Ext = 0, is the residual. The residual xt is assumed to be stationary with respect to the parameter t, that is, (1-2)
xti, • • •, Xtn
have the same probability distribution as
(1.3)
xti+h,-• •, xln+k
for all possible values of h, • • •, tn, h. In other words, the probability distribution of xt is invariant under t displacement. This implies that the set of possible values of t, which we shall call T, is a group or semigroup under addition. Typical examples of the parameter set T are the set of all points in Euclidean ¿-space or the set of lattice points in Euclidean ¿-space. These are in fact the examples of greatest interest and they will be discussed in some detail. The process {yt} may be vector valued. An example of interest in which the vectorvalued case is appropriate will be described. A usual situation is that in which t is thought of as time. The parameter t will then be a point of the form kh if the observations are taken at discrete time points with h seconds between each observation. If the observation is continuous, t will be any real number. The case in which mi = 0 is of considerable importance. Such a model is appropriate where the phenomenon studied consists of random fluctuations which are of a stable character. Some of the fields in which such a model has been used will be discussed in section 2. These fields are in the physical sciences. They are discussed to give some motivation to Research carried out in part at the Statistical Research Center, University of Chicago, under sponsorship of the Statistics Branch, Office of Naval Research. * Now at Indiana University.
77
78
THIRD BERKELEY SYMPOSIUM: GRENANDER AND ROSENBLATT
the development and because the methods considered have been successful in some degree in their application. Before going on to consider the statistical questions of interest, we shall develop some of the probability theory required. We want to give a survey of some of the results that have been obtained in the past five years and indicate that there are many problems of interest that still remain open. We shall therefore sketch out some of the older results and try to motivate them. Proofs will be given only when they are relevant to the discussion or when they relate to new results. The paper discusses a spectral representation of the residual {xt}. We are basically concerned with the statistical problem of estimating the spectrum from a time series or partial realization of the process. The estimation of the spectrum is of great interest for various reasons. First of all, knowledge of the spectrum gives us information about the structure of the process. Knowledge of the spectrum is also essential in the linear problems of prediction, interpolation and filtering. 2. Some fields of application One of the earliest fields of application is in the study of random noise. In electrical or electronic circuits noise of random character sometimes arises. This may be due to the random drift of electrons in the circuit or perhaps from shot noise due to tubes in the circuit. Another example is clutter on a radar scope or a television screen due to reflection from surrounding buildings or snow. In the first example t is time and yt is a number. The residual xt can be thought of as the noise which masks the message m t which one would want to estimate. A detailed discussion of some of the problems that arise in this context can be found in Lawson and Uhlenbeck [10]. Another field of application is in the study of turbulence. Consider a fluid forced through a rectangular grid. Let the kinematic viscosity of the fluid be small. The velocity field V(T, U) of the fluid behind the grid seems to be random. Here T is the time and u is the point at which the observation is taken. The velocity field of the fluid fluctuates even though the macroscopic conditions are the same. It then seems reasonable to consider the velocity field as a stochastic process. Assume that energy is fed in at the same rate at which it is dissipated. At an intermediate distance from the grid the distribution of the velocity field appears to be invariant under space displacement. Homogeneous turbulence is an idealization in which one imagines all of space filled with a turbulent fluid whose velocity field has a probability distribution invariant under space displacement. Let us also assume that the turbulence is also stationary in time since the energy fed in balances out that dissipated. Here t — (T, u) is 4-dimensional while xt = V(T, u) is 3-dimensional. One then looks for a stationary process satisfying the equations of motion, that is, the continuity equation (assuming incompressibility) and the Navier-Stokes equation. A detailed discussion of homogeneous turbulence can be found in Batchelor's book [3], Various meteorologists are now studying the atmosphere, considering it as a turbulent fluid. The assumption of local homogeneity and stationarity which they make may not be a bad one (see Panofsky [12]). Still another field is the study of storm generated ocean waves. Consider a storm on the ocean surface. Let h(r, u) be the vertical displacement of the ocean surface at time r and position u with respect to the undisturbed surface. Well within the storm area h(r, u) can be considered stationary with respect to displacement in u along the sea surface. If the period of observation is small with respect to the duration of the storm, the disturbance can be considered stationary with respect to time. Here / = (T, U) is 3-di-
TIME S E R I E S
79
mensional and xt = h(r, u) is 1-dimensional. See Pierson [13] for a discussion of this application. 3. Spectral representation of the process Assume that mt = 0. We shall now consider a basic representation of the process xt. In effect, we are carrying out a random Fourier analysis of the process. It is reasonable to assume that the process is real valued or has real-valued components. The covariance matrices1 (3.1)
Rt,r=Rt-T=Extx'T
depend only on the difference t — r of the parameters because of the stationarity. Consider the case in which t is integral and xt possibly vector valued. The results cited in this section are still valid with appropriate modification if t is continuous or vector valued. If xt is ¿-dimensional and {ctj is any finite sequence of ¿-vectors (3.2)
^ c ' t R t - T c T ^ 0,
since {i2(} is a sequence of covariance matrices. The process xt has the Fourier representation
xt= f'eitKdZ{\)
(3.3)
J —TT where Z(X) is an orthogonal process, that is, (3.4)
E[dZ(\)
0 E[dZ(\)]= dZ(n)'] = 5\»dF (X)
where 5\„ is the Kronecker delta and F(\) is a nondecreasing matrix-valued function (see Cramer [4]). The differential notation occasionally used in the discussion is to be understood in the usual way. Here dF(k) denotes the increment of the function F(\) over a small X-interval. In (3.3) xt is expressed as a superposition of harmonics exp (i\) with corresponding random amplitudes dZ(\). The random amplitudes dZ(\) are orthogonal and the covariance matrix of the random vector weight dZ(\) is dF(\) ^ 0. The process Z(X) with orthogonal increments is given by (3.5)
1
-V
f — it\
pit* 7"—+
1
(X+1T) ,
that is, Z(\) is the integral of the formal Fourier series with Fourier coefficients xt. The process Z(\) is introduced because the formal Fourier series referred to does not exist. The function F(\) is usually called the spectral distribution function of the process xt. The representation (3.3) of xt implies that
(3.6)
Rt=J—Tre^dFM
From this one can see that knowledge of the covariance sequence Rt and knowledge of the spectral distribution function F(\) are equivalent. Nonetheless, in many fields, especially those referred to in section 2, it is much more natural to think of the process in 1
xt is a column vector. If A is a matrix, A' denotes the conjugated transpose of A.
8o
THIRD B E R K E L E Y SYMPOSIUM: G R E N A N D E R A N D ROSENBLATT
terms of its spectrum rather than its covariance sequence. Moreover, the statistical problems that arise are answered much more naturally and elegantly when looked at from the point of view of the spectrum of the process rather than the covariance sequence. Note that dF(\) = dF(—~h) since the process xt has real-valued components. In the representation (3.3) of the process xt, Z(X) has complex-valued components in general. Because of this unpleasantness it is convenient to introduce the auxiliary processes Zi(X) = Z(X) - Z ( - X ) (3.7) Z 2 (X) = i [ Z ( X )
+Z(-X)-2Z(0)]
with real-valued components. We then have the following real representation for xt (3.8)
xt = f'cos •'0
/X dZ1 (X) + f sin t\ dZ2 (X). •'0
The case of greatest practical interest is that in which F(\) has no singular part, that is, F(\) has an absolutely continuous part and jumps. In the remainder of this paper we shall assume that F(X) is absolutely continuous, that is, (3.9)
F(\)
= I
J—T
/(«) du.
The matrix-valued function/(X) is called the spectral density of the process. Let us consider the case of a two-dimensional process xt = (3.10)
dF(\) d\
=
/ ( * )
=
i n some detail.2 Then
7n(A)
/12 (X)
J21W
/«(X)
^0.
The functions /n(X), /22(X) ^ 0 are the spectral densities of \xt, 2xt, respectively, while /i2(X) is the cross-spectral density of \Xt and 2xt. Now/n(X) = /n(—X),/22(X) = /22(—X), and/j 2 (X) = /2i(— X) = /2i(X) since \Xt and 2xt are real valued. In the real representation (3.7) of xt let Z 12 (X) fZu(X) (3.11) Z 2 (X) = Z x(X) = Z 21 (X) Z 22 (X) Then E dZn (X) dZn ( M ) = £ dZ12 (X) dZn ( M ) = 2 6 ^ f n (X) dX E dZì 1 (X) dZì 1 (ju) =E dZn (X) dZ22 (¡x) = 2 5x^/22 (X) d\ (3.12)
\ E dZn (X) dZn (M) =E dZn (X) dZ22 ( M ) = 0 E dZn (X) dZn (/a) =E dZ12 (X) dZ22 (m) = 2 8\„ Re fu (X) d\ E dZn (X) dZ22 (n) = -E
dZ21 (X) dZ12 (n) = 2 5x„ Im f12 (X) dX .
The real part of the cross-spectral density Re/i 2 (X) is often called the cospectrum of \Xt and 2xt while Im /i2(X) is called the quadrature spectrum of iXt and 2xt. The cospectrum measures the dependence of the in-phase harmonics of the two processes ixt 2 The discussion in the remainder of this section and in section 8 was suggested by conversations one of the authors had with W. J. Pierson, Jr. of the Department of Meteorology and L. J. Tick of the Research Division of N.Y.U.
TIME SERIES
8l
and 2Xt, that is, the dependence between cos t\ dZn(\) and cos t\ dZ2i(\) or sin l\ ¿Z12(X) and sin dZn(\). The quadrature spectrum measures the dependence of the out-ofphase components of the two processes cos t\ dZn(\) and sin tX dZaiX). If the class of admissible distributions in a nonparametric problem is parametrized or labeled in a natural way, one is led to an infinite-dimensional parameter space. It seems natural to think of a statistical problem characterized by such an infinite-dimensional parameter space as a nonparametric problem. It would be natural to parametrize the class of processes we deal with by their covariance sequences. Since this parameter space is infinite-dimensional, the techniques we employ in the statistical analysis of time series would be nonparametric techniques in the sense described above. Note that the normal processes are determined by their spectra since they are determined by their first and second moments. The representation (3.3) of the process xt is a linear representation and thus is especially natural in the case of a normal process. Many of the statistical techniques employed are linear techniques since they are based on this linear representation. Nonetheless, they are still quite useful in obtaining information about the linear structure of nonnormal processes. If the time parameter t is continuous, the range of the X integration in (3.3) is from — co to oo. If the parameter t ranges over the lattice points in ¿-dimensional Euclidean space, the analogue of representation (3.3) is (3.13) k and the spectral distribution function F(\) is a function of the ¿-vector X. When t is a continuous parameter, the X integration in (3.12) ranges over all ¿-space. The representations of the process xt given here are valid if the weaker assumption of weak stationarity of xt is made (see Doob [5]). Strong stationarity of xt has been assumed because it is really made use of later on. 4. Moving averages and linear processes We shall now motivate an assumption on the distribution of the stationary processes xt dealt with. Consider the representation (4.1) E dZ (\) dZ (p) '
S*f {\) d\ .
Note that/(—X) = /(X). Assume that/(X) Si 0 is a nonsingular matrix for almost all X. We can then write (4.2)
/(X) ==^^-- of l( (XX))aa((XX))' ',, zir
where a(X) is nonsingular almost everywhere. Let (4.3) Then (4.4)
a ( — X) = a (X)
82
THIRD B E R K E L E Y SYMPOSIUM: GRENANDER AND ROSENBLATT
and (4.5)
E t & = f e « ' - * » a-1(X)£[áZ(X)dZ(X)'][a-1(X)]' ¿ir J-T
The process {# +
' -
0
[In
m(\,
m)]->/
2(X,
as N, M —» it k
TIME SERIES
89
highly peaked a t u, v = 0. L e t wN, M(u, v) be nonnegative and of total mass one (7.10)
/
y « y
v) dud v = 1 .
(u,
M
We also assume that for every « > 0 (7.11)
wNM(u,v)-^0
uniformly for \u\, |»| ^ « as N, M —> 0°. Consider the following estimate (7.12)
/*_
M(X,
n) = J
JwN
M{u-\,
v-n)
IN
M{u,
v)
dudv
of/(X, n). Then (7.13)
/*
(X,
M)
=
N 1 TTrVi S
M £
r*tw£i>Oe-
'J!
S (=1 T=1
r
>
J. ^
0
'
is the sample covariance and is analogously defined for j, k with different sign. We shall obtain the asymptotic variance of the estimate f%, a/(A, ¿1) under fairly mild assumptions on the weight function and the spectral density. Now (7.15)
NMD>[f*
=
=
M(\,
( 2 t )*NM
n)}
•S
C0V (Xt , ^ Ti+V lt T^t^i, I, m ij, ^ i > T1> ^ 2 N • >m{ > ' A; I, m
T27r) 4 iVM . , ^ j, /{, I, m tp (j, T|t Tj
= S1 +
Xt
lt
T,Xll+l,
T, + J g—iJX—ttn/t
T i - r / i . - ^ + i - l , Tj—ra+fc—1»
S*.
B o t h of the terms Si, S 2 can be treated in the same way. We shall consider the term S2. Now
.f
f > p—l, 3— m p+j, 5+A 3, k
• 22 sin (2
1 ^ f * f J _ J 2ttN
'WN, M
N 2
u
sin2 \u
. 1
2irM
> M) I, m
¿A - ijft
ilX— tmft
» - m) I
dudv
M • 2 Z-7T sin v 2
sin2 * s
~ x> » - / * ) } * { / ( « » ») Wjv, AT ( « -
90
THIRD BERKELEY SYMPOSIUM: GRENANDER AND ROSENBLATT
where the * denotes convolution with respect to u, v. Assume that / ( « , v) is continuous and positive. Let (7.17)
v n
, m (
u
>
»)
= w
N , M (
u
>
*
w
N , M (
u
>
W e also assume that i>N,M{U,
(7.18)
1>N, M
as TV, M—> oo when \u\ < A/N, follows that (7.19)
V)
>0
1
( 0 , 0)
|i>| < A/M where A is any positive constant. I t then
S2~ (2t)2{/(«>
j,(«-X,
} * { / { u ,
=
( 2 t )
2
/ J
r
/ /
2
(
~ ( 2 T ) V 2 ( X . M) /
M
v) wNt
»
M
» ) t e ^
* / k ,
( u - \ ,
i 3 f
M
( « - X ,
(u
r = 0
v - n ) d u d v
V)dudv
as TV, M —> oo. The same sort of argument shows that Si = 0(^2) if (X, m) ^ (0, 0) and Si ~ S 2 if (X, m) = (0, 0). But then (7.20) if (X, £t)
£>*[/*,
m
(X,m)1
J ^ J ^ n . M ^
(0, 0) as N, M —• 00. The asymptotic expression given in (7.20) should be
doubled when (X, ¡i) — (0, 0). Thus/^, m(X, ¿u) is a consistent estimate of/(X, n) if
( 7
-
T M NM
2 1 )
I
-
as TV, Af —» 00. These estimates are well suited for computation on a digital machine. I t is rather doubtful whether one could build useful analogue computers making use of the 2-dimensional counterparts of the estimates discussed in section 6. The following problem is interesting and has not yet been answered satisfactorily. Suppose that xt, T is a continuous parameter process and/(X, ¡1) is known to be circularly symmetric about zero. What would then be an efficient estimate of/(X, y.) making use of the known symmetry? There are, of course, many higher dimensional analogues of this problem. 8. Estimation of the cospectrum and quadrature spectrum Let xt = (
l X t
) be a process with real-valued components and t integral. The spectrum
is assumed to be absolutely continuous with a continuous and nonsingular spectral density. The sample xt, t = 1 ,• • •, N, has been observed and we want to estimate the cospectrum Re/i2(X) and the quadrature spectrum Im/i 2 (X) (see section 3). W e first discuss estimation of the cospectrum. Let a be any real number. The time series ix( + a ixt then has the spectral density (8.1)
/11 (X) + 2a R e /12 (X) + a * / 2 2 ( X ) .
TIME SERIES
Now
91
i du I y^ (ix,+ a x ) e f•J-T wN (u-X) tt^— 2 t ¿irNl 2
(8.2)
is a reasonable estimate of the spectral density (8.1) if the weight function Wn{u) satisfies the conditions cited in section 3. This implies that (8.3)
/•IT
(
i
N
) du
N
1 = 2Ï is an estimate of Re/i 2 (\), where (8-4)
"
S ^ v=— N
i i ^ c
0 9
^
2).*!^" f — T ~V
But an argument similar to that of section 7 implies that
(8.5) cov j f^wN xteilu\*du,
~^l/ 12(X) \*£w*n{u) du N = ~ { [ R e / ( X ) P + [Im fn(\m£rw*x(u) du 12
It then follows that the asymptotic variance of (8.3) is (8-6)
^ { / „ ( X ) / 2 2 ( M + [ R e / 1 2 ( X ) P ~ [Im
f12(\)]>}f\*N(u)du.
One can see that a reasonable estimate of Im/i2(X) is
(8.7) f\N(u-\) Im
du 1 N = tt" y ^ w W - . r * 12 sinfX. 2 "
TT _ '
An argument similar to that given above indicates that the asymptotic variance of (8.7) is (8.8)
^ { / n ( X ) / 2 2 ( X ) + [Im / 1 2 ( X ) P - [ R e / 1 2 ( X ) P } £ < ( « ) < * " ,
X^O.
This expression for the asymptotic variance should not be taken seriously very close to X = 0 as the estimate and Im/ 12 (X) are zero at X = 0. 9. Prefiltering of a time series The results given above are asymptotic results. Here asymptotic must be interpreted not only in terms of the magnitude of N but also in terms of the variation of the spectral
92
THIRD B E R K E L E Y SYMPOSIUM: G R E N A N D E R AND R O S E N B L A T T
density. If, for example, one has a rapidly changing spectral density, one must expect a certain amount of contamination in the estimation of the spectral density at such a point from the immediate neighborhood. In many cases one will not know this beforehand. However, one may be led to this belief from rapid change in the estimated spectral density. Tukey suggests that if one suspects this at a point one should prefilter the series so as to smooth out the spectrum in the neighborhood of the point and estimate the spectrum of the filtered process. Assume the weight function w N (u) is rectangular for convenience, that is, (9.D
w
N
k-r, {u)=yh
if
I 0 ,
\u \
g h ,
otherwise
(see [9]). Then the bias is asymptotically proportional to / " ( X ) . Let us see what effect such prefiltering has on an estimate of the spectral density and how it should most advantageously be set up. Let
2K
and K is any positive number not greater than inf [Mix) — a]/{x — 8). As Chung remarks (footnote 4), (V) may be replaced by the weaker assumption that V(x) = E[Y(x) — M{x)f is continuous and has the value a2 at x = 6. We observe that the proof of the theorem also remains valid with only minor changes if we merely assume nan —> c instead of nan = c. We shall now show that Chung's theorem 9 remains valid if we remove assumption (VIb). This permits the theorem to be applied to problems (such as the bio-assay problem) in which M(x) is bounded. I t was discovered independently by Blum [3], Kallianpur [4], and Kiefer and Wolfo-
ROBBINS-MONRO PROCESS
97
witz [3] that Xn tends to 0 with probability one under certain conditions. The conditions of Blum's theorem 1 are implied by the assumptions of theorem 9. Let A be any positive number, and suppose that a model M satisfies all of the assumptions of theorem 9 except possibly (VIb). Let K' = inf [M{x) — a]/(x — 6), | * - » | ¿A and construct a new model M ' by defining Y(x) to have the same distribution as before if \x — d\ 5= A, but the normal distribution N[K'{x — 0), 1] otherwise. The new model satisfies the conditions of theorem 9. Now introduce a sequence of coefficients an such that nan —> c> 1/2 K', and consider the process Xh X2, • • • generated under the model M ' Inasmuch as Xn converges to 6 with probability one, we can associate with each e > 0 a number N(t) such that the probability is at least 1 — e that | Xn — 6 \ < A for all n > N(e). We define a new process, whose starting point is X[ = XA'+i and which is generated by the model M ' and coefficients a'n = an+N. We shall have X'm = XN+m for all m with probability at least 1 — e. Let us denote generically the distribution of a random variable Z by Fz. We observe that for all m \ FxN+ m — Fx'm | < e. Since the process X' satisfies the conditions of theorem 9, we can choose m so large that | F^(x'm-o) — < e where Sir is the normal distribution function with mean 0 and variance „ if n is large, so that it weights large errors much more heavily. In practice our estimates are usually truncated, which suggests that we consider the random variables Y£ obtained by truncating Y„ at ±A. Let denote the error variance of this truncated estimate, and call tA = lim v„, if it exists, the asymptotic error vari-
98
THIRD BERKELEY SYMPOSIUM: HODGES AND LEHMANN
ance truncated at A. It is easy to show that lA —»w as A —> °°, which suggests that w has the interpretation of a limiting truncated error variance. This notion does not require that we introduce the limit law or even that a limit law exist. We note that more precisely lim lim vi = lim inf v* 5= lim sup v* = lim lim provided the limits inA—>oo »—>00
A,n—>00
«—>co A—>00
volved all exist. This is a simple consequence of the fact that is for each n a nondecreasing function of A. Since in practice both n and A are finite, it is not clear whether w or u (in those cases in which w < u) will be the better approximation to However, if u > w, this can only mean that very large errors occur with very small probability. The situation is similar to that in the Petersburg paradox. The usual human practice of not attaching undue importance to large errors which are extremely unlikely to happen would lead to the use of w in preference to u. Another argument which also supports this choice lies in questioning the reasonableness of squared error as a loss function when the errors are very large. As a consequence of these considerations, we are inclined to use the asymptotic normal variance as a reasonable means of appraising the estimates discussed in section 2 when n is large. In particular, we recommend that coefficients a n ~ c/w be used in the quantal response problem. Further, we suggest that c be chosen so as to minimize o V / (2aic — 1). This leads to c = l/oi and reduces the asymptotic normal variance to (In practice, of course, it will usually be necessary to guess at the value of aj.) It is hardly necessary to remark that the only interest in any asymptotic theory resides in the hope that it will provide a useful approximation for the values of n with which we are dealing. Thus, for example, we use the asymptotic normal variance as an approximation to the variance of a normal distribution which approximates to the actual distribution of the estimate. For many statistical problems the only way of appraising the accuracy of these approximations lies in comparing them with computed values for small n or sampling experiments with moderate n. In the next section we consider computed values of the variance for small n of an approximate model, leading to conclusions about the choice of c in general agreement with those given above. 4. A linear approximation If one attempts to apply the asymptotic normal theory of section 2, two difficulties arise. As with most asymptotic theories, it is not known how large n must be before the theory becomes applicable. Furthermore, the theory holds only if c > 1/2K, where K is any positive number satisfying K ^ inf [M(x) — a)/(X — 0). An examination of the proof shows that in this condition the infimum may be restricted to the values | x — 9 | ^ A, where A is an arbitrarily small positive number. Since we assume M'(6) = 0, it is therefore enough to require c > 1/2 ai. This is consistent with the recommendation made in section 3 that c = 1/ai. In practice, however, ai is usually not exactly known, and one might be tempted to use a "safe" small a priori estimate for m, and a correspondingly large c, to avoid the possibility that c S 1/2 ai in which case the estimates have unknown behavior. This tendency would produce a bias towards values of c too high for greatest efficiency, but as c2/(2c — 1) increases slowly when c increases beyond 1, it would be natural to prefer a c which might be too large to one which might be too small. We shall now present an alternative approach which (while it also has drawbacks) does work for all values of c > 0 and does provide measures of precision for finite n. The
ROBBINS-MONRO PROCESS
99
current approach is based on replacing the actual model by a simpler linear model, for which the actual error variances can be computed. Specifically, we assume M(x) = a + P(x — 6) and V{x) = r 2 , where fi and r 2 are known constants. We might take ¡3 = ai, T2 = o-2[= F(0)], obtaining a model which is a good approximation to the actual one when x is near 6; alternatively, we might attempt to fit a straight line to that portion of M(x) where the xn are likely to fall. To simplify the notation, we shall set a = 8 = 0. It is easily shown that (4.1)
E (.y 2 + 1 ) = (1 — /3ff„)2 E (X 2 ) + a 2 r 2
from which it follows that E{X„+1) equals (4.2)
E{X\) i n d - K ) i 2 + G y i ) L »=1 J fc-1
( w f i (i-K) v=k+1
2
-
Both of the terms of (4.2) have a significance. Since X n + i = Xn — o n F„, E(Xn+i) = n (1 — pa„)E(Xn), so that given Xi = x\, E(Xn+1) = Xi J~J (1 — fian). If we square and take expectations, we find that the first term of (4.2) is the expected squared bias of the estimate. It is the contribution to the total error variance of the error of our initial guess xi, and vanishes if Xi = 0 = 0. The second term of (4.2) is independent of Xi, and represents the variance component of the error variance. We shall now specialize to harmonic coefficients an = c/n, so that (4.2) becomes (4.3)
E (X\)
Vn
(c/8) + ( £ ) V „
(cP)
where