153 73 40MB
English Pages 871 [872] Year 1987
Proceedings of the 1st World Congress of the
BERNOULLI SOCIETY
P r o c e e d i n g s of the 1st W o r l d C o n g r e s s of the
BERNOULLI SOCIETY Tashkent, USSR 8-14 September 1986 Volume 2 Mathematical Statistics Theory and Applications Editors
Yu. A. Prohorov and V. V. Sazonov
WWNU SCIENCE
PRESSili
Utrecht, The Netherlands 1987
V N U Science Press BV P.O. Box 2093 3500 GB Utrecht The Netherlands © 1987 V N U Science Press BV First published in 1987 ISBN 90-6764-103-0 (set) ISBN 90-6764-104-9 (Vol. 1) ISBN 90-6764-105-7 (Vol. 2) All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner.
Printed in Great Britain by J. W. Arrowsmith Ltd, Bristol
CONTENTS
ABSTRACT INFERENCE (semi-parametric models...) (Session 1 - Chairman: P.J. Bickel) INVITED PAPERS Efficient testing in a class of transformation models: an outline P.J. Bickel
3
Abstract inference in image processing U. Grenander
13
CONTRIBUTED PAPERS Bayesian 0. Bunkeinference in semiparametric problems
27
Semi-parametric Bayes estimators N.L. Hjort
31
On estimating in models with infinitely many nuisance parameters A.W. van der Vaart
35
INFERENCE FOR STOCHASTIC PROCESSES (Session 2 - Chairman: A.N. Shiryaev) INVITED PAPERS Semimartingale convergence theory and conditional inference for stochastic processes P.D. Feigin
41
The foundations of finite sample estimation in stochastic processes-II V.P. Godambe CONTRIBUTED PAPERS Some uses of maximum-entropy methods for ill-posed problems in signal and cristallography theories D. Dacunha-Castelle v
49
55
vi On asymptotic efficiency of the Cox estimator K. Dzhaparidze
59
Asymptotic properties of the maximum likelihood estimator, Ito-Ventzel's formula for semimartingales and its application to the recursive estimation in a general scheme of statistical models N.L. Lazrieva and T.A. Toronjadze
63
Maximum entropy selection of solutions to ill-posed martingale problems R. Rebolledo
67
Asymptotic inference for the Galton-Watson process without restriction to the super-critical case D.J. Scott
71
CROSS-VALIDATION (Session 3 - Chairman: D.V. Hinkley) INVITED PAPERS On resampling methods for confidence limits D.V. Hinkley
77
The interplay between cross-validation and smoothing methods B.W. Silverman
87
CONTRIBUTED PAPERS Bootstrap of the mean in the infinite variance case K.B. Athreya
95
Automatic curve smoothing W. HSrdle
99
Non-parametric smooting of the bootstrap A. Young
105
DATA ANALYSIS (projection pursuit, curve estimation . . .) (Session 4 - Chairman: E. Diday) INVITED PAPERS On constructing a general theory of automatic classification S.A. Aivazyan
111
Data analysis: geometric and algebraic structures B. Fichet
123
vii CONTRIBUTED PAPERS Generalized canonical analysis M. Tenenhaus
133
Detecting outliers and clusters in multivariate data based on projection pursuit I.S. Yenyukov
137
The inverse problem of the principal component analysis S.U. Zhanatauov
141
DESIGN OF EXPERIMENTS (nearest neighbour designs . . .) Session 5 - (Chairman: H.P. Wynn) INVITED PAPERS Numerical methods of optimal design construction V.V. Fedorov
147
Ordering experimental designs F. Pukelsheim
157
Observation and experimental design for autocorrelated processes H.P. Wynn
167
CONTRIBUTED PAPERS The design of experiments for model selection A.M. Herzberg and A.V. Tsukanov
175
The design and analysis fo field trials in the presence of fertility effects C. Jennison
179
On the existence of multifactor designs with given marginals 0. Krafft
183
Local asymptotic normality in Gaussian model of variance components M.B. Maljutov
187
ASYMPTOTIC METHODS IN STATISTICS (second order asymptotics, saddle points methodse etc.) (Session 6 - Chairman: W. van Zwet) INVITED PAPERS Differential geometrical method in asymtotics of statistical inference S. Amari
195
viii Likelyhood, ancillarity and strings O.E. Barndorff-Nielsen
205
On asymptotically complete classes of tests A.V. Bernstein
215
Tail probability approximations H.E. Daniels
223
On second order admissibility in simultaneous estimation B.Ya. Levit
225
Bounds for the asymptotic efficiency of estimators based on functional contractions; applications to the problem of estimation in the presence of random nuisance parameters J. Pfanzagl
237
CONTRIBUTED PAPERS On chi-squared goodness-of-fit tests for location-scale models F.C. Drost, J. Oosterhoff and W.C.M. Kallenberg
249
Sone problans in statistics F. Hampel
253
Adaptive procedures for detection of change M. HuSkovà
257
On local and non-local measures of efficiency W.C.M. Kallenberg
263
Maximal deviations of gaussian processes and empirical density functions V.D. Konakov
267
Chi-squared test statistics based on subsamples M. Mirvaliev
271
On Hodges-Lehmann indices of nonparametric tests Ya.Yu. Nikitin
275
Differential geometry and statistical inference L.T. Skovgaard
279
Large sample properties for generalizations of the trimned mean N. Veraverbeke
283
ix MULTIVARIATE ANALYSIS (large number of parameters . . .) (Session 7 - Chairman: Y. Escoufier) INVITED PAPERS Discriminant analysis for special parameter structures J. LSuter
289
Asymptotics of increasing dimensionality in classification L.D. Meshalkin
299
Extensions and asymptotic studies of multivariate analyses A. Pousse
307
Estimation of symmetric functions of parameters and estimation of covariance matrix K. Takeuchi and A. Takamura
317
CONTRIBUTED PAPERS Introduction in general statistical analysis V.L. Girko
327
Estimation and testing of hypotheses in multivariate general Gauss-Markoff model W. Oktaba
331
Symmetry groups and invariant statistical tests for families of multivariate Gaussian distributions E.A. Pukhal'skii
335
Prescribed conditional interaction models for binary contingency tables T. Rudas
339
Quadratic invariant estimators with maximally bounded mean square error F. Stulajter
343
TIME SERIES (long range dependence, non-linear processes, estimation of spectra) (Session 8 - Chairman: H. Tong) INVITED PAPERS Non-gaussian sequences and deconvolution M. Rosenblatt
349
Non-linear time series models of regular sampled data: a review H. Tong
355
X
Robust spectral estimation I.G. Zhurbenko
369
CONTRIBUTED PAPERS OnJ. marginal distributions of threshold models AndSl and A. Fuchs
379
On the boundary of the central limit theorem for stationary p-mixing sequences R.C. Bradley
385
Detection of parameter changes at unknown times in linear regression models V.K. Jandhyala and I.B. MacNeill
389
Soma aspects of directionality in time series analysis A.J. Lawrance
393
The algorithm of maximum mutual information for model fitting and spectrum estimation Z. Xie
397
BOUNDARY CROSSING PROBLEMS AND SEQUENTIAL ANALYSIS (Session 11 - Chairman: D.O. Siegmund) INVITED PAPERS Optimal sequential tests for relative entropy cost functions H.R. Lerche
403
Asymptotic expansions in some problems of sequential testing V.I. Lotov and A.A. Novikov
411
CONTRIBUTED PAPERS Asymptotic methods for boundary crossings of vector processes K. Breitung
421
First passage densities of Gaussian and point processes to general boundaries with special reference to Kolmogorov-Smirnov tests when parameters are estimated J. Durbin
425
Converse results for existence of moments for stopped random walks A. Gut
429
Mathematical progranming in sequential testing theory U. Mflller-Funk
435
xi EXTREME VALUES AND APPLICATIONS ( strength of materials) (Session 12 - Chairman: R.L. Smith) INVITED PAPERS Theory of extremes and its applications to mechanisms of solids and structures V.V. Bolotin
443
Extremes, loads and strengths H. Rootz£n
461
Statistical models for composite materials R.L. Smith
471
CONTRIBUTED PAPERS The distribution of bundle strength under general assumptions H.E. Daniels
485
An estimate of the rate of convergence in the law of large numbers for sums of order statistics and their applications M.U. Gafurov and I.M. Khamdamov
489
The index of the outstanding observation among n independent ones L. de Haan and I. Weissman
493
Rain flow cycle distributions for fatigue life prediction under Gaussian load processes G. Lindgren and I. Rychlik
495
High-level excursions of Gaussian fields: a geometrical approach based on convexity V.P. Nosko
501
EPIDEMIOLOGY (mainly observational studies) (Session 13 - Chairman: N.T.J. Bailey) INVITED PAPERS Epidemic prediction and public health control, with special reference to influenza and AIDS N.T.J. Bailey and J. Estreicher
507
Mathematical models for chronic disease epidemiology K.G. Mantón
517
xii
Global forecast and control of fast-spreading epidemic process V. Vasilyeva, L. Belova, D. Donovan, P. Fine, D. Fraser, M. Gregg, I. Longini, L.A. Rvachev, L.L. Rvachev and V. Shashkov
527
CONTRIBUTED PAPERS
A statistical analysis of the seasonality in sudden infant death syndrome (SIDS) H. Bay
535
Epidemiological models for sexually transmitted infections K. Dietz
539
Results and perspectives of the mathematical forecasting of influenza epidemics in the USSR Yu.G. Ivannikov
543
The generalized discrete-time epidemic model with inmunity I.M. Longini
547
GEOLOGY AND GEOPHYSICS (Session 14 - Chairman: D. Vere-Jones) INVITED PAPERS
Stochastical model of mineral crystallization process from magnetic melt D.A. Rodionov
555
Applications of stochastic geometry in geology D. Stoyan
563
Classification and partitioning of igneous rocks E.H.T. Whitten
573
CONTRIBUTED PAPERS
Application of fuzzy sets theory to the solution of pattern recognition problems in oil and gas geology B.A. Bagirov, I.S. Djafarov and N.M. Djafarova
579
The application of multidimensional random functions for structural modelling of the platform cover J. Harff and G. Schwab
583
Prediction of rock types in oil wells from log data M. Homleid, V. Berteig, E. B^lviken, J. Helgeland and E. Mohn
589
On tests for outlying observations V.I. Pagurova and K.D. Rodionov
593
xiii The ideas of percolation theory in geophysics and failure theory V.F. Pisarenko and A.Ya. Reznikova
597
HYDROLOGY AND METEOROLOGY (Session 15 - Chairman: A.H. Murphy) INVITED PAPERS Some stochastic models of rainfall with particular reference to hydrological applications D.R. Cox and I. Rodrigues-Iturbe
605
Canonical correlations for random processes and their meteorological applications A.M. Obukhov, M.I. Fortus and A.M. Yaglom
611
Statistical decisions and problems of the optimum use of meteorological information E.E. Zhukovsky
625
CONTRIBUTED PAPERS Application of data analysis methods for the evaluations of efficiency of weather modification experiments G. Der Megreditchian
637
Predictor-counting confidence intervals for the value of effect in randomized rainfall enhancement experiments E.M. Kudlaev
643
On behaviour of sea surface tanperature anomalies L.I. Piterbarg and D.D. Sokolov
647
The sampling variability of the autoregressive spectral estimates for two-variate hydrometeorological processes V.E. Privalsky, I.G. Protsenko and G.A. Fogel
651
On the forecasting of the fluctuations in levels of closed lakes M.I. Zelikin, L.F. Zelikina and J. Schultze
655
BIOLOGICAL MODELS AND GENETICS (Session 16 - Chairman: P. Jagers) INVITED PAPERS The equilibrium laws and dynamic processes in population genetics Yu.I. Lyubic
661
Corrmunlty size and age at infection: how are they related? A.R. McLean
671
XIV
Branching processes and neutral mutations 0. Nerman
683
CONTRIBUTED PAPERS Limit theorem for some statistics of multitype GaltonWatson process I.S. Badalbaev and A.A. Mukhitdinov The regularity of metaphase chromosomes organization in cereals N.L. Bolsheva, N.S. Badaev, E.D. Badaeva, O.V. Muravenko and Yu.N. Turin
693
697
The genealogy of the infinite alleles model P. Donnelly and S. Tavar£
701
The relationship between the stochastic and deterministic version of a model for the growth of a plant cell population M.C.M. de Gunst
705
Gene action for agronomic characters in winter wheat W. Lone
709
Total progeny of a critical branching process S.M. Sagitov
713
Limit theorems for a critical branching Crump-Mode-Jagers processes V.A. Topchii
717
The behaviour of the pray-editor system in the neighbourhood of statistical equilibrium Ye.F. Tsarkov
721
Bellman-Harris branching processes and distributions of marks in proliferating cell populations A.Yu. Yakovlev, M.S. Tanushev and N.M. Yanev
725
STOCHASTIC SIMULATION (Session 18 - Chairman: G.A. Mikhailov) INVITED PAPERS Methods in Quantum Monte Carlo M.H. Kalos
731
CONTRIBUTED PAPERS The Monte Carlo method and asynchronic calculations S.M. Ermakov
739
XV
Increasing the efficiency of statistical sampling with the aid of infinite-dimensional uniformly distributed sequences I.M. Sobol'
743
Controlled unbiased estimators for certain functional integrals W. Wagner
747
Kac's model for a gas of n particles and Monte Carlo simulation in rarefied gas dynamics V.E. Yanitskii
751
STATISTICAL COMPUTING (Session 21 - Chairman: S. Mustonen) INVITED PAPERS Mathematical Y. Dodge programming in statistics: an overview
757
Model search in large model families T. HavrAnek
767
Progranming languages and opportunities they offer to the statistical conmunity P. Naeve
779
CONTRIBUTED PAPERS Statistical software for micro-computers R. Gilchrist
793
Data classification by comparing of computer implementations of statistical algorithms N.N. Lyashenko and M.S. Nikulin
797
Time discrete approximation of Ito processes E. Platen
801
Fast algorithm of peak location in spectrum I.I. Surina
805
EMPIRICAL PROCESSES (Session 23 - Chairman: E. Gine) INVITED PAPERS Approximations of weighted empirical processes with applications to extreme, trimmed and self-normalized sums S. Csflrgfl and D.M. Mason
811
xvi Minimization of expected risk based on empirical data V.N. Vapnik and A.Ja. Chervonenkis
821
CONTRIBUTED PAPERS Rates of convergence in the invariance principle for anpirical measures I.S. Borisov
833
Almost sure behaviour of weighted empirical processes in the tails J.H.J. Einmahl and D.M. Mason
837
Convergence of the eripirical characteristic functionals V.I. Kol^inskii
841
Sample approximation of the distribution by means of k points: a consistency result for separable metric spaces K. PSrna
845
Author index
849
ABSTRACT INFERENCE (semi-parametric models. . .) (Session 1) Chairman: P.J. Bickel
EFFICIENT TESTING IN A CLASS OF TRANSFORMATION MODELS: AN OUTLINE by P.J. Bickel University of California, Berkeley Transformation models of the following type have been discussed by Cox (1972), Clayton and Cuzick (1985), Doksum (1985), among others. We observe (Z,,Yj) with Y; e Jj an open subinterval of R, which are a sample from a population characterized as follows. There exists an unknown transformation x from J 0 an open subinterval of R onto J; with T'>0 such that Y = x(T) where (Z,T) follow a parametric model. The intervals J ; here may be proper or halfrays or R itself. Colloquially, if Y is expressed in the proper unknown scale, i.e. as T, then the joint behaviour of (Z,T) has some nice parametric form. The case considered by previous authors is logT = 0 T Z + e where e is independent of Z. The distributions of e considered so far include: Cox (1972):
e£ has an exponential distribution.
Clayton and Cuzick (1985): sity (1)
f(t) = (l+te)
e e has a Pareto distribution with denc
, t>0, c > 0
where c = 0 is the Cox model. An important special case of (1) considered by Bennett (1983) is the log logistic model, c = 1 which has the attractive proportional odds property. Doksum (1985): In generalization of the Box-Cox model, e has a Gaussian distribution. It seems reasonable in these models to base inference about the parameters of the underlying parametric model such as 0,c above on the maximal invariant of the group of transformations generating this semiparametric model, {(z,t) —»(z,i(t))}. This maximal invariant is 3
4
just M = (Z,R) where Z = (Z,, • • • ,Z N ) and R = (Rx, • • • ,R N ) is the vector of ranks of the Y^. The likelihood of M or the conditional likelihood L(9) of R given Z = z can in general only be expressed as an N dimensional integral. It can be evaluated explicitly for the Cox model. Clayton and Cuzick propose some ingenious approximations and Doksum proposes that both the value of L and its distribution be calculated approximately by Monte Carlo. So far, however, the asymptotic behaviour of these procedures is not well understood. In this paper we specialize to Z = 0,1 as in Bickel (1985). Moreover we suppose, as did Clayton and Cuzick that the parameter 9 governing the conditional density of T = T(Y) given Z = j, denoted fj(-,0) is real, and in particular that the distribution of e is assumed known. In this context, for a subclass of transformation models, we indicate how to construct asymptotically efficient tests of H : 9 = 9 0 vs K : 9 > 9 0 . The proofs of our results and a detailed treatment are given in Bickel (1986). The subclass includes the Pareto model for c > 1. The testing problem as such is not very interesting save in the case where 9 0 corresponds to independence of Y and Z which is already well understood. However, the solution of the testing problem is a first step in the solution of the estimation problem whose importance is clear. The tests we propose are based on "quadratic rank statistics". (2)
T n = N- 1 L ^ Z O + N - 2 Z b ( ^ , ^ , Z i , Z j ) .
We interpret efficiency in this ^context conditionally on Z, or equivalently the two sample sizes ZZ, and N - Z Z . We show, i=l i=l i) If 9 n = 9 0 +tN- 1 / 2 , t > 0 , (3)
L 0 (/ "^N | Z)\ —> N(at,l) in probability for some a > 0 CT
N
ii)
where c N is a sequence of normalizing constants. If S N is any other sequence of statistics not necessarily depending on the ranks only such that pISnN supx P(0 X) [S N > s | Z] = a then, for each T, 9 n as above, plim N P (0NiT) [S N * s | Z] s 1 - 9 O is given by N 1 / 2 S n where S N = N- 1 EZioEeJcoCT;) |Z,R} + Z il E 9o {c 1 (T i ) |Z,R} Slop Cj(t)
= -^fj(t,e0) and Zy = I(Z4 = j). Equivalent^, if D = (Dj, • • • , D n ) are the antiranks defined by T q = TD. where T(j)< • • • < T ^ are the order statistics of the sample, then (4)
S N = N- 1 .E i |{Z D J E e o (c j (T ( i ) )|Z,p).
To get an approximation to the scores in (4) we write, fj(-,9 0 ) as fj(-) and define, N
m u = — = 1 -it-,. 1 i=i N We treat it} as deterministic constants in the sequel. Let, n = ZZ,-, m = N - n ,
h= ft0f0(-)+ftifi(-) with H the corresponding distribution function. Note that h and H depend on N and are random only through the ftj. Finally let, for 0 < t < 1, (5)
= Cj(H-1(t)),
tyt) gj (t)
= fj(H _1 (t))/h(H _1 (t))
the density of H(Tj) given Zj = j, and (6)
Yj(t) = —§-(0.
We can rewrite (4) as SN = Sni + SNO
6
where S N j = N- 1 |z D i j E(X j (U ( i ) )|Z,p) where (Z^U,) are i.i.d with U j given Zl = j having density gj and the marginal density of U j is uniform, (7)
ftogo+Aigi
The next step is to note that UQ = (8)
L
=
so that
SNj = N - 1 . | { Z D J a j ( - i - ) + V ( ^ - ) E [ ( U ( i ) - ^ - ) | Z , p ] }
plus terms we expect to be of order 0(N _ 1 ). The first term of the approximation is a linear rank statistic. For the second we use a heuristic argument of Clayton and Cuzick who argue that if Yi = E(U (i) |Z,D) then yj satisfies approximately the recurrence relation, (9)
Cyi+i - Yi)-1 - Cyi - Yi-i)-1 = ( i - Z D ^ o C y ^ + z ^ C y i ) .
Let G/t) = (NA j r 1 EI(U i t}. Q 0 is a distribution function with jumps of size mf 1 at
such that ZD. = 0 while Qj
1
jumps ( N - m ) " at with ZD. = 1". Evidently y{ is a function of i « » j—j — , Z, Q 0 , Qj only. Interpolate smoothly in some way between and
1 < i < N to obtain a function v on (0,1) such that, Yi = v ( ^ ) .
Any solution of (9) must satisfy, for some c,d 7i = d + .E(c+EZ^oYoCy^ + Z^jYxiyk))-1 j, °rf0rU=
i
i-1 — u
(12)
1
v(u) = d + J ( ^ + jY 0 (v(s))ft 0 dQo(s)+Yi(v(s))ftidQ 1 (s))~ dt.
0 iN t This is essentially the integral equation of Bickel (1985), save that we make the transformation H(-) and apply (8). Unfortunately, the hopes for analytic approximation of solutions to (12) expressed in Bickel (1985) have so far not been realized. However, suppose we (still formally) extend the definition of (12) to functions v( -,Q,Q') by replacing Q 0 Qi by arbitrary Q,Q' such that, ft0Q(t) + ft!Q'(t) = t, for t = 0,-^-, • • • , 1 with c,d depending on Q,Q'. Then, if Q = G 0 , Q' = Gj, — = 1 and d = 0, v(u) = u formally satisfies the extension of (12) since by (7) YoftoEo + Yiftigi = 0. Therefore, if Â(u) = v f a . Q ^ Q ^ - u , v = v(-Q 0 Q!) u 1 A(u) = d + J { c + {[Y0(v(s))ft0dQ0(s) + Y1(v(s))it1dQ1(s)]}-1dt o t 1 1 — J{ 1 + J[Y 0 (s)MG 0 (s)+YICS^IDG^s)] }-'dt o t We determine the constants c(Q0,Q!), d(Qo,Qi) formally by smooth fit at the boundaries, (13) Let,
Â(0) = A(l) = 0.
(14)
(-,8,z) be a density measure v on a measurable
space
with
(X,B).
and H be a space
(Z,A).
respect
Suppose
collection
to
that
For
a
each
o-finite
j)(x,8,z)
is
measurable as a function of (x,z) and set (1)
p(x,e,Ti)= J d(x,8,z) dT)(z).
A mixture model is defined by XJJX^J-.-JX
are i.i.d. random elements
(8,ti)s 0xh is unknown Xj has density p(-,0,n) of type (1) Mixture models are sometimes called structural models as opposed to functional models. The latter type of model is described by XjjXg,... are independent random elements (8,Zj,z^,. . . )e 0xZ"° X^ has density
is unknown w.r.t. y on (X,B)
In fact it is possible to embed the two models in a single
and
general model for each n=l,2,...
(2)
X,,X ......X are independent random elements il n! nn (0,Ti^j, . . . j'ljjjj)« Q*H is unknown X ^ has density p( - , 0 , ) w.r.t. y on (X,B) p(',8,T|) takes the form (1). 35
more
36
If H
contains the degenerated distributions
then
the
functional
model is a submodel of (2), since p( •,8,6 )=e(•,8,z). Next suppose that there exist measurable functions
h(-,8):(X,B) -KR
• ( - , 0 ) : ( X , B ) -KRm and jgC • , 8 ,n) iD?" -KR such that (3)
e(•,8,z)= h(•,8) g(0. In many examples which have the structure (4), the score
function
for
efficient
8 (cf. Begun et al.(1983)) is given by
i(-,B,u)= *(-,e,n)- E 0 (£(X 11 ,8,ti) |8,1nj)-
While the estimator T^ is often optimal in the i.i.d. model, it
is
difficult to make a same statement for the performance in
the
(4), due to the fact that it is unclear how to define
optimality
concept in models with infinitely many nuisance
an
model
However
parameters.
the estimator improves other constructions in the literature. As an example consider the model (l)-(2) with Z=(0,~) and p(x,y,0,z)= z e " Z X 0 z e " 0 Z y l(x>0,y>0}. In the functional form of this model we (Xj,Y.) of independent, exponentially with hazard rates z. and 0z^
have
a
sequence
distributed
respectively
and
the
random
of
pairs
variables
problem
is
to
estimate the ratio 0 of the hazard rates. Set n = n I. .11 . . n ]=1 nj Under the condition that the sequence of measures {ri^} is tight in such a way that all limit point n
have
ti^(0,«)=1
(no
mass
should
escape to either zero or infinity), the construction sketched above can be made rigorous and gives an asymptotically normal estimator with variance determined by (8). This follows from general results which will appear in van der Vaart(1987). REFERENCES Begun, J.M., Hall, W.J., Huang, W.M. and Wellner J. (1983). Information and asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 2, 435-452. Vaart, A.W. van der (1986). Estimating a Real Parameter in a Class of Semi-Parametric Models, Report 86-9, Dep. Math., Univ. Leiden. Vaart, A.W van der (1987). Statistical Estimation in Models with Large Parameter Spaces. Thesis.
INFERENCE FOR STOCHASTIC PROCESSES (Session 2) Chairman: A.N. Shiryaev
SEMMARTINGALE CONVERGENCE THEORY AND CONDITIONAL INFERENCE FOR STOCHASTIC PROCESSES Feigin, P.D. Technion - Israel Institute of Technology, Haifa, Israel. 1.
CONVERGENCE OF EXPERIMENTS
We commence by giving some brief and non-rigorous background to the decision theoretic approach to asymptotics for statistical experiments . Suppose we have a parameter space servables
X
respectively. 6
=6
and
0
and two experiments with ob-
Y
and families of measures P = {P„} and Q = ÍQ.}, 8 0 Consider procedures (statistics) pp = 6p(X) and p^ =
(Y) for each experiment.
For the loss function
p,6) we may
define the risk function as Rp(pp,0)
= f
l(PP>0)j£Tdpo
(1.1)
;
*" = ÍQQ}} and a sequence of corresponding procedures {p_J , we need to show that
C[(p,{He-,ee@})\Po}
(i,4)
where {H } is a likelihood process (i.e. E_ (H ) = 1, for all 9 6 0). 0 "o « This would then allow us to give asymptotic bounds to "best" (e.g. "minimax") procedures for {£>t}.
In most cases, by a tightness arguw
ment, the joint weak convergence of (p^,
i H follow from the
weak convergence of the second component, at least along subsequences. Thus the essential aspect of verifying the convergence of experiments is in showing the weak convergence of the likelihood ratio processes. Remark 1.
If we think of
then typically
Q^
and
QQ
t(or call it n) as a sample size index will separate as
t -»• °° . So, for the
usual asymptotics to fit into this framework we are required to localize the original experiment about a given value
0q
and define:
Pt = {PtK = Qte0+Hat(e0)'heH} with
Oj.
0
Ci.5)
at rate fast enough to prevent separation of
P^
and
PQ , and slow enough to make them asymptotically distinguishable. Remark 2.
Given the convergence of {P^} to
P
in the sense of
convergence of likelihood ratios, together with weak convergence of procedures {p1"} to of
p
in
be Zocatty of some 2.
p , we define the asymptotic risk of {p^} as that
P . (The loss function asymptotic. p^
and
suAki
if
I p*
is given and fixed). and
P
t
These will
are localized versions
Q t ; see Feigin (1986) for more details.
LOCALLY ASYMPTOTIC MIXED NORMAL EXPERIMENTS
When the (localized) likelihood ratio process converges weakly to that of a normal shift experiment with a random variance then we say that the corresponding experiments are £ o c j x l t y asymptotic, (LAMN).
mixed YionmaJL
This behaviour occurs for many families of non-ergodic pro-
cesses and we will illustrate the counting process case below.
The
LAMN conditions were discussed by Jeganathan (1982), Basawa and Scott
A3 (1983), and others. Given {(?*} as in Section 1, we proceed to formally define the LAMN property at converges to
6 0 e e . We let 0
and
h 6 H .
9t(k) = 9o If
G
+
where {a t = a^Oo)}
is some (finite) vector space
then {&(.} may be a (matrix) linear operator sequence. 1
P" =
= Qg
ck)}
We define
as the local experiment and |At(fe) =
log
^jj-
k 6 H } as the logarithm of the likelihood ratio process. Definition.
{ot} is LAMN at
0q
if there exists a sequence {at}
as above and sequences of random variables {U^.}, {cr2} satisfying: (i)A t (fe) + ^ [ ( t f , - A ) 2 - D ? ] (ii) t [( t ^ ~ ~ by PQ the extension of {Pjj} to V F .
We denote
44 A Taylor expansion in the scalar
0
case will demonstrate the
semimartingale approach to verifying (i)
an
d (ii)•
We denote by [M]
the "square bracket" process of a martingale M, and by vative of
f
with respect to
h , and evaluated at
f
the deri-
h = 0 .
Then
At{h) « fcAt(0) + (l/2)fc2At(0).
(2.4)
We may readily show, under regularity conditions, that A« = { A s t ; 0 < 6 < l }
C2.5)
and R* = A* + [A*] = {A a t + [A]st ;0 < s < l} Gt = {F st ;0 0 n.ß(sup{
w™ , 0 « / « l } > 5 ) -> 0,
7=0,1,2,
as «->oo uniformly in /Se®. III (Asymptotic regularity). The function (0) and its first two derivatives (1,=(9/3/J)f.(0) are continuous in /9e® uniformly in ie[0,1]; they are bounded on ®X[0,1] and (a,ß)
=
-
m'(ß) 59
> 0.
60 Define the Cox estimator fi„ for fi by the condition i i sup f lnT*"s(P)dK = (lnT*"s(j})dM"s, I n * " = c o / f l n * " , / = 1, . . . ,r„) o i
(2)
with r,
i=l Before characterising asymptotic properties of fi„ we give the following definitions: Definition 1. Let / / " ( a , f i ) = (H"(a, ft),F] be an r„-variate predictable process such that - -
2
'
f(H\S")TdM" J » , } =» ¿,7 = 1,2]) (3) 1 0 where AT = Mn(a,fi) = Nn-A"(a,fi), while for each iiefi1 and aeL2(i0>da) S"=S"(a,b) = col{bZm+a, i = l, . . . ,r„}. Therefore concerning the second component solely the above requirement is met under the Conditions I-III with the limiting variance e{K„
c22= j [b2 T
^Co,
(f) constants ^>0, X>0)C>0 exist such that for any 0 e K and n>0 hQCuO>C 1 Lt^ where ^ @ (u) = -in E@ {exp-"X ( < [_,/• + + S Z ^ + a L ^ - I - B a L +Z.(e^ P K- P K + 0 , K = ( 1 + L>= L ( M e m « " ( e ) - M © , M© ). Then: 1) the family P Q satisfies the uniform LAW condition; uniformly w.r.t.Qon any compactum, 2)
Gn= S
, 3) trr,
{ ^ ( e ) (0-©)W(o y 0
2°. Ito-Ventzel's Formula for Semimartingales. Let on some stochastic basis [1} , F P) a semimartingale ^ and a family of semimartingales p(-t,x)~ M (-t ,x) + ACt ,x), X&R 1 ,
65
, P^ 1) the mapping p
: /
x)
be given. Assume that:
is twice continuously
differentiable in the sense of the norm i! • I!- ( i f
$
ia a aemimartingale and S= M +A
"then || SilT = E^.idAilt E ^[M])
with the second derivative p x x
being continuous in x
for all -fc
and co , the processes
F*(t,0) , F (-t, O )
" ^ p I Fx* (t,>Ol = ^ (-t) , \ X I ^ V 0 1 2) £ C T b^-p I x) iiaAsl£is concave a n d l ^ F s . all ueCb(D),a11 (s,x) in R + x F d ,
for
70
n ( u ) = s u p { F ( P ) - í u d P ; PeProb(s,x,L)} s
s
J D
=sup{H.(P)-JudP ; P€Prob(s,x,L)) D
With the notations of Theorem I we have then the following COROLLARY. For all (s,x) in R+xRd, H (P, s
,)=sup{F (Q) ; QeProb(s,x,L)}=F (P,
(s,x)
s
s
,)=0,
(s,x)
Thus the procedure presented in Theorem 1 gives the maxima of both the entropy and the free energy together with the Markovian property. ACKNOLEDGEMENTS. The author wish to express his gratitude to the Academy of Sciences of the USSR. This research was partially supported by UNDP-UNESCO grant and FONDECYT grant * 1087/86.
REFERENCES. BOBADILLA,Gladys (1986a).Problemas de Martingalas sin condiciones de unicidad.Existencia y Aproximación de soluciones Markovianas Fuertes.Doctoral Thesis,Universidad Católica de Chile. (1986b).Une méthode de sélection de probabilités liarkoviennes.C.R.Acad.Sci.París,t.303,Série 1/4,147-150. TAKAHASHI,Yo1ch1ro(1984).Entropy Functional for Dynamical Systems and their Random Perturbat1ons.ln:Stochastic Analysis,Proceedings of theTaniguchi Int.Symp.Katata and Kyoto,K. I to (ed.),NorthHolland, Amsterdam-N.Y.-Oxford,437-467.
ASYMPTOTIC INFERENCE FOR THE GALTON-WATSON PROCESS WITHOUT RESTRICTION TO THE SUPER-CRITICAL CASE David J .
Scott
Department of S t a t i s t i c s ,
La T r o b e U n i v e r s i t y ,
Consider the Galton-Watson process distribution
p
Bundoora,
{Z^;n>0 }
with
Australia. offspring
g i v e n by
8
=pr(z=k|z =1) = a . 0 k / A ( 6 ) n n-1 k
p . (k) ö
k = 0 , l ,2,...
for
,
6 £ 0
.
Then u(e)
=E(zJz0=i)
=eA'(6)/A(0)
and p2(9)
= E[(Z1-y (6))2|z0=1]
=6y'(0)
•
Set Y = Z . + Z, + . . . + Z . n 0 1 n The maximum l i k e l i h o o d e s t i m a t o r
(MLE) o f
U(6)
is
y (6 ) = (Y -Y ) / Y n n u n-l and
9
n
t h e MLE o f •
(
V
Y
0
) / Y
0
is the solution
n-1
•
L e t t h e p r i o r d i s t r i b u t i o n of posterior $(x)
5
n »
d i s t r i b u t i o n by =
(2TT)-ì
[
e~U
of
0
be d e n o t e d by
¥ (6 | Z , . . . Z^) / 2
du
,
W J H n - l * *
and
71
n(0)
and d e f i n e
,
the
72
u (a,b) = {9
(6 ) + a a < y (9 ) < y (9 ) + b a } . n n n n
Heyde (1979) proved the following. Theorem. and
p e
o
If
6 ->• 9„ , Y , °° , n 0 n-1 is non-degenerate then
IT (6 I Z
tt(9)
, . . . Z N ) D 9 -»• $ (b) - $ ( a ) n
- VJ (a,b)
is continuous at
9„ 0
.
From this Theorem, for
n and Y large, n-1 the probability assuming the prior II ,
letting
P
TT
denote
ir(9 Z . . . . Z )d9 0 n u(i,°°)
P (u(0)>l) = TT
Of 1 - i([l-y(8 )]/a ) n n which allows calculation of the posterior probability that the process is supercritical. Heyde's result is notable in that in contrast to the well-known results using a frequentist approach, no restriction to the supercritical case is required.
A corresponding result may be obtained
using a frequentist approach, the essential change being that the asymptotics are as Theorem. of
converges to infinity.
For the Galton-Watson process if
9^
is the true value
9 , Pr( I 9 -9 1l>e|Y 1 ' n 0 n-1
0
as
N
00
and sup|Pr([Y(9n ) - Y(9 )]/A < X|Y , > N) - 0(x) I -»• 0 x£R 0 n ' n-1 as
00
N
.
This result justifies an asymptotic test of HQ
: U(9) < 1
which is to reject
vs
^
HQ 1
: Y(6) > 1 if
u(6 ) > 1 + a $~ (l-a) . f n n
73 The P-value of the observation
u(9 ) n
for this test is
1 - $C [u(6 ) - l]/a ) n n = $([1 - u(§ )]/a n n which is the same as the posterior probability as calculated using Heyde's result. Heyde's result is a form of asymptotic posterior normality and it is of interest to compare it to the usual Bernstein-von Mises Theorem which also gives asymptotic posterior normality. Consider observations
X, ,X_,... X from a stochastic process 1 2 n p (x.,x„,... x |9). Let 9 be the MLE of 0 and n 1 2. n n let &n(6) = log p (X,,...,X I6) be the log-likelihood. Suppose n 1 n ir(8) is a continuous prior density for 8 and ir (6 X, ,X_, . . . ,X ) 1 2 n is the posterior'density.
with density
A fairly recent version of the Bernstein-von Mises Theorem was given by Heyde and Johnstone (1979). Theorem.
Under regularity conditions
+b a n n tt (6 IX ... X )d6 -+• $ (b) - $ (a) 1 n +a a n n
JS (where
a = [-£"(8 )] *) n n n
in
P. -probability, 8Q
It is important to note that the regularity conditions involve convergence of various quantities in
P. -probability and the result 0 O also gives convergence in P„e -probability. The Bernstein-von Mises 0 Theorem can actually be viewed as being a composition of two statements. The first is analytic, and is that for certain sequences of observed values asymptotic posterior normality holds.
The second is that with
P 0 -probability approaching one, the observed values of the process are such that asymptotic posterior normality holds.
Then Heyde's
Theorem concerning asymptotic posterior normality of the Galton-Watson process consists of the analytic part of the Bernstein-von Mises Theorem only.
This suggests it can be obtained by stripping off the
probabilistic aspects of Heyde and Johnstone's proof of the Bernsteinvon Mises Theorem.
This does indeed work, producing a new result
from which Heyde's Theorem may be obtained.
74 Theorem.
If
8 +b a n n tt
, , 3 +a a n n where
8 n
6„ , Y 0 n-1
(9 I Z _ , .. . , Z )d6 0 n
and
it (8)
is continuous at
6. 0
4>(b) - $(a)
a = 6 /(Y ^ (8 ))* . n n n-1 2 n The connection between this result and Heyde's Theorem is quite
simple.
The theorem above states that asymptotically 6 ~ N(6 ,ct2) . n n
Thus asymptotically U(9)
~N(vi(8n),CT^)
where
which is Heyde's result. References Heyde, C.C. (1979). On assessing the potential severity of an outbreak of a rare infectious disease: a Bayesian approach. Austr. J. Statist. 21, 282-292. Heyde, C.C. and Johnstone, I.M. (1979). On asymptotic posterior normality for stochastic processes. J. Roy. Statist. Soc. B, 41, 184-189.
CROSS-VALIDATION (Session 3) Chairman: D.V. Hinkley
ON RESAMPLING M E T H O D S F O R C O N F I D E N C E LIMITS David V. Hinkley Center for Statistical Sciences and Department of Mathematics T h e University of Texas at Austin SUMMARY Some recent research on bootstrap resampling methods is reviewed. Topics include: Monte Carlo and theoretical approximation as efficient alternatives to naive simulation; construction of approximate pivots; inversion of bootstrap tests; and conditional bootstraps. T h e majority of the discussion is addressed to statistics based on homogeneous random samples. Key words and•phrases:ancillary statistic, bootstrap, double bootstrap, likelihood, Monte Carlo, pivot, saddlepoint method. 1.
INTRODUCTION This is a review of some bootstrap techniques associated with confidence
limit calculations. T h e objective is to be reasonably comprehensive, and to introduce some topics of current research interest. Our starting point is a s u m m a r y and illustration of the basic bootstrap method for homogeneous, independent data; see Efron (1982). Suppose that x — ( x i , . . . , x n ) is a random sample of fixed size n from an infinite population for which P r ( X < x) = F(x)
is the cumulative distribu-
tion function of a randomly sampled datum. T h e population quantity 6, which is a differentiable functional t(F),
0 is
estimated by
is of interest. It is assumed that
T = T{xi ,...,£„) = t(F),
tribution function; defined by nF(x)
where
F
is the empirical dis-
= card{i : X{ < x}, and the functional
¿(•) is assumed regular. For the purpose of calculating confidence limits, distributions of quantities D such as D = T — 9 are required. Because F will be unknown, although possibly belonging to a known family indexed by $ and nuisance parameters, it will be usual to estimate this distribution by F, say, and thence to estimate
78
the distribution of T — 6. If the latter step is not amenable to theoretical calculation, then we can approximate the distribution by a Monte Carlo simulation method. To be specific, consider D = T — 6 and its distribution function G(d)
=
G(d, F) = P r ( T - 9 < d \ F), which is to be estimated by G(d) = G(d, F) = P r ( T - 6 < d \ F) .
(1)
The simplest Monte Carlo technique for approximating G is as follows: Step 1°. Use a Monte Carlo simulation method to generate a random sample x* = ( x j , . . . , x* ) from F. Step 2°. Calculate T* = T ( x j , . . . , x * ) and thence the simulated value T* - T , which is to F what T - 6 is to F. Step 3°. Perform Steps 1° and 2° a total of B times and approximate G{d) by GB(d)
= B'1
freq{T* - T < d} .
(2)
Superscript * will always denote a random variable generated from F. When F is used for F in Step 1°, x* is obtained by uniform random sampling with replacement from x - hence the name "resampling method." While this nonparametric case is of most interest, many theoretical aspects of bootstrap methods Eire most easily discussed in the parametric case where F belongs to a known family. The estimated distribution G yields confidence limits in the usual way. Thus if dp = G - 1 ( p ) , then the lower 1 — a confidence limit for 6 is T — and the upper 1 — a limit is T — da. If the Monte Carlo approximation (2) is used, then G g 1 ( p ) is taken to be the [ ( B + l ) p ] t h ordered value of T* — T. Example 1. Suppose that the first row of Table 1 is a random sample from a population whose mean is 6 = J xdF(x)
= fi, and that we estimate ¡1 by
T = x = n - 1 E x j , whose value is 17.87. We wish to estimate the distribution of x — ft and hence obtain an upper 90% confidence limit for fi. (Superior alternatives to use of x — ¡i will be discussed later.) If F is assumed to be a normal distribution, then we estimate F by the N(T,a2)
distribution with a2 = n _ 1 E ( x j — x ) 2 , and thence calculate G the-
oretically to be G(d) = $(y/nd/a).
Because &2 = 46.53, G(d) = $ ( 0 . 4 6 d )
for these data. Then the 90% upper confidence limit for fi is T — G - 1 (0.10) = 17.87 — ( 0 . 4 6 ) - 1 < J > - 1 ( 0 . 1 0 ) = 20.63.
79
If nothing is assumed about F and we take F = F, then the estimate (1), now written G(d) = Pr(x* — x \ F), will be approximated by (2) using the simulation resampling procedure. A very small application with 5 = 9 is illustrated in Table 1, wherein each sample x* is given in the equivalent form of frequencies /* for data values x;. We approximate such that m n
1
-*• 0.
Let a and b be such that m,n m,n mn
where
T
a (x)
n
1
Xra (X ) - 1. j=l m,n, ' = x i f l < a < 2
m (6) Let H (x, oj) = P(( I m,n
and
nX. - b m n I Ta( 3 ' )= 0 j=l m,n
and x(x) of (2) for 0 < a < 1.
Y. - b ) a~ ] m,n m,n
—
< x |x") 1
98
Theorem 3 : Let a < 2 and m, n •+• " such that mn ^
0.
Let
H (.,.) be as in (6) . Then m,n sup|H
x
m.n
(X,OI) - G (x) | 5 0
a
where G (.) is the distribution function of a stable law of order a a whose characteristic function i(i(t) is given by (t) = exp (/(eltX-l-it xa(x))Aa(dx) where A (.) is as in (3). m -1 Notice that (,I,Y, - b ) a is the bootstrap version of the staj=l j m,n m,n
tistics (. E,X. - L )a where L and a are as in case (i). j=l ] n n n n This indicates that the bootstrap method works when a < 2 provided the resample size in is small campared to the original sample size n. in
Details of the proof of the results of this paper may be found 1,2 J.
REFERENCES: 1. Athreya, K.B. (1986) Bootstrap of the mean in the infinite variance Case-I and II, Technical Reports 86-21, 86-22 of the Department of Statistics, Iowa State University, Ames, Iowa, 50011. 2. Athreya, K.B. (1986) Bootstrap of the mean in the infinite variance case To appear in the Annals of Statistics (1987). 3. Bickel, P.J. and Freedman, D. (1981). Some asymptotic theory for the bootstrap. Annals of Statistics, 9 1196-1217. _ 4. Efron, B. (1979) Bootstrap methods - another look at the Jack knife. Annals of Statistics, 7, 1-26. 5. Feller, W (1971) An introduction to Probability Theory and Applications. John Wiley, N.Y. 6. Singh, K (1981) On the asymptotic efficiency of Efrfln's bootstrap, Annals of Statistics, 9, 1187-1195.
AUTOMATIC CURVE SMOOTHING
Wolfgang Härdle Institut Wirtschaftstheorie II Universität Bonn Adenauerallee 24-26 D-5300 Bonn, Federal Republic of Germany
1. INTRODUCTION Regression smoothing is a method for estimating the mean function from observations ( x ^ ) , . . . , ( x n » Y n ) °f the form Y. = m(x.) + e., 1 I I
i=1,...,n,
where the observation errors are independent, identically distributed, mean zero random variables. There are a number of approaches for estimating the regression function m. Here we discuss nonparametric smoothing procedures, which are closely related to local averaging, i.e. to estimate m(x), average the Y ^ s which are in some neighborhood of x. The width of this neighborhood, commonly called bandwidth or smoothing parameter, controls the smoothness of the curve estimate. Under weak conditions (bandwidth shrinks to zero not too rapidly as n increases) the curve smoothers consistently estimate the regression function m. In practice, however, one has to select a smoothing parameter in some way. A too small bandwidth, resulting in high variance, is not acceptable and so is ovzAimooth-ing which creates a large bias. It is therefore highly desirable to have some automatic, cusivz smoothing procedure. 99
100
Proposed methods for choosing the window size automatically are based on estimates of the prediction error or adjustments of the residual sum of squares. It has been shown by Hardle, Hall and Marron
(1986)
(HHM) that all these proposals
are asymptotically equivalent but can be quite different in a practical situation. In this paper we highlight these difficulties with automatic curve smoothing and construct situations where some of the proposals seem to be preferable. 2. AUTOMATIC CURVE SMOOTHING To simplify the presentation, assume the design points are equally spaced, i.e.
x^=i/n, and assume that the errors
have equal variance, Ee^ = a^. We study kiKnut m. (x) = n h
1
h
AmoothzKi
n x-x. I K(—¡-^)Y. ,=1 h i
1
where h is the bandwidth and K is a symmetric kernel function. It is certainly desirable to tailor the automatic curve smoothing so that the resulting regression estimate is close to the true curve. Most automatic bandwidth procedures are designed to optimize the averaged squared error
(ASE) d A (h) = n" 1
S^[m h (x i )-m(x i )] 2 w(x i ),
where w is some weight function. These automatic bandwidth -1
n
-
2
selectors are defined by multiplying p(h)=n E (Y.-rtu (x.)) w(x.) 1 _1 i=1 1 n 1 by a correction factor H(n h ). The examples we threat here are General Gross-Validation
(Craven and Wahba 1979),
H G C V ( n " 1 h " 1 ) = (1 - n" 1 h~ 1 K(0))" 2 . Akaike's Information Criterion 1
1
1
(Akaike 1970), 1
E A I C ( n " h ~ ) = exp(2n~ h" K(0)). Finite Prediction Error 1
1
(Akaike 1974),
S r ,„,(n~ h~ ) = (1 + n~ 1 h" 1 K(0))/(1
- n"1 h" 1 K (O) ) .
101
A model selector of Shibata (1981), E s (n" 1 h" 1 ) = 1 + 2n~1h~1K(0). The bandwidth selector T of Rice (1984), S T (n" 1 h" 1 ) = (1 - 2n~ 1 h" 1 KC0))~ 1 . Let h denote the bandwidth that minimizes (p-S)(h). The automatic. cuA-ve imoothiK is defined as mc n (x) . This automatic curve smoothing procedure is asymptotically optimal for the above E in the sense that
where h Q denotes the minimizer of d^. The relative differences are quantified in the 1 /5 then Theorem. Let fi o ~ n n 3/10 (fi - h ) + N(0,a2) o n[d A (h) - d A ( h 0 ) ] - C-xJ.
in distribution, where a 2 and C are defined in HHM. A very remarkable feature of this result is that the 2 constants a and C are ¿nd.zpe.nde.nt ofi E. In a simulated example we generated 100 samples of size n=75 with a=0.05 and m(x) = sin(X2irx). The kernel function was taken to be K (x) = (1 5/8) (1-4x2) 2 I (| x| ,(E(1),... ,E(K)), where E(k) = s
(k). If Z is an ordered set, this corresponds to a
preference or hierarchical classification. The fuzzy classification problem corresponds to Z=(z^,...»z^) where Zjj. are real numbers such that z ^ s O , £ z^=1, or, formally, in this case Z is a simplex whose vertices correspond to the numbers of classes and the coordinates of points to the probabilities that an object belongs to a class.Obviously, the availability of a priori information concerning
1 14
the type of a n AC problem and the admissible classifica-r tions leads to constraints w h i c h pinpoint a set S(E) in the set of all mappings. Por example, if there is a training (verified) subsample E 1
(k) in E, then one ob-
tains the conditions s(x)=k for x ^ E ^ C k ) . Thus to specify S, one needs to formalize the problem, i.e., to set Z and to state the conditions which single out S i n the set of all the mappings E—*Z. L ( E ) is the set of the descriptions of E in the framework of a given algorithm. This component relates to the choice of the classification means a n d corresponds to the representativeness subspace w h i c h is understood in a broader sense than i n Diday et al.(l979). The set L is regarded as a subset of the set of all mappings Z—*-Y, where Y is the set of the values w h i c h represent the classification results. Por example, if E C R
p
and
Z=[l ,...,k]
and
if every class is supposed to be described by the sample mean, then Y = R P and L(E)={ Z — y ] =R P * .. .xR p . If a class is described by a standard and if the standard of the k t h class is known to be in a a n-neighborhood of the r e n presentative y^, specificolly, may be zero for some k, then L(E) = ¿( y i
yk),
6 R P , j| y ^ l l ^ S J • If the
classes are described by standard sets, then the corresponding part of the set of all subsets of E is taken as Y. The set Y can be of a completely different nature than the space of the characters to be measured. Moreover, Y itself can consist of spaces w h i c h differ in nature as required by the standards of different classes. Note that o n this w a y we can incorporate m a n y well-known algorithms which cannot be described by the above-mentioned nuées dinamiques method (for example, F0RE1, see A i v a z y a n et al.(1974)) and construct some n e w algorithms. R(E) is the set of certain finite subsets of E, the so-called 'portions', into w h i c h E is subdivided for classification. This component is introduced in order that one can treat algorithms both in parallel (R(E) consists
1 15
of a single element symbolizing the entire set E), and in series (for example, when objects are classified in a one-at-a-time manner, R ( E ) = E ) . Note that although Diday et al.(l979) describes only parallel procedures, the nuees dinamiques method for serial procedures was developed in Diday(1975). JC
is an operator from S X L X R
into S called a classi-
fier since it shows how to apply the available means Y to the AC problem of type Z in order to pass from the state s n of sample E with the description 1 8
v
given the current portion
r
n+ -]
0 ) ,
then some basic sta-
tistical assumptions of the classical asymptotic analysis (which are valid for m — a n d
p=const) are violated.
1 18
In situations like this it is required that the statistical properties of the employed rules and procedures are analyzed under the conditions of the above-mentioned asymptotics (which is often called the Kolmogorov asymptotics). Specifically, the paper of Tsibel*
(1987) is
devoted to these problems. In the theory and practice of automatic classification it is important that the dimension of a mathematical m o del 1 is chosen correctly depending on the sample size m . The formulation and solution of such problems can be found in Enyukov 2.4
(1986).
The methods of constructing partitions stable to variation in the controlled free parameters of an AG algorithm
The idea of multiple solution (by different methods) of the same problem and subsequent selection of most frequent variants have largely been used in statistics. This idea underlies the approach (see Aivazyan et al.(l983) and Aivazyan (1980)) to developing the statistic methods which enable us to obtain inference stable to variation in initial conditions on the nature and accuracy of data. Specifically, it is suggested that an AC problem (in its optimization statement) should be multiply solved for different objective functions, for example, for a parametric family of objectives. As a result we obtain the set of partitions into classes: every objective is associated w i t h its best method and conversely every best method corresponds to its partition. In the set obtained one has to select one or several partitions which are relatively stable to changing the objectives. Obviously, the change in values of the controlled free parameters of an AC algorithm, in particular, the parameters which determine the specific form of an objective, is equivalent to varying the initial conditions concerning the
1 19
nature and accuracy of the data under classification. 2.5
The employment of training elements in the choice of an appropriate metric for AG problems
The definition of the distance between the objects (or the groups of objects) to be classified is a 'bottleneck' of AC theory. Usually, a priori information on the probabilistic and geometrical nature of the multivariate observation is lacking. In such a case the successful choice of a metric depends on the statistician skill of formalizing the professional knowledge and intuition of the expert in the area where the AC problem arises. An interesting approach to using training elements for the 'adjustment' to an appropriate metric is suggested, for example, in Diday and Moreau (1984). 2.6
Estimation of the number of classes in AC problems
The problems related to estimating an integer parameter are traditionally difficult in mathematical statistics (for example, estimating the number of factors in factor analysis, the number of basic functions in regression analysis, etc). In automatic classification the problem of estimating the unknown number of classes can be stated (in probabilistic terms) as a problem of determining the number of the modes of a multivariate density (in the nonparametric formulation) or the number of components in the mixture of distributions characterizing the multivariate observation to be classified (in the parametric formulation). Some interesting results both asymptotic (the amount of observations grows infinitely) and nonasymptotic have been obtained in the latter case (see Orlov (1983), Tsibel' (1987)).
120
REFERENCES Aivazyan, S.A. (1979). Extremal formulation of the basic problems of applied statistics. In: National School on Applied Multivariate
Statistical Analysis A l g o -
rithms and Software. Computer Centre of the Planning Committee of the Armenian SSR, Erevan, pp. 24-49. Aivazyan, S.A. (1980). Statistique mathématique appliquée et problème de la stabilitée des inferences statistique. In: Data Analysis and Informatics. North-Holland Publ. Comp. Aivazyan, S.A., Bezhaeva, Z.I., and Staroverov, O.V.(1974). Classification of Multivariate
Observation . Sta-
tistika, Moscow. Aivazyan, S.A., and Bukhshtaber, V.M.(1985). Data analysis, applied statistics, and constructing a general theory of automatic classification. In: Diday et al. (1979)(Russian translation), pp. 5-22. Aivazyan, S.A., Enyukov, I.S., and Meshalkin, L.D.(1983). Applied statistics: introduction to modelling and primary data processing. Pinaney I Statistika, Moscow. Bukhshtaber, V.M., and Maslov, V.K.(1977). Factor analysis and extremum problems on Grassmann varieties. In: Mathematical Methods of Solving Economic Problems, N 7, pp. 87-102. Bukhshtaber, V.M., and Maslov, V.K.(1980). The problems of applied statistics as extremum problems on irregular domains. In: Algorithms and Software of Applied Statistical Analysis. Uchenye Zapiski po Statistike. v.36, Nauka, Moscow, pp. 381-395. Bukhshtaber, V.M., and Maslov, V.K.(1985). Tomography methods of multivariate
data analysis. In: Statis-
tics. Probability. Economics. Nauka, Moscow, pp.108-116 Demonchaux, E., Quinqueton, J., and Ralambondrainy, H. (1985). CLAVECIN: Un systeme expert en analyse de donnees. Rapports de Recherche, N 431. Institut Natio-
121
nal de Recherche en Informatique. Le Ghesnay. Diday, E., et al.(1979).Optimisation en classification automatique. Institut National de Recherche en Informatique et en Automatique, Le Ghesnay. Diday, E., and Moreau, J.V.(1984). Learning Hierarchical Clustering from Examples. Rapports de Recherche N 289. Institut National de Recherche en Informatique et en Automatique, Le Chesnay. Enyukov, I.S.(1986). Methods, Algorithms, and Programs of Multivariate
Statistical Analysis. Finansy I
Statistika, Moscow. Enyukov, I.S.(1986). Projection pursuit in reconnaissance data analysis. Reviews of the First World Congress of the Bernoulli Society for Mathematical Statistics and Probability. Tashkent. Friedman, J.H., and Tukey, J.VV. ( 1974) • A projection pursuit algorithm for exploratory data analysis. IEEE Transaction on Computers, C-23, pp. 881-890. Girko, V.L.(1935). 'Struggle against dimension' in multivariate Multivariate
statistical analysis. In: Application of Statistical Analysis in Economics and
Quality of Product. Tartu, pp. 43-52. Hahn, G.J.(1985). The American Statistician, v.39, N 1, pp. 1-16. Huber, P.J.(1985). Projection pursuit. The Annals of Statistics, v.13, N 2, pp. 435-475. McQueen, J.(1967). Some methods for classification and analysis of multivariate observation. Proc. Fifth Berkley Symp. Math. Stat, and Probab.,v.1, pp.281-297. Orlov, A.I.(1983). Some probabilistic aspects of classification theory. In:Applied Statistics. Nauka, Moscow, pp. 166-179. Sebestyen, G.S.(1962). Decision Making Process in Pattern Recognition. The McMillan Company. Shlezinger, M.I.(1965). On spontaneous pattern recognition. In: Reading Automata. Naukova Dumka,Kiev,pp.88-106
122
Tsibel', N.A.(1987). Statistical investigation into the properties of the estimates of multivariate analysis model dimension. Ekonomika I Matematicheskie Metody, the USSR Acad, of Sci. (in print). Vapnik, V.N.(1979). Restoration of Dependences from Empirical Data. Nauka, Moscow.
DATA ANALYSIS : GEOMETRIC AND ALGEBRAIC STRUCTURES Fichet B. Laboratoire de Biomathématiques - Faculté de Médecine Université d'Aix - Marseille II.
Here we present a survey of mathematical
structures which arise in
data analysis. After recalling the fundamental triple of data analysis, we investigate the usual representations of this triple : Euclidean embeddings, Lp-embeddings, hierarchies, pyramidal representations, additive trees, star graphs... Each representation corresponds to special dissimilarities and the set of these dissimilarities is shown to be a cone in a finite dimensional vector space. We examine the respective inclusions of the cones, as well as their geometric nature, especially convexity and closure. Then, approximation problems may be studied : least squares approximation with respect to a given norm, subdominant (or submaximal) and superdominated (or superminimal) approximation, additive constants... For all these mathematical aspects, many problems remain unsolved and some conjectures have been made. Finally, we pay particular attention to monotone transformations on data. They play an important role in data analysis and we will present their impact on the afore-mentioned representations.
123
124
1 - DATA STRUCTURES The main basic concept in data analysis is dissimilarity. A dissimilarity d on a finite nonempty set I is a nonnegative real function defined on I 2 such that : Vi £ I, d(i,i) = 0 ; V(i, j) £ I 2 , d (i , j) = d(j,i). The finite dimensional vector space of real functions w h i c h satisfy the same conditions will be denoted by D, and a dissimilarity is a n element of the positive orthant D + . In practice, I represents
indivi-
duals, quantitative variables, categories of a qualitative variable, logical or presence-absence characters... Let us assign a mass m^ (i.e. a positive number) to each unit i in I. These masses arise e s sentially in view of approximation problems. The fundamental triple (I,d,{m^,i€l}) will be called a data structure. Here we recall some common data structures. For a set I of individuals and a set J of quantitative variables, let X^ be the observation of the variable j on the individual i. Denoting by Oj , Pjj, respectively the standard deviation of the variable j and the correlation coefficient of the variables j and j 1 , and supposing cjj > 0 for every j, we generally consider the data structures (I,d^,{m^,i£l}) and ( J , d ^ , j £ J } )
such that :
xJ-xJ,
v(i.i') e i 2
d 2 (i,i') = ? 1 J
v(j,j') e J 2
(l/2)dJ2 (j,j') =
Vi € I, n^ = 1 /111 a.
J
1 - p. JJ
V j 6 J, n. = 1 J
Let I and J be two qualitative variables and {f „ ,(i,j)£IxJ} be a frequency table (derived from a contingency table). We use the following notations : Vi £ I, f. = E f. . ; Vi £ J, f . = E f. . 6 i- j ij -J i iJ Supposing that the previous quantities are all strictly positive, we generally consider on X the data structure V(i,i') £ I 2
d2(i,i') = E
f.. f.,. 1J _ J
,i£I}), where :
(X 2 metric)
f. l.
and a symnetrical data structure (J,d ,{f .,j£J}) on J. J .j
125 For a set I of individuals and a set J of presence-absence
characters,
let x? b e the observation of the character jJ on the individual i 1 (xJ equals 1 (or 0) for presence (or absence) of j). We use the folV(i,i') £ I 2 , m. .. = E ¿ d ; Vi £ I, m. = E x] . i i . i l 1 . 1 J J Then, w e may consider o n I the data structure (I,d ,{m^,i£l}), where : lowing notations
V(i,i')£I2,
:
(l/2)d 2 (i,i') = 1 - m i i , /•m.nu,
and a symmetrical d a t a structure
(Ochiai's metric)
(J,d J ,{n.,j£J}) on J. J
A dissimilarity d on I is said to be : proper iff d ( i , j ) = 0 implies : i = j semi-proper iff d ( i , j ) = 0 implies
: V k £ I , d(i,k) = d(j ,k) .
If d is semi-proper, an equivalence is introduced as follows : i ~ j iff d(i,j) = 0 . Then, a proper dissimilarity d may be constructed on the quotient space I ; we have 3(i,j) =
d(i,j), where i (resp. j) is in i (resp. j).
Moreover, if masses are assigned to units i in I, we put : Vi £ I, nu1 =
I {m. Ii£i} .
Then, (I,d,{m»,i£l}) is called the induced quotient data structure.
It is usual practice to aggregate units
w h i c h are equivalent. In a
mathematical sense, a property of d w h i c h is preserved on the quotient space, has only to be proved w h e n d is proper.
2 - REPRESENTATIONS IN DATA ANALYSIS Given a data structure, different graphical representations may be proposed. E a c h of them corresponds to a particular
dissimilarity.
Here w e recall some usual representations and their associated dissimilarities .
-
Lp-spaces
A dissimilarity d on I is said to be L^ iff there exist an integer N and a family of real numbers {x^ , j = 1,...,N ; i£l} satisfying : V(i,i') £ I 2 , d P (i,i') =
Z |x|-xj, | P .
126
It admits an embedding in an L^-space, and the set of such dissimilarities will also be denoted by L . P Two particular cases an noteworthy. Euclidean dissimilarities (i.e. dissimilarities in I^) which yield a Euclidean representation, and the city-block semi-distances (i.e. dissimilarities in L.) which yield an L.-representation .
Euclidean representation
L. - representation
Mathematically, it is useful to consider the sets L^ , obtained from L
by the transformation d i—* dP. In particular,
L„ is the set
of squared Euclidean dissimilarities. Moreover, every finite metric space is shown to be embedded in an L^-space. According to this property, the set of semi-distances on I will also be denoted by LCO J
- Hierarchical_re2resentation A hierarchy on I is a class H of non-empty subsets, satisfying : i)
i
;
ii)
V(H,H')£H 2 , HflH' 6 {H,H',0}
iii) V H EH , U{H'£ H , H'crH, H'^ H} 6 {H,0} If f : H i-» R
is such that :
iv)
(H (p + 2) and a preordonance on I such that there exist figures satisfying the inequalities of the preordonance only in spaces with dimension p and (n - 1).
We end here our survey. Obviously, many other problems could have b e e n raised in this field. However, we hope to have shown that, even w i t h i n the context evoked here, a great number of questions open. A note on bibliography
remain
: the field is so vast that we have b e e n
obliged to suppress all bibliographical
references.
Bernoulli, Vol. 2, pp. 133-136 Copyright 1987 VNU Science Press
GENERALIZED CANONICAL ANALYSIS
Michel TENENHAUS, Centre HEC-ISA 1, rue de la Libération, 78350 JOUY-EN-JOSAS, France
Introduction Canonical analysis has been considered for a long time as a method having a real theoretical interest, but few practical applications. Situation is changing as we can see from the new book of R. Gittins (1985) : "Canonical analysis : a review with applications in Ecology". Generalized canonical analysis (Mc Keon, 1965 ; Carroll, 1968 ; Kettenring, 1971) includes canonical analysis as a particular case (and, consequently, multiple regression, analysis of variance, discriminant analysis, correspondence analysis, ...) and also principal component analysis and multiple correspondence analysis. This method is also useful to study a population described by several numerical or nominal variables and evolving in time. So Generalized Canonical Analysis represents a remarkable synthesis of multivariate linear methods. We present in this paper Generalized Canonical Analysis from a geometrical point of view, showing its relationship with Principal Component Analysis (Saporta, 1975 ; Pages, Cailliez, Escoufier, 1979). This permits many numerical simplifications useful for the writing of a computer program.
I - GENERALIZED CANONICAL ANALYSIS
1. The data
We consider
p
data tables
X j , . . . , X t ,..., Xp. Each table X t
is formed with n
rows
corresponding to the same n subjects, the columns represent standardized numerical variables or dummy variables associated with the categories of nominal variables. In other words the data can be numerical or nominal variables. If the variables are nominal they are transformed in binary tables.
2. The problem
We look for standardized and uncorrelated variables z j , . . . , z m maximizing the quantity m p (1) Z Z R 2 (z h , X t ) h=l t=l where R2 (z h , X t ) represents the coefficient of determination between variable z h and table X t . 3. The solution 133
134 The centering operator P 0 is equal to I -(1/n) u u' where u is a column vector of n ones. We denote by P t the projection operator into the subspace L(Xt) of R n generated by the columns of table Xt. The coefficient of determination R 2 (z[,, Xt) being equal to (1/n) z'j, P t zj,, quantity (1) may be written (2)
m (1/n) I z'h P z h , h=l
P where P = X P t . The standardized and uncorrected variables zj z m maximizing (2) are t=l obtained as the eigenvectors of the matrix P 0 P associated with the m largest eigenvalue m Xj,...,X m ranked in a decreasing order. Then the maximum of (2) is equal to £ h=l In effect it will be enough to diagonalize the matrix P . Vector u is an eigenvector of P associated with the eigenvalue XQ equal to the number of tables X t containing a binary table. Consequently the eigenvectors of the matrix P 0 P are the eigenvectors of the matrix P different from the eigenvector u associated with the eigenvalue XQ .
4. The associated principal component analysis
We denote by X = [Xj, ..., X t ,..., Xp] the data table obtained by horizontally adjoining the several X t , M the block diagonal matrix formed with the generalized inverses n(X't Xt)" of the matrices (l/n)X t X t , N = (1/n) I . Generalized canonical analysis of the tables X j , . . . , Xp is a principal component analysis of the triplet (X, M, N), (Saporta, 1975). The standardized principal components of the triplet (X, M, N) are the canonical components zj
z m . They are
independent from the chosen generalized inverses n(X't Xt)". The duality diagram (Pages, Cailliez, Escoufier, 1979) associated with the triplet (X, M, N) (figure 1) gives immediatly useful relations for numerical calculations : going from calculations in R n to calculations in Rk , where k is the number of columns of X.
135
X
E.Rk
N • (1/n) I
1
E"
=> F-R"
X
Fifurt 1 Duality diagram associated with the triplet (X, M, N)
The projection operator P t into the subspace L (X t ) may be written P t = X t ( X j Xt)" X"t and so we obtain WN = P . The following relations are useful for the writing of a generalized canonical analysis computer program. 1) zj, = Xjj
x q ^ where the factors o) (3) where E^ is the averaging operator with the density f(z). Here the "theoretical" quantity of the PI is designated by Qg(U,X) in contrast to the sampling value Give without the proof the inequalities connecting the (3) with^the ratio t 2 (U) g(e,B) Q > 2 + B ) (1 +t2CD) )B/2V 2 ) a r e s m a l l enough i n c o m p a r i s o n w i t h t2» t h e n f r o m t h e r e a s o n s of t h e c o n t i n u i t y t h e same two v e c t o r s e x t r a c t t h e main p a r t of s u c h i n f o r m a t i o n . On t h e o t h e r hand, i f a l l the eigenvalues t . are approximately e q u a l i t i s i n d i f f e r e n t which v e c t o r s t a k e f o r p r o j e c t i n g i f only they belong t o t h e R+, but i t i s s a t i s f i e d . REFERENCES
F r i e d m a n J . H . . T u k e y J.W. (1974) A p r o j e c t i o n p u r s u i t a l g o r i t h m f o r e x p l o r a t o r y d a t a a n a l y s i s . IEEE T r a n s . Comput., C - 2 3 , 8 8 1 - 8 8 9 Huber P . J . (1985) P r o j e c t i o n P u r s u i t . Ann. S t a t i s t . , •15, 435-^7* R a o C . R . (1965) L i n e a r S t a t i s t i c a l I n f e r e n c e s and i t s A p p l i c a t i o n s .New York, Wiley Yenyukov I . S . (1986) M e t h o d s , a l g o r i t h m e s and programmes of t h e m u l t i v a r i a t e s t a t i s t i c a l a n a l y s i s ( i n r u s s i a n ) . F i n a n c e s and S t a t i s t i c s , Moscow
THE INVERSE PROBLEM OP THE PRINCIPAL COMPONENT ANALYSIS Zhanatauov S.U., Computing Center, Novosibirsk, USSR In data analysis it is usually impossible to connect a single real multivariate sample with one of the theoretical distribution functions and to obtain additional samples from the same population. In this situation one obtains on the computer the artificial samples, which in some way or other are similar to a real multivariate sample. Let N (x,W) be a set of multivariate samples X° = {f? , J} . , S x9 = pc? ) G E n , generated from mn i i=1,m' i* i1 in independent observations over dependent multivariate random values with dependent components and having a vector of sampling mean values x = (x .. . pc ), X.= (l/m)« m in 1 n j Q and s y - i-*113 sampling covariant matrix W = (1/m)1=1 ij x) (X - x) given; is a set of correlation matrices (c.m.) with the spectrum j \ m = diag(Jl^, ... ¿X ) given. The functions f (A) = tr(A) = n, f CA,1) = JL)/n will be called main functional f4 J-I J A -parameters of the spectrum /\ c.m., which are stable and reliably calculated statistics. Problem: to obtain a multivariate sample of the volume m > n > 2 , satisfying one of the following requirements: a) a sample should have a sampling c.m., exactly equal to the c.m. given; b) a sample should have a sampling c.m., whose spectrum is either exactly equal to the spectrum given, or the values of its main f-parameters with given accuracy are equal to the values given. Here everywhere it is necessary that samples have the vectors of mean values and dispersions given. 141
142
For s o l v i n g t h i s problem i t
i s s u f f i c i e n t t o o b t a i n on
a computer the samples w i t h sampling c . m . , having one and the same g i v e n spectrum as w e l l as s p e c t r a w i t h v a l u e s of main f - p a r a m e t e r s , w i t h g i v e n a c c u r a c i e s of the equal v a l u e s g i v e n . Theorem (Zhanatauov S . U . , 1 9 8 0 ) . of the d i a g o n a l m a t r i x /\ A
m
=
diagC^ , . . .
^ •••»7lk>Ak+1 exist i n f i n i t e
Let
, trC/l) =
=
m>n^2, the
elements
s a t i s f y i n g the r e l a t i o n s : + ...
Jln =
n,
"l^kiin.
Then t h e r e
sets:
a) of the o r t h o g o n a l m a t r i c e s C ^ l ) , number 1 = 1 , 2 , . . . ; b ) the c o r r e l a t i o n m a t r i c e s
having the
spectrum A. and e i g e n v e c t o r s l o c a t e d i n columns of the matrix C ^ : E ^ = C ^JA1 c ( 1 ) T ; nn nn nn nn ' /. n , s c ) o f the m u l t i v a r i a t e samples N" s (0,1) , Y ^ e N ( 0 , A ) , number t = 1 , 2 , . . . , the m a t r i c e s J\ , C^ 1 ^, s nn nn s a t i s f y i n g a l l the r e l a t i o n s of t h e R ( l ) 7 y ( t ) z ( t , l ) nn mn mn d i r e c t model of p r i n c i p a l components by II. H o t e l l i n g (DM P C ) ,
and the
principal
A ~ samples of the i n v e r s e model o f
components (IM PC)
perties:
Z^^"^ ^
having the p r o -
"k
1) w i t h number 1 f i x e d , number t = 1,k^_, M=m^ + . . . + m j j if
Y „^ _£ N v(0,y\), Z ^ f e N ( 0 i,7R/) then YMn M L n ss^ n m^n s" ' [_ nun ( 0 , A ) , zM = i z ^ T : . . . : z < k t ) T ' T e N s ( 0 , R ) ; : m, n J " ' Mn L m, n : : m, n 1 Kt • 2)K t w i t h number t f i x e d , number 1=1,k^, N=k-^»m, if sv
Ymn£Ns(0'A)^
z f f * N^O,^1))
then
Y ^ Y
eNs(0,S),
• mn £ where
1
=1,n; 3) w i t h numbers l=t=T7k7, m,k-^', i f ' 1 ' M=m„+...+ 1 € N (O^A), s
Z^'De m^n
N ( O , ^ 1 ) ) then YM c N v( 0u, Ay) , s nn ^ Mn s '
1^ Y^ m^n
143
n( • Mix
k l "ST (i ) eI, (0, s l-TA'R '' where
. (kn,kn)T- T
, •Z)(1 _
,1 )T: m^n
m K k•1 k 1 jg1 = 1.
0< j 6 1 = m 1 /M < 1,
For computation of the spectrum with the above properties, algorithms have been developed, making use of the following relations. Let n ^ - k ^ l ^ l be integers, a^ ^ 1, i=2,k be real numbers. Then the elements of the spectrum of c.m. are uniquely defined: = f.j/B(k,k), f 1 = n, (¿n^ai)'-^' j = 1 ,k-1, f 1 = n, where B(t,k) = t k = [ I a.). f-Parameters of the spectrum are of the i=1 j = i+1 3 q form: f1 (A) = B(k,k)-Ak, f 2 (A) = D ( k , k f (A)=B(_1 ,k), f.(/\,l) = B(l.k)/B(k,k), f.CA) = B(k)aj, ffi(A) = f ^ a , , 4
t
k
^
k
2
where D(t,k) = ^ ( H ± + 1 a .), B(k) = Q 2 a i ~ 1 These algorithms compute monotonous successions of lues of f-parameters, obtained after increments:
va-
S(1 k) = ai+i W Yi+r 1 ' - B(i,k) + (y i+1 -i> •B(i,k), 5(k,k) = D(k,k)+(y|+1-l)-D(i,k), E(k)=E(k)-^j+r
With the use of IM PC there are developed a nonparametric algorithm of interval estimation of statistics, characterizing interrelations between variable samples (including missing values in variables) (Zhanatauov S.U., 1985a,b) and an algorithm of the point estimate of missing values in the multivariate sample (Zhanatauov S.U., 1981). The degree of adequacy of A-samples in a real sample and the accuracy of estimates of the algorithm presented further are practically independent of the law of distribution of the standard population, whose samples are transformed into A -samples. A multidimensional Gaussian distribution and the uniformly distributed in a unit hypercube are presented as standard distribution. A -samples were used both for comparison and elucida-
144
t i o n of the domains of the p r e f e r a b l e a p p l i c a t i o n of m e thods of incomplete d a t a .analysis. As this takes place, algorithms of c a l c u l a t i o n of the s p e c t r u m of the c.m. w i t h g i v e n a l g e b r a i c p r o p e r t i e s w h i c h along w i t h p r o c e dures of the IM PC and its applications are a part of the package p r o g r a m "Spectrum", w h i c h is the package m o d e l i n g of m u l t i v a r i a t e
of
samples (adequate to those real)
of testing of d a t a analysis m e t h o d s and d a t a p r o c e s s i n g u s i n g the IM PC. REFERENCES Z h a n a t a u o v S.U.(1980). Technique of c o m p u t a t i o n of the sample w i t h g i v e n e i g e n v a l u e s of its sampling c o r r e l a t i o n m a t r i x . - In: M a t e m a t i c h e s k i e v o p r o s y a n a l i z a dannykh, D r o b y s h e v Ju.P.(Ed.). V C S O A N SSSR, N o v o s i b i r s k , pp. 62-76. Z h a n a t a u o v S.U.(1985a). A n o n p a r a m e t r i c a l a l g o r i t h m of interval e s t i m a t i o n s . - In: V s e s . s i m p . " M e t o d y i p r o g r a m mnoe obespechenie o b r a b o t k i i n f o r m a c i i i p r i k l . s t a t . a n a l i z a d a n n y k h n a EVM", BGU, Minsk, pp. 53-54. Z h a n a t a u o v S.U.(1985b). D e t e r m i n a t i o n of confidence intervals f o r e s t i m a t e s of m i s s i n g v a l u e s of a r e a l sample. - In: Struktury i analiz dannykh. D r o b y s h e v Ju.P.(Ed.). V C SOAN SSSR, N o v o s i b i r s k , pp. 111-122. Z h a n a t a u o v S.U.(1981). The m e t h o d of incomplete d a t a analysis. V C S O A N SSSR, N o v o s i b i r s k . (Preprint No 257, 15 p.).
DESIGN OF EXPERIMENTS (nearest neighbour designs . . .) (Session 5) Chairman: H.P. Wynn
NUMERICAL METHODS OF OPTIMAL DESIGN CONSTRUCTION V.V. Fedorov International Institute for Applied Systems Analysis, Laxenburg, Austria 1. INTRODUCTION In this paper numerical approaches for the construction of optimal designs will be considered for experiments described by the regression model yt = i» T /(x) + e t , i
(1)
where / (x) is a given set of basic functions, x € X , and X is compact; at least some of the variables z can be controlled by an experimenter, 4 E R m a r e estimated parameters, yt
e Sl is the i - t h observation, and e t e S1 is the
random e r r o r , E[ci ] = 0, E [c ( c^] = ¿tj.
In practice, technically more com-
plicated problems could be faced (for Instance, yt
could be a vector or
e r r o r s could be correlated) but usually the methods a r e straightforward generalizations of the methods developed for problem (1). The most elegant theoretical results and algorithms were created for a continuous (or approximate) design problem when a design is considered to be a probabilistic measure defined on X, and an information matrix is defined by an integral Xi(() = ff(x)r(x)((dx).
In this case, the optimal design of the
experiment turns out to be the optimization problem in the space of probability measures:
f
= Xrgmlr
^ * [ « ( { ) ] , / f ( c £ i ) = 1.
f
(2)
X
where • is the objective function defined by an experimenter. The first ideas on numerical construction of optimal designs can be found in the pioneer works by Box and Hunter (1965) and Sokolov (1963), where some sequential designs were suggested. These procedures can be considered as very particular cases of some iterative procedures for optimal design in construction, but nevertheless they implicitly contain the idea that one can get optimal design through Improving intermediate designs by transferring a finite measure to some given point in X at every step of the sequential design. 147
148
This idea was developed and clarified by many authors and the majority of algorithms presented in this survey (which does not pretend to be a historical one) a r e based on it. 2 . FIRST-ORDER ITERATIVE PROCEDURES It will be assumed that (a) the functions / ( x ) a r e continuous on compact X, (b) • (M) is a convex function, (c) there exists q such that {{:•[«(*)]= 9 < - { = S (?) * * . and (d) for any f E £(g ) and any other f • [ ( 1 - a ) « a ) + a A f ( i ) ] = * [ « ( * ) ] + « / < f i x . ( ) ( ( d z ) +o (a). X
(3)
If these assumptions hold, then the following Iterative procedure will converge to an optimal design: fatt-a- B € convex hull of the orbit of A,
for some B.
The corresponding information increasing ordering for information matrices will also be denoted by
That these information preorderings nicely agree
with the various levels of our problem is shown by the following. Theorem.
(Giovagnoii et a1. (1986) ) M = > C(M) »
A C{A)
=> 4>(C{M)) > 4>{C{A}), for all invariant
3.3 Universal optimality vs. simultaneous optimality The proceeding theorem suggests to discriminate between the notions of universal optimality whenever C > 5 ,
for all competing D,
and of simultaneous optimality whenever (C) > (D), for all competing D and for all invariant . Frequently these notions will coincide according to the following.
163
Theorem.
(Giovagnoli et al. (1986) ) If the underlying group is compact
and the information matrix C is invariant then C is universally optimal
C is simultaneously
optimal.
When the group fails to be compact or the matrix C is not invariant it seems that the notion of simultaneous optimality is of a greater bearing. The following table gives an overview of some known results and open problems. group
ordering
invariant functionals
{/.}
Loewner
all 4>
Perm(s)
?
1
Orth(s)
upper weak majorization of ordered eigenvalues
symmetric functions of ordered eigenvalues
Unim(s)
?
determinant
reflection groups
?
?
?
?
p-means
As an outstanding result we mention that this provides a further justification for the most popular criterion of D-optimality as being the sole invariant information functional under the group of unimodular linear transformations (i. e. those with determinant ±1). On the other hand it would be of interest to study finite reflection groups as they also arise in other aspects of multivariate analysis, or to find a group such that the invariant functionals are determined by the p-means.
4 Quadratic regression; regression over the unit cube As mentioned above the model for quadratic regression over the symmetrized unit interval [— 1,-fl] is Y(xt)
= Po + Pit + p2t2 + ae.
164
A design £ is invariant under the sign-change group if and only if £ is symmetric about 0. This reduces the corresponding moment matrices to a two-parameter subset. If we augment this with an improvement in the Loewner ordering we obtain a reduction to the one-parameter family of symmetric designs
three-points
given by a
a
2
2
i> 1 - a
.
There is a 1 - 1
correspondence between such sequences and realizations (with probability one) of stationary binary sequences. balanced case, so that
tt1(0) = j.
The set of
Consider the c(r)
or
Trlx(r) (r = 1, ... p) is a closed convex polytope whose extreme points are given by very special periodic sequences. since (1) is a linear programme for fixed these sequences must be the optimum.
cr
Moreover
one (or more) of
Computational methods of
Martins de Carvalho and Clark (1983) and of Karakostas and Wynn (1986) give results up to
p = 5.
Thus theoretically and to
some extent computationally the design problem is solved.
Kiefer
and Wynn (1984) contains a full discussion of the k-treatment case. 3.
SAMPLING The most closely related sampling model is the following.
and observe Let
rs
Zt
if
X^ = 2
and do not observe
Z^
be the covariance matrix of the sample.
when
Let
X t = 0.
Then consider
various criteria, based on the observed values. (1)
min var (e) where
(2)
min E(T - T) 2
e
is the BLUE of
e.
N where
T =
l Z.
t=i
z
and
X
is the BLUE of
T
in the sense of prediction. (3)
min var size.
( )
where
= ^ z Zt s
and
n i N
is the sample
172 Under the conditions of Section (2) we see that (1) and (2) lead to the same criterion asymptotically: 1
elements of
r" .
the elements of
maximize the sum of the
Criterion (3) leads to minimizing the sum of rg.
We can compare these criteria with that
from Section (2) whdch with obvious notation can be reexpressed as minimizing the sum of the elements of _rs ~ r s , s where
s
r
?lrs,sl
(1)
is the complement of the sample.
A final class of criteria is obtained in the pure prediction problem when E(T - T) r
2
9=0.
In this case we may be interested again in
which is the sum of the elements of r
T/s =
7-
r
?,s
r
s_1
r
(2)
s,s •
the conditional covariance matrix of the unsampled sample
l^.
choice of
Z^
given the
This quantity is asymptotically independent of the {X^}
is to minimize det(r s ).
(for fixed det(rj| s )
n).
A criterion under investigation
which is equivalent to maximizing
We have referred elsewhere to this criterion as
"maximum entropy sampling" and there are analogies with statistical mechanics. form of
r-1
It is clear from consideration of the asymptotic that this criterion again only depends on the
structure of the
{X t }
process up to lag p.
173
REFERENCES Kiefer, J. (1960).
"Optimum experimental designs V. with
applications to systematic and rotatable designs". Proc. Fourth Berk. Symp., 2, 381-405. Kiefer, J. and Wynn, H.P. (1983). design of experiments". Robustness.
"Autocorrelation - robust
S c i e n t i f i c I n f . Data Analysis and
Academic Press, New York.
Kiefer, J. and Wynn, H.P. (1984).
"Optimum and minimax exact
treatment designs for one-dimensional autoregressive error processes".
Ann. S t a t i s t .
431-450.
Karakostas, K. and Wynn, H.P. (1976).
"Optimum systematic
sampling for autocorrelated superpopulations". Inf.
J. Stat. Plan.
submitted.
Martins de Carvallo, J.L. and Clark, J . M.-C. (1983). "Characterising the autocorrelation of binary sequences". IEEE. Trans. Inf. Th. 24, 502-508.
THE DESIGN OF EXPERIMENTS FOR MODEL SELECTION A.M. Herzberg Department of Mathematics, Imperial College of Science & Technology London, U.K. A.V. Tsukanov Sevastopol Instrument Making Institute, Sevastopol, U.S.S.R. 1.
INTRODUCTION
The problem of the selection of the optimal model has a large literature.
For example, Mallows (1973), Akaike (1974), Woodruffe (1982),
Shibata (1980, 1981), Allen (1974) and Vapnik (1982) were concerned with techniques for selecting an appropriate model to fit a given set of data; Andrews (1971) and Atkinson and Cox (1974) were concerned with the design of experiments for model selection. Herzberg and Tsukanov (1985a) discussed the design of experiments for linear model selection with the jackknife criterion.
In other
papers, Herzberg and Tsukanov (1985b, 1986) gave a Monte-Carlo comparison of the C^ criterion with that of the jackknife under different measures and considered modifications of the usual jackknife procedure to include the possibility of the removal of different numbers of observations at a time, and selected observations. In this paper, further considerations for the optimal design in the selection of models will be presented. 2.
THE CRITERION FOR THE SELECTION OF MODELS
Let the true functional relationships be represented by y. = n(x.) + e. i l l
(i=l,...,N)
,
(1)
where y^ is the ith observation of the dependent variable at the kdimensional design point, x^, the independent and controlled variable, r|( . ) is the true but unknown function, model, and e. is an indepen. 1 2 dent random variable with mean 0 and constant variance O .
For
ease in presentation and without loss of generality, suppose that the problem is to,choose one of two models 175
176
r|j(x,cO
(j= 1 ,2)
,
(2)
where a^ is a vector of unknown parameters to be determined by least squares. Consider as a measure for the adequacy of the jth model, the jackknife criterion N TJK. = i J "
V [y.-n.{x.,a(-i1}] 2 t—i J 1i=1
(j-1,2)
,
(3)
where a.(-i) is the least squares estimator of a. determined from J
. .
J
the N-l points consisting of all the design points except x^. model riji.) will be preferred to n 2 ( . ) if TJK^ < T J K ^ .
The
This and
other related criteria and measures for the discrimination among two or more models are given and elaborated on in the papers by Herzberg and Tsukanov. Mallows (1973) suggested choosing the model for which C
=
P
C./a2 J
-
N
(4)
is a minimum, where C . = RSS . + Jlp .O2 J J J
' 2 is •
and O
• 2 an estimate of (J .
Further, RSS. is the residual sum of J
squares and p^ is the number of unknown parameters for the jth model. The constant H may be changed; Mallows set il = 2. In particular, Herzberg and Tsukanov (1986) considered modifications of the usual jackknife procedure to include the possibility of the removal of different numbers of observations at a time and selected observations. 3.
THE DESIGN OF EXPERIMENTS FOR MODEL SELECTION
In order to determine the optimal design for model selection, the following method is used: (i)
(ii)
a function r^^ specifying the goodness of decisions is determined, where r.. is the price of the selection of the model ij from set S. when the true model is in set S.; i J the function of average risk is obtained, i.e. R= £ r..p.., . i.j 1 J 1 J where p ^ is the probability of the selection of the
177
model from the set S. when the true model is in the set S.; 1 .J (iii) the p^j are varied according to the criteria and the design used. The value of R depends on the vector of unknown parameters,«!.
In
this case, a minimax or Bayesian approach may be used. It is always possible to transform the response function in such a way that a < n(x) < b
and
c < x. < d
(i=l,...,k) .
(5)
i Consequently, a vector of unknown parameters a is restricted to a finite field ft. The function R can be investigated further for ft 2 and fixed variance of the error,O . 4.
A MONTE-CARLO EXAMPLE
Consider one-dimensional polynomial models.
In order to compare
the criteria and the design, it is necessary to choose a set of test
models.
and b = d = 1.
The restrictions of (5) are used with a = c = -1 It is possible to use a Tchebycheff system of
orthogonal polynomials as a network of models which apprimates the behaviour of risk in the region ft. One set of such polynomials is 2 2 4 2 n ] = x, n 2 = - i+2x , n 3 = 3x-4x , n^ = i-8x +8x , n 5 = - 5x+20x 3 -16x 5
.
(6)
Consider the following two designs with 12 points: Xj:
12 equally spaced points ±1, ±0.82, ±0.64, ±0.45, ±0.27, ±0.09;
X2:
the D-optimal design for a polynomial of degree five, i.e. 12 points, two replicates at each of ±1, ±0.77, ±0.29.
Table 1 gives the result of a computer simulation for two designs. Observations were generated from the models given in (6) with normal2 ly distributed errors with zero mean and variance C7 = 1 . The table gives the number of correct decisions out of 500 simulations. computing was done on the computer complex CDC Cyber 174 and CDC 6500 of the Imperial College of Science and Technology, London.
The
178
Table
1 :
F r e q u e n c y of c o r r e c t d e c i s i o n s 2 for X 1 a n d X 2 a n d 0 == 1
for C
D e s i g n of Design X
1
2
Criterion
P
and
TJK critei
true m o d e l
1
2
3
4
5
325
282
267
271
334
TJK
330
295
248
180
113
C
328
300
259
286
395
352
311
255
270
330
C
p
P
TJK If the m a t r i x of the r . . ' s ij
is k n o w n , t h e n the f u n c t i o n R c a n be
a p p r o x i m a t e d a n d the c h o i c e o f the d e s i g n a n d c r i t e r i o n m a d e
to-
gether . REFERENCES A k a i k e , H. ( 1 9 7 4 ) . A n e w look at s t a t i s t i c a l m o d e l i d e n t i f i c a t i o n . I E E E T r a n s . A u t o m a t i c C o n t r o l , 19, 7 1 6 - 7 2 3 . Allen, D.M. (1974). The relationship between variable selection a n d d a t a a u g m e n t a t i o n a n d a m e t h o d for p r e d i c t i o n . Technometrics, 16, 125-127. Andrews, D.F. (1971). S e q u e n t i a l l y d e s i g n e d e x p e r i m e n t s for s c r e e n ing o u t b a d m o d e l s w i t h F - t e s t s . B i o m e t r i k a , 58, 4 2 7 - 4 3 2 . Atkinson, A.C. and Cox, D.R. (1974). P l a n n i n g e x p e r i m e n t s for d i s J . R . S t a t i s t . Soc. criminating between models (with discussion). B36, 321-348. Herzberg, A.M. and Tsukanov, A.V. (1985a). T h e d e s i g n of e x p e r i m e n t s for l i n e a r m o d e l s e l e c t i o n w i t h the j a c k k n i f e c r i t e r i o n . U t i l i t a s M a t h e m a t i c a , 28, 2.43-253. Herzberg, A.M. and Tsukanov, A.V. (1985b). The Monte Carlo compari s o n of two c r i t e r i a for the s e l e c t i o n of m o d e l s . J. S t a t i s t . C o m p u t . S i m u l . , 22, 1 1 3 - 1 2 6 . H e r z b e r g , A . M . a n d T s u k a n o v , A . V . ( 1 9 8 6 ) . A n o t e o n m o d i f i c a t i o n s of the j a c k k n i f e c r i t e r i o n for m o d e l s e l e c t i o n . Utilitas Mathematica, 29, 2 0 9 - 2 1 6 . M a l l o w s , C.L. (1973). S o m e c o m m e n t s o n C . T e c h n o m e t r i c s , 15, P 661-675. S h i b a t a , R. ( 1 9 8 0 ) . A s y m p t o t i c a l l y e f f i c i e n t s e l e c t i o n of the o r d e r o f the m o d e l for e s t i m a t i n g p a r a m e t e r s o f a l i n e a r p r o c e s s . A n n . S t a t i s t . 8, 1 4 7 - 1 6 4 . S h i b a t a , R. ( 1 9 8 1 ) . A n optimal selection of regression v a r i a b l e s . B i o m e t r i k a , 68, 4 5 - 5 4 . V a p n i k , V. ( 1 9 8 2 ) . T r a n s l a t e d b y S. K o t z . Estimation of
Dependences Based on Empirical Data. Woodruffe, M. (1982). On model A n n . S t a t i s t . 10, 1 1 8 2 - 1 1 9 4 .
Springer-Verlag, New York. s e l e c t i o n a n d the arc sine laws.
THE DESIGN AND ANALYSIS OF FIELD TRIALS IN THE PRESENCE OF FERTILITY EFFECTS
C. Jennison School of Mathematical Sciences University of Bath BATH BA2 7AY United Kingdom
The recent interest in analyses of field trials incorporating adjustments for variations in fertility or other systematic effects can be traced back to the work of Papadakis (1937) who demonstrated how conventional treatment estimates can be improved by performing a second analysis using the average of the residuals of its neighbours as a covariate for each plot. Over thirty years later Atkinson (1969) investigated this unconventional use of a function of the response variables as a covariate and showed the resulting treatment estimates to be close to those obtained by fitting the first-order autoregressive model of Williams (1952). The use of spatial models for field experiments has since developed in its own right with major contributions from the work of Besag (1974,1977) and Bartlett (1978). In general, spatial models define a covariance structure for the observations, possibly involving variance ratios to be estimated from the data, and both treatment estimates and estimates of standard error are obtained by the usual methods for general linear models. A convincing model for one-dimensional layouts has recently been developed by Besag and Kempton (1986) and consists of a fertility process with independent first differences plus superimposed independent error for each plot; Williams (1986) proposes a similar model based on the relationship between correlation and inter-plot distance determined by Patterson and Hunter (1983) for a set of 166 cereal variety trials. The approach of Wilkinson, Eckert, Hancock and Mayo (1983) is more in the spirit of Papadakis although "adjustment" is by the yields rather than residuals of neighbouring plots. These authors propose a smooth trend plus independent error model Y=DT+5+TI,
where Y is the vector of yields, D the design matrix, x the vector of treatment effects, S, represents a trend term which varies smoothly within columns, and T) denotes independent errors. Let plots be indexed along columns and suppose i; is 179
180
locally approximately linear within each column so estimating equations are formed from adjusted yields, Y(=Y(-—jCY,-^ +Y i+I ), thereby removing almost completely the effect of trend, In the "least squares smoothing" method of Green, Jennison and Seheult (1985) this same model is fitted by the penalty function approach well known in nonparametric regression. Values of x, E, and 11 are found by minimizing the penalty function where A^ is the vector of second differences and X, a tuning t constant controlling the degree of smoothness of ths fitted An appropriate value of X must be chosen either by inspection of the fitted E, and T| or by an automatic method such as cross-validation - see Green (1985). A full decomposition of Y is obtained and the fitted trend and residuals rj,- can be inspected for features of interest. Note that minimizing the above penalty function is equivalent to solving the pair of simultaneous equations t=(D T D) _1 D T (Y-^) £=S(Y-Dx) where S=(I+XATA)_1. Thus, for given x, S, is obtained by applying the smoothing matrix S to Y-Dx, and for given x is the ordinary least squares estimate based on adjusted yields The form of these equations suggests extensions in which x is estimated robustly from Y-£ and % is obtained by applying a robust, nonlinear smoother to Y-Dx, a solution being found by iterating between the two equations. In a recent paper Papadakis (1984) describes modifications to his original method to deal with single abnormal observations and apparent discontinuities in fertility attributable to, say, changes in soil type or drainage pattern; both these problems can be handled by the simultaneous equation approach using a treatment estimate which downweights extreme values and a smoother that recognises jumps in fertility and does not smooth across them. The use of blocks in the design and analysis of field trials deserves some comment. Patterson and Hunter (1983) discuss incomplete block designs for cereal variety trials and demonstrate the substantial reduction in variance of treatment estimates from complete block designs (typically 35% for large blocks) but they show that the further improvements obtained by fitting a full spatial model are rather small (5 or 10%). The blocks in these designs are physically contiguous and divisions between blocks have no direct physical meaning, rather, the fitting of block effects allows a step function approximation to a smooth trend The inclusion of such artificial blocks is unnecessary in other methods of analysis (although they do appear in the model of Williams (1986), ostensibly as a means of curtailing long range correlations). Blocking by real physical criteria
181
is of course desirable and a method which can detect from the data where blocks should be introduced is most useful; there is greater scope for detection of "regions" when the experimental layout is two-dimensional and in this case there are interesting parallels with the identification of objects and distinct areas in image analysis. As the above discussion illustrates, recent research has led to a variety of analyses and the experimenter may be faced with a bewildering choice. Fortunately, different methods usually give very similar estimates of treatment effects - most give the least squares estimate for some assumed correlation structure of the observations and changes in this assumed structure tend to affect the estimates only slightly. Rather than find fault with particular methods we should recognise the potential of a selection of tools for data analysis: model based methods which provide both treatment estimates and estimates of standard errors, as long as we accept the assumptions of the model, and more exploratory methods which allow a full investigation of the data and have greater flexibility to adapt to features of the data as they are discovered. I would now like to turn to the problem of design. Firstly, it should be pointed out that special designs are in no way essential for a "neighbour" analysis, in fact, these methods can be used to retrieve a satisfactory analysis from a poorly designed experiment; for example, fitting an appropriate spatial model can remove the bias that would otherwise be introduced by an improperly randomized or even a completely systematic design. Good design will of course improve efficiency and several recent papers discuss optimal designs for correlated observations, see for example Gill and Shukla (1985) and references therein. The general conclusion is that designs should be balanced, i.e. treatments should be neighbours and possibly also second neighbours of each other an equal number of times but no treatment should appear next to itself. One aspect of the theory of optimal design that may need to adapt to new methods of analysis is the role of blocks. As mentioned previously, artificial blocking is no longer necessary for analysis and experiments with a small number of very large blocks may become more common: a typical variety trial can consist of three replicates of 50 varieties so if the replicates were physically separate we would have just three blocks of size 50. To assess the importance of optimal design I performed calculations for an example with four replicates of 20 varieties, comparing the average standard error of treatment differences from a second order balanced design and a design with treatments allocated randomly within each replicate. Using a variety of autoregressive and moving average processes for the true model and both correct and slightly incorrect models in the analysis I found the balanced design to be always superior but often only marginally and at most by 1 or 2%. Other factors may be of greater practical importance. When correlations are high it is noticeable that the variance of treatment estimates for treatments appearing on end plots is considerably higher than average. The suggestion by Wilkinson et al
182
(1983) of adding extra plots at the end of each column in order to give an "adjusted" yield for each internal plot has led to some confusion - clearly these plots are in many ways no different from other plots and they must certainly be counted when discussing efficiency - but such additional plots could be used to ensure that no single treatment estimate is too variable. Alternatively, one or more treatments may be given an extra replicate in return for appearing on several end plots, thereby equalising as nearly as possible the variances of treatment estimates. T o conclude, there is presently a great deal of practical and theoretical interest in the analysis and design of field trials and w e are seeing an influx of ideas from many different areas of statistics. Future work offers an exciting prospect as ideas not traditionally associated with field trials are developed and the areas of application are extended to the whole range of agricultural experiments. ACKNOWLEDGEMENTS My own work in this area has been in collaboration with Peter Green and Allan Seheult. I am particularly grateful to Julian Besag for stimulating my interest in this topic. REFERENCES Atkinson, A.C. (1969) The use of residuals as a concomitant variable. Biometrika, 56, 33-41. Bartlett, M.S. (1978) Nearest neighbour models in the analysis of field experiments (with Discussion). JJt.Statist.Soc.,B, 40, 147-174. Besag, J.E. (1974) Spatial interaction and the statistical analysis of lattice systems (with Discussion). JJt.Statist.Soc.,B, 36, 192-236. Besag, J.E. (1977) Errors-in-variables estimation for Gaussian lattice schemes. JM.Statist.Soc.,B, 39, 73-78. Besag, J.E. and Kempton, R.A. (1986) Statistical analysis of field experiments using neighbouring plots. Biometrics, 42, 231-251. Gill, P.S. and Shukla, G.K. (1985) Efficiency of nearest neighbour balanced block designs for correlated observations. Biometrika, 72, 539-544. Green, PJ.(1985) Linear models for field trials, smoothing and cross-validation. Biometrika, 72, 527-537. Green, PJ., Jennison, C. and Seheult, A.H. (1985) Analysis of field experiments by least squares smoothing. J.R.Statist.Soc.,B, 47, 299-315. Papadakis, J.S. (1937) Méthode statistique pour des expériences sur champ. Bull. Inst. Amél. Plantes â Salonique, 23. Papadakis, J.S. (1984) Advances in the analysis of field experiments. ripccKXtKOt XT|Ç AkoStihiocç A(hyv©v, 59, 326-342. Patterson, H.D. and Hunter, E.A. (1983) The efficiency of incomplete block designs in National List and Recommended List cereal variety trials. JAgric.Sci., 101, 427-433. Wilkinson, G.N., Eckert, S.R., Hancock, T.W. and Mayo, O. (1983) Nearest neighbour (NN) analysis of field experiments (with Discussion). JJi.Statist.Soc.fi, 45, 151-211. Williams, E.R. (1986) A neighbour model for field experiments. Biometrika, 73, 279-287. Williams, R.M. (1952) Experimental designs for serially correlated observations. Biometrika, 39, 151-167.
ON THE EXISTENCE OF MULTIFACTOR DESIGNS WITH GIVEN MARGINAIS
K r a f f t , Olaf I n s t i t u t für S t a t i s t i k der RWTH Aachen Federal Republic of Germany Consider the following row-and column-design with p=3 treatment levels (index k), m=3 row-factor levels (index i ) and n=3 columnfactor levels (index j )
j ^ 1 1 3
2
3
1 1
2
3
3
2
3
2 1 2
When performing an analysis of variance f o r such a design, i t turns out that only the row-frequencies r ^ and column-frequences c^^ of the treatment levels are relevant:
E.g. f o r investigations on design-optimality one i s thus i n c l i n e d to take these matrices as a basis. But changing R^ and C^ only s l i g h t l y into
1
183
0
2 \
2
1
0
.0
2
1
184
one easily sees that a design corresponding to Rg and C2 does not exist. Hence we have the problem: Given R = ( r ^ ) e
IN
mxp
C = (Cj^) e IN™' 3 . Which conditions on R and C guarantee the existence of a design corresponding to R and C ? Using indicators x ^ ^
( x . ^ = 1 iff treatment level k is combined
with row-and column levels i,j), this problem can formally be stated as of finding conditions for consistency of the system n X
•^
ijk
=
r
ik ' ^ ' K )
e
MxP
m I x... = c,. , ( j, k ) e NxP JK i=l 1 J k
(I)
xijk = 1
, (i,j) e MxN
k=l x.. k e {0,l},(i,j,k) e MxNxP , where M = {l,...,m}, N = {l,...,n}, P = {1
p}.
In case p=2 the solution is known as Gale-Ryser theorem: Let r . J
= |{i : r., ^ j}|, c. ' 0
obtained from c., by arrangement in J^ ^ '•
non-ascending order. Then (I) is consistent iff the vector c majorized by the vector r
is
in the Schur-ordering.
A generalization to the case p>3 is unknown: Neither an adequate generalization of the majorization ordering to matrices R and C has been found nor could the known proofs of the Gale-Ryser theorem (algorithm for explicit construction (Gale-Ryser), application of Hall's theorem on systems of distinct representatives (Higgins)) be carried over. We have the following conjecture: (I) is consistent iff
(1)
I r,-i, = n. 1 < i - •
Let us consider a classical mixed ANOVA model =
is (TIX /) - v e c t o r of measurements ,A a. ,
respectively^
p)
CD j,it) are
and fit * K^)-matrices of k n o w n parame-
ters, is fn,L S i) - n o r m a l l y distributed random vector f i ^ N/(0 , i^.^k •In. is an identity mattix,^5 is an unknown parameter. It is clear that-
V-Covj-ZU&i thus t/ -
, G-^VJJ:,
are called variance
components.
The pioneer work where ANOVA methods were applied to testing hypotheses about variance components for a balanced model was Fisher (1918), later R.Fisher devoted some attention to those models i n his famous book (1925). Important contributions to this theory were made later by F.Yates, A.Wald, C.Eisenhart, C.Sheff^, S.R.Searle, C.R.Rao, T.W.Anderson among many others» Modern works on this subject may be classified into two main streams. In the first one (see Rao and Kleffe
(1980))
invariant unbiased quadratic estimates (MINQUE and others) were examined. The second stream deals with maximum likelihood estimates (MIE) for distribution
fft)s)}Jv\i>k i s such that f o r d i - t i j __r i T c j / a i f e f v - ' G . v - ^ . K u j (2)
ifyjMVk
e x i s t and some other conditions similar to those of M i l l e r (1977) are f u l f i l l e d .
= UTX where for all K
+ , tf
§ ) , ¿L > 0 , , t , , L^o?
> O
^
«
S
>
w
*
L ^ ) > 0 > 0
(3) and (4) (5)
The condition ( A ) ( r e s p e c t i v e l y , ( 5 ) ) i s suitable f o r d e r i v i n g AN of Q ( o f MLE). The method of deriving ( 3 ) i s standard. The f i r s t term of ( 3 ) i s a principal part of the f i r s t term of T a y l o r ' s
189
expansion, the second term i s a principal part of the second degree term. The I - t h component of ¿1 , i i k , i s the limit of the normalized and centered expression It U[X/~ ()\\ SL • Components 2-i and , L< k , j > k , are uncorrelated because of symmetry of gaussian d i s t r i b u t i o n . Uniform convergence of the residual can be proved by the methods of Maljutov (1983). Our second aim i s to investigate designs optimizing some function of in (3) which i s simultaneously the normalized covariance matrix of MLE or of Q . We begin with the simplest one-way mixed model f.j y 0 Y Here vectors f e2{y • notations X - ( i j
i
x y
= (1) cc),
-fe - ey, > Ji>0 G R *
i s a common mean, random aj^^ ) and a r e independent. In matrix i{\ UUT- 6 - < 4 6 , . . , G *
We consider the asymptotics of MLE-s when Using the obvious equalities
we obtain limiting equations x (ar^oc1!)) n y ' l
- J(OL1 + O(i)) ^ V •-
5
OCD)
,
ttV'*& = 1 ^ ( a i r V o c i ) ) , Thus when ( e . g . when
JCLj
=
ccL
CL ' ) we get that limiting distribution
190
of t h e v e c t o r has t h e c o v a r i a n c e
matrix
which does n o t depend on mutual r e l a t i o n s h i p between , • • • > ^ x *—* 0 0 • Designs o p t i m i z i n g a convex d i f f e r e n t i a b l e f u n c t i o n by c h o o s i n g X i v . . , C t - j . may e a s i l y be found by s t a n d a r d methods known f o r h o m o s c e d a s t i c i n d e p e n d e n t me a s u r e m e n t s . If ^—> D > 0 , t h e n the c o v a r i a n c e m a t r i x of t h e l i m i t i n g d i s t r i b u t i o n of t h e v e c t o r ( J I / ( i 0 - A 0 ) j W i l - R ) G e n e r a l i z a t i o n t o t h e g e n e r a l case of t h e mean3C^> where ^ ^ ^ V O H ^ - S o b v i o u s . A s u r v e y of non-asymp t o t i c d e s i g n s f o r e s t i m a t i n g m u l t i - w a y v a r i a n c e compon e n t s i s i n Anderson ( 1 9 7 5 ) . REFERENCES Anderson, R.L. ( 1 9 7 5 ) . Designs and E s t i m a t o r s f o r V a r i a n c e Components. I n : A Survey of S t a t i s t i c a l Design and L i n e a r Models, S r i v a s t a v a , J . N . ( E d . ) . N.Holland P u b l . Co, p p . 1 - 2 9 . F i s h e r , R.A. ( 1 9 1 8 ) . The c o r r e l a t i o n between r e l a t i v e s . . . T r a n s . R.Soc. Edinburgh, v . 52, 399-433. F i s h e r , R.A. ( 1 9 2 5 ) . S t a t i s t i c a l methods f o r r e s e a r c h w o r k e r s . Olivex & Boyd, London. G o l d s t e i n , H. ( 1 9 8 6 ) . M u l t i l e v e l mixed l i n e a r models analysis using i t e r a t i v e generalized l e a s t squares. Biometrika, v. 73, N 1, 43-56. H a r t l e y , H.O. and Rao, J . N . K . ( 1 9 6 7 ) . MLE f o r t h e mixed ANOVA model. B i o m e t r i k a , v . 5 4 , 9 3 - 1 0 8 . I b r a g i m o v , I . A . and Khasminsky, R . Z . ( 1 9 8 1 ) . Asymptotic t h e o r y of e s t i m a t i o n . S p r i n g e r , W.Y. L u a n c h i , M. ( 1 9 8 3 ) . Asymptotic i n v e s t i g a t i o n of i t e r a t i v e e s t i m a t e s ( t h e s i s ) . Moscow Lomonosov U n i v e r s i t y , D e p a r t , of Math, and Mech. M a l j u t o v , M.E. ( 1 9 8 3 ) . Lower: bounds f o r a v e r a g e sample s i z e . . . I z v . v u s o v , Matematika, N 1 1 , 1 9 - 4 1 .
19 1
Maljutov, M.B. and Luanchi, M. (1985,). Iterative quadratic estimates of mixed ANOVA models® Abstracts of III-th conference "Application of multivariate analysis..." part II, Tartu, pp. 49-51. Miller, J.J. (1977). Asymptotics for MLE's in the mixed ANOVA model. Ann. Statist., v. 5, 746-762.
ASYMPTOTIC METHODS IN STATISTICS
(second order asynptotics, saddle point methods, etc«) (Session 6)
Chairman: W. van Zwet
DIFFERENTIAL
GEOMETRICAL
IN ASYMPTOTICS
OF
METHOD
STATISTICAL
INFERENCE
Shun-ichi AMARI University of Tokyo, Tokyo, 113 Japan Abstract Differential geometry provides a new powerful method of asymptotics of statistical inference. Geometrical concepts are explained intuitively in the framework of a curved exponential family without technical details. We show some fundamental results of higher-order asymptotics of estimation and testing obtained by the geometrical method. Further prospects of the geometrical method are given. Why geometry?
A typical statistical problem is to make
some
inference on the underlying probability distribution p(x) based on N independent observations x^
x^ therefrom.
In many cases,
statisticians do not directly treat the function space F = { p(x) } of
all
the
possible
distributions,
but
presume
a
parametric
statistical model M = { p(x, u) }, where u is an n-dimensional vector parameter. manifold
imbedded
Then, a model M is regarded as an n-dimensional in
F,
and
it
is
distribution p(x) is included in M or
assumed at least
that is
the
close
true to M.
Roughly speaking, a naive distribution p(x) is obtained in F from the observations as, for example, the empirical distribution or its smoothed
version.
distribution which
Then, we
infer based
is supposed
to belong
on to
this P on the true M.
Hence,
it
is
important to know the geometrical shape and the relative position of M inside F. When the number N of observations is sufficiently large, £S is very close to the true distribution p(x, u), so that one may use linear
approximation
procedures.
at p of M
in F in evaluating
Hence, linear geometry is
order asymptotic theory.
sufficient
for
inferential the
first
This is the reason why one can construct
a first order asymptotic theory in a unified manner. 195
In order to
196
evaluate
higher-order,
characteristics,
for
linear
example
second
approximation
is
and
third
order,
insufficient.
necessary to connect these tangent spaces or linear
It
approximations
of M obtained at various points, thus taking the non-linear into account.
a
effects
To this end, one needs to introduce invariant affine
connections by w h i c h the curvatures are defined. not
is
trivial
task.
We
introduce
two
However, this is
dually
coupled
affine
connections, the exponential or c< = 1 connection and the mixture or oC=
-
1
connection.
exponential
Then
it
will
is
shown
that
the
related
and mixture curvatures play a fundamental role in The n o t i o n of a more general
higher order asymptotic theory. bundle
be
useful
for studying non-parametric or
the
fibre
semi-parametric
models.
Curved
exponential
family.
A
curved
exponential
family
is
very
tractable statistical m o d e l by the following
two reasons:
has
a
the
minimal
sufficient
statistic
x,
and
a It
enveloping
exponential family is flat w i t h respect to the o( — i 1 connections. This
makes
the
related
geometrical
theory
very
simple
transparent, and we can avoid technicality of differential by using this model. to
explain
should
be
theory
for
model
the noted
by
a
Therefore, we use a curved exponential
results that w e
general
using
of
the
can
method.
model
differential
or
even
geometric
family
However,
construct a differential
parametric
proper
geometrical
and
geometry
a
it
geometrical
non-parametric
notions
or
their
extentions. An
exponential
family
functions q(x, Q) = £
m
has
the
following
probability
density
1
0 x. - f($)
i=l
1
w i t h respect to a suitable measure o n the sample space, w h e r e x = (x^) and
0=
(0 1 ), i = 1,
, m , are m-dimensional vectors and
^(6) is the cumulant generating function.
The family S = {q(x, 0 ) }
is an m-dimensional manifold in F, where
the natural or
parameter Q defines a coordinate system of S. distribution
q(x, 6)
coordinate system
in
S
specified
is «(= 1 - affine
by (or
6.
canonical
A point 6 implies We
say
e-affine)
by
that
a
this
regarding
197
5 as an ck - 1 ( or exponentially) linear This
is a definition
define
the
introducing
e-linearity an e- or
coordinate
of the e-linearity. in
a
general
1-affine
system of
S.
(Obviously, we need to statistical
connection.)
parametrized by a scalar t is e-linear, when
it
model
by
A curve &=
0(t)
is linear
in
t.
More generally, a submanifold of S which is represented by a set of linear equations in 0 is e-linear.
An e-curvature of a submanifold
can be defined in an ordinary way, when it is not e-linear.
(This
curvature is a tensor, but we do avoid technical descriptions). There
is
another
coordinate
system
expectation parameter or the expectation
f =
called
coordinate
the
system, which
is defined by 1
= E[x.],
±
where E denotes the expectation with respect to q(x, 0). also
an
important
coordinate
system
dually
This
coupled with
is
6 , and
there is a one-to-one relation between them,
6
= 6(1),
1 = 1 ( 6 ) .
We may use this ^ to specify a distribution in S.
The 7 is said to
be m-affine or 0( = - 1-affine, and any submanifold
of S which
is
defined by a set of linear equations in 1 is said to be m- or ol = 1-flat.
We have thus defined the m- or OL = - 1 flatness.
submanifold
is
not
m-flat,
the
m-curvature
When a
can be defined
in a
similar manner. Let x^, ... , Xj| be N independent vector observations. arithmetic mean
is
a
minimal
x = (1/N)J£ x t=l sufficient
statistic.
(distribution) in S as follows. r
Their
N
This
x
defines
a
point
Let "7 be the point in S whose
?-coordinates are put equal to the sufficient statistic A
_
rf = x. A
We call the point ^ (or more precisely a distribution specified by A
A
*)) the observed point.
Its ©-coordinates
9=
A
0(7) is the m.l.e.
of d. A curved exponential family M = {p(x, u)} is a submanifold of S parametrized by an n-dimensional parameter u = , n), where n < m, such that
(u 3 ),
(a = 1,
198 p(x, u) = q{x, 0(u)}. The submanifold M is represented by 0 = 8(u)
or
"7 = *}(u)
in the respective coordinate systems. be
The e - and m-curvatures
calculated by differentiating Q{u) and 7 ( u )
twice w i t h
can
respect
to u.
Inferential procedures.
Estimation of the true parameter u is
stated geometrically as follows : Given an observed point TJ= x €. S which belongs to S but does not in general lie in M , find a point u e M or point ^(u) 6 M w h i c h is closest to u in some sense. u = etf)
Let *
be an estimator, w h i c h is a mapping
from S
to M ,
e
a
: 1h+u.
Let
A(u) be the inverse image of this mapping, i.e., A(u) = e _ 1 ( u ) = ( 1 6 S | u = Then, A(u)
forms an
a foliation of S.
e(D}.
(m - n)-dimensional submanifold, and {A(u)} is W e call this A(u)
the ancillary submanifold
estimating submanifold attached to u by the estimator e. of the estimator
or
The value
is u, w h e n and only w h e n the observed point 'J is
in A(u). When
an
estimator
eA is
point *l(u).€ M , because
consistent,
A(u)
passes
tends to *J(u) as
through
the
Let us introduce
a coordinate system v = (v K ) , w h i c h is (m - n)-dimensional, in each A(u) such that (u, v) is a coordinate system of S.
Then, any point
I C S is uniquely specified by (u, v) as T where
= u
T ( u , v), shows
relative
that
position
the in
point "I belongs
A(u) .
The
origin
to A(u) v
=
and 0
v
is
shows
put
at
its the
intersection of A(u) and M, so that v = 0 if and only if the ""J is in M. The sufficient statistics x or equivalently the observed point ^ is decomposed into two statistics (u, Q) by solving = T?(fi, 0). As
can be
most
of
easily Fisher
seen, the statistic u is an estimator information
(asymptotically ancillary)
in
x
including
and
v
is
including
rather
ancillary
little information
concerning
199
the
true
u.
We
distribution of geometric their
can obtain (u, 0) up
quantities
angle
of
the
to
related
Edgeworth
the to
expansion
third order the
intersection.
of the
the
curvatures of M and A(u)
and
elucidates
by
how
geometric
quantities are related to the performances of estimators. of
conditional
inference
and
joint
using
This
terms
ancillarity
can
also
be
Problems understood
from this geometric point of view. Testing analyzed
in
hypothesis a
similar
H^ way.
: u £
D
Since
against
H^
: u ^ D give
observations
an
can
be
observed
point 7 i n S, the critical region R of a test is set in S such that A
the hypothesis H ^ is rejected if, and only if,
"7 6 R.
Now let us
compose an ancillary family {A(u)} such that the critical region R is composed of some of these A(u)'s,
R = UA(U). uftR
M Then, the decomposed {A(u)}
has
function
the
of
u
statistics
(u,
following
meaning
only,
X(u).
with : The The
respect
test
statistic
information, but it can be used as a conditioning characteristics
of
various
tests
can
be
to
this A
statistic
analyzed
0
has
is
a
little
statistic. by
=
using
The the
geometric shape of the related A(u) through the Edgeworth expansion of
the
distribution
estimators
are
of
closely
(u, ). related
The to
characteristics
those
they can be analyzed in a similar manner.
Estimation
of
interval
of associated tests,
and
200 Asymptotic estimator.
theory Then,
of
estimation.
its
mean
Let
square
NE[(u - u)'(u - u)] = A l + N
_1
u
be
error
is
a
consistent
expanded
as
2
A 2 + 0(N~ ).
Then, the first-order error matrix A^ is given by ax = ( g - g
A
r\
where g is the Fisher information matrix of M, g^ represents the square of the cosine of the angle between A(u) and M, and the angle is defined with respect
to
the Fisher information matrix of S.
Therefore, as is well known, an estimator is efficient when the estimating manifolds A(u) are orthogonal to M, and A 1 = g ^ holds. * Let u be the one-step bias corrected version of an efficient estimator u.
Then, its A^ term is decomposed into the sum of three
non-negative terms as A 2 ^ Here, !"*„
is
+
coordinate
system
+ 2square of
the
u,
and
the e 2
mixrture
^^^
the
connection
square
of
of
the
the ( o( —
l)-curvature of M, both of which do not depend on the estimator. The third term (H™)2 is the square of the (0( = -1)-curvature the estimating because
manifold
A(u),
vanishes
for
the
m.l.e.,
Hence, the m.l.e. is second-order efficient.
Asymptotic theory of tests. a
it
the estimating manifolds of the m.l.e. are mixture-( o( =
-l)-flat.
of
and
of
test,
we
show
a
Before studying the power function
geometric
result
obtained
from
the
Neyman-Pearson fundamental lemma: The critical region R of the most powerful test of H^ : u = u^ against Hj : u = u^ is bounded by an m-flat
hypersurface
which
connecting u^ and u^.
is
orthogonal
to
the
e-flat
curve
The e-flat curve forms an exponential family
connecting p(x, u^) and p(x, u^), and the critical region R remains the same for any alternative H^ : u = u^1 if u^' is on the curve. This shows the reason why a uniformly most powerful test exists for an exponential family.
However, when M is curved, there are no
uniformly most powerful tests. We consider the simplest case of testing H^: u = UQ against H^ : u / UQ in a scalar parameter case where M = {p(x, u)} forms a curve in S.
The power function P_(t) of a test T is defined by the
201
probability of rejecting HQ w h e n the true distribution is at u = u Q + t/jNg, w h e r e g is the Fisher information of M at u^.
It
is well
uniformly powerful
most at
known
that
powerful
any
t.
there
tests
They
exist
in
are,
the
for
It is expanded as
0(N"3/2)
PT(t) = Pxl(t) + PT2(t)/jN + P t 3 / N +
a number sense
that
example,
test, Wald test, efficient score test, etc.
of
the
first
PT^(t)
is
likelihood
of
equivalently
at
A(u))
is
(asymptotically)
orthogonal
second-order efficient in the sense that
no third-order
to
It is also known that a test is
any t, whenever it is first-order efficient
most ratio
Geometrically, a test
is first-order efficient if and only if the boundary 9 R
intersecting point.
order
M
R
(or
their
automatically
most powerful at
efficient.
However,
there
exist
(uniformly most powerful) tests, implying
that a test can b e good at a specific t^ but not so good at
other
t's.
above
Then, what are the third order characteristics
mentioned The
widely
used
characteristics
tests?
depend
Geometry
on
the
can
cosine
answer
Let
us
define
the
this
problem.
of the asymptotic
between 3 R and M , w h i c h plays a role of canceling (non-exponentiality) of M .
of the
the
angle
e-curvature
We show the results. deficiency
or
third
order
power
loss
function A P T ( t ) of an efficient test T by APT(t) - l i m ^ N ^ t )
- PT3(t)},
where P^(t) is the third order term of the test T (t) w h i c h is most powerful
at
t
(but
not
at
t1).
other
Then
AP^,(t)
is
obtained
explicitly as 6 P T ( t ) = a(t, at) ic - b(t, o i ) } 2 r 2 , where
a(t, ot ) 2
level U , curvature)
Y
and is
and
c
b(t, ot ) are known
the is
square a
of
factor
the of
functions
depending
e-curvature compensating
of the
M
on
the
(Efron's
e-curvature
through the asymptotic angle between M and A(u) or ^ R . The values of c are calculated for various tests.
The results
are as follows: c = 0 for the W a l d test, c = 1/2 for the likelihood ratio test, c = 1 for the locally most powerful W e show the universal deficiency
test,etc.
curve for various
tests.
It
should b e noted that the results hold after b o t h the level and bias
202 of the corresponding test statistics are adjusted up to 0(N w e do in the given
from
Bartlett the
adjustment.
Edgeworth
The
expansion
adjustment
of
related
(û,
where us
define
estimator
is E an by
203
- u)2],
AV[S] = l i m j ^ N E U u where
g =
§2»
incidental
or
a n
•••)
nuisance
infinite sequence of values of
parameter.
An
estimator
u
is
said
the
to
be
optimal, w h e n its asymptotic variance AV[£] is not larger than that of
any
other
estimators
for
any
sequence § .
This
definition
of
optimality is very strong so that there might not exist the optimal estimator.
The
geometrical
obtaining a necessary existence
of
the
method
can
solve
the
problem
of
and sufficient condition that guarantees the
optimal
estimator
and
obtaining
the
estimating
function y(x, u) of the optimal estimator w h e n it exists. To this end, w e define a vector space R(u, § ) at
each
point
(u, 1 ) of M by the set of random variables R(u, S ) = ir(x)
| E[r] = 0, E [ r 2 ]
;a)
1(w;w,a)
indicates the operation of substituting
w
by
a). The discussion in the present paper evolves around the following formula for the conditional model function for a) given
(3)
(Barndorf f-Nielsen (1980,1983)) ^ i = p*(w;u)|a) ^ i p(o);o)|a)
(2) where
a
p*
is defined by * (o);w i| >a) = / »I^I^ 1 W - 1 M p* c(oi,a)|]Pe
205
206
Let j
t n
c = c(to,a) = log{ (2tt) ' c(w,a)}.
(4) This quantity
c
is often close to 0.
The interest of (2) is tied to the conditonality viewpoint, that is to cases where
a
is not only auxiliary,
in the above sense, but is also distribution constant, either exactly or approximately. We refer to a statistic having both these properties as an ancillary
(statistic).
In the case of ordinary repeated sampling with sample size
n,
if the approximation (2) is correct to order
0(n
(typically,
v = 2,3
or
then to that order
many developments traditionally requiring calculation of moments or cumulants can instead be carried out in terms of the mixed log model derivatives (1). For instance, if -3/2 (2) holds to 0(n ), confidence limits for a one-3/2
dimensional interest parameter, valid to ditionally on
a
0(n
)
con-
as well as unconditionally, can be thus
determined. Bartlett adjustment affords another instance, to be discussed later. (Barndorff-Nielsen (1986a,b)). Yet another exemplification is provided by formula (5) below. By Taylor expansion of the right hand side of (3) in around
a) one may derive an asymptotic expansion of
p*(ui;o)|a)
of the form
(5) where
p* (io;u) | a) = ip, (w-u;^) {1+R 1 +R 2 +. . . } ip^(•; A)
denotes the d-dimensional normal probabi-
lity density function with mean vector matrix
w
A,
where
i
0
and precision
is the observed information matrix
with elements (6) and where
R'v
is generally of order
0(n
under re-
207
peated sampling. In particular (w-oo)fS
R1 =
-1 - 1 / r s t + h r S t (w- U ;2i) /Jst} .
Here, and elsewhere, we employ Einstein's summation conrst vention. Furthermore, h ( • i s the tensorial Hermite polynomial of degree 3, corresponding to the precision -1 and
/
and
$,
31 /
are affine connections, in the sense
of differential geometry, on the parameter space of the a model M. For any real a the observed a-connection ? is defined by
(Barndorff-Nielsen
(7)
(1986a))
? = 1±2L v + 1z°l Y 'rst 2 "rs; t 2 ^tjrs
These connections are
'observed analogues' of the Chentsov-
a
Amari connections The expansion
T.
(Chentsov
(1972), Amari
(1985).)
(5) has some similarity to but is distinct
from the Edgeworth expansion for the distribution of Thus
oj .
(5) employs mixed log model derivatives instead of
(approximate) cumulants of
w.
(5) is valid as an asymptotic expansion
Note that
spective of the accuracy with which mates
p(w;w|a)
p*(w;w|a)
and irrespective of whether
irre-
approxia
is
a
there is
(approximately) distribution constant or not. For fixed value of the auxiliary statistic in general
(locally, at least) a smooth one-to-one
spondence between 14=
(1^ (w) , . . . ,
to
corre-
and the score vector
(u) ) .
Hence, by the usual formula for
transformation of probability densities,
(2) can be trans-
formed to a formula for the conditional distribution of 1 + • Equivalently, one may transform to fi ^
is the square root of the matrix
for the matrix with elements
1
r;s
= j! •
^
Writing
one finds A
(8)
P (If ~
2
| a) = c(o,,a){|j| | ^ | / | l . | } ^ e f
1(a))
-
1(a))
where 1# F
208
where on the right hand side oi has to be expressed in terms of 1 + (and a). The relation (2) is, in fact, exact for most transformation models as well as for a variety of other models (Barndorff-Nielsen (1 983), Blassild and Jensen (1984), Barndorff-Nielsen and Blaesild (1986a)). Outside these cases the best that can generally be achieved is an -3/2 asymptotic (relative) error of order 0(n ) . In particular, if M is an exponential model of -3/2 order k and if d = k then (2) is valid to order 0(n ' ). -3/2
Now suppose that (2) holds with error 0(n ' ) and let MQ be a submodel of M, having parametric dimension dp < d. Using the asymptotic normality of the score vector for M, under the hypothesis MQ, one may, by standardizing the part of the score vector orthogonal to MQ so as to have variance matrix equal to the unit matrix asymptotically, construct a supplementary auxiliary statistic of dimension d-dn so as to make (2) valid to -1
order 0(n ) under MQ- In fact, there is a considerable variety of approximately distribution constant statistics of dimension d-dn which could serve in the capacity of supplementary auxiliary and yielding accuracy 0(n- 1 ). However, demanding accuracy 0(n-3/2 ' ) of (2) narrows down the choice significantly. More specifically and supposing, for simplicity, that d n = d-1, it can be shown that accuracy 0(n-3/2 ' ) of (2) as applied to M Q may be achieved by taking as supplementary
auxiliary
+
r
= r - bias correction
where r = ±/2{1(£)-10(u0) }
(9)
is the signed log likelihood ratio statistic for testing Mn
against
M;
and that this choice of a
supplementary
209
auxiliary is unique to the asymptotic order concerned. (Barndorff-Nielsen The statistic
(1984,1986b).)
r^
is asymptotically
N(0,1)
distribu-
- 1
ted to order
0(n
).
By introducing a variance adjustment
it is possible to establish a statistic r* = r + /s.d. adjustment
(10)
which is asymptotically standard normal to order This may be used for a refined test of
Mq
0(n
versus
-3/2
M,
).
as
well as for the role of supplementary auxiliary. Moreover, -3/2 with error 0(n ' ), r* = r-r~1 log K ,
(11)
where K is a certain explicitly given function of the observations, and the right hand side of (11) is often simpler to calculate than (10). In case M is a (k,k) exponential model while Mq is a (k,k-1) exponential model we have - 1
_36 T 8«
K = |r 89 -¿PT
(u
0)
9 (w) -9 (o)0)
For details and examples, see Barndorff-Nielsen (1986b). Let b = d~1 E^w
(12) where
w
is the log likelihood ratio statistic for
210
testing a particular value of (13)
10 under
M,
i.e.
w = 2{1 (w) -1 (to) } .
The quantity
b
and suitable approximations thereof are
termed Bartlett adjustments for the log likelihood ratio statistic. The Bartlett adjusted version
w' = w/b of 2 is, in wide generality, asymptotically x -distributed
w on
d
degrees of freedom, the degree of approximation to 2 -3/2 the limiting x distribution being 0(n ' ) , or even - 2
0(n
-1
),
as opposed to
0(n
The norming quantity
c
)
for
w
itself.
is related to
b
by the
approximate relation bie-(2/d)5.
(14)
(Barndorff-Nielsen and Cox (1984).) Decomposition of the norming quantity
c
(or, equiva-
lently, of Bartlett adjustments) into invariant terms can be achieved by the use of strings, a differential geometric concept generalizing those of tensors, connections, and derivatives of scalars (functions).
(Barndorff-Nielsen
(1 986c), Barndorf f-Nielsen and Blaesild (1986b,c); see also McCullagh and Cox (1986) which provides and discusses the first example of such a decomposition.) (p,q)
A
a sequence M
string of length M
(m,n)
of multiarrays
M
(m
m
into
y
_
3~ma)t
c 11 SiJ;
The string
M
blocks and where
. cm . . . dip
is said to be a costring if
a contrastring if
m = 0.
n = 0
and
These types of strings can be
represented in terms of tensors and special, simple kinds of strings. In particular, any costring can be represented as the intertwining of a connection string, i.e. a (1,0) costring, and a sequence of tensors. Mixed log model derivatives provide examples of strings, and so do moments and cumulants of log likelihood derivatives. Tensors determined by associated intertwining operations may be used to obtain invariant decompositions. For example, to order 0(n-3/2 ) we have •¿r . + 12|< ) i r S ai t U t 24 (v(3K ' rstu 'rt;su" + (3J'drst' ^ uvw+2tf'rtv' . yi suw ) i"X S i°t U fü ™ \J , where the right hand side is a sum of four invariant (separately interpretable) terms, due to the fact that the quantities H are tensors. These tensors were obtained by means of intertwining, applied to the first few of the mixed log model derivatives. For further examples and an extensive study of the mathematical properties of strings, see Barndorff-Nielsen and Blassild (1986b,c).
212
REFERENCES Amari, S.-I. (1985). Differential-Geometric Methods in Statistics. Lecture Notes in Statistics 28. SpringerVerlag, Heidelberg. Barndorff-Nielsen, O.E. (1980). Conditionality resolutions. Biometrika 67, 293-310. Barndorff-Nielsen, O.E. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika 70, 343-365. Barndorff-Nielsen, O.E. (1984). On conditionality resolution and the likelihood ratio for curved exponential models. Scand. J. Statist. 11, 157-170. Corrigendum: 1985, Scand. J. Statist. 12, 191. Barndorff-Nielsen, O.E. (1986a). Likelihood and observed geometries. To appear in Ann. Statist. Barndorff-Nielsen, O.E. (1986b). Inference on full or partial parameters, based on the standardized signed log likelihood ratio. Biometrika 73, 307-322. Barndorff-Nielsen, O.E. (1986c). Strings, tensorial combinants, and Bartlett adjustments. Proc. Roy. Soc. London A 40j>, 1 27-1 37. Barndorff-Nielsen, O.E. and Blaesild, P. (1986a). Combination of reproductive models. Research Report 107, Dept. Theor. Statist., Aarhus University. To appear in Ann. Statist. Barndorff-Nielsen, O.E. and Blaesild, P. (1 986b). Strings: Mathematical theory and statistical examples. Research Report 146, Dept. Theor. Statist., Aarhus University. To appear in Proc. Roy. Soc. London. Barndorff-Nielsen, O.E. and Blaesild, P. (1986c). Strings: contravariant aspect. Research Report 152, Dept. Theor. Statist., Aarhus University. Barndorff-Nielsen, O.E. and Cox, D.R. (1984). Bartlett adjustments to the likelihood ratio statistic and the distribution of the maximum likelihood estimator. J.R. Statist. Soc. B 46, 483-495. Blaesild, P. and Jensen, J.L. (1985). Saddlepoint formulas for reproductive exponential models. Scand. J. Statist. 12, 193-202.
213
Chentsov, N.N. (1972). Statistical Decision Rules and Optimal Inference. (In Russian). Nauka, Moscow. English translation 1982. Translation of Mathematical Monographs Vol. 53. American Mathematical Society, Providence, Rhode Island. McCullagh, P. and Cox, D.R. (1986). Invariants and likelihood ratio statistics. To appear in Ann. Statist.
ON ASYMPTOTICALLY COMPLETE CLASSES OF TESTS Bernstein A.V. Moscow, USSR Let X'"')=(X1)...,XH,) bution P^ Let P^
be the sample from the distri^ eRk,
depending on the unknown parameter
have the density p(:x,{!) w.r.t. some
6"-finite
measure. Based on this sample the null hypothesis {• = 0
is tested against the sequence of the conti-
guous alternatives where
A
H0".
H iri ;
6
,
is the given subset of
R
k
A , and }
The decision rule in this asymptotic (as.) testing problem (t.p.) is determined by the test-sequence (t.s.) {(f^)
^(X'"" 1 )
where for each n. the test
on the sample X
(rL)
of size
n,
is based
. This test is characteri-
zed by their power function
where
to. -fold product measure of identical
denotes
components P^
=
. Denote by
the level of the test 0
(5^(0.).
Remark 2. The coefficient C(B) in fined as the
^
m | CJ €
(8) for &g ^
(4) the if IB)
in
3cC+l$-l(ç
./QY,oÇ+S
(11) reduces to
~(a))
and for the l.f. (5)
« J
M
-
d
W
^ d&j
+ lfadlM/
à
+
Therefore in this casegthe m.l.e. © g U) (4)
for
oTfôMl
y=i «
*
£1
where for the l.f.
«
3ot+£$-%•
a expl-^fauMaj(12) u V-r 0
S2 (D=UD (5)
i
d(r)'1
ilx(2)=
e x p { -
¡m(ti)du}.n3i
%
0
4. Some admissibility results. Definition.
^
An estimator
is said to
be second order admissible in there do not exist estimators
P'*
P3
P'*
-
coincides with
(9j)
X
and for the l.f.
*±lm(9j)) a
fa
a
(s.o.a.) if Gg
£ (§^(0)
with
P.
The next assertion is an immediate consequence from the theorems 1,2. Assertion 1.
The estimator
^fi^tj is s.o.a. iff there
does not exist nontrivial positive solution to the inequ• lity
S
,
,
LU>P = .Z
6 « 0.
Proof. Note that for any positive u)
u)-|
^ Lu) f LOlO-j
u), CO j € C with
^
=
(©) U)^ To" '
230 and that
L ^
\f ±
COnSt
implies L , T
Í 1/
with strict inequality at least in a point. Then apply theorem 2 and the remark 1. L e t W / ® ) be the class of piecewise continuously rentiable functions ©
^
diffe-
vanishing outside compacts in
, with square integrable first order derivatives. Theorem 3.
The estimator 0g 3 co
f €V(0), y
:
'-s
iff
1 } = 0 for any compact K c ©
where
J ú/(G>)Z
CL(0j)(M)*'d9 4
©
.
(14)
J
The proof is a straight generalization of that of Theorem 4.1 in Levit
)1985).
We present next a useful necessary condition for s.o.a. in the special case Theorem 4. of
Let for a given relatively open subset Q.
= { 6 | I6|= 1 }
be contained in
©
c
the
coneC={8|e=7^í>qi>eQ}
. Suppose that both the functions0.(2)
and n
(x) are regularly varying at the points U , 0 0
and
that the integral J C l ^ ^ ( ^M ? is ) ^s.o.a. c i l Then diverges at the points 0 and 0 0 . The proof is a straight generalization of that of Theorem 4.5 in Levit
(1985). One has only
any relatively open Gl^-
to note that for
not having any points in
common with the coordinate axis the expression 0.(2 l
>
uniformly in
1 > 0}
)
i
s
bounded from zero
Q. .
5. Some applications. It^is clear from the s.o.a. of the m.l.e. or of the functions of ©
(12), (13) that
depends only on the behavi-
LÍ1)} tn("l) andcL(Z) near the endpoints
(but not on the particular family
1)
at hand).
However for the illustrative purposes we consider here two classical families, the first being the normal density
231
the o t h e r b e e i n g the Poisson
distribution
L e t us c o n f i n e t o one of the l o s s f u n c t i o n s £ a 06/5
W
^
^ Z
=
R i 0c:
9:
jh
oi. ©: I ,
j
y
J
( o t > 0 +
'
(16)
J
We s t a t e f i r s t some a d m i s s i b i l i t y r e s u l t s postponing l a t e r a remark on t h e i r s t a t i s t i c a l A s s e r t i o n 2. The m . l . e . a) f o r b) f c)
o
iff r
i
f
is
^=3, S
forf^Wi
meaning.
s.o.a.
(
f
a
n
d
then f o r any oC > 0 );
or i c = 3 ^
iff
£ = -
=£
P r o o f , a) From ( 6 ) ( 1 2 )
o r S ^ ^ ^ ^ O .
one e a s i l y
obtains
a(z) = const• 2\ q H*) = 1 \
= (ot+s){3-k)/(z (PWJ.
According t o Theorem 4 the necessary c o n d i t i o n is
c
o
Substituting
to
Kf) =
f
£
I
a
s
t
in
(14)
.
(
1
7
)
reduces
~
w
iff
for
= 0 , in which case act) =
Now i t
-
oC=Jt;
d) f o r f ^ VC^j, i f f
s.o.a.
till
,
(^T-J
.
dz
^
-
,
VCW(R$)
ljr>0j
k * 0 )
etc.
J
Remark 6. There is an abundant statistical literature on admissibility for the families ^ £ as well as for other distributions within the exponential family, main-
233
ly restricted to the case cC —Si in (15), (16); see the list of references below and further references therein. Analyzing s.o.a. of the m.l.e. for different oL exhibits the curiously sensitive way in which its admissibility depends on the particular choice of parametrization / ¿ j / y as well - as i$ the case with the families as on the loss function at hand, as in the case ^ ^ . The situation thus presented seems to demonstrate that the whole affair of relying on the admissibility of estimators has in a way to be reconsidered. But looking at it yet in another way forces one to admit the fruitfullness of relating the admissibility properties to the existence of corresponding non-trivial positive superharmonic functions and through this to some more involved mathematical fields of a running interest. 6. A sufficient condition for s.o.a. Turning back to Assertion 1, let iOfPj^Offy;..• n(B$lM6)=(itaga(&)) with both • Q 0 O and Q,(t) positive even functions, a i O e C ^ h & X m ^ C ^ Z * ) • Denote h j l ) = C K t ) Q ( ? ) ( J ' S l H u ) ^ ) ~ &nd provided the last C L ( t ) n H l ) { j ^ ( u H u . ) integral converges, A
L
0
S
0
J
*Z- /
Lemma 1. Let J (•¿^¿¿
-
C(P)=
Q
rG -v
o c L t l
k ~ H u )
as ij, —^
= P-d",r"
i m p l i e s tf =
d".
240
Let Q := {Qj,r '• & € ©, T € § } , where Q denotes the family of prior distributions T|C. It is easy to see that (see e.g. Pfanzagl and Wefelmeyer (1982), p. 227 and p. 228) Q) contains all functions (;x,ri) - > c r ( z , t ? , r ? ) + fc(»?) with c € R and
k e T ( T , g ) ,
and that k * ( ( x , r¡),Q „ ,
where
f { x , d , r ¡ )
:= ^
r
l o g p ( x ,
Finally, the level space is
)
=
1?,
f ( x , 0, n ) / Q o , r i ^ M , - )
2
) .
r¡).
—>
{ ( x , r ¡ )
k ( r ¡ ) :
G T(r,
k
Q ) } .
According to the general prescription outlined in Part 1, we have to determine conditional expectations, given U. Thanks to the special nature of the function U as the projection into the first component, this becomes particularly simple: For any function / : X x H —> R which is integrable with respect to Qi9,r, we have
{7)
{ Q
*'
r f ) [ x )
~
J
p
M
v
W
r
,
)
To obtain the canonical gradient, k * ( - , P * U expectations of the elements of the level space,
x e X
'
) ,
-
we determine the conditional
and the conditional expectation of £'(-, ti, •), which is (9)
/ e ' { x ^ , r
J
) p { x , ' d , T , ) r { d
S p ( x , d ,
V
) T ( d
V
n
)
)
Let (/(•, 1?, r ) denote the orthogonal component of this function with respect to K { d , T). With this notation, the canonical gradient can be expressed as (10)
r) = d(x)t?,r)/P„,r(d(-,t?,r)2).
From this we obtain the asymptotic variance bound (11)
l/P^v{d{-^Tf).
The application of the general result of P a r t 1 to the problem of unknown random nuisance parameters has led us to a result which was obtained earlier (see Pfanzagl and Wefelmeyer (1982), Section 14.3) in a direct way.
241
Now we apply this result to a more special model, namely (12)
p(x, i>, r?) = 9(1, t?)p 0 (S(x, tf), 1?, V)-
The representation (12) is not unique, and it is convenient in applications to have a certain freedom in choosing q and po- For some purposes, it is, however, advantageous to use a certain canonical form of (12), in which po(-,tf,»?) is a density of (with respect to an appropriate a-finite measure v * S(-, not depending on 1?). Whenever a representation (12) exists, there also exists a canonical representation of this type. (The argument brought forward in connection with (14.3.11) in Pfanzagl and Wefelmeyer (1982), p. 232, remains true if the sufficient statistic S depends on 1?.) The level space if (1?, T) (see (8)) now consists of all functions Jfc(r 7 )p 0 (S(x,i?),t?,> 7 )r(d7 ? ) fp0(s(x,0),#,v)r(dv)
'
All these functions are contractions of S(-,i?) (i.e. they depend on x through S(x, i?) only). Determining orthogonal components with respect to i f ( i ? , r ) becomes particularly simple if K(1?, T) is the class of all functions in A, -Ptf.r) with expectation zero which are contractions of 5(-,i?). This is the case if (see Pfanzagl and Wefelmeyer (1985), p. 95, Proposition 3.2.5) (») C2(H,C,
the family Q of prior distributions is full (i.e.
if T(T, §) = {k G
T) : T(fc) = 0 } ) , a n d
(it) the family {P#,v * S(-,i9), tj G H} is complete for every t? G 9 . Since many interesting applications are of this type, we consider it now in more detail. Up to now we have not yet introduced explicitly the image space of 5(-, i>). Assume this is a measurable space (Y,D). Then the class of all functions in ¿•2{X,A,Pt l T) which are contractions of S(-,i9) is {h o
: h G L2(Y, D,P»,t
* SM))}.
Hence our assumption about K{d, T) may be written as K{8, T) = {ho S{; 0): he
£2(Y,
D, Pr(a'(.,tf)))2.
The
243
Since f^ j r(a*(-,i9)) = —J > 1 > i r(a(-,$)d(-,i?,r)), this asymptotic variance is, in general, larger than the asymptotic variance bound (11). The asymptotic variance coincides with the asymptotic variance bound if S*(•, i?) belongs to K(1?, T), for in this case d(x,i?,T) = a(x,i?, T). S"M) € is trivially fulfilled if S(-,d) is in its canonical form, we have P
s