247 81 1MB
English Pages 286 Year 2014
De Gruyter Graduate Höpfner Asymptotic Statistics
Reinhard Höpfner
Asymptotic Statistics With a View to Stochastic Processes
De Gruyter
Mathematics Subject Classification 2010: 62F12, 62M05, 60J60, 60F17, 60J25 Author Prof. Dr. Reinhard Höpfner Johannes Gutenberg University Mainz Faculty 8: Physics, Mathematics and Computer Science Institute of Mathematics Staudingerweg 9 55099 Mainz Germany [email protected]
ISBN 978-3-11-025024-4 e-ISBN 978-3-11-025028-2 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.dnb.de. © 2014 Walter de Gruyter GmbH, Berlin/Boston Typesetting: P T P-Berlin Protago-TEX-Production GmbH, www.ptp-berlin.de Printing and binding: CPI buch bücher.de GmbH, Birkach Printed on acid-free paper Printed in Germany www.degruyter.com
To the students who like both statistics and stochastic processes
Preface
At Freiburg in 1991 I gave my very first lectures on local asymptotic normality; the first time I explained likelihood ratio processes or minimum distance estimators to students in a course was 1995 at Bonn, during a visiting professorship. Then, during a short time at Paderborn and now almost 15 years at Mainz, I lectured from time to time on related topics, or on parts of these. In cooperation with colleagues in the field of statistics of stochastic processes (where most of the time I was learning) and in discussions with good students (who exposed me to very precise questions on the underlying mathematical notions), the scope of subjects entering the topic increases steadily; initially, my personal preference was on null recurrent Markov processes and on local asymptotic mixed normality. Later, lecturing on such topics, a first tentative version of a script came into life at some point. It went on augmenting and completing itself in successive steps of approximation. It is now my hope that the combination of topics may serve students and interested readers to get acquainted with the purely statistical theory on one hand, developed carefully in a ‘probabilistic’ style, and on the other hand with some (there are others) typical applications to statistics of stochastic processes. The present book can be read in different ways, according to possibly different mathematical preferences of a reader. In the author’s view, the core of the book are the Chapters 5 (Gaussian shift models), 6 (mixed normal and quadratic models), 7 (local asymptotics where the limit model is a Gaussian shift or a mixed normal or a quadratic experiment, often abbreviated as LAN, LAMN or LAQ), and finally 8 (examples of statistical models in a context of diffusion processes where local asymptotics of type LAN, LAMN or LAQ appear). A reader who wants to concentrate on the statistical theory alone should skip chapters or subsections marked by an asterisk : he or she would read only the Sections 5.1 and 6.1, and then all subsections of Chapter 7. This route includes a number of examples formulated in the classical i.i.d. framework, and allows to follow the statistical theory without gaps. In contrast, chapters or subsections marked by an asterisk are designed for readers with an interest in both statistics and stochastic processes. This reader is assumed to be acquainted with basic knowledge on continuous-time martingales, semi-martingales, Ito formula and Girsanov theorem, and may go through the entire Chapters 5 to 8 consecutively. In view of the stochastic process examples in Chapter 8, he or she may consult from time to time the Appendix Section 9 for further background and for references (on subjects such as Harris recurrence, positive or null, convergence of martingales, and convergence of additive functionals of a Harris process).
viii
Preface
In both cases, a reader may previously have consulted or have read the Sections 1.1 and 1.2 as well as Chapters 3 and 4 for statistical notions such as score and information in classical definition, contiguity or L2 -differentiability, to be prepared for the core of the book. Given Sections 1.1 and 1.2, Chapters 3 and 4 can be read independently of each other. Only few basic notions of classical mathematical statistics (such as sufficiency, Rao–Blackwell theorem, exponential families, ...) are assumed to be known. Sections 1.3 and 1.4 are of complementary character and may be skipped; they discuss naive belief in maximum likelihood and provide some background in order to appreciate the theorems of Chapter 7. Chapter 2 stands isolated and can be read separately from all other chapters. In i.i.d. framework we study in detail one particular class of estimators for the unknown parameter which ‘works reasonably well’ in a large variety of statistical problems under weak assumptions. From a theoretical point of view, this allows to explicitly construct estimator sequences which converge at a certain rate. From a practical point of view, we find it interesting to start – prior to all optimality considerations in later chapters – with estimators that tolerate small deviations from theoretical model assumptions. Fruitful exchanges and cooperations over a long period of time have contributed to the scope of topics treated in this book, and I would like to thank my colleagues, coauthors and friends for those many long and stimulating discussions around successive projects related to our joint papers. Their influence, well visible in the relevant parts of the book, is acknowledged with deep gratitude. In a similar way, I would like to thank my coauthors and partners up to now in other (formal or quite informal) cooperations. There are some teachers and colleagues in probability and statistics to whom I owe much, either for encouragement and help at decisive moments of my mathematical life, or for mathematical discussions on specific topics, and I would like to take this opportunity to express my gratitude. Furthermore, I have to thank those who from the beginning allowed me to learn and to start to do mathematics, and – beyond mathematics, sharing everyday life – my family. Concerning a more recent time period, I would like to thank my colleague Eva Löcherbach, my PhD student Michael Diether as well as Tobias Berg and Simon Holbach: they agreed to read longer or shorter parts of this text in close-to-final versions and made critical and helpful comments; remaining errors are my own. Mainz, June 2013
Reinhard Höpfner
Contents
Preface 1
vii
Score and Information
1
1.1 Score, Information, Information Bounds . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2 Estimator Sequences, Asymptotics of Information Bounds . . . . . . . . . 15 1.3 Heuristics on Maximum Likelihood Estimator Sequences . . . . . . . . . . 23 1.4 Consistency of ML Estimators via Hellinger Distances . . . . . . . . . . . . 30 2
Minimum Distance Estimators
42
2.1 Stochastic Processes with Paths in Lp .T , T , / . . . . . . . . . . . . . . . . . . 43 2.2 Minimum Distance Estimator Sequences . . . . . . . . . . . . . . . . . . . . . . . 55 2.3 Some Comments on Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . 68 2.4 Asymptotic Normality for Minimum Distance Estimator Sequences . . 75 3
Contiguity
85
3.1 Le Cam’s First and Third Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.2 Proofs for Section 3.1 and some Variants . . . . . . . . . . . . . . . . . . . . . . . 92 4
L2 -differentiable Statistical Models
108
4.1 Lr -differentiable Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.2 Le Cam’s Second Lemma for i.i.d. Observations . . . . . . . . . . . . . . . . . 119 5
Gaussian Shift Models
127
5.1 Gaussian Shift Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.2 6
Brownian
Motion with Unknown Drift as a Gaussian Shift Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Quadratic Experiments and Mixed Normal Experiments
148
6.1 Quadratic and Mixed Normal Experiments . . . . . . . . . . . . . . . . . . . . . 148 6.2
Likelihood
6.3
Time
Ratio Processes in Diffusion Models . . . . . . . . . . . . . . . . 160
Changes for Brownian Motion with Unknown Drift . . . . . . . . . 168
x 7
Contents
Local Asymptotics of Type LAN, LAMN, LAQ
178
7.1 Local Asymptotics of Type LAN, LAMN, LAQ . . . . . . . . . . . . . . . . . 179 7.2 Asymptotic optimality of estimators in the LAN or LAMN setting . . . 191 7.3 Le Cam’s One-step Modification of Estimators . . . . . . . . . . . . . . . . . . 200 7.4 The Case of i.i.d. Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 8
Some
Stochastic Process Examples for Local Asymptotics of Type LAN, LAMN and LAQ 8.1
Ornstein–Uhlenbeck
8.2
A
8.3
Some
212
Process with Unknown Parameter Observed over a Long Time Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Null Recurrent Diffusion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Further Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Appendix
243
9.1
Convergence
9.2
Harris
9.3
Checking
9.4
One-dimensional
of Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Recurrent Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 247 the Harris Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Diffusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Bibliography
267
Index
275
Chapter 1
Score and Information
Topics for Chapter 1: 1.1 Score, Information, Information Bounds Statistical models admitting score and information 1.1–1.2’ Example: one-parametric paths in nonparametric models 1.3 Example: location models 1.4 Score and information in product models 1.5 Cramér–Rao bound 1.6–1.7’ Van Trees bound 1.8 1.2 Estimator Sequences, Asymptotics of Information Bounds Sequences of experiments, sequences of estimators, consistency 1.9 Asymptotic Cramér–Rao bound in i.i.d. models 1.10 Asymptotic van Trees bound in i.i.d. models 1.11 Example: an asymptotic minimax property of the empirical distribution function 1.11’ 1.3 Heuristics on Maximum Likelihood Estimator Sequences Heuristics I: certain regularity conditions 1.12 Heuristics II: maximum likelihood (ML) estimators 1.13 Heuristics III: asymptotics of ML sequences 1.14 Example: a normal distribution model 1.15 A normal distribution model by Neyman and Scott 1.16 Example: likelihoods which are not differentiable in the parameter 1.16’ 1.4 Consistency of ML Estimators via Hellinger Distances Definition of ML estimators and ML sequences 1.17 An example using Kullback divergence 1.17” Hellinger distance 1.18 Hellinger distances in i.i.d. models: a set of conditions 1.19 Some lemmata on likelihood ratios 1.20–1.23 Main result: consistency via Hellinger distance 1.24 Example: location model generated from the uniform law R. 12 , 12 / 1.25 Exercises: 1.4’, 1.4”, 1.8’, 1.8”, 1.17’, 1.24”
2
Chapter 1 Score and Information
The chapter starts with classically defined notions of ‘score’, ‘information’ and ‘information bounds’ in smoothly parameterised statistical models. These are studied in Sections 1.1 and 1.2; the main results of this part are the asymptotic van Trees bounds for i.i.d. models in Theorem 1.11. Here we encounter in a restricted setting the type of information bounds which will play a key role in later chapters, for more general sequences of statistical models, under weaker assumptions on smoothness of parameterisation, and for a broader family of risks. Section 1.3 then discusses ‘classical heuristics’ which link limit distributions of maximum likelihood estimator sequences to the notion of ‘information’, together with examples of statistical models where such heuristics either work or do not work at all; we include this – completely informal – discussion since later sections will show how similar aims can be attained in a rigorous way based on essentially different mathematical techniques. Finally, a different route to consistency of ML estimator sequences in i.i.d. models is presented in Section 1.4, with the main result in Theorem 1.24, based on conditions on Hellinger distances in the statistical model.
1.1 Score, Information, Information Bounds 1.1 Definition. (a) A statistical model (or statistical experiment) is a triplet ., A, P / where P is a specified family of probability measures on a measurable space ., A/. A statistical model is termed parametric if there is a 1-1-mapping between P and some subset of some Rd , d 1, P D ¹P# : # 2 ‚º ,
‚ Rd ,
and dominated if there is some -finite measure on ., A/ such that P 0º .!/ @
@ @#1
@ @#d
1 log f C A .#, !/ , log f
# 2‚, !2.
This yields a well-defined random variable on ., A/ taking values in .Rd , B.Rd //. Assume further ./
for every # 2 ‚, we have M# 2 L2 .P# / and E# .M# / D 0.
If all these conditions are satisfied, M# is termed score in #, and the covariance matrix under P# I# :D C ov# .M# / D E# M# M#> Fisher information in #. We then call E an experiment admitting score and Fisher information. (b) More generally, we may allow for modification of M# defined in (a) on sets of f# : # 2 ‚º from P# -measure zero, and call any family of measurable mappings ¹M ., A/ to .Rd , B.Rd // with the property f# D M# for every # in ‚ : M
P# -almost surely
a version of the score in the statistical model ¹P# : # 2 ‚º.
4
Chapter 1 Score and Information
Even if many classically studied parametric statistical models do admit score and Fisher information, it is indeed an assumption that densities # ! f .#, !/ should satisfy the smoothness conditions in Definition 1.2. Densities # ! f .#, !/ can be continuous and non-differentiable, their smoothness being e.g. the smoothness of the Brownian path; densities # ! f .#, !/ can be discontinuous, their jumps corresponding e.g. to the jumps of a Poisson process. For examples, see (1) and (3) in [17] (which goes back to [60]). We will start from the classical setting. 1.2’ Remark. For a statistical model, score and information – if they exist – depend essentially on the choice of the parameterisation ¹P# : # 2 ‚º D P for the model, but not on the choice of a dominating measure: in a dominated experiment, for different measures 1 and 2 dominating P , we have with respect to 1 C 2 d 1 d 2 d P# d P# d P# D D d 1 d.1 C 2 / d.1 C 2 / d 2 d.1 C 2 / where the second factor on the r.h.s. and the second factor on the l.h.s. do not involve #, hence do not contribute to the score .r log f /.#, /. 1.3 Example. (d D 1, one-parametric paths in nonparametric models) Fix any probability measure F on .R, B.R//, fix any function h : .R, B.R// ! .R, B.R// with the properties Z Z .˘/ hdF D 0 , 0 < h2 dF < 1 , and write F for the class of all probability measures on .R, B.R//. We shall show that for suitable " > 0 and ‚ :D .", "/, there are parametric models ¹F# : # 2 ‚º F
with F0 D F
of mutually equivalent probability measures on .R, B.R// where score M# and Fisher information I# exist at every point # 2 ‚, and where we have at # D 0 Z .˘˘/ M0 .!/ D h.!/ , I0 D h2 dF . Such models are termed one-parametric paths through F in direction h. (a) We consider first the case of bounded functions h in .˘/. Let sup jh.x/j M < x2R 1 and put ./ E D R, B.R/, ¹F# : j#j < M 1 º , F# .d!/ :D .1 C # h.!// F .d!/ . Then the family ./ is dominated by :D F , with strictly positive densities f .#, !/ D
d F# .!/ D 1 C # h.!/ , dF
# 2‚, !2R,
Section 1.1 Score, Information, Information Bounds
5
hence all probability laws in ./ are equivalent. For all j#j < M 1 , score M# and Fisher information I# according to Definition 1.2 exist and have the form h .!/ , M# .!/ D 1 C #h 0 .1.3 / 2 Z Z h h2 dF < 1 dF# D I# D 1 C #h 1 C #h which at # D 0 gives .˘˘/ as indicated above. (b) Now we consider general functions h in .˘/. Select some truncation function 2 C01 .R/, the class of continuously differentiable functions R ! R with compact support, such that the properties ³ ² 1 1 , .x/ D 0 on ¹jxj > 1º , max j j < .x/ D x on jxj < 3 2 x2R are satisfied. Put ./
E D . R, B.R/, ¹F# : j#j < 1º / , R F# .d!/ :D 1 C Œ .#h.!// .#h/dF F .d!/
and note that in the special case of bounded h as considered in (a) above, paths ./ and ./ coincide when # ranges over some small neighbourhood of 0. By choice of , the densities Z f .#, !/ D 1 C Œ .#h.!// .#h/dF are strictly positive, hence all probability measures in ./ are equivalent on .R, B.R//. Since ./ is in particular Lipschitz, dominated convergence shows Z Z d .#h/ dF D h 0 .#h/ dF . d# This gives scores M# at j#j < 1 R h.!/ 0 .#h.!// h 0 .#h/ dF d R M# .!/ D log f .#, !/ D d# 1 C Œ .#h.!// .#h/dF which by .˘/ belong to L2 .F# /, and are centred under F# since R Z h.!/ 0 .#h.!// h 0 .#h/ dF R F# .d!/ 1 C Œ .#h.!// .#h/dF Z Z D h.!/ 0 .#h.!// h 0 .#h/ dF F .d!/ D 0 .
6
Chapter 1 Score and Information
Thus with M# as specified here, pairs
.1.3”/
. M# , I# / , 2 R Z
h.!/ 0 .#h.!// h 0 .#h/ dF R F .d!/ < 1 I# D 1 C Œ .#h.!// .#h/dF
satisfy all assumptions of Definition 1.2 for all j#j < 1. At # D 0, expressions (1.3”) reduce to .˘˘/. 1.4 Example. (location models) Fix a probability measure F on .R, B.R// having density f with respect to Lebesgue measure such that f is differentiable on R with derivative f 0 f is strictly positive on .a, b/, and 0 outside for some open interval .a, b/ in R (1 a, b C1), and assume 0 2 Z f dF < 1 . f .a,b/ Then the location model generated by F E :D . R, B.R/, ¹F# : # 2 Rº / ,
dF# :D f . #/ d
satisfies all assumptions made in Definition 1.2, with score at # f0 M# .!/ D 1.aC#,bC#/ .!/ .! #/ , ! 2 R f since F# is concentrated on the interval .a C #, b C #/, and Fisher information 0 2 Z f I# D E# M#2 D dF , # 2 R. f .a,b/ In particular, in the location model, the Fisher information does not depend on the parameter. 1.4’ Exercise. In Example 1.4, we may consider .a, b/ :D .0, 1/ and f .x/ :D c.˛/1.0,1/ .x/Œx.1 x/˛ for parameter value ˛ > 1 (we write c.˛/ for the norming constant of the Beta B.˛ C 1, ˛ C 1/ density). 1.4” Exercise. (Location-scale models) For laws F on .R, B.R// with density f differentiable on R and supported by some open interval .a, b/ as in Example 1.4, consider the location-scale model generated by F 1 #1 f d . E :D R, B.R/, ¹F.#1 ,#2 / : #1 2 R , #2 > 0º , dF.#1 ,#2 / :D #2 #2
7
Section 1.1 Score, Information, Information Bounds
This model is parameterised by ‚ :D ¹# D .#1 , #2 / : #1 2 R , #2 > 0º. Show that the condition 0 2 Z f Œ1 C x 2 .x/ F .dx/ < 1 f .a,b/ guarantees that the location-scale model admits score and Fisher information at every point in ‚. Write xf 0 .x/ f 0 .x/ , H.x/ :D Œ1 C , a 0º iD1
n
n
n
iD1
iD1
iD1
in ˝ A. On the product space . X , ˝ A/ we have with the conventions of 1.2 (a) Mn,# ..!1 , : : : , !n // D 1An,# ..!1 , : : : , !n //
n X
M# .!i / .
iD1
Since Pn,# .An,# / D 1 , the measurable mappings .!1 , : : : , !n / ! Mn,# ..!1 , : : : , !n // ,
.!1 , : : : , !n / !
n X
M# .!i /
iD1
coincide Pn,# -almost surely on .XniD1 , ˝niD1 A/ , and are identified under Pn,# according to Definition 1.2(b). We write M#,j for the components of M# , 1 j d , and .I# /j ,l D E# M#,j M#,l .
9
Section 1.1 Score, Information, Information Bounds
(2) Under Pn,# , successive observations !1 , : : : , !n are independent, hence .!1 , : : : , !n / ! M# .!i / ,
1i n
are Rd -valued i.i.d. random variables on . XniD1 , ˝niD1 A , Pn,# /. On this space, we have for the components Mn,#,j D 1¹fn,#>0 º . @#@ log fn,# / of Mn,# j EPn,# Mn,#,j Mn,#,l D X Z X n n M#,j .!r / M#,l .!k / Pn,# .d!1 , : : : , d!n / D n .I# /j ,l rD1
kD1
using ./ in Definition 1.2. This is (a). (3) The proof of assertion (b) is analogous since on the infinite product space 1 1 X , ˝ A , iD1
iD1
the laws Q# jFn in restriction to sub- -fields Fn D .X1 , : : : , Xn / where n < 1, Q n are dominated by Œ˝1 jF with density .! , ! , : : :/ ! n 1 2 iD1 f# .!i / as iD1 above. The Fisher information yields bounds for the quality of estimators. We present two types of bounds. 1.6 Proposition (Cramér–Rao bound). Consider as in Definition 1.2 an experiment E D ., A, ¹P# : # 2 ‚º/, ‚ 2 Rd open, with score ¹M# : # 2 ‚º and Fisher information ¹I# : # 2 ‚º. Consider a mapping : ‚ ! Rk which is partially differentiable, and let Y : ., A/ ! .Rk , B.Rk // denote a square-integrable unbiased estimator for . (a) At points # 2 ‚ where the two conditions .C/
.CC/
the Fisher information matrix I# is invertible 1 0 @ : : : @#@ 1 @#1 1 d C B > @ : : : : : : : : : A .#/ D E# Y M# @ : : : @#@ k @#1 k d
are satisfied, we have a lower bound .˘/
C ov# .Y / V# I#1 V#>
at #, where V# denotes the Jacobi matrix of on the l.h.s. of condition .CC/. (b) In the setting of (a), Y attains the bound (˘) at # (i.e. achieves C ov# .Y / D V# I#1 V#> ) if and only if its estimation error at # admits a representation Y .#/ D V# I#1 M#
P# –almost surely.
10
Chapter 1 Score and Information
(c) In the special case D id in (a) and (b) above, the bound (˘) at # reads C ov# .Y / I#1 and Y attains the bound (˘) at # if and only if Y # D I#1 M#
P# -almost surely.
Proof. According to Definition 1.2, we have M# 2 L2 .P# / and E# .M# / D 0 for all # 2 ‚. Necessarily I# D E# .M# M#> / is symmetric and non-negative definite for all # 2 ‚. We consider a point # 2 ‚ such that .C/ holds, together with a mapping : ‚ ! Rk and a random variable Y 2 L2 .P# / satisfying E# .Y / D .#/ and .CC/. (1) We start by defining V# by the right-hand side of (++): V# :D E# Y M#> . # being fixed, introduce a random variable W :D .Y E# .Y // V# I#1 M# taking values in Rk . Then we have W 2 L2 .P# / with E# .W / D 0 and C ov# .W / D E# .W W > /
D C ov# .Y / E# .Y .#// M#> I#1 V#> / E# V# I#1 M# .Y .#//> C E# V# I#1 M# M#> I#1 V#> . On the r.h.s. of this equation, M# being centred under P# and E# Y M#> D V# by definition, both the second and the third summand reduce to V# I#1 V#> . By definition of the Fisher information, the fourth summand on the r.h.s. is CV# I#1 V#> . In the sense of half-ordering of symmetric and non-negative definite matrices, we thus arrive at .ı/
0 C ov# .W / D C ov# .Y / V# I#1 V#>
which is (a). Equality in (ı) is possible only in the case where W D E# .W / D 0 P# -almost surely. This is the ‘only if’ part of assertion (b), the ‘if’ part is obvious. So far, we did not make use of assumption (++). (2) We consider the special case k D d and D id : here assumption (++) is needed to identify V# with the d d identity matrix. Then (c) is an immediate consequence of (a) and (b). 1.7 Remarks. (a) A purely heuristic argument for (++): one should be allowed to differentiate Z .#/ D E# .Y / D .d!/ f .#, !/ Y .!/
Section 1.1 Score, Information, Information Bounds
11
with respect to # under the integral sign, with partial derivatives Z @ @ E# .Yi / D .d!/ f .#, !/ Yi .!/ @#j @#j Z @ D P# .d!/ log f .#, !/ Yi .!/ D E# .Y M#> /i,j . @#j (b) Condition (++) in 1.6 holds in naturally parameterised d -parametric exponential families when the parameter space is open (see Barra [4, Chap. X and formula (2) on p. 178], Witting [127, pp. 152–153], or van der Vaart [126, Lemma 4.5 on p. 38]): in this case, the score at # is given by the canonical statistic of the exponential family centred under #. See also Barndorff–Nielsen [3, Chap. 2.4] or Küchler and Sørensen [77]. 1.7’ Remark. Within the class of unbiased and square integrable estimators Y for > , the covariance matrix C ov# .Y / D E# .Y .#//.Y .#// / , quantifying spread/concentration of estimation errors Y .#/ at #, allows to compare different estimators at the point #. The lower bound in .˘/ under the assumptions of Proposition 1.6 involves the inverse of the Fisher information I#1 which thus may indicate an ‘optimal concentration’; when D id the lower bound in .˘/ equals I#1 . However, there are two serious drawbacks: (i) these bounds are attainable bounds only in a few classical parametric models, see [27, p. 198], or [127, pp. 312–317], and [86, Theorem 7.15 on p. 300]; (ii) unbiasedness is not ‘per se’ relevant for good estimation: a famous example due to Stein (see [60, p. 26] or [59, p. 93]) shows that even in a normal distribution model ° n ± n n k k k X R , ˝ B.R , ˝ N .#, Ik / : # 2 ‚ , ‚ :D R where k 3 iD1
iD1
iD1
estimators admitting bias can be constructed which (with respect to squared loss) concentrate better around the true # than the empirical mean, the best unbiased square integrable estimator in this model. The following bound, of different nature, allows to consider arbitrary estimators for the unknown parameter. We give it in dimension d D 1 only (multivariate generalisations exist, see [29]), and assuming strictly positive densities. The underlying idea is ‘Bayesian’: some probability measure .d #/ playing on the parameter space ‚ selects the true parameter # 2 ‚. Our proof follows Gill and Levit [29]. 1.8 Proposition (van Trees inequality). Consider an experiment E :D ., A, ¹P# : # 2 ‚º/ where ‚ is an open interval in R, with strictly positive densities f .#, / D dP# on with respect to a dominating measure , for all # 2 ‚. Assume that E, d satisfying all assumptions of Definition 1.2, has score ¹M# : # 2 ‚º and Fisher information ¹I# : # 2 ‚º, and consider a differentiable mapping : ‚ ! R.
12
Chapter 1 Score and Information
Fix any subinterval .a, b/ of ‚ and any a priori law with Lebesgue density g :D such that
d d
.i/ .ii/
g is differentiable on R, strictly positive on .a, b/, and 0 else 0 2 Z g d D: J < 1 g .a,b/
hold. Whenever a or b coincide with a boundary point of ‚, assume that : ‚ ! R admits at this point a finite limit denoted by .a/ or .b/, and f ., !/ a finite limit denoted by f .a, !/ or f .b, !/ for fixed ! 2 . Then we have the bound R 2 Z 0 .#/ .d #/ .a,b/ E# .T .#//2 .d #/ R .a,b/ .a,b/ I# .d #/ C J for arbitrary estimators T for in the experiment E. Proof. (0) Note that J is the Fisher information in the location model generated by on .R, B.R// which satisfies all assumptions of Example 1.4. (1) We show that the assumptions on the densities made in Definition 1.2 imply measurability of ./
‚ 3 .#, !/ ! f .#, !/ 2 .0, 1/
when ‚ is equipped with the product -field B.‚/ ˝ A. In Definition 1.2, we have ‚ open, and 8# 2‚:
f .#, / : ., A/ ! .R, B.R//
8!2:
f ., !/ : ‚ ! R
is measurable,
is continuous.
Write Ak for the set of all j 2 Z such that 2jk , j2C1 k has a non-void intersection with
‚, and select some point #.k, j / in ‚\ 2jk , j2C1 k for j 2 Ak . Then all mappings fk ., / X 1 j , j C1 \ ‚ .#/ f .#.k, j /, !/ , k 1 fk .#, !/ :D j 2An
2k
2k
are measurable in the pair .#, !/, hence the same holds for their pointwise limit ./ as k ! 1. (2) The product measurability established in ./ allows to view Z .#, A/ ! P# .A/ D 1A .!/ f .#, !/ .d!/ , # 2 ‚ , A 2 A as a transition probability from .‚, B.‚// to ., A/.
13
Section 1.1 Score, Information, Information Bounds
(3) We consider estimators T : ., A/ ! .R, B.R// for which have the property Z E# .T .#//2 .d #/ < 1 .a,b/
(otherwise the bound in Proposition 1.8 would be trivial). To prove Proposition 1.8, it is sufficient to consider the restriction of the parameter space ‚ to its subset .a, b/: d thus we identify ‚ with .a, b/ – then, by assumption, g D d will be strictly positive on ‚ with limits g.a/ D g.b/ D 0, and as well as f ., !/ for all ! will have finite limits at the endpoints of ‚ – and work on the product space , A :D .‚ , B.‚/˝A/ equipped with the probability measure P .d #, d!/ :D .d #/ P# .d!/ D .˝/.d #, d!/ g.#/f .#, !/ , # 2‚, !2. (4) In the following steps, we write 0 for the derivative with respect to the parameter (from the set of assumptions in Definition 1.2, recall differentiability of # ! f .#, !/ for fixed ! when d D 1). Then Z d # .f .#, !/g.#//0 D f .b, !/g.b/ f .a, !/g.a/ D 0 .C/ ‚
for all ! 2 since g.a/ D 0 D g.b/ by our assumption, together with Z Z d # .#/ .f .#, !/g.#//0 D d # 0 .#/ .f .#, !/g.#// .CC/ ‚
‚
by partial integration. Combining (+) and (++) we get the equation Z .˝/.d #, d!/ .f .#, !/g.#//0 .T .!/ .#// ‚ Z Z .d!/ d # 0 .#/ f .#, !/ g.#/ D0C ‚ Z .d #/ 0 .#/ . D ‚
By strict positivity of the densities and strict positivity of g on ‚, the l.h.s. of the first equality sign is Z .˝/.d #, d!/ .f .#, !/g.#//0 .T .!/ .#// ‚ Z Z .f .#, !/g.#//0 .d #/P# .d!/ .T .!/ .#// D f .#, !/g.#/ ‚ 0 Z g f0 P .d #, d!/ .#/ C .#, !/ .T .!/ .#// . D g f ‚
14
Chapter 1 Score and Information
In the last integrand, both factors 0 f0 g .#/ C .#, !/ , g f
.T .!/ .#//
are in L2 .P /: the second is the estimation error of an estimator T for which at the start of step (3) was assumed to be in L2 .P /, the first is the sum of the score in the location experiment generated by and the score ¹M# : # 2 ‚º in the experiment E, both necessarily orthogonal in L2 .P /: Z g0 f0 P .d #, d!/ .#/ .#, !/ D g f ‚ .C C C/ Z Z g0 f0 .d #/ .#/ P# .d!/ .#, !/ D 0 . g f ‚ Putting the last three blocks of arguments together, the Cauchy–Schwarz inequality with (+++) gives 2 Z 0 .d #/ .#/ ‚ Z 0 2 g f0 D P .d #, d!/ .#/ C .#, !/ .T .!/ .#// g f ‚ 0 2 Z Z g f0 P .d #, d!/ P .d #, d!/ .T .!/ .#//2 .#/ C .#, !/ g f ‚ ‚ Z Z D JC .d #/ I# .d #/ E# .T .#//2 ‚
‚
which is the assertion.
1.8’ Exercise. For ‚ Rd open, assuming densities which are continuous in the parameter, cover Rd with half-open cubes of side length 2k to prove ‚ 3 .#, !/ ! f .#, !/ 2 Œ0, 1/
is .B.‚/ ˝ A/–B.Œ0, 1//-measurable .
In the case where d D 1, the assertion holds under right-continuity (or left-continuity) of the densities in the parameter. 1.8” Exercise. For ‚ Rd open, for product measurable densities .#, !/ ! f .#, !/ with respect to some dominating measure on ., A/, consider as in Proposition 1.8 the product space , A D .‚ , B.‚/˝A/ equipped with P .d #, d!/ D .d #/ P# .d!/ D .˝/.d #, d!/ g.#/f .#, !/ ,
# 2‚, !2
where is some probability law on .‚, B.‚// with Lebesgue density g, and view .#, !/ ! # and .#, !/ ! ! as random variables on ., A/. We wish to estimate some measurable
Section 1.2 Estimator Sequences, Asymptotics of Information Bounds
15
mapping : .‚, B.‚// ! .Rk , B.Rk //: fixing some loss function ` : Rk ! Œ0, 1/, we call any estimator T : ., A/ ! .Rk , B.Rk // with the property Z Z .d #/ E# . `.T .#// / D .d #/ E# `.T .#// inf T A-mb ‚ ‚ `-Bayesian with respect to the a priori law . Here ‘inf’ is over the class of all measurable mappings T : ., A/ ! .Rk , B.Rk //, i.e. over the class of all possible estimators for . So far, we leave open questions of existence (see Section 37 in Strasser [121]). In the case where `.y/ D jyj2 and 2 L2 ./, prove that a squared loss Bayesian exists and is given by ´ R R R f .,!/ g./ if f . , !/ g. / d > 0 ‚ . / ‚ f . ,!/ g. / d d R‚ T .!/ D if #0 ‚ f . , !/ g. / d D 0 with arbitrary default value #0 2 ‚. Hint: In a squared loss setting, the Bayes property reduces to the L2 -projection property of conditional expectations in ., A, P /. Writing down conditional densities, we have a regular version of the conditional law of # : .#, !/ ! # given ! : .#, !/ ! !. The random variable : .#, !/ ! .#/ belonging to L2 .P /, the conditional expectation of given ! under P is the integral of ./ with respect to this conditional law.
1.2
Estimator Sequences, Asymptotics of Information Bounds
1.9 Definition. Consider a sequence of experiments En D .n , An , ¹Pn,# : # 2 ‚º/ ,
n1
parameterised by the same parameter set ‚ Rd which does not depend on n, and a mapping : ‚ ! Rk . An estimator sequence for is a sequence .Yn /n of measurable mappings Yn : .n , An / ! .Rk , B.Rk // , n 1 . (a) An estimator sequence .Yn /n is called consistent for if for every # 2 ‚, every " > 0:
lim Pn,# .jYn .#/j > "/ D 0
n!1
(convergence in .Pn,# /n -probability of the sequence .Yn /n to .#/, for every # 2 ‚). (b) Associate sequences .'n .#//n to parameter values # 2 ‚, either taking values in .0, 1/ and such that 'n .#/ increases to 1 as n ! 1, or taking values in the space of invertible d d -matrices such that minimal eigenvalues n .#/ of 'n .#/ increase to 1 as n ! 1. Then an estimator sequence .Yn /n for is called .'n /n -consistent if for every # 2 ‚ , ¹L . 'n .#/.Yn .#// j Pn,# / : n 1º is tight in Rk .
16
Chapter 1 Score and Information
(c) .'n /n -consistent estimator sequences .Yn /n for are called asymptotically normal if for every # 2 ‚ , L . 'n .#/.Yn .#// j Pn,# / ! N .0, †.#//
as n ! 1
(weak convergence in Rk ), for suitable normal distributions N .0, †.#// , # 2 ‚. For the remaining part of this section we focus on independent replication of an experiment E :D . , A, ¹P# : # 2 ‚º / , ‚ Rd open which satisfies all assumptions of Definition 1.2, with score ¹M# : # 2 ‚º and Fisher information ¹I# : # 2 ‚º. We consider for n ! 1 the sequence of product models .˘/
En :D . n , An , ¹Pn,# : # 2 ‚º / D n n n , ˝ A, ¹P :D ˝ P : # 2 ‚º X n,# # iD1
iD1
iD1
as in Lemma 1.5(a), with score Mn,# in # and information In,# D n I# . In this setting we present some asymptotic lower bounds for the risk of estimators, in terms of the Fisher information. 1.10 Remark (Asymptotic Cramér–Rao Bound). Consider .En /n as in .˘/ and assume that I# is invertible for all # 2 ‚. Let .Tn /n denote somepsequence of unbiased and square integrable estimators for the unknown parameter, n–consistent and asymptotically normal: p .ı/ for every # 2 ‚ : L n.Tn #/ j Pn,# ! N .0, †.#// , n ! 1 (weak convergence in Rd ). The Cramér–Rao bound in En p p 1 E# Œ n.Tn #/Œ n.Tn #/> D n C ov# .Tn / n In,# D I#1 makes an ‘optimal’ limit variance †.#/ D I#1 appear in .ı/, for every # 2 ‚. Given one estimator sequence .Tn /n whose rescaled estimation errors at # attain the limit law N 0, I#1 , one would like to call this sequence ‘optimal’. The problem is that Cramér–Rao bounds do not allow for comparison within a sufficiently broad class of competing estimator sequences. Fix # 2 ‚. Except for unbiasedness of estimators at #, Cramér–Rao needs the assumption (++) in Proposition 1.6 for D id , and needs ./ from Definition 1.2: both last assumptions E# .Mn,# / D 0 ,
E# .Tn Mn,# / D I ,
n1
(with 0 the zero vector in Rd and I the identity matrix in Rd d ) combine in particular to ./
E# .ŒTn # Mn,# / D I
for all n 1 .
Section 1.2 Estimator Sequences, Asymptotics of Information Bounds
17
Thus from the very beginning, condition (++) of Proposition 1.6 for D id establishes a close connection between the sequence of scores .Mn,# /n on the one hand and those estimator sequences .Tn /n to which we may apply Cramér–Rao on the other. Hence the Cramér–Rao setting turns out to be a restricted setting. 1.11 Theorem (Asymptotic van Trees bounds). Consider an experiment E :D ., A, ¹P# : # 2 ‚º/ where ‚ is an open interval in R, with strictly positive densities # on with respect to a dominating measure , for all # 2 ‚. Assume f .#, / D dP d that E, satisfying all assumptions of Definition 1.2, admits score ¹M# : # 2 ‚º and Fisher information ¹I# : # 2 ‚º, and assume in addition ‚ 3 # ! I# 2 .0, 1/
.˘˘/
is continuous.
Then for independent replication .˘/ of the experiment E, for arbitrary choice of estimators Tn for the unknown parameter # 2 ‚ in the product experiments En , we have the two bounds .I / and .II/ p inf sup E# Œ n.Tn #/2 I#1 .I / lim lim inf 0 c#0
.II/
lim
C "1
n!1
lim inf n!1
Tn An -mb
inf
Tn An -mb
j##0 j 0 one can find a finite collection of open subsets V1 , : : : , Vl of ‚ (with l 2 N and V1 , : : : , Vl depending on # and ") such that ./ and ./ hold: ./
# …
l [ iD1
Vi
and
®
¯
2 ‚ : j #j > "
l [ iD1
Vi
33
Section 1.4 Consistency of ML Estimators via Hellinger Distances
./
f . , / sup log f .#, / 2Vj
2 L .P# / 1
and E#
f . , / sup log f .#, / 2Vj
< 0,
j D 1, : : : , l . Then every ML sequence for the unknown parameter is consistent. Proof. (i) Given any ML sequence .Tn /n for the unknown parameter with ‘good sets’ .An /n in the sense of Definition 1.17 (b), fix 2 ‚ and " > 0 and select l and V1 , : : : , V` according to ./. Then ³ \ l ² \ An ¹ jTn #j " º sup log fn . , / sup log fn . , / < j D1
2Vj
2‚:j#j"
for every n fixed, by Definition 1.17(b) and since all densities are strictly positive. Inverting this, ³ [ l ² [ Acn sup log fn . , / sup log fn . , / ¹ jTn #j > " º j D1
l ² [
j D1
2Vj
:j#j"
sup log fn . , / log fn .#, /
2Vj
³ [
Acn .
(ii) In the sequence of product experiments .En /n , the .An /n being ‘good sets’ for .Tn /n , we use the strong law of large numbers thanks to assumption ./ to show that ´ μ! n 1X f . , !i / Pn,# .!1 , : : : , !n / : 0 ! 0 sup log n f .#, !i / 2Vj iD1
as n ! 1 for every j , 1 j l: this gives lim Pn,# .jTn #j > "/ D 0. n!1
In order to introduce a geometry on the parametric statistical experiment, to be distinguished from Euclidian geometry on the parameter set ‚ Rd , we make use of Hellinger distance H., / which defines a metric on the space of all probability measures on ., A/; see e.g. [124, Chap. 2.4] or [121, Chap. I.2]. Recall that countable collections of probability measures on the same space are dominated. 1.18 Definition. The Hellinger distance H., / between probability measures Q1 and Q2 on ., A/ is defined as the square root of Z ˇ2 1 ˇˇ 1=2 1=2 ˇ g g H 2 .Q1 , Q2 / :D ˇ 1 2 ˇ d 2 Œ0, 1 2 i where gi D dQ are densities with respect to a dominating measure , i D 1, 2. The d affinity A., / between Q1 and Q2 is defined by Z 1=2 1=2 2 A.Q1 , Q2 / :D 1 H .Q1 , Q2 / D g1 g2 d .
34
Chapter 1 Score and Information
The integrals in Definition 1.18 do not depend on the choice of a dominating measure for Q1 and Q2 (similar to Remark 1.2’). We have H.Q, Q0 / D 0 if and only if probability measures Q, Q0 coincide, and H.Q, Q0 / D 1 if and only if Q, Q0 are mutually singular. Below, we focus on i.i.d. models En and follow Ibragimov and Khasminskii’s route [60, Chap. 1.4] to consistency of maximum likelihood estimators under conditions on the Hellinger geometry in the single experiment E. For the remaining part of this section, the following assumptions will be in force. Note that we do not assume equivalence of probability laws, and do not assume continuity of densities in the parameter for fixed !. 1.19 Notations and Assumptions. (a) (i) In the single experiment ‚ Rd open
E D . , A , ¹P# : # 2 ‚º/ ,
# we have densities f# D dP with respect to a dominating measure . For ¤ #, the d likelihood ratio of P with respect to P# is
L=# D
f 1¹f# >0º C 1 1¹f# D0º . f#
(ii) We write K for the class of compact sets in Rd which are contained in ‚. A compact exhaustion of ‚ is a sequence .Km /m in K such that Km int.KmC1 / for S all m, and m Km D ‚. We introduce the notations h. , , K/ :D
inf
0 2KnB ./
Z a. , K c / :D
H 2 .P 0 , P / ,
1=2 1=2 sup f 0 f 0 2K c \‚
Z . , ı/ :D
sup
0 2Bı ./\‚
K2K , 2K, >0, K 2 K , 2 int.K/ ,
d ,
1=2 ˇ ˇ2 ˇ 1=2 1=2 ˇ , ˇf 0 f ˇ d
2‚, ı>0.
(iii) For the single experiment E, we assume a condition .C/
lim . , ı/ D 0 for every 2 ‚ ı#0
together with .CC/
c lim a. , Km / 0º q n Y f 1=2 . , !i / =# Ln .!1 , : : : , !n / D sup sup 2V 2V f 1=2 .#, !i / iD1 n Y 1 1=2 sup f . , !i / f 1=2 .#, !i / 2 V iD1 and thus En,#
q n Z
n =# 1=2 1=2 Ln a.#, K c / sup sup f# f d
2V
2V
for every n 1.
1.21 Lemma. Condition 1.19 implies Hellinger continuity of the parameterisation H.P 0 , P / ! 0 whenever 0 ! in ‚ which in turn guarantees identifiability within compacts contained in ‚: h. , , K/ > 0 for all K 2 K, 2 K and > 0 . Proof. By definition of . , ı/ and by (+) in Assumptions 1.19, the first assertion of the lemma holds by dominated convergence. From this, all mappings .ı/
‚ 3 0 ! H.P 0 , P / 2 Œ0, 1 ,
2 ‚ fixed
are continuous, by inverse triangular equality jd.x, z/ d.y, z/j d.x, y/ for the metric d., / D H., /. Since we have always P 0 ¤ P when 0 ¤ (cf. Definition 1.1), the continuous mapping .ı/ has its unique zero at 0 D . As a consequence,
36
Chapter 1 Score and Information
fixing K 2 K, 2 K, > 0 and restricting .ı/ to the compact K n B . /, the second assertion of the lemma follows. =#
This allows us to control likelihood ratios Ln small balls Bı . 0 / which are distant from #.
under Pn,# when ranges over
1.22 Lemma. Fix K 2 K, # 2 int.K/, > 0. Under Assumptions 1.19, consider points 0 2 K n B .#/ and (using Lemma 1.21 and (+) in Assumptions 1.19) choose ı > 0 small enough to have . 0 , ı/ < h.#, , K/ . Then we have geometric decrease q =# Ln En,# sup Œ 1 h.#, , K/ C . 0 , ı/ n , 2 Bı .0 /\K
n1.
Proof. Write for short V :D Bı . 0 / \ K. For .!1 , : : : , !n / 2 ¹fn,# > 0º, observe that q n Y f 1=2 . , !i / =# Ln .!1 , : : : , !n / D sup sup 2V 2V f 1=2 .#, !i / iD1 is smaller than n Y
ˇ ˇ 1 ˇ 1=2 ˇ 1=2 1=2 f . 0 , !i / C sup ˇf . , !i / f . 0 , !i /ˇ 1=2 .#, ! / 2V f i iD1
which yields q ²Z ³n ˇ ˇ ˇ 1=2 =# 1=2 1=2 1=2 ˇ Ln f# . f0 C sup ˇf f0 ˇ d En,# sup 2V
2V
With Notations 1.19, we have by Lemma 1.21 and choice of 0 Z 1=2 1=2 f# f0 d D 1 H 2 .P0 , P# / 1 h.#, , K/ < 1 whereas Cauchy–Schwarz inequality and choice of V imply Z ˇ ˇ ˇ 1=2 1=2 1=2 ˇ f# sup ˇf f0 ˇ d 1 . 0 , ı/ D . 0 , ı/ . 2V
Putting all this together, assumption (+) in Assumptions 1.19 and our choice of ı implies q En,#
sup
2 Bı .0 /\K
=#
Ln
¹ 1 h.#, , K/ C . 0 , ı/ ºn
where the term ¹1 h.#, , K/ C . 0 , ı/º is strictly between 0 and 1.
Section 1.4 Consistency of ML Estimators via Hellinger Distances
37
1.23 Lemma. For all K 2 K, # 2 int.K/, > 0, we have under Assumptions 1.19 exponential bounds q 1 =# Ln En,# sup C e n 2 h.#, ,K/ , n 1 2 KnB .#/
where the constant C < 1 depends on #, , K. Proof. Fix K 2 K, # 2 int.K/, > 0. By Lemma 1.21 and (+) in Assumptions 1.19, we associate to every point 0 in K n B .#/ some radius ı. 0 / > 0 such that . 0 , ı. 0 //
0; we have to show lim Pn,# . ¹jTn #j º / D 0 .
n!1
Put U :D ‚ n B .#/. Since Tn is maximum likelihood on An , we have ² ³ \ sup fn, < fn,# An ¹ Tn … U º 2U
38
Chapter 1 Score and Information
for every n 1. Taking complements we have ² ³ [ Acn , ¹ Tn 2 U º sup fn, fn,# 2U
n1
and write for the first set on the right-hand side s ! q q fn, =# =# 1 Pn,# sup Ln 1 En,# sup Ln Pn,# sup . fn,# 2U 2U 2U Fix a compact exhaustion .Km /m of ‚. Take m large enough for B .#/ Km and c / < 1 in virtue of condition (++) in Assumptions 1.19. large enough to have a.#, Km Then we decompose U D ‚ n B .#/ D U1 [ U2 ,
c U1 :D Km n B .#/ , U2 :D ‚ \ Km
where Lemma 1.20 applies to U2 and Lemma 1.23 to U1 . Thus q =# Pn,# . ¹jTn #j º \ An / En,# Ln sup c 2 ‚\Km q =# C En,# Ln sup 2 Km nB .#/
n
1 c / C C e n 2 h.#, ,Km / , a.#, Km
n1.
This right-hand side decreases exponentially fast as n ! 1. By definition of the sequence of ‘good sets’ in Definition 1.17, we also have .ı/ lim Pn,# Acn D 0 n!1
and are done.
Apart from the very particular case An D n for all n large enough, the preceding proof did not establish an exponential decrease of Pn,# .jTn #j / as n ! 1. This is since Assumptions 1.19 do not provide a control on the speed of convergence in .ı/. Also, we are not interested in proving results ‘uniformly in compact #-sets’. Later, in more general sequences of statistical models, even the rate of convergence will vary from point to point on the parameter set, without any hope for results of such type. We shall work instead with contiguity and Le Cam’s ‘third lemma’, see Chapters 3 and 7 below. 1.24” Exercise. Calculate the quantities defined in Assumption 1.19(a) for the single experiment E in the following case: ., A/ D .R, B.R//; the dominating measure D is Lebesgue measure; ‚ is some open interval in R, bounded or unbounded; P# are uniform laws on intervals of unit length centred at # 1 1 , # 2‚. P# :D R # , # C 2 2
39
Section 1.4 Consistency of ML Estimators via Hellinger Distances
Prove that the parameterisation is Hölder continuous of order 12 in the sense of Hellinger distance: p H.P 0 , P / D j 0 j for 0 sufficiently close to . Prove that condition (+) in Assumptions 1.19 holds with p . , ı/ D 2 ı whenever ı > 0 is sufficiently small (i.e. 0 < ı ı0 . / for 2 ‚). Show also that it is impossible to satisfy condition (++) whenever diam.‚/ 1. In the case where diam.‚/ > 1, prove c .1 dist. , @‚// lim a. , Km / 2 .1 dist. , @‚// m!1
when
0 < dist. , @‚/
1 2
for any choice of a compact exhaustion of ‚, @‚ denoting the boundary of ‚: in this case condition (++) is satisfied. Hence, in the case where diam.‚/ > 1, Theorem 1.24 establishes consistency of any sequence of ML estimators for the unknown parameter under independent replication of the experiment E.
We conclude this section by underlining p the following. First, in i.i.d. models, rates of convergence need not be the well-known n ; second, in the case where ML estimation errors at # are consistent at some rate 'n .#/ as defined in Assumption 1.9(c), this # n #/ j Pn,# / is not more than just tightness of rescaled estimation errors L.'n .#/.b as n ! 1. Either there may be no weak convergence at all, or we may end up with a variety of limit laws whenever the model allows to define a variety of particular ML sequences. 1.25 Example. In the location model generated from the uniform law R. 12 , 12 / ° ± 1 1 E D R , B.R/ , P# :D R # , # C : # 2‚DR 2 2 (extending exercise 1.2400 ), any choice of an ML estimator sequence for the unknown parameter is consistent at rate n. There is no unicity concerning limit distributions for rescaled estimation errors at # to be attained simultaneously for any choice of an ML sequence. Proof. We determine the density of P0 with respect to as f0 .x/ :D 1Œ 12 , 12 .x/, using the closed interval of length 1. This will simplify the representation since we may use .1/ .3/ two particular ‘extremal’ definitions – the sequences .Tn /n and .Tn /n introduced below – of an ML sequence. (1) We start with a preliminary remark. If Yi are i.i.d. random variables distributed according to R..0, 1//, binomial trials show that the probability of an event ³ ² u1 u2 , min Yi > max Yi < 1 1in n 1in n
40
Chapter 1 Score and Information
tends to e .u1 Cu2 / as n ! 1 for u1 , u2 > 0 fixed. We thus have weak convergence in R2 as n ! 1 ! . Z1 , Z2 / n .1 max Yi / , n min Yi 1in
1in
where Z1 , Z2 are independent and exponentially distributed with parameter 1. =# (2) For n 1, the likelihood function ! Ln coincides Pn,# -almost surely with the function R 3 ! 1
max Xi 21 , min Xi C 12
1i n
. /
2 ¹0, 1º
1i n
in the product model En . Hence any estimator sequence .Tn /n with ‘good sets’ .An /n such that 1 1 . / Tn 2 max Xi , min Xi C on An 1in 2 1in 2 will be a maximum likelihood sequence for the unknown parameter. The random interval in . / is of strictly positive length since Pn,# –almost surely max Xi min Xi < 1 Pn,# -almost surely.
1in
1in
All random intervals above are closed by determination 1Œ 12 C, 12 C of the density f for 2 ‚. .i/ (3) By . / in step (2), the following .Tn /n with good sets .An /n are ML sequences in .En /: 1 Tn.1/ :D min Xi C with An :D n 1in 2 ² ³ 1 1 1 Tn.2/ :D min Xi C 2 with An :D max Xi min Xi < 1 2 1in 1in 1in 2 n n 1 Tn.3/ :D max Xi with An :D n 1in 2 ² ³ 1 1 1 .4/ Tn :D max Xi C 2 with An :D max Xi min Xi < 1 2 1in 1in 1in 2 n n 1 .5/ with An :D n . Tn :D min Xi C max Xi 1in 2 1in Without the above convention on closed intervals in the density f0 we would remove .1/ .3/ the sequences .Tn /n and .Tn /n from this list. (4) Fix # 2 ‚ and consider at stage n of the asymptotics the model En locally at #, via a reparameterisation D # C u=n according to rate n suggested by step (1). .#Cu=n/=# coincides Pn,# -almost Reparameterised, the likelihood function u ! Ln surely with R 3
u ! 1
n
max .Xi #/
1i n
1 2
,n
min .Xi #/ C 12
1i n
.u/
2 ¹0, 1º .
Section 1.4 Consistency of ML Estimators via Hellinger Distances
41
As n ! 1, under Pn,# , we obtain by virtue of step (1) the ‘limiting likelihood function’ R 3 u ! 1. Z1 , Z2 / .u/ 2 ¹0, 1º
. /
where Z1 , Z2 are independent exponentially distributed. (5) Step (4) yields the following list of convergences as n ! 1: 8 for i D 3, 4 < Z1 for i D 1, 2 Z2 L n Tn.i/ # j Pn,# ! : 1 .Z Z / for i D 5. 2 1 2 Clearly, we can also realise convergences L .n .Tn #/ j Pn,# / ! ˛Z2 .1 ˛/Z1 .2/
.4/
for any 0 < ˛ < 1 if we consider Tn :D ˛Tn C .1 ˛/Tn , or we can realise L .n .Tn #/ j Pn,# / .2/
does not converge weakly in R as n ! 1 .4/
if we define Tn as Tn when n is odd, and by Tn if n is even. Thus, for a maximum likelihood estimation in the model E as n ! 1, it makes no sense to put forward the notion of limit distribution, the relevant property of ML estimator sequences being n-consistency (and nothing more).
Chapter 2
Minimum Distance Estimators
Topics for Chapter 2: 2.1 Measurable Stochastic Processes with Paths in Lp .T , T , / Measurable stochastic processes and their paths 2.1 Measurable stochastic processes with paths in Lp .T , T , / 2.2 Characterising weak convergence in Lp .T , T , / 2.3 Main theorem: sufficient conditions for weak convergence in Lp .T , T , / 2.4 Auxiliary results on integrals along the path of a process 2.5–2.5’ Proving the main theorem 2.6–2.6” 2.2 Minimum Distance Estimator Sequences Example: fitting the empirical distribution function to a parametric family 2.7 Assumptions and notations 2.8 Defining minimum distance (MD) estimator sequences 2.9 Measurable selection 2.10 Strong consistency of MD estimator sequences 2.11 Some auxiliary results 2.12–2.13 Main theorem: representation of rescaled estimation errors in MD estimator sequences 2.14 Variant: weakly consistent MD estimator sequence 2.15 2.3 Some Comments on Gaussian Processes Gaussian and -Gaussian processes 2.16 Some examples (time changed Brownian bridge) 2.17 Existence of -Gaussian processes 2.18 -integrals along the path of a -Gaussian process 2.19 2.4 Example: Asymptotic Normality of Minimum Distance Estimator Sequences Assumptions and notations 2.20–2.21 Main theorem: asymptotic normality for MD estimator sequences 2.22 Empirical distribution functions and Brownian bridges 2.23 Example: MD estimators defined from the empirical distribution function 2.24 Example: MD estimator sequences for symmetric stable i.i.d. observations 2.25 Exercises: 2.1’, 2.3’, 2.24’, 2.25’.
Section 2.1 Stochastic Processes with Paths in Lp .T , T , /
43
This chapter is devoted to the study of one class of estimators in parametric families which – without aiming at any notion of optimality – do have reasonable asymptotic properties under assumptions which are weak and easy to verify. Most examples will consider i.i.d. models, but the setting is more general: sequences of experiments b n , calculated from the data at level n of the asymptotics, where empirical objects ‰ are compared to theoretical counterparts ‰# under #, for all values of the parameter # 2 ‚, which are deterministic and independent of n. Our treatment of asymptotics of minimum distance (MD) estimators follows Millar [100], for an outline see Kutoyants [78–80]. Below, Sections 2.1 and 2.3 contain the mathematical tools and are of auxiliary character; the statistical part is concentrated in Sections 2.2 and 2.4. The main statistical results are Theorem 2.11 (almost sure convergence of MD estimators), Theorem 2.14 (representation of rescaled MD estimator errors) and Theorem 2.22 (asymptotic normality of rescaled MD estimation errors). We conclude with an example where the parameters of a symmetric stable law are estimated by means of MD estimators based on the empirical characteristic function of the first n observations.
2.1
Stochastic Processes with Paths in Lp .T , T , /
In preparation for asymptotics of MD estimator sequences, we characterise weak convergence in relevant path spaces. We follow Cremers and Kadelka [16], see also Grinblat [37]. Throughout this section, the following set 2.1 of assumptions will be in force. 2.1 Assumptions and Notations for Section 2.1. (a) Let denote a finite measure on a measurable space .T , T / with countably generated -field T . For 1 p < 1 fixed, the space Lp ./ D Lp .T , T , / of (-equivalence classes of) p-integrable functions f : .T , T / ! .R, B.R// is equipped with its norm Z p1 p p kf k D kf kL .T ,T ,/ D jf .t /j .dt / < 1 T
and its Borel- -field B.Lp .//.
T being countably generated, the space Lp .T , T , / is separable ([127, p. 138], [117, pp. 269–270]): there is a countable subset S Lp ./ which is dense in Lp .T , T , /, and B.Lp .// is generated by the countable collection of open balls Br .g/ :D ¹ f 2 Lp ./ : kf gkLp ./ < r º ,
r 2 QC , g 2 S .
(b) With parameter set T as in (a), a real valued stochastic process X D .X t / t2T on a probability space ., A, P / is a collection of random variables X t , t 2 T , defined on ., A/ and taking values in .R, B.R//. This process is termed measurable if .t , !/ ! X.t , !/ is a measurable mapping from .T , T ˝A/ to .R, B.R//. In a measurable
44
Chapter 2 Minimum Distance Estimators
process X D .X t / t2T , every path X .!/ :
T 3 t ! X.t , !/ 2 R ,
!2
is a measurable mapping from .T , T / to .R, B.R//. (c) Points in Rm are written as t D .t1 , ..., tm /. We write .s1 , ..., sm / .t1 , ..., tm / if sj tj for all 1 j m; in this case, we put .s, t :D XjmD1 .sj , tj or Œs, t / :D XjmD1 Œsj , tj / etc. and speak of intervals in Rm instead of rectangles. 2.1’ Exercise. Consider i.i.d. random variables Y1 , Y2 , : : : defined on some ., A, P /, and taking values in .T , T / :D .Rm , B.Rm //. (a) For any n 1, the empirical distribution function associated to the first n observations
bn .t , !/ :D F
1X 1X 1.1,t .Yi .!// D 1¹Yi tº .!/ , n n n
n
i D1
t 2 Rm , ! 2
i D1
bn D .F bn .t // t2Rm on ., A/. Prove that F bn is a measurable process is a stochastic process F in the sense of 2.1. Hint: with respect to grids 2k Zm , k 1, l D .l1 , : : : , lm / 2 Zm , write m l1 C1 lm C1 lj lj C1 , : : : , .k/ :D , , A ; l C .k/ :D X l 2k 2k 2k j D1 2k then for every k, the mappings .t , !/ !
X
bn .l C .k/, !/ 1Al .k/ .t / F
l2Zm
are measurable from .Rm , B.Rm /˝A/ to .R, B.R//, so the same holds for their pointwise bn .t , !/. limit as k ! 1 which is .t , !/ ! F (b) Note that the reasoning in (a) did not need more than continuity from the right – in bn ., !/ which for fixed ! 2 every component of the argument t 2 Rm – of the paths F are distribution functions on Rm . Deduce the following: every real valued stochastic process .Y t / t2Rm on ., A/ whose paths are continuous from the right is a measurable process. p bn F / (c) Use the argument of (b) to show that rescaled differences n.F p bn .t , !/ F .t / , t 2 Rm , ! 2 .t , !/ ! n F are measurable stochastic processes, for every n 1. (d) Prove that every real valued stochastic process .Y t / t2Rm on ., A/ whose paths are continuous from the left is a measurable process: proceed in analogy to (a) and (b), replacing the l l C1 Al .k/ :D XjmD1 . 2jk , j2k , and the points `C .k/ by 2k `. intervals A` .k/ used in (a) by e
2.2 Lemma. For a measurable stochastic process X D .X t / t2T on ., A, P /, introduce the condition Z ./ jX.t , !/jp .dt / < 1 for all ! 2 . T
Section 2.1 Stochastic Processes with Paths in Lp .T , T , /
45
Then the mapping X :
3 ! ! X .!/ 2 Lp ./
is a well-defined random variable on ., A/ taking values in the space .Lp .T , T , /, B.Lp .T , T , ///, and we call X a measurable stochastic process with paths in Lp ./. Proof. For arbitrary g 2 Lp .T , T , / fixed, for measurable processes X D .X t / t2T on ., A/ which satisfy the condition ./, consider Z p1 p jX.t , !/ g.t /j .dt / 0 was arbitrary, we have (+). This finishes the proof. 2.3’ Exercise. Under the assumptions of Proposition 2.3, use the continuous mapping theorem to show that weak convergence Xn ! X in Lp .T , T , / as n ! 1 implies weak convergence of all integrals Z Z g.s/ Xsn .ds/ ! g.s/ Xs .ds/ (weakly in R, as n ! 1) T
T
for functions g : T ! R which belong to Lq ./,
C q1 D 1, and thus also weak convergence
Z
Z T
1 p
gn .s/ Xsn .ds/ !
g.s/ Xs .ds/
(weakly in R, as n ! 1)
T
for arbitrary sequences .gn /n in Lq ./ with the property gn ! g in Lq ./.
Section 2.1 Stochastic Processes with Paths in Lp .T , T , /
47
The following Theorem 2.4, the main result of this section, gives sufficient conditions for weak convergence in Lp .T , T , / – of type ‘convergence of finite dimensional distributions plus uniform integrability’ – from which weak convergence in Lp .T , T , / can be checked quite easily. 2.4 Theorem (Cremers and Kadelka [16]). Consider measurable stochastic processes .X tn / t2T defined on .n , An , Pn /, n 1, and .X t / t2T defined on ., A, P / with paths in Lp .T , T , /, under all assumptions of 2.1. (a) In order to establish L
Xn ! X
(weak convergence in Lp .T , T , /, as n ! 1) ,
a sufficient condition is that the following properties (i) and (ii) hold simultaneously: (i) convergence of finite dimensional distributions up to some exceptional set N 2 T such that .N / D 0: for arbitrary l 1 and any choice of t1 , : : : , tl in T n N , one has L L .X tn1 , : : : , X tnl / j Pn ! L .X t1 , : : : , X tl / j P (weak convergence in Rl , as n ! 1) ; (ii) uniform integrability of ¹jX n ., /jp : n 0º for the random variables (including :D X and P0 :D P )
X0
X n defined on .T n , T ˝An , ˝Pn / with values in .R, B.R// ,
n0
in the following sense: for every " > 0 there is some K D K."/ < 1 such that Z sup 1¹jX n j>Kº jX n jp d.˝Pn / < " . n0
T n
(b) Whenever condition (a.i) is satisfied, any one of the following two conditions is sufficient for (a.ii): ´ X n ., / ,Rn 1 , X., / are elementsR of Lp .T n , T ˝An , ˝Pn / , and .2.40 / lim sup T n jX n jp d.˝Pn / T jX jp d.˝P / , n!1
´
.2.400 /
1 all t 2 T there function f 2 L .T , T , / such that forn-almost isn some p f .t / for all n 1, and lim E jX t jp D E .jX t jp / . E jX t j n!1
The remaining parts of this section contain the proof of Theorem 2.4, to be completed in Proofs 2.6 and 2.6’, and some auxiliary results; we follow Cremers and Kadelka [16]. Recall that the set of Assumptions 2.1 is in force. W.l.o.g., we take the finite measure on .T , T / as a probability measure .T / D 1.
48
Chapter 2 Minimum Distance Estimators
2.5 Lemma. Write H for the class of bounded measurable functions ' : T R ! R such that for every t 2 T fixed, the mapping '.t , / : x ! '.t , x/ is continuous. Then condition (a.i) of Theorem 2.4 gives for every ' 2 H convergence in law of the integrals Z Z n '.s, Xs / .ds/ ! '.s, Xs / .ds/ (weak convergence in R, as n ! 1) . T
T
Proof. For real valued measurable processes .X t0 / t2T defined on some .0 , A0 , P 0 /, the mapping .t , !/ ! .t , X 0 .t , !// is measurable from T ˝A to T ˝B.R/, hence composition with ' gives a mapping .t , !/ ! '.t , X 0 .t , !// which is T ˝A–B.R/measurable. As a consequence, Z '.s, X 0 .s, !// .ds/ ! ! T
0 0 values in .R, B.R//. In order to is a well-defined random variable on . R , A / taking prove convergence in law of integrals T '.s, Xsn / .ds/ as n ! 1, we shall show Z Z n '.s, Xs / .ds/ ! EP g '.s, Xs / .ds/ .˘/ EPn g T
T
for arbitrary g 2 Cb .R/, the class of bounded continuous functions R ! R. Fix any constant M < 1 such that sup j'j M . Thanks to the convention T R
.T / D 1 above, for functions g 2 Cb .R/ to be considered in .˘/, only the restriction gjŒM ,M to the interval ŒM , M is relevant. According to Weierstrass, approximating g uniformly on ŒM , M by polynomials, it will be sufficient to prove .˘/ for polynomials. By additivity, it remains to consider the special case g.x/ :D x l , l 2 N, and to prove .˘/ in this case. We fix l 2 N and show l ! l ! Z Z EPn '.s, Xsn / .ds/ '.s, Xs / .ds/ ! EP , n!1. T
T
Put X 0 :D X and P0 :D P , and write left- and right-hand sides in the form l ! Z n EPn '.s, Xs / .ds/ T
Z
Z D
T
T
EPn
Z
Z D
T
T
l Y
EPn
! '.si , Xsni /
.ds1 / : : : .dsl /
iD1 n n .s1 ,:::,sl / .Xs1 , : : : , Xsl /
.ds1 / : : : .dsl / .
Section 2.1 Stochastic Processes with Paths in Lp .T , T , /
49
Here, at every point .s1 , : : : , sl / of the product space XliD1 T , a function .x1 , : : : , xl / D
.s1 ,:::,sl / .x1 , : : : , xl / :D
l Y
'.si , xi /
2 Cb .Rl /
iD1
arises which indeed is bounded and continuous: for ' 2 H , we exploit at this point of the proof the defining property of class H . Condition (a.i) of Theorem 2.4 guarantees convergence of finite dimensional distributions of X n to those of X up to some exceptional -null set T 2 T , thus we have .Xs1 , : : : , Xsl / , n ! 1 EPn .Xsn1 , : : : , Xsnl / ! EP for any choice of .s1 , : : : , sl / such that s1 , : : : , sl 2 T n N . Going back to the above integrals on the product space XliD1 T , note that all expressions in the last convergence are bounded by M l , uniformly in .s1 , : : : , sl / and n: hence Z Z n n ... EPn .s1 ,:::,sl / .Xs1 , : : : , Xsl / .ds1 / : : : .dsl / T
T
will tend as n ! 1 by dominated convergence to Z Z ... EP .s1 ,:::,sl / .Xs1 , : : : , Xsl / .ds1 / : : : .dsl / . T
T
This proves .˘/ in the case where g.x/ D x l , for arbitrary l 2 N. This finishes the proof. 2.5’ Lemma. For g 2 Lp .T , T , / fixed, consider the family of random variables Z n .g/ :D kXn gkp ,
n0
as in the proof of Lemma 2.2 (with X 0 :D X , P0 :D P ), and the class H as defined in Lemma 2.5. Under condition R(a.ii) of Theorem 2.4, we can approximate Z n .g/ uniformly in n 2 N0 by integrals T '.s, Xsn / .ds/ for suitable choice of ' 2 H : for every g 2 Lp ./ and every ı > 0 there is some ' D 'g,ı 2 H such that ˇ ˇ Z ˇ ˇ 'g,ı .s, Xsn /.ds/ˇˇ > ı < ı . sup Pn ˇˇkXn gkp n0
T
Proof. (1) Fix g 2 Lp .T , T , / and ı > 0. Exploiting the uniform integrability condition (a.ii) of Theorem 2.4, we select C D C.ı/ < 1 large enough for Z 1 1¹jX n j>C º jX n jp d.˝Pn / < 2.pC1/ ı 2 .˘1/ sup n0 T 4 n
50
Chapter 2 Minimum Distance Estimators
and D .ı/ > 0 small enough such that Gn 2 T ˝An , .˝Pn /.Gn / < H) Z 1 1Gn jX n jp d.˝Pn / < 2p ı 2 4 T n
.˘2/
independently of n. Assertion .˘2/ amounts to the usual "-ı-characterisation of uniform integrability, in the case where random variables .t , !/ ! jXn .t , !/jp are defined on different probability spaces .T n , T ˝An , ˝Pn /. In the same way, we view .t , !/ ! jg.t /jp as a family of random variables on .T n , T ˝An , ˝Pn / for n 1 which is uniformly integrable. Increasing C < 1 of .˘1/ and decreasing > 0 of .˘2/ if necessary, we obtain in addition Z 1 1¹jgj>C º jgjp d < 2.pC1/ ı 2 .˘3/ 4 T together with Gn 2 T ˝An , .˝Pn /.Gn / < H) Z 1 1Gn jgjp d.˝Pn / < 2p ı 2 . 4 T n
.˘4/
A final increase of C < 1 makes sure that in all cases n 0 the sets Gn :D ¹jX n j > C º or
.˘5/
Gn :D ¹jgj > C º
satisfy the condition .˝Pn /.Gn / < , and thus can be inserted in .˘2/ and .˘4/. (2) With g, ı, C of step (1), introduce a truncated identity h.x/ D .C /_x^.CC / and define a function ' D 'g,ı : T R ! R by '.s, x/ :D jh.x/ h.g.s//jp ,
s2T , x2R.
Then ' belongs to class H as defined in Lemma 2.5, and is such that .C/ jXsn .!/ g.s/jp D '.s, Xsn .!// on ¹.s, !/ : jX n .s, !/j C , jg.s/j C º . As a consequence of (+), for ! 2 fixed, the difference ˇ ˇ R n ˇ ˇ kX n .!/ gkp T '.s, Xs .!// .ds/ Z ., n/ j jXsn .!/ g.s/jp '.s, Xsn .!// j .ds/ T
admits, for ! 2 fixed, the upper bound Z 1¹jX n .s,!/j>C º [ ¹jg.s/j>C º j jXsn .!/ g.s/jp '.s, Xsn .!// j .ds/ T
Section 2.1 Stochastic Processes with Paths in Lp .T , T , /
51
which is (using the elementary ja C bjp .jaj C jbj/p 2p .jajp C jbjp /, and definition of ') smaller than Z 1¹jX n .s,!/j>C º [ ¹jg.s/j>C º 2p jXsn .!/jp C 2p jg.s/jp C 2p C p .ds/ T ²Z 1¹jX n .s,!/j>C º .jX n .s, !/jp C C p / .ds/ 2p T Z Z 1¹jX n .s,!/j>C º jg.s/jp .ds/ C 1¹jg.s/j>C º jX n .s, !/jp .ds/ C T ³ ZT p p 1¹jg.s/j>C º .jg.s/j C C / .ds/ C T
and finally smaller than ² Z Z p n p 1¹jX n .s,!/j>C º jX .s, !/j .ds/ C 1¹jX n .s,!/j>C º jg.s/jp .ds/ 2 2 T T ³ Z Z n p p 1¹jg.s/j>C º jX .s, !/j .ds/ C 2 1¹jg.s/j>C º jg.s/j .ds/ . C T
T
The last right-hand side is the desired bound for ., n/, for ! 2 fixed. (3) Integrating this bound obtained in step (2) with respect to Pn , we obtain ˇ ˇ Z ˇ n ˇ p n ˇ '.s, Xs .!// .ds/ˇˇ EPn ˇkX .!/ gk T Z Z pC1 n p p 1¹jX n j>C º jX j d.˝Pn / C 2 1¹jX n j>C º jgjp d.˝Pn / 2 T n T n Z Z p n p pC1 1¹jgj>C º jX j d.˝Pn / C 2 1¹jgj>C º jgjp d C2 T n
T
where .˘1/–.˘5/ make every term on the right-hand side smaller than 14 ı 2 , independently of n. Thus ˇ ˇ Z ˇ n ˇ p n '.s, Xs .!// .ds/ˇˇ < ı 2 sup EPn ˇˇkX .!/ gk n0
T
for ' D 'ı,g 2 H , and application of the Markov inequality gives ˇ ˇ Z ˇ n ˇ p n '.s, Xs /.ds/ˇˇ > ı < ı sup Pn ˇˇkX gk n0
as desired.
T
2.6 Proof of Theorem 2.4(a). (1) We shall prove part (a) of Theorem 2.4 using ‘acen /n of real valued rancompanying sequences’. We explain this for some sequence .Y e dom variables whose convergence in law to Y 0 we wish to establish. Write Cu .R/ for
52
Chapter 2 Minimum Distance Estimators
the class of uniformly continuous and bounded functions R ! R; for f 2 Cu .R/ put e n /n – where for every n 0, Z e n and Y en live on the Mf :D 2 sup jf j. A sequence .Z n e same probability space – is called a ı-accompanying sequence for .Y /n if n e Z enj > ı < ı . sup Pn jY n0
Selecting for every f 2 Cu .R/ and every > 0 some ı D ı.f , / > 0 such that sup
jxx 0 j 0 some sequence .Z n 0 e e such that Z .ı/ converges in law to Z .ı/ as n ! 1, we do have convergence in e0 , thanks to .ı/. en /n to Y law of .Y (2) We start the proof of Theorem 2.4(a). In order to show Xn ! X
(weak convergence in Lp .T , T , /, n ! 1)
it is sufficient by Proposition 2.3 to consider arbitrary g1 , ..., gl in Lp .T , T , /, l 1, and to prove n kX g1 k, : : : , kXn gl k ! .kX g1 k, : : : , kX gl k/ (weakly in Rl , n ! 1) or equivalently n kX g1 kp , : : : , kXn gl kp ! kX g1 kp , : : : , kX gl kp (weakly in Rl , n ! 1) . According to Cramér–Wold (or to P. Lévy’s continuity theorem for characteristic functions), to establish the last assertion we have to prove for all ˛ D .˛1 , : : : , ˛` / 2 Rl
.CC/
Y n :D
l X
˛i kXn gi kp !
iD1
l X
˛i kX gi kp D: Y 0
iD1
(weakly in R, n ! 1) where w.l.o.g. we can assume first ˛i ¤ 0 for 1 i l, and second by multiplication of .˛1 , : : : , ˛` / with some constant.
1 l
Pl
1 iD1 j˛i j
D 1,
Section 2.1 Stochastic Processes with Paths in Lp .T , T , /
53
(3) With .˛1 , : : : , ˛` / and g1 , : : : , g` of (++), for ı > 0 arbitrary, select functions 'gi , ı in H such that lj˛i j
ˇ ˇ Z ˇ n ˇ ı ı p n 'gi , ı .s, Xs /.ds/ˇˇ > < sup Pn ˇˇkX gi k lj˛i j n0 lj˛i j lj˛i j T
with notations of Lemma 2.5’. Then also '˛,g1 ,:::,gl ,ı .t , x/ :D
l X iD1
˛i 'gi ,
ı lj˛i j
.t , x/ ,
t 0, x2R
is a function in class H . Introducing the random variables Z '˛,g1 ,:::,gl ,ı .s, Xsn / .ds/ , n 1, Z n :D ZT 0 Z :D '˛,g1 ,:::,gl ,ı .s, Xs / .ds/ , T
by Lemma 2.5, condition (a.i) of Theorem 2.4 gives convergence in law of the integrals ./
Z n ! Z 0
(weak convergence in R), n ! 1 .
P Now, triangle inequality and norming 1l liD1 j˛1 j D 1 show that independently of i n0 Pn jY n Z n j > ı ˇ ˇ Z l X ˇ ˇ ı n p n ˇ ˇ Pn ˇ˛i kX gi k ˛i 'gi , ı .s, Xs /.ds/ˇ > lj˛i j l T iD1 ˇ ˇ Z l X ˇ ˇ ı D Pn ˇˇkXn gi kp 'gi , ı .s, Xsn /.ds/ˇˇ > D‰# H j #j We require in addition that the components Dj ‰# , 1 j d , be linearly independent in H . 2.9 Definition. A sequence of estimators #n : ., Fn / ! .Rd , B.Rd // for the unknown parameter # 2 ‚ is called a minimum distance (MD) sequence if there is a sequence of events An 2 Fn such that P# lim inf An D 1 n!1
for all # 2 ‚ fixed, and such that for n 1 b b #n .!/ 2 ‚ and ‰ n .!/ ‰#n .!/ D inf ‰ n .!/ ‰ H
H
2‚
Whenever the symbolic notation
b #n D arginf ‰ ‰ n 2‚
H
,
n!1
will be used below, we assume that the above properties do hold.
for all ! 2 An .
59
Section 2.2 Minimum Distance Estimator Sequences
2.10 Proposition. Consider a compact exhaustion .Kn /n of ‚. Define with respect to Kn the event ² ³ b b An :D min ‰ ‰ D inf ‰ 2 Fn , n 1 . ‰ n n 2Kn
H
H
2‚
Then one can construct a sequence .Tn /n such that the following holds for all n 1: Tn : ., Fn / ! .Rd , B.Rd // is measurable, b b 8 ! 2 An : Tn .!/ 2 Kn , ‰ n .!/ ‰Tn .!/ D inf ‰ n .!/ ‰ H
2‚
H
.
Proof. (1) We fix a compact exhaustion .Kn /n of ‚: Kn compact in Rd ,
Kn int.KnC1 / ,
Kn " ‚ .
For every n and every ! 2 , define ² ³ b b Mn .!/ :D 2 Kn : ‰ n .!/ ‰ H D inf ‰ n .!/ ‰ H 2‚
bn .!/ ‰ kH attains its global the set of all points in Kn where the mapping ! k‰ minimum. By continuity of this mapping and definition of An , Mn .!/ is a non-void closed subset of Kn when ! 2 An , hence a non-void compact set. Thus, for ! 2 An , out of arbitrary sequences in Mn .!/, we can select convergent subsequences having limits in Mn .!/. (2) For fixed n 1 and fixed ! 2 An we specify one particular point ˛.!/ 2 Mn .!/ as follows. Put ˛1 .!/ :D inf ¹ 1 : there are points D . 1 , 2 , ..., d / 2 Mn .!/º . Selecting convergent subsequences in the non-void compact Mn .!/, we see that Mn .!/ contains points of the form .˛1 .!/, 2 , ..., d /. Next we consider ˛2 .!/ :D inf¹ 2 : there are points D .˛1 .!/, 2 , ..., d / 2 Mn .!/º . Again, selecting convergent subsequences in Mn .!/, we see that Mn .!/ contains points of the form .˛1 .!/, ˛2 .!/, 3 , ..., d /. Continuing in this way, we end up with a point ˛.!/ :D .˛1 .!/, ˛2 .!/, ..., ˛d .!// 2 Mn .!/ which has the property ˛j .!/ D min¹ j : there are points D .˛1 .!/, : : : , ˛j 1 .!/, j , : : : , d / 2 Mn .!/º for all components 1 j d .
60
Chapter 2 Minimum Distance Estimators
(3) For ! 2 An and for the particular point ˛.!/ 2 Mn .!/ selected in step (2), write ˛ .n/ .!/ :D ˛.!/ for clarity, and define (fixing some default value #0 2 ‚) Tn .!/ :D #0 1Acn .!/ C ˛ .n/ .!/ 1An .!/ ,
!2.
bn .!/ ‰ kH Then Tn .!/ represents one point in Kn such that the mapping ! k‰ attains its global minimum on ‚ at Tn .!/, provided ! 2 An . It remains to show that Tn is Fn -measurable. b n ‰ kH . By construction of the sequence (4) Write for short :D inf2‚ k‰ .n/ .˛ /n , fixing arbitrary .b1 , : : : , bd / 2 Rd and selecting convergent subsequences in compacts Kn \ .XjmD1 .1, bj X Rd m /, the following is seen to hold successively in 1 m d : ± ° .n/ .n/ bm D An \ ˛1 b1 , : : : , ˛m ° ± [ \ b n ‰ kH < C r 2 Fn k‰ r>0 rational
²
2 Qd \Kn j bj ,1j m
b n .!/ ‰ kH on (where again we have used continuity (+) of the mapping ! k‰ ‚ for ! fixed). As a consequence of this equality, taking m D d , the mapping ! ! ˛ .n/ .!/1An .!/ is Fn -measurable. Thus, with constant #0 and An 2 Fn , Tn defined in step (3) is a Fn -measurable random variable, and all assertions of the proposition are proved. The core of the last proof was ‘measurable selection’ out of the set of points where b n .!/ ‰ kH attains its global minimum on ‚. In asymptotic statistics, ! k‰ problems of this type arise in almost all cases where one wishes to construct estimators through minima, maxima or zeros of suitable mappings ! H. , !/. See [43, Thm. A.2 in App. A] for a general and easily applicable result to solve ‘measurable selection problems’ in parametric models, see also [105, Thm. 6.7.22 and Lem. 6.7.23]. 2.11 Proposition. Assume SLLN(#) and I(#) for all # 2 ‚. Then the sequence .Tn /n constructed in Proposition 2.10 is a minimum distance estimator sequence for the unknown parameter. Moreover, arbitrary minimum distance estimator sequences .#n /n for the unknown parameter as defined in Definition 2.9 are strongly consistent: 8# 2‚:
#n ! #
P# -almost surely as n ! 1.
Proof. (1) For the particular sequence .Tn /n which has been constructed in Proposition 2.10, using a compact exhaustion .Kn /n of ‚ and measurable selection on events ² ³ b b An :D min ‰ ‰ D inf ‰ 2 Fn ‰ n n 2Kn
H
2‚
H
61
Section 2.2 Minimum Distance Estimator Sequences
defined with respect to Kn , we have to show that the conditions of our proposition imply .ı/ P# lim inf An D 1 n!1
for all # 2 ‚. Then, by Proposition 2.10, the sequence .Tn /n with ‘good sets’ .An /n will have all properties required in Definition 2.9 of an MD estimator sequence. Fix # 2 ‚. Since .Kn /n is a compact exhaustion of the open parameter set ‚, there is some n0 and some "0 > 0 such that B2"0 .#/ Kn for all n n0 . Consider 0 < " < "0 arbitrarily small. By the definition of An and Tn in Proposition 2.10, we have for n n0 ² ³ b b min ‰ n ‰ < inf ‰ n ‰ An \ ¹jTn #j "º . H
:j#j"
H
:j#j>"
Passing to complements this reads for n n0 ¹jTn #j > "º [ Acn ² ³ b b min ‰ n ‰ inf ‰ n ‰ H H :j#j" :j#j>" ² ³ b b inf ‰ ‰ n ‰# n ‰ H H :j#j>" ² ³ bn ‰# kH bn ‰# kH k‰# ‰ kH k‰ inf k‰ :j#j>" ² ³ b ‰ inf ‰ ‰ Cn :D 2 ‰ n # # H
H
:j#j>"
(in the third line, we use inverse triangular inequality). For the event Cn defined by the right-hand side of this chain of inclusions, SLLN(#) combined with I(#) yields P# lim sup Cn D P# .¹! : ! 2 Cn for infinitely many nº/ D 0 . n!1
But Acn is a subset of Cn for n n0 , hence we have P# .lim sup Acn / D 0 and thus .ı/. n!1
(2) Next we consider an arbitrary MD estimator sequence .#n /n for the unknown parameter according to Definition 2.9, and write .e An /n for its sequence of ‘good sets’: bn .!/ ‰ kH attains its global An the mapping ! k‰ thus e An 2 Fn , for ! 2 e minimum on ‚ at #n .!/, and we have .ıı/ P# lim inf e An D 1 . n!1
By Definition 2.9, we have necessarily ² b ‰ inf min ‰ n < :j#j"
H
:j#j>"
³ b ‰ e ‰ Acn [ ¹j#n #j "º n H
62
Chapter 2 Minimum Distance Estimators
(note the different role played by general ‘good sets’ e An in comparison to the particular construction of Proposition 2.10 and step (1) above). If we transform the left-hand side by the chain of inclusions of step (1) with Cn defined as there, we now obtain ¹j#n #j > "º \ e An C n
for all n 1, and thus
¹j#n #j > "º Cn [ e Acn
,
n1.
Again SLLN(#) and I(#) guarantee P# .lim sup Cn / D 0 as in step (1) above. Since n!1 Acn / D 0 , we arrive at property .ıı/ yields P# .lim sup e n!1
P# lim sup ¹j#n #j > "º D 0 . n!1
This holds for all " > 0, and we have proved P# -almost sure convergence of .#n /n to #. Two auxiliary results prepare for the proof of our main result – the representation of rescaled MD estimator errors, Lemma 2.13 below – in this section. The first is purely analytical. 2.12 Lemma. Under I(#) and D(#) we have lim lim inf inf 'n ‰#Ch='n ‰# H D 1 c"1 n!1 jhj>c
for any sequence 'n " 1 of real numbers. Proof. By assumption I(#), we have for " > 0 fixed as n ! 1 'n ‰#Ch=' ‰# D 'n inf ‰ ‰# ! 1 . .C/ inf n H H h:jh='n j"
:j#j"
Assumption D(#) includes linear independence in H of the components D1 ‰# , : : : , Dd ‰# of the derivative D‰# . Since u ! ku> D‰# kH is continuous, we have on the unit sphere in Rd :D min ku> D‰# kH > 0 .
.CC/
jujD1
Assumption D(#) shows also .C C C/
sup
:j#j 0 is small enough. For any 0 < c < 1, because of (+), all what remains to consider in inf 'n ‰#Ch='n ‰# H jhj>c
63
Section 2.2 Minimum Distance Estimator Sequences
are contributions inf
jhj>c , jh='n j 0 arbitrarily small
which by (++) and (+++) allow for lower bounds ‰#Ch=' ‰# n H inf jhj jh='n j jhj>c , jh='n j D‰# H c inf j #j : 0K='n
we thus obtain .ı/
ˇ ®ˇ ¯ ˇ'n .# #/ˇ > K Cn .K/ [ Ac n n
for n large enough, with ‘good sets’ An satisfying P# .lim inf An / D 1. n!1
(2) Under assumptions I(#) and D(#), Lemma 2.12 shows that deterministic quantities lim inf inf 'n ‰#Ch=' ‰# n!1 jhj>K
n
H
64
Chapter 2 Minimum Distance Estimators
can be made arbitrarily large by choosing K large, whereas the tightness condition T(#) yields b n ‰# > M D 0 . lim sup P# 'n ‰ H M "1 n1
Combining both statements, we obtain for the events Cn .K/ in step (1) for every " > 0 there is K D K."/ < 1 such that lim sup P# . Cn .K."// / < " . n!1
The ‘good sets’ An of step (1) satisfy a fortiori lim P# .An / D 1
(view lim inf An D n!1
S T m
n!1
nm An
as increasing limit of events
T
nm An
as m
tends to 1). Combining both last assertions with .ı/, we have for every " > 0 some K."/ < 1 such that ˇ ˇ lim sup P# ˇ'n .#n #/ˇ > K."/ < " n!1
holds. This finishes the proof.
We arrive at the main result of this section. 2.14 Theorem (Millar [100]). Assume SLLN(#), I(#), D(#), and T(#) with a sequence of norming constants 'n D 'n .#/ " 1. With d d matrix ˛ ˝ ƒ# :D Di ‰# , Dj ‰# 1i,j d (invertible by assumption D(#)) define a linear mapping …# : H ! Rd 0 1 hD1 ‰# , f i @ A , f 2H . ::: …# .f / :D ƒ1 # hDd ‰# , f i Then rescaled estimation errors at # of arbitrary minimum distance estimator sequences .#n /n as in Definition 2.9 admit the representation b n ‰# / C oP .1/ , n ! 1 . 'n .#n #/ D …# 'n .‰ # Proof. (1) We begin with a preliminary remark. By assumption D(#), see Assumptions 2.8(III), the components D1 ‰# , .., Dd ‰# of the derivative D‰# are linearly independent in H . Hence V# :D span.Di ‰# : 1 i d / is a d -dimensional closed linear subspace of H . For points h 2 Rd with components h1 , : : : , hd and for elements f 2 H , the following two statements are equivalent:
65
Section 2.2 Minimum Distance Estimator Sequences
(i) the orthogonal projection of f on V# takes the form
Pn
iD1 hi
Di ‰# D h> D‰# ;
(ii) one has …# .f / D h . Note that the orthogonal projection of f on V# corresponds to the unique h in Rd such that f h> D‰# ? V# . This can be rewritten as
d d X X hi Di ‰# , Dj ‰# D hf , Dj ‰# i hi .ƒ# /i,j , for all 1 j d 0D f iD1
iD1
by definition of ƒ# , and thus in the form .hD1 ‰# , f i, : : : , hDd ‰# , f i/ D h> ƒ# ; transposing and using symmetry of ƒ# the last line gives 0 1 hD1 ‰# , f i A. ::: ƒ# h D @ hDd ‰# , f i Inverting ƒ# we have the assertion. b n ‰# / under P# in place of f (2) Inserting the H -valued random variable 'n .‰ in step (1), we define b b n ‰# / hn :D …# 'n .‰ which is a random variable taking values in Rd . Then by step (1), b ./ b h> n D‰# is the orthogonal projection of 'n .‰ n ‰# / on the subspace V# . b n ‰# /kH under P# as Moreover, assumption T.#/ imposing tightness of k'n .‰ n ! 1 guarantees the family of laws L. b hn j P# /, n 1 , is tight in Rd since …# is a linear mapping. (3) Let .#n /n denote any MD estimator sequence for the unknown parameter with ‘good sets’ .An /n as in Definition 2.9: for every n 1, on the event An , the mapping b ‰ 2 Œ0, 1/ ‚ 3 ! ‰ n H
#n , and we have (in particular, cf. proof of Lemma 2.13)
attains its global minimum at lim P# .An / D 1. With the norming sequence of assumption T.#/, put n!1
hn :D 'n .#n #/ ,
‚#,n :D ¹h 2 Rd : # C h='n 2 ‚º .
Then the sets ‚#,n increase to Rd as n ! 1 since ‚ is open. Rewriting points 2 ‚ relative to # in the form # C h='n , h 2 ‚#,n , we have the following: for every n 1, on the event An , the mapping b n ‰#Ch=' / D: n,# .h/ ./ ‚#,n 3 h ! 'n .‰ n
66
Chapter 2 Minimum Distance Estimators
attains a global minimum at hn . By Lemma 2.13, the family of laws L. hn j P# /, n 1 , is tight in Rd . (4) In order to prove our theorem, we have to show hn C oP .1/ as n ! 1 h D b n
#
with the notation of steps (2) and (3). The idea is as follows. On the one hand, the function n,# ./ defined in ./ attains a global minimum at hn (on the event An ), on the other hand, by ./, b n ‰# / 'n .‰#Ch=' ‰# / n,# .h/ D 'n .‰ n b n ‰# / h> D‰# 'n ‰#Ch=' ‰# .h='n /> D‰# D 'n .‰ n will be close (for n large) to the function b n ‰# / h> D‰# . / Dn,# .h/ :D 'n .‰ which admits a unique minimum at b hn according to ./ in step (2), up to small perturbations of order 'n ‰#Ch=' ‰# .h='n /> D‰# n
which have to be negligible as n ! 1, uniformly in compact h-intervals, in virtue of the differentiability assumption D.#/. We will make this precise in steps (5)–(7) below. (5) By definition of the random functions n,# ./ and Dn,# ./, an inverse triangular inequality in H establishes j n,# .h/ Dn,# .h/ j 'n ‰#Ch=' ‰# .h='n /> D‰# n
for all ! 2 , all h 2 ‚n,# and all n 1. The bound on the right-hand side is deterministic. The differentiability assumption D.#/ shows that for arbitrarily large constants C < 1, sup 'n ‰#Ch=' ‰# .h='n /> D‰# jhjC
n
‰#Ch=' ‰# .h='n /> D‰# n C sup jh='n j jhjC
vanishes as n ! 1. For n sufficiently large, ‚n,# includes ¹jhj C º. For all ! 2 , we thus have .˘/
sup j n,# .h/ Dn,# .h/ j ! 0 as n ! 1
h2K
on arbitrary compacts K in Rd . (6) Squaring the random function Dn,# ./ defined in . /, we find a quadratic lower bound ˇ ˇ2 2 2 .h/ Dn,# .b hn / C ˇ h b hn ˇ 2 , for all ! 2 , h 2 Rd , n 1 .C/ Dn,#
67
Section 2.2 Minimum Distance Estimator Sequences
around its unique minimum at b hn , with D min k u> D‰# k > 0 jujD1
introduced in (++) in the proof of Lemma 2.12. Assertion (+) is proved as follows. From ./ in step (2) b b h> n D‰# is the orthogonal projection of 'n .‰ n ‰# / on the subspace V# or equivalently
b n ‰# / b h> 'n .‰ n D‰#
? V#
we have for the function Dn,# ./ according to Pythagoras 2 ˇ2 ˇ 2 2 2 Dn,# .h/ D .h b .b hn / ˇh b .b hn / . hn /> D‰# C Dn,# hn ˇ 2 C Dn,# This assertion holds for all ! 2 , all h 2 Rd and all n 1. (7) To conclude the proof, we fix " > 0 arbitrarily small. By tightness of .b hn /n and .hn /n under P# , see steps (2) and (3) above, there is a compact K D K."/ in Rd such that " " hn … K < , sup P# hn … K < . sup P# b n1 n1 2 2 Then for the MD estimator sequence .#n /n of step (3), with ‘good sets’ .An /n , P# jhn b hn j > " hn … K C P# hn … K C P# Acn P# b hn 2 Kº \ ¹hn 2 Kº \ An \ ¹jhn b C P# ¹b hn j > "º " " C C P# Acn 2 2 ¯ ® 2 2 hn 2 Kº \ ¹hn 2 Kº \ An \ Dn,# .hn / Dn,# .b h n / C "2 2 C P# ¹b 2 where we have used the quadratic lower bound (+) for Dn,# ./ around b hn from step (6). Recall that lim P# .An / D 1, as in step (2) in the proof of Lemma 2.13. Thanks n!1
to the approximation .˘/ from step (5), we can replace – uniformly on K as n ! 1 2 ./ by 2n,# ./. This gives – the random function Dn,# hn j > " lim sup P# jhn b n!1 ¯ ® 1 h n / C "2 2 . < " C lim sup P# An \ 2n,# .hn / 2n,# .b n!1 2
68
Chapter 2 Minimum Distance Estimators
We shall show that the last assertion implies hn j > " < " . .˘˘/ lim sup P# jhn b n!1
For n large enough, the compact K is contained in ‚#,n . For ! 2 An , by ./ in step (3), the global minimum M of the mapping ‚#,n 3 h ! 2n,# .h/ is attained at hn . Hence for n large enough, the intersection of the sets ² ³ 1 2 2 2 b n,# .hn / M " An and 2 must be void: this proves .˘˘/. By .˘˘/, " > 0 being arbitrary, the sequences .hn /n and .b hn /n under P# are asymptotically equivalent. Thus we have proved hn C oP# .1/ hn D b
as n ! 1
which – as stated at the start of step (4) – concludes the proof.
2.15 Remark. We indicate a variant of our approach. Instead of almost sure convergence of the MD sequence .#n /n to the true parameter one might be interested only in convergence in probability (consistency in the usual sense of Definition 1.9). For this, it is sufficient to work with Definition 2.9 of MD estimator sequences with good sets .An /n which satisfy the weaker condition limn!1 P# .An / D 1 , and to weaken b n ‰# kH ! 0 in P# -probability, i.e. to a weak the condition SLLN(#) to k‰ law of large numbers WLLN(#). With these changes, Proposition 2.12, Lemma 2.13 and Theorem 2.14 remain valid, and Proposition 2.11 changes to convergence in P# probability instead of P# -almost sure convergence.
2.3 Some Comments on Gaussian Processes In the case where the Hilbert space of Section 2.2 is H D L2 .T , T , / as considered in Assumption 2.1(a), Gaussian processes arise as limits of (weakly convergent b n ‰# / under P# . subsequences of) 'n .‰ 2.16 Definition. Consider a measurable space .T , T / with countably generated -field T , and a symmetric mapping K., / : T T ! R. (a) For finite measures on .T , T /, a real valued measurable process .X t / t2T with the property 8 set < there N 2 T of -measure zero such that is an exceptional L X t1 , : : : , X tl D N 0l , .K.ti , tj /i,j D1,...,l : for arbitrary t1 , : : : , tl 2 T n N , ` 1 is called (centred) -Gaussian with covariance kernel K., /.
69
Section 2.3 Some Comments on Gaussian Processes
(b) A real valued measurable process .X t / t2T with the property ² L X t1 , : : : , X tl D N 0l , .K.ti , tj /i,j D1,...,l for arbitrary t1 , : : : , tl 2 T , ` 1 is called (centred) Gaussian with covariance kernel K., /. 2.17 Examples. (a.i) For T D Œ0, 1/ or T D Œ0, 1, consider standard Brownian motion .B t / t2T with B0 0. By continuity of all paths, B defined on any .0 , A0 , P 0 / is a measurable stochastic process (cf. Exercise 2.1’(b)). By independence of increments, B t2 B t1 having law N .0, t2 t1 / , we have E.B t1 B t2 / D E.B t21 / C E.B t1 .B t2 B t1 // D t1 for t1 < t2 . Hence, writing K.t1 , t2 / :D E.B t1 B t2 / D t1 ^ t2
for all t1 , t2 in T ,
Brownian motion .B t / t2T is a Gaussian process in the sense of Definition 2.16(b) with covariance kernel K., /. (ii) If a finite measure on .T , T / satisfies the condition Z Z Z 2 B t .dt / D K.t , t / .dt / D t .dt / < 1 , E T
T
T
R we can modify the paths of B on the P 0 -null set A :D ¹ T B t2 .dt / D 1º in A0 (put B.t , !/ 0 for all t 2 T if ! 2 A): then Brownian motion .B t / t2T is a measurable process with paths in L2 .T , B.T /, / according to Lemma 2.2. (b.i) Put T D Œ0, 1 and consider Brownian bridge B 0 D B t0 0t1 , B t0 :D B t t B1 , 0 t 1 . Transformation properties of multidimensional normal laws under linear transformations and 0 1 0 1 0 0 1 B t1 1 0 : : : 0 t1 B t1 B ::: C B 0 1 : : : 0 t2 C C, e C @ ::: A D e AB A :D B @ A @ ::: A B tl B t0l B1 0 0 : : : 1 tl yield the finite dimensional distributions of B 0 : hence Brownian bridge is a Gaussian process in the sense of Definition 2.16(b) with covariance kernel K.t1 , t2 / D t1 ^ t2 t1 t2 ,
t1 , t2 2 Œ0, 1 .
(ii) The paths of B 0 being bounded functions on Œ0, 1, Brownian bridge is a measurable stochastic process with paths in L2 .Œ0, 1, B.Œ0, 1/, / (apply Lemma 2.2) for every finite measure on .Œ0, 1, B.Œ0, 1//.
70
Chapter 2 Minimum Distance Estimators
(c.i) Fix a distribution function F , associated to an arbitrary probability distribution on .R, B.R//. Brownian bridge time-changed by F is the process , B t0,F :D BF0 .t/ , t 2 R . B 0,F D B t0,F t2R
All paths of B 0,F being cJadlJag (right continuous with left-hand limits: this holds by construction since F is cJadlJag), B 0,F is a measurable stochastic process (cf. Exercise 2.1’(b)). Using (b), B 0,F is a Gaussian process in the sense of Definition 2.16(b) with covariance kernel K.t1 , t2 / D F .t1 / ^ F .t2 / F .t1 /F .t2 / ,
t 1 , t2 2 R .
(ii) The paths of B 0,F being bounded functions on R, Brownian bridge time-changed by F is a process with paths in L2 .R, B.R/, / by Lemma 2.2, for arbitrary choice of a finite measure . Gaussian processes have been considered since about 1940, together with explicit orthogonal representations of the process in terms of eigenfunctions and eigenvalues of the covariance kernel K., / (Karhunen–Loève expansions). See Loeve [89, vol. 2, Sects. 36–37, in particular p. 144] or [89, 3rd. ed., p. 478], see also Gihman and Skorohod [28, vol. II, pp. 229–230]. 2.18 Theorem. Consider T compact in Rk , equipped with its Borel -field T D B.T /. Consider a mapping K., / : T T ! R which is symmetric, continuous, and non-negative definite in the sense X ˛i K.ti , tj / ˛j 0 for all ` 1, t1 , : : : , t` 2 T , ˛1 , : : : , ˛` 2 R . i,j D1,:::,`
Then for every finite measure on .T , T /, there is a real valued measurable process .X t / t2T which is -Gaussian with covariance kernel K., / as in Definition 2.16(a). Proof. (1) Since the kernel K.., ./ is symmetric and non-negative definite, for arbitrary choice of t1 , : : : , t` in T , ` 1, there is a centred normal law P t1 ,:::,t` with covariance matrix .K.ti , tj //i,j D1,:::,` on .R` , B.R` //, with characteristic function >
R` 3 ! e 2 † , † :D .K.ti , tj //i,j D1,:::,` . By the consistency theorem of Kolmogorov, the family of laws ` ` P t1 ,:::,t` probability measure on X R , ˝ B.R/ , t1 , : : : , t` 2 T , 1
iD1
iD1
being consistent, there exists a unique probability measure P on ., A/ :D X R , ˝ B.R/ t2T
t2T
`1
71
Section 2.3 Some Comments on Gaussian Processes
such that the canonical process X D .X t / t2T on ., A, P / – the process of coordinate projections – has finite dimensional distributions L .X t1 , : : : , X t` / j P D N .0/iD1,:::,` , .K.ti , tj //i,j D1,:::,` , .C/ t 1 , : : : , t` 2 T , ` 1 . Since K.., ./ is continuous, any convergent sequence tn ! t in T makes EP .X tn X t /2 D K.tn , tn / 2K.tn , t / C K.t , t / vanish as n ! 1. In this sense, the process .X t / t2T under P is ‘mean square continuous’. As a consequence, X under P is continuous in probability: .CC/
for convergent sequences tn ! t :
X tn D X t C oP .1/ ,
n!1.
So far, being a canonical process on a product space, X has no path properties. (2) We define T ˝A–measurable approximations .X m /m to the process X . As in Exercise 2.1’, we cover Rk with half-open cubes Al .m/ k lj lj C1 , m Al .m/ :D X , l D .l1 , ..., lk / 2 Zk m 2 j D1 2 according to a k-dimensional dyadic grid with step size 2m in each dimension, and define ƒ.m, T / as the set of all indices l 2 Zk such that Al .m/ \ T ¤ ;. For l 2 ƒ.m, T / we select some point tl .m/ in Al .m/ \ T . Note that for t 2 T fixed and m 2 N, there is exactly one l 2 Zk such that t 2 Al .m/ holds: we write l.t , m/ for this index l. Define T ˝A-measurable processes X X m .t , !/ :D 1Al .m/\T .t / X.tl .m/, !/ , t 2 T , ! 2 , m 1 . l2ƒ.m,T /
Then by (+) in step (1), arbitrary finite dimensional distributions of X m are Gaussian with covariances E X m .t1 /X m .t2 / D K tl.t .m/ , tl.t .m/ , t1 , t2 2 T . 1 ,m/ 2 ,m/ By continuity of K., / and convergence tl.t,m/ .m/ ! t , the finite dimensional distrim butions of X converge as m ! 1 to those of the process X constructed in step (1): the last equation combined with (+) gives for arbitrary t1 , : : : , t` 2 T and ` 1 L .X tm1 , : : : , X tm` / j P ! L .X t1 , : : : , X t` / j P ./ (weak convergence in R` , as m ! 1) .
(3) We fix any finite measure on .T , T / and show that the sequence X m converges e Renormalising , it in L2 .T , T ˝A, ˝P / as m ! 1 to some limit process X. is sufficient to consider probability measures e on .T , T /.
72
Chapter 2 Minimum Distance Estimators
(i) We prove that .X m /m is a Cauchy sequence in L2 .T , T ˝A, e ˝P /. First, m 2 ˝P / since K., / is bounded on the compact every X belongs to L .T , T ˝A, e T T : Z Z m 2 X .t , !/ .e .dt / ˝P /.dt , d!/ D E .X tm /2 e T
T
Œ sup K.t 0 , t 0 / e .T / < 1 . 0 t 2T
Consider now pairs .m, m0 / where m0 > m. By construction of the covering of Rk in step (2) at stages m and m0 , we have for any two indices l 0 and l either Al 0 .m0 / Al .m/ or Al 0 .m0 / \ Al .m/ D ;. Thus Z Z 0 m0 m 2 .dt / jX X j d.e ˝P / D E .X tm X tm /2 e T
takes the form
T
X
e Al 0 .m0 / \ Al .m/ \ T K.tl0 .m0 /, tl0 .m0 //
l 0 2ƒ.m0 ,T / l2ƒ.m,T /
.ı/
2K.tl0 .m0 /, tl .m// C K.tl .m/, tl .m// .
Since K., / is uniformly continuous on the compact T T and since jtl0 .m0 / p tl .m/j k 2m for indices l 0 and l such that Al 0 .m0 / Al .m/, integrands in .ı/ K.tl0 .m0 /, tl0 .m0 // 2K.tl0 .m0 /, tl .m// C K.tl .m/, tl .m// for t 2 Al 0 .m0 / \ Al .m/ vanish as m ! 1 uniformly in t 2 T and uniformly in m0 > m. Hence, e being a finite measure, integrals .ı/ vanish as m ! 1 uniformly in m0 > m, and we have proved Z 0 lim sup jX m X m j2 d.e ˝P / D 0 . 0 m!1 m >m
(ii) For the Cauchy sequence T ˝A, e ˝P / such that .C C C/
T .X m /m
e 2 L2 .T , under e ˝P there is some X
e in L2 .T , T ˝A, e ˝P / as m ! 1 . X m ! X
e is measurable, and we deduce from (+++) convergence X tm ! X e t in In particular, X 2 -almost all t 2 T ; the exceptional e -null set in T arising here L .P / as m ! 1 for e can in general not be avoided. ˝P -almost sure con(4) In (+++), we select a subsequence .mk /k along which e vergence holds: e X mk ! X
e ˝P -almost surely on .T , T ˝A/ as k ! 1 .
Section 2.3 Some Comments on Gaussian Processes
73
Thus there is some set M 2 T ˝A of full measure under e ˝P such that ./
e 1M X mk ! 1M X
pointwise on T as k ! 1 .
With notation M t for t -sections through M M t D ¹! 2 : .t , !/ 2 M º 2 A , t 2 T R we have 1 D .e ˝P /.M / D T P .M t / e .dt /. Hence there is some e -null set N 2 T such that . /
P .M t / D 1 for all t in T n N .
Now the proof is finished: for arbitrary ` 1 and t1 , : : : , t` in T nN , we have pointwise convergence e t1 , : : : , 1M t X e t` , k ! 1 .˘/ 1M t1 X tm1 k , : : : , 1M t` X tm` k ! 1M t1 X ` for all ! 2 by ./, and at the same time –combining . / and ./– weak convergence in Rl .˘˘/ L 1M t1 X tm1 k , : : : , 1M t` X tm` k ! N 0l , K.ti , tj / i,j D1,...,l , k ! 1 . e is -Gaussian with This shows that the (real valued and measurable) process 1M X covariance kernel K., / in the sense of Definition 2.16(a). Frequently in what follows, we will consider -integrals along the path of a Gaussian process. 2.19 Proposition. Consider a -Gaussian process .X t / t2T with covariance kernel K., /, defined on some .0 , A0 , P 0 /. Assume that T is compact in Rk and that K., / is continuous on T T . Then, modifying paths of X on some P 0 -null set N 0 2 A0 if necessary, X is a process with paths in L2 .T , T , /, and we have Z g.t / X t .dt / N .0, 2 / TZ Z g.t1 / K.t1 , t2 / g.t2 / .dt1 /.dt2 / where 2 :D T
for functions g 2
L2 .T , T
T
, /.
Proof. For X -Gaussian, fix an exceptional -null set N 2 T such that whenever t1 , : : : , tr do not belong to N , finite dimensional laws L .X t1 , : : : , X tr j P 0 / are normal laws with covariance matrix .K.ti , tj //i,j D1,:::,r . (1) X on .0 , A0 , P 0 / being real valued and measurable, the set ² ³ Z 0 0 2 X .t , !/ .dt / D C1 2 A0 N :D ! 2 : T
74
Chapter 2 Minimum Distance Estimators
has P -measure zero since under our assumptions Z Z X 2 d.˝P 0 / D EP 0 .X t2 / .dt / Œsup K.t , t / .T / < 1 . T 0
t2T
T
Redefining X.t , !/ D 0 for all t 2 T if ! 2 N 0 , all paths of X are in L2 .T , T , /. l l C1 (2) Cover Rk with half-open intervals Al .m/ :D XjkD1 Œ 2jm , j2m /, l D .`1 , : : : , lk / 2 Zk , as in step (2) of the proof of Theorem 2.18. Differently from the proof of Theorem 2.18, define ƒ.m, T / as the set of all l such that .Al .m/ \ T / > 0. For l 2 ƒ.m, T /, select tl .m/ in Al .m/ \ T such that tl .m/ does not belong to the exceptional set N T , and define X 1Al .m/\T .t / X.tl .m/, !/ , t 2 T , ! 2 , m 1 . X m .t , !/ :D l2ƒ.m,T /
Then all finite dimensional distributions of X m , m 1, are normal distributions; ƒ.m, T / being finite since T is compact, all X m have paths in L2 .T , T , /. For functions g 2 L2 .T , T , /, -integrals Z X g.t / X tm .dt / D X.tl .m// g 1Al .m/\T T
l2ƒ.m,T /
are random variables on .0 , A0 , P 0 / following normal laws 2 where N 0 , m X 2 m :D g 1Al .m/\T K tl .m/, tl0 .m/ g 1Al 0 .m/\T . l,l 0 2ƒ.m,T /
Under our assumptions, jgj.t /.dt / is a finite measure on .T , T /, and the kernel K., / is continuous and bounded on T T . Thus by dominated convergence Z 2 ! g.t1 / K.t1 , t2 / g.t2 / .˝/.dt1 , dt2 / D 2 , m ! 1 m T T
which yields
Z
.C/ T
L g.t / X tm .dt / ! N 0 , 2
as
m!1.
(3) As in the proof of Theorem 2.18 we have for points t1 , : : : , tr in T n N L L .X tm1 , : : : , X tmr / j P 0 ! L .X t1 , : : : , X tr / j P 0 (weak convergence in Rr , as m ! 1) where all laws on the right-hand side are normal laws, and for points t in T n N sup E.jX tm j2 / sup K.t 0 , t 0 / < 1 0 m
t 2T
and
lim E.jX tm j2 / D E.jX t j2 / .
m!1
Section 2.4 Asymptotic Normality for Minimum Distance Estimator Sequences
75
Thus the assumptions of Theorem 2.4(a) are satisfied (here we use the sufficient condition (2.4”) with p D 2 and constant f to establish uniform integrability). Applying Theorem 2.4(a) we obtain L
Xm ! X
(weak convergence in L2 .T , T , / as m ! 1) ,
from which the continuous mapping theorem (see Exercise 2.3’) gives weak convergence of integrals Z Z L m g.t / X t .dt / ! g.t / X t .dt / .CC/ T T (weak convergence in R as m ! 1) .
Comparing (++) to (+), the assertion is proved.
2.4
Asymptotic Normality for Minimum Distance Estimator Sequences
In this section, we continue the approach of Section 2.2: based on the representation of Theorem 2.14 of rescaled MD estimation errors, the results of Sections 2.3 and 2.1 – with H D L2 .T , T , / as Hilbert space – will allow to prove asymptotic normality. Our set of Assumptions 2.20 provides additional structure for the empirical quantib n from which MD estimators were defined, merging the Assumptions 2.1 of ties ‰ Section 2.1 with the Assumptions 2.8(I) of Section 2.2. We complete the list of Assumptions 2.8(III) by introducing an asymptotic normality condition AN(#). The main result of this subsection is Theorem 2.22. 2.20 Assumptions and Notations for Section 2.4. (a) T is compact in Rk , T D B.T /, and H :D L2 .T , T , / for some finite measure on .T , T /. For ‚ Rd open, ¹P# : # 2 ‚º is a family of probability measures on some ., A/; we have an increasing family of sub- -fields .Fn /n in A, and Pn,# :D P# jFn is the restriction of P# to Fn . We consider the sequence of experiments En :D ., Fn , ¹Pn,# : # 2 ‚º/ ,
n1,
b n of Fn -measurable H -valued random variables and have a sequence ‰ bn : ‰
., Fn / ! .H , B.H // ,
n1
76
Chapter 2 Minimum Distance Estimators
together with a deterministic family ¹‰ : 2 ‚º in H such that .C/
the mapping ‚ 3 ! ‰ 2 H is continuous.
Specialising to compacts in Rk , this unites the Assumptions 2.1(a) and 2.8(I). b n , n 1: there is a measurable (b) We now impose additional spatial structure on ‰ process X n : .T , T ˝Fn / ! .R, B.R//
with paths in H D L2 .T , T , /
b n in (a) arises as path (cf. Lemma 2.2) such that ‰ b n .!/ :D Xn .!/ D X n ., !/ , ‰
!2
of X n , at every stage n 1 of the asymptotics. The spatial structure assumed in (b) was already present in Example 2.7. We complete the list of Assumptions 2.8(III) by strengthening the tightness condition T(#) in 2.8(III): 2.21 Asymptotic Normality Condition AN(# ). Following Notations and Assumptions 2.20, there is a sequence of norming constants 'n D 'n .#/ " 1 such that processes W n :D 'n X n ‰# , n 1 , under P# have the following properties: there is a kernel K., / : T T ! R
symmetric, continuous and non-negative definite,
an exceptional set N 2 T of -measure zero such that for arbitrary points t1 , : : : , tl in T n N , l 1 , L L .W tn1 , : : : , W tnl / j P# ! N 0 , .K.ti , tj //1i,j l (weak convergence in Rl , n ! 1) , and some function f 2 L1 .T , T , / such that for -almost all t in T E. jW tn j2 j P# / f .t / < 1 ,
n 1,
lim E. jW tn j2 j P# / D K.t , t / .
n!1
Now we state the main theorem on minimum distance estimators. It relies directly on Proposition 2.11 and on the representation of rescaled estimation errors in Theorem 2.14. 2.22 Theorem. Under Assumptions 2.20, let for every # 2 ‚ the set of conditions SLLN(#), I(#), D(#), AN(#), with 'n D 'n .#/ " 1
Section 2.4 Asymptotic Normality for Minimum Distance Estimator Sequences
77
hold. Then any minimum distance estimator sequence .#n /n for the unknown parameter # 2 ‚ defined according to Definition 2.9 is (strongly consistent and) asymptotically normal. We have for every # 2 ‚ 1 ! N 0 , ƒ1 L 'n .#n #/ j P# # „# ƒ# (weak convergence in Rd , as n ! 1), with a d d matrix „# having entries Z Z Di ‰# .t1 / K.t1 , t2 / Dj ‰# .t2 / .dt1 /.dt2 / .„# /i,j D .C/ T T 1 i, j d , and with ƒ# as defined in Theorem 2.14: ˛ ˝ .ƒ# /i,j D Di ‰# , Dj ‰# ,
1 i, j d .
Proof. Fix # 2 ‚. T being compact in Rk , for any finite measure on .T , T / and for any covariance kernel K., / on T T which is symmetric, continuous and nonnegative definite, there exists a real valued measurable process W which is -Gaussian with covariance kernel K., /, by Theorem 2.18. Depending on #, W D W .#/ is defined on some space .0 , A0 , P 0 /. As in the proof of Proposition 2.19, W has paths in L2 .T , T , / (after modification of paths on some P 0 -null set in A0 ). b n D Xn as in Assumption 2.20(b), and W n D 'n .X n ‰# / under P# (1) For ‰ as in Condition 2.21, we have an exceptional set N 2 T of -measure zero such that for arbitrary points t1 , : : : , tl in T n N , l 1 , L L .W tn1 , : : : , W tnl / j P# ! N 0 , .K.ti , tj //1i,j l (weak convergence in Rl , n ! 1) by assumption AN(#). Defining N 0 2 T as the union of N with the exceptional -null set which is contained in the definition of W as a -Gaussian process (cf. Definition 2.16), we have for t1 , : : : , tl in T n N 0 L L .W tn1 , : : : , W tnl / j P# ! L .W t1 , : : : , W tl / j P# (weak convergence in Rl , n ! 1). Thus we have checked assumption (a.i) of Theorem 2.4. Next, again by assumption AN(#), there is some function f 2 L1 .T , T , / such that for -almost all t 2 T E. jW tn j2 j P# / f .t / < 1 ,
n 1,
lim E. jW tn j2 j P# / D K.t , t / .
n!1
This is assumption (a.ii) of Theorem 2.4, via condition (2.4”); hence Theorem 2.4 establishes L
Wn ! W
(weak convergence in L2 .T , T , /, as n ! 1) .
78
Chapter 2 Minimum Distance Estimators
Write h., .i for the scalar product in L2 .T , T , /. Then the continuous mapping theorem gives L
h g, Wn i ! h g, W i
.˘/
(weakly in R, as n ! 1)
for any g 2 L2 .T , T , / where according to Proposition 2.19 Z Z 0 .˘˘/ L hg, W i j P D N 0, g.t1 / K.t1 , t2 / g.t2 / .dt1 /.dt2 / . T
T
(2) By assumption D(#), the components D1 ‰# , : : : , Dd ‰# of the derivative D‰# are elements of H D L2 .T , T , /. If we apply .˘/+.˘˘/ to g :D h> D‰# , h 2 Rd arbitrary, Cramér–Wold yields 1 00 1 1 1 00 hD1 ‰# , W i hD1 ‰# , Wn i A j P# A ! L @@ A j P 0 A D N . 0 , „# / ::: ::: L @@ n hDd ‰# , W i hDd ‰# , W i (weak convergence in Rd , as n ! 1) since the covariance matrix „# in (+) has entries Z Z Di ‰# .s/ K.s, t / Dj ‰# .t / .ds/.dt / D E hDi ‰# , W i , hDj ‰# , W i , T
T
i , j D 1, : : : , d . To conclude the proof, it is sufficient to combine the last convergence with the representation of Theorem 2.14 'n .#n #/ D …# Wn C oP# .1/ , n ! 1 of rescaled MD estimation errors. To apply Theorem 2.14, note that assumption AN(#) – through step (1) above and the continuous mapping theorem – establishes L weak convergence kWn k ! kW k in R as n ! 1, and thus in particular tightness as required in condition T(#). In Example 2.7, we started to consider MD estimator sequences based on the empirical distribution function for i.i.d. observations. In Lemma 2.23 and Example 2.24 below, we will put this example in rigorous terms. 2.23 Lemma. Let Y1 , Y2 , : : : denote i.i.d. random variables taking values in Rk , with b n : .t , !/ ! F b n .t , !/ continuous distribution function F : Rk ! Œ0, 1. Let F denote the empirical distribution function based on the first n observations. Consider T compact in Rk , with Borel- -field T , and a finite measure on .T , T /. Write p bn F , n 1 W n :D n F
Section 2.4 Asymptotic Normality for Minimum Distance Estimator Sequences
79
and consider the kernel K.s, t / D F .s ^ t / F .s/F .t / , s, t 2 Rk with minimum taken componentwise in Rk : s ^ t D .si ^ ti /1ik . Then all requirements of the asymptotic normality condition AN(#) are satisfied, there is a measurable process W which is -Gaussian with covariance kernel K., /, and we have Wn ! W
(weak convergence in L2 .T , T , /, as n ! 1) .
In the case where d D 1, W is Brownian bridge B 0,F time-changed by F . Proof. (1) For t , t 0 2 T , with minimum t ^ t 0 defined componentwise in Rk , we have
K.t , t 0 / D F .t ^ t 0 / F .t /F .t 0 / D EF 1.1,t .Y1 / F .t / 1.1,t 0 .Y1 / F .t 0 / with intervals in Rk written according to Assumption 2.1(c), and X l l X
0 0 ˛i K.ti , ti / ˛i D VarF ˛i 1.1,ti .Y1 / F .ti / 0 i,i 0 D1
iD1
for arbitrary l 1, ˛1 , : : : , ˛l in R, t1 , : : : , tl in T . The distribution function F being continuous, the kernel K., / : T T ! Œ0, 1 is thus symmetric, continuous and nonnegative definite. By Theorem 2.18, a -Gaussian process W with covariance kernel K., / exists. p b n .t , !/ (2) Let .Yi /i1 be defined on some ., A, P /, and put W n .t , !/ :D n.F F .t //. We prove that for arbitrary t1 , : : : , tl in T , l 1, convergence of finite dimensional distributions .C/ L .W tn1 , : : : , W tnl / j P ! N .0, †/ , † :D K.ti , tj / i,j D1,:::,l (weakly in Rl as n ! 1) holds. Using Cramér–Wold we have to show weak convergence in R X l n ˛i W ti j P ! N . 0 , ˛ > † ˛ / , n ! 1 L iD1
for arbitrary ˛ D .˛1 , : : : , ˛l / 2 Rl . By definition of W n and by the central limit theorem we have l X iD1
1 X ˛i W tni D p Rj ! N . 0 , Var.R1 / / n n
j D1
with centred i.i.d. random variables Rj :D
l X iD1
˛i 1.1,ti .Yj / F .ti / ,
j 2N.
80
Chapter 2 Minimum Distance Estimators
By step (1) we have Var.R1 / D ˛ > † ˛ : this proves .C/. (3) A simple particular case of the above, with f .t / :D K.t , t / D F .t /.1 F .t //, is E jW tn j2 D f .t / for all t , n, and E jW t j2 D f .t / for all t ; f being continuous and T compact, we have f 2 L1 .T , T , / . Hence all requirements of assumption AN(#) are satisfied. Thus Theorem 2.4 applies –using condition (2.4”) as a sufficient condition for condition (a.ii) in Theorem 2.4 – and gives weak convergence of paths Wn ! W in L2 .T , T , /. MD estimator sequences based on the empirical distribution function behave as follows. 2.24 Example (Example 2.7 continued). For ‚ Rd open, consider .Yi /i1 i.i.d. observations in Rk , with continuous distribution function F# : Rk ! Œ0, 1 under # 2 ‚. Fix T compact in Rk , T D B.T /, some finite measure on .T , T /, and assume the following: for all # 2 ‚, .˘/
the parameterisation ‚ 3 # ! ‰# :D F# 2 H D L2 .T , T , / satisfies I(#) and D(#) .
Then MD estimator sequences based on the empirical distribution function .t , !/ ! b n .t , !/ F b n F ./ kL2 ./ , #n D arginf k F
.C/
2‚
n1
(according to Definition 2.9; a particular construction may be the construction of Proposition 2.10) are strongly consistent and asymptotically normal for all # 2 ‚: p 1 ! N 0 , ƒ1 (weakly in Rd , as n ! 1) n .#n #/ j P# L # „# ƒ# where „# and ƒ# take the form Z Z .ƒ# /i,j D Di F# .t1 / Dj F# .t2 / .dt1 /.dt2 / , 1 i , j d , Z Z T T Di F# .t1 / Œ F# .t1 ^ t2 / F# .t1 /F# .t2 / Dj F# .t2 / .dt1 /.dt2 / , .„# /i,j D T
T
1 i, j d with ‘min’ taken componentwise in Rk . For the proof of these statements, recall from Example 2.7 that condition SLLN(#) holds as a consequence of Glivenko–Cantelli, cf. ./ in Example 2.7. I(#) holds by .˘/ assumed above, hence Proposition 2.11 gives strong consistency of any version of the MD estimator sequence (+). Next, Lemma 2.23 establishes condition AN(#) with covariance kernel K.t1 , t2 / D F# .t1 ^
Section 2.4 Asymptotic Normality for Minimum Distance Estimator Sequences
81
t2 / F# .t1 /F# .t2 /. Since D(#) holds by assumption .˘/, Theorem 2.22 applies and yields the assertion. 2.24’ Exercise. In dimension d D 1, fix a distribution function F on .R, B.R// which admits a continuous and strictly positive Lebesgue density f on R, and consider as a particular case of Example 2.24 a location model .R, B.R/, ¹F# : # 2 ‚º/ ,
F# :D F . #/ , f# :D f . #/
where ‚ is open in R. Check that for any choice of a finite measure on .T , T /, T compact in R, assumptions I(#) and D(#) are satisfied (with DF# D f# ) for all # 2 ‚.
MD estimators in i.i.d. models may be defined in many different ways, e.g. based on empirical Laplace transforms when tractable expressions for the Laplace transforms under # 2 ‚ are at hand, from empirical quantile functions, from empirical characteristic functions, and so on. The next example is from Höpfner and Rüschendorf [58]. 2.25 Example. Consider real valued i.i.d. symmetric stable random variables .Yj /j 1 with characteristic function ˛ u ! E.˛,/ e i uY1 D e juj , u 2 R and estimate both stability index ˛ 2 .0, 2/ and weight parameter 2 .0, 1/ from the first n observations Y1 , : : : , Yn . Put # D .˛, /, ‚ D .0, 2/ .0, 1/. Symmetry of L.Y jP# / – the characteristic functions being real valued P – allows to work with the real parts of the empirical characteristic functions n1 jnD1 e i uYj only, so we put X b n .u/ D 1 ‰ cos.uYj / , n n
‰# .u/ :D e juj , ˛
# 2‚.
j D1
Fix a sufficiently large compact interval T , symmetric around zero and including open neighbourhoods of the points u D ˙1, T D B.T /, and a finite measure on .T , T /, symmetric around zero such that Z " j log uj2 .du/ < 1 for all " > 0 , .C/ "
and assume in addition: 8 < for some open neighbourhood U of the point u D 1, the density of the -absolutely continuous part of in restriction to U .˘/ : is strictly positive and bounded away from zero. Then any MD estimator sequence based on the empirical characteristic function b #n :D arginf ‰ n ‰# 2 #2‚
L .T ,T ,/
82
Chapter 2 Minimum Distance Estimators
according to Definition 2.9 (a particular construction may be the construction in Proposition 2.10) is strongly consistent and asymptotically normal for all # 2 ‚: p 1 (weakly in R2 , as n ! 1) L n .#n #/ j P# ! N 0 , ƒ1 # „# ƒ# where „# and ƒ# take the form Z Z Di ‰# .u/ Dj ‰# .v/ .du/.dv/ , .ƒ# /i,j D T T Z Z ‰# .u C v/ C ‰# .u v/ Di ‰# .u/ ‰# .u/‰# .v/ .„# /i,j D 2 T T Dj ‰# .v/ .du/.dv/ for 1 i , j 2. The proof for these statements is in several steps. (1) We show that .˘/ establishes the identifiability condition I(#) for all # 2 ‚. Write CN for the compact Œ0, 2 Œ N1 , N in R2 . Fix # 2 ‚. Fix some pair ."0 , N0 / such that the ball B"0 .#/ is contained in the interior of CN0 . For N N0 sufficiently large, and for parameter values # 0 D .˛ 0 , 0 / for which N1 0 N , we introduce linearisations at u D 1 e # 0 .t / :D e 0 Œ1 ˛ 0 0 .t 1/ ‰
on
e :D Br .1/ U
1 of the characteristic functions ‰# 0 ./ , where 0 < r < 4N is a radius sufficiently small 0 e .˛0 , 0 / can be extended e D Br .1/. For fixed, the family ‰ such that .˘/ holds on U :D je . Then, CN being compact and to include limits ˛ 0 D 0 or ˛ 0 D 2. We put e U the mapping e # 0 2 L2 .U e, e / CN 3 # 0 ! ‰
continuous, we have for every 0 < " < "0 ° ± 0 0 e# 2 min e : # 2 C , j# #j " >0. ‰# 0 ‰ N L .e U ,e / This last inequality is identically rewritten in the form ° ± 1 0 0 0 0 0 e# 2 inf e : # D .˛ , / 2 ‚ , N , j# #j " > 0. ‰# 0 ‰ L .e U ,e / N e # is uniformly on U e separated from all functions ‰ e # 0 with 0 < 1 or 0 > N Since ‰ N if N is large, the last inequality can be extended to ° ± 0 0 0 0 e# 2 inf e : # D .˛ , / 2 ‚ , j# #j " > 0. ‰# 0 ‰ L .e U ,e / Now we repeat exactly the same reasoning with the characteristic functions ‰# 0 , ‰# e , instead of their linearisations ‰ e# 0 , ‰ e # at u D 1 which we considered restricted to U so far, and obtain ° ± 0 0 inf k‰# 0 ‰# kL2 .e : # 2 ‚ , j# #j " > 0. U ,e /
83
Section 2.4 Asymptotic Normality for Minimum Distance Estimator Sequences
Since there is some constant c > 0 such that .dt / .je /.dt / c e .dt /, by U e to T in the last assertion and obtain a fortiori assumption .˘/, we may pass from U ¯ ® inf k‰# 0 ‰# kL2 .T ,T ,/ : # 0 2 ‚ , j# 0 #j " > 0 . Since " > 0 was arbitrary, this is the identifiability condition I(#). (2) Following Prakasa Rao [107, Proposition 8.3.1], condition SLLN(#) holds for all # 2 ‚: We start again from the distribution functions F# associated to ‰# , and Glivenko– Cantelli. Fix # 2 ‚ and write A# 2 A for some set of full P# -measure such b n .t , !/ F# .t /j ! 0 as n ! 1 when ! 2 A# . In particular that sup t2R jF b F n .!, t / ! F# .t / at continuity points t of F# , hence empirical measures associated to .Y1 , : : : , Yn /.!/ converge weakly to F# , for ! 2 A# . This gives pointwise b n .t , !/ ! ‰# .t / for t 2 R convergence of the associated characteristic functions ‰ when ! 2 A# . By dominated convergence on T with respect to the finite measure this establishes SLLN(#). (3) Condition (+) is a sufficient condition for differentiability D(#) at all points # 2 ‚: By dominated convergence under (+) assumed above Z " j log uj2 .du/ < 1 for " > 0 arbitrarily small "
the mapping ‚ 3 ! ‰# 2 L2 .T , T , / is differentiable at #, and the derivative D‰# has components D ‰# .u/ D juj˛ ‰# .u/ ,
D˛ ‰# .u/ D .log juj/ juj˛ ‰# .u/ ,
u2R
which are linearly independent in L2 .T , T , /. (4) For all # 2 ‚, the asymptotic normality condition AN(#) holds with covariance kernel 1 K# .u, v/ :D . ‰# .u C v/ C ‰# .u v/ / ‰# .u/‰# .v/ , u, v 2 R: 2 Fix # 2 ‚. Proceeding as in the proof of Lemma 2.23, for any collection of points u1 , : : : , ul 2 T , we replace the random variable Rj defined there by Rj :D
l X
˛i cos.ui Yj / ‰# .ui / .
iD1
Calculating the variance of R1 under #, we arrive at a kernel .u, v/ ! E# .Œcos.uY1 / ‰# .u/ Œcos.vY1 / ‰# .v// . As a consequence of cos.x/ cos.y/ D K., / defined above.
1 2
.cos.x C y/ C cos.x y//, this is the kernel
84
Chapter 2 Minimum Distance Estimators
(5) Now we conclude the proof: we have conditions SLLN(#)+I(#)+D(#)+AN(#) by steps (1)–(4) above, so Theorem 2.22 applies and yields the assertion. 2.25’ Exercise. For the family P D ¹ .a, p/ : a > 0 , p > 0º of Gamma laws on .R, B.R// with densities p a a1 px x e , fa,p .x/ D 1.0,1/ .x/ .a/ for i.i.d. observations Y1 , Y2 , : : : from P , construct MD estimators for .a, p/ based on empirical Laplace transforms n 1 X Yi Œ0, 1/ 3 ! e 2 Œ0, 1 n i D1
and discuss their asymptotics (hint: to satisfy the identifiability condition, work with measures on compacts T D Œ0, C which satisfy .dy/ c dy in restriction to small neighbourhoods Œ0, "/ of 0C ).
We conclude this chapter with some remarks. Many classical examples for MD estimator sequences are given in Millar [100, Chap. XIII]. Also, a large number of extensions of the method exposed here are possible. Kutoyants [79] considers MD estimator sequences for i.i.d. realisations of the same point process on a fixed spatial window, in the case where the spatial intensity is parameterised by # 2 ‚. Beyond the world of i.i.d. observations, MD estimator sequences can be considered in ergodic Markov processes when the time of observation tends to infinity, see a large number of diffusion process examples in Kutoyants [80], or in many other stochastic process models.
Chapter 3
Contiguity
Topics for Chapter 3: 3.1 Le Cam’s First and Third Lemma Notations: likelihood ratios in sequences of binary experiments 3.1 d Some conventions for sequences of R -valued random variables 3.1’ Definition of contiguity 3.2 Contiguity and R-tightness of likelihood ratios 3.3–3.4 Le Cam’s first lemma: statement and interpretation 3.5–3.5’ Le Cam’s third lemma: statement and interpretation 3.6–3.6’ Example: mean shift when limit laws are normal 3.6” 3.2 Proofs for Section 3.1 and some Variants An "-ı-characterisation of contiguity 3.7 Proving Proposition 3.3’ 3.7’–3.8 Proof of Theorem 3.4(a) 3.9 One-sided contiguity and Le Cam’s first lemma 3.10–3.12 Proof of Theorem 3.4(b) 3.13 Proving LeCam’s first lemma 3.14–3.15 One-sided contiguity and Le Cam’s third lemma 3.16 Proving LeCam’s third lemma 3.17 Proof of Proposition 3.6” 3.18 Exercises: 3.5”, 3.5”’, 3.5””, 3.10’, 3.18’ This chapter discusses the notion of contiguity which goes back to Le Cam, see Hájek and Sidák [41], Le Cam [81], Roussas [113], Strasser [121], Liese and Vajda [87], Le Cam and Yang [84], van der Vaart [126], and is of crucial importance in the context of convergence of local models. Mutual contiguity considers sequences of likelihood ratios whose accumulation points in the sense of weak convergence have the interpretation of a likelihood ratio between two equivalent probability laws. Section 3.1 fixes setting and notations, and states two key results on contiguity, termed ‘Le Cam’s first lemma’ and ‘Le Cam’s third lemma’ (3.5 and 3.6 below) since Hájek and Sidák [41, Chap. 6.1]. Section 3.2 contains the proofs together with some variants of the main results.
86
Chapter 3 Contiguity
3.1 Le Cam’s First and Third Lemma For two -finite measures P and Q on the same ., A/, P is absolutely continuous with respect to Q (notation P 0 there is some ı D ı."/ > 0 such that .C/
lim sup Qn .An / < ı n!1
H)
lim sup Pn .An / < " n!1
holds for arbitrary sequences of events An in An , n 1. Proof. (1) We prove that condition (+) is sufficient for contiguity .Pn /n C .Qn /n . To " > 0 arbitrarily small associate ı D ı."/ > 0 such that (+) holds. Fix any sequence of events An in An , n 1, with the property lim Qn .An / D 0 . For this sequence, n!1
condition (+) makes sure that we must have lim Pn .An / D 0 too: this is contiguity n!1
.Pn /n C .Qn /n as in Definition 3.2. (2) We prove that (+) is a necessary condition. Let us assume that for some " > 0, irrespectively of the smallness of ı > 0, an implication like (+) never holds true. In this case, considering in particular ı D k1 , k 2 N arbitrarily large, there are sequences of events .Akn /n such that lim sup Qn .Akn / < n!1
1 holds together with lim sup Pn .Akn / " . n!1 k
93
Section 3.2 Proofs for Section 3.1 and some Variants
From this we can select a sequence .nk /k increasing to 1 such that for every k 2 N we have Pnk .Aknk / > "2 together with Qnk .Aknk / < k2 . Using Definition 3.2 along the subsequence .nk /k , contiguity .Pnk /k C .Qnk /k does not hold. A fortiori, as an easy consequence of Definition 3.2, contiguity .Pn /n C .Qn /n does not hold. We use the "-ı-characterisation of contiguity to prove Proposition 3.3’. 3.7’ Proof of Proposition 3.3’. We have to show that any sequence of Rd - valued random variables .Yn /n which is tight under .Qn /n remains tight under .Pn /n when contiguity .Pn /n C .Qn /n holds. Let " > 0 be arbitrarily small. Assuming contiguity .Pn /n C .Qn /n , we make use of Proposition 3.7 and select ı D ı."/ > 0 such that the implication (+) in Proposition 3.7 is valid. From tightness of .Yn /n under .Qn /n , there is some large K D K.ı/ < 1 such that lim sup Qn .jYn j > K/ < ı , n!1
from which we deduce thanks to (+) lim sup Pn .jYn j > K/ < " . n!1
Since " > 0 was arbitrary, Proposition 3.3’ is proved.
We prepare the next steps by some comments on the conventions in Notations 3.1’. d
3.8 Remarks. With Notations 3.1’, consider on .n , An , Qn / R -valued random b n :D Xn 1¹jX j K/ D 0 .
K "1
n!1
(b) For probability measures F on .Rd , B.Rd //, L .Xn j Qn / ! F
(weakly in Rd as n ! 1)
as defined in Notation 3.1’(ii) is equivalent to the following condition . /: Z Z . / for all f 2 Cb .Rd / : lim f .Xn / 1¹jXn j 0 such that Z n 1 , An 2 An , Qn .An / < ı H)
An
jXn j dQn < " .
According to the definition in Notation 3.1’(iii), this is proved exactly as in the usual case of real valued random variables living on some fixed probability space. By Proposition 3.3 which was proved in Section 3.1, the sequence of likelihood ratios .Ln /n is R-tight under .Qn /n , without further assumptions. Now we can prove part (a) of Theorem 3.4: contiguity .Pn /n C .Qn /n is equivalent to R-tightness of .Ln /n under both .Pn /n and .Qn /n . 3.9 Proof of Theorem 3.4(a). (1) To prove (i)H)(ii) of Theorem 3.4(a), we assume .Pn /n C .Qn /n and have to verify – according to Remark 3.8(a) – that the following holds true: lim lim sup Pn .Ln 2 ŒK, 1/ D 0 . K "1
n!1
If this assertion were not true, we could find some " > 0, a sequence Kj " 1, and a sequence of natural numbers nj increasing to 1 such that Pnj Lnj 2 ŒKj , 1 > " for all j 1 . By Proposition 3.3 we know that .Ln /n under .Qn /n is R-tight, thus we know lim lim sup Qn Ln 2 ŒKj , 1 D 0 . j !1
n!1
Combining the last two formulas, there would be a sequence of events Am 2 Am Am :D ¹Lnj 2 ŒKj , 1º in the case where m D nj , and Am :D ; else with the property lim Qm .Am / D 0 ,
m!1
lim sup Pm .Am / " m!1
in contradiction to the assumption .Pn /n C .Qn /n . This proves (i)H)(ii) of Theorem 3.4(a). (2) To prove (ii)H)(i) in Theorem 3.4(a), we assume R-tightness of .Ln /n under .Pn /n and have to show that contiguity .Pn /n C .Qn /n holds true. To prove this, we
95
Section 3.2 Proofs for Section 3.1 and some Variants
start from a sequence of events An 2 An with the property lim Qn .An / D 0 and n!1 write Pn .An / D Pn .An \ ¹1 Ln > Kº/ C Pn .An \ ¹Ln Kº/ Z Ln dQn Pn .1 Ln > K/ C An \¹Ln Kº
Pn .1 Ln > K/ C K Qn .An / . This gives lim sup Pn .An / lim sup Pn .1 Ln > K/ n!1
n!1
where by R-tightness of .Ln /n under .Pn /n and by Remark 3.8(i), the righthand side can be made arbitrarily small by suitable choice of K. This gives limn!1 Pn .An / D 0 and thus proves (ii)H)(i) of Theorem 3.4(a). Preparing for the proof of Theorem 3.4(b) which will be completed in Proof 3.13 below, we continue to give characterisations of contiguity .Pn /n C .Qn /n . 3.10 Proposition. The following assertions are equivalent: (i) .Pn /n C .Qn /n ; (ii) the sequence .Ln /n under .Qn /n is uniformly integrable, and we have limn!1 Pn .Ln D 1/ D 0 ; (iii) the sequence .Ln /n under .Qn /n is uniformly integrable, and we have limn!1 EQn .Ln / D 1 . Proof. (1) Assertions (ii) and (iii) are equivalent since Pn .Ln D 1/ C EQn .Ln / D 1 ,
n1
as a direct consequence of the Lebesgue decomposition in Notation 3.1(i) of Pn with respect to Qn : here EQn .Ln / is the total mass of the Qn -absolutely continuous part of Pn , and Pn .Ln D 1/ the total mass of the Qn -singular part of Pn . (2) For 0 < K < 1 arbitrarily large we can write Z Pn .K < Ln 1/ D Pn .Ln D 1/ C 1¹K 0 arbitrary, in combination with the last lines gives Z Z eL .d l, dx/ ; f .Ln , Xn / Ln dQn D f .l, x/ l F lim n!1
FL being concentrated on Œ0, 1/, the limit is Z Z e eL l, dx/ . L f .l, x/ .l _ 0/ F .d l, dx/ D f .l, x/ G.d (3) Second, we prove for f ., / as above Z Z eL l, dx// f .l, x/ G.d f .Ln , Xn / dPn !
3
3
where
.Ln , Xn / D .Ln , Xn / 1¹Ln m D . 1/. 2 / z i i 2 k
iD1
X 1 1 1 zi i C 2 .z e0 />ƒ .z e0 / D z>ƒ z 2 2 2 2 k
iD1
as
8 2 9 3 k < = X 1 1 z ! exp 4 .C 2 / C zi .i Ci /5 C z>ƒ z : ; 2 2 iD1
or equivalently in compact notation ± ° 1 e C z>ƒ z z ! exp z> m 2
where
m e :D
C 12 2 C
.
This is the Laplace transform of N .e m, ƒ/. By the uniqueness theorem for Laplace e , dx/ :D transforms (see e.g. [4, Chap. 10.1]), we have identified the law G.d e .d , dx/, for F e of step (1), as e F 1 2 2 > C2 e D N , G C † which proves the second assertion of Proposition 3.6”.
3.18’ Exercise. Write M for the space of all piecewise constant right-continuous jump functions f : Œ0, 1/ ! N0 such that f has at most finitely many jumps over compact time intervals, all these with jump height C1, and starts from f .0/ D 0. Write t for the coordinate projections t .f / D f .t /, t 0, f 2 M . Equip M with the -field generated by the coordinate projections, consider the filtration F D .F t / t0 where F t :D .r : 0 r t /, and call D . t / t0 the canonical process on .M , M, F /. For > 0, let P ./ denote the (unique) probability law on .M , M/ such that the canonical process . t / t0 is a Poisson process with parameter . Then (e.g. [14, p. 165]) the process 0
Lt = :D
0
t
® ¯ exp .0 / t , t 0
is the likelihood ratio process of P .0 / relative to P ./ with respect to F , i.e., for all t 2 Œ0, 1/, 0
Lt = is a version of the likelihood ratio of P .0 /jF t relative to P ./jF t where jF t denotes restriction of a probability measure on .M , M/ to the sub- -field F t . Accepting this as background (as in Jacod [63], Kabanov, Liptser and Shiryaev [68], and Brémaud [14]), prove the following.
107
Section 3.2 Proofs for Section 3.1 and some Variants
(1) For fixed reference value > 0, reparameterising the family of laws on .M , M, F / with respect to as # ! P .e # / , # 2 ‚ :D R ,
.˘/
the likelihood ratio process takes the form which has been considered in Exercise 1.16’: ° ± # = # :D exp # .e 1/ t , t 0. Le t t For t fixed, this is an exponential family in # with canonical statistic t . Specify a functional : ‚ ! .0, 1/ and an estimator T t : M ! R for this functional 1 .#/ :D e # , T t :D t . t By classical theory of exponential families, T t is the best estimator for in the sense of uniformly minimum variance within the class of unbiased estimators (e.g. [127, pp. 303 and 157]). (2) For some h 2 R which we keep fixed as n ! 1, define sequences of probability laws Qn :D P ./jFn ,
Pn :D P .e h=
p n
/jFn ,
n1
and prove contiguity .Pn /n CB .Qn /n via Lemma 3.5: for this, use a representation of loglikelihood ratios ƒn of Pn with respect to Qn in the form p 1 h n ƒn D h p .n n/ e h= n 1 p n n and weak convergence in R 1 L. ƒn j Qn / ! N . h2 , h2 / , n ! 1 . 2 p (3) Writing Xn :D n.Tn /, the reference point being fixed according to the reparameterisation .˘/, extend the last result to joint convergence in R2 1 2 2 2 h h h L. .ƒn , Xn / j Qn / ! N , 0 h such that by Le Cam’s Third Lemma 3.6
L. .ƒn , Xn / j Pn / ! N
C 12 h2 h
2 h h , . h
In particular, this gives .˘˘/
L. Xn h j Pn / ! N . 0 , / .
(4) Deduce from .˘˘/ the following ‘equivariance’ property of the estimator Tn in shrinking neighbourhoods of radius O. p1n / – in the sense of the reparameterisation .˘/ above – of the p
reference point , using approximations e h= n D 1 C phn C O. n1 / as n ! 1: for every h 2 R fixed, we have weak convergence p p p L n.Tn e h= n / j P .e h= n / ! N . 0 , / as n ! 1 where the limit law does not depend on h. This means that on small neighbourhoods with radius O. p1n / of a fixed reference point , asymptotically as n ! 1, the estimator Tn identifies true parameter values with the same precision.
Chapter 4
L2-differentiable Statistical Models
Topics for Chapter 4: 4.1 Lr -differentiability when r 1 Lr -differentiability in dominated families 4.1–4.1” Lr -differentiability in general 4.2–4.2’ Example: one-parametric paths in non-parametric models 4.3 Lr -differentiability implies Ls -differentiability for 1 s r 4.4 Lr -derivatives are centred 4.5 Score and information in L2 -differentiable families 4.6 L2 -differentiability and Hellinger distances locally at # 4.7–4.9 4.2 Le Cam’s Second Lemma for i.i.d. observations Assumptions for Section 4.2 4.10 Statement of Le Cam’s second lemma 4.11 Some auxiliary results 4.12–4.14 Proof of Le Cam’s second lemma 4.15 Exercises: 4.1”’, 4.9’ This chapter, closely related to Section 1.1, generalises the notion of ‘smoothness’ of statistical models. We define it in an L2 -sense which e.g. allows us to consider families of laws which are not pairwise equivalent, or where log-likelihood ratios for ! fixed are not smooth functions of the parameter. To the new notion of smoothness corresponds a new and more general definition of score and information in a statistical model. In the new setting, we will be able to prove rigorously – and under weak assumptions – quadratic expansions of log-likelihood ratios, valid locally in small neighbourhoods of fixed reference points #. Later, such expansions will allow to prove assertions very similar to those intended – with questionable heuristics – in Section 1.3. Frequently called ‘second Le Cam lemma’, expansions of this type can be proved in large classes of statistical models – for various stochastic phenomena observed over long time intervals or in growing spatial windows – provided the parameterisation is L2 -smooth, and provided we have strong laws of large numbers and corresponding martingale convergence theorems. For i.i.d. models, we present a ‘second Le Cam lemma’ in Section 4.2 below.
Section 4.1 Lr -differentiable Statistical Models
4.1
109
Lr -differentiable Statistical Models
We start with models which are dominated, and give a preliminary definition of Lr differentiability of a statistical model at a point #. The domination assumption will be removed in the sequel. 4.1 Motivation. We consider a dominated experiment . , A , P :D ¹P# : # 2 ‚º / , with densities # :D
dP# , d
‚ Rd open , # 2‚. 1=r
Let r 1 be fixed. It is trivial that for all # 2 ‚, # (1) In the normed space Lr ., A, /, the mapping .˘/
‚ 3
1=r
!
P V r L . / j #j
j #j ! 0 .
(2) We prove: in a statistical model P , (i) and (ii) imply the assertion ./
e #,i D 0 -almost surely on ¹# D 0º , 1 i d . V
To see this, consider sequences . n /n ‚ of type n D # ˙ ın ei where ei is the i -th unit vector in Rd , and .ın /n any sequence of strictly positive real numbers tending to 0. Then (ii) with D n restricted to the event ¹# D 0º gives 1¹ D0º 1 1=r e #,i V ! 0 , r # #Cı e n i ın L . / 1 1=r 1¹ D0º e #,i #ın e C V ! 0 r # i ın L . / as n ! 1. Selecting subsequences .nk /k along which -almost sure convergence holds, we have 1 1=r 1 1=r e #,i and e #,i ! V ! V on ¹# D 0º : ınk #Cınk ei ınk #ınk ei
Chapter 4 L2 -differentiable Statistical Models
110
-almost surely as k ! 1. The densities of the left-hand sides being non-negative, e #,i necessarily equals 0 -almost surely on the event ¹# D 0º. Thus any version V e # in (i) and (ii) has the property (). of the derivative V (3) By step (2) combined with the definition (ii) of the Fréchet derivative, we can e # on some -null set in A in order to achieve modify V e # D 0 on ¹# D 0º . V e # into a new object This allows to transform V 1=r
V#,i :D 1¹ # >0º r #
e #,i , V
1i d
which gives (note that with respect to (i) we change the integrating measure from to P# ) V# with components V#,1 , : : : , V#,d 2 Lr ., A, P# / , e #,i D 1 1=r V#,i , 1 i d . V r # e # , Fréchet differentiability of the mapping .˘/ takes the With V# thus associated to V following form: 4.1’ Definition (preliminary). For ‚ 2 Rd open, consider a dominated family P D ¹P : 2 ‚º of probability measures on ., A/ with -densities , 2 ‚. Fix r 1 and # 2 ‚. The model P is called Lr -differentiable in # with derivative V# if the following (i) and (ii) hold simultaneously: .i/
.ii/
V# has components 1 j#jr
V#,1 , : : : V#,d 2 Lr ., A, P# /
R ˇ 1=r 1=r ˇ # as
1 r
ˇr 1=r # . #/> V# ˇ d ! 0
j #j ! 0.
We remark that the integral in Definition 4.1’(ii) does not depend on the choice of the dominating measure . Considering for P different dominating measures 0º .!/ @#i which was considered in Definition 1.2, for all 1 i d . We do have V#,i 2 Lr .P# / by Definition 4.1’(i), and shall see later in Corollary 4.5 that V#,i is necessarily centred under P# . When r D 2, .ıı/ being the classical definition of the score at # (Definition 1.2), L2 -differentiability is a notion of smoothness of parameterisation which extends the classical setting of Chapter 1. We will return to this in Definition 4.6 below. To check that components V#,i of Lr -derivatives V# coincide with .ıı/ P# -almost surely, consider in Definition 4.1’ sequences n D # C ın ei (with ei the i -th unit vector in Rd , and ın # 0). From Lr ./-convergence in Definition 4.1’(ii), we can select subsequences .nk /k along which -almost sure convergence holds ˇ ˇ 1=r 1=r ˇ ˇ ˇ ˇ #Cınk ei # 1 1=r ˇ ˇ V #,i ˇ ! 0 as k ! 1 . ˇ # ınk r ˇ ˇ Thus for the mapping .ı/, the gradient r.1=r /.#, / coincides –almost surely with 1 1=r e r # V# D V # . On ¹# > 0º we thus have r.log /.#, / D r r.log.1=r //.#, / D r
r.1=r /.#, / D V# 1=r .#, /
–almost surely
e # according to Assertion 4.1./. This whereas on ¹# D 0º we put V# 0 V gives the representation .ıı/. 4.1”’ Exercise. Consider the location model P :D ¹F . / : 2 Rº generated by the doubly exponential distribution F .dx/ D 12 e jxj dx on .R, B.R//. Prove that P is L2 -differentiable in D #, for every # 2 R, with derivative V# D sgn. #/.
The next step is to remove the domination assumption from the preliminary Definition 4.1’. 4.2 Definition. For ‚ Rd open, consider a (not necessarily dominated) family 0 P D ¹P : 2 ‚º of probability measures on ., A/. For 0 , 2 ‚ let L = denote a version of the likelihood ratio of P 0 with respect to P , as defined in Notation 3.1(i). Fix r 1. The model P is called Lr -differentiable in # with derivative V# if the following (i), (iia), (iib) hold simultaneously: .i/ .iia/
V# has components V#,1 , : : : V#,d 2 Lr ., A, P# / , =# 1 L P D 1 ! 0 , j #jr
j #j ! 0 ,
112 1 .iib/ j #jr
Chapter 4 L2 -differentiable Statistical Models
ˇr Z ˇ ˇ =# 1=r ˇ 1 > ˇ L 1 . #/ V# ˇˇ dP# ! 0 , ˇ r
j #j ! 0 .
4.2’ Remarks. (a) For arbitrary statistical models P D ¹P : 2 ‚º and reference points # 2 ‚, it is sufficient to check (iia) and (iib) along sequences . n /n ‚ which converge to # as n ! 1. (b) Fix any sequence . n /n ‚ which converges to # as n ! 1. Then the countable subfamily P ¹Pn , n 1, P# º is dominated by some -finite measure (e.g. take :D P# C n1 2n Pn ) which depends on the sequence under consideration. Let n , n 1, # denote densities of Pn , n 1, P# with respect to . Along this subfamily, the integral in Definition 4.1’(ii) Z Z Z ˇ ˇr 1 1=r ˇ 1=r ˇ 1=r > : : : d C : : : d ˇn # # . n #/ V# ˇ d D r ¹ # D0º ¹ # >0º splits into the two expressions which appear in (iia) and (iib) of Definition 4.2: ˇ ˇr Z Z ˇ 1=r ˇ 1 ˇ n ˇ n d C 1 . n #/> V# ˇ # d ˇ ˇ r ¹ # D0º ¹ # >0º ˇ # ˇ ˇr Z ˇ =# 1=r ˇ 1 n =# > n ˇ D1 C 1 . n #/ V# ˇˇ dP# . D P n L ˇ L r Thus the preliminary Definition 4.1’ and the final Definition 4.2 are equivalent in restriction to the subfamily ¹Pn , n 1, P# º . Since we can consider arbitrary sequences . n /n converging to # , we have proved that Definition 4.2 extends Definition 4.1’ consistently. 4.3 Example. On an arbitrary space ., A/ consider P :D M1 ., A/, the set of all probability measures on ., A/. Fix r 1 and P 2 P . In the non-parametric model E D ., A, P /, we shall show that one-parametric paths ¹Q# : j#j < "º through P in directions Z g dP D 0 g 2 Lr .P / such that r (defined in analogy to Example 1.3) are L -differentiable at all points of the parameter
interval, and have Lr -derivative V0 D g at the origin Q0 D P . (a) Consider first the case of directions g which are bounded, and write M :D sup jgj. Then the one-dimensional path through P in P ° 1 ± g , dQ# :D .1 C # g/ dP ./ SP :D , A, Q# : j#j < M 1 is Lr -differentiable at every parameter value #, j#j < M , and the derivative V# at # is given by g 2 Lr ., A, Q# / V# D 1C#g
Section 4.1 Lr -differentiable Statistical Models
113
(in analogy to expression (1.3’), and in agreement with Remark 4.1”). At # D 0 we have simply V0 D g. 1 , V# is bounded. Thus We check this as follows. From jgj M and j#j < M r V# 2 L ., A, Q# / holds for all # in ./ and all r 1. This is condition (i) in Definition 4.2. Since SPg is dominated by :D P , conditions (iia) and (iib) of Definition 4.2 are equivalent to Condition 4.1’(ii) which we shall check now. For all r 1, 1 1 1 g .1 C g/1=r .1 C # g/1=r D .1 C g/ r 1 g D .1 C g/1=r # r r 1C g
for every ! fixed, with some (depending on !) between and #. For " small enough 1 : thus dominated convergence under P# and 2 B" .#/, j j remains separated from M as ! # in the above line gives ˇr Z ˇˇ ˇ 1=r 1=r 1 g ˇ .1 C g/ .1 C # g/ ˇ .1 C # g/1=r ˇ ˇ d ! 0 ˇ # r 1 C # gˇ as j #j ! 0 which is Condition 4.1’(ii). (b) Consider now arbitrary directions g 2 Lr .P / with EP .g/ D 0. As in the second part of Example 1.3, use of truncation avoids boundedness assumptions. With 2 C01 .R/ as there .x/ D x
1 on ¹jxj < º , 3
.x/ D 0 on ¹jxj > 1º ,
max j j < x2R
1 2
we define ./
E D . , A, ¹Q# : j#j < 1º / , R Q# .d!/ :D 1 C Œ .#g.!//
.#g/dP P .d!/ .
In the special case of bounded g as in (a), one-parametric paths ./ and ./ coincide for parameter values # close to 0. Uniformly in .#, !/, densities Z f .#, !/ :D 1 C Œ .#g.!// .#g/dP in ./ are bounded away from both 0 and 2, by choice of . Since dominated convergence gives Z Z d .#g/ dP D g 0 .#g/ dP . d#
is Lipschitz,
e # defined from the mapping .˘/ in 4.1, our assumptions on First, with V that Z d 1 . r1 1/ d 1 . r1 1/ 1=r 0 e D f# f f# D f# g .#g/ g V# D d# # r d# r
and g imply 0
.#g/ dP
Chapter 4 L2 -differentiable Statistical Models
114
belongs to Lr ., A, P / (since f# is bounded away from 0; 0 is bounded), or equivalently that R g 0 .#g/ g 0 .#g/ dP @ 1=r e R D V# D r f# V# D log f .#, / @# 1 C Œ .#g.!// .#g/dP belongs to Lr ., A, Q# /, in analogy to Example 1.3(b). Thus condition (i) in Definition 4.1’ is checked. Second, exploiting again our assumptions on and g to work with dominated convergence, we obtain Z ˇ ˇr 1 ˇ 1=r 1=r e # ˇˇ dP ! 0 as j #j ! 0 . f . #/ V f ˇ # j #jr This establishes condition (ii) in Definition 4.1’. According to Definition 4.1’, we have proved Lr -differentiability in the one-parametric path E at every parameter value j#j < 1. In particular, the derivative at # D 0 is V0 .!/ D g.!/ ,
.ı/
!2
which does not depend on the particular construction – through choice of – of the path E through P . Note that the last assertion .ı/ holds for arbitrary P 2 P and for arbitrary directions g 2 Lr .P / such that EP .g/ D 0. 4.4 Theorem. Consider a family P D ¹P : 2 ‚º where ‚ Rd is open. If P is Lr -differentiable at D # with derivative V# for some r > 1, then P is Ls -differentiable at D # with same derivative V# for all 1 s r. Proof. Our proof follows Witting [127, pp. 175–176] r 1 (such that t > 1, and . r=s /C (1) Fix 1 s < r, put t :D rs 1 1 1 from Œ0, 1/ to .0, 1/ by ts D s r ) and define functions ', '.y/ :D
s y 1=s 1 , r y 1=r 1
rs
.y/ :D ' rs .y/ D ' ts .y/ ,
1 t
D 1 or
y0.
Check that ' is continuous at y D 1 (with '.1/ D 1), and thus continuous on Œ0, 1/. For ° r r ± ,1 y > max r s we have the inclusions r r r r 1=r 1=r > 1/ > y 1=r H) 1 y 1=r > H) .y y r s s s s and thus by definition of t .y/ D ' ts .y/ D
y 1=s 1 r 1=r 1/ s .y
!ts
1 . #/ V . / s L # s L .P# /
in a first step (using triangular inequality) by h ° i ± =# 1=r > =# ' L 1 . #/ V r L s # L .P# / i h > =# 1 s C . #/ V# ' L
L .P# /
1 and in a second step using Hölder inequality (we have . r=s / C 1t D 1 ) by ° ± 1=r 1 . #/> V# r ' L=# t s r L=# L .P# / L .P# / > =# 1 t s C . #/ V# r ' L . L .P# /
L .P# /
In this last right-hand side, as tends to #, the first product is of order o.j #j/ ' L=# t s D o.j #j/ L .P# /
and the second of order
j #j ' L=# 1
Lt s .P# /
D o.j #j/
by Lr -differentiability at # – we use (iib) of Definition 4.2 in the first case, and max ku> V# kLr .P# / < 1 in the second case – in combination with Lts -converjujD1
gence .˘˘/. Summarising, we have proved that . / is of order o.j #j/ as ! #: thus condition (iib) in Definition 4.2 holds for 1 s < r if it holds for r > 1. As a consequence, we now prove that derivatives V# arising in Definitions 4.1’ or 4.2 are always centred: 4.5 Corollary. Consider a family P D ¹P : 2 ‚º where ‚ Rd is open, fix r 1. If P is Lr -differentiable at D # with derivative V# , then E# .V# / D 0. Proof. From Theorem 4.4, P is L1 -differentiable at D # with same derivative V# . Consider a unit vector ei in Rd , a sequence ın # 0, a dominating measure for the countable family ¹P#Cın ei , n 1, P# º, and associated densities. Then Definition 4.1’
Section 4.1 Lr -differentiable Statistical Models
117
with r D 1 gives #Cın ei # # V#,i ! 0 in L1 ./ as n ! 1 ın R and thus EP# .V#,i / D # V#,i d D 0. This holds for 1 i d .
From now on we focus on the case r D 2 which is of particular importance. Assuming L2 -differentiability at all points # 2 ‚, the derivatives have components ./
V#,i 2 L2 .P# /
such that
EP# .V#,i / D 0 ,
1i d
by Definition 4.2(i) and Corollary 4.5. In the special case of dominated models admitting continuous densities for which partial derivatives exist, V# necessarily has all properties of the score considered in Definition 1.2, cf. Remark 4.1”. In this sense, the present setting generalises Definition 1.2 and allows to transfer notions such as score or information to L2 -differentiable statistical models. 4.6 Definition. Consider P D ¹P : 2 ‚º where ‚ Rd is open. If a point # 2 ‚ is such that .˘/
the family P is L2 -differentiable at # with derivative V# ,
we call V# the score and
J# :D E# V# V#>
the Fisher information at #. The family P is called L2 -differentiable if .˘/ holds for all # 2 ‚. In the light of Definition 4.2, Hellinger distance H., / between probability measures Z ˇ2 1 ˇˇ 1=2 1=2 ˇ 2 H .Q1 , Q2 / D ˇ 1 2 ˇ d 2 Œ0, 1 2 on ., A/ (as in Definition 1.18, not depending on the choice of the -finite measure i which dominates Q1 and Q2 , and not on versions of the densities i D dQ , i D 1, 2) d
2 gives the geometry of experiments which are L -differentiable. This will be seen in Proposition 4.8 below. 4.7 Proposition. Without further assumptions on a statistical model P D ¹P : 2 ‚º, we have p 0 2 0 C P 0 L = D 1 L = 1 2 H 2 P 0 , P D E for all , 0 in ‚, and E
p
L 0 = 1 D H 2 .P 0 , P / .
Chapter 4 L2 -differentiable Statistical Models
118
Proof. By Notations 3.1, for any -finite measure on ., A/ which dominates P 0 0 and P and for any choice 0 , of -densities, L = coincides .P C P 0 /-almost
0 surely with 1¹ >0º C 1 1¹ D0º . As in step (1) of the proof of Proposition 3.10, ® 0 ¯ 0 E 1 L = D P 0 D 0 D P 0 L = D 1 is the total mass of the P -singular part of P 0 . Now the decomposition Z Z p p p 2 p 2 0 d D 0 d C P 0 D 0 ¹ >0º
p 0 yields the first assertion. Since L = 2 L2 .P /, all expectations in p 0 2 1 0 1 p 0 = L 1 .C/ E L = 1 D E L = 1 E 2 2 p arepwell defined, where equality in (+) is the elementary a 1 D 12 .a 1/ 1 2 2 . a 1/ for a 0. Taking together the preceding three lines yields the second assertion. 4.8 Proposition. In a model P D ¹P : 2 ‚º where ‚ Rd is open, consider sequences . m /m which approach # from direction u 2 Sd 1 (Sd 1 the unit sphere in Rd ) at arbitrary rate .ım /m : ım :D j m #j ! 0 ,
um :D
m # ! u , j m #j
m!1.
If P is L2 -differentiable in D # with derivative V# , then (with notation an bn for limn!1 abnn D 1) p 2 2 1 >
ım Lm =# 1 u J# u 2 H 2 Pm , P# E# 4
as m ! 1 whenever u>J# u is strictly positive. Proof. With respect to a given sequence m ! # satisfying the assumptions of the proposition, define a dominating measure and densities m , # as in Remark 4.2’(b). By L2 -differentiability at # we approximate first p p m # 1p # u > in L2 ./ by m V# j m #j 2 as m ! 1, using Definition 4.2(iib), and then 1p 1p # u > by # u> V# m V# 2 2
in L2 ./
Section 4.2 Le Cam’s Second Lemma for i.i.d. Observations
119
since . m /m approaches # from direction u. This gives p Z p m # 2 2 2 H Pm , P# D d 2 ım j m #j 1 1 ! E# Œu> V# 2 D u> J# u 4 4 as m ! 1, by Definition 4.6 of the Fisher information. L2 -differentiability at # also guarantees 2 Pm Lm =# D 1 D o ım as m ! 1, from Definition 4.2(iia), thus the first assertion of Proposition 4.7 completes the proof. The sphere Sd 1 being compact in Rd , an arbitrary sequence . n /n ‚ converging to # contains subsequences . nm /m which approach # from some direction u 2 Sd 1 , as required in Proposition 4.8. 4.9 Remark. By Definition 4.6, in an L2 -differentiable experiment, the Fisher information J# at # is symmetric and non-negative definite. If J# admits an eigenvalue 0, for a corresponding eigenvector u and for sequences . m /m approaching # from direction u, the Hellinger distance will be unable to distinguish m from # at rate j m #j as m ! 1, and the proof of the last proposition only gives p H 2 Pm , P# D o j m #j2 , E# . Lm =# 1/2 D o j m #j2 as m ! 1. This explains the crucial role of assumptions on strictly positive definite Fisher information in an experiment. Only in this last case, the geometry of the experiment in terms of Hellinger distance is locally at # equivalent – up to some deformation expressed by J# – to Euclidean geometry on the parameter space ‚ as a subset of Rd .
4.2
Le Cam’s Second Lemma for i.i.d. Observations
Under independent replication of experiments, L2 -differentiability of a statistical model E at a point # makes log-likelihood ratios in product models Enpresemble – locally in small neighbourhoods of # which we reparameterise via # C h= n in terms of a local parameter h – the log-likelihoods in the Gaussian shift model ® ¯ N .J# h , J# / : h 2 Rd where J# is the Fisher information at # in E. This goes back to Le Cam, see [81]; since Hájek and Sidák [41], results which establish an approximation of this type
Chapter 4 L2 -differentiable Statistical Models
120
are called a ‘second Le Cam lemma’. From the next chapter on, we will exploit similar approximations of log-likelihood ratios. Beyond the i.i.d. setting, such approximations do exist in a large variety of contexts where a statistical model is smoothly parameterised and where strong laws of large numbers and central limit theorems are at hand (e.g. autoregressive processes, ergodic diffusions, ergodic Markov processes, . . . ). 4.9’ Exercise. For J 2 Rd d symmetric and strictly positive definite, calculate log-likelihood ratios in the experiment ¹N .J h, J / : h 2 Rd º and compare to the expansion given in 4.11 below.
Throughout this section the following assumptions will be in force. 4.10 Assumptions and Notations for Section 4.2. (a) We work with an experiment E D , A , P D ¹P : 2 ‚º ,
‚ Rd open
0
with likelihood ratios L = of P 0 with respect to P . Notation # for a point in the parameter space implies that the following is satisfied: P D ¹P : 2 ‚º is L2 -differentiable at D # with derivative V# . In this case, we write J# for the Fisher information at # in E, cf. Definition 4.6: J# D E# V# V#> . (b) In n-fold product experiments ² ³ n n n n En D n :D X , An :D ˝ A , P :D ˝ P : 2 ‚ , iD1
iD1
0 =
iD1
0 =
Ln and ƒn denote the likelihood ratio and the log-likelihood ratio of Pn0 with respect to Pn . At points # 2 ‚ where E is L2 -differentiable, cf. (a), we write 1 X V# .!j / , Sn .#/.!1 , : : : , !n / :D p n n
! D .!1 , : : : , !n / 2 n
j D1
and have from the Definition 4.6 (which includes Corollary 4.5) and the central limit theorem L Sn .#/ j P#n ! N .0, J# /
(weak convergence in Rd ) as n ! 1 .
Section 4.2 Le Cam’s Second Lemma for i.i.d. Observations
121
Here is the main result of this section: 4.11 Le Cam’s Second Lemma for i.i.d. Observations. At points # 2 ‚ which d satisfy p the Assumptions 4.10, for bounded sequences .hn /n in R (such that # C hn = n is in ‚), we have p 1 > .#Ch = n/=# ƒn n D h> h J# hn C o.P#n / .1/ , n ! 1 n Sn .#/ 2 n where L Sn .#/ j P#n converges weakly in Rd as n ! 1 to N .0, J# /. The proof of Lemma 4.11 will be given in 4.15, after a series of auxiliary results. p 4.12 Proposition. For bounded sequences .hn /n in Rd and points n :D # C hn = n in ‚, we have n p X j D1
n 1 1 X > 1 Ln =# .!j / 1 D p hn V# .!j / h> J# hn C o.P#n / .1/ , 2 n 8 n
n!1.
j D1
Proof. (1) Fix # satisfying Assumptions 4.10, and define for 0 ¤ h 2 Rd p p 1 1 1 r# .n, h/ :D L.#Ch= n/=# 1 p h> V# p 2 n jhj= n in the experiment E. Then for any choice of a constant C < 1, L2 -differentiability at # implies .˘/ sup E# Œr# .n, h/2 ! 0 as n ! 1 : 0 .#Ch= n/=# An,h .!/ :D L 1 p h V# .!j / 2 n j D1 ./ n jhj X D p r# .n, h/.!j / n j D1
for 0 ¤ h 2 Rd . From the second assertion in Proposition 4.7 combined with Corollary 4.5 we know p p 1 1 > .#Ch= n/=# 1 p h V# D n H 2 P#Ch=pn , P# . E# .An,h / D n E# L 2 n
Chapter 4 L2 -differentiable Statistical Models
122
Consider now a bounded sequence .hn /n in Rd . The unit sphere S d 1 being compact, hn we can select subsequences .hnk /k whose directions uk :D jhnk j approach limits k
u 2 S d 1 . Thus we deduce from Proposition 4.8 1 ./ E# An,hn D n H 2 P#Chn =pn , P# D h> J# hn C o.1/ 8 n
as n ! 1. On the other hand, calculating variances for ./ we find from .˘/ sup Var# .An,h / C 2 sup Var# .r# .n, h// jhjC 2 C sup E# Œr# .n, h/2
jhjC
jhjC
!
0
as n ! 1. Thus the sequence ./ behaves as the sequence of its expectations ./: 1 An,hn D E# An,hn C o.P#n / .1/ D h> J# hn C o.P#n / .1/ 8 n as n ! 1 which is the assertion.
p 4.13 Proposition. For bounded sequences .hn /n in Rd and points n :D # C hn = n, we have n p X 2 1 Ln =# .!j / 1 D h> J# hn C o.P#n / .1/ , n ! 1 . 4 n j D1
Proof. We write n .!/, e n .!/, : : : for remainder terms o.P#n / .1/ defined in En . By Definition 4.6 of the Fisher information and the strong law of large numbers, n 1 X V# V#> .!j / D E# V# V#> C n .!/ D J# C n .!/ n j D1
as n ! 1. Thus we have for bounded sequences .hn /n 2 n X 1 > .C/ p hn V# .!j / D h> n J# hn C o.P#n / .1/ , n
n!1.
j D1
With notation r# .n, h/ introduced at the start of the proof of Proposition 4.12 we write p 1 1 jhn j .CC/ Ln =# .!j / 1 D p h> n V# .!j / C p r# .n, hn /.!j / . 2 n n As n ! 1, we take squares of both the right-hand sides and the left-hand sides in (++), and sum over 1 j n. Thanks to .˘/ in the proof of Proposition 4.12 sup E# r# .n, h/2 ! 0 as n ! 1 jhjC
Section 4.2 Le Cam’s Second Lemma for i.i.d. Observations
123
and Cauchy–Schwarz inequality, the terms 1X Œr# .n, hn /2 .!j / and n n
j D1
ˇ ˇ n ˇ 1 X ˇˇ h> ˇ ˇ n V# .!j / r# .n, hn /.!j / ˇ ˇ jhn j ˇ n j D1
then vanish as n ! 1 on the right-hand sides, and we obtain 2 n p n 2 X X 1 1 > =# n L .!j / 1 D p h V# .!j / Ce n .!/ 2 n n j D1
j D1
1 > h J# hn C o.P#n / .1/ 4 n as a consequence of (++), .˘/ in the proof of Proposition 4.12, and (+). D
p 4.14 Proposition. For bounded sequences .hn /n in Rd and n :D # C hn = n, ƒnn =# .!/
D2
n p X
Ln =# .!j /
j D1
n p X 2 1 Ln =# .!j / 1 C o.P#n / .1/ j D1
as n ! 1. Proof. (1) The idea of the proof is a logarithmic expansion 1 log.1 C z/ D z z 2 C o.z 2 / as z ! 0 2 p which for bounded sequences .hn /n and n D # C hn = n should give ƒnn =# .!/ D
n X
log Ln =# .!j / D 2
j D1 n X
D2
j D1
n X
p log 1 C Ln =# 1 .!j /
j D1
p
n p X 2 Ln =# 1 .!j / Ln =# 1 .!j / C j D1
where we have to consider carefully the different remainder terms which arise as n ! 1. (2) In En , we do have .ı/
ƒnn =# .!/
D
n X
log Ln =# .!j / for P#n -almost all ! D .!1 , : : : , !n / 2 n
j D1
which justifies the first ‘D’ in the chain of equalities in (1) above. To see this, fix the sequence . n /n , choose on ., A/ some dominating measure for the restricted experiment ¹Pn , n 1, P# º, and select densities n , n 1, # . In the restricted product experiment ¹Pnn , n 1, P#n º, the set An :D ¹! 2 n : # .!j / > 0 for all 1 j nº 2 An
Chapter 4 L2 -differentiable Statistical Models
124
=#
has full measure under P#n , the likelihood ratio Lnn coincides .Pnn C P#n /-almost surely with n Y n .!j / C 1 1Acn .!/ , ! ! 1An .!/ # .!j / j D1
and the expressions ƒnn =# .!/ D log.Lnn =# .!//
and
n X
log Ln =# .!j /
j D1
are well defined and Œ1, C1/-valued in restriction to An , and coincide on An . This is .ı/. 2 p (3) We exploit L -differentiability pat # in the experiment E. Fix ı > 0, write Zn D =# L n 1 where n D # C hn = n . Then we have P# .jZn j > ı/ D P# .Zn > ı/ C P# .Zn < ı/ 1 ı ı 1 > > C P# P# Zn > ı , p hn V# p h V# > 2 2 2 n 2 n n 1 ı ı 1 > h C P# Zn < ı , p h> V V < C P p # # # 2 2 2 n n 2 n n ˇp ˇ ˇ ˇ p 1 ˇ ˇ ı ˇ ˇ P# ˇ Ln =# 1 . n #/> V# ˇ > C P # ˇh > n V# ˇ > ı n . 2 2 Since .hn /n is a bounded sequence, we have for suitable C as n ! 1 p ˇ ˇ p ı n > ˇ ˇ P# hn V# > ı n P# jV# j > C 1 C2 p 2 E# jV# j2 1 D o ı n ¹jV# j> C º ı n n since V# is in L2 .P# /. A simpler argument works for the first term on the right-hand side above since ˇp ˇ2 1 1 ˇ ˇ E# ˇ Ln =# 1 . n #/> V# ˇ D o.j n #j2 / D o 2 n by part (iib) of Definition 4.2. Taking all this together, we have 1 P# .jZn j > ı/ D o as n ! 1 n for ı > 0 fixed. (4) In the logarithmic expansion of step (1), we consider remainder terms ˇ ˇ ˇ 1 2 ˇˇ ˇ R.z/ :D ˇ log.1 C z/ z C z ˇ , z 2 .1, 1/ 2
125
Section 4.2 Le Cam’s Second Lemma for i.i.d. Observations
together with random variables in the product experiment En p Zn,j .!/ :D Ln =# 1 .!j / , ! D .!1 , : : : , !n / 2 n , and shall prove in step (5) below n X
.C/
j D1
R.Zn,j / D o.P#n / .1/
as n ! 1 .
Then (+) will justify the last ‘D’ in the chain of heuristic equalities in step (1), and thus will finish the proof of Proposition 4.14. (5) To prove (+), we will consider ‘small’ and ‘large’ absolute values of Zn,j separately, ‘large’ meaning n X
R.Zn,j /1¹jZn,j j>ıº D
j D1
n X
R.Zn,j /1¹Zn,j >ıº C
j D1
n X
R.Zn,j /1¹Zn,j 0 fixed. For Zn,j positive, we associate to ı > 0 the quantity D .ı/ :D inf¹R.z/ : z > ıº > 0 which allows to write ³ ²X n ® ¯ R.Zn,j / 1¹Zn,j >ıº > D Zn,j > ı for at least one 1 j n . j D1
Using step (3) above, the probability of the last event under P#n is smaller than n P# .Zn > ı/ D o .1/
as
n!1.
Negative values of Zn,j can be treated analogously. We thus find that the contribution of ‘large’ absolute values of Zn,j to the sum (+) is negligible: .CC/ for any ı > 0 fixed :
n X
R.Zn,j / D
j D1
n X j D1
R.Zn,j /1¹jZn,j jıº C o.P#n / .1/
as n ! 1. It remains to consider ‘small’ absolute values of Zn,j . Fix " > 0 arbitrarily small. As a consequence of R.z/ D o.z 2 / for z ! 0, we can associate ı D ı."/ > 0 such that
jR.z/j < " z 2
on ¹jzj ıº
to every value of " > 0. Since .hn /n is a bounded sequence, we have as n ! 1 . /
n X j D1
2 Zn,j .!/ D
n p X j D1
2 1 Ln =# .!j / 1 D h> J# hn C o.P#n / .1/ 4 n
Chapter 4 L2 -differentiable Statistical Models
126
from Proposition 4.13, and at the same time n X
.C C C/
R.Zn,j /1¹jZn,j jıº "
j D1
n X
2 Zn,j
j D1
2 n as n ! 1, and we for ı D ı."/. Now, . / implies tightness of L Z j P j D1 n,j # have the possibility to choose " > 0 arbitrarily small. Combining (+++) and (++) we thus obtain n X R.Zn,j / D o.P#n / .1/ as n ! 1 . P n
j D1
This is (+), and the proof of Proposition 4.14 is finished.
Now we can conclude this section: 4.15 Proof of Le Cam’s Second Lemma 4.11. Under Assumptions 4.10, we put together Propositions 4.14, 4.12 and 4.13 ƒnn =# .!/ D 2
n p X
Ln =# .!j /
j D1
n p X 2 1 Ln =# .!j / 1 C o.P#n / .1/ j D1
n 1 1 X > 1 > 1 > D2 p hn V# .!j / hn J# hn h J# hn C o.P#n / .1/ 2 n 8 4 n j D1
1 Dp n
n X j D1
h> n V# .!j /
1 > h J# hn C o.P#n / .1/ 2 n
and have proved the representation of log-likelihoods in Lemma 4.11. Weak convergence of the scores 1 X V# .!j / Sn .#, !/ D p n n
under P#n
j D1
has already been stated in Assumption 4.10(b). Le Cam’s Second Lemma 4.11 is now proved.
Chapter 5
Gaussian Shift Models
Topics for Chapter 5: 5.1 Gaussian Shift Experiments A classical normal distribution model 5.1 Gaussian shift experiment E.J / 5.2–5.3 Equivariant estimators 5.4 Boll’s convolution theorem 5.5 Subconvex loss functions 5.6 Anderson’s lemma, and some consequences 5.6’–5.8 Total variation distance 5.8’ Under ‘very diffuse’ a priori, arbitrary estimators are approximately equivariant 5.9 Main result: the minimax theorem 5.10 5.2 Brownian Motion with Unknown Drift as a Gaussian Shift Experiment Local dominatedness/equivalence of probability measures 5.11 Filtered spaces and density processes 5.12 Canonical path space .C , C , G/ for continuous processes 5.13 Example: Brownian motion with unknown drift, special case d D 1 5.14 The statistical model ‘scaled BM with unknown drift’ in dimension d 1 5.15 Statistical consequence: optimal estimation of the drift parameter 5.16 Exercises: 5.4’, 5.4”, 5.4”’, 5.5’, 5.10’
This chapter considers the Gaussian shift model and its statistical properties. The main stages of Section 5.1 are Boll’s Convolution Theorem 5.5 for equivariant estimators, the proof that arbitrary estimators are approximately equivariant under a very diffuse prior, and – as a consequence of both – the Minimax Theorem 5.10 which establishes a lower bound for the maximal risk of arbitrary estimators, in terms of the central statistic. We will see a stochastic process example in Section 5.2.
5.1
Gaussian Shift Experiments
Up to a particular representation of the location parameter, the following experiment is well known:
128
Chapter 5 Gaussian Shift Models
5.1 Example. Fix J 2 Rd d symmetric and strictly positive definite, and consider the normal distribution model ¯ ® . Rd , B.Rd / , Ph :D N .J h, J / : h 2 Rd Here densities fh D
dPh d
with respect to Lebesgue measure on Rd are given by 1 d=2 1=2 > 1 f .h, x/ D .2/ .det J / exp .x J h/ J .x J h/ , 2 x 2 Rd ,
h 2 Rd
and all laws Ph , h 2 Rd , are pairwise equivalent. (a) It follows that likelihood ratios of Ph with respect to P0 have the form dPh 1 > h=0 > :D D exp h S h J h , h 2 Rd .C/ L dP0 2 where S.x/ :D x denotes the canonical variable on Rd for which, as a trivial assertion, .CC/
L .S j P0 / D N .0, J / .
(b) Taking an arbitrary h0 2 Rd as reference point, we obtain in the same way dPh 1 D exp .h h0 /> .S J h0 / .h h0 /> J .h h0 / , Lh= h0 :D dPh0 2 h 2 Rd and have, again as a trivial assertion, L S J h0 j Ph0 D N .0, J / . (c) Thus a structure analogous to (+) and (++) persists if we reparameterise around arbitrary reference points h0 2 Rd : we always have 1 > e e h> .S J h0 / e h 2 Rd h J h , e . / L.h0 Ch/= h0 D exp e 2 together with . /
L S J h0 j Ph0 D N .0, J / .
(d) The above quadratic shape (+) of likelihoods combined with a distributional property (++) was seen to appear as a limiting structure – by Le Cam’s Second Lemma 4.10 and 4.11 – over shrinking neighbourhoods of fixed reference points # in L2 -differentiable experiments.
129
Section 5.1 Gaussian Shift Experiments
5.2 Definition. A model ., A, ¹Ph : h 2 Rd º/ is called a Gaussian shift experiment E.J / if there exists a statistic S : ., A/ ! .Rd , B.Rd // and a deterministic matrix J 2 Rd d symmetric and strictly positive definite such that for every h 2 Rd ,
1 ! ! exp h> S.!/ h> J h D: Lh=0 .!/ 2
.C/
is a version of the likelihood ratio of Ph with respect to P0 . We call Z :D J 1 S central statistic in the Gaussian shift experiment E.J /. In a Gaussian shift experiment, the name ‘central statistic’ indicates a benchmark for good estimation, simultaneously under a broad class of loss functions: this will be seen in Corollary 5.8 and Theorem 5.10 below. For every given matrix J 2 Rd d symmetric and strictly positive definite, a Gaussian shift experiment E.J / exists by Example 5.1. The following proposition shows that E.J / as a statistical experiment is completely determined from the matrix J 2 Rd d symmetric and strictly positive definite. 5.3 Proposition. In an experiment E.J / with the properties of Definition 5.2, the following holds: (a) all laws Ph , h 2 Rd , are equivalent probability measures; (b) we have L.Z h j Ph / D N .0, J 1 / for all h 2 Rd ; (c) we have L.S J h j Ph / D N .0, J / for all h 2 Rd ; (d) we have for all h 2 Rd .hCe h/= h
L
1 e> e > e D exp h .S J h/ h J h , 2
e h 2 Rd .
It follows that in the classical sense of Definition 1.2, Mh :D S J h is the score in h 2 Rd and J D Eh Mh Mh> the Fisher information in h. The Fisher information does not depend on the parameter h 2 Rd .
130
Chapter 5 Gaussian Shift Models
Proof. (1) For any h 2 Rd , the likelihood ratio Lh=0 in Definition 5.2 is strictly positive and finite on : hence neither a singular part of Ph with respect to P0 nor a singular part of P0 with respect to Ph exists, and we have P0 Ph . (2) Recall that the Laplace transform of a normal law N .0, ƒ/ on .Rd , B.Rd /// is Z 1 > > d e u x N .0, ƒ/.dx/ D e C 2 u ƒ u R 3 u ! (ƒ 2 Rd d symmetric and non-negative definite); the characteristic function of N .0, ƒ/ is Z 1 > > e i u x N .0, ƒ/.dx/ D e 2 u ƒ u . Rd 3 u ! (3) In E.J /, J being deterministic, (1) and Definition 5.2 give 1 > > 1 D E0 L.h/=0 D E0 e h S e 2 h J h , h 2 Rd which specifies the Laplace transform of the law of S under P0 and establishes L .S j P0 / D N .0, J / . For the central statistic Z D J 1 S , the law of Z h under Ph is determined – via scaling and change of measure, by (1) and Definition 5.2 – from the Laplace transform of the law of S under P0 : we have 1 > > > > 1 > Eh e .Zh/ D E0 e .Zh/ Lh=0 D e C h e 2 h J h E0 e .J h/ S >h
D e C
>J
e 2 h 1
h
e C 2 .J 1
1 h/> J
.J 1 h/
> J 1
D eC 2 1
for all 2 Rd and all h 2 Rd . This shows L.Z h j Ph / D N .0, J 1 / for arbitrary h 2 Rd . This is (b), and (c) follows via standard transformations of normal laws. (4) From (1) and Definition 5.2 we obtain the representation (d) in the same way as . / in Example 5.1(b). Then (d) and (c) together show that the experiment E.J / admits score and Fisher information as indicated. For the statistical properties of a parametric experiment, the space ., A/ supporting the family of laws ¹P : 2 ‚º is of no importance: the structure of likelihoods L=# matters when and # range over ‚. Hence, one may encounter the Gaussian shift experiment E.J / in quite different contexts. In a Gaussian shift experiment E.J /, the problem of estimation of the unknown parameter h 2 Rd seems completely settled in a very classical way. The central statistic
131
Section 5.1 Gaussian Shift Experiments
Z is a maximum likelihood estimator. The theorems by Rao–Blackwell and Lehmann– Scheffé (e.g. [127, pp. 349 and 354], using the notions of a canonical statistic in a d -parametric exponential family, of sufficiency and of completeness, assert that in the class of all unbiased and square integrable estimators for the unknown parameter in E.J /, Z is the unique estimator which uniformly on ‚ minimises the variance. However, a famous example by Stein (see e.g. [60, pp. 25–27] or [59, p. 93]) shows that for normal distribution models in dimension d 3, there are estimators admitting bias which improve quadratic risk strictly beyond the best unbiased estimator. Hence it is undesirable to restrict from the start the class of estimators under consideration by imposing unbiasedness. The following definition does not involve any such restrictions. 5.4 Definition. In a statistical model .0 , A0 , ¹Ph0 : h 2 ‚0 º/, ‚0 Rd , an estimator for the unknown parameter h 2 ‚0 is called equivariant if for all h 2 ‚0 . L h j Ph0 D L j P00 An equivariant estimator simply ‘works equally well’ at all points of the statistical model. By Proposition 5.3(b), the central statistics Z is equivariant in the Gaussian shift model E.J /. 5.4’ Exercise. With C the space of continuous functions Rd ! R, write Cp C for the cone of strictly positive f vanishing at 1 faster than any polynomial: for every ` 2 N, sup¹jxj>Rº ¹jxj` f .x/º ! 0 as R ! 1. Consider an experiment E 0 D .0 , A0 , ¹Ph0 : h 2 Rd º/ of mutually equivalent probability laws for which paths h/= h0 Rd 3 e h ! L.h0 Ce .!/ 2 .0, 1/
belong to Cp almost surely, for arbitrary h0 2 Rd fixed, and such that the laws h/= h0 j Ph0 0 L L.h0 Ce e h2Rd on .C , C / do not depend on h0 2 Rd (as an example, all this is satisfied in Gaussian shift experiments E.J / in virtue of Proposition 5.3(c) and (d) for arbitrary dimension d 1). As a consequence, laws of pairs Z Z h/= h0 e h/= h0 e e L h L.h0 Ce dh , L.h0 Ce d h j Ph0 0 Rd
Rd
are well defined and do not depend on h0 . Prove the following (a) and (b): (a) In E 0 , a Bayesian estimator h with ‘uniform over Rd prior’ for the unknown parameter h 2 Rd R 0 h0 =0 .!/ dh0 d h L R h .!/ :D R 0 =0 h .!/ dh0 Rd L (sometimes called Pitman estimator) is well defined, and is an equivariant estimator.
132
Chapter 5 Gaussian Shift Models
(b) In E 0 , a maximum likelihood estimator b h for the unknown parameter h 2 Rd ¯ ® 0 b h.!/ D argmax Lh =0 .!/ : h0 2 Rd (with measurable selection of an argmax – if necessary – similar to Proposition 2.10) is well defined, and is equivariant. (c) In general statistical models, estimators according to (a) or to (b) will have different properties. Beyond the scope of the chapter, we add the following remark, for further reading: in the statistical model with likelihoods 1 u=0 L D exp Wu juj , u 2 R 2 where .Wu /u2R is a two-sided Brownian motion and dimension is d D 1 (the parameter is ‘time’: this is a two-sided variant of Example 1.16’, in the case where X in Example 1.16’ is Brownian motion), all assumptions above are satisfied, (a) and (b) hold true, and the Bayesian h under squared loss. See the estimator h outperforms the maximum likelihood estimator b references quoted at the end of Example 1.16’. For the third point, see [114]; for the second point, see [56, Lemma 5.2]. 5.4” Exercise. Prove the following: in a Gaussian shift model E.J /, the Bayesian estimator with ‘uniform over Rd prior’ (from Exercise 5.4’) for the unknown parameter h 2 Rd R 0 h0 =0 .!/ dh0 d h L R h .!/ :D R , !2 0 =0 h .!/ dh0 Rd L coincides with the central statistic Z, and thus with the maximum likelihood estimator in E.J /. Hint: varying the i -th component of h0 , start from one-dimensional integration Z 1 Z 1 @ h0 =0 0 0 0D L .!/ dh D .S.!/ J h0 /i Lh =0 .!/ dh0i , 1 i d i 0 1 @hi 1 Z
and prove Z.!/
Rd
0
Lh =0 .!/ dh0 D
Z
0
h0 Lh =0 .!/ dh0 .
Rd
Then Z and h coincide on , for the Gaussian shift model E.J /.
5.4”’ Exercise. In a Gaussian shift model E.J / with unknown parameter h 2 Rd and central statistic Z D J 1 S , fix some point h0 2 ‚ and some 0 < ˛ < 1, and use a statistic T :D ˛ Z C .1 ˛/ h0 as an estimator for the unknown parameter. Prove that T is not an equivariant estimator, and specify the law L.T hjPh / for all h 2 Rd from Proposition 5.3.
The following implies a criterion for optimality within the class of all equivariant estimators in E.J /. This is the first main result of this section.
133
Section 5.1 Gaussian Shift Experiments
5.5 Convolution Theorem (Boll [13]). Consider a Gaussian shift experiment E.J /. If is an equivariant estimator for the unknown parameter h 2 Rd , there is some probability measure Q on .Rd , B.Rd // such that L . h j Ph / D N .0, J 1 / ? Q
for all
h 2 Rd .
In addition, the law Q coincides with L . Z j P0 / . Proof. being equivariant for the unknown parameter h 2 Rd , it is sufficient to prove .˘/
L . j P0 / D N .0, J 1 / ? L . Z j P0 / .
(1) Using characteristic functions for laws on Rd , we fix t 2 Rd . Equivariance of gives > > > .C/ E0 e i t D Eh e i t .h/ D E0 e i t .h/ Lh=0 , h 2 Rd . Let us replace h 2 Rd by z 2 C d in (+) to obtain an analytic function > 1 > > f : C d 3 z ! E0 e i t .z/ e z S 2 z J z 2 C . By (+), the restriction of f ./ to Rd C d being constant, f ./ is constant on C d , thus in particular > E0 e i t D f .0/ D f . iJ 1 t / . Calculating the value of f at iJ 1 t we find > > 1 > 1 ./ E0 e i t D e 2 t J t E0 e i t .Z/ . (2) We have ./ for every t 2 Rd . We thus have equality of characteristic functions on Rd . The right-hand side of .C/ admits an interpretation as characteristic function of L e C . Z/ j P0 for some random variable e independent of Z under P0 and such that L.e j 1 P0 / D N .0, J /. Hence, writing Q :D L . Z j P0 /, the right-hand side of ./ is the characteristic function of a convolution N .0, J 1 / ? Q. (3) It is important to note the following: the above steps (1) and (2) did not establish (and the assertion of the theorem did not claim) that Z and . Z/ should be inde pendent under P0 . Combined with Proposition 5.3(b), Boll’s Convolution Theorem 5.5 states that in a Gaussian shift experiment E.J /, estimation errors of equivariant estimators are ‘more spread out’ than the estimation error of the central statistic Z, as an estimator for the unknown parameter. By Theorem 5.5, the best possible concentration of estimation errors (within the class of all equivariant estimators) is attained for D Z: then
134
Chapter 5 Gaussian Shift Models
L .Z h j Ph / equals N .0, J 1 /, and Q D 0 is a point mass at 0. However, a broad class of estimators with possibly interesting properties is not equivariant. 5.5’ Exercise. For R < 1 arbitrarily large, consider closed balls C centred at 0 with radius R. (a) In a Gaussian shift model E.J /, check that a Bayesian with uniform prior over the compact C R 0 h0 =0 h L .!/ dh0 hC .!/ :D CR , !2 h0 =0 .!/ dh0 C L is not an equivariant estimator for the unknown parameter. (b) For arbitrary estimators T : ., A/ ! .Rd , B.Rd // for the unknown parameter which may exist in E.J /, consider quadratic loss and Bayes risks Z 1 Eh .jT hj2 / dh 1 . R.T , C / :D .C / C In the class of all A-measurable estimators T for the unknown parameter, hC minimises the squared risk for h chosen randomly from C , and provides a lower bound for the maximal squared risk over C : .C/
sup Eh .jT hj2 / R.T , C / R.hC , C / .
h2C
This is seen as follows: the first inequality in (+) being a trivial one (we replace an integrand by e :D A ˝ B.C / ; :D C and A its upper bound), it is sufficient to prove the second. Put e e/, let id : .!, h/ ! .!, h/ denote the canonical statistic. Define e, A on the extended space . a probability measure
e.d!, dh0 / :D 1C .h0 / P
0
dh0 Lh =0 .!/ 0 Ph0 .d!/ D P0 .d!/ 1C .h0 / dh .C / .C /
e/, and write e e, P e, A e. On the space . e, A e/, identify hC as on . E for expectation under P conditional expectation of the second component of id given the first component of id . Note e/ since C is that the random variable h – the second component of id – belongs to L2 .P e/ gives compact. Then the projection property of conditional expectations in L2 .P e .jh T j2 / e E E jh hC j2 e/ which depend only on the first component of id . But this for all random variables T 2 L2 .P is the class of all T : ! Rd which are A-measurable, thus the class of all possible estimators for the unknown parameter in the experiment E.J /, and we obtain the second inequality in (+). 5.6 Definition. In an arbitrary experiment .0 , A0 , ¹Ph0 : h 2 ‚0 º/, ‚0 Rd , (i) we call a loss function ` : Rd ! Œ0, 1/
B.Rd /-measurable
subconvex or bowl-shaped if all level sets Ac :D ¹x 2 Rd : `.x/ cº ,
c0
135
Section 5.1 Gaussian Shift Experiments
are convex and symmetric with respect to the origin (i.e. x 2 Ac ” x 2 Ac , for x 2 Rd ); (ii) we associate – with respect to a loss function which we keep fixed – a risk function Z `.T h/ dPh 2 Œ0, 1 R.T , / : ‚0 3 h ! R.T , h/ :D 0
to every estimator T : .0 , A0 / ! .Rd , B.Rd // for the unknown parameter h 2 ‚0 . Note that risk functions according to Definition 5.6(ii) are well defined, but not necessarily finite-valued. When A is a subset of ‚ of finite Lebesgue measure, we also write Z R.T , A/ :D A .dh/ R.T , h/ 2 Œ0, 1 for the Bayes risk where A is the uniform distribution on A. For short, we write n for the uniform distribution on the closed ball Bn :D ¹jxj nº. The following lemma is based on an inequality for volumes of convex combinations of convex sets in Rd , see [1, p. 170–171]) for the proof. 5.6’ Lemma (Anderson [1]). Consider C Rd convex and symmetric with respect to the origin. Consider f : Rd ! Œ0, 1/ integrable, symmetric with respect to the origin, and such that all sets ¹x 2 Rd : f .x/ cº are convex, 0 < c < 1. Then Z
Z C
f .x/ dx
C Cy
f .x/ dx
for all y 2 Rd , where C C y is the set C shifted by y. We will use Anderson’s lemma in the following form: 5.7 Corollary. For matrices ƒ 2 Rd d which are symmetric and strictly positive definite, sets C 2 B.Rd / convex and symmetric with respect to the origin, subconvex loss functions ` : Rd ! Œ0, 1/, points a 2 Rd n ¹0º and probability measures Q on .Rd , B.Rd //, we have N .0, ƒ/.C / N .0, ƒ/.C a/ ; Z `.x/ N .0, ƒ/.dx/ `.x a/ N .0, ƒ/.dx/ ; Z Z `.x/ N .0, ƒ/.dx/ `.x/ ŒN .0, ƒ/ ? Q .dx/ .
Z
136
Chapter 5 Gaussian Shift Models
Proof. Writing f ./ for the Lebesgue density of the normal law N .0, ƒ/ on Rd , the first assertion rephrases Anderson’s Lemma 5.6’. An immediate consequence is Z Z e `.x a/ N .0, ƒ/.dx/ `.x/ N .0, ƒ/.dx/ e for ‘elementary’ subconvex loss functions which are finite sums e `D
m X
˛i 1Rd nCi ,
iD1
˛i > 0, Ci 2 B.Rd / convex and symmetric with respect to the origin, m 1. The second assertion follows if we pass to general subconvex loss functions `./ by monotone convergence `n " ` where all `n are ‘elementary’: writing Cn,j :D ¹` 2jn º we can take n2 n2 1 0 X X 1 j ° ± C n1 `n :D 1 D 1 j0 d j 0 C1 ¹`>nº , R nCn,j n n n < ` 2n 2 2 2 0 n
n
j D1
n 1.
j D0
The second assertion now allows us to compare Z Z Z `.x/ ŒN .0, ƒ/ ? Q .dx/ D `.y C b/ N .0, ƒ/.dy/ Q.db/ Z Z `.y/ N .0, ƒ/.dy/ Q.db/ Z D `.y/ N .0, ƒ/.dy/
which is the third assertion.
The last inequalities of Corollary 5.7 allow to rephrase Boll’s Convolution Theorem 5.5. This yields a powerful way to compare (equivariant) estimators, in the sense that ‘optimality’ appears decoupled from any particular choice of a loss function which we might invent to penalise estimation errors. 5.8 Corollary. In the Gaussian shift experiment E.J /, with respect to any subconvex loss function, the central statistic Z minimises the risk R., h/ R.Z, h/ ,
h 2 Rd
in the class of all equivariant estimators for the unknown parameter.
Proof. Both and Z being equivariant, their risk functions are constant over Rd , thus it is sufficient to prove the assertion for the parameter value h D 0. We have L .Z j P0 / D N .0, J 1 /
137
Section 5.1 Gaussian Shift Experiments
from Proposition 5.3. Theorem 5.5 associates to a probability measure Q such that L . j P0 / D N .0, J 1 / ? Q . The loss function `./ being subconvex, the third assertion in Corollary 5.7 shows Z R.Z, 0/ D E0 . `.Z 0/ / D `.x/ N .0, J 1 /.dx/ Z
`.x/ N .0, J 1 / ? Q .dx/ D E0 . `. 0/ / D R., 0/
which is the assertion.
Given Theorem 5.5 and Corollary 5.8, we are able to compare equivariant estimators; the next aim is the comparison of arbitrary estimators for the unknown parameter. We quote the following from [121, Chap. I.2] or [124, Chap. 2.4]: 5.8’ Remark. Total variation distance d1 .P , Q/ between probability measures P , Q living on the same space .0 , A0 / is defined by Z d1 .P , Q/ :D sup jP .A/ Q.A/j D sup j .p q/ d j 2 Œ0, 1 , A2A0
A2A0
dP ,q d
A
dQ d
D versions of the densities, and with some dominating measure and p D one has sup jP .A/ Q.A/j D 0 A2A² ˇZ ˇ ³ Z ˇ .C/ ˇ 0 0 ˇ ˇ sup ˇ dP dQ ˇ : an A -measurable function ! Œ0, 1 . The following lemma represent a key tool: in a Gaussian shift experiment E.J /, arbitrary estimators for the unknown parameter can be viewed as ‘approximately equivariant’ in the absence of any a priori knowledge except that the unknown parameter should range over large balls centred at the origin. 5.9 Lemma. In a Gaussian shift experiment E.J /, every estimator for the unknown parameter h 2 Rd is associated to a sequence of probability measures .Qn /n on .Rd , B.Rd // such that Z 1 .i/ d1 n .dh/ L. h j Ph / , N .0, J / ? Qn ! 0 as n ! 1 . As a consequence, for any choice of a loss function `./ which is subconvex and bounded, we have Z R ., Bn / D n .dh/ Eh .`. h// Z .ii/
D `.x/ N .0, J 1 / ? Qn .dx/ C k`k1 o.1/
138
Chapter 5 Gaussian Shift Models
as n ! 1 (with Bn the closed ball ¹jhj nº in Rd , and n the uniform distribution on Bn ). In the last line, k`k1 is the sup-norm of the loss function, and the o.1/-terms do not depend on `./. Proof. Assertion (ii) is a consequence of (i), via (+) in Remark 5.8’. We prove (i) in several steps. (1) The central statistic Z D J 1 S is a sufficient statistic in the Gaussian shift experiment E.J / (e.g. [4, Chap. II.1 and II.2], or [127, Chap. 3.1]. Thus, for any random variable taking values in .Rd , B.Rd //, there is a regular version of the conditional law of given Z D which does not depend on the parameter h 2 Rd : there is a transition kernel K., / on .Rd , B.Rd // such that jZDz
K., / is a regular version of .z, A/ ! Ph
.A/ for every value of h 2 Rd .
jZDz
.A/. In the same sense, the conditional law of We write this as K.z, A/ D P Z given Z D does not depend on the parameter, and we define a sequence of probability measures .Qn /n on .Rd , B.Rd //: Z . Z/jZDz .A/ , A 2 B.Rd / , n 1 . Qn .A/ :D n .dz/ P Note that the sequence .Qn /n is defined only in terms of the pair ., Z/. Comparing with the first expression in (i), this definition signifies that the observed value z of the central statistic starts to take over the role of the parameter h. (2) For every fixed value of x 2 Rd , we have uniformly in A 2 B.Rd / the bound ˇZ ˇ ˇ ˇ . Bn 4 .Bn C x/ / ˇ ˇ ./ ˇ n .dh/ K.x C h, A C h/ Qn .A x/ˇ .Bn / where A x is the set A shifted by x, and 4 denotes symmetric difference. To see this, write Z Z 1 .dh/ 1Bn .h/ K.x C h, A C h/ n .dh/ K.x C h, A C h/ D .Bn / Z 1 .dh0 / 1Bn Cx .h0 / K.h0 , A x C h0 / D .Bn / Z 1 jZDh0 .A x C h0 / .dh0 / 1Bn Cx .h0 / P D .Bn / Z 1 ZjZDh0 .A x/ D .dh0 / 1Bn Cx .h0 / P .Bn / and compare the last right-hand side to the definition of Qn in step (1) Z 1 ZjZDh0 .A x/ . .dh0 / 1Bn .h0 / P Qn .A x/ D .Bn /
139
Section 5.1 Gaussian Shift Experiments
It follows that the difference on the left-hand side of ./ is uniformly in A smaller than Z ˇ ˇ 1 . Bn 4 .Bn C x/ / .dh0 / ˇ1Bn .h0 / 1Bn Cx .h0 /ˇ D . .Bn / .Bn / (3) To conclude the proof of (i), we condition the first law in (i) with respect to the central statistic Z. For A 2 B.Rd / we obtain from Proposition 5.3(b), definition of K., / and substitution z D x C h Z Z Z jZDz Z Ph .dz/ Ph .A C h/ n .dh/ Ph . h 2 A / D n .dh/ Z Z D n .dh/ N .h, J 1 /.dz/ K.z, A C h/ Z Z n .dh/ K.x C h, A C h/ D N .0, J 1 /.dx/ whereas the second law in (i) charges A with mass Z N .0, J 1 / ? Qn .A/ D N .0, J 1 /.dx/ Qn .A x/ . By the bounds ./ obtained in step (2) which are uniform in A for fixed value of x, we can compare the last right-hand sides. Thus the definition of total variation distance gives Z d1 n .dh/ L. h j Ph / , N .0, J 1 / ? Qn Z . Bn 4 .Bn C x/ / N .0, J 1 /.dx/ .Bn / where the integrand on the right-hand side, trivially bounded by 2, converges to 0 pointwise in x 2 Rd as n ! 1. Assertion (i) now follows from dominated convergence. Lemma 5.9 is the key to the minimax theorem in Gaussian shift experiments E.J /. It allows to compare all possible estimators for the unknown parameter h 2 Rd , with respect to any subconvex loss function: it turns out that for all choices of `./, the maximal risk on Rd is minimised by the central statistic. This is – following Theorem 5.5 – the second main result of this section. 5.10 Minimax Theorem. In the Gaussian shift experiment E.J /, the central statistic Z is a minimax estimator for the unknown parameter with respect to any subconvex loss function `./; we have Z sup R., h/ `.z/ N .0, J 1 /.dz/ D R.Z, 0/ D sup R.Z, h/ . h2Rd
h2Rd
140
Chapter 5 Gaussian Shift Models
Proof. Consider risk with respect to any subconvex loss function `./. The last equality is merely equivariance of Z as an estimator for the unknown parameter h 2 Rd , and we have to prove the first sign ‘’. Consider any estimator for the unknown parameter, define n , Bn , Qn as in Lemma 5.9, and recall that the sequence .Qn /n depends only on the pair ., Z/. A trivial chain of inequalities is .C/
1 suph2Rd R., h/ suph2Bn R., h/ Z n .dh/ R., h/ D: R., Bn /
for arbitrary n 2 N. We shall show that Z lim inf R., Bn / .` ^ N /.x/ N .0, J 1 /.dx/ n!1 .CC/ for every constant N < 1 . Given (+) and (++), monotone convergence as N ! 1 will finish the proof of the theorem. In order to prove (++), observe first that it is sufficient to work with loss functions `./ which are subconvex and bounded. For such `./, Lemma 5.9(ii) shows Z
R., Bn / D `.x/ N .0, J 1 / ? Qn .dx/ C k`k1 o.1/ as n ! 1, where Anderson’s inequality 5.7 gives Z Z
`.x/ N .0, J 1 / ? Qn .dx/ `.x/ N .0, J 1 /.dx/ Both assertions together establish (++); the proof is complete.
for every n .
To resume, a Gaussian shift experiment E.J / allows – thanks to the properties of its central statistic – for two remarkable results: first, Boll’s Convolution Theorem 5.5 for equivariant estimators; second, Lemma 5.9 for arbitrary estimators , according to which the structure of risks of equivariant estimators is mimicked by Bayes risks over large balls Bn . All this is independent of the choice of a subconvex loss function. The Minimax Theorem 5.10 is then an easy consequence of both results. So neither a notion of ‘maximum likelihood’ nor any version of a ‘Bayes property’ turns out to be intrinsic for good estimation: it is the existence of a central statistic which allows for good estimation in the Gaussian shift experiment E.J /. 5.10’ Exercise. In a Gaussian shift model E.J / with unknown parameter h 2 Rd and central statistic Z D J 1 S , fix some point h0 2 ‚, and consider for 0 < ˛ < 1 estimators T :D ˛ Z C .1 ˛/ h0 according to exercise 5.4000 which are not equivariant. Under squared loss, calculate the risk of T as a function of h 2 Rd . Evaluate what T ‘achieves’ at the particular point h0 in comparison
Section 5.2
Brownian Motion with Unknown Drift as a Gaussian Shift Experiment
141
to the central statistic, and the price to be paid for this at parameter points h 2 Rd which are distant from h0 . Historically, estimators of similar structure have been called ‘superefficient at h0 ’ and have caused some trouble; in the light of Theorem 5.10 it is clear that any denomination of this type is misleading.
5.2
Brownian Motion with Unknown Drift as a Gaussian Shift Experiment
In the following, all sections or subsections preceded by an asterisk will require techniques related to continuous-time martingales, semi-martingales and stochastic analysis, and a reader not interested in stochastic processes may skip these and keep on with the statistical theory. Some typical references for sections marked by are e.g. Liptser and Shiryaev [88], Metivier [98], Jacod and Shiryaev [64], Ikeda and Watanabe [61], Chung and Williams [15], Karatzas and Shreve [69], Revuz and Yor [112]. In the present section, we introduce the notion of a density process (or likelihood ratio process), and then look in particular to statistical models for Brownian motion with unknown drift as an illustration to Section 5.1. 5.11 Definition. Consider a probability space ., A/ equipped with a right-continuous filtration F D .F t / t0 . For any probability measure P on ., A/ and any 0 t < 1, we write P t for the restriction of P to the -field F t . For pairs P 0 , P of probability measures on ., A/, we write loc
P 0 j h 12 s > j 1 s h> s ds p 1A .s/ e , e P0 .dj / 1B \ DC .j / e Rd .2/d jdet.j /j and after rearranging terms 12 h> j h 12 s > j 1 sCh> s D 12 .sj h/> j 1 .sj h/ gives Z Z 1 1 > 1 ds p e 2 .sj h/ j .sj h/ 1A .s/ . P0J .dj / 1B \ DC .j / Rd .2/d jdet.j /j By (i) and Definition 6.2, the last expression equals Z Z J P0 .dj / 1B \ DC .j / N .j h, j /.A/ D PhJ .dj / 1B \ DC .j / N .j h, j /.A/ .
152
Chapter 6 Quadratic Experiments and Mixed Normal Experiments
The set DC has full measure under PhJ since Ph P0 . So the complete chain of equalities shows Z Eh .1B .J / 1A .S // D PhJ .dj / 1B .j / N .j h, j /.A/ where A 2 B.Rd / and B 2 B.Rd d / are arbitrary. This gives (iii) and completes the proof. 6.4 Definition. In a quadratic experiment E.S , J /, for h 2 Rd , we call S J h the score in h and J the observed information. Note that S J h is not a score in h in the classical sense of Definition 1.2: if we have Mh :D S J h D .r log f / .h, / on , there is so far no information on integrability properties of Mh under Ph . Similarly, the random object J is not Fisher information in h in the classical sense of Definition 1.2. 6.4’ Remark. In mixed normal experiments E.S , J /, the notion of ‘observed information’ due to Barndorff-Nielsen (e.g. [3]) gains a new and deeper sense: the observation ! 2 itself communicates through the observed value j D J.!/ the amount of ‘information’ it carries about the unknown parameter. Indeed from Definition 6.2 and Proposition 6.3(iii), the family of conditional probability measures ± ° jJ Dj : h 2 Rd Ph is a Gaussian shift experiment E.j / as discussed in Proposition 5.3 for PJ -almost all j 2 Rd d ; this means that we can condition on the observed information. By Proposition 6.3, this interpretation is valid under mixed normality, and does not carry over to general quadratic experiments. Conditioning on the observed information, the convolution theorem carries over to mixed normal experiments – as noticed by [66] – provided we strengthen the general notion of Definition 5.4 of equivariance and consider estimators which are equivariant conditional on the observed information. 6.5 Definition. In a mixed normal experiment E.S , J /, an estimator for the unknown parameter h 2 Rd is termed strongly equivariant if the transition kernel ./
.h/jJ Dj
.j , A/ ! Ph
.A/ ,
j 2 DC , A 2 B.Rd /
admits a version which does not depend on the parameter h 2 Rd .
153
Section 6.1 Quadratic and Mixed Normal Experiments
In a mixed normal experiment E.S , J /, the central statistic Z D 1¹J 2DC º J 1 S is a strongly equivariant estimator for the unknown parameter, by Proposition 6.3(iv). In the particular case of a Gaussian shift experiment E.J /, J is deterministic, and the notions of equivariance (Definition 5.4) and strong equivariance (Definition 6.5) coincide: this follows from a reformulation of Definition 6.5. 6.5’ Proposition. Consider a mixed normal experiment E.S , J / and an estimator for the unknown parameter h 2 Rd . Then is strongly equivariant if and only if .C/
L . . h, J / j Ph / does not depend on the parameter h 2 Rd .
Proof. Clearly Definition 6.2 combined with ./ of Definition 6.5 gives (+). To prove the converse, fix a countable and \–stable generator S of the -field B.Rd /, and assume that Definition 6.2 holds in combination with (+). Then for every A 2 S and every h 2 Rd , we have from (+) Eh . 1C .J / 1A . h/ / D E0 . 1C .J / 1A ./ / for all C 2 B.Rd d /, and thus Z Z hjJ Dj jJ Dj J .A/ D PJ .dj / 1C .j / P0 .A/ P .dj / 1C .j / Ph for all C 2 B.Rd d /. This shows that the functions hjJ Dj
j ! Ph
.A/
and
jJ Dj
j ! P0
.A/
coincide PJ -almost surely on Rd d . Thus there is some PJ -null set Nh,A 2 B.Rd / such that hjJ Dj jJ Dj .A/ D P0 .A/ for all j 2 DC nNh,A Ph where DSC is the set defined before Definition 6.1. Keeping h fixed, the countable union Nh D A2S Nh,A is again a PJ -null set in B.Rd /, and we have j 2 DC nNh :
hjJ Dj
Ph
jJ Dj
.A/ D P0
.A/ for all A 2 S .
S being a \–stable generator of B.Rd /, we infer: hjJ Dj
. /
the probability laws Ph
jJ Dj
./ and P0
./
coincide on .Rd , B.Rd // when j 2 DC nNh . jJ Dj
Let K., / denote a regular version of the conditional law .j , A/ ! P0 .A/ , j 2 DC , A 2 B.Rd /. Since DC nNh is a set of full measure under PJ , assertion . / hjJ Dj shows that K., / is at the same time a regular version of .j , A/ ! Ph .A/ .
154
Chapter 6 Quadratic Experiments and Mixed Normal Experiments
So far, h 2 Rd was fixed but arbitrary: hence K., / is a regular version simultanehjJ Dj ously for all conditional laws .j , A/ ! Ph .A/ , h 2 Rd . We thus have ./ in Definition 6.5 which finishes the proof. We state the first main result on estimation in mixed normal experiments, due to Jeganathan [66]. 6.6 Convolution Theorem. Consider a strongly equivariant estimator in a mixed normal experiment E.S , J /. Then there exist probability measures ¹Qj : j 2 DC º on .Rd , B.Rd // such that for PJ -almost all j 2 DC : .h/jJ Dj
Ph
D N .0, j 1 / ? Qj
does not depend on h 2 Rd ,
and estimation errors of are distributed as Z
L . h j Ph / D PJ .dj / N .0, j 1 / ? Qj independently of h 2 Rd . In addition, .j , A/ ! Qj .A/ is obtained as a regular .Z/jJ Dj .A/ where j 2 DC and A 2 B.Rd /. version of the conditional law P0 Proof. Since is a strongly equivariant estimator for the unknown parameter h 2 Rd , it is sufficient to determine a family of probability measures ¹Qj : j 2 Rd d º on .Rd , B.Rd // such that for PJ -almost all j 2 DC :
jJ Dj
P0
.A/ D N .0, j 1 / ? Qj .
e / of the conditional law of the pair ., S / given J D Prepare a regular version K., under P0 e , C / :D P0.,S /jJ Dj .C / , .j , C / ! K.j
j 2 DC ,
C 2 B.Rd /˝B.Rd /
with second marginals specified according to Proposition 6.3(ii): e , Rd / D N .0, j / K.j
for all j 2 DC .
(1) For t 2 Rd fixed, for arbitrary events B 2 B.Rd d / and arbitrary h 2 Rd , we start from Proposition 6.5’ > > E0 1B .J / e i t D Eh 1B .J / e i t .h/
155
Section 6.1 Quadratic and Mixed Normal Experiments
which gives on the right-hand side > E0 1B .J / e i t 1 > > > > D E0 1B .J / e i t .h/ Lh=0 D E0 1B .J / e i t .h/ e h S 2 h J h Z Z J i t > .kh/ h> s 12 h> j h e K.j , .d k, ds// e e D P0 .dj / 1B .j / Z Z e , .d k, ds// e i t > .kh/ e h> s 12 h> j h K.j D PJ .dj / 1B\DC .j / and on the left-hand side Z Z i t > J i t >k e D P .dj / 1B\DC .j / K.j , .d k, ds// e E0 1B .J / e e /. We compare the expressions in square brackets in both equalby definition of K., ities: since B 2 B.Rd d / is arbitrary and since PJ is concentrated on DC , the functions Z e , .d k, ds// e i t > .kh/ e h> s 12 h> j h j ! K.j Z
and j !
e , .d k, ds// e i t > k K.j
coincide PJ -almost surely on DC , for every fixed value of h 2 Rd . Let Nh denote of h in the last formula. The an exceptional PJ -null set S with respectJ to a fixed value Nh is a P -null set in DC with the property countable union N :D h2 Qd
.˘/
8R i t > .kh/ e h> s e ˆ < Rd RdRK.j , .d k, ds// e e , .d k, ds// e i t > k D Rd Rd K.j ˆ : for all j 2 DC n N and for all h 2 Qd .
1 2
h> j h
(2) Now we fix j 2 DC n N . We consider the first integral in .˘/ as a function of h Z e , .d k, ds// e i t > .kh/ e h> s 12 h> j h Rd 3 h ! K.j which as in the proof of Theorem 5.5 is extended to an analytic function Z d e , .d k, ds// e i t > .kz/ e z > s 12 z > j z D: fj .z/ . K.j C 3 z ! The second integral in .˘/ equals fj .0/. The function fj being analytic, assertion .˘/ yields for all j 2 DC n N : the function fj ./ is constant on C d .
156
Chapter 6 Quadratic Experiments and Mixed Normal Experiments
In particular, for all j 2 DC n N , fj .0/ D fj .i j 1 t / which gives in analogy to ./ in the proof of Theorem 5.5 Z Z e , .d k, ds// e i t > k D e 12 t > j 1 t e , .d k, ds// e i t > .kj 1 s/ . .C/ K.j K.j e / we have proved By definition of K., .CC/
for all j 2 DC n N : > > 1 > 1 E0 e i t j J D j D e 2 t j t E0 e i t .Z/ j J D j .
(3) So far, we have considered some fixed value of t 2 Rd . Hence the PJ -null set N in .CC/ depends on t . Taking the union of such null sets for all t 2 Qd we obtain e for which dominated convergence with respect to the argument t in a PJ -null set N both integrals in (+) establishes e: for all j 2 DC n N > > 1 > 1 E0 e i t j J D j D e 2 t j t E0 e i t .Z/ j J D j , t 2 Rd . This is an equation between characteristic functions of conditional laws. Introducing a regular version .Z/jJ Dj
.j , A/ ! Qj .A/ D P0
.A/
for the conditional law of Z given J D under P0 we have e , all A 2 B.Rd / : for all j 2 DC n N
jJ Dj
P0
.A/ D ŒN .0, j 1 / ? Qj .A/ .
This finishes the proof of the convolution theorem in the mixed normal experiment E.S , J /. In combination with Proposition 6.3(iv), the Convolution Theorem 6.6 shows that within the class of all strongly equivariant estimators for the unknown parameter h 2 Rd in a mixed normal experiment E.S , J /, the central statistic Z D 1¹J 2DC º J 1 S achieves minimal spread of estimation errors, or equivalently, achieves the best possible concentration of an estimator around the true value of the parameter. In analogy to Corollary 5.8 this can be reformulated – again using Anderson’s Lemma 5.7 – as follows: 6.6’ Corollary. In a mixed normal experiment E.S , J /, with respect to any subconvex loss function, the central statistic Z minimises the risk R., h/ R.Z, h/ ,
h 2 Rd
in the class of all strongly equivariant estimators for the unknown parameter.
Section 6.1 Quadratic and Mixed Normal Experiments
157
6.6” Remark. In the Convolution Theorem 6.6 and in Corollary 6.6’, the best concentrated distribution for estimation errors Z L .Z j P0 / D L PJ .dj / N 0 , j 1 does not necessarily admit finite second moments. As an example, the mixing distribution PJ might be in dimension d D 1 some Gamma law .a, p/ with shape parameter a 2 .0, 1/: then Z 1 .a, p/.du/ D 1 EP0 Z 2 D u .0,1/ and the central statistic Z does not belong to L2 .P0 /. We shall see examples in Chapter 8. When L .ZjP0 / does not admit finite second moments, comparison of estimators based on squared loss `.x/Dx 2 – or based on polynomial loss functions – does not make sense, thus bounded loss functions are of intrinsic importance in Theorem 6.6 and Corollary 6.6’, or in Lemma 6.7 and Theorem 6.8 below. Note that under mixed normality, estimation errors are always compared conditionally on the observed information, never globally. In mixed normal experiments, one would like to be able to compare the central statistic Z not only to strongly equivariant estimators, but to arbitrary estimators for the unknown parameter h 2 Rd . Again we can condition on the observed information to prove in analogy to Lemma 5.9 that arbitrary estimators are ‘approximately strongly invariant’ under a ‘very diffuse’ prior, i.e. in the absence of any a priori information except that the unknown parameter should range over large balls centred at the origin. Again Bn denotes the closed ball ¹x 2 Rd : jxj nº, and n the uniform law on Bn . 6.7 Lemma. In a mixed normal experiment E.S , J /, every estimator for the unj known parameter h 2 Rd can be associated to probability measures ¹ Qn : j 2 DC , n 1 º on .Rd , B.Rd // such that Z Z i h J 1 j d1 P .dj / N .0, j / ? Qn n .dh/ L . hjPh / , ! 0 as n ! 1 , and for every bounded subconvex loss function the risks Z R., Bn / D n .dh/ Eh .`. h// can be represented as n ! 1 in the form Z Z h i J N .0, j 1 / ? Qnj .dv/ `.v/ C o.1/ , R., Bn / D P0 .dj /
n!1.
158
Chapter 6 Quadratic Experiments and Mixed Normal Experiments
Proof. The second assertion is a consequence of the first since `./ is measurable and bounded, by (+) in Remark 5.8’. We prove the first assertion in analogy to the proof of Lemma 5.9. (1) The pair .Z, J / is a sufficient statistic in the mixed normal experiment E.S , J /. Thus, for any random variable taking values in .Rd , B.Rd //, there is a regular version of the conditional law of given .Z, J / D which does not depend on the e / from Rd DC to Rd which parameter h 2 Rd . We fix a transition kernel K., provides a common regular version j.Z,J /D.z,j / e .A/ , K..z, j /, A/ D Ph
A 2 Rd , .z, j / 2 Rd DC
under all values h 2 Rd of the parameter. When we wish to keep j 2 DC fixed, we write for short e j /, A/ , K j .z, A/ :D K..z,
A 2 Rd , z 2 Rd .
Sufficiency also allows to consider mixtures Z . Z/j.Z,J /D.h,j / j Qn .A/ :D n .dh/ P .A/ ,
A 2 B.Rd /
for every j 2 DC and every n 2 N. (2) With notations introduced in step (1), we have from Proposition 6.3(iv) Z n .dh/ Ph . h 2 A/ Z Z Z ZjJ Dj e .dz/ K..z, j /, A C h/ D n .dh/ PhJ .dj / Ph Z Z Z D PJ .dj / n .dh/ N .h, j 1 /.dz/ K j .z, A C h/ Z Z Z J 1 j N .0, j /.dx/ n .dh/ K .x C h, A C h/ D P .dj / in analogy to step (3) of the proof of Lemma 5.9. (3) In analogy to step (2) in the proof of Lemma 5.9, we have bounds ˇZ ˇ ˇ ˇ j j ˇ n .dh/ K .x C h, A C h/ Q .A x/ ˇ .Bn 4.Bn C x// ./ n ˇ ˇ .Bn / which are uniform in A 2 B.Rd / and j 2 DC . To check this, start from Z 1 .dh/ 1Bn .h/ K j .x C h, A C h/ .Bn / Z 1 D .dh0 / 1Bn Cx .h0 / K j .h0 , A x C h0 / .Bn /
159
Section 6.1 Quadratic and Mixed Normal Experiments
e / equals which by definition of K., Z 1 e .h0 , j /, A x C h0 .dh0 / 1Bn Cx .h0 / K .Bn / Z 1 D .dh0 / 1Bn Cx .h0 / P 2 .A x C h0 / j .Z, J / D .h0 , j / .Bn / Z 1 .dh0 / 1Bn Cx .h0 / P . Z/ 2 .A x/ j .Z, J / D .h0 , j / . D .Bn / Now the last expression Z 1 . Z/j.Z,J /D.h0 ,j / .A x/ .dh0 / 1Bn Cx .h0 / P .Bn / can be compared to Qnj .A
1 x/ D .Bn /
Z
. Z/j.Z,J /D.h0 ,j /
.dh0 / 1Bn .h0 / P
.A x/
up to the error bound on the right-hand side of ./, uniformly in A 2 B.Rd / and j 2 DC . (4) Combining steps (2) and (3), we approximate Z Z Z PJ .dj / N .0, j 1 /.dx/ Qnj .A x/ n .dh/ Ph . h 2 A/ by uniformly in A 2 B.Rd / as n ! 1, up to error terms Z Z .Bn 4.Bn C x// N .0, j 1 /.dx/ PJ .dj / .Bn / as in the proof of Lemma 5.9. Using dominated convergence twice, these vanish as n ! 1. The following is – together with the Convolution Theorem 6.6 – the second main result on estimation in mixed normal experiments. 6.8 Minimax Theorem. In a mixed normal experiment E.S , J /, the central statistic Z is a minimax estimator for the unknown parameter with respect to any subconvex loss function `./: we have Z Z ZjJ Dj PJ .dj / P0 .dz/ `.z/ D sup R.Z, h/ . sup R., h/ E0 .`.Z// D h2Rd
h2Rd
Proof. Consider any estimator for the unknown parameter h 2 Rd in E.S , J /, and any subconvex loss function `./. Since E0 .`.Z// (finite or not) is the increasing limit
160
Chapter 6 Quadratic Experiments and Mixed Normal Experiments
of E0 .Œ` ^ N .Z// as N ! 1 where all ` ^ N are subconvex loss functions, it is sufficient to prove a lower bound sup R., h/ E0 .`.Z// D R.Z, 0/
h2Rd
for subconvex loss functions which are bounded. Then we have a chain of inequalities (its first two ‘’ are trivial) where we can apply Lemma 6.7: Z sup R., h/ sup R., h/ n .dh/ Eh .`. h// D R., Bn / h2B h2Rd Z h Z n i J N .0, j 1 / ? Qnj .dv/ `.v/ C o.1/ as n ! 1 . D P0 .dj / Now Anderson’s Lemma 5.7 allows us to compare integral terms for all n, j fixed, and gives lower bounds Z Z J P .dj / N .0, j 1 /.dv/ `.v/ C o.1/ as n ! 1 . The last integral does not depend on n, and by Definition 6.2 and Proposition 6.3(iv) equals Eh .`.Z h// D R.Z, h/ for arbitrary h 2 Rd . This finishes the proof. What happens beyond mixed normality, in the genuinely quadratic case? We know that Z D J 1 S is a maximum likelihood estimator for the unknown parameter, by Remark 6.1’, we know that for J.!/ 2 DC the log-likelihood surface h
!
ƒh=0 .!/ D h> S.!/
1 > h J.!/ h 2
1 .h Z.!//> J.!/ .h Z.!// C expressions not depending on h 2 has the shape of a parabola which opens towards 1 and which admits a unique maximum at Z.!/, but this is no optimality criterion. For quadratic experiments which are not mixed normal, satisfactory optimality results seem unknown. For squared loss, Gushchin [38, Thm. 1, assertion 3] proves admissibility of the ML estimator Z under random norming by J (which makes the randomly normed estimation errors at h coincide with the score S J h at h) in dimension d D 1. He also has Cramér–Rao type results for restricted families of estimators. Beyond the setting of mixed normality, results which allow to distinguish an optimal estimator from its competitors simultaneously under a sufficiently large class of loss functions – such as those in the convolution theorem or in the minimax theorem – seem unknown. D
6.2
Likelihood Ratio Processes in Diffusion Models
Statistical models for diffusion processes provide natural examples for quadratic experiments. For laws of solutions of stochastic differential equations, we consider first
Section 6.2
161
Likelihood Ratio Processes in Diffusion Models
the problem of local absolute continuity or local equivalence of probability measures and the structure of the density processes (this relies on Liptser and Shiryayev [88] and Jacod and Shiryaev [64]), then we specialise to settings where quadratic statistical models arise. Since this section is preceded by an asterisk , a reader not interested in stochastic processes may skip this and go directly to Chapter 7. As in Notation 5.13, .C , C , G/ is the canonical path space for Rd -valued stochastic processes with continuous paths; G is the (right-continuous) filtration generated by the canonical process D . t / t0 . Local absolute continuity and local equivalence of probability measures were introduced in Definition 5.11; density processes were defined in Theorem 5.12. 6.9 Assumptions and Notations for Section 6.2. (a) We consider d -dimensional stochastic differential equations (SDE) .I/
dX t D b.t , X t / dt C .t , X t / d W t ,
t 0,
X 0 x0
.II/
dX t0
t 0,
X 0 x0
Db
0
.t , X t0 / dt
C
.t , X t0 / d W t
,
driven by m-dimensional Brownian motion W . SDE (I) and (II) share the same diffusion coefficient : Œ0, 1/ Rd ! Rd m but have different drift coefficients b , b0 :
Œ0, 1/ Rd ! Rd .
Both equations (I) and (II) have the same starting value x0 2 Rd . All coefficients are measurable in .t , x/; we assume Lipschitz continuity in the second variable jb.t , x/ b.t , y/j C jb 0 .t , x/ b 0 .t , y/j C k .t , x/ .t , y/k K jx yj (for t 0 and x, y 2 Rd ) together with linear growth jb.t , x/j2 C jb 0 .t , x/j2 C k .t , x/k2 K .1 C jxj2 / where K is some global constant. The d d -matrices c.t , x/ :D .t , x/ > .t , x/ will be important in the sequel. (c) Under the assumptions of (b), equations (I) and (II) have unique strong solutions (see e.g. [69, Chap. 5.2.B]) on the probability space ., A, F , P / carrying the driving Brownian motion W , with F the filtration generated by W . We will be interested in the laws of the solutions Q :D L. X j P / and
Q0 :D L. X 0 j P /
on the canonical path space
.C , C , G/ .
162
Chapter 6 Quadratic Experiments and Mixed Normal Experiments 0
(d) We write mX and mX for the .P , F /-local martingale parts of X and X 0 Z t Z t X0 0 0 mX D X X b.s, X / ds , m D X X b.s, Xs0 / ds . t 0 s t t t 0 0
0
Their angle brackets are the predictable processes Z t Z t E D 0E D X X D c.s, Xs / ds , m D c.s, Xs0 / ds m t
t
0
0
taking values in the space of symmetric and non-negative definite d d -matrices. We shall explain in Remark 6.12’ below why equations (I) and (II) have been assumed to share the same diffusion coefficient. The next result is proved with the arguments which [88, Sect. 7] use on the way to their Theorems 7.7 and 7.18. See also [64, pp. 159–160, 179–181, and 187]). 6.10 Theorem. Let the drift coefficients b in (I) and b 0 in (II) be such that a measurable function : Œ0, 1/ Rd ! Rd exists which satisfies the following conditions (+) and (++): b 0 .t , x/ D b.t , x/ C c.t , x/ .t , x/ ,
.C/ .CC/
t 0 , x 2 Rd ,
Z t > c .s, s / ds < 1 for all t < 1, Q- and Q0 -almost surely. 0
(a) Then the probability laws Q0 and Q on .C , C / are locally equivalent relative to G. (b) The density process of Q0 with respect to Q relative to G is the .Q, G/-martingale ²Z t ³ Z 1 t > > .s, s / d mQ c .s, / ds L D .L t / t0 , L t D exp s s 2 0 0 where mQ denotes the local martingale part of the canonical process under Q: Z t Q Q mQ D .m t / t0 , m t D t 0 b.s, s / ds 0
and where the local martingale Z
> .s, s / d mQ s D: M
under Q has angle bracket Z t hM i t D > c .s, s / ds , 0
t 0.
Section 6.2
163
Likelihood Ratio Processes in Diffusion Models
(c) The process L in (b) is the exponential ± ° 1 L t D exp M t hM i t , 2
t 0
of M under Q, in the sense of a solution .L t / t0 under Q to the SDE Z
t
Lt D 1 C
L t dM t ,
t 0.
0
A usual notation for the exponential of M under Q is EQ .M /. Proof. (1) On .C , C , G/ define stopping times ² ³ Z t > c .s, s / ds > n , n :D inf t > 0 :
n1
0
(which may take the value C1 with positive probability under Q or under Q0 ) and write .n/ .s, s / :D .s, s / 1ŒŒ0,n ŒŒ .s/ , n 1 . By assumption (++) which is symmetric in Q and Q0 , we have Q- and Q0 -almost surely.
n " 1
(2) For every n 2 N, on ., A, F , P /, we also have unique strong solutions for equations (II.n/ ): .n/
.II.n/ / dX t
.n/
.n/
D Œb C c .n/ .t , X t / dt C .t , X t / d W t ,
t 0,
X 0 x0 .
By unicity, X .n/ coincides on ŒŒ0, n with the solution X 0 to SDE (II) where n is the F -stopping time n :D n ıX 0 . If we write Q.n/ for the law L.X .n/ jP / on .C , C , G/, then the laws Q.n/ and Q0 coincide in restriction to the -field Gn of events up to time n . (3) On .C , C , G/, the Novikov criterion [61, p. 152] applies – as a consequence of the stopping in step (1) – and grants that for fixed n 2 N ²Z t ³ Z 1 t .n/ > .n/ .n/ Œ Œ .n/ > .s, s / d mQ c Œ .s, / ds , t 0 LL t D exp s s 2 0 0 is a Q-martingale. Thus LL .n/ defines a probability measure QL .n/ on .C , C , G/ via R .n/ QL .n/ .A/ D A LL t dQ if A 2 G t . Writing ˇ for arbitrary directions in Rd , Girsanov’s theorem [64, Thm. III.3.11] then states that
Z > Q > Q .n/ > Q Œ .s, s / d ms , 1 i d ˇ m ˇ m , 0
164
Chapter 6 Quadratic Experiments and Mixed Normal Experiments
is a QL .n/ -local martingale whose angle bracket under QL .n/ coincides with the angle bracket of ˇ >mQ under Q. For the d -dimensional process mQ this shows that Z Q Œ c .n/ .s, s / ds m 0
is a QL .n/ -local martingale whose angle bracket under QL .n/ coincides with the angle bracket of mQ under Q. Since C is generated by the coordinate projections, this signifies that QL .n/ is the unique law on .C , C , G/ such that the canonical process under QL .n/ is a solution to SDE .II.n/ /. As a consequence, the two laws coincide: Q.n/ from step (2) is the same law as QL .n/ , on .C , C , G/. (4) For every n fixed, step (3) shows the following: the laws Q.n/ and Q are locally equivalent relative to G since the density process of Q.n/ with respect to Q relative to G L^n D LL .n/ is strictly positive Q-almost surely. Recall also that the laws Q.n/ and Q0 coincide in restriction to Gn . loc
(5) The following argument proves Q0 t C P . n t / P X^t .n/ P X^t 2 A C P . n t / D Q.n/ . A / C Q0 . n t / D Q.n/ . A / C o.1/ as n ! 1 since X 0 and X .n/ coincide up to time n D n ı X 0 as above, and since n " 1 Q0 -almost surely. Now, Q.n/ and Q being locally equivalent relative to G for all n, by step (4), the above gives A 2 G t , Q.A/ D 0
H)
Q.n/ .A/ D 0 8 n 1
H)
Q0 .A/ D 0
loc
which proves Q0 Kº, the "-ı-characterisation of mutual contiguity in Proposition 3.7 shows that .Yn /n under .Pn,# /n is Rq -tight
”
.Yn /n under .Pn,#Cın hn /n is Rq -tight.
In particular, the quadratic expansions under Pn,# which define LAQ in Definition 7.1 remain valid under Pn,#Cın hn whenever .hn /n is bounded: when LAQ holds at # we have 1 > hn =0 D h> h Jn hn C o.Pn,#Cın hn / .1/ as n ! 1 .ı/ ƒn,# n Sn 2 n for bounded sequences .hn /n . (4) Now we can prove the second part of the proposition. Under LAQ at #, weak convergence of pairs L .Sn , Jn j Pn,# / ! L .S , J j P0 /
(in Rd Rd d , as n ! 1)
implies by the continuous mapping theorem and by representation of log-likelihood ratios in Definition 7.1 h=0 L ƒn,# , Sn , Jn j Pn,# ! L ƒh=0 , S , J j P0 (weakly in R Rd Rd d , as n ! 1) for fixed h 2 Rd . Proposition 7.2 allows to extend this to convergent sequences hn ! h: hn =0 , Sn , Jn j Pn,# ! L ƒh=0 , S , J j P0 L ƒn,# (weakly in R Rd Rd d , for n ! 1) . h =0
n is the log-likelihood ratio of Pn,#Cın hn with respect to Pn,# in En .#/, and Here ƒn,# h=0 the log-likelihood ratio of Ph with respect to P0 in E1 .#/. Now we apply Le ƒ Cam’s third lemma, cf. Lemma 3.6 and Remark 3.6’, and obtain weak convergence under contiguous alternatives hn =0 , Sn , Jn j Pn,#Cın hn ! L ƒh=0 , S , J j Ph L ƒn,#
(weakly in R Rd Rd d , for n ! 1). This finishes the proof of the proposition.
185
Section 7.1 Local Asymptotics of Type LAN, LAMN, LAQ
Among competing sequences of estimators for the unknown parameter in the sequence of experiments .En /n , one would like to identify – if possible – sequences which are ‘asymptotically optimal’. Fix # 2 ‚. In a setting of local asymptotics at # with local scale .ın .#//n , the basic idea is as follows. To any estimator Tn for the unknown parameter in En associate Un D Un .#/ :D ın1 .#/ .Tn #/ which is the rescaled estimation error or Tn at #; clearly Un h D ın1 .#/ .Tn Œ# C ın .#/h/ for h 2 ‚n,# . Thus, # being fixed, we may consider Un D Un .#/ as an estimator for the local parameter h in the local experiment En .#/ D ¹Pn,#Cın .#/h : h 2 ‚n,# º at #. The aim will be to associate to such sequences .Un /n limit objects U D U.#/ in the limit experiment E1 .#/, in a suitably strong sense. Then one may first discuss properties of the estimator U in the limit experiment E1 .#/ and then convert these into properties of Un as estimator for the local parameter h 2 ‚n,# in the local experiment En .#/. This means that we will study properties of Tn in shrinking neighbourhoods of # defined from local scale ın .#/ # 0. Lemma 7.5 below is the technical key to this approach. It requires the notion of a Markov extension of a statistical experiment. 7.4 Definition. Consider a statistical experiment .E, E, P /, another measurable space .E 0 , E 0 / and a transition probability K.y, dy 0 / from .E, E/ to .E 0 , E 0 /. Equip the product space b b b D E E0 , b E D E ˝ E0 .E, E/ , E with probability laws PK .dy, dy 0 / :D P .dy/ K.y, dy 0 / ,
P 2P.
This yields a statistical model b :D ¹PK : P 2 P º P
on
b b .E, E/
with the following properties: (i) for every pair Q, P and every version L of the likelihood ratio of P with respect to Q in the experiment .E, E, P /, b L defined by b L.y, y 0 / :D L.y/ on .E E 0 , E ˝ E 0 / is a version of the likelihood ratio of PK with respect to QK in the experiment b /; b b .E, E, P
186
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
b /, statistics U taking values in .E 0 , E 0 / and realising the b b (ii) in the model .E, E, P prescribed laws Z 0 P .dy/ K.y, A0 / , A0 2 E 0 , PK.A / :D E
b under the probability measure PK 2 P b b b 3 exist: on .E, E/, we simply define the random variable U as the projection E 0 0 0 0 .y, y / ! y 2 E on the second component which gives PK .U 2 A / D PK.A0 / for A0 2 E 0 . b / is called a Markov extension of .E, E, P / ; clearly every b b The experiment .E, E, P b/ b b statistic Y already available in the original experiment is also available in .E, E, P 0 via lifting Y .y, y / :D Y .y/. b / is statistically the b b We comment on this definition. By Definition 7.4(i), .E, E, P same experiment as .E, E, P / since ‘likelihoods remain unchanged’: for arbitrary P .0/ and P .1/ , : : : , P .`/ in P , writing L.i/ for the likelihood ratio of P .i/ with re.i/ .0/ L.i/ for the likelihood ratio of PK with respect to PK spect to P .0/ in .E, E/, and b b /, we have (cf. Strasser [121, 53.10 and 25.6]) by construction b b in .E, E, P .1/ .0/ L ,:::,b L.`/ / j PK . L .L.1/ , : : : , L.`/ / j P .0/ D L .b Whereas .E 0 , E 0 /-valued random variables having law PK under P 2 P might not exist in the original experiment, their existence is granted on the Markov extension b / of .E, E, P / by Definition 7.4(ii). b b .E, E, P 7.5 Lemma. Fix # 2 ‚ such that LAQ holds at #. Relative to #, consider An measurable mappings Un D Un .#/ : n ! Rk with the property L .Un j Pn,# / ,
n1,
is tight in Rk .
Then there are subsequences .nl /l and A-measurable mappings U : ! Rk in (if necessary, Markov extensions of) the limit experiment E1 D E.S , J /, with the following property: 8 d d d k ˆ < weak convergence in R R R L Snl , Jnl , Unl j Pnl ,#Cınl hl ! L .S , J , U j Ph / as l ! 1 ˆ : holds for arbitrary h and convergent sequences hl ! h . Proof. (1) By LAQ at # we have tightness of L. Sn , Jn j Pn,# /, n 1; by assumption, we have tightness of L.Un jPn,# /, n 1. Thus L . Sn , Jn , Un j Pn,# / ,
n1,
is tight in Rd Rd d Rk .
Section 7.1 Local Asymptotics of Type LAN, LAMN, LAQ
187
Extracting weakly convergent subsequences, let .nl /l denote a subsequence of N and 0 a probability law on Rd Rd d Rk such that ./ L Snl , Jnl , Unl j Pn,# ! 0 , l ! 1 . For convergent sequences hl ! h we deduce from the quadratic expansion of loglikelihoods in Definition 7.1 together with Proposition 7.2 h =0 0 , l ! 1 .C/ L ƒnl ,# , Snl , Jnl , Unl j Pn,# ! e l
where the law e 0 on R Rd Rd d Rk is the image measure of 0 under the mapping 1 .s, j , u/ ! h>s h>j h , s, j , u . 2 Mutual contiguity .˘/ in Proposition 7.3 and Le Cam’s Third Lemma 3.6 transform (+) into weak convergence h =0 h , l ! 1 L ƒnl ,# , Snl , Jnl , Unl j Pnl ,#Cınl hl ! e l
0 by where e h is a probability law on R Rd Rd d Rk defined from e 0 .d , ds, dj , du/ . e h .d , ds, dj , du/ D e e Projecting on the components .s, j , u/, we thus have proved for any limit point h 2 Rd and any sequence .hl /l converging to h .CC/ L Snl , Jnl , Unl j Pnl ,#Cınl hl ! h , l ! 1 where – with 0 as in ./ above – the probability measures h are defined from 0 by >s 1 2
.CCC/ h .ds, dj , du/ :D e h
h>j h
0 .ds, dj , du/ on Rd Rd d Rk .
Note that the statistical model ¹h : h 2 Rd º arising in (+++) is attached to the particular accumulation point 0 for ¹L.Sn , Jn , Un j Pn,# / : n 1º which was selected in ./ above, and that different accumulation points for ¹L.Sn , Jn , Un j Pn,# / : n 1º lead to different models (+++). (2) We construct a Markov extension of the limit experiment E.S , J / carrying a random variable U which allows to identify h in (++) as L .S , J , U j Ph /, for all h 2 Rd . Let sL , jL, uL denote the projections which map .s, j , u/ 2 Rd Rd d Rk to either one of its coordinates s 2 Rd or j 2 Rd d or u 2 Rk . In the statistical model ¹h : h 2 Rd º fixed by (+++), the pair .Ls , jL/ is a sufficient statistic. By sufficiency, conditional distributions given .Ls , jL/ of Rk -valued random variables admit regular
188
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
versions which do not depend on the parameter h 2 Rd . Hence there is a transition probability K., / from Rd Rd d to Rk such that ./
uj.L L s ,jL/D.s,j /
K..s, j /, du/ D:
.du/ uj.L L s ,jL/D.s,j /
provides a common determination of all conditional laws h Defining . /
b :D A˝B.Rk / , b :D Rk , A b h .d!, du/ :D Ph .d!/K. .S.!/, J.!// , du / , P
we have a Markov extension
.du/ , h 2 Rd .
h 2 Rd
® ¯ b, P b, A b h : h 2 Rd
of the original limit experiment E1 .#/ D E.S , J / D ., A, ¹Ph : h 2 Rd º/ of Definition 7.1. Exploiting again sufficiency, we can put the laws h of (++) and (+++) in the form .Ls ,jL/
h .ds, dj , du/ D h
.Ls ,jL/
D h
.Ls ,jL/
D h
uj.L L s ,jL/D.s,j /
.ds, dj / h
uj.L L s ,jL/D.s,j /
.ds, dj /
.du/ .du/
.ds, dj / K..s, j /, du/
for all h 2 Rd . Combining ./ and (++) with Proposition 7.3, we can identify the last expression with .S ,J / Ph .ds, dj / K..s, j /, du/ for all h 2 Rd . The Markov extension . / of the original limit experiment allows us to write this as b .S ,J ,U / .ds, dj , du/ P h b to Rk . Hence we have proved where U denotes the projection U.!, u/ :D u from b .S ,J ,U / .ds, dj , du/ for all h 2 Rd h .ds, dj , du/ D P h which is the assertion of the lemma.
From now on, we need no longer distinguish carefully between the original limit experiment and its Markov extension. From ./ and ./ in the last proof, we see that any accumulation point of ¹L.Sn , Jn , Un jPn,# / : n 1º can be written in the form L.S , J jP0 /.ds, dj /K..s, j /, du/ for some transition probability from Rd Rd d to Rk . In this sense, statistical models ¹h : h 2 Rd º which can arise in (++) and (+++) correspond to transition probabilities K., / from Rd Rd d to Rk .
189
Section 7.1 Local Asymptotics of Type LAN, LAMN, LAQ
7.6 Theorem. Assume LAQ at #. For any estimator sequence .Tn /n for the unknown parameter in .En /n , let Un D Un .#/ D ın1 .Tn #/ denote the rescaled estimation errors of Tn at #, n 1. Assume joint weak convergence in Rd Rd d Rd L . Sn , Jn , Un j Pn,# / ! L . S , J , U j P0 /
as n ! 1
where U is a statistic in the (possibly Markov extended) limit experiment E1 .#/ D E.S , J /. Then we have for arbitrary convergent sequences hn ! h L Sn , Jn , Un hn j Pn,#Cın hn ! L . S , J , U h j Ph / .C/ as n ! 1 (weak convergence in Rd Rd d Rd ), and from this ˇ ˇ supjhjC ˇEn,#Cın h ` ın1 .Tn .# C ın h// Eh .`.U h//ˇ .CC/ ! 0 as n ! 1 for bounded and continuous loss functions `./ on Rd and for arbitrarily large constants C < 1. Proof. (1) Assertion (+) of the theorem corresponds to the assertion of Lemma 7.5, under the stronger assumption of joint weak convergence of .Sn , Jn , Un / under Pn,# as n ! 1, without selecting subsequences. For loss functions `./ in Cb .Rd /, (+) contains the following assertion: for arbitrary convergent sequences hn ! h, as n ! 1, En,#Cın hn ` ın1 .Tn .# C ın hn // D En,#Cın hn . `.Un hn // . / ! Eh . `.U h// . (2) We prove that in the limit experiment E1 .#/ D E.S , J /, for `./ continuous and bounded, h ! Eh . `.U h// is continuous on Rd . Consider convergent sequences hn ! h. The structure in Definition 6.1 of likelihoods in the limit experiment implies pointwise convergence of Lhn =0 as n ! 1 to Lh=0 ; these are non-negative, and E0 .Lh=0 / D 1 D E0 .Lhn =0 / holds for all n. This gives (cf. [20, Nr. 21 in Chap. II]) the sequence Lhn =0 under P0 , n 1, is uniformly integrable. For `./ continuous and bounded, we deduce the sequence Lhn =0 `.U hn / under P0 , n 1, is uniformly integrable which contains the assertion of step (2): it is sufficient to write as n ! 1 Ehn . `.U hn // D E0 Lhn =0 `.U hn / ! E0 Lh=0 `.U h/ D Eh `.U h/ .
190
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
(3) Now it is easy to prove (++): in the limit experiment, thanks to step (2), we can rewrite assertion . / of step (1) in the form . /
En,#Cın hn . ` .Un hn // Ehn .`.U hn // D o.1/ ,
n!1
for arbitrary convergent sequences hn ! h. For large constants C < 1 define ˇ ˇ ˛n .C / :D sup ˇ En,#Cın h . ` .Un h// Eh .`.U h// ˇ , n 1 . jhjC
Assume that for some C the sequence .˛n .C //n does not tend to 0. Then there is a sequence .hn /n in the closed ball ¹jhj C º and a subsequence .nk /k of N such that for all k ˇ ˇˇ ˇ ˇEnk ,#Cınk hnk ` Unk hnk Ehnk `.U hnk / ˇ > " for some " > 0. The corresponding .hnk /k taking values in a compact, we can find some further subsequence .nk` /` and a limit point hL such that convergence hnk` ! hL holds as ` ! 1, whereas ˇ ˇˇ ˇ ˇEnk ,#Cınk hnk ` Unk` hnk` Ehnk ` U hnk` ˇ > " `
`
`
`
still holds for all `. This is in contradiction to . /. We thus have ˛n .C / ! 0 as n ! 1. We can rephrase Theorem 7.6 as follows: when LAQ holds at #, any estimator sequence .Tn /n for the unknown parameter in .En /n satisfying a joint convergence condition ./
L . Sn , Jn , Un j Pn,# / ! L . S , J , U j P0 / ,
Un D ın1 .Tn #/
works over shrinking neighbourhoods of #, defined through local scale ın D ın .#/ # 0 , as well as the limit object U viewed as estimator for the unknown parameter h 2 Rd in the limit experiment E1 .#/ D E.S , J / of Definition 7.1, irrespective of choice of loss functions in the class Cb .Rd /, and thus irrespective of any particular way of penalising estimation errors. Of particular interest are sequences .Tn /n coupled to the central sequence at # by .˘/
ın1 .#/.Tn #/ D Zn .#/ C o.Pn,# / .1/ ,
n!1
where Zn D 1¹Jn 2DC º Jn1 Sn is as in Definition 7.1’. In the limit experiment, P0 almost surely, J takes values in DC (Definitions 7.1 and 6.1). By the continuous mapping theorem, the joint convergence condition ./ then holds with U :D J 1 S D Z on the right-hand side, for sequences satisfying .˘/, and Theorem 7.6 reads as follows:
Section 7.2 Asymptotic optimality of estimators in the LAN or LAMN setting
191
7.7 Corollary. Under LAQ in #, any sequence .Tn /n with the coupling property .˘/ satisfies ˇ ˇ sup ˇEn,#Cın h ` ın1 .Tn .# C ın h// Eh .`.Z h//ˇ ! 0 , n ! 1 jhjC
for continuous and bounded loss functions `./ and for arbitrary constants C < 1. This signifies that .Tn /n works over shrinking neighbourhoods of #, defined through local scale ın D ın .#/ # 0 , as well as the maximum likelihood estimator Z in the limit model E1 .#/ D E.S , J /. Recall that in Corollary 7.7, the laws L.Z hjPh / may depend on the parameter h 2 Rd , and may not admit finite higher moments (cf. Remark 6.6”; examples will be seen in Chapter 8). Note also that the statement of Corollary 7.7 – the ‘best’ result as long as we do not assume more than LAQ – should not be mistaken as an optimality criterion: Corollary 7.7 under .˘/ is simply a result on risks of estimators at # – in analogy to Theorem 7.6 under the joint convergence condition ./ – which does not depend on a particular choice of a loss function, and which is uniform over shrinking neighbourhoods of # defined through local scale ın D ın .#/ # 0 . We do not know about the optimality of maximum likelihood estimators in general quadratic limit models (recall the remark following Theorem 6.8).
7.2
Asymptotic optimality of estimators in the LAN or LAMN setting
In local asymptotics at # of type LAMN or LAN, we can do much better than Corollary 7.7. We first consider estimator sequences which are regular at #, following a terminology well established since Hájek [40]: in a local asymptotic sense, this corresponds to the definition of strong equivariance in Definition 6.5 and Proposition 6.5’ for a mixed normal limit experiment, and to the definition of equivariance in Definition 5.4 for a Gaussian shift. The aim is to pass from criteria for optimality which we have in the limit experiment (Theorems 6.6 and 6.8 in the mixed normal case, and Theorems 5.5 and 5.10 for the Gaussian shift) to criteria for local asymptotic optimality for estimator sequences .Tn /n in .En /n at #. The main results are Theorems 7.10, 7.11 and 7.12. 7.8 Definition. For n 1, consider estimators Tn for the unknown parameter # 2 ‚ in En . (a) (Hájek [40]) If LAN holds at #, the sequence .Tn /n is termed regular at # if there is a probability measure F D F .#/ on Rd such that for every h 2 Rd L ın1 .Tn .# C ın h// j Pn,#Cın h ! F (weak convergence in Rd as n ! 1) where the limiting law F does not depend on the value of the local parameter h 2 Rd .
192
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
(b) (Jeganathan [66]) If LAMN holds at #, .Tn /n is termed regular at # if there is eDF e .#/ on Rd d Rd such that for every h 2 Rd some probability measure F e L Jn , ın1 .Tn .# C ın h// j Pn,#Cın h ! F (weakly in Rd d Rd as n ! 1) e does not depend on h 2 Rd . where the limiting law F Thus ‘regular’ is a short expression for ‘locally asymptotically equivariant’ in the LAN case, and for ‘locally asymptotically strongly equivariant’ in the LAMN case. 7.9 Example. Under LAMN or LAN at #, estimator sequences .Tn /n in .En /n linked to the central sequence .Zn /n at # by the coupling condition of Corollary 7.7 .˘/
ın1 .#/.Tn #/ D Zn .#/ C o.Pn,# /n .1/ ,
n!1
are regular at #. This is seen as follows. As in the remarks preceding Corollary 7.7, rescaled estimation errors Un :D ın1 .#/.Tn #/ satisfy a joint convergence condition with U D Z on the right-hand side: L . Sn , Jn , Un j Pn,# / ! L . S , J , Z j P0 / ,
n ! 1,
from which by Theorem 7.6, for every h 2 Rd , L Sn , Jn , .Un h/ j Pn,#Cın h ! L . S , J , .Z h/ j Ph / ,
n ! 1.
When LAN holds at #, F :D L.Z hjPh / does not depend on h 2 Rd , see Proposie :D L.J , Z hjPh / does not depend on h , tion 5.3(b). When LAMN holds at #, F see Definition 6.2 together with Proposition 6.3(iv). This establishes regularity at # of sequences .Tn /n which satisfy condition .˘/, under LAN or LAMN. 7.9’ Exercise. Let E denote the location model ¹F ./ D F0 . / : 2 Rº generated by the doubly exponential distribution F0 .x/ D 12 e jxj dx on .R, B.R//. Write En for the n-fold product experiment. Prove the following, for every reference point # 2 ‚: (a) Recall from Exercise 4.1’’’ that E is L2 -differentiable at D #, p and use Le Cam’s Second Lemma 4.11 to establish LAN at # with local scale ın .#/ D 1= n. (b) The median of the first n observations (which is the maximum likelihood estimator in this model) yields a regular estimator sequence at #. The same holds for the empirical mean, or for arithmetic means between upper and lower empirical ˛-quantiles, 0 < ˛ < 12 fixed. Also Bayesians with ‘uniform over R prior’ as in Exercise 5.4’ R1 L=0 n d , n1 Tn :D R1 1 =0 1 Ln d are regular in the sense of Definition 7.8. Check this using only properties of a location model.
Section 7.2 Asymptotic optimality of estimators in the LAN or LAMN setting
193
7.9” Exercise. We continue Exercise P 7.9’, with notations and assumptions as there. Focus on the empirical mean Tn D n1 niD1 Xi as estimator for the unknown parameter. Check that the sequence .Tn /n , regular by Exercise 7.9’(b), induces the limit law F D N .0, 2/ in Definition 7.8(a). Then give an alternative proof for regularity of .Tn /n based on the LAN property: from joint convergence of ! n n 1 X 1 X p sgn.Xi #/ , p .Xi #/ under Pn,# n i D1 n i D1 specify the two-dimensional normal law which arises as limit distribution for p , n.Tn #/ j Pn,# L ƒh=0 as n ! 1 , n,# finally determine the limit law for p p , n Tn .# C h= n/ j Pn,#Ch=pn L ƒh=0 n,#
as n ! 1
using Le Cam’s Third Lemma 3.6 in the particular form of Proposition 3.6”.
In the LAN case, the following is known as Hájek’s convolution theorem. 7.10 Convolution Theorem. Assume LAMN or LAN at #, and consider a sequence .Tn /n of estimators for the unknown parameter in .En /n which is regular at #. (a) (Hájek [40]) When LAN holds at #, any limit distribution F arising in Definition 7.8(a) can be written as F D N 0, J 1 ? Q for some probability law Q on Rd as in Theorem 5.5. e in Definition 7.8(b) (b) (Jeganathan [66]) When LAMN holds at #, any limit law F admits a representation “ i h e .A/ D F PJ .dj / N .0, j 1 / ? Qj .du/ 1A .j , u/ , A 2 B.Rd d Rd / for some family of probability laws ¹Qj : j 2 DC º as in Theorem 6.6. Proof. We prove (b) first. With notation Un :D ın1 .Tn #/, regularity means that for e, some law F e, n!1 L Jn , Un h j Pn,#Cın h ! F e does not depend on h 2 Rd . Selecting subsequences (weakly in Rd Rd d ) where F e has a representation according to Lemma 7.5, we see that F .C/
e D L .J , U h j Ph / F
not depending on h 2 Rd
where U is a statistic in the (possibly Markov extended) limit experiment E.S , J /. Then U in (+) is a strongly equivariant estimator for the parameter h 2 Rd in the
194
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
mixed normal limit experiment E.S , J /, and the Convolution Theorem 6.6 applies to U and gives the assertion. To prove (a), which corresponds to deterministic J , we use a simplified version of the above, and apply Boll’s Convolution Theorem 5.5. Recall from Definition 5.6 that loss functions on Rd are subconvex if all levels sets are convex and symmetric with respect to the origin in Rd . Recall from Anderson’s Lemma 5.7 that for `./ subconvex, Z Z `.u/ ŒN .0, j 1 / Q0 .du/ `.u/ N .0, j 1 /.du/ for every j 2 DC and any law Q0 on Rd . Anderson’s lemma shows that best concentrated limit distributions in the Convolution Theorem 7.10 are characterised by .7.100 /
Q D 0 under LAN at #, Qj D 0 for P0 -almost all j 2 Rd d under LAMN at #.
Estimator sequences .Tn /n in .En /n which are regular at # and attain in the convolution theorem the limit distribution (7.10’) are called efficient at #. 7.10” Exercise. In the location model generated from the two-sided exponential distribution, continuing Exercises 7.9’ and 7.9”, check from Exercise 7.9” that the sequence of empirical means is not efficient.
In some problems we might find efficient estimators directly, in others not. Under some additional conditions – this will be the topic of Section 7.3 – we can apply a method which allows us to construct efficient estimator sequences. We have the following characterisation. 7.11 Theorem. Consider estimators .Tn /n in .En /n for the unknown parameter. Under LAMN or LAN at #, the following assertions (i) and (ii) are equivalent: (i) the sequence .Tn /n is regular and efficient at # ; (ii) the sequence .Tn /n has the coupling property .˘/ of example 7.9 (or of Corollary 7.7): ın1 .#/.Tn #/ D Zn .#/ C o.Pn,# / .1/ ,
n!1.
Proof. We consider the LAMN case (the proof under LAN is then a simplified version). Consider a sequence .Tn /n which is regular at #, and write Un D ın1 .#/.Tn #/. The implication (ii)H)(i) follows as in Example 7.9 where we have in particular under (ii) L Jn , .Un h/ j Pn,#Cın h ! L .J , .Z h/ j Ph /
Section 7.2 Asymptotic optimality of estimators in the LAN or LAMN setting
195
for every h. But LAMN at # implies according to Definition 6.2 and Proposition 6.3 Z L .J , .Z h/ j Ph / .A/ D L .J , Z j P0 / .A/ D PJ .dj / N .0, j 1 /.du/ 1A .j , u/ for Borel sets A in Rd d Rd . According to (7.10’) above, we have (i). To prove (i)H)(ii), we start from the regularity assumption e, n!1 L Jn , .Un h/ j P#Cın h ! F e does not depend on h 2 Rd . Fix any subsequence for arbitrary h where the limit law F of the natural numbers. Applying Lemma 7.5 along this subsequence, we find a further subsequence .nl /l and a statistic U in the limit experiment E.S , J / (if necessary, after Markov extension) such that L Snl , Jnl , Unl j P#Cınl h ! L .S , J , U j Ph / , l ! 1 for all h, or equivalently by definition of .Zn /n L .Znl h/, Jnl , .Unl h/ j P#Cınl h ! L ..Z h/, J , .U h/ j Ph / , .ı/ l !1 for every h 2 Rd . Again, regularity yields e D L .J , U j P0 / L . J , .U h/ j Ph / D F
does not depend on h 2 Rd
and allows to view U as a strongly equivariant estimator in the limit experiment E1 .#/ D E.S , J / which is mixed normal. The Convolution Theorem 6.6 yields a representation Z Z J e F .A/ D P .dj / ŒN .0, j 1 / ? Qj .du/ 1A .j , u/ , A 2 B.Rd d Rd / . Now we exploit the efficiency assumption for .Tn /n at #: according to (7.10’) above we have Qj D 0 for PJ -almost all j 2 Rd d . Since the Convolution Theorem 6.6 identifies .j , B/ ! Qj .B/ as a regular version U ZjJ Dj .B/, the last line establishes of the conditional distribution P0 U DZ
P0 -almost surely.
Using .ı/ and the continuous mapping theorem, this gives L Unl Znl j Pnl ,# ! L .U Z j P0 / D 0 ,
l ! 1.
But convergence in law to a constant limit is equivalent to stochastic convergence, thus .ıı/
Unl D Znl C o.Pn
l ,#
/ .1/ ,
l ! 1.
196
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
We have proved that every subsequence of the natural numbers contains some further subsequence .nl /l which has the property .ıı/: this gives Un D Zn C o.Pn,# / .1/ ,
n!1
which is (ii) and finishes the proof.
However, there might be interesting estimator sequences .Tn /n for the unknown parameter in .En /n which are not regular as required in the Convolution Theorem 7.10, or we might be unable to prove regularity: when LAMN or LAN holds at a point #, we wish to include these in comparison results. 7.12 Local Asymptotic Minimax Theorem. Assume that LAMN or LAN holds at #, consider arbitrary sequences of estimators .Tn /n for the unknown parameter in .En /n , and arbitrary loss functions `./ which are continuous, bounded and subconvex. (a) A local asymptotic minimax bound lim inf lim inf sup En,#Cın h ` ın1 .Tn .# C ın h// E0 . `.Z// c!1
n!1 jhjc
holds whenever .Tn /n has estimation errors at # which are tight at rate .ın .#//n : L ın1 .#/.Tn #/ j Pn,# , n 1 , is tight in Rd . (b) Sequences .Tn /n satisfying the coupling property .˘/ of Example 7.9 ın1 .#/.Tn #/ D Zn .#/ C o.Pn,# / .1/ , n ! 1 attain the local asymptotic minimax bound at #. One has under this condition lim sup En,#Cın h ` ın1 .Tn .# C ın h// D E0 . `.Z// n!1 jhjc
for arbitrary choice of a constant 0 < c < 1. Proof. We give the proof for the LAMN case (again, the proof under LAN is a simplified version), and write Un D ı 1 .Tn #/ for the rescaled estimation errors of Tn at #. (1) Fix c 2 N. The loss function `./ being non-negative and bounded, lim inf sup En,#Cın h . ` .Un h// n!1 jhjc
is necessarily finite. Select a subsequence of the natural numbers along which ‘liminf’ in the last line can be replaced by ‘lim’, then – using Lemma 7.5 – pass to some further subsequence .nl /l and some statistic U in the limit experiment E1 .#/ D E.S , J / (if
197
Section 7.2 Asymptotic optimality of estimators in the LAN or LAMN setting
necessary, after Markov extension) such that the following holds for arbitrary limit points h and convergent sequences hl ! h: L Snl , Jnl , Unl hl j Pnl ,#Cınl hl ! L . S , J , U h j Ph / , l ! 1 . From this we deduce as in Theorem 7.6 as l ! 1 ˇ ˇ ˇ ˇ .C/ sup ˇEnl ,#Cınl h ` Unl h Eh . ` .U h//ˇ jhjc
!
0
since ` 2 Cb . Recall that h ! Eh . ` .U h// is continuous, see step (2) in the proof of Theorem 7.6. Write c for the uniform law on the closed ball Bc in Rd centred at 0 with radius c, as in Lemma 6.7 and in the proof of Theorem 6.8. Then by uniform convergence according to (+) ! sup Enl ,#Cınl h ` Unl h jhjc Z sup Eh . ` .U h// c .dh/ Eh . ` .U h// jhjc
as l ! 1. The last ‘’ is a trivial bound for an integral with respect to c . (2) Now we exploit Lemma 6.7. In the mixed normal limit experiment E1 .#/ D E.S , J /, arbitrary estimators U for the parameter h 2 Rd can be viewed as being approximately strongly equivariant under a very diffuse prior, i.e. in absence of any a priori information on h except that h should range over large balls centred at the j origin: there are probability laws ¹Qc : j 2 DC , c 2 Nº such that Z Z ! 0 d1 c .dh/ L.U hjPh / , PJ .dj / Œ N .0, j 1 / ? Qcj as c increases to 1, with d1 ., / the total variation distance. As a consequence, `./ being bounded, Z R.U , Bc / D c .dh/ Eh . ` .U h// takes the form R.U , Bc / D
Z PJ .dj /
Z h
i N .0, j 1 / ? Qcj .du/ l.u/ C .c/
where remainder terms .c/ involve an upper bound for `./ and vanish as c increases to 1. In the last line, at every stage c 2 N of the asymptotics, Anderson’s Lemma 5.7 allows for a lower bound Z Z J R.U , Bc / P .dj / N .0, j 1 /.du/ l.u/ C .c/ since `./ is subconvex. According to Definition 6.2 and Proposition 6.3, the law appearing on the right-hand side is L .Z j P0 /, and we arrive at R.U , Bc / E0 .`.Z// C .c/
where
lim .c/ D 0.
c!1
198
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
(3) Combining steps (1) and (2) we have for c 2 N fixed lim inf sup En,#Cın h . ` .Un h// D lim n!1 jhjc
sup Enl ,#Cınl h ` Unl h
l!1 jhjc
D sup Eh . ` .U h// R.U , Bc / jhjc
(this U depends on the choice of the subsequence at the start) where for c tending to 1 lim inf R.U , Bc / E0 .`.Z// . c!1
Both assertions together yield the local asymptotic minimax bound in part (a) of the theorem. (4) For estimator sequences .Tn /n satisfying condition .˘/ in 7.9, (+) above can be strengthened to ˇ ˇ sup ˇ En,#Cın h . ` .Un h// Eh . `.Z h// ˇ ! 0 , n ! 1 jhjc
for fixed c, without any need to select subsequences (Corollary 7.7). In the mixed normal limit experiment, Eh . `.Z h// D E0 .`.Z// does not depend on h. Exploiting this we can replace the conclusion of step (1) above by the stronger assertion sup En,#Cın h . ` .Un h//
jhjc
!
E0 .`.Z//
as n ! 1, for arbitrary fixed value of c. This is part (b) of the theorem.
7.13 Remark. Theorem 7.12 shows in particular that whenever we try to find estimator sequences .Tn /n which attain a local asymptotic minimax bound at #, we may restrict our attention to sequences which have the coupling property .˘/ of Example 7.9 (or of Corollary 7.7). We rephrase this statement according to Theorem 7.11: under LAMN or LAN at #, in order to attain the local asymptotic minimax bound of Theorem 7.12, we may focus – within the class of all possible estimator sequences – on those which are regular and efficient at # in the sense of the convolution theorem. Under LAMN or LAN at # plus some additional conditions, we can construct efficient estimator sequences explicitely by ‘one-step-modification’ starting from any preliminary estimator sequence which converges at rate ın .#/ at #. This will be the topic of Section 7.3. 7.13’ Example. In the set of all probability measures on .R, B.R//, let us consider a one-parametric path E D ¹P : j j < 1º in direction sgn./ through the law P0 :D R.1, C1/, the uniform distribution with support .1, C1/, defined as in Examples 1.3 and 4.3 by g.x/ :D sgn.x/ ,
P .dx/ :D .1 C g.x// P0 .dx/ ,
x 2 .1, C1/ .
Section 7.2 Asymptotic optimality of estimators in the LAN or LAMN setting
199
(1) Write ‚ D .1, C1/. According to Example 4.3, the model E is L2 differentiable at D # with derivative V# .x/ D
g 1 1 .x/ D 1¹x0º , 1C#g 1# 1C#
x 2 .1, C1/ .
at every reference point # 2 ‚ (it is irrelevant how the derivative is defined on Lebesgue null sets). As in Theorem 4.11, Le Cam’s second lemma yields LAN at p # with local scale ın .#/ D 1= n, with score at # 1 X Sn .#/.X1 , : : : , Xn / D p V# .Xi / n n
iD1
at stage n of the asymptotics, and with Fisher information given by 1 J# D E# .V#2 / D , # 2 ‚. 1 #2 b n ./ the empirical distribution function based on the first n observations, (2) With F the score in En Sn .#/ D
1 p b 1 p b n .0// nF n .0/ C n.1 F 1# 1C#
can be written in the form .C/
Sn .#/ D J#
p
n .Tn #/
for all # 2 ‚
when we estimate the unknown parameter # 2 ‚ by b n .0/ . Tn :D 1 2 F From representation (+), for every # 2 ‚, rescaled estimation errors of .Tn /n coincide with the central sequence .Zn .#//n . From Theorems 7.11, 7.10 and 7.12, for every # 2 ‚, the sequence .Tn /n thus attains the best concentrated limit distribution (7.10’) in the convolution theorem, and attains the local asymptotic minimax bound of Theorem 7.12(b). An unsatisfactory point with the last example is that n-fold product models En in Example 7.13’ are in fact classical exponential families b b dPn, D .1 C /nŒ1F n .0/ .1 /nF n .0/ dPn,0 D .1 C /n exp¹ . / TLn º dPn,0 , 2‚ 1 b n .0/. Hence we shall generalise it (considering difin . / :D log. 1C / and TLn :D nF ferent one-parametric paths through a uniform law which are not exponential families) in Example 7.21 below.
200
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
7.3 Le Cam’s One-step Modification of Estimators In this section, we go back to the LAQ setting of Section 7.1 and construct estimator en /n with the coupling property .˘/ of Corollary 7.7 sequences .T .˘/
en #/ D Zn .#/ C o.P / .1/ , ı 1 .#/.T n,#
n!1
starting from any preliminary estimator sequence .Tn /n which converges at rate ın .#/ at #. This ‘one-step modification’ is explicit and requires only few further conditions in addition to LAQ at #; the main result is Theorem 7.19 below. In particular, when LAMN or LAN holds at #, one-step modification yields optimal estimator sequences locally asymptotically at #, via Example 7.9 and Theorem 7.11: the modified sequence will be regular and efficient in the sense of the Convolution Theorem 7.10, and will attain the local asymptotic minimax bound of Theorem 7.12. In the general LAQ case, we only have the following: the modified sequence will work over shrinking neighbourhoods of # as well as the maximum likelihood estimator Z in the limit experiment E1 .#/, according to Corollary 7.7 where L.Z hjPh / may depend on h. We follow Davies [19] for this construction. We formulate the conditions which we need simultaneously for all points # 2 ‚, such that the one-step modifications en /n of .Tn /n will have the desired properties simultaneously at all points # 2 ‚. .T This requires some compatibility between quantities defining LAQ at # and LAQ at # 0 whenever # 0 is close to #. 7.14 Assumptions for Section 7.3. We consider a sequence of experiments En D .n , An , ¹Pn,# : # 2 ‚º/ ,
n1,
‚ Rd open
enjoying the following properties (A)–(D): (A) At every point # 2 ‚ we have LAQ as in Definition 7.1(a), with sequences .ın .#//n ,
.Sn .#//n ,
.Jn .#//n
depending on #, and a quadratic limit experiment E1 .#/ D E .S.#/, J.#// depending on #. (B) Local scale: (i) For every n 1 fixed, ın ./ : ‚ ! .0, 1/ is a measurable mapping which is bounded by 1 (this is no loss of generality: we may always replace ın ./ by ın ./ ^ 1). (ii) For every # 2 ‚ fixed, we have for all 0 < c < 1 ˇ ˇ ˇ ın .# C ın .#/ h/ ˇ 1 ˇˇ ! 0 as n ! 1 . sup ˇˇ ın .#/ jhjc
201
Section 7.3 Le Cam’s One-step Modification of Estimators
(C) Score and observed information: For every # 2 ‚ fixed, in restriction to the set of dyadic numbers S :D ¹˛2k : k 2 N0 , ˛ 2 Zd º in Rd , we have ˇ ®
¯ˇ ˇSn . / Sn .#/ Jn .#/ ı 1 .#/. #/ ˇ sup n 2 S\‚ , j#j c ın .#/ .i/ D oPn,# .1/ , n ! 1 .ii/
sup
2 S\‚ , j#j c ın .#/
jJn . / Jn .#/j D oPn,# .1/ ,
n!1
for arbitrary choice of a constant 0 < c < 1. (D) Preliminary estimator sequence: We have some preliminary estimator sequence .Tn /n for the unknown parameter in .En /n which at all points # 2 ‚ is tight at the rate specified by (A): for every # 2 ‚ : L ın1 .#/ .Tn #/ j Pn,# , n 1 , is tight in Rd . Continuity # ! ın .#/ of local scale in the parameter should not be imposed (a stochastic process example is given in Chapter 8). Similarly, we avoid to impose continuity of the score or of the observed information in the parameter, not even measurability. This is why we consider, on the left-hand sides in (C), only parameter values belonging to a countably dense subset. Under the set of Assumptions 7.14, the first steps are to define a local scale with an estimated parameter, observed information with an estimated parameter, and a score with an estimated parameter. 7.15 Proposition. (a) Define a local scale with estimated parameter by Dn :D ın .Tn / 2 .0, 1 ,
n 1.
Then we have for every # 2 ‚ Dn D 1 C oPn,# .1/ , ın .#/
n!1.
(b) For every n 1, define a N0 -valued random variable .n/ by .n/ D k if and only if Dn 2 .2.kC1/ , 2k , k 2 N0 . Then for every # 2 #, the two sequences 2.n/ ın .#/
and
ın .#/ 2.n/
are tight in R under Pn,# as n ! 1. Proof. According to Assumption 7.14(B.i), Dn is a random variable on .n , An / taking values in .0, 1. Consider the preliminary estimator Tn of Assumption 7.14(D).
202
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
Fix # 2 ‚ and write Un .#/ :D ın1 .#/.Tn #/ for the rescaled estimation errors at #. Combining tightness of L .Un .#/jPn,# / as n ! 1 according to Assumption 7.14(D) with a representation Dn D ın .Tn / D ın . # C ın .#/ Un .#/ / and with Assumption 7.14(B.ii) we obtain (a). From 2..n/C1/ Dn 2.n/ we get 1 ..n/C1/ D , thus (b) follows from (a). n 2 Dn 2 For every k 2 N0 , cover Rd with half-open cubes h d h C.k, ˛/ :D X ˛i 2k , .˛i C 1/2k , iD1
˛ D .˛1 , : : : , ˛d / 2 Zd .
° ± Write Z.k/ :D ˛ 2 Zd : C.k, ˛/ ‚ for the collection of those which are contained in ‚. Fix any default value #0 in S\‚. From the preliminary estimator sequence .Tn /n and from local scale .Dn /n with estimated parameter, we define a discretisation .Gn /n of .Tn /n as follows: for n 1, ² .n/ ˛2 if ˛ 2 Z..n// and Tn 2 C..n/, ˛/ Gn :D else #0 1 X X .˛2k #0 / 1C.k,˛/ .Tn / 1 2.kC1/ ,2k .Dn / . D #0 C kD0
˛2Z.k/
Clearly Gn is an estimator for the unknown parameter # 2 ‚ in En , taking only countably many values. We have to check that passing from .Tn /n to .Gn /n does not modify the tightness rates. 7.16 Proposition. .Gn /n is a sequence of S \ ‚-valued estimators satisfying ® 1 ¯ L ın .#/ .Gn #/ j Pn,# : n 1 is tight in Rd for every # 2 ‚. Proof. Fix # 2 ‚. The above construction specifies ‘good’ events ² ³ [ C..n/, ˛/ 2 An Bn :D Tn 2 ˛2Z..n//
together with a default value Gn D #0 on Bnc . Since ‚ is open and rescaled estimation errors of .Tn /n at # are tight at rate .ın .#//n , since .Dn /n or .e .n/ /n defined in Proposition 7.15(b) are random tightness rates which are equivalent to .ın .#//n under .Pn,# /n , we have by construction lim Pn,# .Bn / D 1
n!1
203
Section 7.3 Le Cam’s One-step Modification of Estimators
together with jGn Tn j
" Pn,# jUL n .#/j > c ˇ ®
¯ˇ ˇ Sn . / Sn .#/ Jn .#/ ı 1 .#/. #/ ˇ > " C Pn,# sup n 2S\‚,j#jcın .#/
and apply part (C.i) of Assumption 7.14.
205
Section 7.3 Le Cam’s One-step Modification of Estimators
Now we resume: Under Assumptions 7.14, with preliminary estimator sequence .Tn /n and its discretisation .Gn /n as in Proposition 7.16, with local scale with J n and score with estimated parameter Dn , information with estimated parameter b estimated parameter b S n as defined in Propositions 7.15, 7.17 and 7.18, with ‘inverse’ b n for b J n as in Proposition 7.17, we have K 7.19 Theorem. (a) With these assumptions and notations, the one-step modification bn b en :D Gn C Dn K Sn , T
n1
en /n for the unknown parameter in .En /n which has yields an estimator sequence .T the property en #/ D Zn .#/ C o.P / .1/ , ın1 .#/.T n,#
n!1
for every # 2 ‚. en /n works over shrinking (b) In the sense of Corollary 7.7, for every # 2 ‚, .T neighbourhoods of # as well as the maximum likelihood estimator in the limit model E1 .#/ D E.S.#/, J.#// . en /n is regular and efficient at # in (c) If LAMN or LAN holds at #, the sequence .T the sense of the Convolution Theorem 7.10, and attains the local asymptotic minimax bound at # according to the Local Asymptotic Minimax Theorem 7.12. Proof. Only (a) requires a proof. Fix # 2 ‚. Then from Propositions 7.15 and 7.17 Dn b b Kn S n ın .#/
D ın1 .#/.Gn #/ C 1 C oPn,# .1/ Jn1 .#/ C oPn,# .1/ b Sn D ın1 .#/.Gn #/ C Jn1 .#/ b S n C oP .1/
en #/ D ın1 .#/.Gn #/ C ın1 .#/.T
n,#
which according to Proposition 7.18 equals ın1 .#/.Gn #/
C Jn1 .#/ Sn .#/ Jn .#/ ın1 .#/.Gn #/ C oPn,# .1/ C oPn,# .1/ where terms ın1 .#/.Gn #/ cancel out and the last line simplifies to Jn1 .#/ Sn .#/ C oPn,# .1/ D Zn .#/ C oPn,# .1/ , This is the assertion.
n ! 1.
206
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
7.4 The Case of i.i.d. Observations p To establish LAN with local scale ın D 1= n in statistical models for i.i.d. observations, we use in most cases Le Cam’s Second Lemma 4.11 and 4.10. We recall this important special case here, and discuss some examples. During this section, E D . , A , P D ¹P : 2 ‚º / is an experiment, ‚ Rd is open, and En is the n-fold product experiment with canonical variable .X1 , : : : , Xn /. We write Pn, D ˝niD1 P for the laws in En and h=0
p .Ch= n/=
ƒn, D ƒn
D log
d Pn,Ch=pn d Pn,
p for 2 ‚ and h 2 ‚,n , the set of all h 2 Rd such that C h= n belongs to ‚. We shall assume 8 < there is an open set ‚0 Rd contained in ‚ such ./ that the following holds: for every # 2 ‚0 , the experiment : E is L2 -differentiable at D # with derivative V# . Recall from Assumptions 4.10, Corollary 4.5 and Definition 4.2 that # 2 ‚0 implies that V# is centred and belongs to L2 .P# /, write J# D E# V# V#> for the Fisher information in the sense of Definition 4.6. Then Le Cam’s Second Lemma 4.11 yields a quadratic expansion of log-likelihood ratios p 1 > .#Ch = n/ ƒn n D h> h J# hn C oPn,# .1/ n Sn .#/ 2 n as n ! 1, for arbitrary bounded sequences .hn /n in Rd , at every reference point # 2 ‚0 , with 1 X Sn .#/ D p V# .Xi / n n
j D1
such that
L . Sn .#/ j Pn,# / ! N . 0 , J# / ,
n ! 1.
In terms of Definitions 5.2 and 7.1 we can rephrase Le Cam’s Second Lemma 4.11 as follows: 7.20 Theorem. For n-fold independent replication of an experiment E satisfying ./ as n ! 1, LAN holds at every # 2 ‚0 with local scale ın .#/ D n1=2 ; the limit experiment E1 .#/ is the Gaussian shift E.J# / . We present some examples of i.i.d. models for which Le Cam’s second lemma establishes LAN at all parameter values. The aim is to specify efficient estimator
207
Section 7.4 The Case of i.i.d. Observations
sequences, either by checking directly the coupling condition in Theorem 7.11, or by one-step modification according to Section 7.3. The first example is very close to Example 7.13’. 7.21 Example. Put ‚ :D . 1 , C 1 / and P0 :D R.I / , the uniform distribution on I :D ., C/. In the set of all probability measures on .R, B.R//, define a oneparametric path E D ¹P : 2 ‚º in direction g through P0 by g.x/ :D sin.x/ , 1 P .dx/ :D .1 C g.x// P0 .dx/ D .1 C sin.x// dx , x 2 I 2 as in Examples 1.3 and 4.3, where the parameterisation is motivated by 1 F .0/ D , 2 ‚ 2 with F ./ the distribution function corresponding to P F .x/ D
1 .Œx C Œcos.x/ C 1/ 2
when
x2I.
(1) According to Example 4.3, the model E is L2 -differentiable at D # with derivative g sin.x/ V# .x/ D .x/ D , x2I 1C#g 1 C # sin.x/ at every reference point # 2 ‚. As resumed in Theorem 7.20, LepCam’s second lemma yields LAN at # for every # 2 ‚, with local scale ın .#/ D 1= n and score 1 X V# .Xi / Sn .#/.X1 , : : : , Xn / D p n n
iD1
at #, and with finite Fisher information in the sense of Definition 4.6 J# D E# .V#2 / < 1 ,
# 2 ‚.
In this model, we prefer to keep the observed information 1X 2 V# .Xi / n n
Jn .#/ D
iD1
in the quadratic expansion of log-likelihood ratios in the local model at #, and write p 1 > .#Ch = n/ .C/ ƒn n D h> h Jn .#/ hn C oPn,# .1/ n Sn .#/ 2 n as n ! 1, for arbitrary bounded sequences .hn /n in Rd .
208
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
(2) We show that the set of Assumptions 7.14 is satisfied. A preliminary estimator 1 b n .0/ Tn :D F 2 p for the unknown parameter # in En is at hand, clearly n-consistent as n ! 1. From step (1), parts (A) and (B) of Assumptions 7.14 are granted; part (C.ii) holds by continuity of Jn .#/ in the parameter. Calculating ®
¯ Sn . / Sn .#/ Jn .#/ ın1 .#/. #/ , with Jn .#/ as in (+) (the quantities arising in part (C.i) of Assumption 7.14), we find n p 1X 2 1 C # sin.Xi / V# .Xi / Sn . / Sn .#/ D n. #/ n 1 C sin.Xi / iD1
from which we deduce easily that Assumption 7.14(C.i) is satisfied. (3) Now the one-step modification according to Theorem 7.19 (there is no need for discretisation since score and observed information depend continuously on the parameter) en :D Tn C p1 ŒJn .Tn /1 Sn .Tn / , n 1 T n en /n which is regular and efficient at every point yields an estimator sequence .T # 2 R (a simplified version of the proof of Theorem 7.19 is sufficient to check this). However, even if the set of assumptions 7.14 can be checked in a broad variety of statistical models, not all of the assumptions listed there are harmless. 7.22 Example. Let E D ., A, ¹P : 2 Rº/ be the location model on .R, B.R// generated from the two-sided exponential distribution P0 .dy/ :D 12 e jyj dy ; this example has already been considered in exercises 4.1000 , 7.90 and 7.900 . We shall see that assumption 7.14 C) i) on the score with estimated parameter does not hold, hence one-step correction according to theorem 7.19 is not applicable. However, it is easy –for this model– to find optimal estimator sequences directly. 1) For every # 2 ‚ :D R, we have L2 -differentiability at D # with derivative V# given by V# .x/ D sgn.x #/ p 000 (cf. exercise 4.1 ). For all #, put ın .#/ D 1= n. Then Le Cam’s second lemma (theorem 7.20 or theorem 4.11) establishes LAN at # with score 1 X V# .yi / Sn .#/.y1 , : : : , yn / D p n n
iD1
at #, and with Fisher information J# D E# .V#2 / 1 not depending on #.
209
Section 7.4 The Case of i.i.d. Observations
2) We consider the set of assumptions 7.14. From 1), parts A) and B) of 7.14 are granted; 7.14 C) part ii) is trivial since Fisher information does not depend on the parameter. Calculate now ®
¯ , Jn .#/ D J# 1 Sn . / Sn .#/ Jn .#/ ın1 .#/. #/ P in view of 7.14 C) part i). This takes a form p1n niD1 Yn,i where
2 1.#,/ .Xi / C 12 . #/ for # < , Yn,i :D for < # . 2 1.,#/ .Xi // 12 .# / Rr Defining a function .r/ :D 0 12 e y dy for r > 0, we have ²
jE# .Yn,i /j D j# j 2.j# j/ ,
Var# .Yn,i / D 4 Œ.1 /.j# j//
where .r/ 12 r as r # 0. We shall see that this structure violates assumption 7.14 C) i), thus one-step correction according to theorem 7.19 breaks down. To see this, consider a constant sequence .hn /n with hn :D 12 for all n, and redefine the Yn,i above using n D # C n1=2 h in place of : first, E# .Yn,i / D o.n1=2 / as n ! 1 and thus 1 X 1 X p Yn,i D p ŒYn,i E# .Yn,i / C oPn,# .1/ , n n n
n
iD1
iD1
second, Var# .Yn,i / D O.n1=2 / as n ! 1 and by choice of the particular sequence .hn /n n n X 1 X Yn,i E# .Yn,i / Yn,i D C oPn,# .1/ p p n Var# .Yn,1 / C : : : C Var# .Yn,i / iD1 iD1
The structure of the Yn,i implies that the Lindeberg condition holds, thus as n ! 1 n ®
¯ 1 X Sn . n / Sn .#/ Jn .#/ ın1 .#/. n #/ D p Yn,i ! N .0, 1/ n iD1
weakly in R. This is incompatible with 7.14 C) i). 3) In the location model generated by the two-sided exponential distribution, there is no need for one-step correction. Consider the median Tn :D median.X1 , : : : , Xn / which in our model is maximum likelihood in En for every n. Our model allows to apply a classical result on asymptotic normality of the median: from [128, p. 578]), p L n .Tn #/ j Pn,# ! N .0, 1/
210
Chapter 7 Local Asymptotics of Type LAN, LAMN, LAQ
as n ! 1. In all location models, the median is equivariant in En for all n. By this circonstance, the last convergence is already regularity in the sense of Hajek: for any h 2 R, p n Tn Œ# C n1=2 h j Pn,#Cn1=2 h ! F :D N .0, 1/ . L Recall from 1) that Fisher information in E equals J# D 1 for all #. Thus F appearing here is the optimal limit distribution F D N .0, J#1 / in Hajek’s convolution theorem. By (7.10’) and Theorem 7.11, the sequence .Tn /n is thus regular and efficient in the sense of the convolution theorem, at every reference point # 2 ‚. Theorem 7.11 establishes that the coupling condition p n .Tn #/ D Zn .#/ C o.Pn,# / .1/ , n ! 1 holds, for every # 2 R. Asymptotically as n ! 1, this links estimation errors of the P median Tn to differences n1 niD1 sgn.Xi #/ between the relative number of observations above and below #. The coupling condition in turn implies that the estimator sequence .Tn /n attains the local asymptotic minimax bound 7.12. 7.22’ Exercise. Consider the location model E D ., A, ¹P : 2 Rº/ generated from P0 D N .0, 1/, show that L2 -differentiability holds with L2 -derivative V# D . #/ at every point # 2 R. Show that the set of Assumptions 7.14 is satisfied, with left-hand sides in Assumption 7.14(C) identical to zero.pConstruct an MDE sequence as in Chapter 2 for the unknown parameter, with tightness rate n, and specify the one-step modification according to Theorem 7.19 which as n ! 1 grants regularity and efficiency at all points # 2 ‚ (again, as in Example 7.21, there is no need for discretisation). Verify that this one-step modification directly replaces the preliminary estimator by the empirical mean, the MLE in this model. 7.22” Exercise. We continue with the location model generated from the two-sided exponential law, under all notations and assumptions of Exercises 7.9’, 7.9” and of Example 7.22. We focus on the Bayesians with ‘uniform over R prior’ R1 L=0 n d Tn D R1 , n1 1 =0 1 Ln d and shall prove that .Tn /n is efficient in the sense of the Convolution Theorem 7.10 and of the Local Asymptotic Minimax Theorem 7.12. (a) Fix any reference point # 2 R and recall (Example 7.22, or Exercises 7.9’(a), or 4.1’’’ and Lemma 4.11) that LAN holds at # in the form n =0 ƒhn,# D hn Sn .#/
1 2 h C o.Pn,# / .1/ , 2 n
1 X Sn .#/ :D p sgn.Xi #/ n i D1 n
as n ! 1, for bounded sequences .hn /n in R. Also, we can write under Pn,# R1 p u Lu=0 du n,# . n Tn # D R1 1 u=0 1 Ln,# du
211
Section 7.4 The Case of i.i.d. Observations
(b) Prove joint convergence Z 1 Z u=0 u Ln,# du , Sn .#/ , 1
as n ! 1 to a limit law
Z
1
1
ue
S, 1
1
uS 12 u2
Lu=0 n,# Z
du
1
du ,
e
uS 12 u2
under Pn,# du
1
where S N .0, 1/ generates the Gaussian limit experiment E.1/ D ¹N .h, 1/ : h 2 Rº. / allows to deal e.g. with integrals (Hint: finite-dimensional convergence of .Lu=0 n,# u2R R u=0 u Ln,# du on compacts K R, as in Lemma 2.5, then give some bound for RK u=0 K c u Ln,# du as n ! 1). (c) Let U denote the Bayesian with ‘uniform over R prior’ in the limit experiment E.1/ R1 1 2 u e uS 2 u du U :D R1 1 uS 12 u2 du 1 e and recall from Exercise 5.4” that in a Gaussian shift experiment, U coincides with the central statistic Z. (d) Use (a), (b) and (c) to prove that the coupling condition p n Tn # D Zn .#/ C o.Pn,# / .1/ as n ! 1 holds. By Theorem 7.11, comparing to Theorem 7.10, (7.10’) and Theorem 7.12, we have efficiency of .Tn /n at #.
Chapter 8
Some Stochastic Process Examples for Local Asymptotics of Type LAN, LAMN and LAQ
Topics for Chapter 8: 8.1 *The Ornstein–Uhlenbeck Process with Unknown Parameter Observed over a Long Time Interval The process and its long-time behaviour (8.1)–8.2 Statistical model, likelihoods and ML estimator 8.3 Local models at # 8.3’ LAN in the ergodic case # < 0 8.4 LAQ in the null recurrent case # D 0 8.5 LAMN in the transient case # > 0 8.6 Remark on non-finite second moments in the LAMN case 8.6’ Sequential observation schemes: LAN at all parameter values 8.7 8.2 *A Null Recurrent Diffusion Model The process and its long-time behaviour (8.8)–8.9 Regular variation, i.i.d. cycles, and norming constants for invariant measure 8.10–(8.11”) Statistical model, likelihoods and ML estimator 8.12 Convergence of martingales together with their angle brackets 8.13 LAMN for local models at every # 2 ‚ 8.14 Remarks on non-finite second moments 8.15 Random norming 8.16 One-step modification is possible 8.17 8.3 *Some Further Remarks LAN or LAMN or LAQ arising in other stochastic process models Example: a limit experiment with essentially different statistical properties Exercises: 8.2’, 8.6”, 8.7’ We discuss in detail some examples for local asymptotics of type LAN, LAMN or LAQ in stochastic process models (hence the asterisk in front of all sections). Martingale convergence and Harris recurrence (positive or null) will play an important role in our arguments and provide limit theorems which establish convergence of local models to a limit model. Background on these topics and some relevant apparatus are collected in an Appendix (Chapter 9) to which we refer frequently, so one might have a look to Chapter 9 first before reading the sections of the present chapter.
213
Section 8.1 Ornstein-Uhlenbeck Model
8.1
Ornstein–Uhlenbeck Process with Unknown Parameter Observed over a Long Time Interval
We start in dimension d D 1 with the well-known example of Ornstein–Uhlenbeck processes depending on an unknown parameter, see [23] and [5, p. 4]). We have a probability space ., A, P / carrying a Brownian motion W , and consider the unique strong solution X D .X t / t0 to the Ornstein–Uhlenbeck SDE dX t D # X t dt C d W t ,
.8.1/
t 0
for some value of the parameter # 2 R and some starting point x 2 R. There is an explicit representation of the solution Z t Xt D e# t x C e # s d Ws , t 0 0
satisfying L
Z e
#t
t
e
# s
´
d Ws
D
0
N
0,
1 2#
e 2# t 1 if # ¤ 0
N .0, t /
if # D 0 ,
and thus an explicit representation of the semigroup .P t ., // t0 of transition probabilities of X P t .x, dy/ :D P .XsCt 2 dy j Xs D x/ ´ N e # t x , 21# e 2# t 1 .dy/ if # ¤ 0 D N . x , t / .dy/ if # D 0 where x, y 2 R and 0 s, t < 1. 8.2 Long-time Behaviour of the Process. Depending on the value of the parameter # 2 ‚, we have three different types of asymptotics for the solution X D .X t / t0 to equation (8.1). (a) Positive recurrence in the case where # < 0: When # < 0, the process X is positive recurrent in the sense of Harris (cf. Definition 9.4) with invariant measure 1 . :D N 0, 2j#j This follows from Proposition 9.12 in the Appendix where the function Z y Z x 2 s.y/ dy with s.y/ D exp 2# v dv D e # y , S.x/ D 0
x, y 2 R
0
corresponding to the coefficients of equation (8.1) in the case where # < 0 is a bijection from R onto R, and determines the invariant measure for the process (8.1)
214
Chapter 8 Some Stochastic Process Examples
1 as s.x/ dx , unique up to constant multiples. Normed to a probability measure this specifies as above. Next, the Ratio Limit Theorem 9.6 yields for functions f 2 L1 ./ Z 1 t lim f . s / ds D .f / almost surely as t ! 1 t!1 t 0
for arbitrary choice of a starting point. We thus have strong laws of large numbers for a large class of additive functionals of X in the case where # < 0. (b) Null recurrence in the case where # D 0: Here X is one-dimensional Brownian motion with starting point x, and thus null recurrent (cf. Definition 9.5’) in the sense of Harris. The invariant measure is , the Lebesgue measure on R. (c) Transience in the case where # > 0: Here trajectories of X tend towards C1 or towards 1 exponentially fast. In particular, any given compact K in R will be left in finite time without return: thus X is transient in the case where # > 0. This is proved as follows. Write F for the (right-continuous) filtration generated by W . For fixed starting point x 2 R, consider the .P , F /-martingale Z t # t Xt D x C e #s d Ws , t 0 . Y D .Y t / t0 , Y t :D e 0
Then E ..Y t x/2 / D
Rt 0
e 2#s ds and thus sup E Y t2 < 1 t0
in the case where # > 0 : hence Y t converges as t ! 1 P -almost surely and in L2 ., A, P /, and the limit is Z 1 1 e #s d Ws N x , . Y1 D x C 2# 0 But almost sure convergence of trajectories t ! Y t .!/ signifies that for P -almost all ! 2 , . /
X t .!/ Y1 .!/ e # t
as
t !1.
Asymptotics . / can be transformed into a strong law of large numbers for some few additive functionals of X , in particular Z t Z t 1 2# t 2 2 Xs2 .!/ ds Y1 .!/ e 2 # s ds Y1 .!/ as t ! 1 e 2# 0 0 for P -almost all ! 2 . The following tool will be used several times in this chapter.
215
Section 8.1 Ornstein-Uhlenbeck Model
8.2’ Lemma. On some space ., A, F , P / with right-continuous filtration F , consider a continuous .P , F /-semi-martingale X admitting a decomposition X D X0 C M C A under P , where A is continuous F -adapted with paths locally of bounded variation starting in A0 D 0, and M a continuous local .P , F )-martingale starting in M0 D 0. Let P 0 denote a second probability measure on ., A, F / such that X is a continuous .P 0 , F /-semi-martingale with representation X D X 0 C M 0 C A0 under P 0 . Let H be an F -adapted process with left-continuous paths, locally bounded loc under both P and P 0 . Then we have the following: under the assumption RP P 0 relative to F , any determination .t , !/ ! I.t , !/ of the stochastic integral Hs dXs R under P is also a determination of the stochastic integral Hs dXs under P 0 . Proof. (1) For the process H , consider some localising sequence .n /n under P , and time horizon N < 1 we have some localising sequence T .n0 /n under P 0 . For T finite 0 0 PN PN , hence events n ¹n N º and n ¹n N º in FN are null sets under both probability measures P and P 0 . This holds for all N , thus .n ^n0 /n is a common localising sequence under both P and P 0 . (2) Localising further, we may assume that H as well as M , hM iP , jjjAjjj and M 0 , hM 0 iP 0 , jjjA0 jjj are bounded (we write jjjAjjj for the total variation process of A, and hM iP for the angle bracket under P ). R (3) Fix a version .t , !/ ! J.t , !/ of the stochastic integral Hs dXs under P , and R a version .t , !/ ! J 0 .t , !/ of the stochastic integral Hs dXs under P 0 . For n 1, define processes .t , !/ ! I .n/ .t , !/ 1 Z t X .n/ It D H kn X t^ kC1 X t^ kn D Hs.n/ dXs , H .n/ :D
kD0 1 X kD0
2n
2
H kn 1 k 2
2n
,
2
0
kC1 2n
to be considered under both P and P 0 . Using [98, Thm. 18.4] with respect to the martingale parts of X , select a subsequence .n` /` from N such that simultaneously as `!1 8 for P -almost all !: the paths I .n` / ., !/ converge uniformly on Œ0, 1/ ˆ ˆ < to the path J., !/ ˆ for P 0 -almost all !: the paths I .n` / ., !/ converge uniformly on Œ0, 1/ ˆ : to the path J 0 ., !/ . Since PN PN0 for N < 1, there is an event AN 2 FN of full measure under both P and P 0 such that ! 2 AN implies J., !/ D J 0 ., !/ on Œ0, N . As a consequence,
216
Chapter 8 Some Stochastic Process Examples
J and J 0 are indistinguishable processes for the probability measure P as well as for the probability measure P 0 . We turn to statistical models defined by observing a trajectory of an Ornstein– Uhlenbeck process continuously over a long time interval, under unknown parameter # 2 R. 8.3 Statistical Model. Consider the Ornstein–Uhlenbeck equation (8.1). Write ‚ :D R. Fix a starting point x D x0 which does not depend on # 2 ‚. Let Q# denote the law of the solution to equation (8.1) under #, on the canonical path space .C , C , G/ or .D, D, G/. Then as in Theorem 6.10 or Example 6.11, all laws Q# are locally equivalent relative to G, and the density process of Q# with respect to Q0 relative to G is ² Z t ³ Z 1 2 t 2 #=0 .0/ s d ms # s ds , t 0 L t D exp # 2 0 0 with m.0/ the Q0 -local martingale part of the canonical process under Q0 . (1) Fix a determination .t , !/ ! Y .t , !/ of the stochastic integral Z Z Z .0/ .0/ D m C m.0/ under Q0 : Y D s ds D s d m.0/ 0 s s d ms as in Example 6.11, using L.Œ 0 , m.0/ j Q0 / D L .B, B/ , there is an explicit representation 1 2 t 20 t , t 0 . Yt D 2 As a consequence, in statistical experiments E t corresponding to observation of the canonical process up to time 0 < t < 1, the likelihood function ² ³ Z t 1 #=0 ‚ 3 # ! L t D exp # Y t # 2 2s ds 2 .0, 1/ 2 0 and the maximum likelihood (ML) estimator b # t :D R t 0
Yt 2s ds
D
1 2
2t 20 t Rt 2 0 s ds
are expressed without reference to any particular probability measure. (2) Simultaneously for all # 2 ‚, .t , !/ ! Y .t , !/ provides a common determination for Z t Z t Z t 2 s ds D s d m.#/ C # ds under Q# s s 0
t0
0
0
t0
by Lemma 8.2’: is a continuousR semi-martingale under both Q0 and Q# , andRany determination .t , !/ ! Y .t , !/ of s ds under Q0 is also a determination of s ds
217
Section 8.1 Ornstein-Uhlenbeck Model
Rt .#/ under Q# . Obviously has Q# -martingale part m t D t 0 # 0 s ds . There are two statistical consequences: (i) for every every 0 < t < 1, ML estimation errors under # 2 ‚ take the form of the ratio of a Q# -martingale divided by its angle bracket under Q# : b #t # D
Rt
.#/
s d ms Rt 2 0 s ds
0
under Q# ;
(ii) for the density process L=# of Q with respect to Q# relative to G, the representation ² ³ Z t 1 =0 #=0 2s ds , t 0 L t =L t D exp . #/ Y t . 2 # 2 / 2 0 coincides with the representation from Theorem 6.10 applied to Q and Q# : ² ³ Z t Z t 1 =# .#/ 2 2 s d ms . #/ s ds . .C/ L t D exp . #/ 2 0 0 (3) The representation (+) allows to reparameterise the model E t with respect to fixed reference points # 2 ‚, and makes quadratic models appear around #. We will call Z t
Z Z t .#/ 2 .#/ s d ms and s ds D dm under Q# 0
t0
0
t0
score martingale at # and information process at #. It is obvious from 8.2 that the law of the observed information depends on #. Hence reparameterising as in (+) with respect to different reference points # or # 0 makes statistically different models appear. The last part of Model 8.3 allows to introduce local models at # 2 R. 8.3’ Local Models at #. (1) Localising around a fixed reference point # 2 R, write Qn for Q restricted to Gn . With suitable choice ın .#/ of local scale to be specified below, consider local models ® n ¯ : h 2 R , n1 E#,n D C , Gn , Q#Cı .#/ h n when n tends to 1. Then from (+), the log-likelihoods in E#,n are h=0
.˘/
ƒ#,n D log Ln.#Cın .#/h/=# Z n Z n 1 2 2 # 2 D h ın .#/ s d ms h ın .#/ s ds , 2 0 0
h2R.
218
Chapter 8 Some Stochastic Process Examples
By .˘/ and in view of Definition 7.1, the problem of choice of local scale at # turns out to be the problem of choice of norming constants for the score martingale: we need weak convergence as n ! 1 of pairs Z n Z n s d m#s , ın2 .#/ 2s ds under Q# .˘˘/ . Sn .#/ , Jn .#/ / :D ın .#/ 0
0
to some pair of limiting random variables . S.#/ , J.#/ / which generate as in Definition 6.1 and Remark 6.1” a quadratic limit experiment. (2) From .˘/, rescaled ML estimation errors at # take the form Rn ın .#/ 0 s d m#s 1 b Rn under Q# ın .#/ # n # D ın2 .#/ 0 2s ds and thus act in the local experiment E#,n as estimators b # n # D Jn1 .#/ Sn .#/ .˘ ˘ ˘/ hn D ın1 .#/ b for the local parameter h 2 R.
Now we show that local asymptotic normality at # holds in the case where # < 0, local asymptotic mixed normality in the case where # > 0, and local asymptotic quadraticity in the case where # D 0. We also specify local scale .ın .#//n . 8.4 LAN in the Positive Recurrent Case. By 8.2(a), in the case where # < 0, the canonical process is positive recurrent under Q# with invariant probability # D 1 /. For the information process in step (3) of Model 8.3 we thus have the N . 0 , 2j#j following strong law of large numbers Z Z 1 1 t 2 s ds D x 2 # .dx/ D D: ƒ Q# -almost surely. lim t!1 t 0 2j#j Correspondingly, if at stage n of the asymptotics we observe a trajectory over the time interval Œ0, n, we take local scale ın .#/ at # such that ın .#/ D n1=2
for all values # < 0 of the parameter.
(1) Rescaling the score martingale in step (3) of Model 8.3 in space and time we put G n :D .G tn / t0 and M#n
D
M#n .t / t0
,
M#n .t /
1=2
Z
:D n
0
tn
s d m.#/ , s
t 0.
219
Section 8.1 Ornstein-Uhlenbeck Model
This yields a family .M#n /n of continuous .Q# , G n /-martingales with angle brackets ./
8t 0 :
Z ˝ n˛ 1 tn 2 1 s ds ! t Dt ƒ M# t D n 0 2j#j Q# -almost surely as n ! 1 .
From Jacod and Shiryaev [64, Cor. VIII.3.24], the martingale convergence theorem – we recall this in Appendix 9.1 below – establishes weak convergence in the Skorohod space D of càdlàg functions Œ0, 1/ ! R to standard Brownian motion with scaling factor ƒ1=2 : M#n ! ƒ1=2 B
(weakly in D under Q# , as n ! 1) .
From this, for the particular time t D 1, ./
M#n .1/ ! ƒ1=2 B1
(weakly in R under Q# , as n ! 1)
since projection mappings D 3 ˛ ! ˛.t / 2 R are continuous at every ˛ 2 C , 1=2 cf. [64, VI.2.1]), and C has full measure under L ƒ B in the space D. (2) Combining ./ and ./ above with .˘/ in 8.3’, log-likelihoods in the local model En,# at # are h=0
ƒ#,n D log L.#Cn n
1=2 h/=#
D h M#n .1/
1 2 ˝ n˛ h M# 1 , 2
h2R
which gives for arbitrary bounded sequences .hn /n 8 h =0 ˆ ƒ n D hn M#n .1/ 12 h2n ƒ C oQ# .1/ as n ! 1 ˆ < #,n . / L M#n .1/ j Q# ! N . 0 , ƒ / as n ! 1 ˆ ˆ : 1 with ƒ D 2j#j . This establishes LAN at parameter values # < 0, cf. Definition 7.1(c), and the limit 1 / in the notation of Definition 5.2. experiment E1 .#/ is the Gaussian shift E. 2j#j (3) Once LAN is established, the assertion .˘˘˘/ in 8.3’ is the coupling condition of Theorem 7.11. From Hájek’s Convolution Theorem 7.10 and the Local Asymptotic Minimax Theorem 7.12 we deduce the following properties for the ML estimator sequence .b # n /n : at all parameter values # < 0, the maximum likelihood estimator sequence is regular and efficient for the unknown parameter, and attains the local asymptotic minimax bound. 8.5 LAQ in the Null Recurrent Case. In the case where # D 0, the canonical process under Q0 is a Brownian motion with starting point x, cf. 8.2(b), and self-similarity properties of Brownian motion turn out to be the key to local asymptotics at # D 0, as
220
Chapter 8 Some Stochastic Process Examples
e for standard Brownian motion, Ito formula pointed out by [23] or [38]. Writing B or B and scaling properties give Z t Z t Z t Z t Z t 1 2 2 2 Bs dBs , jBs j ds , Bs ds D jBs j ds , Bs ds B t , 2 t 0 0 0 0 0 Z t ˇ 2 Z t ˇp p 1 p e 2 d ˇ es ˇ e D t B st ds Œ t B 1 t , ˇ t B t ˇ ds , 2 0 0 Z 1 Z 1 Z 1 d D t Bs dBs , t 3=2 jBs j ds , t 2 Bs2 ds . 0
0
0
From step (3) of Model 8.3, the score martingale at # D 0 is Z Z .0/ D m C .s 0 / d m.0/ under Q0 s d m.0/ 0 s s where L.Œ 0 , m.0/ j Q0 / D L .B, B/ . Consequently, observing at stage n of the asymptotics the canonical process up to time n, the right choice of local scale is ın .#/ :D
1 n
at parameter value # D 0 ,
and rescaling of the score martingale works as follows: with G n :D .G tn / t0 , Z 1 tn s d m.0/ t 0 M n D M n .t / t0 , M n .t / :D s , n 0 where in the case where # D 0 we suppress subscript # D 0 from our notation. Note the influence of the starting point x for equation (8.1) on the form of the score martingale. (A) Consider first starting point x D 0 in equation (8.1) as a special case. (1) Applying the above scaling properties, we have exact equality in law Z Z d n n 2 Bs dBs , Bs ds for all n 1 . ./ L M , hM i j Q0 D L According to .˘/ in 8.3’, the log-likelihoods in the local model En,0 at # D 0 are . /
h=0
ƒ0,n D log Ln.0Cn
1 h/=0
D h M n .1/
1 2 h hM n i1 , 2
h2R
n for every n 1. Thus, as a statistical experiment, local experiments En,0 D ¹Q0C : 1 nh 1 h 2 Rº at # D 0 coincide for all n 1 with the experiment E1 D ¹Qh : h 2 Rº where an Ornstein–Uhlenbeck trajectory (having initial point x D 0) is observed over the time interval Œ0, 1. Thus self-similarity creates a particular LAQ situation for which approximating local experiments and limit experiment coincide. We have seen in Example 6.11 that the limit experiment E1 is not mixed normal.
Section 8.1 Ornstein-Uhlenbeck Model
221
(2) We look to ML estimation in the special case of starting value x D 0 for equation (8.1). Combining the representation of rescaled ML estimation errors .˘˘˘/ in 8.3’ with ./ in step (1), the above scaling properties allow for equality in law which does not depend on n 1 1 #n 0 C h j QŒ0C n1 h D L b .ı/ L n b # 1 h j Qh n at every value h 2 R. To see this, write for functions f 2 Cb .R/ 1 b EQŒ0C 1 h f n # n 0 C h n n D EQŒ0C 1 h f b hn h n Œ0C 1 h=0 f b hn h D EQ0 Ln n ² ³ n 1 2 M .1/ n n h D EQ0 exp h M .1/ h hM i1 f 2 hM n i1 which by ./ above is free of n 1. We can rephrase .ı/ as follows: observing over longer time intervals, we do not gain anything except scaling factors. (B) Now we consider the general case of starting values 0 x ¤ 0 for equation (8.1). In this case, by the above decomposition of the score martingale, M n under Q0 is of type Z Z x 1 tn 1 tn .x C Bs / dBs D B tn C Bs dBs , t 0 n 0 n n 0 for standard Brownian motion B. Decomposing M n in this sense, we can control ˇ ˇ Z sn ˇ n ˇ 1 .0/ . 0 /v d mv ˇˇ under Q0 sup ˇˇM .s/ 0st n 0
in the same way as
x p sup n 0st
jBs j , for arbitrary n and t , and
ˇ ˇ Z sn ˇ n ˇ 1 2 . 0 /v dv ˇˇ under Q0 sup ˇˇhM i .s/ 2 0st n 0 R 2 t in the same way as xnt C 2pjxj jBs j ds , where we use the scaling property stated n 0 at the start. (1) In the general situation (B), the previous equality in law ./ of score and information is replaced by weak convergence of the pair (score martingale, information process) under the parameter value # D 0 Z Z Bs dBs , Bs2 ds L M n , hM n i j Q0 ! L ./ weakly in D.R2 / as n ! 1 .
222
Chapter 8 Some Stochastic Process Examples
Let us write e E 1 for the limit experiment which appears in (A.1), in order to avoid confusion about the different starting values. In our case (B) of starting value x ¤ 0 for equation (8.1), we combine . / for the likelihood ratios in the local models En,0 at # D 0 (which is .˘/ in 8.3’) Œ0C n1 h=0
h=0
ƒ0,n D log Ln
D h M n .1/
1 2 h hM n i1 , 2
h2R
with weak convergence ./ to establish LAQ at # D 0 with limit experiment e E 1. e h for the laws in e E 1. (2) Given LAQ at # D 0 with limit experiment e E 1 , write Q Then Corollary 7.7 shows for ML estimation .ıı/ ˇ ˇ ˇ ˇ b # n Œ0 C n1 h/ Ee ` # h sup ˇ EQŒ0C 1 h ` n .b ˇ ! 0 1 Qh jhjC
n
for arbitrary loss functions `./ which are continuous and bounded and for arbitrary constants C < 1. In case (B) of starting value x ¤ 0 for equation (8.1), .ıı/ replaces equality of laws .ı/ which holds in case (A) above. Recall that .ıı/ merely states the following: the ML estimator b hn for the local parameter h in the local model En,# at # D 0 works approximately as well as the ML estimator in the limit model e E 1 . In particular, .ıı/ is not an optimality criterion. 8.6 LAMN in the Transient Case. By 8.2(c), in the case where # > 0, the canonical process under Q# is transient, and we have the following asymptotics for the information process when # > 0: Z t 1 2 e 2 # t 2s ds ! Y1 .#/ Q# -almost surely as t ! 1 . 2# 0 Here Y1 .#/ is the G1 -measurable limit variable 1 , Y1 D Y1 .#/ N x, 2# for the martingale Y of 8.2(c), with x the starting point for equation (8.1). Observing at stage n of the asymptotics a trajectory of up to time n, we have to chose local scale as ın .#/ D e # n at # > 0 . Thus in the transient case, local scale depends on the value of the parameter. SpaceC time scaling of the score martingale is done as follows. With notation f D f _ 0 n for the positive part of f , we put G :D G.nClog.t//C t0 and M#n
D
M#n .t / t0
,
M#n .t /
:D e
# n
Z 0
.nClog.t//C
s d m.#/ , s
t 0
223
Section 8.1 Ornstein-Uhlenbeck Model
for n 1. Then angle brackets of M#n under Q# satisfy as n ! 1 ˝ n˛ M# t D e 2 # n
Z
.nClog.t//C
2s ds
0
!
2 Y1 .#/
1 2# t 2#
Q# -almost surely
for every 0 < t < 1 fixed. Writing 2 .#/ '# .t / :D Y1
1 2# t , 2#
t 0,
we have a collection of G1 -measurable random variables with the properties ´ '# .0/ 0, t ! '# .t / is continuous and strictly increasing, lim '# .t / D C1; t!1 ˛ ˝ for every 0 < t < 1: M#n t ! '# .t / Q# -almost surely as n ! 1. All M#n being continuous .Q# , G n /-martingales, a martingale convergence theorem (from Jacod and Shiryaev [64, VIII.5.7 and VIII.5.42]) which we recall in Theorem 9.2 in the Appendix (the nesting condition there is satisfied for our choice of the Gn ) yields M#n ! B ı '#
(weak convergence in D, under Q# , as n ! 1)
where standard Brownian motion B is independent from '# . Again by continuity of all M#n , we also have weak convergence of pairs under Q# as n ! 1 n ˝ n ˛ ! . B ı '# , '# / M# , M# (cf. [64, VI.6.1]); we recall this in Theorem 9.3 in the Appendix) in the Skorohood space D.R2 / of càdlàg functions Œ0, 1/ ! R2 . By continuity of projection mappings on a subset of D.R2 / of full measure, we end up with weak convergence n ˛ ˝ M# .1/ , M#n 1 ! . B.'# .1// , '# .1/ / ./ (weakly in R2 , under Q# , as n ! 1) . (1) Log-likelihood ratios in local models E#,n at # are h=0
ƒ0,n D log LnŒ#Ce
# n h=#
D h M#n .1/
1 2 ˝ n˛ h M# 1 , 2
h2R
according to .˘/ in 8.3’. Combining this with weak convergence ./, we have established LAMN at parameter values # > 0; the limit model E1 .#/ is Brownian motion with unknown drift observed up to the independent random time '# .1/ as above. This type of limit experiment was studied in 6.16. (2) According to step (2) in 8.3’, rescaled ML estimator errors at # M n .1/ # n # D ˝ # n ˛ D Zn .#/ under Q# , for all n 1 e# n b M# 1
224
Chapter 8 Some Stochastic Process Examples
coincide with the central sequence at # and converge to the limit law Z B.'# .1// 1 . /
L .'# .1// .du/ N 0 , . '# .1// u By local asymptotic mixed normality according to step (1), Theorems 7.11, 7.10 and 7.12 apply and show the following: at all parameter values # > 0, the ML estimator sequence .b # n /n is regular and efficient in the sense of Jeganathan’s version of the Convolution Theorem 7.10, and attains the local asymptotic minimax bound of Theorem 7.12. 8.6’ Remark. Under assumptions and notations of 8.6, we comment on the limit law arising in . /. With x the starting point for equation (8.1), recall from 8.2(c) and the start of 8.6 1 1 2 .#/ , '# .1/ D Y1 Y1 .#/ N x , 2# 2# which gives (use e.g. [4, Sect. VII.1]) 8 < 12 , 2# 2 in the case where x D 0 p L .'# .1// D : 1 , x 2# 3 , 2# 2 in the case where x ¤ 0 2 where notation .a, , p/ is used for decentral Gamma laws (a > 0, > 0, p > 0) .a, , p/ D
1 X e k .aCm, p/ . kŠ mD0
In the case where D 0 this reduces to the usual .a, p/. It is easy to see that variance mixtures of type Z 1 where 0 < a < 1 .a, p/.du/ N 0 , u do not admit finite second moments. .a, p/ is the first contribution (summand m D 0) to .a, , p/. Thus the limit law . / which is best concentrated in the sense of Jeganathan’s Convolution Theorem 7.10 and in the sense of the Local Asymptotic Minimax Theorem 7.12 in the transient case # > 0 Z Z 1 1 1 2 D L Y1 .#/ .du/ N 0 , L .'# .1// .du/ N 0 , u 2# u is of infinite variance, for all choices of a starting point x 2 R for SDE (8.1). Recall in this context Remark 6.6”: optimality criteria in mixed normal models are conditional on the observed information, never in terms of moments of the laws of rescaled estimation errors.
225
Section 8.1 Ornstein-Uhlenbeck Model
Let us resume the tableau 8.4, 8.5 and 8.6 for convergence of local models when we observe an Ornstein–Uhlenbeck trajectory under unknown parameter over a long time interval: optimality results are available in restriction to submodels where the process either is positive recurrent or is transient; except in the positive recurrent case, the rates of convergence and the limit experiments are different at different values of the unknown parameter. For practical purposes, one might wish to have limit distributions of homogeneous and easily tractable structure which hold over the full range of parameter values. Depending on the statistical model, one may try either random norming of estimation errors or sequential observation schemes. 8.6” Exercise (Random norming). In the Ornstein–Uhlenbeck model, the information process Rt t ! 0 2s ds is observable, its definition does not involve the unknown parameter. Thus we may consider random norming for ML estimation errors using the observed information. (a) Consider first the positive recurrent cases # < 0 in 8.4 and the transient cases # > 0 in 8.6. Using the structure of the limit laws for the pairs ˝ ˛ n under Q# as n ! 1 M# .1/ , M#n 1 from ./+./ in 8.4 and ./ in 8.6 combined with the representation
b #n # D
Rn
s d m.#/ s Rn , 2 ds 0 s
0
of ML estimation errors we can write under Q# sZ n M n .1/ 2s ds b # n # D ˝ # ˛ 1 0 M#n 1 2
n2N,
under Q#
! N .0, 1/ ,
n!1
to get a unified result which covers the cases # ¤ 0. (b) However, random norming as in the last line is not helpful when # D 0. This has been noted by Feigin [23]. Using the scaling properties in 8.5, we find in the case where # D 0 that the weak limit of the laws sZ ! n L 2s ds b # n 0 j Q0 as n ! 1 0 1
.B 2 1/
is the law L. .R 12 B 21ds/1=2 /. Feigin notes simply that this law lacks symmetry around 0 0
P
s
1 .B 2 1/ R21 1 . 0 Bs2 ds/1=2
!
1 0 D P B12 1 D P .B1 2 Œ1, 1/ 0.68 ¤ 2
and thus cannot be a normal law. Hence there is no unified result extending (a) to cover all cases # 2 R.
226
Chapter 8 Some Stochastic Process Examples
In our model, the information process can be calculated from the observation without knowing the unknown parameter, cf. step (3) in Model 8.3. This allows to define a time change and thus a sequential observation scheme by stopping when the observed information hits a prescribed level. 8.7 LAN at all Points # 2 ‚ by Transformation R 2 of Time. For the Ornstein– Uhlenbeck Model 8.3 with information process s ds, define for every n 2 N a random time change ² ³ Z t 2 .n, u/ :D inf t > 0 : s ds > u n , u 0 . 0
Define score martingale and information process scaled and time-changed by u ! .n, u/: Z .n,u/ 1 n e n :D G.n,u/ f s d m.#/ G , M # .u/ :D p s , u0 n 0 Z ˝ n˛ 1 .n,u/ 2 f M# u D s ds D u . n 0 Then P. Lévy’s characterisation n theorem (cf.n [61, p. 74]) shows for all n 2 N and for n e , Q# /-standard Brownian motion. f f .u/ is a .G all # 2 ‚ that M # D M # u0 (1) If at stage n of the asymptotics we observe a trajectory of the canonical process up to p the random time .n, 1/, and write e E n,# for the local model at # with local scale 1= n: ® ¯ e E n,# D C , G.n,1/ , Q#Cn1=2 h j G.n,1/ : h 2 R , then log-likelihoods in the local model at # are ˝ n˛ 1=2 1 2 fn .1/ 1 h2 M fn f eh=0 :D log L.#Cn h/=# D h M h , h2R ƒ # # 1 D h M # .1/ #,n .n,1/ 2 2 where we have for all n 2 N and all # 2 ‚ n f .1/ j Q# D N . 0 , 1 / . L M # According to Definition 7.1(c), this is LAN at all parameter values # 2 R, and even more than that: not only for all values of the parameter # 2 ‚ are limit experiments e E 1 .#/ given by the same Gaussian shift E.1/, in the notation of Definition 5.2, but also all local experiments e E n,# at all levels n 1 of the asymptotics coincide with E.1/. Writing 2.n,1/ 20 .n, 1/ 2.n,1/ 20 .n, 1/ e b b D # n D # .n,1/ D R .n,1/ 2n 2 0 2s ds
Section 8.2
227
A Null Recurrent Diffusion Model
for the maximum likelihood estimator when we observe up to the stopping time .n, 1/, cf. step (1) of Model 8.3, we have at all parameter values # 2 ‚ the following properties: the ML sequence is regular and efficient for the unknown parameter at # (Theorems 7.11 and 7.10), and attains the local asymptotic minimax bound at # (Remark 7.13). This sequential observation scheme allows for a unified treatment over the whole range of parameter values # 2 R (in fact, everything thus reduces to an elementary normal distribution model ¹N .h, 1/ : h 2 Rº). 8.7’ Exercise. We compare the observation schemes used in 8.7 and in 8.6 in the transient case # > 0. Consider only the starting point x D 0 for equation (8.1). Define 1
# .u/ :D inf ¹ t > 0 : '# .t / > u º D . u Œ'# .1/1 / 2#
./
for 0 < u < 1, with # .0/ 0, and recall from 8.6 that 1 1 2 '# .1/ D Y1 .#/
, 2# 2 . 2# 2 (a) Deduce from ./ that for all parameter values # > 0 E .j log # .u/j/ < 1 . (b) Consider in the case where # > 0 the time change u ! .m, u/ in 8.7 and prove that for 0 < u < 1 fixed, 1 log.m/ under Q# .m, u/ 2# converges weakly as m ! 1 to log.# .u// . ˝ n˛ Hint: Observe that the definition of M# t in 8.6 and .n, u/ in 8.7 allows us to replace n 2 N by arbitrary 2 .0, 1/. Using this, we can write ˝ ˛ P . # .u/ > t / D P . '# .t / < u / D lim Q# M#n t < u n!1 D lim Q# .e 2# n , u/ > Œn C log.t /C n!1
for fixed values of u and t in .0, 1/, where the last limit is equal to 1 1 log.m/ C log.t / D lim Q# e .m,u/ 2# log.m/ > t . lim Q# .m, u/ > m!1 m!1 2#
8.2
A Null Recurrent Diffusion Model
We discuss a statistical model where the diffusion process under observation is recurrent null for all values of the parameter. Our presentation follows Höpfner and Kutoyants [54] and makes use of the limit theorems in Höpfner and Löcherbach [53] and of a result by Khasminskii [73].
228
Chapter 8 Some Stochastic Process Examples
In this section, we have a probability space ., A, P / carrying a Brownian motion W ; for some constant > 0 we consider the unique strong solution to equation Xt .8.8/ dX t D # dt C d W t , X0 x 1 C X t2 depending on a parameter # which ranges over the parameter space 1 1 .8.80 / ‚ :D 2 , 2 . 2 2 We shall refer frequently to the Appendix (Chapter 9). 8.9 Long Time Behaviour of the Process. Under # 2 ‚ where the parameter space ‚ is defined by equation (8.8’), the process X in equation (8.8) is recurrent null in the sense of Harris with invariant measure 2# 1 p 2 m.dx/ D 2 1 C y 2 dx , x 2 R . .8.90 / v We prove this as follows. Fix # 2 ‚, write b.v/ D # 1Cv 2 for the drift coefficient in equation (8.8), and consider the mapping S : R ! R defined by Z y Z x 2b S.x/ :D s.y/ dy where s.y/ :D exp .v/ dv , x, y 2 R 2 0 0 Ry as in Proposition 9.12 in the Appendix; we have 0 2b2 .v/ dv D #2 ln.1 C y 2 / and thus p # 2#2 s.y/ D 1 C y 2 2 D 1 C y 2 ,
S.x/ sign.x/
1 1
2#
2# 2
jxj1 2
as x ! ˙1 .
Since j 2# 2 j < 1 by equation (8.8’), the function S./ is a bijection onto R: thus Proposition 9.12 shows that X under # is Harris with invariant measure 1 1 dx on .R, B.R// m.dx/ D 2 s.x/ which gives equation (8.9’). We have null recurrence since m has infinite total mass. We remark that ‚ defined by equation (8.8’) is the maximal open interval in R such that null recurrence holds for all parameter values. For the next result, recall from Remark 6.17 the definition of a Mittag–Leffler process V .˛/ of index 0 < ˛ < 1: to the stable increasing process S .˛/ of index 0 < ˛ < 1, the process with independent and stationary increments having Laplace transforms .˛/ .˛/ ˛ E e .S t2 S t1 / D e .t2 t1 / , 0 , 0 t1 < t2 < 1
Section 8.2
229
A Null Recurrent Diffusion Model .˛/
and starting from S0 0, we associate the process of level crossing times .˛/ V .˛/ . Paths of V .˛/ are continuous and non-decreasing, with V0 D 0 and .˛/ lim t!1 V t D 1. Part (a) of the next result is a consequence of two results due to Khasminskii [73] which we recall in Proposition 9.14 of the Appendix. Based on (a), part (b) then is a well-known and classical statement, see Feller [24, p. 448] or Bingham, Goldie and Teugels [12, p. 349], on domains of attraction of one-sided stable laws. For regularly varying functions see [12]. X being a one-dimensional process with continuous trajectories, there are many possibilities to define a sequence of renewal times .Rn /n1 which decompose the trajectory of X into i.i.d. excursions .X 1ŒŒRn ,RnC1 /n1 away from 0; a particular choice is considered below. 8.10 Regular Variation. Fix # 2 ‚, consider the function S./ and the measure m.dx/ of 8.9 (both depending on #), and define 1 2# 0 ˛ D ˛.#/ D 1 2 2 .0, 1/ . .8.10 / 2 (a) The sequence of renewal times .Rn /n1 defined by Rn :D inf¹t > Sn : X t < 0º ,
Sn :D inf¹t > Rn1 : X t > S 1 .1/º ,
n 1,
R0 0
has the following properties (i) and (ii) under #: 1 ˛ 4 .i/ P .RnC1 Rn > t /
t ˛ 2 2 .˛/ Z .ii/
!
RnC1
f .Xs / ds
E Rn
as t ! 1 ,
D 2 m.f / ,
f 2 L1 .m/ .
(b) For any norming function a./ with the property .8.1000 /
a.t /
1 .1 ˛/ P .R2 R1 > t /
as
t !1
and for any function f 2 L1 .m/ we have weak convergence Z n 1 f .Xs / ds ! 2 m.f / V .˛/ a.n/ 0 .8.10000 / (weakly in D, under #, as n ! 1) where V .˛/ is the Mittag–Leffler process of index ˛, 0 < ˛ < 1 . In particular, the norming function in equation (8.10”) varies regularly at 1 with index ˛ D ˛.#/, and all objects above depend on # 2 ‚.
230
Chapter 8 Some Stochastic Process Examples
Proof. (1) From Khasminskii’s results [73] which we recall in Proposition 9.14 of the Appendix, we deduce the assertions (a.i) and (a.ii): Fix # 2 ‚. We have the functions S./, s./ defined in 8.9 which depend on #. S./ is a bijection onto R. According to step (3) in the proof of Proposition 9.12, the e :D S.X / process X e t / d Wt et D e .X dX
where e D .s / ı S 1
is a diffusion without drift, is Harris recurrent, and has invariant measure 1 1 1 de xD 2 de x on .R, B.R// . m e.de x/ D 2 1 e .e x/ Œs ı S 2 .e x/ Write for short D 2# 2 : by choice of the parameter space ‚ in equation (8.8’), belongs to .1, 1/. From 8.9 we have the following asymptotics: s.y/ jyj , S.x/ sign.x/
y ! ˙1
1 jxj1 , 1
x ! ˙1
1
S 1 .z/ sign.z/ . .1 /jzj / 1 , Œs ı S 1 .v/ . .1 /jvj /
1
,
z ! ˙1 v ! ˙1 .
By the properties of S./, the sequence .Rn /n1 defined in (a) can be written in the form .C/
e t < 0º , Sn :D inf¹t > Rn1 : X e t > 1º , Rn :D inf¹t > Sn : X n 1,
R0 0
e D S.X /. As a consequence of (+), the stopping times .Rn /n1 with respect to X e into i.i.d. excursions induce the particular decomposition of trajectories of X e 1ŒŒR ,R /n1 away from 0 to which Proposition 9.14 applies. With respect to .X n nC1 e we have .Rn /n1 and X 2 2 2 2 2 2 1 2 1 1 .v/ 2 . .1 /jvj / D .1 / D 2 jvj 1 , 2 1 2 2 e .v/ Œs ı S v ! ˙1 which shows that the condition in Proposition 9.14(b) is satisfied: Z 1 x ˇ 2 jvj lim dv D A˙ , x!˙1 x 0 e 2 .v/ ˇ :D
2 > 1 , 1
AC D A :D
2 2 .1 / 1 . 2
Section 8.2
231
A Null Recurrent Diffusion Model
Recall that , ˇ and A˙ depend on #. With ˛ D ˛.#/ defined as in equation (8.10’) 1 1 2# 1 D .1 / D 1 2 2 .0, 1/ ˛ :D ˇC2 2 2 we have ˇ D
and rewrite the constants A˙ in terms of ˛ D ˛.#/ as
12˛ ˛
12˛ 2 2 .1 /ˇ D 2 .2˛/ ˛ . 2 From this, Proposition 9.14(b) gives us
A˙ D
P .RnC1 Rn > t /
˛ 2˛ .ŒAC ˛ C ŒA ˛ / ˛ 1 ˛ 4 t D t ˛ .1 C ˛/ 2 2 .˛/
as t ! 1. This proves part (a.i) of the assertion. Next, we apply Proposition 9.14(a) e the i.i.d. excursions defined by .Rn /n1 correspond to the norming to the process X: constant Z RnC1 e.X e/ , f e 0 measurable e s / ds D 2 m f e .f E Rn
e :D f ı S 1 and apply formula .ıı/ in the for the invariant measure m e. If we write f proof of Proposition 9.12 m.f / D m e f ı S 1 we obtain
Z
RnC1
f .Xs / ds
E Rn
D 2 m.f / ,
f 0 measurable
which proves part (a.ii) of the assertion. Thus, 8.10(a) is now proved. (2) We prove 8.10(b). Under # 2 ‚, for ˛ given by equation (8.10’) and for the renewal times .Rn /n1 considered in (a), select a strictly increasing continuous norming function a./ such that a.t /
1 .1 ˛/ P .R2 R1 > t /
as t ! 1 .
In particular, a./ varies regularly at 1 with index ˛. Fix an asymptotic inverse b./ to a./ which is strictly increasing and continuous. All this depends on #, and (a.i) above implies 1 1 4 .1 ˛/ ˛ 1 t ˛ as t ! 1 . .8.11/ b.t /
2 2 .˛/ From a.b.n// n as n ! 1, a well-known result on convergence of sums of i.i.d. variables to one-sided stable laws (cf. [24, p. 448], [12, p. 349]) gives weak convergence Rn .˛/ (weakly in R, under #, as n ! 1) ! S1 b.n/
232
Chapter 8 Some Stochastic Process Examples .˛/
where S1 follows the one-sided stable law on .0, 1/ with Laplace transform ! exp.˛ /, 0. Regular variation of b./ at 1 with index ˛1 implies that b.t n/
1 t ˛ b.n/ as n ! 1. Thus, scaling properties and independence of increments in the one-sided stable process S ˛ of index 0 < ˛ < 1 show that the last convergence extends to finite-dimensional convergence RŒn b.n/
.8.110 /
f .d .
S .˛/
!
as n ! 1 .
Associate a counting process N D .N t / t0 ,
® ¯ N t :D max j 2 N : Rj t ,
to the renewal times .Rn /n and write for arbitrary 0 < t1 < < tm < 1, Ai 2 B.R/ and xi > 0 N ti b.n/ RŒxi n Œxi n P < ,1i m DP > ti , 1 i m . n n b.n/ Then .8.110 / gives finite dimensional convergence under # as n ! 1 Nb.n/ n
f .d .
!
V .˛/
to the process inverse V .˛/ of S .˛/ . On the left-hand side of the last convergence, we may replace n by a.n/ and make use of b.a.n// n as n ! 1. Then the last convergence takes the form Nn a.n/
.8.1100 /
f .d .
!
V .˛/
as n ! 1 .
By (a.ii) we have for functions f 0 belonging to L1 .m/ Z RnC1 f .Xs / ds D 2 m.f / . E Rn
The counting process N increasing by 1 on Rn , RnC1 , we have almost sure convergence under # Z Z t 1 1 Rn f .Xs / ds D lim f .Xs / ds D 2 m.f / lim n!1 n 0 t!1 N t 0 from the classical strong law of large numbers with respect to the i.i.d. excursions X 1ŒŒRj ,Rj C1 , j 1. Together with .8.1100 / we arrive at 1 a.n/
Z 0
n
f .d .
f .Xs / ds ! 2 m.f / V .˛/ as n ! 1
Section 8.2
233
A Null Recurrent Diffusion Model
for functions f 0 belonging to L1 .m/. All processes in the last convergence are increasing processes, and the limit process is continuous. In this case, according to Jacod and Shiryaev (1987, VI.3.37), finite dimensional convergence and weak convergence in D are equivalent, and part (b) of 8.10 is proved. We turn to statistical models defined by observation of a trajectory of the process (8.8) under unknown # 2 ‚ over a long time interval, with parameter space ‚ defined by equation (8.8’). 8.12 Statistical Model. For ‚ given by equation (8.8’) and for some starting point x0 2 R which does not depend on # 2 ‚, let Q# denote the law of the solution to (8.8) under #, on the canonical path space .C , C , G/ or .D, D, G/. Applying Theorem 6.10, all laws Q# are locally equivalent relative to G, and the density process of Q# with respect to Q0 relative to G is ² Z t ³ Z 1 2 t 2 #=0 2 L t D exp # .s / d m.0/ . / ds , t 0 # s s 2 0 0 with m.0/ the Q0 -local martingale part of the canonical process under Q0 , and .8.120 /
.x/ :D
1 x , 2 1 C x2
x2R.
(1) Fix a determination .t , !/ ! Y .t , !/ of the stochastic integral Z Z under Q0 .s / ds D .s / d m.0/ s where L. 0 jQ0 / D L.m.0/ jQ0 / D L.B/. Thus, in statistical experiments E t corresponding to observation of the canonical process up to time 0 < t < 1, both likelihood function ² ³ Z t 1 #=0 2 .s / 2 ds 2 .0, 1/ ‚ 3 # ! L t D exp # Y t # 2 2 0 and maximum likelihood (ML) estimator b # t :D R t
Yt
2 2 0 .s / ds
are expressed without reference to a particular probability measure. (2) Simultaneously for all # 2 ‚, .t , !/ ! Y .t , !/ provides a common determination for Z t Z t Z t 2 2 .s / ds D .s / d m.#/ C # . / ds under Q# : s s 0
t0
0
0
t0
234
Chapter 8 Some Stochastic Process Examples
R any determination .t , !/ ! Y .t , !/ of the stochastic integral .s / ds under Q0 R is also a determination of the stochastic integral .s / ds under Q# , by Lemma Rt .#/ 8.2’, and the Q# -martingale part of equals m t D t 0 # 0 .s / 2 ds . This allows us to write ML estimation errors under # 2 ‚ in the form Rt .#/ .s / d ms b E # t # D DR0 under Q# .#/ .s / d ms t
and allows us to write density processes L=# of Q with respect to Q# relative to G either as ² ³ Z t 1 2 =0 #=0 2 2 2 .s / ds , t 0 L t =L t D exp . #/ Y t . # / 2 0 or equivalently – coinciding with Theorem 6.10 applied to Q and Q# – as .C/
=# Lt
²
Z
D exp . #/
t
0
.s / d m.#/ s
1 . #/2 2
Z
³
t 2
2
.s / ds . 0
(3) The representation (+) allows to reparameterise the model E t with respect to fixed reference points # such that the model around # is quadratic in . #/. We call Z t .#/ .s / d ms 0
and
Z
t0
t 2
Z D
2
.s / ds 0
./ d m
.#/
under
Q#
t0
score martingale at # and information process at #. It is clear from .8.10000 / that the law of the observed information depends on #. Hence reparameterising as in (+) with respect to a reference point # or with respect to # 0 ¤ # yields statistical models which are different. From now on we shall keep trace of the parameter # in some more notations: instead of m as in .8.90 / we write # .dx/ D
1 p 1 C y2 2
2# 2
dx ,
x2R
for the invariant measure of the canonical process under Q# ; instead of a./ we write a# ./ for the norming function in .8.1000 / which is regularly varying at 1 with index 1 2# ˛.#/ D 1 2 2 .0, 1/ 2
Section 8.2
235
A Null Recurrent Diffusion Model
as in .8.100 /. By choice of ‚ in .8.80 /, there is a one-to-one correspondence between indices 0 < ˛ < 1 and parameter values # 2 ‚. Given the representation of log-likelihoods in Model 8.12, the following proposition allows to prove convergence of local models at #, or convergence of rescaled estimation errors at # for a broad class of estimators. It is obtained as a special case of Theorems 9.8 and 9.10 and of Corollary 9.10’ (from Höpfner and Löcherbach [53]) in the Appendix. 8.13 Proposition. For functions f 2 L2 .# / and with respect to G n D .G tn / t0 , consider locally square integrable local Q# -martingales Z tn 1 M n D .M tn / t0 , M tn :D p f .s / d m.#/ s ˛.#/ 0 n with m.#/ the Q# -martingale part of the canonical process . Under Q# , we have weak convergence n M , hM n i ! .ƒ.#//1=2 B ı V .˛.#// , ƒ.#/ V .˛.#// in D.R R/ as n ! 1: in this limit, Brownian motion B and Mittag–Leffler process V .˛.#// of index 0 < ˛.#/ < 1 are independent, and the constant is 1C˛.#/ .˛.#// .8.130 / ƒ.#/ :D 2 2 # .f 2 / . 4 .1˛.#// Proof. (1) Fix # 2 ‚. For functions f 0 in L1 .# /, we have from 8.10(b) weak convergence in D, under Q# , of integrable additive functionals: Z n 1 f .s / ds ! 2 # .f / V ˛.#/ , n ! 1 . a# .n/ 0 We rephrase this as weak convergence on D, under Q# , of Z n 1 f .s / ds ! C.#/ # .f / V ˛.#/ , n˛.#/ 0
n!1
for a constant C.#/ which according to 8.10 is given by 1 ˛.#/ 4 .1˛.#// 1 00 C.#/ :D 2 . .8.13 / 2 2 .˛.#// (2) Step (1) allows to apply Theorem 9.8(a) from the Appendix: we have the regular variation condition (a.i) there with ˛ D ˛.#/, m D .#/ and `./ 1=C.#/ on the right-hand side. (3) With the same notations, we apply Theorem 9.10 from the Appendix: for locally f satisfying the conditions made there, this gives square integrable local martingales M weak convergence of 1 1 f tn f tn M M p Dp under Q# t0 t0 n˛.#/ C.#/ n˛.#/=`.n/
236
Chapter 8 Some Stochastic Process Examples
in D as n ! 1 to
1=2 e B ı V .˛.#// .ƒ.#//
where Brownian motion B and Mittag–Leffler process V .˛.#// are independent, with constant ˝ ˛ e f . ƒ.#/ :D E# M .8.13000 / 1 (4) We put together steps (1)–(3) above. For g 2 L2 .# / consider the local .Q# , G/-martingale M Z t M t :D g.s / d m.#/ t 0 s , 0
m.#/
is the Q# -local martingale part of the canonical process . Then M is a where martingale additive functional as defined in Definition 9.9 and satisfies Z 1 2 2 g .s / ds D 2 # .g 2 / < 1 . E# .hM i1 / D E# 0
Thus, as a consequence of step (3), M n :D p
1 n˛.#/
.M tn / t0
under Q# converges weakly in D as n ! 1 to .ƒ.#//1=2 B ı V .˛.#// where combining .8.1300 / and .8.13000 / we have 1C˛.#/ e ƒ.#/ D C.#/ ƒ.#/ D C.#/ 2 # .g 2 / D 2 2
.˛.#// # .g 2 / . 4 .1˛.#//
This constant ƒ.#/ appears in .8.130 /. (5) The martingales in step (4) are continuous. Thus Corollary 9.10’ in the Appendix extends the result of step (4) to weak convergence of martingales together with their angle brackets, and concludes the proof of Proposition 8.13. With Model 8.12 and Proposition 8.13 we have all elements which we need to prove LAMN at arbitrary reference points # 2 ‚. Recall that the parameter space ‚ is defined by .8.80 /, and that the starting point for equation (8.8) is fixed and does not depend on #. Combining the representation of likelihoods with respect to # in formula (+) of Model 8.12 with Proposition 8.13 we get the following: 8.14 LAMN at Every Reference Point # 2 ‚. Consider the statistical model En which corresponds to observation of a trajectory of the solution to equation (8.8) continuously over the time interval Œ0, n, under unknown # 2 ‚.
Section 8.2
237
A Null Recurrent Diffusion Model
(a) For every # 2 ‚, we have LAMN at # with local scale 1 for ˛.#/ 2 .0, 1/ defined by .8.100 / : ı.n/ :D p n˛.#/ for every n, log-likelihoods in the local model E#,n are quadratic in the local parameter h=0
ƒ#,n D ƒn.#Cın .#/h/=# D h Sn .#/
1 2 h Jn .#/ , 2
h2R
and we have weak convergence as n ! 1 of score and information at # .˛.#// .˛.#// , ƒ.#/ V1 D: .S , J / ƒ1=2 .#/ B V1 . Sn .#/, Jn .#/ j Q# / ! with ƒ.#/ given by .8.130 /, and with Mittag–Leffler process V ˛.#/ independent from B. For the limit experiment E.S , J / at # see Construction 6.16 and Example 6.18. (b) For every # 2 ‚, rescaled ML estimation errors at # coincide with the central sequence at #. Hence ML estimators are regular and efficient in the sense of Jeganathan’s version 7.10(b) of the convolution theorem, and attain the local asymptotic minimax bound of Theorem 7.12. Proof. (1) Localising around a fixed reference point # 2 R, write Qn for Q restricted to Gn . With 1 1 2# with ˛.#/ D 1 2 2 .0, 1/ ın .#/ :D p 2 n˛.#/ according to 8.10, Model 8.12 and Proposition 8.13 we consider local models at # ° ± n : h 2 R , n1 E#,n D C , Gn , Q#Cı n .#/ h when n tends to 1. According to (+) in step (2) of Model 8.12, log-likelihoods in E#,n are Z n Z n 1 h=0 .s / d m#s h2 ın2 .#/ 2 .s / 2 ds , h 2 R . .˘/ ƒ#,n D h ın .#/ 2 0 0 Now we can apply Proposition 8.13: note that for every # 2 ‚, ./ belongs to L2 .# / since 2# 2 .x/ D O x 2 as jxj ! 1 , d# .x/ D O jxj 2 dx as jxj ! 1 where j 2# 2 j < 1. Proposition 8.13 yields weak convergence as n ! 1 of pairs Z n Z n 1 1 .s / d m#s , ˛.#/ 2 .s / 2 ds . Sn .#/ , Jn .#/ / :D p n .˘˘/ 0 n˛.#/ 0 under Q#
238
Chapter 8 Some Stochastic Process Examples
to the pair of limiting random variables .˛.#// .˛.#// , ƒ.#/ V1 . S.#/ , J.#/ / D ƒ1=2 .#/ B V1 .8.140 / where Brownian motion B and Mittag–Leffler process V ˛.#/ are independent, and ƒ.#/ D .2 2 /1C˛.#/
.˛.#// # . 2 / 4 .1˛.#//
according to .8.130 /, with ./ from .8.120 /. We have proved in Construction 6.16 and in Example 6.18 that the pair of random variables .8.140 / indeed generates a mixed normal experiment. (2) Combining this with step (2) of Model 8.12, we see that rescaled ML estimation errors at # coincide with the central sequence at #: p # n # D Jn1 .#/ Sn .#/ D Zn .#/ , n 1 . n˛.#/ b .˘ ˘ ˘/ In particular, the coupling condition of Theorem 7.11 holds. By LAMN at #, by the Convolution Theorem 7.10 and the Local Asymptotic Minimax Theorem 7.12, we see that the ML estimator sequence is regular and efficient at #, and attains the local asymptotic minimax bound as in Theorem 7.12(b). 8.15 Remark. According to .8.140 / and to Remark 6.6’, the limit law for .˘ ˘ ˘/ as n ! 1 has the form Z 1 ˛.#/ .du/ N 0 , L .Z.#/jP0 / D L ƒ.#/ V1 . u As mentioned in Example 6.18, this law – the best concentrated limit distribution for rescaled estimation errors at #, in the sense of the Convolution Theorem 6.6 or of the Local Asymptotic Minimax Theorem 6.8 – does not admit finite second moments (see Exercise 6.180 ). 8.16 Remark. Our model presents different speeds of convergence at different parameter values, different limit experiments at different parameter values, but has an information process 2 Z t Z t 1 s 2 .s / 2 ds D 2 ds , t 0 0 1 C 2s 0 which can be calculated from the observation . t / t0 without knowing the unknown parameter # 2 ‚. This fact allows for random norming using the observed information: directly from LAMN at # in 8.14, combining .˘/, .˘˘/ and .˘ ˘ ˘/ there, we can write sZ 2 n s b ds # # ! N .0, 2 / n 1 C 2s 0
Section 8.2
239
A Null Recurrent Diffusion Model
for all values of # 2 ‚. The last representation allows for practical work, e.g. to fix confidence intervals for the unknown parameter determined from an asymptotically efficient estimator sequence, and overcomes the handicap caused by non-finiteness of second moments in Remark 8.15. We conclude the discussion of the statistical model defined by (8.8) and (8.8’) by pointing out that one-step correction is possible, and that we may start from any preliminary estimator sequence .Tnp/n for the unknown parameter in .En /n whose estimation errors at # are tight at rate n˛.#/ as n ! 1 for every # 2 ‚, and modify .Tn /n en /n which according to Theorem 7.19 in order to obtain a sequence of estimators .T satisfies the coupling condition of Theorem 7.11 p en # D Zn .#/ C o.Q / .1/ as n ! 1 n˛.#/ T # at every # 2 ‚. Again there is no need for discretisation as in Proposition 7.16 since local scale, observed information and score depend continuously on #. To do this, it is sufficient to check the set of Conditions 7.14 for our model. 8.17 Proposition. With the above notations, the sequence of models .En /n satisfies all assumptions stated in 7.14. For .Tn /n as above, one-step modification 1 en :D Tn C p 1 #n Sn .Tn / D b T ˛.T / J .T n n n/ n directly leads to the ML estimator b # n. Proof. Let Y denote a common determination – as in step (2) of Model 8.12, by R Lemma 8.2’ – of the stochastic integral .s / ds under all laws Q , 2 ‚. We check the set of conditions in 7.14. First, it is obvious that the model allows for a broad collection of preliminary estimator sequences. As an example, select A 2 B.R/ with .A/ > 0, write L RL .x/ :D .x/1A .x/, choose a common determination Y for the stochastic integral L .s / ds under all laws Q , 2 ‚, and put TLn :D R n 0
YLn , L 2 .s / 2 ds
n1.
p Then Proposition 8.13 establishes weak convergence of L. n˛.#/ .TLn #/ j Q# / as n ! 1 under all values of the parameter # 2 ‚. Even if such estimators can be arbitrarily bad, depending on the choice of A, they converge at the right speed at all points of the model: this establishes Assumption 7.14(D).
240
Chapter 8 Some Stochastic Process Examples
In the local model E#,n at # Z n Z n 1 1 2 2 Sn .#/ D p .s / d m.#/ D # . / ds Y p n s s 0 n˛.#/ 0 n˛.#/ is a version of the score, according to 8.14 and (+) in Model 8.12, and the information Z n 1 Jn .#/ D ˛.#/ 2 .s / 2 ds n 0 depends on # only through local scale. In particular,
! n˛.#/ 1 Jn .#/ ; n˛./
Jn . / Jn .#/ D
.C/
writing down Sn . / and transforming tion term we get
R
.s / d m./ into
R
.s / d m.#/ plus correc-
Sn . / pSn .#/ D ! p i n˛.#/ n˛.#/ hp ˛.#/ p 1 Sn .#/ p n . #/ Jn .#/ . n˛./ n˛./
.CC/
From (+) and (++) it is obvious that parts (B) and (C) of Assumptions 7.14 simultaneously will follow from p ˇ ˇ ˇ n˛. #Ch= n˛.#/ / ˇ ˇ ˇ 1 ./ sup ˇ ˇ ! 0 as n ! 1 ˛.#/ ˇ ˇ jhjc n for arbitrary values of a constant 0 < c < 1. By .8.100 /, ˛. / differs from ˛.#/ by Œ 12 . #/. Hence, to check ./ it is sufficient to cancel out n˛.#/ in the ratio in ./ and then take logarithms (note that this argument exploits again 0 < ˛.#/ < 1 for every # 2 ‚). Now parts (B) and (C) of Assumptions 7.14 are established. This finishes the proof, and we write down the one-step modification Rn Yn Tn 0 2 .s / 2 ds 1 1 en :D Tn C p Rn Sn .Tn / D Tn C Db #n T 2 2 n˛.Tn / Jn .Tn / 0 .s / ds according to (a simplified version of) Theorem 7.19.
8.3
Some Further Remarks
We point out several (out of many more) references on LAN, LAMN, LAQ in stochastic process models of different types. Some of these papers prove LAN or LAMN or LAQ in a particular stochastic process model, others establish properties of estimators
Section 8.3
241
Some Further Remarks
at # which indicate that the underlying statistical model should be LAMN at # or LAQ at #. Cox–Ingersoll–Ross process models are treated in Overbeck [103], Overbeck and Ryden [104] and Ben Alaya and Kebaier [8]. For ergodic diffusions with state space R dX t D b.#, X t / dt C .X t / d W t ,
t 0
where the drift depends on a d -dimensional parameter # 2 ‚, the books by Kutoyants [78, 80] present a large variety of interesting models and examples. For LAN in parametric models for non-time-homogeneous diffusions – where it is assumed that the drift is periodic in time and that in a suitable sense ‘periodic ergodicity’ holds – see Höpfner and Kutoyants [55]. An intricate tableau of LAN, LAMN or LAQ properties arising in a setting of delay equations can be found in Gushchin and Küchler [39]. Markov step process models with transition intensities depending on an unknown parameter # 2 ‚ are considered in Höpfner, Jacod and Ladelli [51] and in Höpfner [46, 48, 49]. For general semi-martingale models with jump parts and with diffusive parts see Sørensen [119] and Luschgy [95–97]. For branching diffusions – some finite random number of particles diffusing in space, branching with random number of offspring at rates which depend on the position and on the whole configuration, whereas immigrants are arriving according to Poisson random measure – LAN or LAMN has been proved by Löcherbach [90, 91]; ergodicity properties and invariant measures for such processes have been investigated by Hammer [44]. Diffusion process models where the underlying process is not observed continuT (either with T fixed ously in time but at discrete time steps ti D i with 0 i or with T increasing to 1), and where asymptotics are in tending to 0, have received a lot of interest: see e.g. Yoshida [129], Genon-Catalot and Jacod [25], Kessler [70], or Shimizu and Yoshida [118]. Proofs establishing LAMN (in the case where T is fixed) or LAN (in the case where T " 1 provided the process is ergodic) have been given by Dohnal [22] and by Gobet [30,31]. The book [71] edited by Kessler, Lindner and Sørensen gives a broad recent overview on discretely observed semi-martingale models. If the underlying process is a point process, e.g. Poisson random measure .dx/ on some space .E, E/ with deterministic intensity # .dx/ depending on some unknown parameter # 2 ‚, one may consider sequences of windows Kn E which increase to E as n ! 1, and identify the experiment En at stage n of the asymptotics with observation of the point process in restriction to the window Kn , cf. Kutoyants [79] or Höpfner and Jacod [50]. If our Chapters 7 and 8 deal with local asymptotics of type LAN, LAMN or LAQ, let us mention one example for a limit experiment of different type, inducing statistical properties via local asymptotics which are radically different from what we have discussed above. Ibragimov and Khasminskii [60] put forward a limit experiment where
242
Chapter 8 Some Stochastic Process Examples
the likelihood ratios are .IK/
e u 2 juj , Lh=0 D e W 1
h2R
e u /u2R , and studied convergence to this limit with two-sided Brownian motion .W experiment in ‘signal in white noise’ models. Obviously (IK) is not a quadratic experiment since the parameter plays the role of time: in this sense, the experiment (IK) is linked to the Gaussian shift limit experiment as we pointed out in Example 1.16’. Dachian [17] associated to the limit experiment (IK) approximating experiments where likelihood ratios have – separately on the positive and on the negative branch – the form of the trajectory of a particular Poisson process with suitable linear terms subtracted. In the limit model (IK), Rubin and Song [114] could calculate the risk of the Bayesian u – for quadratic loss, and with ‘uniform prior over the real line’ – and could show that quadratic risk of u is by some factor smaller than quadratic risk of the maximum likelihood estimator b u. Recent investigations by Dachian show quantitatively how this feature carries over to the approximating experiments which he considers: there is a large domain of parameter values in the approximating experiments un where rescaled estimation errors of an analogously defined un outperform those of b under quadratic risk. However, not much seems to be known on comparison of b u and u using other loss functions than the quadratic, a fortiori not under a broader class of loss functions, e.g. subconvex and bounded. There is nothing like a central statistic in the sense of Definition 6.1 for the experiment (IK) where the only sufficient statistic is the whole two-sided Brownian path, hence nothing like a central sequence in the sense of Definition 7.1’(c) in the approximating experiments. Still, both estimators b u and u in the experiment (IK) are equivariant in the sense of Definition 5.4: in Höpfner and Kutoyants [56, 57] we have studied local asymptotics with limit experiment (IK) – in a context of diffusions carrying a deterministic discontinuous periodic signal in their drift, and being observed continuously over a long time interval – using some of the techniques of Chapter 7; the main results of Chapter 7 have no counterpart here. The limit experiment (IK) is of importance in a broad variety of contexts, e.g. Golubev [32], Pflug [106], Küchler and Kutoyants [76] and the references therein.
Chapter 9
Appendix
Topics: 9.1 Convergence of Martingales Convergence to a Gaussian martingale 9.1 Convergence to a conditionally Gaussian martingale 9.2 Convergence of pairs (martingale, angle bracket) 9.3 9.2 Harris Recurrent Markov Processes Harris recurrent Markov processes 9.4 Invariant measure 9.5–9.5’ Additive functionals 9.5” Ratio limit theorem 9.6 Tightness rates under null recurrence (9.7) Regular variation condition, Mittag-Leffler processes, weak convergence of integrable additive functionals 9.8 Martingale additive functionals 9.9–9.9’ Regular variation condition: weak convergence of martingale additive functionals together with their angle brackets 9.10–9.10’ 9.3 Checking the Harris Condition A variant of the Harris condition 9.11 From grid chains via segment chains to continuous time processes 9.11’ 9.4 One-dimensional Diffusions Harris properties for diffusions with values in R 9.12 Some examples 9.12’ Diffusions taking values in some open interval I in R 9.13 Some examples 9.13’–9.13” Exact constants for i.i.d. cycles in null recurrent diffusions without drift 9.14
This Appendix collects facts of different nature which we quote in the stochastic process sections of this book. In most cases, they are stated without proof, and we indicate references. An asterisk in front of this chapter (and in front of all its sections) indicates that the reader should be acquainted with basic properties of stochastic processes in continuous time, with semi-martingales and stochastic differential equations. Our
244
Chapter 9
Appendix
principal references are the following. The book by Métivier [98] represents a wellwritten source for the theory of stochastic processes; a useful overview appears in the appendix sections of Bremaud [14]. A detailed treatment can be found in Dellacherie and Meyer [20]. For stochastic differential equations, we refer to Karatzas and Shreve [69] and Ikeda and Watanabe [61]. For semi-martingales and their (weak) convergence, see Jacod and Shiryaev [64].
9.1
Convergence of Martingales
All filtrations which appear in this section are right-continuous; all processes below have càdlàg paths. .D, D, G/ denotes the Skorohod space of d -dimensional càdlàg functions (see [64, Chap. VI]). A d -dimensional locally square integrable local martingale M D .M t / t0 starting from M0 D 0 is called a continuous Gaussian martingale if there are no jumps and if the angle bracket process hM i is continuous and deterministic: in this case, M has independent increments, and all finite dimensional distributions are Gaussian laws. We quote the following from [64, Coroll. VIII.3.24]: 9.1 Theorem. For n 1, consider d -dimensional locally square integrable local martingales .n/ M .n/ D M t t0 defined on ..n/ , A.n/ , F .n/ , P .n/ / .n/
starting from M0 D 0. Let .n/ .ds, dy/ denote the point process of jumps of M .n/ and .n/ .ds, dy/ its .P .n/ , F .n/ /-compensator, for .s, y/ in .0, 1/ .Rd n ¹0º/. Assume a Lindeberg condition for all 0 < t < 1 and all " > 0, Z tZ jyj2 .n/ .ds, dy/ D oP .n/ .1/ 0
¹jyj>"º
as n ! 1 .
Let M 0 denote a continuous Gaussian martingale M 0 D M t0 t0 defined on .0 , A0 , F 0 , P 0 / and write C :D hM 0 i for the deterministic angle bracket. Then stochastic convergence ˛ ˝ for every 0 < t < 1 fixed, M .n/ t D C t C oP .n/ .1/ as n ! 1 implies weak convergence of martingales Q.n/ :D L M .n/ j P .n/ ! in the Skorohod space D as n ! 1.
Q0 :D L M 0 j P 0
Section 9.1
245
Convergence of Martingales
Next we fix one probability space ..n/ , A.n/ , P .n/ / D ., A, P / for all n, equipped with a filtration F , and assume that M .n/ and F .n/ as above are derived from the same locally square integrable local .P , F /-martingale M D .M t / t0 through .n/ .n/ space-time rescaling (such as e.g. F t :D F tn and M t :D n1=2 M tn ). We need the following nesting condition (cf. [64, VIII.5.37]): 8 is a sequence of positive real numbers ˛n # 0 such that ˆ < there .n/ .nC1/ in F˛nC1 for all n 1, and F˛n is contained .C/ ˆ : S F .n/ D S F D: F . 1 n ˛n t t On ., A, P /, we also need a collection of F1 -measurable random variables ˆ D .ˆ t / t0 such that .CC/
paths t ! ˆ t are continuous and strictly increasing with ˆ0 D 0 and lim ˆ t D 1, P -a.s. t!1
(such as e.g. ˆ t D t for some F1 -measurable random variable > 0). On some other probability space .0 , A0 , F 0 , P 0 /, consider a continuous Gaussian martingale M 0 with deterministic angle bracket hM 0 i D C . We define M 0 subject to independent time change t ! ˆ t M 0 ı ˆ D Mˆ0 t t0 as follows. Let K 0 ., / denote a transition probability from ., F1 / to .D, D/ such that for the first argument ! 2 fixed, the canonical process on .D, D/ under K 0 .!, / is a continuous Gaussian G-martingale with angle bracket t ! .C ı ˆ/.!, t / D C.ˆ t .!//. Lifting ˆ and to D , A˝D , .F1 ˝G t / t0 , .PK 0 /.d!, df / :D P .d!/K 0 .!, df / the pair .ˆ, M 0 ı ˆ/ is well defined on this space (cf. [64, p. 471]). By this construction, M 0 ı ˆ is a conditionally Gaussian martingale. We quote the following result from Jacod and Shiryaev [64, VIII.5.7 and VIII.5.42)]: 9.2 Theorem. For n 1, for filtrations F .n/ in A such that the nesting condition (+) above holds, consider d -dimensional locally square integrable local martingales .n/ M .n/ D M t t0 on ., A, F .n/ , P / .n/
starting from M0
D 0 and satisfying a Lindeberg condition
for all 0 < t < 1 and all " > 0, Z tZ jyj2 .n/ .ds, dy/ D oP .1/ 0
¹jyj>"º
as n ! 1 .
246
Chapter 9
Appendix
As above, for a collection of F1 -measurable variables ˆ D .ˆ t / t0 satisfying (++) and for a continuous Gaussian martingale M 0 with deterministic angle bracket hM 0 i D C , we have a (continuous) conditionally Gaussian martingale M 0 ıˆ living on . D, A˝D, .F1 ˝G t / t0 , PK 0 /. Then stochastic convergence ˛ ˝ for every 0 < t < 1 fixed, M .n/ t D .C ı ˆ/ t C oP .1/ as n ! 1 implies weak convergence of martingales L M .n/ j P !
L M 0 ı ˆ j PK 0
in the Skorohod space D as n ! 1. Finally, consider locally square integrable local martingales M .n/ Z Z M .n/ D M .n,c/ C y ..n/ .n/ /.ds, dy/ 0
Rd n¹0º
where M .n,c/ is the continuous local martingale part. Writing M .n,i/ , 1 i d , for the components of M .n/ and ŒM .n/ for the quadratic covariation process, we obtain for 0 < t < 1 and i , j D 1, : : : , d ˇ
˛ ˇˇ ˝ ˇ sup ˇ M .n,i/ , M .n,j / s M .n,i/ , M .n,j / s ˇ D oP .1/ as n ! 1 st
provided the sequence .M .n/ /n satisfies the Lindeberg condition. In this situation, since Jacod and Shiryaev [64, VI.6.1] show that weak convergence of M .n/ as n ! 1 in D to a continuous limit martingale implies weak convergence of pairs .M .n/ , ŒM .n/ / in the Skorohod space of càdlàg functions Œ0, 1/ ! Rd Rd d , we also have weak convergence of pairs .M .n/ , hM .n/ i/. Thus the following result is contained in [64, VI.6.1]: 9.3 Theorem. For n 1, consider d -dimensional locally square integrable local martingales .n/ M .n/ D M t t0 defined on ..n/ , A.n/ , F .n/ , P .n/ / .n/ f denote a with M0 D 0. Assume the Lindeberg condition of Theorem 9.1. Let M continuous local martingale e e e A, e/ fD M ft defined on ., F, P M t0
f0 D 0. Then weak convergence in D starting from M fjP e , ! L M L M .n/ j P .n/
n!1
Section 9.2
247
Harris Recurrent Markov Processes
implies weak convergence ˛ ˝ L M .n/ , M .n/ j P .n/
!
˝ ˛ f, M f jP e L M
,
n!1
in the Skorohod space D.Rd Rd d / of càdlàg functions Œ0, 1/ ! Rd Rd d .
9.2
Harris Recurrent Markov Processes
For Harris recurrence of Markov chains, we refer to Revuz [111] and Nummelin [101]. For Harris recurrence of continuous time Markov processes, our main reference is Azema, Duflo and Revuz [2]. On some underlying probability space, we consider a time homogeneous strong Markov process X D .X t / t0 taking values in Rd , with càdlàg paths, having infinite life time, and its semigroup .P t ., // t0
:
x, y 2 Rd , s, t 2 Œ0, 1/ . R1 We write U 1 for the potential kernel U 1 .x, dy/ D 0 e t P t .x, dy/ dt which corresponds to observation of X after an independent exponential time. For -finite measures m on .Rd , B.Rd //, measurable functions f : Rd ! Œ0, 1 and t 0 write Z P t f .x/ D P t .x, dy/ f .y/ , Z mP t .dy/ D m.dx/ P t .x, dy/ , Z Em f D m.dx/ Ex .f / . P t .x, dy/ D P . XsCt 2 dy j Xs D x / ,
A -finite measure m on .Rd , B.Rd // with the property m P t D m for all t 0 is termed invariant for X . On the canonical path space .D, D, G/ for Rd -valued càdlàg processes, write Qx for the law of the process X starting from x 2 Rd . Let D . t / t0 denote the canonical process on .D, D, G/, and . t / t0 the collection of shift operators on .D, D/: t .˛/ :D .˛.t C s//s0 for ˛ 2 D. Systematically, we speak of ‘properties of the process X ’ when we mean ‘properties of the semigroup .P t ., // t0 ’: these in turn will be formulated as properties of the canonical process in the system .D, D, G, . t / t0 , .Qx /x2Rd /. 9.4 Definition. The process X D .X t / t0 is called recurrent in the sense of Harris (or Harris for short) if there is a -finite measure on .Rd , B.Rd // such that the
248
Chapter 9
Appendix
following holds: .˘/
A 2 B.Rd / , .A/ > 0 H) Z 1 1A .s / ds D C1 Qx -almost surely, for every x 2 Rd . 0
In Definition 9.4, a process X satisfying .˘/ will accumulate infinite occupation time in sets of positive -measure, independently of the starting point. The following is from [2, 2.4–2.5]: 9.5 Theorem. If X is recurrent in the sense of Harris, (a) there is a unique (up to multiplicative constants) invariant measure m on .Rd , B.Rd //, (b) condition .˘/ of Definition 9.4 holds with m in place of , (c) any measure in Definition 9.4 for which condition .˘/ holds satisfies 0 : Sv.˛/ > t ,
t 0, .˛/
and paths of V .˛/ are continuous and non-decreasing with V0 1 . We extend this definition to the case ˛ D 1 by .1/
S .1/ D id D V .1/ , i.e. S t
.1/
t Vt
,
.˛/
0 and lim V t t!1
t 0.
D
250
Chapter 9
Appendix
The last definition is needed in view of null recurrent processes where suitably normed integrable additive functionals converge weakly to V .1/ D id . As an example, we might have a recurrent atom in the state space of X such that the distribution of the time between successive visits in the atom is ‘relatively stable’ in the sense of [12, Chap. 8.8 combined with p. 359]; at every visit, the process might spend an exponential time in the atom; then the occupation time A t of the atom up to time t defines an additive functional A D .A t / t0 of X for which suitable norming functions vary regularly at 1 with index ˛ D 1. Index ˛ D 1 is also needed for the strong law of large numbers in Theorem 9.6(b) in positive recurrent Harris processes. We quote the following from [53, Thm. 3.15]. 9.8 Theorem. Consider a Harris process X with invariant measure m. (a) For 0 < ˛ 1 and `./ varying slowly at 1, the following assertions (i) and (ii) are equivalent: (i) for every function g : Rd ! Œ0, 1/ which is B.Rd /-measurable and satisfies 0 < m.g/ < 1, one has Z 1 t˛ t1 s .R1=t g/.x/ :D Ex e g.s / ds
m.g/ as t ! 1 `.t / 0 for m-almost all x 2 Rd , the exceptional null set depending on g; (ii) for every f : Rd ! Œ0, 1/ which is B.Rd /-measurable with 0 < m.f / < 1, for every x 2 Rd , Z `.n/ t n f .s / ds under Qx n˛ 0 t0 converges weakly in D as n ! 1 to m.f / V .˛/ where V .˛/ is the Mittag–Leffler process of index ˛. (b) The cases in (a) are the only ones where for functions f as in (a.ii) and for suitable choice of a norming function v./, weak convergence in D of Z tn 1 f .s / ds under Qx v.n/ 0 t0 to a continuous non-decreasing limit process V (starting at V0 D 0, and such that L.V1 / is not Dirac measure 0 at 0) is available. In statistical models for Markov processes under time-continuous observation, we have to consider martingales when the starting point x 2 Rd for the process is arbitrary but fixed. As an example, in the context of Chapter 8, the score martingale is a
Section 9.2
251
Harris Recurrent Markov Processes
locally square integrable local Qx -martingale M D .M t / t0 on .D, D, G/. We need limit theorems for the pair .M , hM i/ under Qx to prove convergence of local models. Under suitable assumptions, Theorem 9.8 above can be used to settle the problem for the angle bracket hM i under Qx . Thus we need weak convergence for the score martingale M under Qx . From this – thanks to a result in [64], which was recalled in Theorem 9.3 above – we can pass to joint convergence of the pair .M , hM i/ under Qx . 9.9 Definition. On the path space .D, D, G/, for given x 2 Rd , consider a locally square integrable local Qx -martingale M D .M t / t0 together with its quadratic variation ŒM and its angle bracket hM i under Qx ; assume in addition that hM i under Qx is locally bounded. We call M a martingale additive functional if the following holds: (i) ŒM and hM i under Qx admit versions which are additive functionals of ; (ii) for every choice of 0 s < t < 1 and y 2 Rd , one has M t Ms D M ts ı s Qy -almost surely. 9.9’ Example. (a) On .D, D, G/, for starting point x 2 R, write Qx for the law of a diffusion dX t D b.X t / dt C .X t / d W t and m.x/ for the local martingale part of the canonical process on .D, D, G/ under Qx . Write L for the Markov generator of X and consider some C 2 function F : R ! R with derivative f . Then Z t f .s / d m.x/ , t 0 Mt D s 0
t0
is a locally square integrable local Qx -martingale admitting a version .t , !/ ! Y .t , !/ Z t Y t :D F . t / F .0 / LF .s / ds , t 0 0
by Ito formula. This version satisfies Y t Ys D Y ts ı s for 0 s < t < 1. Quadratic variation and angle bracket of M under Qx Z t ŒM t D hM i t D f 2 .s / 2 .s /ds , t 0 0
are additive functionals. Hence M is a martingale additive functional as defined in Definition 9.9. (b) Prepare a Poisson process N D .N t / t0 with parameter > 0 independent of Brownian motion W . Under parameter values > 0, > 0, " > 0, for starting point x 2 R, write Qx for the law on .D, D, G/ of the solution to dX t D 1¹X t >0º dt C " 1¹X t 0
H)
lim sup 1A . t / D 1Qx -almost surely, for every x 2 Rd t!1
implies Harris recurrence. Proof. We use arguments of Revuz [111, pp. 94–95] and Revuz and Yor [112, p. 395]. Since m in condition .ı/ is invariant, we consider in .ı/ sets A 2 B.Rd / with
254
Chapter 9
Appendix
Rt Em . 0 1A .s / ds/ D t m.A/ > 0 . As a consequence, we can specify some " > 0 and some ı > 0 such that ® ¯ > 0 m x 2 Rd : Qx . < 1/ > ı .ıı/ Rt where :D A D inf¹t > 0 : 0 1A .s / ds > "º . (1) Write f : Rd ! Œ0, 1 for the B.Rd /-measurable function v ! Qv . < 1/ . Then f . t / D E 1¹tCı t 0 :
1 X
1F .kT Cv /0vT D 1
kD0
almost surely, independently of the choice of a starting point. Thus the segment chain .C/
D . k /k2N0 , k :D .kT Cv /0vT
is a Harris chain taking values in .D.Œ0, T /, D.Œ0, T //. It admits a unique invariant measure. From the definition of m we see that the measure m is invariant for the segment chain. Hence is Harris with invariant measure m. (3) We show that . t / t0 is a Harris process. Introduce a -finite measure on .E, E/: Z T .A/ :D ds Œb mPs .A/ , A 2 E . 0
Fix some set B 2 E with m b.B/ D 1 (recall that we are free to multiply m b with positive constants). Together with the D.Œ0, T /-measurable function G : D.Œ0, T / ! R
given by
˛ ! 1¹0 2Bº .˛/ , ˛ 2 D.Œ0, T /
Section 9.3
257
Checking the Harris Condition
consider a family of D.Œ0, T /-measurable functions indexed by A 2 E, Z FA : D.Œ0, T / ! R
given by
˛ !
T
ds 1A .˛.s// ,
˛ 2 D.Œ0, T / .
0
b.B/ D 1 . The ratio limit Note that m.FA / D .A/ for A 2 E, and m.G/ D m theorem for the segment chain (+) yields the convergence R mT Pm1 1A .s / ds FA . k / m.FA / 0 D lim PkD0 D D .A/ .CC/ lim Pm1 m1 m!1 m!1 m.G/ kD0 1B . kT / mD0 G. k / Qx -almost surely, for every choice of a starting point x 2 E. From m b.B/ > 0 and the Harris property of the grid chain b D .kT /k2N0 in (1), denominators in (++) tend to 1 Qx -almost surely, for every choice of a starting point x 2 E. Hence, for sets A 2 E with .A/ > 0, also numerators in (++) increase to 1 Qx -almost surely, for every choice of a starting point x 2 E: Z 1 A 2 E , .A/ > 0 : 1A .s / ds D 1 Qx -almost surely, for every x 2 E . 0
This is property .˘/ in Definition 9.4: hence the continuous-time process D . t / t0 is Harris. By Theorem 9.5, admits a unique (up to constant multiples) invariant measure m on .E, E/; it remains to identify m . (4) We show that the three measures m, m b , can be identified. Select some set C 2 E with the property m.FC / D .C / D 1 . Then similarly to (++), for every A 2 E, the ratio limit theorem for the segment chain (+) yields Rt 1A .s / ds m.FA / lim R 0t D .A/ Qx -almost surely, for every x 2 E . D t!1 m.F C/ 0 1C .s / ds Thus the measures m and coincide, up to some constant multiple, by Theorem 9.6 RT and the Harris property of . t / t0 in step (3). Next, the measure D 0 ds Œb mP s is by definition invariant for the grid chain .kT /k2N0 . Thus the Harris property in step (1) shows that m b and coincide up to constant multiples. This concludes the proof of the proposition. It follows from Azema, Duflo and Revuz [2] that the Harris property of .X t / t0 in continuous time is equivalent to the Harris property of the chain .XSn /n2N0 which corresponds to observation of the continuous-time process after independent exponential waiting times (i.e. Sn :D 1 C C n where .j /j are i.i.d. exponentially distributed and independent of X ).
258
9.4
Chapter 9
Appendix
One-dimensional Diffusions
For one-dimensional diffusions, it is an easy task to check Harris recurrence, in contrast to higher dimensions. We consider first diffusions with state space .1, 1/ and then – see [69, Sect. 5.5] – diffusions taking values in open intervals I R. The following criterion is widely known; we take it from from Khasminskii [73, 1st. ed., Chap. III.8, Ex. 2 on p. 105]. 9.12 Proposition. In dimension d D 1, consider a continuous semi-martingale dX t D b.X t / dt C .X t / d W t where W is one-dimensional Brownian motion, and where b./ and ./ > 0 are continuous on R. (a) If the (strictly increasing) mapping S : R ! R defined by Z y Z x 2b s.y/ dy where s.y/ :D exp .v/ dv , x, y 2 R ./ S.x/ :D 2 0 0 is a bijection onto R, then the process X D .X t / t0 is recurrent in the sense of Harris, and the invariant measure m is given by Z x 1 1 2b 1 .v/ dv dx on .R, B.R// . dx D exp m.dx/ D 2 s.x/ 2 .x/ 2 .x/ 0 The mapping S./ defined by ./ is called the scale function. (b) In the special case b./ 0, we have S./ D id on R: thus a diffusion X without drift – for ./ continuous and strictly positive on R – is Harris with invariant measure 1 dx. 2 .x/ Proof. Let Qx on .D, D, G/ denote the law of X with starting point x. We give a proof in three steps, from Brownian motion via the driftless case to continuous semimartingales as above. (1) Special case ./ 1 and b./ 0. Here X is one-dimensional Brownian motion starting from x, thus Lebesgue measure on R is an invariant measure since Z Z Z 1 .yx/2 1 e 2 t f .y/ D dy f .y/ D .f / . P t .f / D dx dy p 2 t For any Borel set A of positive Lebesgue measure and for any starting point x for X , select n large enough for .A \ Bn1 .0// > 0 and x 2 Bn1 .0/, and define G-stopping times Sm :D inf ¹t > Tm1 : t > nº , m 0,
Tm :D inf ¹t > Sm : t < nº , T0 0 .
Section 9.4
259
One-dimensional Diffusions
By the law of the iterated logarithm, these are finite Qx -almost surely, and we have Tm " 1 Qx -almost surely as m ! 1. Hence A is visited infinitely often by the canonical process under Qx , and Proposition 9.11 establishes Harris recurrence. By Theorem 9.5, m D is the unique invariant measure, and we have null recurrence since m has infinite total mass on R. (2) Special case b./ 0. Here X is a diffusion without drift. On .D, D, G/, let A denote the additive functional Z t 2 .s / ds , t 0 At D 0
and define M :D t 0 , t 0. Then M is a continuous local martingale under Qx with angle bracket hM i D A. A result by Lepingle [85] states that Qx -almost surely ° ± lim M t exists in R . on the event Rc :D lim hM i t < 1 , t!1
t!1
c Since R 1 2 ./ is continuous and strictly positive on R, we infer that the event R D ¹ 0 .s / ds < 1º is a null set under Qx . As a consequence, A is (strictly) increasing to 1 Qx -almost surely. Define
0 < u < 1,
.u/ :D inf¹t > 0 : A t > uº ,
0 0 .
Then .u/ tends to 1 Qx -almost surely as u ! 1. As a consequence, the characterisation theorem of P. Lévy (cf. [61, p. 85]) shows that B :D M.u/ u0 :D .u/ 0 u0 is a standard Brownian motion. For f 0 measurable, the time change gives Z
.v/
Z f .s / .s /ds D
.v/
2
0
f .s / dAs 0
Z
.C/
D
.v/
0
Z D
v
f .0 C BAs / dAs
f .0 C Br / dr .
0
By assumption on ./, we may replace measurable functions f 0 in (+) by f2 and obtain Z .v/ Z v f .CC/ f .s / ds D .0 C Br / dr , 0 < v < 1 . 2 0 0 By step (1), one-dimensional Brownian motion is a Harris process. Letting v tend to 1 in (++) and exploiting again continuity and strict positivity of ./, we obtain from (++) Z 1 A 2 B.R/ , .A/ > 0 H) 1A .s / ds D 1 Qx -almost surely . 0
260
Chapter 9
Appendix
This holds for an arbitrary choice of a starting point x 2 R. We thus have condition .˘/ in Definition 9.4 with :D , thus the driftless diffusion X is Harris recurrent. It remains to determine the invariant measure m , unique by Theorem 9.5 up to constant multiples. We shall show 1 dx . m.dx/ D 2 .x/ Select g 0 measurable with m.g/ D 1. Then for measurable functions f 0, the Ratio Limit Theorem 9.6 combined with (++) gives Rv f Rt R .v/ f .s / ds 2 .0 C Bs / ds 0 f .s / ds 0 D lim R0v g . D lim R .v/ m.f / D lim R t v!1 v!1 t!1 . C B / ds g. / ds 2 0 s g. / ds s 0 s 0 0 By step (1) above, B is a Harris process whose invariant measure is translation invariant. Thus the ratio limit theorem for B applied to the last right-hand side shows f2 m.f / D g for all f 0 measurable . 2 As a consequence we obtain 0 < g2 < 1 (by definition of an invariant measure, m is -finite and non-null); then, up to multiplication with a constant, we identify m on .R, B.R// as m.dx/ D 21.x/ dx. This proves part (b) of the proposition. (3) We consider a semi-martingale X admitting the above representation, and assume that S : R ! R defined by ./ is a bijection onto R. Let L denote the Markov generator of X acting on C 2 functions 1 Lf .x/ D b.x/f 0 .x/ C 2 .x/f 00 .x/ , x 2 R . 2 It follows from the definition in ./ that LS 0. Thus S ı X is a local martingale by Ito formula. More precisely, e :D S ı X D .S.X t // t0 X is a diffusion without drift as in step (2) solving the equation e t / d Wt et D e .X dX
with e :D .s / ı S 1 .
e is a Harris process with invariant measure m It has been shown in step (2) that X e given by 1 1 de xD de x , e x2R m e.de x/ D 2 2 e .e x/ .s / .S 1 .e x // and thus Z 1 e X e non-negative, B.R/-measurable, m e/ > 0 : e t dt D 1 f .ı/ f e .f 0
e almost surely, independently of the choice of a starting point for X.
Section 9.4
261
One-dimensional Diffusions
If we define a measure m on .R, B.R// as the image of m e under the mapping 1 x / , then e x ! S .e Z .ıı/ m.f / D m e.de x / f .S 1 .e x // D m e f ı S 1 and the transformation formula gives m.dx/ D
1 1 1 s.x/ dx D 2 dx , e 2 .S.x// .x/ s.x/
x2R.
Now, for f 0 measurable, we combine .ı/ and .ıı/ with the obvious representation Z t Z t e :D f ı S 1 e X e v dv for f f f .Xv / dv D 0
0
e gives in order to finish the proof. First, in this combination, the Harris property of X Z 1 f non-negative, B.R/-measurable, m.f / > 0 : f .X t / dt D 1 0
almost surely, for all choices of a starting point for X . This is property .˘/ in Definition 9.4 with D m. Thus X is a Harris process. Second, for g 0 as in step 2 above and e.e g / D 1, the ratio limit theorem e g :D g ı S 1 which implies m.g/ D m Rt Rt e.X e s / ds f 0 f .Xs / ds lim R t D lim R0t Dm e f ı S 1 D m.f / t!1 t!1 e s / ds g.Xs / ds e g .X 0
0
identifies m as the invariant measure for X . We have proved part (a) of the proposition. 9.12’ Examples. The following examples (a)–(c) are applications of Proposition 9.12. (a) The Ornstein–Uhlenbeck process dX t D #X t dt C d W t with parameters # < 0 2 /. and > 0 is positive Harris recurrent with invariant probability m D N .0, 2j#j p (b) The diffusion dX t D .˛ ˇX t / dt C 1 C X t2 d W t with parameters ˛ 2 R, ˇ > 0, 0 < < 1 is positive Harris recurrent. We have Z x ˇ 2b .v/ dv
jxj2.1 / as x ! C1 or x ! 1 2 1 0 with notations of ./ in Proposition 9.12. The invariant measure m of X admits finite moments of arbitrary order. 2
Xt (c) The diffusion dX t D # 1CX 2 dt C d W t with parameters > 0 and # 2 t is recurrent in the sense of Harris. Normed as in Proposition 9.12, the invariant
262
Chapter 9
measure is m.dx/ D j#j
2 2 .
1 2
p
1 C x2
2# 2
Appendix
dx . Null recurrence holds if and only if
The following is a variant of Proposition 9.12 for diffusions taking values in open intervals I R. 9.13 Proposition. In dimension d D 1, consider an open interval I R, write B.I / for the Borel- -field. Consider a continuous semi-martingale X taking values in I and admitting a representation dX t D b.X t / dt C .X t / d W t ,
t 0.
We assume that b./ and ./ > 0 are continuous functions I ! R. Fix any point x0 2 I . If the mapping S : I ! R defined by Z y Z x 2b s.y/ dy where s.y/ :D exp .v/ dv ./ S.x/ :D 2 x0 x0 maps I onto R, then X is recurrent in the sense of Harris, and the invariant measure m on .I , B.I // is given by Z x 1 1 1 2b .v/ dv dx , x 2 I . dx D 2 exp m.dx/ D 2 2 .x/ s.x/ .x/ x0 Proof. This is a modification of part (3) of the proof of Proposition 9.12 since I is e D S ı X is an open. For the I -valued semi-martingale X , the transformed process X R-valued diffusion without drift as in step (2) of the proof of Proposition 9.12, thus Harris, with invariant measure m e on .R, B.R//. Defining m from m e by 1 for f : I ! Œ0, 1/ B.I /-measurable m.f / :D m e f ıS we obtain a -finite measure on .I , B.I // such that sets A 2 B.I / with m.A/ > 0 are visited infinitely often by the process X , almost surely, under every choice of a starting value in I . Thus X is Harris, and the ratio limit theorem identifies m as the invariant measure of X . 9.13’ Example. For every choice of a starting p point in I :D .0, 1/, the Cox– Ingersoll–Ross process dX t D .˛ ˇX t / dt C X t d W t with parameters ˇ > 0, > 0 and 2˛ 2 1 almost surely never hits 0 (e.g. Ikeda and Watanabe [61, p. 235– 237]). Proposition 9.13 shows that X is positive Harris recurrent. The invariant prob2ˇ ability m is a Gamma law . 2˛ 2 , 2 /. 9.13” Example. On some ., A, P /, for some parameter # 0 and for deterministic starting point in I :D .0, 1/, consider the diffusion p 1 dX t D # . X t / dt C X t .1 X t / d W t 2
Section 9.4
263
One-dimensional Diffusions
up to the stopping time :D inf¹t > 0 : X t … .0, 1/º. Then the following holds: P . D 1/ D 1 in the case where # 1 , P . < 1/ D 1 in the case where 0 # < 1 . Only in the case where # 1 is the process X a diffusion taking values in I in the sense of Proposition 9.13. For all # 1, X is positive Harris, and the invariant probability on .0, 1/ is the Beta law B.#, #/. Proof. (1) We fix # 0. Writing b.x/ D #. 12 x/ and 2 .x/ D x.1 x/ for 0 < x < 1, we have Z
y x0
2b .v/ dv D # 2
Z
y x0
y.1 y/ 1 2v dv D # ln , v.1 v/ x0 .1 x0 /
0