Functional Estimation for Density, Regression Models and Processes [2 ed.] 9811272832, 9789811272837

Nonparametric kernel estimators apply to the statistical analysis of independent or dependent sequences of random variab

124 76 8MB

English Pages 244 [259] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Contents
Preface of the Second Edition
Preface of the First Edition
About the Author
1. Introduction
1.1 Estimation of a density
1.2 Estimation of a regression curve
1.3 Estimation of functionals of processes
1.4 Content of the book
2. Kernel Estimator of a Density
2.1 Introduction
2.2 Risks and optimal bandwidths for the kernel estimator
2.3 Weak convergence
2.4 Estimation of the density on Rd
2.5 Minimax risk
2.6 Histogram estimator
2.7 Estimation of functionals of a density
2.8 Density of absolutely continuous distributions
2.9 Hellinger distance between a density and its estimator
2.10 Estimation of the density under right-censoring
2.11 Estimation of the density of left-censored variables
2.12 Kernel estimator for the density of a process
2.13 Exercises
3. Kernel Estimator of a Regression Function
3.1 Introduction and notation
3.2 Risks and convergence rates for the estimator
3.3 Optimal bandwidths for derivatives
3.4 Weak convergence of the estimator
3.5 Estimation of a regression function on Rd
3.6 Estimation of a regression curve by local polynomials
3.7 Estimation in regression models with functional variance
3.8 Estimation of the mode of a regression function
3.9 Estimation of a regression function under censoring
3.10 Proportional odds model
3.11 Estimation for the regression function of processes
3.12 Exercises
4. Limits for the Varying Bandwidths Estimators
4.1 Introduction
4.2 Estimation of densities
4.3 Estimation of regression functions
4.4 Estimation for processes
4.5 Exercises
5. Nonparametric Estimation of Quantiles
5.1 Introduction
5.2 Asymptotics for the quantile processes
5.3 Bandwidth selection
5.4 Estimation of the conditional density of Y given X
5.5 Estimation of conditional quantiles for processes
5.6 Inverse of a regression function
5.7 Quantile function of right-censored variables
5.8 Conditional quantiles with varying bandwidth
5.9 Exercises
6. Nonparametric Estimation of Intensities for Stochastic Processes
6.1 Introduction
6.2 Risks and convergences for estimators of the intensity
6.2.1 Kernel estimator of the intensity
6.2.2 Histogram estimator of the intensity
6.3 Risks and convergences for kernel estimators (6.4)
6.3.1 Models with nonparametric regression functions
6.3.2 Models with parametric regression functions
6.4 Histograms for intensity and regression functions
6.5 Estimation of the density of duration excess
6.6 Estimators for processes on increasing intervals
6.7 Conditional intensity under left-truncation
6.8 Conditional intensity under left-truncation and right-censoring
6.9 Models with varying intensity or regression coefficients
6.10 Estimation in nonparametric frailty models
6.11 Bivariate hazard functions
6.12 Progressive censoring of a random time sequence
6.13 Model with periodic baseline intensity
6.14 Exercises
7. Estimation in Semi-parametric Regression Models
7.1 Introduction
7.2 Convergence of the estimators
7.3 Nonparametric regression with a change of variables
7.4 Exercises
8. Diffusion Processes
8.1 Introduction
8.2 Kernel estimation of time dependent diffusions
8.3 Auto-regressive diffusions
8.4 Estimation for auto-regressive diffusions by discretization
8.5 Estimation for continuous diffusion processes
8.6 Estimation of a diffusion with stochastic volatility
8.7 Estimation of an auto-regressive spatial diffusions
8.8 Estimation of discretely observed diffusions with jumps
8.9 Continuous estimation for diffusions with jumps
8.10 Transformations of a nonstationary Gaussian process
8.11 Exercises
9. Applications to Time Series
9.1 Nonparametric estimation of the mean
9.2 Periodic models for time series
9.3 Nonparametric estimation of the covariance function
9.4 Nonparametric transformations for stationarity
9.5 Change-points in time series
9.6 Exercises
10. Appendix
10.1 Appendix A
10.2 Appendix B
10.3 Appendix C
10.4 Appendix D
Notations
Bibliography
Index
Recommend Papers

Functional Estimation for Density, Regression Models and Processes [2 ed.]
 9811272832, 9789811272837

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

This page intentionally left blank

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data Names: Pons, Odile, author. Title: Functional estimation for density, regression models and processes / Odile Pons, French National Institute for Agronomical Research, France. Description: Second edition. | New Jersey : World Scientific, [2024] | Includes bibliographical references and index. Identifiers: LCCN 2023023159 | ISBN 9789811272837 (hardcover) | ISBN 9789811272844 (ebook for institutions) | ISBN 9789811272851 (ebook individuals) Subjects: LCSH: Estimation theory. | Nonparametric statistics. Classification: LCC QA276.8 .P66 2024 | DDC 519.5/44--dc23/eng20230919 LC record available at https://lccn.loc.gov/2023023159

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

Copyright © 2024 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

For any available supplementary material, please visit https://www.worldscientific.com/worldscibooks/10.1142/13322#t=suppl Desk Editors: Nimal Koliyat/Lai Fun Kwong Typeset by Stallion Press Email: [email protected] Printed in Singapore

Preface of the Second Edition

The second edition improves several parts of the book with simpler proofs and it adds optimal Lp risks and minimax properties of the estimators, depending on the order of differentiability of the functions and on the dimension of the space, for every real p ≥ 1, in regression models, intensity of point processes, drift and variance of time dependent or auto-regressive diffusion processes. Several new kernel and histogram-type estimators are defined for functions defining processes, their risk and their convergence are established: multivariate models, periodic intensities, frailty models, time dependent diffusions, diffusions with stochastic volatility. April 7, 2023

v

This page intentionally left blank

Preface of the First Edition

Nonparametric estimators have been intensively used for the statistical analysis of independent or dependent sequences of random variables and for samples of continuous or discrete processes. The optimization of the procedures is based on the choice of a bandwidth that minimizes an estimation error for functionals of their probability distributions. This book presents new mathematical results about statistical methods for the density and regression functions, widely presented in the mathematical literature. There is no doubt that its origin benefits from earlier publications and from other subjects I worked about in other models for processes. Some questions of great interest for optimizing the methods have motivated much work some years ago, they are mentioned in the introduction and they give rise to new developments of this book. The methods are generalized to estimators with kernel sequences varying on the sample space and to adaptative procedures for estimating the optimal local bandwidth of each model. More complex models are defined by several nonparametric functions or by vector parameters and nonparametric functions, such as the models for the intensity of point processes and the single-index regression models. New estimators are defined and their convergence rates are compared. Odile M.-T. Pons

vii

This page intentionally left blank

About the Author

Odile Pons is a retired director of research at the Department of Mathematics of INRA. She obtained her PhD in mathematics and the habilitation in mathematics (probability-statistics) at the University of Paris (France). She mainly published articles in peer-reviewed journals and books in probability and mathematical statistics. Her books published by World Scientific include Functional Estimation for Density, Regression Models and Processes (2011), Statistical Tests of Nonparametric Hypotheses: Asymptotic Theory (2014), Analysis and Differential Equations (2015), Estimations and Tests in Change-Point Models (2018), Orthonormal Series Estimators (2020), Probability and Stochastic Processes: Work Examples (2020), Inequalities in Analysis and Probability (3rd edition, 2022), and Analysis and Differential Equations (2nd edition, 2022).

ix

This page intentionally left blank

Contents

Preface of the Second Edition

v

Preface of the First Edition

vii

About the Author

ix

1. Introduction

1

1.1 1.2 1.3 1.4

Estimation of a density . . . . . . . . Estimation of a regression curve . . . Estimation of functionals of processes Content of the book . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2. Kernel Estimator of a Density 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13

2 10 14 19 23

Introduction . . . . . . . . . . . . . . . . . . . . . . . . Risks and optimal bandwidths for the kernel estimator Weak convergence . . . . . . . . . . . . . . . . . . . . Estimation of the density on Rd . . . . . . . . . . . . . Minimax risk . . . . . . . . . . . . . . . . . . . . . . . Histogram estimator . . . . . . . . . . . . . . . . . . . Estimation of functionals of a density . . . . . . . . . Density of absolutely continuous distributions . . . . . Hellinger distance between a density and its estimator Estimation of the density under right-censoring . . . . Estimation of the density of left-censored variables . . Kernel estimator for the density of a process . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

xi

. . . . . . . . . . . . .

. . . . . . . . . . . . .

23 25 30 33 35 37 38 42 43 45 47 49 51

xii

Functional Estimation for Density, Regression Models and Processes

3. Kernel Estimator of a Regression Function 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12

53

Introduction and notation . . . . . . . . . . . . . . . . . Risks and convergence rates for the estimator . . . . . . Optimal bandwidths for derivatives . . . . . . . . . . . . Weak convergence of the estimator . . . . . . . . . . . . Estimation of a regression function on Rd . . . . . . . . Estimation of a regression curve by local polynomials . . Estimation in regression models with functional variance Estimation of the mode of a regression function . . . . . Estimation of a regression function under censoring . . . Proportional odds model . . . . . . . . . . . . . . . . . . Estimation for the regression function of processes . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

4. Limits for the Varying Bandwidths Estimators 4.1 4.2 4.3 4.4 4.5

Introduction . . . . . . . . . . . . . Estimation of densities . . . . . . . Estimation of regression functions Estimation for processes . . . . . . Exercises . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

83 . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

5. Nonparametric Estimation of Quantiles 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9

53 54 60 65 67 70 72 75 76 77 79 81

83 84 88 92 93 95

Introduction . . . . . . . . . . . . . . . . . . . . . . Asymptotics for the quantile processes . . . . . . . Bandwidth selection . . . . . . . . . . . . . . . . . Estimation of the conditional density of Y given X Estimation of conditional quantiles for processes . Inverse of a regression function . . . . . . . . . . . Quantile function of right-censored variables . . . . Conditional quantiles with varying bandwidth . . . Exercises . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

95 97 103 105 108 110 112 113 114

6. Nonparametric Estimation of Intensities for Stochastic Processes 115 6.1 6.2

Introduction . . . . . . . . . . . . . . . . . . Risks and convergences for estimators of the 6.2.1 Kernel estimator of the intensity . . 6.2.2 Histogram estimator of the intensity

. . . . . . intensity . . . . . . . . . . . .

. . . .

. . . .

115 119 119 128

xiii

Contents

6.3

6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14

Risks and convergences for kernel estimators (6.4) . . . 6.3.1 Models with nonparametric regression functions 6.3.2 Models with parametric regression functions . . Histograms for intensity and regression functions . . . . Estimation of the density of duration excess . . . . . . . Estimators for processes on increasing intervals . . . . . Conditional intensity under left-truncation . . . . . . . . Conditional intensity under left-truncation and right-censoring . . . . . . . . . . . . . . . . . . . . . Models with varying intensity or regression coefficients . Estimation in nonparametric frailty models . . . . . . . Bivariate hazard functions . . . . . . . . . . . . . . . . . Progressive censoring of a random time sequence . . . . Model with periodic baseline intensity . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

130 131 133 136 139 142 144

. . . . . . .

147 148 152 154 157 158 160

7. Estimation in Semi-parametric Regression Models 7.1 7.2 7.3 7.4

Introduction . . . . . . . . . . . . . . . . . . . . . . . Convergence of the estimators . . . . . . . . . . . . . Nonparametric regression with a change of variables Exercises . . . . . . . . . . . . . . . . . . . . . . . .

161 . . . .

. . . .

. . . .

8. Diffusion Processes 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . Kernel estimation of time dependent diffusions . . . . . Auto-regressive diffusions . . . . . . . . . . . . . . . . . Estimation for auto-regressive diffusions by discretization . . . . . . . . . . . . . . . . . . . . . . Estimation for continuous diffusion processes . . . . . . Estimation of a diffusion with stochastic volatility . . . Estimation of an auto-regressive spatial diffusions . . . . Estimation of discretely observed diffusions with jumps Continuous estimation for diffusions with jumps . . . . . Transformations of a nonstationary Gaussian process . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

9. Applications to Time Series 9.1 9.2

161 163 168 171 173

. 173 . 174 . 181 . . . . . . . .

183 191 195 198 200 205 207 208 209

Nonparametric estimation of the mean . . . . . . . . . . . 210 Periodic models for time series . . . . . . . . . . . . . . . 213

xiv

Functional Estimation for Density, Regression Models and Processes

9.3 9.4 9.5 9.6

Nonparametric estimation of the covariance function Nonparametric transformations for stationarity . . . Change-points in time series . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

10. Appendix 10.1 10.2 10.3 10.4

Appendix Appendix Appendix Appendix

214 216 217 223 225

A B C D

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

225 226 226 229

Notations

233

Bibliography

235

Index

241

Chapter 1

Introduction

The aim of this book is to present in the same approach estimators for functions defining probability models: density, intensity of point processes, regression curves and diffusion processes. The observations may be continuous for processes or discretized for samples of densities, regressions and time series, with sequential observations over time. The regular sampling scheme of the time series is not common in regression models where stochastic explanatory variables X are recorded together with a response variable Y according to a random sampling of independent and identically distributed observations (Xi , Yi )i≤n . The discretization of a continuous diffusion process yields a regression model and the approximation error can be made sufficiently small to extend the estimators of the regression model to the drift and variance functions of a diffusion process. The functions defining the probability models are not specified by parameters and they are estimated in functional spaces. This chapter is a review of well known estimators for density and regression functions and a presentation of models for continuous or discrete processes where nonparametric estimators are defined. On a probability space (Ω, A, P ), let X be a random variable with distribution function F (x) = Pr(X ≤ x) and Lebesgue density f , the derivative of F . The empirical distribution function and the histogram are the simplest estimators of a distribution function and a density, respectively. With a sample (Xi )i≤n of the variable X, the distribution function F (x) is estimated by Fn (x), the proportion of observations smaller than x, which converges uniformly to F in probability and almost surely if and only F is continuous. A histogram with bandwidth hn consists in a partition of the range of the observations into disjoint subintervals of length hn where the density is estimated by the proportion of observations Xi in each 1

2

Functional Estimation for Density, Regression Models and Processes

subintervals, divided by hn . The bandwidth hn tends to zero as n tends to infinity and nh2n tends to infinity, thus the size of the partition tends to infinity with the sample size. For a variable X defined in a metric space (X, A, μ), the histogram is the local nonparametric estimator defined by a set of neighborhoods Vh = {Vh (x), x ∈ X}, with Vh (x) = {s; d(x, s) ≤ h} for the metric d of (X, A, μ) −1 n     dFX 1{X ∈V (x)} . (1.1) fn,h (x) = n Vh (x)

i

h

i=1

The empirical distribution function and the histogram are stepwise estimators and smooth estimators have been later defined for regular functions. 1.1

Estimation of a density

Several kinds of smooth methods have been developed. The first one was the projection of functions onto regular and orthonormal bases of functions (φk )k≥0 . The density of the observations is approximated by a countable  Kn projection on the basis fn (x) = i=1 ak φk (x) where Kn tends to infinity and the coefficients are defined by the scalar product specific to the orthonormality of the basis with   2 φk (x)μφ (x) dx = 1, φk (x)φl (x)μφ (x) dx = 0, for all k = l,  then ak =< f, φk >= f (x)φk (x)μφ (x) dx. The coefficients are estimated by integrating the basis with respect to the empirical distribution of the variable X   akn = φk (x)μφ (x) dFn (x),  Kn which yields an estimator of the density fn (x) = i=1  akn φk (x). The same principle applies to other stepwise estimators of functions. Well known bases of L2 -orthogonal functions are (i) Legendre’s polynomials1 defined on the interval [−1, 1] as solutions of the differential equations 

(1 − x2 )Pn (x) − 2x Pn (x) − n(n + 1)Pn (x) = 0, 1 French

mathematician (1752–1833).

3

Introduction

with Pn (1) = 1. Their solutions have an integral form attributed to Hermite and his student Stieltjes  2 π sin(n + 12 )φ dφ . Pn (cos θ) = π 0 {2 cos θ − 2 cos φ} The polynom Pn (x) has also been expressed as the coefficient of 1 z −(n+1) in the expansion of (z 2 − 2xz + 1)− 2 by Stieltjes (1890). They are orthogonal with the scalar product  1 f, g = f (x)g(x) dx; −1

(ii) Hermite’s polynomials2 of degree n defined by the derivatives Hn (x) = (−1)n e

x2 2

dn − x22 (e ), dxn

n ≥ 1,

they satisfy the recurrence equation Hn+1 (x) = xHn (x) − Hn (x), with H0 (x) = 1. They are orthogonal with the scalar product  +∞ x2 f, g = f (x)g(x)e− 2 dx −∞

√ and their norm is Hn  = n! 2π; (iii) Laguerre’s polynomials3 defined by the derivatives Ln (x) =

ex dn −x n (e x ), n! dxn

n ≥ 1,

and L0 (x) = 1. They satisfy the recurrence equation Ln+1 (x) = (2n + 1 − x)Ln (x) − n2 Ln−1 (x) and they are orthogonal with the scalar product  +∞ f (x)g(x)e−2x dx. f, g = −∞

The orthogonal polynomials are normalized by their norm. If the function f is Lipschitz, the polynomial approximations converge to f in L2 and for the pointwise convergence. The corresponding projection estimators also converge in L2 and pointwisely. Though the bases generate functional spaces of smooth integrable functions, the estimation is parametric. The estimator of the approximation function converge to zero in L2 with the 2 French 3 French

mathematician (1822–1901). mathematician (1834–1886).

4

Functional Estimation for Density, Regression Models and Processes

norm fn −fn 2 = { so that

 +∞ −∞

E(fn −fn )2 (x)μφ (x) dx} 2 if n−1 Kn tends to zero,

fn − fn 22 = =

1



+∞

E −∞ Kn 

( akn − ak )2 φ2k (x)μφ (x) dx

i=1

E( akn − ak )2

i=1 2

Kn 



E( akn − ak ) = E = n−1

2  φk (x)μφ (x) d(Fn − F )(x)

 φk (x)φk (y)μφ (x)μφ (y) dC(x, y)

where C(x, y) = F (x ∧ y) − F (x)F (y) is the covariance function of the 1 empirical process n 2 (Fn − F ). The convergence rate of the norm the 1 1 density estimator is the sum of the norm fn − fn 2 = O(n− 2 Kn2 ) and ∞ 1 the approximation error fn − f 2 = ( i=Kn +1 a2k ) 2 , it is determined by the convergence rate of the sum of the squared coefficients and therefore by the degree of derivability of the function f . Splines are also bases of functions constrained at fixed points or by a condition of derivability of the function f , with an order of integration for its higher derivative. They have been introduced by Whittaker (1923) and developed by Schoenberg (1964), Wold (1975), Wahba and Wold (1975), De Boor (1978), Wahba (1978), Eubank (1988). They allow the approximation of functions having different degrees of smoothness on different intervals which can be fixed. A comparison between splines and kernel estimators of densities may be found in Silverman (1984) who established an uniform asymptotic bound for the difference between the kernel function and the weight function of cubic 1 splines, with a bandwidth kernel λ− 4 where λ is the smoothing parameters of the splines. Each spline operator corresponds to a kernel operator and the bias and variance of both estimators have the same rate of convergence (Rice and Rosenblatt, 1983, Silverman, 1984). Messer (1991) provides an explicit expression of the kernel corresponding to a cubic sinuso¨ıdal spline, with their rates of convergence. Kernel estimators of densities have first been introduced and studied by Rosenblatt (1956), Whittle (1958), Parzen (1962), Watson and Laedbetter (1963), Bickel and Rosenblatt (1973). Consider a real random variable X defined on (Ω, A, P ) with density fX and distribution function FX . A continuous density fX is estimated by smoothing the empirical distribution

Introduction

5

function FX,n of a sample (Xi )1≤i≤n distributed as X by the means of its convolution with a kernel K, over a bandwidth h = hn tending to zero as n tends to infinity  n x − Xi 1    fX,n,h (x) = Kh (x − s) dFX,n (s) = K , (1.2) nh i=1 h where Kh (x) = h−1 K(h−1 x) is the kernel of bandwidth h. The weighting kernel is a bounded symmetric density satisfying regularity properties and moment conditions. With a p-variate vector X, the kernel may be defined −1 on Rp and Kh (x) = (h1 , . . . , hp )−1 K(h−1 1 x1 , . . . , hp xp ), for p-dimensional vectors x = (x1 , . . . , xp ) and h = (h1 , . . . , hp ). Scott (1992) gives a detailed presentation of the multivariate density estimators with graphical visualizations. Another estimator is based on the topology of the space (X, A, μ), with (1.1) or using a real function K and Kh (x) = h−1 K(h−1 xμ ), h > 0. The regularity of the kernel K entails the continuity of the estimator fX,n,h . All results established for a real valued variable X apply straightforwardly to a variable defined in a metric space. Deheuvels (1977) presented a review of nonparametric methods of estimation for the density and compared the mean squared error of several kernel estimators including the classical polynomial kernels which do not satisfy the above conditions, some of them diverge and their orders differ from those of the density kernels. Classical kernels are the normal density with support R and densities with a compact support such as the Bartlett– Epanechnikov kernel with support [−1, 1], K(u) = 0.75(1 − u2 )1{|u|≤1} , other kernels are presented in Parzen (1962), Prakasa Rao (1983), etc. With a sequence hn converging to zero at a convenient rate, the estimator fX,n,h is biased, with an asymptotically negligible bias depending on the regularity properties of the density. Constants depending on moments of the kernel function also appear in the bias function E fX,n,h − fX and the k of the estimated density. The variance does not depend moments E fX,n,h on the class of the density. The weak and strong uniform consistency of the kernel density estimator and its derivatives were proved by Silverman (1978) under derivability conditions for the density. Their performances are measured by several error criteria corresponding to the estimation of the density at a single point or over its whole support. The mean squared error criterion is common for that purpose and it splits into a variance and

6

Functional Estimation for Density, Regression Models and Processes

the square of a bias term M SE(fX,n,h ; x, h) = E{fX,n,h (x) − fX (x)}2 = E{fX,n,h (x) − E fX,n,h (x)}2 + {E fX,n,h (x) − fX (x)}2 . A global random measure of the distance between the estimator fX,n,h and the density fX is the integrated squared error (ISE) given by   (1.3) ISE(fX,n,h ; h) = {fX,n,h (x) − fX (x)}2 dx. A global error criterion is the mean integrated squared error introduced by Rosenblatt (1956)  M ISE(fX,n,h ; h) = E{ISE(fX,n,h ; h)} = M SE(fX,n,h ; x, h) dx. (1.4) The first order approximations of the MSE and the MISE as the sample size increases are the AMSE and the AMISE. Let (hn )n be a bandwidth sequence converging to zero and and let K  such that nh tends to infinity  be a kernel satisfying m2K = x2 K(x) dx < ∞ and κ2 = K 2 (x) dx < ∞. Consider a variable X such that EX 2 is finite and the density FX is twice continuously differentiable h4 AM SE(fX,n,h ); x = (nh)−1 fX (x)κ2 + m22K f 2 (x). 4 They depend on the bandwidth h of the kernel and the AMSE is minimized at a value  2

K (x) dx  15 . hAMSE (x) = fX (x) nm22K f 2 (x) The global optimum of the AMISE is attained at  2  15

K (x) dx  hAMISE = . 2 2 nm2K f (x) dx 4

Then the optimal AMSE tends to zero with the order n− 5 , it depends on the kernel and on the unknown values at x of the functions fX and f 2 , or their integrals for the integrated error (Silverman, 1986). If the bandwidth has a smaller order, the variance of the estimator is predominant in the expression of the errors and the variations of estimator are larger, if the bandwidth is larger than the optimal value, the bias increases and the variance is reduced. The approximation made by suppressing the higher order terms in the expansions of the bias and the variance of

Introduction

7

the density estimator is obviously another source of error in the choice of the bandwidth, Hall and Marron (1987) proved that hMISE h−1 AMISE tends tends to 1 in probability as n tends to infinity. to 1 and hISE h−1 AMISE Surveys on kernel density estimators and their risk functions were given by Nadaraya (1989), Rosenblatt (1956, 1971), Prakasa Rao (1983), Hall (1984), H¨ardle (1991), Khasminskii (1992). The smoothness conditions for the density are sometimes replaced by Lipschitz or H¨ older conditions and the expansions for the MSE are replaced by expansions for an upper bound. Parzen (1962) also proved the weak convergence of the mode of a kernel density estimator. The derivatives of the density are naturally estimated by those of the kernel estimator and the weak and strong convergence of derivative estimators have been considered by Bhattacharya (1967) and Schuster (1969) among others. The L1 (R) norm of the difference between the kernel estimator and its expectation converges to zero, as a consequence of the properties of the convolution. Devroye (1983) studied the consistency of the L1 -norm fX,n,h − f 1 = |fX,n,h − f | dx, Gin´e, Mason and Zaitsev (2003) established the weak convergence of the process 1 n 2 (fX,n,h − E fX,n,h 1− EfX,n,h − E fX,n,h 1 ) to a normal variable with variance depending on K(u)K(u + t) du. Bounds for minimax estimators have been established by Beran (1972). A minimax property of the kernel 2 estimator with the optimal convergence rate n 5 was proved by Bretagnolle and Huber (1981). Though the estimator of a monotone function is monotone with probability tending to 1 as the number of observations tends to infinity, the number of observations is not always large enough to preserve this property and a monotone kernel estimator is built for monotone density functions by isotonisation of the classical kernel estimator. Monotone estimators for a distribution function and a density have been first defined by Grenander (1956) as the least concave minorant of the empirical distribution function and its derivative. This estimator has been studied by Barlow, Bartholomew, Bremmer and Brunk (1972), Kiefer and Wolfowitz (1976), Groeneboom (1989), Groeneboom and Wellner (1997). The isotonisation of the kernel estimator fn,h for a density function is  v 1 fSI,n,h (x) = inf sup (1.5) fn,h (t) dt v≥x u≤x v − u u  and 1{t≤x} fSI,n,h (t) dt is the greatest convex minorant of the integrated  1 estimator 1{t≤x} fn,h (t) dt. Its convergence rate is n 3 (van der Vaart and van der Laan, 2003). Groeneboom and Wellner studied the weak

8

Functional Estimation for Density, Regression Models and Processes

convergence of local increments of the isotonic estimator of the distribution function. The estimation of a convex decreasing and twice continuously differentiable density on R+ by a piecewise linear estimator with knots between observations points was studied by Groeneboom, Jonkbloed 2 and Wellner (2001), the estimator is n 5 -consistent. Dumbgen and Rufibach (2009) proposed a similar estimator for a log-concave density on R+ and β older densities of established its convergence rate {n(log n}−1 ) 2β+1 , for H¨ Hβ,M . Stone (1974), De Boor (1975), Bowman (1983), Marron (1987) introduced automatic data driven methods for the choice of the global bandwidth. They minimize the risk ISE or the cross 2integrated random n (x) dx−2n−1 i=1 fX,n,h,i (Xi ) where validation criterion CV (h) = fX,n,h fX,n,h,i is the kernel estimator based on the data sample without the i-th observation, or the empirical version of the Kullback–Leibler loss-function K(fX,n,h , f ) = −E log fX,n,h dFX (Bowman, 1983). The CV (h) criterion is an unbiased estimator of the MISE and its minimum is the minimum for the estimated ISE using the empirical distribution function. The global bandwidth estimator  hCV minimizing this estimated criterion achieves the bound for the convergence rate of any optimal bandwidth for the ISE, 1 − 10  ) and  hCV − hMISE has a normal asymptotic hCV h−1 MISE − 1 = Op (n distribution (Hall and Marron, 1987). The cross-validation is more variable with the data and often leads to oversmoothing or undersmoothing (Hall and Marron, 1987, Hall and Johnstone, 1992). As noticed by Hall and Marron (1987), the estimation of the density and the mean squared error are different goals and the best bandwidth for the density might not be the optimal for the MSE, hence the bandwidth minimizing the cross-validation induces variability of the density estimator. Other methods for selecting the bandwidth have been proposed such as higher order kernel estimators of the density (Hall and Marron, 1987, 1990) or bootstrap estimations. An uniform weak convergence of the distribution of fn,h was proved using consecutive approximations of empirical processes by Bickel and Rosenblatt (1973), other approaches for the convergence in distribution rely on the small variations of moments of the sample-paths, as in Billingsley (1968) for continuous processes. The Hellinger distance h(fX,n,h , f ) between a density and its estimator has been studied by van de Geer (1993, 1996), here the weak convergence of the process fX,n,h −fX provides a more precise convergence rate for h(fX,n,h , f ). All result are extended to the limiting marginal density of a continuous process under ergodicity and mixing conditions.

Introduction

9

Uniform strong consistency of the kernel density estimator requires stronger conditions, results can be found in Silverman (1978), Singh (1979), Prakasa Rao (1983), H¨ ardle, Janssen and Serfling (1988) for the strong consistency and, for its conditional mode, Ould Sa¨ıd (1997). The law of the iterated logarithm has been studied by Hall (1981), Stute (1982), Bosq (1998). A periodic density f on an interval [−T, T ] is analyzed in the frequency domain where it is expanded according to the amplitudes and the frequency or period of its components. Let T = 2π/w, the density f is expressed +∞ as the limit of series due to Fourier,4 f (x) = k=−∞ ck eiwkx with coefT ficients ck = T −1 −2T f (x)e−iwkx dx and the Fourier transform of f is 2 T defined on R by F f (s) = T −1 −2T f (x)e−iwsx dx. The inversion formula 2  +∞ of the Fourier transform is f (x) = −∞ F f (w)eiwsx ds. For a nonperiodic density, the Fourier transform and its inverse are defined by  ∞ −1 F f (s) = (2π) f (x)e−isx dx, 

−∞

+∞

f (x) = −∞

F f (w)eisx ds.

The Fourier transform is an isometry as expressed by the equality   |F f (s)|2 ds = |f (s)|2 ds. Let (Xk )k≤n be a stationary time series with expectation zero, the spectral density is defined from the autocorrelation coefficients γk = E(X0 Xk ) +∞ −iwk and the inverse relationship for the autoby S(w) = k=−∞γk e ∞ correlations is γk = −∞ S(w)eiwsx dx. The periodogram of the series is n defined as In (w) = T −1 | k=1 Xk e−2πikw |2 and it is smoothed to yield a regular estimator of the spectral density Sn (s) = Kh (u − s)In (s) ds. Brillinger (1975) established that the optimal convergence rate for the 1 bandwidth is hn = O(n− 5 ) under regularity conditions and he proved the 2 weak convergence of the process n 5 (Sn − S) to the process defined as a transformed Brownian motion. Robinson (1986, 1991) studied the consistency of kernel estimators for auto-regression and density functions and for nonparametric models of time series. Cross-validation for the choice of the bandwidth was also introduced by Wold (1975). For time series, Chiu (1991) proposed a stabilized bandwidth criterion having a relative 1 1 convergence rate n 2 instead of n 10 for the cross-validation in density estimation. It is defined from the Fourier transform dY of the observation 4 French

mathematician (1768–1830).

10

Functional Estimation for Density, Regression Models and Processes

series (Yi )i , using the periodogram of the series IY = d2Y /(2πn) and the Fourier transform Wh (λ) of the kernel. The squared sum of errors is equal  to 2π j IY (λj ){1 − Wh (λj )}2 with λj = 2πj and Wh (λ) = n−1

n 

j exp(−iλj)Kh ( ). n j=1

Multivariate kernel estimators are widely used in the analysis of spatial data, in the comparison and the classification of vectors.

1.2

Estimation of a regression curve

Consider a two-dimensional variable (X, Y ) defined on (Ω, A, P ), with values in R2 . Let fX and fX,Y be the continuous densities of X and, respectively, fXY , and let FX and FXY be their distribution functions. In the nonparametric regression setting, the curve of interest is the relationship between two variables, Y a response variable for a predictor X. A continuous curve is estimated by the means of a kernel estimator smoothing the observations of Y for observations of X in the neighborhood of the predictor value. The conditional expectation of Y given X = x is the nonparametric regression function defined for every x inside the support of X by  fX,Y (x, y) dy, m(x) = E(Y |X = x) = y fX (x) it is continuous when the density fX,Y is continuous with respect to its first component. It defines regression models for Y with fixed or varying noises according to the model for its variance Y = m(X) + σε,

(1.6)

where E(ε|X) = 0 and V ar(ε|X) = 1 in a model with a constant variance V ar(Y |X) = σ 2 , or Y = m(X) + σ(X)ε,

(1.7) 2

with a varying conditional variance V ar(Y |X) = σ (X). The regression function of (1.6) is estimated by the integral of Y with respect to a smoothed empirical distribution function of Y given X = x  Kh (x − s)FXY,n (ds, dy) m  n,h (x) = y fX,n,h (x) n Yi Kh (x − Xi ) = i=1 . n i=1 Kh (x − Xi )

11

Introduction

This estimator has been introduced by Watson (1964) and Nadaraya (1964) and detailed presentations can bee found in the monographs by Eubank (1977), Nadaraya (1989) and H¨ardle (1990). The performance of the kernel estimator for the regression curve m is measured by error criteria corresponding to the estimation of the curve at a single point or over its whole support, like for the kernel estimator of a continuous density. A global random measure of the distance between the estimator m  n,h and the regression function m is the integrated squared error (ISE)   n,h (x) − m(x)}2 dx, (1.8) ISE(m  n,h ; h) = {m its convergence was studied by Hall (1984), H¨ardle (1990). The mean squared error criterion develops as the sum of the variance and the squared bias of the estimator  n,h (x) − m(x)}2 M SE(m  n,h ; x, h) = E{m = E{m  n,h (x) − E m  n,h (x)}2 + {E m  n,h (x) − m(x)}2 . A global mean squared error is the mean integrated squared error  M ISE(m  n,h ; h) = E{ISE(m  n,h ; h)} = M SE(m  n,h ; x, h) dx.

(1.9)

Assuming that the curve is twice continuously differentiable, the mean squared error is approximated by the asymptotic MSE (Chapter 3) −1 AM SE(m  n,h ; x) = (nh)−1 κ2 fX (x) V ar(Y |X = x)

2 4 h 2 −1 (2) + m2K fX (x) μ(2) (x) − m(x)fX (x) . 4 1 The AMSE is minimized at a value hm,AMSE which is still of order n− 5 and depends on the value at x of the functions defining the model and their second order derivatives. Automatic optimal bandwidth selection by cross-validation was developed by H¨ardle, Hall and Marron (1988) similarly to the density. Bootstrap methods were also widely studied. Splines were generalized to nonparametric regression by Wahba and Wold (1975), Silverman (1985) for cubic splines and the automatic choice of the degree of smoothing is also determined by cross-validation. In model (1.7) with a random conditional variance V ar(Y |X) = σ 2 (X), the estimator of the regression curve m has to be modified and it is defined as a weighted kernel estimator with weighting function w(x) = σ −1 (x) n w(Xi )Yi Kh (x − Xi ) m  w,n,h (x) = i=1 , n i=1 w(Xi )Kh (x − Xi )

12

Functional Estimation for Density, Regression Models and Processes

or more general function w. In Chapter 3, a kernel estimator of σ −1 (x) is introduced. The bias and variance of the estimator m  w,n,h are developed by the same expansions as the estimator (1.8). The convergence rate of the kernel estimator for σ 2 (x) is nonparametric and its bias depends on the bandwidths used in its definition, on V ar{(Y − m(x))2 |X = x}, on the functions fX , σ 2 , m and their derivatives. Results about the almost sure convergence and the L2 -errors of kernel estimators, their optimal convergence rates and the optimal bandwidth selection were introduced in Hall (1984), Nadaraya (1964). Properties similar to those of the density are developed here with sequences of bandwidths converging with specified rates. The methods for estimating a density and a regression curve by the means of kernel smoothing have been extensively presented in monographs by Nadaraya (1989), H¨ardle (1990, 1992) Wand and Jones (1995), Simonoff (1996), Bowman and Azalini (1997), among others. In this book, the properties of the estimators are extended with exact expansions, as for density estimation, and to variable bandwidth sequences (hn (x))n≥1 converging with a specified rate. Several monotone kernel estimators for a regression function m have been considered, they are built by kernel smoothing after an isotonisation of the data sample, or by an isotonisation of the classical kernel estimator. The isotonisation of the data consists in a transformation of the observation (Yi )i in a monotone set (Yi∗ )i . It is defined by Yi∗ = min max v≥i u≤i

v 1  Yi , v − u j=u

  and i≤k Yi∗ is the greatest convex minorant of i≤k Yi . The kernel estimator for the regression function built with the isotonic sample (Xi∗ , Yi∗ )i is denoted m  IS,n,h . The convergence rate of the isotonic estimator for a mono1 1  IS,n,h − mIS,n,h )(x) tone regression function is n 3 and the variable n− 3 (m converges to a Gaussian process for every x in IX . The isotonisation of the kernel estimator m  n,h for a regression function is 1 u≤x v − u



v

m  SI,n,h (x) = inf sup v≥x

m  n,h (t) dt

(1.10)

u

 and 1{t≤x} m  SI,n,h (t) dt is the greatest convex minorant of the process  1  n,h (t) dt. Its convergence rate is again n 3 (van der Vaart and van 1{t≤x} m der Laan, 2003). Meyer and Woodroof (2000) generalized the constraints

Introduction

13

to larger classes and proved that the variance of the maximum likelihood 1 estimator of a monotone regression has the optimal convergence rate n 3 . In the regression models (1.6) or (1.7) with a multidimensional regression vector X, a multidimensional regression function m(X) can be replaced by a semi-parametric single-index model m(x) = g(θT x), where θT denotes the transpose of a vector θ, or by a more general transformation model g ◦ ϕθ (X) with unknown function m and parameter θ. In the single-index model, several estimators for the regression function m(x) have been defined (Ihimura, 1993, H¨ ardle, Hall and Ihimura, 1993, Hristache, Juditski and Spokony, 2001, Delecroix, H¨ ardle and Hristache, 2003), the estimators of the function g and the parameter θ are iteratively calculated from approximations. The inverse of the distribution function FX of a variable X, or quantile function, is defined on [0, 1] by −1 (t) = inf{x ∈ IX : FX (x) ≥ t}, Q(t) = FX it is right-continuous with left-hand limits, like the distribution function. −1 (U ) has the distribution function FX and, For every uniform variable U , FX if F is continuous, then F (X) has an uniform distribution function. The −1 ◦ FX (x) = x for every x in inverse of the distribution function satisfies FX −1 the support of X and FX ◦FX = id for every continuity point x of FX . The weak convergence of the empirical uniform process and its functionals have been widely studied (Shorack and Wellner, 1986, van der Vaart and Well1 ner, 1996). For a differentiable functional ψ(FX ), n 2 {ψ(FX,n ) − ψ(FX )} converges weakly to (ψ  B) ◦ FX where B is a Brownian motion, limiting 1 distribution of the empirical process n 2 (FX,n − FX ). It follows that the 1 −1 −1 − FX ) converges weakly to B ◦ FX (fX ◦ FX )−1 . Kiefer process n 2 (FX,n (1972) established a law of iterated logarithms for quantiles of probabilities tending to zero, the same result holds for 1 − pn as pn tend to one. The results were extended to conditional distribution functions and Sheather and Marron (1990) considered kernel quantile estimators. The inverse function for a nonparametric regression curve determines thresholds for X given Y values, it is related to the distribution function of Y conditionally on X. The inverse empirical process for a monotone nonparametric regression function has been studied in Pin¸con and Pons (2006) and Pons (2008), the main results are presented and generalized in Chapter 5. The  Y,n,h of the conditional  X,n,h and Q behavior of the threshold estimators Q distribution is studied, with their bias and variance and the mean squared errors which determine the optimal bandwidths specific to the quantile processes.

14

Functional Estimation for Density, Regression Models and Processes

The Bahadur representation for the quantile estimators is an expansion t − FX,n −1 −1 −1 FX,n (t) = FX (t) + ◦ FX (t) + Rn (t), fX

t ∈ [0, 1],

where the main is a sum of independent and identically distributed ran1 dom variables and the remainder term Rn (t) is a op (n− 2 ) (Ghosh, 1971), Bahadur (1966) studied its a.s. convergence. Lo and Singh (1986), Gijbels and Veraverbeke (1988, 1989) extended this approach by differentiation to the Kaplan–Meier estimator of the distribution function of independent and identically distributed right-censored variables.

1.3

Estimation of functionals of processes

Watson and Laedbetter (1964) introduced smooth estimators for the hazard function of a point process. The functional intensity λ(t) of an inhomogeneous Poisson point process N is defined by λ(t) = lim δ −1 P {N (t + δ) − N (t− ) = 1 | N (t−)}, δ→0

it is estimated using a kernel smoothing, from the sample-path of the point process observed on an interval [0, T ]. Let Y (t) = N (T ) − N (t), then  T h (t) = λ Kh (t − s)1{Y (s)>0} Y −1 (s) dN (s). (1.11) 0

For a sample of a time variable T with distribution function F , let F¯ be the survival function of the variable T , F¯ = 1 − F − , the hazard function λ is now defined as λ(t) = f (t)F¯ −1 (t). The probability of excess is Pt (t + x) = Pr(T > t + x | T > t) = 1 −  t+x

= exp − λ(s) ds .

F (t + x) − F (t) F¯ (t)

t

The product-limit estimator has been defined for the estimation of the distribution function of a time variable under an independent rightcensorship by Kaplan and Meier (1957). Breslow and Crowley (1974) 1 studied the asymptotic behavior of the process Bn = n 2 (Fn − F ), they proved its weak convergence to a Gaussian process B with independent increments, expectation zero and a finite variance on every compact subinterval of [0, Tn:n ], where Tn:n = maxi≤n Ti . The weak convergence of

15

Introduction 1

n 2 (F¯n − F¯ ) has been extended by Gill (1983) to the interval [0, Tn:n ] using its expressions as a martingale up to the stopping time Tn:n . Let τ τF = sup{t; F (t) < 1}, for t < τF and if 0 F F¯ −1 dΛ < ∞, we have  n (t) = Λ Fn (t) = F − Fn (t) = 1−F



t∧Tn:n

dFn (s) , 1 − Fn− (s)

t∧Tn:n

¯ (s) dΛ  n (s), F n

t∧Tn:n

1 − Fn (s− )  {dΛn (s) − dΛ(s)} 1 − F (s)

0



0



0

as a consequence, the process n 2 (F − Fn )F¯ −1 converges weakly on [0, τF [ to a centered  tGaussian process BF , with independent increments and variance vF¯ (t) = 0 {(1 − F )−1 F¯ }2 dvΛ , where vΛ is the asymptotic variance of the 1  n − Λ). process n 2 (Λ The definition of the intensity is generalized to point processes having a random intensity. For a multiplicative intensity λY , with a predictable process Y , the hazard function λ has the estimator (1.11). For a random n time sample (Ti )i≤n , the counting process N (t) = i=1 1{Ti ≤t} has a multiplicative intensity λY defined by the process 1

Y (t) =

n 

1{Ti ≥t} .

i=1

Under a right-censorship of a time variable T by an independent variable C, only the censored time variable X = T ∧ C and the indicator δ of the event {T ≤ C} are observed. The counting processes for a n-sample of (X, δ) are N (t) =

 i

1{Ti ≤t∧Ci } ,

Y (t) =



1{Xi ≥t} .

i

Martingale techniques are used to expand the estimation errors, providing optimal convergence rates according to the regularity conditions for the hazard function (Pons, 1986) and the weak convergences with fixed or variable bandwidths (Chapter 6). Regression models for the intensity

16

Functional Estimation for Density, Regression Models and Processes

are classical, there have generally the form λ(t; β) = λ(t)rβ (Z(t)) with a regressor process (Z(t))t≥0 and a parametric regression function such as rβ (Z(t)) = r(β T Z(t)), with an exponential function r in the Cox model (1972). The classical estimators of the Cox model rely on the estimation t of the cumulated hazard function Λ(t) = 0 λ(s) ds by the stepwise pro n (t; β) at fixed β and the parameter β of the exponential regression cess Λ T function rZ (t; β) = eβ Z(t) is estimated by maximization of an expression  n (β) at Ti similar to the likelihood where λ is replaced by the jump of Λ (Cox, 1972). The asymptotic properties of the estimators for the cumulated hazard function and the parameters of the Cox model were established by Andersen and Gill (1982), among others. The estimators presented in this chapter are obtained by minimization of partial likelihoods based on kernel estimators of the baseline hazard function λ defined for each model and on histogram estimators. In the multiplicative intensity model, the kernel estimator of λ satisfies the same minimax property as the kernel estimator of a density (Pons, 1986) and this property is still satisfied in the multiplicative regression models of the intensity. The comparison of histogram-type and kernel estimators is extended to the new estimators defined for hazard functions in this book. For a spatial stationary process N on Rd , the k-th moment measures defined for k ≥ 2 and for every continuous and bounded function g on (Rd )k by  νk (g) = E g(x1 , . . . , xk )N (dx1 ) · · · N (dxk ) (Rd )k

have been intensively studied and they are estimated by empirical moments from observations on a subset G of Rd . The centered moments are immedik ately obtained from the mean measure m and μk = i=1 (−1)i Cki mi νk−i . The stationarity of the process implies that the k-th moment of N is expressed as the expectation of an integral of a translation of its (k − 1)-th moment  g(x1 − xk , . . . , xk−1 − xk , 0)N (dx1 ) · · · N (dxk ) νk (g) = E (Rd )k

which develops in the form    νk (g) = E gk−1 ◦ Tx (x1 , . . . , xk−1 )N (dx1 ) · · · N (dxk−1 ) N (dx) Rd

 =E

Rd

(Rd )k−1

νk−1 (gk−1 ◦ Tx ) N (dx),

17

Introduction

where gk−1 (x1 , . . . , xk−1 ) = g(x1 , . . . , xk−1 , 0). Let λ be the Lebesgue measure on Rd and Tx be the translation operator of x in Rd , then the moment estimators are built iteratively by the relationship  νk−1,Gk−1 (gk−1 ◦ Ty )dN (y). νk,G (g) = {λ(G)}−1 G k

The estimator is consistent and its convergence rate is {λ(G)} 2 . The stationarity of the process and a mixing condition imply that for every function k νk,G (g) − νk (g)) converges weakly to g of C b ((Rd )k ), the variable {λ(G)} 2 ( a normal variable with variance ν2k (g). The density of the k-th moment measures are defined as the derivatives of νk with respect to the Lebesgue measure on Rd and they are estimated by smoothing the empirical estimator νk,G using a kernel Kh on Rd and, by iterations, on Rkd . The convergence kd k of the kernel estimator is then h 2 {λ(G)} 2 , as a consequence of the k d-dimensional smoothing. Consider an auto-regressive diffusion model with nonparametric drift function α and variance function, or diffusion, β dXt = α(Xt )dt + β(Xt )dBt , t ≥ 0

(1.12)

where B is the standard Brownian motion. The drift and variance are expressed as limits of variations of X α(Xt ) = lim h−1 E{(Xt+h − Xt ) | Xt }, h→0

β(Xt ) = lim h−1 E{(Xt+h − Xt )2 | Xt }. h→0

The process X can be approximated by nonparametric regression models with regular or variable discrete sampling schemes of the sample-path of the process X. The diffusion equation uniquely defines a continuous prot cess (Xt )t>0 . Assuming that E exp{− 21 0 β 2 (Bs ) ds} is finite, the Girsanov theorem formulates the density of the process X. Parametric diffusion models have been much studied and estimators of the parameters are defined by maximum likelihood from observations at regularly spaced discretization points or at random stopping times. In a discretization scheme with a constant interval of length Δn between observations, nonparametric estimators are defined like with samples of variables in nonparametric regression models (Pons, 2008). Let (Xti , Yi )i≤1 be discrete observations with Yi = Xti+1 − Xti defined by Equation (1.12), the functions α and β 2 are

18

Functional Estimation for Density, Regression Models and Processes

estimated by n Y K (x − Xti ) i=1  n i hn α n (x) = , Δn i=1 Khn (x − Xti ) n Z 2 K (x − Xti ) i=1  n i hn βn2 (x) = , Δn i=1 Khn (x − Xti ) n (Xti ) is the variable of the centered variations for the where Zi = Yi − Δn α diffusion process. The variance of the variable Yi conditionally on Xti varies with Xti and weighted estimators are also defined here. Varying sampling intervals or random sampling schemes modify the estimators. Functional models of diffusions with discontinuities were also considered in Pons (2008) where the jump size was assumed to be a squared integrable function of the process X and a nonparametric estimator of this function was defined. Here the estimators of the discretized process are compared to those built with the continuously observed diffusion process X defined by (1.12), on an increasing time interval [0, T ]. The kernel bandwidth hT tends to zero as T tends to infinity with the same rate as hn . In Chapter 8, the MISE of each estimator and its optimal bandwidth are determined. The estimators are compared with those defined for the continuously observed diffusion processes. Nonparametric time and space transformations of a Gaussian process have been first studied by Perrin (1999), Guyon and Perrin (2000) who estimated the function Φ of nonstationary processes Z = X ◦ Φ, with X a stationary Gaussian process, Φ a monotone continuously differentiable function defined in [0, 1] or in [0, 1]3 . The covariance of the process Z is r(x, y) = R(Φ(x) − Φ(y)) where R is the stationary covariance of X, which implies R(u) = r(0, Φ−1 (u)) and R(−u) = R(u), with a singularity at zero. The singularity function of Z is the difference ξ(x) of the left and right of r(x, x), which implies Φ(x) = v −1 (1)v(x) where v(x) equals derivatives x ξ(u) du. The estimators are based on the covariances of the process 0 Z are built with its quadratic variations. For the time transformation, the [nx] estimator of Φ(x) is defined by linearisation of Vn (x) = k=1 (ΔZk )2 where the variables Zk = Z(n−1 k) − Z(n−1 (k − 1)) are centered and independent vn (x) = Vn (x) + (nx − [nx])(ΔZ[nx]+1 )2 , x ∈ [0, 1], vn (1) = Vn (1),  Φn (x) = vn−1 (1)vn (x),

(1.13)

 n −Φ is uniformly consistent and n 2 (Φ  n −Φ) is asymptotically the process Φ 3 Gaussian. The method was extended to [0, 1] . The diffusion processes 1

Introduction

19

cannot be reduced to the same model but the method for estimating its variance function relies on similar properties of Gaussian processes. In time series analysis, the models are usually defined by scalar parameters and a wide range of parametric models for stationary series have been intensively studied since many years. Nonparametric spectral densities of the parametric models have been estimated by smoothing the periodogram calculated from T discrete observations of stationary and mixing series (Wold, 1975; Brillinger, 1981; Robinson, 1986; Herrmann et al., 1992; Pons, 2008). The spectral density is supposed to be twice continuously differentiable and the bias, variance and moments of its kernel estimator have been expanded like those of a probability density. It converges weakly with 2 the rate T 5 to a Gaussian process, as a consequence of the weak convergence of the empirical periodogram.

1.4

Content of the book

In the first chapters, the classical estimators for samples of independent and identically distributed variables are presented, with approximations of their bias, variance and Lp -moments, as the sample size n tends to infinity and the bandwidth to zero. In each model, the weak convergence of the whole processes are considered and the limiting distributions are not centered for the optimal bandwidth minimizing the mean integrated squared error. Chapters 2 and 3 focus on the density and the regression models, respectively. In models with a constant variance, the regression estimator defined as a ratio of kernel estimators is approximated by a weighted sum of two kernel estimators and its properties are easily deduced. In models with a functional variance, a kernel estimator of the variance is also considered and the estimator of the regression function is modified by an empirical weight. The properties of the modified estimator are detailed. The Lp risk of the estimators, for every p in [1, ∞[, and their minimax property are proved. The estimators for independent and identically distributed variables are extended to a stationary continuous process (Xt )t≥0 continuously observed on an increasing time interval, for the estimation of the ergodic density of the process. The observations at times s and t are dependent so the methods for independent observations do not adapt immediately. The estimators are defined with the conditions necessary for their convergences and their approximation properties are proved. The optimal bandwidth minimizing

20

Functional Estimation for Density, Regression Models and Processes

the mean squared error are functional sequences of bandwidths and the properties of the kernel estimators are extended to varying bandwidths for this reason in Chapter 4. The estimators of derivatives of the density, regression function and the other functions are expressed by the means of derivatives of the kernel so that their convergence rate is modified, the k-th derivative of Kh being normalized by h−(k+1) instead of h−1 for Kh . Functionals of the densities and functions in the other models are considered, the asymptotic properties of their estimators are deduced from those of the kernel estimators. The inverse function defined for the increasing distribution function F are generalized in Chapter 5 to conditional distribution functions and to monotone regression functions. The bias, variance, norms, optimal bandwidths and weak convergences of the quantiles of their kernel estimators are established with detailed proofs. Exact Bahadur-type representations are written, with L2 approximations. Chapter 6 provides new kernel estimators in nonparametric models for real point processes which generalize the martingale estimators of the baseline hazard functions already studied. They are compared to new histogram-type estimators built for these functional models. The probability density of excess duration for a point process and its estimator are defined and the properties of the estimator are also studied. The second edition adds the Lp risk for the estimator of the baseline hazard function and their minimax property, estimators for periodic baseline intensities, for multivariate hazard functions and for a nonparametric frailty model, with their asymptotic properties. The single-index models are nonparametric regression models for linear combinations of the regression variables. The estimators of the parameter vector θ and the nonparametric regression function g of the model are proved to be consistent for independent and identically distributed variables. New estimators of g and θ are considered in Chapter 7, with direct estimation methods, without numerical iteration procedures. The convergence rate for the estimator θn,h obtained by minimizing the empir1 ical mean squared estimation error Vn is (nh3 ) 2 . The estimator m  n,h built with this estimator of θ has the same convergence rate which is not so small as the nonparametric regression estimator with a d-dimensional regression variable. A differential empirical squared error criterion provides an estimator for the parameter which converges more quickly and the

Introduction

21

estimator of the regression function m has the usual nonparametric conver1 gence rate (nh) 2 . More generally, the linear combination of the regressors can be replaced by a parametric change of variable, in a regression model Y = g ◦ ϕθ (X) + ε. Replacing the function g by a kernel estimator at fixed θ, the parameter in then estimated by minimizing an empirical version of the error V (θ) = {Y −  gn,h ◦ ϕθ (X)}2 . Its asymptotic properties are similar to those of the single-index model estimators. The optimal bandwidths are precised. The estimators of the drift and variance of continuous auto-regressive diffusion processes depend on the sampling scheme for their discretization and they are compared to the estimators built from the whole sample-path of the diffusion process. New results are presented in Chapter 8 and they are extended to the sum of a diffusion processes and a jump process governed by the diffusion. The second edition adds estimators for time dependent diffusions, for multivariate diffusions and for diffusions with stochastic volatility, with their asymptotic properties. The Lp risk of the estimators is established and their minimax property is proved. For nonstationary Gaussian models, a kernel estimator is defined for the singularity function of the covariance of the process. In Chapter 9, classical estimators of covariances and nonparametric regression functions used for stationary time series are generalized to nonstationary models. The expansions of the bias, variance and Lp -errors are detailed and optimal bandwidths are defined. Nonparametric estimators are defined for the stationarization of time series and for their mean function in auto-regressive models, based on the results of the previous chapters.

This page intentionally left blank

Chapter 2

Kernel Estimator of a Density

2.1

Introduction

Let f be the continuous probability density of a real variable X defined on a probability space (Ω, A, P ) and F be its distribution functions. Let IX be the finite or infinite support of the density function f of X with respect to the Lebesgue measure and IX,h = {s ∈ IX ; [s − h, s + h] ∈ IX }. For a sample (Xi )1≤i≤n distributed as X and a kernel K, estimators of F and f are defined on Ω × R as the empirical distribution function n  FX,n (x) = n−1 1{Xi ≤x} , x ∈ IX i=1

and the kernel estimator is defined for every x in IX,h as  n 1  Kh (x − Xi ), fX,n,h (x) = Kh (x − s) dFX,n (s) = n i=1 where Kh (x) = h−1 K(h−1 x) and h = hn tends to zero as n tends to infinity and 1A is the indicator of a set A. The empirical probability measure is n PX,n,h (A) = n−1 i=1 δXi (A), with δXi (A) = 1{Xi ∈A} . Let   fn,h (x) = E fn,h (x) = Kh (x − s) dF (s), the bias of the kernel estimator fn,h (x) is  bn,h (x) = fn,h (x) − f (x) = K(t){f (x + ht) − f (x)} dt.

(2.1)

The Lp -risk of the kernel estimator of the density f of X is its Lp -norm fn,h (x) − f (x)p = {E|fn,h (x) − f (x)|p } p 1

23

(2.2)

24

Functional Estimation for Density, Regression Models and Processes

and it is bounded by the sum of a p-moment and a bias term. For every x in IX,h , the pointwise and uniform convergence of the kernel estimator fn,h are established under the following conditions about the kernel and the density. Condition 2.1. (1) K is a symmetric bounded density such that |x|2 K(x) → 0 as |x| tends to infinity or K has a compact support with value zero on its frontier; (2) The density function f is bounded and belongs to the class C 2 (IX ) of twice continuously differentiable functions defined in IX . (3) The kernel  function satisfies  integrability conditions: the moments m2K = u2 K(u)du, κα = K α (u)du, for α ≥ 0, and |K  (u)|α du, for α = 1, 2, are finite. As n → ∞, hn → 0 nhn → ∞. (4) nh5n converges to a finite limit γ. The next conditions are stronger than Conditions 2.1(2)–(4), with higher degrees of differentiability and integrability. Condition 2.2. (1) The density function f is C s (IX ), with a continuous and bounded derivative of order s, f (s) , on IX . to a finite limit γ > 0. The (2) As n → ∞, nhn → 0 and nh2s+1 n converges j = u K(u)du = 0 for j < s, msK and kernel function satisfies m jK   α |K (u)| du are finite for α ≤ s. The conditions may be strengthened to allow a faster rate of convergence of the bandwidth to zero by replacing the strictly positive limit of nh2s+1 n by nh2s+1 = o(1). That question appears crucial in the relative importance n between the bias and the variance in the L2 -risk of fn,h − f . The choice of the optimal bandwidth minimizing that risk corresponds to an equal rate for the squared bias and the variance and implies the rates of Conditions 2.1(4) or 2.2(2) according to the derivability of the density. Considering the normalized estimator, the reduction of the bias requires a faster convergence rate.

Kernel Estimator of a Density

2.2

25

Risks and optimal bandwidths for the kernel estimator

Proposition 2.1. Under Condition 2.1(1) for a continuous density f , the estimator fn,h (x) converges in probability to f (x), for every x in IX,h . Moreover, supx∈IX,h |fn,h (x)−f (x)| tends a.s. to zero as n tends to infinity if and only if f is uniformly continuous. Proof.

The first assertion is a consequence of an integration by parts  1 x−y )| sup |fn,h (x) − fn,h (x)| ≤ sup |Fn,h (y) − F (y)| |dK( h x∈IX,h x∈IX,h h  1 ≤ sup |Fn,h (y) − F (y)| |dK|. h y The Dvoretzky, Kiefer and Wolfowitz (1956) exponential bound implies 1 that for every λ > 0, Pr(supIX n 2 |Fn,h − F | > λ) ≤ 58 exp{−2λ2 }, then

 −1    Pr(sup |fn,h − fn,h | > ε) ≤ Pr sup |Fn,h − F | > |dK| hn ε IX,h

IX

with α > 0, and or 2.2.

∞ n=1

≤ 58 exp{−αnh2n } exp{−nα h2n } tends to zero under Conditions 2.1 

Proposition 2.2. Assume hn → 0 and nhn → ∞, (a) under Condition 2.1, the bias of fn,h (x) is h2 m2K f (2) (x) + o(h2 ), bn,h (x) = 2 denoted h2 bf (x) + o(h2 ), its variance is V ar{fn,h (x)} = (nh)−1 κ2 f (x) + o((nh)−1 ), also denoted

(nh)−1 σf2 (x)

−1

+ o((nh)

(2.3)

), where all approximations are uniform. Let K have the compact support [−1, 1], the covariance of fn,h (x) and fn,h (y) is zero if |x − y| > 2h, otherwise it is approximated by      (nh)−1 {f (x) + f (y)}δx,y K v − αh K v + αh dv 2 where αh = |x − y|/(2h) and δx,y is the indicator of {x = y}. (b) Under Condition 2.2, for every s ≥ 2, the bias of fn,h (x) is hs msK f (s) (x) + o(hs ), bn,h (x; s) = s! and 1 fn,h (x) − fn,h (x)p = 0((nh)− 2 ), for every integer p ≥ 2, where the approximations are uniform.

26

Functional Estimation for Density, Regression Models and Processes

Proof. The bias as h tends to zero is obtained from a second order expansion of f (x + ht) under Condition 2.1, and from its s-order expansion under Condition 2.2. The variance of fn,h (x) is

 −1 2 2  V ar{fn,h (x)} = n Kh (x − s)f (s) ds − fn,h (x) .  The first term of the sum is n−1 Kh2 (x − u)f (u)du = (nh)−1 κ2 f (x) + o((nh)−1 ), the second term n−1 f 2 (x) + O(n−1 h) is smaller.  The covariance of fn,h (x) and fn,h (y) is written n−1 { I 2 Kh (u − X x)Kh (u − y)f (u) du − fn,h (x)fn,h (y)}, its first term is zero if |x − y| > 2h. Otherwise let αh = |x− y|/(2h) < 1, changing the variables as h−1 (x− u) = v − αh and h−1 (y − u) = v + αh with v = h−1 {(x + y)/2 − u}, the covariance develops as  x+y K(v − αh )K(v + αh )dv Cov{fn,h (x), fn,h (y)} = (nh)−1 f 2 + o((nh)−1 ). If |x − y| ≤ 2h, f ((x + y)/2) = f (x) + o(1) = f (y) + o(1), the covariance is approximated by  (nh)−1 {f (x) + f (y)}I{0≤αh 0, then there exists a constant C such that for every x and y in IX,h (k) (k) |fn,h (x) − fn,h (y)| ≤ Cαh−(k+1) |x − y|.  The integral θk = f (k)2 (x) dx of the quadratic k-th derivative of the density is estimated by  (k)2  θk,n,h = fn,h (x) dx, (2.5) the variance E(θk,n,h − θk )2 has the same order as the MISE for the esti1 1 (k) mator fn,h of f (k) , hence it converges to θk with the rate O((n 2 hk+ 2 ) and the estimator of the parameter θ does not achieve the parametric rate of 1 convergence n 2 . The Lp -risk of the estimator of the density decreases as s increases and, for p ≥ 2, a bound of the Lp -norm is  hps p p−1  {mpsK f (k)p (x) + o(1)} fn,h (x) − f (x)p ≤ 2 (s!)p  + (nh)−1 {gp (x) + o(1)} , [ p2 ]  k  where gp (x) = k=2 1 0.

30

Functional Estimation for Density, Regression Models and Processes

Proposition 2.4. Assume f is bounded and belongs to a H¨ older class [α]  Hα,M , then the bias of fn,h is bounded by M m[α]K h /([α]!) + o(h[α] ), the 1

optimal bandwidth is a O(n 2[α]+1 ) and the MISE at the optimal bandwidth is a O(n

2.3

[α] 2[α]+1

).

Weak convergence

The Lp -norm of the variations of the process fn,h − fn,h are bounded by the same arguments as the bias and the variance. Assume that K has the support [−1, 1]. Lemma 2.2. Under Conditions 2.1 and 2.2, there exists a constant C such that for every x and y in IX,h and satisfying |x − y| ≤ 2h E{fn,h (x) − fn,h (y)}2 ≤ C(nh3 )−1 |x − y|2 . Proof. Let x and y in IX,h , the variance of fn,h (x) − fn,h (y) develops according to their variances given by Proposition 2.2 and the covariance between both terms which has the same bound by the Cauchy–Schwarz inequality. E|fn,h (x) − fn,h (y)|2 develops as the  The second order moment −1 2 {Kh (x − u) − Kh (y − u)} f (u) du + (1 − n−1){fn,h (x) − fn,h (y)}2 . sum n For an approximation of the integral I2 (x, y) = {Kh (x − u) − Kh (y − u)}2 f (u) du, the Mean Value Theorem implies Kh (x − u) − Kh (y − u) = (1) (x − y)ϕn (z − u) where ϕ2n (x) = Kh (x), and z is between x and y, then {Kh (x − u) − Kh (y − u)} f (u) du is approximated by   2 (1)2 2 −3 ϕn (z − u)f (u) du = (x − y) hn {f (x) K (1)2 + o(h)}. (x − y) Since h−1 |x| and h−1 |y| are bounded by 1, the order of the second moment of fn,h (x)− fn,h (y) is a O((x−y)2 (nh3 )−1 ) if |x−y| ≤ 2h and the covariance is zero otherwise.  Theorem 2.1. Under Conditions 2.1 and 2.2, for a bounded density f of class C s (IX ) and with nh2s+1 converging to a constant γ, the process 1 Un,h = (nh) 2 {fn,h − f }I{IX,h } 1

converges weakly to Wf + γ 2 bf , where Wf is a continuous Gaussian process on IX with expectation zero and covariance E{Wf (x)Wf (x )} = δx,x σf2 (x), at x and x .

31

Kernel Estimator of a Density

Proof. The finite dimensional distributions of the process Un,h converge 1 weakly to those of Wf + γ 2 bf , as a consequence of Proposition 2.2. The covariance of Wf at x and x is Cf,n (x, x ) = limn nhCov{fn,h (x), fn,h (x )}, and Proposition 2.2 implies that Un,h (x) and Un,h (x ) are asymptotically independent as n tends to infinity. 1 If the support of X is bounded, let a = inf IX , η > 0 and c > γ 2 |bf (a)|+ 1 (2η −1 σf2 (a)) 2 , then 1 1 Pr{|Un,h (a)| > c} ≤ Pr{(nh) 2 |(fn,h − fn,h )(a)| + (nh) 2 |bn,h (a)| > c} 1 V ar{(nh) 2 (fn,h − fn,h )(a)} , ≤ 1 {c − (nh) 2 |bn,h (a)|}2

so that for n sufficiently large Pr{|Un,h (a)| > c} ≤

σf2 (a) 1

{c − γ 2 |bf (a)|}2

+ o(1) < η,

the process Un,h (a) is therefore tight. Lemma 2.2 and the bound {fn,h (x)−  f (x) − fn,h (y) + f (y)}2 ≤ |f (x) − f (y)|2 + [ K(z){f (x + hz) − f (y + hz)}dz]2 ≤ 2|x − y|2 f (1) 2∞ imply that the expectation of the squared variations of the process Un,h are O(h−2 |x − y|2 ) as |x − y| ≤ 2h < 1, otherwise the estimators fnh (x) and fnh (y) are independent. Billingsley’s Theorem 3 implies the tightness of the process Un,h 1[−h,h] and the convergence is extended to any compact subinterval of the support. With an unbounded support for X such that E|X| < ∞, for every η > 0 there exists A such that P (|X| > A) ≤ η, therefore P (|Un,h (A + 1)| > 0) ≤ η and the same result still holds on [−A − 1, A + 1] instead of the support of the  process Un,h . Corollary 2.1. The process 1

sup σf−1 (x)|Un,h (x) − γ 2 bf (x)|

x∈IX,h

converges weakly to supIX |W1 |, where W1 is the Gaussian process with expectation zero, variance 1 and covariances zero. For every η > 0, there exists a constant cη > 0 such that

 1 Pr sup |σf−1 (Un,h − γ 2 bf ) − W1 | > cη IX,h

tends to zero as n tends to infinity.

32

Functional Estimation for Density, Regression Models and Processes

Lemma 2.2 concerning second moments does not depend on the smoothness of the density and it is not modified by the condition of a H¨ older class instead of a class C s . The variations of the bias are now bounded by {fn,h (x) − f (x) − fn,h (y) + f (y)}2 ≤ 2M |x − y|2α and the expectation of the squared variations of the process Un,h are O(h−2 |x − y|2 ) for |x−y| ≤ 2h < 1. The weak convergence of Theorem 2.1 is therefore fulfilled with every α > 1. With the optimal bandwidth for the global MISE error

15 κ2  , hAMISE = nm22K f (2)2 (x) dx  (2)2 the limit γof nh5n is κ2 m−2 (x) dx}−1 . The integral of the second 2K { f (2)2 (x) dx and the bias term bf = 12 m2K f (2) are estimated derivative f using the second derivative of the estimator for f . Furthermore, the variance σf2 = κ2 f is immediately estimated. More simply, the asymptotic criterion is written  AM ISEn (h) = {h4 bf (x) + (nh)−1 σf2 (x)}f −1 (x) dF (x) and it is estimated by the empirical mean n−1

n 

{h4 bf (Xi ) + (nh)−1 σf2 (Xi )}f −1 (Xi ).

i=1

This empirical error is estimated by  ISEn (h) = n−1 AM

n 

2 {h4bf,n,h2 (Xi ) + (nh)−1 σ f,n,h (Xi )}fh−1 (Xi ) 2 2

i=1

with another bandwidth h2 converging to zero. The global bandwidth hAMISE is then estimated at the value that achieves the minimum of  ISEn (h), i.e. AM − 15  n (Xi ) 4n i=1 b2f,n,h2 (Xi )fh−1 2  . hn = n 2 −1 (Xi ) f,n,h (X ) f i h i=1 σ 2 2 Bootstrap estimators for the bias and the variance provide another estimation of M ISEn (h) and hAMISE . These consistent estimators are then used for centering and normalizing the process fn,h −f and provide an estimated process 1 n = (n U hn ) 2 σ −1



{fn,hn f,n, hn

−f − γn,hnbf,n,hn }I{IX,hn }.

33

Kernel Estimator of a Density

An uniform confidence interval with a level α for the density f is deduced from Corollary 2.1, using a quantile of supIX |W1 |. Theorem 2.2. Under Conditions 2.1 and 2.2, for a density f of class C s (IX ) and with nh2s+2k+1 converging to a constant γ, the process Un,h = (nh2k+1 ) 2 {fn,h − f (k) }I{IX,h } (k)

1

(k)

1

converges weakly to a Gaussian process Wf,k + γ 2 bf,k , where Wf,k is a continuous Gaussian process on IX with expectation and covariances zero. 2.4

Estimation of the density on Rd

Let X be a multidimensional variable with density f defined in a subset IX of Rd its density f is estimated by smoothing its distribution function n  Fn (x) = 1{X1 ≤x1 ,...,Xd ≤xd } , x = (x1 , . . . , xd ), i=1

by a multivariate kernel K defined on [−1, 1]d and Kh (x) = h−d K(h−d x), d −1 with a single bandwidth or Kh (x) = j=1 h−1 j Kj (hj xj ) with a vector bandwidth, such that f and Kj satisfy the Conditions 2.1 or 2.2, for j = 1, . . . , d. A kernel estimator of the density f is defined for every x in IX,h ⊂ IX as n n d 1  1 Kj,h (xj − Xij ) := Kh (x − Xi ), fX,n,h (x) = n i=1 j=1 n i=1  it has the expectation fn,h Kh (x − s)f (s) ds and the bias  (x) = bn,h (x) = fn,h (x) − f (x) = K(t){f (x + ht) − f (x)} dt. Proposition 2.2 extends to the multidimensional case, with the notations d d m2Kd = j=1 m2Kj and κ2d = j=1 κ2,j . The notation f (s) is for the sum of the derivatives of order s of the density and the rates of their moments depend on the dimension d. Let hk = h.

Proposition 2.5. Under Condition 2.1(1), with hn tending to zero and nhdn to infinity as n tends to infinity (a) the estimator fX,n,h (x) is a.s. uniformly consistent, (b) the bias of fn,h (x) is bn,h (x) =

h2 m2Kd f (2) (x) + o(h2 ), 2

34

Functional Estimation for Density, Regression Models and Processes

where f (2) (x) = ance is

d

∂2f j,j  =1 ∂xj ∂xj

(x), it is denoted h2 bf (x) + o(h2 ). Its vari-

V ar{fn,h (x)} = (nhd )−1 κ2d f (x) + o((nhd )−1 ), it is denoted (nhd )−1 σf2 (x) + o((nhd )−1 ), and the covariance of fn,h (x) and fn,h (y) is a o((nhd )−1 ). (c) Under Condition 2.2, the bias of fn,h (x) is bn,h (x; s) =

hs msKd f (s) (x) + o(hs ), s!

its variance is V ar{fn,h (x)} = (nhd )−1 κ2d f (x) + o((nhd )−1 ) and the covariance of fn,h (x) and fn,h (y) is a o((nhd )−1 ). For a density f of C s (IX ), the asymptotic mean squared error of fn,h (x) is AM SE(fn,h ; x) = (nhd )−1 κ2d f (x) +

1 m2 h2s f (s)2 (x), (s!)2 sKd

it is minimum for the bandwidth function 1

2s+d 1 κ2d f (x) hn (x) = n− 2s+d d(s!)2 . 2sm2sKd f (s)2 (x)

Theorem 2.3. Under Conditions 2.1 and 2.2, for a bounded density f of class C s (IX ) and as nh2s+d converges to a constant γ, the process 1 Un,h = (nhd ) 2 {fn,h − f }I{IX,h } 1

converges weakly to Wf + γ 2 bf , where Wf is a continuous Gaussian process on IX with expectation zero and covariance E{Wf (x)Wf (x )} = δx,x σf2 (x), at x and x . (k) Proposition 2.6. Under Conditions 2.1 and 2.2, the estimator fn,h of the k-order derivative of a density in class C s has a bias O(hs ) and a 1 variance O((nh2k+d )−1 ), its optimal bandwidth is a O(n− 2k+2s+d ) and the s corresponding L2 -risks are O(n− 2k+2s+d ).

35

Kernel Estimator of a Density

2.5

Minimax risk

Consider a class F of densities and a risk R(f, fn ) for the estimation of a density f of F by an estimator fn belonging to a space F. A minimax estimator fn∗ is defined as a minimizer of the maximal risk over F f∗ = arg inf sup R(f, fn ). n

 f ∈F fn ∈F

For a density of F = C (R), s ≥ 2, and with an optimal bandwidth related to the Lp -risk 

p1 p   Rp (fn , f ) = E |fn (t) − f (t)| dt s

for an integer p ≥ 2, the kernel estimator provides a Lp -risk of order n− 2s+1 . Bretagnolle and Huber (1979) and Huber (1991) proved that that this is the minimax risk order for p = 2, then for a real p ≥ 1, in a space F determined by the regularity of the kernel, so the kernel estimator is a minimax estimator. Ibragimov and Has’minskii (1981) extended this result to densities of C s (Rd ). The proof by Huber (1991) relies on a lower bound of the risk according to the Hellinger distance on hypercube F0 of cardinality 2N of the set F ,  generated by ε = {−1, +1}N which implies h2 (fε , fε ) = δ i |εi − εi | and  Rp (fε , fε ) = Δ i |εi − εi |, moreover (2.6) inf sup Rp (f, fn ) ≥ cN Δ exp(−8δ). s

 f ∈F fn ∈F

In C s (R), the optimal parameters are δ = O(nu2 N −1 ) and Δ = O(N −1 u2 ) 1 which yields a lower bound as u = O(N −s ) and N = O(n 2s+1 ). Proposition 2.7. Under Conditions 2.1 and 2.2, and for every real p > 0, an upper bound of the Lp -risk for the kernel estimator is s 1 with hn = O(n− 2s+1 ). Rp (fn , f ) = O(n− 2s+1 ), For every f in C 2 (R) we have, be convexity  p    p  p−1 Rp (fn , f ) ≤ 2 Kh (t − s)f (s) ds − f (t) dt  IX,h IX,h  p     +E Kh (t − s) {dFn (s) − dF (s)} dt , 

Proof.

IX,h

IX,h

where the first term is approximated by  hsp mpsK R1p = |f (s) (t)|p dt + o(hsp ). (k!)p IX,h

36

Functional Estimation for Density, Regression Models and Processes

The sum of the quadratic variations of the process SnF (t) = Fn (t) − F (t) is the increasing process [S]nF (t) = n

−2

n 

{1{Xi ≤t} − F (t)}2 ,

i=1 1 and for every t, SnF (t) − with expectation μ(t) = n F (t){1 − F (t)} ≤ 2n [S]nF (t) is a martingale, then by the Burkholder–Davis–Gundy inequality there exists a constant Cp such that p   p2      Kh (t − s) {dFn (s) − dF (s) ≤ Cp Kh2 (t − s) d[S]nF (s) .    IX,h IX,h −1

If 0 ≤ p < 2, by concavity the last integral In,h,p (t) has the bound   p2   E In,h,p (t) dt ≤ Kh2 (t − s) dμn (s) dt IX,h

IX,h



h

−1

IX,h

p2 κ2 sup μn (t) ≤ t∈IX

p

κ22 p p , (2n) 2 h 2

= O(n− 2s+1 ) with hn = O(n− 2s+1 ). then Rpp (fn , f ) ≤ ahsp + b(nh) Let 2 ≤ p < ∞ with integer part [p], the consistency of the kernel estimator implies  p     Kh (t − s) {dFn (s) − dF (s)} dt E  −p 2

IX,h

 ≤E as hn = O(n so  E

IX,h

1 − 2s+1

IX,h

  

IX,h

  

IX,h

[p]+1   Kh (t − s) {dFn (s) − dF (s)} dt ,

), the last term is an O((nh)−

IX,h

1

sp

[p]+1 2

) by Proposition 2.2,

p  p  Kh (t − s) {dFn (s) − dF (s)} dt = O((nh)− 2 ).



The minimax bound and Proposition 2.7 extend to densities on Rd . Proposition 2.8. A lower bound of the Lp -risk for the estimation of a density of Fs,p (Rd ) = {f ∈ C s (Rd ); f (s) ∈ Lp (Rd )} in a subset Fn of Fs,p (Rd ), for a real p > 1 and an integer s ≥ 2, is inf

sup

n f ∈Fs,p (Rd ) fn ∈F

s Rp (fn , f ) = O(n− 2s+d ).

37

Kernel Estimator of a Density

Proposition 2.9. Under Conditions 2.1 and 2.2, and for every real p > 0, the kernel estimator of a density on Rd has the Lp -risk Rp (fn , f ) = O(n− 2s+d ), s

1

with hn = O(n− 2s+d ).

Let F = {f ∈ C s (Rd ); f (s) p ≤ M }, for a constant M , Propositions 2.7 and 2.9 provide an upper bound supf ∈F Rp (f, fn ), and its order for the kernel estimator of the densities of F . The kernel estimator of a density is therefore a minimax estimator under Conditions 2.1 and 2.2.

2.6

Histogram estimator

The histogram is the older unsmoothed nonparametric estimator of the density. It is defined as the empirical distribution of the observations cumulated on small intervals of equal length hn , divided by hn , with hn and nhn converging to zero as n tends to infinity. Let (Bjh )j=1,...,JX,h be a partition of IX into subintervals of length h and centered at ajh , so the value of the density is approximated by f (ajh ) on Bjh . The kernel corresponding to  the histogram is Kh (x) = h−1 j∈JX,h 1Bjh (x), the histogram is therefore defined as  fn,h (x) = hKh (x) Kh (s)dFn (s) =

n  1  1Bjh (x) 1{Xi ∈Bjh } . nh i=1 j∈JX,h

Let pjh = P (Xi ∈ Bjh ) = F (ajh + h2 ) − F (ajh − h2 ), it is asymptotically equivalent to hf (ajh ) as h tends to zero, the expectation of the histo gram is fn,h (x) = h−1 j∈JX,h pjh 1Bjh (x), is is asymptotically equivalent  to j∈JX,h f (ajh )1Bjh (x). The bias of the histogram on Bjh is bf,h (x) = f (ajh ) − f (x) = h f (1) (ajh ) + o(h), 2 it is larger than the bias of kernel estimators and its variance on Bjh is vf,h (x) =

1 1 f (ajh ) + o((nh)−1 ), pjh (1 − pjh ) = 2 nh nh

38

Functional Estimation for Density, Regression Models and Processes

so the variance of the histogram is vf,h (x) = (nh)−1 vf (x) + o((nh)−1 ) with  vf (x) = 1Bjh (x)f (ajh ) j∈JX,h

due to the covariance zero between the empirical distribution on Bjh and  Bj  h for j = j  . Let bf (x) = j∈JX,h 1Bjh (x)f (1) (ajh ). 1 Proposition 2.10. The normalized histogram (nh) 2 (fn,h − f − hbf )(x) 1 converges weakly to vf (x)N (0, 1) and (nhn ) 2 (fn,hn − f ) is asymptotically 1 unbiased with a bandwidth hn = o(n− 3 ).

Increasing the order of hn reduces the variance of the histogram and increases its bias. The asymptotic mean squared error of the histogram is minimal for the bandwidth 1

1 1 1 hn (x) = n− 3 {2b2f (x)}− 3 vf3 (x) = {2nf (1)2 (x)f −1 (x)}− 3

then it is approximated by M SEopt (x) = (3 × 2− 3 )n− 3 { vf (x)bf (x)} 3 . 2

2

2

These expressions do not depend on the degree of derivability of the density. The optimal bandwidth, the bias bf (x) and the variance vf (x) of the histogram are estimated by plugging the estimators of the density and its derivative in their formulae. The Lp moments of the histogram are determined by the higher order term in the expansion of fn,h (x) − fn,h (x)pp , 1

they are O((nh) p −1 ) for every integer p ≥ 2. The derivatives of fn,h (x) are defined by differences of its values in Bjh (1) i.e. fn,h (x) = h−1 {f (aj+1,h ) − f (aj,h )} + o(1) is estimated in Bjh by (1) fn,h (x) = h−1 {fn,h (aj+1,h ) − fn,h (aj,h )}

and the derivatives of higher order are defined in the same way. The bias (1) of fn,h is a O(1) and its variance is a O((nh2 )−1 ). 2.7

Estimation of functionals of a density

The estimation of the integral of a squared density   2 θ = f (x) dx = f (x) dF (x)

Kernel Estimator of a Density

39

has been considered by many authors and several estimators have been proposed. The plug-in kernel density estimator  2 Kh (Xi − Xj ) θn,h = n(n − 1) 1≤i 14 and hn = O(n− 4α+1 ), the estimators have 1 variable, they are asymptotically the convergence rate n 2 to a Gaussian  1 dx} converges weakly to a centered equivalent and n 2 {θn,h − θ¯n,h − f 2 (x)   Gaussian variable with variance 4{ f 3 − ( f 2 )2 }. If α < 14 , the best 4α convergence rate of θn,h is n 4α+1 . The integrals of the squared k-th derivatives of the density  θk = f (k)2 (x) dx, are also estimated by the integral of the square of the kernel estimator for the derivative of the density   2 (k) (k) ¯ θn,h,k = Kh (x − Xi )Kh (x − Xj ) dx, n(n − 1) 1≤i 1, that is n 9 for a density in C 3 (IX ). For a smaller band1 f,n,h converges fastly, with the rate n(1− r3 )/2 . width of order n− r , r > 2, M f,n,h ) is deduced from If the density belongs to C 3 (IX ), the bias of f (1) (M (1) (1) the bias of the process (fn,h − f ) and it equals

2

2

f,n,h ) = − h m2K f (3) (M f,n,h )+o(h2 ) = − h m2K f (3) (Mf )+o(h2 ), Ef (1) (M 2 2 it does not depend on the degree of derivability of the density f . The support of a density f can be estimated from its graph defined as Gf = {(x, y); y = f (x), x ∈ IX }. For a continuous function f defined on an open interval IX with compact closure, Gf is an open set of R2 with compact closure. This closed set defines the support of the function f . For every y such that (x, y) belongs to a closed subset A of Gf , there exist x in a closed subinterval of IX such that y = f (x). The graph of a sum of two densities f1 and f2 is the union of their graphs G1 ∪ G2 and by difference G1 = G1 ∪ G2 − G2 \ G1 , with G2 \ G1 = {(x, y); y = f2 (x) = f1 (x), x ∈ IX }. Let Gf,n,h = {(x, y); y = fn,h (x), x ∈ IX } be the graph of the kernel estimator of an absolutely continuous density f on IX , then Gfn,h = Gf ∪ Gfn,h −f = Gf + Gfn,h −f \ Gf hence Gfn,h −f = Gf,n,h − Gf and it converges a.s. to zero as n tends to infinity. The support of the density f is consistently estimated by Gf,n,h and the extrema of the density are consistently estimated by those of the estimated graph.

42

2.8

Functional Estimation for Density, Regression Models and Processes

Density of absolutely continuous distributions

Let F0 be a distribution function in a functional space F and Fϕ0 be a distribution function absolutely continuous with respect to F0 , having a density ϕ0 with respect to F0 . The function ϕ0 belongs to a nonparametric space of continuous functions Φ and the distribution function Fϕ0 belongs to the ∞ nonparametric model PF ,Φ = {(F, ϕ); F ∈ F , ϕ ∈ Φ, 0 ϕ dF = 1}. The observations are two subsamples X1 , . . . , Xn1 with distribution function F0 and Xn1 +1 , . . . , Xn with distribution function Fϕ0 . The approach extends straightforwardly to a population stratified in K subpopulations. Estimation of the distributions of stratified populations has already been studied, in particular by Anderson (1979) with a specific parametric form for ϕθ , by Gill, Vardi  · and Wellner (1988) in biased sampling models with group distributions 0 wk dF , where the weight functions are known, by Gilbert (2000) in biased sampling models with parametric weight functions, by Cheng and Chu (2004), with the Lebesgue measure and kernel density estimators. The density with respect to the Lebesgue measure of a distribution function F in F is denoted f and the distributions of both samples are supposed to have the same support. Let n2 = n − n1 increasing with n, such that limn n−1 n1 = π in ]0, 1[, and let ρ be the sample indicator defined by ρ = 1 for individuals of the first sample and ρ = 0 for individuals of the second sample. Let F1 = πF0 and F2 = (1 − π)Fϕ0 be the subdistribution functions of the two subsamples, they are estimated by the corresponding empirical subdistribution functions n1 n   F1,n = n−1 ρi 1{Xi ≤t} = n−1 1{Xi1 ≤t} , i=1

F2,n = n−1

n 

i=1

(1 − ρi )1{Xi ≤t}

i=1

and π n = n−1 n1 . Their densities with respect to the Lebesgue measure are denoted f1 and f2 , and the density of the second sample with respect to the distribution of the first one is ϕ = π(1 − π)−1 f1−1 f2 . The densities f1 and f2 are estimated by smoothing F1,n and F2,n , then f0 , fϕ and ϕ are estimated by  n−1 f0,n,h (t) = π

Kh (t − s) dF1,n (s),  n )−1 Kh (t − s) dF2,n (s), fn,h (t) = (1 − π

−1 ϕ n,hn (t) = f0,n,h (t)fn,h (t)

Kernel Estimator of a Density

43

on every compact subset of the support of the densities where f0 is strictly positive and ϕ n,h − ϕ0  converges in probability to0. The expectations of the estimators are approximated by f0;n,h (t) = Kh (t − s) dF0 (s) +  2 1 (2) O(n− 2 ) = f0 + h2 f0 +o(h2 ) and respectively fn,h (t) = Kh (t−s) dFϕ (s)+ 2 1 (2) n,h is expanded as O(n− 2 ) = fϕ + h2 fϕ + o(h2 ). The bias of ϕ bn,h =

h2 −1 (2) (1) f {ϕf0 + 2ϕ(1) f0 } m2K + o(h2 ), 2 0

its variance is vn,h = f0−2 {V arfn,h + ϕ2 V arf0,n,h } + o((nh)−1 ) with the variances given in Proposition 2.2, V arfj,n,h (t) = (nj h)−1 κ2 fj (t) {1 + o(1)} imply similar approximations for the variances of the estimators f0,n,h and fn,h . The following approximation with independent subsamples implies the weak convergence of the estimator of ϕ 1 1 n,h − ϕn,h ) = f0−1 (nh) 2 {(fn,h − fϕ,n,h ) − ϕ(f0,n,h (nh) 2 (ϕ

− f0,n,h )} + oL2 (1).

2.9

Hellinger distance between a density and its estimator

Let P and Q be two probability measures and let λ = P + Q be the dominating measure of their sum. Let F and G be the distribution functions of a variable X under the probability measures P and Q, respectively, and let f and g be the densities of P and Q, respectively, with respect to λ. The Hellinger distance between P and Q is  √    1 1 √ 2 2 ( dP − dQ) = ( f − g)2 dλ. h (P, Q) = 2 2 The affinity of P and Q is   ρ(P, Q) = 1 − h (P, Q) = f g dλ. 2

44

Functional Estimation for Density, Regression Models and Processes

The following inequalities were proved by Lecam and Yang (1990) h2 (P, Q) ≤

1 1 P − Q1 ≤ {1 − ρ2 (P, Q)} 2 . 2

Applying this inequality to the probability density f of P , absolutely continuous with respect to the Lebesgue’s measure λ, and its estimator fn,h , we obtain ⎧ ⎞ ⎛  ⎞2 ⎫ 12 ⎛  ⎪ ⎪   ⎬ ⎨ fn,h ⎠ fn,h dF ⎠ . h2 (fn,h , f ) = ⎝1 − dF ≤ 1 − ⎝ ⎪ ⎪ f f ⎭ ⎩ The convergence to zero of the Hellinger distance h2 (fn,h , f ) is deduced from the obvious bound ⎞ ⎛ ⎞ ⎛    n,h n,h f f ⎠ dF ≤ ⎝ − 1⎠ d(Fn − F ) (2.9) h2 (fn,h , f ) = ⎝1 − f f  ' fn,h  which is consequence of the inequality f dFn ≥ 0. This inequality and the uniform a.s. consistency of the density estimator also imply the 1 a.s. convergence to zero of n 2 h2 (fn,h , f ). By differentiation, estimators of functionals of the density converges with the same rate as the estimator of the density, hence h2 (fn,h , f ) convergences to zero in probability with the 1

rate nhn2 . Applying these results to the probability measures P0 and P = Pϕ0 of the previous section, with distribution functions F0 and F , we get similar formulae   2  12  √ 2 1 √ 2 (1 − ϕ) dF0 ≤ 1 − ϕ dF0 , h (P0 , P ) = 2 ⎧   2 ⎫ 12     ⎬ ⎨ ϕ n,h ϕ n,h dF dF ≤ 1 − h2 (ϕ n,h , ϕ) = 1− . ⎭ ⎩ ϕ ϕ The bound (2.9) is adapted to the density ϕ    ϕ  n,h − 1 d(Fn − F ), h2 (ϕ n,h , ϕ) ≤ ϕ 1

it follows that the convergence rate of h2 (ϕ n,hn , ϕ) is nhn2 .

Kernel Estimator of a Density

2.10

45

Estimation of the density under right-censoring

On a probability space (Ω, A, P ), let X and C be two independent positive random variables with densities f and fC and such that P (X < C) is strictly positive, and let T = X ∧ C, δ = 1{X≤C} denote the observed variables when X is right-censored by C. Let  δi 1{Ti ≤t} Nn (t) = 1≤i≤n

be the number of observations before t and  Yn (t) = 1{Ti ≥t} 1≤i≤n

be the number of individuals at risk at t. The survival function F¯ = 1 − F − of X is now estimated by Kaplan–Meier’s product-limit estimator    δi Jn (Ti )  ¯ R (t) =  n (Ti ) , with F 1− 1 − ΔΛ = n Yn (Ti ) Ti ≤t Ti ≤t  t Jn (s)  n (t) = dNn (s) , Λ (2.10) 0 Yn (s) and Jn (s) = 1{Y (s)>0} . The process FnR is also written in an additive n

form (Pons, 2007) as a right-continuous increasing process identical to the product-limit estimator  t dNn (s) FnR (t) = , (2.11) n −1 } R 0 n− j=1 (1 − δj )1{Tj t

 ¯ c (t) is the continuous part of Λ(t) ¯ ¯ where Λ and s>t {1 + ΔΛ(s)} its rightcontinuous discrete part. On the interval In = ]mini Ti , maxi Ti ], the func¯ is estimated by tion Λ  ∞ dNn  ¯ n (t) = Λ 1{Yn < n} , n − Yn t and a product-limit estimator of the function F is defined on In from the  ¯ n by expression of Λ . /  ¯ (T ) δi . F L (t) = 1 + dΛ n

n

i

Ti ≥t

On the interval In = ]mini Ti , maxi Ti ] it satisfies  ∞ L − Fn (s )  F − FnL ¯ n (s) − dΛ(s)}, ¯ (t) = {dΛ (2.13) F F (s) t 1 and n 2 (F − FnL )F −1 converges weakly a centered Gaussian  ∞ to −1 − 2 ¯ with ¯ (F F ) (H − )−1 dΛ, process with covariance K(s, t) = s∧t

48

Functional Estimation for Density, Regression Models and Processes

 ¯ n is an 1 − H = (1 − F )(1 − FC ). From this expression, it follows that Λ L ¯  unbiased estimator of Λ and Fn is an unbiased estimator of the distribu1 tion function F , moreover FnL (t) − F (t)p = O(n− 2 ), for p ≥ 2. The density of T under left-censoring is estimated by smoothing the Kaplan–Meier estimator FnL of the distribution function  L  fn,h (t) = Kh (t − s) dFnL (s). The a.s. uniform consistency of the process FnL − F implies that L − f | converges in probability to zero, as n tends to the infinity supIX,h |fn,h and h to zero. From (2.13), the estimator fL satisfies n,h

L fn,h (t) = fn,h (t) +



 Kh (t − s)[f (s)



s

 FnL−  ¯ n − Λ) ¯ ds d(Λ F

 ¯ n − Λ)(s)] ¯ − FnL− (s) d(Λ  ∞ FL (s− )  ¯ n − Λ)(s) ¯ d(Λ = fn,h (t) + fn,h (s) n F (s) t   ¯ n − Λ)(s). ¯ − Kh (t − s)FnL (s− ) d(Λ As a consequence of the uniform consistency of the estimators FnL− and  ¯ n , the bias of the estimated density fL (t) is then the same as in the Λ n,h 2

uncensored case bf,n,h (t) = h2 f (2) (t) + o(h2 ). L (t) = (nh)−1 vfL (t), with the expansion vf,n,h  L (t) = E vf,n,h

t

+



Its variance is written

F L (s− ) 2 2 ¯ fn,h (s) n (n − Yn (s))−1 dΛ(s) F (s)

¯ Kh2 (t − s)FnL2 (s− )(n − Yn (s))−1 dΛ(s) 

−2



fn,h (s) t

FnL2 (s− ) ¯ Kh (t − s)(n − Yn (s))−1 dΛ(s), F (s)

where the last two terms are O((nh)−1 ) and the first one is a O(n−1 ). The optimal bandwidths for estimating the density under left-censoring are then 1 2 also O(n− 5 ) and the optimal L2 -risks are O(n− 5 ). Under Conditions 2.1 or 2.2 and if the support of K is compact, the variance vfL belongs to class C 2 (IX ) and for every t and t in IX,h , there

49

Kernel Estimator of a Density

exists a constant α such that for |t − t | ≤ 2h

2 L L E fn,h (t) − fn,h (t ) ≤ α(nh3 )−1 |t − t |2 . L L Under the conditions of Theorem 2.1, the process Un,h = (nh) 2 {fn,h − 1 L L f }I{IX,h } converges weakly to Wf + γ 2 bf , where Wf is a continuous Gaussian process on IX with expectation and covariances zero and with variance function vfL . 1

2.12

Kernel estimator for the density of a process

Consider a continuously observed stationary process (Xt )t∈[0,T ] with values in IX . The stationarity means that the distribution probability of Xt and Xt+s − Xs are identical for every s and t > 0. For a process with independent increments, this implies the ergodicity of the process that is expressed by the convergence of bounded functions of several observations of the process to a mean value: For every x in IX , there exists a measure πx on IX \ {x} such that for every bounded and continuous function ψ on 2 IX   ET −1 ψ(Xs , Xt ) ds dt → ψ(x, y) dπx (dy)dF (x) (2.14) 2 IX

[0,T ]2

as T tends to infinity. The distribution function F in (2.14) is defined as the limit of the expectation of the mean sample-path of the process X   ψ(Xt ) dt → ψ(x) dF (x). (2.15) ET −1 [0,T ]

IX

The mean marginal density f of the process is the density of the distribution function F , it is estimated by replacing the integral of a kernel function with respect to the empirical distribution function of a sample by an integral with respect to the Lebesgue measure over [0, T ] and the bandwidth sequence is indexed by T . For every x in IX,T,h  1 T Kh (Xs − x) ds, (2.16) fT,h (x) = T 0  its expectation is fT,h (x) = IX,n Kh (y − x)f (y) dy so its bias is  Kh (y − x){f (y) − f (x)} dy bT,h (x) = IX,T

hs = T msK f (s) (x) + o(hsT ) s!

50

Functional Estimation for Density, Regression Models and Processes

under Conditions 2.1–2.2. For a density in a H¨ older class Hα,M , bT,h (x) [α] tends  [α]to zero for every α > 0 and it is a O(h ) under the condition |u| K(u) du < ∞. Its variance is expressed through the integral of the covariance between Kh (Xs − x) and Kh (Xt − x). For Xs = Xt , the integral on the diagonal 2 is a (T hT )−1 κ2 f (x) + o((T hT )−1 ) and the integral outside the DX of IX,T diagonal denoted Io (T ) is expanded using the ergodicity property (2.14). Let αh (u, v) = |u − v|/2hT , the integral Io (T ) is written   ds dt Kh (u − x)Kh (v − x)fXs ,Xt (u, v) du dv 2 T T [0,T ]2 IX,T \DX    1−αh (u,v) = (T hT )−1 K(z − αh (u, v))K(z + αh (u, v)) dz IX

IX\{u}

−1+αh (u,v)



dπu (v) dF (u)}{1 + o(1) .  For every fixed u = v, the integral K(z −αh (u, v))K(z +αh (u, v)) dz tends to zero since αhT (u, v) tends to infinity as hT tends to zero. If αh (u, v) tends to zero with hT , then πu (v) also tends to zero and the integral Io (T ) is a o((T hT )−1 ) as T tends to infinity. The mean squared error of the estimator at x for a marginal density in C s is then −2 2 msK f (s)2 (x) M ISET,h (x) = (T hT )−1 κ2 f (x) + h2s T (s!)

+ o((T hT )−1 ) + o(h2s T ) and the optimal local and global bandwidths minimizing the mean squared 1 (integrated) errors are O(T 2s+1 ). If hT has the rate of the optimal band2s widths, the M ISE is a O(T 2s+1 ). The Lp -norm of the estimator satis1 fies fT,h (x) − fT,h (x)p = O((T hT )− 2 ) under an ergodicity condition for k (Xt1 , . . . , Xtk ) similar to (2.14) for bounded functions ψ defined on IX  ET −1 ψ(Xt1 , . . . , Xtk ) dt1 · · · dtk (2.17) [0,T ]k   → ψ(x1 , . . . , xk ) πxj (dxj+1 ) dF (x1 ), k IX

1≤j≤k−1

for every integer k = 2, . . . , p. The property (2.17) implies the weak conver1 gence of the finite dimensional distributions of the process (T hT ) 2 (fT,h − f − bT,h ) to those of a centered Gaussian process with expectation zero, covariances zero and variance κ2 f (x) at x. The proof is similar to the

Kernel Estimator of a Density

51

proof for a sample of variables, using the above expansions for the variance and covariances of the process. A Lipschitzian bound for increments E{fT,h (x)−fT,h (y)}2 is obtained by the Mean Value Theorem which implies  T E{Khn (x − Xt ) − Khn (y − Xt )}2 dt = O(|x − y|2 (T h3T )−1 ), T −2 0

1 as in Lemma 2.1. Then the process (T hT ) 2 (fT,h − f − bT,h ) converges weakly to a centered Gaussian process with covariances zero and variance κ2 f .

The Hellinger distance h2 (fT,hT , f ) is bounded like (2.9) ⎞ ⎛ ⎞ ⎛      fT,hT ⎠ fT,hT − 1⎠ d(FT − F ), dP ≤ ⎝ h2T (fT,hT , f ) = ⎝1 − f f where FT (t) = T −1

 0

T

1{Xt ≤s} dt,

is the empirical probability distribution of the mean marginal distribution function of the process (Xt )t≤T  −1 FT = T FXt dt, [0,T ]

and F is its limit under the ergodicity property (2.15). The convergence 1 rate of FT − F is T 2 , from the mixing property of the process X. Therefore 1 h2 (fT,hT − f ) convergences to zero in probability with the rate T hT2 . 2.13

Exercises

 (1) Let f and g be real functions defined on R  and let f ∗ g(x) = f (x − y)g(y) dy be their convolution. Calculate f ∗ g(x) dx and prove that, for 1 ≤ p ≤ ∞, if f belongs to Lp and g to Lq such that p−1 + q −1 = 1, then supx∈R |f ∗ g(x)| ≤ f p gq . If p is finite, prove that f ∗ g belongs to the space of continuous functions on R tending to zero at infinity. (2) Prove the approximation of the bias in (d) of Proposition 2.2 using a Taylor expansion and precise the expansions for the Lp -risk. (k) (m) (3) Calculate the covariance of fn,h (x) and fn,h (x), for strictly positive integers k = m.

52

Functional Estimation for Density, Regression Models and Processes

(4) Let (Xi )i=1,...,n be a sample of a variable with density f , define estimators of E{f (s) (X)}, for integers s ≥ 1, and determine their bias and variance. [The estimators of Section 2.7 give hints for these new estimators]. (5) Write the variance of the kernel estimator for the marginal density of dependent observations (Xi )i≤n in terms of the auto-covariance coeffin cients ρj = n−1 i=1 Cov(Xi , Xi+j ). (6) Consider a hierarchical sample (Xij )j=(1,...,Ji ),i=1,...,n , with n independent and finite sub-samples of Ji dependent observations. Let N =  n  Ji n −1 j=1 fXij be the limiting mean deni=1 Ji and f = limn N i=1 sity of the observations of X. Define an estimator of the density f and give the first order approximation of the variance of the estimator under relevant ergodicity conditions.

Chapter 3

Kernel Estimator of a Regression Function

3.1

Introduction and notation

The kernel estimation of nonparametric regression functions is related to the estimation of the conditional density of a variable and most authors have studied the asymptotic behavior of weighted risks, using weights proportional to the density estimator so that the random denominator of the regression function disappears. Weighted integrated errors are used for the empirical choice of a bandwidth and for tests about the regression. In this chapter, the bias, variance and norms of the kernel regression estimator are obtained from an asymptotic approximation of the estimator. (X, Y ) with joint density Let (Xi , Yi )i=1,...,n be a sample of a variable  fX,Y . The marginal density of X is fX (x)= fX,Y (x, y)dy and the density −1 fX,Y . Here, the density fX,Y is of Y conditionally on X is fY |X = fX 2 supposed to be C . Let FXY be the distribution function of (X, Y ) and for n an n-sample (Xi , Yi )i=1,...,n , let FXY,n (x, y)= n−1 i=1 1{Xi ≤ x, Yi ≤ y} be the empirical distribution function. Consider the regression model (1.6) on a subset IXY of the support of the distribution function FXY Y = m(X) + σε, where m is bounded on IX , a subset of the support of FX , and the error variable ε has the conditional expectation E(ε|X) = 0 and a constant conditional variance V ar(ε|X) = σ 2 . Let IX,h = {x ∈ IX ; [x − h, x + h] ∈ IX }, IXY,h = {(x, y) ∈ IXY ; [x − h, x + h] × {y} ∈ IXY },

53

54

Functional Estimation for Density, Regression Models and Processes

be subsets of the supports. On an interval IXY,h , a continuous and bounded regression function  −1 m(x) = E(Y |X = x) = fX

yfXY (x, y) dy

is estimated by the kernel estimator n Yi Kh (x − Xi ) m  n,h (x) = i=1 . n i=1 Kh (x − Xi ) Its numerator is denoted  n 1 μ n,h (x) = Yi Kh (x − Xi ) = yKh (x − s) dFXY,n (s, y) n i=1

(3.1)

and its denominator is fX,n,h (x). The expectation of μ n,h (x) and its limit are respectively     μn,h (x) = yKh (x − s) dFXY (s, y) = yK(u)fXY (x + hu, y) du dy,   h2 m2K ∂ 2 fXY (x, y) dy + o(1) , = μ(x) + y 2 2 (∂x)  μ(x) = yfXY (x, y) dy = fX (x)m(x), whereas the expectation of m  n,h (x) is denoted mn,h (x). The notations for the parameters and estimators of the density f are unchanged. The variance of Y is supposed to be finite and its conditional variance is denoted σ 2 (x) = E(Y 2 |X = x) − m2 (x),  −1 2 E(Y |X = x) = fX (x)w2 (x) = y 2 fY |X (y; x) dy, with   w2 (x) = y 2 fXY (x, y) dy = fX (x) y 2 fY |X (y; x) dy. Let also σ4 (x) = E[{Y −m(x)}4 | X = x], they are supposed to be bounded functions on IX . The Lp -risk of the kernel estimator of the regression 1 function m is defined by its Lp -norm  · p = {E · p } p .

3.2

Risks and convergence rates for the estimator

The following conditions are assumed, in addition to Conditions 2.1 and 2.2 about the kernel and the density. Condition 3.1. (1) The functions fX , m and μ are twice continuously differentiable on IX , with bounded second order derivatives; fX is strictly positive on IX ; (2) The functions fX , m and σ belong to the class C s (IX ).

Kernel Estimator of a Regression Function

55

Proposition 3.1. Under Conditions 2.1, 2.2 and 3.1(1), μn,h (x) − μ(x)| and supx∈IX,h |m  n,h (x) − m(x)| converge (a) supx∈IX,h | a.s. to zero if and only if μ and m are uniformly continuous. (b) The following expansions are satisfied mn,h (x) =

μn,h (x) + O((nh)−1 ), fX,n,h (x)

1

1

−1  n,h − mn,h }(x) = (nh) 2 fX (x){( μn,h − μn,h )(x) (3.2) (nh) 2 {m  − m(x)(fX,n,h − fX,n,h )(x)} + rn,h

where rn,h = oL2 (1). μn,h (x) − μ(x)p (c) For every x in IX and for every integer p > 1,  and m  n,h (x) − m(x)p converge to zero, the bias of the estimators μ n,h (x) and m  n,h (x) is uniformly approximated by bμ,n,h (x) = μn,h (x) − μ(x) = h2 bμ (x) + o(h2 ), m2K (2) μ (x), bμ (x) = 2 bm,n,h (x) = mn,h (x) − m(x) = h2 bm (x) + o(h2 ), −1 (x){bμ (x) − m(x)bf (x)} bm (x) = fX m2K −1 (2) f (x){μ(2) (x) − m(x)fX (x)}, = 2 X

(3.3)

(3.4)

the covariance between μ n,h (x) and fX,n,h (x) is Covμ,fX ,n,h (x) = (nh)−1 {Covμ,fX (x) + o(1)}, Covμ,fX (x) = μ(x)κ2 = m(x)fX (x)κ2 ,

(3.5)

and their variance vμ,n,h (x) = (nh)−1 {σμ2 (x) + o(1)}, σμ2 (x) = w2 (x)κ2 , −1

vm,n,h (x) = (nh) 2 (x) σm

2 {σm (x) 2

(3.6) + o(1)},

−2 = {w2 (x) − m (x)f (x)}κ2 fX (x) −1 (x)σ 2 (x). = κ2 f X

(3.7)

Proof. Condition 3.1 implies that the kernel estimator of fX is bounded away from zero on IX which may be a sub-interval of the support of the variable X. Proposition 2.2 and the almost sure convergence to zero of μn,h − μn,h |, proved by the same arguments as for the density, supx∈IX,h | imply the assertion (a). The bias and the variance are similar for μ n,h (x)

56

Functional Estimation for Density, Regression Models and Processes

and fX,n,h (x). For μ n,h (x), they are a consequence of (b). The first approximation of (b) comes from the first order Taylor expansion ( μX,n,h − μX,n,h )(x) μn,h (x) + m  n,h (x) = fX,n,h (x) fX,n,h (x)  μn,h (x){fX,n,h (x) − fX,n,h (x)} (3.8) − 2 fX,n,h (x) +o(|fX,n,h (x) − fX,n,h (x)| + | μX,n,h − μX,n,h )(x)|), the expectation of this equality yields μn,h (x) + o(h2 ) (3.9) mn,h (x) = fX,n,h (x) uniformly on IX , for any bounded regression function m. The bias of m  n,h (x) is 

μ (x) μn,h (x)  n,h bm,n,h (x) = − m(x) + mn,h (x) − , fX,n,h (x) fX,n,h (x) where the second difference is a o(h2 ), using (3.9). A Taylor expansion of −1 (x) as n tends to infinity leads to fX,n,h μn,h (x) −1 = m(x) + {bμ,n,h (x) − m(x)bfX ,n,h (x)}fX (x) + o(h2 ) fX,n,h (x) and the bias of m  n,h (x) follows immediately. The variance μ n,h (x) is vμ,n,h (x) = n−1 [E{Y 2 Kh2 (x − X)} − μ2nh (x)] = (nh)−1 {E(Y 2 | X = x + hu)K 2 (u)fX (x + hu) − hμ2nh (x)}, = (nh)−1 {κ2 w2 (x)fX (x) + o(1)}, and the variance vm,n,h (x) of m  n,h (x) is deduced as

μn,h (x) 2 μn,h (x) 2 vm,n,h (x) = E m  n,h (x) − + mn,h (x) − , fX,n,h (x) fX,n,h (x) where the non random term is a o(h4 ), by (3.9). From the expansion (3.8), the first term develops as  μ2n,h (x) −2 V ar{fX,n,h (x)} (x) V ar{ μn,h (x)} + 2 vm,n,h (x) = fX,n,h fX,n,h (x)  μn,h (x) −2 Cov{ μn,h (x), fX,n,h (x)} + o(n−1 h−1 ) fX,n,h (x)

μ2n,h (x) −2 fX (x) = (nh)−1 fX,n,h (x)κ2 w2 (x) + 2 fX,n,h (x)  μn,h (x) −2 m(x)fX (x) + o(n−1 h−1 ) fX,n,h (x) −2 = (nh)−1 fX,n,h (x)κ2 {w2 (x) − m2 (x)fX (x)} + o(n−1 h−1 ),

57

Kernel Estimator of a Regression Function

using the expressions (2.3, (3.5) and (3.6), and first order expansions of the expectations fX,n,h (x) = fX (x) + O(h2 ) and μn,h (x) = μ(x) + O(h2 ). The convergence to zero of the last term rn,h in (3.2) is satisfied and the other results are obtained by simple calculus.  For p ≥ 2, let wp (x) = E(Y p 1{X=x} )

(3.10) p

be the p-th moment of Y conditionally on X = x. The L risk is calculated from the approximation (3.2) of Proposition 3.1 and the next lemmas. Lemma 3.1. For every integer p ≥ 2 1

 μn,h (x) − μn,h (x)p = O((nh)− 2 ), 1 −1 −1 (x) − fX,n,h (x)p = O((nh)− 2 ), fn,h 1

m  n,h (x) − mn,h (x)p = O((nh)− 2 ), where the approximations are uniform. Proof. Proposition 2.2 extends to μ n,h (x) − μn,h (x) so the moments of 1 order p ≥ 2 of μ n,h (x) − μn,h (x) and fX,n,h (x) − fX,n,h are 0((nh)− 2 ). The result is deduced from the expansion (3.2).  The convergence rate of the bandwidth determines the behavior of the 1  n,h − m), with the following technical bias term of the process (nh) 2 (m results. They generalise Proposition 3.1 to integers p and s ≥ 2.  n,h (x) are Lemma 3.2. Under Condition 2.1, the bias of μ n,h (x) and m uniformly approximated as  hs ∂ s fX,Y (x, y) msK y dy + o(hs ), bμ,n,h (x; s) = s! ∂xs hs (s) −1 msK fX (x){μ(s) (x) − m(x)fX (x)} + o(hs ), (3.11) bm,n,h (x; s) = s! for every integer s ≥ 2, and their variances develop as in Proposition 3.1. Proposition 3.2. Under Conditions 2.1 and 3.1 with s = 2, for every x in IXh 1 1 (nh) 2 (m  n,h − m) = (nh) 2 f −1 {( μn,h − μn,h ) − m(fX,n,h − fX,n,h )} X

1

+ (nh5 ) 2 bm + rn,h , and the remainder term of (3.12) satisfies sup  rn,h 2 = o(1).

x∈IX,h

(3.12)

58

Functional Estimation for Density, Regression Models and Processes

Proof.

A first order expansion yields ( μX,n,h − μX,n,h )(x) μn,h (x) + fX,n,h (x) fX,n,h (x)  mn,h (x){fX,n,h (x) − fX,n,h (x)} − + εn , fX,n,h (x) ( μX,n,h − μX,n,h )(x) = mn,h (x) + fX (x)  1  1 +( μX,n,h − μX,n,h )(x) − fX,n,h (x) fX (x)  m(x){fX,n,h (x) − fX,n,h (x)} − fX (x)  m (x) m(x)   n,h − {fX,n,h (x) − fX,n,h (x)} + εn (3.13) − fX,n,h (x) fX (x)

m  n,h (x) =

where εn = o(|fX,n,h (x) − fX,n,h (x)| + | μX,n,h − μX,n,h )(x)|). 1 Let δn,h = (nh) 2 rn,h , then   1 1  1 μX,n,h − μX,n,h )(x) − δn,h = (nh) 2 ( fX,n,h (x) fX (x)    m (x) m(x) 1 n,h − {fX (x) − fX,n,h (x)}(nh) 2 + εn . (3.14) − fX,n,h (x) fX (x) Then

1 2 fX,n,h (x) fX (x)  m (x)  m(x) 2 n,h − + E{(fX − fX,n,h )2 (x)} + o(1) fX,n,h (x) fX (x)

  2 E(δn,h ) ≤ 2nh E{( μX,n,h − μX,n,h )2 (x)}

1



and it is a o(1) by Lemma 3.1 and Propositions 2.1 and 3.1.



2

For a regression function of class C , s ≥ 2, the L -norm of the remainder term rn,h is given by the next proposition. s

Proposition 3.3. Under Conditions 2.1, 2.2 and 3.1, for every s ≥ 2 the remainder term of (3.13) satisfies the uniform bounds rn,h 2 = o(1). sup 

IX,h

Proof. For functions fX and μ in C s , the risk of rn,h is modified by the bias terms of the previous expansion (3.14). The bias of mn,h is replaced by mn,h (x) − m(x) = hs bμ (x) + O(hs+1 ), and Conditions 2.2 and 3.2 imply  h2s = O((nh)−1 ). The result follows from Lemma 3.2.

59

Kernel Estimator of a Regression Function

Under Conditions 2.1, 2.2, 3.1, for a function μ in C s and a density fX in C r , the bias of m  n,h is

hs  hr −1 bμ (x) − m(x) bf (x) + o(hs∧r ), bm,n,h (x) = fX (x) s! r! and its variance does not depend on r and s. The derivability conditions fX and μ ∈ C s of 3.1 can be replaced by the condition: fX and μ belong to a H¨older class Hα,M . Proposition 3.4. Assume fX and μ are bounded and belong to Hα,M then −1 (x){1 + |m(x)|} + the bias of m  n,h (x) is bounded by M m[α]K hα ([α]!)−1 fX 1

o(h[α] ), by Equation (3.2). The optimal bandwidth is O(n 2[α]+1 ) and the [α]

MISE at the optimal bandwidth is O(n 2[α]+1 ). Proposition 2.7 extends to the regression estimators. Proposition 3.5. Under Conditions 2.1, 2.2 and 3.1, and for every real p > 0, the kernel estimator of a regression function m of C s has the Lp -risk Rp (m  n,h , m) = O(n− 2s+1 ), s

1

with hn = O(n− 2s+1 ).

For every f in C 2 (R), by convexity we have  p    Kh (t − s)f (s) ds − f (t) dt Rpp (fn , f ) ≤ 2p−1  IX,h IX,h   p    Kh (t − s) {dFn (s) − dF (s)} dt , +E 

Proof.

IX,h

IX,h

where the first term is approximated by  hsp mpsK R1p = |f (s) (t)|p dt + o(hsp ). (k!)p IX,h The random term has the approximation (3.8) and its risk is bounded by

 −p 2p−1 E fX,n,h (t)| μX,n,h (t) − μX,n,h (t)|p dt IX,h



+ IX,h

−p |μn,h |p (t)fX,n,h (t)|fX,n,h (t) − fX,n,h (t)|p dt



μX,n,h − μX,n,h )) + o(Rp (fX,n,h − fX,n,h )) + o(Rp ( the result is obtained by the same arguments as for Proposition 2.7 for the  risk of μ X,n,h .

60

Functional Estimation for Density, Regression Models and Processes

The minimax property of the estimator m  n,h is established by the same method as for density estimation. The minimax bound for a density of Lp applies to the function μ(x) = E(Y 1{X=x} ) and, by the expansion (3.2), to the minimax risk for the estimation of m(x). Proposition 3.6. For a real p > 1 and an integer s ≥ 2, a lower bound of the Lp -risk for the estimation of a regression function m of n of the set Ms,p (IX ) = {m ∈ C s (IX ); m(s) ∈ Lp (IX )}, in a subset M Ms,p (IX ), is s sup Rp (m  n , m) = O(n− 2s+1 ). inf n m∈Ms,p m  n ∈M

By Propositions 3.5 and 3.6, the kernel estimator m  n,h is a minimax estimator, under Conditions 2.1, 2.2 and 3.1.

3.3

Optimal bandwidths for derivatives

The asymptotic mean squared error of m  n,h (x), for p = 2, is −2 2 (x) + h4 b2m (x) = (nh)−1 κ2 fX (x){w2 (x) − m2 (x)f (x)} (nh)−1 σm h4 m22K −2 (2) fX (x){μ(2) (x) − m(x)fX (x)}2 4 and its minimum is reached at the optimal bandwidth

κ n−1 {w (x) − m2 (x)f (x)}  15 2 2 hAMSE (x) = m22K {μ(2) (x) − m(x)f (2) (x)}2 +

4

X

where AM SE(x) = O(n− 5 ). The global mean squared error criterion is the integrated error and it isapproximated by AM ISE = (nh)−1 κ2

−2 (x){w2 (x) − m2 (x)f (x)} dx fX

 h4 m22K (2) −2 fX (x){μ(2) (x) − m(x)fX (x)}2 dx, + 4 and the optimal global bandwidth is

κ n−1  f −1 (x)V ar{Y |X = x}f (x)} dx  15 2 X . hn,AMISE =  m22K f −2 (x){μ(2) (x) − m(x)f (2) (x)}2 dx X X For every s ≥ 2, the asymptotic quadratic risk of the estimator for a regression curve of class C s is 2 AM SE(x) = (nh)−1 σm (x) + hs2 b2m,s (x) −2 (x){w2 (x) − m2 (x)f (x)} = (nh)−1 κ2 fX

+

h2s 2 −2 (s) m f (x){μ(s) (x) − m(x)fX (x)}2 , (s!)2 sK X

Kernel Estimator of a Regression Function

61

its minimum is reached at the optimal bandwidth 1

(s!)2 κ n−1 {w (x) − m2 (x)f (x)}  2s+1 2 2 hAMSE (x) = , 2sm2sK {μ(s) (x) − m(x)f (s) (x)}2 X

2s − 2s+1

where AM SE(x) = O(n ). The global mean squared error criterion is the integrated error and it is approximated by  −1 −1 (x)V ar{Y | X = x} dx AM ISE(h, s) = (nh) κ2 fX  h2s m2sK (s) −2 + (x){μ(s) (x) − m(x)fX (x)}2 dx, fX (s!)2 and the optimal global bandwidth is  −1 1

(s!)2 κ n−1 fX (x)V ar{Y | X = x} dx  2s+1 2 hn,AMISE (s) = ,  2sm2sK f −2 (x){μ(s) (x) − m(x)f (s) (x)}2 dx X X 2s

and again AM ISE(hn (s), s) = O(n− 2s+1 ). In order to estimate the constants of the optimal bandwidths, a consistent kernel estimator of the conditional variance of Y is defined as n Yi2 Kh (x − Xi )  V arn,h (Y |X = x) = i=1 −m  2n,h (x). n i=1 Kh2 (x − Xi ) More generally, the conditional moment of order p, mp (x) = E(Y p |X = x) is estimated by n p i=1 Yi Kh (x − Xi ) , m  p,n,h (x) =  n i=1 Kh (x − Xi ) with a bandwidth h = hn such that hn tends to zero and nh2n tends to infinity as n tends to infinity. For every p ≥ 2, the estimator m  p,n,h is −1 μ p,n,h , it is a.s. uniformly consistent and approximations also written fn,h similar to those of Propositions 3.1 and 3.2 for the regression curve hold for mp (x) 1

1

1

−1  p,n,h − mp,n,h ) = (nh5 ) 2 bp,m + (nh) 2 fX {( μp,n,h − μp,n,h ) (nh) 2 (m  − mp (fX,n,h − fX,n,h )} + rp,n,h ,

rp,n,h 2 = o(1), sup 

x∈IX,h

and for its bias

bmp ,n,h

m2K h2 2



∂ 2 fXY (·, y) dy + o(h2 ), ∂x2 m2K h2 −1 fX {bμp ,n,h − mp bf } + o(h2 ). = mp,n,h − mp = 2

bμp ,n,h =

yp

62

Functional Estimation for Density, Regression Models and Processes

The covariance between μ p,n,h (x) and fX,n,h (x) is (nh)−1 mp (x)fX (x)κ2 and the variances of the estimators of μp (x) and mp (x) are vμp ,n,h (x) = (nh)−1 {wp (x)κ2 + o(1)}, −2 (x) {σμ2 p (x) − m2p (x)f (x)} + o(1)}. vmp ,n,h (x) = (nh)−1 {κ2 fX (1)

The first derivative of the kernel Kh is Kh (u) = h−2 K (1) (h−1 u) and (k) its derivative of order k is Kh (u) = h−(k+1) K ( k)(h−1 u). The estimators of the derivatives of the regression function m are n (1) (1) i=1 Yi Kh (x − Xi ) m  n,h (x) =  n i=1 Kh (x − Xi ) n n (1) { i=1 Yi Kh (x − Xi )}{ i=1 Kh (x − Xi )} n − { i=1 Kh (x − Xi )}2 (1) (1) = f−1 (x){ μ (x) − m  n,h (x)f (x)}, (3.15) X,n,h

n,h

X,n,h

and all consecutive derivatives of this expression. The first derivatives n  (1) (1) μ n,h (x) = n−1 Yi Kh (x − Xi ), i=1

n (1) (1) and fX,n,h (x) = n−1 i=1 Kh (x−Xi ) converge uniformly on IX,h to their (1)

(1)

(1)

expectations μn,h (x) = h−1 EY Kh (x − X) and fX,n,h (x), respectively, where h2 (1) (1) (3) fX,n,h (x) = fX (x) + m2K fX (x) + o(h2 ), 2  (1)

μn,h (x) = h−1

(1)

yKh (u − x)fX,Y (u, y) du dy

h2 m2K (mfX )(3) (x) + o(h2 ), 2 (1) (1) −1 then m  n,h converges uniformly to fX (x){(mfX )(1) − mfX } = m(1) , as h tends to zero. The moments of derivatives of the kernel estimator for the regression function are presented in Chapter 3, here the proofs are detailed. (1) The bias of m  n,h (x) is obtained by an application of Proposition 3.1 to Equation (3.15) h2 (1) (3) −1 m2K fX (x){(mfX )(3) − mfX }(x) + o(h2 ). b(m  n,h (x)) = 2 Proposition 3.7.  (1) 3 −1 −1 2 V ar{m  n,h (x)} = (nh ) f (x)σ (x) K (1)2 (u) du + o((nh3 )−1 ). = (mfX )(1) (x) +

63

Kernel Estimator of a Regression Function

(1) (1) (1) −1 Proof. The variance of m  n,h (x) = fn,h (x){ μn,h (x) − m  n,h (x)fn,h (x)} is obtained by approximations similar to (3.2) in Proposition 3.1 (1) (1) (1) (1) μ (x) = f −1 (x)mu (x) + f −1 (x){ μ (x) − μ (x)} f−1 (x) n,h

n,h

n,h

n,h

n,h

(1) −2 −fn,h (x)μn,h (x){fn,h (x) (1) m  n,h (x)fn,h (x)

fn,h (x)

=

(1) mn,h (x)fn,h (x)

+

fn,h (x)

n,h

n,h

− fn,h (x)} + r1n (x)

(1) fn,h (x){ μn,h (x) − 2 (x) fn,h

μn,h (x)}

mn,h (x)fn,h (x){fn,h (x) − fn,h (x)} (1)

−2 +

2 (x) fn,h

(1) (1) mn,h (x){fn,h (x) − fn,h (x)}

fn,h (x)

+ r2n (x),

(1) (1) where r1n (x) = oL2 (|fn,h (x) − fn,h (x)|) + oL2 (| μn,h (x) − μn,h (x)|) and (1) (1) r2n (x) = oL2 (|fn,h (x) − fn,h (x)|)oL2 (|f (x) − f (x)|) + oL2 (| μn,h (x) − n,h

n,h

μn,h (x)|). For functions μ and f in C 3 (IX ), Lemma 2.1 implies the following approximations for the bias of the estimators of f (1) and μ(1)  1 (3) 2 2 bf (1) ,n,h = h bf (1) + o(h ), bf (1) = f u2 K (1) (u) du, 2  1 bμ(1) ,n,h = h2 bμ(1) + o(h2 ), bμ(1) = μ(3) u2 K (1) (u) du, 2 their variances are vf (1) ,n,h = (nh3 )−1 σf2 (1) (x) + o((nh3 )−1 ),  σf2 (1) (x) = f (x) K (1)2 (u) du, vμ(1) ,n,h (x) = (nh3 )−1 σμ2 (1) (x) + o((nh3 )−1 ),  σμ2 (1) (x) = w2 (x)f (x) K (1)2 (u) du, and their covariances are  (1) (1) n,h (x)} = (nh3 )−1 μ(x) K (1)2 (u) du, Cov{fn,h (x), μ  (1) 2 −1   Cov{fn,h (x), fn,h (x)} = (nh ) f (x) K(u)K (1) (u) du,  (1) Cov{ μn,h (x), μ n,h (x)} = (nh2 )−1 w2 (x)f (x) K(u)K (1) (u) du,  (1) 2 −1  K(u)K (1) (u) du Cov{ μn,h (x), fn,h (x)} = (nh ) μ(x) = Cov{fn,h (x), μ n,h (x)}, (1)

64

Functional Estimation for Density, Regression Models and Processes (1)

it follows that the variance of m  n,h (x) is a O((nh3 )−1 )  (1) (1) (1) −2 −2 V ar{ μn,h (x)} + fn,h (x)m2n,h (x)V ar{fn,h (x)} V ar{m  n,h (x)} = fn,h  (1) (1) −2 − 2mn,h (x)fn,h Cov{fn,h (x), μ n,h (x) {1 + o(1)}.



(1)

The convergence rate of m  n,h is (nh3 )1 and the optimal global bandwidth for the estimation of m(1) follows. For the second derivative n (2) (2) i=1 Yi Kh (x − Xi ) m  n,h (x) =  n i=1 Kh (x − Xi )   (1) (1) { ni=1 Yi Kh (x − Xi )}{ ni=1 Kh (x − Xi )} n −2 { i=1 Kh (x − Xi )}2 n n (2) { i=1 Yi Kh (x − Xi )}{ i=1 Kh (x − Xi )} n − { i=1 Kh (x − Xi )}2 2 n n (1) { i=1 Yi Kh (x − Xi )}{ i=1 Kh (x − Xi )} n +2 , { i=1 Kh (x − Xi )}3  (2) (2) (2) n,h (x) = n−1 ni=1 Yi Kh (x − Xi ) converge unithe estimators fn,h and μ 2

(4)

formly to f (2) and μ(2) , respectively, with respective biases h2 m2K fX (x)+ 2 o(h2 ) and h2 m2K μ(4) (x)+o(h2 ). The result extends to a derivative of order k ≥ 1, generalizing Proposition 2.3. Proposition 3.8. Under Conditions 2.2 and 3.1 with nh2k+2s+1 = O(1), (k) for k ≥ 1, and functions m and fX in class C s (IX ) the estimator m  n,h is an uniformly consistent estimator of the k-order derivative of the regression function, its bias is a O(hs ), and its variance a O((nh2k+1 )−1 ), the optimal 1 2s bandwidth is a O(n− 2s+2k+1 ) and its optimal L2 -risks O(n− 2s+2k+1 ). The nonparametric estimator (3.1) is often used in nonparametric time series models with correlated errors. The bias is unchanged and the variance of the estimator depends on the covariances between the observation errors E(εi εi+a ) = βa , for a weakly stationary process (Yi )i corresponding to correlated measurements of Y = m(X)+ε. Now the variance σf2 is replaced  by S = σf2 +2 i≥1 βa assumed to be finite (Billingsley, 1968). A consistent m βi where the correlation is estimator of S was defined by Sm = i=−m

estimated by the mean correlation error with a mean over the lag between the terms of the product and a sum over observations and n−1 m2 tends to zero (Herrmann et al., 1992).

Kernel Estimator of a Regression Function

3.4

65

Weak convergence of the estimator 1

The weak convergence of the process Un,h = (nh) 2 {m  n,h − m}I{IX,h } relies on bounds for the moments of its increments which are first proved, as in Lemma 2.2 for the increments of the centered process defined by the kernel estimator, with a kernel having the compact support [−1, 1]. For a function or a process ϕ defined on IX,h , let Δϕ(x, y) = ϕ(x) − ϕ(y). Lemma 3.3. Under Condition 3.1, there exist positive constants C 1 and C 2 such that for every x and y in IX,h and satisfying |x − y| ≤ 2h E|Δ( μn,h − μn,h )(x, y)|2 ≤ C1 (nh3 )−1 |x − y|2 , E|Δ(m  n,h − mn,h )(x, y)|2 ≤ C2 (nh3 )−1 |x − y|2 , if |x − y| > 2h, they are O((nh)−1 ) and the estimators at x and y are independent. Proof. Let x and y in IX,h such that |x − y| ≤ 2h, E| μn,h (x) − μ n,h (y)|2 develops as the sum  n−1 w2 (u){Kh (x − u) − Kh (y − u)}2 f (u) du + (1 − n−1 ){μn,h (x) − μn,h (y)}2 . For an approximation of the integral, the Mean Value Theorem implies (1) Kh (x − u) − Kh (y − u) = (x − y)ϕh (z − u) where z is between x and y, and  {Kh (x − u) − Kh (y − u)}2 w2 (u)f (u) du  (1)2 = (x − y)2 ϕh (z − u)w2 (u)f (u) du

 (1)2 = (x − y)2 h−3 (x)f (x) K + o(h) . w 2 n Let |x| ≤ h and |y| ≤ h, the order of the second moment E|fn,h (x) − 2n,h (x) fn,h (y)|2 is a O((x − y)2 (nh3 )−1 ) if |x − y| ≤ 2h and it is the sum E μ 2 and μ n,h (y) otherwise. This bound and Lemma 2.2 imply the same orders for the estimator of the regression function m.  1

Theorem 3.1. For h > 0, the process Un,h = (nh) 2 {m  n,h − m}I{IX,h } 1 converges in distribution to σm W1 + γ 2 bm where W1 is a centered Gaussian process on IX with variance 1 and covariances zero.

66

Functional Estimation for Density, Regression Models and Processes

Proof. For any x ∈ IX,h and from the approximation (3.2) of Proposition 3.1 and the weak convergences for μ n,h − μn,h and fX,n,h − fX,n,h , the 1 1  n,h (x) − mn,h (x)} + (nh5 ) 2 bm (x) + variable Un,h (x) develops as (nh) 2 {m 1 1 o((nh5 ) 2 ), and it converges to a non centered distribution {W + γ 2 bm }(x) where W (x) is the Gaussian variable with expectation zero and variance 2 (x). In the same way, the finite dimensional distributions of the process σm 1 Un,h converge weakly to those of {W + γ 2 bm }, where W is a Gaussian process with the same distribution as W (x) at x. The covariance matrix {σ 2 (xk , xl )}k,l=1,...,m between components W (xk ) and W (xl ) of the limiting process is the limit of nh [Cov{ μn,h (xk ), μ n,h (xl )} fX (xk )fX (xl ) n,h (xl )} − m(xl )Cov{ μn,h (xk ), fX,n,h (xl )} − m(xk )Cov{fX,n,h (xk ), μ + m(xk )m(xl )Cov{fX,n,h (xk ), fX,n,h (xl )} + o(1)],

Cov{Un,h (xk ), Un,h (xl )} =

where the o(1) is deduced from Propositions 3.1, 3.2 and 3.3. For every integers k and l, let αh = |xl − xk |/(2h) and v = {(xl + xk )/2 − s}/h be in [0, 1], hence h−1 (xk − s) = v − α and h−1 (xl − s) = v + α. By a Taylor expansion in a neighborhood of (xl + xk )/2, the integral of the first covariance term develops as n,h (xl )} Cov{ μn,h (xk ), μ x + x  x + x   k l k l = n−1 h−1 w2 fX K(v − αh )K(v + αh )dv 2 2 + o(n−1 h−1 ), and zero otherwise, with the notation (3.10). Similar expansions are satisfied for the other terms of the covariance. Using the following approximations for |xk − xl | ≤ 2h : w2 ({xk + xl }/2) = w2 (xk ) + o(1) = w2 (xl ) + o(1) and fX ({xk + xl }/2) = fX (xk ) + o(1) = fX (xl ) + o(1), the covariance of Un,h (xk ) and Un,h (xl ) is approximated by  V ar(Y |X = xk ) + V ar(Y |X = xl ) I{0≤αh 0 and c > γ 2 |bm (a)| + (2η −1 σ 2 (a)) 2 , then 1

1

Pr{|Un,h (a)| > c} ≤ Pr{(nh) 2 |(m  n,h − mn,h )(a)| + (nh) 2 |bn,h (a)| > c} 1



V ar{(nh) 2 (m  n,h − mn,h )(a)} 1

{c − (nh) 2 |bn,h (a)|}2

and for n sufficiently large Pr{|Un,h (a)| > c} ≤

σ 2 (a) 1

{c − γ 2 |bm (a)|}2

+ o(1) < η.

1

The process Un,h is written Wn,h + (nh) 2 bn,h where (bn,h (x) − bn,h (y))2 ≤ kh2s (x − y)2s = O((nh)−1 )(x1 − x2 )2s 1

and Wn,h = (nh) 2 (m  n,h − mn,h ). From Lemma 3.3, there exists a constant CW such that |x−y| ≤ 2h entails E{Wn,h (x)−Wn,h (y)}2 ≤ CW h−2 |x−y|2 , which implies the tightness of the process Un,h and its weak convergence  to a continuous Gaussian process defined on IX . Note that the tightness of the process implies for every η > 0, the existence of a constant cη > 0 such that 1

−1 Pr{ sup |σm (Un,h − γ 2 bm ) − W1 | > cη } ≤ η. IX,h

The limiting distribution of the process Un,h does not depend on the bandwidth h, so one can state the following corollary. 1

−1 |Un,h −γ 2 bm | converges in disCorollary 3.1. suph>0:nh2s+1 →γ supIX,h σm tribution to supIX |W1 |.

An uniform confidence interval for the regression curve m is deduced as for the density.

3.5

Estimation of a regression function on Rd

Let X be a multidimensional variable with density fX on IX ⊂ Rd and let (X, Y ) be a variable with joint density fX,Y on IX,Y ⊂ Rd+1 and satisfying the regression model (1.6) Y = m(X) + σε

68

Functional Estimation for Density, Regression Models and Processes

where m is a bounded function on IX , the real error variable ε has the conditional expectation E(ε|X) = 0 and a constant conditional variance V ar(ε|X) = σ 2 . A kernel estimator of the regression function m is defined from a sample (Xi , Yi )i=1,...,n of the variable (X, Y ), for every x in IX,h ⊂ IX , as   n−1 ni=1 Yi dj=1 Kj,h (xj − Xij ) μ n,h (x) m  n,h (x) = = ,   fX,n,h (x) fX,n,h (x)  where K = dj=1 Kj is a kernel on Rd such that Kj satisfies the Conditions 2.1 or 2.2, for j = 1, . . . , d. The numerator μ n,h (x) has the expectation  yKh (x − s)fX,Y (s, y) ds dy μn,h (x) = IX,Y

 =

IX

m(x)Kh (x − s)fX (s) ds

and it converges to μ(x) = m(x)fX (x), the estimator m  n,h (x) has the expectation μn,h (x) which converges to m(x). Proposition 3.1 extends to d the multidimensional case, with similar notations m2Kd = j=1 m2Kj and d d s κ2d = j=1 κ2,j , μ(s) (x) = j1 ,...,js =1 d∂ f∂x (x). j=1

j

Proposition 3.9. Under Conditions 2.1(1), 2.1(2) and 3.1, with hn tends to zero and nhdn tends to infinity as n tends to infinity (a) the estimator m  X,n,h (x) is a.s. uniformly consistent on IX,h and d 12  n,h (x) − mn,h (x)} has the approximation (3.2), the process (nh ) {m  n,h (x) of functions μ and (b) the bias of the estimators μ n,h (x) and m m of C s (IX ) is uniformly approximated on IX,h by bμ,n,h (x) = hs bμ (x) + o(hs ), msKd (s) μ (x), bμ (x) = s! s bm,n,h (x) = h bm (x) + o(hs ), −1 (x){bμ (x) − m(x)bf (x)} bm (x) = fX msKd −1 (s) = f (x){μ(s) (x) − m(x)fX (x)}. s! X

The covariance between μ n,h (x) and fX,n,h (x) is Covμ,fX ,n,h (x) = (nhd )−1 {Covμ,fX (x) + o(1)}, Covμ,fX (x) = κ2d μ(x) = κ2d m(x)fX (x)

(3.16)

69

Kernel Estimator of a Regression Function

and their variance are vμ,n,h (x) = (nhd )−1 {σμ2 (x) + o(1)}, σμ2 (x) = κ2d V ar(Y | X = x)fX (x) = κ2d w2 (x)fX (x), 2 vm,n,h (x) = (nhd )−1 {σm (x) + o(1)}, −1 −1 2 (x) = κ2d fX (x){V ar(Y | X = x) − m2 (x)} = κ2d fX (x)σ 2 (x), σm

and the covariance of m  n,h (x) and m  n,h (y) is a o((nhd )−1 ). For functions μ and m C s (IX ), the asymptotic mean squared errors of  n,h (x) are μ n,h (x) and m AM SE( μn,h ; x) = (nhd )−1 σμ2 (x) + h2s bμ (x), it is minimum for the bandwidth function 1

1

hn (x) = n− 2s+d [{2sbμ (x)}−1 σμ2 (x)] 2s+d . Theorem 3.2. Under Conditions 2.1, 2.2 and 3.1, for functions μ and m C s (IX ) and as nh2s+d converges to a constant γ, the process 1

 n,h − m}I{IX,h } Un,h = (nhd ) 2 {m 1

converges weakly to Wf + γ 2 bf , where Wf is a continuous Gaussian process on IX with expectation zero and covariance E{Wf (x)Wf (x )} = δx,x σf2 (x), at x and x . (k)

Proposition 3.10. Under Conditions 2.1, 2.2 and 3.1, the estimator m  n,h of the k-order derivative of a multivariate regression function of C s has a bias O(hs ) and a variance O((nh2k+d )−1 ), its optimal bandwidth is a 1 s O(n− 2k+2s+d ) and the corresponding L2 -risks are O(n− 2k+2s+d ). Proposition 2.9 and the expansion (3.2) imply the following. Proposition 3.11. Under Conditions 2.1, 2.2 and 3.1, and for every real p ≥ 0, the kernel estimator m  n,h has the Lp -risk Rp (m  n , m) = O(n− 2s+d ), s

1

with hn = O(n− 2s+d ).

By generalization of Proposition 2.8, we obtain the minimax risk for the estimation of a regression function on Rd . Proposition 3.12. For a real p > 1 and an integer s ≥ 2, a lower bound of the Lp -risk for the estimation of a regression function m of the set n of Fs,p (IX ), is Ms,p (IX ) = {m ∈ C s (IX ); m(s) ∈ Lp (IX )}, in a subset M inf

sup Rp (m  n , m) = O(n− 2s+d ).

n m∈Ms,p m  n ∈M

s

70

Functional Estimation for Density, Regression Models and Processes

By Propositions 3.11 and 3.12, the kernel estimator of a regression function on Rd is minimax. 3.6

Estimation of a regression curve by local polynomials

The regression function m is approximated by a Taylor expansion of order k, for every s in a neighborhood Vx,h of a fixed x, with radius h, m(s) = m(x) + (s − x)m (x) + · · · +

(s − x)p (p) m + o((s − x)p ). (3.17) p!

This expansion is a local polynomial regression where the derivatives at x are considered as parameters. Estimating the derivatives by the derivatives of the estimator m  n,h yields an estimator having a variance sum of terms of different orders, its main term is the variance of m  n,h . Let (Hk,h )k be a square integrable orthonormal basis of real functions with respect to the distribution function of X, with support Vx,h for h converging to zero. Let δk,l be the Dirac indicator δk,l of equality for k and l, k, l ≥ 0. Equation (3.17) is also written m(s) =

p 

θk (x)Hk (s − x) + o((s − x)p ) = mp (x) + o((s − x)p ),

k=0

for s in Vx,h , and the properties of the functional basis entail  E{Hk (X − x)Hl (X − x)} = Hk (s − x)Hl (s − x) dF (s) = δk,l , k, l ≥ 0. In the regression model E(Y |X) = m(X) θk (x) = E{Y Hk (X − x)} = E{Hk (X − x)m(X)}, k ≥ 1, m(x) = E{Y H0 (X − x)} = E{H0 (X − x)m(X)}. For fixed x, θk (x) is considered as a constant parameter. This expansion is an extension of the kernel smoothing if the functional basis has regularity properties. The nonparametric regression function is approximated by an expansion  on the first p elements of the basis and its projections satisfy θk (x) = m(s)Hk (x − s) dF (s). The estimation of the parameters is performed by the projection of the observations of Y onto the first p elements of the orthonormal basis. Let (Xi , Yi )i=1,...,n be a sample for the regression variables (X, Y ), so that Yi = m(Xi ) + εi where εi is an observation error having a finite variance σ 2 = E{Y − m(X)} and such that E(ε|X) = 0.

Kernel Estimator of a Regression Function

71

An estimator of the parameter is defined as the empirical conditional mean of the projection of Y onto the space generated by the basis. For k ≥ 1, n  θk,n (x) = n−1 Yi Hk (Xi − x), i=1

is therefore a consistent estimator of θk . Its conditional variance is n−1 {E(Y 2 |X)Hk2 (X − x) − θk2 }{1 + o(1)}. This approach may be compared to the local polynomials defined by minimizing the local smoothed empirical mean squared error n  {Yi − mp (Xi , θ)}2 Kh (Xi − x). ASE(x) = i=1

This provides an estimator of θ with components satisfying n  {Yi − mp (Xi , θ)}Hk (Xi − x)Kh (Xi − x) = 0. i=1

They are solution of a system of linear equations and θnk is approximated by n i Hk (Xi − x)Kh (Xi − x) i=1 Y , n i=1 Kh (Xi − x) if the orthogonality of the basis entails that EHk (X −x)Hl (X −x)Kh (X −x) convergences to zero as h tends to zero, for every k = l ≤ p. This estimator is consistent and its behavior is further studied by the same method as the estimator of the nonparametric regression. A multidimensional regression function m(X1 , . . . , Xd ) can be expanded in sums of univariate regression functions E(Y | Xk = x) and their interactions like a nonparametric analysis of variance if the regression variables (X1 , . . . , Xd ) generate orthogonal spaces generated. The orthogonality is a necessary condition for the estimation of the components of this expansion since  E{Y Kh (xk − Xk )} =

E(Y | X = x) FX (dx1 , . . . , xk−1 , xk+1 , . . . , xd ) + o(1)

= m(xk )fk (xk ) + o(1), E{Y Kh (xk − Xk )Kh (xl − Xl )} = CK m(xk , xl )fXk ,Xl (xk , xl ) + o(1),  where CK = K(u)K(v) du dv, and E{Y Kh (xk − Xk )Kh (xl − Xl )} −1 fXk ,Xl (xk , xl ) can be factorized or expanded as a sum of regression functions only if Xk and Xl belong to orthogonal spaces. The orthogonalisation of the space generated by a vector variable X can be performed by a preliminary principal component analysis providing orthogonal linear combinations of the initial variables.

72

3.7

Functional Estimation for Density, Regression Models and Processes

Estimation in regression models with functional variance

Consider the nonparametric regression model with an observation error function of the regression variable X, Y = m(X) + σ(X)ε defined by (1.7), with E(ε|X) = 0 and V ar(ε|X) = 1. The conditional variance σ(x)2 = E[{(Y − m(X)}2 |X = x], is assumed to be continuous and it is estimated by a localisation of the empirical error in a neighborhood of x n {Yi − m  n,h (Xi )}2 1{Xi ∈ Vδ (x)} 2 , σ n,h,δ (x) = i=1 n i=1 1{Xi ∈ Vδ (x)} or by smoothing it with a kernel density n {Yi − m  n,h (Xi )}2 Kδ (x − Xi ) 2 . (3.18) σ n,h,δ (x) = i=1 n i=1 Kδ (x − Xi ) The estimator is denoted σ 2 (x) = f−1 (x)Sn,h,δ (x), with Sn,h,δ (x) = n−1  =

n,h,δ n 

X,n,δ

{Yi − m  n,h (Xi )}2 Kδ (x − Xi )

i=1

{y − m  n,h (s)}2 Kδ (x − s) dFX,Y,n (s, y).

The expectation of Sn,h,δ (x) is denoted Sn,h,δ (x). By the uniform consistency of m  n,h , Sn,h,δ converges uniformly to S as n tends to infinity, with h and δ tending to zero. At Xj , it is written  Sn,h,δ (Xj ) = n−1 {Yi − m  n,h (Xi )}2n,h Kδ (Xj − Xi ) + o((nh)−1 ). i =j

The rate of convergence of δ to zero is governed by the degree of derivability of the variance function σ 2 . Condition 3.2. For a density fX in C r (IX ) and a function μ in C s (IX ) and a variance σ 2 in C k (IX ), with k, s, r ≥ 2, the bandwidth sequences (δn )n and (hn )n satisfy 1

δn = O(n− 2k+1 ),

1

hn = O(n− 2(s∧r)+1 ),

as n tends to infinity. Proposition 3.13. Under Conditions 2.1, 2.2, 3.1 and 3.2, for every function μ in C s , density fX in C r and variance function σ 2 in C k , E{Y − m  nh (x)}2 = σ 2 (x) + O(h2(s∧r) ) + O((nh)−1 ),

Kernel Estimator of a Regression Function

73

the bias of the estimator Sn,h,δ (x) of σ 2 (x) defined by (3.18) is δ 2k 2 βn,h,δ (x) = b2m,n,h (x)fX (x) + σm,n,h (x)fX (x) + (σ 2 (x)fX (x))(2) (k!)2 + o(δ 2k + h2(s∧r) + (nh)−1 ) and its variance is written (nδ)−1 {vσ2 + o(1)} with vσ2 (x) = κ2 V ar{(Y − 1 2 σn,h,δ − σ 2 − βn,h,δ ) converges weakly m(x))2 |X = x}. The process (nδ) 2 ( to a Gaussian process with expectation zero, variance vσ2 and covariances zero. Proof. Using Proposition 2.2 and Lemma 3.2, the mean squared error  nh (x)}2 | X = x] and it is expanded as for m  nh at x is E[{Y − m 2 2 2  nh (x)} | X = x] σ (x) + bm,n,h (x) + σm,n,h (x) + E[{Y − m(x)}{m(x) − m  where the last term is zero. For the variance of Sn,h,δ (x), the fourth conditional moment E[{Y − m  nh (x)}4 (x) | X = x] is the conditional expectation  nh )(x)}4 and it is expanded in a of {(Y − m(x)) + (m − mnh )(x) + (mnh − m 4 sum of σ4 (x) = E{Y −m(x)) | X = x}, a bias term b4m,n,h (x) = O(h8(s∧r) ), E(mnh − m  nh )(x)}4 = O((nh)−1 ) by Proposition 3.1, and products of  nh − m22 (x) of order squared terms the main of which being σ 2 (x)m −1 4(s∧r) ), and the others being smaller. The variance of O((nh) ) + O(h  Sn,h,δ (x) follows. Moreover, for every i = j ≤ n and for every function ψ in C 2 and integrable with respect to FX , Eψ(Xj )Kδ2 (Xi − Xj ) = ψ(x)Kδ2 (x − x ) dFX (x) dFX (x ) equals κ2 Eψ(X) + o(δ 2 ) and the main term of the variance does not depend on the bandwidth δ.  The bandwidths hn and δn appear in the bias and the variance, therefore the mean squared error for the variance is minimum under Condition 3.2. Note that the function m which achieves the minimum of the empirical n mean squared error for the model Vn,h (x) = n−1 i=1 Kh (x − Xi ){Yi −  n,h (3.1) and Vn,h (x) converges in probability m(x)}2 is the estimator m σ(x). In a parametric regression model with a Gaussian error having a n constant variance, Vn (x) = n−1 i=1 {Yi − m(x)}2 is the sufficient statistic for the estimation of the parameters of m. In a Gaussian regression model with a functional variance σ 2 (x), each term of the sum defining the error is normalized by a different variance σ(Xi ) and the sufficient statistic for the estimation of parameters of the function m is the weighted mean square error n  σ −1 (Xi ){Yi − mθ (Xi )}2 . Vw,n (θ) = n−1 i=1

74

Functional Estimation for Density, Regression Models and Processes

For a nonparametric regression function, an empirical local mean weighted squared error is defined as Vw,n,h (x) = n−1

n 

w(Xi ){Yi − m(x)}2 Kh (x − Xi ),

i=1

with w(x) = σ −1 (x). A weighted estimator of the nonparametric regression curve m is then defined as n w(Xi )Yi Kh (x − Xi ) , (3.19) m  w,n,h (x) = i=1 n i=1 w(Xi )Kh (x − Xi ) if the variance is known, it achieves the minimum of Vw,n,h (x). With an unknown variance, minimizing the weighted squared error leads to the esti−1 n,h , using (3.18) mator built with its estimator w n = σ n ,δn n w n (Xi )Yi Kh (x − Xi ) m  wn ,n,h (x) = i=1 . (3.20) n n (Xi )Kh (x − Xi ) i=1 w The uniform consistency of w n,h implies supIn,h |m  wn ,n,h − mw | tends to zero as n tends to infinity.  n,h Assuming that σ belongs to C 2 (IX ), the convergence results for m in Propositions 3.1 or (3.2) adapt to the estimator (3.19), with μw = wμ instead of μ and w(x)fX,Y (x, y) instead of fX,Y (x, y). The approximation (3.2) is unchanged, hence the bias and the variance of the weighted estimator m  w,n,h are hs msK {(mwfX )(s) (x) − m(x)(wfX )(s) (x)} + o(hs ), s!w(x)fX (x) vm,w,n,h (x) = vm,n,h (x). bm,w,n,h(x) =

In the approximations of Propositions 3.2 and 3.3, the order of converrn,h 2 is not modified and the weak convergence of gence of supx∈IX,h  1

Theorem 3.1 is fulfilled for the process (nh) 2 {m  w,n,h − m}I{IX,h } , with the modified bias and variance. With an estimated weight, the expectation of the numerator μ wn ,n,h (x)  is E w n (X)m(X)Kh (x − X) and it equals E w n (y)m(y)Kh (x − y)fX (y) dy 2 (Xi ) is equivalent to the estimator of the variance (at Xi ) since σ n,h n ,δn calculated from the observations without Xi . With an empirical weight 2 σn,h (x)), the expectation of the numerator of the estiw n (x) = ψ( n ,δn 2 σn,h − mator (3.20) is then ENn (x) = Ew(X)m(X)Kh (x − X) + E{( n ,δn 2  2 σ )(X)ψ (σ (X))m(X)Kh (x − X)}{1 + o(1)} and the bias of the numerator of (3.20) is modified by adding m(x)fX (x)βn,h,δ (x)ψ  (σ 2 (x)) to the bias of the expression with a fixed weight w. In the same way,

75

Kernel Estimator of a Regression Function

the expectation of the denominator is EDn (x) = w(X)Kh (x − X) + 2 − σ 2 )(X)ψ  (σ 2 )(X)Kh (x − X)}{1 + o(1)} and it is approxiE{( σn,h n ,δn mated by fX (x){w(x) + βn,h,δ (x)ψ  (σ 2 (x)). Using the approximation (3.2) of Proposition 3.1, the first order approximation of the bias of (3.20) is identical to bm,w,n,h(x). The variances of each term are κ2 E{w n2 (x)E(Y 2 | X = x)fX (x)} + o((nh)−1 ), nh κ2 E{w n2 (x)fX (x)} + o((nh)−1 ), V arDn (x) = nh κ2 V ar(w n (x)Y | X = x) + o((nh)−1 ). V arm  wn ,n,h (x) = nhw2 (x)fX (x) V arNn (x) =

The variance of the estimator with an empirical weight is therefore modified by a random factor in the variance of Y and a normalization by w(x). The convergence rates are not modified.

3.8

Estimation of the mode of a regression function

The mode of a real regression function m on IX is Mm = sup m(x).

(3.21)

IX

The mode Mm of a regular regression function is estimated by the mode m,n,h = Mm of a regular estimator of the function, M  n,h . Under Conditions 2.1–3.1, the regression function is locally concave in a neighborhood NM of the mode and its estimator has the same property for n sufficiently large, by the uniform consistency of m  n,h , hence m(1) (Mm ) = 0, m(2) (Mm ) < 0, (1)  m,n,h converges to Mm in probability. A Taylor m  n,h (Mm,n,h ) = 0 and M (1) expansion of m at the estimated mode implies (1)  m,n,h − Mm ) = {m(2) (Mm )}−1 {m(1) (M m,n,h ) − m (M  n,h (M m,n,h )} + o(1). 1

(1)

 n,h −m(1) ) (Proposition 3.8) The weak convergences of the process (nh3 ) 2 (m m,n,h − Mm) as (nh3 ) 12 and it implies determines the convergence rate of (M m,n,h. the asymptotic behavior of the estimator M 1 m,n,h − Proposition 3.14. Under Conditions 2.1, 2.2 and 3.1, (nh3 ) 2 (M Mm ) converges weakly to a centered Gaussian variable with finite variance (1)  n,h (Mm ). m(2)−2 (Mm )V arm

76

Functional Estimation for Density, Regression Models and Processes

m,n,h ) is If the regression function belongs to C 3 (IX ), the bias of m(1) (M (1) deduced from the bias of the process m  n,h defined by (3.15), it equals 2

m,n,h ) = − h m2K f −1 (x){(mfX )(3) − mf (3) }(Mm ) + o(h2 ) Em(1) (M X X 2 and does not depend on the degree of derivability of the regression function m. All results are extended for the search of the local maxima and minima of the function m which are local maxima of −m. The maximization of the function on the interval IX is then replaced by sequential maximizations or minimizations. 3.9

Estimation of a regression function under censoring

Consider the nonparametric regression (1.6) where the variable Y is rightcensored by a variable C independent of (X, Y ) and the observed variables are (X, Y ∗ , δ) where Y ∗ = Y ∧ C and δ = 1{Y ≤C} . Let FY |X denote the distribution function of Y  conditionally on X. The regression function m(x) = E(Y | X = x) = yFY |X (dy; x) is estimated using an estimator of the conditional density of Y given X under right-censoring. Extending the results of Section 2.10 to the nonparametric regression, the conditional distribution function FY |X defines a cumulative conditional hazard function  ΛY |X (y; x) = 1{s≤y} {1 − FY |X (s; x)}−1 FY |X (ds; x), conversely the function ΛY |X uniquely defines the conditional distribution function as  {1 − ΔΛY |X (z − ; x)}, 1 − FY |X (y; x) = exp{−ΛcY |X (y; x)} z>y

 is the continuous part of ΛY |X and s {1 − ΔΛ(s)} its rightwhere continuous discrete part. Let   Kh (x−Xi )δi 1{Yi ≤y} , Yn (y; x) = Kh (x−Xi )1{Yi∗ ≥y} , Nn (y; x) = ΛcY |X

1≤i≤n

1≤i≤n

be the counting processes related to the observations of the censored variable Y ∗ , with regressors in a neighborhood Vh (x) of x, and let Jn (y; x) be  y the indicator of Y n(y; x) > 0. The process Mn (y; x) = Nn (y; x) − Y (s; x) dΛY |X (s; x) is a centered martingale with respect to the fil−∞ n tration generated by the observed processes up to y − , conditionally on

Kernel Estimator of a Regression Function

77

regressors in Vh (x). The functions ΛY |X and FY |X are estimated by  Jn (s; x)Nn (ds; x)  , ΛY |X,n,h (y; x) = 1{s≤y} Yn (s; x)   Y |X,n,h (Yi ; x)}, FY |X,n,h (y; x) = 1 − {1 − ΔΛ Yi ≤y

 Y |X,n,h is unbiased and FY |X,n,h is the Kaplan–Meier estithe estimator Λ mator for distribution function of Y conditional on {X = x}. The regression function m is then estimated by  m  n,h (x) = y FY |X,n,h (dy; x) =

n 

Jn (Yi ; x) . Yi {1 − FY |X,n,h (Yi− ; x)} Yn (Yi ; x) i=1

 Y |X,n,h − ΛY |X |, The estimators are uniformly consistent: supIX ×I |Λ  supIX,Y |FY |X,n,h − FY |X | and supIX |m  n,h − m| converge in probability to zero as n tends to infinity, for every compact subinterval I of IY . For every y ≤ max Yi∗ , the conditional Kaplan–Meier estimator, given x in IX,n,h , still satisfies  y 1 − FY |X,n,h (s− ; x)  FY |X − FY |X,n,h d(ΛY |X,n,h −ΛY |X )(s; x) . (y; x) = 1 − FY |X 1 − FY |X (s; x) −∞ The expectation of this integral with respect to a centered martingale is  Y |X,n are unbiased zero so the conditional Kaplan–Meier estimator and Λ estimators. The bias of the estimator of the regression function for censored variables Y is then a O(h2 ). 3.10

Proportional odds model

Consider a regression model with a discrete response variable Y corresponding to a categorization of an unobserved continuous real variable Z in a partition (Ik )k≤K of its range, with the probabilities Pr(Z ∈ Ik ) = Pr(Y = k). With a regression variable X and intervals Ik = (ak−1 , ak ), the cumulated conditional probabilities are πk (X) = Pr(Y ≤ k | X) = Pr(Z ≤ ak | X), and EπK (X) = 1. The proportional odds model is defined through the logistic model for the probabilities πk (X) = p(ak − m(X)), with the logistic

78

Functional Estimation for Density, Regression Models and Processes

probability p(y) = exp(y)/{1 − exp(y)} and a regression function m. This model is equivalent to πk (X){1 − πk (X)}−1 = exp{ak − m(X)} for every function πk such that 0 < πk (x) < 1 for every x in IX and for 1 ≤ k < K. This implies that the odds-ratio for the observations (Xi , Yi ) and (Xj , Yj ) with Yi and Yj in the same class does not depend on the class πk (Xi ){1 − πk (Xj )} = exp{m(Xj ) − m(Xi )}, {1 − πk (Xi )}πk (Xj ) for every k = 1, . . . , K, this is the proportional odds model. For k = 1, . . . , K, let pk (x) = (πk − πk−1 )(x) = Pr(Y = k | X = x). Assuming that p1 (x) > 0 for every x in IX , the conditional distribution of the discrete variable is also determined by the conditional probabilities αk (x) = P (Y = k|X = x)P −1 (Y = 1|X = x). Equivalently P (Y = k|X = x) =

αk (x) , K 1 + j=1 αj (x)

k = 1, . . . , K,

K This with the constraint k=1 P (Y = k|X = x) = 1 for every x. reparametrization of the conditional probabilities αk is not restrictive, though it is called the logistic model. Estimating first the support of the regression variable reduces the number of unknown parameters to 2(K − 1), the thresholds of the classes and their probabilities, for k ≤ K − 1, in addition to the nonparametric regression function m. The probability functions πk (x) are estimated by the proportions π n,k (x) of observations of the variable Y in class k, conditionally on the regressor value x. Let Uik = log

π n,k (Xi ) , 1−π n,k (Xi )

i = 1, . . . , n,

calculated from the observations (Xi , Yi )i=1,...,n such that Yi = k. The variations of the regression function m between two values x and y are estimated by K n  −1 i=1 Uik Kh (Xi − x)   n,h (y) = K m  n,h (x) − m n i=1 Kh (Xi − x) k=1 n

Uik Kh (Xi − y) i=1 − n . i=1 Kh (Xi − y) This estimator yields an estimator for the derivative of the regression func(1)  n,h (x)− m  n,h (y)} which is written as tion, m  n,h (x) = lim|x−y|→0 (x−y)−1 {m

Kernel Estimator of a Regression Function

79

the mean over the classes of the derivative estimator (3.15) with response variables Uik . Integrating the mean derivative provides a nonparametric estimator of the regression function m. The bounds of the classes cannot be identified without observations of the underlying continuous variable Z, thus the odds ratio allows to remove the unidentifiable parameters from the model for the observed variables. With a regression multidimensional variable X, the single-index model or a transformation model (Chapter 7) reduce the dimension of the variable and fasten the convergence of the estimators. 3.11

Estimation for the regression function of processes

Consider a continuously observed stationary and ergodic process (Zt )t∈[0,T ] = (Xt , Yt )t∈[0,T ] with values in IXY , and the regression model Yt = m(Xt ) + σ(Xt )εt , where (εt )t∈[0,T ] is a conditional Brownian motion such that E(εt | Xt ) = 0 and E(εt εs | Xt ∧ Xs ) = E{(εt ∧ εs )2 | Xt ∧ Xs ) = 1. The ergodicity property is expressed by (2.14) or (2.17) for the bivariate process Z. The regression function m is estimated on an interval IX,Y,T,h by the kernel estimator T Ys Kh (x − Xs ) ds . (3.22) m  T,h (x) = 0 T 0 Kh (x − Xs ) ds Its numerator is denoted μ T,h (x) =

1 T

 0

T

Ys Kh (x − Xs ) ds,

and its denominator is fX,T,h (x). The expectation of μ T,h (x) and its limit are respectively  yKh (x − u) dFXY (u, y), μT,h (x) = IXY  μ(x) = yfXY (x, y) dy = fX (x)m(x). IXY

Under Conditions 2.1, 2.2 and 3.1, the bias of μ T,h (x) is  hs bμ,T,h (x) = yKh (x − u) dFXY (u, y) − μ(x) = T msK μ(s) (x) + o(hsT ), s! IXY,T

80

Functional Estimation for Density, Regression Models and Processes

its variance is expressed through the integral of the covariance between Ys Kh (Xs − x) and Yt Kh (Xt − x). For Xs = Xt , the integral on the diagonal 2 is a o(T hT )−1 )κ2 w2 (x) + o((T hT )−1 ) and the integral outside DX of IX,T the diagonal denoted Io (T ) is expanded using the ergodicity property (2.14). Let αh (u, v) = |u − v|/2hT   ds dt Io (T ) = y1 y2 Kh (u − x)Kh (v − x)dFZs ,Zt (u, y1 , v, y2 ) 2 T T [0,T ]2 IXY \DX 1   

2 = (T hT )−1 K(z − αh (u, v))K(z + αh (u, v)) dz IX

IX\{u}

− 12

 μ(u)μ(v)dπu (v) dFX (u)}{1 + o(1) .

For every fixed u = v, αhT (u, v) tends to infinity as hT tends to zero, then 1 the integral −2 1 K(z − αh (u, v))K(z + αh (u, v)) dz tends to zero with hT . 2

If |u − v| = O(hT ), this integral does not tend to zero but the transition probability πu (v) tends to zero as hT tends to zero, therefore the integral Io (T ) is a o((T hT )−1 ) as T tends to infinity. The Lp -norm of the estimator satisfies 1

 μT,h (x) − μT,h (x)p = O((T hT )− 2 ) under the ergodicity condition (2.17) for k-uplets of the process Z and the approximation (3.2) is also satisfied for the estimator m  T,h . It follows that its bias, for s ≥ 2, and its variance are approximated by bm,T,h (x; s) = hsT bm (x; s) + o(hsT ), −1 (x){bμ (x) − m(x)bf (x)} bm (x; s) = fX msK −1 (s) = f (x){μ(s) (x) − m(x)fX (x)}, s! X 2 (x) + o(1)}, vm,T,h (x) = (T hT )−1 {σm −1 2 σm (x) = κ2 fX (x)σ 2 (x),

 T,h (y) tends to zero. The mean and the covariance between m  T,h (x) and m squared error of the estimator at x for a marginal density in C s is then −1 M ISET,hT (x) = (T hT )−1 )κ2 fX (x)σ 2 (x) 2 −1 ) + o(h2s + h2s T bm (x; s) + o((T hT ) T ),

and the optimal local and global bandwidths minimizing the mean squared 1 (integrated) errors are O(T 2s+1 ) 1  2s+1

1 2 σm (x) hAMSE,T (x) = , T 2sb2m (x; s)m(x)

Kernel Estimator of a Regression Function

81

and, for the asymptotic mean integrated squared error criterion

hAMISE,T =



1  2s+1 2 (x) dx σm . T 2s b2m (x; s)m(x) dx

1



With the optimal bandwidth rate, the asymptotic mean (integrated) 2s squared errors are O(T 2s+1 ). The same expansions as for the variance μ T,h (x) and fX,T,h (x) in Section 2.12 prove that the finite dimension 1 distributions of the process (T hT ) 2 (fT,h − f − bT,h ) converge to those of a centered Gaussian process with expectation zero, covariances zero and variance κ2 f (x) at x. Lemma 3.3 generalizes and the increments E{fT,h (x) − fT,h (y)}2 are approximated as E|Δ(m  n,h − mn,h )(x, y)|2 = O(|x − y|2 (T h3T )−1 ), for every x and y in IX,h such that |x − y| ≤ 2hT . Then the process 1 1  T,h − m}I{IX,T } converges weakly to σm W1 + γ 2 bm where W1 (T hT ) 2 {m is a centered Gaussian process on IX with variance 1 and covariances zero.

3.12

Exercises

(1) Detail the proof for the approximations of the biases and variances of Proposition 3.1. (2) Suppose Y is a binary variable with P (Y |X = x) = p(x) and express the bias and the variance of the estimator of the nonparametric probability function p. (3) Consider a discrete variable with values in an infinite countable set. Define an estimator of the function m under suitable conditions and give the expression of its bias and variance. (4) Define nonparametric estimators for the bias of the function m and its variance. (5) Define the optimal bandwidths for the estimation of the function μ and its first order derivative. (6) Detail the expression of m  n,h (x) − m(x)p using the orders of the norms established in Section 3.2. (7) Detail the expressions of the bias and the second order approximation 2 (x) in Proposition 3.13. of the variance of σ n,h,δ

82

Functional Estimation for Density, Regression Models and Processes

(8) Let FY |X (y; x) = Pr(Y ≤ y | X ≤ x) be the distribution function of Y conditionally on X and FY |X,n,h (y; x) = n−1

n 

1{Yi ≤y} Hh (Xi − x)

i=1

be a smooth estimator of the conditional distribution function (see Exercise 2.11(6)). Find the expression of the bias and the variance of FY |X,n,h (x).

Chapter 4

Limits for the Varying Bandwidths Estimators

4.1

Introduction

The pointwise mean squared error for a density or regression function reaches its minimum at a bandwidth function varying in the domain of the variable X. The question of the behavior of the estimator of density and regression functions with a varying bandwidth is then settled. All results of Chapters 2 and 3 are modified by this function. Consider a density or a regression function of class C s (IX ). Let (hn )n be a sequence of functional bandwidths in C 1 (IX ), converging uniformly to zero and uniformly bounded away from zero on IX . In order to have an optimal bandwidth for the estimation of functions of class C 2 , the functional sequence is assumed to satisfy an uniform convergence condition for the uniform norm hn . Condition 4.1. There exists a strictly positive function h in C 1 (IX ), such that h is finite and nhn2s+1 − h tends to zero as n tends to infinity. Under this condition, the bandwidth is uniformly approximated as 1

1

1

hn (x) = n− 2s+1 h 2s+1 (x) + o(n− 2s+1 ). The increasing intervals IX,hn are now defined with respect the uniform norm of the function hn by IX,hn = {s ∈ IX ; [s − hn , s + hn ] ∈ IX }. The main results of the previous chapters are extended to kernel estimators with functional bandwidth sequences satisfying this convergence rate. That is the case of the kernel estimators built with estimated optimal local bandwidths calculated from independent observations. The second point of this chapter is the definition of an adaptative estimator of the bandwidth, when the degree of derivability of the density varies in its domain of definition, and the behavior of the estimator of 83

84

Functional Estimation for Density, Regression Models and Processes

the density with an adaptative estimator. In Chapter 2, the optimal density was obtained under the assumption that the degree of smoothness of the density is known and constant on the interval of the observations. The last assumption flattens the estimated curve by the use of a too large bandwidth in areas with smaller derivability order, the above variable bandwidth hn (x) does not solves that problem. The cross-validation method allows to define a global bandwidth without knowledge of the class of the density. Other adaptative methods are based the maximal variations of the estimator as the bandwidth varies in a grid Dn corresponding to a discretization of the possible domain of the bandwidth according to the order of regularity of the density. It can be performed globally or pointwisely.

4.2

Estimation of densities

1 Let us consider the random process Un,hn (x) = (nhn (x)) 2 {fn,hn(x) (x) − f (x)} for x in IX,hn . Under Conditions 2.1 and 4.1, supI |fn,hn (x) − f (x)| converges a.s. to zero for every compact subinterval I of IX,hn and fn,hn (x) − f (x)p tends to zero, as n tends to infinity. The bias of fn,hn (x) (x) is bn,hn (x) = 12 h2n (x)m2K f (2) (x) + o(hn 2 ), its variance is V ar{fn,hn (x) (x)} = (nhn (x))−1 κ2 f (x) + o((n−1 h−1 n ) and 2 fn,hn (x) (x) − fn,hn (x) (x)p = 0((n−1 h−1 n ) ). 1

Under Conditions 2.1–4.1, for a density of class C s (IX ) and for every x in IX,h , the moments of order p ≥ 2 are unchanged and the bias of fn,hn (x) (x) is modified as bn,hn (x; s) =

hsn (x) msK f (s) (x) + o(hn s ). s!

The MISE and the optimal local bandwidth are similar to those of Chapter 2 using these expressions. For every u in [−1, 1], let αn and v in [−1, 1], |u| in [0, {x + hn (x)} ∧ {y + hn (y)}] be defined by 1 −1 {(u − x)h−1 (4.1) n (x) − (u − y)hn (y)}, 2 1 −1 v = vn (x, y, u) = {(u − x)h−1 n (x) + (u − y)hn (y)} 2 1 [{hn (x) + hn (y)}u − xhn (y) − yhn (x)], = 2hn (x)hn (y) αn (x, y, u) =

Limits for the Varying Bandwidths Estimators

85

u = un (x, y, v) = {hn (x) + hn (y)}−1 {xhn (y) + yhn (x) + 2vhn (x)hn (y)}, zn (x, y) = {hn (x) + hn (y)}−1 {xhn (y) + yhn (x)}, δn (x, y) = 2hn (x)hn (y){hn (x) + hn (y)}−1 = o(1), hence αn (x, y, u) is also denoted αn (x, y, v). Lemma 4.1. The covariance of fn,h (x) and fn,h (y)} equals  2 f (zn (x, y)) K(v − αn (v))K(v + αn (v)) dv n{hn (x) + hn (y)}

 (1) + δn (x, y)f (zn (x, y)) vK(v − αn (v))K(v + αn (v)) dv + o(hn ) . Proof.

The integral

EKhn (x) (x − X)Khn (y) (y − X) =

 Khn (x) (x − u)Khn (y) (y − u)fX (u) du

is expanded changing the variable u in v and it equals  2 K(v − αn (v))K(v + αn (v))f (un (x, y, v)) dv hn (x) + hn (y)  2 = f (zn (x, y)) K(v − αn (v))K(v + αn (v)) dv hn (x) + hn (y)

 (1) + δn (x, y)f (zn (x, y)) vK(v − αn (v))K(v + αn (v)) dv + o(hn ) .



Lemma 4.2. For functions of class C s (IX ), s ≥ 1, and under Conditions 3.1 and 4.1, for every x and y in IX,hn the mean variation of fn,hn between x and y has the order O(|x − y|) and its mean squared variation for −1 2 −1 3 2   h−1 |xh−1 n (x)−yhn (y)| ≤ 1 are E|fn,hn (x)− fn,h (y)| = O(n n  (x−y) ).   Otherwise, it is a O(n−1 h−1 n ) and the variables fn,h (x) and fn,h (y) are independent. Proof. By the Mean Value Theorem, for every x and y in IX,h there exists s between x and y such that |fn,hn (x) − fn,hn (y)| = |x − y|f (1) (s) and |fn,hn (x) − fn,hn (y)| ≤ |x − y|f (1) . Let z = limn zn (x, y) defined in (4.1). The expectation of |fn,h (x) −  fn,h (y)|2 develops as n−1 {Khn (x) (x − u) − Khn (y) (y − u)}2 f (u) du + (1 − n−1 ){fn,hn (x) − fn,hn (y)}2 .

86

Functional Estimation for Density, Regression Models and Processes

Using the notations (4.1), the first term of this sum is expanded as  1 S1n = {hn (x)K(v − αn (v)) nhn (x)hn (y){hn (x) + hn (y)} − hn (y)K(v + αn (v))}2 f (zn (v)) dv. The derivability of the bandwidth functions implies  1 {hn (x)K(v − αn ) − hn (y)K(v + αn )}2 f (zn ) dv hn (x)hn (y) 0  hn (x) {K(v − αn ) − K(v + αn )}2 f (zn ) dv ≤2 hn (y) 1  {hn (x) − hn (y)}2 2 + K (v − αn )f (zn ) dv , hn (x)hn (y) 0  hn (x) 2 f (z) {K(v − αn ) − K(v + αn )}2 dv S1n ≤ n{hn (x) + hn (y)} hn (y)  (1)2 2 hn (η(x − y)) 2 + (x − y) K (v − αn )f (zn ) dv , hn (x)hn (y) (1)2

where η lies in (−1, 1), by the Mean Value Theorem, hn (η) and hn (x)hn (y) have the same order, and   2 2 2 {K(v − αn ) − K(v + αn )} dv = 4αn K (1)2 (v) dv = O(|x − y|2 h−1 n  ). 2 −1 2 Since h−1 It follows that S1n = O(n−1 h−1 n  |x − y| hn  ). n (x)|x| −1 −1 −1 12  and hn (y)|y| are bounded by 1, the order of E(nhn  ) |fn,h (x) − fn,h (y)|2 = O((x−y)2 ) if |xhn (y)−yhn (x)| ≤ hn (y)hn (x), otherwise fn,h (x)  and fn,h (y) are independent and it is a sum of variances.

Theorem 4.1. Under the conditions, for a density f of class C s (IX ) and a varying bandwidth sequence such that nhn 2s+1 converges to h, the process 1 Un,hn (x) = (nhn (x)) 2 {fn,hn (x) − f (x) − bn,hn (x)}I{x ∈ IX, hn }

converges weakly on IX to a continuous centered Gaussian process Wf (x), with covariance σf2 (x)δ{x,x } between Wf (x) and Wf (x ). Proof. The weak convergence of the variable Un,h (x) is a consequence to 1 the L2 -convergence of (nhn (x)) 2 {fn,hn (x)−f (x)−bn,hn (x)} to κ2 f (x). Furthermore, the finite dimensional distributions of the process Un,h converge

Limits for the Varying Bandwidths Estimators

87

weakly to those of a centered Gaussian process. The quadratic variations of the bias {fn,hn(x) (x) − f (x) − fn,hn (y) (y) + f (y)}2 are bounded by 2     K(z){f (x + hn (x)z) − f (x) − f (y + hn (y)z) − f (y)} dz    msK hn 2s = s!

,

h(x) h

2s f

(s)

(x) −

2s -2 h(y) (s) f (y) , h

and it is a O(hn 2s |x − y|2 ). This bound and Lemma 4.2 imply that the mean of the squared variations of the process Un,h on small intervals are O(|x − y|2 ), therefore the process Un,h is tight, so it converges weakly to a centered Gaussian process. The covariance of the limiting process at x and y is the limit of the covariance between Un,h (x) and Un,h (y) and it 1 1 equals limn nhn2 (x)hn2 (y)Cov{fn,h (x), fn,h (y)}. The covariance of fn,h (x)  and fn,h (y) is approximated by n−1 Kh (x) (x − u)Kh (y) (y − u)f (u)du, n

n

for x = y it develops as  1{0≤αn 14 , or C α with α ≥ 2. All estimators of the bias of a density depend on its regularity through the constant of the bias and the exponent of h and it cannot be directly estimated without knowledge of α. The bandwidth minimizing the mean squared error of the estimator fn,h (x) depends on α with an order of smoothness α > 14 , so only the lower bound of the degree is necessary to obtain a bound of the MSE. As the variance of fn,h(x)(x) does not depend on the class of f , it can be estimated using a bandwidth function h2 such that nh2  tends to zero and nh2 2 tends to infinity, by V arn,h2 fn,h (x) = (nh(x))−1 κ2 fn,h2 (x) (x). Let  M SE n,h,an (x) be the estimator of M SEn,h,an (x) obtained by plugging the estimator of V arfn,h (x). It can be compared with the bootstrap estima∗ (x) = V ar∗ fn,h (x) + B ∗2 (fn,h )(x) tor of the mean squared error M SEn,h calculated from a bootstrap sample of independent variables having the distribution Fn . This estimator and the bootstrap estimator V ar∗ fn,h (x)

88

Functional Estimation for Density, Regression Models and Processes

yield an estimator of α. An optimal local bandwidth can then be estimated from the estimator of α. The choice of the bandwidth function h2 relies on the same procedure and the optimal estimator  hn (x) requires iterations of this procedure, starting to an empirical bandwidth calculated from a discretization of its range. Adaptative estimators of the bandwidth were previously defined using empirical thresholds for the variations of the estimator of the density according to the bandwidth, however constants in the thresholds were chosen by numerical recursive procedures. Another variable bandwidth kernel estimator is defined with a bandwidth function of the variables Xi rather than x n 1 Khn (Xi ) (x − Xi ). fX,n,hn (x) = n i=1  Its expectation is EfX,n,hn (x) = EKhn (X) (x − X) = Khn (y) (x − y)fX (y) dy and its limit is fx (x), approximating y by x in the integral. Its bias and variance are not expanded as above, the bandwidth (1) at y is now developed as hn (y) = hn (x){1 + zhn (x)} + o(hn 2 ) where (1) y = x + hn (y)z = x + hn (x)z + hn (x)hn (x)z 2 + o(z 2 ), hence 1 (1) (2) fX (y) = fX (x) + hn (x)zfX (x) + hn (x)z 2 {hn (x)fX (x) 2 (1) 2 + 2h(1) n (x)fX (x)} + o(hn  ), and the bias of the estimator is m2K (2) (1) 2 hn (x){hn (x)fX (x) + 2h(1) bfX,n,h (x) = n (x)fX (x)} + o(hn  ). 2 Its variance is

  −1  V arfX,n,hn (x) = n Kh2n (y) (x − y)fX (y) dy − E 2 fX,n,hn (x) ,   Kh2n (y) (x − y) dy = K 2 (z)fX (x + hn (x)z + hn (x)h(1) n (x)z) dz + o(hn 2 ) = m2K {fX (x) + hn (x)f (1) (x)



zK 2 (z) dz + o(hn ),

so the first order approximation of the variances are identical.

4.3

Estimation of regression functions

Let us consider the variable bandwidth kernel estimator m  n,hn (x) (x) of the regression function m of a variable Y on a real variable X, and the random

89

Limits for the Varying Bandwidths Estimators

process related to the estimated regression function 1

 n,hn (x) − m(x)}I{x∈IX,hn  } . Um,n,hn (x) = (nhn (x)) 2 {m Conditions 2.1 and 4.1 for kernel estimators of densities with variable bandwidth are supposed to be satisfied in addition to Condition 3.1 for kernel estimators of regression functions. With the notations of Chapter 2.13,  n,hn (x) (x) − m(x)| converges a.s. to zero with the uniform supx∈IX,hn  |m approximations mn,hn (x) (x) =

μn,hn (x) (x) + O((nhn )−1 ), fX,n,hn (x) (x)

1

1

(nhn (x)) 2 {m  n,hn(x) − mn,hn (x) }(x) = (nhn (x)) 2 {( μn,hn (x)  − μn,h (x) )(x) − m(x)(fX,n,h (x) − fX,n,h (x) )(x)}f −1 (x) + rn,h n

n

X

n

n (x)

,

where rn,hn = oL2 (1), uniformly.  n,hn (x) (x) − For every x in IX, hn and for every integer p > 1, m  n,hn (x) (x) is uniformly m(x)p converges to zero, the bias of the estimator m approximated by bm,n,hn(x) (x) = mn,hn (x) (x) − m(x) = h2n (x)bm (x) + o(hn 2 ), −1 (x){bμ (x) − m(x)bf (x)} bm (x) = fX m2K −1 (2) f (x){μ(2) (x) − m(x)fX (x)}, = 2 X

and its variance is deduced from (3.7) 2 (x) + o(1)}, vm,n,hn (x) (x) = (nhn (x))−1 {σm −2 2 (x) = κ2 fX (x){w2 (x) − m2 (x)f (x)}. σm

For a regression function and a density fX in class C s (IX ), s ≥ 2, and under Condition 2.2, the bias of m  n,hn (x) (x) is uniformly approximated by bm,n,hn (x) (x; s) =

hsn (x) (s) −1 msK fX (x){μ(s) (x) − m(x)fX (x)} + o(hn s ), s!

and its moments are not modified by the degree of derivability. For every x in IX, hn 1

1

−1 (nhn (x)) 2 (m  n,hn − m)(x) = (nhn (x)) 2 fX (x){( μn,hn (x) − μn,hn (x) ) − m(fX,n,h (x) − fX,n,h (x) )}(x) n

2s+1

+ (nhn (x)

n

1 2

) bm (x) + rn,hn (x) (x),

90

Functional Estimation for Density, Regression Models and Processes 1

and supx∈IX,hn   rn,hn (x) 2 = o((nhn )− 2 ). The asymptotic mean squared error of m  n,h (x) is −1 2 (nhn (x)) σm (x) + hn (x)4 b2m (x) = (nhn (x))−1 κ2 {w2 (x) − m2 (x)f (x)} h4 (x)m22K −2 (2) −2 fX (x){μ(2) (x) − m(x)fX (x)}2 fX + n (x), 4 and its minimum is reached at the optimal local bandwidth   15 κ2 n−1 {w2 (x) − m2 (x)f (x)} hn,AMSE (x) = , m22K {μ(2) (x) − m(x)f (2) (x)}2 X

4

where AM SE(x) = O(n− 5 ). For every s ≥ 2, the asymptotic quadratic risk of the estimator for a regression curve of class C s is 2 2 (x) + h2s AM SE(x) = (nhn (x))−1 σm n (x)bm,s (x) −2 (x){w2 (x) − m2 (x)f (x)} = (nhn (x))−1 κ2 fX

h2s (s) n (x) 2 m f −2 (x){μ(s) (x) − m(x)fX (x)}2 , (s!)2 sK X its minimum is reached at the optimal bandwidth 1   2s+1 (s!)2 κ2 n−1 {w2 (x) − m2 (x)f (x)} hn,AMSE (x) = , 2sm2sK {μ(s) (x) − m(x)f (s) (x)}2 +

X

2s

where AM SE(x) = O(n− 2s+1 ).  n,hn (y) is calculated as for The covariance of m  n,hn (x) and m Theorems 3.1 and 4.1 and it is a o(1) for every x = y. Lemma 4.3. The covariance of m  n,hn (x) and m  n,hn (x)} equals 0  2 K(v − αn (v))K(v + αn (v)) dv σ 2 (zn (x, y))κ−1 2 n{hn (x) + hn (y)} m (1)

(1)

−2 (zn (x, y)){w2 − m2 fX }(zn (x, y)) + δn (x, y)fX 1  × vK(v − αn (v))K(v + αn (v)) dv + o(hn ) .

Proof. The integralEY 2 Khn (x) (x − X)Khn(y) (y − X) = EY 2 Khn (x) (x − X)Khn (y) (y − X) = Khn (x) (x − u)Khn (y) (y − u)w2 (u) du is expanded changing the variable u in v and it equals  2 K(v − αn (v))K(v + αn (v))w2 (un (x, y, v)) dv hn (x) + hn (y)  2 = w2 (zn (x, y)) K(v − αn (v))K(v + αn (v)) dv hn (x) + hn (y)

 (1) + δn (x, y)w2 (zn (x, y)) vK(v − αn (v))K(v + αn (v)) dv + o(hn ) ,

Limits for the Varying Bandwidths Estimators

91

1

then the L2 -approximation of (nhn (x)) 2 {m  n,hn (x) − mn,hn (x) }(x) and Lemma 4.3 end the proof.  Lemma 3.3 is extended to μ n,hn and m  n,hn with functional bandwidths like 4.2 and the weak convergence on IX, hn of the process with varying 1 bandwidth Un,hn (x) = (nhn (x)) 2 {fn,hn (x) (x)−fn,hn (x) (x)} is proved as for the density estimator. Lemma 4.4. For a regression function m and density fX of class C s (IX ), s ≥ 2, and under Conditions 3.1 and 4.1, for every x and y in IX,hn the expectation of the variation of m  n,hn between x and y has the order O(|x − 3 2 −1  n,h (y)|2 = O(n−1 h−1 y|) and E|m  n,hn (x) − m n  (x − y) ) if |xhn (x) − −1 −1 −1 yhn (y)| ≤ 1. Otherwise, it is a O(n hn ). Theorem 4.2. Under the conditions, for a density f of class C s (IX ) and a varying bandwidth sequence such that nhn 2s+1 converges to h, the 1  n,hn (x) − mn,hn (x)}I{x ∈ IX, hn } conprocess Un,hn = {nhn (x)} 2 {m verges weakly to a continuous centered Gaussian process with covariance 2 (x)δ{x=x } at x and x . σm The estimators of the derivatives of the regression function are modified by the derivatives of the bandwidth and the kernel in each term of the estimators, as detailed in Appendix B, and the first derivative is (1) (1) (1) −1 { μn,h − m  n,h fn,h }, like in (3.15), with notations of the appendix m  n,h = fn,h for d{Khn (x) (x)}/dx. The results of Proposition 3.8 are extended to the (k)

estimator m  n,hn with a varying bandwidth sequence, its bias is a O(hn s ), 2k+1 ), hence the optimal bandwidth is a and its variance a O((nh−1 n ) 1 2s − 2k+2s+1 ) and the optimal mean squared error is a O(n− 2k+2s+1 ). O(n In the regression model with a conditional variance function σ 2 (x), the kernel estimator (3.18) with continuous functional bandwidths hn and δn can be written n {Yi − m  n,h (x) (Xi )}2 Kδn (x) (x − Xi ) 2 n n , σ n,hn (x),δn (x) (x) = i=1 i=1 Kδn (x) (x − Xi ) then a new estimator for the regression function is defined using this estima−1 tor as a weighting process w n = σ n,h in the estimator of the regression n ,δn function n w n (Xi )Yi Khn (x) (x − Xi ) . m  wn ,n,hn (x) (x) = i=1 n n (Xi )Khn (x) (x − Xi ) i=1 w 2 The bias and variance of the estimator σ n,h (x) and the fixed bandn (x),δn (x) 2  wn ,n,hn (x) (x) and width estimator for σ (x) are still similar. The bias of m

92

Functional Estimation for Density, Regression Models and Processes

m  w,n,hn (x) (x) have the same approximations, the variance of m  w,n,hn(x) (x) is identical to the variance of m  n,hn (x) (x) whereas the variance of m  wn ,n,hn (x) (x) is modified like with the fixed bandwidth estimator. The weak Convergence Theorem 4.2 extends to the weighted regression estimator. All results of Section 3.5 are generalized to the estimation of a regression function on a multivariate regressor X, with a varying bandwidth. 4.4

Estimation for processes

Let (Xt )t∈[0,T ] be a continuously observed stationary and ergodic process satisfying (2.14), with values in IX . The limiting marginal density defined 1 by (2.15) is estimated with an optimal bandwidth of order O(T 2s+1 ) as proved in Section 2.12. For every x in IX,T, hT  1 T  KhT (x) (Xs − x) ds, (4.2) fT,hT (x) (x) = T 0 1

where T 2s+1 hT  = O(1). Conditions 2.1–2.2 are supposed to be satisfied, with a density f in class C s and assuming that the bandwidth function fulfills Condition 4.1 with the approximation 1

1

hT (x) = T − 2s+1 {h 2s+1 (x) + o(1)}.

(4.3)

The results of the previous sections extends to prove that for every x in hsT (x)  IX,T, hT , the bias of fT,h (x) is bT,hT (x) = s! msK f (s) (x) + o(hT s ), its variance is V ar{fT,hT (x)} = (T hT (x))−1 κ2 f (x) + o((T −1 h−1 T ), p its covariances are o((T −1 h−1 T ) and the L -norms are 2 fT,hT (x) − fT,hT (x)p = 0((T −1 h−1 T ) ). 1

The ergodic property (2.17) for k-dimensional vectors of values of the process (Xt )t entails the weak convergence of the finite dimensional distributions of the density estimator fT,h . Lemma 4.2 extends to the ergodic pro1 cess and entails the weak convergence of (T hT ) 2 (fT,h − fT,h ) to a centered Gaussian process with variance κ2 f (x) at x and covariances zero. For a continuously observed stationary and ergodic process (Xt , Yt )t≤T with values in IX,Y , consider the regression model Yt = m(Xt ) + σ(Xt )εt ,

Limits for the Varying Bandwidths Estimators

93

where (εt )t∈[0,T ] is a Brownian motion such that E(εt | Xt ) = 0 and E(εt εs | Xt ∧ Xs ) = E{(εt ∧ εs )2 | Xt ∧ Xs ) = 1. The bivariate process Z = (X, Y ) is supposed to be ergodic, satisfying the properties (2.14) and (2.17). Under the same conditions as in Chapter 3, the regression function m is estimated on an interval IX,Y,T, hT by the kernel estimator T Ys KhT (x) (x − Xs ) ds . m  T,hT (x) = 0 T 0 KhT (x) (x − Xs ) ds The bias and variances established in Section 3.11 for the functions f and m of class C s and fixed bandwidth hT are modified, with the notation μ = mf bm,T,hT (x) (x) = hsT (x)bm (x) + o(hT s ), msK −1 (s) bm (x) = f (x){μ(s) (x) − m(x)fX (x)}, s! X 2 vm,T,h (x) = (T hT (x))−1 σm (x) + o((T hT )−1 ), −1 2 (x) = κ2 fX (x)V ar(Y | X = x), σm

and the covariance of m  T,hT (x) (x) and m  T,hT (x) (y) is a o((T hT )−1 ). The 1  T,hhT (x) (x) − m(x)} is then weak convergence of the process (T hT (x)) 2 {m proved by the same methods, under the ergodicity properties. In a model with a variance function, the regression function is also −1 T,h in the estimator of the estimated using a weighting process w T = σ T ,δT regression function T {Ys − m  T,hT (Xs ) (Xs )}2 KδT (x) (x − Xs ) ds 2 , σ T,hT ,δT (x) = 0 T K (x − X ) ds s δ (x) T 0 T w  (X )Y KhT (x) (x − Xi ) T i i . m  wT ,T,hT (x) (x) = 0 T T (Xi )KhT (x) (x − Xi ) 0 w The previous modifications of the bias and variance of the estimator extend to the continuously observed process (Xt )t≤T .

4.5

Exercises

(1) Compute the fixed and varying optimal bandwidths for the estimation of a density and compare the respective density estimators.

94

Functional Estimation for Density, Regression Models and Processes

(2) Give the expressions of the first moments of the varying bandwidth estimator of the conditional probability p(x) = P (Y |X = x) for a Y binary variable, conditionally on the value of a continuous variable X (Exercise 3.10(2)). (3) For the hierarchical observations of n independent sub-samples of Ji dependent observations of Exercise 2.11(5), determine a varying bandwidth estimator for the limiting density f and ergodicity conditions for the calculus of its bias and variance, and write their first order approximations. (4) Write the expressions of the bias and the variance of the continuous estimator FY |X,n,hn (x) for the distribution function of Y ≤ y conditionally on X ≤ x of Exercise 3.10(8), with a varying bandwidth and prove its weak convergence.

Chapter 5

Nonparametric Estimation of Quantiles

5.1

Introduction

Let F be a distribution function with density f on R, Fn its empirical 1 distribution function and νn = n 2 (Fn − F ) the normalized empirical process. The process Fn − F convergences to zero uniformly a.s. and in L2 , and νn converges weakly to B ◦ F , where B is the Brownian motion. The  n is the inverse functional for Fn , it converges therefore in probquantile Q ability to the inverse QF of F , uniformly on [0, 1]. The quantile estimator is approximated as 2     n = QF + (Fn − F ) ◦ QF − {(Fn − F ) ◦ QF } (f ◦ QF ) Q f ◦ QF 2(f ◦ QF )3 + o({(Fn − F ) ◦ QF }2 f ),

as a consequence, the quantile has the approximation process 1  n − QF ) = νn ◦ QF + rn , n 2 (Q f

(5.1)

with a remainder term such that supt∈[0,1] rn  = oL2 (1), it converges weakly to a centered Gaussian process with covariance function {F (s ∧ t) − F (s)F (t)}{f ◦ QF (s)}−1 {f ◦ QF (t)}−1 , for all s and t in [0, 1]. Consider the distribution function FY |X of the variable Y conditionally on the regression variable X, in the model Y = m(X) + ε with a continuous regression curve m(x) = E(Y |X = x) and an observation error ε such that E(ε|X) = 0 and V ar(ε|X) = σ 2 (X). It is defined with respect to the distribution function Fε of ε by FY |X (y; x) = P (Y ≤ y|X = x) = Fε (y − m(x)).

95

(5.2)

96

Functional Estimation for Density, Regression Models and Processes

 The marginal distribution function of Y is FY (y) = Fε (y − m(s)) dFX (s) and the joint distribution function of (X, Y ) is FX,Y (x, y) = 1{s≤x} Fε (y − m(s)) dFX (s). The estimator of FY |X is defined by smoothing the regression variable with a kernel is n Kh (x − Xi )1{Yi ≤y} n , FY |X,n,h (y; x) = i=1 i=1 Kh (x − Xi ) and an estimator of Fε is deduced from those of FY |X , FX and m as  Fε,n,h (s) = n−1 FY |X,n,h (s + m  n,h (Xi ); Xi ). 1≤i≤n

In this expression, the estimator of the regression function can be weighted by the inverse of the square root of the kernel estimator for the variance 2 , FY |X and Fε , are function σ 2 . Therefore, all functions of the model, m, σ easily estimated from the sample (Xi , Yi )i≤n . The quantile of the conditional distribution function of Y given X are first defined with respect to Y , then with respect to X. For every t in [0, 1] and at fixed x in IX , the conditional distribution FY |X (y; x) is increasing with respect to y and its inverse is defined as QY (t; x) = FY−1 |X (t; x) = inf{y ∈ IY : FY |X (y; x) ≥ t}.

(5.3)

It is right-continuous with left-hand limits, like the FY |X . For every x ∈ IX , FY |X ◦ QY (t; x) ≥ t with equality if and only if the function FY |X (·; x) is continuous for every x in the support of X. Assuming that the function m is monotone by intervals, the definition (5.2) implies the monotonicity on the same intervals of the conditional distribution function FY |X with respect to the Y , with the inverse monotonicity. On each interval of monotonicity and for every s in the image of IX , the quantile QX (y; s) is defined by inversion of the conditional distribution FY |X in the domain of the variable X, at fixed y, from Equation (5.2) inf{x ∈ IX : FY |X (y; x) ≥ t}, if m is decreasing, (5.4) QX (y; s) = sup{x ∈ IX : FY |X (y; x) ≤ t}, if m is increasing. For every y ∈ IY , QX ◦FY |X (y; x) = x if and only if m and Fε are continuous on IX , for every y in IY , and FY |X ◦ QX (y; s) = s if and only if m and Fε are strictly monotone, for every (s, y) in DX,Y . The empirical conditional distribution function defines in the same way  Y,n,h , according to (5.4) and,  X,n,h and Q the empirical quantile processes Q X,Y,n,h , the marginal components respectively (5.3). If (x, y) belongs to D Y,n,h which are the domains X,n,h and, respectively D x and y belong to D   of QX,n,h and, respectively QY,n,h .

Nonparametric Estimation of Quantiles

97

Another question of interest for a regression function m monotone on an interval Im is to determine its inverse with its distribution properties. Consider a continuous regression function m, increasing on a sub-interval Im of the support IX of X, its inverse is defined as m−1 (t) = inf{x ∈ IX : m(x) ≥ t}.

(5.5)

It is increasing and continuous on the image of Im by m and satisfies m−1 ◦ m = m ◦ m−1 = id. 5.2

Asymptotics for the quantile processes

Let IX,Y,h = {(s, y) ∈ IX,Y ; [s − h, s + h] ∈ IX }. Under conditions similar to those of the nonparametric regression,Proposition 3.1 applies considering y as fixed, with fX (x)FY |X (y; x) = 1{ζ≤y} fX,Y (x, ζ) dζ instead of μ(x) and with the conditional function FY |X (y; x), for every (x, y) in IX,Y . The weak convergence of the process defined on IX,h , at fixed y, 1 by (nh) 2 {FY |X,n,h (y; ·) − FY |X (y; ·)} is a corollary of Theorem 3.1. The expressions of the bias and the Lp -norms rely on an expansion up to higher order terms of its moments. Proposition 5.1. Let FXY be a distribution function of C s+1 (IX,Y ). Under Condition 2.1 for the density fX and 3.1 for the conditional distribution function FY |X (y; x) at fixed y, the variable supIX,Y,h |FY |X,n,h − FY |X | tends to zero a.s., its bias and its variance are bFY |X ,n,h (y; x) = hs bF (y; x) + o(h2 ), (5.6) s+1

∂ FX,Y (x, y) 1 (s) −1 bF (y; x) = msK fX (x) − FY |X (x, y)fX (x) , s! ∂xs+1 vFY |X ,n,h (y; x) = (nh)−1 vF (y; x) + o((nh)−1 ) vF (y; x) =

−2 κ2 f X (x)FY |X (y; x){1

(5.7)

− FY |X (y; x)}.

At every fixed y in IY , the process (nh) 2 {FY |X,n,h (y) − FY |X (y) − bFY |X ,n,h (y)}1{IX,h } , 1

converges weakly to a Gaussian process defined in IX , with mean function 1 limn (nh5 ) 2 bFY |X (y; ·), covariances zero and variance function vFY |X (y; ·). 1

1

At the optimal bandwidth hn = O(n− 2s+1 ), (nh) 2 bFY |X ,n,h is a O(1) under 1 Condition 3.1 so the empirical process (nh) 2 {FY |X,n,h (y) − FY |X (y)} is not centered, like for the density or the regression functions.

98

Functional Estimation for Density, Regression Models and Processes

1 The weak convergence of the bivariate process (nh) 2 (FY |X,n,h − FY |X − bFY |X ,n,h ) defined on IX,Y,h requires an extension of the previous results as for the empirical distribution function of Y .

Proposition 5.2. The process 1 νY |X,n,h = (nh) 2 {FY |X,n,h − FY |X − bFY |X ,n,h )}1{IX,Y,h }

converges weakly to a Gaussian process Wν on IY,X , with mean function 1 limn (nh) 2 bF (y; ·), variance vY |X and covariances at fixed x −2 (x){FY |X (y ∧ y  ; x) − FY |X (y; x)FY |X (y  ; x)}, CovY |X (y, y  ; x) = κ2 fX

and zero otherwise. Proof. This is a consequence of the weak convergence of the finite dimensional distributions of νY |X,n,h and of its tightness, due to the bound obtained for the moments of the squared variations between (x, y) and (x , y  ) of the joint empirical process, νY |X,n,h (y; x) − νY |X,n,h (y; x ){νY |X,n,h (y  ; x )−νY |X,n,h (y; x )} is a O((x −x)2 +(y  −y)2 ). The bound O((y  − y)2 ) is obtained for the empirical process at fixed x ,  and O((x − x)2 ) as in the proof of Lemma 3.3, at fixed y. Let FY |X (y; x) be monotone with respect to x. If n is sufficiently large, then FY |X,n,h is monotone, as proved in the following lemma. The expecta Y,n,h and, tions are denoted FY |X,n,h , QY,n,h and QX,n,h for E FY |X,n,h , E Q  X,n,h . respectively, E Q Lemma 5.1. If n ≥ n0 large enough, FY |X,n,h is monotone on IX,Y,h . Moreover, if FY |X is increasing with respect to x in IX then, for every x1 < x2 and ζ > 0, there exists C > 0 such that Pr{FY |X,n,h (x2 ) − FY |X,n,h (x1 ) > C} ≥ 1 − ζ. Proof. Let y be considered as fixed in IY , x1 < x2 be in IX,h and such that FY |X (y; x2 ) − FY |X (y; x1 ) = d > 0. For n large enough the bias of FY |X,n,h (y; x2 ) − FY |X,n,h (y; x1 ) is strictly larger than d/2, by Proposition 5.2. The uniform consistency of Proposition 5.1 implies, for every η and ζ > 0, the existence of an integer n0 such that for every n ≥ n0 , Pr{|FY |X,n,h (y; x1 )−FY |X (y; x1 )|+|FY |X,n,h (y; x2 )−FY |X (y; x2 )| > η} < ζ. For the monotonicity of the empirical conditional distribution function, let

Nonparametric Estimation of Quantiles

99

d > η > 0, then Pr{FY |X,n,h (y; x2 ) − FY |X,n,h (y; x1 ) > d − η} = 1 − Pr{(FY |X,n,h − FY |X )(y; x1 ) − (FY |X,n,h − FY |X )(y; x2 ) ≥ η} ≥ 1 − Pr{|FY |X,n,h − FY |X |(y; x2 ) + |FY |X,n,h − FY |X |(y; x1 ) ≥ η} ≥ 1 − ζ.



The asymptotic behavior of the quantile processes follows the same principles as the distribution functions. We first consider the quantile QY defined by (5.3) conditionally on fixed X = x, it is always increasing. The empirical quantile function is increasing with probability tending to 1, as in Lemma 5.1 and the functions QX,n,h and QY,n,h are monotone, for n large enough. The results of Section 5.1, are adapted to the empirical quantiles. The derivative with respect to y of FY |X (y; x) belonging to C 2 (IY ) is fY |X (y; x), for every x in IX . Let bF and vF be defined by (5.5) and (5.7), respectively. Proposition 5.3. Let FX|Y be a continuous conditional distribution func Y,n,h − QY |(u; x) converges in probability tion, the process supDY,n,h ×IX |Q to zero. If the density fX,Y of (X, Y ) belongs to C s (IX,Y ), then for every Y,n,h , the bias of Q  Y,n,h equals x in IX and u in D bY,h (u; x) = h2 bY (u; x) + o(h2 ), bF bY (u; x) = ◦ QY (u; x), fY |X and its variance is bY,n,h(u; x) = (nh)−1 vY (u; x) + o((nh)−1 ), vF vY (u; x) = 2 ◦ QY (u; x). fY |X Proof. The proof relies on the expansion of the quantile of a distribution function according to the empirical process of this distribution, it is similar to the expansion (5.1) of the introduction. By the derivability of the inverse function, it has the approximation   Y,n,h (u; x) − QY (u; x) = FY |X,n,h − FY |X ◦ QY (u; x) Q f  +o(FY |X,n,h (·; x) − FY |X (·; x)L2 ).

100

Functional Estimation for Density, Regression Models and Processes

By the uniform convergence in probability of FY |X,n,h to FY |X on IX,n,h and under the condition that the density is bounded away from zero on  Y,n,h and the functions Q  Y,n,h convergence uniformly IX , the processes Q to QY . The bias and the variance of the empirical conditional quantile  Y,n,h are deduced from those of the kernel estimator FY |X,n,h detailed in Q Proposition 5.1.  Another quantile function is defined for n large enough and for v in  DY,n,h by . /  Y,n,h (v; x) = sup y : (x, y) ∈ IX,Y,h , FY |X,n,h (y; x) ≤ v , Q (5.8)  Y,n,h at fixed x. The uniform convergence of FY |X,n,h to FY |X implies that Q Y,n,h and let converges uniformly to QY . At fixed x, let u be in D  Y,n,h (u) − FY |X ◦ Q  Y,n,h (u), ηY,n,h (u) = FY |X ◦ Q

(5.9)

which converge in probability to zero, the quantile estimator satisfies  Y,n,h = QY ◦ (  Y,n,h ). Q ηY,n,h + FY |X ◦ Q

(5.10)

 Y,n,h as a function of Q and Q  Y,n,h as Taylor expansions allow to express Q    a function of QY,n,h and of the process ηY,n,h . Since FY |X,n,h ◦ QY,n,h and  Y,n,h equal identity, (5.9) is also written FY |X,n,h ◦ Q  Y,n,h (u) − {FY |X,n,h − FY |X } ◦ Q  Y,n,h(u). ηY,n,h (u) = bF,n,h ◦ Q  Y,n,h (u) − h2 E{bF ◦ Q  Y,n,h(u)} + o(h2 ), (5.11) E{ ηY,n,h (u)} = h2 bF ◦ Q and the variance of ηY,n,h (u) equals  Y,n,h (u)|Q  Y,n,h (u)}] V ar{ ηY,n,h (u)} = E[V ar{FY |X,n,h ◦ Q  Y,n,h (u)} + V ar{bF,n,h ◦ Q  Y,n,h (u)} = (nh)−1 E{vF ◦ Q  Y,n,h (u)} + o(n−1 h−1 + h4 ). (5.12) + h4 V ar{bF ◦ Q The expression of the bias of FY |X,n,h implies  n,h (u) = h2 bF ◦ Q  n,h (u) + o(h2 ), {FY |X,n,h − FY |X } ◦ Q therefore  Y,n,h(u) = F −1 (u − h2 bF ◦ Q  Y,n,h (u)) + o(h2 ) Q Y |X = QY (u) − h2

 Y,n,h(u) bF ◦ Q + o(h2 ). fY |X ◦ QY (u)

101

Nonparametric Estimation of Quantiles

 Y,n,h (u) = QY (u) + O(h2 ), bF ◦ Q  Y,n,h(u) = bF ◦ QY (u) + O(h2 ), so Since Q that  Y,n,h(u) = QY (u) − h2 bF ◦ QY (u) + o(h2 ). (5.13) Q fY |X Furthermore, by (5.9),  Y,n,h(u) = F −1 (FY |X ◦ Q  Y,n,h(u) + ηY,n,h (u)) Q Y |X  Y,n,h (u) + =Q

ηY,n,h (u) 2 (u)), + O( ηY,n,h  Y,n,h (u) fY |X ◦ Q

and, using (5.13),  Y,n,h(u) = Q  Y,n,h (u) + Q

ηY,n,h (u) fY |X ◦ QY (u)

2 + O(h2 ηY,n,h (u)) + O( ηY,n,h (u)).

(5.14)

 Y,n,h (u) = bF ◦ QY (u) + o(1). With The expansion (5.13) implies bF ◦ Q ηY,n,h (u)} are o(1) (5.14) and since E{ ηY,n,h (u)} and V ar{  Y,n,h (u) = bF ◦ Q  Y,n,h (u) + ηY,n,h (u) bf ◦ Q  Y,n,h(u) bF ◦ Q FY |X ◦ QY (u) 2 + O(h2 ηY,n,h (u)) + O( ηY,n,h (u)),

 Y,n,h (u)} = bF ◦ Q  Y,n,h (u) + o(1) = bF ◦ QY (u) + o(1). E{bF ◦ Q  Y,n,h (u)} = O(h4 + n−1 h−1 ) because of the approxiMoreover, V ar{bF ◦ Q

4 mations V ar{ ηY,n,h (u)} = O(h4 +n−1 h−1 ) and E{ ηY,n,h (u)} = o(n−1 h−1 ). From (5.11), the expectation of ηY,n,h (u) becomes

E{ ηY,n,h (u)} = o(h2 ). (5.15)  Y,n,h (u)} = vF ◦ QY (u) + o(1) and In the expansion (5.12), E{vF ◦ Q 4 8  h V ar{bF ◦ QY,n,h (u)} = O(h + n−1 h3 ) = o(n−1 h−1 ). The variance of ηY,n,h (u) is then equal to V ar{ ηY,n,h (u)} = (nh)−1 vF ◦ QY (u) + o((nh)−1 ). Finally, (5.10), (5.13), (5.14) and (5.15) imply   Y,n,h = QY + ηY,n,h + FY |X ◦ QY,n,h − FY |X ◦ QY {1 + o(1)}. Q fY |X ◦ QY ) 1  Y,n,h − QY }1  Theorem 5.1. The process UY,n,h = (nh) 2 {Q {DY,n,h } con1

verges weakly to UY = limit of νY |X,n,h .

Wν + γ 2 bF 1

vF2 Y

◦ QY where Wν is the Gaussian process

102

Functional Estimation for Density, Regression Models and Processes

Y,n,h , the expansion For every x in IX,n,h and for every u in D  Y,n,h (u; x) = QY (u; x) + (FY |X,n,h − FY |X )(QY (u; x); x) Q fY |X (QY (u; x); x)

Proof.



{(FY |X,n,h − FY |X )2 (QY (u; x); x) fY |X (QY (u; x); x) 2fY3 |X (QY (u; x); x)

+o((FY |X,n,h − FY |X )2 (QY (u; x); x)), and the weak convergence of νY |X,n,h (Proposition 5.2) imply  Y,n,h − QY )(u; x) = (nh) 2 (Q 1

=

1

(nh) 2 (FY |X,n,h − FY |X )

 Y,n,h νY |X,n,h ◦ Q + rn , fY |X (QY (u; x); x)

where rn is a oP (1), and its limit follows.



 Y,n,h has the representation The conditional quantile process Q   Y,n,h = QY + ηY,n,h + FY |X ◦ QY,n,h − FY |X ◦ QY + rY,n,h , (5.16) Q  Y,n,h ) fY |X ◦ Q where ηY,n,h is defined by (5.9) and where the remainder term rY,n,h is 1 oL2 ((nhn )− 2 ). An analogous representation holds for the empirical quantile  X,n,h process Q    X,n,h = QX + ζX,n,h + FY |X ◦ QX,n,h − FY |X ◦ QX + rX,n,h , Q fY |X ◦ QX )

(5.17)

 X,n,h − FY |X ◦ Q  X,n,h and rX,n,h = oL2 ((nhn )− 12 ). where ζX,n,h = FY |X ◦ Q  X,n,h are The bias bX , the variance vX and the weak convergence of Q (1) obtained using the derivative FY |X of FY |X (y; x) with respect to x. Proposition 5.4. Let FX|Y be a continuous conditional distribution func X,n,h − QX |(u; x) converges in probability tion, the process supDX,n,h ×IY |Q to zero. If the density fX,Y of (X, Y ) belongs to C s (IX,Y ), then for every X,n,h , the bias of Q  X,n,h equals x in IX and u in D bX,h (y; u) = hs bX (y; u) + o(hs ), bF ◦ QX (y; u), bX (y; u) = ∂FY |X /∂x and its variance is vX,n,h (y; u) = (nh)−1 vX (y; u) + o((nh)−1 ), vF ◦ QX (y; u). vX (y; u) = {∂FY |X /∂x}2

Nonparametric Estimation of Quantiles

103

1  X,n,h − QX }1  Theorem 5.2. The process UX,n,h = (nh) 2 {Q {DX,n,h } con1

verges weakly to UX =

5.3

Wν + limn (nh5n ) 2 bF 1

vX2

◦ QX .

Bandwidth selection

 n,h (u) of the The error criteria measuring the accuracy of an estimator Q quantile of a distribution function Q(u) are generally sums of the variance  n,h (u), where the variance increases as h tends and the squared bias of Q to zero whereas the bias decreases. Under the assumption that the conditional distribution function FY |X is twice continuously differentiable with respect to y and using results of Proposition 5.3, the mean squared error  Y,n,h(u) − QY (u)}2 is asymptotically equivalent to M SEY (h) = E{Q vF ◦ QY (u; x) {fY |X ◦ QY (u; x)}2

2 bF + h4 ◦ QY (u; x) . fY |X

AMSEQY (u; x, h) = (nh)−1

Its minimization in h leads to an optimal local bandwidth, varying with u and x

1 1 vF ◦ QY (u; x) 5 . hopt,loc (u; x) = n− 5 4b2F ◦ QY (u; x) The optimal local bandwidth minimizing the AMSE AMSEF (u; x, h) = (nh)−1 vF (u; x) + h4 b2F (u; x), of FY |X,n,h (u; x) for the unique value of x such that y = QY (u) has the same form

1 vF (u; x) 5 1 . hopt,loc (u; x) = n− 5 4b2F (u; x) If the density fX has a continuous derivative, the optimal local bandwidth  X,n,h (y; x), at fixed y, is also similar, with the minimizing the AMSE of Q notations for the bias and the variance given by Proposition 5.4. 1 Since the optimal rate for the bandwidth has the order n 5 , the optimal 4  Y,n,h −  Y,n,h to QY is n 5 and the process (nh) 12 {Q rate of convergence of Q QY } converges to a noncentered Gaussian process with an expectation different from zero because nh5 = O(1). Estimating the bias bF and the variance vF by bootstrap allows to estimate the optimal bandwidths for

104

Functional Estimation for Density, Regression Models and Processes

the quantile estimator without knowledge of its order s of derivability. direct kernel estimator of the variance function of the process νY |X,n,h −2 vF,n,h = κ2 fX,n,h F|X,n,h (1 − F|X,n,h ), according to the expression (5.7) vF . For a conditional distribution function FY |X (·; x) having derivatives order s, the bias is modified bs,F,n,h (y; x) = hs bF (y; x) + o(hs ) s+1 ∂ FX,Y (x, y) hs −1 msK fX (x) = s! ∂xs+1 −



(s) FY |X (x, y)fX (x)

A is of of

+ o(h2 ),

and the optimal local bandwidth is modified by this s-order bias. The global mean integrated squared error criteria are defined by inteY,n,h the AM SEQY (u; x, h), conditionally grating over all values of u in D on a fixed value of x

 Y,n,h (u; x) − QY (u; x)}2 du E{Q 2     vF ◦ QY (u; x) bY ◦ QY (u; x) −1 4 +h = (nh) du {fY |X ◦ QY (u; x)}2 fY |X ◦ QY (u; x)  AMSEF (y; x, h) dy, = fY |X (y; x) IY,n,h

AMISEQY (x, h) =

 which differs from the integral AMISEF (x, h) = IY,n,h AMSEF (y; x, h) dy, conditional to X = x. The expectation of the conditional random criterion AMISEQY (X, h) is  AMSEF (y; x, h) dy dFX (x). fY2 |X (y; x) IX,Y,n,h In the same way, the global mean integrated squared error criteria is X,n,h , for a fixed value of defined by integrating AMSEQX (y, h) over D y, AMISEQX (y, h) = IX,n,h AMSEF (y; x, h){fY |X (y; x)}−1 dx, at fixed y. X,Y,n,h The global AMISE criteria for QX and QY defined as integrals over D are both equal to AMISEQ (h) =

 IX,Y,n,h

AMSEF (y; x, h) dx dy, fY2 |X (y; x)

and they differ from the global criterion  AMSEF (Y ; X, h) AMISEF = . AMSEF (y; x, h) dx dy = E 2 fX,Y (X, Y ) IX,Y,n,h

105

Nonparametric Estimation of Quantiles

Some discretized versions of these criteria are the Asymptotic Mean Average Squared Errors such as the AMASEF corresponding to AMISEF , AMASEQY (x, h) corresponding to AMISEQY (x, h) and EAMISEQY (X, h) are respectively defined by AMASEF = n−1 AMASEQY (x, h) = n−1

n  AMSEF (Yi ; Xi ) i=1 n  i=1

AMASEQY (h) = n−2

fX,Y (Xi , Yi , h)

,

AMSEF (Yi ; x, h) , fY3 |X (Yi ; x, h)

n  n  AMSEF (Yi ; Xj , h) i=1 j=1

fY3 |X (Yi ; Xj , h)

,

which is the empirical mean of AMASEQY (X, h). Similar ones are defined for QX and other means. Note that no computation of the global errors and bandwidths require the computation of integrals of errors for the empirical inverse functions, all are expressed through integrals or empirical means of AMSEF with various weights depending on the density of X and the conditional density of Y given X. The optimal window for AMASEQF (h) is n

− 15

11 0 n {vF (fX,Y )−1 }(Xi , Yi , h) 5 i=1 , n 4 i=1 {b2F (fX,Y )−1 }(Xi , Yi , h)

for AMASEQY (h) it is , n n

− 15

n

i=1 4 ni=1

{vF (fX,Y )−2 }(Xi , Yj , h) j=1 n 2 −1 }(X , Y , h) i j j=1 {bF (fX,Y )

- 15 .

The expressions of other optimal global bandwidths are easily written and all are estimated by plugging estimators of the density, the bias bF and the variance vF with another bandwidth. The derivatives of the conditional distribution function are simply the derivatives of the conditional empirical distribution function, as nonparametric regression curves. The mean  X,n,h squared errors and the optimal bandwidths for the quantile process Q are written in similar forms, with the bias bX and variance vX . 5.4

Estimation of the conditional density of Y given X

The conditional density fY |X (y; x) is deduced from the conditional distribution function FY |X (y; x) by derivative with respect to y and it is estimated

106

Functional Estimation for Density, Regression Models and Processes

using the kernel K with another bandwidth h −1 (x)fX,Y,n,h,h (x, y) fY |X,n,h,h (y; x) = fX,n,h n Kh (x − Xi )Kh (y − Yi ) = i=1 n . i=1 Kh (x − Xi ) The order of the bias of the bidimensional density estimator is (hh )2 and the order of its variance is (nhh )−1 . Proposition 5.5. If the conditional density fY |X belongs to the class 1 C s (IXY ), s ≥ 2, the process (nhh ) 2 (fY |X,n,h,h − fY |X ) converges weakly 1 to a Gaussian process with expectation limn (nhn hn ) 2 (E fY |X,n,h,h − fY |X ), with covariances zero and variance function vf = κ2 fY |X (1 − fY |X ). 1 If hn = hn and s = 2, the process n 2 hn (fY |X,n,hn − fY |X ) converges 1 weakly to a Gaussian process with expectation limn (nh6n ) 2 bf where 1 (2) −1 {∂ 2 fX,Y /∂y 2 + ∂ 2 fX,Y /∂x2 − fY |X fX }, bf = m2K fX 2 with covariances zero and variance function vf . The optimal bandwidth is 1 O(n− 6 ). Proof. By Proposition 5.1, if fY |X belongs to C 2 (IXY ), its expectation develops as   E fY |X,n,h,h (y; x) = Kh (v − y) E FY |X,n,h (dv; x)  = Kh (v − y) (FY |X + bF,n,h )(dv; x) 

∂ 2 fY |X (y; x) h2 m2K 2 ∂y 2  ∂bF,n,h (y; x) + o(h2 ) + o(h 2 ). + ∂y

= fY |X (y; x) +

Assuming that h = h, its bias bf,n,h (y; x) = h2 bf (y; x) + o(h2 ), with 1 (2) −1 {∂ 2 fX,Y /∂y 2 + ∂ 2 fX,Y /∂x2 − fY |X fX }. bf = m2K fX 2 Generally, the range of the variables X and Y differs and two distinct kernels have to be used, the bias is then expressed as a sum of two terms (1)





bf,n,h,h (y; x) = h2 bF (y; x) + h 2 bf (y; x) + o(h2 ) + o(h 2 ), 1 (1) (2) −1 bF = m2K fX {∂ 2 fX,Y /∂x2 − fY |X fX }, 2 1 −1 2 bf = m2K fX ∂ fX,Y /∂y 2 . 2

Nonparametric Estimation of Quantiles

107

The variance of the estimator is the limit of V arfY |X,n,h,h (y; x) written  Kh (u − y)Kh (v − y)Cov{FY |X,n,h (du; x), FY |X,n,h (dv; x)}  −1 = (nh)−1 κ2 fX (x) Kh2 (u − y) FY |X (du; x)

 − Kh (u − y)Kh (v − y) FY |X (du; x)FY |X (dv; x) . 

−1 The first integral  develops as I1 = h {κ2 fY |X (y; x) + o(1)}, the second integral I2 = Kh (u − y)Kh (v − y) FY |X (du; x)FY |X (dv; x) is the sum 2 ; |u − v| < 2hn }, of the integral outside the diagonal DY = {(u, v) ∈ IX,T which is zero, and an integral restricted to the diagonal which is expanded by changing the variables like in the proof of Proposition 2.2. Let αh (u, v) = |u − v|/(2h ), u = y + h (z + αh ), v = y + h (z − αh ) and z = {(u + v)/2 − y}/(h )  I2 = Kh (u − x)Kh (v − x)fY |X (u; x)fY |X (v; x) du dv DY

  = h −1 K(z − αh (u, v))K(z + αh (u, v)) dzdufY2 |X (y; x) + o(1) , DY





and it is equivalent to h −1 κ2 fY2 |X (y; x) + o(h −1 ). The variance of the estimator of the conditional density fY |X is then vfY |X,n,h,h = (nhh )−1 vf (y; x) + o((nhh )−1 ), vf (y; x) = κ2 fY |X (1 − fY |X ), and its covariances at every y = y  tends to zero.

(5.18) 

The asymptotic mean squared error for the estimator of the conditional  (1)2 density is M SEfY |X y; x; hn , hn ) = h4n bF (y; x)+hn4 b2f (y; x)+(nhn hn )−1 vf , it is minimal at the optimal bandwidth   15 1 v f (y; x) . hn,opt,fY |X (y; x) = nhn 4b2f In this expression, hn can be chosen as the optimal bandwidth for the kernel estimator of the conditional distribution function FY |X . The convergence 1 rate (nhn hn ) 2 of the estimator for the conditional density is smaller than 1 2 the convergence rate of a density and than (nh2n ) 2 = O(n 5 ), at the optimal bandwidth.

108

Functional Estimation for Density, Regression Models and Processes

Assuming that hn = hn , the optimal bandwidth is now   16 1 vf (y; x) , hn,opt,fY |X (y; x) = nh 2b2f and the convergence rate for the estimator of the conditional density fY |X 1 is n 3 which is larger than the previous rate with two optimal bandwidths. The mode of the conditional density fY |X is estimated by the mode of its estimator and the proof of Proposition 2.11 applies with the modified rates of convergence and limit. The derivative of fY |X,n,h (y; x) with respect to y  1 1 converges with the rate (nhn3 hn ) 2 that is n 2 h2n for identical bandwidths.

5.5

Estimation of conditional quantiles for processes

Let (Zt )t∈[0,T ] = (Xt , Yt )t∈[0,T ] be a continuously observed stationary and ergodic process with values in a metric space IXY and the regression model Yt = m(Xt ) + σ(Xt )εt as in Section 3.11. Under the ergodicity condition (2.14) for (Zt )t>0 , the conditional distribution function of the limiting distribution corresponds to FY |X (y; x) for a sample of variables and it is estimated from the sample-path of the process on [0, T ], similarly to (3.22), with a bandwidth indexed by T T 1{Y ≤y} KhT (x − Xs ) ds  FY |X,T,hT (y; x) = 0  T s . (5.19) KhT (x − Xs ) ds 0 T The numerator of (5.19), μ F,T,hT (y; x) = T1 0 1{Ys ≤y} KhT (x − Xs ) ds, has the expectation  KhT (x − u) FXs ,Ys (du, y) μF,T,hT (x) = IX

= F (y; x)f (x) + h2T bF (y; x) + o(h2T ) with bμ (y; x) = ∂ 3 {F (x, y)}/∂x3 , for a conditional density of C 2 (IXY ). Proposition 5.6. Under the ergodicity conditions and for a conditional density fY |X in class C s (IXY ), the bias bF,T,hT and the variance vF,T,hT of the estimator FY |X,T,hT are bF,T,hT (y; x) = hsT bF (y; x) + o(hsT ), msK −1 (s) f (x){∂ s+1 F (x, y)/∂xs+1 − F (x)fX (x)}, bF (y; x) = s! X vF,T,hT (y; x) = (T hT )−1 {σF2 (y; x) + o(1)}, −1 (x)FY |X (y; x){1 − FY |X (y; x)}, σF2 (x) = κ2 fX

Nonparametric Estimation of Quantiles

109

−1 its covariances are CovY |X (y, y  ; x) = κ2 fX (x){FY |X (y ∧ y  ; x) −   FY |X (y; x)FY |X (y ; x)} and zero for x = x .

The weak convergence of Proposition 5.2 is still satisfied with the conver1 gence rate (T hT ) 2 and the notations of Proposition 5.6. The quantile processes of Section 5.2 are generalized to the continuous process (Xt , Yt )t>0 and their asymptotic behavior is deduced by the same arguments from the 1 weak convergence of (T hT ) 2 (FY |X,T,hT − FY |X ). The conditional density fY |X (y; x) of the ergodic limit of the process is estimated using the kernel KhT , with the same bandwidth as the estimator of the distribution function FY |X   T  T K (x − X )1 1 h s {Y ≤Y } T s t 0 dt. KhT (Yt − y) fY |X,T,hT (y; x) = T T 0 K (x − X ) ds h s T 0 Its expectation is approximated by  T KhT (Yt − y){FY |X (Yt ; x) + h2T bF (Yt ; x)} dt fY |X,T,hT (y; x) = T −1 E 0

+ o(h2T )  ∂ =E KhT (v − y) {FY |X + h2T bF }(v; x) dv + o(h2T ) ∂v IY = fY |X (y; x) + h2T

∂ 2 fY |X (y; x) ∂bF (y; x) h2T + m2K ∂y 2 ∂y 2

+ o(h2T ), where bF is the bias (5.7) of the estimator FY |X,n,h . Let vf be defined by (5.18), the variance of fY |X,T,hT (y; x) has an expansion similar to the variance of the estimator fY |X,n,h (y; x) vfY |X ,T,hT = (T h2T )−1 vf (y; x) + o((T h2T )−1 ). Proposition 5.7. Under the ergodicity conditions and for a conditional density fY |X in class C s (IXY ), the bias and the variance of the estimator fY |X,T,hT are bfY |X ,T,hT (y; x) = hsT bfY |X (y; x) + o(hsT ), msK −1 (s) f (x){∂ s f (x, y)/∂xs − fY |X fX (x)}, bfY |X (y; x) = s! X vfY |X ,T,hT (y; x) = (T h2T )−1 {vf (y; x) + o(1)}, −1 (x)fY |X (y; x){1 − fY |X (y; x)}, vf (y; x) = κ2 fX

its covariances are zero for x = x or y = y  .

110

Functional Estimation for Density, Regression Models and Processes

1 The process T 2 hT (fY |X,T,hT − fY |X ) converges weakly to a Gaussian 1 bfY |X , with variance vf and covariprocess with expectation limT T 2 hs+1 T ances zero.

The optimal bandwidth for fY |X,T,hT is O(T − 2s+2 ) and the convergence S 1 rate of fY |X,T,hT with the optimal bandwidth is T 2s+2 , hence it is T 3 for s = 2, and the expression of the optimal bandwidth is hT,opt,fY |X defined in the previous section. 1

5.6

Inverse of a regression function

Consider the inverse function (5.5) for a regression function m of the model (1.6), monotone on a sub-interval Im of the support IX of the regression variable X. The kernel estimator of the function m is monotone on the same interval with a probability converging to 1 as n tends to infinity, by an extension of Lemma 5.1 to an increasing function. The maxima and minima of the estimated regression function, considered in Section 3.8, define empirical intervals for monotonicity where the inverse of the regression function is estimated by the inverse of its estimator. Let t belong to the image Jm by m of an interval Im where m is increasing  m,n,h (t) = m Q  −1  n,h (x) ≥ t}. n,h (t) = inf{x ∈ Im : m

(5.20)

This estimator is continuous like m  n,h , so that m  n,h ◦ m  −1 n,h = id on Jm , −1 and m  n,h ◦ m  n,h = id on Im . The results proved in Section 5.2 for the  m,n,h conditional quantiles adapt to the estimator Q  n,h − m)  m,n,h − Qm = (m Q ◦ Qm + o(m  n,h − m). m(1)

(5.21)

The bias and the variance of the estimator (5.20) on Jm are deduced from those of the estimator m  n,h , as in Proposition 5.3 bm ◦ Qm (t) + o(h2 ), m(1) 2 σm vQm ,n,h (t) = (nh)−1 (1)2 ◦ Qm (t) + o((nh)−1 ), m bQm ,n,h (t) = h2

they are denoted bQm ,n,h = h2 bQm + o(h2 ) and, respectively 2 vQm ,n,h (t) = (nh)−1 σQ + o((nh)−1 ). m

Nonparametric Estimation of Quantiles

111

1  m,n,h − Qm ) is a consequence of Theorem The weak convergence of (nh) 2 (Q 3.1 and it is proved by the same arguments as Theorem 5.1. Let W1 be the 1 −1 (nh) 2 (m  n,h − mn,h ) on Im . Gaussian process limit of σQ m

 m,n,h − Qm } conTheorem 5.3. On Jm , the process UQm ,n,h = (nh) 2 {Q 1 verges weakly to UQm = σQm W1 ◦ Qm + γ 2 bQm . 1

The inverse of the estimator (3.22) for a regression function of an ergodic 1 and mixing process (Xt , Yt )t≥0 is (T hT ) 2 -consistent and it satisfies the same approximations and weak convergence, with the notations and conditions of Section 3.11. Under derivability conditions for the kernel, the regression function and the density of the variable X, the estimators m  n,h and its inverse are differentiable and they belong to the same class which is supposed to be sufficiently large to allow expansions of order s for estimator of function m in C k+s . The derivatives of the quantile are determined by consecutive derivatives of the inverse 1  (1) = Q , m,n,h (1)  m,n,h m  ◦Q n,h

 (2) = − Q m,n,h

  m,n,h Q m  n,h ◦ Q m,n,h (2)

(1)2  m,n,h m  n,h ◦ Q (2)

=−

(1)

m  n,h (1)3

m  n,h

 m,n,h . ◦Q (1)

Their convergence rates are deduced from those of the derivatives m  n,h (2)

and m  n,h . For m in C s (R), their bias is O(hs ) and their variances are O((nh3 )−1 ) and, respectively O((nh5 )−1 ), if m belongs to C s (R), their bias is O(hs ) and their variances are O((nh2+d )−1 ) and, respectively O((nh4+d )−1 ). Consider a partition of the sample in J disjoint groups of size nj , and J let Aj be the indicator of a group j, for j = 1, . . . , J. Let Y = j=1 Yj 1Aj J and X = j=1 Xj 1Aj where (Xj , Yj ) is the variable set in group j. For j = 1, . . . , J, the regression model for the variables (Xji , Yji )i=1,...,nj is Yji = mj (Xji ) + εji where mj (x) = E(Y | 1Aj X = x) and the expectation in the whole sample is defined from the probability pj = P (X ∈ Aj ). The conditional density fj of X given Aj is the derivative of the conditional distribution function

112

Functional Estimation for Density, Regression Models and Processes

Fj (x) = P (X ≤ x | X ∈ Aj ), and the conditional regression functions given the group Aj are m(x) =

J 

πj (x)mj (x),

j=1

πj (x) = pj

fj (x) = P (Aj | X = x). f (x)

The density of X in the whole sample is a mixture of J densities condiJ tionally on the group fX (x) = j=1 pj fj (x) and the ratio f −1 (x)fj (x) is one if the partition is independent of X. The regression functions and the conditional probability densities are estimated from the sub-samples nj i=1 Yji Kh (x − Xji ) m  j,n,h (x) =  , nj i=1 Kh (x − Xji ) J nj i=1 Yji Kh (x − Xji ) j=1 , m  n,h (x) = J nj i=1 Kh (x − Xji ) j=1 nj Kh (x − Xji ) . π j,n,h (x) = J i=1 nj i=1 Kh (x − Xji ) j=1  j,m,n,h are defined as in Equation (5.20) for each The inverse processes Q group. The inverse of the conditional probability densities πj are estimated using the same arguments, their asymptotic properties follow. 5.7

Quantile function of right-censored variables

The product-limit estimator Fn for a differentiable distribution function F on R+ under right-censorship satisfies Equation (2.12) on [0, max Ti [  x 1 − Fn (s− )  d(Λn − Λ)(s), Fn (x) = F − {1 − F (x)} 1 − F (s) 0 denoted F − ψn where Eψn = 0 and supt≤τ ψn (t)2 converges a.s. to  n converges zero for every τ < τF = sup{x > 0; F (x) < 1}. Its quantile Q therefore in probability to the quantile QF of F , uniformly on [0, τ ]. Let f be the density probability for F and let G be the distribution function of 1 the independent censoring times, the process n 2 ψn converges weakly to a centered Gaussian process with covariance function  x∧y {(1 − F )(1 − G)}−1 dΛ, CF (x, y) = {1 − F (x)}{1 − F (y)} 0

113

Nonparametric Estimation of Quantiles

at every x and y in [0, τ ], for every τ < τ0 = τF ∧ τG . As a consequence, the quantile process  n − QF ) = −n 2 n 2 (Q 1

1

ψn ◦ QF + rn , f

(5.22)

is unbiased and it converges weakly to a centered Gaussian process with covariance function CF (QF (s), QF (t)) , c(s, t) = f ◦ QF (s) f ◦ QF (t) for all s and t in [0, F (τ0 )]. The remainder term is such that supt≤F (τ0 ) rn  is a oL2 (1). A smoothed quantile process is defined by integrating the smoothed process   n (s), qn,h (t) = Kh (t − s) dQ which is an uniformly consistent estimator of the derivative of QF (t) (1)

q(t) = QF (t) = Its expectation is qn,h (t) =



1 . f ◦ QF (t)

Kh (t − s) dQ(s) and its bias

h2 (3) m2K QF (t) + o(h2 ) 2 if F belongs to C 3 . Its variance and covariance functions are deduced from the representation (5.22) of the quantile, for s = t and as n tends to infinity  t s 1  1 −1 1{u =v} Kh (u − u ) Cov{ qn,h (t), qn,h (s)} = n bqF ,n,h =

0

0

−1

−1

× Kh (v − v  ) du dv d2 c(u , v  ) 1

+ (nh)−1 κ2 c(s ∧ t, s ∧ t) + o(n− 2 ) = (nh)−1 κ2 c(s ∧ t, s ∧ t) + o(n 5.8

− 12

).

Conditional quantiles with varying bandwidth

The pointwise conditional mean squared errors for the empirical conditional distribution function and its inverses reach their minimum at a varying bandwidth function. So the behaviour of the estimators with such bandwidth is now considered. Condition 4.1 are supposed to be satisfied in addition to 2.1 or 2.2. The results of Propositions 5.1 and 5.2 still hold with a

114

Functional Estimation for Density, Regression Models and Processes

functional bandwidth sequence hn and approximation orders o(hn 2 ) for the bias and o(nh−1 n ) for the variance and a functional convergence rate 1 (nhn ) 2 for the process νY |X,n,h . This is an application of Section 4.3 with the following expansion of the covariances. Lemma 5.2. The covariance of FY |X,n,h (y; x1 ) and FY |X,n,h (y; x2 ) equals 2[nκ2 {hn (x1 ) + hn (x2 )}]−1 × [vF (y; zn (x1 , x2 ))

 K(v − αn (v))K(v + αn (v)) dv (1)

−2 + δn (x1 , x2 )fX (zn (x, y)){(vΛ Y |XfX )(1) − m2 fX }(zn (x, y))  × vK(v − αn (v))K(v + αn (v)) dv + o(hn )].

The mean squared errors the functional bandwidth sequences are similar to the MSE and MISE of Section 5.3. The conditional quantiles are now defined with functional bandwidths satisfying the convergence condition 4.1. The representations (5.16) and (5.17) of the empirical conditional quantiles are modified by the remainder 1 terms rY,n,h and rX,n,h which become oL2 ((nhn )− 2 ). The expansions of their bias and variance are also written with the uniform norms of the bandwidths, generalizing Propositions 5.3 and 5.4 and the weak convergence of the quantile processes is proved as for the kernel regression function with variable bandwidth in Section 4.4. 5.9

Exercises

(1) Consider the quantile process Fn−1 of a continuous  1 distribution function F and the smooth quantile estimator Tn (t) = 0 Kh (t − s)Fn−1 (s) ds, for t in [0, 1]. Prove its consistency and write expansions for its bias and its variance. (2) Determine the limiting distribution of the quantiles with respect to X and Y for the estimator of the distribution function of Y ≤ y conditionally on X ≤ x. (3) Determine the limiting distribution of smoothed quantiles with respect to X and Y for the estimator of the distribution function of Y ≤ y conditionally on X ≤ x.

Chapter 6

Nonparametric Estimation of Intensities for Stochastic Processes

6.1

Introduction

Let Nn = {Nn (t), t ≥ 0} be a sequence of counting processes defined on a probability space (Ω, A, P ) associated to a sequence of random time variables (Ti )1≤i≤n Nn (t) =

n 

1{Ti ≤t} ,

t ≥ 0,

i=1

where Ti = inf{t; Nn (t) = i}, and let Fn = (Fnt )t∈R+ denote the history generated by observations of Nn and other observed processes before t. The predictable compensator of Nn with respect to Fn is the unique n such that Nn − N n is a Fn Fn− -measurable (or predictable) process N martingale on (Ω, A, P ). Consider a counting process Nn with a predictable compensator n (t) = N

n   i=1

0

t

Yi (s)μ(s, Zi (s)) ds

where Yi and Zi are predictable processes with values in metric spaces Y and Z and μ(s, z) = λ(s)r(z) is a strictly positive function for s > 0. This model with a random variable or process Z is classical when the observations are right-censored. The right-censorship is defined by a sequence of random censoring variables (Ci )1≤i≤n and the observations are the sequences of times (Ti ∧ Ci )1≤i≤n and indicators (δi )i = (1{Ti ∧Ci } )i , with values 1 if Ti is observed n and 0 otherwise, so that the counting processes NnT (t) = i=1 1{Ti ≤t} and 115

116

NnC (t) =

Functional Estimation for Density, Regression Models and Processes

n i=1

1{Ci ≤t} are partially observed. They define the processes Nn (t) =

n 

δi 1{Ti ∧Ci ≤t} =

i=1

Yn (t) =



n 

1{Ti ≤t∧Ci } ,

i=1

1{Ti ∧Ci ≥t} .

1≤i≤n

All processes are observed in an increasing time interval [0, τ ] such that Nn (τ ) = n tends to infinity. With independent and identically distributed fT , the relationships variables Ti with distribution function FT and density t between the survival function 1−FT (t) = exp{− 0 λ(s) ds} and the hazard function λT = (1 − FT )−1 fT are equivalent. With independent and identically distributed censoring variables Ci , independent of the time sequence (Ti )1≤i≤n and with distribution function FC , the hazard function of the cen sored counting process Nn (t) = 1≤i≤n δi 1{Ti ≤t} is identical to λT . The aim of this chapter is to define smooth estimators for the baseline hazard function and regression function of intensity models and to compare them with histogram-type estimators. Several regression models are considered, with parametric or nonparametric regression functions. Let Jn (t) = 1{Yn (t)>0} be the indicator of censored times occurring after t. The baseline intensity λ of an intensity μn (t) = λ(t)Yn (t) is estimated for t in [h, τ − h] by smoothing the estimator of the cumulative t hazard function Λ(t) = 0 λ(s) ds, which is asymptotically equivalent to t n (s) as Jn tends to 1 in probability. The unbiased Nelson J (s)Yn−1 (s) dN 0 n estimator (1972) is defined as  t  n (t) = Λ Jn (s)Yn−1 (s) dNn (s), 0

n with the convention 0/0 = 0. The function λ is estimated by smoothing Λ  1 n,h (t) = Yn−1 (s)Jn (s)Kh (t − s) dNn (s). (6.1) λ −1

A stepwise estimator for λ is also defined on an observation time [0, τ ] as the ratio of integrals over the subintervals (Bjh )j≤Jh of a partition of the observation interval into Jh = h−1 τ disjoint intervals with length h tending to zero. For every t belonging to Bjh , the histogram-type estimator of the function λ is estimated at t by  B Jn (s) dNn  λn,h (t) =  jh (6.2) Y (s) ds Bjh n

Nonparametric Estimation of Intensities for Stochastic Processes

117

where the normalization of the histogram for a density is replaced by the integral of Yn . Let Ni (t) = 1{Ti ≤t} be the counting process of the i-th time variable, with a multiplicative intensity μ(t, Zi (t)) = λ(t)r(Zi (t))Yi (t). In the Cox model, the regression function r defining the point process is exp(β T z), with an unknown parameter β belonging to an open bounded set of Rd and z in the metric space (Z, ·, ) of the sample-paths of a regression processes Zi . The estimators of λ and β are defined by the expectations (0) of the process Sn defined by weighting each term of the sum Yn by the regression function at the jump time Sn(0) (t; β) =

n 

rZi (t; β)1{Ti ∧Ci ≥t} ,

i=1

with the parametric function rZ (t; β) = exp{β T Z(t)}. For k = 1, 2, let also n (k) (0) ⊗k with Sn (t; β) = i=1 rZi (t; β)Zi (t)1{Ti ≥t} be the derivatives of Sn respect to β, let Z ⊗0 = 1, Z ⊗1 = Z and Z ⊗2 be the scalar product. The true regression parameter value is β0 , or r0 for the function r and the predictable compensator of the point process Nn is  t  Nn (t) = Sn(0) (s; β0 )λ(s) ds . (6.3) 0

Let Jn = 1{S (0) >0} , the classical estimators of the Cox model rely on the n estimation of the function Λ(t) by the stepwise process  τ  Λn (t; β) = Jn (s){Sn(0) (s; β)}−1 dNn (s), 0

at fixed β and the parameter β of the exponential regression function T rZ (t; β) = eβ Z(t) is estimated by maximization of the partial likelihood with the convention 00 = 1  βn = arg max {rZi (t; β)Sn(0)−1 (Ti ; β)}δi . β

Ti ≤τ

The baseline hazard function λ is estimated by smoothing the estimator of the cumulative hazard function at βn  τ n,h (t; β) = Jn (s){Sn(0) (s; β)}−1 Kh (t − s) dNn (s), (6.4) λ 0

n,h = λ n,h (βn,h ). and λ

118

Functional Estimation for Density, Regression Models and Processes

A histogram-type estimator for the baseline intensity is now defined as    n,h (t; β) = 1Bjh (t)[ Jn (s) dNn (s)][ Sn(0) (s; β) ds]−1 , (6.5) λ Bjh

j≤Jh

Bjh

n,h = λ n,h (βn,h ). and λ More generally, a nonparametric function r is estimated by a stepwise process rn,h defined on each set Bjh of the partition (Bjh )j≤Jh , centered at ajh . Let also (Dlh )l≤Lh be a partition of the values Zi (t), i ≤ n, centered at zlh . The function r is estimated in the form rn,h (Z(t)) =  n,h (zlh )1Dlh (Z(t)) l≤Lh r 

n,h (t; r) = λ

,

 1Bjh (t)

Jn (s) dNn (s) Bjh

j≤Jh

rn,h (zlh ) = arg max rl

 Ti ≤τ

-−1 Sn(0) (s; r) ds

,

Bjh

⎫ ⎡⎧ ⎤δi ⎨ ⎬ n,h (Ti ; rlh )⎦ ,(6.6) ⎣ rl 1Dlh (Zi (Ti )) λ ⎩ ⎭ l≤Lh

n (0) where Sn (t; r) = i=1 rZi (t)1{Ti ≥t} is now defined for a nonparametric n,h (t, rn (Z(t))). A kernel estimator n,h (t, Z) = λ regression function, then λ for the functions λ is similarly defined by n,h (t; r) = λ



1

−1

Jn (s){Sn(0) (s; r)}−1 Kh (t − s) dNn (s) .

(6.7)

An approximation of the covariates values at jump times by z when they are sufficiently close allows to build a nonparametric estimator of the regression function r like β in the parametric model for r rn,h (z) = arg max rz

n   i=1

1

−1

n (s; rz )}Kh (z − Zi (s)) dNi (s), {log rz (s) + log λ 2

where h2 = hn2 is a bandwidth sequence satisfying the same conditions as n (t; rn (t, Z(t)). n (t, Z) = λ h, and λ 2 The L -risk of the estimators of the intensity functions splits into a squared bias and a variance term and the minimization of the quadratic risk provides an optimal bandwidth depending on the parameters and functions of the models and having similar rates of convergence, following the same arguments as for the optimal bandwidths for densities.

Nonparametric Estimation of Intensities for Stochastic Processes

6.2

119

Risks and convergences for estimators of the intensity

Conditions for expanding the bias and variance of the estimators are added or adapted from Conditions 2.1 and 2.2 of the previous chapters concerning the kernel and the bandwidths. A kernel function with a compact support will be considered as defined on [0, 1]. For the intensities, they are regularity and boundedness conditions for the functions of the models and for the processes. Condition 6.1. (1) As n tends to infinity, the process Yn is positive with a probability tending to 1, i.e. P {inf [0,τ ] Yn > 0} tends to 1, and there exists a g such that sup[0,τ ] |n−1 Yn − g| tends a.s. to zero; function τ −1 (2) 0 g (s)λ(s) ds < ∞; (3) The functions λ and g belong to C s (R+ ), s ≥ 2. In the model of right-censored random times and under the condition F¯T (τ )F¯C (τ ) > 0, P {inf [0,τ ] Yn > 0} tends to 1 as n tends to infinity and the function g(t) = F¯T (t)F¯C (t) is strictly positive on [0, τ ]. 6.2.1

Kernel estimator of the intensity

We consider the kernel estimator (6.1) on [h, τ − h], let  1 λn,h (t) = Jn (s)Kh (t − s)λ(s) ds, for t ∈ [h, τ − h], −1

be its expectation and let λn (t) = Jn (t)λ(t), defined as λ(t) on the random interval In,τ = 1{Jn =1 }[0, τ ]. Let also In,h,τ = 1{Jn =1 }[h, τ − h] be the interval where all convergences will be considered. Since Jn (t) − 1 tends uniformly to zero in probability, supt∈In,h,τ |λ(t) − λn,h (t)| tends to zero in probability. Proposition 6.1. Under Conditions 2.1 and 6.1 with hn converging to zero and nh to infinity n,h (t) − λ(t)| converges to zero in probability. (a) supt∈In,h,τ |λ (b) For every t in In,h,τ , the bias λn,h (t) − λ(t) of the estimator is hs msK λ(s) (t) + o(hs ), bλ,n,h (t; s) = s! denoted hs bλ (t) + o(hS ), its variance is n,h (t)} = (nh)−1 κ2 g −1 (t)λ(t) + o((nh)−1 ), V ar{λ

120

Functional Estimation for Density, Regression Models and Processes

n,h (t) and also denoted (nh)−1 σλ2 (t) + o((nh)−1 ) and the covariance of λ n,h (s) is λ

   n,h (t)} = (nh)−1 λ s + t n,h (s), λ Cov{λ K(v−αh )K(v+αh )dv+o(1) g 2 if αh = |t − s|/2h ≤ 1 and zero otherwise, with uniform approximations on In,h,τ .  n − Λ, its predictable compensator is N 2n . By Proof. (a) Let MΛ,n = Λ Lenglart’s inequality applied to the martingale MΛ,n , for every n ≥ 1  1 3 4 n,h − λ| ≥ η ≤ η −2 E sup P sup |λ Kh2 (t − u)(Jn Yn−1 λ)(u) du [h,τ −h]



−2

−1

(nh)

κ2 E

.

t∈[h,τ −h]

sup t∈[0,τ ]

−1

/ ,

Jn (t)Yn−1 (t)λ(t)

and it tends to zero as n tends to infinity. (b) For every t in In,h,τ , the bias bλ,n,h (t) = λh (t) − λn,h (t) develops as  1 bλ,n,h (t) = Kh (t − u)E{Jn (u)λ(u) − Jn (t)λ(t)} du −1 1

 =

−1 s

E{Jn (t + hz)λ(t + hz) − Jn (t)λ(t)}K(z) dz

h (s) λ (t) + o(hs ), s! where EYn (s) = P (Yn (s) > 0) = P (maxi≤n Ti > s) = 1 − FTn (s) belongs to ]0, 1[ for every s ≤ τ , for independent times Ti , i ≤ n. Its variance is  1 n,h (t)} = E Kh2 (t − u)Jn (u)Yn−1 (u)λ(u) du V ar{λ −1  = (nh)−1 K 2 (z)EJn (t + hz)g −1 (t + hz)λ(t + hz) dz =

= (nh)−1 κ2 g −1 (t)λ(t) + o((nh)−1 ).  n,h (t) and λ n,h (s) is The covariance of λ Kh (s − u)Kh (t − u)Jn (u)Yn−1 (u)λ(u) du, it is zero if αh = |x − y|/(2h) > 1 and, if αh ≤ 1, it is approximated by a change of variables as in Proposition 2.2 for the density, under Condition 6.1.  Proposition 6.2. For every integer p ≥ 2, the Lp -norms of the estimator are n,h (t) − λn,h (t)p = O((nh)− 2 ), sup λ 1

t∈In,h,τ

Nonparametric Estimation of Intensities for Stochastic Processes

121

and if λ belongs to C s (R+ ) n,h (t) − λ(t)p = O((nh)− 12 ) + O(hs ). sup λ t∈In,h,τ

Proof. The Lp -moment of a martingale (Mt )t≤τ are approximated on a partition (ti )i=0,...,n of [0, τ ] such that δn = |ti+1 − ti | tends to zero as n  tends to infinity, and where Mt = ti ≤t (Mti − Mti−1 ). By the martingale property, it follows that ⎧⎛ ⎞k ⎫ ⎪ ⎪ p ⎨ ⎬  k  p ⎠ ⎝ E E(Mt ) = (Mti − Mti−1 ⎪ ⎪ p ⎩ ti ≤t ⎭ k=0 ⎧⎛ ⎞p−k ⎫ ⎪ ⎪ ⎨ ⎬  . (Mtj − Mtj−1 .⎠ ×E ⎝ ⎪ ⎪ ⎩ tj ≤t,tj =ti ⎭ n and its Applying this property to the centered martingale M n = Nn − N τ −1  stochastic integral λn,h (t) − λn,h (t) = 0 Kh (t − u)Jn (u)Yn (u) d(Nn − n )(u), with the discretization N n  n,h (t) − λn,h (t) = Kh (t − ti )Jn (ti )Yn−1 (ti ){Mn (ti+1 ) − Mn (ti )} + o(δn ), λ i=1

we obtain n,h − λn,h 3 = E Eλ 3

, n 

Kh3 (u − ti )Jn (ti )Yn−3 (ti ){Mn (ti+1 ) − Mn (ti )}3

i=1

× {1 + o(1)}  n 3  K (u − t ) i h E{Mn (ti+1 ) − Mn (ti )}3 | Fti }] n−3 g 3 (ti ) i=1

 =E

× {1 + o(1)} = O((nhn )−2 ). For every integer p ≥ 3 n,h − λn,h p = E Eλ p



n−1

n 

Kh (t − ti )Jn (ti )nYn−1 (ti )

i=1

× {Mn(ti+1 ) − Mn (ti )}

p 

{1 + o(1)}

and the higher order terms of the expansion of this term is the largest number of products of moments of conditionally independent variables with the minimum order 2 if p is even, 2 and 3 if p is odd, like for the moments  of fn,h in Proposition 2.2.

122

Functional Estimation for Density, Regression Models and Processes

n,h (t) − λn,h (t)}2 develops on In,h,τ as the sum of For p = 2, E{λ n,h (t)} a squared bias and a variance terms {λn,h (t) − λ(t)}2 + V ar{λ −2 2 2s (s)2 and its first order expansions are (s!) msK h λ (t) + o(h2s ) + (nh)−1 κ2 g −1 (t)λ(t) + o(n−1 h−1 ). The asymptotic mean squared error n,h ) = (nh)−1 κ2 g −1 (t)λ(t) + 1 m2 h2s λ(s)2 (t), AM SE(t; λ (s!)2 sK is minimum for the bandwidth function 1

s!(s − 1)! κ λ(t)  2s+1 1 2 hAMSE,n (t) = n− 2s+1 . 2m2sK g(t)λ(s)2 (t) n,h at t is The global asymptotic mean integrated squared error for λ  τ  τ 1 −1 −1 2 2s  g (t)λ(t) dt + m h λ(s)2 (t) dt, AM ISE(λn,h ) = (nh) κ2 (s!)2 sK 0 0 it is minimum for the global bandwidth 1

s!(s − 1)! κ  τ g −1 (t)λ(t) dt  2s+1 1 2 0 − 2s+1  . hAMISE,n = n τ 2m2sK 0 λ(s)2 (t) dt It is estimated by 1  hAMISE,n,h = n− 2s+1

,

- 1 τ  n (t) 2s+1 s!(s − 1)! κ2 0 Yn−1 (t) dΛ .  τ (s)  (t)}2 dt 2m2sK 0 {λ n,h

Another integrated asymptotic mean squared  τ error is the average of AM SE(T ) for the intensity, E{AM SE(T )} = 0 AM SE dFT also written  AM ISEn (h; FT ) = {h2s bλ (t) + (nh)−1 σλ2 (t)}{1 − F (t)} dΛ(t), and it is estimated by plugging estimators of the intensity, bλ and σλ2 into the empirical mean  τ −1  n,h(t)}Y −1 (t) dNn (t). n {h2s b2λ (t) + (nh)−1 σλ2 (t)} exp{−Λ n 0

Its minimum empirical bandwidth is 1   2s+1 τ n,h exp{−Λ  n,h}Yn−2 dNn λ 1 − 0  hλ,n,h = n 2s+1 .  τ (s)  }2 exp{−Λ  n,h}Yn−1 dNn 2m2sK 0 {λ n,h Proposition 6.3. Under Conditions 2.1 and 6.1, and for every real p > 0, the kernel estimator has the Lp -risk s n,h , λ) = O(n− 2s+1 Rp (λ ),

1

with hn = O(n− 2s+1 ).

Nonparametric Estimation of Intensities for Stochastic Processes

Proof.

123

For every f in C 2 (R) we have p  τ  τ   n,h , λ) ≤ 2p−1 Rpp (λ Kh (t − s)λ(s) ds − λ(t) dt   τ0 τ0 p    n (s) − dΛ(s)} dt , Kh (t − s) {dΛ +E  0

0

where the first term is approximated by  hsp mpsK τ (s) p |λ (t)| dt + o(hsp ). R1p = (k!)p 0  n (t) − The sum of the quadratic variations of the process martingale Λ t −1 Λ(t) is the increasing process 0 Jn Yn dΛ with expectation μn (t) then, by the Burkholder–Davis–Gundy inequality for martingales, there exists a constant Cp such that  τ p

 τ  p2      Kh (t − s) {dΛn (s) − dΛ(s) ≤ Cp Kh2 (t−s)Jn (s)Yn−1 (s) dΛ(s) .  0

0

If 0 < p < 2, by concavity the last integral In,h,p (t) has the bound  τ

 τ  τ  p2 E In,h,p (t) dt ≤ Kh2 (t − s) dμn (s) dt 0

0

0



p2 p −1 ≤ h κ2 sup μn (t) = O((nh)− 2 ), t≤τ

1 n,h , λ) = O(n then Rpp (λ ) with hn = O(n− 2s+1 ). If 2 ≤ p < ∞, the result is established from Proposition 6.2 for the integer part of p, like in Proposition 2.7 for independent variables.  sp − 2s+1

Proposition 6.4. Let λ belong to Fs,p = {u ∈ C s (R+); u(s) ∈ Lp (R+)} for a real p > 1 and an integer s ≥ 2, the minimax risk for an intensity λ in Fs,p and an estimator in the set Fn of its estimators in Fs,p based on the observations of the point process Nn on a bounded interval [0, τ ] is inf

s n , λ) = O(n− 2s+1 sup Rp (λ ).

 n ∈F n λ∈Fs,p λ

Proof. that

Let λn be a small perturbation of the function λ on [0, τ ], such λn,a (t) = λ(t) + ηn



aj g(b−1 n (t − tj )),

j=1,...,Kn

where aj belongs to {−1, 1} for every j, ηn and bn tend to zero as n tends to infinity, Kn bn = b − a, g is a symmetric and positive function of Fs,p ([0, 1]),

124

Functional Estimation for Density, Regression Models and Processes

and (tj )j=1,...,Kn is a partition of [a, b] with path bn so that the functions g(b−1 n (t − tj ) have disjoint supports. Let Ht be the σ-algebra generated by the observation of the point processes Nn and Yn up to t, there exist probabilities Pn,λ and P0 such that under Pn,λ the distribution of the point process Nn is absolutely continuous with respect to its distribution under P0 , with density at t   dP n,λ | Ft fn,λ (t) = E dP0  t  t  = f0 exp log{λ(s)Yn (s)} dNn (s) + {1 − λ(s)Yn (s)} ds , 0

0

in the same way, there exists a probability Pn,λn,a , a in {−1, +1}Kn , such that under Pn,λn,a , Nn (t) has a distribution absolutely continuous with respect to its distribution under P0 , with density fn,λn,a (t). The information −1 K(fn,λn,a (t), fn,λn,a (t)) = EPn,λn,a {Pn,λn,a dPn,λ | Ft }, n,a is expressed as  τ λn,a (s) λn,a (s)  K(fn,λn,a , fn,λn,a ) = log +1− Yn (s)λn,a (s) ds λn,a (s) λn,a (s) 0   1 τ  {λn,a (s) − λn,a (s)}2 {1 + o(1)} Yn (s) ds = 2 0 λn,a (s)  τ   = nηn2 (a2j − aj2 ) bn g 2 (u) n−1 Yn (tj + bn u) du 0

j=1,...,Kn

= O(nηn2 bn ), and the Lp norm of λn,a − λn,a satisfies    p   Rpp (λn,a , λn,a ) = ηnp  (aj − aj )g(b−1 n (x − xj )) dx j=1,...,Kn

=

bn ηnp

gpp

Kn 

|aj − aj |p .

j=1

The inequality (2.6) with h2 (f1 , f2 ) ≤ K(f1 , f2 ) implies the existence of constants c1 and c2 such that n ) ≥ c1 η p exp(−c2 nη 2 bn ). inf sup Rpp (λ, λ n n  n ∈F  λ∈F λ

n belongs to Fs,p if This bound is maximum if nηn2 bn = O(1), moreover λ s 1 − 2s+1 −s ) and bn = O(n− 2s+1 ), then ηn bn = O(1), it follows that ηn = O(n  a lower bound for the minimax risk Lp is a O(ηn ). n,h of the intensity is therefore minimax. The kernel estimator λ

Nonparametric Estimation of Intensities for Stochastic Processes

125

An estimator of the derivative λ(k) or its integral are defined by the means of the derivatives K (k) of the kernel, for k ≥ 1  (k) (t) = K (k) (t − s)Jn (s)Y −1 (s) dNn (s). λ n n,h h Proposition 2.3 established for the densities is generalized to the intensity λ. Lemma 2.1 allows to develop the expectation of the estimator of the first derivative as  (1) (1) λn,h (t) = Kh (u − t)λ(u) du   h2 = λ(1) (t) zK (1) (z) dz + λ(3) (t) z 3 K (1) (z) dz + o(h2 ) 6 2 h = λ(1) (t) + m2K λ(3) (t) + o(h2 ), 2 (1)

s

for an intensity of C 3 or λn,h (t) = λ(1) (t) + hs! msK λ(s) (t) + o(hs ) for an  (1) is (nh3 )−1 g −1 (t)λ(t) K (1)2 (z) dz + intensity of C s . The variance of λ n,h

o((nh3 )−1 ). The optimal local bandwidth for estimating λ(1) belonging to C s is therefore  1

1 λ(t) K (1)2 (z) dz  2s+3 hAMSE (λ(1) ; t) = n− 2s+3 s!(s − 1)! . 2m2sK g(t)λ(3)2 (t) (2) is the derivaFor the second derivative of the intensity, the estimator λ n,h (1)  tive of λ expressed by the means of the second derivative of the kern,h

(2) nel. For a function λ in class C 4 , the expectation of the estimator λ n,h (2)

h2 (4) (t) + o(h2 ), 2 m2K λ 2 (2) h (4)  of λ (t) n,h is 2 m2K λ (2)2 5 −1

is λn,h (t) = λ(2) (t) + (2)

to λ . The bias  (nh5 )−1 g −1 (t)λ(t) K

(z) dz + o((nh )

so it converges uniformly + o(h2 ) and its variance

).

Proposition 6.5. Under Condition 2.2, for every integers k ≥ 0 and s ≥ 2 (k) of the k-order and for intensities belonging to class C s , the estimator λ n,h derivative λ(k) has a bias O(hs ) and a variance O((nh2k+1 )−1 ) on In,h,τ . s Its optimal local and global bandwidths are O(n− 2k+2s+1 ) and the optimal s L2 -risks are O(n− 2k+2s+1 ). Consider the normalized process n,h (t) − λ(t)}, t ∈ In,h,τ . Uλ,n,h (t) = (nh) 2 {λ 1

126

Functional Estimation for Density, Regression Models and Processes

The tightness and the weak convergence of Uλ,n,h on In,h,τ are proved by studying moments of its variations and the convergence of its finite dimensional distributions. For independent and identically distributed observations of right-censored variables, the intensity of the censored counting process has the same degree of derivability as the density functions for the random times of interest. Lemma 6.1. Under Condition 6.1, for every intensity of C s there exists a constant C such that for every t and t in In,h,τ satisfying |t − t| ≤ 2h n,h (s)}2 ≤ C(nh3 )−1 |t − t |2 . n,h (t) − λ V ar{λ n,h (t ) − λ n,h (t) develops Proof. Let t and t in In,h,τ , the variance of λ according to their variances given by Proposition 6.1 and the covariance between both terms which is zero if |t − t | > 1 as established in the same n,h (t )|2 develops as n,h (t) − λ E|λ proposition. The second2 order moment −1 {Kh (t−u)−Kh (t −u)} Jn (u)Yn (u)λ(u) du and it is a O((t−t )2 n−1 h−3 n ), by the same approximation as for the proof of Lemma 2.2 and the uniform  convergence of Jn Yn−1 . Theorem 6.1. Under Condition 6.1, for a density λ of class C s (Iτ ) and with nh2s+1 converging to a constant γ, the process 1 n,h − λ}1{I , Uλ,n,h = (nh) 2 {λ n,h,τ } 1

converges weakly to Wλ + γ 2 bλ , where Wλ is a continuous Gaussian process on Iτ with expectation zero and covariance E{Wλ (t )Wλ (t)} = σλ2 (t)δ{t ,t} , at t and t in Iτ , and σλ2 (t) = g −1 (t)λ(t). Proof. The weak convergence of the finite dimensional distributions  1 n,h − λn,h )(t) = (nh) 12 1 Kh (t − of the process Wλ,n,h (t) = (nh) 2 (λ −1 u)Jn (u)Yn−1 (u) dMn (u) on In,h,τ is a consequence of the convergence 1 of its variance and of the weak convergence of the martingale n− 2 Mn to a continuous Gaussian process with expectation zero and covariance  t∧t g(u)λ(u) du at t and t . The covariance between Wλ,n,h (t) and 0 Wλ,n,h (t ), for t = t , is approximated by  n−1 Kh (t − u)Kh (t − u)g(u)−1 λ(u) du

  1{0≤α 2hn . The process Un,h is therefore tight.  Corollary 6.1. The process 1

sup σλ−1 (t)|Uλ,n,h (t) − γ 2 bλ (t)|

t∈In,h,τ

converges weakly to supIτ |W1 |, where W1 is the Gaussian process with expectation zero, variance 1 and covariances zero. For every η > 0, there exists a constant cη > 0 such that 

1 Pr sup |σλ−1 (Uλ,n,h − γ 2 bλ ) − W1 | > cη In,h,τ

tends to zero as n tends to infinity. The Hellinger distance between two probability measures P1 and P2 defined by intensity functions λ1 and λ2 is   λ  12  1 − F  12  1 1 2 1− dF1 , h (P1 , P2 ) = λ2 1 − F2   12 1   1 − λλ12 e− 2 (Λ1 −Λ2 ) dF1 . The and it is also written h2 (P1 , P2 ) = estimator of a function λ satisfies  λ n,h 1 − Fn  12  2  1− dF h (λn,h , λ) = λ 1−F    λn,h 1 − Fn 1 = − 1 {1 + o(1)} dF 2 λ 1−F    1 − F  n,h 1  λ n = −1 + − 1 {1 + o(1)} dF 2 λ 1−F     λn,h 1 1 − Fn + −1 − 1 {1 + o(1)} dF {1 + o(1)} dF, 2 λ 1−F 1

n,h , λ) to zero is nhn2 . and the convergence rate of h2 (λ A varying bandwidth estimator is defined for multiplicative intensities under Condition 4.1, with the optimal convergence rate. The bias and the variance of the estimator are modified as hn (t)s m2K λ(s) (t) + o(hn 2 ), bλ,n,hn (t) (t) = s! its variance is n,h (t) (t)} = (nhn (t))−1 κ2 g −1 (t)λ(t) + o((n−1 h−1 ), V ar{λ n

n

128

Functional Estimation for Density, Regression Models and Processes

n,h (t) (t) − λn,h (t) (t)p = 0((n−1 h−1 ) 12 ). The covariance of and Eλ n n n n,h (t) (t )} equals n,h (t) (t) and λ λ n n 

Khn (t) (t − u)Khn (t ) (t − u)Yn−1 (u) dΛ(u)  2 −1  = (g λ)(zn (t, t )) K(v − αn (v))K(v + αn (v)) dv n{hn (t) + hn (t )}

 −1 (1) + δn (x, y)(g λ) (zn (x, y)) vK(v − αn (v))K(v + αn (v)) dv + o(hn ) ,

E

−1 with αn (x, y, u) = 12 {(u − x)h−1 n (x) − (u − y)hn (y)} and v = {(u − −1 −1 x)hn (x) + (u − y)hn (y)}/2. Lemma 4.2 is fulfilled for the mean n,h (t) (t) − n,h (t) (t) which satisfy E|λ squared variations of the process λ n n  2 −1 −1 3  2 −1  −1   λn,h(t ) (t )| = O(n hn  (t − t ) ), if |thn (t) − t hn (t )| ≤ 1. Othern,h are zero, this implies the weak wise, the mean squared variations of λ 1  convergence of the process (nhn (t)) 2 {λn,hn (t) (t) − λ(t)}I{t ∈ In, hn ,τ } to 1 the process Wf (t)+h 2 (t)bf (t), where Wf is a continuous centered Gaussian process on Iτ with covariance σλ2 (t)δ{t,t } at t and t .

6.2.2

Histogram estimator of the intensity

The histogram estimator (6.2) for the intensity λ is a consistent estimator  as h tends to zero and n to infinity. Let Kh (t) = h−1 j∈Jτ,h 1Bjh (t) be the kernel corresponding to the histogram, the histogram estimator (6.2) is defined as the ratio of two stochastic integrals on the same subintervals of the partition of [0, τ ]. Let  λj,h = λ(aj,h ) where aj,h is thecenter of Bjh , so that the expectations of Bjh Jn (s) dNn and, respectively Bjh Yn (s) ds, are   asymptotically equivalent to λj,h Bjh g(s) ds and, respectively Bjh g(s) ds. The convergence rates of its bias and its variances are determined under the additional condition Condition 6.2. 1

(1) As n tends to infinity, sup[0,τ ] n−1 Yn − g2 = O(n− 2 ), (2) The function g is differentiable on [0, τ ]. Proposition 6.6. Under Conditions 6.1(1), 6.2, n,h (x) − λ(x)| converges a.s. to zero. (a) supx∈IX,h |λ

Nonparametric Estimation of Intensities for Stochastic Processes

129

n,h (x) is a O(h) and its variance is (b) The bias of λ vλ,n,h (x) = (nh)−1 { σλ2 (x) + o(1)} 2 where σ λ,n,h (x) = {g  (ajh )}−1 on Bj,h .  Proof. Condition 6.1 implies that Bj,h g(s) ds is strictly positive on   [0, τ ], Bj,h {n−1 Yn (s)−g(s)} ds and n−1 Bj,h {Jn (s) dNn (s)−Yn (s)λ(s) ds} converge a.s. to zero, which imply (a). To prove (b), let  n  Jn (s) dNn = n−1 1{Yn (Ti ) > 0, Ti ≤ Ci , Ti ∈ Bjh }, n−1 Bjh

n−1

i=1



Yn (s) ds = n−1

Bjh

n   i=1

1{s ≤ Xi } ds,

Bjh

their expectations for n large enough are pj,h = P (Ti ≤ Ci , Ti ∈ Bjh ){1 + o(1)} = hF¯C (ajh )fT (ajh ) + o(h),  qj,h = E 1{s≤Xi } ds Bjh

1 0

h h = hP X ≥ ajh + +E X − ajh − 1{X∈Bjh } , 2 2 where |qj,h − hF¯T (ajh )F¯C (ajh )| ≤ h2 (fT F¯C + F¯T fC )(ajh ). A first order expansion implies  Jn (s) dNn − hF¯C (ajh )fT (ajh ) B  hn,j,h = λj,h + jh h(F¯T F¯C )(ajh )  Yn (s) ds − h(F¯T F¯C )(ajh ) B + r(h), −λj,h jh h(F¯T F¯C )(ajh )   where r(h) = o(| Bjh Jn (s) dNn − hF¯C (ajh )fT (ajh )|) + o(| Bjh Yn (s) ds − h(F¯T F¯C )(ajh )|) then λj,h + λC,j,h + o(h). F¯T (ajh ) t t The centered martingale Mn (t) = 0 Jn (s) Nn (s) − 0 Yn (s)λ(s) ds has t the variance E{Mn2 (t)} = 0 Yn (s)λ(s) ds. The approximation of the variance of the histogram on Bjh comes from the expansion   n−1 { Bjh Jn (s) dNn − nλjh Bjh g(s) ds}   λn,h (x) = λjh + Bjh g(s) ds   −1 n Bjh Yn (s) ds − Bjh g(s) ds  + rn,h , (6.8) −λjh g(s) ds Bjh |E  hn,j,h − λj,h | ≤ hλj,h

130

Functional Estimation for Density, Regression Models and Processes

   where rn,h = o(|n−1 Bjh Jn (s) dNn − Bjh g(s)λ(s) ds|) + o(| Bjh (n−1 Yn −  ng) ds|). The integrals Bjh (Yn − ng)(s) ds and    −1 −1 Jn (s) dNn − λjh g(s) ds = n Jn (s) dMn n Bjh

Bjh

Bjh



+λjh

(n−1 Yn − g) ds,

Bjh 1

are centered, their L2 norms are O((nh)− 2 ) under Condition 6.2 and due to the value n−2 E[{ Bjh Jn (s) dMn (s)}2 ] = n−1 λjh Bjh g(s) ds = O((nh)−1 ). n,h (x) is a O((nh)−1 ) from the expansion The variance vλ,n,h (x) of λ (6.8) 1 vm,n,h (x) = V ar {n Bjh g(s) ds}2 

= =



1 λjh E {n Bjh g(s) ds}2 

n

Jn (s) dMn (s) Bjh



Yn (s) ds Bjh

1 1 = + o((nh)−1 ).  (a ) g(s) ds nhg jh Bjh





n,h (t) is minimal for The asymptotic mean squared error of the estimator λ the bandwidth 1 1 1 hn (t) = n− 3 {2b2λ (t)}− 3 vλ3 (t).

6.3

Risks and convergences for kernel estimators (6.4)

The estimators (6.4) for the exponential regression of the intensity are special cases of those defined by (6.7) in a multiplicative intensity with explanatory predictable processes and an unknown regression function r. For every  n,h (t; r) is still λn,h (t) = 1 Kh (t−s)λ(s) ds t in In,h,τ , the expectation of λ −1 and their degree of derivability is the same as K. With a parametric regression function r, the convergence in the first condition of 6.1 is replaced by the a.s. convergence (k) sup |n−1 Sn(k) (t; β) − s0 (t)| = 0, lim lim sup (k) s0

n→∞ ε→0 t∈[0,τ ] β−β ≤ε 0 (k) = s0 (β0 ), for k = 0, 1, 2,

where and ε > 0 is a small real number. In a nonparametric model, this condition is replaced by the a.s. convergence (k) sup |n−1 Sn(k) (t; r) − s0 (t)| = 0, lim lim sup n→∞ ε→0 t∈[0,τ ] r−r ≤ε 0

where

(k) s0

= s(k) (r0 ), for k = 0, 1, 2.

Nonparametric Estimation of Intensities for Stochastic Processes

131

The previous Condition 6.1 are modified by the regression function. For expansions of the bias and the variance, they are now written as follows. Condition 6.3. (k)

(k)

(1) As n tends to infinity, the processes n−1 Sn (t; β) and n−1 Sn (t; r) are strictly positive with a probability tending to 1 and the function defined (k) by s(k) (t) = n−1 ESn (t) belongs to class C 2 (R+ ); (0) (2) The function pn (s) = Pr(Sn (s; r) > 0) belongs to class C 2 (R+ ) and pnτ(τ, r0 ) converges to 1 in probability; (3) 0 r(z)g −1 (s)λ(s) ds < ∞; (4) The functions λ and g belong to C 2 (R+ ) and r belongs to C s (Z). 6.3.1

Models with nonparametric regression functions

The regression function is estimated by rn,h (z) = arg maxrz Ln,h (z; r) where n    n (s; rz )} Kh (z−Zi (s)) dNi (s), Ln,h (z; r) = n−1 {log rz (s)+log ΔΛ i=1

Iτ,n,h

for t in In,h,τ . Its expectation Ln (z; r) = E Ln,h (z; r) is expanded as   n (s; rz )} Kh (z − Z(s))S (0) (s; rZ )λ(s) ds log{rz (s) ΔΛ Ln (z; r) = E n Iτ,n,h

 =E

2  n (s; rz )}{Sn(0) (s; rz ) + h κ2 Sn(2) (s; rz )} log{rz (s) ΔΛ 2 Iτ,n,h

× fZ(s) (z)λ(s) ds + o(h2 ), where fZ(s) is the marginal density of Z(s). It follows that rn (z) converges uniformly, in probability, to the value r0 (z) which maximizes the limit of n−1 Ln (z; rz )  1 L0 (z; rz ) = {log rz (s) + log ΔΛ(s; rz )}s(0) (s; rz )fZ(s) (z) λ(s) ds. −1

be the k-th order derivative of Ln (z; r) with respect to z, their Let (k) (k) limits are denoted L0 and their expectations Ln , for k = 1, 2. (k) Ln

1

rn,h − r) conProposition 6.7. Under Condition 6.3, the process (nh) 2 ( verges weakly to a Gaussian process with expectation zero and variance (1) (L(2) )−1 V(1) {L(2) }−1 (z; r) where V(1) = limn→∞ nhV arLn .

132

Functional Estimation for Density, Regression Models and Processes

Proof. The first derivative of Ln with respect to rz and its expectation n depend on the derivative of λ  (1) (t; rz ) = − Kh (t − s)Sn(1) (s; rz )Sn(0)−2 (s; rz ) dΛ(s), λ n In,h,τ



1  (1) ΔΛ n (s; rz ) Kh (z − Zi (s)) dNi (s), +  n (s; rz ) rz (s) ΔΛ i=1 In,h,τ  1   (1) 1 ΔΛ n (s; rz ) Kh (z − Z(s)) + =E  n (s; rz ) ΔΛ −1 rt,z (s)

−1 L(1) n = n

L(1) n

n  

×Sn(0) (s; rz ) fZ(s) (z)λ(s) ds, 1 (1) (1) is such that V arLn (z; r) = O((nh)−1 ), therefore (nh) 2 (Ln − L(1) )(z; r) is bounded in probability, and the second derivative is a Op (1). By a Taylor (1) expansion of Ln (z; r) in a neighborhood of the true value r0z ≡ r0 (z) of the regression function at z rn,h −r0 )2 (z(s))) L(1) (z; r)−L(1) (z; r0 ) = {(rz (0)−r0z (s)}T L(2) (z; r0 )+Op ((

n

n

and, by an inversion, the centered estimator ( rn,h − r0 )(z) is approxi(2) −1 (1)  mated by the variable {−Ln (z; r0 )} L )(z; r0 ) the variance of which is a O((nh)−1 ). For every z in Zn,h = {z ∈ Z; supz ∈∂Z z − z   ≥ h}, the 1 rn,h − r0 )(z) converges weakly to a Gaussian variable with variable (nh) 2 ( (1) (2) −1 variance (L ) limn nhn V arLn {L(2) }−1 (z; r0 ).  1 n,h − λn,h )(r0 )1I Proposition 6.8. The processes (nh) 2 (λ and τ,n,h  (1) 1 Sn n,h − λ) + (nh) 12 ( (nh) 2 (λ rn,h − r0 ) (s; rn,h )Kh (· − s) Jn (s) dNn (s)}, (0)3 Sn converge weakly to the same continuous and centered Gaussian process on Iτ , with covariances zero and variance function vλ = κ2 s(0)−1 (r0 )λ.

 n,h is The bias of Λ  T bΛ,n,h (t) = − E{Sn(0) (s, r)Sn(0)−1 (s, rn,h ) − 1} dΛ(s)

Proof.

 = 0

0

T

E{( rn,h − r)(s)

(1)

1

Sn (s, rn,h ) + Op (n− 2 ) (0)

Sn (s, r)S, rn,h )

} dΛ(s),

and it is also equivalent to the expectation of  T ( rn,h − r)(s)Sn(1) (s, rn,h ){Sn(0) (t, rn,h )}−3 dNn (s). 0

Nonparametric Estimation of Intensities for Stochastic Processes

133

n,h is obtained by smoothing the bias bΛ,n,h (t) The bias of the estimator λ and its first order approximation can be written as the expectation of  1 (1) Sn (s, rn,h ) bλ,n,h (t) = ( rn,h − r)(t) Kh (t − s) (0) dNn (s). −1 {Sn (t, rn,h )}3  6.3.2

Models with parametric regression functions T

In the exponential regression model rβ (z) = eβ z with observations (Xi , δi )1≤i≤n , let βn the estimator of the parameter β which maximizes n  n (Ti ; β)} . Ln (β) = n−1 δi {β T Zi (Ti ) + log λ i=1

Its expectation  Ln (β) =

In,h,τ

 {β T s(1) s(0) n (s; β0 ) + E log{λn (s; β)} n (s; β0 )}λ(s) ds

converges to L(β) =

τ 0

{β T s(1) (s; β0 ) + {log λ(s; β)}s(0) (s; β0 )}λ(s) ds.

Proposition 6.9. Under Condition 6.3, n 2 (βn,h − β0 ) converges weakly 1 n,h − to a Gaussian variable with expectation zero. The processes (nh) 2 (λ λn,h )(β0 ) 1Iτ,n,h and  (1) 1 Sn n,h − λ)(βn )+ n 12 (βn,h − β0 )T (nh) 2 (λ (s; βn,h )Kh (·− s) dNn (s), (0)2 Iτ,n,h Sn converge weakly to the same continuous and centered Gaussian process with covariances zero and variance function vλ = κ2 s(0)−1 (β0 )λ. 1

Proof. written

The derivatives with respect to β of the partial likelihood Ln are n

  (1) λ n (Ti ; β) , δi Zi (Ti ) + n (Ti ; β) λ i=1 λ  (1)⊗2 (2) λ n n −1 L(2) (Ti ; β), (β) = −n δ − i n 2 n λ λ

−1 L(1) n (β) = n

n

n,h with respect to β are written where the derivatives of λ  (1) Sn (1) (t; β) = −n−1 λ (s; β)Kh (t − s) dNn (s), n (0)2 Iτ,n,h Sn  (2)  S (1)⊗2 Sn  n (2) (t; β) = n−1 λ − 2 (s; β)Kh (t − s) dNn (s). n (0)3 (0)2 Iτ,n,h Sn Sn

134

Functional Estimation for Density, Regression Models and Processes

(k) As h tends to zero, the predictable compensators of Ln (β), k = 1, 2, develop as  (1) λ n (s; β) (0) −1 (1) S (s; β0 )}λ(s) ds, (β) = n {S (s; β ) + L(1) 0 n n n (s; β) n λ In,h,τ  λ  (1)⊗2 (2) λ n n −1 L(2) (s; β)Sn(0) (s; β0 )λ(s) ds. (β) = −n − n 2 n λ λ In,h,τ n

The expectation of n of Nn − N  λ(1) (t; β) = n

(1) λ n ,

k = 1, 2, is deduced from the martingale property (1)

Sn (s; β)

In,h,τ

(0)2 Sn (s; β)

(1)

=

Sn (t; β) (0)2 Sn (t; β)

+ 

Sn(0) (s; β0 )λ(s)Kh (t − s) ds,

Sn(0) (t; β0 )λ(t)

(1) (2) m2K h2 Sn (t; β) (0) (t; β )λ(t) + o(h2 ), S 0 n (0)2 2 Sn (t; β)

(2)  S (1)⊗2 Sn  n 2 (0)2 − (0) (s; β)Kh (t − s)λ(s) ds In,h,τ Sn Sn (2)   S (1)⊗2 Sn n = 2 (0)2 − (0) (t; β)λ(t) Sn Sn (1)⊗2 (2) (2) 2  m2K h Sn Sn  + 2 (0)2 − (0) (t; β)λ(t) + o(h2 ). 2 Sn Sn

λ(2) n (t; β) =

(1)

(2)

It follows that Ln (β) and Ln (β) converges to L(1) (t; β)  τ (1) (0)−2 L(1) (β) = {s(1) (s; β)s(0)2 n (s; β0 ) − sn (s; β)sn n (s; β0 )}λ(s) ds, 0

 τ  (1)⊗2 λ(2)  λ (s; β)s(0) − n (s; β0 )λ(s) ds, 2 λ λ 0  τ s(2) (s; β0 )λ(s) ds, L(2) (β0 ) = − L(2) (β) = −

0

(2)

where −L (β0 ) is positive definite and L(1) (t; β0 ) = 0 so that the maximum βn,h of Ln converges in probability to β0 , the maximum of the (1) limit L of Ln . The rate of convergence of βn,h − β0 is that of Ln (β0 ). 1 First n 2 (L(1) )n − L(1) )n )(β0 ) is the sum of stochastic integrals of predictable processes with respect to centered martingales and it convergences

135

Nonparametric Estimation of Intensities for Stochastic Processes

weakly to a centered Gaussian variable with variance v(1) =

 In,h,τ

{s(2) −

s(1)⊗2 s(0)−1 )(s; β0 )λ(s) ds. Secondly   1  1 (1) (1) 2 n 2 n−1 (Sn(1) − s(1) (s; β0 ) n (Ln − L )(β0 ) = In,h,τ 1

+ n2

λ (1) n

n λ

  n−1 Sn(0) − s(1) (s; β0 ) λ(s) ds,

1 (1) (1) is continuous and independent of n 2 (Ln − Ln )(β0 ), its integrand is a 1 (1) sum of three terms l1n + l2n + l3n where l1n = n 2 {n−1 (Sn − s(1) }(s; β0 ) convergences weakly to a centered Gaussian variable with a finite variance,

λ(1)  1 n,h −1 (0) 2 n Sn − s(1) (·; β0 ), l2n = n λn,h (1)

λ (1) λn,h  1 n,h (0) 2 (·; β0 ). l3n = n [Sn − n,h λn,h λ

The term l2n convergences weakly to a centered Gaussian variable with a finite variance and l3n has the same asymptotic distribution as 1 (1) − λ(1) ) − λ(1) λ−1 (λ n,h − λn,h )}(·; β0 ) n 2 s(0) λ−1 {(λ n,h

n,h

where the process 1 n,h − λn,h )(t; β0 ) (nh) 2 (λ  τ 1 n )(s) = n2 Sn(0)−1 (s; β0 )Kh (t − s) Yn (s) d(Nn − N 0

has the expectation zero and the variance h

 Iτ,n,h

(0)−1

Sn

(s; β0 )Kh2 (t −

s)λ(s) ds which converges in probability to vλ = κ2 s(0)−1 (t; β0 )λ(t). In the same way, the process  (1) 1 Sn (1) − λ(1) )(t; β0 ) = n 12 n )(s) (s; β0 )Kh (t − s) d(Nn − N n 2 (λ n,h n,h (0)2 In,h,τ Sn is consistent and it has the finite asymptotic variance vλ,(1) (t) = s(1)⊗2 s(0)−3 (t; β0 )λ(t). The term l3n with asymptotic variance zero converges in probability to zero. The proof of the weak convergence of βn ends as previously. The process 1 n,h − λ)(t; βn ) develops as (nh) 2 (λ  1 n )(s) n2 Jn (s)Sn(0)−1 (s; βn )Kh (t − s) d(Nn − N Iτ,n,h



+ Iτ,n,h

{Sn(0)−1 (s; βn ) − Sn(0)−1 (s; β0 )}Sn(0) (s; β0 )Kh (t − s)λ(s) ds

136

Functional Estimation for Density, Regression Models and Processes

the first term of the right-hand side converges weakly to a centered Gaussian process with variance κ2 s(0)−1 (t; β0 )λ(t) and covariances zero, and the second term is expanded into  1 T  2 Sn(1) (s; β0 )Sn(0)−1 (s; β0 )Kh (t − s)λ(s) ds − n (βn,h − β0 ) Iτ,n,h

= −n (βn,h − β0 )T s(1) (t; β0 )s(0)−1 (t; β0 )λ(t) + o(1). 1 2



The results are analogous for every parametric regression function rβ n (0) of C 2 , the processes are then defined by Sn (t; β) = i=1 rZi (t; β)1{Ti ≥t} (k) and Sn is its derivative of order k with respect to β n

r(1) (Zi (Ti )) λ (1) (s; β)   β n,h −1 dNn (s), L(1) + (β) = n δ i n  r (Z (T )) β i i λn (s; β) i=1 

r(1) (Zi (t) λ  (1) n (s; β) β + (β) = E L(1) n n (s; β) rβ (Zi (t) λ In,h,τ −1 L(2) n (β) = −n

n λ  (1)2 n δi 2 λ n

i=1

× Sn(0) (t; β0 )λ(s) ds,  (2) λ n (Ti ; β) . − n λ

All results of this section are extended to varying bandwidth estimators as before.

6.4

Histograms for intensity and regression functions

The histogram estimator (6.2) for the intensity λ with a parametric regression or (6.6), for nonparametric regression of the intensity, are consistent estimators as h tends to zero and n to infinity. Their expectations are approximated like in Section 6.2.2 by a ratio of expectations. Their bias and their variance are calculated by similar approximations. The nonparametric regression function r is estimated by  rn,h (zlh )1Dlh (z), rn,h (z) = l≤Lh

and the histogram estimator for the intensity defines the estimator rn,h of the regression function by  δi    n,h (Ti ; rlh ) , rn,h (zlh ) = arg max rl 1Dlh (Zn (Ti )) λ rl

n,h (t; r) = λ

 j≤Jh

Ti ≤τ

1Bjh (t)



l≤Lh



Jn (s) dNn (s) Bjh

Bjh

Sn(0) (s; r) ds

−1

.

Nonparametric Estimation of Intensities for Stochastic Processes

137

Proposition 6.10. Under Conditions 2.1, 6.1 and 6.3 with hn converging to zero and nh to infinity n,h (t) − λ(t)| converges to zero in probability. (a) supt∈In,h,τ |λ n,h is (b) For every t in In,h,τ , the bias λn,h (t) − λ(t) of the estimator λ s h msK λ(s) (t) + o(hs ), bλ,n,h (t; s) = s! denoted h2 bλ (t) + o(h2 ), its variance is n,h (t)} = (nh)−1 κ2 g −1 (t)λ(t) + o((nh)−1 ), V ar{λ also denoted (nh)−1 σλ2 (t) + o((nh)−1 ). n (0) Proof. For every t in Bj,h , let Sn (t; r) = i=1 rZi (t)Yi (t), the limit  (0) of n−1 Sn (t; r) is s(0) (t; r) = r Pr(Z(t) ∈ Dlh ) + o(1) and its l≤Lh zlh variance is  rz2lh [Pr(Z(t) ∈ Dlh )g(t)−{Pr(Z(t) ∈ Dlh )g(t)}2 ]+o(1) v (0) (t; r) = n−1 l≤Lh

under the conditions. n,h (t; zlh ) is The expectation of λ  s(0) (s; rlh )λ(s) ds B + O((nh)−1 ) = λ(aj,h ) + o(h) λn,h (t; rlh ) = j,h (0) (s; r ) ds s lh Bj,h and its bias is bλ,h (t; rlh ) = hλ(1) (t) + o(h), uniformly on Iτ,n,h . Its variance is

n,h (t) − vn,h (t; rlh ) = E λ

 Bj,h

s(0) (s; rlh )λ(s) ds 2

+ O((nh)−1 ) s(0) (s; rlh ) ds 

−2  V ar{(nh)−1 = s(0) (t; rlh ) Jn (s) dNn } 

Bj,h

− 2λ(t)Cov{(nh)−1

Bj,h



Jn dNn , (nh)−1

Bj,h

2

−1



Sn(0) (s; rlh ) ds}

+ o((nh)−1 ) + o(h),

Bj,h

where, for t in Bj,h and Z(t) in Dlh  

V ar (nh)−1 Jn dNn = (nh)−1 s(0) (t; rlh )λ(t) + o((nh)−1 ) Bj,h

−1



V ar (nh)

−1



 Sn(0) (s; rlh ) ds = (nh)−1 v (0) (t; rlh ) + o((nh)−1 ),

Bj,h

Cov (nh)

−1



Jn dNn , (nh) Bj,h

Sn(0) (s; rlh ) ds



Bj,h −1 (0)

= (nh)

v



Bj,h

Sn(0) (s; rlh ) ds}

+ λ (t)V ar{(nh)



(t; rlh )λ(t) + o((nh)−1 ),

138

Functional Estimation for Density, Regression Models and Processes

therefore vn,h (t) = (nh)−1 vλ (t) + o((nh)−1 ) with vλ (t) = s(0)−2 (s; rlh ){s(0) (t; rlh )λ(t) − v (0) (t; rlh )λ2 (t)}.



Proposition 6.11. Under the conditions of Proposition 6.10, the process 1 n,h − λn,h ) converges weakly to a centered Gaussian process with (nh) 2 (λ variance vλ . n,h (t) is still minimal The asymptotic mean squared error of the estimator λ 1 1 1 for the bandwidth hn (t) = n− 3 {2b2 (t)}− 3 v 3 (t). The stepwise constant λ

λ

estimator of the nonparametric regression function r maximizes ⎧ ⎫  ⎨ ⎬ log rl 1Dlh (Zn (s)) Jn (s) dNn (s) Ln,h (r) = ⎩ ⎭ l≤Lh  nh (s; rlh ) Jn (s) dNn (s), + λ (1)

(1)

r1h , . . . , rLh ,h ) = 0, where Ln,h is a vector with comand it satisfies Ln,h ( ponents the derivatives with respect to the components of rh = (rlh )l≤Lh ⎧ ⎫  ⎨ ⎬ 1 (1) Ln,h,l (rh ) = 1Dlh (Zn (s)) Jn (s) dNn (s) ⎩ ⎭ rlh l≤Lh  (1) (s; rlh ) Jn (s) dNn (s). + λ n,h,l The derivatives of the intensity are consistently estimated by differences of values of the histogram, in the same way as the derivatives of a density. (1) is a O((nh3 )−1 ) and the estimator of the regression The variance of λ n,h function converges with that rate. In the parametric regression model, the histogram estimator for the function λ and the related estimator of the regression parameter have the (0) same form (6.5), where the function r and the process Sn are indexed by the parameter β. Let t in Bj,h   −1  n,h (t; β) = 1Bjh (t) Jn (s) dNn (s) Sn(0) (s; β) ds , λ j≤Jh

βn,h = arg max β



Bjh

Bjh

n,h (Ti ; β)}δi . {rZi (Ti ; β)λ

Ti ≤τ

n,h are obtained by deriving Sn(0) with respect to β The derivative of λ  (1)    S (s; β) ds Bjh n (1)  1Bjh (t) Jn (s) dNn (s)  λn,h (s; β) = − (0) Bjh [ Bjh Sn (s; β) ds]2 j≤Jh

Nonparametric Estimation of Intensities for Stochastic Processes

139

and the derivative of the logarithm of the partial likelihood for β is  (1) n   rZi (1) (1) (s; β) dNn (s). (s; β) dNi (s) + n−1 λ Ln,h (β) = n−1 nh r Zi i=1 (1) . Therefore, It is zero at βn,h and its convergence rate is a O((nh3 )), like λ nh βn,h has the convergence rate O((nh3 )) and the estimator of the hazard function has the convergence rate O((nh)).

6.5

Estimation of the density of duration excess

For the indicator NT of a time variable T , the probability of excess is Pt (t + x) = Pr(T > t + x | T > t) = 1 − Pr(t < T ≤ t + x | T > t) = 1 − Pr{NT (t + x) − NT (t) = 1 | NT (t) = 0}. For a sample of n independent and identically distributed variables (Ti )i≤n ,  the processes n−1 Nn (t) = n−1 ni=1 1Ti ≤t and n−1 Yn (t− ) = 1 − n−1 Nn (t) converge respectively to the functions F (t) and F¯ (t), and Pt (t + x) is estimated by Nn (t + x) − Nn (t) Pn,t (t + x) = 1 − 1{Nn (t)0} {F¯Y |X (s; x)}−1 FY |X (ds; x) −∞  y = 1{B(s;x)>0} {B(s; x)}−1 A(ds; x). −∞

Let FYT (y) = P (T ≤ Y ≤ y) be the truncated distribution function of Y and let F¯YT (y) = P (T ≤ y ≤ Y ) be the associated truncated survival function of Y , their empirical estimators are FYT n (y) = n−1

n  i=1

T

1{Ti ≤Yi ≤y} ,

¯ (y) = n−1 F Yn

n 

1{Ti ≤y≤Yi } .

i=1

The conditional sub-distribution functions of the observed variables are H1 (y; x) = P (T ≤ Y ≤ y | X = x), H2 (y; x) = P (T ≤ y ≤ Y | X = x), they have the consistent kernel estimators n FYT |X,n,h (y; x) Kh (x − Xi )1{T ≤Y ≤y}  H1n,h (y; x) = i=1n , = fX,n,h (x) i=1 Kh (x − Xi ) n ¯ T F Y |X,n,h (y; x) i=1 Kh (x − Xi )1{T ≤y≤Y }  n H2n,h (y; x) = . = fX,n,h (x) i=1 Kh (x − Xi ) The conditional probability α(x) has the kernel estimator n Kh (x − Xi )1{T ≤Y } n α n,h (x) = i=1 i=1 Kh (x − Xi )

146

Functional Estimation for Density, Regression Models and Processes

and estimators of the conditional distribution functions A and B are the ratios n  1n,h (y; x) H i=1 Kh (x − Xi )1{T ≤Y ≤y}  An,h (y; x) = =  , n α n,h (x) i=1 Kh (x − Xi )1{T ≤Y } n  2n,h (y; x) H i=1 Kh (x − Xi )1{T ≤y≤Y }  Bn,h (y; x) = =  . n α n,h (x) i=1 Kh (x − Xi )1{T ≤Y } A kernel estimator of the hazard function of Y conditionally on X is deduced from (6.14) and (6.15)  y  n,h (ds; x) n,h (s; x)}−1 A ΛY |X,n,h (y; x) = 1{Bn,h (s;x)>0} {B −∞



y

=

1 −∞

 ¯T {F

Y |X,n,h (s;x)>0}

dFYT |X,n,h (s; x) T

¯ F Y |X,n,h (s; x)

,

and a product-limit estimator of FY |X (y; x) is  . /  Y |X,n,h (Yi ; x) . FY |X,n,h (y; x) = 1 − ΔΛ Yi ≤y

 1n,h and H  2n,h have the same form as kernel estiThe estimators α n,h , H mators of regression functions and their asymptotic properties are deduced n,h and Λ  Y |X,n,h are ratios n,h , A from Proposition 3.1. The estimators A of these estimators so their bias and variance have the same order and the same optimal rate of convergence. Proposition 6.12. Under Conditions 2.1, 2.2 and 3.1(1) form the funcs n,h − tions f , A and B of C s (IX ) and for the kernel, the processes n 2s+1 (A s 1 1  2s+1 2 2 (Bn − B − γ bB ) converge weakly to centered Gaussian A− γ bA ) and n processes on every subset of their support where the function α is strictly positive. Let I be a bounded subset of the support of ΛY |X where the function B is strictly positive.  nY |X a.s. Proposition 6.13. Under the same conditions, the estimator Λ s  2s+1 is uniformly consistent on I and the process n (ΛnY |X −ΛY |X ) converges weakly to a Gaussian process on I. The estimator FnY |X is a.s. uniformly s consistent on I and the process n 2s+1 (FnY |X − FY |X ) converges weakly to a Gaussian process on I.

Nonparametric Estimation of Intensities for Stochastic Processes

6.8

147

Conditional intensity under left-truncation and right-censoring

Let (X, Y ) be random variables such that Y is left-truncated by an unobserved variable T and right-censored by a variable C, such that T , C and (X, Y ) are independent. The notations α (6.13) and those of the joint and marginal distribution function of X, Y and T are in like in Section 6.7 and FC is the distribution function of C. The observations are δ = 1{Y ≤C} , and Y ∧ C, conditionally on Y ∧ C ≥ T . The conditional sub-distributions of the observed variables are A(y; x) = P (Y ≤ y ∧ C|X = x, T ≤ Y )  = α−1 (x) 1{v≤y} FT (v)F¯C (v) FY |X (dv; x), B(y; x) = P (T ≤ y ≤ Y ∧ C|X = x, T ≤ Y ) = α−1 (x)FT (y)F¯C (y)F¯Y |X (y; x). The conditional hazard function ΛY |X is defined for every y < τ , such that F¯Y |X (τ ; x)FT (τ )F¯C (τ ) > 0 as  y F¯ −1 (v; x) FY |X (dv; x) ΛY |X (y; x) =  =

τ1

Y |X

1{v≤y} 1{B(v;x)>0} B −1 (v; x)A(dv; x),

and the conditional survival function is F¯Y |X (y; x) = exp{−ΛY |X (y; x)}. The conditional sub-distribution functions H1 (y; x) = P (T ≤ Y ≤ y ∧ C | X = x), H2 (y; x) = P (T ≤ y ≤ Y ∧ C | X = x), and the function α(x) have the consistent estimators n FYT|X,n,h (y; x) Kh (x − Xi )1{T ≤Y ≤y} = α n (y; x) = i=1n , fX,n,h (x) i=1 Kh (x − Xi ) n Kh (x − Xi )1{T ≤Y ≤y∧C}  , H1n (y; x) = i=1 n i=1 Kh (x − Xi ) n h (x − Xi )1{T ≤y≤Y ∧C}  2n (y; x) = i=1 K n H . i=1 Kh (x − Xi ) The conditional sub-distributions A and B are consistently estimated by 1n (y; x), n (y; x) = α −1 (x)H A n

n (y; x) = α  B −1 n (x)H2n (y; x),

148

Functional Estimation for Density, Regression Models and Processes

 1n,h and H  2n,h still they are uniformly consistent. The estimators α n,h , H have the same form as kernel estimators of regression functions and their asymptotic properties are deduced from Proposition 3.1. The estimators n,h and Λ  Y |X,n,h are ratios of these estimators, their bias and varin,h , A A ance have the same order as the previous estimators and the same optimal rate of convergence. Proposition 6.14. Under Conditions 2.1, 2.2 and 3.1(1) form the funcs s n −A) and n 2s+1 n −B) (B tions f , A and B of C s (IX ) the processes n 2s+1 (A converge weakly to Gaussian processes on every subset of their support where the function α is strictly positive.  nY |X , and respectively FnY |X , of the conditional hazard The estimators Λ and distribution functions, have the same form as in Section 6.7 

n (dy; x) A , n (s; x) B −∞ δi   nY |X (Yi ; x) FnY |X (y; x) = 1 − ΔΛ .

 nY |X (y; x) = Λ

y

1{Bn (s;x)>0}

Yi ≤y

Let I be a bounded subset of the support of ΛY |X where the function α is strictly positive. Proposition 6.15. Under Conditions 2.1, 2.2 and 3.1(1) form the func nY |X and FnY |X are tions f , A and B of C s (IX ), the estimators Λ s  nY |X − ΛY |X ) and a.s. uniformly consistent on I, the process n 2s+1 (Λ s n 2s+1 (FnY |X − FY |X ) converge weakly to Gaussian processes on I. 6.9

Models with varying intensity or regression coefficients

More complex models are required to describe the distribution of event times when the conditions may change in time or according to the value of a variable. Pons (1999, 2002) presented results for two extensions of the classical exponential regression model for an intensity involving a nonparametric baseline hazard function and a regression on a p-dimensional process Z (1) a model for the duration X = T 0 − S of a phenomenon starting at S and ending at T 0 , with a nonstationary baseline hazard depending

Nonparametric Estimation of Intensities for Stochastic Processes

149

nonparametrically on the time S at which the observed phenomenon starts λX|S,Z (x | S, Z) = λX|S (x; S) eβ

T

Z(S+x)

,

(6.16)

(2) a model where the regression coefficients are smooth functions of an observed variable X λ(t | X, Z) = λ(t)eβ(X)

T

Z(t)

.

(6.17)

 n of β and of the The asymptotic properties of the estimators βn and Λ cumulative baseline hazard function follow the classical lines but the kernel smoothing of the likelihood requires modifications. In model (6.16), the time T 0 may be right-censored at a random time C independent of (S, T 0 ) conditionally on Z and non informative for the parameters β and λX,S , and that S is uncensored. We observe a sample (Si , Ti , δi , Zi )1≤i≤n drawn from the distribution of (S, T, δ, Z), where T = T 0 ∧ C and δ = 1{T 0 ≤C} is the censoring indicator. The data are observed on a finite time interval [0, τ ] strictly included in the support of the distributions of the variables S, T 0 and C, and (S, T 0 ) belongs to the triangle Iτ = {(s, x) ∈ [0, τ ] × [0, τ ]; s + x ≤ τ }. For Si in a neighborhood of s, the baseline hazard λX|S (·; Si ) is approximated by λX|S (·; s), which yields a local log-likelihood at s ∈ [hn , τ − hn ], defined as  ln (s) = Khn (s − Si )δi {log λX|S (Xi ; Si ) + β T Zi (Ti )} i



 0

τ

Yi (y) exp{β T Zi (Si + y)}λX|S (y; Si ) dy.

i = Xi ∧ (Ci − Si ) Let X In,τ = {(s, x); s ∈ [hn , τ − hn ], x ∈ [0, τ − s]} , Yi (x) = 1{T 0 ∧Ci ≥Si +x} = 1{Xi ≥x} , i  Khn (s − Sj )Yj (x) exp{β T Zj (Sj + x)}. Sn(0) (x; s, β) = n−1 j

x The estimator of Λ0,X|S (x; s) = 0 λ0,X|S (y; s) dy is defined for (s, x) ∈ In,τ ˆ n,X|S (x; s) = Λ ˆ n,X|S (x; s, βˆn ) with by Λ  Khn (s − Si )1{Si ≤Ci ,Xi ≤x∧(Ci −Si )} ˆ n,X|S (x; s, β) = Λ . (0) nSn (Xi ; s, β) i The estimator βn of the regression coefficient maximizes the following partial likelihood ln (β) =

 i

  δi β T Zi (Ti0 ) − log{nSn(0) (Xi ; Si , β)} εn (Si ),

150

Functional Estimation for Density, Regression Models and Processes

where εn (s) = 1[hn ,τ −hn ] (s). The bandwidth h is supposed to con1 verge to zero, with nh2 tends to infinity and h = o(n− 4 ) as n tends to infinity, the other conditions are precised by Pons (2002). 1 The variable n 2 (βn − β0 ) converges weakly to a Gaussian variable N (0, I −1 (β0 )) where the variance I −1 (β0 ), defined as the inverse of the limit of the second derivative of the partial likelihood ln , is the minimal variance for a regular estimator of β0 . The weak convergence of the estimated cumulative hazard function defined along the current time and the duration elapsed between two events relies on the bivariate empirical processes   n (s, x) = n−1 δi 1{Si ≤s} 1{Xi ≤x} , H i

¯ n(0) (s, x) = n−1 W



eβ0 Zi (Si +x) 1{Si ≤s} 1{Xi ≥x} , T

i

n = n (W ¯ n(0) − W (0) , H  n − H) 1In,τ , B 1 2

n converges under boundedness and regularity conditions, the process B (0) weakly to a Gaussian limit. With functions λ and s in class C 2 , the bias  n,X|S is a O(h2 ), thus the optimal bandwidth minimizing of the estimator Λ 1 the asymptotic mean squared error of Λ is O(n− 5 ) and it is still written in terms of the squared bias and the variance of the estimator. If the regressor Z is a bounded variable, then there exists a sequence of centered Gaussian 1 n − Bn In,τ = op (hn2 ). This property processes Bn on In,τ such that B 1  n,X|S −Λ0,X|S )1{I } implies the weak convergence of the process (nhn ) 2 (Λ n,τ to a centered Gaussian process.  n only involves kernel terms through the regression In model (6.17), Λ   n have the same nonparametric rate of converfunctions but both βn and Λ gence. In Pons (1999), the estimator βn,h (x) was defined as the value of β which maximizes  δi Khn (x − Xi )[{β(Xi )}T Zi (Ti ), (6.18) ln,x (β) = i≤n

− log{



Khn (x − Xj )Yj (Ti )e{β(Xi )}

T

Zj (Ti )

}],

(6.19)

j≤n

where Yi (t) = 1{Ti ≥t} is the risk indicator for individual i at t. Let  (0) β(Xi )T Zi (t) Sn (t, β) = , an estimator of the integrated baseline i Yi (t)e  t (0)−1  hazard function is Λn (t) = 0 Sn (s, βn,h ) dNn (s). For every x in IX,h ,

Nonparametric Estimation of Intensities for Stochastic Processes

151

the process n−1 ln,x converges uniformly to  τ (β − β0 )(x)T s(1) (t, β0 (x), x) lx (β) = 0

− s(0) (t, β0 (x), x) log

s(0) (t, β(x), x) dΛ0 (t), s(0) (t, β0 (x), x)

which is maximum at β0 hence βn,h (x) = arg max ln,x (x) converges to β0 (x). Let Un,h (·, x) and In,h (·, x) be the first two derivatives of the process ln,x with respect to β at fixed x in IX,h , the estimator of β(x) satisfies Un,h (βn,h (x), x) = 0 and In,h (x) ≤ 0 converges uniformly to a limit I(x). By a Taylor expansion Un,h (β0 (x), x) = (βn,h (x) − β0 (x))T {I(β0 , x) + o(1)} and (βn,h (x) − β0 (x)) = {In,h (β0 , x) + o(1)}}−1 Un,h (β0 (x). 1

Under the assumptions that the bandwidth is a O(n− 5 ) and the function by β belongs to the class C 2 (IX ), the bias of βn,h (x) is approximated τ I −1 (β0 , x)h2 u(x) where u(x) has the form u(x) = m22K 0 φ(t, x) dΛ0 (t) and its variance is (nh)−1 κ2 I −1 (β0 , x) + o((nh)−1 ). The asymptotic mean integrated squared error  βn,h (x) − β0 (x)w(x) dx, AM ISEw (h) = E Xn,h

for βn,h (x) is therefore minimal for the bandwidth  κ2 Xn,h I −1 (β0 , x)w(x) dx 1 hn,opt = n− 5  , u(x)I −1 (β0 , x)w(x) dx Xn,h 2

and the error AM ISEw (hn,opt ) has the order n− 5 . The limiting distributions of the estimators are now expressed in the 1 (0) (0) following proposition. Let Gn = (n−1 h) 2 {Sn (βn,h ) − Sn (β0 )}. 1 Proposition 6.16. For every x in IX,n,h , the variable (nhn ) 2 (βn,h −β0 )(x) converges weakly to a Gaussian variable N (0, γ2 (K)I0−1 (x)). 1  n − Λ0 ) converges weakly to the Gaussian proThe process (nhn ) 2 (Λ  (0) · cess − 0 G(t){ s (t, y) dy}−2 dΛ0 (t), where the process G is the limiting distribution of Gn . 1

The rate (nhn ) 2 for the estimator of Λ comes from the variance  t convergence (0) (0)−2  n (t) − Λ0 (t) developed by a first (s, βn,h ) dΛ0 (s) of Λ E 0 Sn (s, β0 )Sn order Taylor expansion.

152

Functional Estimation for Density, Regression Models and Processes

6.10

Estimation in nonparametric frailty models

In frailty models with proportional hazards, the conditional hazard function of the time variable Tij for the jth individual of the ith group depends on the vector of observed covariates processes Zij of Rp and on a random variable ui of group effect, shared by all individuals of the group and depending on covariates omitted in the model. A nonparametric hazard function of the variable Tij is defined as λij (t, ui , Zij ) = ui λi (t)ri (Zij (t)),

i = 1, . . . , I, j = 1, . . . , ni .

(6.20)

I The total sample size is n = i=1 ni and the ratios n−1 ni converges to strictly positive limits pi as n tends to infinity, for i = 1, . . . , I. The conditional survival function of the variable Tij is

 t F¯i (t | ui , Zij ) = exp −ui λij (s, Zij ) ds = F¯iui (t, Zij ).

(6.21)

0

The observations under independent right censoring are the independent time events Xij = Tij ∧ Cij and the censoring indicators δij = 1{Xij ≤Cij } , where Tij have the conditional distribution functions Fij and Cij are the censoring variables. Let Yij (t) = 1{Xij ≤t} , the counting process Nn,i (t) = ni conditional on j=1 δij 1{Tij ≤t} has the predictable compensator, (ui , (Zij )j=1,...,ni ) n,i (t, ui ) = N

ni   j=1



τ

0

Yij dΛij (t, ui , Zij (t)) = ui

τ 0

(0)

Sn,i (t)λi (t) dt,

(6.22)

ni (0) Yij (t)ri (Zij (t)). Parametric distributions of the variwhere Sn,i (t) = j=1 able ui have been used for the frailty model with a parametric Cox model, usually a Gamma distribution Γ(α, γ) with density fα,γ (u) =

γ α α−1 −γu u e , u > 0, Γ(α)

α > 0 and γ > 0, its expectation is μ = γ −1 α and its variance is σ 2 = γ −2 α. Then the conditional survival function of the variables Tij is F¯i (t, α, γ | Zij ) =

γα , {γ + exp{Λij (t, Zij )}α

153

Nonparametric Estimation of Intensities for Stochastic Processes

and the likelihood of the observations is  ∞ ni I  {uλij (Tij , Zij (Tij ))}δij F¯iu (t | Zij ) fα,γ (u) du Ln (α, γ, Λi ) = 0

=

i=1 j=1

ni I  

δ λijij (Tij , Zij (Tij ))

i=1 j=1

γα  = Γ(α) i=1 I

α







∞ 0

uNi e−uΛni fα,γ (u) du

uα+Nn,i −1 e−u(γ+Λni ) du

0

I 

ni 

δ

λijij (Tij , Zij (Tij ))

j=1 ni 

Γ(α + Nn,i ) γ δ λi ij (Tij , Zij (Tij )), α+N n,i Γ(α) i=1 (γ + Λn,i ) j=1 ni  τ where Λn,i = j=1 0 Yij (t)λi (t, Zij (t)) dt. Maximum likelihood estimators of the parameters are solutions of the score equation =

I  α n + Nn,i α n = n−1 . γ n γ + Λn,i  i=1 n

(6.23)

Considering Nn,i as an unbiased estimator of Λn,i , (6.23) implies the following convergence. Proposition 6.17. The maximum likelihood estimators of the parameters α and γ in bounded sets satisfy α n −  γn = op (1) as n tends to infinity. For the estimation of the parameter α, the derivative of the partial loglikelihood reduces to Γ (α)  Γ (α + Nn,i )  ln (α) = log α + 1 − + Γ(α) Γ(α + Nn,i ) i   α + Nn,i − + log(α + Λn,i ) . α + Λn,i i By Stirling’s formula, the derivative of log Γ(α + Nn,i ) with respect to α is approximated by log(α + Nn,i ) as n tends to infinity, then Γ (α)  α + Nn,i  α + Nn,i  + ln (α) = log α + 1 − log − + op (1) Γ(α) α + Λn,i α + Λn,i i i = log α + 1 −

Γ (α) + op (1), Γ(α)

and a maximum likelihood estimator of the parameter α is solution of the equation  ln (α) = 0.

154

Functional Estimation for Density, Regression Models and Processes

The predictable compensator of Ni , conditional on (Zij )j=1,...,ni is the expectation  with respect to the distribution of the frailty variable of (6.22), n,i (t) = τ S (0) dΛi and a consistent kernel estimator of the derivative λi N n,i 0 of Λi is deduced as  τ (0)−1 i,n,h (t) = λ Jn,i (s)Sn,i (s)Kn (t − s) dNn,i (s), 0

with Jn,i (s) = 1{S (0) (s)>0} . This estimators is similar to (6.4) and it has n,i

the properties of Proposition 6.1 as ni tends to infinity. The functions ri are estimated by maximization of the likelihood of each i,n,h ) so they αn,h , γ n,h , λ group, they are defined from the expression of Ln ( are proportional to Ln,i =

ni 

n,i (Tij )}δij , {ri (Zij (Tij ))λ

j=1

ri,n,h (z) = arg max rz

ni   j=1

T

0

Kh2 (z − Zij (s)){log rz (s)

i,n,h (s)} dNij (s). + log λ The results of Section 6.3 for the nonparametric regressions apply to the s estimators ri,n,h , they have the optimum convergence rate n 2s+1 if the functions ri belong to C 2 (IZ ) and they converge weakly to Gaussian variables (Proposition 6.3), they are mutually independent.

6.11

Bivariate hazard functions

Let (S, T ) be a pair of failure times with a distribution function FS,T and a density fS,T , and let F¯S,T (s, t) = FS,T (s, t) − F¯S (s) − F¯T (t) + 1 be the bivariate survival function of (S, T ). They determine the partial hazard functions λS|T ≥y (x) = lim h−1 P (x ≤ S < x + h | S ≥ x, T ≥ y), h→0

λT |S≥x (y) = lim h−1 P (y ≤ T < y + h | S ≥ x, T ≥ y), h→0

f (x, y) . λS,T (x, y) = ¯ F (x, y) When S and T are independent, the functions λS|T ≥y (x) and respectively λT |S≥x (y) reduce to the first and second marginal hazard functions λS (x)

Nonparametric Estimation of Intensities for Stochastic Processes

155

and respectively λT (y), and the function λS,T (x, y) is their product. The marginal survival functions are F¯ (x, 0) = F¯X (x) and F¯ (0, y) = F¯Y (y), the conditional hazard function λS|T ≥0 (x) is the marginal hazard function λS (x) for S, in the same way the conditional hazard function λT |S≥0 (y) is the marginal hazard function λT (x) for T . A bivariate survival function is the unique solution of the equation  s t F¯S,T (x, y)ΛS,T (dx, dy) FS,T (s, t) = 0

ss

ss

0

where ΛS,T (s, t) = 0 0 = 0 0 λS,T (x, y) dx dy is their joint cumulative hazard function. Let (Xi , Yi )i=1,...,n be a sample of n independent and identically dis¯ such that tributed censored variables with a survival function F¯X,Y G Xi = Si ∧ C1i and Yi = Ti ∧ C2i with a bivariate censoring variable C = (C1 , C2 ), and let δ1i = 1{Si ≤C1i } and δ2i = 1{Ti ≤C2i } be the indicators of censorship. The two-dimensional observations generate the processes ¯1n (x, y) = N ¯2n (x, y) = N ¯n (x, y) = N Yn (x, y) =

n  i=1 n  i=1 n  i=1 n 

δ1i 1{Xi ≤x,Yi ≥y} , δ2i 1{Xi ≥x,Yi ≤y} , δ1i δ2i 1{Xi ≤x,Yi ≤y} , 1{Xi ≥x,Yi ≥y} .

i=1

The conditional cumulative hazard functions  x  λS|T ≥y (s) ds, ΛT |S≥x (y) = ΛS|T ≥y (x) = 0

y 0

λT |S≥x (t) dt,

and the bivariate hazard function ΛS,T are estimated from the censored observations as  x ¯1n (dx, y),  n,S|T ≥y (x) = Λ Yn−1 (x, y)1{Yn (x,y)>0} N 0  y ¯2n (x, dy),  Λn,T |S≥x (y) = Yn−1 (x, y)1{Yn (x,y)>0} N 0  x y  ¯n (ds, dt). Λn,ST (x, y) = Yn−1 (s, t)1{Yn (s,t)>0} N 0

0

156

Functional Estimation for Density, Regression Models and Processes

¯ ) > 0, the process On every subset [0, τ ] of R2+ such that F¯ (τ )G(τ  n,ST − ΛST is a centered weak martingale so Λ  n,ST is P -uniformly conΛ sistent estimator of ΛST (Pons, 1986).  ¯ −1 dF finite, the proProposition 6.18. Under the condition [0,τ ] F¯ −2 G 1  n,ST − ΛST ) converges weakly on [0, τ ] to a centered Gaussian cess n− 2 (Λ process. n,h,T |S≥y of n,h,S|T ≥y of λS|T ≥y and, respectively λ Kernel estimators λ  n,S|T ≥y (x) and, respectively λT |S≥x (y), are obtained by smoothing Λ  n,T |S≥x (y), with a kernel on a real interval, the estimators are asympΛ n defined by (6.1). totically normal with the same convergence rate as λ  n,ST (x, y) with a An estimator of λS,T (x, y) is obtained by smoothing Λ bivariate kernel. Let  ¯n (du, dv) N  . (6.24) Jn (u, v)Kh (s − u)Kh (t − v) λn,h,ST (x, y) = Yn (u, v) [0,τ ] where Jn (u, v) = 1{Yn (u,v)>0} and let λn,h,ST (x, y) denote its expectation. n,ST is P -uniformly consistent on every subset of R2 where The estimator λ + ¯ is strictly positive. Let λST belong to C 2 ([0, τ ]), then its ¯ the process FST G bias is m2 (2) (2) bn,h,ST (x, y) = h2 2K {λxx,ST (x, y) + λyy,ST (x, y)} + o(h2 ), 2 and its variance is n,h,ST (x, y)} = (nh2 )−1 κ2 λST (x, y) + o((nh2 )−1 ). V ar{λ n,h,ST is The optimal rate of convergence which minimizes the AMSE of λ − 16 therefore a O(n ).  ¯ −1 dF finite, the Proposition 6.19. Under Condition 2.1 and [0,τ ] F¯ −2 G 1 n,h,ST −λn,h,ST )(x, y) converges weakly on [0, τ ] to a cenprocess (nh2 )− 2 (λ xy ¯ t)−1 ΛS,T (ds, dt). F¯ (s, t)−1 G(s, tered Gaussian process with variance 0

0

A bivariate proportional hazards model is defined by marginal models with T multiplicative intensities λk (t, Zk,i (t)) = λ0k (t)eβk Zk,i (t) Yk,i (t) for the i-th marginal time variable Tk,i , k = 1, 2, and by a bivariate multiplicative intensity λS,T (t, Zi (t)) = λ0,S,T (t)eβ

T

Zi (t)

Yi (t)

Nonparametric Estimation of Intensities for Stochastic Processes

157

T T T T for the i-th time variable Ti , where Zi = (Z1,i , Z2,i , Z3,i ) . For independent times T1,i and T2,i , the values of parameter β related to Z3,i is zero and λ0,S,T = λ01 λ02 . The estimators are defined like previously for the Cox model. The bivariate case extends to the d-dimensional case by considering all hazard functions of k time variables conditionally on the d − k other times, for k = 1, . . . , d − 1.

6.12

Progressive censoring of a random time sequence

Let (Ti )i=1,...,n be a sequence of independent random time variables and (Tj )j=1,...,m be an independent sequence of independent random censoring times such that a random number Rj of variables Ti are censored at Cj and m j=1 Rj = n. Then the censored variables Xi,j = Ti ∧ Cj are no longer independent, only m sets of variables are independent. Let Nn,m (t) =

Rj m  

1{Ti ≤t∧Cj } , Yn,m (t) =

j=1 i=1

Rj m  

1{Xi,j ≥t} .

j=1 i=1

Let F be the common distribution function of the variables Ti and Gj be the distribution function of Cj , the intensity of the point process Nn,m is still written λn,m (t) = λYn,m with λ = F¯ −1 f . Conditionally on the censoring number R = (R1 , . . . , Rm ), the expectations of Nn,m (t) and Yn,m (t) are  t m  ¯ j dF, G E{Nn,m (t) | R} = Rj E{Yn,m (t) | R} =

j=1 m 

0

¯ j (t)F¯ (t). Rj G

j=1

Let μR = lim Rj for j = 1, . . . , m, and Jn,m (t) = 1{Yn,m (t)>0} , the estimator of the cumulative hazard function Λ and its derivative λ are  t −1  n,m (t) = Λ Yn,m Jn,m dNn,m , 0  n,m (t) = Kh (t − s) dΛ  n,m (s). λ Assuming that there exists an uniform limit for the mean survival func−1 ¯ ¯ = limm→∞ m−1 m G Yn,m converges unition G j=1 j , the process n ¯ ¯ formly to its expectation μY (t) = μR G(t)F (t), and n−1 Nn,m converges

158

Functional Estimation for Density, Regression Models and Processes

t n,m are asymptoti¯ dF . The estimators Λ  n,m and λ uniformly to μR 0 G 1  n,m − Λ)(t) callyunbiased and uniformly consistent. The variance of n 2 (Λ  t t −1 −1 is E 0 nYn,m Jn,m dΛ and it converges to vΛ (t) = 0 μY dΛ. The process 1  n,m − Λ) converges weakly to a centered Gaussian processes with inden 2 (Λ pendent increments and variance vΛ . n,m (t) is (nh)−1 κ2 v (1) (t) + o((nh)−1 ) The variance of the estimator λ Λ and its asymptotic covariances are zero, its bias is a O(h2 ) hence it has the 1 2 n,m −λ) optimal convergence rate h = O(n 5 ) if λ is C 2 . The process(nh) 5 (λ (1) converges weakly to Gaussian process with variance κ2 vΛ and covariances zero. All results for multiplicative regression models with independent censoring times apply to this progressive random censoring scheme. With nonrandom numbers Rj , the necessary condition for the convergence of the m ¯j . processes is the uniform convergence of n−1 j=1 Rj G 6.13

Model with periodic baseline intensity

 Let N (t) = i≥1 1{Ti ≤T } be a Poisson process with a periodic intensity function λ on R+ , with period τ , and for t in [0, τ ], and let ¯n (t) = n−1 N

n−1 

{N (t + kτ ) − N (kτ )}

k=0

be the expectation of the process N on n periods. The intensity λ has the consistent estimator  τ  ¯n (s), t ∈ [0, τ ]. λn,h (t) = Kh (t − s) dN 0

Denoting (Sj )j≥1 in [0, τ ] the values modulo τ of the time variables Ti on [0, nτ ], the estimator is written n,h (t) = n−1 λ

n−1  k=0

its expectation is λh (t) =

τ 0

Kh (t − Si )1{kτ 0 αn,h , α) = O(T − 2s+1 ), Rp ( s

1

with hT = O(T − 2s+1 ),

177

Diffusion Processes 2 and for βn,h 2 Rp (βn,h , β 2 ) = O(n− 2s+1 ), s

1

with hn = O(n− 2s+1 ).

Proof. For the risk of the random term of the estimator α n,h , the sum of the quadratic variations of the process n−1 

Sn,X (t) = (nΔn )−1

{ΔX(ti ) − Δn α(ti )}1{ti ≤t}

i=0

is the increasing process [S]n,X (t) = (nΔn )−2 expectation μn,X (t) = (nΔn )−2

n−1 

n−1 i=0

{εi β(ti )}2 1{ti ≤t} with

β 2 (ti )Δn 1{ti ≤t} = O((nΔn )−1 ) = O(T −1 )

i=0

 −1 T

2 as T 0 β (s) ds is finite, the bound for p in ]0, 2[ is obtained by concavity, like for Proposition 2.7. The time dependent diffusion is a process with independent increments so the result for p ≥ 2 is obtained like for Proposition 2.7. 2 , the sum of the quadratic variations of the For the random term of βn,h process

Sn,Y (t) = (nΔn )−1

n−1 

2 {Zn,h (ti ) − Δn β 2 (ti )}1{ti ≤t}

i=0

is the increasing process [S]n,Y (t) = (nΔn )−2

n−1 

2 {Zn,h (ti ) − Δn β 2 (ti )}2 1{ti ≤t}

i=0

with expectation asymptotically equivalent, as T −1 μn,Y (t) = 2n−2 Δ−1 n

n−1 

T 0

β 4 (s) ds is finite, to

Δn β 4 (ti )1{ti ≤t} = O(n−1 ).

i=0

The result for p in ]0, 2[ follows by concavity, like for a density or a regression 2 (ti ) are independent so the result for p ≥ 2 is function. The variables Zn,h obtained like for a regression function.  The minimax property of the estimator α n,h is deduced from the distribution of the vector (Yi )i=1,...,n with components the independent Gaussian variables Yi = Δn αti + εi βti , having the density  f ⊗n = i=1,...,n fΔn αti ,Δn βt2 . i

178

Functional Estimation for Density, Regression Models and Processes

Proposition 8.4. Let α and β belong to Fs,p = {u ∈ C s (R+ ); u(s) ∈ Lp } for a real p > 1 and an integer s ≥ 2. The minimax risks for α and β 2 in Fs,p and estimators in the set Fn of their estimators in Fs,p based on n observations of the diffusions at (ti )i=1,...,n are inf

sup Rp ( αn , α) = O(T − 2s+1 ), s

n α∈Fs,p α  n ∈F

inf

s sup Rp (βn2 , β 2 ) = O(n− 2s+1 ).

n ∈F n β∈Fs,p β

Proof. that

Let αn be a small perturbation of the function α on [0, T ] such αn,a (t) = α(t) + ηn



aj g(b−1 n (tsj )),

(8.5)

j=1,...,N

where aj belongs to {−1, 1} for every j, ηn and bn tend to zero as n tends to infinity, g is a symmetric and positive function of Fs,p ([0, 1]), with value zero at 0 and 1, and (sij )j is a partition of [0, T ] distinct of (ti )i , with path bn so that the functions g(b−1 n (t − sj )) have disjoint supports and every ti belongs to a unique interval [sji , sji +1 [. The functions αn,a belong to = O(1), equivalently ηn b−s Fs,p (R+ ) if N ηnp b1−sp n = O(1).  n ⊗n Let fn = i=1,...,n fΔn αn,ti ,Δn βt2 denote the Gaussian densities of the i vector (Yi )i=1,...,n with the perturbated expectations. The information K(f1 , f2 ) = −Ef2 log(f2−1 f1 ) of two Gaussian densities f1 = fμ1 ,σ12 and f2 = fμ2 ,σ22 is −K(f1 , f2 ) = log

σ2 σ2 − σ2 (μ1 − μ2 )2 − 2 2 1 − , σ1 σ1 σ12

then the information for two Gaussian densities with the same variance σ 2 is −μ2 )2 and, since |αn,a,ti −αti | = ηn |g(b−1 K(f1 , f2 ) = (μ12σ 2 n (ti −sji ))| = O(ηn ), we have n  {αn,a (ti ) − αn,a (ti )}2 ⊗n ⊗n , fn,a = O(nbn ηn2 ), K(fn,a  ) = Δn 2 β ti i=1 if bn = O(Δn ). The Lp norm of αn,a − αn,a satisfies  sj+1 N  p Rpp (αn,a , αn,a ) = ηnp |aj − aj |p |g(b−1 n (t − sj ))| dt sj

j=1

= bn ηnp gpp

N  j=1

|aj − aj |p ,

179

Diffusion Processes

where N bn = T . Applying the inequality (2.6) with h2 (f1 , f2 ) ≤ K(f1 , f2 ) implies the existence of constants c1 and c2 such that inf sup Rpp (α, α n ) ≥ c1 ηnp exp(−c2 nbn ηn2 ).

 α∈F α  n ∈F

1

This bound is maximum if ηn = O((nbn )− 2 ), moreover αn,a belongs to 1 1 that bn = O(n− 2s+1 ) = O(T − 2s+1 ) Fs,p hence ηn b−s n = O(1), it follows s with Δn < 1, and ηn = O(T − 2s+1 ), we deduce that a lower bound for the s minimax risk is a O(T − 2s+1 ). The minimax risk for the estimation of β 2 is established by similar arguments, let βn2 be a small perturbation of the function β 2 on [0, T ], such that  2 (t) = β 2 (t) + ηn aj g(b−1 βn,a n (t − sj )), j=1,...,N

it is defined like in (8.5) and with the same conditions. The minimax prop2 is deduced from the Gaussian distribution of erty of the estimator βn,h  the vector (Yi )i=1,...,n with density f ⊗n = i=1,...,n fΔn αti ,Δn βt2i . For Gaussian densities f1 and f2 with the same expectation and if |σ1 − σ2 | is small enough, the information is σ2 1 ) − 2 (σ22 − σ12 ) σ1 2σ2 (σ2 − σ1 )4 (σ1 − σ2 )2 σ1 |σ 2 − σ 2 | σ1 ≤− ≤ ≤ 1 32 . 4 3 4σ2 σ2 σ2

K(f1 , f2 ) = log(

It follows that ⊗n ⊗n K(fn,a , fn,a ) ≤



2 2 |βn,a,t − βn,a  ,t | i i {1 + o(1)} = O(nηn ). 2 β ti i=1,...,n

(8.6)

2 2 p 2 2 p The Lp norm of βn,a − βn,a  satisfies again Rp (βn,a , βn,a ) = 0(T ηn ) and the inequality (2.6) implies the existence of constants c1 and c2 such that

inf sup Rpp (β 2 , βn2 ) ≥ c1 ηnp exp(−c2 nηn ).

2 ∈F  β 2 ∈F β n

The condition nηn constant implies nbn ηn2 = O(1) and the proof ends like for the estimation of α.  2 are miniBy Propositions 8.3 and 8.4, the kernel estimators α n,h and βn,h max, under Conditions of Proposition 8.3.

180

Functional Estimation for Density, Regression Models and Processes

Kernel estimators of the drift and variance functions for continuously observed diffusion processes (8.1) are defined as  T −1 Kh (t − s) dX(s), α T,h (t) = T 0

2 (t) = T −1 βT,h



0

T

2 Kh (t − s) dYT,h (s),

t T,h (s) ds. The expectation of α T,h (t) with the process YT,h (t) = X(t) − 0 α  −1 T is αT,h (t) = T α(s)Kh (t − s) ds and it converges to α(t) as n tends to 0 t infinity. The process YT,h has the expectation E YT,h (t) = − 0 bα,T,h (s) ds T,h (t) is a 0(h2 ) for every t, and its variance is where the bias bα,T,h (t) of α V (t) + o(1). The integral on [0, T ] is approximated by a sum of n terms on a partition T,h and (ti )i=1,...,n be of [0, T ] such that nΔn = T , so that the estimators α 2 2 are approximated by the estimators α n,h and βn,h defined for discrete βT,h observations of the diffusion. Proposition 8.5. For drift and variance functions of C s ([0, T ]) 2 a.s. uniformly consistent on IT,h , (a) the estimators α T,h and βT,h (b) for every t in IT,h , the bias of the estimators have the uniform approximations bα,T,h (t) = αT,h (t) − α(t) = hs bα (t) + o(hs ), msK (s) α (t), bα (t) = k! 2 bβ 2 ,T,h (t) = βT,h (t) − β 2 (t) = hs bβ 2 (t) + o(hs ), msK 2 (s) (β ) (t), bβ 2 (t) = k! and their variances are vα,T,h (t) = (hT )−1 {σα2 (t) + o(1)}, σα2 (t) = κ2 β 2 (t), vβ 2 ,T,h (t) = (hT )−1 {σβ2 2 (t) + o(1)}, σβ2 2 (t) = 4κ2 βt2 Vt , T,h (t) and, respectively βT,h (s) and and the covariances of α T,h (s) and α −1  βT,h (t), are 0((hT ) ) if |t − s| ≤ 2h and they are zero if |t − s| > 2h. The proofs are similar to the proofs of Proposition 8.1. The optimal band1 width hT = O(T − 2s+1 ) minimize the asymptotic mean squared errors and s the optimal rates of convergence of the estimators are O(T 2s+1 ).

181

Diffusion Processes

Proposition 8.6. Under the conditions of Proposition 8.1, the processes s s αT,h − αT,h ) and, respectively T 2s+1 (βT,h − βT,h ) converge weakly to T 2s+1 ( centered Gaussian processes with variances vα and, respectively vβ 2 , and covariances zero. 8.3

Auto-regressive diffusions

Let α and β be two functions of class C 2 on a functional metric space (X,  · ), and let B be the standard Brownian motion on R. Their norms on X, α1 , β2 , Eα(X(t))1 and Eβ(X(t))2 are supposed to be finite. A diffusion process (Xt )t∈[0,T ] is defined by a stochastic differential equation t ∈ [0, T ],

dXt = α(Xt )dt + β(Xt )dBt ,

(8.7)

and its initial value X0 such that E|X0 | < ∞. Equation (8.7) with locally Lipschitz drift  t and diffusion functions has a unique solution Xt = t X0 + 0 α(Xs )ds + 0 β(Xs )dBs , it is a continuous Gaussian process with t expectation E(Xt − X0 ) = 0 Eα(Xs ) ds and variance  t

 t V ar(Xt − X0 ) = V ar α(Xs ) ds + E β 2 (Xs ) ds 0

 ≤

t 0

2

0

2

E{α (Xs ) + β (Xs )} ds −

 0

2

t

Eα(Xs ) ds

.

The existence and unicity of the process X is proved by construction of a sequence of processes satisfying Equation (8.7) and starting from X0 and satisfying  t  t {α(Xn,s )−α(Xn−1,s )}ds+ {β(Xn,s )−β(Xn−1,s )}dBt , Xn,t −Xn−1,t = 0

0

hence Xn,t is the finite sum from X0 to Xn,t −Xn−1,t , where the convergence of the sum is a consequence of the Lipschitz property of α and β. By a discretization of the time interval [0, T ] in n sub-intervals of length tending to zero as n tends to infinity, Equation (8.7) is approximated as Xti+1 − Xti ≈ Yi = (ti+1 − ti )α(Xti ) + β(Xti ){Bti+1 − Bti },

(8.8)

considering the functions α and β as piecewise constant on the intervals of the partition generated by (ti )i=1,...,n . Let εi = Bti+1 − Bti , it is a random variable with expectation zero and variance (ti+1 − ti ) conditionally on the σ-algebra Fti generated by the sample-paths of X up to ti ,

182

Functional Estimation for Density, Regression Models and Processes

then Eα(Xti )εi = 0 and V ar(Yi |Xti ) = (ti+1 − ti )β 2 (Xti ). The process Xt solution of (8.7) is a continuous Gaussian process with independent increments. Its increments are approximated by the nonparametric regression model (8.8) with an independent normal error by considering the functions α and β as stepwise constant functions on the partition (ti )1≤i≤n . In the nonparametric regression model (8.8), EYi = (ti+1 − ti )Eα(Xti ) and V arYi = (ti+1 − ti )2 V arα(Xti ) + (ti+1 − ti )Eβ 2 (Xti ). Let t in Ii =]ti , ti+1 ], the approximation error of the process Xt − Xti t by the discretized sample-path (8.8) is et;ti = ti {α(Xs ) − α(Xti )}ds + t ti {β(Xs ) − β(Xti )}dBs , it satisfies Eet,ti = (t − ti )E{α(Xs∗ )}, s∗ ∈]ti , t[, E|et,ti | ≤ (t − ti )2 sup |α(1) (Xs )|, 

s∈Ii

ti

ti  t

t



t

{α(Xs ) − α(Xti )}ds +

V ar et,ti = V ar 



t

V ar{α(Xs ) − α(Xti )}ds + ti

E{β(Xs ) − β(Xti )}2 ds E{β(Xs ) − β(Xti )}2 ds,

ti

and it bounded by (t − ti )(α(1) 2 + β (1) 2 ) E(Xt − Xti )2 = O((t − ti )2 ), with  t E{α2 (Xs ) + β 2 (Xs )} ds. E(Xt − Xti )2 ≤ ti

The following condition allows to express the moments of increments of the diffusion process as integrals with respect to a mean density. Condition 8.1. There exists a mean density of the variables (Xti )1≤i≤n defined as the limit f (x) = lim n−1 n→∞

n−1 

fXti (x) = EfXt (x).

i=0

This condition is satisfied under a mixing property of the process Xt ∞ sup{Pr(B|A) − Pr(B); A ∈ F0t , B ∈ Ft+s , s, t ∈ R+ } ≤ ϕ(s),  T ϕ(u) du < ∞, (8.9) 0

∞ where the σ-algebras F0t and Ft+s are respectively generated by {Xu , u ∈ [0, t]} and {Xu , u ∈ [t + s, ∞[}. That property is satisfied for the Brownian motion, its sample paths having independent increments. For the diffusion

183

Diffusion Processes

T process Xt , it is sufficient that E 0 β 2 (Xs ) ds < ∞. Moments of discontinuous parts of a diffusion process with jumps require another ergodicity condition defining another mean density and it is satisfied under the mixing property (8.9).

8.4

Estimation for auto-regressive diffusions by discretization

The regression model (8.8) with observations at fixed points regularly spaced on a grid (ti )1≤i≤n , of path Δn = n−1 T , is written Yi = Δn α(Xti ) + β(Xti )εi ,

i = 1, . . . , n,

with V ar(Yi | Xti = x) = Δn β 2 (x).

E(Yi | Xti = x) = Δn α(x),

=0 The variables εi have a normal distribution N (0, Δn ), hence Eε2k+1 i 2(k−1) for every integer k, Eε2i = Δn and Eε2k = (2k − 1)Δ Eε for every n i i k ≥ 1, thus Eε4i = 3Δ2n . Let IX and IXY be respectively subsets of the supports of the distribution functions FX and FXY , and for h > 0 let IX,h and IXY,h be defined like for the regression model. A nonparametric estimator of the function α requires a normalization of Yi by the scale Δ−1 n n−1 i=0 Yi Kh (x − Xti ) , α n,h (x) =  Δn n−1 i=0 Kh (x − Xti ) for every x in IX,h . It is the ratio of n−1 1  μ α,n,h (x) = Yi Kh (x − Xti ) nΔn i=0 n−1 and fX,n,h (x) = n−1 i= Kh (x − Xti ). The expectation of fX,n,h (x) and μ α,n,h (x), and their limits are respectively n−1  1 fX,n,h (x) = Kh (x − s) dFXti (s), n i=0 IXt ,h i

n−1 

1 fX (x), n i=0 ti n−1  1 μα,n,h (x) = α(s)Kh (x − s) dFXti (s), n i=0 IXt ,h fX (x) = lim

n→∞

i

μα (x) = α(x) fX (x),

184

Functional Estimation for Density, Regression Models and Processes

so α n,h (x) converges to α(x). The approximations and the convergences of Proposition 3.1 are satisfied for the estimator α n,h . Let αn,h be its expectation. Proposition 8.7. Under Conditions 2.1, 2.2 and 3.1(1) αn,h (x) − α(x)| converges a.s. to zero, (a) supx∈IX,h | (b) the following expansions are satisfied αn,h (x) =

μα,n,h (x) + O((nh)−1 ), fX,n,h (x)

−1 { αn,h − αn,h }(x) = fX (x){( μα,n,h − μα,n,h )(x) − α(x)(fX,n,h − fX,n,h )(x)} + rn,h , 1

where rn,h = oL2 ((nΔn h)− 2 ). n,h (x) are (c) The bias of μ α,n,h (x) and α bμ,n,h (x) = μn,h (x) − μ(x) = h2 bμ (x) + o(h2 ), m2K (2) m2K μ (x) = {α(x)fX (x)}(2) , bμ (x) = 2 2 bα,n,h (x) = αn,h (x) − α(x) = h2 bα (x) + o(h2 ), m2K −1 (2) f (x){μ(2) (x) − α(x)fX (x)}, bα (x) = 2 X and the variance of α n,h (x) is vα,n,h (x) = (T h)−1 {σα2 (x) + o(1)}, −1 (x)β 2 (x). σα2 (x) = κ2 fX

For the estimation of the function β defining the variance of the diffusion process, let n,h (Xti ) = Δn (α − α n,h )(Xti ) + β(Xti )εi , Zi = Yi − Δn α its expectation is E{Zi |Xti } = Δn (α − αn,h )(Xti ) = O(Δn h2 ) and its variance satisfies −1 Δ−1 n,h )(Xti ) + β(Xti )εi }2 n V ar{Zi |Xti = x} = Δn E{Δn (αn,h − α

= Δn V ar αn,h (Xti ) + β 2 (x) = β 2 (x) + o(1). A consistent estimator of the function β 2 (x) is therefore n−1 2 2 i=0 Zi Kh (x − Xti )  , βn,h (x) =  Δn n−1 i=0 Kh (x − Xti )

(8.10)

Diffusion Processes

185

it is the ratio of μ β,n,h (x) =

n−1 1  2 Z Kh (x − Xti ), nΔn i=0 i

β,n,h (x) and its limits are and fX,n,h (x). The expectation of μ μβ,n,h (x) =

n−1  1 n i=0 IXt

i

β 2 (s)Kh (x − s) dFXti (s), ,h

μβ (x) = β 2 (x)fX (x). The approximations and the convergences of Proposition 3.1 are also sat2 isfied for the estimator βn2 . Let βn,h be its expectation. Proposition 8.8. Under Conditions 2.1, 2.2 and 3.1(1) 2 (a) supx∈IX,h |βn,h (x) − β 2 (x)| converges a.s. to zero, (b) the following expansions are satisfied 2 βn,h (x) =

μβ,n,h (x) + O((nh)−1 ), fX,n,h (x)

1 1 −1 2 2 − βn,h }(x) = (nh) 2 fX (x){( μβ,n,h − μβ,n,h )(x) (nh) 2 {βn,h

− β 2 (x)(fX,n,h − fX,n,h )(x)} + rn,h , where rn,h = oL2 (1). 2 (x) are (c) The bias of μ β,n,h (x) and βn,h bμ,n,h (x) = μβ,n,h (x) − μβ (x) = h2 bμ (x) + o(h2 ), m2K (2) m2K 2 μβ (x) = {β (x)fX (x)}(2) , bμ (x) = 2 2 2 (x) − β 2 (x) = h2 bβ (x) + o(h2 ), bβ,n,h (x) = βn,h m2K −1 (2) (2) f (x){μβ (x) − β 2 (x)fX (x)}, bβ (x) = 2 X 2 and the variance of βn,h (x) is

vβ,n,h (x) = (nh)−1 {σβ2 (x) + o(1)}, −1 (x)β 4 (x). σβ2 (x) = 2κ2 fX

Proof. (a) and (b) are proved like for the estimator of a regression function. The variance vβ,n,h (x) is deduced from the expansion of (b), with the order ((nh)−1 ) for the variance of fX,n,h and the covariance of fX,n,h and

186

Functional Estimation for Density, Regression Models and Processes

μ β 2 n,,h , and where the variance σμ2 (x) of μ β,n,h is provided by the first term in the expansion of E(Zi4 | Xti = x) − E 2 (Zi2 | Xti = x) with E(Zi4 | Xti = x) = E{β 4 (x)ε4i + 4β 2 (x)ε2i ( αn,h − α)2 (x)} + o(Δn ) = 3Δ2n β 4 (x){1 + o(1)}, E(Zi2 | Xti = x) = Δn β 2 (x){1 + o(1)}, then V ar μβ,n,h (x) = 2(nh)−1 κ2 β 4 (x)fX (x) + o((nh)−1 ) and the variance 2   of βn,h follows. Proposition 8.9. Under Conditions 2.1, 2.2 and 3.1 for the functions α 2 n,h and βn,h are uniformly consisand β in class C s (IX ), the estimators α tent on X , with bias bα,n,h (x) = αn,h (x) − α(x) = hs bα (x) + o(hs ), msK −1 (s) f (x){μ(s) (x)}, bα (x) = α (x) − α(x)f s! 2 (x) − β 2 (x) = hs bβ (x) + o(hs ), bβ 2 ,n,h (x) = βn,h msK −1 (s) f (x){μβ (x) − β 2 (x)f (s) (x)}, bβ 2 (x) = s! and their variances are vα,n,h (x) and vβ,n,h (x). Let γ = limn→∞ nhn , the s 1 αn,h − α − γ 2 bα ) converges weakly to centered Gaussian proprocess T 2s+1 ( s 1 2 − β 2 − γ 2 bβ 2 ) converges cess with variance σα2 (x), the process n 2s+1 (βn,h weakly to centered Gaussian process with variance σβ2 (x) at x, and the limiting covariances are zero. 1

The order hT = 0(T − 2s+1 ) for the bandwidths is the order of the optimal bandwidth for the asymptotic mean squared error of the estimation of α and 1 hn = 0(n− 2s+1 ) is the order of the optimal bandwidth for the asymptotic mean squared error of the estimation of β 2 , in C s (IX ). The conditions ensure a Lipschitz property for the second order moment of the increments of the processes, similar to Lemma 2.2 for the density; the covariances develop like in the proof of Theorem 2.1. Proposition 3.5 for regression functions extends to the estimators for discretized auto-regressive diffusions. Proposition 8.10. Under Conditions 2.1, 2.2 and 3.1 for the kernel and for functions α and β of C s (IX ) with derivatives in Lp (IX ), for a real p > 0, the kernel estimator α n,h has the Lp -risk Rp ( αn,h , α) = O(T − 2s+1 ), s

1

with hT = O(T − 2s+1 ),

187

Diffusion Processes 2 and βn,h has the Lp -risk s 2 Rp (βn,h , β 2 ) = O(n− 2s+1 ),

1

with hn = O(n− 2s+1 ).

The arguments of the proof are the same as for Propositions 8.3 and 3.5. The minimax property of the estimator α n,h is deduced from the distribution of the vector (Yi )i=1,...,n with the Gaussian density f ⊗n =  i=1,...,n fΔn Eα(Xti ),Δn Eβ 2 (Xti ) . Proposition 8.11. Let α and β belong to Fs,p = {u ∈ C s (IX ); u(s) ∈ Lp (IX )} for a real p > 1 and an integer s ≥ 2, the minimax risk for α in Fs,p and an estimator in a subset Fn of its estimators in Fs,p based on n observations of the diffusions at (ti )i=1,...,n is inf

sup Rp ( αn , α) = O(T − 2s+1 ). s

n α∈Fs,p α  n ∈F

The minimax risk for β 2 is inf

sup Rp (βn2 , β 2 ) = O(n− 2s+1 ). s

n ∈F n β∈Fs,p β

Proof. Let αn be a small perturbation of the function α on a bounded interval [a, b] such that  aj g(b−1 αn,a (x) = α(x) + ηn n (x − xj )), j=1,...,Nn

where aj belongs to {−1, 1} for every j, ηn and bn tend to zero as n tends to infinity, Nn bn = b − a, g is a symmetric and positive function of Fs,p ([0, 1]), and (xj )j=1,...,Nn is a partition of [a, b] with path bn so that the functions disjoint supports. g(b−1 n (x − xj ) have  Let fn⊗n = i=1,...,n fΔn Eαn (Xti ),Δn Eβ 2 (Xti ) denote the Gaussian density with the perturbated expectations. ⊗n ⊗n , fn,a The information K(fn,a  ) has the lower bound δn = Δn

n  {Eαn,a (Xti ) − Eαn,a (Xti )}2 , Eβ 2 (Xti ) i=1

and δn = O(nΔn ηn2 ) = O(nbn ηn2 ) if Δn = O(bn ), and the Lp norm of αn,a − αn,a satisfies    p   Rpp (αn,a , αn,a ) = ηnp  (aj − aj )g(b−1 (x − x ))  dx j n j=1,...,Nn

= bn ηnp gpp

N  j=1

|aj − aj |p ,

188

Functional Estimation for Density, Regression Models and Processes

with N bn = O(1). Applying (2.6) with h2 (f1 , f2 ) ≤ K(f1 , f2 ) implies the existence of constants c1 and c2 such that inf sup Rpp (α, α n ) ≥ c1 ηnp exp(−c2 nbn ηn2 ).

 α∈F α  n ∈F

This bound is maximum if T bn ηn2 = O(1), moreover α n belongs to Fs,p s − 2s+1 −s if ηn bn = O(1), it follows that ηn = O(T ), with Δn < 1, and 1 bn = O(T − 2s+1 ), we deduce a lower bound for the minimax risk Lp . The lower bound of the risk Rp for the estimation of β 2 is proved by similar arguments, using small perturbations of the function β 2 on a bounded interval [a, b]  2 (x) = β 2 (x) + ηn aj g(b−1 βn,a n (x − xj )), j=1,...,Nn

under the same conditions as for αn,a . The minimax property of the estimator α n,h is deduced from the distribution of the vector (ΔY 2 (ti ))i=1,...,n  with the density f ⊗n = i=1,...,n fΔn Eβ 2 (Xti ),2Δ2n Eβ 4 (Xti ) . The informa⊗n ⊗n p tion has the bound (8.6), K(fn,a , fn,a norm of  ) = O(nηn ), and the L 2 2 βn,a − βn,a satisfies again 2 2 p p Rpp (βn,a , βn,a  ) = bn ηn gp

N 

|aj − aj |p ,

j=1

where N bn = b−a. Applying the inequality (2.6) with h2 (f1 , f2 ) ≤ K(f1 , f2 ) implies the existence of constants c1 and c2 such that inf sup Rpp (β 2 , βn2 ) ≥ c1 ηnp exp(−nηn ),

2 ∈F  β 2 ∈F β n

and the condition nηn constant implies nbn ηn2 = O(1), moreover βn,a s 1 − 2s+1 belongs to Fs,p if ηn b−s ), bn = O(n− 2s+1 ) n = O(1), hence ηn = O(n  and a lower bound for the minimax risk Lp of β 2 follows. The conditional variance of the variable Y in model (8.8) being a function of X, the regression function α is also estimated by the expectation of a weighted kernel as in Section 3.7, with weights w(X  ti ) = σα−1 (Xti ). As previously, the approximations of the bias and variance of the new estimator (3.20) of the drift function are modified by introducing w n and its asymptotic distribution is modified. With a partition of [0, T ] in subintervals Ii of unequal length Δn,i varying with the observation times ti of the process, the variable Yi has to be

189

Diffusion Processes

normalized by Δn,i , 1 ≤ i ≤ n. For every x in Xn,h , the functions α and β are consistently estimated by n−1 −1 i=0 Δn,i Yi Kh (x − Xti ) α n,h (x) = , n−1 i=0 Kh (x − Xti ) −1

n,h (Xti )}, Zn,i = Δn,i2 {Yi − Δn,i α n−1 2 i=0 Zn,i Kh (x − Xti ) 2 . (x) =  βn,h n−1 i=0 Kh (x − Xti ) The results of Proposition 8.9 are satisfied, replacing the means of sums n−1 −1 −1 with terms Δ−1 n,i by means with coefficient n i=0 Δn,i and assuming that the lengths Δn,i have the order n−1 T . The optimal bandwidth for 1 the estimation of α is O(T − 2s+1 ) and its asymptotic mean squared error is AM SEα (x) = (T h)−1 σα2 (x) + hs2 b2α,s (x), it is minimum for the bandwidth function 1  2s+1

(s!)2 κ T −1 V ar(Δ−1 n,i Yti ) 2 . hα,AMSE (x) = (s) (x)}2 2sm2sK {μ(s) α (x) − α(x)f The optimal local bandwidth for estimating the variance function β 2 of the 1 diffusion is a O(n− 2s+1 ) and it minimizes AM SEβ (x) = (nh)−1 σβ2 (x) + hs2 b2β,s (x). A diffusion model including several explanatory processes in the coefficients α and β may be written using an indicator process (Jt )t with values in a discrete space {1, . . . , K} as dXt =

K 

αk (Xt )1{Jt = k}dt+

k=1

K 

βk (Xt )1{Jt = k}dBt , t ∈ [0, T ]. (8.11)

k=1

Let Xtk = Xt 1{Jt = k} be the partition of the variable corresponding to the models for the drift and the variance of Equation (8.11). The model is equivalent to dXt =

K  k=1

αk (Xtk )dt +

K 

βk (Xtk )dBt ,

k=1

and consistent estimators of the 2K functions αk and βk are defined for every for x in Xn,h as n−1 −1 i=0 Δn,i Yi Kh (x − Xti ,k ) , α k,n,h (x) = n−1 i=0 Kh (x − Xti ,k ) n−1 −1 2 i=0 Δn,i Zi Kh (x − Xti ,k ) 2  βk,n,h (x) = . n−1 i=0 Kh (x − Xti ,k )

190

Functional Estimation for Density, Regression Models and Processes

Their expectations are approximated as the estimators for model (8.7) by −1 αk,n,h (x) = μαk ,n,h (x)fX,n,h (x) + O((nh)−1 ), −1 2 (x) = μβk ,n,h (x)fX,n,h (x) + O((nh)−1 ), βk,n,h

where μα,k,n,h (x) = En−1

n−1 

Δ−1 n,i

  yKh (x − s) dFXk,ti ,Yti (s, y)

i=0

h2 2 m2K μ(2) αk (x) + o(h ) 2 μαk (x) = f (x)αk (x), = μαk (x) +

μβk ,n,h (x) = n−1

n−1 

2 Δ−1 n,i E[Zi Kh (x − Xk,ti )]

i=0

h2 (2) m2K μβk (x) + o(h2 ), 2 μβk (x) = f (x)βk2 (x). = μβk (x) +

The norms and the asymptotic behavior of the estimators is the same as in Proposition 8.9. The two-dimensional model dXt = αX (Yt )dt + βX (Xt )dBX (t), dYt = αY (Yt )dt + βY (Yt )dBY (t), with independent Brownian processes BX and BY is a special case where all parameters are estimated as before. The process (8.7) is generalized with functions depending of the samplepath and of the current time dXt = α(t, Xt )dt + β(t, Xt )dBt , t ∈ [0, T ]

(8.12)

under similar conditions. A discretization of the time interval [0, T ] leads to Yi = Xti+1 − Xti = (ti+1 − ti )α(ti , Xti ) + β(ti , Xti )(Bti+1 − Bti ). The functions α and β are now defined in (R+ × X) and they are estimated by n−1 i=0 Yi Kh1 (x − Xti )Kh2 (t − ti ) , α n,h (t, x) = n−1 Δn i=0 Kh1 (x − Xti )Kh2 (t − ti ) n−1 2 2 i=0 Zi Kh1 (x − Xti )Kh2 (t − ti )  βn,h (t, x) = , n−1 Δn i=0 Kh1 (x − Xti )Kh2 (t − ti )

Diffusion Processes

191

with Zi = Yi − Δn α n,h (ti , Xti ) = Δn (α − α n,h )(ti , Xti ) + β(ti , Xti )εi . Their 2 bias is now a O((h1 h2 ) ) and their variance has the order (nh1 h2 )−1 for α n,h and (T h1 h2 )−1 for βn,h , the convergence rate of the centered processes is related.

8.5

Estimation for continuous diffusion processes

The process {Xt , t ∈ [0, 1]} is extended to a time interval [0, T ] by rescaling: Xt = XT s , with s in [0, 1] and t in [0, T ]. Now the Gaussian process B is 1 mapped from [0, 1] onto [0, T ] by the same transform and Bs = T 2 BT −1 T is the Brownian motion extended from [0, 1] to [0, T ]. The observation of the sample-path of the process {Xt , t ∈ [0, T ]} allows to construct estimators similar those of smooth density and regression function in Sections 2.12 and 3.11, under the ergodic property (2.14). The Brownian process (Bt )t≥0 is a Gaussian martingale with respect to the filtration generated by (Bu )u 0 and the main term of its conditional variance  t

 t α T,h (Xs )ds + β 2 (Xs )ds V ar(Zt | Xt ) = V ar 0 0  t

 t − 2Cov ( αT,h )(Xs ) ds, β(Xs ) dBs + O((T hT )−1 ), 0

0

t is approximated by V (t; Xs ) = 0 β 2 (Xs )ds. The variance function β 2 (Xt ) is therefore consistently estimated by T 2 0 Zs Kh (Xs − x) dZs 2  . (8.15) βT,h (x) = T Kh (Xs − x) ds 0 Under Conditions 2.1 and 3.1 for the functions α and β in class C s (IX ), 2 is the bias of the estimator βT,h bβ,T,h (x) = hs bβ (x; s) + o(hs ), msK −1 f (x){(f β 2 )(s) (x) − β 2 (x)f (s) (x)}. bβ (x; s) = s!

194

Functional Estimation for Density, Regression Models and Processes

Its variance is vβ,T,h (x) = (T h)−1 {σβ2 (x) + o(1)}, where the expression of σβ2 (x) is established like for Proposition 8.8 in the discrete model, from the first term in the expansion of V ar(Zt2 | Xt = x) σβ2 (x) = 2κ2 f −1 (x)β 4 (x). Proposition 8.12. Under the previous conditions, for functions α and β s s 2 in C s (IX ), the processes (T hT ) 2s+1 ( αT,h −α−bα,T,h ) and (T hT ) 2s+1 (βT,h − 2 β − bβ 2 ,T,h ) converge weakly to a centered Gaussian processes with expectation zero, covariances zero and respective variance functions σα2 and σβ2 . Lemma 3.3 generalizes and the increments E{ αT,h (x) − α T,h (y)}2 and 2 2 3 E{βT,h (x) − βT,h (y)} are approximated by O(|x − y| (T hT )−1 ) for every T,h (x) and x and y in IX,h such that |x − y| ≤ 2hT . The covariance of α βT,h (y), with 2|x − y| > hT develops using the approximation (3.2) as ,   T −2 Kh (Xs − x)β(Xs ) dBs E Cα,β,T,h (x, y) = {f (x)f (y)T }  ×

0

0

T

2

Kh (Xs − y)(2Zt dZt − β (Xt ) dt) 

− Eα(x)  − Eβ(y)

0

0



T

Kh (Xs − y)β(Xs ) dBs

(fT,h − fT,h )(x) 

T

2

Kh (Xs − x)(2Zt dZt − β (Xt ) dt)

× (fT,h − fT,h )(y) + α(x)β(y)(T hT )−1 Cov(fT,h (x), fT,h (y)), it is therefore a o((T hT )−1 ). The mean squared error of the estimator at x for a marginal density in C s is then −1 M ISET,hT (x) = (T hT )−1 κ2 fX (x)σα2 (x) −1 ) + o(h2s + h2s T bα (x; s) + o((T hT ) T ),

and the optimal local and global bandwidths minimizing the mean squared 1 (integrated) errors are O(T 2s+1 ) 1

1 σ 2 (x)  2s+1 α hAMSE,T (x) = T 2sb2α (x; s)

195

Diffusion Processes

and, for the asymptotic mean integrated squared error criterion 1

1  σ 2 (x) dx  2s+1  α hAMISE,T = . T 2s b2α (x; s) dx With the optimal bandwidth rate, the asymptotic mean (integrated) 2s squared errors are O(T 2s+1 ). According to the local optimal bandwidths defined in the previous sections, estimators of the functions α and β may be defined with a functional bandwidth sequences (hn (x))n or (hT (x))T . The assumptions for the convergence of these sequences are similar to the assumptions for the nonparametric regression with a functional bandwidth and the results of Chapter 4 apply immediately for the estimators of the discretized or continuous processes (8.8) and (8.7).

8.6

Estimation of a diffusion with stochastic volatility

A nonparametric model of diffusion with a hidden stochastic volatility is defined by the differential equations for an observed real centered Gaussian process Xt and an unobserved real Markovian process Vt defined on R+ as dXt = σt dBt , dVt = α(Vt ) dt + β(Vt ) dWt , 2

(8.16)

σt2

with functions α and β of L (R+ ), with Vt = > 0, X0 = 0 and V0 = η a random variable independent of the two dimensional standard Brownian motion (Bt , Wt ). Assuming that Bt and Wt are independent, the process Xt is centered and its variance is  t V ar0 Xt = E0 Vs ds. 0

The process Vt is an autoregressive Markov process and it satisfies the ergodic property (8.2) with an invariant distribution function FV on R+ . According to the first time-dependent equation, it is estimated like β 2 (t) in the diffusion model (8.1). With a discretely observed process Xt , at times ti such that ti+1 − ti = Δn = n−1 T for i = 1, . . . , n, X(ti+1 ) − X(ti ) has the approximation Y (ti ) = σ(ti )εi + o(Δn ), so Vt has the estimator Vn,h (t) = (nΔn )−1

n−1  i=0

Y 2 (ti )Kh (t − ti ).

196

Functional Estimation for Density, Regression Models and Processes

Its expectation Vn,h (t) = E Vn,h (t) = n−1

n−1 

V (ti )Kh (t − ti ),

i=0

converges to V (t) as nh tends to infinity with n and Vn,h (t) is an a.s. uniformly consistent estimator of V on IT,h . For a function V of C s ([0, T ]), the bias and the variance of Vn,h (t) are given by Proposition 8.1 bV,n,h(t) = hs bV (t) + o(hs ), msK (s) V (t), bV (t) = k! vV,n,h (t) = (nh)−1 {σV2 (t) + o(1)}, σV2 (t) = 2κ2 V 2 (t). The optimal convergence rate for the estimation of the function V is there1 s fore hn = n− 2s+1 and the process n 2s+1 {Vn,h (t)−Vn,h (t)} converges weakly to a centered Gaussian process with variance σV2 (t) and covariances zero. Equation (8.16) defining the model of the auto-regressive diffusion Vt is similar to (8.7) previously defined for the process Xt dVn,h,t = α(Vn,h,t ) dt + β(Vn,h,t ) dWt . The estimators of the drift and variance functions of the diffusion equations (8.7) and (8.16) are similar, replacing the process X of Sections 8.3, 8.4 and 8.5 by the estimator Vn,h of the unobserved process V . For i = 0, . . . , n − 1, let Ui = Δn α(Vn,h,ti ) + β(Vn,h,ti )εi , n,h (Vn,h,ti ) = Δn (α − α n,h )(Vn,h,ti ) + β(Vn,h,ti )εi , Zi = Ui − Δn α the estimators of the functions α and β are n−1  i=0 Ui Kh (y − Vn,h,ti ) α n,h (y) = , n−1 Δn i=0 Kh (y − Vn,h,ti ) n−1 2  2 i=0 Zi Kh (y − Vn,h,ti )  , βn,h (y) = n−1 Δn i=0 Kh (y − Vn,h,ti ) they are a.s. uniformly consistent on [h, T − h]. The distribution function n−1 FV (y) = limn→∞ n−1 i=0 P (Vti ≤ y) is estimated by Fn,h,V (y) = n−1

n−1  i=0

1{Vn,h,t

i

≤y} ,

197

Diffusion Processes

and the asymptotic approximations of the bias and the variance of α n,h and 2  βn,h are similar to those of Propositions 8.7 and 8.8, replacing the density fX by fV . 2 Propositions 8.3 and 8.10 apply to the estimators Vn,h , α n,h and βn,h 2 of the functions V , α and β . Proposition 8.13. Under Condition 2.1 for the kernel and for functions V , α and β of C s ([0, T ]) with derivatives in Lp ([0, T ]), the kernel estimators have the Lp -risk, p > 0 Rp (Vn,h , V ) = O(n− 2s+1 ),

with hn = O(n− 2s+1 ),

αn,h , α) = O(T − 2s+1 ), Rp ( s 2 Rp (βn,h , β 2 ) = O(n− 2s+1 ),

with hT = O(T − 2s+1 ),

1

s

1

s

1

with hn = O(n− 2s+1 ).

2 n,h and βn,h are By Propositions 8.4 and 8.11, the kernel estimators Vn,h , α minimax. Estimators for continuously observed diffusions with stochastic volatility (8.16) are defined as

VT,h (t) = T −1

 0

T

2X(s)Kh (t − s) dX(s),

(8.17)

it is an a.s. uniformly consistent estimator ofV on [h, T − h]. The ergodic T distribution function FV (y) = limT →∞ T −1 0 P (Vt ≤ y) dt is estimated by  T FT,h,VT ,h (y) = T −1 1{VT ,h (t)≤y} dt, (8.18) 0

1 Proposition 8.14. The empirical process νT,h,V = T 2 (FT,h,VT ,h − FVT ,h ) converges weakly to the Brownian bridge GFV related to the ergodic distri1 bution function FV , and GFV is the limit of νT,V (x) = T 2 (FT,V − FV ).

The mean density of V has the estimator  T  fT,h,VT ,h (t) = Kh (t − s) dFT,h,VT ,h (s) 0

and form Proposition 8.14, it converges to fV (t), its bias is a O(h2 ) and its variance a O((nh)−1 ).

198

Functional Estimation for Density, Regression Models and Processes

The diffusion equation for the process V has the approximation dVT,h,t = α(VT,h,t ) dt + β(VT,h,t ) dWt , the estimators of the functions α and β are similar to those of Section 8.5 and their asymptotic properties follow. Let X be the process defined by extension of the model (8.1) as a timedependent model dXt = μt dt + σt dBt or an auto-regressive model dXt = μ(Xt ) dt + σ(Xt ) dBt , with a stochastic volatility σ 2 = V following Equation (8.1), or respectively (8.16). The estimators of the function V defined in Section 8.1 or 8.5 are used to estimate the drift and variance functions of the diffusion process Xt and their asymptotic properties follow. In the same way, the results obtained with discrete observation, in Section 8.4, for the estimator of the variance of the process X apply to the estimation of the drift and variance functions of the diffusion equation (8.16) for V with discrete observations. 8.7

Estimation of an auto-regressive spatial diffusions

Let α and β be two functions of class C 2 (IX ), for a subset IX of Rd , with values in Rd and let B1 and B2 be independent Brownian motions on a time interval [0, T ]. The norms β2 , Eα(X(t))1 and Eβ(X(t))2 are supposed to be finite. A spatial diffusion process (Xt )t∈[0,T ] is defined on Rd by a stochastic differential equation dXt = α(Xt )dt + β(Xt ) dBt , t ∈ [0, T ],

(8.19)

where Bt = (B1t , B2t )T , and its initial value X0 such that E|X0 | < ∞. The components of α = (α1 , . . . , αd )T and β = (β1 , . . . , βd )T are continuous diffusions on R2 , they are estimated by discretization (8.4) or from the continuous paths. Using its values on a grid (ti )1≤i≤n on [0, T ], of path Δn = n−1 T , let Yi = Δn α(Xti ) + β(Xti )εi ,

i = 1, . . . , n,

with the Gaussian variables εi = Bti+1 − Bti . The variables Yi have the expectation and the marginal variances E(Yi | Xti = x) = Δn α(x) and

Diffusion Processes

199

V ar(Yki | Xti = x) = Δn βk2 (x), multivariate kernel estimators of α and β are now defined on IX,h as n−1 d i=0 Yi j=1 Kh (xj − Xj,ti α n,h (x) = , n−1 d Δn i=0 j=1 Kh (xj − Xj,ti ) n−1 2 d i=0 Zki j=1 Kh (xj − Xj,ti 2 , k = 1, 2, βn,h,k (x) = n−1 d Δn i=0 j=1 Kh (xj − Xj,ti ) where x = (x1 , . . . , xd ) and n,h,k (Xti ) = Δn (αk − α n,h,k )(Xti ) + βk (Xti )εki Zki = Yki − Δn α where the variables εki , i = 1, . . . , n, are independent. Propositions 8.7 and 8.8 are modified with variances O((nhd )−1 ) for the estimators of the drift and the diffusion, (a) and (b) and the expansions of the biases are 1 unchanged. In Proposition 8.9 the convergence rate is modified as (nhd ) 2 . The estimation for continuously observed spatial diffusions is adapted from Section 8.5 with a kernel on Rd , the order of the bias is unchanged and the order of the variances of the estimators is now (T hd )−1 . A dependence of the Brownian motions Bj , j = 1, . . . , d, induces a covariance between the components of the diffusion process and therefore between the components of Yi , let Cov(Yij , Yij  | Xti = x) = Δn σjj  (x). The function σjj  is consistently estimated by n−1 Zji Zj  i Kh (xj − Xj,ti )Kh (xj  − Xj  ,ti ) , σ n,h,jj  (x) = i=0n−1 Δn i=0 Kh (xj − Xj,ti )Kh (xj  − Xj  ,ti ) 2 this estimator as the same properties as βn,h (x), on R2 , with a variance 2 −1 O((nh ) ). The covariance between the components of Zt defined by (8.14) is approximated by Cjj  (t; x) = βj (x)βj  (x) Cov(Bjt , Bj  t ). If this covariance is constant with respect to t, it has on [0, T ] the estimator T Zjs Zj  s Kh (Xjs − xj )Kh (Xj  s − xj  ) dZj  s  CT,h,jj  (x) = 0  T . Kh (Xjs − xj )Kh (Xj  s − xj  ) ds 0

For a function Cjj  (x) in class C s (IX ), the bias of its estimator has the order hs and its variance has the order (T h2 )−1 .

200

Functional Estimation for Density, Regression Models and Processes

Propositions 8.10 and 8.4 generalize to Rd . Proposition 8.15. Under Conditions 2.1, 2.2 and 3.1 for the kernel and for functions α and β of C s (IX ) with derivatives in Lp (IX ), for a real n,h has the Lp -risk p > 0 and for a subset IX of Rd , the kernel estimator α Rp ( αn,h , α) = O(T − 2s+d ), s

1

with hT = O(T − 2s+d ),

2 and βn,h has the Lp -risk 2 , β 2 ) = O(n− 2s+d ), Rp (βn,h s

1

with hn = O(n− 2s+d ).

Proposition 8.16. Let α and β belong to Fs,p = {u ∈ C s (IX ); u(s) ∈ Lp (IX )} for a real p > 1 and an integer s ≥ 2. The minimax risks for α and β 2 in Fs,p and estimators in the sets Fn of their estimators in Fs,p based on n observations of the diffusions at (ti )i=1,...,n are inf

sup Rp ( αn , α) = O(T − 2s+d ),

inf

sup Rp (βn2 , β 2 ) = O(n− 2s+d ).

s

n α∈Fs,p α  n ∈F

s

n ∈F n β∈Fs,p β

8.8

Estimation of discretely observed diffusions with jumps

Let α, β and γ be functions of class C 2 on a metric space (X,  · ), let B  be a centered martinbe the standard Brownian motion, let M = N − N gale associated  t to a point process N , with the predictable compensator  (t) = N 0 Y dΛ, and such that M is independent of B. The process Y is predictable and there exists a function g defined on [0, 1] such that sups∈[0,1] |T −1 YT s − g(s)| converges to zero in probability as T tends to infinity, the function g and the hazard function λ is supposed to be in class  ). C 2 (R); the function γ belongs to L2 (dN The process Xt solution of the stochastic differential equation dXt = α(Xt )dt + β(Xt )dBt + γ(Xt )dMt , t ∈ [0, T ],

(8.20)

has a discrete and a continuous part. A discretization of this equation into n sub-intervals of equal length Δn tending to zero as n tends to infinity gives the approximated equation for Xti+1 − Xti as Yi = Δn α(Xti ) + β(Xti )ΔBti + γ(Xti )ΔMti . Let εi = ΔBti = Bti+1 − Bti , with expectation zero and variance Δn conditionally on the σ-algebra Fti generated by the sample-paths of X up

201

Diffusion Processes

ti+1 − N ti = to ti , ηi = Mti+1 − Mti , with expectation zero and variance N O(Δn ) conditionally on the σ-algebra Fti generated by the sample-paths of X; E{α(Xti )εi } = 0, E{β(Xti )ηi } = 0, and the martingales (Bt )t≥0 and (Mt )t≥0 have independent increments, by definition. The functionals of the  are estimated from the observation of the martingale M and the process N point process N , as in Chapter 4. The variables XTi are supposed to satisfy an ergodic property for the random stopping times of the counting process N , in addition to Conditions 6.3 and 8.1. Condition 8.2. There exists a mean density of the variables XTi defined as the limit  T  T −1 −1 fN (x) = lim T fXs (x) dN (s) = lim T fXs (x)g(s) dΛ(s). T →∞

T →∞

0

0

This condition is satisfied if the jump part of the process Xt satisfies the T  (s). The diffuproperty (8.9) and the limit is fN (x) = T −1 E 0 fXs (x) dN sion process Xt defined by (8.20) has the expectation  t α(Xs )ds μt = EX0 + E 0   t  α(x)fXs (x) ds dx = EX0 + t α(x)f (x) dx, = EX0 + 0

X

1

X

and the variance of the normalized variable T − 2 (XT − μT ) is finite if the integrals   T −1 2 α (Xs ) ds = α2 (x)f (x) dx + o(1), Sα = ET 0

Sβ = ET −1 Sγ = ET −1 1



T

0



T

0 −1

β 2 (Xs ) ds =



X

β 2 (x)f (x) dx + o(1),

X

 (s) = γ 2 (Xs ) dN

 X

γ 2 (x)fN (x) dx + o(1),

are finite. Then T 2 (T XT − μT ) converges weakly to a centered Gaussian variable with variance SX = Sα +Sβ +Sγ . Let SX (t) be the function defined 1 as above with integrals on [0, t], with t = sT . The process T 2 (T −1 XsT − μsT )0≤s≤1 is a sum of stochastic integrals with respect to the martingales B and M . 1

Proposition 8.17. The process WT,s = T 2 (T −1 XsT −μsT )0≤s≤1 is a martingale. If SX < ∞, WT,s converges weakly to a centered Gaussian process sum of a Brownian motion and a Gaussian process with independent increments, with variance function SX (s) on [0, 1].

202

Functional Estimation for Density, Regression Models and Processes

Moments of discontinuous parts of a diffusion process with jumps require another ergodicity condition defining another mean density and it is satisfied under the mixing property (8.9). Conditionally on Fti , the variables Yi have the expectation Δn α(Xti ) and the variance  ti+1  (s) V ar(Yi |Xti ) = β 2 (Xti )Δn + γ 2 (Xs ) dN ti

 (ti ) + o(Δn ) = O(Δn ), = β (Xti )Δn + γ 2 (Xti ) ΔN  (ti ) = N  (ti+1 ) − N  (ti ) = O(Δn ). where ΔN A nonparametric estimator of the function α is the kernel estimator normalized by Δn as in the previous section n−1 −1 Δn Yi Kh (x − Xti ) α n,h (x) = i=0 , n−1 i=0 Kh (x − Xti ) 2

n,h (x) is approximated by for x in Xn,h . The expectation α h2 m2K {(f α)(2) (x) − α(x)f (2) (x)} + o(h2 ), 2 n−1 where f (x) = limn→∞ n−1 i=0 fXti (x). The variance of α n,h (x) is a O((T h)−1 ) αn,h (x) = α(x) +

vα,n,h (x) = (T h)−1 {σα2 (x) + o(1)}, σα2 (x) = κ2 f −1 (x)Δ−1 n V ar(Yt | Xt = x) = κ2 f −1 (x){β 2 (x) + γ 2 (x)f −1 (x)fN (x)}, 1

and its covariances tend to zero. The process (T h) 2 ( αn,h − α) has the asymptotic variance κ2 σα2 (x), at x.  The discrete part of X is X d (t) = s≤t γ(Xs )ΔNs and its continuous   t t t s , with variations part is X c (t) = 0 α(Xs ) ds + 0 β(Xs ) dBs − 0 γ(Xs ) dN on (ti , ti+1 ) ΔXic = α(Xti ) Δn + β(Xti ) ΔBti − γ(Xti ) Δn Y (ti )λ(ti ) = Op (Δn ). t Then the sum of its jumps converges to 0 Eγ(Xs ) g(s)dΛs . Let (Ti )1≤i≤N (T ) be the jump times of the process N . The jump sizes ΔX d (Ti ) = γ(XTi ) yield a consistent estimator of γ(x), for x in Xn,h  d 1≤i≤N (T ) ΔX (Ti )Kh (x − XTi )  γ n,h (x) = 1≤i≤N (T ) Kh (x − XTi ) T γ(Xt )Kh (x − Xt ) dN (t) . = 0 T K (x − X ) dN (t) h t 0

203

Diffusion Processes

The expectation of  γn,h (x) is approximated by the ratio of the means of the numerator and the denominator. For the numerator  T h2 γ(Xs )Kh (x−Xs ) dNs = (γfN )(x)+ m2K {(γfN )(x)}(2) +o(h2 ), ET −1 2 0 and, for the denominator  T h2 (2) −1 ET Kh (x − Xs ) dNs = fN (x) + m2K fN (x) + o(h2 ). 2 0 The bias of γ n,h (x) is then bγ,n,h(x) =

h2 (2) m2K {fN (x)}−1 [{γ(x)fN (x)}(2) − γ(x)fN (x)] + o(h2 ), 2

n,h (x) is deduced also denoted bγ,n,h (x) = h2 bγ + o(h2 ). The variance of γ from (3.2) with the variance of the numerator  T γ 2 (Xs )Kh2 (x − Xs ) dNs T −2 E 0

= (T h)−1 {κ2 fN (x) + κ22

h2 (2) f (x)} + o(hT −1 ), 2 N

the variance of the denominator  T −2 Kh2 (x − Xs ) dNs T E 0

= (T h)−1 {κ2 γ 2 (x)fN (x) + h2 κ22 (γ 2 fN )(2) (x)} + o(T −1 h), and their covariance  T −2 T E 0 −1

= (T h)

0

T

γ(Xs )Kh (x − Xs )Kh (x − Xt ) dNs dNt

{κ2 γ(x)fN (x) + h2 κ22 (γfN )(2) (x)} + o(hT −1 ),

therefore vγ,n,h (x) = T −1 hvγ (x) + o(hT −1 ) with (2)

vγ (x) = κ22 {fN (x)}−1 {(γ 2 fN )(x) − γ 2 (x)fN (x)}. Let c = limT →∞ h3 T . Proposition 8.18. Under Conditions 2.2 and 3.1 with functions γ and 1 1 γn − γ − c 2 bγ ) converges weakly to fN in class C 2 , the process (T h−1 ) 2 ( a centered Gaussian process with variance function vγ (x) and covariances zero.

204

Functional Estimation for Density, Regression Models and Processes

For the estimation of the variance function β of model (8.20), let Zi = Yi − Δn α n,h (Xti ) − γ n,h (Xti )ηi = Δn (α − α n,h )(Xti ) + β(Xti )εi + (γ −  γn,h )(Xti )ηi , its conditional expectation E(Zi | Xti = x) = Δn (α − αn,h )(Xti ) tends to zero and its conditional variance satisfies 2 4 −1 Δ−1 ) + o((T h)−1 ). n V ar{Zi | Xti } = β (Xti ) + o(h ) + o((nh)

An estimator of the function β is deduced for x in Xn,h  −1 2 1≤i≤n Δn Zi Kh (x − Xti ) 2 n . βn,h (x) = i=1 Kh (x − Xti ) The previous approximations of the estimator βn,h given in Proposition 8.9 are modified, its expectation is approximated by  −1 2 2 2 (x) = n−1 Δ−1 βn,h n EZi Kh (x − Xti )fN (x) + o(h ), 1≤i≤n 2 − β 2 = bβ,n,h + o(h2 ) with therefore its bias is E βn,h

bβ,n,h =

h2 m2K f −1 (x){(f β 2 )(2) (x) − β 2 (x)f (2) (x)} + o(h2 ). 2

Under Conditions 2.1 and 3.1 for the function β in class C 2 (X ), the variance 2 is vβ,n,h (x) = (nh)−1 {σβ2 (x) + o(1)}, with of the estimator βn,h −1 2 σβ2 (x) = κ2 fX (x)Δ−2 n V ar(Zt | Xt = x). t 2 The normalized variance Δ−2 n V ar(Zt | Xt = x) develops as 4 4 E{Δ2n ( αn,h − α)4 (x) + β 4 (x)Δ−2 n,h )4 (x)Δ−2 n ε + (γ − γ n η

+ O(h4 ) + O((nh)−1 ) + O(hT −1 ), where the Burkholder–Davis–Gundy inequality implies that the order of Eηi4 is a O((Eηi2 )2 ) = O(Δ2n ). Then, from the expression of the moments of the variable ε 4 4 σβ2 (x) = β 4 (x)(Δ−2 n Eε − 1) + o(1) = 2β (x) + o(1). 2 is therefore written vβ,n,h (x) = (nh)−1 vβ (x){1+o(1)}, The variance of βn,h 1 1 −1 it is a O((nh) ) and the process (nh) 2 (βn −β−(nh5) 2 bβ ) converges weakly

to a centered Gaussian process with variance function vβ and covariances zero.

205

Diffusion Processes

8.9

Continuous estimation for diffusions with jumps

In model (8.20), the estimator α T,hT of Section 8.5 is unchanged and new estimators of the functions β and γ must be defined from the continuous observation of the sample path of X. The discrete part of X is also written t Xtd = 0 γ(Xs )dNs and the point process N is rescaled as Nt = NT s , with t in [0, T ] and s in [0, 1]. Let NT (s) = T −1 NT s ,

XT (s) = T −1 XT s , t ∈ [0, T ], s ∈ [0, 1].

 T (t) = T −1 t YT (s)λ(s) ds The predictable compensator of NT is written N 0 on [0, 1] and it is assumed to converge uniformly on [0, 1] to its expectation t T (t) = g(s)λ(s) ds, in probability. Then XTd (t) converges uniformly EN 0 t in probability to 0 Eγ(XT (s)) g(s)dΛ(s). The continuous part of X is dXtc = α(Xt ) dt + β(Xt ) dBt − γ(Xt )Yt λt dt. A consistent estimator of γ(x), for x in IX,T,h T γ T,h (x) = 0 T =

0 T 0

Kh (x − Xs ) dX d (s) Kh (x − Xs ) dN (s)

,

Kh (x − Xs )γ(Xs ) dN (s) , T K (x − X ) dN (s) h s 0

it is identical to the estimator previously defined for the discrete diffusion process. Its moments calculated in the continuous model (8.20) are iden1 1 2 γ tical to those of Section 8.8 then the process (T h−1 T,hT − γ − c 2 bγ ) T ) ( converges weakly to a centered Gaussian process with variance function vγ and covariances zero. The variance function β 2 (Xt ) is now estimated by smoothing the squared variations of the process  t  t α T,h (Xs ) ds − γT,h (Xs )dMs  (8.21) Z t = Xt − X0 − 0 0  t  t  t (α − α T,h )(Xs ) ds + β(Xs ) dBs + (γ − γ T,h )(Xs ) dMs . = 0

0

0

For every t in [0, T ], its first two conditional moments are  E(Zt | Ft ) = −

0

t

bα,T,h (Xs ) ds = O(h2 )

206

Functional Estimation for Density, Regression Models and Processes

and

 V ar(Zt | Ft ) = V ar

0



t

α T,h (Xs )ds + E

 t s +E (γ − γ T,h )2 (Xs ) dN 0  = t β 2 (x)fXs (x) dx + o(1).

 0

t

β 2 (Xs )ds

X

The variance function β 2 (x) is then consistently estimated smoothing the process Zt2 T Kh (Xs − x) Zs dZs 2 . (8.22) βT,h (x) = 2 0  T 0 Kh (Xs − x) ds Under Conditions 2.1 and 3.1 for the function β in class C 2 (X ) and using the ergodicity property (2.14) for the limiting density f of the process (Xt )t∈[0,T ] , the expectation of the denominator of (8.22) is  T  T −1 −1 T E Kh (Xs − x) ds = T E fXs (x) ds 0

0

 T h2 (2) + m2K T −1 E fXs (x) ds + o(h2 ) 2 0 h2 = f (x) + m2K f (2) (x) + o(h2 ), 2 the expectation of the numerator is  T Kh (Xs − x)Zs dZs 2T −1 E 0

= 2T −1





T

E 0

= β 2 (x) f (x) +

X

Kh (u − x) β 2 (u) fXs (u) du + o(h4 )

h2 m2K {β 2 (x)f (x)}(2) + o(h2 ), 2 = h2 bβ + o(h2 ), with

and its bias is denoted bβ,T,h 1 bβ = m2K f −1 (x){(f β 2 )(2) (x) − β 2 (x)f (2) (x)}. 2 Under Conditions 2.1 and 3.1 for the function β inclass C 2 , the variance of the estimator βT,h is obtained from E(Zt2 | X) = β 2 (Xs ) ds, V ar(Zt2 | X) = O(t) and expanding  T −2 Kh2 (x − y)V ar(Zt2 | Xt = y)fXt (y) dy dt = O((hT )−1 ), ET 0

2 it is therefore written σβ,T,h = (hT )−1 vβ + o((hT )−1 ). Then the process 1 1 (T hT ) 2 (βT,h − β − (T h5T ) 2 bβ ) converges weakly to a centered Gaussian process with variance function vβ and covariances zero.

207

Diffusion Processes

8.10

Transformations of a nonstationary Gaussian process

Consider the nonstationary processes Z = X ◦ Φ, where X is a stationary Gaussian process with covariance R(x, y) = E(Xx Xy ) and Φ is a monotone function C 1 ([0, 1]) with Φ(0) = 0 and Φ(1) = 1. The transform is expressed function of as Φ(x) = v −1 (1)v(x) with respect x  x to the integrated singularity the covariance r(x, x), v(x) = 0 ξ(u) du. Conversely, 0 ξ(u) du = cξ Φ(x) 1 with cξ = 0 ξ(u) du. A direct estimator of the regularity function ξ is  n (x) defined by (1.13) obtained by smoothing the estimator Φ  1  n (y) ξn,h (x) = Vn (1) Kh (x − y) dΦ  = 0

1

0

Kh (x − y) d vn (y).

1 The expectation of ξn,h (x) is ξn,h (x) = 0 Kh (x − y) dv(y) and the process 1  n − Φ)(y) is uniformly consistent, since Kh (x − y)d(Φ (ξn,h − ξn,h )(x) = 0

n

an integration by parts implies  1     Khn (s − y) d(Φn − Φ)(y) ≤ Φn − Φ 0

0

1

|dKhn (s − y)|

 n − Φ + sup |Khn | Φ   ≤ (sup |K| + |dK(z)|) h−1 n Φn − Φ,

which converges to zero in probability, by the convergence in probability of  n − Φ to zero. Φ √ x 1 vn − v) converges weakly to 2 0 v(y)dW (y) where W The process n 2 ( is a Gaussian process with expectation zero and covariances x ∧ y at (x, y),  x∧y 1 vn − v) is 2 0 v 2 (y) dy at x = y. then the covariance of the limit of n 2 ( The limiting variance of ξn,h (x) is  1

2  1 E Kh (x − y) d( vn − v)(y) = E Kh2 (x − y) dV ar( vn − v)(y) 0



1



1

0

+E Kh (x − y)Kh (x − u) dCov{( vn − v)(y), ( vn − v)(u) 0 0  1 Kh2 (x − y)v 2 (y) dy = O((nh)−1 ). = O n−1 0

The convergence rate of the process ξn,h is therefore (nhn ) 2 and the finite 1 dimensional distributions of (nhn ) 2 (ξn,h − ξn,h ) converge to those of a 1

208

Functional Estimation for Density, Regression Models and Processes

Gaussian process with expectation zero, as normalized sums of the independent variables defined as the weighted quadratic variations of the increments 1 of Z. The covariances of (nhn ) 2 (ξn,h − ξn,h ) are zero except on the interval [−hn , hn ] where they are bounded, hence the covariance function converges to zero. The quadratic variations of ξn,h satisfy a Lipschitz property of moments E|(ξn,h − ξn,h )(x) − (ξn,h − ξn,h )(y)|2  1    −1  2 2 2 = 2n  {Kh (x − u) − Kh (y − u)}v (u) du, 0

O((nh3n )−1 |x

it is then a − y|2 ) for |x − y| ≤ 2hn . It follows that the 1 process (nhn ) 2 (ξn,h − ξn,h ) converges weakly to a continuous process with expectation zero and variance function 2v 2 and covariances zero. The singularity function of the spatial covariance of a Gaussian process Z is estimated by smoothing the estimator of the integrated spatial trans1 form of Z on [0, 1]3 , the convergence rate of the estimator is then (nh3 ) 2 . 8.11

Exercises

(1) Calculate the moments of the estimators for the continuous process (8.12) and write the necessary ergodic conditions for the convergences in this model. (2) Calculate the bias and variance of derivatives of the estimators of functions α, β and γ in the stochastic differential equations model (8.20). (3) Prove Proposition 8.17.

Chapter 9

Applications to Time Series

Let (X, ·) be a metric space and (Xt )t∈N be a time series defined on XN by its initial value X0 and a recursive equation Xt = m(Xt−p , . . . , Xt−1 ) + εt where m is a parametric or nonparametric function defined on Xp for some p > 1 and (εt )t is a sequence of independent noise variables such that E(εt | Xt−p , . . . , Xt−1 ) = 0 and V ar(εt | Xt−p , . . . , Xt−1 ) = σ 2 . The stationarity of a time series is a property of the joint distribution of consecutive observations. The weak stationarity is defined by a constant mean μ and a stationary covariance function ρs,t = Cov(Xs , Xt ) = Cov(X0 , Xt−s ) = ρt−s ,

for every s < t.

The series (Xt )t is strong stationary if the distributions of the sequences (Xt1 , . . . , Xtk ) and (Xt1 −s , . . . , Xtk −s ) are identical for every sequence (t1 , . . . , tk , s) in Nk+1 . The nonparametric estimation of the mean and the covariances is therefore useful for modelling the time series. The moving average processes are stationary, they are defined as linear combinations of past and present noise terms such as the MA(q) pro cess Xt = εt + qk=1 θk εt−k , with independent variables εj such that 2 Eεj = 0 and V arεj =  σ , for every integer j. The variance of Xt is q 2 2 2 and it is supposed to be finite. The covariance σq = σ k=1 θk + 1 of Xs and Xt such that 0 < t − s < q is Cov(Xs , Xt ) = σ 2

q   k=t−s

(t−s+q)∧q

θk +



 θk2 ,

k=t−s+1

it only depends on the difference t − s. The moving average processes with |θ| < 1 are reversible and the process Xt can be expressed as an autoregressive process, sum of εt and an infinite combination of its past values. Generally, an AR process is not stationary. 209

210

Functional Estimation for Density, Regression Models and Processes

In nonstationary series, a nonstationarity may be due to a smooth trend or regular and deterministic seasonal variations, to discontinuities or to a continuous change-points. A transformation such as differencing a stochastic linear trend reduces the nonstationarity of the series, other classical transformations are the square root or power transformations for data with increasing variance. Periodic functions of the mean can be estimated after the identification of the period and nonparametric estimator is proposed in Section 9.2. Change-points of nonparametric regressions in time or at thresholds of the series are stronger causes of non regularity and several phases of the series must be considered separately, with estimation of their change-points. Their estimators are studied in Section 9.5. 9.1

Nonparametric estimation of the mean

The simplest nonparametric estimators for the mean of a stationary process are the moving average estimators 1  Xt−i , k + 1 i=0 k

μ t,k =

k for a lag k up to t. The transformed series is Xt − μ t,k = k+1 Xt −  k 1 i=1 Xt−i and it equals (Xt − Xt−1 )/2 for k = 2. A polynomial trend k+1 is estimated by minimizing the empirical mean squared error of the model, t,k is expressed by the means of moving then the transformed series Xt − μ average of higher order, according to the degree of the polynomial model. Consider the auto-regressive process with nonparametric mean

Xt = μt + αXt−1 + εt , t ∈ N,

(9.1)

with an independent sequence of independent errors (εt )t with expectation zero and variance σ 2 . With α = 1, its expectation of Xt may be written μt = (1 − α)mt , with an unknown function mt and the solution Xt of Equation (9.1) is Xt =

t−1  k=0

μt−k αk + αt X0 +

t 

αk εt−k .

k=1

With a mean and an initial value zero, the covariance of Xs and Xt is s∧t ρs,t = σ 2 k=1 α2k and it is not stationary. The asymptotic behavior of the process X changes as the mean crosses the threshold value 1. For α = 1, the model is the classical nonparametric regression model.

Applications to Time Series

211

The parameters of the auto-regressive series AR(1) Xt = μ+αXt−1 +εt , with α = 1, are estimated by ¯t + t−1 ( t )X αt Xt − X0 ), μ t = (1 − α t )−1 μ t , m  t = (1 − α t (Xk−1 − m  t )(Xk − m  t) α t = k=1t , 2  t) k=1 (Xk−1 − m

(9.2)

1 ¯t − α {Xk − (1 − α t )X t Xk−1 }2 . t t

σ t2 =

k=1

¯ t + Op (t−1 ) and m ¯ t . For α = 1, the paramet )X  t =X For |α|= 1, μ t = (1 − α trization μ = (1 − α)m is meaningless and the mean is simply estimated by t μ t = t−1 k=1 (Xk − Xk−1 ). The estimators are consistent and asymptotically Gaussian, with different normalization sequences for the three domains p of α (α < 1, α = 1, α > 1). In the AR(p) model Xt = μt + j=1 αj Xt−j +εt , similar estimators are defined for the regression parameters αj t  t )(Xk − m  t) k=j (Xk−j − m α j,t,h = , t 2  t) k=j (Xk−j − m and the variance is estimated by the mean squared estimation error. The estimators m  t,h,k and μ t,h,k are similar to estimators defined for a nonparametric regression function and their properties are the same. In model (9.1) with a nonparametric mean function μt = (1 − α)mt , Xt − mt = α(Xt−1 − mt ) + εt , then the estimator (9.2) of α is modified by ¯ k by a local moving average mean or by a local mean replacing m k = X t j=0 Kh (j − k)Xj m  t,h,k = t , j=1 Kh (j − k) for every k, and the estimator of α becomes t (Xk−1 − m  k )(Xk − m  k) α t = k=1t . 2 (X − m  ) k−1 k k=1 Finally, the function μt is estimated by (1 − α t )m  t or by smoothing Xt − α t Xt−1 μ t,h,k = t−1

t 

Kh (j − k)(Xj − α j Xj−1 )

j=0

and the estimator of σ 2 is still defined by (9.2). The asymptotic distributions are modified as a consequence of the asymptotic behavior of m  k,

212

Functional Estimation for Density, Regression Models and Processes

with mean tending to mk and variance converging to a finite limit. As h tends to zero, the weak convergence to centered Gaussian variables of 1  t − mt ), when |α| < 1, and tα−t (m  t − mt ), when |α| > 1, follows from t 2 (m martingale properties of the time series which imply its ergodicity and a mixing property (Appendix D). If |α| = 1 t (Xk−1 − m  k−1 )((1 − α)(mk − m  k ) + εk ) α t − α = k=1 t 2  k) k=1 (Xk−1 − m it is therefore approximated in the same way as in model AR(1) and it converges weakly with the same rate as in this model. When Equation (9.1) is defined by a regular parametrization of the expectation μt = (1 − α)mθ (t) for |α| = 1, the minimization of squared t t ¯t − α t )X t Xk−1 }2 estimation error  ε2(t) 2t = k=1 ε2k = k=1 {Xk − (1 − α yields estimators of the parameters α and θ for identically distributed error variables εk . If the variance of εk is σk2 (θ), maximum likelihood estima tors minimize k σk−1 ε2k . The robustness and the bias of the estimators in false models have been studied for generalized exponential distributions, the same methods are used in models for time series. In a nonparametric regression model Xt = m(Xt−1 ) + εt

(9.3)

with an initial random value X0 and with independent and identically distributed errors εt with expectation zero and variance σ 2 , let F be the continuous distribution function of the variables εt , and f its density. The nonparametric estimator of the function m is still t k=1 Kh (x − Xk−1 )Xk . m  t,h (x) =  t k=1 Kh (x − Xk−1 ) It is uniformly consistent under the ergodicity condition   t 1 ϕ(Xk , Xk−1 ) → ϕ(x, y)F (dx − m(y)) dπ(y), t k=1

with the invariant measure π of the process and for every continuous and bounded function ϕ on R2 . Conditions on the function m and the independence of the error variables εi ensure the ergodicity, then the process 1  t,h − m) converges weakly to a continuous centered Gaussian pro(th) 2 (m cess with covariances zero and variance κ2 f −1 (x)V ar{Xk | Xk−1 }, where V ar{Xk | Xk−1 } = σ 2 . In model (9.3) with a functional variance, the results of Section 3.7 apply.

Applications to Time Series

213

The observation of series in several groups or in distinct time intervals may introduce a group or time effect similar to population effect in regression samples and sub-regression functions may necessary as in Section 5.6.

9.2

Periodic models for time series

Let (Xt )t∈N be a periodic auto-regressive time series defined by X0 and Xt = ψ(t) +

p 

αp Xt−p + εt

(9.4)

i=1

where |α| < 1 and ψ is a periodic function defined in N with period τ , ψ(t) = ψ(t + kτ ), for every integers t and k. Let α = (α1 , . . . , αp ) and X(p),t = (Xt−1 , . . . , Xt−p ). As ψ(t) = E(Xt − αT X(p),t ), the value of the function ψ at t is estimated by an empirical mean over the periods, with a fixed parameter value α. Assuming that K periods are observed and T = Kτ values of the series are observed, the function ψ is estimated as a mean over the K periods of the remainder term of the auto-regressive process. For every t in {1, . . . , τ } K−1 1  (Xt+kτ − αT X(p),t+kτ ), ψK,α (t) = K

(9.5)

k=0

and the parameter vector is estimated by minimizing the mean squared error of the model T 1  {Xt − ψK,α (t) − αT X(p),t }2 . lK (α) = T t=1 The components of the first two derivatives of lK are   T  K,α 2 ∂ ψ T (t) + X(p),t {Xt − ψK,α (t) − α X(p),t } l˙T,K,t = − T t=1 ∂α  K−1  T 2  1  T  = {Xt − ψK,α (t) − α X(p),t } X(p),t+kτ − X(p),t , T t=1 K k=0 ⊗2  K−1 T   2 1 ¨ (X(p),t+kτ − X(p),t ) . lT,K,t = T t=1 K k=0

The vector α is estimated by α T = arg minα∈]−1,1[d lT,K,t (α). For the first 1 order derivative, T 2 l˙T,K,t (α0 ) converges weakly to a centered limiting distribution and the second order derivative ¨lT,K,t converges in probability to

214

Functional Estimation for Density, Regression Models and Processes

a positive definite matrix E ¨lT,K,t which does not depend on α. Then the 1 1 −1 αT,K,t − α0 ) = ¨lT,K,t T 2 l˙T,K,t (α0 ) + o(1). The estimator of α satisfies T 2 ( 1 estimator α T is consistent and its weak convergence rate is T 2 , if all components of the vector α have a norm smaller than 1. The function ψ is then consistently estimated by ψK = ψK, αT and, for every t in {1, . . . , τ }, the 1 weak convergence rate of the estimator ψK (t) is K 2 . The true period of the function ψ was supposed to be known. With an T depend on the parameter τ unknown period, the estimators ψK and α αT,τ ). and it is consistently estimated by τT = arg minτ ≤T l[T /τ ] ( If the function ψ is parametric, its parameters vector θ is estimated by T minimizing the mean squared error between ψK and ψθ , T1 t=1 {ψK (t) − ψθ (t)}2 . As a minimum distance estimator, the estimator θK is consistent 1 and T 2 (θT − θ) converges weakly to a centered Gaussian variable. The trigonometric series with independent noise are a combination of periodic sines and cosines functions Xt = =

r  j=1 r 

M {cos(wj t + Φt ) + sin(wj t + Φt )} + εt {Aj cos(2πwj t) − Bj sin(2πwj t)} + εt ,

j=1

where (wj )j=1,...,r are frequencies wj = jt−1 , Aj = M cos Φt , Bj = M sin Φt such that A2j + Bj2 = M 2 is the magnitude of the series, for j = 1, . . . , r, and Φt its phases. The estimators of the parameters are defined from the Fourier series, for j = 1, . . . , r tj = 2n−1 A tj = 2n−1 B t = r−1 M

t  k=1 t 

Xt cos(2πkj/t),

Xt sin(2πkj/t),

k=1 r 

2 + B  2 ) 12 . (A tj tj

j=1

9.3

Nonparametric estimation of the covariance function

The classical estimator for estimating the covariances function in a stationary model is similar to the moving average for the mean, with a lag k ≥ 1

215

Applications to Time Series

between variables Xi and Xi−k , for every i ≥ 1, t  ¯ t )(Xi−k − X ¯ t ). ρk,t = (t − k)−1 (Xi − X i=k+1

In the auto-regressive model AR(1) with independent errors with expectation zero and variance σ 2 , for k ≥ 1, the variable Xk is expressed from the initial value as k k−1   αk−j εj = αj εk−j . Xk − m = αk (X0 − m) + Sk,α , where Sk,α = j=1

j=0

Let B be the standard Brownian motion, if |α| < 1 the process S[ns],α 1 defined up to the integer part of ns converges weakly to σB{(1 − α2 ) 2 }−1 . 1 If α = 1, the process n− 2 S[ns],1 converges weakly to the process σB, and 1 if |α| > 1 the process α−[ns] S[ns],α converges weakly to σB{(α2 − 1) 2 }−1 . The independence of the error variables εj implies E(Xk − m)(Xk+s − m) = α2k+s V arX0 + Cov(Sk,α , Sk+s,α ), (9.6) ⎞2 ⎛ k k   αk−j εj ⎠ = σ 2 α2(k−j) , Cov(Sk,α , Sk+s,α ) = E ⎝ j=1

j=1

2k+s

V arX0 + V arSk,α and the covariance so E(Xk − m)(Xk+s − m) = α function of the series is not stationary. The estimator (9.2) of the variance σ 2 is defined as the empirical variance of the estimator of the noise variables which are identically distributed and independent. In the same way, the covariance is estimated by t  1 {Xi − m  t −α t (Xi−1 − α t )}{Xi−k − m  t −α t (Xi−k−1 − α t )}, ρt,k = t−k i=k+1

the estimators σ t2 and ρt,k are consistent (Pons, 2008). The estimators are defined in the same way in an auto-regressive model of order p, with a scalar Tt Xi−k−1 for p-dimensional variables Xi−1 and products α Tt Xi−1 and α Xi−k−1 . In model (9.1), the expansion (9.6) of the variables centered by the mean function is not modified and the covariance E(Xk −mk )(Xk+s −mk+s ) has the same expression depending only on the variances of the initial value and Sk,α , and on α and the rank of the observations. In auto-regressive series with deterministic models of the expectation, the covariance estimator is modified by the corresponding estimator of the mean. In model (9.3), the covariance estimator becomes t  1 {Xi − m  t,h (Xi )}{Xi−k − m  t,h (Xi−k )}, ρt,k = t−k i=k+1

and the estimators are consistent.

216

9.4

Functional Estimation for Density, Regression Models and Processes

Nonparametric transformations for stationarity

In the nonparametric regression model (9.3), Xt = m(Xt−1 ) + εt with an initial random value X0 and with independent and identically distributed errors εt with expectation zero and variance σ 2 , the covariance between Xk and Xk+l is ρt,k,l = E{Xk m∗l (Xk )} − EXk Em∗l (Xk ), with E{Xk m(Xk+l−1 )} = E{Xk m∗l (Xk )}, where m∗l is the composition of l functions m. The nonstationarity of ρt,k,l does not allow to estimate it using empirical means and it is necessary to remove the functional expectation μt before studying the covariance of the series. The centered series  t (Xt−1 ) = m(Xt−1 ) − m  t (Xt−1 ) + εt Yt = Xt − m has a conditional expectation equal to minus the bias of the estimator m t E(Yt | Xt−1 ) = −

 h2 (2) −1 (Xt−1 ) (mfXt−1 )(2) − mf Xt−1 (Xt−1 )m2K fX t−1 2

and it is negligible as t tends to infinity and h to zero. The time series Yt is then asymptotically equivalent to a random walk with a variance parameter σ 2 . The main transformations for nonstationary series (9.3) with a constant variance is therefore its centering. With a varying variance function Eε2i = σi2 = V ar(Xi | Xi−1 ), the estimator of the mean function of the series has to be weighted by the inverse of the square root of the nonparametric estimator of the variance at Xi , where t {Yi − m  t,h (Xi )}2 Kδ (x − Xi ) 2 , σ t,h,δ (x) = i=1 n i=1 Kδ (x − Xi ) as in Section 3.7, the estimator of the regression function is t w t,h,δ (Xi )Yi Kh (x − Xi ) m  w,t,h (x) = i=1 n t,h,δ (Xi )Kh (x − Xi ) i=1 w and the stationary series for (9.3) is Yi = Xi − m  w,t,h (Xi−1 ). A model for non independent stationary terms εt can then be detailed.

Applications to Time Series

9.5

217

Change-points in time series

A change-point in a time series may occur at an unknown time τ or at an unknown threshold η of the series. In both cases, Xt splits into two processes at the unknown threshold X1,t = Xt It

and X2,t = Xt (1 − It )

with It = 1{Xt ≤ η} for a model with a change-point at a threshold of the series and It = 1{t ≤ τ } in a model with a time threshold. The p-dimensional parameter vector α is replaced by two vectors α and β. Both change-points models are written equivalently, with a time change-point or a series change-points τη = sup{t; Xt ≤ η},

ητ = inf{x; (Xs )s∈[0,τ ] ≤ x}.

(9.7)

With a change-point, the auto-regressive model AR(p) is modified as Xt = μ1 It + μ2 (1 − It ) + αT X1,t + β T X2,t + εt , T

(9.8)

T

where Xt = μ+α X1,t +β X2,t +εt with X1,t = Xt It and X2,t = Xt (1−It ) for a model without change-point in the mean. Considering first that the change-point is known, the parameters are μ, or μ1 and μ2 , α, β and σ 2 . As t tends to infinity, a change-point at an integer time τ is denoted [γt] and sums of variables up to τ are increasing with t. For the auto-regressive process of order 1 with a change-point in time, this equation yields a twophase sample-path t  t αt−k εk , t ≤ τ, Xt,α = mα + α (X0 − mα ) + k=1

Xt,β = mβ + β

t−τ

(Xτ,α − mα ) +

t−τ  k=1

β t−τ −k εk+τ , t > τ,

t or mβ = μ(1 − β)−1 . With α = 1, Xt,α = X0 + (t − 1)μ + k=1 εk and t−τ with β = 1 and t > τ , Xt,β = Xτ,α + (t − k − 1)μ + k=1 εk+τ . Let θ be the vector of parameters α, β, mα , mβ , γ. The time τ corresponds either to a change-point of the series or a stopping time defined by (9.7) for a change-point at a threshold of the process X and the indicator Ik relative to an unknown threshold τη of Xt−k is denoted Ik,τ . t (Ik−1,τ Xk−1 − m  α,τ )(Ik,τ Xk − m  α,τ ) , α t,τ = k=1 τ 2  α,τ ) k=1 (Ik−1 Xk−1 − m t ((1 − Ik−1,τ )Xk−1 − m  β,t )((1 − Ik,τ )Xk − m  β,t )  βt,τ = k=1 , t 2  β,t ) k=1 ((1 − Ik−1,τ )Xk−1 − m

218

Functional Estimation for Density, Regression Models and Processes

where the estimators of mα = (1 − α)−1 μ and mβ = (1 − β)−1 μ are equivalent to ¯τ , m  α,τ = X ¯ τ,k = k −1 m  β,τ +k = X

k 

for t = τ + k ≥ τ,

Xτ +j ,

j=1

¯ τ − βt,τ X ¯ τ,t if |α| and |β| = 1. ¯τ − α and μ t = X t,τ X The estimator of the change-point parameter minimizes with respect to τ the mean squared error of estimation. For t > τ , we consider the  αt,τ ,τ − α t,τ Xk−1 if k ≤ τ and εt,k = estimation errors ετ,k = Xk − m  Xk − m  βt,τ ,t − βt,τ Xk−1 if k > τ . The variance σ 2 and the change-point parameter are estimated by σt2 (θ)



−1

τ 

ε2τ,k

−1

+ (t − τ )

τ ∈[0,t]

ε2t,k ,

k=τ +1

k=1

γ t = arg min

t 

σ t2 (τ ).

The change-point estimator is approximated by  τ 1  1 (Xk − μα − αXk−1 )2 γ t = arg min t 2 τ τ ∈[0,t] k=τ0 +1

1 − t−τ

τ 

(Xk − μβ − βXk−1 )2 − γ0

 + op (1),

k=τ0 +1

γ t − γ0 is independent of the estimators of the parameter vector ξt of the regression and all estimators converge weakly to limits bounded in probability. Consider the model (9.8) of order 1 with a change-point at a threshold η of the series, with the equivalence (9.7) between the chronological change-point model and the model for a series crossing the threshold η at consecutive random stopping times τ1 = inf{k ≥ 0 : Ik = 0} and τj = inf{k > τj−1 : Ik = 0}, j ≥ 1. The series have similar asymptotic behavior starting from the first value of the series which goes across the threshold η at time sj = inf{k > τj−1 : Ik = 1} after τj−1 . The estimators of the parameters in the first phase of the model are restricted to the set of random intervals [sj , τj ] where Xt stands below η, for the second phase the observations are restricted to the set of random intervals ]τj−1 , sj [ where X

Applications to Time Series

219

remains above η. The time τj are stopping times of the series defined for t > sj−1 by ⎧ if |α| < 1, ⎪ ⎨ mα + Ssj−1 ,t−sj−1 ,α + op (1), Xt = Xsj−1 + (t − sj−1 − 1)μ + Ssj−1 ,t−sj−1 ,1 , if α = 1, ⎪ ⎩ t−sj−1 (Xsj−1 −1 − mβ ) + Ssj−1 ,t−sj−1 ,α , if |α| > 1 mα + α and the sj are stopping times defined for t > τj−1 by ⎧ if |β| < 1, ⎨ mβ + Sτj−1 ,t−τj−1 ,β + op (1), Xt = Xτj−1 + (t − τj−1 − 1)μ + Sτj−1 ,t−τj−1 ,1 , if β = 1, ⎩ mβ + β t−τj−1 (Xτj−1 − mα ) + Sτj−1 ,t−τj−1 ,β if |β| > 1, The sequences t−1 τj and t−1 sj converge to the corresponding stopping times of the limit of Xt as t tends to infinity. The partial sums are therefore defined as sums over indices belonging to countable union of intervals [sj , τj ] and ]τj , sj+1 [, respectively, for the two phases of the model. Theirs limits are deduced from integrals on the corresponding sub-intervals, instead of sums of the errors on the interval (τ, τ0 ). The estimators of the parameters are still expressions of their partial sums. The results generalize to processes of order p with a possible change-point in each p component. The estimators and their weak convergences are detailed in Pons (2009). Change-points in nonparametric models for time series are estimated by replacing the estimators of the parameters by those of the functions of the models and only the expression of the errors εk determines its estimator. With a change-point at an unknown time τ0 in the nonparametric model (9.3), it is written Xt = Iτ,t m1 (Xt−1 ) + (1 − Iτ,t )m2 (Xt−1 ) + σεt . For every x of IX , the two regression functions are estimated using a kernel estimator with the same bandwidth h for m1 and m2 t Kh (x − Xi )(1 − Iτ,i )Yi m  1,t,h (x, τ ) = i=1 , t i=1 Kh (x − Xi )(1 − Iτ,i ) t Kh (x − Xi )Ii,τ Yi m  2,t,h (x, τ ) = i=1 . t i=1 Kh (x − Xi )Ii,τ  2,t,h is the same as in the The behavior of the estimators m  1,t,h and m model where τ0 is known, and it is the behavior described in Section 9.1. The variance σ 2 is estimated by 2 = t−1 σ τ,t,h

t  i=1

{Yi − (Iτ,i )m  1,t,h (Xi , τ ) − (1 − Iτ,i ){m  2,t,h (Xi , τ )}2 ,

220

Functional Estimation for Density, Regression Models and Processes

at the estimated τ . The change-point parameter τ is estimated by minimization of the error of the model with a change-point at τ 2 τ,t,h , τt,h = arg min σ τ ≤t

and the functions m1 and m2 by m  k,t,h (x) = m  k,t,h (x, τt,h ), for k = 1, 2. Let γ = T −1 τ and the corresponding change-point time τγ = T γ, let m = (m1 , m2 ) with true functions m0 , and let σt2 (m, γ)

=t

−1

t 

{Yi − (Iτγ ,i )m1 (Xi ) − (1 − Iτ,i )m2 (Xi )}2

i=1

be the mean squared error for parameters (m, τ ). The difference of the error from its minimal is lt (m, τ ) = σt2 (m, τ ) − σt2 (m0 , τ0 ) t   {Yi − Iτ,i m1 (Xi ) − (1 − Iτ,i )m2 (Xi )}2 = t−1 i=1

− {Yi − Iτ0 ,i m10 (Xi ) − (1 − Iτ0 ,i )m20 (Xi )}2 =t

−1

t   {Iτ,i m1 (Xi ) − Iτ0 ,i m10 (Xi )} i=1

= t−1



t 

(9.9)

2 + {(1 − Iτ,i )m2 (Xi ) − (1 − Iτ0 ,i )m20 (Xi )} , [{(m1 − m10 )(Xi )Iτ0 ,i − (m2 − m20 )(Xi )(1 − Iτ0 ,i )}2

i=1

+ {(Iτ,i − Iτ0 ,i ))(m1 − m2 )(Xi )}2 ]{1 + o(1)}. It converges a.s. to l(m, τ ) = Eα (m1 − m10 )2 (X) + Eβ (m2 − m20 )2 (X) + |τ − τ0 |E(m1 − m2 )2 (X) which is minimal for (m0 , τ0 ), and the estimator  nh , τ ). The a.s. consistency of the regression estimaτnh minimizes lt (m  1nh , m  2nh ) and lt (m, τ ) imply that  γt,h = t−1 τt,h is an a.s. tors m  nh = (m consistent estimator of γ0 in ]0, 1[. It follows that the estimator  1nh (x)Iτt,h + m  2nh (x)(1 − Iτt,h ) m  nh (x) = m of the regression function m0 (x) = m10 (x)Iτ0 + m20 (x)(1 − Iτ0 ) is a.s. uni1  th − m0 ) converges weakly under formly consistent and the process (th) 2 (m Pm0 to a Gaussian process Gm on IX , with expectation and covariances zero and with variance function Vm (x) = κ2 V ar(Y |X = x).

221

Applications to Time Series

For the weak convergence of the change-point estimator, let ϕX be 1 the L2 (FX )-norm of a function ϕ on IX , ρ(θ, θ ) = (|γ − γ  | + m − m 2X ) 2  the distance between θ = (mT , γ)T and θ = (m T , γ  )T and let Vε (θ0 ) be a neighborhood of θ0 with radius ε for the metric ρ. The quadratic function lt (m, τ ) defined by (9.9) converges to its expectation l(m  th , τth ) = 0(m  nh − m0 2X + | τnh − τ0 |). The process is bounded in the same way , t  {(m1 − m10 )2 (Xi )Iτ0 ,i + (1 − Iτ0 ,i )(m2 − m20 )2 (Xi )} lt (m, τ ) = t−1 +t

i=1 t  −1

2

2

(Iτ,i − Iτ0 ,i ) (m1 − m2 ) (Xi ) {1 + o(1)}.

i=1 1

it is denoted lt = (l1t + l2t ){1 + o(1)}. The process Wt (m, γ) = t 2 (lt −  th is a local maximum likelihood l)(m, τγ ) is a Op (1). The estimator m estimators of the nonparametric regression functions and the estimator of  th ) the change-point is a maximum likelihood estimator. The variable l1t (m  th , τγt ) converges to zero with the same converges to l1 (m0 ) = 0 and l2t (m  th . We obtain the next rate if the convergence rate of γ t is the same as m bounds. Lemma 9.1. For every ε > 0, there exists a constant κ0 such that E sup(m,γ)∈Vε (τ0 ) lt (m, τγ ) ≤ κ0 ε2 and 0 ≤ l(m, τγ ) ≤ κ0 ρ2 (θ, θ0 ), for every θ in Vε (τ0 ). Lemma 9.2. For every ε > 0, there exist a constant κ1 such that E sup(m,γ)∈Vε (τ0 ) Wt (m, γ) ≤ κ1 ρ(θ, θ0 ). The lemmas imply that for every ε > 0 lim sup P0 (tht | γtht − γ0 | > A) = 0.

t→∞,A→∞

The proof is similar to Ibragimov and Has’minskii’s (1981) for a changepoint of a density. It implies that lt (θth ) = (l1t + l2t )(θth ) + op (1) uniformly. 1 γth − γ0 ), let For the weak convergence of (th) 2 ( 1

Un = {u = (uTm , uγ )T : um = (th)− 2 (m − m0 ), uγ = (th)−1 (γ − γ0 )}, A be a bounded set. For every A > 0, let Uth = {u ∈ Ut ; u2 ≤ A}. Then A for every u = (um , uγ ) belonging to Uth , θt,u = (mt,u , γt,u ) with mt,u = 1 m0 + (th)− 2 um and γt,u = γ0 + (th)−1 uγ . The process Wt defines a map u → Wt (θt,u ).

222

Functional Estimation for Density, Regression Models and Processes

Theorem 9.1. For every A > 0, the process Wt (θ) develops as Wt (θ) = W1t (m) + W2t (γ) + op (1), where the op is uniform on UtA , as t tends to infinity. Then change-point estimator of γ0 is asymptotically independent of the estimators of the regres 2th . sion functions m  1th and m Proof.

For an ergodic process, the continuous part l1t of lt converges to l1 (m) = γ0 m1 − m01 2F1X + (1 − γ0 )m2 − m02 2F2X , 1

and the continuous part of Wt is approximated by W1t (m) = t 2 (l1t −l1 )(m). A , it is written On Uth   2 W1t (m) = γ0 (m1 − m01 ) dν1t + (1 − γ0 ) (m2 − m02 )2 dν2t , where νkt = t 2 (Fkt − Fk0 ) is the empirical processes of the series in phase k = 1, 2, with the ergodic distributions Fk0 of the process. 1 The discrete part of Wt is approximated by W2t (γ) = t 2 (l2t − l2 )(γ)  t where l2t = t−1 i=1 (Iτ,i − Iτ0 ,i )2 (m10 − m20 )2 (Xi ) + op (|τ − τ0 |) and the sum is developed with the notation ai = (m10 − m20 )2 (Xi ) τ  t t 0    1 t−1 (Iτ,i − Iτ0 ,i )2 (m10 − m20 )2 (Xi ) = (1 − Iτ,i )ai + Iτ,i ai t i=1 i=1 i=1+τ0   τth τ0   1 1{τth