Foundations of Modern Statistics: Festschrift in Honor of Vladimir Spokoiny, Berlin, Germany, November 6–8, 2019, Moscow, Russia, November 30, 2019 ... Proceedings in Mathematics & Statistics, 425) 3031301137, 9783031301131

This book contains contributions from the participants of the international conference “Foundations of Modern Statistics

192 90 7MB

English Pages 615 [603]

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Optimal Rates and Non-asymptotic Bounds in Nonparametrics
Adaptive Denoising of Signals with Local Shift-Invariant Structure
1 Introduction
2 Problem Description
2.1 Notation
2.2 Problem Statement
3 Oracle Inequalities for the ell2-Loss of Adaptive Estimators
3.1 Adaptive Signal Interpolation
3.2 Adaptive Signal Filtering
4 Risk Bounds for Adaptive Recovery Under ASI
4.1 Risk Bounds for Adaptive Signal Interpolation
4.2 Risk Bounds for Adaptive Signal Filtering
4.3 Harmonic Oscillation Denoising
References
Goodness-of-Fit Testing for Hölder Continuous Densities Under Local Differential Privacy
1 Introduction
2 Problem Statement
3 Non-interactive Privacy Mechanisms
3.1 Upper Bound in the Non-interactive Scenario
3.2 Lower Bound in the Non-interactive Scenario
4 Interactive Privacy Mechanisms
4.1 Upper Bound in the Interactive Scenario
4.2 Lower Bound in the Interactive Scenario
5 Examples
A Proofs of Sect. 3
A.1 Proof of Proposition 3.2
A.2 Proof of Theorem 3.4
A.3 Proof of Lemma 3.7
B Proofs of Sect. 4
B.1 Proof of Proposition 4.1
B.2 Analysis of the Mean and Variance of the Statistic DB
B.3 Proof of Theorem 4.3
B.4 Proof of Theorem 4.4
C Proofs of Sect. 5
C.1 Example 5.2
C.2 Example 5.3
C.3 Example 5.4
C.4 Example 5.5
C.5 Example 5.6
C.6 Example 5.7
C.7 Example 5.8
References
Nonasymptotic One- and Two-Sample Tests in High Dimension with Unknown Covariance Structure
1 Introduction
1.1 Relation to White Noise Model in Nonparametric Statistics
1.2 Relation to ``Modern'' and High-Dimensional Statistics
1.3 Relation to Machine Learning and Kernel Mean Embeddings of Distributions
1.4 Overview of Contributions
1.5 Organization of the Paper
2 Main Results
2.1 A General Result to Upper Bound Separation Rates
2.2 Concentration Properties of the Test Statistic
2.3 Quantile Estimation
2.4 Concluding Remarks
3 Proofs
3.1 Proof of Theorem 3
3.2 Proof of Propositions 6 and 9
3.3 Proof of Theorem 7
3.4 Proof of Theorem 8
3.5 Proof of Propositions 10 and 11
3.6 Proof of Propositions 12 and 13
3.7 Additional Proofs
References
The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls
1 Introduction
1.1 Organization of the Paper
2 Definitions
3 Bounds for the Empirical Process Using Approximation Numbers
4 Bounds for the Lasso Using Approximation Numbers
5 Some Entropy Results From the Literature
6 Explicit Entropy Bounds Using δ-Approximation Numbers
7 Bounds Using Covering Numbers Illustrated: One-Hidden-Layer Neural Networks
7.1 Definitions and Results
7.2 Simulation
8 Examples of Approximation Numbers
8.1 Second Order Discrete Derivatives
8.2 kth Order Discrete Derivatives
8.3 Higher-Dimensional Extensions
8.4 Entropy of the Class of Distribution Functions
9 Conclusion
10 Technical Proofs
10.1 Proof of Lemma 1
10.2 Proof of Lemma 2
10.3 Proof of Theorem 5
10.4 Proof of Lemma 3
10.5 Proof of Lemma 5
References
Local Linear Smoothing in Additive Models as Data Projection
1 Introduction
2 Local Linear Smoothing In Additive Models
3 Existence and Uniqueness of the Estimator, Convergence of the Algorithm
4 Asymptotic Properties of the Estimator
References
A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)
1 Introduction and Main Result
2 Notation and Auxiliary Results
3 Proof of the Main Theorem
References
Estimation of Matrices and Subspaces
Rate of Convergence for Sparse Sample Covariance Matrices
1 Introduction
2 Main Results
3 The Stieltjes Transforms Proximity
4 The Proof of Theorem 1
4.1 Truncation
4.2 The Proof of Theorem
5 The Proof of Theorem 2
5.1 Estimation of Resolvent Diagonal Elements
5.2 Estimation of Tn
6 Appendix
References
Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces
1 Introduction
2 A Van Trees Inequality for the Estimation of Principal Subspaces
3 Proof of Proposition 1
3.1 Reduction to a Pointwise Risk
3.2 A Pointwise Cramér-Rao Inequality for Equivariant Estimators
4 Applications
4.1 PCA and the Subspace Distance
4.2 PCA and the Excess Risk
4.3 Low-Rank Matrix Denoising
5 Proofs for Sect.4
5.1 Specialization to Principal Subspaces
5.2 A Simple Optimization Problem
5.3 End of Proofs of the Consequences
References
Sparse Constrained Projection Approximation Subspace Tracking
1 Introduction
1.1 Main Setup
2 Methods
2.1 CPAST
2.2 Sparse CPAST
3 Error Bounds for CPAST and SCPAST
4 Numerical Results
5 Outlines of the Proofs
6 Conclusions
7 Proofs of Results in Section5
8 Concentration of the Spectral Norm of the Perturbation
References
Bernstein–von Mises Theorem and Misspecified Models: A Review
1 Introduction
2 Frequentist Results for Misspecified Models
2.1 Probability Model
2.2 Best Parameter
2.3 Regular Models
2.4 Nonasymptotic LAN Condition
2.5 Optimal Variance for Unbiased Estimators
3 Bernstein–von Mises Theorem for Correctly Specified Models
4 Bernstein–von Mises Theorem and Model Misspecification
4.1 Bayesian Inference Under Model Misspecification
4.2 Concentration
4.3 Bernstein–von Mises—Type Results Under Model Misspecification
4.4 Example: Misspecified Linear Model
5 ``Optimising'' Bayesian Inference Under Model Misspecification
5.1 Asymptotic Risk of Parameter Estimation Under a Misspecified Model
5.2 Composite Likelihoods
5.3 Generalised (Gibbs) Posterior Distribution
5.4 Nonparametric Model for Uncertainty in p0 and Bootstrap Posterior
5.5 Curvature Adjustment
6 Discussion and Open Questions
References
On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems
1 Introduction
1.1 Problem Statement
1.2 Related Work
2 Sieve Approach
3 Semiparametric Bernstein-von Mises Theorem
3.1 Parametric Estimation: Main Definitions
3.2 Conditions
3.3 Posterior Contraction
3.4 Gaussian Approximation of Posterior Distribution
3.5 Critical Dimension and Examples
4 Tools
4.1 Some Inequalities for Normal Distribution
4.2 Linear Approximation of Log-likelihood Gradient and Other Tools
5 Proofs of Main Results
5.1 Proof of Corollary 7
5.2 Proof of Corollary 8
5.3 Proof of Theorem 1
5.4 Proof of Theorem 2
5.5 Proof of Theorem 3
5.6 Proof of Theorem 4
5.7 Proof of Corollary 3
5.8 Proof of Theorem 6
5.9 Proof of Theorem 7
References
Statistical Theory Motivated by Applications
An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity
1 Introduction
2 Covariate Balancing Weights and Double Robustness
3 A Parametric Alternative to Synthetic Control
3.1 Estimation With Low-Dimensional Covariates
3.2 Estimation With High-Dimensional Covariates
3.3 Asymptotic Properties
4 Monte Carlo Simulations
5 Empirical Applications
5.1 Job Training Program
5.2 California Tobacco Control Program
6 Conclusion
7 Algorithm for Feasible Penalty Loadings
8 Proofs
9 Auxiliary Lemmas
References
Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models
1 Introduction
2 Minimax Optimal Quadratic Functional Estimation
2.1 Preliminaries and Notation
2.2 A Leave-one-out, Sieve NPIV Estimator
2.3 Rate of Convergence
3 Rate Adaptive Estimation
4 Conclusion and Extensions
5 Proofs of Results in Section 2
6 Proofs of Results in Section 3
7 Supplementary Lemmas
References
A Minimax Testing Perspective on Spatial Statistical Resolution in Microscopy
1 Introduction
2 Theory
2.1 Assumptions
2.2 Results
2.3 Physical Implications
3 Simulations
4 Proofs
4.1 Most difficult Setup for Even psfs in the CLT Regime
References
Optimization
Unifying Framework for Accelerated Randomized Methods in Convex Optimization
1 Introduction
1.1 Related Work
1.2 Our Approach and Contributions
2 Preliminaries
2.1 Notation
2.2 Problem Statement and Assumptions
3 Unified Accelerated Randomized Method
4 Extension for Strongly Convex Functions
5 Examples of Applications
5.1 Accelerated Random Directional Search
5.2 Accelerated Random Coordinate Descent
5.3 Accelerated Random Block-Coordinate Descent
5.4 Accelerated Random Derivative-Free Directional Search
5.5 Accelerated Random Derivative-Free Coordinate Descent
5.6 Accelerated Random Derivative-Free Block-Coordinate Descent
5.7 Accelerated Random Derivative-Free Block-Coordinate Descent with Random Approximations for Block Derivatives
6 Model Generality in a Non-accelerated Random Block-Coordinate Descent
7 Conclusion
References
Surrogate Models for Optimization of Dynamical Systems
1 Introduction
2 Literature Review
3 Mathematical Framework
3.1 Optimal Control Problem for Dynamical Systems
3.2 Surrogate Models for Optimization Problems
4 Enhanced Surrogate Models
4.1 Iterative Algorithm
5 Application of POD-RBF Procedure on Dynamical Systems
5.1 Model 1: Science Policy
5.2 Model 2: Population Dynamics
5.3 Model 3: Quality Control in Production and Process Management
6 Conclusion
6.1 Limitations and Future Work
References
Appendix Interview with Vladimir Spokoiny on 29/01/21 by E. Mammen and M. Reiß
Recommend Papers

Foundations of Modern Statistics: Festschrift in Honor of Vladimir Spokoiny, Berlin, Germany, November 6–8, 2019, Moscow, Russia, November 30, 2019 ... Proceedings in Mathematics & Statistics, 425)
 3031301137, 9783031301131

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Springer Proceedings in Mathematics & Statistics

Denis Belomestny · Cristina Butucea · Enno Mammen · Eric Moulines · Markus Reiß · Vladimir V. Ulyanov   Editors

Foundations of Modern Statistics Festschrift in Honor of Vladimir Spokoiny, Berlin, Germany, November 6–8, 2019, Moscow, Russia, November 30, 2019

Springer Proceedings in Mathematics & Statistics Volume 425

This book series features volumes composed of selected contributions from workshops and conferences in all areas of current research in mathematics and statistics, including data science, operations research and optimization. In addition to an overall evaluation of the interest, scientific quality, and timeliness of each proposal at the hands of the publisher, individual contributions are all refereed to the high quality standards of leading journals in the field. Thus, this series provides the research community with well-edited, authoritative reports on developments in the most exciting areas of mathematical and statistical research today.

Denis Belomestny · Cristina Butucea · Enno Mammen · Eric Moulines · Markus Reiß · Vladimir V. Ulyanov Editors

Foundations of Modern Statistics Festschrift in Honor of Vladimir Spokoiny, Berlin, Germany, November 6–8, 2019, Moscow, Russia, November 30, 2019

Editors Denis Belomestny Faculty of Mathematics University of Duisburg-Essen Essen, Germany Enno Mammen Institute for Applied Mathematics Heidelberg University Heidelberg, Baden-Württemberg Germany Markus Reiß Institute of Mathematics Humboldt-Universität zu Berlin Berlin, Germany

Cristina Butucea Institut Polytechnique de Paris CREST, ENSAE Palaiseau, France Eric Moulines CMAP Ecole Polytechnique Palaiseau, France Vladimir V. Ulyanov Faculty of Computer Science HSE University and Moscow State University Moscow, Russia

ISSN 2194-1009 ISSN 2194-1017 (electronic) Springer Proceedings in Mathematics & Statistics ISBN 978-3-031-30113-1 ISBN 978-3-031-30114-8 (eBook) https://doi.org/10.1007/978-3-031-30114-8 Mathematics Subject Classification: 62-02, 62-06, 60-06, 62Axx, 62Gxx © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This volume contains contributions from the participants of the international conference “Foundations of Modern Statistics” which took place from November 6 to November 8, 2019, at WIAS in Berlin and on November 30, 2019, at HSE University in Moscow, the beautiful and ideal locations for the very special event. This event was organized in order to honor the numerous scientific achievements of Prof. Spokoiny on the occasion of his 60th birthday. Vladimir Spokoiny has pioneered the field of adaptive statistical inference and contributed to a variety of its applications. His more than 30 years of research in the field of mathematical statistics had a major impact on the development of the mathematical theory of statistics up to its present state. It has inspired many young researchers to take up their research in this exciting field of mathematics. Vladimir Spokoiny studied Mathematics and Physics at Moscow Institute of Railway Engineering (1976–1981). He received a Doktor degree in Mathematics from Lomonosov Moscow State University in 1987 under the supervision of M. B. Maljutov and A. N. Shiryaev. After spending the years 1990–1993 as a researcher at Institute for Information Transmission Problems, he started a position at Weierstrass Institute for Applied Analysis and Statistics (Berlin) in 1993, where he was promoted to the Head of a Research Group in 2000. Three years later, he accepted an offer from Humboldt University, Berlin, for a Full Professorship (C4). Vladimir Spokoiny has received several awards and recognitions. He received numerous grants from the German Research Foundation (DFG). The papers contained in this volume reflect the broad field of interests of Vladimir Spokoiny. The volume starts with six papers on optimal rates and non-asymptotic bounds in nonparametrics. In the paper by Zaid Harchaoui, Anatoli Juditsky, Arkadi Nemirovski, and Dmitrii Ostrovskii, optimal rates are discussed for discrete-time signal denoising. In their

v

vi

Preface

contribution, Amandine Dubois, Thomas B. Berret, and Cristina Butucea discuss minimax rates in a setting of the currently very active field of statistics under differential privacy. Gilles Blanchard and Jean-Baptiste Fermanian give non-asymptotic upper and lower bounds on the minimal separation in high-dimensional two-sample testing and relate them to estimation problems. In a more general setting, Sara van de Geer and Peter Hinz discuss if nonparametric minimax rates can be achieved by projection type arguments. In the paper of Munir Hiabu, Enno Mammen, and Joseph T. Meyer, local linear smooth backfitting in nonparametric additive models is interpreted as a projection of the data onto a linear space, and this is used to show optimal rates for the backfitting estimator and to develop asymptotic distribution theory. The paper of Sagak A. Ayvazyan and Vladimir V. Ulyanov generalizes a result of B. Klartag and S. Sodin to higher dimensions, namely for sums of i.i.d. vectors weighted with random weights from the unit sphere. The results say that, with high √ probability, the sums are approximately normal with a rate that is faster than the 1 n Berry-Esseen rate. In the following section, three papers of the volume discuss results in the estimation of matrices and subspaces. F. Götze, A. Tikhomirov, and D. Timushev discuss the convergence rate of the empirical spectral distribution function of sparse sample covariance matrices. Martin Wahl states non-asymptotic lower bounds for the estimation of principal subspaces. Furthermore, Denis Belomestny and Ekaterina Krymova prove non-asymptotic error bounds for constrained orthogonal projection approximation subspace tracking algorithms. Two papers of the volume discuss Bayes approaches from a frequentistic point of view, in the paper of Natalia Bochkina a review is given on the Bernstein–von Mises theorem in misspecified models, and Maxim Panov discusses Gaussian approximations in semiparametric settings. The next section contains three papers that contain statistical theory motivated by models in applied fields. Both, the paper of Marianne Bléhaut, Xavier D’Haultfœuille, Jérémy L’Hour, and Alexandre B. Tsybakov and the paper of Christoph Breunig and Xiaohong Chen discuss models coming from econometrics. The first paper proposes estimates of average treatment effects that are doubly robust, consistent, and asymptotically normal, in classical and in sparse settings. The latter paper considers adaptive estimation of quadratic functionals in nonparametric instrumental variable models. The third paper in this section, contributed by Gytis Kulaitis, Axel Munk, and Frank Werner, is motivated by a problem in microscopy. It is shown that statistical inference on microscopy resolution leads to a minimax testing problem which requires novel asymptotic theory. The last section of the book contains two papers from optimization. In their paper, Pavel Dvurechensky, Alexander Gasnikov, Alexander Tyurin, and Vladimir Zholobov develop a detailed picture for convergence rates of accelerated randomized methods in convex optimization. Finally, Kainat Khowaja, Mykhaylo Shcherbatyy, and Wolfgang Karl Härdle use dimension reduction techniques to

Preface

vii

solve complex systems of differential equations arising in optimization problems of dynamical systems. The book concludes by a conversation of Vladimir Spokoiny with Markus Reiß and Enno Mammen. This interview gives some background on the life of Vladimir Spokoiny and his many scientific interests and motivations. Essen, Germany Paris, France Heidelberg, Germany Paris, France Berlin, Germany Moscow, Russia

Denis Belomestny Cristina Butucea Enno Mammen Eric Moulines Markus Reiß Vladimir V. Ulyanov

Contents

Optimal Rates and Non-asymptotic Bounds in Nonparametrics Adaptive Denoising of Signals with Local Shift-Invariant Structure . . . . Zaid Harchaoui, Anatoli Juditsky, Arkadi Nemirovski, and Dmitrii Ostrovskii Goodness-of-Fit Testing for Hölder Continuous Densities Under Local Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amandine Dubois, Thomas B. Berrett, and Cristina Butucea

3

53

Nonasymptotic One- and Two-Sample Tests in High Dimension with Unknown Covariance Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Gilles Blanchard and Jean-Baptiste Fermanian The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Sara van de Geer and Peter Hinz Local Linear Smoothing in Additive Models as Data Projection . . . . . . . . 197 Munir Hiabu, Enno Mammen, and Joseph T. Meyer A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Sagak A. Ayvazyan and Vladimir V. Ulyanov Estimation of Matrices and Subspaces Rate of Convergence for Sparse Sample Covariance Matrices . . . . . . . . . . 261 F. Götze, A. Tikhomirov, and D. Timushev Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Martin Wahl

ix

x

Contents

Sparse Constrained Projection Approximation Subspace Tracking . . . . . 323 Denis Belomestny and Ekaterina Krymova Bernstein–von Mises Theorem and Misspecified Models: A Review . . . . . 355 Natalia Bochkina On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 Maxim Panov Statistical Theory Motivated by Applications An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 Marianne Bléhaut, Xavier D’Haultfœuille, Jérémy L’Hour, and Alexandre B. Tsybakov Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Christoph Breunig and Xiaohong Chen A Minimax Testing Perspective on Spatial Statistical Resolution in Microscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 Gytis Kulaitis, Axel Munk, and Frank Werner Optimization Unifying Framework for Accelerated Randomized Methods in Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 Pavel Dvurechensky, Alexander Gasnikov, Alexander Tyurin, and Vladimir Zholobov Surrogate Models for Optimization of Dynamical Systems . . . . . . . . . . . . . 563 Kainat Khowaja, Mykhaylo Shcherbatyy, and Wolfgang Karl Härdle Interview with Vladimir Spokoiny on 29/01/21 by E. Mammen and M. Reiß . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595

Optimal Rates and Non-asymptotic Bounds in Nonparametrics

Adaptive Denoising of Signals with Local Shift-Invariant Structure Zaid Harchaoui, Anatoli Juditsky, Arkadi Nemirovski, and Dmitrii Ostrovskii

Abstract We discuss the problem of adaptive discrete-time signal denoising in the situation where the signal to be recovered admits a “linear oracle”—an unknown linear estimate that takes the form of the convolution of observations with a timeinvariant filter. It was shown by Juditsky and Nemirovski [20] that when the 2 -norm of the oracle filter is small enough, such oracle can be “mimicked” by an efficiently computable adaptive estimate of the same structure with an observation-driven filter. The filter in question was obtained as a solution to the optimization problem in which the ∞ -norm of the Discrete Fourier Transform (DFT) of the estimation residual is minimized under constraint on the 1 -norm of the filter DFT. In this paper, we discuss a new family of adaptive estimates which rely upon minimizing the 2 -norm of the estimation residual. We show that such estimators possess better statistical properties than those based on ∞ -fit; in particular, under the assumption of approximate shiftinvariance we prove oracle inequalities for their 2 -loss and improved bounds for 2 - and pointwise losses. We also study the relationship of the approximate shiftinvariance assumption with the signal simplicity introduced in [20], and discuss the application of the proposed approach to harmonic oscillation denoising. Keywords Nonparametric estimation · Adaptive denoising · Adaptive filtering · Harmonic oscillation denoising

Z. Harchaoui University of Washington, Seattle, WA 98195, USA e-mail: [email protected] A. Juditsky (B) LJK, Université Grenoble Alpes, 700 Avenue Centrale, 38401 Saint-Martin-d’Hères, France e-mail: [email protected] A. Nemirovski Georgia Institute of Technology, Atlanta, GA 30332, USA e-mail: [email protected] D. Ostrovskii University of Southern California, Los Angeles, CA 90089, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_1

3

4

Z. Harchaoui et al.

1 Introduction The problem we consider in this paper is that of signal denoising: given noisy observations yτ = xτ + σζτ , τ ∈ Z

(1)

we aim at recovering a signal (xt )t∈Z . It is convenient for us to assume that signal and noises are complex-valued. Observation noises ζτ are assumed to be independent of x i.i.d. standard complex-valued Gaussian random variables (denoted ζτ ∼ CN (0, 1)), meaning that ζτ = ζτ1 + iζτ2 with i.i.d. ζτ1 , ζτ2 ∼ N (0, 1). Our goal may be, for instance, to recover the value xt of the signal at time t given observations (yτ ), |τ − t| ≤ m for some m ∈ Z+ (the problem referred to as signal interpolation in signal processing literature), or to estimate the value xt+h given observations (yτ ), t − m ≤ τ ≤ t (signal prediction or extrapolation), etc. The above problem is classical in statistics and signal processing. In particular, linear estimates of the form   xt = φτ yt−τ τ ∈Z

are ubiquitous in nonparametric estimation; for instance, classical kernel estimators are of this type. More generally, linear estimates are considered both theoretically attractive and easy to use in practice [6, 9, 17, 25, 39, 41]. When the set X of signals is well-specified, one can usually compute a (nearly) minimax on X linear estimator in a closed form. In particular, if X is a class of “smooth signals,” such as a Hölder or a Sobolev ball, then the corresponding estimator is given by the kernel estimator with properly selected bandwidth [39], and is minimax among all possible estimators. Moreover, linear estimators are known to be nearly minimax optimal with respect to the pointwise loss [6, 16] and the 2 -loss [8, 23, 24, 34] under rather weak assumptions about the set X of admissible signals. Besides this, if the set X of signals is specified in a computationally tractable way, then a near-minimax linear estimator can be efficiently computed by solving a convex optimization problem [23, 24]. The strength of this approach, however, comes at a price: in order to implement the estimator, the set X must be known to the statistician. Such knowledge is crucial: near-minimax estimator for one signal set can be of poor quality for another one. Thus, linear estimation approach cannot be directly implemented when no prior knowledge of X is available. In the statistical literature, this difficulty is usually addressed via adaptive model selection procedures [3, 12, 18, 27–30, 39]. However, model selection procedures usually impose strong structural assumptions on the signal set, assuming it to be known up to a few hyper-parameters.1

1

More general adaptation schemes have been recently introduced, e.g., routines from [13, 28] which can handle, for example, adaptation to inhomogeneous and anisotropic smoothness of the

Adaptive Denoising of Signals with Local Shift-Invariant Structure

5

An alternative approach to the denoising problem with unknown X was proposed in [32]. There, instead of directly restricting the class of signals and requiring a specification of X , one restricts the class of possible estimators. Namely, let us denote with C(Z) the space of complex-valued functions on Z, and let, for m ∈ Z+ , Cm (Z) be the space of complex-valued sequences that vanish outside the set {−m, ..., m}. We consider linear convolution-type estimators, associated with filters φ ∈ Cm (Z) of the form    xt = [y ∗ φ]t := φτ yt−τ = φτ yt−τ . (2) τ ∈Z

|τ |≤m

Informally, the problem we are interested in here is as follows: If we fix the structure (2) of the estimate and consider the form of the filter φ as a “free parameter,” is it possible to build an estimation procedure which is adaptive with respect to this parameter?

In other words, suppose that a “good” filter φo with a small estimation error “exists in nature.” Is it then possible to construct a data-driven estimation routine which has (almost) the same accuracy as the “oracle”—a hypothetic optimal estimation method utilizing φo ? The above question was first answered positively in [20] using the estimation machinery from [32]. To present the ideas underlying the approach of [20] we need to define the class of “well-filtered” or “simple” signals [15, 20]. Definition 1 (Simple signals) Given parameters m, n ∈ Z+ , ρ ≥ 1, and θ ≥ 0, signal x ∈ C(Z) is called (m, n, ρ, θ)-simple if there exists φo ∈ Cm (Z) satisfying ρ φo 2 ≤ √ , 2m + 1

(3)

and such that |xτ − [φo ∗ x]τ | ≤ √

σθρ 2m + 1

, for all |τ | ≤ m + n.

(4)

Decomposing the pointwise mean-squared error of the estimate (2) with φ = φo as E|xτ − [φo ∗ y]τ |2 = σ 2 E|[φo ∗ ζ]τ |2 + |xτ − [φo ∗ x]τ |2 , we immediately arrive at the following bound on the pointwise expected error: √   σ 1 + θ2 ρ o 2 1/2 E|xτ − [φ ∗ y]τ | ≤ √ , |τ | ≤ m + n. 2m + 1

(5)

signal. However, the proposed schemes cannot be implemented in a numerically efficient fashion, and therefore are not practical.

6

Z. Harchaoui et al.

In other words, simple signals are those for which there exists a linear estimator that (i) utilizies observations in the m-neighbourhood of a point, (ii) is invariant in the (m + n)-vicinity of the origin, and (iii) attains the pointwise risk of order m −1/2 in that vicinity. (For brevity, here we refer to the quantity E[|xt −  xt |2 ]1/2 as the pointwise risk (at t ∈ Z) of estimate  x .) Parameters ρ, θ allow for the refined control of the risk and specify the bias-variance balance. Now, assume that the only prior information about the signal to be recovered is that it is (m, n, ρ, θ)-simple with some known (m, n, ρ, θ). As we have just seen, this implies existence of a convolution-type linear estimator  x o = φo ∗ y with a good statistical performance. The question is whether we can use this information to “mimic”  x o —i.e., construct an estimator of (xτ )|τ |≤n with comparable statistical performance when only using available observations. Answering this question is not straightforward. In order to build such an adaptive estimator, one could implement a cross-validation procedure by minimizing some observable proxy of the quadratic loss of the estimate, say, the 2 -norm of the residual ([y − ϕ ∗ y]τ )|τ |≤m+n , over the set of filters ϕ satisfying (3). However, it is well known that the set of filters satisfying (3) is too “massive” to allow for construction of adaptive estimate with the risk bound similar to (5) even when ρ = 1.2 As a result, all known to us approaches to adaptive estimation in this case impose some extra constraints on the class of filters, such as regularity [11], or sparsity in a certain basis [7], etc. Nevertheless, surprisingly, adaptive convolution-type estimators with favorable statistical performance guarantees can be constructed. The key idea, going back to [20], is to pass to a “new oracle” with a characterization which better suits the goal of adaptive estimation. Namely, one can easily verify (cf., e.g., [15, Proposition 3]) that if a filter φo ∈ Cm (Z) satisfies relations (3) and (4), then its autoconvolution ϕo = φo ∗ φo ∈ C2m (Z) (with a twice larger support) satisfies their analogues 2    F2m [ϕo ] ≤ √ 2ρ , 1 4m + 1 √ 2 2σθρ2 o |xτ − [ϕ ∗ x]τ | ≤ √ , |τ | ≤ n; 4m + 1

(6)

here Fn is the unitary Discrete Fourier Transform (DFT) Fn : Cn (Z) → C2n+1 ,    1 2πikτ xτ , 1 ≤ k ≤ 2n + 1. (Fn [x])k = √ exp 2n + 1 2n + 1 |τ |≤n

2

While this statement appears self-evident to statisticians of older generations, younger researchers may expect an explanation. This is why we provide a brief discussion of the “naive estimate” in Sect. 4.3 of the appendix.

Adaptive Denoising of Signals with Local Shift-Invariant Structure

7

While the new bounds are inflated (the additional factor ρ is present in both bounds), the bound (6) is essentially stronger than its counterpart Fm [φo ]1 ≤ ρ one could extract from (3) directly. Based on this observation, the authors studied in [14, 15, 20] a class of adaptive convolution-type “uniform-fit” estimators which correspond to filters obtained by minimizing the uniform norm of the Fourier-domain residual Fn [y − y ∗ ϕ] constrained (or penalized) by the 1 -norm of the DFT of the filter. Such estimators can be efficiently computed, since the corresponding filters are given as optimal solutions to well-structured convex optimization problems. As it is common in adaptive nonparametric estimation, one can measure the quality of an adaptive estimator with the factor—the “cost of adaptation”—by which the risk of such an estimator is greater than that of the corresponding “oracle” estimator which the adaptive one is trying to “mimic”. As it turns out, “uniform-fit” estimators studied in [14, 15, 20] admit the pointwise risk bounds similar to (5), with extra factor Cρ3 log(m + n) as compared to (5) (see [15, Theorem 5]). On the other hand, √ there is a lower bound stating that the adaptation factor cannot be less than cρ log m when m ≥ c n (cf. [15, Theorem 2]), leaving the gap between these two bounds which may be quite significant when ρ is large. Furthermore, the choice of optimization objective (uniform fit of the Fourier-domain residual) in such estimators was dictated by the technical consideration allowing simpler control of the pointwise risk and seems artificial when the estimation performance is measured by the 2 -loss. Contributions. In this paper, we propose a new family of adaptive convolution-type estimators. These estimators utilize an adaptive filter which is obtained by minimizing the 2 -norm of the residual constrained or penalized by the 1 -norm of the DFT of the filter. Similarly to uniform-fit estimators, new estimators can be efficiently computed via convex optimization routines. We prove oracle inequalities for the 2 -loss of these estimators, which lead to the improved risk bounds compared to the case of uniform-fit estimators. Note that signal simplicity, as per Definition 1, involves a special sort of time-invariance of the oracle estimate: filter φo ∈ Cm (Z) in Definition 1 is assumed to be “good” (cf. (4)) uniformly over |t| ≤ m + n, what can be understood as some kind of “approximate local shift-invariance” of the signal to be recovered. In fact, this property of the signal is operational when deriving corresponding risk bounds for adaptive recoveries. In the present paper, in order to derive the oracle inequalities we replace the assumption of signal simplicity, as per Definition 1, with an explicit approximate (local) shift-invariance (ASI) assumption. In a nutshell, the new assumption states that the unknown signal admits (locally) a decomposition x = x S + ε where x S belongs to an unknown shift-invariant linear subspace S ⊂ C(Z) of a small dimension, and the residual component ε is small in 2 -norm or ∞ -norm. The remainder terms in the established oracle inequalities explicitly depend on the subspace dimension s = dim(S) and the magnitude κ of

8

Z. Harchaoui et al.

the residual.3 We also study the relationship between our ASI assumption and the notion of signal simplicity introduced in [20]: • On one hand, approximately shift-invariant signals constitute a subclass of simple signals (in fact, the widest known to us such subclass to date). In particular, a “uniform” version of ASI assumption, in which the residual component ε is bounded in ∞ -norm, implies signal simplicity (cf. Definition 1) with simple dependence of parameters ρ and θ of the class on the ASI parameters s and κ. This, in turn, allows to derive improved bounds for the pointwise and 2 -loss of novel adaptive estimators over the class of signals satisfying the “uniform” version of ASI assumption. • On the other hand, all known to us examples of simple signals in C(Z) are those of signals close to solutions of low-order linear homogeneous difference equations, see [21]; such signals are close to small-dimensional shift-invariant subspaces. New bounds on the 2 - and pointwise risk for such signals established in this work improve significantly over the analogous bounds for such signals obtained in [15, 21]. As an illustration, we apply the proposed approach to the problem of denoising a harmonic oscillation—a sum of complex sinusoids with arbitrary (unknown) frequencies. The known approaches [1, 37] to this problem are based on the ideas from sparse recovery [10] and impose frequency separation conditions to obtain sharp statistical guarantees (see Sect. 4.3 for more details). In contrast, deriving near-optimal statistical guarantees for adaptive convolution-type estimators in this problem does not require this type of assumptions. Preliminary versions of some results presented in this paper were announced in [33]. Manuscript organization. We present the problem of adaptive interpolation and prediction and introduce necessary notation in Sect. 2. In Sect. 3, we introduce adaptive estimators and present oracle inequalities for their 2 -loss. Then we use these inequalities to derive guarantees for 2 - and pointwise risks of adaptive estimates in Sect. 4. In particular, in Sect. 4.2 we discuss the structure of the classes of approximately shift-invariant signals over Z and show that such signals are close, in certain sense, to complex exponential polynomials—solutions to linear homogeneous difference equations. We then specify statistical guarantees for adaptive interpolation and prediction of such signals; in particular, we establish new bounds for adaptive prediction 3

In hindsight, ASI is a natural generalization of the classical “regularity assumption” for signals on the regular grid. Indeed, consider signals which are discretizations of smooth functions; such signals have a very simple structure—they are “locally close” to a given small-dimensional subspace, that of small degree polynomials. Here we extend the notion of regularity allowing for signals to be (locally) close to an unknown subspace of moderate dimension; we refer to [14, 21] for the detailed discussion of the relationship of the developed framework with nonparametric estimation of regular functions. Our standing (technical) assumption about (local) shift invariance of the approximating subspace is operational, it allows for successful application of the machinery of linear filtering and Fourier transform.

Adaptive Denoising of Signals with Local Shift-Invariant Structure

9

of generalized harmonic oscillations, which are sums of complex sinusoids modulated by polynomials. Finally, in Sect. 4.3 we consider an application of the proposed estimates to the problem of full recovery of a generalized (or usual) harmonic oscillation, and compare our approach against the state of the art for this problem. To streamline the presentation, we defer technical proofs to appendix.

2 Problem Description 2.1 Notation We follow the “Matlab convention” for matrices: [A, B] and [A; B] denote, respectively, the horizontal and vertical concatenations of two matrices of compatible dimensions. Unless explicitly stated otherwise, all vectors are column vectors. Given a signal x ∈ C(Z) and n 1 , n 2 ∈ Z such that n 1 ≤ n 2 , we define the “slicing” map xnn12 := [xn 1 ... xn 2 ].

(7)

In what follows, when it is unambiguous, we use the shorthand notation τ ≤ n (τ < n, |τ | ≤ n, etc.) for the set of integers satisfying the inequality in question. Convolution and filters. Recall that C(Z) is the linear space of all two-sided complex sequences, and Cn (Z) denotes the space of such sequences which vanish outside [−n, ..., n]. We call the smallest m ∈ Z+ such that φ ∈ Cm (Z) the width of φ and denote it w(φ). Note that (7) allows to identify Cn (Z), with complex vector space C2n+1 . It is also convenient to identify x ∈ C(Z) with its Laurent series x(z) =

j j x j z . The (discrete) convolution of ϕ ∗ ψ ∈ C(Z) of ϕ, ψ ∈ C(Z) is defined as [ϕ ∗ ψ]t :=



ϕτ ψt−τ

τ ∈Z

and is, clearly, a commutative operation. One has [ϕ ∗ ψ](z) = ϕ(z)ψ(z) with w(ϕ ∗ ψ) ≤ w(ϕ) + w(ψ). In what follows,  stands for the forward shift operator on C(Z): [x]t = xt−1 , and −1 for its inverse, the backward shift. Then ϕ ∗ ψ = ϕ()ψ. Given ϕ ∈ C(Z) with w(ϕ) < ∞ and observations y = (yτ ), we can associate with ϕ the linear estimate  x of x ∈ C(Z) of the form

10

Z. Harchaoui et al.

 x = ϕ ∗ y = ϕ()y

(8)

( x is simply a kernel estimate over the grid Z corresponding to a finitely supported discrete kernel ϕ). The just defined “convolution” (kernel) estimates are referred to as linear filters in signal processing; with some terminology abuse, we also call filters elements of C(Z) with finitely many nonzero entries. Norms. For x, y ∈ C(Z ) we denote with x, y the Hermitian inner product x, y =

τ ∈Z x τ yτ , x τ being the complex conjugate of x τ ; for n ∈ Z+ we put

x, yn =



x τ yτ .

|τ |≤n

Given p ≥ 1 and n ∈ Z+ we define semi-norms on C(Z) as follows: ⎛ xn, p := ⎝



⎞1/ p |xτ | p ⎠

|τ |≤n

with xn,∞ = max|τ |≤n |xτ |. When such notation is unambiguous, we also use  ·  p to denote the “usual”  p -norm on C(Z), e.g., x p = xn, p whenever w(x) ≤ n. We define the (unitary) Discrete Fourier Transform (DFT) operator Fn : Cn (Z) → C2n+1 by    1 i2πkτ xτ , 1 ≤ k ≤ 2n + 1. exp − (Fn [x])k = √ 2n + 1 2n + 1 |τ |≤n The unitarity of DFT implies the Parseval identities: for any x, y ∈ C(Z) and n ∈ Z+ , one has

x, xn = Fn [x], Fn [x], xn,2 = Fn [x]2 . (9) In what follows, c, C, C , etc., stand for absolute constants whose exact values can be recovered from the proofs. We use the O(·) notation: for two functions f, g of the same argument t, f = O(g) means that there exists C < ∞ such that | f (t)| ≤ C|g(t)| for all t in the domain of f .

2.2 Problem Statement We consider the problem of estimating the signal x ∈ C(Z) given noisy observations yτ := xτ + σζτ on the segment −L ≤ τ ≤ L (cf. (1)); here ζt ∼ CN (0, 1) are i.i.d. standard complex-valued Gaussian random variables. Here we discuss different settings of this problem:

Adaptive Denoising of Signals with Local Shift-Invariant Structure

11

• Signal interpolation in which, when estimating xt , one can use observations both on the left and on the right of t. For the sake of simplicity, we consider the “symmetric” version of this problem, where the objective is, given |m| ≤ L, to build an estimate ϕ ∗ y]t of xt for |t| ≤ L − m, with ϕ  ∈ Cm (Z) depending on observations.  xt = [ • Signal prediction in which, when computing the estimate of xt , we are allowed to use observations only on one side of t, e.g., observations for τ ≤ t − h where h ∈ Z+ is a given prediction horizon. For the sake of clarity, in this paper we only consider the version of this problem with h = 0 (often referred as filtering in signal processing literature); the general situation can be treated in the same way at the expense of more involved notation. In other words, we are looking to build a data-driven filter ϕ  ∈ Cm (Z) and the “left” estimate of xt , −L + 2m ≤ t ≤ L (utilizing observations yτ , τ ≤ t),  xt =

2m 

m 

ϕτ −m yt−τ =

τ =0

ϕs yt−s−m = [ϕ ∗ (m y)]t .

s=−m

The corresponding “right” estimate of xt , −L ≤ t ≤ L − 2m (utilizing observations yτ , τ ≥ t) writes  xt =

2m  τ =0

ϕm−τ yt+τ =

m 

ϕs yt−s+m = [ϕ ∗ (−m y)]t .

s=−m

Given a set X of signals, m, n ∈ Z+ , observations yτ for |τ | ≤ L = m + n, and the target estimation domain Dn of length 2n + 1 (e.g., Dn = {−n, ..., n} in the case of signal interpolation, or Dn = {−n + m, ..., n + m} in the case of filtering), we quantify the accuracy of estimate  x using two types of risks: – the maximal over X 2 (integral) α -risk: the smallest maximal over x ∈ X (1 − x: α)-confidence ball of  · 2 -norm on Dn centered at  ⎧ ⎧⎛ ⎫ ⎫ ⎞1/2 ⎨ ⎨  ⎬ ⎬ Risk Dn ,2,α ( x |X ) = inf r : sup Prob ⎝ |[ x − x]t |2 ⎠ ≥ r ≤ α ; ⎩ x∈X ⎩ ⎭ ⎭ t∈Dn

– the maximal over X pointwise α -risk: the smallest maximal over x ∈ X and xt : t ∈ Dn (1 − α)-confidence interval for xt centered at    Risk Dn ,α ( x |X ) = inf r : sup Prob {|[ x − x]t | ≥ r } ≤ α ∀ t ∈ Dn . x∈X

12

Z. Harchaoui et al.

When n = 0, the estimation interval Dn = {t} is a singleton, and the latter definition becomes that of the “usual” worst-case over X (1 − α)-confidence interval for xt :   Risk α ( xt |X ) = inf r : sup Prob {|[ x − x]t | ≥ r } ≤ α . x∈X

3 Oracle Inequalities for the 2 -Loss of Adaptive Estimators 3.1 Adaptive Signal Interpolation Adaptive recovery Given m, n ∈ Z+ , L = m + n, and > 0, consider the optimization problem

¯ min y − ϕ ∗ y2n,2 subject to Fm [ϕ]1 ≤ √ . 2m + 1

ϕ∈Cm (Z)

(Con)

Note that (Con) is clearly solvable; we denote ϕ con its optimal solution and refer to  xcon = ϕ con ∗ y as the constrained (least-squares) estimate of x. Computing ϕ con requires setting the problem parameter ¯ which, ideally, would be set proportional to the 1 -norm of the DFT of some ideal (oracle) filter, or a non-trivial upper bound on it. Since this is not often possible in practice, we also consider the penalized estimator  xpen = ϕ pen ∗ y, where, for λ > 0, ϕ pen ∈ Cm (Z) is selected as an optimal solution to the (solvable) problem min y − ϕ ∗ y2n,2 + σ 2 λ2 (2m + 1)Fm [ϕ]21 .

ϕ∈Cm (Z)

(Pen)

Instead of knowing ¯ , some knowledge of noise variance σ 2 is required to tune this estimator. Hence, the practical recommendation is to use (Pen) when σ 2 is known or can be estimated. Oracle inequalities for 2 -loss Despite striking similarity with Lasso estimators [2, 5, 38], the proposed estimators are of quite different nature. First of all, solving optimization problems (Con) and (Pen) allows to recover a filter but not the signal itself, and this filter is generally not sparse-neither in time nor in Fourier domain (unless the signal to recover is a sum of harmonic oscillations with frequencies on the “DFT grid”). Second, the equivalent of “regression matrices” involved in these procedures cannot be assumed to satisfy any kind of “restricted incoherence” conditions usually

Adaptive Denoising of Signals with Local Shift-Invariant Structure

13

imposed to prove statistical properties of “classical” 1 -recovery routines (see [4, Chap. 6] for a comprehensive overview of such conditions). Moreover, being constructed from noisy observations, these matrices depend on the noise, which poses some extra difficulties in the analysis of the adaptive estimates, in particular, leading to the necessity of imposing some restrictions on the signal class. In what follows, when analyzing adaptive estimators we constrain the unknown signal x on the interval |τ | ≤ L to be “close” to some shift-invariant linear subspace S. Specifically, consider the following assumption: Assumption 3.1 (Approximate local shift-invariance) We suppose that x ∈ C(Z) admits a decomposition x = x S + ε. Here, x S ∈ S where S is some (unknown) shift-invariant linear subspace of C(Z) with s := dim(S) ≤ 2n + 1, and ε is bounded in the 2 -norm: for some κ ≥ 0 one has  −τ   ε

n,2

≤ κσ, |τ | ≤ m.

(10)

We denote with Xm,n (s, κ) the class of such signals. Remark Assumption 3.1 merits some comments. Observe that Xm,n (s, κ) is, in fact, the subset of C(Z) comprising sequences which are close, in the sense of (10), to all s-dimensional shift-invariant subspaces of C(Z). Similarly to Assumption 3.1, signal “simplicity” as set by Definition 1 also postulates a kind of “local time-invariance” of the signal: it states that there exists a linear timeinvariant filter which reproduces the signal “well” on a certain interval. However, the actual relationship between the two notions is rather intricate and will be discussed in Sect. 4. Letting the signal to be close, in 2 -norm, to a shift-invariant subspace—instead of simply belonging to the subspace—extends the set of signals and allows to address nonparametric situations. As an example, consider discretizations over a uniform grid in [0, 1] of functions from the Sobolev ball. Locally, such signals are close in 2 -norm to polynomials on the grid which satisfy a linear homogeneous difference equation and hence belong to a shift-invariant subspace of small dimension [21]. Other classes of signals for which Assumption 3.1 holds are discretizations of complex sinusoids modulated with smooth functions and signals satisfying linear difference inequalities [21]. with We now present oracle inequalities which relate the 2 -loss of adaptive filter ϕ the best loss of any feasible solution ϕ to the corresponding optimization problem. These inequalities, interesting for their own sake, are also operational when deriving bounds for the pointwise and 2 -losses of the proposed estimators. We first state the result for the constrained estimator. Theorem 1 Let s, m, n ∈ Z+ , κ ≥ 0. Suppose that x ∈ Xm,n (s, κ) and ϕ is feasible for (Con). Let ϕ con be an optimal solution to (Con) with some ¯ > 1, and let con ∗ y. Then for any α ∈]0, 1[ it holds with probability at least 1 − α:  xcon = ϕ

14

Z. Harchaoui et al.

x −  xcon n,2 ≤ x − ϕ ∗ yn,2  1/2 +Cσ ¯ (κ2m,n + 1) log[(m + n)/α] + ¯ κ log[1/α] + s



where κm,n :=

(11)

2n + 1 . 2m + 1

The counterpart of Theorem 1 for the penalized estimator is as follows. Theorem 2 Let s, m, √n ∈ Z+ , κ, λ > 0. Suppose that x ∈ Xm,n (s, κ) and ϕ ∈ pen be an optimal solution to (Pen). Cm (Z) with (ϕ) = 2m + 1Fm [ϕ]1 . Let ϕ pen ∗ y satisfies with probability at least Then for any α ∈ ]0, 1[ the estimate  xpen = ϕ 1 − α:     1/2 x −  xpen n,2 ≤ x − ϕ ∗ yn,2 + σ λ (ϕ) + C1 Q1 /λ + C2 Q2 (ϕ) (12) where Q1 = Q1 (κ, κm,n , α) = (κ2m,n + 1) log[(m + n)/α] + κ log[1/α] + 1, (13) Q2 (ϕ) = Q2 (ϕ, s, κ, α) = (ϕ) log[1/α] + κ log[1/α] + s. 1/2

In particular, when setting λ = Q1 we obtain     1/2 1/2 x −  xpen n,2 ≤ x − ϕ ∗ yn,2 + Cσ Q1 (ϕ) + Q2 (ϕ) . One may observe that, ideally, ¯ in (Con) should be selected as

(ϕo ) =



2m + 1Fm [ϕo ]1

where ϕo is an ideal “oracle filter,” while the penalty parameter in (Pen) would be set to λ = [C1 Q1 / (ϕo )]1/2 . These choices would result in the same remainder terms in (11) and (12) (order of σ( (ϕo )(1 + κ) + s)1/2 up to logarithmic factors). Obviously, this choice cannot be implemented since the value (ϕo ) is unknown. Nevertheless, Theorem 2 provides us with an implementable choice of λ that still results in an oracle √inequality,√at the expense of a larger remainder term which now scales as σ[ (ϕo ) 1 + κ + s].

3.2 Adaptive Signal Filtering Here we consider the “left” version of the problem in which we are given observations (yτ ) on the interval −L ≤ τ ≤ L, and our objective is to build a (left)

Adaptive Denoising of Signals with Local Shift-Invariant Structure

15

convolution estimate  xt = [ ϕ ∗ (m y)]t of xt , t ∈ {−L + 2m ≤ t ≤ L}, using an observation-driven filter ϕ  ∈ Cm (Z). Clearly, the treatment of the “right” version of the problem is completely analogous up to obvious modifications. Let us consider the following counterparts of (Con) and (Pen):   −m  (y − ϕ ∗ m y)2





¯

  n,2 subject to Fm [ϕ] 1 ≤ √2m + 1 ϕ∈Cm (Z) 2  2  min −m (y − ϕ ∗ m y)n,2 + σ 2 λ2 (m + 1) Fm [ϕ]1 ϕ∈C (Z) min

(Con)+ (Pen)+ .

m

Same as in the interpolation setting, both problems are clearly solvable, so their pen are well-defined. A close inspection of respective optimal solutions ϕ con and ϕ the proofs of Theorems 1 and 2 shows that their results remain valid, with obvious adjustments, in the setting of this section. Namely, we have the following analog of those statements. Proposition 1 Let s, m, n ∈ Z+ , κ ≥ 0, and x ∈ Xm,n (s, κ); let α ∈ ]0, 1[. 1. Let ¯ > 1 be fixed, ϕ be feasible to (Con, and let  xcon = ϕ con ∗ m y where ϕ con is an optimal solution to (Con+ ); then with probability at least 1 − α estimate  xcon satisfies     −m  (x −  xcon )n,2 ≤ −m (x − ϕ ∗ m y)n,2  1/2 +Cσ ¯ (κ2m,n + 1) log[(m + n)/α] + ¯ κ log[1/α] + s . √ 2. Let ϕ ∈ Cm (Z) with (ϕ) = 2m + 1Fm [ϕ]1 , and let  xpen = ϕ pen ∗ m y + xpen satisfies with where ϕ pen is an optimal solution to (Pen ) with λ > 0; then  probability at least 1 − α   −m  (x −  xpen )n,2 ≤ −m (x − ϕ ∗ m y)n,2   1/2 +σ λ (ϕ) + C1 Q1 /λ + C2 Q2 (ϕ) where Q1 and Q2 (ϕ) are defined in (13).

4 Risk Bounds for Adaptive Recovery Under ASI In order to transform the oracle inequalities of Theorems 1, 2 and Proposition 1 into risk bounds for adaptive recovery, we need to establish bounds for oracle risks on the classes of approximately shift-invariant signals. We start with the interpolation setting.

16

Z. Harchaoui et al.

4.1 Risk Bounds for Adaptive Signal Interpolation Results of this section are direct corollaries of the following statement which may be of independent interest. Proposition 2 Let S be a shift-invariant subspace of C(Z) of dimension s ≤ m + 1. Then there exists a filter φo ∈ Cm (Z) such that for all x ∈ S, one has x = φo ∗ x and  2s o φ 2 ≤ . 2m + 1 In other words, signals x ∈ S are (m, n,√ρ, 0)-simple in the sense of Definition 1, for any n ∈ Z+ and m ≥ s − 1, with ρ = 2s and θ = 0. When combined with Theorems 1 and 2, Proposition 2 implies the following bound on the integral risk of adaptive recovery. Proposition 3 Let s, m, n ∈ Z+ , m ≥ 2s − 1, κ ≥ 0, and let Dn = {−n, ..., n}. (i) Assume that  xcon = ϕ con ∗ y where ϕ con is an optimal solution to (Con) with some ¯ ≥ 4s. Then for any α ∈]0, 1/2], α Risk Dn ,2,α ( xcon |Xm,n (s, κ)) ≤ Cψm,n (σ, s, κ; ¯ )

where   α ψm,n (σ, s, κ; ¯ ) = σs κm,n log[1/α] + κ  1/2 +σ ¯ (κ2m,n + 1) log[(m + n)/α] + ¯ κ log[1/α] + s . In particular, when ¯ ≤ C s is chosen in (Con), one obtains α

Risk Dn ,2,α ( xcon |Xm,n (s, κ)) ≤ Cψ m,n (σ, s, κ)

(14)

with   α ψ m,n (σ, s, κ) = σs κm,n log[1/α] + κ  1/2 +σ s(κ2m,n + 1) log[(m + n)/α] + sκ log[1/α] + s . 1/2

(ii) Let λ = Q1 with Q1 as defined in (13), and let  xpen = ϕ pen ∗ y where ϕ pen is an optimal solution to (Pen). Then for any α ∈ (0, 1/2], α (σ, s, κ) Risk Dn ,2,α ( xpen |Xm,n (s, κ)) ≤ C ψ m,n where

Adaptive Denoising of Signals with Local Shift-Invariant Structure

17

  α (σ, s, κ) = σs κm,n log[1/α] + κ + σs(κm,n + 1) log[(m + n)/α]. ψ m,n

We are now ready to derive bounds for the pointwise risk of adaptive estimates described in the previous section. To establish such bounds we need to replace Assumption 3.1 with a somewhat stronger uniform analog. Assumption 4.1 (Approximate locally uniform shift-invariance) Let n ≥ m ∈ Z+ . We suppose that x ∈ C(Z) admits a decomposition x = x S + ε. Here x S ∈ S where S is some (unknown) shift-invariant linear subspace of C(Z) with s := dim(S) ≤ 2n + 1, and ε is uniformly bounded: for some κ ≥ 0 one has |ετ | ≤ √

κσ 2n + 1

, |τ | ≤ n + m.

(15)

We denote X m,n (s, κ) the class of such signals. Observe that if x ∈ X m,n (s, κ) then also x ∈ Xm,n (s, κ). Therefore, the bounds of Proposition 2 also hold true for the risk of adaptive recovery on X m,n (s, κ). Furthermore, bound (15) of Assumption 4.1 now leads to the following bounds for pointwise xpen . risk of recoveries  xcon and  Proposition 4 Let s, m, n ∈ Z+ with m ≥ 2s − 1 and n ≥ m/2 (here · stands for the integer part), κ ≥ 0; let also Dn,m = {−n + m/2, ..., n − m/2}. (i) Let  xcon = ϕ con ∗ y where ϕ con is an optimal solution to (Con) with ¯ ∈ [4s, Cs] for some C ≥ 4.4 Then for any α ∈]0, 1/2] Risk Dn,m ,α ( xcon |X m,n (s, κ)) ≤ C ς αm,n (σ, s, κ)

(16)

where 

s α ψ (σ, s, κ) 2m + 1 m,n √  sσ +√ sκ + log [(2m + 1)/α] + s log[1/α] 2m + 1   sσ κm,n s log [1/α] + κ + κm,n log [(m + n)/α] . ≤C √ 2m + 1

ςα m,n (σ, s, κ) =

1/2

(ii) Let  xpen = ϕ pen ∗ y where ϕ pen is an optimal solution to (Pen) with λ = Q1 , Q1 being defined in (13). Then for any α ∈ (0, 1/2] 4

For the sake of brevity, here we only present the result for the constrained recovery with ¯  s.

18

Z. Harchaoui et al. α Risk Dn,m ,α ( xpen |X m,n (s, κ)) ≤ C ςm,n (σ, s, κ)

where 

s α (σ, s, κ) ψ 2m + 1 m,n √  sσ +√ sκ + log [(2m + 1)/α] + s log [1/α] 2m + 1 C sσ √  ≤ √ s κm,n log [1/α] 2m + 1   +κ + κ log [1/α] + κm,n log [(m + n)/α] .

α  ςm,n (σ, s, κ) =

Remark The above bounds for the pointwise risk of adaptive estimates may be compared against available lower bound and bounds for the risk of the uniform-fit adaptive estimate in the case where the signal to recover is a sum of scomplex sinusoids. In this situation, [15, Theorem 2] states the lower bound cσs the pointwise risk of estimation with the upper bound 



log m O σs log[s] m

log m m

for



3

up to a logarithmic in α factor (cf. [15, Sect. 4]). Since the signal in question belongs to a 2s-dimensional shift-invariant subspace of C(Z), the bound on the pointwise risk in Proposition 4 results (recall that we are in the situation of κ = 0) in the bound  O σs



s + log m m



xpen with a significantly improved dependence on s. for adaptive estimates  xcon and 

4.2 Risk Bounds for Adaptive Signal Filtering Our next goal is to bound the risk of the constrained and penalized adaptive filters. Recall that in order to obtain the corresponding bounds in the interpolation setting we first established the result of Proposition 2 which allows to bound the error of the oracle filter on any s-dimensional shift-invariant subspace of C(Z). This result, along with oracle inequalities of Theorems 1 and 2, directly led us to the bounds for the risk of adaptive interpolation estimators. In order to reproduce the derivation in the previous section, we first need to establish a fact similar to Proposition 2,

Adaptive Denoising of Signals with Local Shift-Invariant Structure

19

which would guarantee existence of a predictive filter of a small 2 -norm exactly reproducing all signals from any shift-invariant subspace of C(Z). However, as we will see in an instant, the prediction case is rather different from the interpolation case: generally, a “good predictive filter” one may look for—a reproducing predictive filter of a small norm—simply does not exist in the case of prediction. Moreover, the analysis of situations where such filter does exist is quite different from the simple proof of Proposition 2. This is why, before returning to our original problem, it is useful to get a better understanding of the structure of shift-invariant subspaces of C(Z). Characterizing shift-invariant subspaces of C(Z) We start with the following Proposition 5 The solution set of a homogeneous linear difference equation [ p()x]t =

s 

! pτ xt−τ

= 0, t ∈ Z,

(17)

τ =0

with a characteristic polynomial p(z) = 1 + p1 z + ... + ps z s is a shift-invariant subspace of C(Z) of dimension at most s. Conversely, any shift-invariant subspace of C(Z) of dimension s is the solution set of a difference equation of the form (17) with deg( p) = s; such polynomial is unique if normalized by p(0) = 1. Recall that the set of solutions of equation (17) is spanned by exponential polynomials. Namely, let z k , for k = 1, ..., r ≤ s, be the distinct roots of p(z) with corresponding multiplicities m k , and let ωk ∈ C be such that z k = e−iωk . Then solutions to (17) are exactly sequences of the form xt =

r 

qk (t)eiωk t

k=1

where qk (·) are arbitrary polynomials of deg(qk ) = m k − 1. For instance, discretetime polynomials of degree s − 1 satisfy (17) with p(z) = (1 − z)s ; another example is that of harmonic oscillations with given (all distinct) ω1 , ..., ωs ∈ [0, 2π[, xt =

s 

qk eiωk t ,

q ∈ Cs ,

(18)

k=1

" which satisfy (17) with p(z) = sk=1 (1 − eiωk z). Thus, the set of complex harmonic oscillations with fixed frequencies ω1 , ..., ωs is an s-dimensional shift-invariant subspace. In view of the above, it is now clear that simply belonging to a shift-invariant subspace does not guarantee that a signal x can be reproduced by a predictive filter of a small 2 -norm. For instance, given r ∈ C, |r | > 1, consider signals from the parametric family

20

Z. Harchaoui et al.

Xr = {x ∈ C(Z) : xτ = βr τ , β ∈ C}. Here Xr is a one-dimensional shift-invariant subspace of C(Z)—the solution set of the equation (1 − r )x = 0. Clearly, for x ∈ Xr xt cannot be estimated consistently using noisy observations on the left of t (cf. [35]), and we cannot expect a “good” predictive filter to exist for all x ∈ Xr . The above example is representative of the difficulties arising when predicting signals from shift-invariant subspaces of C(Z): the characteristic polynomial of the associated difference equation is unstable—its root z = 1/r lies inside the (open) unit disk. Therefore, to be able to build good “left” predictive filters, we need to reduce the class of signals to solutions of equations (17) with stable polynomials, with all roots lying outside the (open) unit disk—decaying exponents, harmonic oscillations, and their products. Note that if we are interested in estimating xt using only observations on the right of t, similar difficulties will arise when x is a solution of a homogeneous linear difference equation with roots outside the closed unit disc—this situation is completely similar to the above, up to the inversion of the time axis. Adaptive prediction of generalized harmonic oscillations The above discussion motivates our interest in a special family of shift-invariant subspaces which allow for constructing good “left” and “right” prediction filters—that of sets of solutions to linear homogeneous difference equations (17) with all roots z k on the unit circle, i.e., z k = e−iωk with real ωk ∈ [0, 2π[, k = 1, ..., s. In"other words, we are interested in the class of solutions to Eq. (17) with p(z) = sk=1 (1 − eiωk z) comprised of signals of the form r  xt = qk (t)eiωk t k=1

where ω1 , ..., ωr ∈ [0, 2π[ are distinct oscillation frequencies and qk (·), k = 1, ..., r , are (arbitrary) polynomials of degree m k − 1, m k being the multiplicity of the root

z k = e−iωk (i.e., rk=1 m k = s). We call such signals generalized harmonic oscillations; we denote Hs [ω] the space of such signals with fixed spectrum ω ∈ [0, 2π[s and denote Hs the set of generalized harmonic oscillations with at most s (unknown) frequencies. The problem of constructing a predictive filter for signals from Hs [ω] has already been studied in [22], where the authors proved (cf. [22, Lemma 6.1]) that for any s ≥ 1, vector of frequencies ω1 , ..., ωs , and m large enough there is φo ∈ Cm (Z) such that x = φo ∗ m x and  φo 2 ≤ Cs 3/2

log[s + 1] . m

(19)

Here we utilize a stronger version of this result. Proposition 6 Let s ≥ 1 and ω ∈ [0, 2π[s . Then for any m ≥ cs 2 log s there is a filter φo ∈ Cm (Z) which only depend on ω such that x = φo ∗ m x for all x ∈ Hs [ω], and

Adaptive Denoising of Signals with Local Shift-Invariant Structure

 φ 2 ≤ Cs o

log m . m

21

(20)

Let now Hm,n (s, κ) be the set of signals x ∈ C(Z) (locally) close to Hs in 2 -norm, i.e., which can be decomposed (cf. Assumption 3.1) as x = xH + ε where x H ∈ Hs and

 −τ   ε

n,2

≤ κσ, |τ | ≤ m.

Equipped with the bound of Proposition 6, we can now derive risk bounds for adaptive predictive estimates on Hm,n (s, κ). Specifically, following the proof of Propositions 3 and 4 we obtain the following corollaries of the oracle inequalities of Proposition 1. Proposition 7 Let s, m, n ∈ Z+ , m ≥ cs 2 log s with large enough c, and let κ ≥ 0. (i) Let ¯ = Cs 2 log m with C large enough, and let  xcon = ϕ con ∗ m y where ϕ con + is an optimal solution to (Con ); let also Dn = {−n + m, ..., n + m}. Then for any α ∈]0, 1/2], Risk Dn ,2,α ( xcon |Hm,n (s, κ)) ≤ C χαm,n (σ, s, κ) where   χαm,n (σ, s, κ) = σs 2 log[m] κm,n log[1/α] + κ +σs(κm,n + 1) log[m] log[(m + n)/α]. 1/2

(ii) Let λ = Q1 with Q1 as in (13), and let  xpen = ϕ pen ∗ m y where ϕ pen is an + optimal solution to (Pen ). Then for any α ∈]0, 1/2], Risk Dn ,2,α ( xpen |Hm,n (s, κ)) ≤ C χαm,n (σ, s, κ) where    χαm,n (σ, s, κ) = σs 2 log[m] (κm,n + 1) log[(m + n)/α] + κ .

Next, in order to state the result describing pointwise risks of the proposed estimate we need to replace the class Hm,n (s, κ) with the class of signals which are (locally) “uniformly” close to Hs . Namely, let Hm,n (s, κ) be the set of signals x ∈ C(Z) which can be decomposed (cf. Assumption 4.1) as x = xH + ε

22

Z. Harchaoui et al.

with x H ∈ Hs and |ετ | ≤ √

κσ 2n + 1

, |τ | ≤ n + m.

Proposition 8 Let s, m, n ∈ Z+ , m ≥ cs 2 log s with large enough c, n ≥ m/2, and let κ ≥ 0. We set Dn,m = {−n + 2m, ..., n + m}. (i) Let  xcon = ϕ con ∗ m y where ϕ con is an optimal solution to (Con+ ) where ¯ = 2 Cs log m with C large enough. Then for any α ∈]0, 1/2], α Risk Dn,m ,α ( xcon |Hm,n (s, κ)) ≤ C νm,n (σ, s, κ)

where 

α νm,n (σ, s, κ)

log m α σs 3 (log m)3/2 χm,n (σ, s, κ) + √ (κ + log[1/α]) m m σs 3 (log m)3/2 ≤ C √ (κ + log[1/α]) . m

=s

(ii) Let  xpen = ϕ pen ∗ m y where ϕ pen is an optimal solution to (Pen+ ) with λ = 1/2 Q1 , Q1 being defined in (13). Then for any α ∈]0, 1/2], α Risk Dn,m ,α ( xpen |Hm,n (s, κ)) ≤ C νm,n (σ, s, κ)

where 

α  νm,n (σ, s, κ)

log m α σs 3 (log m)3/2  χm,n (σ, s, κ) + √ (κ + log[1/α]) m m σs 3 (log m)3/2 ≤ C √ (κ + log[(m + n)/α]) . m

=s

4.3 Harmonic Oscillation Denoising To illustrate the results of the previous section, let us consider the problem of recovering generalized harmonic oscillations. Specifically, given observations yτ = xτ + σζτ , |τ | ≤ L ∈ Z+ we are to estimate the signal x ∈ Hs . We measure the statistical performance of adaptive estimate  x by the worst-case over Hs integral α-risk #

$

%

Risk DL ,2,α ( x |Hs ) = inf r : sup Prob  x − x L ,2 ≥ r ≤ α x∈Hs

&

Adaptive Denoising of Signals with Local Shift-Invariant Structure

23

on the entire observation domain D L = {−L , ..., L}. Note that if thefrequencies were known, the ordinary least-squares estimate would √  attain the risk O σ s (up to a logarithmic factor in α). When the frequencies are unknown, the lower bound (see, e.g., [37, Theorem 2]) states that Risk DL ,2, 21 ( x |Hs ) ≥ cσ s log L.

(21)

In the case where all frequencies are different, this bound is attained asymptotically by the maximum likelihood estimate [36, 40]. However, implementing that estimate involves computing maximal likelihood estimate of ω—a global minimizer in the optimization problem ⎛

' '2 ⎞1/2 s '  ''  iωk τ ' ⎠ ⎝ y min − α e ' ' τ k ' ' α∈Cs , ω∈Rs |τ |≤L

k=1

and becomes numerically challenging already for very moderate values of s. Moreover, the lower bound (21) is in fact attained by the Atomic Soft Thresholding (AST) estimate [1, 37]—which can be implemented efficiently—but only under the assumption that the frequencies {ω1 , ..., ωs } are well separated—precisely, when the minimal frequency separation in the wrap-around distance δmin :=

min min{|ω j − ωk |, 2π − |ω j − ωk |}

1≤ j=k≤s

(22)

2π satisfies δmin > 2L+1 (cf. [37, Theorem 1]). To the best of our knowledge, the question whether there exists an efficiently implementable estimate matching the lower bound (21) in the general case is open. A new approach to the problem was suggested in [15] where a uniform-fit adaptive estimate was used for estimation and prediction of (generalized) harmonic oscillations. That approach, using the bound (19) along with the  estimate for the risk of the uniform-fit recovery, resulted in the final risk bound O σs 3 log[s] log[L/α] . Using the results in the previous section, we can now build an improved adaptive estimate. Here we assume that the number s of frequencies (counted with their multiplicities) is known in advance, and utilize constrained recoveries (Con) and (Con+ ) with the parameter ¯ selected using this information5 ; note that s is precisely the dimension of the shift-invariant subspace to which x belongs, cf. Proposition 5. Let us consider the following procedure.

Choose K ≤ L, and split the observation interval D L into the central segment D K = {−K , ..., K } and left and right segments D− = {−L , ..., −K − 1} and D+ = {K + 1, ..., L}. In what follows we assume that L and K are even and put k = (L − K )/2. Then we act as follows. 5

It is worth mentioning that the AST estimate does not require the a priori knowledge of s; we can also get rid of this hypothesis when using the procedure which is adaptive to the unknown value of s, at the expense of an additional logarithmic factor.

24

Z. Harchaoui et al.

– Using the data yτ , |τ | ≤ L we compute an optimal solution ϕ  ∈ C L−K (Z) to the optimization problem (Con) with m = L − K , n = K , and ¯ = 4s; for t ∈ Dn we ϕ ∗ y]t . compute the interpolating (two-sided) estimate  xt = [ – We set m = (L + n)/2, n = k, ¯ = ¯ + := 2C 2 s 2 log L where C is as in the + ∈ Cm (Z) to bound (20) of Proposition 6 and compute an optimal solution ϕ + the optimization problem (Con ); for t ∈ D+ we compute the left (one-sided) ϕ+ ∗ m y]t . prediction  xt = [ – We set m = (L + n)/n, n = k, ¯ = ¯ + and compute an optimal solution ϕ − ∈ + 6 Cm (Z) to the “right” analog of (Con ) ; for t ∈ D− we compute the right (oneϕ− ∗ −m y]t . sided) prediction  xt = [ We select K to minimize the “total” risk bound of the adaptive recovery over D L . We have the following corollary of the Propositions 3 and 7 in the present setting. Proposition 9 Suppose that L ≥ cs 2 log s with large enough c > 0. Then, in the situation of this section, for any α ∈]0, 1/2] Risk DL ,2,α ( x |Hs ) ≤ Cσs 3/2 log[L/α].

(23)

Remark The risk bound (23), while significantly improved in terms of dependence √ on s over the corresponding bound of [15], contains an extra factor O(s log L) when compared to the lower bound (21). It is unclear to us whether this factor can be reduced for an efficiently computable estimate. It may be worth mentioning that when the frequency separation assumption holds, 2π where the separation δmin is defined in (22), the above estii.e., when δmin > 2L+1 mation procedure can be simplified: one can “remove” the central segment in the above construction only using left and right adaptive predictive estimates on two halfdomains. The “total” (1 − α)-reliable 2 -loss of the “simplified” adaptive recovery is then   O σ s 2 log[1/α] + s log[L/α] . The latter bound is a simple corollary of the oracle inequalities of Proposition 1 and the following result. Lemma 1 Let m ∈ Z+ , ν > 1, and let Hs [ω] be the set of harmonic oscillations x with the minimal frequency separation satisfying δmin ≥

2πν . 2m + 1

(24)

Then there exists a filter φo ∈ Cm (Z) satisfying x = φo ∗ m x for all x ∈ Hs [ω] and such that In the corresponding “right” optimization problem the “left prediction” ϕ ∗ m y is replaced with the “right prediction” ϕ ∗ −m y. Therefore, the objective to be minimized in this case is m (y − ϕ ∗ −m y)n,2 .

6

Adaptive Denoising of Signals with Local Shift-Invariant Structure

 φ 2 ≤ o

In particular, whenever δmin ≥

25

Qs ν+1 , where Q = . 2m + 1 ν−1

4π , 2m+1

one has 

φ 2 ≤ o

3s . 2m + 1

Acknowledgements Dmitrii Ostrovskii was supported by ERCIM Alain Bensoussan Scholarship while finalizing this project. Zaid Harchaoui received support from NSF CCF 1740551. Research of Anatoli Juditsky and Arkadi Nemirovski was supported by MIAI @ Grenoble Alpes (ANR-19P3IA-0003).

Appendix A: Preliminaries First, let us present some additional notation and technical tools to be used in the proofs.

A.1 Additional Notation In what follows, Re(z) and Im(z) denote, correspondingly, the real and imaginary parts of z ∈ C, and z denotes the complex conjugate of z. For a matrix A with complex entries, A stands for the conjugation of A (without transposition), AT for the transpose of A, and AH for its Hermitian conjugate. We denote A−1 the inverse of A when it exists. Tr(A) denotes the trace of a matrix A and det A its determinant; AF is the Frobenius norm of A, A∗ is the operator norm, and A is the nuclear norm. We also denote λmax (A) and λmin (A) the maximal and minimal eigenvalues of a Hermitian matrix A. For a ∈ Cn we denote Diag(a) the n × n diagonal matrix with diagonal entries ai . We use notation x∗n, p for the  p -norm of the DFT of x so that 2n+1 1/ p  '  'p ∗ ' ' Fn [x] k xn, p = Fn [x] p = k=1

with the standard interpretation of  · ∗n,∞ . In what follows, we associate linear maps Cn (Z) → Cn (Z) with matrices in C(2n+1)×(2n +1) . Convolution matrices. We use the following matrix-vector representations of discrete convolution. – Given y ∈ C(Z), we associate with it a (2n + 1) × (2m + 1) matrix

26

Z. Harchaoui et al.



y−n+m ⎢ .. ⎢ . ⎢ T (y) = ⎢ ⎢ ym ⎢ . ⎣ .. yn+m

⎤ · · · y−n · · · y−n−m . . ⎥ · · · .. · · · .. ⎥ ⎥ · · · y0 · · · y−m ⎥ ⎥, .. .. ⎥ ··· . ··· . ⎦ · · · yn · · · yn−m

(25)

such that [ϕ ∗ y]n−n = T (y)[ϕ]m −m for ϕ ∈ Cm (Z). Its squared Frobenius norm satisfies  T (y)2F = τ y2n,2 . (26) |τ |≤m

– Given ϕ ∈ Cm (Z), consider a (2n + 1) × (2m + 2n + 1) matrix ⎡

ϕm ⎢ 0 ⎢ ⎢ .. M(ϕ) = ⎢ ⎢ . ⎢ . ⎣ .. 0

··· ϕm .. .

· · · ϕ−m 0 · · · · · · ϕ−m .. . ··· ··· .. .. ··· . . ··· ··· ··· 0 ϕm

⎤ ··· ··· 0 0 ··· 0 ⎥ ⎥ .. ⎥ .. . ··· . ⎥ ⎥, .. ⎥ .. ··· . . ⎦

(27)

· · · · · · ϕ−m

such that for y ∈ C(Z) one has [ϕ ∗ y]n−n = M(ϕ)[y]m+n −m−n , and M(ϕ)2F = (2n + 1)ϕ2m,2 .

(28)

– Given ϕ ∈ Cm (Z), consider the following circulant matrix of size 2m + 2n + 1: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ C(ϕ) = ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

ϕ0 · · · · · · ϕ−m 0 ϕ1 ϕ0 · · · · · · ϕ−m .. ··· ··· . ··· ··· .. . ··· ··· ··· ··· .. . ··· ··· ··· ··· ··· ··· ··· ··· 0 · · · 0 ϕm .. . ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···

··· 0 .. .

··· ··· ··· ··· .. . ··· .. .. ··· . . .. ··· ··· .

.. . ··· ··· ··· .. . ··· .. .. . .

··· ··· ϕ0 · · · .. ··· .

··· ··· .. .. . . ··· ··· .. .. ··· ··· . .

⎤ 0 ϕm · · · · · · ϕ1 · · · 0 ϕm · · · ϕ2 ⎥ ⎥ ⎥ ⎥ ··· ··· ··· ··· ···⎥ ⎥ ⎥ ··· ··· ··· ··· ···⎥ ⎥ ⎥ ⎥ .. . ··· ··· ··· ···⎥ ⎥ ⎥ .. .. ⎥ . . ··· ··· ···⎥ ⎥ · · · ϕ−m 0 · · · 0 ⎥ ⎥. ⎥ ⎥ ··· ··· ··· ··· ···⎥ ⎥ ⎥ .. ⎥ . ··· ··· ··· ···⎥ ⎥ ⎥ .. . ··· ··· ···⎥ ⎥ ··· ⎥ ⎥ .. . ··· ···⎥ ··· ··· ⎥ ⎥ ⎥ .. .. . ··· ··· . ···⎦

.. ··· ··· ··· ··· ··· ··· ··· . ϕ−1 · · · · · · ϕ−m 0 · · · · · · · · · 0

ϕm · · · · · · ϕ0

(29)

Adaptive Denoising of Signals with Local Shift-Invariant Structure

27

m+n Note that C(ϕ)[y]m+n −m−n is the circular convolution of [y]−m−n and the zero-padded filter ϕ˜ := [ϕ]m+n −m−n = [0; ...; ϕ−m ; ...; ϕm ; 0; ...; 0],

˜ evaluated on that is, convolution of the periodic extensions of [y]m+n −m−n and ϕ {−m − n, ..., m + n}. Hence, by the diagonalization property of the DFT operator one has √ H C(ϕ) = 2m + 2n + 1Fm+n diag(Fm+n ϕ)F ˜ m+n (30) where with some notational abuse we denote Fn the matrix of DFT with the entries   1 2πi(k − n) j , 1 ≤ k, j ≤ 2n + 1. [Fn ]k j = √ exp 2n + 1 2n + 1 Besides this, note that C(ϕ)2F = (2m + 2n + 1)ϕ2m,2 . Reformulation of approximate shift-invariance. The following reformulation of Assumption 3.1 will be convenient for our purposes. There exists an s-dimensional vector subspace Sn of C2n+1 and an idempotent Hermitian (2n + 1) × (2n + 1) matrix Sn of rank s—projector on Sn —such that   /   .    I2n+1 − S τ x n  = τ ε ≤ σκ, |τ | ≤ m (31) n n,2 −n 2 where I2n+1 is the (2n + 1) × (2n + 1) identity matrix.

A.2 Technical Tools Deviation bounds for quadratic forms. Let ζ ∼ CN (0, In ) be a standard complex Gaussian vector, meaning that ζ = ξ1 + iξ2 where ξ1 and ξ2 are two independent draws from N (0, In ). We use simple facts listed below. n – Due to the unitarity of the DFT, if ζ−n ∼ CN (0, I2n+1 ) we also have Fn [ζ] ∼ CN (0, I2n+1 ). – We use a simple bound

1 0 Prob ζn,∞ ≤ 2 log n + 2u ≥ 1 − e−u which can be verified directly using that |ζ1 |22 ∼ χ22 . – The following deviation bounds for ζ22 ∼ χ22n are due to [26, Lemma 1]:

(32)

28

Z. Harchaoui et al.



 √ ζ22 Prob ≤ n + 2nu + u ≥ 1 − e−u , 2   √ ζ22 ≥ n − 2nu ≥ 1 − e−u . Prob 2

(33)

By simple algebra we obtain an upper bound for the norm: 0 √ √ 1 Prob ζ2 ≤ 2n + 2u ≥ 1 − e−u .

(34)

– Further, let K be an n × n Hermitian matrix with the vector of eigenvalues λ = [λ1 ; ...; λn ]. Then the real-valued quadratic form ζ H K ζ has the same distribution as ξ T Bξ, where ξ = [ξ1 ; ξ2 ] ∼ N (0, I2n ), and B is a real 2n × 2n symmetric matrix with the vector of eigenvalues [λ; λ]. We have Tr(B) = 2Tr(K ), B2F = 2K 2F and B = K  ≤ K F . Invoking again [26, Lemma 1] (a close inspection of the proof shows that the assumption of positive semidefiniteness can be relaxed), we have 

√ ζH K ζ Prob ≤ Tr(K ) + (u + 2u)K F 2



≥ 1 − e−u .

(35)

Further, when K is positive semidefinite, we have K F ≤ Tr(K ), whence  Prob

 √ ζH K ζ ≤ Tr(K )(1 + u)2 ≥ 1 − e−u . 2

(36)

The following lemma, interesting in its own right, controls the inflation of the 1 -norm of the DFT of a zero-padded signal. Lemma 2 Let u ∈ Cm (Z) one has u∗m+n,1 ≤ u∗m,1 (1 + κ2m,n )1/2 [log(m + n + 1) + 3]. Proof It suffices to show that the bound u∗m+n,1 ≤ (1 + κ2m,n )1/2 [log(m + n + 1) + 3] holds for all u ∈ Cm (Z) such that u∗m,1 ≤ 1. We assume that n ≥ 1, the lemma statement being trivial otherwise. First of all, function u∗m+n,1 is convex so its maximum over the set u ∈ Cm (Z), u∗m,1 ≤ 1, is attained at an extreme point u j of the set given by Fm [u j ] = eiθ e j where e j is the j-th canonic basis vector and θ ∈ [0, 2π]. Note that 3  2 1 2πτ j , u τj = √ exp i θ + 2m + 1 2m + 1

Adaptive Denoising of Signals with Local Shift-Invariant Structure

thus, for γm,n :=  j ∗ u 

m+n,1

=

=



29

(2m + 2n + 1)(2m + 1) we obtain 1

γm,n 1 γm,n

' ' 3'  2 ' ' j k ' ' exp 2πiτ − ' 2m + 1 2m + 2n + 1 '' '|τ |≤m

' 2(m+n)+1  k=1 2(m+n)+1 

'  ' 'Dm ω jk ' ,

k=1

2

where ω jk

j k := 2π − 2m + 1 2m + 2n + 1

3

and Dm (·) is the Dirichlet kernel of order m: ⎧ ⎨ sin ((2m + 1)ω/2) , sin (ω/2) Dm (ω) := ⎩ 2m + 1,

ω = 2πl, ω = 2πl.

Hence, # γm,n u j ∗m+n,1 ≤ max

θ∈[0,2π]

m,n (θ) :=

' 2(m+n)+1 

' 'D m '

k=1



'& ' 2πk + θ '' . 2m + 2n + 1 (37)

For any θ ∈ [0, 2π], the summation in (37) is over the θ-shifted regular (2m + 2n + 1)-grid on the unit circle. The contribution to the sum m,n (θ) of the two closest to x = 1 points of this grid is at most 2(2m + 1). Using the bound Dm (ω) ≤ | sin(ω/2)|−1 ≤ for the remaining points, and because f (ω) = that n ≥ 1) we arrive at the bound  m,n (θ) ≤ 2 2m + 1 +

π . min(ω, 2π − ω) π ω

2π is decreasing on [ 2m+2n+1 , π] (recall

m+n+1  k=1

 2m + 2n + 1 . 2k

Now, using the inequality Hn ≤ log n + 1 for the n-th harmonic number we arrive at the bound   m,n (θ) ≤ 2(2m + 1) + (2m + 2n + 1) log(m + n + 1) + 1   ≤ (2m + 2n + 1) log(m + n + 1) + 3 which implies the lemma.



30

Z. Harchaoui et al.

Appendix B: Proof of Theorems 1 and 2 What is ahead. While it is difficult to describe informally the ideas underlying the proofs of the oracle inequalities, the “mechanics” of the proof of inequality (11), for instance, is fairly simple: for any ϕo which is feasible to (Con) one has y − ϕ  ∗ yn,2 ≤ y − ϕo ∗ yn,2 , and to prove the inequality (11) all we need to do is to bound tediously all terms of the remainder x − ϕ  ∗ yn,2 − x − ϕo ∗ yn,2 . This may be compared to bounding the 2 -loss of the Lasso regression estimate. Indeed, let m = n for simplicity, and, given y ∈ C(Z), let T (y) be the (2n + 1) × (2n + 1) “convolution matrix” as defined by (25) such that for ϕ ∈ Cn (Z) one has [ϕ ∗ y]n0 = T (y)[ϕ]n−n . When denoting f = Fn [ϕ], the optimization problem in (Con) can be recast as a “standard” 1 constrained least-squares problem with respect to f :

¯ min y − An f 2n,2 s.t.  f 1 ≤ √ 2n+1 f ∈C 2n + 1

(38)

where An = T (y)FnH . Observe that f o = Fn [ϕo ] is feasible for (38), so that y − An  f 2n,2 ≤ y − An f o 2n,2 , ϕ], and where  f = Fn [ x − An  f 2n,2 − x − An f o 2n,2   ≤ 2σ Re ζ, x − An f o n − Re ζ, x − An  f n ' ' n o  ≤ 2σ ' ζ, An ( f o −  f )n ' ≤ 2σAH n [ζ]−n ∞  f − f 1

¯ n ≤ 4σAH . n [ζ]−n ∞ √ n+1 In the “classical” situation, where [ζ]n−n is independent of An (see, e.g., [19]) one would have n AH n [ζ]−n ∞ ≤ cα log n max [An ] j 2 ≤ cα n log n max |Ai j | j

i, j

where cα is a logarithmic in α−1 factor. This would rapidly lead to the bound equivalent to (11). The principal difference with the standard setting which is also the source of the main difficulty in the analysis of the properties of adaptive estimates is that the “regression matrix” An in the case we are interested in is built of the noisy observations [y]n−n and thus depends on [ζ]n−n . In this situation, curbing the cross term is more involved and calls for Assumption 3.1.

Adaptive Denoising of Signals with Local Shift-Invariant Structure

31

B.1 Proof of Theorem 1 1o . Let ϕo ∈ Cm (Z) be any filter satisfying the constraint in (Con). Then, x − ϕ  ∗ y2n,2 ≤ (1 − ϕo ) ∗ y2n,2 − σ 2 ζ2n,2 − 2σRe ζ, x − ϕ  ∗ yn = x − ϕo ∗ y2n,2 − 2 σRe ζ, x − ϕ  ∗ yn 4 56 7 δ (1)

+2 σRe ζ, x − ϕo ∗ yn . 4 56 7

(39)

δ (2)

Let us bound δ (1) . Denote for brevity I := I2n+1 , and recall that Sn is the projector on Sn from (31). We have the following decomposition: δ (1) = σRe [ζ]n−n , Sn [x − ϕ  ∗ y]n−n  + σRe [ζ]n−n , (I − Sn )[x − ϕ  ∗ x]n−n  4 56 7 4 56 7 δ1(1)

δ2(1)

− σ 2 Re [ζ]n−n , (I − Sn )[ ϕ ∗ ζ]n−n  4 56 7 δ3(1)

One can easily bound δ1(1) under the premise of the theorem: ' '     ' (1) '  ∗ y]n−n 2 'δ1 ' ≤ σ Sn [ζ]n−n 2 Sn [x − ϕ      ∗ y . ≤ σ S [ζ]n  x − ϕ n

−n 2

n,2

Note that Sn [ζ]n−n ∼ CN (0, Is ), and by (34) we have 0 √ √ 1  Prob Sn [ζ]n−n 2 ≥ 2s + 2u ≤ e−u , which gives the bound 1 0' '  √   ∗ y n,2 Prob 'δ1(1) ' ≤ σ x − ϕ 2s + 2 log [1/α1 ] ≥ 1 − α1 .

(40)

2o . We are to bound the second term of (4.3). To this end, note first that δ2(1) = σRe [ζ]n−n , (I − Sn )[x]n−n  − σRe [ζ]n−n , (I − Sn )[ ϕ ∗ x]n−n .   By (31), (I − Sn )[x]n−n 2 ≤ σκ, thus with probability 1 − α, ' ' ' [ζ]n , (I − S )[x]n ' ≤ σκ 2 log[1/α]. n −n −n

(41)

32

Z. Harchaoui et al.

On the other hand, using the notation defined in (25), we have [ ϕ ∗ x]n−n =T (x)[ ϕ ]m −m , so that

[ζ]n−n , (I − Sn )[ ϕ ∗ x]n−n  = [ζ]n−n , (I − Sn )T (x)[ ϕ ]m −m . Note that [T (x)]τ = [τ x]n−n for the columns of T (x), |τ | ≤ m. By (31), we have (I − Sn )T (x) = T (ε), and by (26),    (I − S )T (x)2 = T (ε)2 = τ ε2n,2 ≤ (2m + 1)σ 2 κ 2 . n F F |τ |≤m

Due to (36) we conclude that     T (x)H (I − S )[ζ]n 2 ≤ 2(2m + 1)σ 2 κ 2 1 + log[1/α] 2 n −n 2 with probability at least 1 − α. Since '8 n   9'

¯ ' ' [ζ] , (I − S )T (x)[ T (x)H (I − S )[ζ]n  , ϕ] m n n −n −m ≤ √ −n 2 2m + 1 we arrive at the bound with probability 1 − α: '8 n 9' √   ' ' [ζ] , (I − S )T (x)[ ϕ] m 2σκ ¯ 1 + log[1/α] . n −n −m ≤ Along with (41) this results in the bound 0' ' √  1 Prob 'δ2(1) ' ≤ 2σ 2 κ(¯ + 1) 1 + log [1/ min(α2 , α3 )] ≥ 1 − α2 − α3 .(42) 3o . Let us rewrite δ3(1) as follows: (1)

δ3

m+n m+n m+n = σ 2 Re [ζ]n−n , (I − Sn )M( ϕ)[ζ]−m−n  = σ 2 Reσ 2 [ζ]−m−n , Q M( ϕ)[ζ]−m−n ,

where M( ϕ) ∈ C(2n+1)×(2m+2n+1) is defined by (27), and Q ∈ C(2m+2n+1)×(2n+1) is given by Q = [Om,2n+1 ; I − Sn ; Om,2n+1 ]  and (Hereafter we denote Om,n the m × n zero matrix.) Now, by the definition of ϕ since the mapping ϕ → M(ϕ) is linear,

Adaptive Denoising of Signals with Local Shift-Invariant Structure

δ3(1) =

33

σ2 H ([ζ]m+n M( ϕ) + M( ϕ)H Q H )[ζ]m+n −m−n ) (Q −m−n 4 56 7 2 K 1 ( ϕ)

σ ¯ ≤ √ 2 2m + 1 2

max

u ∈ Cm (Z), u∗m,1 ≤ 1

([ζ]n−m )H K 1 (u)[ζ]m+n −m−n

σ 2 ¯ 1 H iθ j m+n =√ max max ([ζ]m+n −m−n ) K 1 (e u )[ζ]−m−n , | j|≤m θ∈[0,2π] 2 2m + 1 H j j where u j ∈ Cm (Z), and [u j ]m −m = Fm e , e being the j-th canonic basis vector. m+n H m+n Indeed, ([ζ]−m−n ) K 1 (u)[ζ]−m−n is clearly a convex function of the argument u as a linear function of [Re(u); Im(u)]; as such, it attains its maximum over the set

Bm,1 = {u ∈ Cm (Z) : u∗m,1 ≤ 1}

(43)

at one of the extremal points eiθ u j , θ ∈ [0, 2π], of this set. It can be directly verified that K 1 (eıθ u) = K 1 (u) cos θ + K 2 (u) sin θ, where the Hermitian matrix K 2 (u) is given by   K 2 (u) = i Q M(u) − M(u)H Q H . j

m+n H j Denoting ql (ζ) = 21 ([ζ]m+n −m−n ) K l (u )[ζ]−m−n for l = 1, 2, we have

max

θ∈[0,2π]

1 H ıθ j m+n ([ζ]m+n −m−n ) K 1 (e u )[ζ]−m−n 2 j

j

= max q1 (ζ) cos θ + q2 (ζ) sin θ θ∈[0,2π]  √ j j j j = |q1 (ζ)|2 + |q2 (ζ)|2 ≤ 2 max(|q1 (ζ)|, |q2 (ζ)|).

(44)

Using (28), by simple algebra we get for l = 1, 2: Tr[K l (u j )2 ] ≤ 4 Tr[M(u j )M(u j )H ] = 4(2n + 1)u j 2m,2 ≤ 4(2n + 1). Now let us bound Tr[K l (u)], l ∈ {1, 2}, on the set Bm,1 cf. (43). One can verify that for the circulant matrix C(u), cf. (29), it holds that Q M(u) = RC(u), where R = Q Q H is an (2m + 2n + 1) × (2m + 2n + 1) projection matrix of rank s defined by

34

Z. Harchaoui et al.



⎤ Om,m Om,n+1 Om,m R = ⎣ On+1,m I − Sn On+1,m ⎦ Om,m Om,n+1 Om,m . Hence, we can bound Tr[K l (u)], l ∈ {1, 2}, as follows: ' ' | Tr[K l (u)]| ≤ 2' Tr[RC(u)]' ≤ 2R∗ C(u) √ ≤ 2C(u) = 2 2m + 2n + 1u ˜ ∗m+n,1 ,

(45)

where in the last transition we used the Fourier diagonalization property (30). Recall that u ∈ Cm (Z), hence Fm+n [u] is the Discrete Fourier Transform of the zero-padded filter 2m+2n+1 u˜ = [0; ...; 0; [u]m . −m ; 0; ...; 0] ∈ C Now, combining Lemma 2 with (45) we arrive at ' ' √ 'Tr[K l (u j )]' ≤ 2 2m + 1(κ2 + 1)(log[2m + 2n + 1] + 3), l = 1, 2. m,n By (35) we conclude that for any fixed pair (l, j) ∈ {1, 2} × {−m, ..., m}, with probability ≥ 1 − α, '    ' j ' '  'q (ζ)' ≤ 'Tr[K l (u j )]' +  K l (u j ) 1 + log[2/α] 2 . l F With α0 = 2(2m + 1)α, by the union bound together with (43) and (44) we get 0 √  Prob δ3(1) ≤ 2 2σ 2 ¯ (κ2m,n + 1)(log[2m + 2n + 1] + 3)  2 /1 ≥ 1 − α0 . +κm,n 1 + log [4(2m + 1)/α0 ]

(46)

4o . Bounding δ (2) is relatively easy since ϕo does not depend on the noise. We decompose δ (2) = σRe ζ, x − ϕo ∗ xn − σ 2 Re ζ, ϕo ∗ ζn . Note that Re ζ, x − ϕo ∗ xn ∼ N (0, x − ϕo ∗ x2n,2 ), therefore, with probability ≥ 1 − α, Re ζ, x − ϕo ∗ xn ≤



2 log[1/α]x − ϕo ∗ xn,2 .

On the other hand, defining

= we have

√ 2m + 1ϕo ∗m,1 ,

(47)

Adaptive Denoising of Signals with Local Shift-Invariant Structure

35

x − ϕo ∗ xn,2 ≤ x − ϕo ∗ yn,2 + σϕo ∗ ζn,2 √   ≤ x − ϕo ∗ yn,2 + 2σ κm,n 1 + log[1/α]

(48)

with probability 1 − α. Indeed, one has  2  ϕo ∗ ζ2n,2 =  M(ϕo )[ζ]m+n −m−n 2 , where for M(ϕo ) by (28) we have    M(ϕo )2 = (2n + 1)ϕo 2 ≤ κ2 2 . m,2 m,n F

(49)

Using (36) we conclude that, with probability at least 1 − α,  2 ϕo ∗ ζ2n,2 ≤ 2κ2m,n 2 1 + log[1/α] ,

(50)

which implies (48). Using (47) and (48), we get that with probability at least 1 − α4 − α5 ,  Re ζ, x − ϕo ∗ xn ≤ 2 log [1/ min(α4 , α5 )] x − ϕo ∗ yn,2 √  / + 2σ κm,n 1 + log[1/ min(α4 , α5 )] ≤ x − ϕo ∗ yn,2 2 log [1/ min(α4 , α5 )]  2 +2σ κm,n 1 + log [1/ min(α4 , α5 )] . (51) Now, the (indefinite) quadratic form Re ζ, ϕo ∗ ζn =

1 H o m+n ([ζ]m+n −m−n ) K 0 (ϕ )[ζ]−m−n , 2

where K 0 (ϕo ) = [Om,2m+2n+1 ; M(ϕo ); Om,2m+2n+1 ] + [Om,2m+2n+1 ; M(ϕo ); Om,2m+2n+1 ]H ,

whence (cf. 3o ) ' ' | Tr[K 0 (ϕo )]| ≤ 2(2n + 1) 'ϕo0 ' ' ' Let us bound 'ϕo0 '. Let e0 be the discrete centered Dirac vector in R2m+1 , and note √ that Fm [e0 ]∞ = 1/ 2m + 1. Then, ' o' 'ϕ ' = | [ϕo ]m , e0 | ≤ ϕo ∗ Fm [e0 ]∞ ≤ m −m m,1 whence | Tr[K 0 (ϕo )]| ≤ 2κ2m,n . On the other hand, by (49),

, 2m + 1

36

Z. Harchaoui et al.

     K 0 (ϕo )2 ≤ 4  M(ϕo )2 ≤ 4κ2 2 . m,n F F Hence by (35), 0 2 1  ≥ 1 − α6 . Prob −Re ζ, ϕo ∗ ζn ≤ 2κ2m,n + 2κm,n 1 + 2 log [1/α6 ] (52) 5o . Let us combine the bounds obtained in the previous steps with initial bound (39). For any α ∈ (0, 1], putting αi = α/4 for i = 0, 1, 6, and α j = α/16, 2 ≤ j ≤ 5, by the union bound we get that with probability ≥ 1 − α, x − ϕ  ∗ y2n,2 ≤ x − ϕo ∗ y2n,2 + 2δ (2) − 2δ (1)

[by (51)] ≤ x − ϕo ∗ y2n,2 + 2σx − ϕo ∗ yn,2 2 log[16/α] .  2 / [by (51)–(52)] + 4σ 2 κ2m,n + 2κm,n 1 + 2 log[16/α] √  [by (40)] + 2σx − ϕ  ∗ yn,2 2s + 2 log[16/α] √   [by (42)] + 2 2σ 2 (¯ + 1) 1 + log[16/α] κ . √ [by (46)] + 4 2σ 2 ¯ (κ2m,n + 1)(log[2m + 2n + 1] + 3)  2 / +κm,n 1 + log [16(m + 1)/α] (53)

Now, denote cα :=

√ 2 log[16/α] and let

 √ 2 + cα ,   v1 (α) = 4 κ2m,n + 2κm,n (1 + cα )2 , √ . v2 (α) = 4 2 (κ2m,n + 1)(log[2m + 2n + 1] + 3)  2 / + κm,n 1 + log [16(2m + 1)/α] . u(α) = 2

(54) (55)

(56)

In this notation, (53) becomes x − ϕ  ∗ y2n,2 ≤ x − ϕo ∗ y2n,2 √   + 2σ( 2s + cα ) x − ϕ  ∗ yn,2 + x − ϕo ∗ yn,2 + u(α)σ 2 (¯ + 1)κ + (v1 (α) + v2 (α))σ 2 ¯ , which implies, by completing the squares, that √ x − ϕ  ∗ yn,2 ≤ x − ϕo ∗ yn,2 + 2σ( 2s + cα ) +σ u(α)(¯ + 1)κ + (v1 (α) + v2 (α))¯ . Let us simplify this bound. Note that

(57)

Adaptive Denoising of Signals with Local Shift-Invariant Structure

37

u(α) ≤ 4cα ,

(58)

√ v1 (α) + v2 (α) ≤ 4 2(κ2m,n + 1)(log[2m + 2n + 1] + 4) √ +4.5(4 2 + 8)κm,n log [16(2m + 1)/α] 2  ≤ 8 1 + 4κm,n log [110(m + n + 1)/α] .

(59)

while on the other hand,

We finally arrive at x − ϕ  ∗ yn,2 ≤ x − ϕo ∗ yn,2 + 2σ where we put



¯ Vα +

 √ (¯ + 1)cα κ + 2s + cα (60)

2  Vα := 2 1 + 4κm,n log [110(m + n + 1)/α] .

(61)

The bound (11) of the theorem follows from (60) after straightforward simplifications. 

B.2 Proof of Theorem 2 √ √ Denote 

= 2m + 1 ϕ∗m,1 , and let = (ϕo ) = 2m + 1ϕo ∗m,1 for some ϕo ∈ Cm (Z). In the sequel, we use the notation defined in the proof of Theorem 1. We have the following counterpart of (39): x − ϕ  ∗ y2n,2 + λ2 σ 2

2 ≤ x − ϕo ∗ y2n,2 − 2δ (1) + 2δ (2) + λ2 σ 2 2 . When repeating steps 1o –4o of the proof of Theorem 1 we obtain a counterpart of (57): x − ϕ  ∗ y2n,2 + λ2 σ 2

2 ≤ x − ϕo ∗ y2n,2 + 2σ(x − ϕo ∗ yn,2 √ +x − ϕ  ∗ yn,2 )( 2s + cα ) + u(α)σ 2 κ + v1 (α)σ 2 +λ2 σ 2 2 + [u(α)κ + v2 (α)] σ 2

(62) with u(α), v1 (α), and v2 (α) given by (54)–(56). We now consider two cases as follows. (a) First, assume that √ x −  ϕ ∗ y2n,2 ≤ x − ϕo ∗ y2n,2 + 2σ(x − ϕo ∗ yn,2 + x −  ϕ ∗ yn,2 )( 2s + cα ) +u(α)σ 2 κ + v1 (α)σ 2 + λ2 σ 2 2 .

(63)

38

Z. Harchaoui et al.

In this case, clearly,  √ x − ϕ  ∗ yn,2 ≤ x − ϕo ∗ yn,2 + 2σ 2s + cα + u(α)σ 2 κ + v1 (α)σ 2 + λ2 σ 2 2 √ ≤ x − ϕo ∗ yn,2 + 2σ( 2s + cα ) + σ( u(α)κ + v1 (α) + λ )

(64)

(b) Suppose, on the contrary, that (63) does not hold, we then conclude from (62) that 

≤ λ−2 (u(α)κ + v2 (α)), and

u(α)

κ + v2 (α)

≤ λ−2 (u(α)κ + v2 (α))2 .

When substituting the latter bound into (62), we obtain the bound √ x − ϕ  ∗ yn,2 ≤ x − ϕo ∗ yn,2 + 2σ( 2s + cα ) +σ( u(α)κ + v1 (α) + λ−1 (u(α)κ + v2 (α)) + λ ), which also holds in the case of (a) due to (64). Finally, using (58), (59), and the bound v1 (α) ≤ 4(1 + κm,n )2 (1 + cα )2 which directly follows from (55), we conclude that x −  ϕ ∗ yn,2 ≤ x − ϕo ∗ yn,2 +σ(λ + 4λ−1 (cα κ + Vα )) + 2σ



Wα +



cα κ +



2s + cα



with Vα given by (61), and Wα = (1 + κm,n )2 (1 + cα )2 . The bound (12) of the theorem follows by a straightforward simplification of the above bound. 

Appendix C: Proofs for Sect. 4 C.1 Proof of Proposition 2 Let Sm be the m + 1-dimensional Euclidean projection matrix on the subspace Sm ⊂ Cm+1 of dimension ≤ s (in fact, this subspace is exactly of dimension s) generated by vectors x0m for x ∈ S (one may set, for instance, Sm = Z m (Z mH Z m )−1 Z mH ,

Adaptive Denoising of Signals with Local Shift-Invariant Structure

39

Z m = [z 1 , ..., z dim(Sm ) ], where z i are linearly independent and such that z i = [xi ]m 0 with xi ∈ S). Since dim(S) ≤ s, one has Sm 22 = Tr(Sm ) ≤ s. Thus, there is a j ∈ {0, ..., m} such that the j + 1-th column r = [Sm ] j of Sm satisfies   s 2s r 2 ≤ ≤ , m+1 2m + 1 and, because Sm is the projector on Sm one has x j − r, x0m  = 0 for all x ∈ S. Hence, using that S = S we obtain for all τ ∈ Z τ − j+m

xτ − r, xτ − j

 = 0, τ ∈ Z.

Finally, let φo ∈ Cm (Z) = − j φ(r ) where φ(r ) is the inverse slicing map of r˜ ∈ Cm+1 such that r˜i = rm+1−i . Obviously, φo ∈ Cm (Z); on the other hand,  φ 2 ≤ o

2s and xt − [φo ∗ x]t = 0, ∀t ∈ Z. 2m + 1



C.2 Proof of Proposition 3 In the proofs to follow, the following simple statement will be of use. Lemma 3 (i) Suppose that for all z ∈ S there is a filter φo ∈ Cm (Z) such that ρ for some ρ ≥ 1. Then for all x ∈ Xm,n (s, κ) z = φo ∗ z with φo 2 ≤ √2m+1 one has x − φo ∗ xn,2 ≤ σκ(1 + ρ).

(65)

Moreover, if x ∈ X m,n (s, κ) then σκ x − φo ∗ xn,∞ ≤ √ (1 + ρκn,m ). 2m + 1

(66)

(ii) Similarly, assume that for all z ∈ S there is φo ∈ Cm (Z) such that z = φo ∗ m z ρ for some ρ ≥ 1. Then for all x ∈ Xm,n (s, κ) one has and φo 2 ≤ √2m+1 −m (x − φo ∗ m x)n,2 ≤ σκ(1 + ρ).

40

Z. Harchaoui et al.

Furthermore, if x ∈ X m,n (s, κ) then −m (x − φo ∗ m x)n,∞ ≤ √

σκ 2m + 1

(1 + ρκn,m ).

Proof of the Lemma Here we prove the first statement of the lemma, proof of the second one being completely analogous. Recall that any x ∈ Xm,n (s, κ) can be decomposed as in x = x S + ε where x S ∈ S and τ εn,2 ≤ κσ for all |τ | ≤ m. Thus, x − [φo ∗ x]n,2 ≤ x S − φo ∗ x S n,2 + εn,2 + φo ∗ εn,2 = κσ + φo ∗ εn,2 .

(67)

On the other hand, by the Cauchy inequality, φ ∗ o

ε2n,2

' m '2 n '  n m '    ' ' o = φτ εt−τ ' ≤ φo 22 |εt−τ |2 ' ' ' t=−n τ =−m t=−n τ =−m = φo 22

m 

τ ε2n,2 ≤ ρ2 σ 2 κ 2 .

τ =−m

When substituting the latter bound into (67) we obtain (65). To show (66) recall that in the case of x ∈ X m,n (s, κ) we have x = x S + ε with κσ for all |τ | ≤ m + n. Then for |t| ≤ n we get |ετ | ≤ √2n+1 |xt − [φo ∗ x]t | ≤ |xtS − [φo ∗ x S ]t | + |εt | + |[φo ∗ ε]t | κσ ≤ √ + φo 2 −t εm,2 2n + 1 √ σκ 2m + 1 κσ ρ σκ ≤ √ +√ ≤√ (1 + ρκn,m ). √ 2n + 1 2m + 1 2n + 1 2m + 1  Proof of the Proposition W.l.o.g. we may assume that m = 2m o . In the premise of the proposition, by Proposition 2, for any m o ≥ s − 1 there exists a filter φo ∈ Cm o (Z) such that : 2s φo 2 ≤ , z = φo ∗ z ∀z ∈ S. (68) 2m o + 1 When setting ϕo = φo ∗ φo ∈ Cm we have z − ϕo ∗ z = 0 ∀z ∈ S, and7 In the case of m = 2m o + 1 one may consider two filters φo and ψ o of widths m o and m o + 1 respectively, and then build ϕo = φo ∗ ψ o ∈ Cm (Z). One easily verifies that in this case ϕo ∗m,1 ≤ √ 4s 2m + 1φo 2 ψ o 2 ≤ √2m+1 . 7

Adaptive Denoising of Signals with Local Shift-Invariant Structure

ϕo m,2 ≤ ϕo ∗m,1 ≤ √

4s 2m + 1

41

(69)

(cf. [15, Proposition 3] or [20, Lemma 16]). We now apply Lemma 3.i to obtain for all x ∈ Xm,n (s, κ) x − ϕo ∗ xn,2 ≤ σκ(4s + 1).

(70)

Moreover, note that ϕo ∗ ζ2n,2 = ζ, M(ϕo )ζn , where M(ϕ) is defined by (27). When using the bound (69) along with (28) we obtain M(ϕo )2F = (2n + 1)ϕo 22 ≤ 16κ2m,n s 2 ; by (36) this implies that for any α ∈ (0, 1), with probability at least 1 − α, √   ϕo ∗ ζn,2 ≤ 4 2σκm,n s 1 + log[1/α] .

(71)

The latter bound taken together with (70) implies that with probability ≥ 1 − α √   x − ϕo ∗ yn,2 ≤ 4 2κm,n σs 1 + log[1/α] + σκ(4s + 1)   ≤ Cσs κm,n log[1/α] + κ when α ≤ 1/2. We conclude the proof by substituting the above bound for the loss of the estimate  x = ϕo ∗ y and the bound ϕo ∗m,1 ≤ 4s into the oracle inequalities  of Theorems 1 and 2.

C.3 Proof of Proposition 4 We provide the proof for the case of constrained estimator  xcon , the proof of the = ϕ con . proposition for penalized estimator  xpen follows exactly same lines. Let ϕ 1o . W.l.o.g. we assume that m = 2m o . By Proposition 2, for such m o there is a filter φo ∈ Cm o (Z) satisfying relationships (68). When applying Lemma 3.i we obtain for all x ∈ X m,n (s, κ) √ σκ x − φo ∗ xn,∞ ≤ √ (1 + 2sκn,m o ). 2m o + 1

(72)

Next, replacing ϕo with φo and n with m in the derivation which led us to (71) in the proof of Proposition 3 we conclude that √    √  φo ∗ ζm,2 ≤ 2σκm o ,m s 1 + log[1/α] ≤ 2 2sσ 1 + log[1/α] . (73)

42

Z. Harchaoui et al.

2o . Let now |t| ≤ n − m o . We decompose |[x − ϕ  ∗ y]t | = |[(φo + (1 − φo )) ∗ (x − ϕ  ∗ y)]t | o ≤ |[φ ∗ (x − ϕ  ∗ y)]t | + |[(1 − φo ) ∗ (1 − ϕ ) ∗ x]t | +σ|[ ϕ ∗ ζ]t | + σ|[ ϕ ∗ φo ∗ ζ]t | (1) (2) (3) =: δ + δ + δ + δ (4) .

(74)

We have √   2 s δ (1) ≤ φo 2 −t [x − ϕ  ∗ y]m o ,2 ≤ √ x − ϕ  ∗ yn,2 . 2m + 1 Using the bound (14) of Proposition 3 we conclude that with probability ≥ 1 − α/3  δ (1) ≤ C

s α/3 ψ (σ, s, κ). 2m + 1 m,n

Next, using (72) we get    δ (2) ≤ 1 +  ϕ1 −t [(1 − φo ) ∗ x]m o ,∞ ≤ C s √

σκ 2m + 1

(1 +

√ Cs 3/2 σκ 2sκn,m o ) ≤ √ 2m + 1

(recall that n ≥ m o ). Further, by the Parseval’s identity, with probability ≥ 1 − α/3, δ (3) = σ| Fm [ ϕ], Fm [−t ζ]| ≤ σ ϕ∗m,1 −t ζ∗m,∞ C sσ ≤ √ 2 log [3(2m + 1)/α] 2m + 1 t+m+m o due to (32). Finally, using (73) and the fact that the distribution of ζt−m−m is the o m+m o same as that of ζ−m−m o we conclude that with probability ≥ 1 − α/3 it holds

√   −t [φo ∗ ζ]m,2 ≤ 2 2sσ 1 + log[3/α] . Therefore, we have for δ (4) : √     C sσ δ (4) ≤ σ ϕm,2 −t [φo ∗ ζ]m,2 ≤ √ 2 2sσ 1 + log[3/α] 2m + 1  C s 3/2 σ  1 + log[3/α] = √ 2m + 1 with prob. ≥ 1 − α/3. Substituting the bounds for δ (k) , k = 1, ·, 4, into (74) we  arrive at (16).

Adaptive Denoising of Signals with Local Shift-Invariant Structure

43

C.4 Proof of Proposition 5 As a precursory remark, note that if a finite-dimensional subspace S is shift-invariant, i.e., S ⊆ S, then necessarily S = S (indeed,  obviously is a linear transformation with a trivial kernel). 1o . To prove the direct statement, note that the solution set of (17) with deg( p(·)) = s is a shift-invariant subspace of C(Z) – let us call it S . Indeed, if x ∈ C(Z) satisfies (17), so does x, so S is shift-invariant. To see that dim(S ) = s, note that x → x1s is a bijection S → Cs : under this map arbitrary x1s ∈ Cs has a unique preimage. Indeed, as soon as one fixes x1s , (17) uniquely defines the next samples xs+1 , xs+2 , ... (note that p(0) = 0); dividing (17) by s , one can retrieve the remaining samples of x since deg( p(·)) = s (we used that  is bijective on S). 2o . To prove the converse, first note that any polynomial p(·) with deg( p(·)) = s and such that p(0) = 1 is uniquely expressed via its roots z 1 , ..., z s as p(z) =

s ;

(1 − z/z k ).

k=1

Since S is shift-invariant, we have S = S as discussed above, i.e.,  is a bijective linear operator on S. Let us fix some basis E = [e1 ; ...; es ] of S and denote A the s ai j ei . By the s × s representation matrix of  in this basis, that is, (e j ) = i=1 Jordan theorem basis E can be chosen in such a way that A is upper-triangular. Then, any vector x ∈ S satisfies q()x ≡ 0 where q(z) =

s ;

(aii − z) = det(A − z I )

i=1

"s aii = 0 since  is a is the characteristic polynomial of A. Note that det A = i=1 bijection. Hence, choosing q() p() = det A "s we obtain i=1 (1 − ci )x ≡ 0 for some complex ci = 0. This means that S is contained in the solution set S of (17) with deg( p(·)) = s and such that p(0) = 1. Note that by 1o S is also a shift-invariant subspace of dimension s, thus S and S coincide. Finally, uniqueness of p(·) follows from the fact that q(·) is a characteristic polynomial of A. 

44

Z. Harchaoui et al.

C.5 Proof of Proposition 6 To prove the proposition we need to exhibit

a vector q ∈ Cn+1 of small 2 -norm and n i such that the polynomial 1 − q(z) = 1 − i=0 qi z is divisible by p(z), i.e., that there is a polynomial r (z) of degree n − s such that 1 − q(z) = r (z) p(z). Indeed, this would imply that xt − [q ∗ x]t = [1 − q()]xt = r () p()xt = 0 due to p()xt = 0,

 Our objective is to prove the inequality q2 ≤ C s log[ns] . So, let θ1 , ..., θs be n complex numbers of modulus 1 – the roots of the polynomial p(z). Given δ = 1 −  ∈ (0, 1), let us set δ¯ = 2δ/(1 + δ), so that δ¯ − 1 = 1 − δ¯ > 0. δ

Consider the function q(z) ¯ =

(75)

s ; z − θi . δz − θi i=1

Note that q(·) ¯ has no singularities in the circle ¯ B = {z : |z| ≤ 1/δ}; ¯ so that z = δ¯ −1 w with |w| = 1. We besides this, we have q(0) ¯ = 1. Let |z| = 1/δ, have ¯ i| |z − θi | 1 |w − δθ = . δ |δz − θi | δ |w − ¯ θi | δ

¯

¯ i | ≤ |w − δ θi |. We claim that when |w| = 1, |w − δθ δ Indeed, assuming w.l.o.g. that w is not proportional to θi , consider triangle  with the vertices ¯ i and C = δ¯ θi . Let also D = θi . By (75), the segment AD is a median in , A = w, B = δθ δ and ∠C D A is ≥ π2 (since D is the closest to C point in the unit circle, and the latter contains ¯ i | ≤ |w − δ¯ θi |. A), so that |w − δθ δ

As a consequence, we get z ∈ B ⇒ |q(z)| ¯ ≤ δ −s ,

(76)

Adaptive Denoising of Signals with Local Shift-Invariant Structure

whence also

45

|z| = 1 ⇒ |q(z)| ¯ ≤ δ −s . "s

Now, the polynomial p(z) =

i=1 (z

(77)

− θi ) on the boundary of B clearly satisfies

3 3s 2 1−δ s 1 −1 = , | p(z)| ≥ 2δ δ¯ 2

which combines with (76) to imply that the modulus of the holomorphic in B function s ;

r¯ (z) =

!−1 (δz − θi )

i=1

 −s  2 s is bounded with δ −s 1−δ = 1−δ on the boundary of B. It follows that the 2δ coefficients r j of the Taylor series of r¯ satisfy 2

2 |r j | ≤ 1−δ

3s

j δ¯ , j = 0, 1, 2, ...

When setting q  (z) = p(z)r  (z), r  (z) =

 

rjz j,

(78)

j=1

for |z| ≤ 1, utilizing the trivial upper bound | p(z)| ≤ 2s , we get 



|q (z) − q(z)| ¯ ≤ | p(z)||r (z) − r¯ (z)| ≤ 2 2

4 ≤ 1−δ

2 s

2 1−δ

3s  ∞

|r j |

j=+1

3s ¯ +1 δ . 1 − δ¯

(79)

Note that q  (0) = p(0)r  (0) = p(0)¯r (0) = 1, that q  is a polynomial of degree  + s, and that q  is divisible by p(z). Besides this, on the unit circumference we have, by (79), 2

4 |q (z)| ≤ |q(z)| ¯ + 1−δ 

3s ¯ +1 3d ¯ +1 2 δ δ 4 −s ≤δ + , ¯ 1 − δ 1 − δ¯ 1−δ 4 56 7 R

where we used (77). Now,

(80)

46

Z. Harchaoui et al.

δ¯ =

2 − 2 1− 2δ = = ≤ 1 − /2 ≤ e−/2 , 1+δ 2− 1 − /2

and

1 1 − δ¯

2− 2 1+δ = ≤ . 1−δ  

=

We can upper-bound R: 2 R=

4 1−δ

3s ¯ +1 δ 22s+1 ≤ s+1 e−/2  1 − δ¯

Now, given positive integer  and positive α such that α 1 ≤ ,  4 α let  = 2s . Since 0 <  ≤ 18 , we have − log(δ) = − log(1 − ) ≤ 2 = α that δ¯ ≤ e−/2 = e− 4s , and

2 R≤

8s α

3s+1

(81) α , implying s

0 α1 . exp − 4s

Now let us put α = α(, s) = 4s(s + 2) log(8s); observe that this choice of α satisfies (81), provided that  ≥ O(1)s 2 log(s + 1) with properly selected absolute constant O(1). With this selection of α, we have α ≥ 1, whence R

. α /−1 

0 α1 0 α 1 2 8s 3s+1  ≤ exp − [8s]s+2 ≤ exp − 4s α α 4s ≤ exp{−(s + 2) log(8s)} exp{(s + 2) log(8s)} = 1,

that is, R≤ Furthermore,

1 α ≤ .  4

(82)

Adaptive Denoising of Signals with Local Shift-Invariant Structure

δ −s = exp{−s log(1 − )} ≤ exp{2s} = exp{ α } ≤ 2, δ −2s = exp{−2s log(1 − )} ≤ exp{4s} = exp{ 2α }  4α ≤ 1 + exp{ 21 } 2α ≤ 1 + .  

47

(83)

When invoking (80) and utilizing (83) and (82) we get 1 2π

< |z|=1

|q  (z)|2 |dz| ≤ δ −2s + 2δ −s R + R 2 ≤ 1 + 4

α 1 α + 4R + R ≤ 1 + 10 .  4 

On the other hand, denoting by q0 , q1 ,...,q+s the coefficients of the polynomial q  and taking into account that q¯ 0 = q  (0) = 1, we have 1+

+s 

|qi |2 = |q0 |2 + ... + |q+s |2 =

i=1

1 2π


0

48

Z. Harchaoui et al.

we may write Sm = UU H , where U = [U1 , ..., Us ] is the unitary normalization of V : U = [U1 · · · Us ] = V (V H V )−1/2 , U H U = Is . ..., u s ] be the last row of U , and v that of V . Note that the vector Let u = [u 1 ,

ψ = uU H = sk=1 u k [Uk ]H has the same 2 -norm as φo , and so φo 22 = u22 . On the other hand, because u = v(V H V )−1/2 , we arrive at H u22 ≤ v22 λ−1 min (V V ) ≤

s λ−1 (V H V ) 2m + 1 min

where the last inequality is due to the bound (2m + 1)−1/2 on the moduli of elements of v. Finally, we utilize the bound on the condition number of a Vandermonde matrix. Lemma 4 ([31, Theorem 2.3]) Let δmin be given by (22); one has     λmax (V H V ) 2π −1 2π m + . ≤ m − λmin (V H V ) δmin δmin We clearly have V ∗ ≥ 1, whence λmax (V H V ) ≥ 1. Together with (24) this results in ν+1 H λ−1 , min (V V ) ≤ ν−1 whence the required bound on φo 2 .



C.7 Proof of Proposition 9 Note that in the premise of the proposition k = L/(s log[L]) is correctly defined and K = L − 2k ≥ L/2 so that κ K ,k ≤ C(s log L)−1/2 and κk,K ≤ C s log L.

(85)

When applying Proposition 3 (recall that κ = 0 in our setting), we conclude that the error of the estimate ϕ  ∗ y satisfies, with probability at least 1 − α/3,   x −  x  K ,2 ≤ Cσ κk,K s log[1/α] + s log[L/α] .

(86)

Adaptive Denoising of Signals with Local Shift-Invariant Structure

49

On the other hand, due to κ K ,k ≤ 1, applying Proposition 7 we conclude that with probability 1 − α/3 the error of the left estimate ϕ + ∗ m y satisfies:     −m  (x −  ϕ+ ∗ m y)k,2 ≤ C σ κ K ,k s 2 log[L] log[1/α] + s log[L] log[L/α] ,

and the same estimation holds true for the right estimate ϕ − ∗ −m y:    m   (x −  ϕ− ∗ −m y)k,2 ≤ C σ κ K ,k s 2 log[L] log[1/α] + s log[L] log[L/α] .

When combining the latter bounds with (86) we arrive at the bound with probability ≥ 1 − α:     x −  x  L ,2 ≤ m (x −  ϕ− ∗ −m y)k,2 + x −  x  K ,2 + −m (x −  ϕ+ ∗ m y)k,2 ≤ Cσs log[L] log[L/α] + C σs log[1/α](κk,K + κ K ,k s log[L]) (by (85)) ≤ Cσs log[L/α] + C σs s log[L] log[1/α] ≤ Cσs 3/2 log[L/α].



Appendix D: Naive Adaptive Estimate  ∗ y where φ ∈ In this section,8 we consider the “naive” adaptive estimate  x =φ Cm (Z) solves the optimization problem min y − φ ∗ yn,2 subject to φ2 ≤ √

φ∈Cm (Z)

ρ 2m + 1

.

(87)

Recall that our goal is to show that using estimate  x is really not a good idea. To make the long story short, from now on, we consider the simplified version of the estimation problem in which m = n, signals are 2m + 1-periodic, and linear estimates are in the form of circular (periodic) convolution [φ ∗ y]t =

m 

φτ ys(t,τ ) ,

|t| ≤ m,

τ =−m

where s(t, τ ) = [t + m − τ mod 2m + 1] − m. Because the Discrete Fourier Transform diagonalizes the periodic convolution, problem (87) may be equivalently reformulated in the space of Fourier coefficients min z − Z wn,2 subject to w2 ≤ ρ

w∈C2m+1

8

We use notation defined in Sects. 2.1 and 4.3.

(88)

50

Z. Harchaoui et al.

where z = Fm [y], Z = diag(z) (with A = diag(a) being the diagonal matrix with √ entries Aii = ai ), and w is a properly “rephased” DFT of φ with |wk | = 2m + 1 | (Fm [φ])k |, 1 ≤ k ≤ 2m + 1. Consider the situation in which the signal to recover is just one “complex sinusoid,” 2πiτ e.g., xτ = ae 2m+1 , τ ∈ Z, a ∈ C, and let us show that the error of the naive estimate may be much larger √than the “oracle” error. We have Fm√[x] = f e1 where e1 is the first basis orth, f = a 2m + 1 with | f | = xm,2 = |a| 2m + 1, and the “sequencespace” observation z satisfies z = f e1 + σζ, ζ ∼ CN (0, In ). Obviously, in this case there exist a filter φo with φo 2 = (2m + 1)−1/2 such that x = φo ∗ x, so that the integral α-risk of the “oracle estimate” φo ∗ y is O(σ) up to logarithmic in α factor. Let us show that in this simple situation the risk of the naive estimate may be significantly higher. First of all, note that the optimal solution  w to the problem (88) with ρ = 1 is of the form |z k |2 , 1 ≤ k ≤ 2m + 1  wk = |z k |2 + λ where λ is chosen to ensure  w2 = 1. Let us bound λ from below. We have 1 =  w22 = ≥

2m+1  k=2

2m+1  σ 4 |ζk |4 |z 1 |4 + (|z 1 |2 + λ)2 (σ 2 |ζk |2 + λ)2 k=2

σ 4 |ζk |4 σ 4 Sm2 ≥ (σ 2 Mm2 + λ)2 2m(λ + σ 2 Mm )2

2 where Mm = max1≤k≤2m+1 |ζk |2 and Sm = 2m+1 k=2 |ζk | . Since with high probability (say, 1 − O(1/m)) Mm = O(log m) and Sm = O(m) (cf. (32) and (33)), for m large enough one has   √ Sm λ ≥ σ2 √ − Mm ≥ cσ 2 m 2m with probability at least 1 − O(1/m). As a result, 1− w1 = 1 −

λ λ |z 1 |2 = ≥ ≥ c 2 2 |z 1 | + λ |z 1 | + λ (| f | + σ|Mn |)2 + λ

√ whenever f satisfies | f |2 ≤ Cσ 2 m. Next, observe that x −  x 2m,2 = Fm [x] − Z w22 =  f e1 − Z w22 ≥ | f − z 1 w1 |2 1 ≥ | f (1 −  w1 )|2 − σ 2 |ζ1 |2 w12 ≥ c| f |2 − σ 2 Mm ≥ c | f |2 2

Adaptive Denoising of Signals with Local Shift-Invariant Structure

51

√ for | f | ≥ c σ log m. In other words, when the signal amplitude satisfies cσ 2 log m Cσ 2 ≤ |a|2 ≤ √ , m m the loss  x − xm,2 of the naive estimate is lower bounded, with probability at least 1 − O(1/m), with c xm,2 . In particular, when a  σm −1/4 this error is at least order of σm 1/4 , which is incomparably worse than the error O(σ) of the oracle estimate.

References 1. Bhaskar, B., Tang, G., Recht, B.: Atomic norm denoising with applications to line spectral estimation. IEEE Trans. Signal Processing 61(23), 5987–5999 (2013) 2. Bickel, P., Ritov, Y., Tsybakov, A.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009) 3. Birgé, L., Massart, P.: From model selection to adaptive estimation. In: Festschrift for Lucien le Cam, pp. 55–87. Springer (1997) 4. Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science & Business Media (2011) 5. Candes, E., Tao, T.: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35(6), 2313–2351, 12 (2007) 6. Donoho, D.: Statistical estimation and optimal recovery. Ann. Stat. 22(1), 238–270 (1994) 7. Donoho, D., Johnstone, I.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3), 425–455 (1994) 8. Donoho, D., Liu, R., MacGibbon, B.: Minimax risk over hyperrectangles, and implications. Ann. Stat. 18(3), 1416–1437 (1990) 9. Donoho, D., Low, M.: Renormalization exponents and optimal pointwise rates of convergence. Ann. Stat. 20(2), 944–970 (1992) 10. Duarte, M.F., Baraniuk, R.G.: Spectral compressive sensing. Appl. Comput. Harmon. Anal. 35(1), 111–129 (2013) 11. Efromovich, S., Pinsker, M.: Sharp-optimal and adaptive estimation for heteroscedastic nonparametric regression. Stat. Sin. 6, 925–942 (1996) 12. Goldenshluger, A., Lepski, O.: Bandwidth selection in kernel density estimation: oracle inequalities and adaptive minimax optimality. Ann. Stat. 39(3), 1608–1632 (2011) 13. Goldenshluger, A., Lepski, O.: General selection rule from a family of linear estimators. Theory Probab. Appl. 57(2), 209–226 (2013) 14. Goldenshluger, A., Nemirovski, A.: Adaptive de-noising of signals satisfying differential inequalities. IEEE Trans. Inf. Theory 43(3), 872–889 (1997) 15. Harchaoui, Z., Juditsky, A., Nemirovski, A., Ostrovsky, D.: Adaptive recovery of signals by convex optimization. In: Proceedings of The 28th Conference on Learning Theory (COLT) 2015, Paris, France, July 3–6, 2015, pp. 929–955 (2015) 16. Ibragimov, I., Khasminskii, R.: Nonparametric estimation of the value of a linear functional in Gaussian white noise. Theor. Probab. Appl. 29(1), 1–32 (1984) 17. Ibragimov, I., Khasminskii, R.: Estimation of linear functionals in Gaussian noise. Theor. Probab. Appl. 32(1), 30–39 (1988) 18. Johnstone, I.: Gaussian estimation: sequence and multiresolution models (2011) 19. Juditsky, A., Nemirovski, A.: Functional aggregation for nonparametric regression. Ann. Stat. 28, 681–712 (2000)

52

Z. Harchaoui et al.

20. Juditsky, A., Nemirovski, A.: Nonparametric denoising of signals with unknown local structure, I: Oracle inequalities. Appl. Comput. Harmon. Anal. 27(2), 157–179 (2009) 21. Juditsky, A., Nemirovski, A.: Nonparametric denoising signals of unknown local structure, II: Nonparametric function recovery. Appl. Comput. Harmon. Anal. 29(3), 354–367 (2010) 22. Juditsky, A., Nemirovski, A.: On detecting harmonic oscillations. Bernoulli 23(2), 1134–1165 (2013) 23. Juditsky, A., Nemirovski, A.: Near-optimality of linear recovery from indirect observations. Math. Stat. Learn. 1(2), 171–225 (2018) 24. Juditsky, A., Nemirovski, A.: Near-optimality of linear recovery in gaussian observation scheme under  · 2 -loss. Ann. Stat. 46(4), 1603–1629 (2018) 25. Kailath, T., Sayed, A., Hassibi, B.: Linear Estimation. Prentice Hall (2000) 26. Laurent, B., Massart, P.: Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 28(5), 1302–1338 (2000) 27. Lepski, O.: On a problem of adaptive estimation in Gaussian white noise. Theory Probab. Appl. 35(3), 454–466 (1991) 28. Lepski, O.: Adaptive estimation over anisotropic functional classes via oracle approach. Ann. Stat. 43(3), 1178–1242 (2015) 29. Lepski, O., Mammen, E., Spokoiny, V.: Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. Ann. Stat. 25(3), 929–947 (1997) 30. Massart, P.: Concentration Inequalities and Model Selection, vol. 6. Springer (2007) 31. Moitra, A.: Super-resolution, extremal functions and the condition number of Vandermonde matrices. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pp. 821–830. ACM (2015) 32. Nemirovski, A.: On non-parametric estimation of functions satisfying differential inequalities (1991) 33. Ostrovsky, D., Harchaoui, Z., Juditsky, A., Nemirovski, A.: Structure-blind signal recovery. In: Advances in Neural Information Processing Systems, pp. 4817–4825 (2016) 34. Pinsker, M.: Optimal filtering of square-integrable signals in gaussian noise. Probl. Peredachi Inf. 16(2), 52–68 (1980) 35. Shiryaev, A.N., Spokoiny, V.G.: On sequential estimation of an autoregressive parameter. Stochast.: Int. J. Prob. Stochast. Processes 60(3–4), 219–240 (1997) 36. Stoica, P., Nehorai, A.: Music, maximum likelihood, and Cramer-Rao bound. IEEE Trans. Acoust. Speech Signal Process. 37(5), 720–741 (1989) 37. Tang, G., Bhaskar, B., Recht, B.: Near minimax line spectral estimation. In: 2013 47th Annual Conference on Information Sciences and Systems (CISS), pp. 1–6. IEEE (2013) 38. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B Stat Methodol. 58(1), 267–288 (1996) 39. Tsybakov, A.: Introduction to Nonparametric Estimation. Springer (2008) 40. Tufts, D.W., Kumaresan, R.: Estimation of frequencies of multiple sinusoids: Making linear prediction perform like maximum likelihood. Proc. IEEE 70(9), 975–989 (1982) 41. Wasserman, L.: All of Nonparametric Statistics. Springer Texts in Statistics. Springer (2006)

Goodness-of-Fit Testing for Hölder Continuous Densities Under Local Differential Privacy Amandine Dubois, Thomas B. Berrett, and Cristina Butucea

Abstract We address the problem of goodness-of-fit testing for Hölder continuous densities under local differential privacy constraints. We study minimax separation rates when only non-interactive privacy mechanisms are allowed to be used and when both non-interactive and sequentially interactive can be used for privatisation. We propose privacy mechanisms and associated testing procedures whose analysis enables us to obtain upper bounds on the minimax rates. These results are complemented with lower bounds. By comparing these bounds, we show that the proposed privacy mechanisms and tests are optimal up to at most a logarithmic factor for several choices of f 0 including densities from uniform, normal, Beta, Cauchy, Pareto, exponential distributions. In particular, we observe that the results are deteriorated in the private setting compared to the non-private one. Moreover, we show that sequentially interactive mechanisms improve upon the results obtained when considering only non-interactive privacy mechanisms. Keywords Goodness-of-fit test · Local differential privacy · Minimax separation rates · Privacy mechanims · Total variation separation distance

Financial support from GENES and from the French ANR grant ANR-18-EURE-0004. Financial support from GENES and the French National Research Agency (ANR) under the grant Labex Ecodec (ANR-11-LABEX-0047). A. Dubois CREST, ENSAI, Campus de Ker-Lann - Rue Blaise Pascal, BP 37203, 35172 Bruz Cedex, France e-mail: [email protected] T. B. Berrett Department of Statistics, University of Warwick, Coventry CV4 7AL, United Kingdom e-mail: [email protected] C. Butucea (B) CREST, ENSAE, Institut Polytechnique de Paris, 5 avenue Henry Le Chatelier, 91120 Palaiseau, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_2

53

54

A. Dubois et al.

1 Introduction Over the past few years, data privacy has become a fundamental problem in statistical data analysis. While more and more personal data are collected each day, stored and analyzed, private data analysis aims at publishing valid statistical results without compromising the privacy of the individuals whose data are analysed. Differential privacy has emerged from this line of research as a strong mathematical framework which provides rigorous privacy guarantees. Global differential privacy has been formalized by Dwork et al. [14]. Their definition requires a curator who gathers the confidential data of n individuals and generates a privatized output from this complete information. Only this privatized output can be released. In a nutshell, the differential privacy constraints require that altering a single entry in the original dataset does not affect the probability of a privatized output too much. One intuition behind this definition is that if the distribution of the privatized output does not depend too much on any single element of the database, then it should be difficult for an adversary to guess if one given person is in the database or not. We refer the reader to [27] for a precise definition of global differential privacy and more discussion on its testing interpretation. In this paper, we will rather focus on the stronger notion of local differential privacy for which no trusted curator is needeed. In the local setup, each individual generates a privatized version of its true data on its own machine, and only the privatized data are collected for analysis. Thus, the data-owners do not have to share their true data with anyone else. However, some interaction between the n individuals can be allowed. We will consider two specific classes of locally differentially privacy mechanisms : non-interactive and sequentially interactive privacy mechanisms, respectively. In the local non-interactive scenario, each individual generates a private view Z i of its original data X i on its own machine independently of all the other individuals. In the sequentially interactive scenario, the privatized data Z 1 , . . . , Z n are generated such that the i-th individual has access to the previously privatized data Z 1 , . . . , Z i−1 in addition to the original data X i in order to generate its own Z i . In this paper, we study a goodness-of-fit testing problem for densities under local differential privacy constraints. Goodness-of-fit testing problems consist in testing whether n independent and identically distributed random variables X 1 , ..., X n were drawn from a specified distribution P0 or from any other distribution P with d(P0 , P) ≥ ρ for some distance between distributions d and some separation parameter ρ > 0. Here, the considered distributions will be assumed to have Hölder smooth densities and we will measure the separation between distributions using the L 1 norm which corresponds (up to a constant) to the total variation distance. Moreover, only privatised data Z 1 , ..., Z n are supposed available to be used in order to design testing procedures. Therefore we proceed in two steps: first randomize the original sample into a private sample, then build a test using the latter sample. Optimality is shown over all test procedures and additionally over all privacy mechanisms satisfying the privacy constraints. We adopt a minimax point of view and aim at determining the private minimax testing radius which is the smallest separation parameter for which

Goodness-of-Fit Testing for Hölder Continuous Densities …

55

there exists a private testing procedure whose first type and second type error probabilities are bounded from above by a constant fixed in advance.

Contributions Our contributions can be summarized as follows. First, when non-interactive privacy mechanisms are used, we present an α-locally differentially private such mechanism and construct a testing procedure based on the privatized data. Its analysis indicates how to tune the parameters of the test statistic and the threshold of the test procedure in order to get a least upper bound on the non interactive testing radius. This result is further complemented with a lower bound. Next, we prove that these bounds can be improved when allowing for sequential interaction. When previously privatized random variables are publicly available, we may proceed in two steps in order to improve on the detection rates. The first part of the sample is privatized as in the non-interactive case and it is used to acquire partial information on the unknown probability density. This information is further encoded in the private versions of the second part of the sample and the whole procedure benefits and attains faster rates of detection. This idea was previously introduced in [9] and was also successful for testing discrete distributions in [6]. Finally, we investigate the optimality of our results for many choices of the null density f 0 . We prove that our lower bounds and upper bounds match up to a constant in the sequentially interactive scenario, and up to a logarithmic factor in the noninteractive scenario, for several f 0 including densities from uniform, gaussian, beta, Cauchy, Pareto and exponential distributions.

Related Work Goodness-of-fit testing for separation norm  · 1 has recently received great attention in the non-private setting. Valiant and Valiant [25] studies the case of discrete distributions. Given a discrete distribution P0 and an unknown discrete distribution P, they tackle the problem of finding how many samples from P one should obtain to be able to distinguish with high probability the case that P = P0 from the case that P − P0 1 ≥ ε. They provide both upper bounds and lower bounds on this sample complexity as a function of ε and the null hypothesis P0 . Other testing procedures for this problem have been proposed in [4, 12] has revisited the problem in a minimax framework similar to the one considered in this paper (without privacy constraints). Note that before these papers, the majority of the works on this problem focused on the case where P0 is the uniform distribution, or considered a worst-case setting. The upper and lower bounds obtained in [4, 25] appear to match in most usual cases but do not match for some pathological distributions. This problem has been fixed in [11], where the authors provide matching upper and lower bounds on the minimax

56

A. Dubois et al.

separation distance for separation norm  · t , t in [1, 2]. As for the continuous case, [4] studies goodness-of-fit testing for densities with separation norm  · 1 , focusing on the case of Hölder continuous densities. As it has already been observed for the discrete case, they prove that the local minimax testing radius (or minimax separation distance) strongly depends on the null distribution. We extend their results to the private setting. Many papers have been devoted to the study of testing problems under global differential privacy constraints. This includes goodness-of-fit testing [2, 3, 10, 15, 26], independence testing [15, 26] and closeness testing [2, 3]. In the local setting of differential privacy, [17–19] study simple hypothesis testing, and [1, 16, 23] consider independence testing. Some of these references and a few others also deal with goodness-of-fit testing under local differential privacy constraints: [16] studies the asymptotic distribution of several test statistics used for fitting multinomial distributions, while [1, 23] provide upper and lower bounds on the sample complexity for fitting more general but finitely supported discrete distributions. However, [1] considers only the case where the null distribution P0 is the uniform distribution, and both papers prove lower bounds only with respect to the choice of the test statistic for a fixed specific privacy mechanism. In the minimax results below we prove optimality over all test statistics and also over all privacy mechanisms submitted to the local differential privacy constraints. Minimax goodness-of-fit testing for discrete random variables has first been studied with L2 separation norm in [20]. They consider the non-interactive scenario exclusively, and their lower bound result is proven for the uniform distribution P0 under the null. Lam-Weil et al. [20] also tackles the problem of goodness-of-fit testing for continuous random variables with  · 2 separation norm. They are the first to study minimax testing rates for the problem of goodness-of-fit testing for compactly s (L) in the setting of non-interactive local supported densities over Besov balls B2,∞ differential privacy. They provide an upper bound which holds for any density f 0 , and a matching lower bound in the special case where f 0 is the uniform density over [0, 1]. In a parallel work, [9] investigates the estimation of the integrated square of a density over general Besov classes B sp,q , and prove that allowing for sequential interaction improves over the results obtained in the non-interactive scenario in terms of minimax estimation rates. As an application, they discuss non-interactive and sequentially interactive L 2 -goodness-of-fit testing for densities supported on [0, 1] which lie in Besov balls. They thus extend the results obtained in [20] to more general Besov balls, to the interactive scenario, and to the case where f 0 is not assumed to be the uniform distribution, but has to be bounded from below on its support. Later, locally differentially private goodness-of-fit testing for discrete random variables (not necessarily finite supported) has been studied in [6] in a minimax framework.  The authors aim at computing the minimax testing rates when d(P, P0 ) = dj=1 |P( j) − P0 ( j)|i , i ∈ {1, 2}. They provide upper bounds on the minimax testing rates by constructing and analysing specific private testing procedures, complement these results with lower bounds, and investigate the optimality of their results for several choices of the null distribution P0 . Interestingly, they tackle both the sequentially interactive case and the non-interactive case and prove

Goodness-of-Fit Testing for Hölder Continuous Densities …

57

that the minimax testing rates are improved when sequential interaction is allowed. Such a phenomenon appears neither for simple hypothesis testing [17], nor for many estimation problems (see for instance [5, 8, 13, 21]). We pursue these works by considering goodness-of-fit testing of Hölder-smooth probability densities and the separation norm  · 1 . Moreover, similarly to [4], we consider densities with Hölder smoothness β in (0,1] and that can tend to 0 on their support, with possibly unbounded support. Our goal is to show how differential privacy affects the minimax separation radius for this goodness-of-fit test. Balakrishnan and Wasserman[4], following works in discrete testing initiated by [25], have shown that two procedures need to be aggregated in this case. They split the support of the density f 0 into a compact set B where f 0 is bounded from below by some positive constant and they build a weighted L2 test on this set; then they build a tail test on B which is based on estimates of the total probabilities (P − P0 )(B). They show that the separation rates are of order    2β ( B f 0 (x)γ d x)1/γ 4β+d 2β , , where γ = n 3β + d for d−dimensional observations and depend of f 0 via an integral functional. The cut-off (choice of B) will depend on n and their separation rates are not minimax optimal due to different cut-offs in the upper and lower bounds. We show that under local differential privacy constraints, we get for an optimal choice of B the separation rates 3β+3



|B| 4β+3 (nα2 )− 4β+3 when only non-interactive privacy mechanisms are allowed, and we show that better rates are obtained β+1 2β |B| 2β+1 (nα2 )− 4β+2 when interactive privacy mechanisms are allowed (using previously published privatized information). We see that our rates only depend on f 0 in a global way through the length |B| of the set B and that explains why we do not need to weight the L2 test statistic. Further work will include extension to more general Hölder and Besov classes with β > 0 and adaptation to the smoothness β by aggregation of an increasing number of tests as introduced by [24].

Organization of the Paper The paper is organized as follows. In Sect. 2 we introduce the notion of local differential privacy and describe the minimax framework considered in the rest of the paper. In Sect. 3 we introduce a non-interactive privacy mechanism and an associated

58

A. Dubois et al.

testing procedure. Its analysis leads to an upper bound on the non-interactive testing radius which is complemented by a lower bound. In Sect. 4 we give a lower bound on the testing radius for the sequentially interactive scenario and present a sequentially interactive testing procedure which improves on the rates of the non interactive case. In Sect. 5 we prove that our results are optimal (at most up to a logarithmic factor) for several choices of the null density f 0 .

2 Problem Statement Let (X 1 , ..., X n ) ∈ X n be i.i.d. with common probability density function (pdf) f : X → R+ . We assume that f belongs to the smoothness class H (β, L) for some smoothness 0 < β ≤ 1 and L > 0, where   H (β, L) = f : X → R+ : | f (x) − f (y)| ≤ L|x − y|β , ∀x, y ∈ X . In the sequel, we will omit the space X in the definition of functions f and f 0 and integrals, and we will choose a set B such that B ⊂ X and denote by B = X \ B. Given a probability density function f 0 in H (β, L 0 ) for some L 0 < L, we want to solve the goodness-of-fit test H0 : f ≡ f 0 H1 (ρ) : f ∈ H (β, L) and  f − f 0 1 ≥ ρ, where ρ > 0 under an α-local differential privacy constraint. We will consider two classes of locally differentially private mechanisms : sequentially interactive mechanisms and non-interactive mechanisms. In the sequentially interactive scenario, privatized data Z 1 , . . . , Z n are obtained by successively applying suitable Markov kernels : given X i = xi and Z 1 = z 1 , . . . , Z i−1 = z i−1 , the i-th data-holder draws Z i ∼ Q i (· | X i = x, Z 1 = z 1 , . . . , Z i−1 = z i−1 ) for some Markov kernel Q i : Z × X × Z i−1 → [0, 1] where the measure spaces of the non-private and private data are denoted with (X , X ) and (Z, Z ), respectively. We say that the sequence of Markov kernels (Q i )i=1,...,n provides α-local differential privacy or that Z 1 , . . . , Z n are α-local differentially private views of X 1 , . . . , X n if sup

sup

A∈Z z 1 ,...,z i−1 ∈Z

sup

x,x ∈X

Q i (A | X i = x, Z 1 = z 1 , . . . , Z i−1 = z i−1 ) ≤ eα , Q i (A | X i = x , Z 1 = z 1 , . . . , Z i−1 = z i−1 )

for all i = 1, . . . , n.

(1)

Goodness-of-Fit Testing for Hölder Continuous Densities …

59

We will denote by Qα the set of all α-LDP sequentially interactive mechanisms. In the non-interactive scenario Z i depends only on X i but not on Z k for k < i. We have Z i ∼ Q i (· | X i = xi ), and condition (1) becomes sup sup

A∈Z x,x ∈X

Q i (A | X i = x) ≤ eα , for all i = 1, . . . , n. Q i (A | X i = x )

We will denote by QNI α the set of all α-LDP non-interactive mechanisms. Given an α-LDP privacy mechanism Q, let  Q = {φ : Z n → {0, 1}} denote the set of all tests based on Z 1 , . . . Z n . The sequentially interactive α-LDP minimax testing risk is given by Rn,α ( f 0 , ρ) := inf

 inf

sup

Q∈Qα φ∈ Q f ∈H1 (ρ)

P Q nf (φ = 1) + P Q nf (φ = 0) . 0

We define similarly the non-interactive α-LDP minimax testing risk RNI n,α ( f 0 , ρ), instead of Q . Given γ ∈ (0, 1), where the first infimum is taken over the set QNI α α we study the α-LDP minimax testing radius defined by   En,α ( f 0 , γ) := inf ρ > 0 : Rn,α ( f 0 , ρ) ≤ γ , NI ( f 0 , γ). and we define similarly En,α

Notation For any positive integer number n, we denote by 1, n the set of integer values {1, 2, ..., n}. If B is a compact set on R, we denote by |B| its length (its Lebesgue measure). For any function

ψ and any positive real number h, we denote the rescaled function by ψh = h1 ψ h· . For two sequences (an )n and (bn ), we denote by an  bn that there exists some constant C > 0 such that an ≤ Cbn , and we write an bn if both an  bn and bn  an .

3 Non-interactive Privacy Mechanisms In this section we design a non-interactive α-locally differentially private mechanism and the associated testing procedure. We study successively its first and second type NI ( f 0 , γ). error probabilities in order to obtain an upper bound on the testing radius En,α We then present a lower bound on the testing radius. The test and privacy mechanism proposed in this section will turn out to be (nearly) optimal for many choices of f 0 since the lower bound and the upper bound match up to a logarithmic factor for several f 0 , see Sect. 5 for many examples.

60

A. Dubois et al.

3.1 Upper Bound in the Non-interactive Scenario We propose a testing procedure that, like [4], combines an L2 procedure on a bulk set B where the density f 0 under the null is bounded away from 0 by some (small) constant and an L1 procedure on the tail B. However, we note that, unlike [4], the rate depends on f 0 in a global way, only through the length |B| of the set B. Our procedure also translates to the case of continuous distributions the one proposed by Berrett and Butucea [6] for locally private testing of discrete distributions. It consists in the following steps: 1. Consider a compact set B ⊂ R (its choice depends on f 0 , and on values of n and α).  2. Using the first half of the (privatized) data, define an estimator S B of B ( f − f 0 )2 . 3. Using the second half of the (privatized) data, define an estimator TB of B¯ ( f − f 0 ). 4. Reject H0 if either S B ≥ t1 or TB ≥ t2 . Assume without loss of generality that the sample size is even and equal to 2n so that we can split the data into equal parts, X 1 , . . . , X n and X n+1 , . . . , X 2n . Let B ⊂ R be a nonempty compact set, and let (B j ) j=1,...,N be a partition of B, h > 0 be the bandwidth and (x1 , . . . , x N ) be the centering points, that is B j = [x j − h, x j + h] for all j ∈ 1, N . Let ψ : R → R be a function satisfying the following assumptions. Assumption 3.1 ψ is a bounded function supported in [−1, 1] such that



1

−1

ψ(t)dt = 1, and

1

−1

|t|β |ψ(t)|dt < ∞.

In particular, Assumption 3.1 implies that ψh (x j − y) = 0 if y ∈ / B j , where ψh (u) = 1 u ψ( ). h h We now define our first privacy mechanism. For i ∈ 1, n and j ∈ 1, N  set Zi j =

1 ψ h



x j − Xi h

 +

2ψ∞ Wi j , αh

where (Wi j )i∈1,n, j∈1,N  is a sequence of i.i.d Laplace(1) random variables. Using these privatized data, we define the following U-statistic of order 2. S B :=

N

j=1

1 (Z i j − f 0 (x j ))(Z k j − f 0 (x j )). n(n − 1) i=k

The second half of the sample is used to design a tail test. For all i ∈ n + 1, 2n set   I (X i ∈ / B) 1 , 1± Z i = ±cα , with probabilities 2 cα

Goodness-of-Fit Testing for Hölder Continuous Densities …

61

where cα = (eα + 1)/(eα − 1). Using these private data, we define the following statistic. 2n 1 TB = Zi − f0 . n i=n+1 B We then put

 =

1 0

if S B ≥ t1 or TB ≥ t2 , otherwise

(2)

 √ 20 3 2 2 2β 196ψ2∞ N , , t2 = t1 = L 0 C β N h + 2 γnα2 h 2 nα2 γ

where

(3)

1 with Cβ = −1 |u|β |ψ(u)|du. The privacy mechanism that outputs (Z 1 , . . . , Z n , Z n+1 , . . . , Z 2n ) is non-interactive since for all i ∈ 1, 2n Z i depends only on X i . The following result establishes that this mechanism also provides α-local differential privacy. Its proof is deferred to Sect. A.1 in the Appendix. Proposition 3.2 For all i ∈ 1, 2n, Z i is an α-locally differentially private view of X i . The following proposition studies the properties of the test statistics. Its proof is given in the Appendix A.2. Proposition 3.3 1. It holds E Q nf [S B ] =

N

2 [ψh ∗ f ](x j ) − f 0 (x j ) .

(4)

j=1

Under Assumption 3.1 it also holds if α ∈ (0, 1] Var Q nf (S B ) ≤

N 2 164ψ4∞ N 36ψ2∞

[ψh ∗ f ](x j ) − f 0 (x j ) + . 2 2 nα h j=1 n(n − 1)α4 h 4

(5)

2. It holds

1 E Q nf [TB ] = ( f − f 0 ), and Var Q nf (TB ) = n B

 cα2

2 

 −

f

.

B

The study of the first and second type error probabilities of the test  in (2) with NI ( f 0 , γ). a convenient choice of h leads to the following upper bound on En,α

62

A. Dubois et al.

Theorem 3.4 Assume that α ∈ (0, 1) and β ≤ 1. The test procedure  in (2) with t1 and t2 in (3) and bandwidth h given by h |B|−1/(4β+3) (nα2 )−2/(4β+3) attains the following bound on the separation rate NI En,α ( f 0 , γ)

  3β+3 2β 1 2 − 4β+3 4β+3 , ≤ C(L , γ, ψ) · |B| (nα ) + f0 + √ nα2 B

for all compact set B ⊂ R. The proof can be found in Appendix A.2. Note that the tightest upper bound is obtained for the sets B that minimize the right-hand sides in Theorem 3.4. In order to do this, we note that theupper bounds sum a√term which increases with B, a term which decreases with B: B f 0 and a term 1/ nα2 free of B. Thus we suggest to choose B = Bn,α as a level set  Bn,α ∈ arg

inf

B compact set

|B| :

f 0 ≥ |B| B

3β+3 4β+3

− 2β (nz α2 ) 4β+3

1



+√ and inf f 0 ≥ sup f 0 . B nα2 B

(6)

3.2 Lower Bound in the Non-interactive Scenario NI We now complete the study of the testing radius En,α ( f 0 , γ) with the following lower bound.

Theorem 3.5 Let α > 0. Assume that β ≤ 1. Set z α = e2α − e−2α and C0 (B) = min{ f 0 (x) : x ∈ B}. For all compact set B ⊂ R we get −1   4β+4 2 NI En,α ( f 0 , γ) ≥ C(γ, L , L 0 ) log C|B| 4β+3 (nz α2 ) 4β+3  3β+3 2β · min |B|C0 (B), |B| 4β+3 (nz α2 )− 4β+3 . If, moreover, the compact set B is satisfying |B|β/(4β+3) C0 (B) ≥ C(nz α2 )−2β/(4β+3)

(7)

for some C > 0, it holds −1   4β+4 3β+3 2β 2 NI En,α ( f 0 , γ) ≥ C(γ, L , L 0 ) log C|B| 4β+3 (nz α2 ) 4β+3 |B| 4β+3 (nz α2 )− 4β+3 . Discussion of the optimality of the bounds. The choice of the set B is crucial for obtaining matching rates in the upper and lower bounds. In the case where the support X of f 0 is compact with c1 ≤ |X | ≤ c2 for two constants c1 > 0 and c2 > 0 and if f 0 is bounded from below on X , one can take

Goodness-of-Fit Testing for Hölder Continuous Densities …

63

B = X . Indeed, for such functions, the choice B = X yields an upper bound of order 2β (nα2 )− 4β+3 . Moreover, (7) holds with this choice of B and Theorem 3.5 proves that the upper bound is optimal up to (at most) a logarithmic factor. In the case of densities with bounded support but which can tend to 0 on their support, and in the case of densities with unbounded support, we suggest to choose B = Bn,α as defined in (6) both in the upper and lower bounds. By inspection of the proof, we can also write that Bn,α in (6) is such that  Bn,α ∈ arg

inf

B compact set

where ψn,α (B) = |B|h β + ∗





f 0 ≥ ψn,α (B) and inf f 0 ≥ sup f 0 ,

|B| :

B

B

3β+3 2β |B|√3/4 + √ 1 2 = |B| 4β+3 (nα2 )− 4β+3 h 3/4 nα2 nα 1/2 2 −2/(4β+3)

B

+

√1 nα2

for an opti-

mal choice of h = h (B) = (|B| nα ) . Indeed, we choose Bn,α as a level set such that B f 0 (which is decreasing with B) be equal to ψn,α (B) (which is increasing with B). For the choices B = Bn,α and h = h ∗ (Bn,α ) we thus obtain an NI ( f 0 , γ) of order upper bound on En,α 3β+3



|Bn,α | 4β+3 (nα2 )− 4β+3 + √

1 nα2

.

Recall that f 0 is a Hölder smooth function and thus uniformly bounded. Moreover, ψn,α (B) and B f 0 are continuous quantities of the length of the set B when it varies in  the family of level sets. Thus, for small rates ψn,α (Bn,α ) we have necessarily Bn,α f 0 that does not tend to 0, hence |Bn,α | does not tend to 0. Then the term 3β+3



|Bn,α | 4β+3 (nα2 )− 4β+3 will be dominant. The following proposition gives a sufficient condition so that our upper and lower bounds match up to a logarithmic factor. Proposition 3.6 Let Bn,α be defined by (6). If there exists a compact set K ⊂ B n,α and some c ∈]0, 1[ such that |Bn,α |  1, (8) f0 ≥ c f 0 and c |K | K B n,α then it holds   −1 4β+4 3β+3 2β 2 log |Bn,α | 4β+3 (nα2 ) 4β+3 |Bn,α | 4β+3 (nα2 )− 4β+3 3β+3



NI  En,α ( f 0 , γ)  |Bn,α | 4β+3 (nα2 )− 4β+3 .

Proof Indeed, if K satisfies (8), then it holds f 0 ≤ |K | sup f 0 ≤ |K | sup f 0 ≤ |K | inf f 0 , K

K

B n,α

Bn,α

64

A. Dubois et al.

and



β

β f 0 ≥ cψn,α (Bn,α ) ≥ c|Bn,α | h ∗ (Bn,α )  |K | h ∗ (Bn,α ) ,

f0 ≥ c K

B n,α

β which yields inf Bn,α f 0  h ∗ (Bn,α ) , and condition (7) is thus satisfied with B =  Bn,α . Thus, the choice B = Bn,α ends the proof of the proposition. Let us now discuss a sufficient condition for the existence of a compact set K ⊂ B n,α satisfying (8). Let us consider the special case of decreasing densities f 0 with support X = [0, +∞). Note that for such functions, Bn,α takes the form Bn,α = [0, a]. Writing f 0 (x) = (x)/(1 + x), a sufficient condition for the existence of a compact set K ⊂ B n,α satisfying (8) is that sup x≥1

(t x) ≤c (x)

for some constant c < 1 and some t > 1. Indeed, in this case, taking K = [a, ta], it holds c|Bn,α |/|K | = c/t, and



ta

  ∞ ∞ (x/t) (u) t (1 + x) (u) dx = ct du ≤ c sup du f0 ≤ c 1 + x 1 + tu 1 + t x 1 +u x≥a ta a a   ∞ t −1 ≤c 1+ f0 , 1 + ta a

and thus





∞  f0 f0 K  = 1 − ta∞ ≥ 1 − c{1 + o(1)}, B n,α f 0 a f0

and (8) is satisfied if a is large enough. In this case our upper and lower bounds match up to a logarithmic factor. Note that f 0 in Example 5.2 checks the condition for all t > 1 and the only example where this condition is not satisfied is Example 5.8. In the latter, the density A log(2) A f 0 (x) = (x+2)(log(x+2)) A+1 , x ∈ [0, ∞), for some A > 0 arbitrarily small but fixed, has very slowly decreasing tails. An additional logarithmic factor is lost in the lower bounds in this least favorable case. Proof of Theorem 3.5 We use the well-known reduction technique. The idea is to build a family { f ν : ν ∈ V} that belong to the alternative set of densities H1 (ρ) and then reduce the test problem to testing between f 0 and the mixture of the f ν . Our construction of such functions is inspired by the one proposed in [20] for goodnesss in the special case where f 0 is the uniform of-fit testing over Besov Balls B2,∞ distribution over [0, 1], and in [9] for the minimax estimation over Besov ellipsoids B sp,q of the integrated square of a density supported in [0, 1]. However, we need to make some modifications in order to consider Hölder smoothness instead of Besov

Goodness-of-Fit Testing for Hölder Continuous Densities …

65

smoothness and to tackle the case of densities with unbounded support. Let B ⊂ R be a nonempty compact set, and let (B j ) j=1,...,N be a partition of B, h > 0 be the bandwidth and (x1 , . . . , x N ) be the centering points, that is B j = [x j − h, x j + h] for  2all j ∈ 1, N . Let ψ : [−1, 1] → R be such that ψ ∈ H (β, L), ψ = 0 and ψ = 1. For j ∈ 1, N , define   t − xj 1 . ψ j : t ∈ R → √ ψ h h  Note that the support of ψ j is B j , ψ j = 0 and (ψ j ) j=1,...,N is an orthonormal family. Fix a privacy mechanism Q = (Q 1 , . . . , Q n ) ∈ QNI α . According to Lemma B.3 in [9], we can consider for every i ∈ 1, n a probability measure μi on Zi and a family of μi -densities (qi (· | x))x∈R such that for every x ∈ R one has d Q i (· | x) = qi (· | x)dμi and e−α ≤ qi (· | x) ≤ eα . Denote by g0,i (z i ) = R qi (z i | x) f 0 (x)dx the density of Z i when X i has density f 0 . Define for all i = 1, . . . , n the operator K i : L 2 (R) → L 2 (Zi , dμi ) by Ki f =

R

qi (· | x) f (x)1 B (x)  dx, g0,i (·)

f ∈ L 2 (R).

 Note that this operator is well-defined since g0,i (z i ) ≥ R e−α f 0 (x)dx = e−α > 0 for all z i . Observe that its adjoint operator K i is given by K i :  ∈ L 2 (Zi , dμi ) →

Zi

(z i )qi (z i | ·)1 B (·)  dμi (z i ). g0,i (z i )

Using Fubini’s theorem we thus have for all f ∈ L 2 (R) K i K i



f =

Zi

R

R

Zi

 =

qi (z i | y) f (y)1 B (y)  dy g0,i (z i )



qi (z i | ·)1 B (·)  dμi (z i ) g0,i (z i )  qi (z i | y)qi (z i | ·)1 B (y)1 B (·) dμi (z i ) f (y)dy, g0,i (z i )

meaning that K i K i is an integral operator with kernel Fi (x, y) = qi (z i | x)qi (z i | y)1 B (x)1 B (y) dμi (z i ). Define the operator g0,i (z i ) K =

n 1  K Ki , n i=1 i

which is symmetric and positive semidefinite. Define also W N = span{ψ j , j = 1, . . . , N }.

 Zi

66

A. Dubois et al.

Let (v1 , . . . , v N ) be an orthonormal family of eigenfunctions of K as an operator on the linear L 2 (R)-subspace W N . Note that since vk can be written as a linear combination of the ψ j ’s, it holds R vk = 0 and Supp(vk ) ⊂ B. We also denote by λ21 , . . . , λ2N the corresponding eigenvalues. Note that they are non-negative. Define the functions f ν : x ∈ R → f 0 (x) + δ

N

νj v j (x), λ˜ j j=1

where for j = 1, . . . , N ν j ∈ {−1, 1}, δ > 0 may depend on B,h, N , ψ, γ, L, L 0 , β, n and α, and will be specified later, and   λj √ , 2h , z α = e2α − e−2α . λ˜j = max zα The following lemma shows that for δ properly chosen, for most of the possible ν ∈ {−1, 1} N , f ν is a density belonging to H (β, L) and f ν is sufficiently far away  from f 0 in a L 1 sense. Lemma 3.7 Let Pν denote the uniform distribution on {−1, 1} N . Let b > 0. If the parameter δ appearing in the definition of f ν satisfies 

   C0 (B) 1 L0 δ≤ , min 1− hβ , ψ∞ 2 L log(2N /b) h

where C0 (B) := min{ f 0 (x) : x ∈ B}, then there exists a subset Ab ⊆ {−1, 1} N with Pν (Ab ) ≥ 1 − b such that  (i) f ν ≥ 0 and f ν = 1, for all ν ∈ Ab , (ii) f ν ∈ H (β, L), for all ν ∈ Ab , 1 (iii)  f ν − f 0 1 ≥ 3C8 1 √ δ N 2N , for all ν ∈ Ab , with C1 = −1 |ψ|. log( b )  Denote by gν,i (z i ) = R qi (z i | x) f ν (x)dx the density of Z i when X i has density f ν , and   n  gν,i (z i )dμi (z i ) . d Q n (z 1 , . . . , z n ) = Eν i=1

If δ is chosen such that δ ≤ √

h log(2N /b)

ρ =

 min

C0 (B) 1 , ψ∞ 2

δN 3C1  , 8 log 2N b

1−

L0 L



h β , setting

Goodness-of-Fit Testing for Hölder Continuous Densities …

67

we deduce from the above lemma that if ⎡ 2 ⎤ d Q n ⎦ ≤ 1 + (1 − γ − b)2 for all Q ∈ QNI E Q nf ⎣ α , 0 d Q nf0

(9)

then it holds inf

inf

sup

Q∈QNI α φ∈ Q f ∈H1 (ρ )

 P Q nf (φ = 1) + P Q nf (φ = 0) ≥ γ, 0

 where H1 (ρ ) := { f ∈ H (β, L) : f ≥ 0, f = 1,  f − f 0 1 ≥ ρ }, NI ( f 0 , γ) ≥ ρ . Indeed, if (9) holds, then we have consequently En,α inf

inf

sup

Q∈QNI α φ∈ Q f ∈H1 (ρ )

 P Q nf (φ = 1) + P Q nf (φ = 0) 0





P Q nf (φ = 1) + sup P Q nfν (φ = 0)

≥ inf

inf

≥ inf

   inf P Q nf (φ = 1) + Eν I (ν ∈ Ab )P Q nfν (φ = 0) ,

Q∈QNI α φ∈ Q Q∈QNI α φ∈ Q

0

ν∈Ab

0

and     Eν I (ν ∈ Ab )P Q nfν (φ = 0) = P Q n (φ = 0) − Eν I (ν ∈ Acb )P Q nfν (φ = 0) ≥ P Q n (φ = 0) − Pν (Acb ) ≥ P Q n (φ = 0) − b. Thus, if (9) holds, we have inf

inf

sup

Q∈QNI α φ∈ Q f ∈H1 (ρ )

 P Q nf (φ = 1) + P Q nf (φ = 0) 0



 inf P Q nf (φ = 1) + P Q n (φ = 0) − b 0 Q∈QNI α φ∈ Q

≥ inf 1 − TV(Q n , Q nf0 ) − b NI Q∈Qα % ⎛ ⎞ ⎡ & 2 ⎤ & d Qn ⎦ & ⎜ ⎟ − 1⎠ ≥ γ. = inf ⎝1 − b − 'E Q nf ⎣ n 0 NI d Q f0 Q∈Qα ≥ inf

We now prove that (9) holds under an extra assumption on δ.

and

68

A. Dubois et al.

We have that ⎡ E

Q nf 0



d Qn d Q nf0

 =E

Q nf 0

= Eν,ν

Eν,ν n  i=1



2 ⎤

⎡

⎦ = EQn ⎣ f0

n  i=1



+,n ,n

i=1 gν,i (Z i )

i=1 g0,i (Z i )



N

νk qi (Z i | ·), vk  · 1+δ g0,i (Z i ) λ˜ k k=1

- 2 ⎤ ⎦  

N

νk qi (Z i | ·), vk  · 1+δ · g0,i (Z i ) λ˜ k



k=1

. . / / N N

νk νk qi (Z i | ·), vk  qi (Z i | ·), vk  +δ E Q f0 E Q f0 1+δ ˜ ˜ g0,i (Z i ) g0,i (Z i ) k=1 λk k=1 λk ⎞ . / N

νk1 νk 2 qi (Z i | ·), vk1 qi (Z i | ·), vk2  ⎠ 2 , E Q f0 +δ (g0,i (Z i ))2 λ˜ k1 λ˜ k2 k1 ,k2 =1

where we have interverted E Q nf and Eν,ν and used the independence of the Z i , 0 i = 1, . . . , n. Now, observe that . / qi (Z i | ·), vk  qi (z i | ·), vk  = · g0,i (z i )dμi (z i ) E Q f0 g0,i (Z i ) g0,i (z i ) Z  i  qi (z i | x)vk (x)dx dμi (z i ) = Z R i = vk = 0, R

and, using that Supp(vk ) ⊂ B for all k, . / qi (Z i | ·), vk1 qi (Z i | ·), vk2  E Q f0 (g0,i (Z i ))2 qi (z i | ·), vk1 qi (z i | ·), vk2  = · g0,i (z i )dμi (z i ) (g0,i (z i ))2 Zi     1 qi (z i | x)vk1 (x)dx qi (z i | y)vk2 (y)dy dμi (z i ) = Z g0,i (z i ) R R  i  qi (z i | x)qi (z i | y)1 B (x)1 B (y) dμi (z i ) vk1 (x)vk2 (y)dxdy = g0,i (z i ) R R Zi = Fi (x, y)vk1 (x)vk2 (y)dxdy = vk1 , K i K i vk2 . R

R

Using 1 + x ≤ exp(x), we thus obtain

Goodness-of-Fit Testing for Hölder Continuous Densities …

⎡ E Q nf

0

⎣ d Qn d Q nf0

2 ⎤

69

 N

νk1 νk 2  vk1 , K i K i vk2  1+δ ˜ ˜ i=1 k1 ,k2 =1 λk1 λk2    N n

νk1 νk 2 2  ≤ Eν,ν exp δ vk1 , K i K i vk2  ˜ ˜ i=1 k1 ,k2 =1 λk1 λk2    N

νk1 νk 2 2 = Eν,ν exp nδ vk1 , K vk2  ˜ ˜ k1 ,k2 =1 λk1 λk2    N

νk1 νk 2 2 2 = Eν,ν exp nδ · λk2 vk1 , vk2  ˜ ˜ k1 ,k2 =1 λk1 λk2    N

≤ Eν,ν exp nδ 2 z α2 νk νk ,

⎦ = Eν,ν

n 



2

k=1

where we have used that λ2k λ2k ≤ z α2 . = max{z α−2 λ2k , 2h} λ˜ 2k Now, using that for k = 1, . . . , N , νk , νk are Rademacher distributed and independent random variables, we obtain ⎡  2 ⎤ N 

2 2 d Q n ⎦ ≤ Eν,ν E Q nf ⎣ exp nδ z α νk νk 0 d Q nf0 k=1  N   N  

2 2

2 2 N n 2 δ 4 z α4 = Eν , cosh nδ z α νk = cosh nδ z α ≤ exp 2 k=1 k=1 where the last inequality follows from cosh(x) ≤ exp(x 2 /2) for all x ∈ R. Thus, (9) holds as soon as 

1/4 2 log 1 + (1 − b − γ)2 δ≤ . N n 2 z α4 

Finally, taking δ = min obtain



h log(2N /b)

 min

C0 (B) 1 ψ∞ , 2



1−

L0 L







/1/4 . 2 log 1+(1−b−γ)2 , , we N n2 z4 α

70

A. Dubois et al. NI En,α ( f 0 , γ) ≥ C(ψ, b, γ) 

1

log (2N /b)      N 3/4 L0 |B| C0 (B) 1 β 1− h , min  , min . ψ∞ 2 L log(2N /b) nz α2 

If B is chosen such that C0 (B) = min{ f 0 (x), x ∈ B} ≥ Ch β , then the bound becomes NI ( f 0 , γ) En,α



|B|h β

N 3/4 min  ≥ C(ψ, b, γ, L , L 0 )  , log (2N /b) log(2N /b) nz α2 1

 ,

and the choice h |B|−1/(4β+3) (nz α2 )−2/(4β+3) yields −1   4β+4 3β+3 2β 2 NI ( f 0 , γ) ≥ C(ψ, b, γ, L , L 0 ) log C|B| 4β+3 (nz α2 ) 4β+3 |B| 4β+3 (nz α2 )− 4β+3 . En,α Note that with this choice of h, the condition C0 (B) ≥ Ch β becomes |B|β/(4β+3) C0 (B) ≥ C(nz α2 )−2β/(4β+3) .

4 Interactive Privacy Mechanisms In this section, we prove that the results obtained in Sect. 3 can be improved when sequential interaction is allowed between data-holders.

4.1 Upper Bound in the Interactive Scenario We first propose a testing procedure which relies on some sequential interaction between data-holders. We then prove that this test achieves a better separation rate than the one obtained in Sect. 3. We assume that the sample size is equal to 3n so that we can split the data in three parts. Like in the 0non-interactive scenario, we consider a non-empty compact set B ⊂ R, and B = Nj=1 B j a partition of B with |B j | = 2h for all j ∈ 1, N . privatized arrays With the first third of the data, X 1 , . . . , X n , we generate  Z i = (Z i j ) j=1,...,N that will be used to estimate p( j) := B j f . Let’s consider the following privacy mechanism. We first generate an i.i.d. sequence (Wi j )i∈1,n, j∈1,N  of Laplace(1) random variables and for i = 1, . . . , n and j = 1, . . . , N we set

Goodness-of-Fit Testing for Hölder Continuous Densities …

Z i j = I (X i ∈ B j ) +

71

2 Wi j . α

For each j = 1, . . . , N , we then build an estimator of p( j) := 1 pj =

 Bj

f via

n 1 Zi j . n i=1 α

We now privatize the second third of the data. Set cα = eeα +1 and τ = (nα2 )−1/2 . For −1 p j and the all i ∈ n + 1, 2n, we generate Z i ∈ {−cα τ , cα τ } using the estimator 1 true data X i by

P Z i = ±cα τ | X i ∈ B j

where [x]τ−τ statistic



  [1 p j − p0 ( j)]τ−τ 1 1± , = 2 cα τ

1 P Z i = ±cα τ | X i ∈ B¯ = , 2  = max{−τ , min(x, τ )}, and p0 ( j) = B j f 0 . We then define the test DB =

2n N

1 Zi − p0 ( j)[1 p j − p0 ( j)]τ−τ . n i=n+1 j=1

The analysis of the mean and variance of this statistic can be found in Appendix B.2. It will be crucial in the analysis of our final test procedure. Finally, we define the same tail test statistic as in Sect. 3. For all i ∈ 2n + 1, 3n, a private view Z i of X i is generated by   I (X i ∈ / B) 1 1± , Z i = ±cα , with probabilities 2 cα and we set

3n 1 TB = Zi − f0 . n i=2n+1 B 

The final test is = where

1 0

if D B ≥ t1 or TB ≥ t2 , otherwise

√ 2 5 t1 = √ , nα2 γ

(10)

 t2 =

20 . nα2 γ

(11)

72

A. Dubois et al.

We denote the privacy mechanism that outputs (Z 1 , . . . , Z n , Z n+1 , . . . , Z 2n , Z 2n+1 , . . . , Z 3n ) by Q. It is sequentially interactive since each Z i for i ∈ n + 1, 2n p j , but does not depend on the depends on the privatized data (Z 1 , . . . , Z n ) through 1 other Z k , k ∈ n + 1, 2n, k = i. The following result establishes that this mechanism provides α-local differential privacy. Its proof is deferred to Appendix B.1. Proposition 4.1 The sequentially interactive privacy mechanism Q provides αlocal differential privacy. The following Proposition gives properties of the test statistic D B . Its proof is in the Appendix B.2. +  p j − p0 ( j)]τ−τ . Proposition 4.2 1. It holds E Q f n [D B ] = Nj=1 { p( j) − p0 ( j)}E [1 In particular, E Q f0n [D B ] = 0. Moreover, we have E Q f n [D B ] ≥ with Dτ ( f ) =  p( j) := B j f . 2. It holds

N j=1

1 τ Dτ ( f ) − 6 √ , 6 n

(12)

| p( j) − p0 ( j)| min {| p( j) − p0 ( j)|, τ } where we recall that

Var Q f n (D B ) ≤

5 Dτ ( f ) + 67 . 2 2 (nα ) nα2

The following result presents an upper bound on En,α ( f 0 , γ). Its proof is in Appendix B.3. Theorem 4.3 Assume that α ∈ (0, 1) and β < 1. The test procedure  in (10) with t1 and t2 in (11) and bandwidth h given by h |B|− 2β+1 (nα2 )− 2β+1 , 1

1

attains the following bound on the separation rate   β+1 β 1 . En,α ( f 0 , γ) ≤ C(L , L 0 , γ) |B| 2β+1 (nα2 )− 2β+1 + f0 + √ nα2 B This result indicates to choose the optimal set B = Bn,α as a level set  Bn,α = arg

f 0 ≥ |B|

inf

B compact set

B

β+1 2β+1

β 2 − 2β+1

(nα )

+√

1 nα2

 and inf f 0 ≥ sup f 0 . B

B

(13)

Goodness-of-Fit Testing for Hölder Continuous Densities …

73

4.2 Lower Bound in the Interactive Scenario In this subsection we complement the study of En,α ( f 0 , γ) with a lower bound. This lower bound will turn out to match the upper bound for several f 0 , proving the optimality of the test and privacy mechanism proposed in the previous subsection for several f 0 . See Sect. 5 for the optimality. Theorem 4.4 Let α ∈ (0, 1). Assume that β ≤ 1. Recall that z α = e2α − e−2α and C0 (B) = min{ f 0 (x) : x ∈ B}. For all compact sets B ⊂ R we get  β+1 β En,α ( f 0 , γ) ≥ C(γ, L , L 0 ) min |B|C0 (B), |B| 2β+1 (nz α2 )− 2β+1 . If, moreover, B is satisfying |B|β/(2β+1) C0 (B) ≥ C(nz α2 )−β/(2β+1)

(14)

for some C > 0, it holds β+1

β

En,α ( f 0 , γ) ≥ C(γ, L , L 0 )|B| 2β+1 (nz α2 )− 2β+1 . The proof is deferred to Appendix B.4. Let us note that the same comment after Theorem 3.5 holds in this case. In all examples, we choose the set Bn,α as defined in (13) and show that it checks the condition (14) giving thus minimax optimality of the testing rates.

5 Examples In this section, we investigate the optimality of our lower and upper bounds for some examples of densities f 0 . For all the examples studied below, our bounds are optimal (up to a constant) in the interactive scenario, and optimal up to a logarithmic factor in the non-interactive scenario. The densities considered in this section are Hölder continuous with exponent β for all β ∈ (0, 1] unless otherwise specified. The results are stated for n large enough and α ∈ (0, 1) such that nα2 → +∞ as n → ∞. They are summarised in Table 1 for β = 1 and compared to the non-private separation rates. The proofs can be found in Appendix C. Example 5.1 Assume that f 0 is the density of the continuous uniform distribution on [a, b] where a and b are two constants satisfying a < b, that is f 0 (x) =

1 I (x ∈ [a, b]). b−a

74

A. Dubois et al.

Table 1 Some examples of separation rates for different choices of densities f 0 and β = 1. The non-private separation rates can be found in [4] Non-private separation Private separation rate, Private separation rate, rate non-interactive interactive scenario scenario (up to a log factor) U ([a, b]) N (0, 1)

Beta(a, b) Spiky null Cauchy(0, a) Pareto(a, k) Exp(λ)

n −2/5 n −2/5 n −2/5 n −2/5 (log n)4/5 n −2/5 n −2k/(2+3k) n −2/5

(nα2 )−2/7 log(nα2 )3/7 (nα2 )−2/7 (nα2 )−2/7 (nα2 )−2/7 (nα2 )−2/13 (nα2 )−2k/(7k+6) log(nα2 )6/7 (nα2 )−2/7

(nα2 )−1/3 log(nα2 )1/3 (nα2 )−1/3 (nα2 )−1/3 (nα2 )−1/3 (nα2 )−1/5 (nα2 )−k/(3k+2) log(nα2 )2/3 (nα2 )−1/3

Taking B = [a, b] in Theorems 3.5, 3.4, 4.4 and 4.3 yields the following bounds on the minimax radius   −1 2β 2β 2 NI log C(nα2 ) 4β+3 (nα2 )− 4β+3  En,α ( f 0 , γ)  (nα2 )− 4β+3 , and

β

En,α ( f 0 , γ) (nα2 )− 2β+1

Example 5.2 Assume that f 0 is the density of the Pareto distribution with parameters a > 0 and k > 0, that is ka k f 0 (x) = k+1 I (x ≥ a). x It holds .  /−1 4β+4 2β 2kβ 2kβ · + 2 − − NI log C(nα2 ) 4β+3 k(4β+3)+3β+3 4β+3 (nα2 ) k(4β+3)+3β+3  En,α ( f 0 , γ)  (nα2 ) k(4β+3)+3β+3 ,

and



En,α ( f 0 , γ) (nα2 )− k(2β+1)+β+1 .

Example 5.3 Assume that f 0 is the density of the exponential distribution with parameter λ > 0, that is f 0 (x) = λ exp(−λx)I (x ≥ 0).

Goodness-of-Fit Testing for Hölder Continuous Densities …

75

It holds 

−1  4β+4 3β+3 2β 2 log(nα2 ) 4β+3 (nα2 )− 4β+3 log C log(nα2 ) 4β+3 (nα2 ) 4β+3 3β+3



NI  En,α ( f 0 , γ)  log(nα2 ) 4β+3 (nα2 )− 4β+3 ,

and

β+1

β

En,α ( f 0 , γ) log(nα2 ) 2β+1 (nα2 )− 2β+1

Example 5.4 Assume that f 0 is the density of the normal distribution with parameters 0 and 1, that is  2 1 x . f 0 (x) = √ exp − 2 2π It holds   −1 4β+4 3β+3 2β 2 log C log(nα2 ) 2(4β+3) (nα2 ) 4β+3 log(nα2 ) 2(4β+3) (nα2 )− 4β+3 3β+3



NI  En,α ( f 0 , γ)  log(nα2 ) 2(4β+3) (nα2 )− 4β+3 ,

and

β+1

β

En,α ( f 0 , γ) log(nα2 ) 2(2β+1) (nα2 )− 2β+1

Example 5.5 Assume that f 0 is the density of the Cauchy distribution with parameters 0 and a > 0, that is a2 1 f 0 (x) = . 2 πa x + a 2 It holds −1   4β+4 2β 2β 2 (nα2 )− 7β+6 log C(nα2 ) 4β+3 · 7β+6 + 4β+3 2β

NI  En,α ( f 0 , γ)  (nα2 )− 7β+6 ,

and

β

En,α ( f 0 , γ) (nα2 )− 3β+2

Example 5.6 Assume that the density f 0 is given by ⎧ if 0 ≤ x ≤ √1L ⎪ 0x ⎨ L√ 0 f 0 (x) = 2 L 0 − L 0 x if √1L ≤ x ≤ √2L 0 0 ⎪ ⎩ 0 otherwise.

76

A. Dubois et al.

It holds −1   2β 2 (nα2 )− 4β+3 log C(nα2 ) 4β+3 2β

NI  En,α ( f 0 , γ)  (nα2 )− 4β+3 ,

and

β

En,α ( f 0 , γ) (nα2 )− 2β+1

Example 5.7 Assume that f 0 is the density of the Beta distribution with parameters a ≥ 1 and b ≥ 1, that is f 0 (x) =

1 x a−1 (1 − x)b−1 I (0 < x < 1), B(a, b)

(15)

where B(·, ·) is the Beta function. It holds   −1 2β 2 log C(nα2 ) 4β+3 (nα2 )− 4β+3 and



NI  En,α ( f 0 , γ)  (nα2 )− 4β+3 ,

β

En,α ( f 0 , γ) (nα2 )− 2β+1 .

Note that the density f 0 given by (15) can be defined for all a > 0 and b > 0. However, f 0 is Hölder continuous for no exponent β ∈ (0, 1] if a < 1 or b < 1. Note also that if a = 1 and b = 1 then f 0 is the density of the continuous uniform distribution on [0, 1], and this case has already been tackled in Example 5.1. Now, if a = 1 and b > 1 (respectively a > 1 and b = 1), one can check that f 0 is Hölder continuous with exponent β for all β ∈ (0, min{b − 1, 1}] (respectively β ∈ (0, min{a − 1, 1}]). Finally, if a > 1 and b > 1 then f 0 is is Hölder continuous with exponent β for all β ∈ (0, min{a − 1, b − 1, 1}]. Example 5.8 Assume that the density f 0 is given by f 0 (x) =

A log(2) A I (x ≥ 0), (x + 2) log A+1 (x + 2)

for some A > 0 which can be arbitrarily small but fixed. It holds .

 4β+4 /−1 3β+3 3β+3 2 + -−1 4β+3 − 2β − 2β NI log(a∗ ) a∗ (nα2 ) 4β+3  En,α ( f 0 , γ)  a∗4β+3 (nα2 ) 4β+3 , log Ca∗4β+3 (nα2 ) 4β+3

where  a∗ = sup a ≥ 0 :

 3β+3 2β (log 2) A 1 4β+3 (nα2 )− 4β+3 + √ . ≥ a log A (2 + a) nα2

Goodness-of-Fit Testing for Hölder Continuous Densities …

77 3β+3

It is easy to see that a∗ > 1 is up to some log factors a polynomial of nα2 : a∗4β+3 2β (nα2 ) 4β+3 / log A (2 + a∗ ) and therefore 3β+3



a∗4β+3 (nα2 )− 4β+3

1 . log (nα2 ) A

In the interactive case +

log(b∗ )

-−1

β+1

β

β+1

β

b∗2β+1 (nα2 )− 2β+1  En,α ( f 0 , γ)  b∗2β+1 (nα2 )− 2β+1 ,

where  b∗ = sup b ≥ 0 :

 β+1 β (log 2) A 1 2β+1 (nα2 )− 2β+1 + √ . ≥ b log A (2 + b) nα2

Similarly to the non-interactive case, b∗ is up to log factors a polynomial of nα2 and therefore β+1 β 1 b∗2β+1 (nα2 )− 2β+1 . log A (nα2 )

A Proofs of Sect. 3 A.1 Proof of Proposition 3.2 Let i ∈ 1, n. Set σ := 2ψ∞ /(αh). The conditional density of Z i given X i = y can be written as q

Z i |X i =y

  N  |z j − ψh (x j − y)| 1 exp − . (z) = 2σ σ j=1

Thus, by the reverse and the ordinary triangle inequality,   N  |z j − ψh (x j − y )| − |z j − ψh (x j − y)| q Z i |X i =y (z) = exp q Z i |X i =y (z) σ j=1 N 



|ψh (x j − y ) − ψh (x j − y)| ≤ exp σ j=1



78

A. Dubois et al.

⎞  6 N 6  

6 6 x x − y − y 1 j j 6⎠ 6ψ ≤ exp ⎝ −ψ 6 σh j=1 6 h h ⎛ ⎞ 6  6/ N .6  66

6 6 6 x x − y − y 1 j j 6ψ 6 + 6ψ 6 ⎠ ≤ exp ⎝ 6 6 6 σh j=1 6 h h   2ψ∞ ≤ exp σh ≤ exp(α), ⎛

where the second to last inequality follows from the fact that for a fixed y the quantity ψ((x j − y)/ h) is non-zero for at most one coefficient j ∈ 1, N . This is a consequence of Assumption 3.1. This proves that Z i is an α-locally differentially private view of X i for all i ∈ 1, n. Consider now i ∈ n + 1, 2n. For all j ∈ 1, N  it holds / B) 2eα P (Z i = cα | X i ∈ 1 =1+

. = α cα e +1 P Z i = cα | X i ∈ B j Since 2 ≤ eα + 1 ≤ 2eα , we obtain e−α ≤ 1 ≤

/ B) P (Z i = cα | X i ∈ ≤ eα .

P Z i = cα | X i ∈ B j

It also holds P (Z i = −cα | X i ∈ / B) 2 1 =1−

∈ [e−α , eα ]. = α cα e +1 P Z i = −cα | X i ∈ B j Now, for all ( j, k) ∈ 1, N 2 it holds P (Z i = cα | X i ∈ Bk ) P (Z i = −cα | X i ∈ Bk ) =

= 1 ∈ [e−α , eα ].

P Z i = cα | X i ∈ B j P Z i = −cα | X i ∈ B j This proves that Z i is an α-locally differentially private view of X i for all i ∈ n + 1, 2n.

A.2 Proof of Theorem 3.4 Proof of Proposition 3.3 1. Equality (4) follows from the independence of Z i and Z k for i = k and from E[Z i j ] = ψh ∗ f (x j ). We now prove (5). Set ah, j := ψh ∗ f (x j ) and let us define

Goodness-of-Fit Testing for Hölder Continuous Densities …

1B = U

79





1 Z i j − ah, j Z k j − ah, j , n(n − 1) i=k j=1 N

N n



1B = 2 ah, j − f 0 (x j ) Z i j − ah, j , V n i=1 j=1

and observe that we have 1B + V 1B + SB = U

N

(ah, j − f 0 (x j ))2 .

j=1

1B , V 1B ) = 0. We thus have Note that Cov(U 1B ) + Var(V 1B ), Var(S B ) = Var(U 1B ) separately. We begin with 1B ) and Var(V and we will bound from above Var(U 1B is centered, it holds 1B ). Since V Var(V 1B2 ] 1B ) = E[V Var(V =

n N N n

4



ah, j − f 0 (x j ) ah,k − f 0 (xk ) n 2 i=1 j=1 t=1 k=1

+

E Z i j − ah, j Z tk − ah,k .

Note that if t = i, the independence of Z i and Z t yields E

+

Z i j − ah, j



Z tk − ah,k

-

= 0.

Moreover, since the Wi j , j = 1, . . . , N are independent of X i and E[Wi j ] = 0 we have E

+

Z i j − ah, j



Z ik − ah,k

-

 

2ψ∞ = E[ ψh x j − X i + Wi j − ah, j (ψh (xk − X i ) αh  2ψ∞ + Wik − ah,k ] αh +

+

= E ψh x j − X i ψh (xk − X i ) − ah,k E ψh x j − X i 4ψ2 + + 2 2∞ E Wi j Wik − ah, j E [ψh (xk − X i )] + ah, j ah,k α h / . 2



8ψ2 = f (y)dy + 2 2∞ I ( j = k) − ah, j ah,k , ψh x j − y α h

80

A. Dubois et al.

where the last equality is a consequence of Assumption 3.1. We thus obtain . / N 2 2

4

8ψ2∞ 1 Var(VB ) = ah, j − f 0 (x j ) (ψh x j − y ) f (y)dy + 2 2 n j=1 α h −

N N

4

ah, j − f 0 (x j ) ah,k − f 0 (xk ) ah, j ah,k n j=1 k=1

. / N

2 2 4

8ψ2∞ ah, j − f 0 (x j ) = (ψh x j − y ) f (y)dy + 2 2 n j=1 α h ⎞2 ⎛ N 4

− ⎝ ah, j − f 0 (x j ) ah, j ⎠ n j=1

. / N 2 2

4

8ψ2∞ ≤ ah, j − f 0 (x j ) (ψh x j − y ) f (y)dy + 2 2 . n j=1 α h 

Now, (ψh x j − y )2 f (y)dy ≤ ψh 2∞ ≤ ψ2∞ / h 2 ≤ ψ2∞ /(α2 h 2 ) if α ∈ (0, 1]. We finally obtain

1B ) ≤ Var(V

N 2 36ψ2∞

ah, j − f 0 (x j ) . 2 2 nα h j=1

1B ). One can rewrite U 1B as We now bound from above Var(U 1B = U where h(Z i , Z k ) =

1 h(Z i , Z k ), n(n − 1) i=k N



Z i j − ah, j



Z k j − ah, j .

j=1

Using a result for the variance of a U -statistic (see for instance Lemma A, p. 183 in [22]), we have   n 1B ) = 2(n − 2)ζ1 + ζ2 , Var(U 2 where ζ1 = Var (E [h(Z 1 , Z 2 ) | Z 1 ]) , and ζ2 = Var (h(Z 1 , Z 2 )) .

Goodness-of-Fit Testing for Hölder Continuous Densities …

81

We have ζ1 = 0 since E [h(Z 1 , Z 2 ) | Z 1 ] = 0 and thus 1B ) = Var(U

2 Var (h(Z 1 , Z 2 )) . n(n − 1)

Write h(Z 1 , Z 2 ) =

  N 

2ψ∞

2ψ∞ ψh x j − X 1 + ψh x j − X 2 + W1 j − ah, j W2 j − ah, j αh αh j=1

=

N N





4ψ2 ψh x j − X 1 − ah, j ψh x j − X 2 − ah, j + 2 2∞ W1 j W2 j α h j=1

+

+

j=1

2ψ∞ αh

N

W1 j (ψh (x j − X 2 ) − ah, j )

j=1

N 2ψ∞ W2 j (ψh (x j − X 1 ) − ah, j ) αh j=1

=: T˜1 + T˜2 + T˜3 + T˜4 .

4  We thus have Var(h(Z 1 , Z 2 )) = i=1 Var(T˜i ) + 2 i< j Cov(T˜i , T˜ j ). Observe that Cov(T˜i , T˜ j ) = 0 for i < j and Var(T˜3 ) = Var(T˜4 ). We thus have Var(h(Z 1 , Z 2 )) = Var(T˜1 ) + Var(T˜2 ) + 2Var(T˜3 ). The independence of the random variables (Wi j )i, j yields Var(T˜2 ) =

64ψ4∞ N . α4 h 4

The independence of the random variables (Wi j )i, j and their independence with X 2 yield   Var(T˜3 ) = E T˜32

⎡ ⎤ N N

4ψ2∞ ⎣ E W1 j (ψh (x j − X 2 ) − ah, j ) W1k (ψh (xk − X 2 ) − ah,k )⎦ = α2 h 2 j=1

=

4ψ2∞ α2 h 2

N N

k=1

- + + E W1 j W1k E (ψh (x j − X 2 ) − ah, j )(ψh (xk − X 2 ) − ah,k )

j=1 k=1

=

N 8ψ2∞ + E (ψh (x j − X 2 ) − ah, j )2 α2 h 2



8ψ2∞ α2 h 2

j=1

N

j=1

+ E (ψh (x j − X 2 ))2 .

82

A. Dubois et al.

Now, since y → ψh (x j − y) is null outside B j (consequence of Assumption 3.1), it holds N

N + - E (ψh (x j − X 2 ))2 =

j=1



j=1

Bj

N

2

ψh (x j − y) f (y)d y ≤ ψh 2∞ j=1

and thus Var(T˜3 ) ≤

Bj

f ≤ ψh 2∞ ,

8ψ4∞ . α2 h 4

By independence of X 1 and X 2 , it holds E[T˜1 ] = 0, and   Var(T˜1 ) = E T˜12 N N



+

= E ψh x j − X 1 − ah, j ψh x j − X 2 − ah, j j=1 k=1





ψh (xk − X 1 ) − ah,k ψh (xk − X 2 ) − ah,k

=

N N

+



E ψh x j − X 1 − ah, j ψh (xk − X 1 ) − ah,k j=1 k=1



+

E ψh x j − X 2 − ah, j ψh (xk − X 2 ) − ah,k /2 N N .

= ψh (x j − y)ψh (xk − y) f (y)dy − ah, j ah,k j=1 k=1

=

N N 

2 ψh (x j − y)ψh (xk − y) f (y)dy

j=1 k=1

−2

N N

ψh (x j − y)ψh (xk − y) f (y)dy

ah, j ah,k

j=1 k=1

+

N N

2 2 ah, j ah,k .

j=1 k=1



Assumption 3.1 yields ψh (x j − y)ψh (xk − y) f (y)dy = 0 if j = k. We thus obtain Var(T˜1 ) =

N 

2 (ψh (x j − y)) f (y)dy 2

j=1

−2

N

j=1

2 ah, j

⎛ ⎞2 N

2

2 ⎠ ah, . ψh (x j − y) f (y)dy + ⎝ j j=1

Goodness-of-Fit Testing for Hölder Continuous Densities …

83

Now, since y → ψh (x j − y) is null outside B j (consequence of Assumption 3.1), observe that 2

N 

(ψh (x j − y))2 f (y)dy



j=1

 2 N ψ4∞ f h4 Bj j=1



ψ4∞ h4

N

j=1

f ≤ Bj

ψ4∞ , h4

and ⎞2 ⎛ ⎞ ⎛ 2 2 N N 

2 ⎠ ⎝ ah, =⎝ ψh (x j − y) f (y)dy ⎠ j j=1

j=1

⎡ 2 ⎤2  N 4 ψ4∞ ⎣ ⎦ ≤ ψ∞ , ≤ f h4 h4 Bj j=1

ψ yielding Var(T˜1 ) ≤ 2 h 4∞ . We thus have 4

1B ) ≤ Var(U

. / ψ4∞ 164ψ4∞ N 2 64ψ4∞ N 16ψ4∞ 2 4 + ≤ + . n(n − 1) h α4 h 4 α2 h 4 n(n − 1)α4 h 4

Finally, Var(S B ) ≤

N 2 36ψ2∞

164ψ4∞ N a − f (x ) + . h, j 0 j nα2 h 2 j=1 n(n − 1)α4 h 4

2. For all i ∈ n + 1, 2n it holds / B] P (X i ∈ / B) + E Q nf [Z i ] = E [Z i | X i ∈

N

-

+ E Zi | Xi ∈ B j P Xi ∈ B j j=1

 /   . 1 1 1 1 − cα · P (X i ∈ 1+ 1− = cα · / B) 2 cα 2 cα / N .

1 1 cα · − cα · P Xi ∈ B j + 2 2 j=1 = P (X i ∈ / B) . This yields E Q nf [TB ] = 1, . . . , 2n we obtain



B(

f − f 0 ), and using the independence of the Z i , i = n +

84

A. Dubois et al. 2n 2n - 1 1 1 + Var Q nf [TB ] = 2 Var(Z i ) = 2 E[Z i2 ] − E[Z i ]2 = n n n i=n+1

 cα2

2 

 −

f

.

B

i=n+1

 We can now prove Theorem 3.4. We first prove that the choice of t1 and t2 in (3) gives P Q nf ( = 1) ≤ γ/2. Since E Q nf [TB ] = 0, Chebyshev’s inequality and Propo0 0 sition 3.3 yield for α ∈ (0, 1] P Q nf (TB ≥ t2 ) ≤ P Q nf (|TB | ≥ t2 ) ≤ 0

Var Q nf (TB ) 0

t22

0



cα2 5 γ ≤ = . 2 2 2 4 nt2 nα t2

2 

If t1 > E Q nf [S B ] = Nj=1 [ψh ∗ f 0 ](x j ) − f 0 (x j ) , then Chebychev’s inequality 0 and Proposition 3.3 yield P Q nf (S B ≥ t1 ) ≤ P Q nf (|S B − E Q nf [S B ]| ≥ t1 − E Q nf [S B ]) 0

0



0

0

Var Q nf (S B ) 0

(t1 − E Q nf [S B ])2 0 2 36ψ2∞  N

j=1 [ψh ∗ f 0 ](x j ) − f 0 (x j ) nα2 h 2 ≤ 2 2 

t1 − Nj=1 [ψh ∗ f 0 ](x j ) − f 0 (x j ) +

t1 −

N

j=1

164ψ4∞ N n(n−1)α4 h 4

[ψh ∗ f 0 ](x j ) − f 0 (x j )

2 2 .

Observe that t1 ≥

N

2

[ψh ∗ f 0 ](x j ) − f 0 (x j ) j=1

⎧% ⎫  N ⎨& 4 N ⎬ & 288ψ2∞

2 1312ψ ∞ [ψh ∗ f 0 ](x j ) − f 0 (x j ) , + max ' . ⎩ γnα2 h 2 j=1 γn(n − 1)α4 h 4 ⎭

6 6 Indeed for f ∈ H (β, L) with β ≤ 1 it holds 6[ψh ∗ f ](x j ) − f (x j )6 ≤ LCβ h β for 1 all j ∈ 1, N  where Cβ = −1 |u|β |ψ(u)|du, and thus using ab ≤ a 2 /2 + b2 /2 we obtain

Goodness-of-Fit Testing for Hölder Continuous Densities …

85

N

2 [ψh ∗ f 0 ](x j ) − f 0 (x j ) j=1

⎧% ⎫  N ⎨& 4 N ⎬ & 288ψ2∞

2 1312ψ ∞ [ψh ∗ f 0 ](x j ) − f 0 (x j ) , + max ' ⎩ γnα2 h 2 j=1 γn(n − 1)α4 h 4 ⎭ ⎧ ⎫  ⎨1 2 4 N ⎬ 1312ψ 144ψ ∞ ∞ L 2 C 2 N h 2β + ≤ L 20 Cβ2 N h 2β + max , ⎩2 0 β γnα2 h 2 γn(n − 1)α4 h 4 ⎭ √ 3 2 2 2β 144ψ2∞ 52ψ2∞ N ≤ L 0 Cβ N h + + √ 2 γnα2 h 2 γnα2 h 2 √ 3 196ψ2∞ N ≤ L 20 Cβ2 N h 2β + = t1 . 2 γnα2 h 2

Then it holds N

2 [ψh ∗ f 0 ](x j ) − f 0 (x j ) P Q nf (S B ≥ t1 ) ≤ 2 

0 (t1 − Nj=1 [ψh ∗ f 0 ](x j ) − f 0 (x j ) )2 36ψ2∞ nα2 h 2

j=1

164ψ4∞ N n(n−1)α4 h 4

+

2 

(t1 − Nj=1 [ψh ∗ f 0 ](x j ) − f 0 (x j ) )2 γ γ γ ≤ + ≤ , 8 8 4 and thus P Q nf ( = 1) ≤ P Q nf (TB ≥ t2 ) + P Q nf (S B ≥ t1 ) ≤ 0

0

0

γ . 2

We now exhibit ρ1 , ρ2 > 0 such that 

| f − f 0 | ≥ ρ1 ⇒ P Q nf (S B < t1 ) ≤ γ/2 n B¯ | f − f 0 | ≥ ρ2 ⇒ P Q f (TB < t2 ) ≤ γ/2.

B

In this case, for all f ∈ H (β, L) satisfying  f − f 0 1 ≥ ρ1 + ρ2 it holds P Q nf ( = 1) + P Q nf ( = 0) ≤ 0

γ  γ γ + min P Q nf (S B < t1 ), P Q nf (TB < t2 ) ≤ + = γ, 2 2 2

   since  B | f − f 0 | + B¯ | f − f 0 | =  f − f 0 1 ≥ ρ1 + ρ2 implies B | f − f 0 | ≥ ρ1 or B¯ | f − f 0 | ≥ ρ2 . Consequently, ρ1 + ρ2 will provide an upper bound on NI ( f 0 , γ). En,α

86

A. Dubois et al.

If



B(

f − f 0 ) = E Q nf [TB ] > t2 then Chebychev’s inequality yields   P Q nf (TB < t2 ) = P Q nf E Q nf [TB ] − TB > E Q nf [TB ] − t2 6 6  6 6 ≤ P Q nf 6E Q nf [TB ] − TB 6 > E Q nf [TB ] − t2 Var Q nf (TB ) ≤ 2 E Q nf [TB ] − t2 ≤

Now, observe that

n



cα2 B(

( f − f0 ) ≥

Thus, setting ρ2 = 2 B¯



| f − f0 | − 2



f0 .

 1 f 0 + 1 + √ t2 , 2 



2 .



we obtain that

f − f 0 ) − t2



| f − f 0 | ≥ ρ2 implies P Q nf (TB < t2 ) ≤

2cα2 10 γ ≤ = . 2 2 2 2 nt2 nα t2

 We now exhibit ρ1 such that B | f − f 0 | ≥ ρ1 implies P Q nf (S B < t1 ) ≤ γ/2. First note that if the following relation holds N

6 6 6[ψh ∗ f ](x j ) − f 0 (x j )62 ≥ t1 + n E Q f [S B ] =



2Var Q nf (S B ) γ

j=1

,

then Chebychev’s inequality yields ⎛ P Q nf (S B < t1 ) ≤ P Q nf ⎝ S B ≤ E Q nf [S B ] −

Using



2Var Q nf (S B ) γ

⎞ ⎠ ≤ γ. 2

√ √ √ a + b ≤ a + b for all a, b > 0 and ab ≤ a 2 /2 + b2 /2 we have

(16)

Goodness-of-Fit Testing for Hölder Continuous Densities …



2Var Q nf (S B ) γ

Thus, if

87

% & N & 72ψ2∞

2 328ψ4∞ N [ψh ∗ f ](x j ) − f 0 (x j ) + ≤' 2 2 γnα h j=1 γn(n − 1)α4 h 4 %  & N & 72ψ2∞

2 656ψ4∞ N [ψh ∗ f ](x j ) − f 0 (x j ) + ≤' 2 2 γnα h j=1 γn 2 α4 h 4 √ N 2 36ψ2∞ 1

26ψ2∞ N [ψh ∗ f ](x j ) − f 0 (x j ) + ≤ + √ 2 j=1 γnα2 h 2 γnα2 h 2 √ N 2 62ψ2∞ N 1

[ψh ∗ f ](x j ) − f 0 (x j ) + ≤ . 2 j=1 γnα2 h 2

 √  N 2

62 6 N 62ψ ∞ 6[ψh ∗ f ](x j ) − f 0 (x j )6 ≥ 2 t1 + 2h2 γnα j=1

(17)

 6 then (16) holds and we have P Q nf (S B < t1 ) ≤ γ/2. We now link Nj=1 6[ψh ∗ f ](x j ) 62  − f 0 (x j )6 to | f − f 0 |. According to Cauchy-Schwarz inequality we have B

⎞2 N N

6 6 6 6 6[ψh ∗ f ](x j ) − f 0 (x j )6⎠ ≤ N 6[ψh ∗ f ](x j ) − f 0 (x j )62 . ⎝ ⎛

j=1

j=1

We also have 6 6 6 6 N

6 6 6 | f − f0 | − 6 2h|ψ ∗ f (x ) − f (x )| h j 0 j 6 6 6 B 6 j=1 6 6 6 6 N

6 6 N | f − f0 | − 2h|ψh ∗ f (x j ) − f 0 (x j )|66 = 66 6 6 j=1 B j j=1 6 6 6 6 6 N

6 | f (x) − f 0 (x)| − |ψh ∗ f (x j ) − f 0 (x j )| dx 66 = 66 6 6 j=1 B j ≤

N

j=1



6 6 6 f (x) − f 0 (x) − ψh ∗ f (x j ) + f 0 (x j )6 dx Bj

N

j=1

.



| f (x) − f (x j )| + | f (x j ) − ψh ∗ f (x j )| + | f 0 (x j ) − f 0 (x)| dx Bj

≤ 1 + Cβ +

/ L0 Lh β |B|. L

88

A. Dubois et al.

We thus have N 6

6 6[ψh ∗ f ](x j ) − f 0 (x j )62 ≥ j=1

1 4N h 2



/ . 2 L0 Lh β |B| . | f − f 0 | − 1 + Cβ + L B

Thus, if

 √ / . √ L0 124ψ2∞ N β Lh |B| + 2h N 2t1 + | f − f 0 | ≥ 1 + Cβ + =: ρ1 L γnα2 h 2 B

then (17) holds and we have P Q nf (S B < t1 ) ≤ γ/2. Consequently NI En,α ( f 0 , γ) ≤ ρ1 + ρ2  √ / . √ L0 124ψ2∞ N β Lh |B| + 2h N 2t1 + ≤ 1 + Cβ + L γnα2 h 2   1 +2 f 0 + 1 + √ t2 2 B¯ / . N 3/4 1 ≤ C(L , L 0 , β, γ, ψ) h β |B| + N h β+1 + √ + f0 + √ nα2 nα2 B¯ / . 3/4 |B| 1 ≤ C(L , L 0 , β, γ, ψ) h β |B| + + f0 + √ √ 3/4 2 ¯ h nα nα2 B √ √ √ where we have used a + b ≤ a + b for a, b > 0 to obtain the second to last inequality. Taking h |B|−1/(4β+3) (nα2 )−2/(4β+3) yields

. NI En,α ( f 0 , γ)

≤ C(L , L 0 , β, γ, ψ) |B|

3β+3 4β+3

2β 2 − 4β+3

(nα )

+

f0 + √ B

1 nα2

/ .

A.3 Proof of Lemma 3.7 For j = 1, . . . , N , write vj =

N

a k j ψk .

k=1

Note that since (ψ1 , . . . , ψ N ) and (v1 , . . . , v N ) are two orthonormal bases of W N , the matrix (ak j )k j is orthogonal. We can write f ν (x) = f 0 (x) + δ

N N

ν j ak j j=1 k=1

λ˜ j

ψk (x), x ∈ R.

Goodness-of-Fit Testing for Hölder Continuous Densities …

89

Define ⎧ ⎨ Ab = ν ∈ {−1, 1} N ⎩

⎫ 6 6   6 N 6  ⎬ 6 ν j ak j 6 6 ≤ √1 log 2N for all 1 ≤ k ≤ N . : 66 6 ⎭ b h 6 j=1 λ˜ j 6

The union bound and Hoeffding inequality yield 6 ⎛6 ⎞   6 6  N 6 6 ν a 1 2N j kj 6 ⎠ Pν (Acb ) ≤ P ⎝66 6 > √h log b λ 6 j=1 ˜ j 6 k=1 ⎞ ⎛

2N N

⎜ 2 log b ⎟ ≤ 2 exp ⎝−  ⎠ a2 h Nj=1 4 λ˜k2j k=1 N

j

≤ b,  where the last inequality follows from λ˜ 2j ≥ 2h for all j and Nj=1 ak2j = 1. We thus have Pν (Ab ) ≥ 1 − b.    We now prove i). Since ψk = 0 for all k = 1, . . . , n, it holds f ν = f 0 = 1 for all ν. Since Supp(ψk ) = Bk for all k = 1, . . . , N , it holds f ν ≡ f 0 on B c and thus f ν is non-negative on B c . Now, for x ∈ Bk it holds f ν (x) = f 0 (x) + δ

N

ν j ak j j=1

λ˜ j

δψ∞ ψk (x) ≥ C0 (B) − √ h

6 6 6 6 N 6 ν j ak j 6 6. 6 6 6 6 j=1 λ˜ j 6

Moreover, for any ν ∈ Ab , we have δψ∞ √ h

6 6   6 N 6  6 ν j ak j 6 δψ∞ 2N 6 6≤ ≤ C0 (B) log 6 6 h b 6 j=1 λ˜ j 6

since δ is assumed to satisfy δ ≤ √

h log(2N /b)

 min

C0 (B) 1 , ψ∞ 2

1−

L0 L



h β . Thus, f ν is

non-negative on R for all ν ∈ Ab . To prove ii), we have to show that | f ν (x) − f ν (y)| ≤ L|x − y|β , for all ν ∈ Ab , for all x, y ∈ R. Since f ν ≡ f 0 on B c and f 0 ∈ H (β, L 0 ), this result is trivial for x, y ∈ B c . If x ∈ Bl and y ∈ Bk it holds 6 6 6 N 6 N

6 ν j al j 6 ν j ak j 6 | f ν (x) − f ν (y)| ≤ | f 0 (x) − f 0 (y)| + 6δ ψl (x) − δ ψk (y)66 ˜ ˜ λ λ 6 j=1 6 j j j=1 6 6 6 N 6 N

ν j al j 6 ν j al j 6 ≤ L 0 |x − y|β + 66δ ψl (x) − δ ψl (y)66 ˜ ˜ λj λj 6 6 j=1

j=1

90

A. Dubois et al. 6 6 6 6 N

6 N ν j ak j 6 ν j ak j 6 + 6δ ψk (x) − δ ψk (y)66 ˜ ˜ λj 6 j=1 λ j 6 j=1

6 66    6 66 6 6 6ψ x − xl − ψ y − xl 6 66 6 h h 6 6 6 66  6 N   6 x − xk δ 6 ν j ak j 66 66 y − xk 66 ψ + √ 66 − ψ 6 h h h 6 j=1 λ˜ j 66 6 6 6 6 N 6 6 ν j al j 6 δ 6 6 · L|x − y|β ≤ L 0 |x − y|β + 6 6 β+1/2 h 6 j=1 λ˜ j 6 6 6 6 N 6 6 ν j ak j 6 δ 6 6 · L|x − y|β + 6 6 h β+1/2 6 λ˜ j 6 j=1 6 6 6 6⎞ ⎛ 6 N 6 6 N 6 6 ν j al j 6 6 ν j ak j 6 δ L δ 0 6 6 6 6⎠ L|x − y|β , + + =⎝ 6 6 6 L h β+1/2 66 λ˜ j 6 h β+1/2 6 λ˜ j 6 6

6 N δ 6 ν j al j ≤ L 0 |x − y|β + √ 66 h 6 j=1 λ˜ j

j=1

j=1

where we have used ψ ∈ H (β, L). Observe that for all k = 1, . . . , n and for all ν ∈ Ab it holds 6 6   6 N 6    δ 1 L0 δ 66 ν j ak j 66 2N ≤ ≤ 1− , · log h β+1/2 66 j=1 λ˜ j 66 h β+1 b 2 L since δ is assumed to satisfy δ ≤ √

h log(2N /b)

 min

C0 (B) 1 , ψ∞ 2

1−

L0 L



h β . Thus, it

holds | f ν (x) − f ν (y)| ≤ L|x − y|β for all ν ∈ Ab , x ∈ Bl and y ∈ Bk . The case x ∈ B c and y ∈ Bk can be handled in a similar way, which ends the proof of ii). We now prove iii). It holds 6 6 6 6 6 6 6 N 6

6 6 6 N νj 6 N νj 6 dx = δ 6 dx 6δ 6 | fν − f0 | = v (x) v (x) j j 6 6 6 6 ˜ R R 6 j=1 λ˜ j 6 6 k=1 Bk 6 j=1 λ j 6 6 6 N 6

6 6 N ν j ak j 6 =δ ψk (x)66 dx 6 ˜ 6 k=1 Bk 6 j=1 λ j 6 6 6 N 6

6 N ν j ak j 6 6 6 |ψ (x)| dx =δ 6 ˜j 66 Bk k λ 6 k=1 j=1 6 6 6 N 6 √ 6 N ν j ak j 6 6 6, = C1 δ h 6 6 λ˜j 6 6



k=1

where C1 =

1 −1

j=1

|ψ|. For all ν ∈ Ab it thus holds

Goodness-of-Fit Testing for Hölder Continuous Densities …



91

6 62 6 N 6 N

6 ν j ak j 66 δh 6 | f ν − f 0 | ≥ C1  . 6 ˜ 6 R log 2N k=1 6 j=1 λ j 6 b

Moreover, 6 62 ⎛ ⎞  2 6 N 6 N N



ν j ak j νl akl 6 N ν j ak j 6 ν a j k j 6 6 = ⎝ ⎠ + 6 ˜j 66 ˜j ˜j λ˜l λ λ λ 6 k=1 j=1 k=1 j=1 j=l =

N N N

ν j νl

1 2 ak j + ak j akl λ˜ 2 λ˜ j λ˜ l j k=1

j=1

=

N

j=1

j=l

k=1

1 , λ˜ 2j

since the matrix (ak j )k, j is orthogonal. Thus, for all ν ∈ Ab it holds

1 δh .

2N ˜2 log b j=1 λ j N

 f ν − f 0 1 ≥ C1 

Set J = { j ∈ 1, N  : z α−1 λ j ≥



2h}, we have for all ν ∈ Ab

  N

√ √ 1 δh z α2 −1 −1 I (z α λ j < 2h) + 2 I (z α λ j ≥ 2h)  f ν − f 0 1 ≥ C1  2h λj log 2N j=1 b ⎞ ⎛

z2 1 δh α⎠ = C1  ⎝ (N − |J |) + 2 2h λ 2N j log j∈J b





⎞−1 ⎞

|J | δh 2 2⎝ ⎝N λ2j ⎠ ⎠

2N 2h − 2h + z α |J | log b j∈J ⎛ ⎞−1 ⎞ ⎛  2

|J | δN |J | + = C1  ⎝1 − |B|z α2 ⎝ λ2j ⎠ ⎠ , N N 2N 2 log b j∈J ≥ C1 

where the second to last inequality follows from the inequality between harmonic and arithmetic means. Now,

92

j∈J

A. Dubois et al. λ2j ≤

N

λ2j =

j=1

N

K v j , v j 

j=1

=

: ;  N n 

qi (z i | y)qi (z i | ·)1 B (y)1 B (·) 1 dμi (z i ) v j (y)dy, v j n g0,i (z i ) R Zi j=1

=

=

=

1 n 1 n 1 n

i=1

n

i=1

Zi j=1

n

i=1

2 N 

qi (z i | x)1 B (x) v j (x)dx g0,i (z i )dμi (z i ) g0,i (z i ) R

Zi j=1

n

i=1

 N 

qi (z i | y)qi (z i | x)1 B (y)1 B (x) v j (x)v j (y)dxdy dμi (z i ) g0,i (z i ) R R

 2 N  

qi (z i | x) − e−2α 1 B (x)v j (x)dx g0,i (z i )dμi (z i ), g0,i (z i ) R

Zi j=1

 since 1 B (x)v j (x)d x = 0. Recall that qi satisfies e−α ≤ qi (z i | x) ≤ eα for all z i ∈ Zi and all x ∈ R. This implies e−α ≤ g0,i (z i ) ≤ eα , and therefore 0 ≤ f i,zi (x) := qi (z i |x) − e−2α ≤ z α . Writing f i,zi ,B = 1 B · f i,zi , we have g0,i (z i )  2 N   N

qi (z i | x) − e−2α 1 B (x)v j (x)dx =  f i,zi ,B , v j 2 g0,i (z i ) R j=1 j=1 α .

P∈H0

P∈Aδ

It follows that: δ∗ ≥

& Σ/nop (1 − α).

3.4 Proof of Theorem 8 This proof is similar to the proof of Theorem 7, so some details will be skipped. As in the one-sample case the upper bound is directly obtained using Theorem 3 and Proposition 6. We just additionally use the following upper bounds:

Nonasymptotic One- and Two-Sample Tests in High Dimension …

147

   √ √ Tr Σ 2 Tr S 2 √ Tr Σ 2 Tr S 2 √ Σ S 2 + ≤ 2 Tr ; + ≤ 2 + n m n2 m2 n m    Σ Sop Σop Σop Sop S + ≤ 2 max , ≤ 2 +  ,  n m n m n m op where the last inequality holds because Σ, S are both positive semidefinite. The lower bound in the two-sample case is a direct consequence of the one-sample case, by reduction to the case where one of the two sample means is known, say equal to zero. More specifically, let Σ and S be two symmetric positive semidefinite matrices, we consider again the distribution QnΣ defined in (37). Then 

' Rd×(n+m)

dQnΣ ⊗ P⊗m 0,S

⊗m dP⊗n ν,Σ ⊗ P0,S

2

dP⊗n ν,Σ



P⊗m 0,S

' =

Rd×n



dQnΣ dP⊗n ν,Σ

2

dP⊗n ν,Σ .

Then using the previous results of the proof of Theorem 7 we obtain that 21  & −1 2 2 2 Tr Σ − λd s + η − η, δ (α) ≥ n ∗

with s =

&

2 e

(39)

log(1 + 4(1 − α)2 ). By the same token we obtain that  21 & δ ∗ (α) ≥ m −1 Tr S 2 − 2d s + η 2 − η,

(40)

where d is the smallest eigenvalue of the matrix S. Because d∗ ≥ 3, it holds      2 1  max n −2 Tr Σ 2 − λ2d , m −2 Tr S 2 − 2d ≥ max n −2 Tr Σ 2 , m −2 Tr S 2 2 3  Σ S 2 1 + ≥ Tr , 6 n m and by combining (39) and (40), we obtain that ⎛ ⎞ 1 2 1 √ √ σd ∗ ⎠, δ ∗ (α) ≥ (2 12)−1 σ min ⎝ sd∗4 , s η where σ = Σ/n + S/mop . We obtain (19) using again that s ≥ 1 − α. The last part of the lower bound is obtained as in the one-sample case using first ⊗m ⊗n ⊗m the distributions P⊗n (η+δ)e1 ,Σ ⊗ P0,S and Pηe1 ,Σ ⊗ P0,S where e1 is still the eigenvector

148

G. Blanchard and J.-B. Fermanian 1/2

associated to the biggest eigenvalue of Σ. We obtain that δ ∗ (α)  Σ/nop . By the 1/2 same token, we obtain that δ ∗ (α)  S/mop and conclude the proof using that 2 max(Σ/nop , S/mop ) ≥ Σ/n + S/mop .

3.5 Proof of Propositions 10 and 11 We want to obtain a concentration inequality for the estimator we will first study the following: - := Σ

&

op . To this end, Σ

n 1 (X i − μ)(X i − μ)T , n i=1

(41)

where μ is the true mean of the sample X. Then we have:   Σ

−Σ -

op

3.5.1

  μ)(μ − μ)T op = μ − = −(μ − μ2 .

(42)

Gaussian Setting 1/2

- op is a consequence of the classical Lipschitz Gaussian The concentration of Σ concentration property (see e.g. Theorem 3.4 in Massart [29]). Theorem 20 (Gaussian Lipschitz concentration) Let X = (x1 , . . . , xd ) be a vector of i.i.d. standard Gaussian variables, and f : Rd → R be a L-Lipschitz function with respect to the Euclidean norm. Then for all t ≥ 0: t2

P [ f (X ) − E[ f (X )] ≥ t] ≤ e− 2L 2 .

(43)

The following corollary is a direct consequence of that theorem (we provide a proof in Sect. 3.7), which will be used to control the term in (42). Corollary 21 Let X a random Gaussian vector of distribution N (μ, Σ). Then for all u ≥ 0:  



(44) P X  ≥ μ2 + Tr Σ + 2Σop u ≤ e−u . We will use the results of Koltchinskii and Lounici [24] giving an upper bound of - from its expectation. The the expectation of the operator norm of the deviations of Σ constants come from the improved version given by Handel [40].

Nonasymptotic One- and Two-Sample Tests in High Dimension …

149

Theorem 22 ([40]) Let X = (X i )1≤i≤n a sample of independent Gaussian vectors of distribution N (0, Σ), then    √   de de   -−Σ (2 + , ≤ Σ 2) E Σ + 2 op op n n

(45)

- is defined in equation (41). where de = Tr Σ/Σop and Σ /2 - 1op We can now prove a concentration inequality for Σ .

Proposition 23 Let X = (X i )1≤i≤n a sample of independent N (μ, Σ) Gaussian vectors, then for u ≥ 0, with probability at least 1 − 2e−u :      1 1 2uΣop 2 Tr Σ  - 2 2  + ,  Σ op − Σop  ≤ 2 n n

(46)

- is defined in (41). where Σ Remark 24 In (46), the lower and upper bounds have been brought together, but the lower bound is in fact slightly better than the upper bound. This is due to the lower bound of the expectation where Tr Σ can be replaced by Σop , see (49) below. Proof We remark that

 1 Σ - 2 = sup u t Σu op ud =1

1 = sup √ n ud =1

 n 

 21 u, X i − μ

2

i=1

n 1  = sup sup √ u, X i − μ vi n i=1 ud =1 vn =1 dist

n / 1 . 1 sup √ u, Σ 2 gi vi , n i=1 ud =1 vn =1

∼ sup

where (gi )i=1...n are i.i.d. standard Gaussian vectors and  ·  p for p ∈ N is defined as the Euclidean norm in R p . Let u and v be unit vectors in Rd and Rn respectively and f u,v : Rd×n → R: n / 1 . 1 f u,v (y) := √ u, Σ 2 yi vi , n i=1

y ∈ Rd×n .

These functions are Lipschitz: indeed for all z, y ∈ Rd×n we have:

150

G. Blanchard and J.-B. Fermanian

n n 1 / 1 . 1  1 2 u, Σ 2 (yi − z i ) vi ≤ √ f u,v (y) − f u,v (z) = √ Σop yi − z i d |vi | n i=1 n i=1 0 1 1 1 n 2 1 2 Σop 2 Σ op ≤ √ yi − z i 2d = √ y − zd×n . (47) n n i=1

A supremum of Lipschitz functions is Lipschitz, thus we can use the Gaussian Lipschitz concentration (Theorem 20), and get for all x ≥ 0: 

  1   1 - 2 ≥ - 2 − E Σ P Σ op op



2x Σop n

 ≤ e−x ,

(48)

with the same control for lower   deviations.  1/2   /2 - 1op − Σop . For one direction, using Jensen’s It remains to upper bound E Σ and triangle inequalities and inequality (29), we get:    1   1  

2 2   - − Σ  − Σop E Σ op − Σop ≤ Σop + E Σ op  ⎛  ⎞ - − Σ    E Σ op ⎠ - − Σ ,

≤ min ⎝ E Σ op 2 Σop  2 Tr Σ . ≤2 n For the inequality, we have used Theorem 22 for the expectation, then the fact that  last  1/2 √ √ √ √ min a x + bx , (a x + bx)/2 ≤ max( a + b, (a + b)/2) x where a = √ 2 + 2, b = 2 and x = de /n. This is achieved by treating cases x ≤ 1 and x ≥ 1 separately. For the other direction, a reformulation of (48) is that there exists a random variable g ∼ Exp(1) such that:   1   2gΣ  1 op Σ - 2 + - 2 ≤ E Σ . op op n Taking the square then the expectation and then applying Jensen’s inequality to the √ concave function x → (a + b x)2 (a, b ≥ 0), we obtain: Σop

⎡ 2 ⎤      1   2gΣ op ⎦ - ≤ Eg ⎣ E Σ - 2 + ≤ E Σ op op n  2   1   2Σ op 2 - + ≤ E Σ , op n

Nonasymptotic One- and Two-Sample Tests in High Dimension …

and thus

    1  1 2Σop 2 Tr Σ 2 2   E Σ op − Σop ≥ − ≥ −2 . n n

151

(49)

Proof of Proposition 10. It holds    1     1 1 1 1  - 2   2 2  2   − Σ - 2 . Σ −  ≤ Σ  Σ op − Σop op  + Σ op op Then, from (42): 1

2

− Σ - op Σ ≤ μ − μ .

According to Proposition 23 and Corollary 21, we obtain that for u ≥ 0, with probability at least 1 − 3e−u :       1  1 2uΣop 2Σop u 2 Tr Σ Tr Σ   2 2  Σ Σ ≤ 2 + + + . −  op  op n n n n So, for u ≥ 0, with probability at least 1 − 3e−u :     1  1 2uΣop 2 Tr Σ   2 2  Σ +2 .  Σ op − op  ≤ 3 n n

3.5.2

Bounded Setting

We first recall the following concentration result for bounded random vectors in the formulation of Bousquet [7]. Theorem 25 (Talagrand-Bousquet inequality) Assume (X i )1≤i≤n are i.i.d. with marginal distribution P. Let F be a countable set of functions from X to R and assume that all functions f in F are P-measurable, square-integrable, bounded by M and satisfy E[ f ] = 0. Then we denote Z = sup

n 

f ∈F i=1

f (X i ).

Let σ be a positive real number such that σ 2 ≥ sup f ∈F Var [ f (X 1 )]. Then for all u ≥ 0, ε > 0 we have:   √ Mu −1 2 (1 + ε ) ≤ e−u . P Z ≥ E[Z ](1 + ε) + 2unσ + 3

152

G. Blanchard and J.-B. Fermanian

The following corollary is a direct consequence of Theorem 25. Some refinement of this result in the same vein (including two-sided deviation control in the uncentered case) can be found in Marienwald et al. [28] (Proposition 6.2 and Corollary 6.3). Corollary 26 Let X i for i = 1, . . . , n i.i.d. random vectors bounded by L with expectation μ, covariance Σ in a separable Hilbert space H. Then for u ≥ 0, with probability at least 1 − e−u :  n    1   2Σop u 4Lu Tr Σ   + + . X i − μ ≤ 2  n  n n 3n i=1 Lemma 27 Let X i for i = 1, . . . , n i.i.d. random vectors bounded by L with expectation μ, covariance Σ in a separable Hilbert space H. Then    - − Σ ≤ E Σ op



Var X 1 − μ2 , n

(50)

- is defined in (41). where Σ Remark 28 Using the boundedness of the variables we can upper bound this variance: Var X 1 − μ2 ≤ 4L 2 Tr Σ. Proposition 29 Let (X i )1≤i≤n be i.i.d. random vectors in a separable Hilbert space H, with norm bounded by L and covariance Σ, then for any for u ≥ 1, with probability at least 1 − e−u : 

  Σ - − Σ

op

 Var X 1 − μ2 2Σop u 8L 2 u +L + , ≤2 n n 3n

(51)

- is defined in (41). where Σ Proof We denote in this proof Z i := X i − μ for 1 ≤ i ≤ n. Let us first remark that if B1 is the unit ball of H, then:   Σ - − Σ

op

= sup u,v∈B1

n n  1  1 v, Z i Z iT − Σ u =: sup f u,v (X i ) . n i=1 u,v∈B1 n i=1

Since the variables X i have norm bounded by L, it can be assumed equivalently that they take their values in B L = L B1 , and it holds supx∈BL supu,v∈B1 f u,v (x) ≤ 8L 2 . Furthermore, since (u, v) → f u,v (x) is continuous, and the Hilbert space H is separable, the uncountable set B1 can be replaced by a countable dense subset. Thus we can apply Theorem 25, and obtain that with probability at least 1 − e−x :   Σ - − Σ

op

- − Σop + L ≤ 2E Σ



2Σop x 16L 2 x + , n 3n

Nonasymptotic One- and Two-Sample Tests in High Dimension …

153

where we have used for the variance term:   sup E v, Z i Z iT − Σ u 2 ≤ sup E v, Z i 2 Z i , u 2

u,v∈B1

u,v∈B1

≤ 4n L 2 Σop . We conclude using the upper bound of the expectation from Lemma 27. Proof of Proposition 11. As in the Gaussian case, we have:   1    1   1 1 1  - 2   2 2  2 

−Σ - 2 . Σ −  Σ op − Σop  ≤ Σ  + Σ op op op From Lemma 15 and Proposition 29, we have with probability at least 1 − e−u :   1  1  - 2 2   Σ op − Σop  ≤ 4L



Tr Σ + nΣop



16 L 2 u , 3 n

where we have used that: 

Using

√ Var Z 1 2 2L Tr Σ ≤ . √ n n 1  Σ

−Σ - 2 ≤ μ − μ , op

and according to Corollary 26, we obtain that for u ≥ 0, with probability at least 1 − 2e−u :        1 2u 1 L Tr Σ 16    2 2 +  ≤ 4L  Σ op − Σop nΣop 3 n     2Σop u 4Lu Tr Σ + + + 2 n n 3n    u Tr Σ 2u ≤ 8L + + 4L , nΣop n 3n where we have used for the last inequality that Σop ≤ 4L 2 .



154

G. Blanchard and J.-B. Fermanian

3.6 Proof of Propositions 12 and 13 From a sample X = (X i )1≤i≤n of i.i.d. random vectors, we want to estimate Tr Σ 2

defined in (24) is an where Σ is their common covariance matrix. The statistic T unbiased estimator of Tr Σ 2 . This statistic is also invariant by translation. constant (∇μ τ = 0).

can be rewritten as: If we denote Sn the set of permutations of {1, . . . , n}, T n/4 1  1  1

X σ(4i) − X σ(4i−2) , X σ(4i−1) − X σ(4i−3) 2 ; T = n! n/4 i=1 4

(52)

σ∈Sn

namely by symmetry, all the 4-tuples appear the same number of times in the righthand side, so we just need to divide by the number of terms to obtain the identity (52).

for the We will use this decomposition to obtain a concentration of the statistic T Gaussian case and the bounded case, since the inner sum for each fixed permutation is a sum of n/4 i.i.d. terms.

3.6.1

Gaussian Setting

Because the statistic is invariant by translation we can assume without loss of gen 1/2 , we will first find a erality that μ = 0. To obtain a deviation inequality for T

concentration inequality for T and then use Lemma 15. We obtain concentration via

, so we first need some upper bounds on Gaussian moments. control of moments of T The following lemma is proved in Sect. 3.7. j

Lemma 30 Let Z i := X i1 − X i3 , X i2 − X i4 2 /4, where X i for i = 1, . . . , m and 1 ≤ j ≤ 4 are i.i.d. Gaussian random vectors N (0, Σ). Then for all q ∈ N: ⎡

m 1  Z i − Tr Σ 2 E⎣ m i=1

where φ = (1 +

2q ⎤

2q  √ Σ2 ⎦ ≤ 4 2φq 2 Tr , √ m

(53)

√ 5)/2 is the golden ratio.

. We deduce from this lemma a concentration inequality for T Proposition 31 Let (X i )1≤i≤n , n ≥ 4 be i.i.d. random vectors with distribution N (μ, Σ). Then for all u ≥ 0:    u 2 Tr Σ 2 2  ≤ e4 e−u , P T − Tr Σ ≥ 30 √ n 

is defined in (24). where T

(54)

Nonasymptotic One- and Two-Sample Tests in High Dimension …

155

Proof Using Lemma 30, (52) and the convexity of the function x → x 2q , we can

: upper bound the moments of T   2 2q √  

− Tr Σ 2 2q ≤ 4 2φq 2 √Tr Σ . E T n/4

(55)

Let t ≥ 0 and q ∈ N, then by Markov’s inequality     

− Tr Σ 2 2q . P  T − Tr Σ 2  ≥ t ≤ t −2q E T Let us choose q as:

3 q=

e−1 1 √ 1 t2 2 φ2 4



Tr Σ 2 √ n/4

− 21 4

(56)

,

so that (55), (56) entail   P  T − Tr Σ 2  ≥ t ≤ e−4q . Let us now take t=

√ u 2 Tr Σ 2 e2 2φ 2 Tr Σ 2 u √ ≤ 30 √ , 4 n n/4

where we have used n/4 ≥ n/7 for n ≥ 4; we obtain that for all u ≥ 0:     u 2 Tr Σ 2 ≤ e4 e−u . P  T − Tr Σ 2  ≥ 30 √ n  Proposition 12 directly follows from Proposition 31 and Lemma 15.

3.6.2

Bounded Setting

and then using As in the Gaussian case, we first obtain a concentration inequality for T 1/2

Lemma 15, we obtain one for T . We will need the following classical Bernstein’s inequality (see for instance Vershynin [41], Exercise 2.8.5 for the version below) which gives an upper bound on the Laplace transform of the sum of bounded random variables. Lemma 32 (Bernstein’s inequality) Let (X i )1≤i≤m be i.i.d. real centered random variables bounded by B such that E X 12 ≤ σ 2 .

156

G. Blanchard and J.-B. Fermanian

Then for all t < 3/B:   !  1 mσ 2 t 2 log E expt X i ≤ . 2 1 − Bt/3 Via Bernstein’s inequality we obtain the following result. Proposition 33 Let (X i )1 ≤i≤n , n ≥ 4 be i.i.d. Hilbert-valued random variables with

defined by (24). Then for all t ≥ 0: norm bounded by L and covariance Σ, and T 

  P  T − Tr Σ 2  ≥ 8L 2



10L 4 t Tr Σ 2 t + n n

 ≤ 2e−t .

(57)

is defined in (24). where T Proof Let X , X , Y , Y be i.i.d. Hilbert-valued random vectors of expectation μ, covariance Σ and with norm bounded by L, and Z := X − Y, X − Y 2 /4. Then it holds 0 ≤ Z ≤ 4L 4 , E[Z ] = Tr Σ 2 and |Z − E[Z ]| ≤ 4L 4 ; Var [Z ] ≤ 4L 4 E[Z ] = 4L 4 Tr Σ 2 . Now using the convexity of the exponential function, (52) and then Lemma 32, we

as follows: can upper bound the Laplace transform of T   

log E expt T ≤

4L 4 Tr Σ 2 t 2 1 , 2n/4 1 − 4L 4 t/(3n/4)

for all t such that the right-hand sise is well defined, i.e. the denominator is strictly positive. Now using Lemma 17, and n/4 ≥ n/7 for n ≥ 4, for all t ≥ 0 it holds 

   2 10L 4 t  2 2 Tr Σ t P T − Tr Σ  ≥ 8L + n n

 ≤ 2e−t .

(58)

Proof of Proposition 13 Assuming the event entering into (58) holds, we will use the inequalities of Lemma 15:



− T







Tr Σ 2 t √ − Tr Σ 2 + Tr Σ 2 ≤ Tr Σ 2 + 8L 2 n    t t 2 2 10t 2 +L ≤ 8L . ≤ 4L n n n



10L 4 t n

Nonasymptotic One- and Two-Sample Tests in High Dimension …

157

For the other side, we proceed analogously:

0    1 2t √ √ 1 Tr Σ 10L 4 t

− Tr Σ 2 ≥ 2 Tr Σ 2 − 8L 2 − Tr Σ 2 − T n n +    t 10t t ≥ −8L 2 − L2 ≥ −12L 2 . n n n 

3.7 Additional Proofs Proof of Lemma 15 This Lemma completes the Lemma 6.1.3 of Blanchard et al. [6]. This is its complete proof. Let a in R+ , it is well known that for b ≥ −a 2 : a−



|b| ≤



a 2 + b ≤ a + |b| .

On √ the other hand, suppose that b ≥ 0, the Taylor expansion of the function b → a 2 + b − a gives that there exists c ∈ (0, b) such that:

b b . ≤ a2 + b − a = √ 2 2a 2 a +c Suppose now that 0 ≥ b ≥ −a 2 , then

b2 b a 2 + b ≥ a + ⇔ b ≥ 2b + 2 ⇔ b ≥ −a 2 . a a The Eq. (29) is still true when b < −a 2 because then:

|b|  −a ≥ − |b| ≥ − . a Proof of Proposition 18 Let g be a standard Gaussian random vector in Rd , and U T DU be the singular value decomposition of the matrix S 1/2 Σ S 1/2 where D = diag(λ, . . . , λd ). Then we have the following equalities in distribution dist

1

1

dist

dist

Y T ΣY ∼ g T S 2 Σ S 2 g ∼ g T U T DU g ∼ g T Dg . The last equality is a consequence of the invariance by rotation of Gaussian vectors. Then for t < 1/ Σop Sop :

158

G. Blanchard and J.-B. Fermanian

    2 1 2  2T  d t X,Y t g Dg t Σ 2 Y  t2  2 =E e 2 Ee λi gi . =E e 2 = E exp 2 i=1 x ≤ 1−x√x for Using the independence of the coordinates and that − log(1 − x) ≤ 1−x x < 1 (the first inequality can easily be checked by termwise power series comparison), we obtain: n 

    1 log E et X,Y = − log 1 − t λi 2 i=1

≤ 1

n  1

1

1

t 2 λi 1 t 2 Tr(S 2 Σ S 2 ) ≤ . 2 2 1 − t λi 2 1 − tS 21 Σ S 21  21 i=1 op

1

1

1

We conclude using that Tr(S 2 Σ S 2 ) = Tr(Σ S) and that S 2 Σ S 2 op ≤ Sop Σop .  dist

1

Proof of Corollary 21 We use the representation X ∼ (Σ 2 g + μ), where g is a standard Gaussian random variable. We then have dist

1

X d ∼ Σ 2 g + μd = f (g) , where for y ∈ Rd :

  1 f (y) = Σ 2 y + μd . 1

This function f is Lipschitz with constant Σ 2 op . We conclude using Theorem 20 and Jensen’s inequality: E[X d ] ≤

& μ2d + Tr Σ .

 Proof of Corollary 26 We apply Theorem 25, with ε = 1 and the set of functions F = { f u }uH =1 where f u : x ∈ H → x, u H for u ∈ H. We can find a countable subset of the unit sphere because H is separable. Then   n   n   X i − μ, u H =  X i − μ Z = sup  . uH =1 i=1

i=1

H

We conclude using that for all u in the unit sphere of H, Var [ X i − μ, u H ] ≤ Σop and | X i − μ, u H | ≤ 2L a.s. We use Jensen’s inequality to upper bound the expec1 tation: E[Z ] ≤ (n Tr Σ) 2 .  Proof of Lemma 27 We upper bound the operator norm with the Frobenius norm. We denote in this proof Z i := X i − μ. It holds:

Nonasymptotic One- and Two-Sample Tests in High Dimension …

159

   - E Σ − Σ op &   2 ≤ E Tr Σ − Σ ⎛ ⎡

⎤⎞ 21    1 - − ΣΣ - + Σ 2 ⎦⎠ ≤ ⎝E⎣Tr (Z i Z iT )2 + Z i Z iT Z j Z Tj − ΣΣ n2 i i = j 

  21 E Z 4 Tr Σ 2 − = n n  √ Var Z 2 2L Tr Σ ≤ = . √ n n



Proof Lemma 30 First let us remark that if X and X are independent N (0, Σ) Gaussian vectors, then d dist  λi gi gi , X, X ∼ i=1

where gi and gi are independent standard Gaussianrandom  variables and the λi s are 2q

the eigenvalues of Σ. Then for q ∈ N, recalling E gi E X, X 2q =

= (2q!)/(2q q!),



 p1 +...+ pd =q

(  d 2q (2 pi )! 2 2 pi (λi ) 2 p1 , . . . , 2 pd i=1 2 pi pi !



≤ (2q)!

d (

(λi2 ) pi

p1 +...+ pd =q i=1

≤ (2q)!(Tr Σ 2 )q , where we have used (2 p)! ≤ 22 p p!2 . Using this bound, we upper bound the moments of the Z i s:  q  E Z  = 2−2q E X 1 − X 3 , X 2 − X 4 2q ≤ (2q)!(Tr Σ 2 )q . i

i

i

i

i

We now upper bound of Z i − Tr Σ. Let Z i be an independent copy of the moments 2 Z i , then since E Z i = Tr Σ , by Jensen’s inequality     2q  2q  2q ≤ E Z i − Z i

≤ 22q E Z i ≤ (4q)!(2 Tr Σ 2 )2q . E Z i − Tr Σ 2 For the odd moments we use that the function (·)2q+1 is increasing:    2q+1  2q+1 ≤ E Zi ≤ (4q + 2)!(Tr Σ 2 )2q+1 , −(Tr Σ 2 )2q+1 ≤ E Z i − Tr Σ 2

160

G. Blanchard and J.-B. Fermanian

so for all q ≥ 0:

  q   E Z i − Tr Σ 2  ≤ (2q)!(2 Tr Σ 2 )q .

(59)

It remains to upper bound the moments of the sum: ⎡

m 1  2 E⎣ Z − Tr Σ 2 m i=1 i

2q ⎤

=



⎦ 1 m 2q 1 m 2q



 p1 +...+ pm =2q pi =1





(2q)! ( (2 pi )!(2 Tr Σ 2 ) pi p1 ! . . . pm ! i=1 m

p1 +...+ pm =2q pi =1

2 Tr Σ 2 m

≤ (2q)!

( m   pi  2q E Z i − Tr Σ 2 p1 , . . . , pm i=1

2q



(2q)2q

1.

p1 +...+ pm =2q pi =1

Let us count the number of terms in this last sum. Consider first that we have k non-null terms ( pi1 , . . . , pik ). Their sum is equal to 2q but because these terms are strictly greater than 1, we also have that ( pi1 − 2) + . . . + ( pik − 2) = 2q − 2k, where sumare nonnegative. The number of k-partitions of 2q − 2k 2q−k−1  all termsof this = and then the number of terms in the sum is equal to: is (2q−2k)+(k−1) k−1 k−1 m    m 2q − k − 1 k=0

k

k−1

 m 2q − k − 1

m∧q

=

k=0

≤ mq

k−1

k

q   2q − k − 1 k−1

k=0

where F(·) is the Fibonacci sequence and φ = (1 + using that (2q)! ≤ (2q)q q q we obtain that ⎡

m 1  E⎣ Z i − Tr Σ 2 m i=1

2q ⎤ ⎦ ≤ (2φ2 )q

= m q F(2q − 1) ≤ m q φ2q , √



5)/2 is the golden ratio. So

Tr Σ 2 √ m

2q (2q)4q .

(60) 

Nonasymptotic One- and Two-Sample Tests in High Dimension …

161

Acknowledgements GB acknowledges support from: Deutsche Forschungsgemeinschaft (DFG)– SFB1294/1–318763901; Agence Nationale de la Recherche (ANR), ANR-19-CHIA-0021-01 “BiSCottE”; the Franco-German University (UFA) through the binational Doktorandenkolleg CDFA 01-18. Both authors are extremely grateful to the two reviewers and to the editor, who by their very careful read of the initial manuscript and their various suggestions allowed us to improve its quality significantly.

References 1. Anderson, T.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley Series in Probability and Mathematical Statistics. Wiley (2003) 2. Balasubramanian, K., Li, T., Yuan, M.: On the optimality of kernelembedding based goodnessof-fit tests. J. Mach. Learn. Res. 22(1), 1–45 (2021) 3. Baraud, Y.: Non-asymptotic minimax rates of testing in signal detection. Bernoulli 8(5), 577– 606 (2002) 4. Berger, J.O., Delampady, M.: Testing precise hypotheses. Stat. Sci. 2(3), 317–335 (1987) 5. Birgé, L.: An alternative point of view on Lepski’s method. In: State of the Art in Probability and Statistics (Leiden, 1999), vol. 36, pp. 113–133. IMS Lecture Notes Monograph Series Institute Mathematical Statistics (2001) 6. Blanchard, G., Carpentier, A., Gutzeit, M.: Minimax Euclidean separation rates for testing convex hypotheses in Rd . Electron. J. Stat. 12(2), 3713–3735 (2018) 7. Bousquet, O.: A Bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematiques de l’Académie des Sciences 334(6), 495–500 (2002) 8. Chwialkowski, K., Strathmann, H., Gretton, A.: A kernel test of goodness of fit. In: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), vol. 48, pp. 2606– 2615 (2016) 9. Cohn, D.L.: Measure Theory/Donald L. Cohn. English. Birkhauser Boston, ix, p. 373 (1980) 10. Dette, H., Kokot, K., Aue, A.: Functional data analysis in the Banach space of continuous functions. Ann. Stat. 48(2), 1168–1192 (2020) 11. Dette, H., Kokot, K., Volgushev, S.: Testing relevant hypotheses in functional time series via self-normalization. J. R. Stat. Soc. Ser. B 82(3), 629–660 (2020) 12. Dette, H., Munk, A.: Nonparametric comparison of several regression functions: exact and asymptotic theory. Ann. Stat. 26(6), 2339–2368 (1998) 13. Ermakov, M.S.: Minimax detection of a signal in a Gaussian white noise. Theory Probab. Appl. 35(4), 667–679 (1991) 14. Fromont, M., Laurent, B., Lerasle, M., Reynaud-Bouret, P.: Kernels based tests with nonasymptotic bootstrap approaches for two-sample problems. In: Mannor, S., Srebro, N., Williamson R.C. (eds.) Proceedings of the 25th Annual Conference on Learning Theory. Proceedings of Machine Learning Research, vol. 23, pp. 1–23 (2012) 15. Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13(25), 723–773 (2012) 16. Houdré, C., Reynaud-Bouret, P.: Exponential inequalities, with constants, for U-statistics of order two. In: Stochastic Inequalities and Applications. Progress in Probability, vol. 56, pp. 55–69 (2003) 17. Hsu, D., Kakade, S., Zhang, T.: A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 17, 6 (2012) 18. Ingster, Y.I.: Minimax nonparametric detection of signals in white Gaussian noise. Probl. Inf. Transm. 18(2), 130–140 (1982) 19. Ingster, Y. I.: Asymptotically minimax hypothesis testing for nonparametric alternatives I-II-III. Math. Methods Stat. 2(2–4), 85–114, 171–189, 249–268 (1993)

162

G. Blanchard and J.-B. Fermanian

20. Ingster, Y., Suslina, I.A.: Nonparametric goodness-of-fit testing under Gaussian models. In: Lecture Notes in Statistics, vol. 169. Springer (2012) 21. Ingster, Y.I., Suslina, I.A.: Minimax detection of a signal for Besov bodies and balls. Problems Inf. Transm. 34(1), 48–59 (1998) 22. Jirak, M., Wahl, M.: Perturbation Bounds for Eigenspaces Under a Relative Gap Condition (2018). arXiv: 1803.03868 [math.PR] 23. Kim, I., Balakrishnan, S., Wasserman, L.: Minimax Optimality of Permutation Tests (2020). arXiv: 2003.13208 [math.ST] 24. Koltchinskii, V., Lounici, K.: Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23(1), 110–133 (2017) 25. Lam-Weil, J., Carpentier, A., Sriperumbudur, B.K.: Local Minimax Rates for Closeness Testing of Discrete Distributions (2021). arXiv: 1902.01219 [math.ST] 26. Lepski, O.V., Spokoiny, V.G.: Minimax nonparametric hypothesis testing: the case of an inhomogeneous alternative. Bernoulli 5(2), 333–358 (1999) 27. Lugosi, G., Mendelson, S.: Mean estimation and regression under heavy-tailed distributions: a survey. Found. Comput. Math. 19(5), 1145–1190 (2019) 28. Marienwald, H., Fermanian, J.-B., Blanchard, G.: High-dimensional multi-task averaging and application to kernel mean embedding. In: AISTATS 2021 (2020). arXiv: 2011.06794 [stat.ML] 29. Massart, P.: Concentration Inequalities and Model Selection. Springer (2003) 30. Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B.: Kernel mean embedding of distributions: a review and beyond. Found. Trends Mach. Learn. 10(1–2), 1–141 (2017) 31. Munk, A., Czado, C.: Nonparametric validation of similar distributions and assessment of goodness of fit. J. R. Stat. Soc. Ser. B 60(1), 223–241 (1998) 32. Naumov, A., Spokoiny, V.G., Ulyanov, V.: Bootstrap confidence sets for spectral projectors of sample covariance. Probab. Theory Related Fields 174(3), 1091–1132 (2019) 33. Ostrovskii, D.M., Ndaoud, M., Javanmard, A., Razaviyayn, M.: Near-Optimal Model Discrimination with Non-Disclosure (2020). arXiv: 2012.02901 [math.ST] 34. Smola, A., Gretton, A., Song, L., Schölkopf, B.: A Hilbert space embedding for distributions. In: Proceedings of International Conference on Algorithmic Learning Theory (ALT 2007), pp. 13–31 (2007) 35. Spokoiny, V.G.: Adaptive hypothesis testing using wavelets. Ann. Stat. 24(6), 2477–2498 (1996) 36. Spokoiny, V.G.: Parametric estimation. Finite sample theory. Ann. Stat. 40(6), 2877–2909 (2012) 37. Spokoiny, V.G., Dickhaus, T.: Basics of Modern Mathematical Statistics. Springer Texts in Statistics. Springer (2015) 38. Spokoiny, V.G., Zhilova, M.: Sharp deviation bounds for quadratic forms. Math. Methods Stat. 22(2), 100–113 (2013) 39. Spokoiny, V.G., Zhilova, M.: Bootstrap confidence sets under model misspecification. Ann. Stat. 43(6), 2653–2675 (2015) 40. van Handel, R.: Structured random matrices. In: Convexity and Concentration, pp. 107–156. Springer (2017) 41. Vershynin, R.: High-Dimensional Probability: An Introduction with Applications to Data Science. Cambridge Series in Statistical and Probabilistic Mathematics, vol. 47. Cambridge University Press (2018) 42. Wellek, S.: Testing Statistical Hypotheses of Equivalence. Chapman and Hall/CRC (2002)

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls Sara van de Geer and Peter Hinz

Abstract The paper [10] shows bounds for the prediction error of the Lasso. They use projection arguments to establish non-adaptive as well as adaptive bounds. We address the question whether these projection arguments can give tight non-adaptive bounds. It turns out that indeed this is the case when the design is “structured” and when one applies an appropriate type of approximation numbers. This follows from results in the approximation theory literature. We present the connection, an entropy bound based on projection arguments, and several examples. Keywords Approximation number · Convex hull · Entropy · Lasso

1 Introduction In [10], adaptive and non-adaptive bounds for the prediction error of the Lasso, (which is given here in Eq. (2)) are derived. Their work relies on uniform bounds for   f /n where  ∈ Rn is a standard Gaussian vector and f varies in a class of vectors F ⊂ Rn . Such bounds are typically derived using empirical process theory, Dudley’s entropy integrals, or Gaussian widths. The paper [10] however invokes projection arguments instead. Their main aim is to establish adaptive bounds, but as “by-product” they obtain non-adaptive bounds as well. The question we address in this paper is: do such projection arguments always allow one to recover minimax rates, possibly up to log-terms? We show that the answer is affirmative, at least when the design is “structured”. The latter is for example the case if the the approximation numbers, defined in Section 2, are polynomial in the inverse dimension. A sufficient condition for this is that the covering numbers, also defined in Sect. 2, of the design (or of the extreme points of its convex hull) are polynomial. Actually, a little more flexibility S. van de Geer (B) · P. Hinz Seminar for Statistics, ETH Zürich, Rämistrasse 101, 8092 Zürich, Switzerland e-mail: [email protected] P. Hinz e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_4

163

164

S. van de Geer and P. Hinz

than in [10] may be needed when choosing the space to project on. If one moreover uses a somewhat different measure of approximation, there will be no superfluous log-terms. This follows from results in approximation theory. In particular, we will make use of Theorem 2.4 in [25]. It is known that for a class of regression functions the minimax rate follows from the (local) entropy of this class. We will not review this theory, but instead only discuss that the projection arguments lead to tight entropy bounds. This is well-known in approximation theory and there is in fact a large body of results for Banach spaces and convex bodies. In this paper, we cite these in our very simple finite-dimensional context. Thus, the present work recalls parts of approximation theory that are very useful but perhaps less known to statisticians. The paper of [3] contains a bound for the entropy of convex hulls of classes of functions or vectors with polynomial covering numbers. It can be applied to derive a rate of convergence for the Lasso, when the design has polynomial covering numbers (see e.g. [28]). Although the bound in [3] is the best possible when no further structural properties are imposed, it is known to be too loose in particular cases. An improvement is based on using small ball estimates, and this in turn employs the duality theorem, which in turn invokes approximation numbers. We started to learn this from [16] and then from going back to the papers [19, 21] and to the monograph [25]. Because the bounds in the literature involve unspecified constants, we give a self-contained proof deriving entropy bounds from approximation numbers (see Theorem 5), where there appear log-terms. Our aim is not to remove log-terms however. Instead, we aim at revealing the relation between projection arguments and the existing approximation theory literature, without going too much into difficult computations and bounds with unspecified constants. Indeed, we provide simple bounds with explicit constants, but with a log-price to pay. We consider several applications. The first is for one-hidden-layer neural networks with ReLU activation function. We derive a bound for the covering number, and hence a bound for the entropy, and hence a rate of convergence for the Lasso with design the one-hidden-layer design. The next example is for functions with bounded total variation of their discrete derivatives. In that case covering numbers do not yield the right entropy bound, but the projections do give the right bounds up to log-terms. This thus allows one to reprove minimax rates up to log-terms for one-dimensional trend filtering. We also present higher-dimensional extensions, which can be used for example when employing multiplicative adaptive regression splines. However, we show that our multivariate extension for bounding approximation numbers does not give the right result when one considers the class of distribution functions in higher dimensions. Thus, in view of the previous findings, we conclude that there must be better bounds for the approximation numbers in this case. For this case it remains open how to explicitly construct the space to project on and obtain tight approximation numbers.

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

165

1.1 Organization of the Paper In the next section, Sect. 2, we present the definition of approximation numbers and covering numbers. In Sect. 3 we apply the approximation numbers to bound the empirical process   f /n where  ∈ Rn is a standard Gaussian vector and f varies within a class F ⊂ Rn . Section 4 invokes the bound for the empirical process to establish a (non-adaptive) bound for the prediction error of the Lasso. Section 5 connects the approximation numbers with results from the literature on approximation theory and also cites the result of [3] concerning entropy bounds based on covering numbers. In Sect. 6 we present an explicit bound for the entropy based on approximation numbers. For the local entropy, we re-establish here what was already known, but the advantage is that the constants are given explicitly. We do not know whether the log-term here is always redundant. In Sect. 7 we estimate the covering numbers for the case of one-hidden-layer neural networks. The weights are fixed and we consider the so-called ReLU activation functions. With this design, we call the resulting Lasso the N N Lasso. Section 7.1 has the definitions and results and Sect. 7.2 presents a small simulation study. Section 8 contains some examples of approximation numbers. Section 9 concludes. In Sect. 10 one can find most of the proofs. Some proofs however are included in the main text because they explain the main idea, while being relatively straightforward.

2 Definitions n Let x1 , . . . , xn be given elements in some space X and Q n := n1 i=1 δxi . Denote the L 2 (Q n )-norm by  ·  Q n . Let  be a given subset of L 2 (Q n ) with cardinality p := ||. Let V ⊂ L 2 (Q n ) be a linear space with dimension dim(V). (Clearly we always have dim(V) ≤ n.) For a function f ∈ L 2 (Q n ), write the projection of f on V as f V . Define the anti-projection or residual as f V ⊥ := f − f V . Write   δ(V, ) := sup ψV ⊥  Q n : ψ ∈  . We consider a fixed linear space N ⊂ L 2 (Q n ) with dimension dim(N ), and will require V ⊃ N . In the application to the Lasso, the space N consists of functions whose coefficients are not penalized. In what follows one may also apply an alternative formulation setting N = ∅ and replace  by {ψN ⊥ : ψ ∈ }. Definition 1 For N ∈ N let the best N -term δ-approximation of the functions in  be   V N ∈ arg min δ(V, ) : N ⊂ V ⊂ L 2 (Q n ) linear space, dim(V) = N + dim(N ) . V

166

S. van de Geer and P. Hinz

The δ-approximation numbers of the functions in  are δ N () := δ(V N , ), N ∈ N. Let  ∈ Rn be a standard Gaussian vector and   √ 2  ⊥ I max | ψV |/ n . l (V, ) := E 2

ψ∈

Definition 2 For N ∈ N let the best N -term l-approximation of the functions in  be   V N ∈ arg min l(V, ) : N ⊂ V ⊂ L 2 (Q n ) linear space, dim(V) = N + dim(N ) . V

The l-approximation numbers of the functions in  are l N () := l(V N , ), N ∈ N. Moreover we define the l-inverse complexity as ico() := sup

N ∈N

δ(V N , ) . l(V N , )

Approximation numbers play an important role in approximation theory, see e.g. [11, 25]. Some properties are given in Lemma 1 below. Lemma 1 It holds that  δ N () ≤ l N () ≤ δ N () 2(log(2 p) + 1) ∀ N . Moreover   √ √  ≤ exp[−u], ∀ u > 0. IP max | ψV ⊥ |/ n ≥ l N () 1 + ico() 2u ψ∈

Finally, 1  ≤ ico() ≤ 1. 2 log(2 p) + 1 Proof See Sect. 10.1. Definition 3 For G ⊂ L 2 (Q n ), and δ > 0, the δ-covering number N (δ, G) of G is the minimum number of balls with radius δ necessary to cover G. The entropy of G is H (·, G) := log N (·, G).

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

167

By definition, if for some δ > 0 and for N = N (δ, ), the collection {ψ j } Nj=1 is a minimal δ-covering of , then for each ψ ∈  there is a k ∈ {1, . . . , N } such that ψ − ψk  Q n ≤ δ. Suppose now for simplicity that N = ∅. Clearly, if we let V be N the linear space spanned by {ψk }k=1 then the space V has dimension (at most) N and the distance of any ψ ∈  to V will be at most δ as well, but it can in fact be much smaller. In other words for N = N (δ, ) we have δ N () ≤ δ. For example, suppose for some positive constants A and W it holds that N (δ, ) ≤ A W δ −W for all δ > 0. Fix N ≥ 2 and choose δ such that N − 1 ≤ A W δ −W ≤ N . Then N (δ, ) ≤ N and δ ≤ A(N − 1)−1/W so that the approximation number δ N () can be bounded by A(N − 1)−1/W . We will see, in Sect. 8, examples where, for N = N (δ, ), actually δ N () δ due to the fact that we allow for any linear space V to approximate the functions in . In this sense (l- or) δ-approximation numbers are more refined than covering numbers. This means that Theorem 5 below which derives entropies based on δ-approximation numbers recovers the result of [3] (given here in Theorem 3) up to log-terms. Define the convex hull of  as  conv() :=

f =

p



ψ j b j : (b1 , . . . , b p ) ∈

p R+ ,

j=1

p

 bj = 1 .

(1)

j=1

The absolute convex hull is defined as  abconv() :=

f =

p

j=1

ψjbj :

p

 |b j | ≤ 1 .

j=1

Note that when the zero vector is in the set , then  conv() =

f =

p

j=1

ψ j b j : (b1 , . . . , b p ) ∈ R+ , p

p

 bj ≤ 1

j=1

and then N (·, conv()) ≤ N (·, abconv()) ≤ N 2 (·/2, conv()). Thus, for polynomial entropies for example, the difference between the two entropies is only in the constants.

168

S. van de Geer and P. Hinz

3 Bounds for the Empirical Process Using Approximation Numbers Let  ∈ Rn be a standard Gaussian vector. Let  ⊂ L 2 (Q n ) with p := cardinality p ||. Let N ⊂ L 2 (Q n ) be a linear space. For b ∈ R p we let f b := j=1 ψ j b j = b. We define   F :=

f = f N + fb , f N ∈ N , b ∈ R p .

We use the following shorthand formulation for a random variable Z and constants c1 , c2 and 0 ≤ π ≤ 1 c1 with probability at least π Z≤ c2 means that IP(Z ≤ c1 ) ≥ π as well as IP(Z ≤ c2 ) ≥ π . Theorem 1 For all u, v > 0 and all N ∈ N with probability at least 1 − exp[−u] − exp[−v] it holds for all f = f N + f b ∈ F that   f /n ≤



N + dim(N ) + n



√ √ l N ()(1 + ico() 2u) 2v   f  Q n + b1 / n n δ N () 2 log(2 p) + 2u

.

Proof of Theorem 1 Let V ⊃ N , with dim(V) = N + dim(N ). Let f = f N + f b ∈ F be arbitrary. By a tail bound for χ 2 random variables (see [20], Lemma 1), we have with probability at least 1 − exp[−v]   f V /n ≤



Moreover, f b =

N + dim(N ) + n

p j=1



 2v N + dim(N ) 2v  fV  Qn ≤ +  f  Qn . n n n

ψ j b j so f V ⊥ =

  f V ⊥ /n =

p

p j=1

ψV ⊥ , j b j and therefore

  ψV ⊥ , j b j /n

j=1

≤ max |  ψV ⊥ , j |/nb1 1≤ j≤ p  ⎧ √ √ ⎨ l N () 1 + ico() 2u ≤ b1 / n  ⎩ δ N () 2 log(2 p) + 2u

,

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

169

both inequalities with probability at least 1 − exp[−u], where the last upper inequality follows by taking V N such that l N () = l(V N , ) and invoking Lemma 1 and the lower inequality by taking V N such that δ N () = δ(V N , ) and applying the union bound. 

4 Bounds for the Lasso Using Approximation Numbers Let Y ∈ Rn be a response vector with unknown mean f 0 := E I Y . We write the noise as  := Y − f 0 and assume  has an n-dimensional standard normal distribution. Consider a given design matrix  ∈ Rn× p . Let moreover N ⊂ L 2 (Q n ) be a given linear space with dimension dim(N ). Write f b := b for b ∈ R p . Define as in the previous section F := { f = f N + f b : f N ∈ N , b ∈ R p }. We consider the Lasso fˆ = fˆN + f βˆ := arg

min

f = f N + f b ∈F

  2 Y − f  Q n + 2λb1 .

(2)

Condition on the l-approximation numbers The l-approximation numbers satisfy for some positive constants A and W l N () ≤ AN −1/W ∀ N ∈ N.

(3)

Condition on the δ-approximation numbers The δ-approximation numbers satisfy for some positive constants A and W δ N () ≤ AN −1/W ∀ N ∈ N.

(4)

The following theorem is an extension of results in [10], as the latter require the linear space V onto which one projects to be spanned by a subset of the vectors in . Moreover, they use δ-approximation numbers but not l-approximation numbers. Let f ∗ = f ∗N + f β ∗ ∈ F be arbitrary. Theorem 2 Suppose Condition (3) and/or Condition (4) is met. Then for all u, v > 0, with probability at least 1 − exp[−u] − exp[−v] ˆ 1 ≤  f ∗ − f 0 2Q  fˆ − f 0 2Q n + λβ n +

2(2 Aλ0 (u))W 2(1 + dim(N ) + 2v) + 3λβ ∗ 1 . + W nλ n

170

where

S. van de Geer and P. Hinz

√ √ (1 + ico() 2u)/ n underCondition(3) λ0 (u) :=  . √ 2 log(2 p) + 2u/ n underCondition(4)

In e.g. [27] rates of convergence of penalized least squares estimators are derived using entropy bounds. If one applies Theorem 2 under Condition (3) on the l-approximation numbers, the result is up to constants the same as the one based on entropy bounds. This follows from Corollary 1 below where the relation between entropy and l-approximation numbers is given. In Sect. 8 we will present δapproximation numbers in some examples. These can then be inserted in Theorem 2 using Condition (4) (see Corollary 2 for an illustration). In view of Theorem 5, the result is then the same as when using entropy bounds but now up to log-terms. To prove the theorem we employ the next lemma termed “Basic Inequality”. The result is from [18] (Sect. 5.4), see also [17] (Chap. 5) or [29] (Chap. 2). We present a proof for completeness in Sect. 10.2. Define for f = f N + f b , the “penalized” squared distance τ∗2 ( f ) :=  f − f ∗ 2Q n + λb1 . Lemma 2 We have ˆ 1 ≤  f ∗ − f 0 2Q + 2  ( fˆ − f ∗ )/n + 2λβ ∗ 1 .  fˆ − f 0 2Q n + τ∗2 ( fˆ) + λβ n Proof See Sect. 10.2. Proof of Theorem 2 By Theorem 1 it holds with probability at least 1 − exp[−u] − exp[−v] that 

N + dim(N ) 2v +  fˆ − f ∗  Q n n n ˆ 1 + β ∗ 1 ) + 2 AN −1/W λ0 (u)(β 2(dim(N ) + 2v) 2N + +  fˆ − f ∗ 2Q n ≤ n n ˆ 1 + β ∗ 1 ). + 2 AN −1/W λ0 (u)(β

2 ( fˆ − f ∗ )/n ≤ 2 



Define N∗ = Then

 (2 Aλ0 (u))W . λW

ˆ 1 + β ∗ 1 ) ≤ λ(β ˆ 1 + β ∗ 1 ). 2 AN∗−1/W λ0 (u)(β

We choose N = N∗ to see that 2  ( fˆ − f ∗ )/n ≤

2(dim(N ) + 2v) 2N∗ + + τ∗2 ( fˆ) + λβ ∗ 1 . n n

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

Moreover N∗ ≤

171

(2 Aλ0 (u))W + 1. λW

Inserting the last two inequalities in the Basic Inequality of Lemma 2, we see that the term τ∗2 ( fˆ) cancels out, and thus with probability at least 1 − exp[−u] − exp[−v], ˆ 1  fˆ − f 0 2Q n + λβ ≤  f ∗ − f 0 2Q n +

2(2 Aλ0 (u))W 2(1 + dim(N ) + 2v) + 3λβ ∗ 1 . + W nλ n 

5 Some Entropy Results From the Literature In this section we consider the case N = ∅. We will refer to results from [3, 19, 21, 25]. It must be stressed however that we do not do these papers justice as we only consider a very simple (finite-dimensional) special case with a fixed discrete measure Q n and a finite set . The following result is obtained in [3], see also Theorem 2.6.9 in [30]. Theorem 3 ([3]) Suppose that the class  ⊂ L 2 (Q n ) is within a ball with radius 1: supψ∈ ψ Q n ≤ 1, and that for some positive constants A and W N (δ, ) ≤ A W δ −W , ∀ δ > 0. Then there exists a constant C depending on A and W such that H (δ, abconv()) ≤ Cδ − 2+W , ∀ δ > 0. 2W

Recall that we consider  as a subset of L 2 (Q n ). The above theorem actually holds not just for the measure Q n but for arbitrary probability measures Q and for arbitrary subsets  in the unit ball of L 2 (Q). Theorem 3 is tight in its generality, but in special cases it may not give tight entropy bounds. Thus, we need further refinements. We refer to [16] for an insightful summary on this for an infinite-dimensional setting and using small ball estimates. In [19, 21, 25], a very general theory is developed concerning convex bodies and subsets of (infinite-dimensional) Banach spaces and mappings from Banach spaces to (infinite-dimensional) Hilbert spaces and visa versa. These works are our reference here, but we note that there are many other impressing publications in this area e.g. [1, 15]. It would go too far to explain everything here in detail. We only consider a rather trivial case, where the Banach space is E := (R p ,  · ∞ ) or its dual E ∗ := (R p ,  · 1 ). Moreover, our mappings are simply

172

S. van de Geer and P. Hinz

u : Rn → E u(a) :=   a, a ∈ Rn and its dual u ∗ : E ∗ → Rn u ∗ (b) := b, b ∈ R p . We will cite Theorem 2.4 of [25] (see also Lemma 2.1 in [21]) in our simple case and using the original notation so that the translation to our context is clearer. In his monograph, [25] defines   √ 2 I u()∞ / n , l 2 (u) := E where  ∈ Rn is a standard Gaussian vector. Moreover, l N (u) defined in [25] is our l N (). Let e N (u ∗ ) := inf{δ > 0 : N (δ, u ∗ (B E ∗ )) ≤ 2 N −1 } where B E ∗ is the unit ball in E ∗ . Theorem 4 ([25], Theorem 2.4, special case) There exist universal constants c1 and c2 such that

ek (u ∗ )k −1/2 . l N (u) ≤ c1 k≥c2 N

Corollary 1 Suppose that for some positive constants C and W H (δ, u ∗ (B E ∗ )) ≤ Cδ − 2+W . 2W

Then for N ≥ 2



2C e N (u ) ≤ N log 2 ∗

2+W 2W

and so, for some constant A depending on C and W , l N (u) ≤ AN −1/W , ∀ N ∈ N. We have stated Theorem 4 and its corollary in the notation of the original reference. In our special case the set u ∗ (B E ∗ ) is u ∗ (B E ∗ ) = abconv(). Let us stress again that the results in literature are for general Banach spaces and allow for non-polynomial entropies/approximation numbers. The papers [19, 21] show that the bounds are tight. In our simple context, this implies

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

173

l N ()  N −1/W ⇔ H (δ, abconv())  δ − 2+W . 2W

Thus δ N ()  N −1/W ⇔ δ − 2+W  H (δ, abconv())  δ − 2+W log 2+W p, 2W

2W

W

 since δ N () ≤ l N () ≤ δ N () 2(log(2 p) + 1) by Lemma 1. A main point in the papers [19, 21] is to show that the entropies follow from small ball estimates, and the approximation numbers are an intermediate step to this end (with another ingredient being the duality theorem). In the present paper, Theorem 4 is of importance, as it allows us to conclude that the bounds obtained in Theorem 2 are the same as those based on tight bounds for the empirical process using Dudley’s entropy integral or Gaussian widths. This means for instance that Theorem 2 can be applied to prove minimax rates when the tuning parameter is chosen appropriately.

6 Explicit Entropy Bounds Using δ-Approximation Numbers The δ-approximation numbers are less than or equal to the l-approximation numbers, but as we have seen in Theorem 1, using δ-approximation may lead to a log-term price to pay. The same is true for entropy numbers: using l-approximations one gets tight bounds, but with δ-approximation numbers, there may be again a log-term price to pay. We show this in this section in Theorem 5: δ-approximation numbers give the right order for the entropy modulo a log-term. We will not be too careful so that the main message is clearer: modulo constants the (log p)-term can be replaced by a | log δ|-term in the δ-entropy. We consider for all R > 0 the collection F(R) := { f = f N + f b : f N ∈ N , b ∈ R+ , p

p

j=1

Theorem 5 Suppose that δ N () ≤ AN −1/W ∀ N ∈ N. Then for all R > 0 and δ > 0 it holds that

b j = 1,  f  Q n ≤ R}.

174

S. van de Geer and P. Hinz W   2+W 4R + δ 2 1 + log p log 2+W δ  4R + δ + 1 + log p. + (1 + dim(N )) log δ



H (δ, F(R)) ≤ 2

2W 2+W

2A δ

Proof See Sect. 10.3. The local entropy is the case where R  δ. We see from the theorem that then H (δ, F(R))  δ − 2+W log 2+W p + log p 2W

W

7 Bounds Using Covering Numbers Illustrated: One-Hidden-Layer Neural Networks 7.1 Definitions and Results We show that using covering numbers, one obtains by Theorem 3 a bound for the entropy of a class of one-hidden-layer neural networks with ReLU activation function. Let xi := (ξi,1 , . . . , ξi,d ) ∈ R1×d , i = 1, . . . , n, and let the input matrix be ⎛ ⎞ x1 ⎜ .. ⎟ X := ⎝ . ⎠ ∈ Rn×d . xn We assume the rows of X have unit length: max1≤i≤n xi 2 = 1. Given a matrix of weights W := (w1 , . . . , w p ) ∈ Rd× p with each column having length one: diag(W W) = I , and given a vector c := (c1 , . . . , c p ) ∈ R1× p , we consider f b (i) :=

p

(xi w j − c j )+ b j , i = 1, . . . , n,

(5)

j=1



where z + :=

z, z ≥ 0 0, z < 0

is the hinge function or Rectified Linear Unit (ReLU) function. For w ∈ Rd and c ∈ R we write

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

175

⎞ (x1 w − c)+ ⎟ ⎜ .. n (X w − c)+ = ⎝ ⎠∈R . . ⎛

(xn w − c)+

Moreover, we write  (X W − c)+ = (X w1 − c1 )+ , · · · , (X w p − c p )+ ∈ Rn× p . Let S d−1 := {w ∈ Rd : w2 = 1} be the unit sphere in Rd and   d−1  := ψw,c (x) = (xw − c)+ , w ∈ S , c ∈ R .

Lemma 3 We have

N (δ, ) ≤ 4d+1 dδ −d , ∀ δ > 0.

Proof See Sect. 10.4. We thus see by Theorem 3 that for a constant C depending only on d H (δ, abconv()) ≤ Cδ − 2+d ∀ δ > 0. 2d

(6)

Note (as is known, see e.g. [4]) that the entropy does not depend on the width p of the network. Let the one-hidden-layer neural network class be  N N :=

 f b := (X W − c)+ b : b ∈ R . p

We define the N N Lasso as fˆ := arg min

f b ∈N N

  2 Y − f b  Q n + 2λb1 .

See also [7] where the Lasso is used in the last layer of a multi-layer neural network. (It is moreover in the spirit of [24] Proposition 26.) Using the entropy bound (6), one obtains by standard arguments (see e.g. [27]), for constants c0 and c1 depending on d, for all v > 0 and δn2 given by δn2

√ c0 (1/ n)d c1 (1 + v) , = + d nλ n

with probability at least 1 − exp[−v] that

176

S. van de Geer and P. Hinz

ˆ 1 ≤  f ∗ − f 0 2Q + 2λβ ∗ 1 + δn2 .  fˆ − f 0 2Q n + λβ n

(7)

This bound depends of course on the width of the network via the approximation error  f ∗ − f 0 2Q n and the 1 -norm of β ∗ . We may alternatively apply Theorem 2 to obtain explicit constants at the price of an additional log-term. By Lemma 3, for N ≥ 2 δ N () ≤ 4(4d)1/d (N − 1)−1/d . To accommodate for N ≥ 2, we see from the proof of Theorem 2 that this can be done by adding 2/n to its bound. This gives: for all u, v > 0, with probability at least 1 − exp[−u] − exp[−v] 8d(8λ0 (u))d 2(2 + 2v) ˆ 1 ≤  f ∗ − f 0 2Q + +  fˆ − f 0 2Q n + λβ + 3λβ ∗ 1 n d nλ n  √ where λ0 (u) := 2 log(2 p) + 2u/ n. The log p-term in this result is sub-optimal, but an advantage is that the constants are explicit.

7.2 Simulation We now want to illustrate the above results for one-hidden-layer neural networks with ReLU activation function. To better describe the dependency on the sample size n, in contrast to Eq. (5), where f b was defined as a vector of function values, it is now more convenient to directly consider N N as the set of one-hidden-layer neural networks from Rd to R with the ReLU activation function:   N N := f b : R1×d → R, x → (xW − c)+ b : b ∈ R p with W and c defined as above. We consider the case where f 0 ∈ N N , i.e. f 0 = f b0 for some b0 ∈ R p . In order to notationally express the dependency on n and λ we introduce introduce the random variables fˆn,λ := f βˆn,λ βˆn,λ := arg minp ( b∈R

n 1

( f 0 (xi ) + εi − f b (xi ))2 + 2λb1 ) n i=1

∗ ∗ ∗ with b via i.i.d. standard normal noise ε1 , . . . , εn . Furthermore, let f n,λ := f bn,λ n,λ := 2 ∗ 0 0 arg minb∈R p  f − f b  Q n + 2λb1 for n ∈ N. In particular,  f n,λ − f 2Q n + ∗ 1 ≤ 2λb0 1 such that the probabilistic inequality in Eq. (7) implies 2λbn,λ

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

    2+d E  fˆn,λ − f 0 2Q n + λβˆn,λ 1 = O λb0 1 + λ−d n − 2 + n −1

177

(8)

for fixed d, λ → ∞, n → ∞ and bounded predictors satisfying supi∈N xi 2 ≤ 1. 2+d Balancing λ and n in this equation yields λb0 1  λ−d n − 2 or equivalently 1 − 1+d

λ  n − 2(1+d) b0 1 2+d

for n → ∞.

(9)

For a fixed input dimension d and different sample sizes n we want to rediscover the above balancing rate for λ in a simulation of optimal values λ∗n that minimize the expected mean squared error   for n ∈ N. λ∗n = arg min E  fˆn,λ − f 0 2Q n λ≥0

(10)

We choose input dimension d = 2 for the neural networks such that this rate evalu2 ates to approximately λ  n − 3 and want to observe a similar rate in our simulated approximations of λ∗n . Our simulation is is carried out as follows: For the fixed input dimension d = 2 and hidden layer width p = 10000 we presample N = 3000 predictor values x1 , . . . , x N ∈ Rd on the unit disc in Rd , the input layer weight matrix W ∈ Rd× p with i.i.d. columns uniformly on S d−1 and the bias vector c ∈ R1× p with i.i.d. entries from the uniform distribution between -1 and 1. In particular we have maxi∈{1,...,N } xi 2 ≤ 1. For the true function vector f 0 = ( f (x1 ), . . . , f (x N )) we evaluate the function f : R1×d → R, x = (ξ1 , . . . , ξd ) → αcos(ξ1 )

(11)

depending only on the first √ three input coordinates with a scale constant α > 0 chosen such that  f 0 2 = N . This way we balance the signal-to-noise ratio because E[ε12 + · · · + ε2N ] = N =  f 0 22 . Since N = 3000 < 10000 = p there exist solutions b ∈ R p of ⎛ ⎞ (x1 W + c)+ ⎜ ⎟ .. 0 (12) ⎝ ⎠b = f . . (x N W + c)+

Using basis pursuit, we can determine the solution b0 to (12) which has minimal 1 -norm. For a selection of sample sizes n ∈ n := {100, 200, 400, 800, 1500, 3000} we want to approximate λ∗n from Eq. (10) by the random quantity λˆ ∗n = argminλ≥0 where

M 1 ˆ(m) f − f 0 2Q n , M m=1 n,λ

(13)

178

S. van de Geer and P. Hinz

Fig. 1 Samples of 1 M 0 2 ˆ(m) m=1  f n,λ − f  Q n M for M = 6000, n ∈ {100, 200, 400, 800, 1500, 3000} and λ in the chosen sample size specific grids for λ. The solid black dots represent the minima

Fig. 2 Logarithmic plot of the estimated optimal penalty λˆ ∗n minimizing λ → 1 M 0 2 ˆ(m) m=1  f n,λ − f  Q n M against sample size n for n ∈ {100, 200, 400, 800, 1500, 3000} and M = 6000. The almost straight line with approximate slope −0.604 empirically suggests a relation λ∗n  n −0.604

1 (m) (m) := f βˆ (m) , βˆn,λ := argminb∈R p fˆn,λ n,λ n

n 

f 0 (xi ) + εi(m) − f b (xi )

2

+ 2λb1 .

i=1

(14) Since we train the hidden layer weights of a one-hidden-layer neural network, for each considered n and each λ ≥ 0 we require M Lasso fits without an intercept parameter. Practically, we resort to a discrete grid of values for λ and use a Lasso optimizer that is capable of optimizing for a multitude of penalty parameters in the same run. In order to trade off computational feasibility and a fine-grained grid for the λ values, for each considered n we choose a sample size specific λ-grid for the Lasso fits around a first estimate λˆ ∗n based on a crude grid with a large range and fewer Lasso fits. M 0 2 ˆ(m) For M = 6000 the simulated approximations M1 m=1  f n,λ − f  Q n for our considered sample sizes on the corresponding grids are depicted Fig. 1. The minimizers λˆ ∗n , n ∈ n are the λ-values of the minima represented by the black dots. As it can be expected, with increasing sample size n the optimal expected MSEs M 1 0 2 ˆ(m) ˆ∗ m=1  f n,λ − f  Q n and the optimizing penalty λn decrease. M

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

179

Fig. 3 Logarithmic plot of the quantities 1 M 0 2 ˆ(m) m=1  f ˆ ∗ − f  Q n M n,λn

(average decay rate −0.610) and λˆ ∗n b0 1 (average decay rate −0.604, b0 1 ≈ 4.38) for n ∈ {100, 200, 400, 800, 1500, 3000} and M = 6000

The observed relation of these optimal penalties and the corresponding sample n depicted in sizes Fig. 2 is approximately of the form λˆ ∗n  n −0.604 which is very close to what we would expect from Eq. (9). M ˆ(m) Also the minimized values of the estimated expected MSE M1 m=1  f n,λˆ ∗ − n

f 0 2 depicted in Fig. 7.2 are decreasing consistently, with a similar rate of approximately n −0.610 in the considered range of sample sizes (for larger values for n the computational effort became infeasible). For completeness Fig. 3, compares both, the values of λˆ ∗n b0 1 using the basis pursuit solution b0 to (12) and the values 1 M 0 2 ˆ(m) m=1  f n,λ − f  Q n . Both quantities behave similarly as should be expected M from Eq. (8) with the balanced λ rate (9).

8 Examples of Approximation Numbers The paper [21] contains as an example the general Riemann-Liouville integral operators  1 1 (x − c)k−1 f b (x) = + b(c)dc, x ∈ (0, 1]. (k) 0 where k > 0 is fixed. For  := { f b :



|b(c)|2 dc ≤ 1}, they show that for k > 1/2

l N ()  N −

2k−1 2

where the l-approximation numbers are with respect to Lebesgue measure. The result follows from small ball estimates. In the next two subsections, we consider the discrete counterpart and only for values k ∈ N. We first consider the case k = 2 and then general k ∈ N. Moreover, we look at δ-approximation numbers instead of l-approximation numbers.

180

S. van de Geer and P. Hinz

8.1 Second Order Discrete Derivatives Let xi := i/n and f i := f (xi ) (i = 1, . . . , n). Then f := ( f 1 , . . . , f n ) ∈ Rn . We write its first discrete derivative as ( f ) j := f j − f j−1 , j ∈ {2, . . . , n} and its second discrete derivative as (2 f ) j := ( f ) j − ( f ) j−1 , j ∈ {3, . . . , n}. 

Let N =

f

N





(x) = a1 ψ1 + a2 ψ2 : (a1 , a2 ) ∈ R

2

be the space spanned by {ψ1 , ψ2 }, ψi,1 = 1, ψi,2 = (i − 1)/n, i = 1, . . . , n, and fb =

n

ψjbj

j=3

with b j = n(2 f ) j , j ∈ {3, . . . , n}, and ψi, j = (i − j + 1)l{i ≥ j}/n, i ∈ {1, . . . , n}, j ∈ {3, . . . , n}. One easily checks that

N (δ, )  δ −1 .

In other words, the covering number has W = 1. We show in Lemma 4 that the δ-approximation number improves this to W = 2/3. Indeed, it is known (see [2] or [5]) that 1 (15) H (δ, F(1))  δ − 2 . where  F(1) :=

f

N

+ fb : f

N

∈ N, b ∈ R

n−2

,  f Qn

Lemma 4 It holds that δ N () ≤ 23/2 N −3/2 ∀ N ≤ n.

 ≤ 1, b1 ≤ 1 .

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

Proof of Lemma 4 Note that ⎛ 1 −2 ⎜0 1 ⎜ 2 = ⎜ . . ⎝ .. ..

1 −2 .. .

0 1 .. .

··· ··· .. .

0 0 .. .

0 0 .. .

181

⎞ 0 0⎟ ⎟ n×(n−2) . .. ⎟ =: D ∈ R .⎠

0 0 · · · 1 −2 1

0 0

Clearly N is the null space of D. It follows that for f = f N + f b we have f N ⊥ = D  (D D  )−1 b/n. In other words

N ⊥ = D  (D D  )−1 /n =: {ψ j,N ⊥ }nj=3 .

It is shown in [22] that for j ∈ {3, . . . . , n} ψ j,N ⊥ 22 ≤

 1 1 n−1 3 3 3 ( j − 2) ∧ (n + 1 − j) ≤ . n2 n2 2

Let L ∈ N and m := (n − 1)/(L + 1). In the matrix D, remove the rows with L and call the new matrix D−S . Then D−S index in the set S := {1 + im, 2 + im}i=1  is a block diagonal matrix, with each block a matrix in Rm ×n with m  ≤ m having the same structure as the original matrix D. Thus, if we let V be the space spanned by {ψ j } j∈S∪{1,2} we see that ψ j,V ⊥ 22 ≤ But then

 ψ j,V ⊥ 2Q n



m 2n

3

 1 m 3 . n2 2

≤ (L + 1)−3 ≤ L −3 .

Since dim(V) = dim(N ) + 2L, the result follows.



8.2 kth Order Discrete Derivatives Let as in Sect. 8.1, xi := i/n and f i := f (xi ) (i = 1, . . . , n). For k ∈ N, k < n, we let (k f ) j := ( f ) j − ( f ) j−1 , j ∈ {k + 1, . . . , n}. The class F is now

182

S. van de Geer and P. Hinz

F := { f ∈ Rn : f = f N + f b : f N ∈ N , b ∈ Rn−k } where N is the space of polynomials in x1 , . . . , xn of degree at most k − 1 and fb =

n

bjψj,

j=k+1

with b j = n k−1 (k f ) j , j ∈ {k + 1, . . . , n}. The vectors  := {ψ j }nj=1 form the falling factorial basis for equidistant design, see [26, 31]. They are defined as follows. Let • for k = 1, φ1, j (i) = 1{i≥ j} , i, j ∈ {1, . . . , n}, • and for k ≥ 2, φ j, j , j ∈ {1 . . . , k − 1} φk, j =  . l≥ j φk−1,l /n, j ∈ {k, . . . , n} Using the same arguments as for the case k = 2 and the bounds from [22], one sees 2k−1 2k−1 that δ N () ≤ k 2 N − 2 . Corollary 2 Let Y = f 0 +  where  has the n-dimensional standard normal distribution. Consider the following Lasso estimator for the trend filtering problem:   k−1 k ˆ f := arg minn Y − f  Q n + 2λn  f 1 . f ∈R

Then an application of Theorem 2 with W = probability at least 1 − exp[−u] − exp[−v]

2 2k−1

gives that for all u, v > 0, with

 fˆ − f 0 2Q n + λn k−1 k fˆ1 2



≤f −

f 0 2Q n

+

2(2 Aλ0 (u)) 2k−1 2 2k−1

+

2(1 + k + 2v) + 3λn k−1 k f ∗ 1 n

nλ  √ 2k−1 where A = k 2 and λ0 (u) = 2 log(2n) + 2u/ n. Thus with u = v = log n (say) 2k k and λ  n − 2k+1 log 2k+1 n one arrives at: with probability at least 1 − 2/n  fˆ − f 0 2Q n + λn k−1 k fˆ1 ≤  f ∗ − f 0 2Q n + 3λn k−1 k f ∗ 1 + η where η  λ. Thus, as in [22], one obtains up to log-terms the minimax rate over { f 0 : n k−1 k f 0 1 ≤ 1} ([12]).

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

183

8.3 Higher-Dimensional Extensions In this section we consider, for i = 1, . . . , n, variables xi = (ξi,1 , . . . , ξi,d ) ∈ p {1, . . . , n 0 }d /n 0 and let n = n d0 . Let {ψ j } j=1 be real-valued functions on {1, . . . , n 0 }/n 0 . Let for j = ( j1 , . . . , jd ) ∈ {1, . . . , p}d d

ψj (xi ) =

ψ jt (ξi,t ). t=1

Then functions of the form f =



bj ψj

j∈{1,..., p}d

with ψ j as in Sect. 8.2 correspond to those used in the context of multiplicative adaptive regression splines (MARS), see [14]. Let us simplify the situation to d = 3. Let ψu , ψv and ψw be functions in a class  of functions on {1, . . . , n 0 }/n 0 and ψu ψv ψw : {1, . . . , n 0 }3 /n 0 → R ψu ψv ψw (ξ1 , ξ2 , ξ3 ) := ψu (ξ1 )ψv (ξ2 )ψw (ξ3 ).

Lemma 5 Suppose that supψ∈ ψ Q n ≤ 1 and that for some positive constants A and W δ N () ≤ AN −1/W ∀ N ∈ N. Then for N ≥ 2  1   2 √ 6 (2 A W )3 (1+ 21 + 13 )W δ N ψu ψv ψw : ψu ∈ , ψv ∈ , ψw ∈  ≤ 10 . N −1 Proof See Sect. 10.5. In [23] the result of Lemma 5 is extended to d dimensions:  δN {

d

δ N () ≤ AN −1/W ∀ N ∈ N ⇒ 1 − 1 1 ψ jt }( j1 ,... jd )∈{1,..., p}d ≤ B(N − 1) (1+ 2 +···+ d )W ∀ N ∈ N, N ≥ 2,

t=1

where B is a constant depending on d, A and W . The paper [23] moreover details the application to the Lasso.

184

S. van de Geer and P. Hinz

8.4 Entropy of the Class of Distribution Functions We consider the situation of Sect. 8.3, with d = 2 and with, for j = 1, . . . , n 0 , ψ j (i/n 0 ) = l(i ≥ j), i = 1, . . . , n 0 . 0 we have Then for  = {ψ j }nj=1

δ N () ≤ N −1/2 ∀ N ∈ N. Using the notation and the argument of the previous section we see that  δ N {ψ j ψk : ψ j ∈ , ψk ∈ } ≤ B(N − 1)1/3 , ∀ N ∈ N N ≥ 2. This leads by Theorem 5 to an δ-entropy bound for conv({ψ j ψk : ψ j ∈ , ψk ∈ }) of order δ −6/5 modulo log-terms. Note now that {ψ j ψk : ψ j ∈ , ψk ∈ } is the collection of half-intervals ψ j1 , j2 (i 1 /n 0 , i 2 /n 0 ) = l{i 1 ≥ j1 , i 2 ≥ j2 }, (i 1 , i 2 ) ∈ {1, . . . n 0 }2 , ( j1 , j2 ) ∈ {1, . . . n 0 }2 . Thus conv({ψ j ψk : ψ j ∈ , ψk ∈ }) is the collection of discrete distribution functions of measures on {1, · · · , n 0 }2 /n 0 . In [6] however, it is shown that for all dimensions d, the collection of distribution functions on [0, 1]d has δ-entropy for Lebesgue measure of order 1/δ modulo log-terms. By Theorem 2.4 in [25], for the class of half-intervals in Rd  d := ψv : [0, 1] → {0, 1} : ψv (x) := l{x ≥ v}, v ∈ [0, 1] , 

d

the -approximation numbers for Lebesgue measure l N (d ) are modulo log-terms of order N −1/2 . We conjecture therefore that our estimate for the δ-approximation number that is derived from Sect. 8.3 is not tight in this case. The only discrepancy between our situation and the one in [6] is that we consider a discrete situation, whereas [6] assumes that all the distributions are absolutely continuous. However, we expect that this difference has no impact on the order of magnitude of approximation numbers or entropies. The paper [6] applies small ball estimates, using for upperbounding the entropy a small ball estimate of [13]. It is a question for future research whether the small ball estimates can be used to arrive at an explicit construction with tight δ-approximation numbers.

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

185

9 Conclusion With this paper, we hope to have presented some insight into the connection between projection arguments to establish bounds for the Lasso, and projection arguments to obtain entropy bounds as established in the literature on approximation theory. This connection may be of interest to statisticians working on Lasso problems and wondering whether projection arguments give tight bounds. It turns out they do. Some further research may be in the direction of the using the small ball estimates in statistical problems. Moreover, with these one may arrive at explicit constructions of N -term approximations. In the special case studied in [22] the active set happens to be (a subset of) these N -terms and this has implications for adaptivity of the Lasso. Does this occur in other examples as well?

10 Technical Proofs 10.1 Proof of Lemma 1 We first present an auxiliary lemma. Lemma 6 Let Z be a positive random variable satisfying for some constant μ > 0 IP(Z ≥ μ +



2t) ≤ e−t ∀ t > 0.

Then E I Z 2 ≤ 2μ2 + 4. Similarly, if IP(Z ≥

 μ2 + 2t) ≤ e−t ∀ t > 0,

then E I Z 2 ≤ μ2 + 2. Proof of Lemma 6 The first inequality of the lemma follows from

186

S. van de Geer and P. Hinz



 ∞ IP(Z 2 ≥ t)dt ≤ 2μ2 + IP(Z 2 ≥ t)dt 0 2μ2  ∞ = 2μ2 + IP(Z 2 ≥ 2μ2 + t)dt 0 ∞  2 ≤ 2μ + IP(Z ≥ μ + t/2)dt 0 ∞ t ≤ 2μ2 + e− 4 dt = 2μ2 + 4. ∞

E I Z2 =

0

The second inequality follows in the same way: 



E I Z2 =

 IP(Z 2 ≥ t)dt ≤ μ2 +

0

∞ μ2

IP(Z 2 ≥ t)dt ≤ μ2 + 2.

p

Proof of Lemma 1 With  := {ψ j } j=1 , write



√   ψV ⊥ , j / n Z j := , j = 1, . . . , p. ψV ⊥ , j n Then Z j has a standard Gaussian distribution for all j ∈ {1, . . . , p}. Thus by the union bound   IP max |Z j | ≥ 2 log(2 p) + 2t ≤ exp[−t], ∀ t. 1≤ j≤ p

Hence E I max |Z j |2 ≤ 2 log(2 p) + 2 1≤ j≤ p

by Lemma 6. But then l 2 (V, ) ≤ δ 2 (V, )IE max |Z j |2 ≤ δ 2 (V, )(2 log(2 p) + 2). 1≤ j≤ p

Since V is an arbitrary linear space as long as V ⊃ N and dim(V) = N + dim(N ) we see that for all such V l N2 () ≤ δ 2 (V, )(2 log(2 p) + 2). But then also l N2 () ≤ δ 2N ()(2 log(2 p) + 2). Let now V := V N be a best N -term l-approximation as in Definition 2. Then clearly l N () ≥ δ(V, ) ≥ δ N ().

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

187

as the expectation of the maximum never smaller than the maximum of the expectation, and since δ N () is the minimum δ-approximation. Define   √  ⊥ μ := E I max | ψV |/ n /δ(V, ). ψ∈

Note that μ ≤ 1/ico(). We have by concentration of measure, for all u > 0, with probability at least 1 − exp[−u]   √ √ max |  ψV ⊥ |/ n /δ(V, ) ≤ μ + 2u ψ∈

≤ 1/ico() +



2u.

This proves the second statement of the lemma. We turn to the last statement, which consists of two inequalities. The right inequality ico() ≤ 1 is clear as—again—the expectation of the maximum never smaller than the maximum of the expectation. For the left inequality use that μ≤E I max |Z j | ≤ 1≤ j≤ p



2 log(2 p).

From Lemma 6 we see that l N2 () ≤ 2μ2 δ 2N (V, ) + 4δ 2N (V, ) ≤ 4(log(2 p) + 1)δ 2N (V, ) so that

1 δ N (V, ) ≥  . 2 log(2 p) + 1 N ∈N l N ()

ico() = sup



10.2 Proof of Lemma 2 Proof of Lemma 2 Let ∂b1 := {z ∈ R p : z∞ ≤ 1, z j = sign(b j ) if b j = 0, j = 1, . . . p} be the subdifferential of b → b1 , b ∈ R p . Then by the KKT ˆ 1, conditions, for zˆ ∈ ∂β   ( fˆ − f 0 )/n =   /n − λˆz . ) , Moreover, writing N =: span(),  = {φ j }dim(N j=1

 ( fˆ − f 0 )/n =  /n.

188

S. van de Geer and P. Hinz

Thus ˆ 1 + λβ ∗ zˆ ( fˆ − f ∗ ) ( fˆ − f 0 )/n = ( fˆ − f ∗ ) /n − λβ ˆ 1 + λβ ∗ 1 . ≤ ( fˆ − f ∗ ) /n − λβ The proof is completed by noting that  f ∗ − f 0 2Q n =  fˆ − f ∗ 2Q n +  fˆ − f 0 2Q n − 2( fˆ − f ∗ ) ( fˆ − f 0 )/n. 

10.3 Proof of Theorem 5 The following lemma is known as Maurey’s Lemma (see [9]). We present a proof for completeness. Lemma 7 (Maurey’s Lemma) Let u := supψ∈ ψ Q n . For all vectors β = p p p (β1 , . . . , β p ) ∈ R+ with j=1 β j = 1 there exists a vector b ∈ R with entries  p in {0, 1, . . . , m}/m and with j=1 b j = 1 (thus with at most m non-zero elements) such that !2 ! p ! !

! (β j − b j )ψ j ! ≤ u 2 /m. ! ! Qn

j=1

Proof of Lemma 7 Let (M1 , . . . , M p ) be Multinomial(m, β): m! m β m1 · · · β p p , m1! · · · m p ! 1 p

(m 1 , . . . , m p ) ∈ {0, 1, . . . , m} p , m j = m.

IP(M1 = m 1 , . . . , M p = m p ) =

j=1

Define b j := M j /m, j = 1, . . . , p. Then ! p !2 !2 ! p p

!

! ! !

2 ! ! ! E I ! (b j − β j )ψ j ! = β j ψ j  Q n /m − ! βjψj! ! /m j=1

Qn

j=1



j=1

Qn

p



β j ψ j 2Q n /m ≤ u 2 /m.

j=1



The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

189

Lemma of vectors b ∈ R p with entries in {1, . . . , m}/m  p 8 Let m ∈ N. The number m and j=1 b j = 1 is at most (e p) . Proof of Lemma 8 The number of such vectors is at most  m+ p−1 . m Since by Stirling’s formula, m! ≥ (m/e)m , this can be bounded by em ( m+mp−1 )m ≤  (e p)m . The following lemma gives a bound on the entropy of conv() without exploiting possible structure in . Lemma 9 Let u := supψ∈ ψ Q n . We have for all δ > 0  H (δ, conv()) ≤

 u2 (1 + log p) δ2

Proof of Lemma 9 Taking m = u 2 /δ 2  we see from Lemmas 7 and 8 that we can cover conv() by at most1 p m balls with radius δ. The logarithm of this number is  m log(e p) ≤

 u2 (1 + log p). δ2

which thus is a bound for the δ-entropy of the convex hull conv().



Lemma 10 For all R > 0 and δ > 0,  H (δ, F(R)) ≤ min N ∈N





4R + δ N + dim(N ) log δ



 +

2δ N () δ

2

 (1 + log p) .

Proof of Lemma 10 Let δ > 0 be arbitrary, V ⊃ N be a best N -term δ-approximation of the functions in , and and u := δ N (). Write  FV (R) :=

   f V : f ∈ F(R) , FV ⊥ (R) := f V ⊥ : f ∈ F(R) .

Since  f V  Q n ≤  f  Q n ≤ R for all f ∈ F(R), we know that for all δ > 0 (using for example Lemma 14.17 in [8]) 2R + δ . H (δ, FV (R)) ≤ (N + dim(N )) log δ 

1

In fact, the proof of Lemma 8 has a tighter bound and this can be exploited to get rid of log-factors, but then the constants get complicated (and large).

190

S. van de Geer and P. Hinz

Since FV ⊥ (R) ⊂ conv(V ⊥ ) it follows from Lemma 9 that for all δ > 0  H (δ, FV ⊥ (R)) ≤

 u2 (1 + log p) δ2

It follows that 

2R + δ H (2δ, F(R)) ≤ (N + dim(N )) log δ



 u2 + 2 (1 + log p) δ 

 Proof of Theorem 5 Consider the mapping  N → N log

4R + δ δ



+ N −2/W



2A δ

2 (1 + log p).

It consists of a function increasing in N and one decreasing in N . Trading these off leads to choosing W   (2 A/δ)2 (1 + log p) 2+W . N∗ := log((4R + δ)/δ) The application of Lemma 10 thus yields the result. 

10.4 Proof of Lemma 3 Let for δ > 0, let N (δ, S d−1 ,  · 2 ) be the δ-covering number of (S d−1 ,  · 2 ), i.e. the number of balls with radius δ for the Euclidian metric, necessary to cover S d−1 . Lemma 11 We have N (δ, S d−1 ,  · 2 ) ≤ 2d

 d−1 3 , ∀ δ > 0. δ

Proof of Lemma 11 Let for ρ > 0, B(ρ) := {w ∈ Rd : w2 ≤ ρ} the d-dimensional ball with radius ρ and center at the origin. We have volume(B(ρ)) = Cd ρ d , where the constant Cd (= 2π d/2 /(d( d2 ))) plays no role in the final result. Let for some 0 < δ ≤ 1, the collection {w j } Nj=1 ⊂ S d−1 be a maximal δ-packing set of S d−1 . Thus w j − wk 2 > δ for all j = k and min j=1,...,N w − w j 2 ≤ δ for all w ∈ S d−1 \{w j } Nj=1 . For j = 1, . . . , N , let B j be the ball with radius δ/2 and center

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

191

w j . Then B1 , . . . , B N are disjoint and so volume(∪ Nj=1 B j ) ≥ N Cd (δ/2)d . On the other hand, for any j and for w ∈ B j we know that w2 ≤ w j 2 + w − w j 2 ≤ 1 + δ/2 and w2 ≥ w j 2 − w − w j 2 ≥ 1 − δ/2. Thus ˚ − δ/2) ∪ Nj=1 B j ⊂ B(1 + δ/2)\B(1 ˚ − δ/2) is the interior of B(1 − δ/2). Therefore where B(1 ˚ − δ/2)) volume(∪ Nj=1 B j ) ≤ volume(B(1 + δ/2)) − volume(B(1  d d = Cd (1 + δ/2) − (1 − δ/2) ≤ d(3/2)d−1 Cd δ. Hence N Cd (δ/2)d ≤ d(3/2)d−1 Cd δ or N ≤ 2d3d−1 /δ d−1 .  Proof of Lemma 3 If c ∈ / [0, 1] we have (wxi − c)+ = 0, so we only need to look at values c ∈ [0, 1]. Let δ > 0 be arbitrary. We can cover the interval [0, 1] by 2/δ intervals of length at most δ. By Lemma 11 N (3δ, S

d−1

 d−1 1 ,  · 2 ) ≤ 2d . δ

Now use that |(xw − c)+ − (x w˜ − c) ˜ + | ≤ w − w ˜ 2 + |c − c|. ˜ It follows that for all δ > 0  d−1 2 1 = 4dδ −d . N (4δ, ) ≤ 2d δ δ But then

N (δ, ) ≤ 4d+1 dδ −d . 

192

S. van de Geer and P. Hinz

10.5 Proof of Lemma 5 Proof of Lemma 5 Let 2 := {φ1 , . . . , φ N2 } be given, let 1 ⊂ 2 , |1 | = N1 and 0 ⊂ 1 , |0 | = N0 . Let Vi := span(i ), i = 1, 2, 3. Consider a given ψu ψv ψw . Let ψu i be the projection of ψu on Vi , ψvi be the projection of ψv on Vi , and ψwi be the projection of ψw on Vi , i = 1, 2, 3. Define ψ˜ := −ψu 0 ψv1 ψw1 − ψu 1 ψv0 ψw1 − ψu 1 ψv1 ψw0 + ψu 1 ψv0 ψw2 + ψu 2 ψv0 ψw1 + ψu 0 ψv1 ψw2 + ψu 0 ψv2 ψw1 + ψu 1 ψv2 ψw0 + ψu 2 ψv1 ψw0 − ψu 0 ψv0 ψw2 − ψu 0 ψv2 ψw0 − ψu 2 ψv0 ψw0 + ψu 0 ψv0 ψw0 Consider in this expansion for example ψu 0 ψv1 ψw2 . This function lies in a linear space, V{0,1,2} say, of dimension N0 N1 N2 , see Lemma 12. All the other functions ψu x ψv y ψwz with   (x, y, z) ∈ {0, 1, 2}3 ∪ {x + y + z ≤ 3} ∪ {min(x, y, z) = 0} , ˜ also lie in a space V{x,y,z} with dimension at most N0 N1 N2 . In fact, making up ψ, all lie in the union of V{x,y,z} with {x, y, z} running over all 3! = 6 permutations of {0, 1, 2}. Thus we may write ψu ψv ψw = ψ˜ + remainder, where ψ˜ ∈ V with V as space of dimension at most 6N0 N1 N2 and remainder = (ψu − ψu 0 )(ψv − ψv0 )(ψw − ψw0 ) + ψu 0 (ψv − ψv1 )(ψw − ψw1 ) + ψu 0 (ψv1 − ψv0 )(ψw − ψw2 ) + ψu 0 (ψv − ψv2 )ψw1 + (ψu − ψu 1 )ψv0 (ψw − ψw1 ) + ψu 1 ψv0 (ψw − ψw2 ) + (ψu − ψu 2 )ψv0 (ψw1 − ψw0 ) + (ψu − ψu 1 )(ψv − ψv1 )ψw0 + (ψu 1 − ψu 0 )(ψv − ψv2 )ψw0 + (ψu − ψu 2 )ψv1 ψw0

Since ψ Q n ≤ 1 for all ψ ∈  also ψu i  Q n ≤ 1, ψvi  Q n ≤ 1 and ψwi  Q n ≤ 2 2 1, i = 1, 2, 3. We assume that ψu − ψu i 2Q n ≤ δ 3−i , ψv − ψvi 2Q n ≤ δ 3−i , and 2

ψw − ψwi 2Q n ≤ δ 3−i , i = 1, 2, 3. Then ψu 1 − ψu 0 2Q n = ψu − ψu 0 2Q − ψu −

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

193

ψu 1 2Q ≤ δ 2/3 − δ ≤ δ 2/3 ≤ 1 and similarly ψv1 − ψv0 2Q ≤ δ 2/3 ≤ 1 and ψw1 − ψw0 2Q ≤ δ 2/3 ≤ 1. The remainder consists of 10 terms which are all orthogonal to each other.Therefore, using that Q n is a product measure remainder2Q n ≤ (δ 2/3 )3 + 3(δ 1 )2 + 3(δ 2 )1 + 3δ 2/3 δ 2 ≤ 10δ 2 . The assumption

δ N () ≤ AN −1/W

implies that for all u > 0

δ N () ≤ u

for a value of N ∈ N satisfying 2(A/u)W − 1 ≤ N ≤ 2(A/u)W . For i ∈ {0, 1, 2}, 1 we let N i be such a value with u = δ 3−i , i = 1, 2, 3. We let V i be the corresponding best N i -term δ-approximation, i = 1, 2, 3. Then we take V0 = V 0 , V1 as the direct product of V0 and V 1 and V2 the direct product of V 2 and V1 . Then N0 = N 0 ≤ 2 A W δ −W/3 , N1 ≤ N0 + N 1 ≤ 2 A W δ −W/3 + 2 A W δ −W/2 ≤ 4 A W δ −W/2 and N2 ≤ N1 + 2 A W δ −W ≤ 6A W δ −W . Therefore 6N0 N1 N2 ≤ 62 (2 A W )3 δ −(1+ 2 + 3 )W . 1

1

. Lemma 12 Consider linear spaces V0 ⊂ V1 ⊂ V2 with dimension Ni := dim(Vi ), i ∈ {0, 1, 2}. We have    = N0 N1 N2 . dim span ψu 0 ψv1 ψw2 : ψu 0 ∈ V0 , ψv1 ∈ V1 , ψw2 ∈ V2 Proof of Lemma 12 Suppose Vi = span(i ), i ∈ {0, 1, 2}. Without loss of generalelements of 2 and that 0 consists ity, we may assume that 1 consists of the first N1  N0 1 ai φi , ψv1 = Nj=1 b j φ j , and of the first N0 elements of 1 . Then for ψu 0 = i=1  N2 ψw2 = k=1 ck φk , it holds that

ψu 0 ψv1 ψw2 =



N0 i=1

ai φi



N1 j=1

bjφj



N2 k=1

ck φk

=

N0

N1

N2

ai b j ck φi φ j φk .

i=1 j=1 k=1

This is a linear combination of N0 N1 N2 basis functions.



Acknowledgements We acknowledge support for this project from the the Swiss National Science Foundation (SNF grant 200020_191960). We thank the Editor and Reviewer for their very helpful remarks.

194

S. van de Geer and P. Hinz

References 1. Artstein, V., Milman, S., Szarek, Tomczak-Jaegermann, N.: On convexified packing and entropy duality. Geomet. Funct. Anal. GAFA 14(5), 1134–1141 (2004) 2. Babenko, K.I.: Theoretical foundations and construction of numerical algorithms for problems in mathematical physics (1979) 3. Ball, K., Pajor, A.: The entropy of convex bodies with “few” extreme points. London Math. Soc. Lecture Note Series 158, 25–32 (1990) 4. Barron, A.R.: Approximation and estimation bounds for artificial neural networks. Mach. Learn. 14(1), 115–133 (1994) 5. Birman, M.Š, Solomjak, M.Z.: Piecewise-polynomial approximations of functions of the classes W pα . Math. USSR-Sbornik 2, 295–317 (1967) 6. Blei, R., Gao, F., Li, W.: Metric entropy of high dimensional distributions. Proc. Am. Math. Soc. 135(12), 4009–4018 (2007) 7. Bölcskei, H., Grohs, P., Kutyniok, G., Petersen, P.: Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1), 8–45 (2019) 8. Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods. Theory and Applications. Springer, Berlin (2011) 9. Carl, B.: Inequalities of Bernstein-Jackson-type and the degree of compactness of operators in banach spaces. In: Annales de l’Institut Fourier, vol. 35, 79–118 (1985) 10. Dalalyan, A., Hebiri, M., Lederer, J.: On the prediction performance of the Lasso. Bernoulli 23(1), 552–581 (2017) 11. DeVore, R.A.: Nonlinear Approximation, vol. 7. Cambridge University Press (1998) 12. Donoho, D.: Johnstone, Iain: Minimax estimation via wavelet shrinkage. Ann. Stat. 26(3), 879–921 (1998) 13. Dunker, T., Linde, W., Kühn, T., Lifshits, M.A.: Metric entropy of integration operators and small ball probabilities for the Brownian sheet. J. Approx. Theory 101(1), 63–77 (1999) 14. Friedman, J.H.: Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67 (1991) 15. Gao, F.: Entropy of absolute convex hulls in Hilbert spaces. Bull. Lond. Math. Soc. 36(4), 460–468 (2004) 16. Gao, F., Li, W., Wellner, J.A.: How many Laplace transforms of probability measures are there? Proc. Am. Math. Soc. 138(12), 4331–4344 (2010) 17. Giraud, C.: Introduction to High-Dimensional Statistics, vol. 138. CRC Press (2014) 18. Koltchinskii, V., Lounici, K., Tsybakov, A.B.: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39(5), 2302–2329 (2011) 19. Kuelbs, J., Li, W.V.: Metric entropy and the small ball problem for gaussian measures. J. Funct. Anal. 116(1), 133–157 (1993) 20. Laurent, B., Massart, P.: Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 1302–1338 (2000) 21. Li, W.V., Linde, W.: Approximation, metric entropy and small ball estimates for gaussian measures. Ann. Probab. 27(3), 1556–1578 (1999) 22. Ortelli, F., van de Geer, S.: Prediction bounds for (higher order) total variation regularized least squares. Ann, Stat (2021) 23. Ortelli, F., van de Geer, S.: Tensor denoising with trend filtering. Math. Statist. Learn. 4, 87–142 (2022) 24. Parhi, R., Nowak, R.D.: Minimum “norm” neural networks are splines. In: Proceedings of Machine Learning Research (2020) 25. Pisier, G.: The Volume of Convex Bodies and Banach Space Geometry. Cambridge University Press (1989) 26. Tibshirani, R.: Divided differences, falling factorials, and discrete splines: Another look at trend filtering and related problems (2020). arXiv preprint arXiv:2003.03886 27. van de Geer, S.: Least squares estimation with complexity penalties. Math. Methods Statist. 10(3), 355 (2001)

The Lasso with Structured Design and Entropy of (Absolute) Convex Hulls

195

28. van de Geer, S.: On non-asymptotic bounds for estimation in generalized linear models with highly correlated design. In: Asymptotics: Particles, Processes and Inverse Problems, pp. 121– 134. Institute of Mathematical Statistics (2007) 29. van de Geer, S.: Estimation and Testing Under Sparsity, vol. 2159. Springer (2016) 30. van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, New York (1996) 31. Wang, Y.-X., Smola, A., Tibshirani, R.: The falling factorial basis and its statistical applications. In: International Conference on Machine Learning, pp. 730–738 (2014)

Local Linear Smoothing in Additive Models as Data Projection Munir Hiabu, Enno Mammen, and Joseph T. Meyer

Abstract We discuss local linear smooth backfitting for additive nonparametric models. This procedure is well known for achieving optimal convergence rates under appropriate smoothness conditions. In particular, it allows for the estimation of each component of an additive model with the same asymptotic accuracy as if the other components were known. The asymptotic discussion of local linear smooth backfitting is rather complex because typically an overwhelming notation is required for a detailed discussion. In this paper we interpret the local linear smooth backfitting estimator as a projection of the data onto a linear space with a suitably chosen semi-norm. This approach simplifies both the mathematical discussion as well as the intuitive understanding of properties of this version of smooth backfitting. Keywords Additive models · Local linear estimation · Backfitting · Data projection · Kernel smoothing

1 Introduction In this paper we consider local linear smoothing in an additive model E[Yi |X i ] = m 0 + m 1 (X i1 ) + · · · + m d (X id ),

(1)

where (Yi , X i ) (i = 1, . . . , n) are iid observations with values in R × X for a bounded connected open subset X ⊆ Rd . Here, m j ( j = 1, . . . , d) are some smooth functions which we aim to estimate and m 0 ∈ R. Below, we will add norming conditions on m 0 , . . . , m d such that they are uniquely defined given the sum. In [11] a local linear smooth backfitting estimator based on smoothing kernels was proposed M. Hiabu Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5, 2100 Copenhagen O, Denmark E. Mammen (B) · J. T. Meyer Heidelberg University, Institute for Applied Mathematics, INF 205, 69120 Heidelberg, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_5

197

198

M. Hiabu et al.

for the additive functions m j . There, it was shown that their version of a local linear estimator mˆ j of the function m j has the same pointwise asymptotic variance and bias as a classical local linear estimator in the oracle model, where one observes i.i.d. observations (Yi∗ , X i j ) with E[Yi∗ |X i j ] = m j (X i j ), Yi∗ = Yi −



m k (X ik ).

k= j

In this respect the local linear estimator differs from other smoothing methods where the asymptotic bias of the estimator of the function m j depends on the shape of the functions m k for k = j. An example for an estimator with this disadvantageous bias property is the local constant smooth backfitting estimator which is based on a backfitting implementation of one-dimensional Nadaraya-Watson estimators. It is also the case for other smoothing estimators as regression splines, smoothing splines and orthogonal series estimators, where in addition also no closed form expression for the asymptotic bias is available. Asymptotic properties of local linear smoothing simplify the choice of bandwidths as well as the statistical interpretation of the estimators mˆ j . These aspects have made local linear smooth backfitting a preferred choice for estimation in additive models. Deriving asymptotic theory for local linear smooth backfitting is typically complicated by an overloaded notation that is required for detailed proofs. In this note we will use that the local linear smooth backfitting estimator has a nice geometric interpretation. This simplifies mathematical arguments and allows for a more intuitive derivation of asymptotic properties. In particular, we will see that the estimator can be characterized as a solution of an empirical integral equation of the second kind as is the case for local constant smooth backfitting, see [16]. Our main point is that the local linear estimator can be seen as an orthogonal projection of the response vector Y = (Y )i=1,...,n onto a subspace of a suitably chosen linear space. A similar point of view is taken in [12] for a related construction where it was also shown that regression splines, smoothing splines and orthogonal series estimators can be interpreted as projection of the data in an appropriately chosen Hilbert space. Whereas this interpretation is rather straight forward for these classes of estimators it is not immediately clear that it also applies for kernel smoothing and local polynomial smoothing, see [12]. In this paper we will introduce a new and simple view of local linear smoothing as data projection. In the next section we will define the required spaces together with a corresponding semi-norm. We will also introduce a new algorithm motivated by our interpretation of local linear smooth backfitting. The algorithm will be discussed in Sect. 3. In Sect. 4 we will see that our geometric point of view allows for simplified arguments for the asymptotic study of properties of the local linear smooth backfitting estimator. The additive model (1) was first introduced in [2] and enjoys great popularity for two main reasons. The first is estimation performance. While not being as restrictive as a linear model, in contrast to a fully flexible model, it is not subject to the curse of dimensionality. Assuming that E[Yi |X i = x] is twice continuously differen-

Local Linear Smoothing in Additive Models as Data Projection

199

tiable, the optimal rate of convergence of an estimator of E[Yi |X i = x] is n −2/(d+4) if no further structural assumptions are made, see [18]. This means the rate deteriorates exponentially in the dimension of the covariates d. Under the additive model assumption (1) and assuming that each function m j , j = 1, . . . , d is twice continuously differentiable, the optimal rate of convergence is n −2/5 . The second reason is interpretability. In many applications it is desirable to understand the relationship between predictors and the response. Even if the goal is prediction only, understanding this relationship may help detect systematic biases in the estimator, so that out of sample performance can be improved or adjusted for. While it is almost impossible to grasp the global structure of a multivariate function m in general, the additive structure (1) allows for visualisation of each of the univariate functions, providing a comprehensible connection between predictors and the response. Though the setting considered in this paper is fairly simple, it can be seen as a baseline for more complicated settings.One main drawback is the additive structure which cannot account for interactions between covariates. It is assuring however that even if the true model is not additive, the smooth backfitting estimator is still defined as the closest additive approximation. This will be shown in the next section. If the true regression function is far away from an additive structure, then a more complex structure may be preferable. This could be done by adding higher-dimensional covariates, products of univariate functions or considering a generalized additive model. For testing procedures that compare such specifications, see also [6, 15]. Besides such structural assumptions, other directions the ideas in this paper can be extended to are the consideration of time-series data or high dimensional settings. Settings using more complicated responses like survival times, densities or other functional data may also be approached. Some of these cases have been considered, e.g., in [3–5, 7, 8, 13–16, 19]. We hope that a better understanding of local linear estimation in this simple setting will help advance theory and methodology for more complicated settings in the future.

2 Local Linear Smoothing In Additive Models The local linear smooth backfitting estimator mˆ = (mˆ 0 , mˆ 1 , . . . , mˆ d , mˆ (1) ˆ (1) 1 ,...,m d ) is defined as the minimizer of the criterion S( f 0 , . . . , f d , f 1(1) , . . . , f d(1) ) ⎫2 ⎧ d d n  ⎨ ⎬    f j (x j ) − f j(1) (x j )(X i j − x j ) = n −1 Yi − f 0 − ⎭ X ⎩ i=1

×K hX i (X i

j=1

− x)dx

j=1

200

M. Hiabu et al.

under the constraint n   X

i=1

f j (x j )K hX i (X i − x)dx = 0

(2)

for j = 1, . . . , d. The minimization runs over all values f 0 ∈ R and all functions f j , f j(1) : X j → R with X j = {u ∈ R : there exists an x ∈ X with x j = u}. Under the constraint (2) and some conditions introduced in Sect. 3, the minimizer is unique. For j = 1, . . . , d the local linear estimator of m j is defined by mˆ j . In the definition of S the function K hu (·) is a boundary corrected product kernel, i.e.,

d j=1

K hu (u − x) =

d

κ

j=1 κ

X



u j −x j hj

u j −v j hj



. dv j

Here, h = (h 1 , . . . , h d ) is a bandwidth vector with h 1 , . . . , h d > 0 and κ : R → R is some given univariate density function, i.e., κ(t) ≥ 0 and κ(t)dt = 1. We use the variable u twice in the notation because away from the boundary of X , the kernel K hu (u − x) only depends on u − x. It is worth emphasizing that the empirical minimization criterion S depends on a choice of a kernel κ and a smoothing bandwidth h. While the choice of κ is not of great importance, see similar to e.g. [17, Sect. 3.3.2], the quality of estimation heavily depends on an appropriate choice of the smoothing parameter h. We will not discuss the choice of a (data-driven) bandwidth in this paper, but we note that the asymptotic properties of the local linear smoothing estimator do simplify the choice of bandwidths compared to other estimators. The reason is that the asymptotic bias of one additive component does not depend on the shape of the other components and on the bandwidths used for the other components. We now argue that the local linear smooth backfitting estimator can be interpreted as an empirical projection of the data onto a space of additive functions. We introduce the linear space  H = ( f i, j )i=1,...,n;

j=0,...,d |

 f i, j : X → R, f n < ∞

with inner product f, g n = n −1

⎧ n  ⎨  i=1



X



f i,0 (x) +

× g i,0 (x) +

j=1 d  k=1

 and norm f n = f, f n .

d 

f i, j (x)(X i j − x j )

⎫ ⎬ ⎭



g i,k (x)(X i j − x j ) K hX i (X i − x)dx

Local Linear Smoothing in Additive Models as Data Projection

201

We identify the response Y = (Yi )i=1,...,n as an element of H via Y i,0 ≡ Yi and Y ≡ 0 for j ≥ 1. We will later assume that the functions m j are differentiable. We identify the regression function i, j

m : X → R, m(x) = m 0 + m 1 (x1 ) + · · · + m d (xd )  as an element of H via m i,0 (x) = m 0 + j m j (x j ) and m i, j = ∂m j (x j )/∂x j for j ≥ 1. Note that the components of m ∈ H do not depend on i. We define the following subspaces of H: H f ull = { f ∈ H| the components of f do not depend on i} ,  (1) Hadd = f ∈ H f ull | f i,0 (x) = f 0 + f 1 (x1 ) + · · · + f d (xd ), f i, j (x) = f j (x j ) for (1)

some f 0 ∈ R and some univariate functions f j , f j ⎫ n  ⎬  X f j (x j )K h i (X i − x)dx = 0 . d with ⎭ X

: X j → R, j = 1, . . . ,

i=1

For a function f ∈ Hadd we write f 0 ∈ R and f j , f j(1) : X j → R for j = 1, . . . , d for the constant and functions that define f . In the next section we will state conditions under which the constant f 0 and functions f j , f j(1) are unique given any f ∈ Hadd . By a slight abuse of notation we also write f j for the element of H given by f i,0 (x) = f j (x j ) and f i,k (x) ≡ 0 for k = 1, . . . , d. We also write f j(1) for the element of H with f i,k (x) ≡ 0 for k = j and f i, j (x) = f j(1) (x j ). Furthermore, we define f j+d := f j(1) for j = 1, . . . , d for both interpretations. Thus, for f ∈ Hadd we have f = f 0 + · · · + f 2d .

(3)

Recall that the linear smooth backfitting estimator ˆ (1) m  = (mˆ 0 , mˆ 1 , . . . , mˆ d , mˆ (1) 1 ,...,m d ) is defined as the minimizer of the criterion S under the constraint (2). By setting mˆ i,0 (x) = mˆ 0 + dj=1 mˆ j (x j ) and m i, j (x) = mˆ (1) j (x j ) for j ≥ 1 it can easily be seen that m  = arg min Y − f n . f ∈Hadd

(4)

In the next section we will state conditions under which the minimization has a unique solution. Equation (4) provides a geometric interpretation of local linear smooth backfitting. The local linear smooth backfitting estimator is an orthogonal projection of the response vector Y onto the linear subspace Hadd ⊆ H. We will make repeated use of this fact in this paper.

202

M. Hiabu et al.

We now introduce the following subspaces of H:   H0 = f ∈ H| f i,0 (x) ≡ c for some c ∈ R, f i, j (x) ≡ 0 for j = 0 ,  Hk = f ∈ H| f i, j (x) ≡ 0 for j = 0, and f i,0 (x) = f k (xk ) for some univariate  n   Xi f k (xk )K h (X i − x)dx = 0 , function f k : Xk → R with Hk  =

i=1



X

f ∈ H| f i, j (x) ≡ 0 for j = k, f i,k (x) = f k(1) (xk ) for some univariate  function f k(1) : Xk → R

 for k = 1, . . . , d and k  := k + d. Using these definitions we have Hadd = 2d j=0 H j with H j ∩ Hk = {0}, j = k. In particular, the functions f j in (3) are unique elements in H j , j = 0, . . . , 2d. For k = 0, . . . , 2d we denote the orthogonal projection of H onto the space Hk by Pk . Note that for k = 0, . . . , d the operators Pk set all components of an element f = ( f i, j )i=1,...,n; j=0,...,d ∈ H to zero except the components with indices (i, 0), i = 1, . . . , n. Furthermore, for k = d + 1, . . . , 2d, only components with index (i, k − d) are not set to zero. Because H0 is orthogonal to Hk for k = 1, . . . , d, the orthogonal projection onto the space Hk is given by Pk = Pk − P0 where Pk is the projection onto H0 + Hk . In Appendix 1 we will state explicit formulas for the orthogonal projection operators. The operators Pk can be used to define an iterative algorithm for the approximation of m. ˆ For an explanation observe that mˆ is the projection of Y onto Hadd and Hk is a linear subspace of Hadd . Thus Pk (Y ) = Pk (m) ˆ holds for k = 0, . . . , 2d. This gives ⎛ ⎞ ⎛ ⎞ 2d   Pk (Y ) = Pk (m) ˆ = Pk ⎝ mˆ j ⎠ = mˆ k + Pk ⎝ mˆ j ⎠

(5)

j=k

j=0

or, equivalently, mˆ k = Pk (Y ) −



Pk (mˆ j ) = Pk (Y ) − Y¯ −

j=k



Pk (mˆ j ),

j=k

n Y i , (Y¯ )i, j ≡ 0 where Y¯ = P0 (Y ) = P0 (Y ) is the element of H with (Y¯ )i,0 ≡ n1 i=1 for j ≥ 1. This equation inspires an iterative algorithm where in each step approxiˆ k are updated by mations mˆ old k of m = Pk (Y ) − Y¯ − mˆ new k



Pk (mˆ old j ).

j=k

Algorithm 1 provides a compact definition of our algorithm for the approximation of m. ˆ In each iteration step, either mˆ j or mˆ (1) j is updated for some j = 1, . . . , d. This

Local Linear Smoothing in Additive Models as Data Projection

203

Algorithm 1 Smooth Backfitting algorithm 1: Start: mˆ k (xk ) ≡ 0, m k = Pk (Y ), err or = ∞ 2: while err or > tolerance do 3: err or ← 0 4: for k = 0, . . . , 2d do 5: mˆ old ˆk  k ←m 6: mˆ k ← m k − j=k Pk (mˆ j ) 7: err or ← err or + |mˆ k − mˆ old k |

 k = 0, . . . , 2d

8: return mˆ = (mˆ 0 , mˆ 1 , . . . , mˆ 2d )

is different from the algorithm proposed in [11] where in each step a function tuple (mˆ j , mˆ (1) j ) is updated. For the orthogonal projections of functions m ∈ Hadd one can use simplified formulas. They will be given in Appendix 1. Note that m˜ k , k = 0, . . . , 2d only needs to be calculated once at the beginning. Also the marginals pk (xk ), pk∗ (xk ), pk∗∗ (xk ), p jk (x j , xk ), p ∗jk (x j , xk ) and p ∗∗ jk (x j , x k ) which are needed in the evaluation of Pk only need to be calculated once at the beginning. Precise definitions of these marginals can be found in the following sections. In each iteration of the for-loop in line 4 of Algorithm 7, O(d × n × g) calculations are performed. Hence for a full cycle, the algorithm needs O(d 2 × n × g × log(1/tolerance)) calculations. Here g is the number of evaluation points for each coordinate xk . Existence and uniqueness of the local linear smooth backfitting estimator will be discussed in the next section. Additionally, convergence of the proposed iterative algorithm will be shown.

3 Existence and Uniqueness of the Estimator, Convergence of the Algorithm In this section we will establish conditions for existence and uniqueness of the local linear smooth backfitting estimator m . Afterwards we will discuss convergence of the iterative algorithm provided in Algorithm 7. Note that convergence is shown for arbitrary starting values, i.e., we can set mˆ k (xk ) to values other than zero in step 1 of Algorithm 7. For these statements we require the following weak condition on the kernel. (A1) The kernel k has support [−1, 1]. Furthermore, k is strictly positive on (−1, 1) and continuous on R. For k = 1, . . . , d and x ∈ Rd we write x−k := (x1 , . . . , xk−1 , xk+1 , . . . , xd ).

204

M. Hiabu et al.

In the following, we will show that our claims hold on the following event:  E=

For k = 1, . . . , d and xk ∈ X k there exist two observations i 1 , i 2 ∈ {1, . . . , n}

such that X i 1 ,k  = X i 2 ,k , |X i,k − xk | < h k , (xk , X i,−k ) ∈ X for i = i 1 , i 2 . Furthermore, there exist no b0 , . . . , bd ∈ R with b0 +

d 

 b j X i j = 0 ∀i = 1, . . . , n ,

j=1

where X k is the closure of Xk and by a slight abuse of notation (xk , X i,−k ) := (X i,1 , . . . , , X i,k−1 , xk , X i,k+1 , . . . , , X i,d ). Throughout this paper, we require the following definitions. n  1 K X i (X i − x)dx−k , pˆ k (xk ) = n i=1 X−k (xk ) h n  1 pˆ k∗ (xk ) = (X ik − xk )K hX i (X i − x)dx−k , n i=1 X−k (xk ) n  1 ∗∗ pˆ k (xk ) = (X ik − xk )2 K hX i (X i − x)dx−k , n i=1 X−k (xk )

where X−k (xk ) = {u −k | (xk , u −k ) ∈ X }. Lemma 1 Make Assumption (A1). Then, on the event E it holds that f n = 0 implies f 0 = 0 as well as f j ≡ 0 almost everywhere for j = 1 . . . 2d and all f ∈ Hadd . One can easily see that the lemma implies the following. On the event E, if a minimizer mˆ = mˆ 0 + · · · + mˆ 2d of Y − f n over f ∈ Hadd exists, the components mˆ 0 , . . . , mˆ 2d are uniquely determined: Suppose there exists another minimizer ˆ mˆ − m ˜ n = 0 and Y − m, ˜ mˆ − m ˜ n=0 m˜ ∈ Hadd . Then it holds that Y − m, which gives mˆ − m ˜ n = 0. An application of the lemma yields uniqueness of the components mˆ 0 , . . . , mˆ 2d . Remark 1 In Fig. 1 we give an example where a set X and data points X i do not belong to the event E and where the components of the function f ∈ Hadd are not identified. Note that in this example for all k = 1, 2 and xk ∈ Xk there exist i 1 , i 2 ∈ {1, . . . , n} such thatX i1 = X i2 and |X i,k − xk | < h, for i = i 1 , i 2 . However, for x1 ∈ a the condition (x1 , X i,2 ) ∈ X is not fulfilled for any i = 1, . . . , n with |X i,1 − x1 | < h. Therefore, K hX i (X i − x) = 0 for all x ∈ X with x1 ∈ a. Thus, any function satisfying f ∈ H1 with f 1 (x) = 0 for x ∈ X1 \{a} has the property f n = 0.

Local Linear Smoothing in Additive Models as Data Projection

205

Fig. 1 An example of a possible data set X ⊆ R2 including data points where the conditions of the event E are not satisfied and where the components of functions f ∈ Hadd are not identified. The data is visualized by blue dots. The size of the parameter h = h 1 = h 2 is showcased on the right hand side. For explanatory reasons, the interval a is included

Proof (of Lemma 1) First, for each pair i 1 , i 2 = 1, . . . , n define the set Mi1 ,i2 := {x1 ∈ X1 | |X i,1 − x1 | < h, (x1 , X i,−1 ) ∈ X for i = i 1 , i 2 } if X i1 ,1 = X i2 ,1 and Mi1 ,i2 = ∅ otherwise. It is easy to see that Mi1 ,i2 is open as an intersection of open sets. Note that on the event E we have 

Mi1 ,i2 = X1 .

(6)

i 1 ,i 2

Now, suppose that for some f ∈ Hadd we have f n = 0. We want to show that f 0 = 0 and that f j ≡ 0 for j = 1, . . . , 2d. From f n = 0 we obtain  f0 +

d 

f j (x j ) +

j=1

d 

2 f j  (x j )(X i j − x j )

K hX i (X i − x) = 0

j=1

for i = 1, . . . , n and almost all x ∈ X . Let i 1 , i 2 ∈ {1, . . . , n}. Then f 0 + f 1 (x1 ) +

d  j=2

f j (X i j ) + f d+1 (x1 )(X i1 − x1 ) = 0

(7)

206

M. Hiabu et al.

holds for all i = i 1 , i 2 and x1 ∈ Mi1 ,i2 almost surely. By subtraction of Equation (7) for i = i 1 and i = i 2 we receive f d+1 (x1 ) = v1  with constant v1 = − dj=2 ( f j (X i1 , j ) − f j (X i2 , j ))/(X i1 ,1 − X i2 ,1 ) for x1 ∈ Mi1 ,i2 . Furthermore, by using (7) once again we obtain f 1 (x1 ) = u 1 + v1 x1 with another constant u 1 ∈ R. Following (6), since X1 is connected and the sets Mi1 ,i2 are open we can conclude f d+1 (x1 ) = v1 and f 1 (x1 ) = u 1 + v1 x1 for almost all x1 ∈ X1 since the sets must overlap. Similarly one shows f j  (x j ) = v j and f j (x j ) = u j + v j x j for j = 2, . . . , d and almost all x j ∈ X j . We conclude that 0 = f 2n 2  n  d d   1  f0 + = f j (x j ) + f j  (x j )(X i j − x j ) K hX i (X i − x)dx n i=1 j=1 j=1 =

1 n 

=

  n 

f0 +

i=1

f0 +

d  j=1

d 

uj +

d 

j=1

uj +

d 

2

v j Xi j

K hX i (X i − x)dx

j=1

2

v j Xi j

.

j=1

On the event E the covariates X i do not lie in a linear subspace of Rd . This shows v j = 0 for 1 ≤ j ≤ d. Thus f j ≡ 0 for d + 1 ≤ j ≤ 2d and f j = u j for 1 ≤ j ≤ d. Now, f j (x j ) pˆ j (x j )dx j = 0 implies that f j ≡ 0 for 1 ≤ j ≤ d and f 0 = 0. This concludes the proof of the lemma. Existence and uniqueness of m  on the event E under Assumption (A1) follows immediately from the following lemma. Lemma 2 Make Assumption (A1). Then, on the event E, for every D ⊆ {0, . . . , 2d} the linear space k∈D Hk is a closed subset of H. In particular, Hadd is closed. For the proof of this lemma we make use of some propositions introduced below. In the following, we consider sums L = L 1 + L 2 of closed subspaces L 1 and L 2 of a Hilbert space with L 1 ∩ L 2 = {0}. In this setup, an element g ∈ L has a unique

Local Linear Smoothing in Additive Models as Data Projection

207

decomposition g = g1 + g2 with g1 ∈ L 1 and g2 ∈ L 2 . Thus, the projection operator from L onto L 1 along L 2 given by 1 (L 2 ) : L → L 1 , 1 (L 2 )(g) = g1 is well defined. Proposition 1 For the sum L = L 1 + L 2 of two closed subspaces L 1 and L 2 of a Hilbert space with L 1 ∩ L 2 = {0}, the following conditions are equivalent (i) L is closed. (ii) There exists a constant c > 0 such that for every g = g1 + g2 ∈ L with g1 ∈ L 1 and g2 ∈ L 2 we have (8) g ≥ c max{ g1 , g2 }. (iii) The projection operator 1 (L 2 ) from L onto L 1 along L 2 is bounded. (iv) The gap from L 1 to L 2 is greater than zero, i.e., γ(L 1 , L 2 ) := inf

g1 ∈L 1

dist (g1 , L 2 ) > 0, g1

where dist( f, V ) := inf h∈V f − h with the convention 0/0 = 1. Remark 2 A version of Proposition 1 is also true if L 1 ∩ L 2 = {0}. In this case, the quantities involved need to be identified as objects in the quotient space L/(L 1 ∩ L 2 ). Proposition 2 The sum L = L 1 + L 2 of two closed subspaces L 1 and L 2 of a Hilbert space with L 1 ∩ L 2 = {0} is closed if the orthogonal projection of L 2 on L 1 is compact. The proofs of Propositions 1 and 2 can be reconstructed from [1, A.4 Proposition 2], [9, Chap. 4, Theorem 4.2] and [10]. For completeness, we have added proofs of the propositions in Appendix Appendix 2. We now come to the proof of Lemma 2. Proof (of Lemma 2) First note that the spaces Hk are closed for k = 0, . . . , 2d. We show that Hk + Hk  is closed for 1 ≤ k ≤ d. Consider R = min M where M := {r ≥ 0 | ( pˆ k∗ (xk ))2 ≤ r pˆ k (xk ) pˆ k∗∗ (xk ) for all xk ∈ X k and 1 ≤ k ≤ d} and

    Ixk := i ∈ {1, . . . , n} 

u∈X−k (xk )

 K hX i (X i − u)du −k > 0

for xk ∈ X k . By the Cauchy-Schwarz inequality we have ( pˆ k∗ (xk ))2 ≤ pˆ k (xk ) pˆ k∗∗ (xk ) for xk ∈ X k and 1 ≤ k ≤ d. This implies R ≤ 1 . Now, equality in the inequality only holds if X ik − xk does not depend on i ∈ Ixk . On the event E for xk ∈ X k there exist

208

M. Hiabu et al.

1 ≤ i 1 , i 2 ≤ n with |xk − X i,k | < h for i = i 1 , i 2 and X i1 ,k = X i2 ,k . Thus, X ik − xk depends on i for i ∈ Ixk and the strict inequality holds for all xk . Furthermore, because the kernel function k is continuous, we have that pˆ k , pˆ k∗ and pˆ k∗∗ are continuous. Thogether with the compactness of X k this implies that R < 1 on the event E. Now let f ∈ Hk and g ∈ Hk  for some 1 ≤ k ≤ d. We will show f + g 2n ≥ (1 − R)( f 2n + g 2n ).

(9)

By application of Proposition 1 this immediately implies that Hk + Hk  is closed. For a proof of (9) note that f + g 2n = n −1 

n 

( f k (xk ) + g0 + (X ik − xk )gk  (xk ))2 K hX i (X i − x)dx

i=1

 ( f k (xk ) + g0 )2 pˆ k (xk )dxk + 2 ( f k (xk ) + g0 )gk  (xk ) pˆ k∗ (xk )dxk  + gk  (xk )2 pˆ k∗∗ (xk )dxk   ≥ ( f k (xk ) + g0 )2 pˆ k (xk )dxk + gk  (xk )2 pˆ k∗∗ (xk )dxk  −2R | f k (xk ) + g0 ||gk  (xk )|( pˆ k (xk ) pˆ k∗∗ (xk ))1/2 dxk   ≥ ( f k (xk ) + g0 )2 pˆ k (xk )dxk + gk  (xk )2 pˆ k∗∗ (xk )dxk

=

!1/2  !1/2 gk  (xk )2 pˆ k∗∗ (xk )dxk ( f k (xk ) + g0 )2 pˆ k (xk )dxk   ≥ (1 − R) ( f k (xk ) + g0 )2 pˆ k (xk )dxk + (1 − R) gk  (xk )2 pˆ k∗∗ (xk )dxk !   2 2 2 ∗∗ = (1 − R) f k (xk ) pˆ k (xk )dxk + g0 + gk  (xk ) pˆ k (xk )dxk 

−2R

≥ (1 − R)( f 2n + g 2n ), where in the second to last row, we used that f ∈ Hk . This concludes the proof of (9). Note that the statement of the lemma is equivalent to the statement:  following  For D1 , D2 ⊆ {1, . . . , d} and δ ∈ {0, 1} the space ◦H0 + k∈D1 Hk + k∈D2 Hk  is closed. We show this inductively over the number of elements s = |D2 ∩ D1 | of D1 ∩ D2 .   For the case s = 0, note that for D1 ∩ D2 = ∅, the space ◦H0 + k∈D1 Hk + k∈D2 Hk  is closed, which can be shown with similar but simpler arguments than , D2 ⊆ {1, . . . , d} with |D2 ∩ the ones used below. Now let s ≥ 1, δ ∈ {0, 1}, D1 D1 | = s − 1 and assume L 2 = δH0 + j∈D1 H j + j∈D2 H j  is closed. Without loss of generality, let k ∈ {1, . . . , d}\(D1 ∪ D2 ). We will argue that on the event E

Local Linear Smoothing in Additive Models as Data Projection

209

the orthogonal projection of L 2 on L 1 = Hk + Hk  is Hilbert-Schmidt, noting that a Hilbert-Schmidt operator is compact. 2 since L 1 and L 2 are  Using Proposition  closed, this implies that L = ◦H0 + j∈D1 ∪{k} H j + j∈D2 ∪{k} H j  is closed which completes the inductive argument.   For an element f ∈ L 2 with decomposition f = f 0 + j∈D1 f j + j∈D2 f j  the projection onto L 1 + H0 is given by univariate functions gk , gk  and g0 ∈ R which satisfy 0=

n   i=1

f0 +



f j (x j ) +

j∈D1

!

−gk  (xk )(X ik − xk )



f j  (x j )(X i j − x j )−g0 − gk (xk )

j∈D2

1 X ik − xk

! K X i (X i − x)dx−k .

Note that f 0 = 0 if δ = 0. This implies ! ! 1 g0 +gk (xk ) pˆ k∗∗ − pˆ k∗ (xk ) = gk  (xk ) ( pˆ k pˆ k∗∗ − ( pˆ k∗ )2 )(xk ) − pˆ k∗ pˆ k ! !   pˆ jk (x j , xk ) pˆ k (x dx j × f0 ) + (x ) f k j j pˆ k∗j (xk , x j ) pˆ k∗ j∈D1 !   pˆ ∗jk (x j , xk ) dx j , + f j  (x j ) pˆ ∗∗ jk (x j , x k ) j∈D2

where gk and g0 are chosen such that gk (xk ) pˆ k (xk )dxk = 0. We now use that the projection of g0 onto L 1 = Hk + Hk  is equal to rk (xk ) rk  (xk )

! =

! g0 (1 − ck sk (xk ) pˆ k∗∗ (xk )) , g0 ck sk (xk ) pˆ k∗ (xk )

where sk (xk ) = pˆ k (xk )/( pˆ k (xk ) pˆ k∗∗ (xk ) − ( pˆ k∗ )2 (xk )) and ck = ( sk (xk ) pˆ k∗∗ (xk ) pˆ k (xk )dxk )−1 . Thus, the projection of f onto L 1 is defined by (gk (xk ) + rk (xk ), gk  (xk ) + rk  (xk )) . Under our settings on the event E this is a Hilbert-Schmidt operator. This concludes the proof. We now come to a short discussion of the convergence of Algorithm 7. The algorithm is used to approximate m. ˆ In the lemma we denote by mˆ [r ] the outcome of the algorithm after r iterations of the while loop (see Algorithm 7). We prove the algorithm for arbitrary starting values, i.e. we can set the mˆ k (xk ) to values other than zero in step 1 of Algorithm 7. The vector of starting values of the algorithm is denoted by mˆ [0] ∈ Hadd . Lemma 3 Make Assumption (A1). Then, on the event E, for Algorithm 7 and all choices of starting values mˆ [0] ∈ Hadd we have

210

M. Hiabu et al.

mˆ [r ] − m ˆ n ≤ V r mˆ [0] − m ˆ n,

2 where 0 ≤ V = 1 − 2d−1 k=0 γ (Hk , Hk+1 + · · · + H2d ) < 1 is a random variable depending on the observations. Remark 3 On the event E, the algorithm converges with a geometric rate where in every iteration step the distance to the limiting value, m, ˆ is reduced by a factor smaller or equal to V . If the columns of the design matrix X are orthogonal, V will be close to zero and if they are highly correlated, V will be close to 1. The variable V depends on n and is random. Under additional assumptions, as stated in the next section, one can show that with probability tending to one, V is bounded by a constant smaller than 1. Proof (of Lemma 3) For a subspace V ⊆ Hadd we denote by PV the orthogonal projection onto V. For k = 0, . . . , 2d let Qk := PH⊥k = 1 − Pk be the projection onto the orthogonal complement Hk⊥ of Hk . The idea is to show the following statements. (i) Y − mˆ [r ] = (Q2d . . . Q0 )r (Y − mˆ [0] ), ˆ = Y − m, ˆ (ii) (Q2d . . . Q0 )r (Y − m) This then implies " [r ] " " " " " "mˆ − mˆ " = "mˆ [r ] − Y + Y − mˆ " = "(Q2d . . . Q0 )r (mˆ [0] − m) ˆ "n . n n The proof is concluded by showing Q2d . . . Q0 g 2n ≤ 1 −

2d−1 #

! γ 2 (Hk , Hk+1 + · · · + H2d ) g 2n

(10)

k=0

2 for all g ∈ Hadd . Note that 0 ≤ V := 1 − 2d−1 k=0 γ (Hk , Hk+1 + · · · + H2d ) < 1 by Lemma 2 and Proposition 1. For (i), observe that for all r ≥ 1 and k = 0, . . . , 2d we have −1] − mˆ [rk ] − · · · − mˆ [r2d] Y − mˆ [r0 −1] − · · · − mˆ [rk−1 −1] ] = (1 − Pk )(Y − mˆ [r0 −1] − · · · − mˆ [rk−1 − mˆ [rk+1 − · · · − mˆ [r2d] ) −1] ] − mˆ [rk −1] − mˆ [rk+1 − · · · − mˆ [r2d] ) = (1 − Pk )(Y − mˆ [r0 −1] − · · · − mˆ [rk−1 ] − · · · − mˆ [r2d] ). = Qk (Y − mˆ [r0 −1] − · · · − mˆ [rk −1] − mˆ [rk+1

The statement follows inductively by beginning with the case r = 1, k = 0. Secondly, (ii) follows from ˆ = Qr . . . Q0 PH⊥0 ∩···∩H⊥2d (Y ) = PH⊥0 ∩···∩H⊥2d (Y ) = Y − m. ˆ Qr . . . Q0 (Y − m) It remains to show the inequality in (10).

Local Linear Smoothing in Additive Models as Data Projection

211

" "2 For 0 ≤ k ≤ 2d define Nk := Hk + · · · + H2d . We prove "Q2d . . . Q j g "n ≤ (1 −

2d−1 2 2 k= j γ (Hk , Hk+1 + · · · + H2d )) g n for all g ∈ Hadd and 0 ≤ j ≤ 2d using an inductive argument. The case j = 2d is trivial. For 0 ≤ j < 2d and any g ∈ Hadd let g ⊥j := Q j g =   ⊥ (g) and g := PN (g). Then, by orthogonality, we have g + g  with g  := PN j+1 j+1 " " " " " " " " "Q2d . . . Q j+1 g ⊥ "2 = "g  + Q2d . . . Q j+1 g  "2 = "g  "2 + "Q2d . . . Q j+1 g  "2 . j n n n n Induction gives ! 2d−1 # " "2 " "2 " " 2 "Q2d . . . Q j+1 g  "2 ≤ 1 − γ (H , H + · · · + H ) ("g ⊥j "n − "g  "n ) k k+1 2d n k= j+1

which implies ! 2d−1 # " " " ⊥ "2 2 "Q2d . . . Q j+1 g ⊥ "2 ≤ 1 − "g " γ (H , H + · · · + H ) k k+1 2d j n j n k= j+1

+

2d−1 #

" "2 γ 2 (Hk , Hk+1 + · · · + H2d ) "g  "n .

k= j+1

By Lemma 2 and Lemma 9 we have "2 " "  "2 " "2 " "g " ≤ " ⊥ Q1 " = "P N PH j "n = 1 − γ 2 (H j , H j+1 + · · · + H2d ). "P N j+1 j+1 n n

" " " " This concludes the proof by noting that "g ⊥j " ≤ g n . n

4 Asymptotic Properties of the Estimator In this section we will discuss asymptotic properties of the local linear smooth backfitting estimator. For simplicity we consider only the case that X is a product of intervals X j = (a j , b j ) ⊂ R. We make the following additional assumptions: (A2) The observations (Yi , X i ) are i.i.d. and the covariates X i have one-dimansional marginal densities p j which are strictly positive on [ak , bk ]. The two-dimensional marginal densities p jk of (X i, j , X i,k ) are continuous on their support [a j , b j ] × [ak , bk ]. (A3) It holds (11) Yi = m 0 + m 1 (X i1 ) + · · · + m d (X id ) + εi ,

212

M. Hiabu et al.

for twice continuously differentiable functions m j : X j → R with p j (x j )dx j = 0. The error variables εi satisfy E[εi |X i ] = 0 and

m j (x j )

sup E[|εi |5/2 |X i = x] < ∞. x∈X

(A4) There exist constants c1 , . . . , cd > 0 with n 1/5 h j → c j for n → ∞. To simplify notation we assume that h 1 = · · · = h d . In abuse of notation we write h for h j and ch for c j . From now on we will write mˆ n = (mˆ n0 , mˆ n1 , . . . , mˆ n2d ) for the estimator mˆ to indicate its dependence on the sample size n. The following theorem states an asymptotic expansion for the components mˆ n1 , . . . , mˆ nd . Later in this section we will state some lemmas which will be used to prove the result. Theorem 1 Make assumptions (A1) – (A4). Then   !   n  mˆ (x j ) − m j (x j ) − β j (x j ) − β j (u j ) p j (u j )du j − v j (x j )  j  = o P (h 2 + {nh}−1/2 ) = o P (n −2/5 ), holds uniformly over 1 ≤ j ≤ d and a j ≤ x j ≤ b j , where v j is a stochastic variance term 1 n h −1 k(h −1 (X i j − x j ))εi n v j (x j ) = 1 i=1 = O p ({nh}−1/2 ) n −1 k(h −1 (X − x )) h i j j i=1 n and β j is a deterministic bias term β j (x j ) =

b j,2 (x j )2 − b j,1 (x j )b j,3 (x j ) 1 2  h m j (x j ) = O(h 2 ), 2 b j,0 (x j )b j,2 (x j ) − b j,1 (x j )2

with b j,l (x j ) = X j k(h −1 (u j − x j ))(u j − x j )l h −l−1 b j (u j )−1 du j and b j (x j ) =

−1 −1 X j k(h (x j − w j ))h dw j for 0 ≤ l ≤ 2. The expansion for mˆ nj stated in the theorem neither depends on d nor on functions m k (k = j). In particular, this shows that the same expansion holds for the local linear estimator m nj in the oracle model where the functions m k (k = j) are known. More precisely, in the oracle model one observes i.i.d. observations (Yi∗ , X i j ) with Yi∗ = m j (X i j ) + εi , Yi∗ = Yi −



m k (X ik ),

(12)

k= j

and the local linear estimator m nj is defined as the second component that minimises the criterion

Local Linear Smoothing in Additive Models as Data Projection

 S( f 0 , f j , f j(1) ) =

n    i=1

X

213

Yi∗ − f 0 − f j (x j ) − f j(1) (x j )(X i j − x j )

2

X

×κh i j (X i j − x)dx j with boundary corrected kernel khu (u

κ

− x) = Xj

$ u−x %

κ

h % $ u−v h

dv

.

We conclude that the local linear smooth backfitting estimator mˆ j is asymptotically equivalent to the local linear estimator m nj in the oracle model. We formulate this asymptotic equivalence as a first corollary of Theorem 1. In particular, it implies that the estimators have the same first order asymptotic properties. Corollary 1 Make assumptions (A1) – (A4). Then it holds uniformly over 1 ≤ k ≤ d and a j ≤ x j ≤ b j that    n  n 2 mˆ (x j ) − m   (x ) j  = o P (h ). j  j For x j ∈ (a j + 2h, b j − 2h) the bias term β j simplifies and we have that 1  β j (x j ) = h 2 m j (x j ) 2

 k(v)v 2 dv.

This implies the following corollary of Theorem 1. Corollary 2 Make assumptions (A1) – (A4). Then it holds uniformly over 1 ≤ k ≤ d and a j + 2h ≤ x j ≤ b j − 2h that  !    n mˆ (x j ) − m j (x j ) − 1 m  (x j ) − m  (u j ) p j (u j )du j h 2 k(v)v 2 dv j j  j 2   −v j (x j ) = o P (h 2 ). Corollary 2 can be used to derive the asymptotic distribution of mˆ nj (x j ) for an x j ∈ (a j , b j ). Under the additional assumption that σ 2j (u) = E[εi2 |X i j = u] is continuous in u = x j we get under (A1) – (A4) that n 2/5 (mˆ nj (x j ) − m j (x j )) has an asymptotic 

 k(v)v 2 dv and normal distribution with mean ch 21 m j (x j ) − m j (u j ) p j (u j )du j

2 variance ch−1 σ 2j (x j ) p −1 j (x j ) k(v) dv. This is equal to the asymptotic limit distribution of the classical local linear estimator in the oracle model in accordance with Corollary 1. Now, we come to the proof of Theorem 1.

214

M. Hiabu et al.

First, we define the operator Sn = (Sn,0 , Sn,1 , . . . , Sn,2d ) : G n → G n with G n = {(g0 , . . . , g2d )|g0 ∈ R, gl , gl  ∈ L2 ( pl ) with P0 (gl ) = 0 for l = 1, . . . , d}, where Sn,k maps g = (g0 , . . . , g2d ) ∈ G n to f k with &



f 0 = P0

'

&

gl

= P0

1≤l≤2d

⎛ f k (xk ) = Pk ⎝

' gl

∈ R,

d+1≤l≤2d







gl ⎠ (x),

0≤l≤2d,l=k

for 1 ≤ k ≤ 2d. With this notation we can rewrite the backfitting equation (5) as m¯ n (Y ) = mˆ n + Sn mˆ n , where for z ∈ Rn we define m¯ n0 (z) = z¯ =

1 n

n

i=1 z i

(13)

and for 1 ≤ j ≤ d,

 n 1 (z i − z¯ ) K hX i (X i − x)dx− j , n i=1  n  n ∗∗ −1 1 m¯ j  (z)(x j ) = pˆ j (x j ) (X i j − x j )z i K hX i (X i − x)dx− j . n i=1 m¯ nj (z)(x j ) = pˆ j (x j )−1

The following lemma shows that I + Sn is invertible on the event E. Here we denote the identity operator by I . Lemma 4 On the event E the operator I + Sn : G n → G n is invertible. Proof Suppose that for some g ∈ G n it holds that (I + Sn )(g) = 0. We have to show that this implies g = 0. For the proof of this claim note that gk + Sn,k (g) is the orthogonal projection of 2d j=0 g j onto Hk . Furthermore, we have that gk is an element of Hk . This gives that ( gk ,

2d 

) gj

j=0

Summing over k gives

(

2d  j=0

gj,

2d  j=0

= 0. n

) gj

= 0. n

According to Lemma 1 on the event E we have g0 = 0 and g j ≡ 0 for j = 1, . . . , 2d. This concludes the proof of the lemma.

Local Linear Smoothing in Additive Models as Data Projection

215

One can show that under conditions (A1) – (A4) the probability of the event E converges to one. Note that we have assumed that X = dj=1 X j . We conclude that under (A1) – (A4) I + Sn is invertible with probability tending to one. Thus we have that with probability tending to one (14) mˆ n − m − m¯ n (ε) − βn + n m + n βn −1 n n = (I + Sn ) (I + Sn )(mˆ − m − m¯ (ε) − βn + n m + n βn ), where m has components m 0 , . . . , m 2d with m 0 , . . . , m d as in (11) and with m j  = m j for 1 ≤ j ≤ d. Furthermore, β(x) has components β0 = 0, β j (x j ) and β j  (x j ) =

b j,0 (x j )b j,3 (x j ) − b j,1 (x j )b j,2 (x j ) 1  m (x j ) h 2 j b j,0 (x j )b j,2 (x j ) − b j,1 (x j )2

for j = 1, . . . , d with b j,l (x j ) defined above. Additionally, the norming constants are given by  (n β) j = (n m) j =



β j (x j ) pˆ j (x j )dx j , m j (x j ) pˆ j (x j )dx j ,

(n β) j  = (n m) j  = 0 for j = 1, . . . , d, d   (n β)0 = β j  (x j ) pˆ ∗j (x j )dx j , j=1

(n m)0 =

d  

m j  (x j ) pˆ ∗j (x j )dx j .

j=1

One can verify that for a j + 2h j ≤ x j ≤ b j − 2h j one has β j  (x j ) = o P (h). We have  already seen that β j (x j ) = 21 m j (x j ) k(v)v 2 dv + o P (h 2 ) holds for such x j . For the statement of Theorem 1 we have to show that for 1 ≤ j ≤ d the j-th component on the left hand side of equation (14) is of order o P (h 2 ) uniformly for a j + 2h ≤ x j ≤ b j − 2h. For a proof of this claim we first analyze the term Dn = (I + Sn )(mˆ n − m − m¯ n (ε) − βn + n m + n βn ) = m¯ n (Y ) − (I + Sn )(m + m¯ n (ε) + βn − n m − n βn ).

(15)

For this sake we split the term m¯ n (Y ) into the sum of a stochastic variance term and a deterministic expectation term: m¯ n (Y ) = m¯ n (ε) +

d  j=0

m¯ n (μn, j ),

(16)

216

M. Hiabu et al.

where ε=Y−

d 

μn, j ,

j=0

μn, j = (m j (X i j ))i=1,..,n for j = 1, . . . , d, μn,0 = (m 0 )i=1,..,n . β

We write Dn = Dn + Dnε , with Dnβ

=

d 

m¯ n (μn, j ) − (I + Sn )(m + βn − n m − n βn ),

j=0

Dnε = Sn (m¯ n (ε)). The following lemma treats the conditional expectation term D β . β

Lemma 5 Assume (A1) – (A4). It holds Dn,0 = o p (h 2 ) and    β o p (h 2 ) for 1 ≤ k ≤ d, sup  Dn,k (xk ) = o p (h) for d + 1 ≤ k ≤ 2d. xk ∈Xk Proof The lemma follows by application of lengthy calculations using second order Taylor expansions for m j (X i j ) and by application of laws of large numbers. We now turn to the variance term. ε Lemma 6 Assume (A1) – (A4). It holds Dn,0 = o p (h 2 ) and

   ε o p (h 2 ) for 1 ≤ k ≤ d, sup  Dn,k (xk ) = o p (h) for d + 1 ≤ k ≤ 2d. xk ∈Xk ε Proof One can easily check that Dn,k (xk ) consists of weighted sums of εi where the weights are of the same order for all 1 ≤ i ≤ n. For fixed xk the sums are of order O P (n −1/2 ) for 1 ≤ k ≤ d and of order O P (h −1 n −1/2 ) for d + 1 ≤ k ≤ 2d. Using the conditional moment conditions on εi in Assumption (A3) we get the uniform rates stated in the lemma. β

It remains to study the behaviour of (I + Sn )−1 Dnε and (I + Sn )−1 Dn . We will use a small transformation of Sn here which is better suitable for an inversion. Define the following 2 × 2 matrix An,k (x) by An,k (x) =

1 pˆ k pˆ k∗∗ − ( pˆ k∗ )2

pˆ k∗∗ pˆ k pˆ k∗∗ pˆ k∗ pˆ k∗ pˆ k pˆ k pˆ k∗∗

! (xk ).

Local Linear Smoothing in Additive Models as Data Projection

217

Furthermore, define the 2d × 2d matrix An (x) where the elements with indices (k, k), (k, k  ), (k  , k), (k  , k  ) are equal to the elements of An,k (x) with indices (1, 1), (1, 2), (2, 1), (2, 2). We now define S˜n by the equation I + S˜n = An (I + Sn ). Below we will make use of the fact that S˜n is of the form     S˜n,k m(x) = (17) qk,l (xk , u)m l (u)du + ql (u)m l (u)du, } l ∈{k,k /

S˜n,k  m(x) =

 

l∈{k,k  }

qk  ,l (xk , u)m l (u)du +

} l ∈{k,k /

 

ql (u)m l (u)du

(18)

l∈{k,k  }

for 1 ≤ k ≤ d with some random functions qk,l , ql which fulfill that qk,l (xk , u)2 du and qk (u)2 du are of order O P (1) uniformly over 1 ≤ k, l ≤ 2d and xk . Note that we need S˜n because Sn can not be written in the form of (17) and (18). The operator S˜n differs from Sn in the h-neighbourhood of the boundary by terms of order h 2 . Otherwise the difference is of order o p (h 2 ). Outside of the h-neighbourhood of the boundary, for n → ∞, the matrix An (x) converges to the identity matrix. Thus S˜n is a second order modification of Sn with the advantage of having (17)-(18). 0 For our further discussion we now introduce the space

G of tuples f = ( f 0 , f 1 , . . . , f 2d ) with f 0 = 0 and f k , f k  : Xk → R with f k (xk ) pk (xk )dxk = 0  and endow it with the norm f 2 = dk=1 ( f k (xk )2 + f k  (xk )2 ) pk (xk )dxk . The next β lemma shows that the norm of Hn (I + Sn )−1 Dnε and Hn (I + Sn )−1 Dn is of order 2 o P (h ). Here Hn is a diagonal matrix where the first d + 1 diagonal elements equal 1. The remaining elements are equal to h. Lemma 7 Assume (A1) – (A4). Then it holds that Hn (I + Sn )−1 Dn∗ = Hn (I + β Q˜ n )−1 An Dn∗ = o P (h 2 ) for Dn∗ = Dnε and Dn∗ = Dn .

ε β ε (xk ) − Dn,k (u k ) pk (u k )du k and Proof Define D¯ nε and D¯ n by D¯ nεk (xk ) = Dn,k

β β β ε ε D¯ n,k (xk ) = Dn,k (xk ) − Dn,k (u k ) pk (u k )du k for 1 ≤ k ≤ d and D¯ n,k = Dn,k and β β ¯ Dn,k = Dn,k , otherwise. It can be checked that it suffices to prove the lemma with β β β Dnε and Dn replaced by D¯ nε and D¯ n . Note that D¯ nε and D¯ n are elements of G 0 . For the proof of this claim we compare the operator S˜n with the operator S0 defined by S0,0 (g) = 0, S0,k  (g)(xk ) = 0 and S0,k (g)(xk ) =

 j=k

Xj

g j (u j )

p j,k (u j , xk ) du j pk (xk )

for 1 ≤ k ≤ d. By standard kernel smoothing theory one can show that sup g∈G 0 , g ≤1

(S0 − Hn S˜n Hn−1 )g = o P (1).

218

M. Hiabu et al.

For the proof of this claim one makes use of the fact that non-vanishing differences in the h-neighbourhood of the boundary are asymptotically negligible in the calculation of the norm because the size of the neighbourhood converges to zero. In the next lemma we will show that I + S0 has a bounded inverse. This implies the statement of the lemma by applying the following expansion: (I + Hn S˜n Hn−1 )−1 − (I + S0 )−1 = (I + S0 )−1 ((I + S0 )(I + Hn S˜n Hn−1 )−1 − I ) = (I + S0 )−1 (((I + Hn S˜n Hn−1 )(I + S0 )−1 )−1 − I ) = (I + S0 )−1 ((I + (Hn S˜n Hn−1 − S0 )(I + S0 )−1 )−1 − I ) ∞  (I + (−1) j (Hn S˜n Hn−1 − S0 )(I + S0 )−1 ) j . = (I + S0 )−1 j=1

This shows the lemma because of Hn (I + S˜n )−1 An Dn∗ = Hn (I + Q˜ n )−1 Hn−1 Hn β An Dn∗ = (I + Hn S˜n Hn−1 )−1 Hn An Dn∗ for Dn∗ = Dnε and Dn∗ = Dn . Lemma 8 Assume (A1) – (A4). The operator I + S0 : G 0 → G 0 is bijective and has a bounded inverse. Proof For a proof of this claim it suffices to show that the operator I + S∗ : where G ∗ is the space of tuples G ∗ → G ∗ is bijective and has a bounded inverse

f = ( f 1 , . . . , f d ) where f k : X j → R with f k (xk )dxk = 0 with norm f 2 = d 2 k=1 f k (x k ) pk (x k )dx k and S∗,k (g)(xk ) =

 j=k

Xj

g j (u j )

p j,k (u j , xk ) du j pk (xk )

for 1 ≤ k ≤ d. We will apply the bounded inverse theorem. For an application of this theorem we have to show that I + S∗ is bounded and bijective. It can easily be seen that the operator is bounded. It remains to show that it is surjective. We will show that (i) (I + S∗ )g n → 0 for a sequence g n ∈ G ∗ implies that g n → 0. (ii) gk (I + S∗ )k r (xk ) pk (xk )dxk = 0 for all g ∈ G ∗ implies that r = 0. Note that (i) implies that G ∗∗ = {(I + S∗ )g : g ∈ G ∗ } is a closed subset of G ∗ . To see this suppose that (I + S∗ )g n → g for g, g n ∈ G ∗ . Then (i) implies that g n is a Cauchy sequence and thus g n has a limit in G ∗ which implies that (I + S∗ )g n has a limit in G ∗∗ . Thus G ∗∗ is closed. From (ii) we conclude that the orthogonal complement of G ∗∗ is equal to {0}. Thus the closure of G ∗∗ is equal to G ∗ . This shows that G ∗ = G ∗∗ because G ∗∗ is closed. We conclude that (I + S∗ ) is surjective.

Local Linear Smoothing in Additive Models as Data Projection

219

It remains to show (i) and (ii). For a proof of (i) note that (I + S∗ )g n → 0 implies that  gkn (xk )(I + S∗ )k g n (xk ) pk (xk )dxk → 0 which shows d  

gkn (xk )2 pk (xk )dxk +



gkn (xk ) pk j (xk , x j )g j (x j )dxk dx j → 0.

k= j

k=1

Thus we have E[(

d 

gk (X ik ))2 ] → 0.

k=1

By application of Proposition 1 (ii) we get that max1≤k≤d E[gk (X ik )2 ] → 0, which shows (i).

Claim (ii) can be seen by a similar argument. Note that gk (I + S∗ )k r pk (xk )dxk = 0 for all g ∈ G ∗ implies that rk (I + S∗ )k r (xk ) pk (xk )dxk = 0. We now apply the results stated in the lemma for the final proof of Theorem 1. Proof (of Theorem 1) From (14) and Lemma 7 we know that the L2 norm of mˆ n − β m − m¯ n (ε) − βn + n m + n βn = Hn (I + Sn )−1 (Dnε + Dn ) = Hn (I + S˜n )−1 β An (Dnε + Dn ) is of order o P (h 2 ). Note that Hn (I + S˜n )−1 = Hn − Hn S˜n (I + S˜n )−1 . β

We already know that the sup norm of all components in Hn An (Dnε + Dn ) are of order o P (h 2 ). Thus, it remains to check that the sup norm of the components of β Hn S˜n (I + S˜n )−1 An (Dnε + Dn ) is of order o P (h 2 ). But this follows by application of β the just mentioned bound on the L2 norm of Hn (I + Sn )−1 (Dnε + Dn ), by equations (17)–(18), and the bounds for the random functions qk,l and ql mentioned after the statement of the equations. One gets a bound for the sup norms by application of the Cauchy Schwarz inequality.

Appendix 1: Projection Operators In this section we will state expressions for the projection operators P0 , Pk , Pk and Pk  (1 ≤ k ≤ d) mapping elements of H to H0 , Hk , Hk + H0 and Hk  , respectively, see Sect. 2. For an element f = ( f i, j )i=1,...,n; j=0,...,d the operators P0 , Pk , and Pk (1 ≤ k ≤ d) set all components to zero but the components with indices (i, 0), i = 1, . . . , n. Furthermore, in the case d < k ≤ 2d only the components with index (i, k − d), i = 1, . . . , n are non-zero. Thus, for the definition of the operators it remains to set

220

M. Hiabu et al.

1 n i=1 n

(P0 ( f ))i,0 (x) =

 X

{ f i,0 (u) +

d 

f i, j (x)(X i j − u j )}K hX i (X i − u)du.

j=1

For 1 ≤ k ≤ d it suffices to define (Pk ( f ))i,0 (x) = (Pk ( f ))i,0 (x) − (P0 ( f ))i,0 and  *   n  d  1 1 i,0 i, j f (u) + (Pk ( f )) (x) = f (u)(X i j − u j ) pˆ k (xk ) n i=1 u∈X−k (xk ) j=1 + Xi ×K h (X i − u)du −k , i,0

 *   n  d  1 1 i,0 i, j (P ( f )) (x) = ∗∗ f (u) + f (u)(X i j − u j ) pˆ k (xk ) n i=1 u∈X−k (xk ) j=1 + Xi ×(X ik − xk )K h (X i − u)du −k . k

i,0

For the orthogonal projections of functions m ∈ Hadd one can use simplified formulas. In particular, these formulas can be used in our algorithm for updating functions (1) m ∈ Hadd . If m ∈ Hadd has components m 0 , . . . , m d , m (1) 1 , . . . , m d the operators Pk and Pk  are defined as follows (P0 (m)(x)))i,0 = m 0 +

d   j=1

Xj

m (1) p ∗j (u j )du j , j (u j )

pˆ k∗ (xk ) (Pk (m)(x))i,0 = m 0 + m k (xk ) + m (1) k (x k ) pˆ k (xk ) + *   pˆ ∗jk (u j , xk ) pˆ jk (u j , xk ) (1) + m j (u j ) du j , m j (u j ) + pˆ k (xk ) pˆ k (xk ) 1≤ j≤d, j=k X−k, j (xk )   pˆ k∗ (xk ) − (x ) m (1) (u j ) pˆ ∗j (u j )du j (Pk (m)(x))i,0 = m k (xk ) + m (1) k k pˆ k (xk ) 1≤ j≤d X j j * +   pˆ ∗jk (u j , xk ) pˆ jk (u j , xk ) (1) m j (u j ) + m j (u j ) du j , + pˆ k (xk ) pˆ k (xk ) 1≤ j≤d, j=k X−k, j (xk ) pˆ k∗ (xk ) pˆ k∗∗ (xk ) * +  ∗  pˆ k j (xk , u j ) pˆ ∗∗ jk (u j , x k ) m j (u j ) ∗∗ + m (1) du j , + (u ) j j pˆ k (xk ) pˆ k∗∗ (xk ) 1≤ j≤d, j=k X−k, j (xk )

(Pk  (m)(x))i,0 = m (1) k (x k ) + (m 0 + m k (x k ))

where for 1 ≤ j, k ≤ d with k = j

Local Linear Smoothing in Additive Models as Data Projection

221

n  1 K X i (X i − x)dx−( jk) , n i=1 X−( jk) (x j ,xk ) h n  1 pˆ ∗jk (x j , xk ) = (X i j − u j )K hX i (X i − x)dx−( jk) , n i=1 X−( jk) (x j ,xk ) n  1 pˆ ∗∗ (x , x ) = (X i j − u j )(X ik − xk )K hX i (X i − x)dx−( jk) j k jk n i=1 X−( jk) (x j ,xk )

pˆ jk (x j , xk ) =

with X−( jk) (x j , xk ) = {u ∈ X : u k = xk , u j = x j } and X−k, j (xk ) = {u ∈ X j : there exists v ∈ X with vk = xk and v j = u} and u −( jk) denoting the vector (u l : l ∈ {1, . . . , d}\{ j, k}).

Appendix 2: Proofs of Propositions 1 and 2 In this section we will give proofs for Propositions 1 and 2. They were used in Sect. 3 for the discussion of the existence of the smooth backfitting estimator as well as the convergence of an algorithm for its calculation. Proof (of Proposition 1) (ii) ⇒ (i). Let g (n) ∈ L be a Cauchy sequence. We must show limn→∞ g (n) ∈ L. By definition of L there exist sequences g1(n) ∈ L 1 and g2(n) ∈ L 2 such that g (n) = g1(n) + g2(n) . With (8), for i = 1, 2 we obtain " 1" " " " (n) (m) " "gi − gi " ≤ "g (n) − g (m) " → 0. c Hence, g1(n) and g2(n) are Cauchy sequences. Since L 1 and L 2 are closed their limits are elements of L 1 ⊆ L and L 2 ⊆ L, respectively. Thus, lim g (n) = lim g1(n) + g2(n) ∈ L .

n→∞

n→∞

(i) ⇒ (iii). We write 1 (L 2 ) = 1 . Since L is closed, it is a Banach space. Using the closed graph theorem, it suffices to show the following: If g (n) ∈ L and 1 g (n) ∈ L 1 are converging sequences with limits g, g1 , then 1 g = g1 . Let g (n) ∈ L and 1 g (n) ∈ L 1 be sequences with limits g and g1 , respectively. Write g (n) = g1(n) + g2(n) . Since " " " " " " " (n) " (n) (m) " (m) " "g2 − g2 " ≤ "g1 − g1 " + "g (n) − g (m) " g2(n) is a Cauchy sequence converging to a limit g2 ∈ L 2 . We conclude g = g1 + g2 , meaning 1 g = g1 .

222

M. Hiabu et al.

(iii) ⇒ (ii). If 1 is a bounded operator, then so is 2 , since g2 ≤ g + g1 . Denote the corresponding operator norms by C1 and C2 , respectively. Then max{ g1 , g2 } ≤ max{C1 , C2 } g which concludes the proof by choosing c = (iii) ⇔ (iv). This follows from 1 = sup g∈L

1 . max{C1 ,C2 }

g1 g1 g1 1 = sup = sup = . g γ(L 1 , L 2 ) g1 ∈L 1 ,g2 ∈L 2 g1 + g2 g1 ∈L 1 dist(g1 , L 2 )

Lemma 9 Let L 1 , L 2 be closed subspaces of a Hilbert space. For γ defined as in Proposition 1 we have γ(L 1 , L 2 )2 = 1 − P2 P1 2 . Proof γ(L 1 , L 2 )2 = = =

inf

g1 − P2 g1

inf

g1 − P2 g1 , g1 − P2 g1

inf

g1 , g1 − P2 g1 , P2 g1

g1 ∈L 1 , g1 =1 g1 ∈L 1 , g1 =1 g1 ∈L 1 , g1 =1

=1− =1−

sup

g1 ∈L 1 , g1 =1

sup

g∈L , g =1

P2 g1 , P2 g1

P2 P1 g, P2 P1 g

= 1 − P2 P1 2 . Proof (of Proposition 2) Let P j be the orthogonal projection onto L j . Following Lemma 9 we have 1 − P2 P1 2 = γ(L 1 , L 2 )2 . Using Proposition 1, proving P2 P1 < 1 implies that L is closed. Observe that P2 P1 ≤ 1 because for g ∈ L " " " " "P j g "2 = P j g, P j g = g, P j g ≤ g "P j g " , which yields Pi ≤ 1 for i = 1, 2. To show the strict inequality, note that if P2  L

1

is compact, so is P2 P1 since the composition of two operators is compact if at least one is compact. Thus, for every ε > 0, P2 P1 has at most a finite number of eigenvalues greater than ε. Since 1 is clearly not an eigenvalue, we conclude P1 P2 < 1.

Local Linear Smoothing in Additive Models as Data Projection

223

References 1. Bickel, P.J., Klaassen, C.A., Bickel, P.J., Ritov, Y., Klaassen, J., Wellner, J.A., Ritov, Y.: Efficient and Adaptive Estimation for Semiparametric Models. John Hopkins University Press, Baltimore (1993) 2. Friedman, J.H., Stuetzle, W.: Projection pursuit regression. J. Am. Stat. Assoc. 76, 817–823 (1981) 3. Gregory, K., Mammen, E., Wahl, M.: Optimal estimation of sparse high-dimensional additive models. Ann. Stat. (2020) 4. Han, K., Müller, H.-G., Park, B.U.: Additive functional regression for densities as responses. J. Am. Stat. Assoc. 115, 997–1010 (2020) 5. Han, K., Park, B.U., et al.: Smooth backfitting for errors-in-variables additive models. Ann. Stat. 46, 2216–2250 (2018) 6. Härdle, W., Sperlich, S., Spokoiny, V.: Structural tests in additive regression. J. Am. Stat. Assoc. 96, 1333–1347 (2001) 7. Hiabu, M., Mammen, E., Martínez-Miranda, M.D., Nielsen, J.P.: Smooth backfitting of proportional hazards with multiplicative components. J. Am. Stat. Assoc. (2020) 8. Jeon, J.M., Park, B.U., et al.: Additive regression with Hilbertian responses. Ann. Stat. 48, 2671–2697 (2020) 9. Kato, T.: Perturbation Theory for Linear Operators. Springer Science & Business Media (2013) 10. Kober, H.: A theorem on Banach spaces. Compos. Math. 7, 135–140 (1940) 11. Mammen, E., Linton, O., Nielsen, J.: The existence and asymptotic properties of a backfitting projection algorithm under weak conditions. Ann. Stat. 27, 1443–1490 (1999) 12. Mammen, E., Marron, J., Turlach, B., Wand, M., et al.: A general projection framework for constrained smoothing. Stat. Sci. 16, 232–248 (2001) 13. Mammen, E., Nielsen, J.P.: Generalised structured models. Biometrika 90, 551–566 (2003) 14. Mammen, E., Park, B.U., Schienle, M.: Additive models: extensions and related models. In: Racine, J.S., Su, L., Ullah, A. (eds.) The Oxford Handbook of Applied Nonparametric and Semiparametric Econometrics and Statistics. Oxford Univ, Press (2014) 15. Mammen, E., Sperlich, S.: Additivity tests based on smooth backfitting. Biometrika (2021) 16. Mammen, E., Yu, K.: Nonparametric estimation of noisy integral equations of the second kind. J. Korean Stat. Soc. 38, 99–110 (2009) 17. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Routledge (2018) 18. Stone, C.J.: Optimal global rates of convergence for nonparametric regression. Ann. Stat. 10, 1040–1053 (1982) 19. Yu, K., Park, B.U., Mammen, E., et al.: Smooth backfitting in generalized additive models. Ann. Stat. 36, 228–260 (2008)

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n) Sagak A. Ayvazyan and Vladimir V. Ulyanov

Abstract The “typical” asymptotic behavior of the weighted sums of independent random vectors in k-dimensional space is considered. It is shown that in this case the rate of convergence in the multivariate central limit theorem is of order O(1/n). This extends the one-dimensional Klartag and Sodin (2011) result. Keywords Weighted sums · Improved rate of convergence · Multivariate central limit theorem

1 Introduction and Main Result Let X, X 1 , X 2 , . . . , X n be independent identically distributed random vectors in Rk with finite third absolute moment γ 3 = EX 3 < ∞, zero mean EX = 0 and unit covariance matrix cov(X ) = I . Let Z be the standard Gaussian random variable in Rk with zero mean and unit covariance matrix. Denote by B, the class of all Borel convex sets in Rk . Sazonov [1] obtained the following error bound of approximation for distribution of the normalized sum of random vectors by the standard multivariate normal law:     n   1  γ3   sup P √ X i ∈ B − P(Z ∈ B) ≤ C(k) √ ,  n n B∈B 

(1)

i=1

where C(k) depends on dimension k only. S. A. Ayvazyan · V. V. Ulyanov (B) Lomonosov Moscow State University, 119991 Moscow, Russia e-mail: [email protected]; [email protected] S. A. Ayvazyan e-mail: [email protected] V. V. Ulyanov National Research University Higher School of Economics, 101000 Moscow, Russia

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_6

225

226

S. A. Ayvazyan and V. V. Ulyanov

√ The bound (1) is optimal one in general. Moreover, the rate O(1/ n) can not be improved under higher order moment assumptions. This is easy to show in onedimensional case k = 1 taking X such that P(X = 1) = P(X = −1) = 1/2. However, the situation is different when we consider a weighted sum θ1 X 1 + · · · + θn X n ,  where nj=1 θ 2j = 1. If we are interested in the typical behavior of these sums for most of θ in the sense of the normalized Lebesgue measure λn−1 on the unit sphere S n−1 = {(θ1 , . . . , θn ) :

n 

θ 2j = 1},

j=1

then we have to refer to a recent remarkable result due to Klartag and Sodin. In [2] they have showed that in one-dimensional case k = 1 for any ρ : 1 > ρ > 0, there exists a set Q ⊆ S n−1 : λn−1 (Q) > 1 − ρ, and a constant C(ρ) depending on ρ only such that for any θ = (θ1 , . . . , θn ) ∈ Q one has      2  n  b  1 x δ4   sup P a ≤ θi X i ≤ b − d x  ≤ C(ρ) , √ exp −  2 n 2π a,b∈R,a 0, there is a subset Q ⊆ S n−1 with λn−1 (Q) > 1 − ρ and constant C(ρ, k) such that for any θ = (θ1 , , . . . , , θn ) ∈ Q, one has n    δ4   sup P θ j X j ∈ B − (B) ≤ C(ρ, k) . n B∈B j=1

Moreover, C(ρ, k) ≤ C(k) ln2 ( ρ1 ), where C(k) is a universal constant depending only on the dimension k. If we replace the class B with a smaller class B0 of all centered ellipsoids the situation changes noticeably. In this case, the distribution of the normalized sum of √ i.i.d. random vectors with θ1 = · · · = θn = 1/ n is approximated by a√Gaussian distribution on the class B0 with an accuracy of the order from o(1/ n) up to O(1/n) under the appropriate dimension of space and when the summands satisfy some moment conditions, for example, finiteness of the fourth absolute moment. See, e.g. [6–8]. For non-i.i.d. random vectors case see [9].

2 Notation and Auxiliary Results Notation In the following, the weight coefficients θ1 , θ2 , . . . , θn , according to the statement of the theorem, will belong to the unit sphere S n−1 = {θ1 , θ2 , . . . , θn :

n 

θ 2j = 1}.

j=1

Define = ( 1 , 2 , . . . , n ) as a random vector uniformly distributed on S n−1 , and for a given set Q ⊆ S n−1 denote the normalized Lebesgue measure on the unit sphere as λn−1 (Q) = P( ∈ Q). We denote the absolute fourth-order moment of the random vector X j , j = 1, . . . , n, as δ 4j = EX j 4 , (3) fourth-order weighted absolute moment δθ4 =

n  j=1

θ 4j δ 4j ,

(4)

228

S. A. Ayvazyan and V. V. Ulyanov

and the averaged absolute moment of the fourth order n 1 4 δ . n j=1 j

δ4 =

(5)

We introduce truncated random variables Y j and Z j , j = 1, . . . , n, as Y j = X j 1 X (θ j X j  ≤ 1),

(6)

Z j = Y j − EY j , where 1 X (A) is the indicator function of an event A. For these random variables, we define the weighted expectation and the weighted covariance matrix as An =

n 

θ j EY j ,

j=1

D=

n 

θ 2j cov(Z j ),

(7)

j=1

Q 2 = D −1 .

(8)

We also use the notation for the distributions of random vectors appeared in the proof: over all Borel sets B n

 FX (B) = P θj X j ∈ B , (9) j=1

FY (B) = P

n



θjYj ∈ B ,

(10)

j=1

FZ (B) = P

n



θj Z j ∈ B ,

j=1

FX j (B) = P(θ j X j ∈ B), j = 1, . . . , n,

(11)

FY j (B) = P(θ j Y j ∈ B), j = 1, . . . , n.

(12)

Also a,V will denote the distribution of the multidimensional normal law with expectation a and covariance matrix V. The proof will use the technique of characteristic functions. The characteristic function of the random vector Z j is defined as (13) ϕ j (t) = E exp(i t, Z j ), j = 1, . . . , n.

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

229

ˆ a,V as the corresponding characteristic funcWe denote Fˆ Z = nj=1 ϕ j (θ j t) and   tions of weighted sums of random vectors nj=1 Z j θ j , and a random vector with multivariate normal distribution. The absolute moment of order s of the random vector Z is defined as ρs (Z ) = EZ s . Also, for the coefficients θ1 , θ2 , . . . , θn , the weighted absolute moment of order s is determined as n  ρs = ρs (θ j Z j ) (14) j=1

and ηs =

n 

ρs (Qθ j Z j ).

(15)

j=1

For 1 ≤ m ≤ s − 1 we have ρs (Qθ j Z j ) ≤ Qs ρs (θ j Z j ) = Qs ρs (θ j (Y j − EY j )) ≤ Qs 2s ρs (θ j Y j ) ≤ Qs 2s ρs−m (θ j X j ).

(16)

For a given nonnegative vector α, we define α the moment of the random vector Z as μα (Z ) = EZ α , (17) for this value the following inequality holds μα (Z ) ≤ ρ|α| (Z ).

(18)

Let the random vector Z j have finite absolute moments of order m. Then the characteristic function in a neighborhood of zero satisfies the Taylor expansion ϕ j (t) = 1 +

 1≤|ν|≤m

μν (Z j )

(it)ν + o(tm ), j = 1 . . . n, ν!

as t → 0. We define the logarithm of a nonzero complex number z = r exp(iξ ) as log(z) = log(r ) + iξ, where r > 0, ξ ∈ (−π, π ]. Thus, we always take the so-called main branch of the logarithm. The characteristic function of the random vector Z j is continuous and equal to one at zero. Consequently, in a neighborhood of zero, the Taylor expansion takes place

230

S. A. Ayvazyan and V. V. Ulyanov

log(ϕ j (t)) =



κν (Z j )

1≤|ν|≤m

=

(it)ν + o(tm ) ν!

 κr (Z j , it) + o(tm ), r ! 1≤r ≤m

where κr (Z j , t) =

 r! κν (Z j )t ν . ν! |ν|=r

(19)

The expansion coefficients of the logarithm of the characteristic function κν are called the cumulants of the random vector Z j . The cumulants κν are explicitly expressed in terms of moments (see Chap. 2 Sect. 6 in [10]). In particular, the following inequality holds: |κν (Z j )| ≤ c1 (ν)ρ|ν| (Z j ), j = 1 . . . n. (20) We also point out the most important property of the cumulants of the sum of random vectors and denote the cumulants of the weighted sum κν

n



n  |ν| θj Z j = θ j κν (Z j ).

j=1

and κν = κν

(21)

j=1

n



θj Z j



j=1

and κr (t) = κr

n



θj Z j, t .

j=1

Also for the characteristic function of the weighted sum, in a neighborhood of zero, one has n m

 κr (it) log ϕ j (θ j t) = + o(tm ), r ! r =1 j=1 Let us define the polynomials Pr (t, ·) from the formal expression 1+

∞  r =1

Pr (t, κν (Z ))u r = exp



 κ s=1

u ,

s+2 (Z , t) s

(s + 2)!

(22)

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

231

explicitly, we obtain the expressions P0 (t, κν (Z )) = 1, Pr (t, κν (Z )) =

r  1 m! m=1



  i 1 ,...,i m : mj=1 i j =r

 ν1 ,...,νm :|ν j |=i j

 κν1 (Z ) . . . κνm (Z ) ν1 +···+νm t . ν1 ! . . . νm ! +2

(23)

For a given positive vector α and a number m, we denote D α and Dm as differential operators ∂ α1 +α2 +···+αk f (t) D α f (t) = (∂t1 )α1 (∂t2 )α2 . . . (∂tk )αk and Dm f (t) =

∂ f (t) . ∂tm

Lemma 1 Let Z 1 , Z 2 , . . . , Z n be independent random vectors (nondegenerate at zero) with a finite absolute moment of order s. Then for any 2 < r ≤ s 1

ρ r −2

r

r 2

ρ2



1

ρ s−2

s s

ρ22

,

where ρs is defined in (14). Proof The proof of Lemma follows the scheme of the proof of Lemma 6.2 [10]. Let us show that log ρr is a convex function on [2, s], where ρs = nj=1 ρs (θ j Z j ) = n s j=1 Eθ j Z j  (see 14). This follows from Hölder’s inequality for α + β = 1 and r1 , r2 ∈ [2, s] ραr1 +βr2 (Z j ) ≤ ρrα1 (Z j )ρrβ2 (Z j ), ραr1 +βr2 ≤

n 

ρrα1 (θ j Z j )ρrβ2 (θ j Z j )

j=1



n

 j=1

ρr1 (θ j Z j )

n α 

ρr2 (θ j Z j )

β

= ρrα1 ρrβ2 .

j=1

so log(ραr1 +βr2 ) ≤ α log(ρr1 ) + β log(ρr2 ).

232

S. A. Ayvazyan and V. V. Ulyanov

Further, suppose that ρ2 = 1, then 1

log(ρrr −2 ) =

log(ρr ) − log(ρ2 ) r −2

increases due to the fact that it is the slope between the points (2, ρ2 ) and (r, ρr ) of the function ρr . In the general case, when ρ2 = 1, it is necessary to consider the √ random vectors Zˆ j = Z j / ρ , j = 1, . . . , n. 2

Lemma 2 Let X 1 , , . . . , X n be independent random vectors with zero mean, unit covariance matrix and finite fourth absolute moment. If the following condition holds δθ4 ≤ (8k)−1 , then the weighted covariance matrix D satisfies      t, Dt − t2  ≤ 2kδθ4 t2 and D − I  ≤

1 , 4

3 5 4 ≤ D ≤ , D −1  ≤ . 4 4 3

Where δθ4 and matrix D are defined in (4) and (7). Proof The proof of Lemma follows the scheme of the proof of Corollary 14.2 [10]. First, we prove two auxiliary inequalities for the mathematical expectations of the original and truncated random vectors         Eθ j Y ji  = EX ji θ j 1 X (X j θ j  > 1) ≤ EX j θ j 1 X (X j θ j  > 1) = θ 4j EX j 4 ≤ θ 4j δ 4j , and         θ 2j EX ji X jl − EY ji Y jl  = θ 2j EX ji X jl 1 X (θ j X ji  > 1) ≤ Eθ j X j 2 1 X (θ j X ji  > 1) ≤ θ 4j EX j 4 = θ 4j δ 4j . Also, note that |Eθ j Y ji | = |Eθ j X ji 1 X (X j θ j  ≤ 1)| ≤ 1. for i, j = 1, . . . , k. By definiDefine Kronecker delta function as δi j = 1 X (i = j), tion of covariance matrix of the weighted sums D = nj=1 θ 2j cov(Z j ) ( see (7)), one has k         ti tl (dil − δil ),  t, Dt − t, t  =  i,l

wherein

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

233

n          θ 2j cov(X ji , X jl ) − cov(Y ji , Y jl ) dil − δil  ≤ j=1



n 

n

     θ 4j δ 4j + θ 4j δ 4j = 2δθ4 , θ 2j EX ji X jl − EY ji Y jl + EY ji EY jl  ≤

j=1

j=1

where δ 4j = EX j 4 , δ 4 =

1 n

n j=1

δ 4j ( see (4), (5)). Finally we get

k k  

 2   ti tl (dil − δil ) ≤ 2δθ4 |ti | ≤ 2kδθ4 t2 .  i,l

i

By the definition of the matrix norm, it follows that     D − I  = sup  t, (D − I )t  ≤ 2kδθ4 . t≤1

Since δθ4 < (8k)−1 , then D − I  ≤ Further,

1 , 4

3 5 ≤ D ≤ . 4 4

1 3 t, Dt ≥ t2 − t2 = t2 . 4 4

Therefore, for the inverse matrix D −1 one has D −1  ≤

4 . 3

Lemma 3 Let Z 1 , Z 2 , . . . , Z n be independent random vectors with zero mean, nondegenerate covariance matrix D (see (7)) and finite absolute moment ρs of order s ≥ 3. Then there are  constantsc1 (k, s) and c2 (k, s) such that for any |α| ≤ s and −1

1 − s−2

t ≤ c1 (k, s) min ηs s , ηs

one has

n s−3   

1   α  ϕ j (θ j Qt) − exp − t2 Pr (i Qt, κν )  D 2 r =0 j=1



1 ≤ c2 (k, s)ηs (ts−|α| + t3(s−2)+|α| ) exp − t2 , 4

234

S. A. Ayvazyan and V. V. Ulyanov

where ϕ(t) is defined in (13), matrix Q in (8), ηs in (15) and polynomial Pr (t, κν ) to (22), (23). Proof The proof follows the scheme of the proof of Theorem 9.11 [10]. First, suppose that the covariance  matrix is D = I , then  from the definition of weighted absolute moments ρs = nj=1 ρs (θ j Z j ), ηs = nj=1 ρs (Qθ j Z j )( see (14), (15)), where Q 2 = D −1 (see (8)), it follows that ηs = ρs . Using Hölder’s inequality, we obtain that − 1s

for the characteristic function of the random vector Z j at t ≤ ρs inequality holds E t, θ j Z j 2 t2 Eθ j Z j 2 ≤ 2 2

|ϕ j (θ j t) − 1| ≤ ≤

the following

2s t2

t2 2s 1 Eθ j Z j s ≤ ρs ≤ , 2 2 2

therefore, the characteristic function of the random vector θ j Z j , defined as ϕ j (t) = E exp(it Z j ) (see 13), does not vanish in a given interval, and therefore the following functions can be defined

h j (t) = log(ϕ j (θ j t)) − −

h(t) =

θ 2j t2

n 

s−3  κr +2 (Z j , it) , + (r + 2)! r =1

h j (t),

j

 κr +2 (it) 1 ζ (t) = t2 + , 2 (r + 2)! r =1 s−3

 ν νt where κr (t) = |ν|=r κν! (see (21), (19)). Since the weighted absolute moment of the second order is ρ2 = k, then by Lemma 1 for 2 < r ≤ s, (ρr ) r −2 ≤ (ρs ) s−2 (ρ2 ) r −2 − s−2 = (ρs ) s−2 (k) r −2 − s−2 ≤ (ρs ) s−2 k 1

1

1

1

1

1

1

1

and for the cumulants of the distribution κν , for 2 < |ν| ≤ s, due to (18), (20) 1

1

1

1

|κν | |ν|−2 ≤ (c1 (ν)ρ|ν| ) |ν|−2 ≤ c(ν)kρss−2 ≤ cˆ1 (s, k)ρss−2 . Next, consider the expression

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)



n 

235

   ϕ j (θ j t) − exp(ζ (t)) = D α (exp(h(t) − 1) exp(ζ (t))

j=1

=



c(α, β)D β exp(ζ (t))D α−β (exp(h(t)) − 1).

(24)

0≤β≤a

Denote c(s, k) =

s−3  

1 ν! r =0 |ν|=r +2

1 −1 and notice that for t ≤ cˆ1 (s, k)8c(s, k)ρss−2 the following chain of inequalities holds 1 s−3  s−3  s−2 r κν ν     r +2 (cˆ1 (s, k)ρs ) t ≤ t  ν! ν! r =0 |ν|=r +2 r =0 |ν|=r +2 1

s−3   (cˆ1 (s, k)tρss−2 )r ≤ t ν! r =0 |ν|=r +2 2

≤ t2

≤ t2

r 1 1

− 1 cˆ1 (s, k)ρss−2 ρs s−2 (cˆ1 (s, k)8c(s, k))−1 ν! r =0 |ν|=r +2

s−3  

s−3  

s−3 1 t2   1 t2 ≤ = , ν!8c(s, k)r 8c(s, k) r =0 |ν|=r +2 ν! 8 r =0 |ν|=r +2

(25)

therefore, the module of the function ζ (t) is bounded |ζ (t)| ≤

t2 t2 5t2 + = . 2 8 8

Similarly, it can be shown that the modulus of the derivative of this function s−3 s−3         κν ν   κν  β    t = t ν−β  D ζ (t) = D β ν! (ν − β)! r =0 |ν|=r +2 r =max{0,|β|−2} |ν|=r +2,ν≥β



s−3 



r =max{0,|β|−2} |ν|=r +2,ν≥β

r

(cˆ1 (s, k)ρs ) s−2 tr +2−|β| (ν − β)!

236

S. A. Ayvazyan and V. V. Ulyanov



s−3 



1 −r (cˆ1 (s, k)ρs ) s−2

cˆ1 (s, k)8c(s, k)ρss−2 t2−|β| (ν − β)! r

r =max{0,|β|−2} |ν|=r +2,ν≥β

≤ c2 (s, k, β)t2−|β| . non-negative numbers β1 , β2 , . . . , βr non-negative vecFurther, let j1 , j2 , . . . , jr be tors satisfying the equality ri=1 ji βi = β, since 2

t

r  i=1

ji

≤ t2 + t2−|β| ,

then the derivative ζ (t) has the following representation for t = 0 r    ji (2−|β j |)  β1 j1 βr jr  (D ζ (t)) . . . (D ζ (t))  ≤ c3 (s, k)ti=1

≤ c3 (s, k)(t2−|β| + t|β| ),

1 −1 Lemma 9.2 [10] implies that for t ≤ cˆ1 (s, k)8c(s, k)ρss−2     β D exp(ζ (t)) ≤ c4 (s, k)(t2−|β| + t|β| ) exp(ζ (t))

3 ≤ c4 (s, k)(t2−|β| + t|β| ) exp − t2 . 8

(26)

Further, for β : 0 ≤ |β| ≤ s, one has ˆ

ˆ ≤ s − |β| − 1]. D β (D β h j )(0) = 0, [0 ≤ |β| Therefore, applying Corollary 8.3 [10] and the fact that g ≡ D β h j we obtain the inequality    β  D h j (t) ≤

 ˆ |β|=s−|β|

ˆ 

 |t β |  ˆ  sup (D β g)(ut) : 0 ≤ u ≤ 1 . ˆ β!

ˆ = s − |β|, then, by Lemma 9.4 [10], the following relation holds: If |β|         ˆ   ˆ   βˆ D g(ut) = D β+β h j (ut) = D β+β log ϕ j (θ j ut)|θ j |s ≤ |θ j |s c2 (s)ρs (Z j ),

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

so

 

  β D h(t) ≤ c2 (s)ρs ts−|β|

 ˆ |β|=s−|β|

1 . βˆ

237

(27)

If β = 0, then similarly as in (25) we get that |h(t)| ≤ c2 (s)

 1 t2 ρs ts ≤ . 8 βˆ ˆ |β|=s

If α − β = 0       α−β   (exp(h(t)) − 1)) =  exp(h(t)) − 1 ≤ |h(t)| exp(|h(t)|) D ≤ c5 (s, k)ρs ts exp If α > β

t2 . 8

(28)

D α−β exp(h(t)) − 1 = D α−β exp(h(t)),

then in this case the derivative is represented as a linear combination of the following form (D β1 h(t)) j1 . . . (D βr h(t)) jr exp(h(t)),  where ri=1 ji βi = α − β. From (27) and the inequality xa ≤ xb + xc , 0 ≤ b ≤ a ≤ c, it follows that     β1 (D h(t)) j1 . . . (D βr h(t)) jr  r r  

1 (s−2)( ji −1) s−2+2 ji −|α−β| s−2 i=1 i=1 ≤ c6 (s, k)ρs tρs t

≤ c7 (s, k)ρs (ts−|α−β| + ts−|α−β|−2 ). Therefore, if α > β, then  

t2  α−β  . (exp(h(t)) − 1) ≤ c8 (s, k)ρs (ts−|α−β| + ts−|α−β|+2 ) exp D 8 (29) Using (26), (28), (29) in (24), we get

238

S. A. Ayvazyan and V. V. Ulyanov



n 

s−3

t2  κr +2 (it)  ϕ j (θ j t) − exp − + 2 (r + 2)! r =1 j=1

t2 . ≤ c9 (s, k)ρs (ts−|α| + ts−|α|−2 ) exp − 4 Next, we use Lemma 9.7 [10] when setting u = 1 and the inequality 1

1

|κν | |ν|−2 ≤ cˆ1 (s, k)ρss−2 , for 2 < |ν| ≤ s. Also taking into account that the derivative is represented as a linear combination of terms of the following form D α exp and

−t2 2

f (t) =

 0≤β≤α

D α−β exp

−t2 2

D β f (t),



−t2 

−t2   α−β , exp  ≤ c(α − β, k)(1 + t|α−β| ) exp D 2 2

we get that s−3 s−3  

t2 

−t2   κr +2 (it)  α  − exp + Pr (it, κν )   D exp − 2 (r + 2)! 2 r =1 r =1



1 ≤ c(s, k)ρs (ts−|α + t3(s−2)+|α| ) exp − t2 , 4 which completes the proof of the Lemma for the case when the covariance matrix D = I. In order to prove in the general case D = I , it is necessary to consider the transformed sequence of random vectors Q Z 1 , Q Z 2 , . . . , Q Z n , then nj=1 ϕ j (θ j Qt)  will be the characteristic function of the sum of random vectors Q( nj=1 Z j ), and in this case, the weighted sum of random vectors has a unit covariance matrix, and the Lemma holds for these random vectors. Lemma 4 Let X 1 , X 2 , . . . , X n be independent random vectors with zero mean, unit covariance  matrix and finite moment. Let l > 0 be such that for  fourth absolute  2 subset U = j : |θ j | ≤ l/δ 2j , one has θ j ≥ 1/8. If δθ4 ≤ (8k)−1 , then for t ≤ j∈U √ (8 kl)−1 , it holds

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

239

n  

1   α ϕ j (θ j t) ≤ c1 (α, k)(1 + t|α| ) exp − t2 . D 48 j=1

Where δθ4 is defined in (5) and ϕ j (t) in (13). Proof The proof of Lemma follows the scheme of proofs of Lemma 2.2 [2] and Lemma 14.3 [10]. For the random vector Z j = Y j − EY j , where Y j = X j 1 X (θ j X j  ≤ 1), (see 6, 2), consider the expansion of the characteristic function of a Taylor series as t tends to zero, 1 1 ϕ j (θ j t) = 1 − θ 2j E Z j , t 2 + θ 3j E Z j , t 3 ξ, 2 6 √ −1 where |ξ | ≤ 1. For t ≤ (8 kl) and j ∈ U one has θ 2j E Z j , t 2 ≤ θ 2j t2 EZ j 2 ≤

EZ j 2 4EX j 2 k ≤ = < 1. 4 4 64kδ j 64kδ j 16kδ 4j

Also |θ j |3 E| Z j , t |3 ≤ |θ j |3 t3 EZ j 3 ≤ |θ j |2 t2

23 EX j 3 √ 8 kδ 2j

√ 23 kδ 2j ≤ |θ j | t √ 2 ≤ |θ j |2 t2 , 8 kδ j 2

hence

And

2

1 1 − θ 2j E Z j , t 2 > 0. 2 1 1 |ϕ j (θ j t)| ≤ 1 − θ 2j E Z j , t 2 + |θ j |3 E| Z j , t |3 2 6

1 1 ≤ exp − θ 2j E Z j , t 2 + |θ j |3 E| Z j , t |3 2 6

1 1 ≤ exp − θ 2j E Z j , t 2 + θ 2j t2 . 2 6

Further, note that exp ≤ exp

1

1 θ 2j E Z j , t 2 − |θ j |3 E| Z j , t |3 2 6

1



2 1 2 . (E|θ j Z j , t |3 ) 3 − |θ j |3 E| Z j , t |3 ≤ exp 2 6 3

240

S. A. Ayvazyan and V. V. Ulyanov

Now we denote the subset Nr = { j1 , j2√ , . . . , jr } as a subset of N = {1, 2, . . . , n}, consisting of r elements, for t ≤ (8 kl)−1 the following chain of inequalities holds    

  1  1     − θ 2j E Z j , t 2 + θ 2j t2 ϕ j (θ j t) ≤  ϕ j (θ j t) ≤ exp  2 6 j∈N \N j∈U\N j∈U\N r

exp

r

 j∈U∩Nr



r

 1 1 − θ 2j E Z j , t 2 + θ 2j t2 − 2 6

≤ exp

 j∈U



 j∈U∩Nr

 1 1 − θ 2j E Z j , t 2 + θ 2j t2 2 6



2r 1 1 − θ 2j E Z j , t 2 + θ 2j t2 exp 2 6 3



n 1 2 1  2 θ j E Z j , t 2 + θ j E Z j , t 2 ≤ exp − 2 j=1 2 j∈N \U



2r 1  2 2 1  2 2 1 2 θ j t + θ j t + t exp 2 j∈N \U 2 j∈N \U 6 3

 1 

2r

1  1 1 . = exp − Dt, t + D2 t, t − t2 θ 2j + θ 2j t2 + t2 exp 2 2 2 6 3 j∈N \U

j∈N \U

Note that the norm of the covariance matrix satisfies the estimate D > 3/4, also, the inequality for the sum of the weight coefficients is satisfied  by definition, 2 θ ≤ 1/8. If we denote the matrix D2 as the weighted sum of the covariance j∈N \U j matrices of the random vectors X j , j ∈ N \ U, then by Lemma 2 we obtain     1   θ 2j  ≤ 2k θ 4j δ 4j t2 ≤ t2 ,  t, D2 t − t2 4 j∈N \U

so

j∈N \U

 

3 1

2r 1 1 2   t exp ϕ j (θ j t) ≤ exp − + + +  8 8 16 6 3 j∈N \N r



2r

1 . = exp − t2 exp 48 3 For α = 0 the statement of the Lemma is proved. Before proceeding to the proof of the case α = 0, consider the modulus of the derivative of the characteristic function of the random vector Z j

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

241

  

      Dm ϕ j (θ j t) = |θ j |EZ j,m exp i θ j t, Z j ,

(30)

If a positive vector β satisfies the condition |β| = 1 then   



      Dβ ϕ j (θ j t) = |θ j |EZ j,β exp i θ j t, Z j − 1      ≤ |θ j |EZ j,β θ j t, Z j  ≤ θ 2j tEZ j 2 ≤ θ 2j ρ2 (Z j )t, from (30) we also get that for any vector β with |β| ≥ 2    β  β D ϕ j (θ j t) ≤ |θ j ||β| E|Z j | ≤ 2|β| θ 2j ρ2 (X j ), finally we get that for any non-negative vector β > 0    β  D ϕ j (θ j t) ≤ c2 (α, k)θ 2j ρ2 (X j ) max{1, t}. Now consider the positive vector α > 0, according to the rule of differentiation of the product of functions, we obtain that Dα

n

j=1

ϕ j (θ j t) =



ϕ j (θ j t)D β1 ϕ j1 (θ j1 t) . . . D βr ϕ jr (θ jr t),

(31)

j∈N \Nr

where Nr = { j1 , . . . , jr }, 1 ≤ r ≤ |α|, β1 , β2 , β3 , . . . , βr vectors that meet the conditions |β j | ≥ 1 (1 ≤ j ≤ r ) and rj=1 β j = α. The number of multiplications in each of the n α terms of the expression (31) is α1 ! . . . αk ! , r k j=1 i=1 β ji ! where α = (α1 , . . . , αk ) and β j = (β j1 , . . . , β jk ), 1 ≤ j ≤ r. Each term in the expression (31) is bounded by the value exp

2r 3



1 t bj, 48 j∈N r

where b j = c2 (α, k)ρ2 (X j )θ 2j max{1, t}, therefore from (31) we obtain n  

2r   1   α ϕ j (θ j t) ≤ c3 (α, r ) exp − t bj, D 3 48 r j∈N 1≤r ≤|α| j=1 r

242

S. A. Ayvazyan and V. V. Ulyanov

where the outer summation is over all r elements from N . It remains to evaluate the expression  r

j∈Nr

bj ≤

n



r bj

= (c2 (α, k)ρ2 max{1, t})r = (c2 (α, k)k)r (1 + tr ),

j=1

which completes the proof. Lemma 5 Let X 1 , X 2 , . . . , X n be independent random vectors with zero mean, unit covariance matrix and finite fourth absolute moment. Let ( 1 , 2 , . . . , n ) be a random vector uniformly distributed on the unit sphere S n−1 . Then with probability

greater than 1 − C`2 (α, k) exp − c`2 (k) δn4 one has  √ β(k) n ≤t≤ δn4 δ2

n   δ4  α  ϕ j ( j t)dt ≤ C3 (α, c, k) . D n j=1

Where ϕ j (t) is defined in (13) and δ 4 in (5). Proof The proof of Lemma follows the scheme of the proof of Lemma 3.5 [2]. Let us estimate the modulus of the characteristic function of the truncated random vector Y j = X j 1 X ( j X j  ≤ 1), (see (6) ), denote g j (t) as the characteristic function of the random vector X j . First, using the Chebyshev inequality, we obtain the estimate E1 X (θ j X j  ≥ 1) = P(θ j X j  ≥ 1) ≤ EX j 2 θ 2j = kθ 2j , then, applying a number of transformations, we obtain the following chain of inequalities for estimating the characteristic function 



           E exp i j t, Y j  j  = E exp i j t, X j 1 X ( j X j  ≤ 1)  j  



     = E exp i j t, X j 1 X ( j X j  ≤ 1) 1 X ( j X j  ≤ 1) + 1 X ( j X j  > 1)  j 

  

   = E exp i j t, X j 1 X ( j X j  ≤ 1) + E1 X ( j X j  > 1) j   



  = E exp i j t, X j − exp i j t, X j 1 X ( j X j  > 1)     +E1 X ( j X j  > 1) j 

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

243

 





     = E exp i j t, X j + E1 X ( j X j  > 1) 1 − exp i j t, X j  j  



     ≤ |g j ( j t)| + E1 X ( j X j  > 1) 1 − exp i j t, X j  j    

   ≤ |g j ( j t)| + E1 X ( j X j  > 1)1 − exp i j t, X j   j   ≤ |g j ( j t)| + 2E1 X ( j X j  > 1) j ≤ |g j ( j t)| + 2k 2j .

(32)

Next, we will show that for any r one has E|g j ( j r )|2 ≤ 1 − c1 min

 r 2 1  . , n δ 4j

For r = 0 the inequality holds automatically. Therefore, the case r > 0 is considered below. Let us denote X j as an independent copy of the random vector X j , and define a random vector Xˆ = X j − X j with the corresponding distribution Fˆx . Also denote f n and Jn as the distribution density and characteristic function of the component of a random vector uniformly distributed on the unit sphere, j . Further, changing the order of integration, we obtain  E|g j ( j r )|2 = =

 

|g j (r t)|2 f n (t)dt =

 



exp it t, y d Fˆx f n (t)dt



 Jn ( r, x )d Fˆx = EJn ( r, Xˆ ) f n (t) exp it r, y dt d Fˆx =

r, Xˆ . = EJn r  r 

Lemma 3.3 [2] implies that the estimate holds for the characteristic function of a random variable uniformly distributed on the unit sphere

 r 2 r, Xˆ 2  r, Xˆ ≤ 1 − c3 E min ,1 . EJn r  r  n r  Let us define the random variable X  as X  =

r, Xˆ 2 2r 2

and τ =

r 2 , n

then

k   2 2 r j ri Xˆ j Xˆ i  ri Xˆ i + 2 2r 2 i=1 1≤ j hence E|g j ( j r )|2 ≤ 1 − c1 min

τ , 2

 r 2 1  . , n δ 4j

Similarly, consider Z j − Z j , where Z j is an independent copy of Z j , also note that ˆ 2 . And, using the inequality (32), we obtain the estimate E 2j = n −1 and E 4j ≤ C/n 



      E|ϕ j ( j t)|2 = E E exp i j t, Z j − Z j  j = E E exp i j t, Y j − Y j  j

 

1 1 ≤ E |g j ( j t)| + 2k 2j |g j (− j t)| + 2k 2j ≤ E|g j ( j t)|2 + 2k + 4k 2 Cˆ 2 n n ≤ 1 − c1 min

1 1 1 , 4 + 2k + 4k 2 Cˆ 2 . n δ n n j

 t2

Since δ 4j ≥ k 2 and in the region t2 ≥ n/δ 4 , then the estimate holds

 t2 1  1 1 −1

C −1 1 − c1 min , 4 + 2k + 4k Cˆ 2 ≤ 1− 2 . n δj n n k In Lemma 4 it was shown that     β D ϕ j (θ j t) ≤ c2 (α, k)θ 2j ρ2 (X j ) max{1, t}, from this it immediately follows 2 1 21



  2 ED β ϕ j ( j t) ≤ c2 (α, k)ρ2 (X j ) max{1, t} E 4j = c2 (α, k)ρ2 (X j ) max{1, t}

 Cˆ n

.

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

245

Further, using Theorem 1 [14], we obtain      ϕ j ( j t) D β1 ϕ j1 ( j1 ) . . . D βr ϕ jr ( jr ) E j∈Nr



  1   1 1  E|ϕ j ( j t)|2 2 E|D β1 ϕ j1 ( j1 )|2 2 . . . E|D βr ϕ jr ( jr )|2 2 . j∈Nr

  Let us denote the subset of indices G = j : δ j < 2δ and we come to the conclusion that n 1 4 1  4 n − |G| δj ≥ δj ≥ 16δ 4 , δ4 = n j=1 n n j ∈G /

therefore, |G| ≥ n/2. Due to this, the following chain of inequalities is valid n n

   t2 1  1 1 21   1 − c1 min ϕ j ( j t) ≤ , 4 + 2k + 4k Cˆ 2 E n δj n n j=1 j=1





1 − c1 min

j∈G

 t2 1  1 1 21 , 4 + 2k + 4k 2 Cˆ 2 n δj n n

 t2 1  1 1 n4 ≤ 1 − c5 min , 4 + 2k + 4k 2 Cˆ 2 n δ n n ≤ exp



  ˆ 2 exp − c6 min t2 , n . + Ck 2 δ4

k

Finally, we come to the conclusion 21

−r 

k



2 n ˆ 2 1− C E|ϕ j ( j t)|2 ≤ exp , + Ck exp − c min t , 6 2 k2 δ4 j∈N r

and similarly, as in the proof of Lemma 4, we obtain the estimate n  

 n    ϕ j (θ j t) ≤ c3 (α, k)(1 + t|α| ) exp − c6 min t2 , 4 E D α δ j

and therefore

246

S. A. Ayvazyan and V. V. Ulyanov

 √ β(k) n ≤t≤ δn4 δ2

 ≤ √ β(k) n ≤t≤ δn4 δ2

n     ED α ϕ j ( j t)dt j=1

 n  c3 (α, k)(1 + t|α| ) exp − c7 min t2 , 4 dt δ

 ≤ √ β(k) n ≤t≤ δn4 δ2

n c3 (α, k)(1 + t|α| ) exp − c8 4 dt 2δ

n ≤ C 7 (α, c, k) exp − c9 (k) 4 . δ Using the Chebyshev inequality, 

P

√ β(k) n ≤t≤ δn4 δ2

 n  

n  α  ϕ j ( j t)dt ≥ C 7 (α, k) exp − c9 (k) 4 D δ j=1





n C 7 (α, k) exp − c9 (k) 4 . δ

We see that there is a subset Q2 ⊆ S n−1 on the unit sphere with probability

n P( ∈ Q2 ) ≥ 1 − C` 2 (α, k) exp − c`2 (k) 2 , δ such that for any vector of weight coefficients (θ1 , θ2 , . . . , θn ) ∈ Q2 the inequality  √ β(k) n ≤t≤ δn4 δ2

 n  

n δ4  α  ϕ j (θ j t)dt ≤ C 7 (α, k) exp − c9 (k) 4 ≤ C3 (α, k) . D δ n j=1

Lemma 6 Let Z 1 , Z 2 , . . . , Z n be independent random vectors with zero mean and finite absolute moment of order r + 2, r ≥ 1. If the weighted covariance matrix D satisfies D > 3/4, then for any positive vector α such that |α| ≤ 3r the following inequalities hold  

1    α  |κν |,  D P1 (it, κν ) exp − Dt, t dt ≤ C6 (α, k) 2 |ν|=3

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

247

 

1    α D Pr (it, κν ) exp − Dt, t dt ≤ C(α, k, r )ρr +2 , 2 for r ≥ 2 , also 

1 

3  α  D exp − Dt, t  ≤ C(α, k)(1 + t|α| ) exp − t2 . 2 8 Where the matrix D is defined in (7), ρr +2 in (14) and the polynomial Pr (t, κν ) in (22), (23). Proof The proof of Lemma follows the scheme of the proof of Lemma 9.5 [10]. Lemma 9.3 [10] implies that 

1 

3   α D exp − Dt, t  ≤ C(α, k)(1 + t|α| ) exp − t2 , 2 8 now we show that for any positive vector α such that 0 ≤ |α| ≤ 3r one has     α  D Pr (z, κν ) ≤ C3 (α, r )(1 + ρ2r −1 )(1 + z3r −|α| )ρr +2 . Differentiating the polynomial of the asymptotic expansion, we obtain the following representation D α Pr (z, κν ) =

r  1 m! m=1

×

  i 1 ,...,i m : mj=1 i j =r



 ν1 ,...,νm :|ν j |=i j

κν1 . . . κνm ν !. . . . νm ! +2 1



(ν1 + · · · + νm )! z ν1 +···+νm −α , ((ν1 + · · · + νm − α)!)

 |ν| where κν = nj=1 θ j κν (Z j ), (see (21), (22), (23)). Estimating the cumulants in this expression through the corresponding absolute moments (see (20)), we come to the conclusion |κν1 . . . κνm | ≤ c1 (ν1 )ρi1 +2 . . . c1 (νm )ρim +2 i 1 +2+···+i m +2 2

≤ c1 (ν1 ) . . . c1 (νm )ρ2

r 2 +m

≤ c1 (ν1 ) . . . c1 (νm )ρ2



ρr +2 r +2 2

ρ2

m i1 +···+i r

ρi1 +2 . . . ρim +2 i 1 +2

i m +2 2

ρ2 2 . . . ρ2

= c1 (ν1 ) . . . c1 (νm )ρ2m−1 ρr +2 ,

248

S. A. Ayvazyan and V. V. Ulyanov

using the well-known inequality ta ≤ tb + tc for 0 ≤ b ≤ a ≤ c, we come to the fact that |D α Pr (z, κν )| ≤ C3 (α, r )(1 + ρ2r −1 )(1 + z3r −|α| )ρr +2 . Further, since

1

1  D α−β exp − Dt, t D β Pr (it, κν ), D α exp − Dt, t Pr (it, κν ) = 2 2 0≤β≤α we obtain that for any r ≥ 1  

1    α D Pr (it, κν ) exp − Dt, t dt ≤ C(α, k, r )(1 + ρ2r −1 )ρr +2 . 2 Separately, it is necessary to consider the case when r = 1  

1    α D P1 (it, κν ) exp − Dt, t dt 2   

1    =  D α−β P1 (it, κν )D β exp − Dt, t dt 2 0≤β≤α =





  

1   κν   D α−β (it)ν D β exp − Dt, t dt  ν! 2 0≤β≤α |ν|=3

   

3  κν   D α−β (it)ν C(β, k)(1 + t|β| ) exp − t2 )dt  ν! 8 0≤β≤α |ν|=3

 |ν|=3

|κν |

   

3 (it)ν   D α−β C(β, k)(1 + t|β| ) exp − t2 dt  ν! 8 0≤β≤α ≤

 |ν|=3

C6 (α, ν, k)|κν | ≤ C6 (α, k)



|κν |.

|ν|=3

Lemma 7 Let be a random vector uniformly distributed on the unit sphere S n−1 , ν is a positive vector with |ν| = 3, then for any t > 0, n 

 

δ4 2   P  ≤ C` 3 exp − c`3 t 3 3j μν (Z j ) ≥ t n j=1

and

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

P

n

 j=1

δ 4j 4j ≥ t

249

√ δ4 ≤ C` 4 exp − c`4 t . n

Where δ 4j , δ 4 , μν (Z j ) are defined in (3), (5), (17), respectively. Proof The proof follows the scheme of the proof of Lemma 4.1 [2]. Let 1 , 2 , , . . . , n be a sequence of independent random variables with standard normal distribution and Z as a random variable independent of  j for j = 1, . . . , n, which of the normal distrihas a chi-square with n degrees of freedom. Using the properties √ bution, one can show that the representation j =  j / Z is holds for j = 1, . . . , n. Applying a number of transformations, we obtain n n 

 

  μν (Z j ) 3j  δ4 δ4    ≥ t =P  μν (Z j ) 3j  ≥ t P   3 n n Z2 j=1 j=1 n  t √nδ 2

 

  ≤P  + C 1 exp − c1 n . μν (Z j ) 3j  ≥ 4 j=1

Due to the fact that E exp(c2  2j ) ≤ 2, where c2 is an absolute constant, therefore the  random variable Y = nj=1 μν (Z j ) 3j belongs to class ψ2/3 see Sects. 2 and 3 in [12]. And, therefore, we can apply the inequality for the moments Sect. 3 in [12], which asserts that for any p ≥ 2 the following estimate holds n 1p

 21

3 E|Y | p ≤ C2 p 2 μν (Z j )2 . j=1

Using the inequality for ν = ν1 + ν2 with |ν1 | = 1, |ν2 | = 2 2ν2 2ν1 4 2 6 4 1 (μν (Z j ))2 = |EZ νj 1 Z νj 2 |2 ≤ (EZ 2ν j )(EZ j ) ≤ 2 EX j 2 ρ4 (X j ) = 2 δ j ,

we get that 3

C2 p 2

n n

 21

 21 3 3√ (μν (Z j ))2 ≤ C 3 p 2 δ 4j = C 3 p 2 nδ 2 . j=1

j=1

Applying the Chebyshev inequality, we come to the conclusion that n 

  √   P  μν (X j ) 3j  ≥ t nδ 2 ≤ j=1

C p 2 p E|Y | p 3 . √ 2 p ≤ t (t nδ ) 3

250

S. A. Ayvazyan and V. V. Ulyanov

Put p =



1 −1 C t 2 3

23

5

, if we consider t ≥ 2 2 C 3 , then we get that p ≥ 2. Note that

C p 23 p

1 −1 23 3 = exp − C 3 t ln(2) , t 2

therefore we get that for any t ≥ c3 n 



  δ4 2   ≤ exp − c3 t 3 + C 1 exp(−c1 n), P  μν (X j ) 3j  ≥ t n j=1

as n n n    1  21 √  3 2 4 2 μν (X j ) j  ≤ (μν (X j )) j ≤ 26 δ 4j 4j ≤ 23 nδ 2 ,  j=1

j=1

j=1

therefore, for any t ≥ 0 one has n 

  δ4 2   ≤ C˘ 3 exp − c˘3 t 3 . β j 3j  ≥ t P  n j=1

Similarly for the random variable Y2 = P

n



δ 4j 4j ≥ 12

j=1

n j=1

δ 4j ( 4j − 3) we get

δ4 δ4 nδ 4 ≤ P Y2 ≥ t + C 4 exp(−c4 n) +t n n 4

and reusing the moment inequality (Sect. 3 in [12])

1p E|Y2 | p ≤ C 5 p 2 nδ 4 . Similarly, we come to the conclusion that P

n



δ 4j 4j ≥ t

j=1

Also from the fact that

P

√ δ4 ≤ C 6 exp − c5 t + C 4 exp(−c4 n), t > 15. n

n j=1 n

 j=1

δ 4j 4j ≤ nδ 4 , it follows that for any t > 0

δ 4j 4j ≥ t

√ δ4 ≤ C˘ 4 exp(−c˘4 t). n

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

251

3 Proof of the Main Theorem Proof Assume that δθ4 ≤ (8k)−1 . We split the original integral into several terms, for this we add and subtract the term with the distribution of the sum of truncated random variables defined in (9), (10), then estimate each term separately              d(FX − ) ≤  d(FX − FY ) +  d(FY − ). B

B

B

To estimate the first term in the sum, we use the following transformation n          FX j ∗ · · · ∗ FX j−1 ∗ (FX j − FY j ) ∗ FY j+1 ∗ · · · ∗ FYn   d(FX − FY ) ≤  d B

B

j=1

n  n n       ≤ P(θ j X j  > 1) ≤ δ 4j θ 4j = δθ4 ,  d(FX j − FY j ) ≤ j=1

j=1

B

j=1

where FX j , FY j − are defined (11), (12) and “∗” denotes the function convolution. To estimate the second term, we additionally split this integral into the sum of three integrals, adding and subtracting new terms        d(FY − ) = 



  d(FZ −  An ,I )

Bn =B+{An }

B

            ≤  d(FZ − 0,D ) +  d( − 0,D ) +  d( An ,I − ). Bn

Bn

Bn

To estimate the last integral, we show that due to EX j = 0, j = 1, . . . , n, the following inequality holds: Eθ j Y j 2 =

k

k

2  2  Eθ j (Y ji − X ji ) ≤ Eθ j X j 1 X (θ j X j  > 1) i=1

i=1

2

≤ k Eθ j X j 4 = k(θ 4j δ 4j )2 . From this it follows that the norm of the weighted mathematical expectation is bounded by the value

252

S. A. Ayvazyan and V. V. Ulyanov

An  ≤

n 

Eθ j Y j  ≤

j=1

n √  1 k θ 4j δ 4j < √ . 8 k j=1

From Theorem 4 [13] we obtain that  2  √   An  ≤ C1 kδθ4 .  d( An ,I − ) ≤ π Bn

To estimate the next integral from the sum, we use Theorem 3 [13] and Lemma 2  

 21   (dil − vil )2  d( − 0,D ) ≤ C2 I − D2 = C2 i,l

Bn

≤ C2

 21 (2δθ4 )2 ≤ 2kC2 δθ4 . i,l

It remains to estimate the last integral, for this we use the most important inequality from Corollary 11.5 [2]           d(FZ − 0,D ) ≤ C3 d((FZ − 0,D ) × K  ) + Cˆ 1 (k), Bn

√ 1 5 where  = 4a kδ 4 /n and a = 2− 3 k 6 , and K  (x) is a kernel function (more details 13.8–13.13 in [10]), the most important property of this function is that its character(t) = 0 is equal to zero for t > n/δ 4 . By Lemma 11.6 [10], from istic function K estimating the difference of distributions, we can go to estimating the difference of the corresponding characteristic functions      d(FZ − 0,D ) × K  ) ≤ Cˆ 2 (k)

max

0≤|α+β|≤k+1

    α ˆ ˆ 0,D )(t)D β Kˆ  (t)dt. D ( FZ − 

Since |D β Kˆ  (t)| ≤ c, ˆ we get that       α ˆ β ˆ ˆ ˆ  D ( FZ − 0,D )(t)D K  (t)dt ≤ C3 (k)

   α ˆ ˆ 0,D )(t)dt.  D ( FZ − 

t≤ δn4

 − 1 1   − k+1 k+3 Denote E n = c1 (k, k + 3) min ηk+3 , where ηk+3 = nj=1 ρk+3 (Qθ j Z j ) , ηk+3 (see (15)) and c1 (k, k + 3) is a constant from the statement of Lemma 3, further, we add and subtract terms of the asymptotic expansion of the logarithm of the characteristic function. Considering that, by definition, Fˆ Z = nj=1 ϕ j (θ j t) (see (13)), we can

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

253

split the integral into the sum of several terms by dividing the region of integration into several parts, similarly to inequality (71) [11] 

   α ˆ ˆ 0,D )(t)dt  D ( FZ − 

t≤ δn4

 ≤

√4

t≤

5

En

n k  

1    α  ϕ j (θ j t) − exp − Dt, t

Pr (it, κv ) dt D 2 r =0 j=1

 + √4 5



√ n √ 1600 kδ 2

j=1

E n ≤t≤ 1600√nkδ2

n    α  ϕ j (θ j t)dt D j=1

≤t≤ δn4

  k 

1 

1     α   Pr (it, κν ) exp − Dt, t dt D exp − Dt, t dt + D α 2 2

 +





n    α  ϕ j (θ j t)dt + D

r =1

4 5 E n ≤t

= I1 + I2 + I3 + I4 + I5 . Then, by Lemma 2, we obtain k

det Q ≤ Q 2  2 ≤ Qt ≥ D− 2 t ≥ 1

4 k2 3

4 21 5

, t,

   − 1 1  4   − k+1 k+3 Qt ≤ , E n ⊂ t ≤ c1 (k, k + 3) min ηk+3 , ηk+3 5 where the matrix Q is defined in (8). Also, taking into account that for any s ≥ 4 from (16) it follows ηs ≤ 2s Qs

n  j=1

ρ4 (θ j X j ) = Qs 2s δθ4 ≤ 2s

4 2s 3

δθ4 .

Substituting t = Qt, s = k + 3 and using Lemma 3, we come to the fact that

254

S. A. Ayvazyan and V. V. Ulyanov

 I1 =

√4

Qt≤

5

 ≤ t≤E n



En

n k   

−t2     α ϕ j (θ j Qt) − exp Pr (i Qt, κv ) det Q dt D 2 r =0 j=1

n k

4 2k  

−t2    α  ϕ j (θ j Qt) − exp Pr (i Qt, κv ) dt D 3 2 r =0 j=1

 k

1 4 2 c2 (k, k + 3)ηk+3 (tk+3−|α| + t3(k+1)+|α| ) exp − t2 dt 3 4 ≤ Cˆ1 (α, k)ηk+3 ≤ 2k+3

4 k+3 2 3

Cˆ1 (α, k)δθ4 ≤ C1 (α, k)δθ4 .

  Then, we denote the subset G = j : δ 2j < 5δ 2 , the following holds δ4 =

n 1 4 1 4 n − |G| 2 2 δj ≥ δj > (5δ ) , n j=1 n n j ∈G /

hence |G|/n > 24/25 > 4/5. Using Lemma 3.2 [2], we obtain that with probability greater than 1 − C` 1 exp(−c`1 n) a random vector uniformly distributed on the unit sphere S n−1 satisfies the following condition 

2j >

j∈U

 1 40  , U = j ∈ G : | j | < √ . 8 n

√ √ If we set l = 200δ 2 / n, then one has 40/ n ≤ l/δ 2j , for j ∈ G. Therefore, using Lemma 4, we obtain that there is a subset of Q1 with λn−1 (Q1 ) ≥ 1 − C` 1 exp(−c`1 n) such that  n    α  I2 = ϕ j (θ j t)dt D √ √4 j=1 n 5

E n ≤t≤ 1600√kδ2

 ≤

√4 5







1 c1 (α, k)(1 + t|α| ) exp − t2 dt 48

E n ≤t≤ 1600√nkδ2



4 21  − 1 1  −k−3 − k+1 k+3 t−1 c1 (k, k + 3) min ηk+3 , ηk+3 5

1 ×c1 (α, k)(1 + t|α| ) exp − t2 dt 48

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

255

 − 1  1 −k−3 2  − k+1 k+3 k+1 ≤ Cˆ 2 (α, k) min ηk+3 , ηk+3 = Cˆ 2 (α, k)ηk+3 max 1, ηk+3 ≤ Cˆ 2 (α, k)2k+3

4 k+3 2 3

2 k+1 ≤ C2 (α, k)δθ4 . δθ4 1 + ηk+3



By Lemma 5 there exists a subset Q2 with λn−1 (Q2 ) ≥ 1 − C` 2 (α, k) exp − c`2 (k) δn2 such that for any vector of weight coefficients (θ1 , θ2 , . . . , θn ) ∈ Q2 one has  I3 = √ n √ 1600 kδ 2

≤t≤ δn4

By Lemma 6

n    δ4  α  ϕ j (θ j t) dt ≤ C3 (α, k) . D n j=1

 I4 =

√4 5

 ≤

√4 5





1    α D exp − Dt, t dt 2

E n ≤t



3 C(α, k)(1 + t|α| ) exp − t2 dt 8

E n ≤t



4 21  − 1 1  −k−3 − k+1 k+3 t−1 c1 (k, k + 3) min ηk+3 , ηk+3 5



3 ×C(α, k)(1 + t|α| ) exp − t2 dt ≤ C4 (α, k)δθ4 8 and I5 =



k 

   k

1   α  D Pr (it, κν ) exp − Dt, t dt  2 r =1 C(α, k, r )ρr +2 (1 + ρ2r −1 ) + C6 (α, k)

r =2



|κν |

|ν|=3

C5 (α, k)δθ4

n      + C6 (α, k) θ 3j κν (Z j )  |ν|=3

=



C5 (α, k)δθ4

j=1

n      + C6 (α, k) θ 3j μν (Z j ).  |ν|=3

j=1

256

S. A. Ayvazyan and V. V. Ulyanov

We get that there is a subset of weight coefficients Q1 ∩ Q2 with measure

n λn−1 (Q1 ∩ Q2 ) ≥ 1 − C` 2 (k) exp − c`2 (k) 4 − C` 1 exp(−c`1 n) δ

n ≥ 1 − C` 5 (k) exp − c`5 (k) 4 , δ such that for any vector of weight coefficients (θ1 , θ2 , . . . , θn ) ∈ Q1 ∩ Q2 one has n   

   δ4    4 ≤ C d(F − ) (k) δ + + θ 3j μν (Z j ) .    X 7 θ n |ν|=3 j=1

(33)

B

Note that if δθ4 > (8k)−1 , the inequality (33) holds automatically for a certain choice of universal constants. Also, without loss of generality, we can require that log2





ρ 2C` 5 (k)

≤ c`5 (k)

n , δ4

similarly, otherwise the statement of the Theorem holds for a special choice of the constant in the inequality. Further,

n 21 ≥ exp − c`5 (k) 4 δ 2C` 5 (k) ρ

and

λn−1 (Q1 ∩ Q2 ) > 1 − C` 5 (k)

ρ 2C` 5 (k)

≥1−

ρ . 2

By Lemma 7 there exists a subset Q3 with λn−1 (Q3 ) ≥ 1 − ρ/2, for which δθ4 +

n 

1 2 δ 4

1 23 δ 4     θ 3j μν (Z j ) ≤ Cˆ 7 log + Cˆ 8 (k) log .  ρ n ρ n |ν|=3 j=1

Finally, for any vector of coefficients (θ1 , θ2 , . . . , θn ) ∈ Q = Q1 ∩ Q2 ∩ Q3 we have  

1 2 δ4 δ4   C(k) ≤ C(ρ, k) , sup  d(FX − ) ≤ log ρ n n B∈B B

moreover λn−1 (Q) ≥ 1 −

ρ ρ − = 1 − ρ. 2 2

A Multivariate CLT for Weighted Sums with Rate of Convergence of Order O(1/n)

257

Acknowledgements The research was done as part of the program of the Moscow Center for Fundamental and Applied Mathematics and within the framework of the HSE University Basic Research Program.

References 1. Sazonov, V.V.: On the multidimensional central limit theorem. Sankhya Ser. A. 30, 181–204 (1968) 2. Klartag, B., Sodin, S.: Variations on the Berry–Esseen theorem. Theory Probab. Appl. 56(3), 514–533 (2011) 3. Bobkov, S.G.: Edgeworth corrections in randomized central limit theorems. In: Klartag B., Milman E. (eds.) Geometric Aspects of Functional Analysis. Lecture Notes in Mathematics, vol. 2256, pp. 71–97. Springer, Cham (2020) 4. Bobkov, S.G., Chistyakov, G.P., Götze, F.: Berry-Esseen bounds for typical weighted sums. Electron. J. Probab. 23(92), 1–22 (2018) 5. Götze, F., Naumov, A.A., Ulyanov, V.V.: Asymptotic analysis of symmetric functions. J. Theor. Probab. 30, 876–897 (2017) 6. Esseen, C.-G.: Fourier analysis of distribution functions. A mathematical study of the Laplace– Gaussian law. Acta Math. 77, 1–125 (1945) 7. Prokhorov, Yu.V., Ulyanov, V.V.: Some approximation problems in statistics and probability. In >, Springer Proceedings in Mathematics & Statistics vol. 42, pp. 235–249. Springer, Heidelberg (2013) 8. Götze, F., Zaitsev, AYu.: Explicit rates of approximation in the CLT for quadratic forms. Ann. Probab. 42(1), 354–397 (2014) 9. Ulyanov, V.V.: Asymptotic expansions for distributions of sums of independent random variables in H . Theory Probab. Appl. 31(1), 25–39 (1987) 10. Bhattacharya, R., Rao Ranga, R.: Normal Approximation and Asymptotic Expansions. New York, Wiley (1976) 11. Sazonov, V.: Normal Approximation-Some Recent Advances. New York, Springer, Berlin, Heidelberg, (1981) 12. Adamczak, R., Litvak, A., Pajor, A., Tomczak-Jaegermann, N.: Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling. Constr. Approx. 34(1), 61–88 (2011) 13. Barsov, S.S., Ulyanov, V.: Difference of Gaussian Measures. J. Sov. Math. 5(38), 2191–2198 (1987) 14. Carlen, E., Lieb, E., Loss, M.: A sharp analog of Young’s inequality on SN and related entropy inequalities. J. Geom. Anal. 4(3), 487–520 (2004)

Estimation of Matrices and Subspaces

Rate of Convergence for Sparse Sample Covariance Matrices F. Götze, A. Tikhomirov, and D. Timushev

Abstract This work is devoted to the estimation of the convergence rate of the empirical spectral distribution function (ESD) of sparse sample covariance matrices in the Kolmogorov metric. We consider the case with the sparsity npn ∼ logα n, for some α > 1 and assume that the moments of the matrix elements satisfy the condition E |X jk |4+δ ≤ C < ∞, for some δ > 0. We also obtain approximation estimates for the Stieltjes transform in the bulk. Keywords Sparse sample covariance matrices · Marchenko–Pastur law · Rate of convergence · Stieltjes transform

1 Introduction For the first time, random matrices arose in the work of Wishart [1] in mathematical statistics and later in the work of Wigner [2] for applications in nuclear physics. Wigner used the eigenvalues of random Hermitian matrices with centered independent identically distributed elements to describe the energy levels of heavy nuclei. Later, such matrices were called Wigner matrices. Wigner proved that the density of the empirical spectral distribution function of the eigenvalues of such matrices converges to the semicircle law when the dimension of the matrix increases to infinity. The second important class of random matrices is an ensemble of sample covariance matrices. These are matrices of the form XX∗ , where X is a n × m matrix with F. Götze Faculty of Mathematics, Bielefeld University, Bielefeld, Germany e-mail: [email protected] A. Tikhomirov (B) · D. Timushev Institute of Physics and Mathematics, Komi Science Center of Ural Division of RAS, Syktyvkar, Russia e-mail: [email protected] D. Timushev e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_7

261

262

F. Götze et al.

centered identically distributed independent elements. In the context of statistical analysis, the columns of matrix X are an m-dimensional sample of a n-dimensional data vector. With the size of matrix X increasing to infinity, so that the ratio n/m converges to some constant, the density of the limiting empirical spectral distribution of the matrix XX∗ was explicitly found by Marchenko and Pastur [3]. Sample covariance matrices play an important role in the problems of multivariate statistical analysis, in particular in the method of Principal components analysis (PCA). Random matrices of this type also appear in the theory of wireless communication. The spectral density of these matrices is used in calculations related to the capacity of a multiple-input multiple-output (MIMO) channel. This connection between the theory of random matrices and wireless communication was established by Telatar [4]. In this linear model, the element X i j of the channel matrix X is the transmission coefficient from the j-th transmitter to the i-th receiver antenna. The resulting signal can be represented by the linear relationship y = Xr + s, where r is the input signal and s is zero-mean Gaussian noise with some variance σ 2 . In the case of i.i.d. Gaussian input signals, the channel capacity is determined by the expression (1/n) E log det(I + σ −2 XX∗ ). This application of the random matrix theory was one of the reasons of the explosive growth in this topic in the next decade. An important area of sample covariance matrices applications is Graph theory. In the case of a directed graph, its adjacency matrix is asymmetric and so instead of its eigenvalues, we have to consider singular ones, which naturally leads to sample covariance matrices. An example of such a graph is a bipartite random graph, the vertices of which can be divided into two groups, in each of which the vertices are not connected to each other. If the edges probability pn of such graph tends to zero with the number of vertices increasing to infinity, we get a sparse sample covariance matrix. The behaviour of eigenvalues and eigenvectors of a sparse random matrix significantly depends on its sparsity and we cannot apply the results obtained for the case of non-sparse matrices. The case of sparse matrices has been considered in a number of works (see [5–10]). The spectral properties of sparse sample covariance matrices with the sparsity npn ∼ n α were studied in [8]. In particular, a local law for the eigenvalue density was proved and the limiting distribution of the shifted, rescaled largest eigenvalue was found assuming that matrix elements satisfy the moments assumption E |X jk |q ≤ (Cq)cq . Estimates for the Stieltjes transform were also obtained in the domain {z = u + iv : λ− /2 ≤ u ≤ λ+ + 1, 0 < v < 3}, where λ− and λ+ are left and right endpoints of the Marchenko-Pastur distribution support, respectively. The present work is devoted to the estimation of the convergence rate of the empirical spectral distribution function (ESD) of sparse sample covariance matrices in the Kolmogorov metric. We consider the case with the sparsity npn ∼ logα n, for some α > 1 and assume that matrix elements moments satisfy the condition E |X jk |4+δ ≤ C < ∞. It should be noted that in addition to the result on the rate of convergence, we also obtain approximation estimates for the Stieltjes transform in the support of the limiting distribution assuming some truncation condition (see Sect. 3).

Rate of Convergence for Sparse Sample Covariance Matrices

263

2 Main Results Let m = m(n), m ≥ n. Consider independent identically distributed zero-mean random variables X jk , 1 ≤ j ≤ n, 1 ≤ k ≤ m with E X jk = 1 and independent set of independent Bernoulli random variables ξ jk , 1 ≤ j ≤ n, 1 ≤ k ≤ m with E ξ jk = pn . In addition suppose that npn → ∞ as n → ∞. In what follows, for simplicity of notation, we will omit the index n in pn if this does not cause confusion. Observe the sequence of random matrices X= √

1 (ξ jk X jk )1≤ j≤n,1≤k≤m . mpn

Denote by s1 ≥ · · · ≥ sn the singular values of X and define the empirical spectral distribution function (ESD) of the sample covariance matrix W = XX∗ : Fn (x) =

n 1 2 I{s j ≤ x}, n j=1

where I{A} stands for the event A indicator. We are interested in the rate of the convergence of the distribution Fn (x) to the Marchenko–Pastur distribution G y (x) with the density g y (x) =

1  (x − a 2 )(b2 − x) I{a 2 ≤ x ≤ b2 }, 2π yx

√ √ where y = y(n) = mn and a = 1 − y, b = 1 + y. We will follow the standard symmetrization technique (see [11]). To linearize the ESD we consider the symmetrized distribution function 2 n (x) = 1 + sign(x)Fn (x ) , F 2

y (x) with the density and the symmetrized Marchenko–Pastur distribution function G  g y (x) =

1  2 (x − a 2 )(b2 − x 2 ) I{a 2 ≤ x 2 ≤ b2 }. 2π y|x|

Then we have for the rate of the convergence of the ESD function to the Marchenko– Pastur distribution function in the Kolmogorov metric that n (x) − G y (x)|. n := sup |Fn (x) − G y (x)| = 2 sup | F x

x

In what follows we shall consider the symmetrized ESD only and we shall omit the symbol “ · ” in the notation.

264

F. Götze et al.

We state the following result. Theorem 1 Let E X jk = 0 and E |X jk |2 = 1. Assume that E |X jk |4+δ ≤ μ4+δ < ∞, for any j, k ≥ 1 and for some δ > 0. Suppose that there exists a positive constant B, such that 2 npn ≥ B log κ n, where κ =

δ . Then there exists a constant C 2(4+δ)

n ≤

depending on δ and μ4+δ such that

C log n , npn

with high probability. Remark 1 We say that some event An holds with high probability, if for any Q > 0, P{An } ≥ 1 − n −Q , for sufficiently large n. To prove Theorem 1, we will use the smoothing inequality developed in [11], Corollary 3.2. For the reader’s convenience, we reproduce the statement. √ √ For any x such that |x| ∈ [1 − y, 1 + y], define the quantities β = β(x) := and Jε = [1 −





y − ||x| − 1|,

1 1 √ √ y + ε, 1 + y − ε], for any 0 < ε < y. 2 2

Lemma 1 (Smoothing inequality) Suppose that y < 1. Let ε, v0 be positive numbers √ such that ε ≤ y/2 and v0 < Cε3/2 . Denote by sn (z) the Stieltjes transform of Fn (x), √ and by S y (z) — the Stieltjes transform of G y (x). Note that 0 ≤ β ≤ y. For any x : |x| ∈ Jε , we have β ≥ 21 ε. Then the following inequality holds:  n ≤ 2



−∞

+ 2 sup

|x|∈Jε

where V > v0 , v =

3

|sn (u + i V ) − S y (u + i V )|du + C1 v0 + C2 ε 2  v

V

|sn (x + iu) − S y (x + iu)|du,

(1)

v0 √ . β

The last inequality implies that it is enough to estimate the proximity of the corresponding Stieltjes transforms in the domain

Rate of Convergence for Sparse Sample Covariance Matrices

265

 Dε := {z = u + iv : |u| ∈ Jε , v0 / β ≤ v ≤ V }, where the exact values v0 , ε will be defined later. This will imply the bound for n . The paper is organized as follows. In Sect. 3 we investigate the Stieltjes transform of the symmetrized ESD function in the domain Dε assuming that the elements X jk satisfy some truncation and moment assumption (C0) (will be defined below). We state Theorem 2 and prove several Corollaries. In Sect. 4.1 we show that the elements X jk may satisfy the condition (C0). In Sect. 4.2 smoothing inequality (1) and Corollary 3 allow us to bound the quantity n . Sect. 5 is devoted to the proof of Theorem 2: in Sect. 5.1 we get the bound for the diagonal elements of resolvent matrix R(z) when z ∈ Dε (the main difficulty is with Im z close to real axis of the complex plane), and in Sect. 5.2 we estimate Tn . In the Appendix we state and prove some auxiliary results.

3 The Stieltjes Transforms Proximity Note that the symmetrized empirical spectral distribution function (ESD) of the sample covariance matrix W = XX∗ may be rewrite as Fn (x) =

n  1  I{s j ≤ x} + I{−s j ≤ x} . 2n j=1

By the definition of the Stieltjes transform we have sn (z) =

n n n  1  1 z 1  1 = + 2 2n j=1 s j − z −s − z n s − z2 j j=1 j=1 j

and S y (z) =

−(z 2 − ab) +



(z 2 − a 2 )(z 2 − b2 ) . 2yz

  1 Using the inequality |S y (z)| ≤ √1y we obtain |y Sy (z)+z| =  y S y (z) − √ 2 y, for z ∈ Dε . Note that Fn (x) is the ESD of the block matrix V=



1−y  z 

≤1+

On X , X∗ Om

where Ok is k × k matrix with zero elements. Let R = R(z) be the resolvent matrix of V: R = (V − zIn+m )−1 . It is easy to see that

266

F. Götze et al.

R= This implies

z(XX∗ − z 2 In )−1 (XX∗ − z 2 In )−1 X . X∗ (XX∗ − z 2 In )−1 z(X∗ X − z 2 Im )−1

n m 1 1 m−n sn (z) = . Rjj = Rl+n,l+n + n j=1 n l=1 nz

(2)

Throughout this section the conditions (C0) are assumed to be fulfilled:

E |X jk |4+δ ≤ μ4+δ < ∞ and |X jk | ≤ C(np) 2 −κ , with κ = 1

δ . 2(4 + δ)

(C0)

Let s0 > 1 be some constant depending on δ, V be some positive constant. The exact values of these constants will be determined below. For any 0 < v ≤ V we define kv as kv = kv (V ) := min{l ≥ 0 : s0l v ≥ V }. Let n = n (z) := sn (z) − S y (z), where z = u + iv and v > 0. Set the notation b(z) = z −

1−y log(np) log n + 2y S y (z), an (z) = Im b(z) + , z np|b(z)|

and bn (z) = z −

1−y + 2y S y (z) + y n (z) = b(z) + y n (z). z

Note that, for z ∈ Dε v + 2y Im S y (z) ≤ Im b(z) ≤ |b(z)| ≤ C(y, V ), √ where C(y, V ) = 2 + 3 y + V . For given γ > 0 consider the event Qγ (v) := | n (u + iv)| ≤ γ an (u + iv), for all u : |u| ∈ Jε and the event Qγ (v) = such that

k v l=0

Qγ (s0l v). For any γ there exists the constant V = V (γ ) Pr{ Qγ (V )} = 1.

(3)

√ It may be V = 2/γ , for example. In what follows, we assume that γ and V are chosen so that equality (3) is satisfied. We will also assume that γ is chosen so that for all 0 < v ≤ V γ



√ √ yan (z) max 1, (1 + 2 y) y < 1/4.

Rate of Convergence for Sparse Sample Covariance Matrices

267

We denote Q = Qγ (V ). Summing the equations in (2), we get sn (z) = −

1 z + y S y (z) −

1−y z

(1 + Tn − y n sn (z)) = S y (z)(1 + Tn − y n sn (z)), (4)

where Tn =

n 1 εj Rjj, n j=1

where ε j = ε j1 + ε j2 + ε j3 , ε j1 =

m m 1  1  ( j) Rl+n,l+n − R , m l=1 m l=1 l+n,l+n

m 1  2 ( j) (X ξ jl − p)Rl+n,l+n , mp l=1 jl  1 ( j) = X jl X jk ξ jl ξ jk Rl+n,k+n . mp 1≤l =k≤m

ε j2 = ε j3 ( j)

Here by symbol Rl,k we denote the elements of the resolvent matrix defined via matrix X with row j deleted. In this section we formulate the following result. Theorem 2 Under the conditions (C0), for all z ∈ Dε with v0 = Cn −1 log4 n, ε ≥ 2/3  C log n , for any K ≥ 1 there exists constant C = C(K ) and q ≤ K log n, such n that √ q E |Tn |q I{Q} ≤ C q q 2 (1 + nvq q/2 )βnq , where βn =

1 an (z) + . nv np

Corollary 1 Under the conditions of Theorem 2, in the domain Dε , for any K ≥ 1 there exists constant C = C(K ) and q ≤ K log n, such that q

E | n |q I{Q} ≤

C q q q βn . |b(z)|q

Proof From Theorem 2 it follows E |Tn |q I{Q} ≤ C q q q βnq .

268

F. Götze et al.

It is straightforward to check that |bn (z)|I(Q) ≥ c|b(z)|I(Q) in the domain Dε . Thus n| . The Corollary is proved.  it is enough to consider the inequality | n | ≤ C|T |b(z)| Corollary 2 Under the conditions of Theorem 2, in the domain Dε for any Q > 1 there exists a constant C depending on Q, such that  βn log n  Pr | n | > C ; Q ≤ Cn −Q . |b(z)| Proof For the proof it is enough to apply Markov inequality choosing q ∼ log n and result of Corollary 1.  Corollary 3 Under the conditions of Theorem 2, in the domain Dε for any Q > 1 there exists a constant C depending on Q, such that  βn log n  ≤ Cn −Q . Pr | n | > C |b(z)| Proof For the proof of this Corollary we shall use the so called additive descent method. We partition interval [v , V ] by points vk = V − kδn , k = 1, . . . K , where K = (V − v )/δn and δn is choosing so that for any k = 1, . . . , K and any vk ≥ v ≥ vk+1 | n (u + ivk ) − n (u + iv)| ≤ δn /v 2 ≤ 1/n. It is enough to put δn = 1/n 3 . Note that by Lemma 6, we have  βn C|an (z)| 1 C 1 1 ≤ ≤ an (u + iv1 ) and max , ≤ an (u + iv1 ). |b(z)| nv|b(z)| 4 np|b(z)| np 4 These implies

Pr{| n (u + iv1 )| ≥ an (u + iv1 )} ≤ Cn −Q .

Applying Corollary 2, we get  βn log n  ; Q ≤ Cn −Q . Pr | n (u + iv1 )| ≥ C |b(z)| Repeating this for k = 1, . . . K and using the union bound, we get   βn log n , for any V ≥ v ≥ v ≤ Cn −Q . Pr | n (u + iv)| > C |b(z)| Thus Corollary 3 is proved.



Rate of Convergence for Sparse Sample Covariance Matrices

269

Corollary 4 Under the conditions of Theorem 2, in the domain Dε we have, for q ∼ log n, 





Pr

√ √ 1− y≤|u|≤1+ y

| n (u + iv)| > K

βn log n  ≤ Cn −Q . |b(z)|

Proof Note that, for v ≥ v0 ,    ∂ n (u + iv)    ≤ 2 ≤ n2.   v2 ∂u √ √ Split interval [1 − y, 1 + y] into N = n 3 intervals with endpoints u k , k = 1, . . . , N such that |u k+1 − u k | ≤ CN and for any u k ≤ u k+1 , | n (u + iv) − n (u k + iv)| ≤ K

βn log n . 2|b(z)|

Further we write  Pr





√ ||u|−1|≤ y

| n (u + iv)| > K

 βn log n  βn log n  ≤ Pr . | n (u k + iv)| > K |b(z)| 2|b(z)| k

Applying the union bound, we get  Pr

 √ ||u|−1|≤ y



| n (u + iv)| > K

N βn log n    βn log n  ≤ . Pr | n (u k + iv)| > K |b(z)| 2|b(z)| k=1

Taking into account that N ≤ Cn 3 for q ∼ log n, and applying Corollary 3 we obtain the result. 

4 The Proof of Theorem 1 In this section we assume only that max E |X jk |4+δ = μ4+δ < ∞. j,k≥1

Using the standard truncation procedure we show that the Kolmogorov distance between the empirical spectral distribution and the limiting one is of order O((np)−1 log n). This implies, the proximity of the eigenvalues to the corresponding quantiles of the limit distribution of order O(n ) with probability close to 1.

270

F. Götze et al.

4.1 Truncation In the proof we use the standard truncation procedure (see [11], Sect. 5, for example). Introduce the truncated random values 1

X jk = X jk I{|X jk | ≤ C(np) 2 −κ }, where κ =

δ , 2(4 + δ)

the centered random values  X jk = X jk − E X jk , and the normalized random values X jk |2 . X jk , where σ 2 = E |  X˘ jk = σ −1  ˘ matrices obtained from V by replacing X jk with X jk ,  X jk , X˘ jk . Denote by V,  V, V

n (x), F n (x), F˘n (x) be the symmetrized ESD functions of the corresponding Let F matrices. Lemma 2

C . sup |Fn (x) − G y (x)| ≤ sup | F˘n (x) − G y (x)| + np x x

n (x) − Fn (x)|, supx | F n (x) − Proof We will successively estimate the values supx | F ˘ 

Fn (x)| and supx | Fn (x) − Fn (x)|. First we note that μ4+δ . (5) Pr{ X jk = X jk } ≤ (np)2 Bai’s rank inequality (see [12], Theorem A.43) gives

n (x) − Fn (x)| ≤ sup | F x

rank{ V − V} . n

Since the rank does not exceed the number of non-zero elements of the matrix, we have n  n m m  1  1

n (x) − Fn (x)| ≤ 1 sup | F ξ jk I{|X jk | ≥ C(np) 2 −κ } = η jk , n j=1 k=1 n j=1 k=1 x

where η jk = ξ jk I{|X jk | ≥ C(np) 2 −κ } are Bernoulli distributed with E η jk ≤ 1/( pn 2 ). This implies 1

Rate of Convergence for Sparse Sample Covariance Matrices

271

n  n  m m q  q    q q    

n (x) − Fn (x)| ≤ C E sup | F E η + (η − E η )   jk jk jk nq x j=1 k=1

Cq

≤ q q + n p

n m C q q q/2   

nq

E(η jk − E η jk )2

q/2

+

j=1 k=1 n  m q q  C q

E |η jk − E η jk |q .

nq

j=1 k=1

j=1 k=1

Applying (5) and Rosenthal’s inequality (see [11], Lemma 8.1), we obtain q q  q C q μ4+δ Cqq 2 Cq qq

n (x) − Fn (x)| ≤ E sup | F + q + q (np) nq p x nq p 2 q



C q max{1, (qp) 2 , (qp)q−1 q} . (np)q

n (x) − Fn (x)| ≥ K Pr sup | F

√ C max{1, qp, qp}  1 ≤ q. np K

Using Chebyshev’s inequality, we get 

x

Choosing q ∼ log n and K ∼ const, we write  

n (x) − Fn (x)| ≥ C max{1, p log n} ≤ Cn −Q . Pr sup | F np x

n (x). Note that the elements X jk are i.i.d., Consider the centered ESD function F

 so the rank of the matrix V − V is 1 (since all elements of the matrix are the same constant E X jk ). Therefore

n (x) − F n (x)| ≤ sup | F x

C . n

Consider now the normalized ESD function F˘n (x). It is easy to see that n (x) − G y (x)| ≤ sup | F˘n (x) − G y (x)| + sup |G y (xσ ) − G y (x)|. sup | F x

x

x

Note that the random variables X˘ jk satisfy the condition (C0). Given that the density of the distribution function G y (x) is bounded by a constant, we obtain n (x) − G y (x)| ≤ sup | F˘n (x) − G y (x)| + C|σ − 1|. sup | F x

x

It can be shown, that |1 − σ | ≤

C . (np)1+2κ

272

F. Götze et al.

Thus, n (x) − G y (x)| ≤ sup | F˘n (x) − G y (x)| + sup | F x

x

C . (np)1+2κ 

The lemma is proved.

4.2 The Proof of Theorem Now we may consider the random variables satisfying condition (C0). Put v0 = 2/3  n Cn −1 log4 n, ε = C log . We shall apply smoothing inequality (1) and the result n of Corollary 3. The estimation of the first term in the smoothing inequality when Im z = V is standard procedure since |R j j | < 1/V . It is not difficult to obtain the estimation 



−∞

|sn (u + i V ) − S y (u + i V )|du ≤ C

1 n

+

1  . np

The main problem appears in the application of the second integral of the smoothing inequality. Using the bound from Corollary 3 we get 

V

sup

|u|∈Jε



≤ ≤

v

V

v



 a (z) 1  C log n n + dv nv np |b(z)| |u|∈Jε v

 V C log2 n log(np) C log n dv dv + sup 2 nvnp|b(z)| |u|∈Jε v np|b(z)|

|sn (u + iv) − S y (u + iv)|dv ≤ sup

C log n dv + sup nv |u|∈Jε 2

 v

V

V

4

C log n C log n C log n C log n + + ≤ . n npnβ np np

The Theorem is proved.

5 The Proof of Theorem 2 The main difficulty in proving the Theorem is to obtain an estimate for diagonal elements of resolvent R in the domain Dε .

Rate of Convergence for Sparse Sample Covariance Matrices

273

5.1 Estimation of Resolvent Diagonal Elements Let T = {1, . . . , n}, J ⊂ T and T(1) = {1, . . . , m}, K ⊂ T(1) . Consider σ -algebras M(J,K) , generated by the elements of X with the exception of the rows with number in J and the columns with number in K. We will write for brevity M j instead of M(J∪{ j},K) and Ml+n instead of M(J,K∪{l}) . By symbol X(J,K) we denote the matrix X which rows with numbers in J are deleted, and which columns with numbers in K are deleted too. In a similar way, we will denote all objects defined via X(J,K) , and so such as the resolvent matrix R(J,K) , the ESD Stieltjes transform sn(J,K) , (J,K) n on. The symbol E j denotes the conditional expectation with respect to the σ -algebra M j , and El+n — with respect to σ -algebra Ml+n . Let Jc = T \ J, Kc = T(1) \ K. Consider the events A jk (v, J, K; C) = {|R (J,K) jk (u + iv)| ≤ C}.

(6)

Set A(1) (v, J, K) = ∩nj=1 ∩nk=1 A j,k (v, J, K), A(2) (v, J, K) = ∩mj=1 ∩nk=1 A j+n,k (v, J, K), A(3) (v, J, K) = ∩nj=1 ∩m k=1 A j,k+n (v, J, K), A(4) (v, J, K) = ∩mj=1 ∩m k=1 A j+n,k+n (v, J, K).

(7)

1 For all { j, k} ∈ (Jc ∪ Kc ) × (Jc ∪ Kc ) and u we have |R (J,K) jk | ≤ v . Introduce events

(v) := Q(J,K) γ It is easy to see that

kv    (J,K)  |J| + K|   . (u + is0l v) ≤ γ an (u, s0l v) + n ns0l v l=0

Qγ (v) ⊂ Q(J,K) (v). γ

For the diagonal elements of R we can write   (J,K) (J,K) (J,K) (J,K) , for j ∈ Jc , = S (z) 1 − ε R + y R R (J,K) y n jj j jj jj (J,K) =− Rl+n,l+n

(8)

  1 (J,K) (J,K) (J,K) 1 − εl+n , for l ∈ Kc . (9) Rl+n,l+n + y (J,K) Rl+n,l+n n z + y S y (z)

(J,K) Correction terms ε(J,K) for j ∈ Jc and εl+n for l ∈ Kc are defined as j

274

F. Götze et al.

ε(J,K) = ε(J,K) + · · · + ε(J,K) j j1 j3 , ε(J,K) = j1

m m 1  (J,K) 1  (J∪{ j},K) Rl+n,l+n − R , m l=1 m l=1 l+n,l+n

m 1  2 (J∪{ j},K) (X ξ jl − p)Rl+n,l+n , mp l=1 jl  1 (J∪{ j},K) = X jl X jk ξ jl ξ jk Rl+n,k+n ; mp 1≤l =k≤m

ε(J,K) = j2 ε(J,K) j3 and

(J,K) (J,K) (J,K) εl+n = εl+n,1 + · · · + εl+n,3 , (J,K) εl+n,1 =

n n 1  (J,K) 1  (J,K∪{l+n}) Rjj − R , m j=1 m j=1 j j

(J,K) εl+n,2 =

n 1  2 (J,K∪{l+n}) (X ξ jl − p)R j j , mp j=1 jl

(J,K) εl+n,3 =

1 mp



(J,K∪{l+n})

X jl X kl ξ jl ξkl R jk

.

1≤ j =k≤n

Theorem 3 Under the conditions (C0), for any 0 < γ < γ0 and u 0 > 0 there exist constants H = H (δ, μ4+δ , γ , u 0 ) and C = C(δ, μ4+δ , γ , u 0 ) such that for all 1 ≤ j ≤ n, 1 ≤ k ≤ m and for all z ∈ Dε , we have Qγ (v) ≤ Cn −c log n , Pr |R jk | > H ; Pr max{|R j,k+n |, |R j+n,k |} > H ; Qγ (v) ≤ Cn −c log n , Pr |R j+n,k+n | > H ; Qγ (v) ≤ Cn −c log n . Proof In what follows, we put Q := Qγ (v). Representation (8) and (9) yield that for all γ ≤ γ0 and all J, K satisfying |u 0 |(|J| + |K|)/nv ≤ 1/4, the inequalities 2 | I{Q} ≤ 2|ε(J,K) ||R (J,K) | I{Q} + √ |R (J,K) jj j jj y

(10)

2 √ (J,K) (J,K) (J,K) |Rl+n,l+n | I{Q} ≤ 2|εl+n ||Rl+n,l+n | I{Q} + 1 + √ + y y

(11)

and

hold.

Rate of Convergence for Sparse Sample Covariance Matrices

275

Consider the off-diagonal elements of the resolvent matrix. It can be shown that for j = k ∈ Jc m   1  (J∪{ j},K) (J,K) − , R (J,K) = R X ξ R √ jl jl l+n,k jk jj mp l=1

for j = k ∈ Kc n   1  (J,K∪{ j+n}) (J,K) − , R (J,K) = R X rl ξrl Rk+n,r √ j+n,k+n j+n, j+n mp r =1

and m   1  (J∪{ j},K) (J,K) − R (J,K) = R X ξ R √ jr jr j,k+n jj r +n,l+n , mp r =1

m   1  (J∪{k},K) (J,K) − R (J,K) = R X kr ξkr Rr +n, j+n . √ j+n,k kk mp r =1

Put m n 1  1  (J∪{ j},K) (J,K∪{ j+n}) (J,K) ε(J,K) = − X ξ R , ε = − X r j ξr j Rr,k+n , √ √ jl jl l+n,k jk j+n,k+n mp l=1 mp r =1 m n 1  1  (J∪{ j},K) (J∪{k},K) (J,K) ε(J,K) = − X ξ R , ε = − X lk ξlk Rl+n,k+n . √ √ kl kl l+n, j+n j+n,k j,k+n mp l=1 mp l=1

Then we get = R (J,K) ε(J,K) R (J,K) jk jj jk ,

(J,K) (J,K) (J,K) Rl+n,k+n = Rl+n,l+n εl+n,k+n ,

(J,K) (J,K) R (J,K) ε j,k+n , jl+n = R j j

(J,K) (J,K) (J,K) Rk+n, εk+n, j . j = Rjj

(12)

 1 Pr{|R j j |I{Q} > C} ≤ Pr |ε j |I{Q} > 4

(13)

Inequalities (10) and (11) imply that

for 1 ≤ j ≤ n and C >

√4 , y

and

 1 Pr{|Rl+n,l+n |I{Q} > C} ≤ Pr |εl+n |I{Q} > 4

(14)

276

F. Götze et al.

√ for 1 ≤ l ≤ m and C > 2(2 + 3 y +

√2 ). y

Relations (12) give

Pr{|R jk |I{Q} > C} ≤ Pr{|R j j |I{Q} > C} + Pr{|ε jk |I{Q} > 1} for 1 ≤ j = k ≤ n, and Pr{|Rl+n,k+n |I{Q} > C} ≤ Pr{|Rl+n,l+n |I{Q} > C} + Pr{|εl+n,k+n |I{Q} > 1} for 1 ≤ l = k ≤ m. Similarly we obtain Pr{|Rl,k+n |I{Q} > C} ≤ Pr{|Rl,l |I{Q} > C} + Pr{|εl,k+n |I{Q} > 1} and Pr{|Rl+n,k |I{Q} > C} ≤ Pr{|Rk,k |I{Q} > C} + Pr{|εl+n,k |I{Q} > 1}. By Rosenthal’s inequality (see [11], Lemma 8.1) we find that m  q q 1  ( j) q  ( j) q |R | E j |ε jk |q ≤ C q q 2 (nv)− 2 (Im Rkk ) 2 + q q (np)−qκ−1 n l=1 k,l+n

for 1 ≤ j = k ≤ n, and  q q q ( j+n) E j+n |ε j+n,k+n |q ≤ C q q 2 (nv)− 2 (Im Rk+n,k+n ) 2 + q q (np)−qκ−1

n 1  ( j+n) q  |R | , n r =1 k+n,r

 q q q ( j+n) E j |ε j,k+n |q ≤ C q q 2 (nv)− 2 (Im Rk+n,k+n ) 2 + q q (np)−qκ−1

n 1  ( j+n) q  |R | , n r =1 k+n,r +n

 q q q ( j+n) E j+n |ε j+n,k+n |q ≤ C q q 2 (nv)− 2 (Im Rk+n,k+n ) 2 + q q (np)−qκ−1 for 1 ≤ j = k ≤ m. Note that

n 1  ( j+n) q  |R | n r =1 k+n,r

Rate of Convergence for Sparse Sample Covariance Matrices

277

1 c ; Q} ≤ Pr{A(4) (sv, J, K) ; Q} 4 1 + Pr{|ε(J,K) | > ; A(4) (sv, J, K); Q}, j 4 1 c (1) Pr{|ε(J,K) j+n | > ; Q} ≤ Pr{A (sv, J, K) ; Q} 4 (1) + Pr{|ε(J,K) j+n | > 1/4; A (sv, J, K); Q},

Pr{|ε(J,K) |> j

(2) Pr{|ε(J,K) jk | > 1; Q} ≤ Pr{A (sv, J, K) ; Q} c

(2) + Pr{|ε(J,K) jk | > 1; A (sv, J, K); Q}, (J,K) Pr{|εl+n,k+n | > 1; Q} ≤ Pr{A(3) (sv, J, K) ; Q} c

(J,K) | > 1; A(3) (sv, J, K); Q}, + Pr{|εl+n,k+n (4) Pr{|ε(J,K) j+n,k | > 1; Q} ≤ Pr{A (sv, J, K) ; Q} c

(4) + Pr{|ε(J,K) j+n,k | > 1; A (sv, J, K); Q}, (J,K) (4) Pr{|εk, j+n (v)| > 1; Q} ≤ Pr{A (sv, J, K) ; Q} c

(J,K) + Pr{|εk,l+n (v)| > 1; A(4) (sv, J, K); Q}.

Using Chebyshev’s inequality we get   | > 1/4; Q; A(4) } ≤ C q E E j |ε j |q I{Q(J,K) }I{A(4) }. Pr{|ε(J,K) j Applying the triangle inequality, the results of Lemmas 5–8, the property of the multiplicative gradient descent of the resolvent, we arrive at the inequality q    q2 1 qs qs 1 E j I{A (sv, J, K)}|ε j | ≤ C + + (nv)q np np (np)2κ q q q  2  2  2  2 q

  2 2 q qs 1 q 1 q s . + + + nv np (np)2κ nv (np)2κ (np)2 (4)



q

q

2

If we set q ∼ log2 n, nv > C log4 n, np > C log κ n, we get E j |ε j |q I{A(4) (sv, J ∪ { j}, K)} ≤ Cn −c log n . Moreover, the constant c can be made arbitrarily large. We may obtain similar estimates for the quantities εl+n , ε jk , ε j+n,k , ε j,k+n , ε j+n,k+n . Inequalities (13), (14) imply

278

F. Götze et al.

Pr{|R (J,K) | I{Q} > C} ≤ Pr{A(4) (sv, J ∪ { j}, K) } + Cn −c log n , jj c

(J,K) Pr{|Rl+n,l+n | I{Q} > C} ≤ Pr{A(1) (sv, J, K ∪ {l}) } + Cn −c log n , c

(2) −c log n , Pr{|R (J,K) jk | I{Q} > C} ≤ Pr{A (sv, J, K ∪ {l}) } + Cn c

c

Pr{|R j+n,k | I{Q} > C} ≤ Pr{A(4) (sv, J, K ∪ { j}) } + Cn −c log n , c

Pr{|Rk+n, j | I{Q} > C} ≤ Pr{A(4) (sv, J, K ∪ { j}) } + Cn −c log n , c

Pr{|Rk+n, j+n | I{Q} > C} ≤ Pr{A(3) (sv, J, K ∪ { j}) } + Cn −c log n . The last inequalities give −c log n max Pr{|R (J,K) j,k | I{Q} > C} ≤ Cn

j,k∈Jc ∪Kc

+

max max{Pr{Ac (s0 v, J ∪ { j}, K)}, Pr{Ac (s0 v, J, K ∪ {k})}}.

j∈Jc ,k∈Kc

Note that kv ≤ C log n for v ≥ v0 = n −1 log4 n. So, choosing c large enough, we get Pr{Ac (v) ∩ Q} ≤ Cn −c log n . 

This completes the proof of the theorem.

Corollary 5 Under the conditions of Theorem 3, for any v ≥ v0 and q ≤ c log n there exists a constant H such that E |R jk |q I{Q} ≤ H q , for j, k ∈ T ∪ (T(1) + n). Proof We may write E |R jk |q I{Q} ≤ E |R jk |q I{Q} I{A(v)} + E |R jk |q I{Q} I{Ac (v)}. Combine this inequality with |R jk | ≤ v−1 , we find that −q

E |R jk |q I{Q} ≤ C q + v0 E I{Q} I{Ac (v)}. Applying Theorem 3, we obtain what is required.



Rate of Convergence for Sparse Sample Covariance Matrices

279

5.2 Estimation of Tn Recall that a := an (z) = Im b(z) +

log(n) log(np) . np|b(z)|

Consider the smoothing of the indicator h γ (x): ⎧ ⎪ ⎨1, for |x| ≤ γ a, a| h γ (x, v) = 1 − ||x|−γ , for γ a ≤ |x| ≤ 2γ a, γa ⎪ ⎩ 0, for |x| > 2γ a. Note that I Qγ (v) ≤ h γ (| n (u + iv)|, v) ≤ I Q2γ (v) , where, as before,

Qγ (v) =

kv 

{| n (u + is0ν v)| ≤ γ an (u, s0ν v)}.

ν=0

We will estimate the value Dn := E |Tn |q h qγ (| n |, v). It is easy to see that E |Tn |q I{Q} ≤ Dn . To estimate Dn we use the approach developed in [13–16] that goes back to Stein’s method. Let ϕ(z) := z|z|q−2 . Set

n := Tn h γ (| n |, v). T

Then we can write

n ϕ(T

n ). Dn := E T

The equality  1− y sn (z) + ysn2 (z) = b(z) n (z) + y 2n (z) Tn = 1 + z − z

280

F. Götze et al.

implies that for z ∈ Dε

Consider

|Tn | I{Q} ≤ C. B := A(1) ∩ A(2) ∩ A(3) ∩ A(4) ,

where A(ν) were defined in (6) and (7). Then Dn ≤ E |Tn |q I{Q} I{B} + Cn −c log n . By the definition of Tn , we may rewrite the last inequality as Dn :=

n 1

n ) I{B} + Cn −c log n . E ε j R j j h γ (| n |, v)ϕ(T n j=1

Set

Dn = Dn(1) + Dn(2) + Cn −c log n ,

(15)

where Dn(1) :=

n 1

n ) I{B}, E ε j1 R j j h γ (| n |, v)ϕ(T n j=1

Dn(2) :=

n 1

n ) I{B}, E ε j R j j h γ (| n |, v)ϕ(T n j=1

ε j := ε j2 + ε j3 . We have

and this yields

n 1

sn (z) 1 sn (z) + ε j1 R j j = n j=1 2n 2nz

n 1   C C   Im sn (z) + . ε j1 R j j  ≤  n j=1 nv n

Using Hölder’s inequality, we get |Dn(1) | ≤

Can (z) q−1 Dn q . nv

(16)

Rate of Convergence for Sparse Sample Covariance Matrices

281

Further, consider

n( j) = E j T

n , Tn( j) = E j Tn , (nj) = E j n . T Represent Dn(2) in the form Dn(2) = Dn(21) + · · · + Dn(24) ,

(17)

where S y (z) 

n( j) ) I{B}, E ε j h γ (| (nj) |, v)ϕ(T n j=1 n

Dn(21) := Dn(22)

n 1

n( j) ) I{B}, := E ε j (R j j − S y (z))h γ (| (nj) |, v)ϕ(T n j=1

Dn(23) :=

n 1

n( j) ) I{B}, E ε j R j j (h γ (| n |, v) − h γ (| (nj) |, v))ϕ(T n j=1

Dn(24) :=

n 1

n ) − ϕ(T

n( j) )) I{B}. E ε j R j j h γ (| n |, v)(ϕ(T n j=1

ε j = 0, we find Since E j S y (z) 

n( j) ) I{Bc }. E ε j h γ (| (nj) |, v)ϕ(T n j=1 n

Dn(21) =

From here it is easy to obtain that |Dn(21) | ≤ Cn −c log n . Note that Eq. (4) implies n = where bn (z) = z −

Tn , bn (z)

1−y + y S y (z) + ysn (z). z

The last can also be rewritten as bn (z) = z −

1−y + 2y S y (z) + y n (z) = b(z) + y n (z). z

(18)

282

F. Götze et al.

5.2.1

Estimation of Dn(22)

Using the representation of R j j , we may write n(22) + D

n(22) , Dn(22) = D where 

n( j) ) I{B}, n(22) := S y (z) E ε2j R j j h γ (| (nj) |, v)ϕ(T D n j=1 n



n(22) := y S y (z)

n( j) ) I{B}. D E ε j n R j j h γ (| (nj) |, v)ϕ(T n j=1 n

By Hölder’s inequality,

n(22) | ≤ |D

n q q−1 C  q1  E E j | ε j || n ||R j j |h γ (| (nj) |, v) I{B} Dn q . n j=1

Further,     E j | ε j || n ||R j j |h γ (| (nj) |, v) I{B} ≤ C E j | ε j || n |h γ (| (nj) |, v) I{B} . It is obvious that   ε j || n |h γ (| (nj) |, v) I{B} E j |    ε j || n |h γ (| (nj) |, v) I{B} I{|b(z)| ≤ |Tn |} ≤ E j |    ε j || n |h γ (| (nj) |, v) I{B} I{|b(z)| > |Tn |}. + E j | We have | n |h γ (| (nj) |, v) I{B} ≤ | n |h γ (| n |, v) I{B} + | n ||h γ (| (nj) |, v) − h γ (| (nj) |, v)| I{B}. Since we have |an | ≤ |b(z)| for z ∈ Dε in the last inequality, we can write  1 ≥ c|b(z)| |bn (z)|h(| (nj) |, v) ≥ |b(z)| − γ an (z) − nv and therefore | n | ≤

C|Tn | . |b(z)|

(19)

Rate of Convergence for Sparse Sample Covariance Matrices

283

From here get   | n |h γ (| n |, v) I{B}I{ |Tn | ≤ |b(z)|} ≤ C |Tn |h( n |, v). Further, we note that if

√ |Tn | ≥ |b(z)|, then

| n |h γ (| n |, v) ≤ γ an (z)h γ (| n |, v) ≤ C|b(z)|h γ (| n |, v)  ≤ C |Tn |h γ (| n |, v). Using this we conclude that 1 1  

n | ε j || n |h γ (| (nj) |, v) I{B} ≤ E j2 | ε j |2 I{| (nj) | ≤ Can (z)}I{B} E j2 |T E j | 1 C + ε j |2 I{| (nj) | ≤ Can (z)}I{B} E j2 | an (z)   1 1  . × E j2 | n |2 | n − (nj) |2 I max{| n |, | (nj) |} ≤ C an (z) + nv

Applying Lemmas 7 and 8, we obtain 1   (a (z)) 2   an (z) + nv 1  21 n E j | . E ε j || n |h γ (| (nj) |, v) I{B} ≤ C + | T | + n 1 1 j nvan (z) (nv) 2 (np) 2 1

Using that an (z) ≥ C/nv (see Lemma 3), we get  (a (z)) 21   1  21 1 n ( j) . (20) E j | E j |Tn | + + ε j || n |h γ (| n |, v) I{B} ≤ C 1 1 nv (nv) 2 (np) 2 Combining inequalities (19), (20), and

1 nv

≤C

q−1



|Dn(22) | ≤ C(βn Dn q +

5.2.2



an (z) , nv

we get

2q−1

βn Dn 2q ).

Estimation of Dn(23)

Note that |h γ (| n |, v) − |h γ (| (nj) |, v)||R j j | I{B} C | n − (nj) | I{max{| n |, | (nj) |} ≤ 2γ an (z)} I{B}. ≤ an (z) Using Hölder’s inequality and Cauchy’s inequality, we get

(21)

284

F. Götze et al. q−1 q q C  q1 E [E j | ε j |2 I{Q}I(B)] 2 [E j | n − (nj) |2 I{Q}I(B)] 2 Dn q . nan (z) j=1

n

Dn(23) ≤

Applying Lemmas 7, 8, 10, obtain Dn(23)



Can−1 (z)

1 √  C an (z)  Can2 (z)  an (z) √ √ + √ np nv nv nv

1

+

Can2 (z) 1 2

(nv) (np)

1 2

Taking into account that Im b(z) ≥ Dn(23) ≤C

5.2.3

+ C nv

C 1 2

(nv) (np)

1 2

 q−1 + n −c log n Dn q .

(an (z)nv ≥ C), the last gives the bound

 a (z) q−1 C  q−1 n + Dn q = βn Dn q . nv np

Estimation of Dn(24)

By Taylor’s theorem, we have Dn(24) =

n 1

n − T

n( j) )ϕ (T

n( j) + τ (T

n − T

n( j) )) I{B}, E ε j R j j h γ (| n |, v)(T n j=1

where τ is uniform distributed on the interval [0, 1] random variables independent on all other. Since I{B} = 1 yields |R j j | ≤ C, we find that |Dn(24) | ≤

n C

n − T

n( j) ||ϕ (T

n( j) + τ (T

n − T

n( j) ))| I{B}. E | ε j |h γ (| n |, v)|T n j=1

Taking into account the inequality  ( j) q−2 

n( j) + τ (T

n − T

n( j) ))| ≤ Cq |T

n |

n − T

n( j) |q−2 , |ϕ (T + q q−2 |T obtain |Dn(24) |

n Cq 

n − T

n( j) ||T

n( j) |q−2 I{B} ≤ E | ε j |h γ (| n |, v)|T n j=1

+

n Cq q−1 

n − T

n( j) |q−1 I{B} =: D

n(24) + D n(24) . E | ε j |h γ (| n |, v)|T n j=1

Rate of Convergence for Sparse Sample Covariance Matrices

285

Applying Hölder’s inequality, get n   q q−2 ( j) q 2 

n(24) ≤ Cq

n − T

n( j) | I{B}} 2 E q |T

n | . D E q E j {| ε j |h γ (| n |, v)|T n j=1

Jensen’s inequality gives n   q q−2 2 

n(24) ≤ Cq

n − T

n( j) | I{B}} 2 Dn q . D E q E j {| ε j |h γ (| n |, v)|T n j=1

n(24) we have to get bound for To estimate D q   q

n − T

n( j) | I{B} 2 . ε j |h γ (| n |, v)|T V j2 := E E j |

By Cauchy’s inequality, we have q

V j2 ≤ E(V j(1) ) 4 (V j(2) ) 4 , q

q

(22)

where

n − T

n( j) |2 h 2γ (| n |, v) I{B}. ε j |2 I{ Q2γ (v)} I{B}, V j(2) := E j |T V j(1) := E j | Estimation of V j(1) Lemma 7 gives C , E j |ε j2 |2 I{ Q2γ (v)} I{B} ≤ np and Lemma 8, in its turn, gives E j |ε j3 |2 I{ Q2γ (v)} I{B} ≤

C an (z). nv

Summing up the obtained estimates, we arrive at the following inequality V j(1) ≤ Estimation of V j(2)

C Can (z) + =: βn (z). nv np

(23)

n − T

n( j) . Since T

n = Tn h γ (| n |, v) and T

n( j) = E j T

n , To estimate V j(2) we consider T we have

286

F. Götze et al.

 

n( j) = (Tn − Tn( j) )h γ (| n |, v) − E j (Tn − Tn( j) )h γ (| n |, v)

n − T T    ( j) ( j) ( j) + Tn [h γ (| n |, v) − h γ (| n |, v)] − E j h γ (| n |, v) − h γ (| n |, v) .

Further, we note that Tn = n bn = n b(z) + y 2n . Then Tn − Tn( j) = ( n − (nj) )(b(z) + 2y (nj) ) + y( n − (nj) )2 − y E j ( n − (nj) )2 . (24) We get  

n( j) = (b(z) + 2y (n j) ) ( n − (n j) )h γ (| n |, v) − E j ( n − (n j) )h γ (| n |, v)

n − T T   ( j) ( j) + y ( n − n )2 − E j ( n − n )2 h γ (| n |, v)   ( j)  − y E j ( n − n )2 h γ (| n |, v) − E j h γ (| n |, v)   ( j) ( j) (25) − Tn E j h γ (| n |, v) − h γ (| n |, v) .

Now return to estimation of V j(2) . Equality (25) implies V j(2) ≤ 4|b(z)|2 E j | n − (nj) |2 h 4γ (| n |, v) I{B}  2 + 4|b(z)|2 E j ( n − (nj) )h γ (| n |, v) E j h 2γ (| n |, v) I{B} + 8y 2 E j | (nj) |2 | n − (nj) |2 h 4γ (| n |, v) I{B}  2 + 8y 2 E j ( n − (nj) )h γ (| n |, v) E j | (nj) |2 h 2γ (| n |, v) I{B}   + 2y 2 E j | n − (nj) |4 h 4γ (| n |, v) I{B}  2 + 2y 2 E j ( n − (nj) )2 h γ (| n |, v) E j h 2γ (| n |, v) I{B}  2 + y 2 E j ( n − (nj) )2 h γ (| n |, v) E j h 2γ (| n |, v) I{B}  2  2 + y 2 E j ( n − (nj) )2 E j h γ (| n |, v) E j h 2γ (| n |, v) I{B} 2  + |Tn( j) |2 E j h γ (| n |, v) − h γ (| (nj) |, v) h 2γ (| n |, v) I{B}   2 + |Tn( j) |2 E j h γ (| n |, v) − h γ (| (nj) |, v) E j h 2γ (| n |, v) I{B}. We can rewrite it as

V j(2) ≤ A1 + A2 + A3 + A4 ,

Rate of Convergence for Sparse Sample Covariance Matrices

287

  ( j) A1 = C|b(z)|2 E j | n − n |2 h 2γ (| n |, v) h 2γ (| n |, v) + E j h 2γ (| n |, v) I{B},  ( j)  ( j) ( j) A2 = C E j | n − n |2 h 2γ (| n |, v) | n |2 h 2γ (| n |, v) + E j | n |2 h 2γ (| n |, v) I{B},   ( j) A3 = C E j | n − n |4 h 2γ (| n |, v) h 2γ (| n |, v) + E j h 2γ (| n |, v) I{B}, 2    ( j) ( j) A4 = C|Tn |2 E j h γ (| n |, v) − h γ (| n |, v) h 2γ (| n |, v) + E j h 2γ (| n |, v) I{B}.

Applying h γ (| n |, v) ≤ 1 and Lemma 10, we find that A1 ≤ C|b(z)|2

an (z) nv



an2 (z) 1 1 an (z) + + + (nv)2 (np)(nv) np(nv) (nv)2



and a 3 (z) A2 ≤ C n nv Note that A3 ≤



an2 (z) 1 1 an (z) + + + 2 (nv) (np)(nv) np(nv) (nv)2

.

C E j | n − (nj) |2 h 2γ (| n |, v). n 2 v2

Combining Lemma 10 and an (z) ≥ Can (z) A3 ≤ (nv)3





1 , nv

get

an2 (z) an (z) 1 1 + + + 2 (nv) (np)(nv) np(nv) (nv)2

 .

Applying the triangle inequality, we obtain 2  ( j)   ( j) ( j) A4 = C E j h γ (| n |, v) − h γ (| n |, v) |Tn |2 h 2γ (| n |, v) + E j |Tn |2 h 2γ (| n |, v) 2    ( j)

n |2 + E j |Tn − Tn( j) |2 h 2γ (| n |, v) ≤ C E j h γ (| n |, v) − h γ (| n |, v) E j |T 2    ( j)

n |2 + |Tn − Tn( j) |2 h 2γ (| n |, v) . + C E j h γ (| n |, v) − h γ (| n |, v) |T

Further, since   h γ (| n |, v) − h γ (| ( j) |, v) ≤ n

C | n − (nj) | γ an (z)

× I{min{| n |, | (nj) |} ≤ 2γ an (z)}, we may write

(26)

288

F. Götze et al.

A4 ≤

C an2 (z) + ×

n |2 | + E j |T

n |2 )| n − (nj) |2 I{min{| n |, | (nj) |} ≤ Can (z)} E j (|T

C E j | n − (nj) |2 I{min{| n |, | (nj) |} ≤ 2γ an (z)} 2 an (z) (h 2γ (| n |, v)|Tn − Tn( j) |2 + E j |Tn − Tn( j) |2 h 2γ (| n |, v)).

By representation (24), |Tn − Tn( j) |2 h 2γ (| n |, v) ≤ (|b(z)|2 + an2 (z))| n − (nj) |2 h 2γ (| n |, v). Given that |Tn | ≤ C| n ||bn (z)|, arrive at the inequality   A4 ≤ C |b(z)|2 + an2 (z) E j | n − (nj) |2 I{min{| n |, | (nj) |} ≤ Can (z)}. Applying Lemma 10, obtain    Ca (z)  a 2 (z) 1 C an (z) n n + + + . A4 ≤ C |b(z)|2 + an2 (z) nv (nv)2 (np)(nv) np(nv) (nv)2 Combining the estimates obtained for A1 , . . . , A4 we conclude that    Ca (z)  a 2 (z) an (z) 1 C n n . + V j(2) ≤C |b(z)|2 + an2 (z) + + nv (nv)2 (np)(nv) np(nv) (nv)2 (27) Inequalities (22), (23) and (27) imply the bound q

q

V j2 ≤C q βn4 (z) q  q4  a (z)  a 2 (z)  1 C  4 an (z) n n + + + . × |b(z)|2 + an2 (z) nv (nv)2 (np)(nv) np(nv) (nv)2 (28) Note that

n  q−2  

n(24) ≤ Cq 1 D V j Dn q . n j=1

Then inequality (28) yields   21  a (z)  a 2 (z) n n |b(z)|2 + an2 (z) nv (nv)2

q−2 1 C  21 an (z) + + Dn q . + (np)(nv) np(nv) (nv)2 1

n(24) ≤ Cqβn2 (z) D

Rate of Convergence for Sparse Sample Covariance Matrices

289

Using that |b(z)| ≤ Can (z) (see Lemma 4) we may rewrite it as q−2

n(24) ≤ Cqβn2 Dn q . D

5.2.4

(29)

Estimation of  Dn(24)

Recall that n q q−1 

n − T

n( j) |q−1 h γ (| n |, v) I{B}. n(24) = C q E | ε j ||T D n j=1

Using (25) and (24), we get

n − T

n( j) | ≤ |T

C C (an (z) + |b(z)|) + 2 2 + |Tn( j) ||h(| n |, v) − h(| (nj) |, v)|. nv n v

Further, (26) gives

n − T

n( j) | ≤ |T

C C (an (z) + |b(z)|) + 2 2 nv n v C ( j) |T || n − (nj) |I{| n | ≤ Can (z)}. + an (z) n

Applying the triangle inequality, obtain

n − T

n( j) | ≤ |T

C C (an (z) + |b(z)|) + 2 2 nv n v C |Tn || n − (nj) |I{max{| n |, | (nj) |} ≤ Can (z)} + an (z) C |Tn − Tn( j) || n − (nj) |I{min{| n |, | (nj) |} ≤ Can (z)}. + an (z)

Inequality (24), the relations |Tn | ≤ C| n ||b(z)| and

n − T

n( j) | ≤ |T

1 nv

≤ Can (z) imply

C (an (z) + |b(z)|). nv

The last gives n q q−1 q−2  n(24) ≤ C q an (z) 1

n − T

n( j) |h γ (| n |, v) I{B}. D E | ε j ||T (nv)q−2 n j=1

290

F. Götze et al.

Taking into account

n − T

n( j) |h γ (| n |, v) I{B} ≤ E(V j(1) V j(2) ) 21 E | ε j ||T and inequalities (23), (27), we get q q−1 1 n(24) ≤ C q an (z) βn2 an (z) D (nv)q−2 nv  a 2 (z) 1 C  21 an (z) × n 2+ + + . (nv) (np)(nv) np(nv) (nv)2 q−1

We rewrite it as

√ n(24) ≤ C q q q−1 βnq nv. D

(30)

Combining relations (15), (16), (17), (18), (21), (29), (30), we find that q−1 2q−1 q−2 1 √ Dn ≤Cn −c log n + Cβn Dn q + Cβn2 Dn 2q + Cβn2 q Dn q + C q q q−1 βnq nv.

Applying Young’s inequality, we conclude that q

Dn ≤C q q 2 (1 +

√ nvq q/2 )βnq .

6 Appendix Lemma 3 For any z ∈ Dε we have |b(z)| ≤



2 Im b(z).

Proof Note that b(z) =

1 2y



z−

b2  a 2  z− . z z

It is enough to prove that  Re

z−

a 2  b2  z− ≤ 0. z z

We have  Re

z−

 b2  a 2  (1 − y)2  z− = (u 2 − v2 ) 1 + − 2(1 + y). z z |z|4

(31)

Rate of Convergence for Sparse Sample Covariance Matrices

291

It may be rewritten as  Re

z−

  2 b2  (1 − y)2  a 2  2 1 + (1 − y) . z− = |z|2 1 + − 2(1 + y) − 2v z z |z|4 |z|4

It is straightforward to check that for 1 −



y ≤ |z| ≤ 1 +



y

 (1 − y)2  − 2(1 + y) ≤ 0. |z|2 1 + |z|4 √ √ √ This implies that for 1 − y ≤ |z| ≤ 1 + y inequality (31) holds. Since 1 − y ≤ √ |z| for all z ∈ Dε , to prove (31) for z ∈ Dε is enough to consider |z| ≥ 1 + y. For this case we have  Re

z−

b2  a 2  √ √ z− ≤ u 2 − v2 + (1 − y)2 − 2(1 + y) ≤ u 2 − (1 + y)2 ≤ 0. z z

Thus Lemma is proved. √ √ Lemma 4 For 1 − y ≤ |u| ≤ 1 + y the inequality



|b(z)| ≤ Can (z) holds. Proof Note that 1−y b(z) = z − + 2y S y (z) = z and an (z) = Im

!

It is easy to show that for 1 −



 Re Indeed,

z−

z−

1 − y 2 − 4y z

 1 1 − y 2 1 + . − 4y + z nv np

y ≤ |u| ≤ 1 +

z−

!



y

 1 − y 2 − 4y ≤ 0. z

 1 − y 2 (1 − y)2 − 4y ≤ u 2 + − 2(1 + y). z u2 √ √ The last expression is not positive for 1 − y ≤ |u| ≤ 1 + y. From the negativity of the real part it follows that 

Re

z−

292

F. Götze et al.

! Im

!   1 − y 2 1 − y 2 1    z− − 4y ≥ √  z − − 4y  z z 2 

This implies the required. The Lemma is proved. Lemma 5 Under the conditions of Theorem 3, for j ∈ J and l ∈ K we have c

c

C (J,K) max |ε(J,K) . j1 |, |εl+n,1 | ≤ nv Proof For simplicity, we consider only the case J = ∅ and K = ∅. Note that m − n  1  m − n − 1  Tr R − − Tr R( j) − 2m z |z|  1 1  Tr R − Tr R( j) − . = 2m 2mz

ε j1 =

Applying Schur’s formula, we get |ε j1 | ≤

1 . nv

The second inequality is proved similarly. Lemma 6 For any z ∈ Dε



lim (nvb(z)) = ∞.

n→∞

Proof First we note that b(z) =

1 (z − a)(z + a)(z − b)(z + b). z

It is straightforward to check that, for z ∈ Dε   √ √ √ √ |b(z)| ≥ C min{|1 − y − z|, |1 − y + z|, |1 + y − z||1 + y + z|} ≥ C β + v.

Further,

  nv|b(z)| ≥ C(nv0 / β) β + v0 ≥ nv0 → ∞. 

Thus Lemma is proved. Lemma 7 Under the conditions of Theorem 3, for all j ∈ Jc the inequalities 2 E j |ε(J,K) j2 | ≤

and

m μ4 1   (J∪{ j},K) 2 R np n l=1 l+n,l+n

Rate of Convergence for Sparse Sample Covariance Matrices (J,K) 2 El+n |εl+n,2 | ≤

293

n μ4 1   (J,K∪{l}) q R . np n j=1 j j

are valid. In addition for q > 2 we have q q E j |ε(J,K) j2 | ≤ C

 q q2

+

q

(np) 2

and (J,K) q | ≤ Cq El+n |εl+n,2

 q q2 (np)

q 2

m 1   (J∪{ j},K) q qq R  2qκ+1 (np) n l=1 l+n,l+n

+

n 1   (J,K∪{l}) q qq R  2qκ+1 (np) n j=1 j j

for l ∈ Kc . Proof For simplicity, consider the case J = ∅ and K = ∅. The first two inequalities are obvious. We consider only q > 2. Applying Rosenthal’s inequality (see [11], Lemma 8.1), obtain for any q > 2 E j |ε j2 |q =

m  q 1   ( j) 2 E (X ξ − p)R  j jl jl l+n,l+n  q (mp) l=1

2 Cq  q   ( j) q2 E j |X 2jl ξ jl − p|2 |Rl+n,l+n |2 q (mp) l=1 m



q

+ qq

m 

( j)

E j |X 2jl ξ jl − p|q |Rl+n,l+n |q



l=1 m m 1  q   q qq C  ( j) ( j) 2 2 2 (qμ ) |R | +  μ |Rl+n,l+n |q ≤ q q 4 2q l+n,l+n m l=1 (mp) 2 (mp) 2 l=1 q m  (qμ ) 2  q  1 mq 4 ( j)  μ2q |R |q . (32) ≤ Cq q + q (mp) m l=1 l+n,l+n (mp) 2 q

Recall that  μr = E |X jk ξ jk |r and by the conditions of the Theorem  μ2q ≤ C q p(np)q−2qκ−2 μ4+δ . Substituting the last inequality in (32), we get

294

F. Götze et al.

E j |ε j2 | ≤ C q

q

 q q2 (mp)

q 2

+

m 1  qq ( j) |R |q . (mp)2qκ+1 m l=1 l+n,l+n



The second inequality can be proved similarly. Lemma 8 Under the conditions of Theorem 3 for all j ∈ TJ , l ∈ T1K 2 E j |ε(J,K) j3 | ≤

(J∪{ j},K)

C Im sn nv

,

(J,K) 2 El+n |εl+n,3 | ≤

(J,K∪{l})

C Im sn nv

are valid. In addition for q > 2 we have q E j |ε(J,K) j3 |

  C Im s (J∪{ j},K)  q2 q n ≤ C q q (nv)− 2 nv m q 3q q 1  (J∪{ j},K) 2 Im Rl+n,l+n + q 2 (nv)− 2 (np)−qκ−1 n l=1  m m 1    (J∪{ j},K) q + q 2q (np)−2qκ 2  Rl+n,k+n  , n l=1 k=1 q

  C Im s (J∪{ j},K)  q2 q n (J,K) q El+n |εl+n,3 | ≤ C q q q (nv)− 2 nv m q 3q q 1  (J,K∪{l+n}) Im R j j + q 2 (nv)− 2 (np)−qκ−1 n j=1 n  n      (J,K∪{l+n}) q 2q −2qκ −2 + q (np) n  Rk j  . j=1 k=1

Proof It suffices to apply the inequality from Corollary 1 of [17]. Lemma 9 Under the conditions of the theorem , the bound E j |R j j − E j R j j |2 I{Q} I{B} ≤ C

 a (z) A2 (z)  n + 0 + Cn −c log n nv np

is valid. Proof Consider the equality Rjj = −

1 z−

1−y z

( j)

+ ysn (z)

  1 + εj Rjj .



Rate of Convergence for Sparse Sample Covariance Matrices

295

It implies Rjj − Ej Rjj = −

1 z−

1−y z

+

It is well-known that y S y (z) + z − Consequently,

( j) ysn (z)

 

εj Rjj − Ej εj Rjj .

1 1−y = . z S y (z)

 1 − y  √   ≥ y. y S y (z) + z − z

Further, note that for a sufficiently small γ there is a constant H such that     z −

    I{Q} ≤ H I{Q}. ( j) + ysn (z)  1

1−y z

Hence,  E j |R j j − E j R j j |2 I{Q} I{B} ≤ H 2 E j | ε j |2 |R j j |2 I{Q} I{B}  + E j I{Q} I{B} E j | ε j |2 |R j j |2 . It is easy to see that E j | ε j |2 |R j j |2 I{Q} I{B} ≤ C E j | ε j |2 I{Q} I{B} ≤ C Introduce events

It is obvious that

1 C an (z) + . nv np

 1 Q( j) = | (nj) | ≤ 2γ an (z) + . nv I{Q} ≤ I{Q} I{Q( j) }.

Consequently, ε j |2 |R j j |2 ≤ E j I{Q} I{B} E j | ε j |2 |R j j |2 I{Q( j) }. E j I{Q} I{B} E j | Further, consider  Q = {| n | ≤ 2γ an (z)}. We have Q}. I{Q( j) } ≤ I{

296

F. Götze et al.

Then it follows that ε j |2 |R j j |2 ≤ E j I{Q} I{B} E j | ε j |2 |R j j |2 I{ Q}. E j I{Q} I{B} E j | Next, the inequality "c } ε j |2 |R j j |2 I{ Q} ≤ E j | ε j |2 |R j j |2 I{ Q}I{ B} + E j | ε j |2 |R j j |2 I{ Q}I{B E j |

(33)

holds. By the conditions C0 and the inequality |R j j | ≤ v0−1 , we obtain the bound "c } ≤ Cn −c log n . ε j |2 |R j j |2 I{ Q}I{B E j | For the first term on the right side of (33) get  a (z) C n . ε j |2 |R j j |2 I{ Q}I{ B} ≤ C + E j | nv np 

This completes the proof. Lemma 10 Under the conditions of the theorem, we have E j | n − (nj) |2 I{Q} I{B} ≤

( j)

1 C  Can (z)  an2 (z) an (z) + + + nv (nv)2 (np)(nv) np(nv) (nv)2 −c log n +n .

( j)

n = sn (z) − S y (z). By the Schur complement formula, Proof Set

(nj) = n −

m  1  1 1+ X jl X jk ξ jl ξ jk [R ( j) ]2 k+n,l+n R j j . 2n np l.k=1

( j)

n is measurable with respect to M( j) , we may write Since

(nj) ) − E j { n −

(nj) }. n − (nj) = ( n − Introduce the notation 1  2 1  2 2 (X jl ξ jl − p)[R( j) ]l+n,l+n , η j2 = X jl X jk ξ jl ξ jk [R( j) ]k+n,l+n . np l=1 np l=1 m

η j1 =

m

Rate of Convergence for Sparse Sample Covariance Matrices

297

In these notation n −

(nj)

m  1  ( j) 2 1 1+ = [R ]l+n,l+n (R j j − E j R j j ) n n l=1

+

1 1 (η j1 + η j2 )R j j − E j (η j1 + η j2 )R j j . n n

Note that E j |η j1 |2 I{Q} I{B} ≤ Since

m 2 C p   ( j) 2  I{Q( j) } I{B( j) }. [R ] l+n,l+n n 2 p 2 l=1

m    ( j)   ( j) 2  R l+n,k+n 2 ≤ C Im R ( j) [R ]l+n,l+n  ≤ l+n,l+n , v k=1

Theorem 3 gives E j |η j1 |2 I{Q} I{B} ≤

m 2 C 1  Can (z) ( j) Im Rl+n,l+n I{Q( j) } I{B( j) } ≤ . 2 npv n l=1 npv2

Similarly, for the second moment of η j2 we have the estimate m 2 C p 2   ( j) 2 [R ] l+n,k+n  I{Q( j) } I{B( j) } E j |η j2 | I{Q} I{B} ≤ 2 2 n p l,k=1 2



C C Tr |R|4 I{Q( j) } I{B( j) } ≤ 3 an (z). n2 nv

From the above estimates and Lemma 9 we conclude that E j | n − (nj) |2 I{Q} I{B} Can2 (z)(z) E j |R j j − E j R j j |2 I{Q} I{B} n 2 v2  a (z) C  Can (z)  Can (z)  C n + + n −c log n . ≤ + + np(nv)2 (nv)3 (nv)2 np (nv) ≤

The lemma is proved.



298

F. Götze et al.

Lemma 11 There is an absolute constant C > 0 such that for any z = u + iv satisfying √ √ 1 − y − v ≤ |u| ≤ 1 + y + v and v > 0 the inequality | n | ≤ C min

 |T |   n , |Tn | |b(z)|

(34)

is valid. Proof Make the change of variables by setting 1  1− y w= √ z− , y z and

 √ w y + yw2 + 4(1 − y) z= , 2

√  S(w) = y S y (z),

 sn (w) =



ysn (z).

In this notation, we can rewrite the main equation in the form sn2 (w) = Tn . 1 + w sn (w) + It is easy to see that 1 sn (z) −  S(w)). n = √ ( y Now it suffices to repeat the proof of Lemma B.1 from [18]. Note that this Lemma implies (34) for w satisfying | Re w| ≤ 2 + Im w. From this we conclude that inequal√ √ ity (34) holds for z = u + iv such that 1 − y ≤ |u| ≤ 1 + y. The Lemma is proved.  Lemma 12 Under conditions of Theorem 3 we have  1  1  + , for all − ∞ < u < ∞ ≥ 1 − Cn −Q Pr | n (u + i V )| ≤ C|S y (u + i V )|2 n np

Proof First we note that |R jk | ≤ 1/V , for v = V for j, l = 1, . . . n, This implies that |bn (u + i V )| ≥

1 1 − . |S y (u + i V )| V

We may choose V so large that |bn (u + i V )| ≥

2 . |S y (z)|

Rate of Convergence for Sparse Sample Covariance Matrices

299

From here it follows n n  C|S (z)|2    C|S y (z)|   y    εj Rjj ≤ εj | n | ≤   n n j=1 j=1

 C|S y (z)|   |ε j ||R j j − S y (z)| . n j=1 n

+ Using Eq. (9), we get

n n 1   1     ε j  + C|S y (z)|2 |ε j |2 . | n | ≤ C|S y (z)|2  n j=1 n j=1

From this inequality it follows that n n q 1   1     εj + E |ε j |2q . E | n |q ≤ C q |S y (z)|2q E  n j=1 n j=1

(35)

Applying Lemmas 5, 7, 8, we get n 1 Cq E |ε j |2q ≤ . n j=1 (np)q

To estimate the first term in the right hand side of (35), we use triangle inequality n n n n q q q q  1  1  1   1          E ε j  ≤ Cq E  ε j1  + E  ε j2  + E  ε j3  =: S1 + S2 + S3 . n n n n j=1

By Lemma 5

j=1

j=1

j=1

n 1  q Cq   E ε j1  ≤ q . n j=1 n

To estimate the second and the third terms we use the inequality of [17, Theorem1]. We get Cq qq Cq qq S2 ≤ , S ≤ . 3 (np)q nq It is enough to repeat the proof of [17, Theorem 3]. We omit here the details. Thus Lemma 12 is proved. 

300

F. Götze et al.

References 1. Wishart, J.: The generalised product moment distribution in samples from a normal multivariate population. Biometrika 20A(1/2), 32–52 (1928) 2. Wigner, E.P.: Characteristic vectors of bordered matrices with infinite dimensions. Ann. Math. 62(3), 548–564 (1955) 3. Marchenko, V.A., Pastur, L.A.: Distribution of eigenvalues for some sets of random matrices. Mat. Sb. (N.S.) 72(4), 507–536 (1967) 4. Telatar, E.: Capacity of multi-antenna Gaussian channels. Eur. Trans. Telecomm. 10(6), 585– 595 (1999) 5. Erd˝os, L., Knowles, A., Yau, H.-T., Yin, J.: Spectral statistics of Erd˝os-Rényi graphs I: Local semicircle law. Ann. Probab. 41(3B), 2279–2375 (2013) 6. Erd˝os, L., Knowles, A., Yau, H.-T., Yin, J.: Spectral statistics of Erd˝os-Rényi graphs II: eigenvalue spacing and the extreme eigenvalues. Commun. Math. Phys. 314(3), 587–640 (2012) 7. Lee, J.O., Schnelli, K.: Tracy-Widom distribution for the largest eigenvalue of real sample covariance matrices with general population. Ann. Appl. Probab. 26(6), 3786–3839 (2016) 8. Hwang, J.Y., Lee, J.O., Schnelli, K.: Local law and Tracy-Widom limit for sparse sample covariance matrices. Ann. Appl. Probab. 29(5), 3006–3036 (2019) 9. Hwang, J.Y., Lee, J.O., Yang, W.: Local law and Tracy-Widom limit for sparse stochastic block models. Bernoulli 26(3), 2400–2435 (2020) 10. Lee, J.O., Schnelli, K.: Local law and Tracy-Widom limit for sparse random matrices. Probab. Theory Relat. Fields 171, 543–616 (2018) 11. Götze, F., Tikhomirov, A.N.: Rate of Convergence of the Expected Spectral Distribution Function to the Marchenko–Pastur Law (2014). https://arxiv.org/abs/1412.6284 12. Bay, Z.D., Silverstein, J.W.: Spectral Analysis of Large Dimensional Random Matrices, 2nd edn. Springer Series in Statistics, New York, NY, USA (2010) 13. Götze, F., Tikhomirov, A.N.: Optimal bounds for convergence of expected spectral distributions to the semi-circular law. Probab. Theory Relat. Fields. 165, 163–233 (2016) 14. Götze, F., Naumov, A.A., Tikhomirov, A.N.: On the local semicircular law for Wigner ensembles. Bernoulli 24(3), 2358–2400 (2018) 15. Götze, F., Naumov, A.A., Tikhomirov, A.N.: Local semicircle law under moment conditions: the stieltjes transform, rigidity, and delocalization. Theory Probab. Appl. 62(1), 58–83 (2018) 16. Götze, F., Naumov, A.A., Tikhomirov, A.N.: Local semicircle law under fourth moment condition. J. Theor. Probab. 33, 1327–1362 (2020) 17. Götze, F., Naumov, A.A., Tikhomirov, A.N.: Moment inequalities for linear and nonlinear statistics. Theory Probab. Appl. 65(1), 1–16 (2020) 18. Götze, F., Naumov, A.A., Tikhomirov, A.N.: Local Semicircle Law under Moment Conditions. Part I: The Stieltjes Transform (2016). https://arxiv.org/abs/1510.07350

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces Martin Wahl

Abstract We establish non-asymptotic lower bounds for the estimation of principal subspaces. As applications, we obtain new results for the excess risk of principal component analysis and the matrix denoising problem. Keywords Van Trees inequality · Cramér-Rao inequality · Group equivariance · Orthogonal group · Haar measure · Principal subspace · Doubly substochastic matrix

1 Introduction Many learning algorithms and statistical procedures rely on the spectral decomposition of some empirical matrix or operator. Leading examples are principal component analysis (PCA) and its extensions to kernel PCA or manifold learning. In modern statistics and data science, such methods are typically studied in a high-dimensional or infinite-dimensional setting. Moreover, a major focus is on non-asymptotic results, that is, one seeks results that depend optimally on the underlying parameters (see, e.g., [16, 19] for two recent developments). In this paper, we are concerned with non-asymptotic lower bounds for the estimation of principal subspaces, that is the eigenspace of the, say d, leading eigenvalues. As stated in [5], it is highly nontrivial to obtain such lower bounds which depend optimally on all underlying parameters, in particular the eigenvalues and d. In fact, in contrast to asymptotic settings where one can apply the local asymptotic minimax theorem [12], it seems unavoidable to use some more sophisticated facts on the underlying parameter space of all orthonormal bases in order to obtain nonasymptotic lower bounds. A state-of-the-art result, obtained in [5, 23], provides a non-asymptotic lower bound for the spiked covariance model with two groups of eigenvalues. To state their result, consider the statistical model defined by M. Wahl (B) Faculty of Mathematics, Bielefeld University, Bielefeld, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_8

301

302

M. Wahl

(PU )U ∈O( p) ,

PU = N(0, U U T )⊗n ,

(1)

where O( p) denotes the orthogonal group,  = diag(λ1 , . . . , λ p ) is a diagonal matrix with λ1 ≥ · · · ≥ λ p > 0 and N(0, U U T ) denotes a Gaussian distribution with expectation zero and covariance matrix U U T . This statistical model corresponds to observing n independent N(0, U U T )-distributed random variables X 1 , . . . , X n , and we will write EU to denote expectation with respect to X 1 , . . . , X n having law PU . Moreover, in this model, the d-th principal subspace (resp. its corresponding orthogonal projection) is given by P≤d (U ) = i≤d u i u iT , where u 1 , . . . , u p are the columns of U ∈ O( p). Theorem 1 ([5]) Consider the statistical model (1) with λ1 = · · · = λd > λd+1 = · · · = λ p > 0. Then there is an absolute constant c > 0 such that  d( p − d) λ λ  d d+1 inf sup EU  Pˆ − P≤d (U )22 ≥ c · min , d, p − d , n (λd − λd+1 )2 Pˆ U ∈O( p) ˆ 1 , . . . , X n ) with values in where the infimum is taken over all estimators Pˆ = P(X the class of all orthogonal projections on R p of rank d and  · 2 denotes the HilbertSchmidt (or Frobenius) norm. The proof is based on applying lower bounds under metric entropy conditions [25] combined with the metric entropy of the Grassmann manifold [20]. This (Grassmann) approach has been applied to many other principal subspaces estimation problems and spiked structures (see, e.g., [4, 6, 10, 17]). In principle, it can also be applied to settings with decaying eigenvalues by considering spiked submodels. Yet, since this leads to lower bounds of a specific multiplicative form, it seems difficult to recover the optimal weighted eigenvalue expressions appearing in the asymptotic limit [7] and in the non-asymptotic upper bounds from [15, 21]. To overcome this difficulty, [24] proposed a new approach based on a van Treestype inequality with reference measure being the Haar measure on the special orthogonal group S O( p). The key ingredient is to explore the group equivariance of the statistical model (1), allowing to take advantage of the Fisher information geometry more efficiently. For instance, a main consequence of the developed theory is the following non-asymptotic analogue of the local asymptotic minimax theorem. Theorem 2 ([24]) Consider the statistical model (1) with λ1 ≥ · · · ≥ λ p > 0. Then there are absolute constants c, C > 0 such that, for every h ≥ C, we have  inf Pˆ

S O( p)

EU  Pˆ − P≤d (U )22 πh (U )dU ≥ c ·

 i≤d j>d

min

λi λ j 1  , , n (λi − λ j )2 h 2 p

1

ˆ 1 , . . . , X n ), where the infimum is taken over all R p× p -valued estimators Pˆ = P(X dU denotes the Haar measure on S O( p), and the prior density πh is given by

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces

πh (U ) = 

303

exp(hp tr U ) S O( p) exp(hp tr U ) dU

with tr U denoting the trace of U . Clearly, for the from a non-asymptotic point of view optimal choice h = C, Theorem 2 implies Theorem 1, as can be seen from inserting 2d( p − d)/ p ≥ min(d, p − d). Moreover, as shown in [24, Section 1.3], Theorem 2 can be used to derive tight non-asymptotic minimax lower bounds under standard eigenvalue conditions from functional PCA or kernel PCA. The goal of this paper is to extend the theory of [24] in several directions. First, we provide a lower bound for the excess risk of PCA. This loss function can be written as a weighted squared loss and a variation of the approach in [24] allows to deal with it. More precisely, we establish a slightly complementary van Trees-type inequality tailored for principal subspace estimation problems and dealing solely with the uniform prior. Interestingly, such uniform prior densities lead to trivial results in van Trees inequality from [24] (as well as in previous classical van Trees approaches [11, 22]). Finally, we provide lower bounds that are characterized by doubly substochastic matrices whose entries are bounded by the inverses of the different Fisher information directions, confirming recent non-asymptotic upper bounds that hold for the principal subspaces of the empirical covariance operator.

2 A Van Trees Inequality for the Estimation of Principal Subspaces In this section, we state a general van Trees-type inequality tailored for principal subspace estimation problems. Applications to more concrete settings are presented in Sect. 4. Let (X, F , (PU )U ∈O( p) ) be a statistical model with parameter space being the orthogonal group O( p). Let (A, ·, ·) be a real inner product space of dimension m ∈ N and let ψ : O( p) → A be a derived parameter. We suppose that O( p) acts (from the left, measurable) on X and A such that (A1) (PU )U ∈O( p) is O( p)-equivariant (i.e., PV U (V E) = PU (E) for all U, V ∈ O( p) and all E ∈ F ) and U → PU (E) is measurable for all E ∈ F . (A2) ψ is O( p)-equivariant (i.e., ψ(V U ) = V ψ(U ) for all U, V ∈ O( p)). (A3) U a, U b = a, b for all a, b ∈ A and all U ∈ O( p). Condition (A1) says that for a random variable X with distribution PU , the random variable V X has distribution PV U . For more background on statistical models under group action, the reader is deferred to [8, 9] and to [24, Sect. 2.3]. Next, we specify the allowed loss functions. Let (v1 , . . . , vm ) : O( p) → Am be such that for all j = 1, . . . , m,

304

M. Wahl

(A4) v1 (U ), . . . , vm (U ) is an orthonormal basis of A for all U ∈ O( p). (A5) v j are O( p)-equivariant (i.e., v j (V U ) = V v j (U ) for all U, V ∈ O( p)). For w ∈ Rm >0 we now define the loss function lw : O( p) × A → R≥0 ,

lw (U, a) =

m 

wk vk (U ), a − ψ(U )2 .

k=1

If w1 = · · · = wm = 1, then lw does not depend on v1 , . . . , vm and is equal to the squared norm in A l(1,...,1) (U, a) = a − ψ(U )2 = a − ψ(U ), a − ψ(U ).

(2)

For general w, the loss function lw is itself invariant in the sense that lw (V U, V a) = lw (U, a) for all U, V ∈ O( p), a ∈ A,

(3)

ˆ ) based on an obseras can be seen from (A2), (A3) and (A5). For an estimator ψ(X ˆ )), where EU denotes vation X from the model, the lw -risk is defined as EU lw (U, ψ(X expectation when X has distribution PU . In order to formulate our abstract main result, we also need some differentiability conditions on ψ and the v j . We assume that ψ and v j are differentiable at the identity matrix I p in the sense that for all ξ ∈ so( p), all a ∈ A and all j = 1, . . . , m, we have , a = dψ(I p )ξ, a. t→0 t  v j (exp(tξ )) − v j (I p ) , a = dv j (I p )ξ, a. (A7) lim t→0 t Here, dψ(I p )ξ and dv j (I p )ξ denote the directional derivatives at I p defined on the Lie algebra so( p) on S O( p) (i.e., the tangent space of O( p) at I p ). Since A is finite-dimensional, conditions (A6) and (A7) can also formulated in the normsense (e.g., limt→0 t −1 (ψ(exp(tξ )) − ψ(I p )) − dψ(I p )ξ  = 0 for all ξ ∈ so( p)). For some background on the special orthogonal group S O( p) and its Lie algebra so( p), see, e.g., [24, Sect. 2.1]. 2 between two probability measures P Q is defined Recall that the  χdP-divergence 2 as χ (P, Q) = ( dQ )2 dQ − 1. (A6) lim

 ψ(exp(tξ )) − ψ(I p )

Proposition 1 Assume (A1)–(A7). Let ξ1 , . . . , ξm ∈ so( p) be such that Pexp(tξ j ) P I p for all j = 1, . . . , m and all t small enough. Suppose that there are a1 , . . . , am ∈ (0, ∞] such that, for all j = 1, . . . , m, 1 2 χ (Pexp(tξ j ) , P I p ) = a −1 j , t→0 t 2 lim

ˆ ) with values in A, we have Then, for all estimators ψˆ = ψ(X

(4)

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces

305

m  2 v j (I p ), dψ(I p )ξ j 



ˆ )) dU ≥ EU lw (U, ψ(X O( p)

m 

−1 w−1 j aj

j=1 m 

+

j=1

k=1

wk−1

m 

2 . vk (I p ), dv j (I p )ξ j 

j=1

Remark 1 In applications, the a −1 j ∈ [0, ∞) will be different Fisher information directions. We use the inverse notation because it will be more suitable to solve the final optimization problem in Sect. 5.2. Remark 2 Let us briefly compare Proposition 1 to [24, Proposition 1 and Theorem 3], where a more general van Trees inequality is presented. In fact, the bound [24, Theorem 3] has a more classical form and involves a general prior, an average over the prior in the numerator and Fisher informations of the prior in the denominator. Yet, while these Fisher informations are zero for the uniform prior considered in Proposition 1, the averages in the numerator are zero as well. Hence, [24, Theorem 3] is trivial for the uniform prior. The reason that we can deal with the uniform prior lies in the fact that in addition to the equivariance of the statistical model, we also require equivariance of the derived parameter and invariance of the loss function.

3 Proof of Proposition 1 We provide a proof which manifests Proposition 1 as a Cramér-Rao-type inequality for equivariant estimators.

3.1 Reduction to a Pointwise Risk We use [24, Lemma 4] in order to reduce the Bayes risk of Proposition 1 to a pointwise risk minimized over the class of all equivariant estimators. For completeness we briefly repeat the (standard) argument. Let ψ˜ be an arbitrary estimator with values in A. Without loss of generality we may restrict ourselves to estima˜ < ∞. (Indeed, by (A2) and (A3) we know tors with bounded norm supx∈X ψ(x) ˜ > Cw = that supU ∈O( p) ψ(U ) = C < ∞. Hence, for x ∈ X such that ψ(x) 1/2 2C(wmax /wmin ) with wmax = maxk wk < ∞ and wmin = mink wk > 0, we have 1/2 1/2 1/2 1/2 1/2 1/2 ˜ > wmin Cw − wmax C = wmax C.) Hence, we lw (U, 0) ≤ wmax C, while lw (U, ψ(x)) can construct  ˆ ˜ x) d V, ψ(x) = x ∈ X. V T ψ(V O( p)

306

M. Wahl

By [24, Lemma 4] this defines an O( p)-equivariant estimator (that is, it holds that ˆ x) = U ψ(x) ˆ ψ(U for all x ∈ X and all U ∈ O( p)) satisfying 

˜ )) dU ≥ EU lw (U, ψ(X O( p)



ˆ )) dU, EU lw (U, ψ(X O( p)

where we used (A1) and the facts that the loss function lw is convex in the second argument and satisfies (3). Moreover, using that ψˆ is O( p)-equivariant, it follows ˆ )) is constant over U ∈ O( p). again from [24, Lemma 4] that the risk EU lw (U, ψ(X Hence, we arrive at  ˜ )) dU ≥ ˆ )), EU lw (U, ψ(X inf E I p lw (I p , ψ(X inf ψ˜

ψˆ O( p)-equivariant

O( p)

and it suffices to lower bound the right-hand side.

3.2 A Pointwise Cramér-Rao Inequality for Equivariant Estimators The classical Cramér-Rao inequality provides a lower bound for the (co-)variance of unbiased estimators. In this section, we show that in our context, a similar lower bound can be proved for the class of all equivariant estimators. Lemma 1 Assume (A1)–(A7). Let ξ1 , . . . , ξm ∈ so( p) be such that Pexp(tξ j ) P I p for all j = 1, . . . , m and all t small enough. Suppose that there are a1 , . . . , am ∈ (0, ∞] such that limt→0 χ 2 (Pexp(tξ j ) , P I p )/t 2 = a −1 j for all j = 1, . . . , m. Then, for ˆ any O( p)-equivariant estimator ψ(X ) with values in A, we have m  2 v j (I p ), dψ(I p )ξ j 

ˆ )) ≥ E I p lw (I p , ψ(X

m  j=1

−1 w−1 j aj

j=1 m 

+

wk−1

k=1

m 

2 . vk (I p ), dv j (I p )ξ j 

j=1

Proof As shown in Sect. 3.1, we can restrict ourselves to estimators ψˆ that are bounded. For U j = exp(tξ j ), j = 1, . . . , m, consider the expression m 

ˆ ) − ψ(I p ) − E I p v j (I p ), ψ(X

j=1

Clearly (5) is equal to

m  j=1

ˆ ) − ψ(U jT ). E I p v j (I p ), ψ(X

(5)

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces m 

v j (I p ), ψ(U jT ) − ψ(I p ).

307

(6)

j=1

ˆ we have On the other hand, using (A1)–(A3), (A5) and the equivariance of ψ, ˆ ) − ψ(U jT ) E I p v j (I p ), ψ(X ˆ ) − U jT ψ(I p ) ˆ jT X ) − ψ(U jT ) = EU j v j (I p ), U jT ψ(X = EU j v j (I p ), ψ(U dP ˆ ) − ψ(I p ) = E I p U j (X )v j (U j ), ψ(X ˆ ) − ψ(I p ). = EU j v j (U j ), ψ(X dP I p Hence, (5) is also equal to m m   dPU j ˆ ) − ψ(I p ) − E I p ˆ ) − ψ(I p ). v j (I p ), ψ(X (X )v j (U j ), ψ(X dP I p j=1 j=1

EI p

(7) Using (5)–(7), Parseval’s identity, (A4) and the Cauchy-Schwarz inequality (twice), we arrive at m 

2

v j (I p ), ψ(U Tj ) − ψ(I p )

j=1 m m dP  2   Uj ˆ ) − ψ(I p ) − E I ˆ ) − ψ(I p ) = EI p v j (I p ), ψ(X (X )v j (U j ), ψ(X p dP I p j=1



= EI p

j=1

m  m  k=1



≤ EI p

m 

v j (I p ), vk (I p ) −

j=1

m dP  Uj j=1

ˆ ) − ψ(I p )2 wk vk (I p ), ψ(X

dP I p

2 ˆ ) − ψ(I p ) (X )v j (U j ), vk (I p ) vk (I p ), ψ(X



k=1 m m m dP  2 

   Uj × EI p wk−1 v j (I p ), vk (I p ) − (X )v j (U j ), vk (I p ) . dP I p k=1

j=1

j=1

The first term on the right-hand side is equal to the lw -risk of ψˆ at I p . Moreover, the second term can be written as m  k=1

wk−1

m

 

2

v j (U j ) − v j (I p ), vk (I p )

+ EI p

j=1

m  dP  U j=1

j

dP I p

 2 . (X ) − 1 v j (U j ), vk (I p )

In particular, we have proved that ˆ )) ≥  E I p lw (I p , ψ(X m k=1

D2 wk−1 (Dk2 + E I p (Bk + Ck )2 )

(8)

308

M. Wahl

with m  v j (I p ), ψ(U jT ) − ψ(I p ), D= j=1 m  Dk = v j (U j ) − v j (I p ), vk (I p ), j=1

Bk =

m   dPU j j=1

Ck =

dP I p

m   dPU j j=1

dP I p

 (X ) − 1 v j (I p ), vk (I p ),  (X ) − 1 v j (U j ) − v j (I p ), vk (I p ).

We now invoke a limiting argument to deduce Lemma 1 from (8). For this, recall that U j = exp(tξ j ), ξ j ∈ so( p), multiply numerator and denominator by 1/t 2 and let t → 0. First, by (A6) and (A7), we have  1 D→− v j (I p ), dψ(I p )ξ j , t j=1 m

 1 Dk → vk (I p ), dv j (I p )ξ j  t j=1 m

as t → 0. Moreover, by assumption (4), we have m m m  1  −1 1  −1 2 −1 2 w E B = w χ (P , P ) → w−1 as t → 0. I U I p j p k j aj t 2 k=1 k t 2 j=1 j j=1

On the other hand, Ck is asymptotically negligible, as can be seen from    1  2 1 E I p Ck2 ≤ 2 χ (PU j , P I p ) v j (U j ) − v j (I p ), vk (I p )2 → 0 2 t t j=1 j=1 m

m

as t → 0. Here, we used (4) and (A7). Thus, m m  1  −1 −1 2 2 w E (B + C ) − w E B I k k I p p k k k t 2 k=1 k=1



m 1  −1 w (2(E I p Bk2 )1/2 (E I p Ck2 )1/2 + E I p Ck2 ) → 0 t 2 k=1 k

as t → 0. The proof now follows from inserting these limits into (8).



Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces

309

4 Applications In this section, we specialize our lower bounds in the context of principal component analysis (PCA) and a low-rank matrix denoising model. In doing so, we will focus on the derived parameter ψ(U ) = P≤d (U ) =



u i u iT ,

U ∈ O( p),

i≤d

where 1 ≤ d ≤ p and u 1 , . . . , u p are the columns of U ∈ O( p). We discuss several loss functions based on the Hilbert-Schmidt distance and the excess risk in the reconstruction error.

4.1 PCA and the Subspace Distance In this section, we consider the statistical model given in (1) (PU )U ∈O( p) ,

PU = N(0, U U T )⊗n ,

with  = diag(λ1 , . . . , λ p ) and λ1 ≥ · · · ≥ λ p > 0. The following theorem proved in Sect. 5 applies Proposition 1 to the above model, derived parameter P≤d , and the squared Hilbert-Schmidt loss (cf. Sect. 5.1 below). Theorem 3 Consider the statistical model (1). Then, for each δ > 0, we have  inf Pˆ

O( p)

EU  Pˆ − P≤d (U )22 dU ≥ Iδ

ˆ 1 , . . . , X n ) and with infimum taken over all R p× p -valued estimators Pˆ = P(X Iδ =

 1 λi λ j max xi j : 0 ≤ xi j ≤ n2 (λi −λ for all i ≤ d, j > d, 2 j) 1 + 2δ i≤d j>d  xi j ≤ δ for all j > d, i≤d



xi j ≤ δ for all i ≤ d .

j>d

Remark 3 We write i ≤ d for i ∈ {1, . . . , d} and j > d for j ∈ {d + 1, . . . , p}. Remark 4 A (non-square) matrix (xi j ) is called doubly substochastic (cf. [2, Sect. 2]) if

310

M. Wahl

xi j ≥ 0  xi j ≤ 1

for all i, j, for all j,

i



xi j ≤ 1

for all i.

j

Hence, choosing δ = 1, Theorem 3 holds with I1 =

 1 xi j : (xi j ) doubly substochastic with max 3 i≤d j>d xi j ≤

2 λi λ j n (λi −λ j )2

for all i ≤ d, j > d .

Remark 5 That doubly substochastic matrices play a central role is no coincidence. Such a structure also appears in the upper bounds for the principal subspaces of the empirical covariance operator (see, e.g., [21]). To explain this, let X 1 , . . . , X n be independent Gaussian  random variables with expectation zero and covariance n ˆ = n −1 i=1 X i X iT be the empirical covariance matrix. Moreover, matrix and let ˆ and let let λ1 ≥ · · · ≥ λ p (resp. λˆ 1 ≥ · · · ≥ λˆ p ) be the eigenvalues of (resp. ) ˆ u 1 , . . . , u p (resp. uˆ 1 , . . . , uˆ p ) be the corresponding eigenvectors of (resp. ).   T T Then, for P≤d = i≤d u i u i and Pˆ≤d = i≤d uˆ i uˆ i , we have (cf. [15, 21])  Pˆ≤d − P≤d 22 = 2



xi j with xi j = u i , uˆ j 2 .

i≤d j>d

There are two completely different possibilities to bound this squared HilbertSchmidt distance. First, by Bessel’s inequality, we always have the trivial bounds  i≤d

xi j ≤ 1 and



xi j ≤ 1.

j>d

On the other hand, using perturbative methods, we have the central limit theorem  d nxi j = nu i , uˆ j 2 → N 0,

λi λ j  , (λi − λ j )2

(provided that λi = λ j are simple eigenvalues, see, e.g., [1, 14] for a non-asymptotic version of this result). Hence, the lower bound in Theorem 3 can be interpreted as the fact that we can not do essentially better than the best mixture of trivial and perturbative bounds.

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces

311

Remark 6 A simple and canonical choice of the xi j in Theorem 3 is given by xi j = min

λi λ j 1 , , n (λi − λ j )2 p

2

in which case we rediscover the lower bound [24, Theorem 1]. Yet, let us point out that the result in Theorem 2 is stronger in the √ sense that it allows for priors that are highly concentrated around I p (cf. h of size n), while Theorem 3 provides a lower bound for the uniform prior. Remark 7 In general it seems difficult to find a simple closed form expression for the lower bound in Theorem 3. One exception is given by the case d = 1, in which case we have Iδ =

 2  λ λ 1 1 j , δ . min 1 + 2δ n j>1 (λ1 − λ j )2

Remark 8 Using decision-theoretic arguments, Theorem 3 can be extended to random variables with values in a Hilbert space (see [24, Sect. 1.4] for the details).

4.2 PCA and the Excess Risk Theorem 3 provides a lower bound for the squared Hilbert-Schmidt distance  Pˆ − ˆ ˆ P≤d (U )22 . If the estimator √ P is itself an orthogonal projection of rank d, then  P − P≤d (U )22 is equal to 2 times the Euclidean norm of the sines of the canonical angles between the corresponding subspaces (see, e.g., [2, Chap. VII.1]). This socalled sin distance is a well-studied distance in linear algebra, numerical analysis and statistics (see, e.g., [2, 13, 26]). In the context of statistical learning, another important loss function arises if one introduces PCA as an empirical risk minimization problem with respect to the reconstruction error (see, e.g. [3, 18, 21]). For 1 ≤ d ≤ p, let Pd be the set of all orthogonal projections P : R p → R p of rank d. Consider the statistical model defined by (1). Then the reconstruction error is defined by RU (P) = EU X − P X 2 ,

P ∈ Pd , U ∈ O( p),

where X denotes a random variable with distribution N(0, U U T ) under EU (independent of X 1 , . . . , X n ), and it is easy to see that P≤d (U ) ∈ arg min P∈Pd RU (P). Hence, the performance of Pˆ ∈ Pd can be measured by its excess risk defined by

312

M. Wahl

ˆ = RU ( P) ˆ − min RU (P) = RU ( P) ˆ − RU (P≤d (U )). EU ( P) P∈Pd

(9)

ˆ can be written in the form lw for some suitable In Sect. 5.3, we show that EU ( P) choices for A, v and w, and Proposition 1 yields the following theorem. Theorem 4 Consider the statistical model (1) with the excess risk loss function from (9). Then, for any μ ∈ [λd+1 , λd ], we have 

ˆ dU ≥ Jμ EU EU ( P)

inf Pˆ

O( p)

ˆ 1 , . . . , X n ) and with infimum taken over all Pd -valued estimators Pˆ = P(X Jμ =

 1 λi λ j xi j : 0 ≤ xi j ≤ n1 λi −λ for all i ≤ d, j > d, max j 3 i≤d j>d  xi j ≤ μ − λ j for all j > d, i≤d



xi j ≤ λi − μ for all i ≤ d .

j>d

Remark 9 The lower bound is similar to the mixture bounds established in [21]. In particular, as in the case of the squared Hilbert-Schmidt distance, the term n −1 λi λ j /(λi − λ j ) corresponds to the size of certain weighted projector norms, while the other two constrains correspond to trivial bounds. Hence, our lower bound strengthens the reciprocal dependence of the excess risk on spectral gaps (the excess risk might be small in both cases, small and large gaps). An important special case is given when λd > λd+1 and the last two restrictions in Jμ are satisfied for xi j = n −1 λi λ j /(λi − λ j ). In particular, letting μ = (λd + λd+1 )/2, they are satisfied if and only if λd+1  λi λd  λ j ≤ n and ≤ n, μ − λd+1 i≤d λi − λd+1 λd − μ j>d λd − λ j as can be seen from a monotonicity argument. A simple modification leads to the following corollary. Corollary 1 We have  inf Pˆ

provided that

  λi λ j ˆ dU ≥ 1 EU EU ( P) , 3n i≤d j>d λi − λ j O( p)

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces

  λj  n λd λi ≤ . + λd − λd+1 i≤d λi − λd+1 λ − λj 2 j>d d

313

(10)

Remark 10 Condition (10) is the main condition of [21] under which perturbation bounds for the empirical covariance operator are developed (cf. [21, Remark 3.15]). Example 1 If for some α > 0, we have λ j = e−α j , j = 1, . . . , p, then there are constants c1 , c2 > 0 depending only on α such that 

ˆ dU ≥ c1 EU EU ( P)

inf Pˆ

O( p)

de−αd , provided that d ≤ c2 n n

and 2d ≤ p. A matching upper bound can be found in [21, Sect. 2.4]. Example 2 If for some α > 0, we have λ j = j −α−1 , j = 1, . . . , p, then there are constants c1 , c2 > 0 depending only on α such that 

ˆ dU ≥ c1 d EU EU ( P)

inf Pˆ

O( p)

2−α

n

, provided that d 2 log d ≤ c2 n

and 2d ≤ p. A matching upper bound (under more restrictive conditions on d) can be found in [21, Sect. 2.4]. The condition on d can be further relaxed by using [15].

4.3 Low-Rank Matrix Denoising For a diagonal matrix  = diag(λ1 , . . . , λ p ) with λ1 ≥ · · · ≥ λ p ≥ 0, consider the family of probability measures (PU )U ∈O( p) with PU being the distribution of X = U U T + σ W, where σ > 0 and W = (Wi j )1≤i, j≤ p is drawn from the GOE ensemble, that is a symmetric random matrix whose upper triangular entries are independent zero mean Gaussian random variables with EWi2j = 1 for 1 ≤ i < j ≤ p and EWii2 = 2 for i = 1, . . . , p. Alternatively, this model can be defined on X = R p( p+1)/2 with (PU )U ∈O( p) ,

PU = N(vech(U U T ), σ 2 W ),

(11)

where symmetric matrices a ∈ R p× p are transformed into vectors using vech(a) = (a11 , a21 , . . . , a p1 , a22 , a32 . . . , a p2 , . . . , a pp ) ∈ R p( p+1)/2 , and W is the covariance matrix of vech(W ). The following theorem is the analogue of Theorem 3.

314

M. Wahl

Theorem 5 Consider the statistical model (11). Then for each δ > 0, we have  inf Pˆ

O( p)

EU  Pˆ − P≤d (U )22 dU ≥ Iδ

ˆ 1 , . . . , X n ) and with infimum taken over all R p× p -valued estimators Pˆ = P(X Iδ =

 1 2 max xi j : 0 ≤ xi j ≤ (λi2σ for all i ≤ d, j > d, −λ j )2 1 + 2δ i≤d j>d  xi j ≤ δ for all j > d, i≤d



xi j ≤ δ for all i ≤ d .

j>d

Example 3 Suppose that rank() = d ≤ p − d. Then, setting xi j = min

σ2 λi2

,

 σ 2 ( p − d)  1  1 , we get I1 ≥ min ,1 . p−d 3 i≤d λi2

Ignoring the minimum with 1, I1 gives the size of a first order perturbation expansion for X = U U T + σ W (see [15] for more details and a corresponding upper bound).

5 Proofs for Sect. 4 In this section we show how Theorems 3–5 can be obtained by an application of Proposition 1.

5.1 Specialization to Principal Subspaces We start with specializing Proposition 1 in the case where ψ(U ) = P≤d (U ) =



u i u iT = U

i≤d



ei eiT U T ,

U ∈ O( p),

i≤d

where 1 ≤ d ≤ p is a natural number. Here u i = U ei is the i-th column of U and e1 , . . . , e p denotes the standard basis in R p . We consider A = R p× p endowed with the trace inner product a, b2 = tr(a T b), a, b ∈ R p× p and choose v : O( p) → R p× p ,

v(U ) = (u k u lT )1≤k,l≤ p .

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces

315

p× p

Hence, for w ∈ R>0 , we consider the loss function defined by lw (U, a) =

p p  

wkl u k u lT , a − P≤d (U )22 ,

U ∈ O( p), a ∈ R p× p .

k=1 l=1

In particular, if wkl = 1 for all k, l, then lw (U, a) = a − P≤d (U )22 = a − P≤d (U ), a − P≤d (U )2 is the squared Hilbert-Schmidt (or Frobenius) distance. Note that, in contrast to Sect. 2, we consider a double index in this section. We equip A with the group action given by conjugation U · a = U aU T , a ∈ R p× p , U ∈ O( p). Using this definition it is easy to see that (A2), (A3), (A4) and (A5) are satisfied. Moreover, the following lemma computes the derivatives in (A6) and (A7) in this case. Lemma 2 For ξ ∈ so( p), we have   (i) d P≤d (I p )ξ = ξ i≤d ei eiT − i≤d ei eiT ξ , (ii) dvi j (I p )ξ = ξ ei e Tj − ei e Tj ξ . In particular, for i = j and L (i j) = ei e Tj − e j eiT ∈ so( p), we have (i) d P≤d (I p )L (i j) = −d P≤d (I p )L ( ji) = −ei e Tj − e j eiT if i ≤ d and j > d, (ii) dvi j (I p )L (i j) = ei eiT − e j e Tj . Remark 11 We have d P≤d (I p )L (kl) = 0 if k, l ≤ d or k, l > d. Proof For U ∈ S O( p) and ξ ∈ so( p), we have d P≤d (I p )ξ = f  (0) with f : p× p , t → i≤d (exp(tξ )ei )(exp(tξ )ei )T . Hence, using (d/dt) exp(tξ ) = R→R ξ exp(tξ ), (i) follows. Claim (ii) can be shown analogously and (iii) and (iv) fol low from inserting ξ = L (i j) into (i) and (ii), respectively. Corollary 2 Consider the above setting with ψ = P≤d . Suppose that (A1) holds and that there is a bilinear form I : so( p) × so( p) → R such that lim

1

t→0 t 2

χ 2 (Pexp(tξ ) , P I p ) = I(ξ, ξ )

for all ξ ∈ so( p).

(12)

ˆ ) with values in A = R p× p and every z i j ∈ R, Then, for every estimator Pˆ = P(X i ≤ d, j > d, we have 

ˆ )) dU EU lw (U, P(X O( p)

≥  i≤d j>d

(13) 

2 zi j

i≤d j>d 2 (wi j + w ji )−1 ai−1 j zi j +

 i≤d

wii−1

 j>d

2 zi j

+

 j>d

w−1 jj

 i≤d

2 , zi j

316

M. Wahl

(i j) where ai−1 , L (i j) ) and L (i j) = ei e Tj − e j eiT , i ≤ d, j > d. j = I(L

Remark 12 The assumption that w has strictly  positive entries2 can be dropped. To ˆ ) be an estimator with ˆ see this, let Pˆ = P(X O( p) EU  P(X )2 dU < ∞. Then the left-hand side of (13) is continuous in w (by the dominated convergence theorem), while the right-hand side is non-increasing in all coordinates of w. Hence, by a p× p limiting argument, (13) also hold for w ∈ R≥0 , provided that the latter condition on Pˆ holds and using the (standard) convention ∞ · 0 = 0. Proof We choose ξ (i j) = yi j L (i j) and ξ ( ji) = −y ji L ( ji) = y ji L (i j) for i ≤ d and j > d and we set ξ (lk) = 0 in all other cases. Then, by Lemma 2, the sum in the numerator of Proposition 1 is equal to   vi j (I p ), d P≤d (I p )ξ (i j) 2 + v ji (I p ), d P≤d (I p )ξ ( ji) 2 i≤d j>d

=



j>d i≤d

yi j ei e Tj , −ei e Tj



e j eiT 2

+ y ji e j eiT , −ei e Tj − e j eiT 2



i≤d j>d

=−



(yi j + y ji ).

i≤d j>d

On the other hand, for 1 ≤ k, l ≤ p, the term in the squared brackets in the denominator is equal to  i≤d j>d

vkl (I p ), dvi j (I p )ξ (i j) 2 +

 vkl (I p ), dv ji (I p )ξ ( ji) 2 j>d i≤d

 = (yi j ek elT , ei eiT − e j e Tj 2 + y ji ek elT , ei eiT − e j e Tj 2 ) i≤d j>d

and the latter is equal to ⎧ (yk j + y jk ), k = l ≤ d, ⎪ ⎪ ⎪ ⎪ ⎨ j>d − (yik + yki ), k = l > d, ⎪ ⎪ ⎪ i≤d ⎪ ⎩ 0, else. Hence the second term in the denominator is equal to  i≤d

wii−1

 j>d

(yi j + y ji )

2

+

 j>d

w−1 jj

 i≤d

(yi j + y ji )

2

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces

317

Finally, the Fisher information term is equal to     −1 −1 2 (i j) ( ji) 2 wi−1 , ξ (i j) ) + w−1 , ξ ( ji) ) = ai j (wi j yi j + w−1 j I(ξ ji I(ξ ji y ji ). i≤d j>d

i≤d j>d

Plugging all these formulas into Proposition 1, we get that, for every yi j , y ji ∈ R, i ≤ d, j > d, the left-hand side in (13) is lower bounded by   

2 (yi j + y ji )

i≤d j>d

−1 2 −1 2 ai−1 j (wi j yi j + w ji y ji ) +

i≤d j>d



−1 wii



i≤d

2  2 .  (yi j + y ji ) + w−1 (yi j + y ji ) jj

j>d

j>d

i≤d

For a, b > 0 and z ∈ R, it is easy to see that minimizing a −1 x 2 + b−1 y 2 subject to x + y = z leads to the value (a + b)−1 z 2 . Applying this with a = wi j ai j , b = w ji ai j and z = z i j , the claim follows. 

5.2 A Simple Optimization Problem We now consider the optimization problem  max   z i j ∈R i≤d, j>d i≤d j>d

2 bi−1 j zi j

+

 i≤d

2 zi j

i≤d j>d

wii−1

 j>d

2 zi j

+

 j>d

w−1 jj



2 ,

(14)

zi j

i≤d

where wii , w j j ∈ [0, ∞) and bi j ∈ [0, ∞] for i ≤ d and j > d (in applications we −1 −1 have bi−1 j = (wi j + w ji ) ai j ), and we use the convention 0 · ∞ = 0. If d = 1, then a solution to (14) can be given explicitly. Lemma 3 Suppose that d = 1 and that w11 = w j j = 1 and b1 j ∈ (0, ∞] for all j > 1. Then a solution of (14) is given by z 1 j = (1 + b1−1j )−1 , leading to the maximum 

(1 + b1−1j )−1

   1  1 ≥ min min(b1 j , 1), 1 = min b1 j , 1 . (15)  4 4 1+ (1 + b1−1j )−1 j>1 j>1 j>1

j>1

Proof Obviously, the choices z 1 j = (1 + b1−1j )−1 lead to the expression on the lefthand side of (15). The value defined through (14) is also upper bounded by the left-hand side of (15), as can be seen by inserting the (Cauchy-Schwarz) inequal-

318

M. Wahl

   ity ( j>1 (1 + b1−1j )−1 )−1 ( j>1 z 1 j )2 ≤ j>1 (1 + b1−1j )z 12 j into (14). Finally, the inequality in (15) follows from inserting x/(1 + x) ≥ (1/2) min(x, 1), x ≥ 0.  In general, it seems more difficult to give an explicit formula for (14) using only bi j and ∧. Yet, the following lower bound is sufficient for our purposes. In the special case of Lemma 3, it gives the second bound in (15). Lemma 4 For each δ > 0, the value defined through (14) is lower bounded by maximize

1  xi j 1 + 2δ i≤d j>d

subject to 0 ≤ xi j ≤ bi j  xi j ≤ δw j j

for all i ≤ d, j > d,

(16)

for all j > d,

i≤d



xi j ≤ δwii

for all i ≤ d.

j>d

Proof Let z i j = xi j be real values satisfying the constraints in (16). Then we have  i≤d j>d

2 bi−1 j zi j +



wii−1



i≤d

j>d

2 zi j

+



w−1 jj



j>d

2 zi j

≤ (1 + 2δ)

i≤d



zi j .

i≤d j>d



Inserting this into (14), the claim follows

Remark 13 If bi j = ∞, then the first constraint in (16) can be written as 0 ≤ xi j < ∞.

5.3 End of Proofs of the Consequences Proof (Proof of Theorem 3) By [24, Lemma 1], Condition (12) is satisfied with I(ξ, ξ ) =

p n  2 (λi − λ j )2 ξ , 2 i, j=1 i j λi λ j

ξ ∈ so( p).

n R p , the statistical model in (1) Moreover, letting O( p) act coordinate-wise on i=1 satisfies (A1) (cf. [24, Example 1]). Hence, applying Corollary 2 with wkl = 1 for −1 −1 −1 (i j) , L (i j) ) = all k, l, the claim follows from Lemma 4 with bi−1 j = 2 ai j = 2 I(L  n(λi − λ j )2 /(2λi λ j ) ∈ [0, ∞). Proof (Proof of Theorem 4) The main remaining point is to show that the excess risk p× p ˆ Pˆ ∈ Pd , is of the form lw (U, P) ˆ for some w ∈ R≥0 EU ( P), . This can be deduced from [21, Lemma 2.6]. 

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces

319

Lemma 5 For Pˆ ∈ Pd and μ ∈ [λd+1 , λd ], we have ˆ = EU ( P)

p p  

ˆ wkl u k u lT , Pˆ − P≤d (U )22 = lw (U, P)

k=1 l=1

with wkl = λk − μ for k ≤ d, l ≥ 1 and wkl = μ − λk for k > d, l ≥ 1. Proof For brevity we write P≤d = P≤d (U ) and Pk = Pk (U ) = u k u kT . By [21, Lemma 2.6], we have ˆ = EU ( P)

  ˆ 22 + ˆ 22 . (λk − μ)Pk (I − P) (μ − λk )Pk P k≤d

k>d

Inserting ˆ 22 = Pk (P≤d − P) ˆ 22 , Pk (I − P)

k ≤ d,

ˆ 22 , ˆ 22 = Pk ( Pˆ − P≤d )22 = Pk (P≤d − P) Pk P

k > d,

we obtain ˆ = EU ( P)

  ˆ 22 + ˆ 22 (λk − μ)Pk (P≤d − P) (μ − λk )Pk (P≤d − P) k≤d

k>d

p p   ˆ l 22 + ˆ l 22 , = (λk − μ)Pk (P≤d − P)P (μ − λk )Pk (P≤d − P)P k≤d l=1

k>d l=1

and the claim follows from inserting the identity Pk a Pl 22 = u k u lT , a22 , a ∈  R p× p . Applying Lemma 5, we get 

ˆ dU = inf EU EU ( P)

inf

ˆ d P∈P

ˆ d P∈P

O( p)



ˆ dU, EU lw (U, P) O( p)

where the infimum is over all estimators Pˆ with values in Pd ⊆ R p× p . Hence, applyp× p ing Corollary 2 and Remark 12 with w = (wkl ) ∈ R≥0 from Lemma 5, the claim follows from Lemma 4 with −1 −1 −1 (i j) , L (i j) ) bi−1 j = (wi j + w ji ) ai j = (λi − μ + (μ − λ j )) I(L

=

n(λi − λ j ) ∈ [0, ∞). λi λ j

320

M. Wahl

Proof (Proof of Theorem 5) By [24, Lemma 3], Condition (12) is satisfied with I(ξ, ξ ) =

p p 1  2 ξ (λi − λ j )2 , 2σ 2 i=1 j=1 i j

ξ ∈ so( p). d

Moreover, let O( p) act on R p× p by conjugation. Since V W V T = W , V ∈ O( p), d we have V X V T = V U X (V U )T + σ W , meaning that the statistical model in (11) satisfies (A1) (cf. [24, Example 2]). Hence, applying Corollary 2 with wkl = 1, −1 −1 −1 (i j) , L (i j) ) = (λi − the claim follows from Lemma 4 with bi−1 j = 2 ai j = 2 I(L 2 2  λ j ) /(2σ ) ∈ [0, ∞). Acknowledgements This research has been supported by a Feodor Lynen Fellowship of the Alexander von Humboldt Foundation. The author would like to thank two anonymous referees for their helpful comments and remarks.

References 1. Anderson, T.W.: An Introduction to Multivariate Statistical Analysis, 2nd edn. Wiley, New York (1984) 2. Bhatia, R.: Matrix Analysis. Springer, New York (1997) 3. Blanchard, G., Bousquet, O., Zwald, L.: Statistical properties of kernel principal component analysis. Mach. Learn. 66, 259–294 (2007) 4. Cai, T., Li, H., Ma, R.: Optimal structured principal subspace estimation: metric entropy and minimax rates. J. Mach. Learn. Res. 22(Paper No. 46), 45 (2021) 5. Cai, T.T., Ma, Z., Wu, Y.: Sparse PCA: optimal rates and adaptive estimation. Ann. Statist. 41(6), 3074–3110 (2013) 6. Cai, T.T., Zhang, A.: Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. Ann. Statist. 46(1), 60–89 (2018) 7. Dauxois, J., Pousse, A., Romain, Y.: Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference. J. Multivar. Anal. 12(1), 136–154 (1982) 8. Eaton, M.L.: Group invariance applications in statistics. NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 1. Institute of Mathematical Statistics, Hayward, CA; American Statistical Association, Alexandria, VA, (1989) 9. Eaton, M.L.: Multivariate Statistics: A Vector Space Approach. Institute of Mathematical Statistics, Beachwood, OH (2007). (Reprint of the 1983 original) 10. Gao, C., Ma, Z., Ren, Z., Zhou, H.H.: Minimax estimation in sparse canonical correlation analysis. Ann. Statist. 43(5), 2168–2197 (2015) 11. Gill, R.D., Levit, B.Y.: Applications of the Van Trees inequality: a Bayesian Cramér-Rao bound. Bernoulli 1(1–2), 59–79 (1995) 12. Hájek, J.: Local asymptotic minimax and admissibility in estimation. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability. Theory of Statistics, vol. I, pp. 175–194 (1972) 13. Ipsen, I.C.F.: An overview of relative sin θ theorems for invariant subspaces of complex matrices. J. Comput. Appl. Math. 123, 131–153 (2000) 14. Jirak, M., Wahl, M.: Relative perturbation bounds with applications to empirical covariance operators. Adv. Math. 412, Paper No. 108808, 59pp. (2023)

Van Trees Inequality, Group Equivariance, and Estimation of Principal Subspaces

321

15. Jirak, M., Wahl, M.: Perturbation bounds for eigenspaces under a relative gap condition. Proc. Am. Math. Soc. 148(2), 479–494 (2020) 16. Koltchinskii, V., Lounici, K.: Normal approximation and concentration of spectral projectors of sample covariance. Ann. Statist. 45(1), 121–157 (2017) 17. Ma, Z., Li, X.: Subspace perspective on canonical correlation analysis: dimension reduction and minimax rates. Bernoulli 26(1), 432–470 (2020) 18. Milbradt, C., Wahl, M.: High-probability bounds for the reconstruction error of PCA. Statist. Probab. Lett. 161, 108741, 6 (2020) 19. Naumov, A., Spokoiny, V., Ulyanov, V.: Bootstrap confidence sets for spectral projectors of sample covariance. Probab. Theory Relat. Fields 174(3–4), 1091–1132 (2019) 20. Pajor, A.: Metric entropy of the Grassmann manifold. In: Convex Geometric Analysis (Berkeley, CA, 1996). Mathematical Sciences Research Institute Publications, vol. 34, pp. 181–188. Cambridge University Press, Cambridge (1999) 21. Reiss, M., Wahl, M.: Nonasymptotic upper bounds for the reconstruction error of PCA. Ann. Statist. 48(2), 1098–1123 (2020) 22. Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer, New York (2009). (Revised and extended from the 2004 French original) 23. Vu, V.Q., Lei, J.: Minimax sparse principal subspace estimation in high dimensions. Ann. Statist. 41(6), 2905–2947 (2013) 24. Wahl, M.: Lower bounds for invariant statistical models with applications to principal component analysis. Ann. Inst. Henri Poincaré Probab. Stat. 58(3), 1565–1589 (2022) 25. Yang, Y., Barron, A.: Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27(5), 1564–1599 (1999) 26. Yu, Y., Wang, T., Samworth, R.J.: A useful variant of the Davis-Kahan theorem for statisticians. Biometrika 102, 315–323 (2015)

Sparse Constrained Projection Approximation Subspace Tracking Denis Belomestny

and Ekaterina Krymova

Abstract In this paper we revisit the well-known constrained (orthogonal) projection approximation subspace tracking algorithm and derive, for the first time, non-asymptotic error bounds. Furthermore, we introduce a novel sparse constrained projection approximation subspace tracking algorithm, which is able to exploit sparsity in the underlying covariance structure. We present a non-asymptotic analysis of the proposed algorithm and study its empirical performance on simulated and real data. Keywords Subspace tracking · Sparse eigenvectors · Multidimensional signal · Orthogonal iterations · Spike model

1 Introduction Subspace tracking methods are intensively used in statistical and signal processing community. Given observations of n-dimensional noise-corrupted signal, one is interested in real-time estimating (tracking) a subspace spanning the eigenvectors corresponding to the first d largest eigenvalues of the signal covariance matrix. Over the past few decades a wide variety of subspace tracking methods was developed, which found applications in data compression, filtering, speech enhancement, etc. (see also [3, 12] and references therein). Many methods are based on the adaptive variants of orthogonal iteration method [14], where the covariance matrix is subD. Belomestny Faculty of Mathematics, University of Duisburg-Essen, Thea-Leymann-Straße 9, 45127 Essen, Germany e-mail: [email protected] URL: https://www.uni-due.de/ hm0124/ E. Krymova (B) EPF Lausanne & ETH Zürich, Swiss Data Science Center, INN, Station 14, 1015 Lausanne, Switzerland e-mail: [email protected] URL: http://ekkrym.github.io © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_9

323

324

D. Belomestny and E. Krymova

stituted by its empirical analogue. These methods include Gram-Schmidt or QR orthogonalisation and have a computational complexity O(nd 2 ) in each time step [16, 32, 37]. Fast subspace tracking methods have lower computational complexity O(nd). The gain in computation comlexity may be achieved by relaxation of orthonormality condition, e.g. PAST [34], FAST [4], or, for example, by replacing the orthogonalisation step with approximate analogues, e.g. LORAF [31], Oja scheme [26], FDPM [12], FRANS [5]. The methods, which enforce orthogonality and have higher computationally complexity, typically demonstrate much faster convergence rates [3]. Fast subspace tracking methods with the computational complexity O(nd) were proposed in [1, 2, 6]. Throughout the paper we refer to a class of an orthogonal subspace tracking method methods as constraint projection approximation subspace tracking method (CPAST) with complexity O(n 2 d) (NP1 method and (4) in [16]) for the derivation of a novel sparse method. Fast implementation of CPAST [2, 16], orthogonal PAST (OPAST) [2] and has computational complexity O(nd). Despite popularity of the subspace tracking methods, only few non-asymptotic results are known about their convergence. The asymptotic convergence of the PAST algorithm was first established in [35, 36] using a general theory of stability for ordinary differential equations. However, no finite sample error bounds are available in the literature. Furthermore, in the case of a high-dimensional signal the empirical covariance matrix estimator performs poorly if the number of observations is small. A common way to improve the estimation quality in this case is to impose sparsity assumptions on the signal itself or on the eigensubspace of the underlying covariance matrix. As a motivating example, consider a problem of transmission of a musical signal though hearing aids device. The structure of the signal is often investigated in the time-spectral domain [24]. Typically, the spectral analysis is applied to subsequently windowed signal frames, which gives local information about frequency distribution in time. The musical notes are known to have logarithmically spaced frequencies [19]. To capture this property wavelet-type CQT-transform is typically used [8]. The window sizes in CQT-transform are properly adjusted to obtain better resolution for different frequencies. In the time-spectral domain, musical signals demonstrate local in time sparse spectral structure. For example, consider the spectrogram matrix of CQT-transform of Bach Siciliano for Oboe and Piano in Fig. 1, where x-axis corresponds to the number of the window frame. For each frame number in x-axis, values in y-axis form a vector with the amplitudes in CQT decomposition. One can observe that a musical signal exhibits time-varying spectral structure with local regions of stationarity and small number of frequencies corresponding to large amplitudes (sparsity). We aim to formalize and to exploit the sparsity property of the coefficients of wavelet-type transforms in the time and frequency domain assuming that the transform is performed in online fashion, in assumption of stationary time-series. The time of transition between the intervals of homogeneity could be estimated by the change point detection procedures, e.g. from [29]. Empirical studies suggest that it is crucial to estimate several eigenvectors (d > 1) to preserve good reconstruction quality [25]. Furthermore, the orthogonality of the eigenvector estimates is important for the musical signal processing as otherwise the result gets severely distorted. In [23] a sparse modification of the orthogonal iteration

Sparse Constrained Projection …

325

Fig. 1 CQT-Spectrogram of Bach Siciliano for Oboe and Piano. The deep blue color corresponds the transform amplitudes values close to zero, the red color corresponds to the higher values of amplitudes

algorithm was proposed in the case where all observations are available. A thorough analysis in [23] shows that under appropriate sparsity assumptions on the leading eigenvectors, the orthogonal iteration algorithm combined with thresholding allows to perform eigensubspace estimation in high-dimensional setting. In [38] authors proposed an algorithm tailored for streaming sparse principal component analysis when observations are arriving in blocks (see also references therein for sparse and non-sparse principal components analysis methods). As a parameter of the method one should set the sparsity parameter of a solution, where the sparsity is controlled by a number of non-zero components of an eigenvector. In [21, 22] sparse subspace tracking methods based on OPAST and FAPI solutions were proposed, where the sparsity constraint is imposed on the orthonormal filtering matrix, with complexity for orthonormal version O(nd 2 ) + O(d 3 ) and linear complexity for the approximate non-orthogonal version. Sparse subspace tracking method l1 -PAST [39] has slightly less computational complexity O(nd 2 ) + O(d 2 ), but is lacking orthogonality of the solution at each step. Based on CPAST method, we propose a novel procedure called sparse constraint projection approximation subspace tracking method (SCPAST), which can be used for efficient subspace tracking in the case of high-dimensional signal with sparse structure. Computational complexity of the proposed method is O(nds) + O(sd 2 ), where s is related to the sparsity of the underlying eigensubspace. Therefore in the sparse case SCPAST method has better complexity than of the sparse subspace tracking methods and preserves orthogonality along with exploiting the structure of the data. Another contribution of our paper is a non-asymptotic convergence analysis of CPAST and SCPAST algorithms showing the advantage of the SCPAST algorithm in the case of sparse covariance structure. Note that the faster equivalent of CPAST, OPAST, has the same convergence rate as CPAST. Last but not the least, we analyze numerical performance of SCPAST algorithm on simulated and real

326

D. Belomestny and E. Krymova

data. In particular, the problem of tracking the leading subspace of a music signal is considered. The structure of the paper is as follows. In Sect. 1.1 we consider a general subspace tracking problem. In Sect. 2 we review the CPAST algorithm and propose SCPAST method for the sparse setting. Error bounds for CPAST (OPAST) and SCPAST are presented in Sect. 3. Section 4 contains a numerical study of the proposed algorithm on synthetic and real data. Outlines of the proofs are collected in Sect. 5. Proofs of the intermediate results in Sect. 3 are given in Appendix A.

1.1 Main Setup We consider first a general problem of dominant subspace adaptive estimation given incoming noisy observations x(t) ∈ Rn : x(t) = s(t) + σ(t)ξ(t), t = 1, . . . , T,

(1)

where a signal s(t) ∈ Rn is corrupted by a vector ξ(t) ∈ Rn with independent standard Gaussian components. The signal s(t) is modeled as s(t) = A(t)η(t), where A(t) is a deterministic n × d matrix of rank d with d  n and η(t) is a random vector in Rd independent of ξ(t), such that E[η(t)] = 0, E[ηi2 (t)] = 1, i = 1, . . . , d. For simplicity assume that d is known. Under these assumptions, the process x(t) has a covariance matrix Σ(t) which may be decomposed in the following way Σ(t) = E[x(t)x  (t)] = A(t)A (t) + σ 2 (t)In ,

(2)

 where In stands for the unit matrix in Rn . Note that the dmatrix A(t)A (t) has the  rank d and by the spectral decomposition A(t)A (t) = i=1 λi (t)vi (t)vi (t), where vi (t) ∈ Rn , i = 1, . . . , d, are the eigenvectors of A(t)A (t) corresponding to the eigenvalues λ1 (t) ≥ λ2 (t) ≥ · · · ≥ λd (t) > 0. It follows from (2) that the first d eigenvalues of Σ(t) are λ1 (t) + σ 2 (t), . . . , λd (t) + σ 2 (t), whereas the remaining n − d eigenvalues are equal to σ 2 (t). Since λd (t) > 0, the subspace of the first d dominant eigenvectors is identifiable, which means that one can separate the subspace corresponding to the first d eigenvectors of A(t)A (t) from the subspace of the rest of the eigenvectors. The subspace tracking methods aim to estimate the subspace span(v1 (t), . . . , vd (t)) based on the observations (x(k))tk=1 . The overall number of observations T is assumed to be fixed and known.

Sparse Constrained Projection …

327

2 Methods Many of subspace tracking methods rely on a heuristic assumption of slowly varying in time Σ(t), which leads to an idea of discounting the past observations in the covariance estimator, for example, as follows γ (t) = Σ

t 

γ t−i x(i)x  (i),

(3)

i=0

where 0 < γ ≤ 1 is the so-called forgetting factor. In the stationary regime, that is, if Σ(t) is a constant matrix, one would use γ = 1. We mainly focus the theoretical analysis of the subspace tracking methods on the stationary case, that is, assuming: Σ(t) = Σ, A(t) = A, vi (t) = vi , λi (t) = λi , i = 1, . . . , d, σ 2 (t) = σ 2 . In this situation one would like to keep all the observations to estimate V , therefore to use the estimator (3) for Σ with γ = 1. For notational simplicity, from now skip the dependence on γ for the station  = ary case and use the notation Σ(t) for the empirical covariance matrix Σ(t) 1 t   x(i)x (i) and V (t) for the estimators of the matrix containing the domii=1 t nant eigenvectors. We assume that the random vectors η(t) and ξ(t) have independent N (0, 1) components for t = 1, . . . , T . Under these assumptions the covariance matrix (2) becomes Σ=

d 

λi vi vi + σ 2 In = V Λd V  + σ 2 In ,

(4)

i=1 d where V is n × d matrix with columns {vi }i=1 , Λd is d × d diagonal matrix with d {λi }i=1 on the diagonal. Note that the observational model (1) in stationary case can be alternatively written as a spike model

x(t) =

d  

λi u i (t)vi + σξ(t),

(5)

i=1

where u i (t) are i.i.d. standard Gaussian random variables independent from ξ(t). We assume that σ 2 and λi , i = 1, . . . , d are known. In such case we can always normalize the data; therefore without loss of generality we put σ 2 = 1. We stress that our goal is a recovery of a d-dimensional subspace of the first eigenvectors, therefore we are not interested in the estimation of each particular eigenvector and we need a condition for the separation of the d-dimensional subspace τ λd ≥ λ1 ,

(6)

328

D. Belomestny and E. Krymova

that is, the gap between λd and λd+1 = 0 is sufficiently large. Define a distance l between two subspaces W and Q spanning orthonormal columns w1 , . . . , wd and q1 , . . . , qd correspondingly via l(W, Q) = l(W, Q) = W W  − Q Q  2 ,

(7)

where W = {w1 , . . . , wd } and Q = {q1 , . . . , qd } are matrices in Rn×d with orthonormal columns, the spectral norm A of a matrix A ∈ Rn×d is defined as A = 2 . Throughout the paper we use notations p ∨ q = max( p, q), supx∈Rd ,x=0 Ax x2 p ∧ q = min( p, q).

2.1 CPAST For the general model (1) and non-stationary case, constrained projection approximation subspace tracking (CPAST) method allows to iteratively compute a matrix γ (t), t = 1, . . . , T , containing the estimators of the first d leading eigenvectors of V γ (t) (see (3)) based on sequentially arriving x( j), j = 1, . . . , t. The the matrix Σ 0 and consists of the γ (0) = V procedure starts with some initial approximation V following two steps γ (t − 1); γ,V (t) = Σ γ (t)V – multiplication: Σ  γ (t) = Σ γ,V (t)[Σ γ,V γ,V (t)]−1/2 . (t)Σ – orthogonalisation: V In the “stationary” case (γ = 1) the method may be regarded as the “online”– version of the orthogonal iterations scheme [14] for computing the eigensubspace of the non-negatively definite matrix. First multiplication step requires O(n 2 d) flops, whereas the second with application of QR factorization needs O(nd 2 ) flops. In the stationary case one can write both CPAST updating steps as (t − 1)]−1/2 . (t) = [Σ(t) (t − 1)][V  (t − 1) Σ 2 (t)V  V V

(8)

In the Sect. 2.2 we propose a modification of CPAST method for estimating a ddimensional subspace of the observed data under sparsity assumption on the leading eigenvectors of the covariance matrix Σ.

2.2 Sparse CPAST Assume the first d leading eigenvectors vi , i = 1, . . . , d, of Σ in (4) have most of their entries close to zero, namely, each vi fulfills the so-called weak-lr ball condition [11, 18]: for r ∈ (0, 2), |vi |(k) ≤ si k −1/r , k = 1, . . . , n.

Sparse Constrained Projection …

329

where |vi |(k) is the k-th largest coordinate of vi . The weak-lr ball condition is known to be more general than lr ball condition (which is qr ≤ s for q ∈ Rn , r ∈ (0, 2), s ≥ 1). It combines different definitions of sparsity especially related to sparseness of wavelet coefficients of functions from the smooth classes [13]. A proposed sparse modification of CPAST for the sparse leading eigenvectors relies on the orthogonal iteration scheme with an additional thresholding step (cf. [18, 23]). Define a thresholding function g(x, β) with a thresholding parameter β > 0 and x ∈ R via x − β ≤ g(x, β) ≤ x + β, g(x, β)1|x|≤β = 0.

(9)

The popular choices of the thresholding function are the hard-thresholding function g H (x, β) given by (10) g H (x, β) = x1(|x|≥β) and the soft-thresholding function g S (x, β) = (|x| − β)+ · sign(x). We denote by g(V, β) a result of thresholding of each column vi ∈ Rn , i = 1, . . . , d of a matrix V ∈ Rn×d with a corresponding component of a thresholding vector β = [β1 , . . . , βd ] , i = 1, . . . , d. Thus the matrix g(V, β) contains entries g(vi j , βi ), i = 1, . . . , d, j = 1, . . . , n, (t) an iterative From now on, by a slight abuse of notation, we will denote by V estimator obtained with the help of the modified CPAST, given t observations. To (t0 ), we use the following modification of a standard get the initial approximation V SPCA scheme (see [18, 23]): t0  0 ) = i=1 1. compute based on t0 observations: Σ(t x(i)x  (i)/t0 ;  0 ): 2. define a set of indices G, corresponding to large diagonal elements of Σ(t   log(n ∨ t0 )  G = k : Σkk (t0 ) > 1 + γ0 t0 √ ) for γ0 ≥ 3 2 log(n∨T . log(n∨t0 ) 0  0 ) corresponding to the row and column indices  (t0 ) be a submatrix of Σ(t 3. let Σ in G × G; (t0 ) take the first d eigenvectors of Σ 0 (t0 ) completed with zeros 4. as columns of V in the coordinates {1, . . . , n}\G to the vectors of length n. Now we describe a sparse modification of CPAST, which we called SCPAST. We (t0 ) obtained by the above procedure. Then for t = t0 + 1, . . . , T, we start with V perform the following steps

330

D. Belomestny and E. Krymova

(t − 1), (t) = Σ(t)  V 1. multiplication: Υ (t), β(t)), where g is a thresholding β (t) = g(Υ 2. thresholding: define a matrix Υ function satisfying (9) and β(t) is the corresponding thresholding vector; (t) = Υ β (t)Υ β (t)]−1/2 . β (t)[Υ 3. orthogonalization: V (t) allows to avoid auxiliary orthogRemark 1 Thresholding of the columns of Υ onalization. Futhermore, the weak-lr sparsity of the vector ζv with the components  d 2 j=1 λ j v jk , k = 1, . . . , n implies the weak-lr sparsity of v j , j = 1, . . . , d [28]. Therefore the thresholding step after multiplication ensures to sparseness of the esti2 mator of V . Note that ζvk is the variance of the k-th coordinate of the signal part of observations.

3 Error Bounds for CPAST and SCPAST Convergence of CPAST. We show that with high probability the subspace spanning (t) is close, in terms of l, to the subspace spanning V when the the CPAST estimator V 0 (t0 ) = V number of observation is large enough. Assume that the initial estimator V is constructed from t0 first observations by means of the singular value decomposition  0 ). of Σ(t Theorem 1 Suppose that the eigengap condition (6) holds and √ √ λ1 + 1 , t0 ≥ 4 2Rmax λd √ √ √ where Rmax = 5 n − d + 5 6 ln(n ∨ T ). Then after t − t0 iterations we get with probability at least 1 − C0 (n ∨ t)−2 , (t)) ≤C1 l(V, V

λd + 1 n − d λ1 + 1 log(n ∨ t) + C2 , t t λ2d λ2d

(11)

where C0 , C2 are absolute constants and C1 depends on τ . Remark 2 The second term on the right-hand side of (11) corresponds to the error of separating the first d eigenvectors from the rest. The first term is an average error of estimating all components of d leading eigenvectors. It originates from the interaction of the noise terms with the different coordinates, see [7]. Convergence of SCPAST. First we define parameter β(t) as √ the thresholding ) the components βi (t), i = follows. For t = t0 + 1, . . . , T and a ≥ 3 2 log(n∨T log(n∨t0 ) 1, . . . , d of the vector β(t) are given by βi (t) = a (λi + 1)

log(n ∨ t) . t

(12)

Sparse Constrained Projection …

331

Remark 3 Recall that we assumed that d and the eigenvalues λ1 , . . . , λd are known. In the case of unknown d and λ1 , . . . , λd one might first estimate the eigenvalues of 0 (t0 ) defined in the previous section and then select the largest set of eigenvalues Σ satisfying the eigengap condition (6) with some parameter τ (see [23] for more details). Denote by S(t) the set of indices of “large” eigenvectors components (S stands for “signal”), that is, for a fixed t,

S(t) =

j : |vi j | ≥ bh i

log(n ∨ t) , for some i = 1, . . . , d , t



√ . In fact, the quantity h 2 /t is an estimate of the noise where h i = λλii+1 and b = √0.1a i τ d variance in the entries of the i-th leading eigenvector [7]. The number of “large” entries of the first d leading eigenvectors to estimate thus might be estimated by the cardinality of S(t), which we denote by card(S(t)). card(S(t)) as   One can bound  d log(n∨t) . From card(S(t)) ≤ i=1 card(S j (t)), where S j (t)) = j : |vi j | ≥ bh i t

Lemma 14 d ≤ card(S(t)) ≤ C M(t), where C depends on b, r and ⎡ ⎤   d  s rj log(n ∨ t) −r/2 ⎦. M(t) = n ∧ ⎣ hr t j=1 j

(13)

Note that in the sparse case, the number of non-zero components card(S(t)) is much smaller than n. For example, if v j r ≤ s, j = 1 . . . , d, then sr M(t) ≤ n ∧ d r hd



log(n ∨ t) t

−r/2

.

r

The value hs r is often referred to as an effective dimension of the vector v j . Thus j M(t) is the number of effective coordinates of v j , j = 1, . . . , d in the case of disjoint S j (t). Since h d 2 /t is an upper-bound for the estimation error for the components of the first d leading eigenvectors, the right hand side of the above inequality gives, up to a logarithmic term, the overall number of components of the d leading eigenvectors to estimate. The next theorem gives non-asymptotic bounds for a distance between (t). V and V Theorem 2 Let  λ1 + 1   √ log(n ∨ T ), t0 ≥ C1 h d M 1/2 (T ) + C2 λd

(14)

where C1 depends on τ in (6), r , a, C2 depends on τ . After t iterations one has with probability at least 1 − C0 (n ∨ t)−2 ,

332

D. Belomestny and E. Krymova

(t)) ≤C1 h 2d M(t) l(V, V

λ1 + 1 log(n ∨ t) log(n ∨ t) + C2 . t t λ2d

(15)

with some absolute constant C0 > 0. Remark 4 The second term in (15) is the same as in the non-sparse case, see Theorem 1. This term is always present as an error of separating the first d eigenvectors from the rest eigenvectors regardless how sparse they are. The first term in (15) and (11) is responsible for the interaction of the noise with different coordinates of the signal. The average error of estimating one entry of the first d leading eigenvectors based on t observation can be bounded by 1t λdλ+1 2 , see [27]. The number of compod nents to be estimated in SCPAST for each vector is bounded by M(t) (see (13)), which is small compared to n in the sparse case. Thus, the first term in (15) can be significantly smaller than the first one in (11), provided the first d leading eigenvectors are sparse. Note also that the computational complexity of SCPAST at each step t = t0 + 1, . . . , T is O(ndcard(S(t))) + O(d 2 card(S(t))) with probability given by Theorem (2).

4 Numerical Results Artificial data. To illustrate the advantage of using SCPAST for the sparse case, we generate T = 2000 observations from (5) for the case of a single spike, that is, d = 1 and n = 1024. Our aim is to estimate the leading eigenvector v1 . We shall use three functions depicted in subplots (a) of Figs. 2, 3 and 4 with different sparsity levels in the wavelet domain. The observations are generated for the noise level σ = 1 and following cases of maximal eigenvalue λ1 ∈ {5, 30, 100}. We used the Symmlet 8 basis from the Matlab package SPCALab to transform the initial data into the wavelet domain. We applied CPAST and SCPAST for the recovery of wavelet coefficients of the vector v1 and v1 ) then transformed the estimates to the initial domain and computed the error l(v1 , depending on the number of observations. The results for the hard thresholding (10) with the a = 1.5 are shown in Figs. 2, 3 and 4 in subplots (b)–(d). Note that one peak function has sparser wavelet coefficients than those of three peak functions and the error of the recovery with SCPAST is significantly smaller for the case of one peak function. Real data example. Natural acoustic signals like the musical ones exhibit a highly varying temporal structure; therefore there is a need in adaptive unsupervised methods for signal processing which reduce the complexity of the signal. In [20] a method was proposed which reduces the spectral complexity of music signals using the adaptive segmentation of the signal in the spectral domain for the principal component analysis for listeners with cochlear hearing loss. In the following we apply CPAST and SCPAST as an alternative method for the complexity reduction of music signals. To illustrate the use of SCPAST and CPAST we set the memory parameter γ = 0.9

Sparse Constrained Projection …

333

Fig. 2 The components of the leading eigenvector to recover a step function, b–d contain the results for the error l(v1 , v1 ) for λ1 = {5, 30, 100}

to be able to adapt to the changes in the spectral domain of the signal. We focus on the first leading eigenvector recovery. As an example we consider a piece from Bach Siciliano for Oboe and Piano. A wavelet-kind CQT-transform [8] is computed for the signal (see a spectrogram of the transform in Fig. 1). The warmer colors correspond to the higher values of the amplitudes of the harmonics present in the signal at a particular time frame. It is clear that the signal has some regions of “stationarity” (e.g. approximately in time frame interval [1200, 2600]). We regard the corresponding spectrogram as a matrix with 4500 observations of 168-dimensional signal modeled by (16) and apply SCPAST and CPAST methods to recover the leading eigenvector v1 . Figure 5 contains the results of the recovery of the leading eigenvalue with 168 components. Leading eigenvector recovered by SCPAST has most of the components set to zero (regions with blue color), whereas CPAST keeps almost all the components non-zero. Non-zero components of the leading eigenvector estimate obtained by SCPAST correspond to the entries of original signal spectrogram Fig. 1 with the highest values (red and dark red). Therefore the results show that SCPAST method allows obtaining sparse representation of the leading eigenvectors and seems to be promising for construction of the structure preserving compressed representations of the signals.

334

D. Belomestny and E. Krymova

Fig. 3 The components of the leading eigenvector to recover a three peeks function, b–d contain the results for the error l(v1 , v1 ) for λ1 = {5, 30, 100}

5 Outlines of the Proofs Denote by V¯ a matrix with n − d column vectors vi , i = d + 1, . . . , n, which comd of the matrix V to the orthonormal basis plete the orthonormal columns {vi }i=1 t n . From (5) one gets a in R . Denote by X (t) a matrix with the columns {x(i)}i=1 representation X (t) = V Λd U  (t) + σΞ (t), t = 1, . . . , T, 1/2

(16)

where U (t) ∈ Rt×d , Ξ (t) ∈ Rn×t are matrices with independent N (0, 1) entries, V n , Λd is a diagonal matrix with λi , is the orthonormal matrix with columns {vi }i=1 i = 1, . . . , d on the diagonal. Denote a set of indices to the small components of leading eigenvectors as N (t) = {1, . . . , n}\S(t) (where N here stands for “noise”). From (16) the empirical covariance matrix can be decomposed as 1 1 1/2     = V Λ1/2 Σ(t) d U (t)U (t)Λd V + Ξ (t)Ξ (t) t t 1 1 1/2 1/2 + V Λd U  (t)Ξ  (t) + Ξ (t)U (t)Λd V  . t t

(17)

Sparse Constrained Projection …

335

Fig. 4 The components of the leading eigenvector to recover a one peek function, b–d contain the results for the error l(v1 , v1 ) for λ1 = {5, 30, 100}

Fig. 5 CQT-Spectrogram of the leading eigenvector recovered by CPAST and SCPAST with the memory parameter γ = 0.9 for Bach Siciliano for Oboe and Piano

336

D. Belomestny and E. Krymova

It is well known [14] that the distance (7) between subspaces W and Q, spanning n × d matrices with orthonormal columns W and Q correspondingly, is related to dth principal angle between subspaces W and Q as l(W, Q) = sin2 φd (W, Q), where the principal angles 0 ≤ φ1 ≤ · · · ≤ φd between subspaces W and Q are recursively defined as [15] xi , yi , where xi 2 yi 2   x, y arccos . {xi , yi } = arg min x∈W, y∈Q, x2 y2

φi (W, Q) = arccos

x⊥x j , y⊥y j , j   √   1, u = t0 + 1, . . . , t the term u  u1 V  Ξ (u) V  Ξ (u) − Id  is bounded from    √ √ 1  above by 3( d + p log(n ∨ t)), u u V¯ Ξ (u)[V¯  Ξ (u)] − In−d  by    √ √  √ 3( n − d + p log(n ∨ t)), u  u1 V  Ξ (u)[V¯  Ξ (u)]  by 1 + 2 p log(n∨t) u   √ √ √  1 ¯   ( n − d + d + p log(n ∨ t)). Finally, u V Ξ (u)U (u) is bounded by u

 1 + 2p

 √ log(n ∨ t) √ ( n − d + d + p log(n ∨ t)). √ u

√ Each of the bounds holds with the probability 1 − (n ∨ t)−3 for p = 6. Using the union bound we get the statement of the Lemma for the intersection of events with the probability 1 − C0 (t − t0 )(n ∨ t)−3 . Proof of Lemma 3 The proof is based on Davis sin θ Theorem 4, Lemma 10, and Weyl’s Theorem [30]. From Davis sin θ Theorem

Sparse Constrained Projection …

341

  (Σ(t  0 ) − Σ)V 2 (t0 )) ≤  l(V, V   ,  0) 2 λd + 1 − λd+1 Σ(t

(29)

where λd+1 (A) is a (d + 1)-th singular value of the matrix A T A. Weil’s theorem gives for j = 1, . . . , n  0 ))| ≤ Σ(t  0 ) − Σ. |λ j + 1 − λ j (Σ(t Therefore the denominator in (29) may be bounded as  0 ))| ≥ λd − 2Σ(t  0 ) − Σ. |λd + 1 − λd+1 (Σ(t From (27)     1  1        Σ(t0 ) − Σ ≤λ1  U (t0 ) U (t0 ) − Id  +  Ξ (t0 )Ξ (t0 ) − In   t0 t    1     + 2 λ1   t U (t0 )Ξ (t0 ) , 0 by Lemmas 10 and 11  0 ))| ≥ (1 + o(1))λd . |λd + 1 − λd+1 (Σ(t From (28) and Lemma 10 one has that with probability 1 − C0 (n ∨ t)−2  0 ) − Σ)V  ≤ (λ1 + 1) (Σ(t



 n + t0

 log(n ∨ t) . t0

Combining the last two inequalities we get the statement of Lemma. Proof of Lemma 4 First we prove (23) for the pair (t0 , r (t0 )) which satisfies by induction for all k = 1, . . . , K , and some ρ ∈ (α0 , 1) 

    ρ 1 2 α2 R α1 α0 . 1 − r (t0 ) + R − √ > √ α0 1 − ρ t0 ρ t0

We have for K = 1 r (t0 + 1) ≤ 

α0 r (t0 ) + α1 √tR+1 0

1 − r 2 (t0 ) −

α2 √tR+1 0

≤ ρr (t0 ) +

R ρα1 . √ α0 t0 + 1

(30)

342

D. Belomestny and E. Krymova

Furthermore suppose that (30) holds for K = L , then 



r (t + L) ≤ ρ r (t0 ) + R 2

L

α1 α0

 L k=1

ρ L+1−k √ t0 + k

2

and r (t0 + L + 1) ≤ 

R α0 r (t + L) + α1 √t+L+1

R 1 − r 2 (t + L) − α2 √t+L+1   L+1 1+L−k α1 ρ . ≤ ρ L r (t0 ) + R √ α0 k=1 t0 + k

A sufficient condition for the above formula to hold reads as  ⎞2 ⎛   k−1 k− j   R ρ α ⎠ − √α2 R > α0 . 1 − ⎝ρk−1r (t0 ) + 1 √ α0 j=1 t0 + j ρ t0 + k  Note that k−1 j=1 Furthermore

k− j √ρ t0 + j



ρ √1 , therefore the above condition is fulfilled given (30). 1−ρ t0

 r (t0 + K + 1) ≤ ρ K r (t0 ) + R

α1 α0

 K +1 k=1

ρ1+K −k , √ t0 + k

where for K > K 0 (ρ), t0 > 1 and jK ,ρ = log(K )/(2 log(1/ρ)) j K ,ρ K   ρ K −k ρj + ≤ √ √ t0 + K − j t0 + k k=1 j=0



K −1  j= j K ,ρ

ρj √ t0 + K − j +1

1 1 1 1  + √ 1 − ρ t0 + K − j K ,ρ 1 − ρ K + t0

and r (t0 + K + 1) 

R α1 ρ . √ α0 1 − ρ K + t0 + 1

From (30) the condition on the starting value r (t0 ) is  r (t0 ) ≤

  1 R α0 2 Rα1 ρ 1 − α2 √ + − √ . ρ α0 1 − ρ t0 t0

(31)

Sparse Constrained Projection …

343

Thus the number of initial observations t0 for (31) to be satisfied given r (t0 ) ≤ α √1t0 reads as      1 1 1 R α0 2 α1 α √ ≤ 1 − α2 √ + −R √ . ρ α0 1 − ρ t0 t0 t0 Therefore, taking into account (26) and ρ > α0 , the sufficient condition on

√ t0 is

# $ ρ α1 2R √ α0 1−ρ 2α2 R $+  t0 ≥ # . α2 α2 1 − ρ20 1 − ρ20 α2 . Set ρ = ρ() = 1 − (1 − α0 ). It is easy to From Lemma 3 and (26) α = Rmax 1−α 0 check that α0 < ρ() < 1 for  ∈ (0, 1/2]. Recall Rmax = R(T ), therefore



t0 ≥

1 2Rmax α2 . √ (1 − α0 )3/2 1 −  α1 √ R , thus (using | ln(1 − x)| ≥ x α0 t0# +K √ $ α0 1 ln T . Put  = 1/2 to get the (1−α0 ) α1

The value K 0 (ρ) might be defined by ρ K ≤ for x ∈ (0, 1)) it is sufficient to set K ≥ result.

Proof of Lemma 5 Using the triangle inequality    ◦ (t), V ) ≤ l(Υ ◦ (t)). ◦ (t), V ) + l(Υ ◦ (t), V l(V

(32)

◦ (t − 1), V ). Using the ◦ (t), V ) ≤ tan φd (Σ ◦ (t)V We bound the first term by l 1/2 (Υ variational definition of tan φd ◦ (t − 1)x ◦ (t)V V¯  Σ ◦ (t − 1)x ◦ (t)V x2 =1 V  Σ

◦ (t), V ) ≤ max l 1/2 (Υ

The right hand side may be bounded with ◦ (t − 1)x + V¯  (Σ ◦ (t − 1)x ◦ (t) − Σ)V V¯  Σ V . ◦ (t − 1)x − V  (Σ ◦ (t − 1)x ◦ (t) − Σ)V x2 =1 V  Σ V max

Triangle inequality gives ◦ (t) − Σ)V¯  ≤(Σ ◦ (t) − Σ ◦ (t))V¯  + (Σ ◦ (t) − Σ)V¯ . (Σ Note that

344

D. Belomestny and E. Krymova

   ◦ (t) − Σ ◦ (t) = Σ S (t) − Σ S (t) 0 , Σ 0 0   0 −VS (t)Λd VN (t) Σ ◦ (t) − Σ = , −VN (t)Λd VS (t) −VN (t)Λd VN (t)

(33) (34)

S (t) − where VS (t) is a submatrix of V with the row indices in S(t). Decompose Σ Σ S (t) using (27) and (33)    1 ◦ (t) − Σ ◦ (t))V¯  ≤ λ1 V¯ S VS (t)  U (t) U (t) − Id  (Σ  t      1 1         +  t Ξ S (t)Ξ S (t) − I S (t) + 2 λ1  t U (t) Ξ S (t) , where Ξ S (t) is t × card(S(t)) matrix, U (t) is t × d matrix. The elements of both matrices are i.i.d. N (0, 1). Using V¯  V = V¯ S (t)VS (t) + V¯ N (t)VN (t) = 0 we may bound V¯ S VS (t) ≤ V¯ N VN (t) ≤ VN (t) ≤ VN (t)F ,  where  · F is Frobenius norm, i.e. AF = tr(A A) for any matrix A, is small since it depends only on the components of the eigenvectors below the corresponding thresholds (see Lemma 13 and definition (13)) VN (t)2F =

d  i=1



d  j=1

vi,N (t)2 %

& r r 2 t 2 s j /(bh j )r log(n ∨ t) b2 h 2j r ∧ n 2 − r [log(n ∨ t)] 2 t

≤C M(t)h 2d

log(n ∨ t) , t

where C depends on d, r . From Lemmas 10, 11 and 12 with the probability 1 − C0 (n ∨ t)−3 one can bound   log(n ∨ t) M 1/2 (t) ◦ ◦ ¯  (Σ (t) − Σ (t))V  ≤C1 λ1 h d √ + C2 ( λ1 ∨ 1) . √ t t From (34) V¯  (Σ ◦ (t) − Σ) ≤ λ1 VN (t). Thus log(n ∨ t) ◦ 1/2 ¯  (Σ (t) − Σ)V  ≤C1 λ1 h d M (t) t  log(n ∨ t) . + C2 ( λ1 ∨ 1) t

(35)

Sparse Constrained Projection …

345

where C1 depends on r , d and C2 is a constant. Similarly, from (34) V  (Σ ◦ (t) − Σ) ≤ |VN (t)(1 + o(1)) and



log(n ∨ t) t  log(n ∨ t) , + C2 ( λ1 ∨ 1) t

◦

V (Σ (t) − Σ) ≤C1 λ1 h d M

1/2

(t)

(36)

where C1 depends on r , d and C2 is a constant. ◦ (t)) = l(Υ ◦ (t), Υ ◦,β (t)) relies on Wedin’s sin θ ◦ (t), V The bound on l(Υ Theorem 3 ◦ (t − 1) − Σ ◦,β (t)2 ◦ (t)V Σ ◦ (t), Υ ◦,β (t)) ≤ l(Υ . (37) ◦ (t − 1)) ◦ (t)V λ d (Σ ◦ (t − 1) − Υ ◦,β (t − 1) ≤ Z (t) F , where Z i j (t) is a matrix ◦ (t)V Note that Σ ◦ ◦ (t)V with the entries Z i j (t) = β j (t) if i ∈ S(t) and Z i j (t) = 0 if i ∈ N (t). Thus Σ d 2 ◦,β 2  (t) ≤ C M(t) i=1 βi (t) and from (12) (t − 1) − Υ d 

log(n ∨ t)  log(n ∨ t) 2 hd (λi + 1) ≤ da 2 λ2d t t i=1 d

βi2 (t) ≤ a 2

i=1

◦ (t − 1) − Υ ◦ (t)V ◦,β (t)2 ≤ C M(t)λ2d log(n∨t) h 2d , where C  depends on That is Σ t d, a and r . ◦ (t − To bound the denominator of (37) note that one may decompose Σ V 2 2 2 ◦  1)x = Σ z 1  + Σ z 2  , where V (t − 1)x = z 1 + z 2 , z 1 ∈ ran(V ) and z 2 ∈ ◦ (t − 1)x2 ≥ Σ z 1 2 . Using z 1 ∈ ran(V ) one has ran(V¯ ). Thus Σ V ◦ (t − 1)) Σ z 1  ≥ (λd + 1)z 1  ≥ (λd + 1) cos(V, V and taking into account (34) we get 1/2

λd



 ◦ (t − 1) ≥(λd + 1) cos(V, V ◦ (t − 1)) ◦ (t)V Σ ◦ (t) − Σ ◦ (t) − λ1 VN (t). − Σ

Thus using (33) and Lemmas 10 and 11 and summarizing the bounds for denominator and nominator in (37) we get

l

1/2

◦ (t)) ≤ ◦ (t), V (Υ

λd C M 1/2 (t)



log(n∨t) hd t

◦ (t − 1)) − E ◦ (t) (λd + 1) cos(V, V

   log(n∨t) √ . where E ◦ (t) = C1 λ1 h d M 1/2 (t) + C2 ( λ1 ∨ 1) t

,

346

D. Belomestny and E. Krymova

Combining the last inequality, (35), (36), (32) and (6) we get the result in the flavor of (1) with probability 1 − C0 (n ∨ t)−3 for one step of SCPAST algorithm. To get the bounds uniformly in u = t0 + 1, . . . , t, similarly to Lemma 2 we define the following events, each of which  occurs √ withprobability 1 − C0 (n ∨ √  u  u1 U (u) U (u) − Id  < 2 d + 2 p log(n ∨ t); an event t)−3 : an event   √ √  √  u  u1 Ξ S (u)Ξ S (u) − I S (u) < 2 card(S(t)) + 2 p log(n ∨ t); event u  u1    √ √ √ U (u) Ξ  (u) < 1 + 2 p log(n∨t) ( card(S(t)) + d + p log(n ∨ t)). For the S

u

intersection of the above events for u = t0 + 1, . . . , t and using Lemma 12, we get the statement of the Lemma. Proof of Lemma 6 Using Wedin sin θ Theorem 3 ◦ (t0 )) ≤ l(V, V

◦ (t0 ))2 V  (Σ − Σ . ◦ (t0 ))2 (λd − λd+1 (Σ

(38)

◦ (t0 )) = λd+1 + o(λ1 ) and Using Weyl theorem [30] it may be shown that λd+1 (Σ ◦  (t0 ))| ≥ λd (1 + o(1)). From (36) with probability 1 − (n ∨ thus |λd − λd+1 (Σ T )−2  log(n ∨ t) ◦ (t0 ) − Σ)V  ≤C1 λ1 h d M 1/2 (t0 ) (Σ t0   log(n ∨ T ) + C2 ( λ1 ∨ 1) . t0 Thus r (t0 ) ≤

√α t0

holds with probability 1 − (n ∨ T )−2 , where 

α=

1 C1 λ1 h d M 1/2 (t0 ) + C2 λd

√  λ1 + 1  log(n ∨ T ). λd

Proof of Lemma 7 The proof follows from Lemma 4 applied to (25) with α0 = λd1+1 , α1 = α2 = λλd1 +1 , and initial conditions given by Lemma 6. +1 d λi v 2ji , j = 1, . . . , n and for Proof of Lemma 8 Following [28] define η j = i=1  ( ' 0) ◦ (t0 ) = V (t0 ) . To show that V 0 < a− < 1 define G + = j : η j > a− γ0 log(n∨t t0 one has to prove that for the proper choice of γ0 and a− it holds G ⊆ G + ⊆ S(t0 ) j j (t0 ) ∼ (1 + with probability 1 − C0 (n ∨ T )−2 . To show that we first note that Σ d 2 2 i=1 λi v ji )ξ/t0 , where ξ is χt0 r. v. Therefore

Sparse Constrained Projection …

347

⎧ ⎫ ⎨, # $⎬  √ j j (t0 ) > 1 + γ0 log(n ∨ t0 )/ t0 Σ P(G ⊂ G + ) =P ⎩ + ⎭ j ∈G /  ⎧ ⎫ 0) ⎬ ⎨ξ γ0 (1 − a− ) log(n∨t t0  ≤nP −1> ⎩ t0 log(n∨t0 ) ⎭ 1+a γ − 0

t0

√ 2 2 2 n(n ∨ t0 )−(γ0 (1−a− ) /4)(1+o(1)) . ≤ γ0

√ √ Thus G ⊂G + holds with probability 1 − C0 (n ∨ T )−2 , e.g. for a− = 1 − 2/ 3, √ ) γ0 ≥ 3 2 log(n∨T . Note that for any j ∈ G + there exists i ∈ {1, . . . , d}, λi v 2ji ≥ log(n∨t0 )  log(n∨t0 ) a− γ0 , thus for G + ⊂ S(t0 ) to hold it is sufficient that d t0  a− dλi

log(n ∨ T ) λi + 1 log(n ∨ t0 ) >b . t0 t0 λi2

◦ (t0 ) = V (t0 ) with probability Thus for sufficiently big T , G ∩ S(t0 ) = G, that is V −2 1 − C0 (n ∨ T ) . Proof of Lemma 9 From Lemma 8 with probability 1 − (n ∨ T )−2 the results of the original and oracle version of the zero-step estimation procedure coincide, that (t0 ). First let us show that the similar statement holds for V ◦ (t0 + 1) ◦ (t0 ) = V is V ◦  (t0 ) = V (t0 ) holds it is (t0 + 1). Denote t1 = t0 + 1. On the event for which V and V (t0 ) = Σ(t ◦ (t0 ). From the construction of V ◦ (t0 ), the  1 )V  1 )V (t1 ) = Σ(t true that Υ ◦ N (t0 ) has zero entries. Note that S(t0 ) ⊆ S(t1 ) and N (t1 ) ⊆ N (t0 ). Thus submatrix V ◦ (t1 )k,l = Σ k,S (t1 ) Υ vl,S (t0 ),

(39)

◦ (t0 ) is a vector of size card(S(t1 )) containing the components of  vl◦ (t0 ) where  vl,S k,S (t1 ) is a row containing the components of k-th row of Σ(t  1) indexed by S(t1 ), Σ indexed by S(t0 ). Let us show that for k ∈ N (t1 ) with high probability

 (t1 )k,l ≤ βl (t1 ) = a (λl + 1) Υ

log(n ∨ t1 ) . t1

Therefore after the thresholding step the components in N (t1 ) are set to zero with high probability. From (16) k,S (t1 ) =Vk Λd U (t1 ) U (t1 )Λd VS (t1 ) t1 Σ 1/2

1/2

+ Ξk (t1 )Ξ S (t1 ) + Vk Λd U (t1 ) Ξ S (t1 ) 1/2

+

1/2 Ξk (t1 )U (t1 )Λd VS (t1 ),

(40)

348

D. Belomestny and E. Krymova

where Ξk (t1 ) is k-th row of Ξ (t1 ). Denote by VS◦ (t1 ) a matrix containing the first d eigenvalues of Σ S (t1 ) as columns (recall (24)) and by V¯ S◦ (t1 ) a matrix with card(S(t1 )) − d columns which complete columns of VS◦ (t1 ) to the orthonormal basis in Rcard(S(t1 )) . Note that V¯ S◦ (t0 )V¯ S◦, (t0 ) + VS◦ (t0 )VS◦, (t0 ) = I S (t0 ). Plugging in above equality in (39) (before VS (t1 )) in the view of (40) one gets (t1 )k,l = q11 + q12 + q12 + q14 + q21 + q22 + q22 + q24 , Υ where q1∗ depend on V¯ S◦ (t0 ) and q2∗ depend on VS◦ (t0 ). Let us first bound the terms  q11 and q21 . To this end we add  and subtract Vk Λd VS (t1 ) in the first term in (40) and   use that  t11 U (t1 ) U (t1 ) − I  = o(1), thus ◦ |q11 | ≤ (1 + o(1))Vk Λd V¯ S◦, (t0 ) vl,S (t0 )

(41)

 ¯◦ where it was  also used that VS (t1 )VS (t0 ) ≤ 1. Consider k ∈ N (t1 ), that is, |Vkl | = 1) . Using the definition of βk (t1 ) (recall (12)) |vkl | ≤ b h i2 log(n∨t t1

   d  d   λi + 1  λi + 1 b b 1/2  , Vk Λd  ≤ βk (t1 ) Vk Λd  ≤ βk (t1 ) . (42) a λ +1 a (λk + 1)λi i=1 k i=1 Thus using (42) the term (41) may be bounded as   d  λi + 1 b ◦ vl,S (t0 ) |q11 | ≤ (1 + o(1)) βk (t1 )V¯ S◦, (t0 ) a λ +1 i=1 l and in the same way it can be shown that   d  λi + 1 b ◦, ◦ |q21 | ≤ (1 + o(1)) βk (t1 )VS (t0 ) . vl,S (t0 ) a λ +1 i=1 l Next

◦ |q12 | =|Ξk (t1 )Ξ S (t1 )V¯ S◦ (t0 )V¯ S◦, (t0 ) vl,S (t0 )|/t1 ◦ ≤ζ(k, S(t1 ))Ξ S (t1 )VS◦, (t0 ) vl,S (t0 )/t1 ,

where ζ(k, l, S(t1 )) =

t1 q12 .  ◦ ◦ ¯ Ξ S (t1 )VS (t0 )V¯ S◦, (t0 ) vl,S (t0 )

Sparse Constrained Projection …

349

◦ S (t0 ) and doesn’t depend on Ξk (t1 ), k ∈ Note that  vl,S (t0 ) is the l-th eigenvector of Σ N (t1 ), and since N (t1 ) ⊆ N (t0 ), Ξk (t1 ) is independent from Ξ S (t1 ), thus ζ(k, l, S(t1 )) has N (0, 1) distribution. Define the event for which it holds:

|ζ(k, l, S(t1 ))| ≤ and Ξ S (t1 ) ≤



t1 +



c1 log(n ∨ t)

  card(S(t1 )) + 2 log(n ∨ t).

For big enough t1 (guarantied by (14)) t1 dominates card(S(t1 )) and log(n ∨ t). Thus 1 ◦ vl,S (t0 ) |q12 | ≤ |ζ(k, S(t1 ))|Ξ S (t1 )VS◦, (t1 ) t1  1/2 1 c1 log(n ∨ t) ◦ ≤ V ◦, (t1 ) βl (t1 ) vl,S (t0 ). a λl + 1 log(n ∨ t1 ) S On the events defined in the end of the proof of Lemma 5 the bound for the term q13 is as follows 1 1/2 ◦ Vk Λd U (t1 ) Ξ S (t1 )V¯ S◦, (t1 ) vl,S (t0 ) t1 b card(S(t1 )) log(n ∨ t) ◦ βl (t1 )V¯ S◦, (t1 ) ≤ vl,S (t0 ). √ a λd t1 log(n ∨ t1 )

|q13 | ≤

1 ))) = o(1) (see Supplementary materials for [23] p.16) it follows that From λ1d card(S(t t1 |q13 | = o(βl (t1 )). To bound the term q14 one may use the same argument as for q12

|q14 | ≤

1 1/2 ◦ |g(k, S(t1 ))|U (t1 )Λd V¯ S◦, (t1 ) vl,S (t0 ), t1

where g(k, l, S(t1 )) =

t1 q14 1/2  ◦ U (t1 )Λd VS (t1 )V¯ S◦ (t1 )V¯ S◦, (t1 ) vl,S (t0 )

◦ vl,S (t0 ), furthermore g(k, l, S(t1 )) has N (0, 1) disis independent from U (t1 ) and   √ √ tribution. On the intersection of events {U (t1 ) ≤ t1 + d + 2 log(n ∨ t)} and  {|g(k, l, S(t1 ))| ≤ c1 log(n ∨ t)}. with probability 1 − C0 (n ∨ t)−3

1 1/2 ◦ |q14 | ≤ g(k, l, S(t1 ))U (t1 )Λd V¯ S◦, (t1 ) vl,S (t0 ) t1   1 c1 λ1 1/2 ¯ ◦, log(n ∨ t) ◦ ≤ . VS (t1 ) vl,S (t0 )βl (t1 ) a λl + 1 log(n ∨ t1 )

350

D. Belomestny and E. Krymova

In the similar way term q22 may be bounded as follows   1  ◦,  ◦ ◦  vl,S (t0 ) q22 ≤  Ξk Ξ S(t1 ) VS (t1 )  VS (t1 ) t1 b c2 log(n ∨ t) ◦ β j (t1 ) V ◦, (t1 ) ≤ (1 + o(1)) vl,S (t0 ). a λl + 1 log(n ∨ t1 ) S The bound on the term q23 reads as 1 1/2 ◦ |q23 | = Vk Λd U (t1 ) Ξ S (t1 )VS◦ (t1 )VS◦, (t1 ) vl,S (t0 ) t1  log(n ∨ t1 ) 1/2 ◦ Vk Λd VS◦, (t1 ) vl,S (t0 ) = o(βl (t1 )). ≤2 t1 Similarly to the case of q14 one can show that 1 1/2 ◦ vl,S (t0 ) |q24 | ≤ Ξk U (t1 )Λd VS (t1 )VS◦ (t1 )VS◦, (t1 ) t1  c2 λ1 1 log(n ∨ t) ◦ ≤ βl (t1 ) V ◦, (t1 ) vl,S (t0 ). λl + 1 a log(n ∨ t1 ) S ◦ ◦ vl,S (t0 ) = 1 + o(1) and V¯ S◦, (t1 ) vl,S (t0 ) = o(1), Note that (see [23]) VS◦, (t1 ) # $ 4 4 that is from above bounds i=1 |q1i | = o i=1 |q2i | . Therefore

(t1 )k,l Υ

  d  λ j + 1 b ≤ βl (t1 ) a λ +1 j=1 l  log(n ∨ t) 1 + βl (t1 ) 2c1 a log(n ∨ t1 )

 d Observe that j=1 log(n ∨ T ) thus a≥



λ j +1 λl +1





λ1 + 1 . λl + 1

√ √ τ d and λ1 /λ j ≤ τ . Let us bound log(n ∨ t) by

√ ) 0.9a − 2c1 log(n∨T log(n ∨ T ) log(n∨t0 ) , b= 2c1 . √ √ log(n ∨ t0 ) τ d

N◦ (t0 + 1) = 0, and so, (t1 )k,l | ≤ βl (t1 ) and V Therefore one gets for all k ∈ N (t1 ): |Υ ◦   VN (t0 + 1) = VN (t0 + 1). N (u), u = t0 + 2, . . . , t we consider the events defined N◦ (u) = V To show that V for the standard normal random variables z(k, l, S(u)) and g(k, l, S(u))

Sparse Constrained Projection …

351

{z(k, l, S(u)) ≤



{|g(k, l, S(u))| ≤

c1 log(n ∨ t)},



c1 log(n ∨ t)}.

Using the union bound  P

'

,

z(k, l, S(u)) ≤



( c1 log(n ∨ t)

l∈N (u),k=1,...,d, u=t0 +1,...,t

≤1−

t d   

P{z(k, l, S(u)) ≤

k=1 u=t0 +1 l∈N (t)

≤ 1 − nd(t − t0 )P{z(k, l, S(t)) ≤



c1 log(n ∨ t)}

 c1 log(n ∨ t)}

≤ 1 − C0 n(t − t0 )log(n ∨ t)−1 (n ∨ t)−c1 /2 . Take c1 ≥ 9 to obtain the statement of the lemma.

8 Concentration of the Spectral Norm of the Perturbation Lemma 10 ([33]) Let X be a t × n matrix with i.i.d. N (0, 1) entries. The following result holds true     1  2  ≥ E X X − I (t, n, p) ≤ 2(n ∨ t)− p /2 , P  n 1 t where E 1 (t, n, p) = 3 max

 n t

+p



log(n∨t) √ , t



n t

+

p



log(n∨t) √ t

2  .

Lemma 11 ([9]) Let X and Y be t × q and t × m matrices, q > m, with i.i.d. N (0, 1) entries then for any 0 < x < 1/2 and c > 0   c2 log(n∨t) 3x 2 log(n∨t) P X  Y  ≥ t E 2 (t, q, m, x, c) ≤ e− 2 + qe− 16 , where E 2 (t, q, m, x, c) =

  √  m log(n∨t) q √ . 1 + x log(n∨t) + + c t t t t

Lemma 12 ([23]) There exist constants C˜ 1 ,C¯ 1 depending on r and C¯ 2 , C˜ 2 :

352

D. Belomestny and E. Krymova

λ1 M 1/2 (t) log(n ∨ t) ˜ ˜ E 1 (t, card(S(t)), p) ≤ C1 h d + C2 √ t t 1/2 M (t) h d log(n ∨ t) E 2 (t, card(S(t)), d, 4, p) ≤ C¯ 1 √ + C¯ 2 . t t h1 Theorem 3 (Wedin sin θ) Let A and B be n × k, n ≥ k, full-column rank matrices. Let the columns of a n × (n − k + 1) matrix U be the orthogonal matrices spanning the orthogonal complement of range of B. If the λmin (A) ≥  ≥ 0 then l(A, B) ≤ A U 2 /2 ≤ B − A2 /2 . Theorem 4 (Davis sin θ) [10] Let A and B be the symmetric matrices with the decomposition A = W1 Λ1 W1 + W2 Λ2 W2 and B = U1 1 U1 + U2 2 U2 , with conditions [U1 , U2 ] is orthogonal, W2 is orthonormal and W1 W2 = 0, the eigenvalues of Λ1 W1 W1 are contained in the interval (a1 , a2 ) and the eigenvalues of 1 are laying outside of the interval (a1 − , a2 − ) for some  > 0 then l(W1 , U1 ) ≤ U2 (B − A)W1 2 −1 λ−2 min (W1 ). Lemma 13 ([23]) The norms of the subvectors v j,N (t) of v j satisfy % v j,N (t)  ≤ 2

r

sj 2 2 − r (bh j (t))r



log(n ∨ t) t

−r/2

& ∧ n b2 h 2j (t)

log(n ∨ t) . t

Lemma 14 ([23]) Bound on the effective dimension card(S(t)) is given by ⎡ d ≤ card(S(t)) ≤ C M(t) = C ⎣n ∧

d  j=1

s rj h −r j



log(n ∨ t) t

−r/2

⎤ ⎦.

Lemma 15 For a χ2t random variable ζt the following bounds hold [17] P(ζt > t (1 + ε)) ≤ e−3tε

2

/16

, 0 < ε < 1/2,

P(ζt < t (1 − ε)) ≤ e−tε /4 , 0 < ε < 1, √ 2 2 P(ζt > t (1 + ε)) ≤ √ e−tε /4 , 0 < ε < t 1/16 , t > 16. ε t 2

Sparse Constrained Projection …

353

References 1. Abed-Meraim, K., Attallah, S., Chkeif, A., Hua, Y.: Orthogonal OJA algorithm. IEEE Signal Proc. Lett. 7(5), 116–119 (2000). https://doi.org/10.1109/97.841157 2. Abed-Meraim, K., Chkeif, A., Hua, Y.: Fast orthonormal past algorithm. IEEE Signal Proc. Lett. 7(3), 60–62 (2000). https://doi.org/10.1109/97.823526 3. Adali, T., Haykin, S.: Adaptive Signal Processing: Next Generation Solutions, vol. 55. Wiley (2010). https://doi.org/10.1002/9780470575758 4. Attallah, S., Abed-Meraim, K.: Fast algorithms for subspace tracking. IEEE Signal Proc. Lett. 8(7), 203–206 (2001). https://doi.org/10.1109/97.928678 5. Attallah, S., Abed-Meraim, K.: Low-cost adaptive algorithm for noise subspace estimation. Electron. Lett. 38(12), 1 (2002). https://doi.org/10.1049/el:20020388 6. Badeau, R., David, B., Richard, G.: Fast approximated power iteration subspace tracking. IEEE Trans. Signal Proc. 53(8), 2931–2941 (2005). https://doi.org/10.1109/TSP.2005.850378 7. Birnbaum, A., Johnstone, I.M., Nadler, B., Paul, D.: Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Stat. 41(3), 1055 (2013). https://doi.org/10.1214/12-AOS1014 8. Brown, J.C.: Calculation of a constant q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991). https://doi.org/10.1121/1.400476 9. Davidson, K.R., Szarek, S.J.: Local operator theory, random matrices and banach spaces. Handbook Geom. Banach Spaces 1(317–366), 131 (2001). https://doi.org/10.1016/S18745849(01)80010-3 10. Davis, C., Kahan, W.M.: The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. 7(1), 1–46 (1970). https://doi.org/10.1137/0707001 11. Donoho, D.L.: Unconditional bases are optimal bases for data compression and for statistical estimation. Appl. Comput. Harmon. Anal. 1(1), 100–115 (1993). https://doi.org/10.1006/ ACHA.1993.1008 12. Doukopoulos, X.G., Moustakides, G.V.: Fast and stable subspace tracking. IEEE Trans. Signal Proc. 56(4), 1452–1465 (2008). https://doi.org/10.1109/TSP.2007.909335 13. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001). https://doi.org/10.1198/016214501753382273 14. Golub, G.H., Van Loan, C.F.: Matrix computations, vol. 3. JHU Press (2012) 15. Hardt, M., Price, E.: The noisy power method: a meta algorithm with applications. In: Advances in Neural Information Processing Systems, pp. 2861–2869 (2014). https://doi.org/10.5555/ 2969033.2969146 16. Hua, Y., Xiang, Y., Chen, T., Abed-Meraim, K., Miao, Y.: A new look at the power method for fast subspace tracking. Digit. Signal Proc. 9(4), 297–314 (1999). https://doi.org/10.1006/dspr. 1999.0348 17. Johnstone, I.M.: Chi-square oracle inequalities. Lecture Notes-Monograph Series, pp. 399–418 (2001). https://doi.org/10.1214/lnms/1215090080 18. Johnstone, I.M., Lu, A.Y.: Sparse Principal Components Analysis (2009). arXiv:0901.4392 19. Klapuri, A., Davy, M.: Signal processing methods for music transcription. Springer Sci. Bus. Media (2007). https://doi.org/10.1007/0-387-32845-9 20. Krymova, E., Nagathil, A., Belomestny, D., Martin, R.: Segmentation of music signals based on explained variance ratio for applications in spectral complexity reduction. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 206–210. IEEE (2017). https://doi.org/10.1109/ICASSP.2017.7952147 21. Lassami, N., Abed-Meraim, K., Aïssa-El-Bey, A.: Low cost subspace tracking algorithms for sparse systems. In: 2017 25th European Signal Processing Conference (EUSIPCO), pp. 1400– 1404. IEEE (2017). https://doi.org/10.23919/EUSIPCO.2017.8081439 22. Lassami, N., Aïssa-El-Bey, A., Abed-Meraim, K.: Low cost sparse subspace tracking algorithms. Signal Proc. 173, 107–522 (2020). https://doi.org/10.1016/j.sigpro.2020.107522 23. Ma, Z., et al.: Sparse principal component analysis and iterative thresholding. Ann. Stat. 41(2), 772–801 (2013). https://doi.org/10.1214/13-AOS1097

354

D. Belomestny and E. Krymova

24. Nagathil, A., Martin, R.: Optimal signal reconstruction from a constant-q spectrum. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 349–352. IEEE (2012). https://doi.org/10.1109/ICASSP.2012.6287888 25. Nagathil, A., Weihs, C., Martin, R.: Spectral complexity reduction of music signals for mitigating effects of cochlear hearing loss. IEEE/ACM Trans. Audio Speech Lang. Proc. 24(3), 445–458 (2016). https://doi.org/10.1109/TASLP.2015.2511623 26. Oja, E.: Principal components, minor components, and linear neural networks. Neural Netw. 5(6), 927–935 (1992). https://doi.org/10.1016/S0893-6080(05)80089-9 27. Paul, D.: Nonparametric Estimation of Principal Components. Stanford University (2004) 28. Paul, D., Johnstone, I.M.: Augmented sparse principal component analysis for high dimensional data (2012). arXiv:1202.1242 29. Spokoiny, V.: Multiscale local change point detection with applications to value-at-risk. Ann. Stat. 37(3), 1405–1436 (2009). https://doi.org/10.1214/08-AOS612 30. Stewart, G., Sun, J.G.: Matrix Perturbation Theory (computer science and scientific computing). Academic Press Boston (1990) 31. Strobach, P.: Low-rank adaptive filters. IEEE Trans. Signal Proc. 44(12), 2932–2947 (1996). https://doi.org/10.1109/78.553469 32. Valizadeh, A., Karimi, M.: Fast subspace tracking algorithm based on the constrained projection approximation. EURASIP J. Adv. Signal Proc. 2009, 9 (2009). https://doi.org/10.1155/2009/ 576972 33. Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices (2010). arXiv:1011.3027. https://doi.org/10.1017/CBO9780511794308.006 34. Yang, B.: Projection approximation subspace tracking. IEEE Trans. Signal Proc. 43(1), 95–107 (1995). https://doi.org/10.1109/78.365290 35. Yang, B.: Asymptotic convergence analysis of the projection approximation subspace tracking algorithms. Signal Proc. 50(1–2), 123–136 (1996). https://doi.org/10.1016/01651684(96)00008-4 36. Yang, B.: Convergence analysis of the subspace tracking algorithms past and pastd. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996. ICASSP96. Conference Proceedings, vol. 3, pp. 1759–1762. IEEE (1996). https://doi.org/10.1109/ ICASSP.1996.544206 37. Yang, J.F., Kaveh, M.: Adaptive eigensubspace algorithms for direction or frequency estimation and tracking. IEEE Trans. Acoust. Speech Signal Proc. 36(2), 241–251 (1988). https://doi.org/ 10.1109/29.1516 38. Yang, W., Xu, H.: Streaming sparse principal component analysis. In: International Conference on Machine Learning, pp. 494–503 (2015). https://doi.org/10.5555/3045118.3045172 39. Yang, X., Sun, Y., Zeng, T., Long, T., Sarkar, T.K.: Fast STAP method based on past with sparse constraint for airborne phased array radar. IEEE Trans. Signal Proc. 64(17), 4550–4561 (2016). https://doi.org/10.1109/TSP.2016.2569471

Bernstein–von Mises Theorem and Misspecified Models: A Review Natalia Bochkina

Abstract This is a review of asymptotic and non-asymptotic behaviour of Bayesian methods under model specification. In particular we focus on consistency, i.e. convergence of the posterior distribution to the point mass at the best parametric approximation to the true model, and conditions for it to be locally Gaussian around this point. For well specified regular models, variance of the Gaussian approximation coincides with the Fisher information, making Bayesian inference asymptotically efficient. In this review, we discuss how this is affected by model misspecification. In particular, we highlight contribution of Volodia Spokoiny to this area. We also discuss approaches to adjust Bayesian inference to make it asymptotically efficient under model misspecification. Keywords Bayesian model · Bernstein-von Mises theorem · Misspecified models · Optimality · Posterior contraction rate

1 Introduction Consider a family of probability models P(Y | θ ) indexed by parameter θ ∈  for observations y, and a prior distribution π on the parameter space . In a classical Bayesian approach, the posterior distribution p(θ | Y) is used for statistical inference [40]. / Denote the true distribution of observations P0 , and we consider the case P0 ∈ {P(· | θ ), θ ∈ }, i.e. the model is misspecified. Such case arises in many applications, particularly in complex models where the numerical evaluation of posterior distribution under the ideal probability model takes a long time to run, leading to increased use of approximate models with faster computing time. A typical example is approximating complex dependence structure by pairwise dependence only [2, 41]. N. Bochkina (B) University of Edinburgh and Maxwell Institute, Edinburgh, UK e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_10

355

356

N. Bochkina

For well-specified regular models, the classical Bernstein–von Mises theorem states that for n independent identically distributed (iid) observations, for large n, the posterior distribution behaves approximately as a normal distribution centered on the true value of the parameter with a random shift, with both the posterior variance and the variance of the random shift being asymptotically equal to the inverse Fisher information, thus making the Bayesian inference asymptotically consistent and efficient. This was extended to locally asymptotically normal (LAN) models for  ∈ R p with fixed p [40], for LAN models with growing p, etc. A version of Bernstein–von Mises theorem is also available for nonregular models such as locally asymptotically exponential models [10, 21], models with parameter on the boundary of the parameter space [8] which hold under misspecified models. Here, however, we focus on “regular” models where estimators are asymptotically Gaussian. However, under model misspecification, Bayesian model is no longer optimal, as the posterior variance does not match the minimal lower bound on the variance of unbiased estimators [23, 30]). Therefore, the standard way of constructing a posterior distribution following the Bayes theorem may not be appropriate for particular purposes, e.g. inference or prediction [28]. Different ways to construct a distribution of θ given Y that produces inference appropriate for the purpose of the analysis have been proposed. A natural aim for such a method is to behave like a standard Bayesian method when the model is well specified, i.e. when P0 ∈ {P(·|θ ), θ ∈ }, and to provide at least asymptotically optimal inference, from the frequentist perspective, under model misspecification. In this review we focus on regular misspecified models. The review is organised as follows. We start with the summary of frequentist results for regular misspecified models (Sect. 2). In Sect. 3, we formulate classical Bernstein–von Mises theorem and in Sect. 4 we discuss the analogue of Bernstein–von Mises theorem under model misspecification, particularly the conditions when this local Gaussian approximation holds. In Sect. 5, we review the proposed methods to construct distribution p(θ | Y ) that results in improved inference under model misspecification compared to the standard Bayesian approach. We conclude with discussion and open questions. Definitions. For a vector x ∈ R p , ||x|| denotes the Euclidean norm of x, and for a matrix A ∈ R p× p , ||A|| denotes the spectral (operator) norm of A.

2 Frequentist Results for Misspecified Models 2.1 Probability Model A probability-based set up is as follows. Consider a measurable space (Y, A) and let P be a set of probability distributions on (Y, A), and assume that Y ∼ P(· | θ ), θ ∈  ⊆ R p with finite p (which may or may not be allowed to grow with n),

Bernstein–von Mises Theorem and Misspecified Models: A Review

357

{P(· | θ ), θ ∈ } ⊂ P, where Y is n-dimensional random variable, P(y | θ ) is the probability density function (with respect to the Lebesque or counting measure). The true distribution of Y is denoted as P0 , with density p0 .

2.2 Best Parameter Given the parametric family and the true distribution of the data, the “best” parameter is defined by (1) θ  = arg min K L(P(·|θ ), P0 ) θ∈





 where K L(P1 , P0 ) = log dd PP01 d P0 is the Kullback-Leibler divergence between probability measures P0 and P1 . Usually, it is either assumed that the model misspecification is such that θ  is the parameter of interest (e.g. in machine learning or quasi-likelihood approaches, this is often done by construction). However, there are alternative approaches when θ  differs from the parameter of interest θ0 (e.g. [27]).

2.3 Regular Models We consider a regular setting, under the following assumptions. 1.  is an open set. 2. Maximum of E log p(θ | Y) over θ ∈  is attained at a single point θ  ((1)). 3. D(θ ) and V (θ ) are finite positive definite matrices for all θ in some neighbourhood of θ  , where

V (θ ) = E[∇ log p(θ | Y)(∇ log p(θ | Y))T ],

(2)

D(θ ) = −E∇ log p(θ | Y). 2

Here (and throughout the paper) the expectation is taken with respect to the true distribution of the data, Y ∼ P0 (which is sometimes emphasised by writing E P0 ), and ∇ is the differentiation operator with respect to θ . If the model is correctly specified, D(θ  ) = V (θ  ). Typically, two main types of models are considered in theory: independent identically distributed (iid) models: Y = (Y1 , . . . , Yn ) with independent Yi with the same pdf or pmf p(· | θ ), and more generally locally asymptotically normal (LAN) models [40]. We give the definition of LAN models stated in [22] that applies to a misspecified model.

358

N. Bochkina

Definition 1 Stochastic local asymptotic normality (LAN) condition: given an interior point θ  ∈  and a rate δn → 0, there exist random vectors n,θ  and a nonsingular matrix D0 such that the sequence n,θ  is bounded in probability, and for every compact set K ∈ R p ,    log pθ  +δn h (Y1 , . . . , Yn )  T T   − h D0 n,θ − 0.5 h D0 h  → 0 sup  log pθ  (Y1 , . . . , Yn ) h∈K as n → ∞ in (outer) P0(n) -probability.

√ For iid model and iid true distribution, with possibly misspecified density, δn = 1/ n 2   and D0 is the limit of δn D(θ ) as n → ∞ where D(θ ) is defined by (2). Also, 

n,θ 

=n

−1/2

D0−1

n 

∇ log p(Yi | θ  )

i=1

which has mean 0 and variance D0−1 V0 D0−1 , called the sandwich covariance, where V0 is the limit of δn2 V (θ  ) as n → ∞, and V (θ  ) is defined by (2).

2.4 Nonasymptotic LAN Condition Spokoiny [35] provides a non-asymptotic version of LAN expansion under model misspecification, and non-asymptotic bounds on consistency of the MLE and coverage of MLE-based and likelihood-based confidence regions. These conditions have been updated in [30]. Define the stochastic term ζ (θ ) = log p(Y | θ ) − E log p(Y | θ ). 1. There exist g > 0 and positive-definite p × p matrix V such that for any |λ| ≤ g,   T γ ∇ζ (θ  ) 2 ≤ eλ /2 , sup E P0 exp λ p ||V γ || γ ∈R If such matrix V exists, it satisfies Var(∇ζ (θ  )) ≤ V . Spokoiny and Panov [37] show that this holds as long as the following condition holds for some g˜ > 0 and C ∈ (0, ∞):

E exp u T ∇ζ (θ  ) ≤ C. sup u∈R p : ||u||≤g˜

Bernstein–von Mises Theorem and Misspecified Models: A Review

359

2. There exists ω > 0 such that for any ||D 1/2 (θ − θ  )|| ≤ r and |λ| ≤ g,   T γ (∇ζ (θ ) − ∇ζ (θ  )) 2 ≤ eλ /2 . sup E exp λ p ω||V γ || γ ∈R In later work, [36] assumes that ∇ζ (θ ) is independent of θ by introducing an augmented model in the context of an inverse problem (the approach the author refers to as calming) thus making this condition unnecessary. 3. Conditions on E log p(Y | θ ): ∇E log p(Y | θ  ) = 0 and that the second derivative is continuous in the neighbourhood ||D 1/2 (θ − θ  )|| ≤ r : ||D −1 D(θ ) − I p || ≤ δ(r ) where I p is p × p identity matrix, D(θ ) = −∇ 2 E log p(Y | θ ) and D = D(θ  ). This condition is taken from [30], in [35] this condition was written in terms of E log p(Y | θ ) − E log p(Y | θ  ) rather than its second derivative D(θ ) which are similar due to ∇E log p(Y | θ  ) = 0. In later papers this condition is rewritten in terms of the moments of the third and fourth derivatives of E log p(Y | θ ) [37]. Under the conditions for the stochastic terms, using results of [37], for any p × p matrix B such that trace(BV B T ) < ∞, the random term can be bounded nonasymptotically as follows: P ||B(∇ζ (θ ) − ∇ζ (θ  ))|| ≥ c0 ωz(B, x) ≤ 2e−x where c0 is an absolute constant and

z(B, x) = trace(BV B T ) + 2x 1/2 trace((BV B T )2 ) + 2x||BV B T ||. Note that if these conditions hold, then for ||D 1/2 (θ − θ  )|| ≤ r , with probability at least 1 − 2e−x ,     log p(Y | θ ) − (θ − θ  )T ∇ζ (θ  ) − 0.5(θ − θ  )T D(θ − θ  )    p(Y | θ ) ≤ 0.5δ(r )r 2 + c0 r ωz D −1/2 , x , which is a non-asymptotic analogue of LAN expansion under a possibly misspecified model with δn = ||D||−1/2 , D0 = δn2 D (or its limit as δn → 0), h = (θ − θ  )/δn with 1/2 ||D0 h|| ≤ δn r and n,θ  = δn D0−1 ∇ζ (θ  ). Example 1 Consider a model with p-dimensional θ and iid observations Y1 , . . . , Yn . Assume that the true observations are also iid but they may have a different true distribution with finite positive definite V0 = E [∇ log p(Yi | θ  )]T ∇ log p(Yi | θ  ) and D0 = −E∇ 2 log√p(Yi | θ  ). Therefore, D = D(θ  ) = n D0 and V = V (θ  ) = 1/2 nV0 . Denote r0 = r/ n the radius of the local neighbourhood ||D0 (θ − θ  )|| ≤ r0 .

360

N. Bochkina

If ∇ζ (θ ) does not depend on θ (e.g. for p(· | θ ) from an exponential family with natural parameter), then ω = 0 in Condition 2 and the upper bound in the LAN condition is 0.5nr03 which tends to 0 if r0 = o(n −1/3 ), or equivalently if r = o(n 1/6 ).

2.5 Optimal Variance for Unbiased Estimators For regular models discussed in Sect. 2.3, the lower information bound for the variance of unbiased estimators of θ  when the true model is unknown, is the sandwich covariance [14, 45]: ˆ ≥ D −1 V D −1 Var(θ) where V = V (θ  ) and D = D(θ  ) as defined by (2). This is an analogue of the Cramer-Rao inequality for regular misspecified models. In frequentist inference, the MLE is asymptotically unbiased and its variance is approximately the sandwich covariance, i.e. inference based on the MLE for misspecified models is asymptotically efficient [45]. When the true model is known but a misspecified model is used, e.g. for computational convenience, it is possible in principle to achieve the smallest variance, inverse Fisher information, using a misspecified model with additional adjustment [14]. We illustrate it on the model used in [38] in Sect. 5.5.

3 Bernstein–von Mises Theorem for Correctly Specified Models For a correctly specified parametric model {P(· | θ ), θ ∈ } with a density p(· | θ ) and a prior distribution with density π(θ ), Bayesian inference is conducted using the posterior distribution p(θ | y) = 

p(y | θ )π(θ ) , θ ∈ .  p(y | θ )π(θ )dθ

We formulate the Bernstein-von Mises theorem in a regular setting defined in Sect. 2.3, under the additional assumption that prior density π(θ ) is continuous for in a neighbourhood of θ  , following [40]. Theorem 1 (Bernstein–von Mises theorem) For a well-specified regular parametric model { p(y | θ ), θ ∈ } with P0 = Pθ  under the regularity assumptions listed in Sect. 2.3, LAN condition with D0 , for a prior density π(θ ) continuous for in a neighbourhood of θ  , then

Bernstein–von Mises Theorem and Misspecified Models: A Review

361 P0∞

sup |P(δn−1 (θ − θ  ) ∈ A | Y1 , . . . , Yn ) − N (A; n,θ  , D0−1 )| → 0 A

as n → ∞, where the supremum is taken over measurable sets A, and n,θ  weakly converges to N (0, D0−1 ). Matching variances of the posterior distribution and of the random shift n,θ  , that are equal to the inverse Fisher information, make Bayesian inference efficient asymptotically, from the frequentist perspective. As the random variable n,θ  is bounded with high probability and δn → 0, this theorem also implies consistency of the posterior distribution of θ .

4 Bernstein–von Mises Theorem and Model Misspecification 4.1 Bayesian Inference Under Model Misspecification Given a prior distribution with density π(θ ), the posterior distribution is constructed as the conditional distribution of the parameter θ given data y using Bayes theorem. A more general approach, often referred to as a Gibbs posterior distribution, is where the posterior distribution is defined using a loss function (θ, y) and a prior distribution with density π(θ ): p(θ | y) = 

e− (θ,y) π(θ ) , θ ∈ . − (θ,y) π(θ )dθ θ∈ e

If the loss function (θ, y) is chosen to be − log p(y | θ ), this approach leads to the usual posterior distribution. As well as differing in the interpretation, the key technical difference to the classical Bayesian approach is that function e− (θ,y) does not integrate to 1 over y. This approach is used in applications where only moment conditions are available ([11], Huber function can be used as a loss for robust inference, etc. Bissiri et al. [7] provide a decision-theoretical justification of this approach, by showing that this distribution minimises the following loss function with respect to probability measure ν on ,  

(y, θ )ν(dθ ) + K L(ν, π ).

The authors argue that for iid observations, this is a Bayesian equivalent of θˆ = arg min θ∈

n 1 (yi , θ ). n i=1

(3)

362

N. Bochkina

In the latter approach the interest is in a point estimator whereas in the former approach the interest is in a distribution over θ given y. When applying Bayesian approach under model misspecification, the key question is whether Bayesian inference remains asymptotically efficient, i.e. whether the Bernstein–von Mises theorem holds with the posterior variance being close asymptotically to the sandwich covariance.

4.2 Concentration A necessary condition for a Bernstein–von Mises—type result is to prove that the posterior distribution concentrates in the limit at the point mass at θ  . Consistency of the posterior distribution can be defined as follows. Given a semimetric d between the class of probability models {Pθ , θ ∈ } and the true distribution P0 , as for any ε > 0, P0∞

P(d(Pθ , P0 ) > ε | Y1 , . . . , Yn ) → 0

as n → ∞.

Here Pθ is the probability distribution associated with probability density (mass) function pθ . For a misspecified model, the distance between {Pθ , θ ∈ } and P0 may be positive, so it is not always possible to achieve consistency. However, it may be possible to prove concentration at the probability model with the best parameter θ  : as for any ε > 0, P0∞

P(d(Pθ , Pθ  ) > ε | Y1 , . . . , Yn ) → 0

as n → ∞.

This is referred to as posterior concentration. It is also often of interest to prove that the posterior distribution contracts at some rate (usually the corresponding minimax rate), namely that there exists a sequence εn such that for any sequence Mn growing to infinity, P0∞

P(d(Pθ , Pθ  ) > Mn εn | Y1 , . . . , Yn ) → 0

as n → ∞.

(4)

The main paper on posterior contraction rate under model misspecification is [22]. Their results apply to nonparametric models with θ = P. One of their conditions is formulated in terms of the covering number for testing under misspecification. Definition 2 Given ε > 0, define Nt (ε, P, d, P0 , Pθ  ), the covering number for testing under misspecification, as the minimal number N of convex sets B1 , . . . , B N of probability measures on (Y, A) needed to cover the set {P ∈ P : ε < d(P, Pθ  ) < 2ε} such that, for every i,

Bernstein–von Mises Theorem and Misspecified Models: A Review

363

inf sup − log E P0 [d P/d Pθ  ]η ≥ ε2 /4.

P∈Bi 0 0,

(B(ε, Pθ  , P0 )) > 0

where 



d Pθ B(ε, Pθ  , P0 ) = θ : −E P0 log d Pθ 



 ≤ ε , −E P0 2

d Pθ d Pθ 



2 ≤ε

2

,

and sup Nt (η, P, d, P0 , Pθ  ) < ∞. η>ε

Then, for every ε > 0, as n → ∞, P(d(Pθ , Pθ  ) > ε | Y1 , . . . , Yn ) → 0 as n → ∞. The authors also consider the case when the best approximation θ  is not unique. Their results are illustrated on consistency of density estimation using mixture models, and on nonparametric regression models using a convex set of prior models for the regression function. Grünwald [16] demonstrates that the convexity of P is crucial. Grünwald and van Ommen [17] show via simulations that if {Pθ , θ ∈ } is not convex, the posterior distribution does not concentrate on θ  but instead it concentrates on the best approximation of P0 in the convex hull of the class of the parametric models Conv({Pθ , θ ∈ }): P˜ = arg

min

P∈Conv({Pθ , θ∈})

K L(P, P0 ).

(5)

We discuss their example in more detail in Sect. 4.4. In particular, the authors say that it is possible to achieve consistency, i.e. for the posterior distribution to converge to the point mass at θ  if for any η ∈ (0, 1],

364

N. Bochkina

 E P0

d Pθ d Pθ 



≤1

for all θ ∈ .

(6)

Note that this condition is reminiscent of one of the conditions of [22] who assume that the above condition holds for η = 1, and a similar condition is present in the definition of the covering numbers under model misspecification. Further, [18] relax this condition to be upper bounded by 1 + u for some small u > 0. Bhattacharya et al. [3] study the posterior contraction rate for a particular type of semi-metric d that is matched to the considered misspecified model. For such a matched semi-metric, they show that the posterior contraction rate is determined only by the prior mass condition of the posterior contraction theorem of Ghoshal and van der Vaart (2007), and does not involve the entropy condition. In their setting, the pseudo-likelihood is a power of a probability density, so condition (6) holds. The authors consider only examples of misspecified models with a convex parameter space. Syring and Martin [39] study the concentration rate of Gibbs posteriors in a (semi-metric) d under more general losses and semi-metrics, focusing on iid models and iid true distribution, and discuss a setting where Yi ’s are independent but not necessarily identically distributed. Their assumptions also include the condition on the prior mass of KL neighbourhood but not the entropy; they use a different additional assumption instead. See Sect. 5.3.2 for details.

4.3 Bernstein–von Mises—Type Results Under Model Misspecification The first Bernstein–von Mises—type result under model misspecification was formulated by [23]. The authors state that for misspecified LAN models (see Definition 1), under assumptions of Theorem 2.1 in [22] with rate δn , P0∞

sup |P(δn−1 (θ − θ  ) ∈ A | Y1 , . . . , Yn ) − N (A; n,θ  , D0−1 )| → 0 A

as n → ∞, i.e. the posterior distribution converges to the Gaussian distribution in the total variation distance. Panov and Spokoiny [30] state the BvM for semi-parametric possibly misspecified models, with flat or a Gaussian prior distribution, in a non-asymptotic setting. Here we only state conditions and statement for a parametric model. In addition to the non-asymptotic LAN assumptions stated in Sect. 2.3, the following assumptions are made. 1. Small bias condition: the norm of the bias of the penalised estimator ||θπ − θ  || is small where θπ is defined by

Bernstein–von Mises Theorem and Misspecified Models: A Review

θπ = arg min[K L( p(·|θ ), p0 ) − log π(θ )] θ∈

365

(7)

= arg min[−E log p(Y | θ ) − log π(θ )]. θ∈

2. Identifiability: ||D −1 V || ≤ a 2 ∈ (0, ∞). 3. Global deterministic condition: for ||D 1/2 (θ − θ  )|| > r , E log p(Y | θ ) − E log p(Y | θ  ) ≥ −||D 1/2 (θ − θ  )||b(r ) with b(r ) growing to infinity as r grows to infinity. Panov and Spokoiny [30] have a stronger condition however, it is possible to show that this condition is sufficient to bound the tail of the posterior distribution on ||D 1/2 (θ − θ  )|| > r for large r . 4. Global stochastic condition: for any r > 0 there exists g(r ) > 0 such that for any |λ| ≤ g(r ),   T γ ∇ζ (θ ) 2 ≤ eλ /2 , sup E exp λ sup p ||V γ || 1/2  ||D (θ−θ )||≤r γ ∈R where the expectation is taken with respect to Y ∼ P0 . Under these assumptions, [30] formulated a non-asymptotic version of Bernstein– von Mises theorem for misspecified models. Theorem 2 (Theorem 1 in [30]) Suppose that the assumptions stated in this section hold, and consider a flat prior π(θ ) = 1 for all θ ∈ . Then, for any measurable A, with probability at least 1 − 4e−x , ˜ |P( D˜ 1/2 (θ − θ  − ,n ) ∈ A | Y) − N (A; 0, I p )| ≤ e(r ) − 1,

(8)

˜ ) = r 2 (δ(r ) + ˆ ,n = D˜ −1 ∇ζ (θ  ) and (r where θˆ is the MLE of θ , D˜ = D(θ), −1/2 −x , x)) + 8e . 6ωz( D˜ The authors also prove a similar result with posterior mean and posterior precision ˜ matrix instead of θ  + ,n and D. In their Theorem 2, the authors extend this result to a Gaussian prior θ ∼ N (0, G −2 ) such that ||G 2 D −1 || ≤  < 1/2, trace[(G 2 D −1 )2 ] ≤ δ 2 , ||(D + G 2 )−1 G 2 θ  || ≤ β. Then, the posterior is approximated by the Gaussian distribution, namely, Eq. (8) −x holds with the upper bound replaced by e2(r )+8e (1 + τ ) + e−x − 1 where 1/2  . τ = 0.5 (1 + )(3β + z(D −1/2 , x))2 + δ 2

366

N. Bochkina

In particular, the authors show that for a high dimensional parameter, the upper bound is small is x is large and p 3 /n is small. Spokoiny and Panov [37] have shown a similar result under assumption that the stochastic term is a constant, with posterior distribution centered either at the MLE θˆ or at the posterior mean. The authors also apply these results to nonparametric problems. Chib et al. [11] study Bayesian exponentially-tilted empirical likelihood posterior distributions, which are defined by moment conditions rather than by a likelihood or loss function. The authors show the BvM result for well-specified and misspecified models under fairly general conditions.

4.4 Example: Misspecified Linear Model Now we consider the example of a misspecified model given in [17] where the Bayesian approach considered by the authors fails, and we apply theory of [30] to analyse it. In particular, we check whether the small bias condition holds, i.e. whether θπ defined by (7) is close to θ  defined by (1). The authors considered the following linear model Yi = β0 +

p 

β j X i j + i , i ∼ N (0, σ 2 ), i = 1, . . . , n

(9)

j=1

independently, with a conjugate prior distribution on its parameters: β = (β0 , β1 , . . . , β p )T ∼ N (0, c−1 σ 2 G), τ := σ −2 ∼ (a, b)

(10)

independently, with G = I p+1 . The values of the hyperparameters were chosen to be c = 1, a = 1, b = 40. The true distribution of the data, i.e. the data generating mechanism, is as follows: X i j ∼ N (0, 1), Z i ∼ Ber n(0.5), ⎛ ⎞ p  2 Yi j | Z i ∼ N ⎝ βtr ue, j X i j , σtr ue (1 + Z i )⎠ ,

(11)

j=1

independently for i = 1, . . . , n and j = 1, . . . , p. The true values were taken as n = 100, p = 40, σtr2 ue = 1/40, βtr ue, j = 0.1 for j = 1, 2, 3, 4 and βtr ue, j = 0 otherwise. Note that βtr ue,0 = 0, i.e. there is no intercept, and that the variance of Yi given X is σtr2 ue (0.5 · 1 + 0.5 · 2) = 1.5σtr2 ue . Now we work out the best parameter for this model defined by (1) and the point at which posterior distribution concentrates asymptotically (7), and whether they are close or not.

Bernstein–von Mises Theorem and Misspecified Models: A Review

367

The log likelihood for the considered model is L(β, τ ) = −0.5τ (Y − Xβ)T (Y − Xβ) + 0.5n log τ, and logarithm of the posterior distribution of θ = (β, τ ) is L π (β, τ ) = L(β, τ ) − 0.5cτβ T β + 0.5 p log τ + (a − 1) log τ − bτ. Negative Kullback–Leibler distance K L( pθ , p0 ) (up to an additive constant independent of unknown parameters), is 3 EL(β, τ ) = 0.5n log τ − 0.5τ ||β − βtr ue ||22 − nσtr2 ue τ 4

(12)

where the expectation is taken under the true model, using E(Y T Y | X ) = βtrT ue X T Xβtr ue + 1.5nσtr2 ue and EX T X = I p+1 . Then, the best parameter, i.e. the parameter maximising expression (12) is β  = βtr ue , τ  −1 = σ  2 =

3 2 σ , 2 tr ue

(13)

as stated in [17]. Now we study the value of the parameters where the posterior concentrates which minimises 3 EL π (β, τ ) = (0.5(n + p) + a − 1) log τ − τ b − nσtr2 ue τ 4 −0.5τ (1/c + 1)−1 ||βtr ue ||22

(14) (15)

−0.5(1 + c)τ ||β − βtr ue /(1 + c)||22 and which are equal to βπ = (1 + c)−1 βtr ue , τπ −1 = σπ 2 =

1.5σtr2 ue

(16) −1 −1

+ 2b/n + c(1 + c) n 1 + p/n + 2(a − 1)/n

||βtr ue ||22

.

Hence, (βπ , σπ 2 ) is close to (β  , σ  2 ) if the following conditions hold: 1. 2. 3. 4. 5.

c = o(1) b/n = o(σtr2 ue ) c(1 + c)−1 n −1 ||βtr ue ||22 = o(σtr2 ue ) p/n = o(1) (a − 1)/n = o(1).

368

N. Bochkina

The choice of the parameters given in [17] is the following: 1. 2. 3. 4. 5.

c=1 b/n = 0.4, σtr2 ue = 0.025 c(1 + c)−1 n −1 ||βtr ue ||22 = 0.0002, p/n = 0.5 (a − 1)/n = 0

i.e. conditions 3 and 5 hold whereas the remaining conditions do not hold. So, it is possible to tune hyperparameters so that all conditions, except condition 4, hold, e.g. by taking small b and c leading to weakly informative priors with large variances. Condition p/n = o(1) is due to the choice of the conjugate prior for β with its prior variance proportional to the variance of the noise; if the prior variance of β does not depend on the noise variance, then this condition is not necessary. For instance, it is easy to show using the same technique, that considering a non-conjugate prior β ∼ N (0, τ0−1 0 ) and τ ∼ (a, b), with ||0 || = 1, under the following conditions 1. τ0 = o(1) 2. b/n = o(σtr2 ue ) 3. (a − 1)/n = o(1) leads to θπ being close to θ  . These conditions are satisfied e.g. with small τ0 , small b and a = 1, provided σtr2 ue is not much smaller than 1/n.

5 “Optimising” Bayesian Inference Under Model Misspecification 5.1 Asymptotic Risk of Parameter Estimation Under a Misspecified Model Müller [28] showed that the asymptotic frequentist risk associated with misspecified Bayesian estimators is inferior to that of an artificial posterior which is normally distributed, centred at the maximum likelihood estimator and with the sandwich covariance matrix. This provided theoretical justification for constructing a (quasi-) posterior distribution based on a misspecified model such that its posterior variance is approximately the sandwich covariance. Several such approaches have been used that we discuss below.

Bernstein–von Mises Theorem and Misspecified Models: A Review

369

5.2 Composite Likelihoods Composite likelihoods (also known as pseudo-likelihoods) have been studied by [25], and they are defined as follows. Denote by {A1 , . . . , A K } a set of marginal or conditional events with associated likelihoods L k (θ ; y) ∝ P(y ∈ Ak ; θ ). Then, a composite likelihood is the weighted product L c (θ ; y) =

K 

[L k (θ ; y)]wk ,

k=1

where wk are nonnegative weights to be chosen. It is often used to simplify the model for dependence structure in time series and in spatial models, with one of the most famous examples given by [2] of approximating spatial dependence by the product of conditional densities of a single observation given its neighbours. Selection of unequal weights to improve efficiency in the context of particular applications and a review of frequentist inference for this approach is discussed by [41]. For the discussion of connection of the choice of weights with the Bayesian inference under empirical likelihood see [34]. A typical example is when L k (θ ; y) is the marginal likelihood for yk (with K = n). Unless more information is available, it is generally difficult to estimate individual weights from the sample however this formulation gave rise to a number of approaches with randomised weights (w1 , . . . , wn ). The idea of composite likelihood is used to sample the powers (weights of the contributions of individual samples wi ) from some probability distribution. The typical choice of a joint Dirichlet distribution for the weights corresponds to Bayesian bootstrap and is discussed in Sect. 5.4. Other choices of weights and their effect on the corresponding posterior inference are discussed in [44]. As far as I am aware, currently there are no BvM results for other randomisation schemes, apart from a joint Dirichlet distribution. Choosing the same weight wk = w for all k leads to fractional or tempered posterior distributions discussed in Sect. 5.3.

5.3 Generalised (Gibbs) Posterior Distribution 5.3.1

Definition and Interpretation

Let (y, θ ) be some loss function. Then, generalised posterior distribution is given by exp{−η (y, θ )}π(θ ) (17) pη (θ | y1 , . . . yn ) =  exp{−η (y, θ )π(θ )dθ where η is the parameter that adjusts for misspecification. This parameter is called the learning rate (in machine learning), inverse temperature. Taking (y, θ ) =

370

N. Bochkina

n

i=1 i (yi , θ ) for some loss i associated with observation yi given parameter θ corresponds to the assumption that observations yi are independent. Taking (y, θ ) = − log p(y | θ ) and η = 1 leads to the classical Bayesian inference. Different functions may be used for different types of model misspecification and different inference purposes, e.g. Huber function for robust parameter estimation. This is also known as a Gibbs posterior in Bayesian literature, exponential weighting in frequentist literature [12], typically with i = ||yi − θ ||22 , and it is used as a model for PAC-Bayesian approach in machine learning. Lately it has also been referred to as a fractional posterior and as a tempered posterior. Grünwald and van Ommen [17] argue that if there exists η¯ ≤ 1 such that for all 0 < η ≤ η¯    p(y | θ ) η dy ≤ 1 for all θ ∈ , p0 (y) p(y | θ  )

then the generalised posterior with η < η¯ is asymptotically consistent, i.e. converges to the point mass at pθ  . For η such that condition (6) holds, the authors interpret the generalised posterior as a posterior distribution based on the reweighted true likelihood:   p(y | θ ) η p˜ η,θ  (y | θ ) = p0 (y) p(y | θ  ) which is interpreted as a density on the probability space augmented by an unobserved event if this function integrates to a positive value less than 1. This is due to the following: if this density was used as a density of y given θ to construct the likelihood, then this would correspond to a correctly specified model, since the KL distance between p˜ η,θ  (y | θ ) and p0 is minimised at θ  and p˜ η,θ  (· | θ  ) = p0 (·), and the corresponding posterior would be a proper posterior and it coincides with the generalised posterior. This is done for interpretation only, as it is not possible to use p˜ η,θ  for inference in practice due to unknown p0 and θ  . In the following section we discuss known results about concentration of the Gibbs posterior distribution.

5.3.2

Concentration and Posterior Contraction Rate

Bhattacharya et al. [3] study the posterior contraction rate for a particular type of semi-metric d that is matched to the considered misspecified model. They consider generalised Bayesian approach pη (θ | y) = 

[ p(y | θ )]η π(θ ) [ p(y | θ )]η π(θ )dθ

with p(y | θ ) being a density of a probability measure with respect to some measure μ, and the semi-metric based on Renyi divergence with matching index η:

Bernstein–von Mises Theorem and Misspecified Models: A Review

Dη(n) (θ, θ  ) = −

371

1  log A(n) η (θ, θ ) 1−η

 where A(n) η (θ, θ ) is the integral defined in (6) for all n observation y = (y1 , . . . , yn ):  A(n) η (θ, θ )

 = E P0

p(Y | θ ) p(Y | θ  )

η 

.

The authors show that since p(y | θ ) is a density of a probability measure, condition  A(n) η (θ, θ ) ≤ 1 (equation (6)) holds. Also, the authors show that for η → 1−, the generalised posterior converges to the corresponding posterior distribution. Therefore, their approach is not shown to apply to so called Gibbs posteriors where other loss functions (rather than a negative log density) can be used to specify the (pseudo)likelihood. Following [17], one may argue that in the setting considered by [3], it is not necessary to use η < 1 to adjust the inference to achieve posterior consistency (it may be necessary e.g. to achieve asymptotic efficiency). Under the assumptions of  [3], A(n) η (θ, θ ) ≤ 1 for all θ and η ∈ (0, 1), and, due to convergence of the generalised posterior to the posterior as η → 1−, the posterior distribution (with η = 1) is asymptotically consistent. Syring and Martin [39] study the concentration rate of a Gibbs posterior (17) in (semi-metric) d under the following fairly general assumptions. The authors focus on iid models and iid true distribution, and discuss a setting where Yi ’s are independent but not necessarily identically distributed. Condition 1 There exist η, ¯ K , r > 0 such that for all η ∈ (0, η) ¯ and for all sufficiently small δ > 0, for θ ∈ , d(θ, θ  ) > δ ⇒ log E exp(−η( (Y1 ; θ ) − (Y1 ; θ  ; Y1 ))) < −K ηδr . KL neighbourhood condition. For a sequence (εn ) such that εn → 0 and nεnr → ∞ as n → ∞, there exists C1 ∈ (0, ∞) such that for all n large enough, log (B K L (εr )) > −C1 nεnr , where B K L (R) = θ ∈  : −E[ (Y; θ  ) − (Y; θ )] ≤ R

Var[ (Y; θ  ) − (Y; θ )] ≤ R . Theorem 3 (Theorem 3.2, [39]) Under Condition 1 and KL neighbourhood condition, for a fixed η, the Gibbs posterior distribution defined by (17) satisfies (4) with asymptotic concentration rate εn .

372

N. Bochkina

The authors show that this also holds for ηn → 0 as long as ηn nεnr → ∞ and in the KL neighbourhood condition C1 nεnr is replaced by C1 ηn nεnr . The also how that this holds for a random ηˆ as long as with high probability c−1 ηn ≤ ηˆ ≤ cηn for ηn → 0 and some c ≥ 1. The authors also discuss that condition (6) can be relaxed to hold on n = {θ ∈  : ||θ || ≤ n } for a sequence (n ) increasing to infinity, under stronger conditions (see Theorem 4.1 in [39]). The authors also discuss that conditions of this theorem are related to the entropy condition of [22] and verify this condition for convex as a function of θ . In the iid setting, Condition 1 combines several conditions of [35] for a single observation Y1 since log E exp(−η( (Y1 ; θ ) − (Y1 ; θ  ))) = −η(E (Y1 ; θ ) − E (Y1 ; θ  )) + log E exp(η(ζ1 (θ ) − ζ1 (θ  ))),

(18)

where ζ1 (θ ) = E (Y1 , θ ) − (Y1 , θ ), except that the authors assume that this condition holds for all θ ∈  whereas in [35] the conditions are split into local (in a neighbourhood of θ  ) and global (for all θ ∈ ), with the global conditions being weaker. As the authors discuss, their Condition 1 can hold if ζ1 (θ ) − ζ1 (θ  ) has subGaussian tails, and if for η small enough the first term (which is negative) in (18) is sufficiently greater in absolute value than the second term. More specifically, assume that there exist r, K 1 > 0 such that −[E (Y1 ; θ ) − E (Y1 ; θ  )] < −K 1 [d(θ, θ  )]r , θ, θ  ∈ , and that the sub-Gaussian tail condition holds with some b > 0 log E exp(η(ζ1 (θ ) − ζ1 (θ  ))) ≤ bη2 ||θ − θ  ||22 /2 which can be verified through Conditions 1 and 2 [35], and there exist K 2 > 0 such that ||θ − θ  ||22 ≤ K 2 [d(θ, θ  )]r for all θ, θ  ∈ . Then, log E exp(−η( (Y1 ; θ ) − (Y1 ; θ  ))) ≤ −η[d(θ, θ  )]r [K 1 − bηK 2 /2] < −K η[d(θ, θ  )]r if η < η¯ = min(1 + o(1), 2K 1 /(bK 2 )) and K = K 1 − bηK ¯ 2 /2. Since the inequality η < η¯ is strict, as long as 2K 1 /(bK 2 ) > 1, we can take η = 1. The authors suggest that case r = 2 corresponds to regular problems, i.e. where ∇E (θ  ) = 0 and ∇ 2 E (θ  ) is positive and continuous in the neighbourhood of θ  , and the sub-Gaussian tails condition, whereas nonregular problems may require other values of r , e.g. r = 1 if θ  is on the boundary of the parameter space [8], or if there is a finite jump at θ [10]. We illustrate the latter on an example.

Bernstein–von Mises Theorem and Misspecified Models: A Review

373

Example 2 Now we check if Condition 1 holds for a density with jump. Consider a density p(y|θ ) that is 0 for y < θ and the right hand side limit lim y→θ+ p(y|θ ) = λ > 0, for instance with p(y|θ ) = e−(y−θ) for y > θ , and the true density p0 (y) such that p0 (y) = 0 for y < θ0 and lim y→θ0 + p0 (y) = c0 > 0. Then,  E log p(Y | θ ) = −

∞ θ0

(y − θ ) p0 (y)dy + log(0)I (θ > θ0 )

= θ − EY + log(0)I (θ > θ0 ) which is minimised at θ  = θ0 . This implies that for θ ∈ 0 = (−∞, θ  ], E log p(Y | θ ) − E log p(Y | θ  ) = θ − θ  So if d(x, y) = |x − y| then this holds with K 1 = 1 and r = 1. The stochastic term is ζ (θ ) = Y − EY for θ ∈ 0 , and for θ ∈ 0 log E exp(η(ζ (θ ) − ζ (θ  ))) = 0, and Condition 1 holds for θ ∈ (−∞, θ  ] with K = 1 and r = 1.

5.3.3

Estimation of η

There are various approaches to estimation of η that lead to the posterior distribution concentrating at θ  , e.g. [18, 32]; see a review [46]. Here I will give a very brief discussion. There are two key issues: firstly, this parameter models misspecification so it cannot be estimated in a usual Bayesian way (e.g. by putting a hyperprior), and secondly, a relevant estimator depends on the aim of the inference. 1. When predictive inference is of interest, the Safe-Bayes estimator of [16] further explored in [18], may be appropriate:  ηˆ = arg min −

n 

 E log p(yi | θ ) pη (θ | y1:(i−1) )dθ

i=1

where pη (θ | y1:(i−1) ) is the generalised posterior distribution with parameter η based on i − 1 samples (if i = 1 then it is the prior distribution). 2. Now we discuss estimators of η when estimation of θ is of interest, in particular frequentist coverage of credible posterior regions Cα such that Pη (θ ∈ Cα | Y) = 1 − α. Under the conditions of Gaussian approximation of the posterior, if asymptotic coverage of asymptotic credible balls Cα is of interest, then it is sufficient to check that asymptotic credible balls ˆ T ≤ χ p2 (α) ˆ T Dη (θ − θ) (θ − θ)

374

N. Bochkina

are inside the frequentist confidence balls with the sandwich covariance (θ − θˆ )T DV −1 D(θ − θˆ )T ≤ χ p2 (α), i.e. it is sufficient to check that the largest eigenvalue of Dη = ηD is not smaller than the largest eigenvalue of DV −1 D. Therefore, η is chosen so that the largest eigenvalue of the posterior precision matrix ηD matches that largest eigenvalue of the sandwich precision matrix DV −1 D, i.e. the “oracle” value is ||DV −1 D|| η = , ||D|| and it can be estimated if estimates of V = V (θ  ) and D = D(θ  ) are available. Holmes and Walker [19] used the Fisher information number to calibrate this parameter: trace(DV −1 D) . η = trace(D) As the authors say, it is the sum of the marginal Fisher information for each dimension, which can be used as a summary for the amount of information in a sample about parameters. Pauli et al. [31] propose ηˆ = trace(V −1 (θˆc )D(θˆc ))/ dim(θ ) which asymptotically is the average of the mutual eigenvalues of D(θˆc )) with respect to V (θˆc ). Remark | n 1 Suppose that conditions of [30] hold for some pseudo likelihood p(y p(yi | θ ) and θ ∈  ⊆ R. Then, these conditions hold for p(y | θ )η = θ) = i=1 n η 2 i=1 p(yi | θ ) with Vη = η V and Dη = ηD, r η = r η and bη such that bη (r η ) = b(r ). Therefore, the posterior variance Dη−1 coincides with the sandwich variance if ηD = D 2 V −1 , i.e. if η = DV −1 . See also [13] in the context of linear regression.

5.4 Nonparametric Model for Uncertainty in p0 and Bootstrap Posterior 5.4.1

Nonparametric Model and Connection to Bootstrap

Lyddon et al. [26] proposed to take into the account uncertainty about the parametric model by modelling the distribution of the data nonparametrically, e.g. using a Dirichlet process prior with the base model being the considered parametric model: yi ∼ F, i = 1, . . . , F ∼ D P(α, p(· | θ )).

Bernstein–von Mises Theorem and Misspecified Models: A Review

375

For iid observations yi , when α → 0, this approach corresponds to Bayesian bootstrap [33], with the following sampling of (θ ( j) ) Bj=1 from the bootstrap posterior: θ ( j) = θ (F j ) with F j (x) =

n 

α ji δ yi (x)

(19)

and α j = (α j1 , . . . α jn ) ∼ Dirichlet (1, . . . , 1),

(20)

j=1



where θ (F) = arg min θ∈

(θ, y)d F(y)

(21)

with (θ, y) = − log p(y | θ ). More generally, for a possibly different loss function, the authors refer to this as the loss-likelihood (LL) bootstrap approach. The authors argue that using this procedure induces a prior distribution on θ defined as P(θ ∈ A) = P(F : θ (F) ∈ A).

5.4.2

Asymptotic Normality of Bootstrap Posterior

Lyddon et al. [26] show that the sample from the loss-likelihood bootstrap has asymptotically normal distribution with sandwich covariance matrix, weakly, under the following assumptions. 1.  is a compact and convex subset of a p-dimensional Euclidean space. 2. The loss function :  × R → R is a measurable bounded from below function, with  (θ, y) p0 (y)dy < ∞ for all θ ∈  3. (Identifiability). There exists a unique minimizing parameter value θ  = arg min θ∈

 (θ, y) p0 (y)dy,

and for all δ > 0 where exists ε > 0 such that lim inf P( sup n

|θ−θ  |>δ

n 1 [ (θ, yi ) − (θ  , yi )] > ε) = 1 n i=1

4. Smoothness of loss: there exists an open ball B containing θ  such that 2 (θ, Y )|] < ∞ E P0 [|∇ Ikk (θ, Y )|] < ∞ and E[|∇ j (θ, Y )∇km

for k = 1, 2, 3 and for all corresponding indices Ik , i.e. I1 ∈ {1 : p}, I2 ∈ {1 : p}2 , ( j, k, m) = I3 ∈ {1 : p}3 .

376

N. Bochkina

5. For θ ∈ B, the corresponding information matrices V (θ ) and D(θ ) are positive definite with all elements being finite, where V (θ ) = D(θ )



∇ (θ, y)∇ T (θ, y) p0 (y)dy,  = ∇ 2 (θ, y) p0 (y)dy.

Theorem 4 (Theorem 1 in [26]) Let θ˜n be a loss-likelihood bootstrap sample of a parameter defined by (19) and (21) with loss function , given n iid observations (y1 , . . . , yn ), and let PL L be its probability measure. Under the above assumptions, for any Borel set A ∈ R p , as n → ∞, PL L (n 1/2 (θ˜n − θˆn ∈ A)) → P(Z ∈ A) where Z ∼ N p (0, DV −1 D), θˆn = arg minθ∈

1 n

n i=1

(θ, yi ) and

V = V (θ  ), D = D(θ  , y). Therefore, the inference approach based on the loss-likelihood bootstrap is asymptotically efficient in the case the true distribution of the observations is unknown. Strictly speaking, this is not a Bernstein–von Mises theorem, since the convergence is not in the total variation distance, and hence it does not guarantee approximation of (θ ∈ A | Y) by the corresponding Gaussian probabilities for all Borel sets A. Also, assumption of compactness of  is not present in other results on posterior concentration, so it should be possible to relax this assumption. Another interesting problem is how to modify this approach to take into the account a given a prior π that results in coherent and efficient inference about parameter θ . Newton et al. [29] proposed such a solution, by replacing the loss function in the optimisation problem (21) by the loss function penalised by negative log prior, however for their choice of weights, the authors give a heuristic argument that their method approximates the target posterior with posterior covariance D −1 rather than with the sandwich covariance.

5.4.3

Other Bootstrap-Based Approaches

Another approach is called bagged posterior or “BayesBag” which applies bagging proposed by [9] to the Bayesian posterior [42]. The idea is to select subsets of data as in bootstrap, compute posterior distribution for each of these subsets of data and average these posteriors. Formally, the bagged posterior is defined by p Bayes Bag (θ | Y) =

1  π(θ | Y(i,N ) ) |I | Y ∈I (i,N )

Bernstein–von Mises Theorem and Misspecified Models: A Review

377

of the original data Y = (Y1 , . . . , Yn ) and bootstrap data sets Y(i,N ) of size N as the observed data. Huggins and Miller [20] show that under a range of conditions, for √ iid true distribution of the data and iid model, bagged posterior distribution of n(θ − E(θ | Y)) converges weakly to a Gaussian distribution centered at 0 with covariance matrix D −1 /c + D −1 V D −1 /c where c = lim(N /n). Hence, this approach does better than the usual posterior distribution with e.g. N = n − n 0 for some small finite constant n 0 leading to c = 1, however it is still not efficient.

5.5 Curvature Adjustment The generalised posterior approach uses a single parameter η to adjust for model misspecification. In general, it is possible to use this approach to obtain variance adjustment—and hence asymptotically optimal and valid posterior inference—only for one-dimensional parameter θ . In the case of higher dimensions, [32] proposed to use curvature adjustment in the following way. For a possibly misspecified parametric family { p(˙|θ ), θ ∈ } and prior π(θ ), consider the following family of posterior distributions: π A (θ |y) = p(Aθ | y) ∝ p(y | Aθ )π(Aθ ). Then, the idea is to find an estimator of the“oracle” matrix A in this class of admissible transforms determined by the condition that the posterior variance of Aθ is optimal, i.e. under the condition A T D −1 A = D −1 V D −1 in the case the true parametric model is unknown and V = V (θ  ) and V = V (θ  ) defined by (2), and under condition A T D −1 A = Vtr−1ue if the true parametric model is known with Vtr ue being the Fisher information under the true parametric model. For the case of the unknown parametric model, D can be estimated by the posterior precision matrix of θ , and estimation of V is usually more challenging. Stoehr and Friel [38] applied an affine version of the transform, i.e. they considered p(Aθ + b | y) misspecified models with known true parametric model, estimating b and A so that the posterior mean and the posterior variance of this distribution coincide with the posterior mean and the posterior variance under the posterior distribution with the true parametric model.

378

N. Bochkina

6 Discussion and Open Questions The approach of [30, 37] allows to address numerically the approximation properties of misspecified Bayesian inference and to verify whether it is close to being efficient, or whether a further adjustment is needed. While the authors have the assumption of a flat or Gaussian prior, for many model it is fairly straightforward to extend this to a larger class of continuous priors, in some cases with a continuous second derivative of the log likelihood. There are many other interesting aspects of inference under model misspecification that are not considered here, such as optimality of predictive inference, model selection, etc. It would be interesting to explore the connection between PAC-Bayesian inequalities and the conditional distribution p(θ | y) defined as the solution of the optimisation problem (3). Other interesting approaches include BvM for Variational Bayes under model misspecification [43], BvM for median and quantiles under classical and Gibbs posterior [4, 5]. Interestingly, [? ] show that Bayesian neural networks show inconsistency similar to that discussed in [17], applying variational Bayes leads to BNN becoming consistent; it would be interesting to study whether it is possible to achieve asymptotic efficiency. Another version of robust Bayes-like estimation is proposed by [1] that does not involve Kullback–Leibler distance but is based only on Hellinger distance between the true distribution and the parametric family. Construction (asymptotically) efficient more general Bayesian inference under model misspecification (which is also computationally tractable) is a very active research area, with several promising solutions such as bootstrap posterior and curvature adjustment, however there is still no general unifying framework to encompass these approaches or to provide a coherent general framework. Fractional posterior allows a potentially simpler procedure for model correction which involves a single tuning parameter even if the parameter is multivariate which may result in conservative inference which can be sufficient for some problems but it is unlikely to be efficient in general. Linear curvature adjustment appear to work in practice and it is applicable to the models with no independence structure but there is no decision— theoretic justification for this is available yet; such justification is likely to involve geometry of the model space and its local linear adjustment. The open question in bootstrap-based posterior inference is the use of a given prior and its extension to data without independence structure which is likely to come from its Bayesian nonparametric interpretation. Knoblauch et al. [24] proposed an approach combining generalised variational inference, PAC-Bayes and other approaches into a single principled framework; they give conditions for consistency of their approach but not for efficiency. Fong et al. [15] propose a novel view to constructing a generalised posterior distribution, so it would be interesting to study its efficiency. Acknowledgements This review was in part motivated by the discussion of the author with Peter Grünwald, Pierre Jacob and Jeffrey Miller during a Research in Groups meeting sponsored by the International Centre for Mathematical Sciences in Edinburgh, UK.

Bernstein–von Mises Theorem and Misspecified Models: A Review

379

References 1. Baraud, Y., Birgé, L.: Robust Bayes-like estimation: Rho-Bayes estimation. Ann. Stat. 48(6), 3699–3720, 12 2020 2. Besag, J.: On the statistical analysis of dirty pictures (with discussion). J. Roy. Statist. Soc. B 48, 259–302 (1986) 3. Bhattacharya, A., Pati, D., Yang, Y.: Bayesian fractional posteriors. Ann. Statist. 47(1), 39–66, 02 2019 4. Bhattacharya, I., Ghosal, S.: Bayesian inference on multivariate medians and quantiles. Statistica Sinica (2019) 5. Bhattacharya, I., Martin, R.: Gibbs posterior inference on multivariate quantiles. J. Stat. Plann. Infer. 218, 106–121 (2022) 6. Bissiri, P.G., Holmes, C.C., Walker, S.G.: A general framework for updating belief distributions. J. R. Statist. Soc.: Ser. B (Statistical Methodology) (2016) 7. Bochkina, N.A., Green, P.J.: The Bernstein–von Mises theorem and nonregular models. Ann. Statist. 42(5), 1850–1878, 10 2014 8. Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996) 9. Chernozhukov, V., Hong, H.: Likelihood estimation and inference in a class of nonregular econometric models. Econometrica 72, 1445–1480 (2004) 10. Chib, S., Shin, M., Simoni, A.: Bayesian estimation and comparison of moment condition models. J. Am. Statist. Assoc. 113(524), 1656–1668 (2018) 11. Dalalyan, A., Tsybakov, A.B.: Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 72, 39–61 (2008) 12. de Heide, R., Kirichenko, A., Mehta, N., Grünwald, P.: Safe-Bayesian generalized linear regression. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2623 –2633. PMLR (2020) 13. Diong, M.L., Chaumette, E., Vincent, F.: On the efficiency of maximum-likelihood estimators of misspecified models. In: 25th European Signal Processing Conference (EUSIPCO) (2017) 14. Fong, E., Holmes, C., Walker, S.G.: Martingale posterior distributions. J. R. Statist. Soc.: Ser. B (Statistical Methodology) (2023) 15. Grünwald, P.: The safe Bayesian. In: International Conference on Algorithmic Learning Theory, pp. 169–183. Springer (2012) 16. Grünwald, P., van Ommen, T.: Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal. 12(4), 1069–1103, 12 2017 17. Grünwald, P.D., Mehta, N.A.: Fast rates for general unbounded loss functions: from ERM to generalized Bayes. J. Mach. Learn. Res. 21(56), 1–80 (2020) 18. Holmes, C.C., Walker, S.G.: Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104(2), 497–503 (2017) 19. Huggins, J., Miller, J.: Reproducible model selection using bagged posteriors. Bayesian. Anal. 18(1), 79–104 (2023) 20. Ibragimov, I., Hasminskij, R.: Statistical Estimation: Asymptotic Theory. Springer (1981) 21. Kleijn, B.J.K., van der Vaart, A.W.: Misspecification in infinite-dimensional Bayesian statistics. Ann. Statist. 34(2), 837–877, 04 2006 22. Kleijn, B.J.K., van der Vaart, A.W.: The Bernstein–von-Mises theorem under misspecification. Electron. J. Statist. 6, 354–381 (2012) 23. Knoblauch, J., Jewson, J., Damoulas, T.: Generalized variational inference: Three arguments for deriving new posteriors (2021). arXiv:1904.02063 24. Lindsay, B.: Composite likelihood methods. Contemp. Math. 80, 221–239 (1988) 25. Lyddon, S.P., Holmes, C.C., Walker, S.G.: General Bayesian updating and the loss-likelihood bootstrap. Biometrika 106, 465–478 (2019) 26. Miller, J.W., Dunson, D.B.: Robust Bayesian inference via coarsening. J. Am. Statist. Assoc. 114(527), 1113–1125 (2019)

380

N. Bochkina

27. Müller, U.K.: Risk of Bayesian inference in misspecified models, and the sandwich covariance matrix. Econometrica 81(5), 1805–1849 (2013) 28. Newton, M.A., Polson, N.G., Xu, J.: Weighted Bayesian bootstrap for scalable posterior distributions. Canadian J. Statist. (2021) 29. Panov, M., Spokoiny, V.: Finite sample Bernstein–von Mises theorem for semiparametric problems. Bayesian Anal. 10(3), 665–710, 09 2015 30. Pauli, F., Racugno, W., Ventura, L.: Bayesian composite marginal likelihoods. Statistica Sinica 21, 149–164 (2012) 31. Ribatet, M., Cooley, D., Davison, A.C.: Bayesian inference from composite likelihoods, with an application to spatial extremes. Statistica Sinica 22, 813–845 (2012) 32. Rubin, D.B.: The Bayesian bootstrap. Ann. Statist. 9, 130–134 (1981) 33. Schennach, S.M.: Bayesian exponentially tilted empirical likelihood. Biometrika 92(1), 31–46 (2005) 34. Spokoiny, V.: Parametric estimation. Finite sample theory. Ann. Statist. 40(6), 2877–2909, 12 2012 35. Spokoiny, V.: Bayesian inference for nonlinear inverse problems (2020). arXiv:1912.12694 36. Spokoiny, V., Panov, M.: Accuracy of Gaussian approximation in nonparametric Bernstein–von Mises (2020). arXiv:1910.06028 37. Stoehr, J., Friel, N.: Calibration of conditional composite likelihood for Bayesian inference on Gibbs random fields. In: Lebanon, G., Vishwanathan, S.V.N. (eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pp. 921–929. PMLR (2015) 38. Syring, N., Martin, R.: Gibbs posterior concentration rates under sub-exponential type losses. Bernoulli 29(2), 1080–1108 (2023) 39. Van der Vaart, A.W.: Asymptotic Statistics, vol. 3. Cambridge University Press (2000) 40. Varin, C., Reid, N., Firth, D.: An overview of composite likelihood methods. Statistica Sinica 21, 5–42 (2011) 41. Waddell, P.J., Kishino, H., Ota, R.: Very fast algorithms for evaluating the stability of ml and Bayesian phylogenetic trees from sequence data. In Genome Inf. 13, 82–92 (2002) 42. Wang, Y., Blei, D.M.: Variational Bayes under model misspecification. In: In Advances in Neural Information Processing Systems (2019) 43. Wang, Y., Kucukelbir, A., Blei, D.M.: Robust probabilistic modeling with Bayesian data reweighting. In: Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3646–3655. PMLR (2017) 44. White, H.: Maximum likelihood estimation of misspecified models. Econometrica 50, 1–25 (1982) 45. Wu, P.-S., Martin, R.: A comparison of learning rate selection methods in generalized Bayesian inference (2020). arxiv:2012.11349 46. Zhang, Y., Nalisnick, E.: On the inconsistency of Bayesian inference for misspecified neural networks. In: Third Symposium on Advances in Approximate Bayesian Inference (2021)

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems Maxim Panov

Abstract We consider the problem of Bayesian semiparametric inference and aim to obtain an upper bound on the error of Gaussian approximation of the posterior distribution for the target parameter. This type of result can be seen as a nonasymptotic version of semiparametric Bernstein–von Mises (BvM). The provided bound is explicit in the dimension of the target parameter and in the dimension of sieve approximation of the full parameter. As a result, we can introduce the so-called critical dimension pn of the sieve approximation, the maximal dimension for which the BvM result remains valid. In various particular statistical models, we show the necessity of the condition “ pn2 q/n is small”, where q is the dimension of the target parameter and n is the sample size, for the BvM result to be valid under the general assumptions on the model. Keywords Prior · Posterior · Bayesian inference · Semiparametric · Critical dimension

1 Introduction The Bayesian approach is one of the main directions of modern mathematical statistics. It studies the so-called posterior distribution, i.e., the one obtained by correction of the prior distribution to account for the observed data. Bernstein–von Mises (BvM) theorem states that the posterior distribution is asymptotically close to the normal distribution with a mean close to the MLE and covariance matrix close to the inverse of Fisher information matrix. Thus, the BvM result gives theoretical grounds for the Bayesian computations of mean and covariance. It justifies the usage of elliptic credible sets based on the first two moments of the posterior. In this work we are going to quantify the error of the normal approximation of the posterior distribution for the finite sample size which is important for the practical applications of the M. Panov (B) Skolkovo Institute of Science and Technology, Moscow, Russia e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_11

381

382

M. Panov

approximation. The results of this work were partially announced without proofs in [33].

1.1 Problem Statement Let us denote by Y ∈ Rn the observed random data and by IP –their distribution. We assume that the unknown data distribution IP belongs to a given parametric family (IPφ ) : Y ∼ IP = IPφ ∗ ∈ (IPφ , φ ∈ Φ), where Φ is the parameter space and φ ∗ ∈ Φ denotes the true parameter value. In the present study, we focus on semiparametric problems where the full parameter φ has infinite dimension while the target of estimation is its low dimensional component θ ∗ = Π0 φ ∗ def

given by the mapping Π0 : Φ → IR q with q ∈ N being the dimension of target parameter. In what follows we consider the classical semiparametric setup with φ = (θ , ψ) , where θ is the target of estimation, and ψ = {ψ j }∞ j=1 ∈ 2 is called nuisance parameter. The more general case with θ = Π0 φ can also be studied in our framework but requires certain technical conditions on operator Π0 to hold (see, for example, [18]). In the real-world problems it is unrealistic to assume that true distribution belongs to the parametric family even if it is significantly rich. In this work, we allow the parametric model to be misspecified, i.e. the distribution IP might not belong to the family (IPφ , φ ∈ Φ) . Then the “true” value φ ∗ of parameter φ can be defined as φ ∗ = arg max IEL (φ), φ∈Φ

 def d IP where L (φ) = L (φ  Y ) = log dμφ (Y ) is the log-likelihood for the family (IPφ ) 0 with respect to some dominating measure μ0 . In misspecified case φ ∗ determines the best fit to IP among the distributions in the considered parametric family; see [16, 26, 27] and references therein. The target of estimation θ ∗ is still determined by the mapping Π0 via θ ∗ = Π0 φ ∗ . Semiparametric Bernstein–von Mises Theorem Let us assume that the prior measure Π on the parameter set Φ is given. Below we will focus on the analysis of the posterior measure on Φ which determines the posterior distribution  of φ for a given Y and can be obtained by normalization of the product exp L (φ) Π (dφ) . Let us assume that the prior measure Π has the density π(φ) with respect to the Lebesgue measure Φ : Π (dφ) = π(φ)dφ . Then this posterior measure can be written as

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

φ|Y

  exp L (φ) π(φ)dφ.



383

(1)

It is important to know that one can still write the formula (1) in the case when L (φ) does not equal to the true log-likelihood of the data distribution for any φ , i.e. the model is misspecified. In such a case, Eq. (1) defines so-called quasi-posterior distribution, see, for example, the work [16]. In this work, we study properties of the posterior distribution for the target parameter ϑ = Π0 φ given the data Y . Let us note that here and further in the text we will denote the target parameter by ϑ , when it accounts for the posterior random variable, and by θ in all other situations. In this case the Eq. (1) can be written as  ϑ|Y

  exp L (φ) π(φ)dφ,



(2)

H

where H is the subspace corresponding to the parameter η . Celebrated Bernstein– von Mises theorem (BvM) states that the posterior distribution is asymptotically normal. More specifically, in the semiparametric case the centering by some efficient estimate  θ of parameter θ ∗ (e.g., profile MLE) and normalization by square root of efficient Fisher information matrix D˘ e2 given by (4) ensures the convergence in total variation to the standard normal distribution: θ ) | Y → N (0, I q ) , D˘ e (ϑ −  where I q is an identity matrix of dimension q . Below we will show that the smoothness of log-likelihood L (φ) ensures the Gaussian approximation of the posterior measure. We will focus on the accuracy of such an approximation as a function of the dimensions p of full parameter and q of target parameter. We will focus on the case of the non-informative prior. The results for the smooth priors also follow, see [34]. We note that depending on the prior both in parametric models and, especially, in their non-parametric counterparts the effect of penalization can be significant even asymptotically, see [42]. Our goal is to prove that under the reasonable conditions the posterior distribution of the target parameter (2) is close to the normal distribution with properly selected mean and covariance even for the finite sample size. Also, we will describe the bounds on the sample size and dimension which ensure the validity of the BvM result.

1.2 Related Work The name of BvM result goes back to Sergey Bernstein who proved the asymptotic normality of posterior for Bernoulli distribution and general prior distribution [4], and Richard von Mises who extended it to multinomial distribution [32]. The classical version of BvM theorem is formulated for the standard parametric setup with fixed parametric model and large sample size (see the detailed review in [30] and [44]). However, in the modern statistical applications very complex mod-

384

M. Panov

els are essential, while the sample size is usually limited (see the detailed review of modern high dimensional statistics in [9]). Thus, there is a need to extend classical results to such non-classical situations. Let us note the works [7, 17, 19, 20] where the Bayesian models with growing parameter dimension are studied. Nonparametric and semiparametric models are especially challenging with some simple questions like consistency appearing to be a hard problem, see [2, 6, 37]. The normality of posterior measure is even more complicated, see [38]. Some results for particular semi-and nonparametric models can be found in [12, 24, 25, 31]. The profile likelihood approach to semiparametric BvM was studied in [14]. The BvM result for i.i.d. case with fairly general assumptions on the model was obtained in [5]. The work [11] studied the asymptotic normality of the posterior in semiparametric models where functional parameter is generated by Gaussian process. The work [35] proved the BvM result for linear functionals of density while in [13] it was extended to the case of more general models and functionals. The contraction properties of the posterior the data distribution from exponential family were studied in [36]. Finally, the work [3] studied the asymptotic normality of posterior distribution for exponential family distributions with growing parameter dimension. However, all these results are limited only to asymptotic case or to very particular model classes such as Gaussian, exponential family or i.i.d. models. In this work, we prove the variant of BvM theorem for fairly broad class of parametric and semiparametric models. The important feature of our analysis is focus on the fixed size of the data. In classical statistical theory one usually assumes the conditions of local asymptotic normality to hold, while the models have fixed parameter dimension and the sample size grows to infinity, see [30] and [23]. Let us also note the works by Gusev [21] and [22], where for the i.i.d. models asymptotic expansions of posterior densities, moments of random variables and Bayesian risks were considered. The second order expansions for Bayesian estimates in i.i.d. case were studied by Burnashev [10]. The construction of the theory for finite datasets is a challenging task as the majority of methods and approaches in classical statistics are developed for the asymptotic case assuming that sample size tends to infinity. Relatively small number of results is known for the finite samples, see, for example, [8]. The other peculiarity of our work is the possibility to account for model misspecification, i.e. the case when the true data distribution does not belong to the considered parametric family. The interested reader may look on one of few works on misspecified models in Bayesian context [26]. In this work, we consider a semiparametric problem where the dimension of the full parameter is large or infinite while target parameter has smaller dimension. The component of the full parameter vector orthogonal to the target parameter is called a nuisance parameter. In Bayesian approach, the goal of semiparametric estimation is the marginal posterior of the target parameter, see the work by Castillo [11]. The typical examples include estimation of functionals, estimation of function value at a point and simply estimation of a given sub-vector of the full parameter vector. Interestingly, in semiparametric BvM nuisance parameter impacts the results only via the projection of normalized the gradient of log-likelihood on the target subspace and via normalized Fisher information matrix, see [5]. Usually in this case the estimation

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

385

methods are based on the notion of the worst parametric submodel, see the book by Kosorok [28]). Moreover, it is assumed that there exists the estimation method for the nuisance parameter achieving certain convergence rate, see the work by Cheng and Kosorok [14]). Such assumptions significantly simplify the work with the problem but do not allow to extract the quantitative relations between the dimension of the parameter space and the information contained in the data. This paper is basing on finite sample BvM results of [40] and [34] for parametric and semiparametric cases respectively. The results of the current paper improve on [34] and were partially announced without proofs in [33]. The similar techniques were used to obtain a variant of nonparametric Bernstein-von Mises theorem in [42].

2 Sieve Approach The infinite dimensional parameter requires certain treatment to allow the solutions given finite data sample. In this work, we will use projection estimators [15] also known as a sieve approach. This approach assumes that instead of estimation of full parameter vector, we only fit first m most important components leaving others equal to 0 . In the Bayesian framework the analogous effect is achieved by assuming that the prior distribution has singular mass at point 0 for all the components of nuisance parameter except some finite number. Let η = {η j }mj=1 be a vector of first m components of the nuisance parameter φ . Further we will consider decompositions of the nuisance parameter φ = (η, κ) and of the full parameter φ = (θ, η, κ) . def We also denote L (θ , η, κ) = L (φ) . Then the “true” point φ ∗ can be written as φ ∗ = (θ ∗ , η∗ , κ ∗ ) . The approximation by projection estimator is equivalent to setting def κ ≡ 0 and we can define projection of the full parameter by υ = (θ , η) ∈ R p with p = q + m being its dimensionality. Let Υ be the domain of definition of υ and denote the υ -approximation of the log-likelihood as def

L(υ) = L (θ, η, 0). The “true” value of the projection on the (θ , η) -subspace is υ ∗ = (θ ∗s , η∗s ) = arg max IE L(υ). def

def

υ=(θ ,η)

The approximation of the nuisance parameter φ by m -dimensional parameter η results in bias which consists of two parts. Firstly, there is a difference between the true value of the target θ ∗ and its approximation θ ∗s . Secondly, the effective Fisher information matrix D˘ e2 differs from the one for the projection estimator D˘ 2 . Below we will focus on obtaining finite sample bounds on the error of the posterior approximation by corresponding normal distribution in the space of parameters (θ, η) . We will quantify the impact of the bias arising due to the use of the projection estimator.

386

M. Panov

3 Semiparametric Bernstein-von Mises Theorem 3.1 Parametric Estimation: Main Definitions We assume that the large positive constant x is fixed in a way to define on the sample space a set Ω(x) of dominating probability. We say that the random set Ω(x) is set of dominating probability if   IP Ω(x) ≥ 1 − Ce−x . One of the main elements of our construction is ( p × p) matrix D 2 which is called Fisher information matrix: def (3) D 2 = −∇ 2 IE L(υ ∗ ). Here and below we work under the conditions which are similar to the classical conditions on regular parametric family (see the book by Ibragimov and Khasminskii [23]). We implicitly assume that the log-likelihood function L(υ) is sufficiently smooth in υ , and denote by ∇ L(υ) its gradient, and by ∇ 2 IE L(υ) Hessian of mathematical expectation IE L(υ) . Also, define ξ = D −1 ∇ L(υ ∗ ). def

The definition of υ ∗ implies that ∇ IE L(υ ∗ ) = 0 and, consequently, IEξ = 0 . For (θ , η) -model consider the block representation of vector ∇ L(υ ∗ ) and matrix D 2 from (3):

2

∇θ L(υ ∗ ) D A 2 . = , D ∇ L(υ ∗ ) = ∇η L(υ ∗ ) A H 2 Define also (q × q) -matrices D˘ 2 and D˘ e2 , and random vector ξ˘ ∈ IR q : −1 def def  D02 = −∇ 2 IEL (φ ∗ ), D˘ e2 = Π0 D0−2 Π0 ,

(4)

˘ 2 def

D = D 2 − AH −2 A , def def ξ˘ = D˘ −1 ∇˘ θ L(υ ∗ ), where ∇˘ θ = ∇θ − AH −2 ∇η .

(5)

Matrix D˘ 2 of size q × q is usually called efficient Fisher information matrix. Further everywhere in the text we denote by a the Euclidean norm of vector a , and for matrix A we denote its operator norm by A . The order on square matrices is defined in a standard way, i.e. A > B means that matrix A − B is positive definite.

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

387

3.2 Conditions Our approach requires that certain number of conditions is satisfied. The list of condition is close to the one in [39], where one can find the discussion of the conditions and examples showing that theses conditions are not restrictive and are satisfied for the majority of classical statistical models such that linear regression and generalized linear models. The conditions can be split in local and global. Local conditions describe the behaviour of the process L(υ) on a local set υ ∈ Υ0 (r0 ) fro some fixed value r0 , where  def  Υ0 (r) = υ ∈ Υ : D(υ − υ ∗ ) ≤ r . (6) Let us note that below we implicitly assume that the point υ ∗ is an interior point of the set Υ . Global conditions should be satisfied on the whole set Υ . Let us define the stochastic component ζ (υ) of the log-likelihood L(υ) : def

ζ (υ) = L(υ) − IE L(υ). We start from the conditions on finite exponential moments. (ED0 ) There exist constant ν0 > 0 , positive definite ( p × p) -matrix V 2 satisfying Var{∇ζ (υ ∗ )} ≤ V 2 , and constant g0 > 0 such that ν 2 μ2 γ ∇ζ (υ ∗ ) ≤ 0 , ∀μ : |μ| ≤ g0 . sup log IE exp μ V γ 2 γ ∈IR p (ED2 ) There exist constants ν0 , ω > 0 and for each r > 0 constant g(r) > 0 such that for all υ ∈ Υ0 (r) it holds

2 μ γ 1 ∇ ζ (υ)γ 2 sup log IE exp ω Dγ 1 · Dγ 2 γ 1 ,γ 2 ∈IR p



ν02 μ2 , ∀μ : |μ| ≤ g(r). 2

Below we assume that g(r) ≥ g for some g > 0 and for all r > 0 . Define def

D 2 (υ) = −∇ 2 IE L(υ). Then D 2 = D 2 (υ ∗ ) . The following condition is required to ensure the smoothness of mathematical expectation of the log-likelihood IE L(υ) in a local set υ ∈ Υ0 (r0 ) : (L0 ) For any r ≤ r0 there exists a constant δ(r) > 0 such that on a set Υ0 (r) it holds:

−1 2

D D (υ)D −1 − I p ≤ δ(r).

388

M. Panov def

Let us introduce the notation L(υ, υ ∗ ) = L(υ) − L(υ ∗ ) for the logarithm of the log-likelihood ratio. The global identifiability condition reads as: (Lr) There exists a constant b > 0 such that for any r > 0 it holds −IE L(υ, υ ∗ ) ≥ br2 , r = D(υ − υ ∗ ) . We also need to introduce some identifiability conditions. We start by introducing the information and covariance matrices in the block form:



2 2 D A V B 2 , V . = D2 = A H 2 B Q2 Identifiability conditions in [39] ensure that matrix D 2 is positive definite and satisfies the condition a2 D 2 ≥ V 2 for some a > 0 . Here we rewrite it in the block form which is essential for the (θ , η) -model. (I) There exist constants a > 0 and 0 ≤ ν < 1 such that a2 D 2 ≥ V 2 , and

a2 H 2 ≥ Q 2 ,

a2 D 2 ≥ V 2

D −1 AH −2 A D −1 ≤ ν.

The quantity ν bounds an angle between subspaces of target an nuisance parameters in the tangent space. The regularity condition (I) ensures that this angle is not too small, i.e. target and nuisance parameters are identifiable. In particular, matrix D˘ 2 is positive definite under the condition (I) . The following conditions are required to bound the bias arising from the approximation of the infinite dimensional nuisance parameter ψ by its finite dimensional component η : (B) Let the values ρs ≤

1 2

and bs ≤

1 2

are such, that

D˘ e (θ ∗ − θ ∗s ) 2 ≤ ρs , I q − D˘ e D˘ −2 D˘ e ≤ bs . For validity of our results we will need that the dimension m of the projection is fixed in such a way that the values ρs and bs are sufficiently small. These values can be upper bounded under the usual smoothness conditions on φ . The example of computation of ρs and bs can be found in [34].

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

389

Along with the local set Υ0 (r) let us define local sets for target and nuisance parameters: def ˘ − θ ∗ ) ≤ r}, H0 (r) def = {η˘ : H (η˘ − η˘ ∗ ) ≤ r}, Θ0 (r) = {θ : D(θ

where η˘ = η + H −2 A θ and η˘ ∗ = η∗ + H −2 A θ ∗ . Change of variables from ˘ allows to account for the interaction between target and nuisance (θ, η) to (θ , η) ˘ − parameters. Let us note that the following equality holds D(υ − υ ∗ ) 2 = D(θ θ ∗ ) 2 + H (η˘ − η˘ ∗ ) 2 . Thus, we can obtain the following representation def

def

Υ0 (r) = Θ0 (r) × H0 (r).

(7)

Also, we define an anisotropic local set based on the representation (7): Υ0 (h, r) = Θ0 (h) × H0 (r). Let us note that for h < r it holds Υ0 (h, r) ⊂ Υ0 (r). The usage of the anisotropic local set Υ0 (h, r) allows to account for the peculiarities of the semiparametric problem. The formulation if the results involves the radius r0 and the quantity ♦(r0 , x) . The radius r0 separates the local zone Υ0 (r0 ) which is a vicinity of the central point υ ∗ and its complement Υ \ Υ0 (r0 ) for which we establish a large deviation result. Define  def  (8) ♦(r0 , x) = δ(r0 ) + 6ν0 z H (x) ω r0 , √ def where z H (x) = 2 p 1/2 + 2x + g−1 (g−2 x + 1)4 p . The value ♦(r0 , x) determines the quality of local approximation of the gradient of log-likelihood ratio ∇ L(υ, υ ∗ ) by the linear function D(υ − υ ∗ ) , see the details in [1]. The term δ(r0 )r0 measures the error of a quadratic approximation of the expected loglikelihood ∇ IE L(υ) due to the condition (L0 ) , while the second term 6ν0 z H (x) ω r0 controls the stochastic term and involves the entropy of the parameter space which is involved in the definition of z H (x) .

3.3 Posterior Contraction Posterior Tail  Probability for the Full Parameter The first step of our analysis is to show that υ  Y concentrates in a small vicinity Υ0 (r0 ) of the central point υ ∗ for the properly chosen r0 . This phenomenon is usually called posterior contraction. Let us consider the uniform prior distribution π(υ) ≡ 1, υ ∈ Υ . The contraction properties of the posterior can be described by the following random variable:

390

M. Panov

  IP υ ∈ / Υ0 (r0 )  Y = 



  exp L(υ, υ ∗ ) dυ    . ∗ Υ exp L(υ, υ ) dυ

Υ \Υ0 (r0 )

Theorem 1 (Theorem 10 from [34]) Let the conditions of Sect. 3.2 hold. Then on Ωr0 (x) the following inequality holds.      IP υ ∈ / Υ0 (r0 )  Y ≤ exp{r0 ♦(r0 , x) + ν0 (r0 )} b− p/2 IP γ 2 ≥ br20 , where

(9)

   def ν0 (r0 ) = − log IP γ + ξ ≤ r0  Y ,

♦(r0 , x) is defined in (8) and γ ∈ R p is a standard √ Gaussian vector. √ If r0 ≥ z B (x) + z( p, x) with z( p, x) = p + 2x and z B (x) defined in (32), then on Ω(x) it holds (10) ν0 (r0 ) ≤ 2e−x . We provide the proof of Theorem 1 in Sect. 5.3 to make the paper self-contained. This result gives us a simple sufficient condition on the value of r0 , which ensures the concentration of the posterior on the set Υ0 (r0 ) . Corollary 1 Let the condition of Theorem 1 hold. Then the additional inequality br20 ≥ z 2 ( p, x + 2p log be ) allows to obtain the bound    IP υ ∈ / Υ0 (r0 )  Y ≤ exp{r0 ♦(r0 , x) + 2e−x − x} on a set Ω(x) of probability not less than 1 − 4e−x . This result follows from the Theorem 1 in view of Lemma 2 (see Sect. 4.1 below). Posterior Tail Probability for the Target Parameter The next step  of our analysis is to ensure that the posterior distribution of the target parameter ϑ  Y concentrates in   ˘ − θ ∗ ) ≤ h0 of the central point θ ∗ = Π0 υ ∗ a small vicinity Θ0 (h0 ) = θ : D(θ for the proper choice of h0 . The contraction properties of the posterior for target parameter can be described by the following random variable    IP ϑ ∈ / Θ0 (h0 )  Y =

 Υ

    1 θ∈ / Θ0 (h0 ) exp L(υ, υ ∗ ) π(υ)dυ    . ∗ Υ exp L(υ, υ ) π(υ)dυ

Working under the same assumption on the uniformity of the prior distribution, i.e. π(υ) ≡ 1 , υ ∈ Υ , we obtain the following formula for the probability of large deviations:      ∗    / Θ0 (h0 ) dυ Υ exp L(υ, υ ) 1 θ ∈     . (11) IP ϑ ∈ / Θ0 (h0 ) Y = ∗ Υ exp L(υ, υ ) dυ

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

391

Let us note that for h0 ≤ r0 it holds        def Υ \Υ0 (h0 ,r0 ) exp L(υ, υ ∗ ) dυ    / Υ0 (h0 , r0 )  Y =  IP υ ∈ / Υ0 (r0 )  Y ≤ IP υ ∈ . ∗ Υ0 (h0 ,r0 ) exp L(υ, υ ) dυ 

We aim to show that we can take h0 much smaller than r0 in such a way that the posterior distribution concentrates on the set Υ0 (h0 , r0 ) and correspondingly the posterior distribution of the target parameter concentrates on the set Θ0 (h0 ) . Theorem 2 Let’s assume √ that conditions of Theorem 1 hold. Under the assumption √ that h0 ≥ z B˘ (x) + q + 2x , on a set Ω(x) the following inequality holds    IP υ ∈ / Υ0 (h0 , r0 )  Y ≤ ρ0 (h0 , r0 , x)e−x , where

ρ0 (h0 , r0 , x) = er0 ♦(r0 ,x)+2e def

−x

−x

+ 4e2h0 ♦(r0 ,x)+2e .

(12)

Moreover, on a set Ω(x) it holds       / Υ0 (h0 , r0 )  Y ≤ ρ0 (h0 , r0 , x)e−x . IP ϑ ∈ / Θ0 (h0 )  Y ≤ IP υ ∈

(13)

The result (9) shows the concentration of the full parameter on the set Υ0 (r0 ) with r20 ≥ C( p + x) . The result (13) that one can obtain concentration on a smaller set Υ0 (h0 , r0 ) = Θ0 (h0 ) × H0 (r0 ) with h20 ≥ C(q + x) which allows to improve the bounds on sizes of concentration set for both full parameter υ = (θ, η) and the target one θ . Also it is useful to extend these results by providing the bound for the expectation of the quadratic form of the target parameter on the local set. For an arbitrary λ ∈ IR q with λ = 1 let’s define  Υ \Υ0 (h0 ,r0 )

def

ρ2 (h0 , r0 ) =

  ˘ − θ ◦ )|2 exp L(υ, υ ∗ ) dυ |λ D(θ    , ∗ Υ exp L(υ, υ ) dυ

def where θ ◦ = θ ∗ + D˘ −1 ξ˘ .

Theorem 3 Let’s assume√that conditions of Theorem 1 hold. Under the assumption √ that h0 ≥ z B˘ (x) + q + 2x , on a set Ω(x) the following inequality holds ρ2 (h0 , r0 ) ≤ 3ρ0 (h0 , r0 , x)e−x .

(14)

392

M. Panov

3.4 Gaussian Approximation of Posterior Distribution Local Gaussian Approximation of Posterior Distribution: Upper Bound In this section we aim to obtain upper bounds on the quality of Gaussian approximation of the posterior distribution. It is convenient to introduce local mathematical expectation IE ◦ : for a random variable η define     def IE ◦ η = IE η1 υ ∈ Υ0 (h0 , r0 )  Y . The following theorem gives a precise statement about the upper bound on the local posterior mean. Let us introduce a variable θ ◦ = θ ∗ + D˘ −1 ξ˘ , def

(15)

where matrix D˘ is defined in (4) and vector ξ˘ is defined in (5). The quantity θ ◦ can be seen as an approximation of the maximum likelihood estimate by the first order Taylor expansion. It is used in our work for the convenience an brevity of the results’ formulations. The results can easily extended to the case of the MLE or other asymptotically efficient estimate in place of θ ◦ . Theorem 4 Let the inequality (30) hold. Then for any function f : IR q → IR+ on Ωr0 (x) it holds     ˘ − θ ◦ ) ≤ exp 2h0 ♦(r0 , x) + ν(h0 ) IE f (γ ), IE ◦ f D(ϑ

(16)

where γ ∼ N (0, I q ) .

2  In the following corollary we consider the particular cases f (u) = λ u and f (u) = 1(u ∈ A) for any measurable set A . Corollary 2 Let us assume that conditions of Sect. 3.2 hold. Then for any λ ∈ IR q with λ = 1 on Ω(x) it holds       ˘ − θ ◦ )2  Y ≤ exp 2h0 ♦(r0 , x) + 2e−x + 3ρ0 (h0 , r0 , x)e−x . IE λ D(ϑ For any measurable set A ⊆ IR q on Ω(x) it holds    ˘ − θ ◦) ∈ A  Y IP D(ϑ     ≤ exp 2h0 ♦(r0 , x) + 2e−x IP γ ∈ A + ρ0 (h0 , r0 , x)e−x , where γ is a standard Gaussian vector in Rq .

(17)

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

393

This result can be easily obtained from the result of Theorem 4 and large deviation results (13) and (14). The following corollary describes an upper bound in the case of the different centering and normalization. Corollary 3 Let D1 be a symmetric (q × q) -matrix, satisfying the condition II − ˘ ◦ − D1−1 D˘ 2 D1−1 ≤ α . Let also  θ ∈ IR q some vector satisfying D(θ θ ) ≤ β , and def ◦ q  let δ 0 = D1 (θ − θ ) . Then for any measurable set A ⊂ IR on Ω(x) it holds    IP D1 (ϑ −  θ) ∈ A  Y     ≤ exp 2h0 ♦(r0 , x) + 2e−x IP D1 D˘ −1 γ + δ 0 ∈ A + ρ0 (h0 , r0 , x)e−x     1  ≤ exp 2h0 ♦(r0 , x) + 2e−x IP γ ∈ A + α 2 q + (1 + α)2 β 2 + ρ0 (h0 , r0 , x)e−x . (18) 2

Local Gaussian Approximation of Posterior Distribution: Lower Bound Now we continue with the lower bound on the quality of Gaussian approximation of the posterior distribution. Theorem 5 Let the inequality (28) hold. Then for any function f : IR q → IR+ on Ωr0 (x) it holds        ˘ − θ ◦ ) ≥ exp −2h0 ♦(r0 , x) IE f (γ )1 γ + ξ˘ ≤ h0 , IE ◦ f D(ϑ where γ ∼ N (0, I q ) . This result can be proved by analogy with the proof of Theorem 4. It shows that posterior mean allows for the lower bound which is close to the expectation of the function of standard Gaussian random variable up to (small) multiplicative and additive constants. As a corollary we state the result for quadratic and indicator functions f (u) . Corollary 4 Let x > 3 . The for any λ ∈ IR q with λ = 1 on Ω(x) it holds       ˘ − θ ◦ )2  Y ≥ exp −2h0 ♦(r0 , x) − 3e−x . IE λ D(ϑ Then for any measurable set A ⊂ IR q on Ω(x) it holds    ˘ − θ ◦) ∈ A  Y IP D(ϑ



    exp −2h0 ♦(r0 , x) IP γ ∈ A − e−x .

(19)

Let D1 be a symmetric (q × q) -matrix, satisfying the condition II − D1−1 D˘ 2 D1−1 def ˘ ◦ − ≤ α . Let also  θ ∈ IR q some vector satisfying D(θ θ ) ≤ β , and let δ 0 = θ ) . Then for any measurable set A ⊂ IR q on Ω(x) it holds D1 (θ ◦ −         IP D1 (ϑ −  θ) ∈ A  Y ≥ exp −2h0 ♦(r0 , x) IP D1 D˘ −1 γ + δ 0 ∈ A − e−x       1 α 2 q + (1 + α)2 β 2 − e−x . ≥ exp −2h0 ♦(r0 , x) IP γ ∈ A − 2

(20)

394

M. Panov

Proofs of these results are similar to the proofs of Corollaries 2 and 3. Main Theorem In this section, we formulate the BvM for the posterior distribution of the target parameter ϑ , given by the formula (2) in the case of the uniform prior distribution, i.e. π(υ) ≡ 1 on Υ . Let us define the posterior mean ϑ and the posterior covariance matrix S2 :    def ϑ = IE ϑ  Y ,

    def def S2 = Cov(ϑ  Y ) = IE (ϑ − ϑ)(ϑ − ϑ)  Y .

(21)

Let us also remind the definition of the vector θ ◦ from (15): θ ◦ = θ ∗ + D˘ −1 ξ˘ . Below we present the variant of semiparametric BvM Theorem in the considered ϑ is close θ ◦ , S2 is close to non-asymptotic framework which shows that vector  ◦ −2 ˘ ˘ D and the distribution of the vector D ϑ − θ conditional on the data Y is close to the standard normal distribution. We remind that C is a common symbol for the −x absolute constants involved and x is sum positive number ensuring  that  e is small. By Ω(x) we define the set of the random events such that IP Ω(x) ≥ 1 − Ce−x . The precise values of constants C will be defined below. Theorem 6 Let the √ conditions of Sect. 3.2 hold. Additionally let us assume that √ h0 ≥ z B˘ (x) + q + 2x . Let the prior distribution be uniform on Υ . Then there exists a set Ω(x) of probability greater than 1 − 4e−x such that on Ω(x) it holds D˘ e (ϑ − θ ◦ ) 2 ≤ 2(1 + bs )Δ(h0 , r0 , x) + ρs ,

I q − D˘ e S2 De ≤ bs + 2(1 + bs )Δ(h0 , r0 , x), def

where ϑ and S2 are defined in (21), Δ(h0 , r0 , x) = 2h0 ♦(r0 , x) + Cρ0 (h0 , r0 , x)e−x , ♦(r0 , x) is defined in (27), and ρ0 (h0 , r0 , x) is defined in (12). Moreover, on Ω(x) for any measurable set A ⊂ IR q it holds      1 exp −Δ(h0 , r0 , x) IP γ ∈ A − q 1/2 bs − e−x 2    ≤ IP D˘ e (ϑ − θ ◦ ) ∈ A  Y      1 ≤ exp Δ(h0 , r0 , x) IP γ ∈ A + q 1/2 bs + ρ0 (h0 , r0 , x)e−x , 2

where vector γ ∈ IR q has standard Gaussian distribution. The condition “ h0 ♦(r0 , x) is small” implies the BvM result, i.e. the closeness of centered and normalized posterior measure to the Gaussian one in total variation. Classical asymptotic results can be easily obtained as corollaries for many standard models such as linear regression, semiparametric linear regression and generalized linear models (see discussion in [34, Sect. 5]). The obtained result can be generalized in the following way.

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

395

Corollary 5 Under the conditions of Theorem 6 for any measurable set A ⊂ IR q on the set Ω(x) of probability greater than 1 − 4e−x it holds      exp −2h0 ♦(r0 , x) IP γ ∈ A − τ − e−x    ≤ IP S−1 (ϑ − ϑ) ∈ A  Y      ≤ exp 2h0 ♦(r0 , x) + 2e−x IP γ ∈ A + τ + ρ0 (h0 , r0 , x)e−x ,

 def def where τ = 21 Δ(h0 , r0 , x) q + (1 + Δ(h0 , r0 , x))2 + 21 q 1/2 bs , Δ(h0 , r0 , x) =   −x 4h0 ♦(r0 , x) + 6 1 + ρ0 (h0 , r0 , x) e and vector γ ∈ IR q has standard normal distribution. This corollary is important as matrix D˘ and vector θ ◦ are not known in the practical scenarios while matrix S−1 and vector ϑ can be directly computed. In the case when dimension q is fixed the validity condition remains: “ h0 ♦(r0 , x) is small”. Moreover, the result can be extended to the case when q increases with the sample size but h0 ♦(r0 , x) q 1/2 is still small. The classical asymptotic results immediately follow for many statistical models, see Sect. 3.5. The results for the uniform prior can be extended to the case of the general prior Π (dυ) which has a smooth density π(υ) , see [34, Sect. 2.3].

3.5 Critical Dimension and Examples In this section we will show how the general results obtained previously can be related to the classical asymptotic results in statistics. Let us consider a model with i.i.d. observations Y = (Y1 , . . . , Yn ) from distribution P belonging to the parametric family (Pυ , υ ∈ Υ ) on the space of observations Y1 . Each value υ ∈ Υ determines the probability measure IPυ = Pυ⊗n on the space Y = Y1n . We again assume (Pυ ) that there exists dominating measure μ0 and density def p(y, υ) = d Pυ /dμ0 (y) . Let us denote (y, υ) = log p(y, υ) . The parametric assumption Yi ∼ Pυ ∗ ∈ (Pυ ) for some υ ∗ ∈ Υ allows to write the log-likelihood L(υ) =

n 

(Yi , υ).

i=1

The structure of i.i.d. observations Yi allows us to rewrite the conditions (L r) , (ED0 ) , (ED2 ) , (L0 ) and (I) in terms of marginal distributions. We will not do it here due to the limited space and direct the interested reader to [39]. Instead, we will simply require the following conditions on the parameters arising in general conditions.

396

M. Panov

(IID) Let the data Y1 , . . . , Yn are i.i.d. random variables. Let us assume that there exist constants ω∗ , δ ∗ , g1 , matrices v02 , F0 and function g1 (u) such that 2 2 2 conditions (ED0 ) , (ED2 ) , (L0 ) and (L r) hold √ with V = nv0 , D√ = ∗ 1/2 ∗ 1/2 nF0 , ω = ω /n , δ(r) = δ r/n , g0 = g1 n and g(r) = g1 (u) n . We note that the part of the conditions in (I I D) , like the ones on information matrices, are direct consequences of i.i.d. assumptions while others require some additional mild conditions, see [39]. Large Deviations for I.I.D. Data In this section we provide sufficient conditions  which ensure the small probability of the event {υ ∈ / Υloc (u0 )  Y } for Υloc (u) = 1/2 {υ ∈ Υ : F0 (υ − υ ∗ ) ≤ u} and fixed u0 . Theorem 2 shows that one can achieve exponential concentration by taking r0 > C(z B (x) + z( p, x)) for a fixed constant C , and z B (x) and z( p, x) being defined in (31) and in (25) correspondingly. By taking r0 = n 1/2 u0 we can formulate the following proposition. Proposition 1 Let us assume that the condition (I I D) holds. Then if for some u0 > 0 (22) n 1/2 u0 > C(z B (x) + z( p, x)), the the following inequality holds    IP υ ∈ / Υloc (u0 )  Y ≤ e−x . Remark 1 The present result allows to determine values u0 and n which ensure large deviation result. Considering the condition (I) the condition (22) can be written as nu20  x + p . In other words, we obtain the concentration of the posterior on a set Υloc (u0 ) with u20 of the order p/n . This result corresponds to classical root-n contraction in statistics. By analogy one can obtain that the √ target parameter concentrates in the corresponding set of the radius proportional to q/n . Local Estimation for I.I.D. Data Next step is to quantify the error of local Gaussian approximation for i.i.d. case. This error is determined by the value ♦(r0 , x) given by (27). Proposition 2 Let us assume the condition (I I D) . Then the results of Theorem 8 and its corollaries hold for i.i.d. model with r20 = nu20 . In particular, we obtain local linear approximation of the log-likelihood gradient in the direction of the target parameter: IP



−1   ∗ ∗

˘ ˘ ˘ ˘ sup D ∇θ L(υ) − ∇θ L(υ ) + D(θ − θ ) ≥ ♦(r, x) ≤ e−x ,

θ∈Θ0 (r)

  where ♦(r0 , x) = δ ∗ r0 + 6ν0 z H (x) ω∗ r0 /n 1/2 . Let us briefly discuss how the obtained results can be applied to the classical asymptotic scenario with n → ∞ . Let us assume that the dimension of the sieve approximation tends to infinity with the sample size p = pn → ∞ . Denote r20 = C pn for

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

397

some constant C ensuring large deviation result 1 and also h20 = Cq . Let us note √ z H (x) has the order of pn . Thus, for sufficiently large sample size n we obtain that the value ♦(r0 , x)  pn /n 1/2 . Then we obtain the following theorem. 2 Theorem 7 Let the condition (I I D) holds. Let also pn →  ∞ and pn2/n → 0 . 2 Then the result of Theorem 6 holds with h0 ♦(r0 , x) = C pn q/n , D = nF0 , where F0 is a Fisher information matrix of distribution (Pυ ) for a single observation at a point υ ∗ .

Let us note that the result can be extended beyond i.i.d. case (see discussion in [34, Sect. 4]). √ In particular, for regression models it also can be shown that  ♦(r0 , x)  ( p + x)/ n , whereas h20  q + x , i.e. the inference is possible if “ p 2 q/n is small”. Let us also note that for the full parameter the results of Proposition 7 require the condition pn3 /n → 0 for BvM result to be valid. In our earlier work [34] the same condition was required for the semiparametric case, i.e. the present paper allows to improve the result.

4 Tools 4.1 Some Inequalities for Normal Distribution This section contains some useful facts about the properties of the Gaussian distribution. Everywhere in this section γ denotes the standard normal vector in dimension IR p . Lemma 1 (Lemma 6 from [34]) For any u ∈ IR p , any unit vector a ∈ IR p and any z > 0 the following bounds hold     (23) IP γ − u ≥ z ≤ exp −z 2 /4 + p/2 + u 2 /2 ,  2    2  2 2 IE |γ a| 1 γ − u ≥ z ≤ (2 + |u a| ) exp −z /4 + p/2 + u /2 . (24) The following lemma is the simple corollary of the Lemma 1 in [29]. It describes the concentration of the norm γ of Gaussian random vector. Lemma 2 For any x > 0 it holds     IP γ ≥ z( p, x) ≤ exp −x , where

def

z( p, x) =



p+



2x.

(25)

The following lemma bounds the Kullback-Leibler divergence K (IP, IP ◦ ) between two Gaussian distributions IP and IP ◦ .

398

M. Panov

Lemma 3 (Lemma 8 from [34]) Let IP = N (b, Σ) and IP ◦ = N (b◦ , Σ ◦ ) for positive definite matrices Σ and Σ ◦ . If Σ −1/2 Σ ◦ Σ −1/2 − I p ≤ ≤ 1/2,

2  tr Σ −1/2 Σ ◦ Σ −1/2 − I p ≤ δ 2 ,

then K (IP, IP ◦ ) = −IE 0 log

d IP ◦ δ2 1 ≤ + (b − b◦ ) Σ ◦ (b − b◦ ) d IP 2 2 1+ δ2 + (b − b◦ ) Σ(b − b◦ ). ≤ 2 2

For any measurable set A ⊂ IR p it holds      IP(A) − IP ◦ (A) ≤ K (IP, IP ◦ )/2 ≤ 1 δ 2 + (1 + )(b − b◦ ) Σ(b − b◦ ). 2

4.2 Linear Approximation of Log-likelihood Gradient and Other Tools In this work we review some important results from [39] and generalize them to the semiparametric case. The bracketing result describes the linear approximation of the gradient of log-likelihood ∇ L(υ) in the vicinity of the central point υ ∗ . Upper function method allows to show that the MLE  υ lies to this vicinity with dominating probability. We assume that the value x is fixed in such a way that e−x is sufficiently small. We can set x = C log p if p is large. Let us assume that r = r0 is fixed in a way to separate local and global regions We start with the basic result from [41]: Theorem 8 (Spokoiny, 2013) Assume that conditions (ED2 ) and (L0 ) hold. Then for all r ≤ r0 IP



  sup D −1 ∇ L(υ) − ∇ L(υ ∗ ) + D(υ − υ ∗ ) ≥ ♦(r, x) ≤ e−x ,

υ∈Υ0 (r)

def

where z H (x) = 2 p 1/2 +



(26)

2x + g−1 (g−2 x + 1)4 p and

 def  ♦(r, x) = δ(r) + 6ν0 z H (x) ω r.

(27)

This result allows to bound the error of local linear approximation of the loglikelihood gradient uniformly on υ ∈ Υ0 (r0 ) . The quantity ♦(r0 , x) measures the quality of approximation of the gradient of log-likelihood ratio ∇ L(υ, υ ∗ ) by a linear process D(υ − υ ∗ ) . The first term δ(r0 )r0 bounds the error of linear approx-

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

399

imation for the expectation of log-likelihood gradient ∇ IE L(υ, υ ∗ ) according to condition (L0 ) , while the second term 6ν0 z H (x) ω r0 controls the stochastic part and is proportional to the entropy of the local set, which is involved in the definition of z H (x) . Let us define the quadratic process L(υ, υ ∗ ) : L(υ, υ ∗ ) = (υ − υ ∗ ) ∇ L(υ ∗ ) − D(υ − υ ∗ ) 2 /2. def

The bound below follows from (26) and describes the quality of the quadratic approximation for L(υ) in the local vicinity of υ ∗ . Corollary 6 (Spokoiny, 2013) Let the conditions (ED2 ) and (L0 ) from Sect. 3.2 hold for some r0 > 0 . Then on the set Ωr0 (x) of dominating probability not less than 1 − e−x it holds |L(υ, υ ∗ ) − L(υ, υ ∗ )| ≤ r0 ♦(r0 , x), υ ∈ Υ0 (r0 ),

(28)

where ♦(r0 , x) is defined in (27) and Υ0 (r0 ) is defined in (6). The result (28) is an improved version of the bound from [39, Theorem 3.1]. The result of Theorem 8 can be extended to the projection of the gradient in the direction of the target: Corollary 7 Let conditions (L0 ) , (ED2 ) and (I) hold. Then for all r ≤ r0 and all υ ∈ Υ0 (r) the following inequality holds IP



  ˘ − θ ∗ ) ≥ ♦(r, x) ≤ e−x , (29) sup D˘ −1 ∇˘ θ L(υ) − ∇˘ θ L(υ ∗ ) + D(θ

θ∈Θ0 (r)

where ♦(r, x) is defined by (27). This result is a key instrument to obtain the more accurate error bound for the Gaussian approximation of the posterior for the target parameter. Based on Corollary 7, we can construct the approximation of log-likelihood ratio for the target direction by corresponding quadratic form. First, let us define ˘ = L(θ , η˘ − H −2 A θ ), L ◦ (θ , η) def

def

where we have introduced L(θ , η) = L(υ) for υ = (θ, η) . Then, let us consider the following decomposition of the log-likelihood ratio:     ˘ − L ◦ (θ ∗ , η˘ ∗ ) = L ◦ (θ , η) ˘ − L ◦ (θ ∗ , η) ˘ + L ◦ (θ ∗ , η) ˘ − L ◦ (θ ∗ , η˘ ∗ ) . L(υ, υ ∗ ) = L ◦ (θ , η)

Also, define

400

M. Panov def ˘ ˘ − θ ∗ ) 2 /2, ˘ − θ ∗ ) − D(θ L(θ, θ ∗ ) = ξ˘ D(θ

where ξ˘ = D˘ −1 ∇˘ θ L(υ ∗ ) . Based on Corollary 7 we can prove the following result. Corollary 8 Let the conditions of Corollary 7 hold. Then on the set Ωr0 (x) of probability greater than 1 − e−x for any υ ∈ Υ0 (r0 ) it holds     L(υ, υ ∗ ) − L(θ ˘ , θ ∗ ) − L ◦ (θ ∗ , η) ˘ − θ ∗ ) ♦(r0 , x). ˘ − L ◦ (θ ∗ , η˘ ∗ )  ≤ D(θ (30) This result gives basis for the accurate bounds of Gaussian approximation for the posterior distribution of the target parameter. Now we switch to the study of concentration properties for stochastic part of the log-likelihood. Theorem 9 (Spokoiny and Zhilova, 2013) Let the condition (ED0 ) hold. Then for g20 > p and x ≤ g20 /4 the random vector ξ = D −1 ∇ L(υ ∗ ) on the set Ω B (x) of probability greater than 1 − 2e−x satisfies the inequality ξ 2 ≤ z 2B (x),

(31)

where def

z 2B (x) = p B + 6λ B x, B = D −1 V 2 D −1 , def

def   p B = tr B ,

  def λ B = λmax B .

(32)

Let us note that the result (31) can be found in the work [43]. This theorem can be easily extended to the case of projection on the target subspace. Corollary 9 Let the conditions (ED0 ) and (I) hold. Then the random vector ξ˘ = D˘ −1 ∇˘ θ L(υ ∗ ) on a set Ω B (x) of probability greater than 1 − 2e−x satisfies the inequality ξ˘ 2 ≤ z 2B˘ (x), where def

def V˘ 2 = V 2 − B Q −2 B,   def   def p B˘ = tr B˘ , λ B˘ = λmax B˘ .

z 2B˘ (x) = p B˘ + 6λ B˘ x, def B˘ = D˘ −1 V˘ 2 D˘ −1 ,

The following theorem allows to construct the large deviation bounds for the loglikelihood using the upper function method. Theorem 10 (Spokoiny, 2012) Let us assume that condition (E D2 ) holds. Also, let as assume that condition (L r) holds, i.e. for all υ ∈ Υ \ Υ0 (r0 ) it holds

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

401

−IE L(υ, υ ∗ ) ≥ b D(υ − υ ∗ ) 2 .     Let also br0 ≥ 2 z B (x) + (r0 , x) , where (r, x) = 6ν0 z H x + log(2r/r0 ) ω . Then for υ ∈ Υ \ Υ0 (r0 ) the inquality L(υ, υ ∗ ) ≤ −b D(υ − υ ∗ ) 2 /2

(33)

holds on a set Ω(x) of probability greater than 1 − 4e−x . The result (33) is close to [39, Theorem 4.2].

5 Proofs of Main Results 5.1 Proof of Corollary 7 Let us note that the following sequence of equalities holds

−1    

D ∇ L(υ) − ∇ L(υ ∗ ) − D 2 (υ − υ ∗ ) = K −1 ∇ L(υ) − ∇ L(υ ∗ ) − D 2 (υ − υ ∗ )

−1 

 = K ∇ L(υ) − ∇ L(υ ∗ ) − K (υ − υ ∗ ) ,

where matrix K is defined as def

K = and consequently K

−1

=



D˘ AH −1 O H



D˘ −1 − D˘ −1 AH −2 O H −1

.

By projection on subspace corresponding to θ we obtain

−1 

  

D˘ ˘ − θ ∗ ) ≤ K −1 ∇ L(υ) − ∇ L(υ ∗ ) − K (υ − υ ∗ ) ∇˘ θ L(υ) − ∇˘ θ L(υ ∗ ) − D(θ

  = D −1 ∇ L(υ) − ∇ L(υ ∗ ) − D 2 (υ − υ ∗ ) .

As a result, the inequality (29) hold as the direct consequence of inequality (26).

5.2 Proof of Corollary 8 Let us define the following quantity

402

M. Panov def ˘ ˘ = L ◦ (θ , η) ˘ − L ◦ (θ ∗ , η) ˘ − L(θ, α(θ , θ ∗ , η) θ ∗ ).

Its gradient in the direction of θ is given by ˘ = ∇˘ θ L(υ) − ∇˘ θ L(υ ∗ ) + D˘ 2 (θ − θ ∗ ). ∇θ α(θ , θ ∗ , η) Thus, we obtain

˘ = (θ − θ ∗ ) ∇θ α(θ  , θ ∗ , η), ˘ α(θ , θ ∗ , η)

where θ  is some point from the segment between θ ∗ and θ . We can bound 

   α(θ , θ ∗ , η) ˘ − θ ∗ ) · D˘ −1 ∇θ α(θ  , θ ∗ , η) ˘  ≤ D(θ ˘ and the required result follows due to (29).

5.3 Proof of Theorem 1 Let u(υ) = b D(υ − υ ∗ ) 2 /2 . Then by the change of variables we obtain    b p/2 det(D) exp −u(υ) dυ p/2 (2π ) Υ \Υ0 (r0 )      b p/2 det(D) ≤ exp −b D(υ − υ ∗ ) 2 /2 dυ = IP γ 2 ≥ br20 . p/2 (2π ) Υ \Υ0 (r0 )

For the integral in the numerator (11) on Ω(x) with the help of (33) we obtain 



Υ \Υ0 (r0 )





exp L(υ, υ ) dυ ≤

 Υ \Υ0 (r0 )

  exp −u(υ) dυ.

Below we use the fact that the function L(υ, υ ∗ ) = ξ D(υ − υ ∗ ) − D(υ − υ ∗ ) 2 /2 is proportional to the standard normal law. Define √ def m(ξ ) = − ξ 2 /2 + log(det D) − p log( 2π ). Then √ m(ξ ) + L(υ, υ ∗ ) = − D(υ − υ ∗ ) − ξ 2 /2 + log(det D) − p log( 2π ) is (conditionally on the data Y ) the logarithm of the density of the normal distribution with mean υ 0 = υ ∗ + D −1 ξ and covariance matrix D −2 . Then for the integral in the denominator we obtain

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

 Υ0 (r0 )

403

  exp L(υ, υ ∗ ) dυ  ≥ exp{−r0 ♦(r0 , x) − m(ξ )}

Υ0 (r0 )

  exp L(υ, υ ∗ ) + m(ξ ) dυ.

(34)

From inequality (34) by definition of ν(r0 ) it follows that  Υ0 (r0 )

  exp{L(υ, υ ∗ )} dυ ≥ exp −r0 ♦(r0 , x) − m(ξ ) − ν(r0 ) .

From inequality (37) for the local integral

 Υ0 (r0 )

(35)

  exp L(υ, υ ∗ ) dυ we obtain

    IP υ ∈ / Υ0 (r0 )  Y ≤ exp r0 ♦(r0 , x) + ν0 (r0 ) + m(ξ ) 

 Υ \Υ0 (r0 )

  exp −u(υ) dυ.

Finally,     exp m(ξ ) = exp − ξ 2 /2 (2π )− p/2 det(D) ≤ (2π )− p/2 det(D) and we obtain the statement (9). The bound (10) also follows       ν0 (r0 ) = − log IP γ + ξ ≤ r0  Y ≤ − log IP γ + ξ ≤ r0  Y    ≤ − log IP γ ≤ z( p, x)  Y ≤ 2e−x .

5.4 Proof of Theorem 2 Let us show that with dominating probability the posterior measure of the set Υ0 (r0 ) \ Υ0 (h0 , r0 ) is exponentially small given the condition that h20 is larger than C(q + x) . √ √ Lemma 4 Let us assume that the bound (26) holds. If h0 ≥ z B˘ (x) + q + 2x and x > 2 , the on the set Ω(x) it holds   ∗ ∗ e L(υ,υ ) dυ ≤ (1 + 4δ0 e−x ) e L(υ,υ ) dυ, Υ0 (r0 )

Υ0 (h0 ,r0 )

def

δ0 = e

2h0 ♦(r0 ,x)+2e−x

.

Proof Let h0 < h1 < . . . < h K = r0 be an increasing sequence. Let us define def Uk = Θ0 (hk ) × H0 (r0 ) and consider the bound (16) which allows to bound an ∗ def  integral over the set Uk \ Uk−1 . By defining T = U0 e L(υ,υ ) dυ we obtain

404

M. Panov 

1 T



U K \U0

=

e L(υ,υ ) dυ =

k=1



K 

 K  1 ∗ e L(υ,υ ) dυ T Uk \Uk−1

Uk

∗ ˘ e L(υ,υ ) 1( D(θ



U0

k=1

− θ ∗ ) > hk−1 )dυ

∗ e L(υ,υ ) dυ



K 

  e2hk ♦(r0 ,x)+ν(h0 ) IP γ + ξ˘ > hk−1 .

k=1

√ √ The choice h0 = z B˘ (x) + q + 2x ensures by (25) that the following inequality holds   ≥ 1 − e−x . IP γ + ξ˘ ≤ h0 √ √ Let us further define hk = z B˘ (x) + q + 2x + k 2 . Then it holds hk − h0 ≤ k . Under the condition 2♦(r0 , x) ≤ 1/2 the following sequence of inequalities holds: 1 T ≤





U K \U0

K 

e

e L(υ,υ ) dυ ≤

K 

  e2hk ♦(r0 ,x)+ν(h0 ) IP γ + ξ˘ > hk−1

k=1

2hk ♦(r0 ,x)+ν(h0 ) −x−(k−1)2 /2

e

≤e

2h0 ♦(r0 ,x)+ν(h0 )−x

k=1

∞ 

2

ek/2−(k−1)

/2

k=1

≤ 4e2h0 ♦(r0 ,x)+ν(h0 )−x .  Further, it is easy to see that 

   θ∈ / Θ0 (h0 ), υ ∈ Υ ⊂ Υ \ Υ0 (h0 , r0 ) .

Thus, for the integral in the numerator of (11) the bound (33) leads to the following bound:         exp L(υ, υ ∗ ) 1 θ ∈ / Θ0 (h0 ) dυ ≤ exp L(υ, υ ∗ ) dυ. Υ

Υ \Υ0 (h0 ,r0 )

Based on Lemma 4, we obtain: 



Υ0 (r0 )\Υ0 (h0 ,r0 )

e L(υ,υ ) dυ ≤ 4δ0 e−x

 Υ0 (h0 ,r0 )



e L(υ,υ ) dυ ≤ 4δ0 e−x





Υ0 (r0 )

e L(υ,υ ) dυ.

Finally, 

    exp L(υ, υ ∗ ) 1 θ ∈ / Θ0 (r0 ) dυ    ∗ Υ exp L(υ, υ ) dυ    ∗ Υ \Υ0 (h0 ,r0 ) exp L(υ, υ ) dυ    ≤ ∗ Υ exp L(υ, υ ) dυ   IP ϑ ∈ / Θ0 (h0 )  Y = 

Υ

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

 ≤

Υ \Υ0 (r0 )

405

     exp L(υ, υ ∗ ) dυ + Υ0 (r0 )\Υ0 (h0 ,r0 ) exp L(υ, υ ∗ ) dυ    ∗ Υ exp L(υ, υ ) dυ

≤ e2r0 ♦(r0 ,x)+2e

−x

−x

+ 4e2h0 ♦(r0 ,x)+2e

−x

−x

,

where the last transition is made with the help of Theorem 1.

5.5 Proof of Theorem 3 Let us bound numerator and denominator separately. Let us define λ0 = D

−1



˘ Dλ , 0

where 0 is the vector of all zeros of dimension p − q . Let us note that λ0 2 = λ 2 = 1 . We can split the numerator in two summands: 

    λ D(θ ˘ − θ ◦ )2 exp L(υ, υ ∗ ) dυ Υ \Υ0 (h0 ,r0 )      λ D(θ ˘ − θ ◦ )2 exp L(υ, υ ∗ ) dυ = Υ \Υ (r )  0 0     λ D(θ ˘ − θ ◦ )2 exp L(υ, υ ∗ ) dυ. + Υ0 (r0 )\Υ0 (h0 ,r0 )

For the first summand the usage of (33) leads to the bound  Υ \Υ0 (r0 )

    λ D(θ ˘ − θ ◦ )2 exp L(υ, υ ∗ ) dυ





Υ \Υ0 (r0 )

 =

Υ \Υ0 (r0 )

    λ D(θ ˘ − θ ◦ )2 exp −b D(υ − υ ∗ ) 2 /2 dυ     λ D(υ − υ ∗ )2 exp −b D(υ − υ ∗ ) 2 /2 dυ 0

√ 2     1( γ 2 ≥ br2 ). = exp (−( p/2) log b − log(det D) + p log( 2π ) IE λ 0γ 0 Then usage of (24) allows to achieve 

    λ D(θ ˘ − θ ◦ )2 exp L(υ, υ ∗ ) dυ Υ \Υ0 (r0 )   √ 2 ≤ 2 exp −( p/2) log b − log(det D) + p log( 2π ) − br20 /4 + p/2 λ0 .

406

M. Panov

Now we take br20 ≥ (2 p + 2) log(e/b) + 4x on Ω(x) and obtain  Υ \Υ0 (r0 )

    λ D(θ ˘ − θ ◦ )2 exp L(υ, υ ∗ ) dυ

√  2  ≤ exp − log(det D) + p log( 2π ) − x λ0 √   = exp − log(det D) + p log( 2π ) − x .

Now let us bound the second summand of the numerator. Similarly to the Lemma 4: def h0 < h1 < . . . < h K = r0 is an increasing sequence. Define Uk = Θ0 (hk ) × H0 (r0 ) and not the bound (16) allows to bound the integral over Uk \ Uk−1 . Let us ∗ further defin T = U0 e L(υ,υ ) . Then, the sequence of inequalities follows:  K      1 λ D(θ λ D(θ ˘ − θ ◦ )2 e L(υ,υ ∗ ) dυ = ˘ − θ ◦ )2 e L(υ,υ ∗ ) dυ T Uk \Uk−1 U K \U0 k=1    K λ D(θ ˘ − θ ◦ )2 e L(υ,υ ∗ ) 1( D(θ ˘ − θ ∗ ) > hk−1 )dυ  Uk  ≤ L(υ,υ ∗ ) dυ U0 e

1 T



k=1



K 

 2   e2hk ♦(r0 ,x)+ν(h0 ) IE λ γ  1 γ + ξ˘ > hk−1 ≤ (2 + |λ ξ˘ |2 )4e2h0 ♦(r0 ,x)+ν(h0 )−x

k=1

≤ 8(1 + λ 2 e−x )e2h0 ♦(r0 ,x)+ν(h0 )−x ≤ 9e2h0 ♦(r0 ,x)+ν(h0 )−x

√ √ for h0 ≥ z B˘ (x) + q + 2x on Ω(x) , x > 3 . Then we can proceed with the denominator with the help of (35):  Υ0 (r0 )

  exp{L(υ, υ ∗ )} dυ ≥ exp −r0 ♦(r0 , x) − m(ξ ) − ν0 (r0 ) .

Finally, on Ω(x) we obtain the sequence of inequalities     λ D(θ ˘ − θ ◦ )2 exp L(υ, υ ∗ ) dυ    ρ2 (h0 , r0 ) = ∗ Υ exp L(υ, υ ) dυ √    2 exp − log(det D) + p log( 2π ) − x   ≤ exp −r0 ♦(r0 , x) − m(ξ ) − ν0 (r0 ) 

Υ \Υ0 (h0 ,r0 )

= 2er0 ♦(r0 ,x)− ξ

2

/2 +ν0 (r0 )−x

+ 9e2h0 ♦(r0 ,x)+ν(h0 )−x .

≤ 2er0 ♦(r0 ,x)+ν0 (r0 )−x + 9e2h0 ♦(r0 ,x)+ν(h0 )−x ≤ 2er0 ♦(r0 ,x)+2e

−x

−x

+ 9e2h0 ♦(r0 ,x)+2e

and the result of the Theorem 3 follows.

−x

−x

,

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

407

5.6 Proof of Theorem 4 ˘ ˘ − θ ∗ ) 2 /2 is ˘ − θ ∗ ) − D(θ Let us note the that the quantity L(θ, θ ∗ ) = ξ˘ D(θ proportional to the log-density of the normal distribution. More precisely, define √ def ˘ − q log( 2π). m(ξ˘ ) = − ξ˘ 2 /2 + log(det D) Then √ ˘ ˘ − θ ∗ ) − ξ˘ 2 /2 + log(det D) ˘ − q log( 2π ) m(ξ˘ ) + L(θ, θ ∗ ) = − D(θ is (conditionally on Y ) log-density of the normal distribution with mean θ ◦ = θ ∗ + D˘ −1 ξ˘ and covariance matrix D˘ −2 . Thus, for any nonnegative function f : IR q → IR+ we obtain  Υ0 (h0 ,r0 )

    ˘ − θ ◦ ) dυ exp L(υ, υ ∗ ) + m(ξ˘ ) f D(θ



≤  ·

Θ0 (h0 )

H 0 (r0 )

    ˘ , θ ∗ ) + m(ξ˘ ) f D(θ ˘ − θ ∗ ) ♦(r0 , x) + L(θ ˘ − θ ◦ ) dθ exp D(θ

 ˘ η˘ ∗ )} d η, ˘ exp L◦ (η,

˘ η˘ ∗ ) = L ◦ (θ ∗ , η) ˘ − L ◦ (θ ∗ , η˘ ∗ ) . Then we get where L◦ (η,  Υ

    ˘ − θ ◦ ) dυ exp L(υ, υ ∗ ) + m(ξ˘ ) f D(θ

     ˘ ˘ η˘ ∗ ) d η˘ ≤ IEe γ +ξ ♦(r0 ,x) 1 γ + ξ˘ ≤ h0 f (γ ) exp L◦ (η, H 0 (r0 )    ˘ η˘ ∗ ) d η, ˘ exp L◦ (η, ≤ eh0 ♦(r0 ,x) IE f (γ ) H 0 (r0 )

where γ is a standard Gaussian vector in IR q . Then, we get  Υ

    ˘ − θ ◦ ) dυ exp L(υ, υ ∗ ) f D(θ ≤ exp{h0 ♦(r0 , x) − m(ξ˘ )} IE f (γ )

For the integral in the denominator it holds

 H 0 (r0 )

  ˘ η˘ ∗ ) d η. ˘ exp L◦ (η,

(36)

408



M. Panov

  exp L(υ, υ ∗ ) dυ ≥e

−h0 ♦(r0 ,x)−m(ξ˘ )

 Υ0 (h0 ,r0 )

    ˘ , θ ∗ ) + m(ξ˘ ) exp L◦ (η, ˘ η˘ ∗ ) dθ d η. ˘ exp L(θ

We can bound from the below  Υ0 (r0 )

  exp L(υ, υ ∗ ) dυ      ˘ , θ ∗ ) + m(ξ˘ ) dθ ˘ η˘ ∗ ) d η˘ exp L(θ exp L◦ (η, Θ0 (h0 ) H0 (r0 )    ˘) −h ♦(r ,x)−m( ξ 0 0 ˘ ˘ η˘ ∗ ) d η. ˘ IE1{ γ + ξ ≤ h0 } exp L◦ (η, (37) =e ˘

≥ e−h0 ♦(r0 ,x)−m(ξ )



H0 (r0 )

From (36) and (37) the desired result follows  Υ

    ˘ − θ ◦ ) dυ   exp L(υ, υ ∗ ) f D(θ    ≤ exp 2h0 ♦(r0 , x) + ν(h0 ) IE f (γ ). ∗ Υ exp L(υ, υ ) dυ

5.7 Proof of Corollary 3  The firs statement in (20) follows from Theorem 4 with f (u) = 1 D1 D˘ −1 u + δ 0 ∈  def θ ) it holds A . Further on Ω(x) for δ 0 = D1 (θ ◦ −  ˘ ◦ − θ ) 2 ≤ (1 + α) D(θ θ ) 2 ≤ (1 + α)β 2 . δ 0 2 = D1 (θ ◦ −  To prove (18) we compute Kullback-Leibler divergence and use Pinsker inequality. Let γ be a standard Gaussian vector in IR q . The random variable D1 D˘ −1 γ + δ 0 def is Gaussian with mean δ 0 and covariance matrix B1−1 = D1 D˘ −2 D1 . Obviously, IIq − B1 = IIq − D1−1 D˘ 2 D1−1 ≤ α. Then, by Lemma 3 for any measurable set A it holds     1  α 2 q + (1 + α)2 β 2 . IP D1 D˘ −1 γ + δ 0 ∈ A  Y ≤ IP γ ∈ A + 2

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

409

5.8 Proof of Theorem 6 We need to show that on Ω(x) th following inequalities hold

˘ 2 D˘ ≤ 2Δ∗ , ˘ − θ ◦ ) 2 ≤ 2Δ∗ , I q − DS D(ϑ where

  def Δ∗ = max Δ+ , Δ− , and we define

def

Δ+ = 2h0 ♦(r0 , x) + 2e−x +

def

−x − −x ∗ and 3ρ  −x Δ = 2h0 ♦(r0 , x) + 3e . Thus, Δ ≤ 2h0 ♦(r0 , x) +  0 (h0 , r0 , x)e 3 1 + ρ0 (h0 , r0 , x) e . def ˘ − θ ◦ ) . From Corollaries 2 and 4 it holds that for any λ ∈ IR q Consider η = D(ϑ with λ = 1 : 2  (38) exp(−Δ− ) ≤ IE ◦ λ η ≤ exp(Δ+ ).

Define first two moments of the random variable η :   def ˘ 2 D. ˘ S◦2 = IE ◦ (η − η)(η − η) = DS

η = IE ◦ η, def

Let us consider the following technical result. ∗ Lemma  us assume that the inequality (38) holds. Then for Δ =  +5 Let − max Δ , Δ ≤ 1/2 the following two inequalities hold

η 2 ≤ 2Δ∗ ,

S◦2 − I q ≤ 2Δ∗ .

(39)

Proof Let u be an arbitrary unit vector on IR q . From (38) we obtain 2  exp(−Δ− ) ≤ IE ◦ u η ≤ exp(Δ+ ). Also, we note that

Thus,

2  IE ◦ u η = u S◦2 u + |u η|2 . exp(−Δ− ) ≤ u S◦2 u + |u η|2 ≤ exp(Δ+ ).

(40)

In a similar way u = η/ η and γ ∼ N (0, I q ) 2  2  − IE ◦ u (η − η) ≥ e−Δ IE u (γ − η) 1{ γ + ξ˘ ≤ h0 }  2  − − = e−Δ 1 + η 2 − e−Δ IE u (γ − η) 1{ γ + ξ˘ ≥ h0 }. Consider the reminder term

410

M. Panov

 2 − e−Δ IE u (γ − η) 1{ γ + ξ˘ ≥ h0 }  2 − − 2 = e−Δ IE u γ  1{ γ + ξ˘ ≥ h0 } + e−Δ η IE1{ γ + ξ˘ ≥ h0 }. By inequalities (23) and (24):   − 2 − 2 e−Δ η IE1{ γ + ξ˘ ≥ r0 } ≤ e−Δ η exp −r20 /4 + q/2 + ξ˘ 2 /2 ,  2   − − e−Δ IE u γ  1{ γ + ξ˘ ≥ r0 } ≤ e−Δ (2 + |u ξ˘ |2 ) exp −r20 /4 + q/2 + ξ˘ 2 /2   − ≤ e−Δ (2 + ξ˘ 2 ) exp −r20 /4 + q/2 + ξ˘ 2 /2 .

According to (31) on Ω(x) it holds that ξ˘ 2 ≤ z 2B˘ (x) . Consider h20 ≥ 4x + 2q + 2z 2B˘ (x) , then we get   u S◦2 u ≥ exp(−Δ− ) 1 + η 2 − exp{−x + log κ} , def

where κ = 2 + η 2 + z 2B˘ (x) . We can take x > log Δ∗ + log κ . Let η 2 > 2Δ∗ , then u S◦2 u ≥ (1 + Δ∗ ) exp(−Δ∗ ). This inequality contradicts (40) for Δ∗ ≤ 1/2 . Indeed, for u = η/ η the upper bound (40) implies, that u S◦2 u ≤ exp{Δ∗ } − 2Δ∗ . We get the contradiction as ex − 2x ≤ (1 + x)e−x for 0 ≤ x ≤ 1/2 . Also note that S◦2 − I q ≤ 2Δ∗ directly follows from (38) due to the elementary inequality ex − 1 ≤ 2x for 0 ≤ x ≤ 1/2 . Thus, the inequalities (39) hold.  The bound on the first moment ϑ = IE ◦ ϑ implies the inequality

D(ϑ ˘ − θ ◦ ) 2 ≤ 2Δ∗ , whereas the second bound gives

DS ˘ 2 D˘ − I q ≤ 2Δ∗ . The last result of the theorem directly follows from inequalities (17) and (19).

5.9 Proof of Theorem 7 The bracketing result of Corollary 6 and large deviation result of Theorem 10 are valid if the sample size n satisfies the condition n ≥ C( pn + x) for a fixed constant

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

411

C . It appears that BvM result requires stronger condition. Indeed, in i.i.d. case it holds √ √ 2 (xn )  pn + xn , ω  1/ n, δ(r0 )  r0 / n, z H where a  b means that a = O(b) and b = O(a) with n → ∞ . The radius r0 should satisfy the condition r20 ≥ C( pn + x) , to ensure the large deviation result for full parameter, and for the target parameter h20 ≥ C(q + x) . Thus, we obtain  2 (xn )ω)r0 ≥ C ( pn + x)2 (q + x)/n. h0 ♦(r0 , x) = h0 (δ(r0 ) + 3ν0 z H If we fix x = Cq then the BvM result requires the condition pn2 q/n → 0 for n → ∞. Acknowledgements The research was supported by the Russian Science Foundation grant 20-7110135.

References 1. Andresen, A., Spokoiny, V.: Critical dimension in profile semiparametric estimation. Electron. J. Statist. 8(2), 3077–3125 (2014). https://doi.org/10.1214/14-EJS982 2. Barron, A., Schervish, M.J., Wasserman, L.: The consistency of posterior distributions in nonparametric problems. Ann. Stat. 27, 536–561 (1996). https://doi.org/10.1214/aos/1018031206 3. Belloni, A., Chernozhukov, V.: Posterior inference in curved exponential families under increasing dimensions. Econ. J. 17(2), S75–S100 (2014). https://doi.org/10.1111/ectj.12027 4. Bernstein, S.: Lecture Notes on Probability Theory. Kharkiv University (1917) 5. Bickel, P.J., Kleijn, B.J.K.: The semiparametric Bernstein-von Mises theorem. Ann. Stat. 40(1), 206–237 (2012). https://doi.org/10.1214/11-AOS921 6. Bochkina, N.: Consistency of the posterior distribution in generalized linear inverse problems. Inverse Probl. 29(9), 095010 (2013). https://doi.org/10.1088/0266-5611/29/9/095010 7. Boucheron, S., Gassiat, E.: A Bernstein-von Mises theorem for discrete probability distributions. Electron. J. Stat. 3, 114–148 (2009). https://doi.org/10.1214/08-EJS262 8. Boucheron, S., Massart, P.: A high-dimensional Wilks phenomenon. Probab. Theory Relat. Fields. 150, 405–433 (2011). https://doi.org/10.1007/s00440-010-0278-7 9. Buhlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications, 1st edn. Springer Publishing Company, Incorporated (2011). https://doi.org/10. 1007/978-3-642-20192-9 10. Burnashev, M.V.: Investigation of second order properties of statistical estimators in a scheme of independent observations. Izv. Akad. Nauk USSR Ser. Mat. 45(3), 509–539 (1981) 11. Castillo, I.: A semiparametric Bernstein—von Mises theorem for Gaussian process priors. Probab. Theory Relat. Fields. 152, 53–99 (2012). https://doi.org/10.1007/s00440-010-0316-5 12. Castillo, I., Nickl, R.: Nonparametric Bernstein-von Mises theorems in Gaussian white noise. Ann. Stat. 41(4), 1999–2028 (2013). https://doi.org/10.1214/13-AOS1133 13. Castillo, I., Rousseau, J.: A general bernstein–von mises theorem in semiparametric models (2013). arXiv:1305.4482 [math.ST] 14. Cheng, G., Kosorok, M.R.: General frequentist properties of the posterior profile distribution. Ann. Stat. 36(4), 1819–1853 (2008). https://doi.org/10.1214/07-AOS536

412

M. Panov

15. Chentsov, N.N.: A bound for an unknown distribution density in terms of the observations. Dokl. Akad. Nauk USSR 147(1), 45–48 (1962) 16. Chernozhukov, V., Hong, H.: An mcmc approach to classical estimation. J. Econ. 115(2), 293–346 (2003). https://doi.org/10.1016/S0304-4076(03)00100-3 17. Cox, D.D.: An analysis of Bayesian inference for nonparametric regression. Ann. Stat. 21(2), 903–923 (1993). https://doi.org/10.1214/aos/1176349157 18. Ermakov, M.: On semiparametric statistical inferences in the moderate deviation zone. J. Math. Sci. 152(6), 869–874 (2008). https://doi.org/10.1007/s10958-008-9104-5 19. Freedman, D.: On the Bernstein-von Mises theorem with infinite-dimensional parameters. Ann. Stat. 27(4), 1119–1140 (1999). https://doi.org/10.1214/aos/1017938917 20. Ghosal, S.: Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli 5(2), 315–331 (1999). https://doi.org/10.2307/3318438 21. Gusev S.I.: Asymptotic expansions associated with some statistical estimators in the smooth case. i. expansions of random variables. Theory Probab. Appl. 20(3), 488–514 (1975) 22. Gusev S.I.: Asymptotic expansions associated with some statistical estimators in the smooth case. ii. expansions of moments and distributions. Theory Probab. Appl. 21(1), 16–33 (1976) 23. Ibragimov, I., Khas’minskij, R.: Statistical Estimation. Springer-Verlag, Asymptotic theory. New York - Heidelberg -Berlin (1981) 24. Kim, Y.: The Bernstein—von Mises theorem for the proportional hazard model. Ann. Stat. 34(4), 1678–1700 (2006). https://doi.org/10.1214/009053606000000533 25. Kim, Y., Lee, J.: A Bernstein—von Mises theorem in the nonparametric right-censoring model. Ann. Stat. 32(4), 1492–1512 (2004). https://doi.org/10.1214/009053604000000526 26. Kleijn, B.J.K., van der Vaart, A.W.: Misspecification in infinite-dimensional Bayesian statistics. Ann. Stat. 34(2), 837–877 (2006). https://doi.org/10.1214/009053606000000029 27. Kleijn, B.J.K., van der Vaart, A.W.: The Bernstein-von-Mises theorem under misspecification. Electron. J. Stat. 6, 354–381 (2012). https://doi.org/10.1214/12-EJS675 28. Kosorok, M.R.: Springer series in statistics. Introduction to Empirical Processes and Semiparametric Inference (2008). https://doi.org/10.1007/978-0-387-74978-5 29. Laurent, B., Massart, P.: Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 28(5), 1302–1338 (2000). https://doi.org/10.1214/aos/1015957395 30. Le Cam, L., Yang, G.L.: Asymptotics in Statistics: Some Basic Concepts. Springer in Statistics (1990). https://doi.org/10.1007/978-1-4612-1166-2 31. Leahu, H.: On the bernstein-von mises phenomenon in the gaussian white noise model. Electron. J. Stat. 5, 373–404 (2011). https://doi.org/10.1214/11-EJS611 32. Mises, R.: Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und theoretischen Physik. Mary S, Rosenberg (1931) 33. Panov, M.: Nonasymptotic approach to bayesian semiparametric inference. Dokl. Math. 93(2), 155–158 (2016). https://doi.org/10.1134/S1064562416020101 34. Panov, M., Spokoiny, V.: Finite sample Bernstein—von Mises theorem for semiparametric problems. Bayesian Anal. 10(3), 665–710 (2015). https://doi.org/10.1214/14-BA926 35. Rivoirard, V., Rousseau, J.: Bernstein—von Mises theorem for linear functionals of the density. Ann. Stat. 40(3), 1489–1523 (2012). https://doi.org/10.1214/12-AOS1004 36. Rivoirard, V., Rousseau, J.: Posterior concentration rates for infinite dimensional exponential families. Bayesian Anal. 7(2), 311–334 (2012). https://doi.org/10.1214/12-BA710 37. Schwartz, L.: On bayes procedures. Probab. Theory Relat. Fields. 4(1), 10–26 (1965) 38. Shen, X.: Asymptotic normality of semiparametric and nonparametric posterior distributions. J. Am. Stat. Assoc. 97(457), 222–235 (2002). https://doi.org/10.1198/016214502753479365 39. Spokoiny, V.: Parametric estimation. Finite sample theory. Ann. Stat. 40(6), 2877–2909 (2012). https://doi.org/10.1214/12-AOS1054 40. Spokoiny, V.: Bernstein—von Mises Theorem for growing parameter dimension (2013). Manuscript. arXiv:1302.3430 41. Spokoiny, V.: Penalized maximum likelihood estimation and effective dimension. AIHP (2015). https://doi.org/10.1214/15-AIHP720

On Accuracy of Gaussian Approximation in Bayesian Semiparametric Problems

413

42. Spokoiny, V., Panov, M.: Accuracy of gaussian approximation in nonparametric bernstein–von mises theorem (2019). arXiv preprint arXiv:1910.06028 43. Spokoiny, V., Zhilova, M.: Sharp deviation bounds for quadratic forms. Math. Methods Stat. 22(2), 100–113 (2013). https://doi.org/10.3103/S1066530713020026 44. van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press (2000). https://doi. org/10.1017/CBO9780511802256

Statistical Theory Motivated by Applications

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity Marianne Bléhaut, Xavier D’Haultfœuille, Jérémy L’Hour, and Alexandre B. Tsybakov

Abstract The synthetic control method is a an econometric tool to evaluate causal effects when only one unit is treated. While initially aimed at evaluating the effect of large-scale macroeconomic changes with very few available control units, it has increasingly been used in place of more well-known microeconometric tools in a broad range of applications, but its properties in this context are unknown. This paper introduces an alternative to the synthetic control method, which is developed both in the usual asymptotic framework and in the high-dimensional scenario. We propose an estimator of average treatment effect that is doubly robust, consistent and asymptotically normal. It is also immunized against first-step selection mistakes. We illustrate these properties using Monte Carlo simulations and applications to both standard and potentially high-dimensional settings, and offer a comparison with the synthetic control method. Keywords Treatment effect · Synthetic control · Covariate balancing · High-dimension

M. Bléhaut CREDOC, 142 rue du Chevaleret, 75013 Paris, France e-mail: [email protected] X. D’Haultfœuille · J. L’Hour · A. B. Tsybakov (B) CREST, ENSAE-Paris, Institut Polytechnique de Paris, 5 Avenue Le Chatelier, 91120 Palaiseau, France e-mail: [email protected] X. D’Haultfœuille e-mail: [email protected] J. L’Hour e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_12

417

418

M. Bléhaut et al.

1 Introduction The synthetic control method [2–4] is one of the most recent additions to the empiricist’s toolbox, gaining popularity not only in economics, but also in political science, medicine, etc. It provides a sound methodology in many settings where only long aggregate panel data is available to the researcher. The method has been specifically developed in a context where a single sizeable unit such as a country, a state or a city undergoes a large-scale policy change (referred to as the treatment or intervention hereafter), while only a moderate number of control units (the donor pool) is available to construct a counterfactual through a synthetic unit. This unit is defined as a convex combination of units from the donor pool that best resembles the treated unit before the intervention. Then the treatment effect is estimated from the difference in outcomes between the treated unit and its synthetic unit after the intervention takes place. In contexts such as those described above, the synthetic unit possesses several appealing properties (for more details on such properties, see the recent survey of [1]). First, it does not lead to extrapolation outside the support of the data: because weights are non-negative, the counterfactual never takes a value outside of the convex hull defined by the donor pool. Second, one can assess simply its fit, making it easy to judge the quality of the counterfactual. Third, the synthetic unit is sparse: the number of control units receiving a non-zero weight is at most equal to the dimension of the matching variable plus one. The method has still some limitations, in particular when applied to micro data, for which it was not initially intended. In such cases, the number of untreated units n 0 is typically greater than the dimension p of variables X used to construct the synthetic units. Then, as soon as the treated unit falls into the convex hull defined by the donor pool, the synthetic control solution is not uniquely defined (see in particular [5]). Second, and still related to the fact that the method was not developed for micro data, there is, to the best of our knowledge, no asymptotic theory available for synthetic control yet. This means in particular, that inference cannot be conducted in a standard way. A third issue is related to variable selection. The standard synthetic control method, as advocated in [2], not only minimizes the norm .V –defined for √ a vector a of dimension p and diagonal positive-definite matrix V , as aV = a T V a–between the characteristics of the treated and those of its synthetic unit under constraints, but also employs a bi-level optimization program over the weighting matrix V so as to obtain the best possible pre-treatment fit. Diagonal elements of V are interpreted as a measure of the predicting power of each characteristics for the outcome (see, e.g., [1, 2]). This approach has been criticized for being unstable and yielding unreproducible results, see in particular [33]. We consider an alternative to the synthetic control that addresses these issues. Specifically, we consider a parametric form for the synthetic control weights, Wi = h(X iT β0 ), where we estimate the unknown parameter β0 . This approach warrants the uniqueness of the solution in low-dimensional cases where p < n 0 . With micro data, it may thus be seen as a particular solution of the synthetic control method. We show that the average treatment on the treated (ATT) parameter can be estimated

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

419

with a two-step GMM estimator, where β0 is computed in a first step so that the reweighted control group matches some features of the treated units. A key result is the double robustness of the estimator, as defined by [8]. Specifically, we show that misspecifications in the synthetic control weights do not prevent valid inference if the outcome regression function is linear for the control group. We then turn to the high-dimensional case where p is large, possibly greater than n 0 . This case actually corresponds to the initial set-up of the synthetic control method, and is therefore crucial to take into consideration. We depart from the synthetic control method by introducing an 1 penalization term in the minimization program used to estimate β0 . We thus perform variable selection in a similar way as the Lasso, but differently from the synthetic control method, which relies on the aforementioned optimization over V (leading to overweighting the variables that are good predictors of the outcome and underweighting the others). We also study the asymptotic properties of our estimator. Building on double robustness, we construct an estimator that is immunized against first-step selection mistakes in the sense defined for example by [21] or [19]. This construction requires an extra step, which models the outcome regression function and provides a bias correction, a theme that has also been developed in [6, 14] and [5]. We show that both in the low-and high-dimensional case, the estimator is consistent and asymptotically normal. Consequently, we develop inference based on asymptotic approximation, which can be used in place of permutation tests when randomization of the treatment is not warranted. Apart from its close connection with the synthetic control method, the present paper is related to the literature on treatment effect evaluation through propensity score weighting and covariate balancing. Several efforts have been made to include balance between covariates as an explicit objective for estimation with or without relation to the propensity score (e.g.[27, 28]). Our paper is in particular related to that of [30], who integrate propensity score estimation and covariate balancing in the same framework. We extend their paper by considering the case of high-dimensional covariates. Note that the covariate balancing idea is related to the calibration estimation in survey sampling, see in particular [24]. It also partakes in the econometric literature addressing variable selection, and more generally the use of machine learning tools, when estimating a treatment effect, especially but not exclusively in a high-dimensional framework. The lack of uniformity for inference after a selection step has been raised in a series of papers by [36–38], echoing earlier papers by [35] who put into question the credibility of many empirical policy evaluation results. One recent solution proposed to circumvent this post-selection conundrum is the use of double-selection procedures [11, 19, 20, 25]. For example, [13] highlight the dangers of selecting controls exclusively in their relation to the outcome and propose a three-step procedure that helps selecting more controls and guards against omitted variable biases much more than a simple “postsingle-selection” estimator, as it is usually done by selecting covariates based on either their relation with the outcome or with the treatment variable, but rarely both. [25] extends the main approach of [13] by allowing for heterogeneous treatment effects, proposing an estimator that is robust to either model selection mistakes in

420

M. Bléhaut et al.

propensity scores or in outcome regression. However, [25] proves the root-n consistency of his estimator only when both models hold true (see Assumption 3 and Theorem 3 therein). Our paper is also related to the work of [7], who consider treatment effect estimation under the assumption of a linear conditional expectation for the outcome equation. As we do, they also estimate balancing weights, to correct for the bias arising in this high-dimensional setting, but because of their linearity assumption, they do not require to estimate a propensity score. Their method is then somewhat simpler than ours, but it does not enjoy the double-robustness property of ours. Finally, a recent work by [17] parallel to ours1 suggests a Lasso-type procedure assuming logistic propensity score and linear specification for both treated and untreated items. Their method is similar but still slightly different from ours. The main focus in [17] is to prove asymptotic normality under possibly weaker conditions √ on the sparsity levels of the parameters. Namely, they allow for sparsity up to o( n/ log( p)) or, in some cases up to o(n/ log( p)) where n is the sample size, when the eigenvalues of the population Gram matrix do not depend on n. Similar results can be developed for our method at the expense of additional technical effort. We have opted not to pursue in this direction since it only helps to include relatively non-sparse models that are not of interest for the applications we have in mind. More recently, the papers of [40] and [45] written independently have been brought to our attention. Tan [45] studies double-robustness of a similar estimator but assumes a logistic propensity score model while we leave room for other possibilities. The setting in [40] is not restricted to the logistic model. However, that paper considers propensity score only as a mean to select the variables that should be fully balanced and that will enter in the final propensity score. This leads to a more complex estimation procedure. The authors of these two papers do not seem to be aware of the prior work of ours [16]. The paper is organized as follows. Section 2 introduces the set-up and the identification strategy behind our estimator. Section 3 presents the estimator both in the low-and high-dimensional case and studies its asymptotic properties. Section 4 examines the finite sample properties of our estimator through a Monte Carlo experiment. Section 5 revisits the dataset of [34] to compare our procedure with other highdimensional econometric tools and the effect of the large-scale tobacco control program of [2] for a comparison with synthetic control. Section 6 concludes. All proofs are in the Appendix.

1

Our results have been presented as early as 2016 at the North American and European Summer Meetings of the Econometric Society (see https://www.econometricsociety.org/sites/default/files/ regions/program_ESEM2016.pdf), and again in 2017 during the IAAE Meeting in Sapporo, see [16].

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

421

2 Covariate Balancing Weights and Double Robustness We are interested in the effect of a binary treatment, coded by D = 1 for the treated and D = 0 for the non-treated. We let Y (0) and Y (1) denote the potential outcome under no treatment and under the treatment, respectively. The observed outcome is then Y = DY (1) + (1 − D)Y (0). We also observe a random vector X ∈ R p of pretreatment characteristics. The quantity of interest is the average treatment effect on the treated (ATT), defined as: θ0 = E[Y (1) − Y (0)|D = 1]. Here and in what follows, we assume that the random variables are such that all the considered expectations are finite. Since no individual is observed in both treatment states, identification of the counterfactual E[Y (0)|D = 1] is achieved through the following two ubiquitous conditions. Assumption 1 (Nested Support) P[D = 1|X ] < 1 almost surely and 0 < P[D = 1] < 1. Assumption 2 (Mean Independence) E[Y (0)|X, D = 1] = E[Y (0)|X, D = 0]. Assumption 1, a version of the usual common support condition, requires that there exist control units for any possible value of the covariates in the population. Since the ATT is the parameter of interest, we are never reconstructing a counterfactual for control units so P[D = 1|X ] > 0 is not required. Assumption 2 states that conditional on a set of observed confounding factors, the expected potential outcome under no treatment is the same for treated and control individuals. This assumption is a weaker form of the classical conditional independence assumption (Y (0), Y (1)) ⊥ D|X . In policy evaluation settings, the counterfactual is usually identified and estimated as a weighted average of non-treated unit outcomes: θ0 = E[Y (1)|D = 1] − E[W Y (0)|D = 0],

(1)

where W is a random variable called the weight. Popular choices for the weight are the following: 1. Linear regression: W = E[D X T ]E[(1 − D)X X T ]−1 X , also referred to as the Oaxaca-Blinder weight [32]; 2. Propensity score: W = P[D = 1|X ]/(1 − P[D = 1|X ]); 3. Matching: W = P(D = 1) f X |D=1 (X )/[P(D = 0) f X |D=0 (X )]; 2 4. Synthetic controls: see [2]. Assuming here that the conditional densities f X |D=1 and f X |D=0 exist. Of course, P(D = 1) f X |D=1 (X )/[P(D = 0) f X |D=0 (X )] = P[D = 1|X ]/(1 − P[D = 1|X ]), but the methods of estimation of W differ in the two cases. 2

422

M. Bléhaut et al.

In this paper, we propose another choice of weight W , which can be seen as a parametric alternative to the synthetic control. An advantage is that it is well-defined whether or not the number of untreated observations n 0 is greater than the dimension p of X , whereas the synthetic control estimator is not uniquely defined when n 0 > p. Formally, we look for weights W that: (i) (ii) (iii) (iv)

satisfy a balancing condition as in the synthetic control method; are positive; depend only on the covariates; can be used whether n 0 > p or n 0 ≤ p (high-dimensional regime).

Satisfying a balancing condition means that E[D X ] = E[W (1 − D)X ].

(2)

Up to a proportionality constant, this is equivalent to E[X |D = 1] = E[W X |D = 0]. In words, W balances the first moment of the observed covariates between the treated and the control group. The definition of the observable covariates X is left to the econometrician and can include transformation of the original covariates so as to match more features of their distribution. The idea behind such weights relies on the principle of “covariate balancing” as in, e.g., [30]. The following lemma shows that under Assumption 1 weights satisfying the balancing condition always exist. Lemma 1 If Assumption 1 holds, the propensity score weight W = P[D = 1|X ]/ (1 − P[D = 1|X ]) satisfies the balancing condition (2). The proof of this lemma is straightforward by plugging the expression of W in Equation (2) and using the law of iterated expectations. Note that W is not a unique solution of (2). The linear regression weight W = E[D X T ]E[(1 − D)X X T ]−1 X also satisfies the balancing condition but it can be negative and its use is problematic in high-dimensional regime. Lemma 1 would suggest solving a binary choice model to obtain estimators of P[D = 1|X ] and of the weight W as a first step, and then plugging W in (1) to estimate θ0 . However, an inconsistent estimator of the propensity score leads to an inconsistent estimator of θ0 and does not guarantee that the corresponding weight will achieve covariate balancing. Finally, estimation of the propensity score can be problematic when there are very few treated units. For these reasons, we consider another approach where estimation is based directly on balancing equations: E [(D − (1 − D)W ) X ] = 0.

(3)

An important advantage of this approach over the usual one based on the propensity score estimation through maximum likelihood is its double robustness (for a definition, see, e.g., [8]). Indeed, let W1 denote the weights identified by (3) under a misspecified model on the propensity score. It turns out that if the balancing equations (3) hold for W1 the estimated treatment effect will still be consistent provided that E[Y (0)|X ] is linear in X . The formal result is given in Theorem 1 below.

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

423

Theorem 1 (Double Robustness) Let Assumptions 1–2 hold and let w : R p → (0, +∞) be a measurable function such that E [w(X )|Y |] < ∞, E [w(X )X 2 ] < ∞, where  · 2 denotes the Euclidean norm. Assume the balancing condition E [(D − (1 − D)w(X )) X ] = 0.

(4)

Then, for any μ ∈ R p the ATT θ0 can be expressed as θ0 =

  1 E (D − (1 − D)w(X )) (Y − X T μ) P(D = 1)

(5)

in each of the following two cases: 1. E[Y (0)|X ] = X T μ0 for some μ0 ∈ R p ; 2. P[D = 1|X ] = w(X )/(1 + w(X )). In (5), the effect of X is taken out from Y in a linear fashion, while the effect of X on D is taken out by re-weighting the control group to obtain the same mean for X . Theorem 1 shows that an estimator based on (5) enjoys the double robustness property. Theorem 1 is similar to the result of [32] for the Oaxaca-Blinder estimator, which is obtained under the assumption that the propensity score follows specifically a log-logistic model in the propensity-score-well-specified case. Theorem 1 is more general. It can be applied under parametric modeling of W as well as in nonparametric settings. In this paper, we consider a parametric model for w(X ). Namely, we assume that P[D = 1|X ] = G(X T β0 ) for some unknown β0 ∈ R p and some known strictly increasing cumulative distribution function G. Then w(X ) = h(X T β0 ) with h = G/(1 − G) and β0 is identified by the balancing condition E



  D − (1 − D)h(X T β0 ) X = 0.

(6)

Clearly, h is a positive strictly increasing function, which implies that its primitive H is strictly convex. A classical example is to take G as the c.d.f. of the logistic distribution, in which case h(u) = H (u) = exp(u) for u ∈ R. The strict convexity of H implies that β0 is the unique solution of a strictly convex program:   β0 = arg min E (1 − D)H (X T β) − D X T β . β∈R p

(7)

This program is well-defined, whether or not P[D = 1|X] = G(X T β0 ).Note also that definitions (6) and (7) are equivalent provided that E h(X T β)X 2 < ∞ for β in a vicinity of β0 . Indeed, it follows from the dominated convergence theorem that, under this assumption and due to the fact that any convex function is locally Lipschitz, differentiation under the expectation sign is legitimate in (7).

424

M. Bléhaut et al.

We are now ready to state the main identification theorem justifying the use of ATT estimation methods developed below. It is a straightforward corollary of Theorem 1. Theorem 2 (Parametric Double Robustness) Let Assumptions 1–2 hold.  Assume a positive strictly increasing function h are such that E h(X T β0 ) that β0 ∈ R p and   T X 2 ] < ∞, E h(X β0 )|Y | < ∞ and condition (6) holds. Then, for any μ ∈ R p , the ATT θ0 satisfies θ0 =

   1 E D − (1 − D)h(X T β0 ) (Y − X T μ) , P(D = 1)

(8)

in each of two the following cases. 1. There exists μ0 ∈ R p such that E[Y (0)|X ] = X T μ0 . 2. P[D = 1|X ] = G(X T β0 ) with G = h/(1 + h). At this stage, the parameter μ in Equation (8) does not play any role and can, for example, be zero. However, we will see below that in the high-dimensional regime, choosing μ carefully is crucial to obtain an “immunized” estimator of θ0 that enjoys the desirable asymptotic properties.

3 A Parametric Alternative to Synthetic Control We now assume to have a sample (Di , X i , Yi )i=1...n of i.i.d. random variables with the same distribution as (D, X, Y ).

3.1 Estimation With Low-Dimensional Covariates Consider first an asymptotic regime where the dimension p of the covariates is fixed, while the sample size n tends to infinity. We call it the low-dimensional regime. Define an estimator of β0 via the empirical counterpart of (7): n  ld ∈ arg min 1 [(1 − Di )H (X iT β) − Di X iT β]. β n i=1 β∈R p

(9)

ld in the empirical counterpart of (8) to obtain the following estimator Next, we plug β of θ0 :

n 1 1 T  [Di − (1 − Di )h(X i βld )]Yi . θld := 1 n n i=1 i=1 Di n

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

425

Note that if X includes the intercept,  θld satisfies the desirable property of location invariance, namely it does not change if we replace all Yi by Yi + c, for any c ∈ R. Set Z := (D, X, Y ), Z i := (Di , X i , Yi ) and introduce the function g(Z , θ, (β, μ)) := [D − (1 − D)h(X T β)][Y − X T μ] − Dθ. Then the estimator  θld satisfies n 1 ld , 0)) = 0. g(Z i ,  θld , (β n i=1

(10)

This estimator is a two-step GMM. It is consistent andasymptotically normal  under mild regularity conditions, with asymptotic variance E g 2 (Z , θ0 , (β0 , μ0 )) /E(D)2 , where (11) μ0 = E[h (X T β0 )X X T |D = 0]−1 E[h (X T β0 )X Y |D = 0]. This can be shown by standard techniques (see, e.g., Sect. 6 in [39]). Notice that since h (X T β0 ) > 0 the vector μ0 is the coefficient of the weighted population regression of Y on X for the control group. This observation is useful for the derivation of the “immunized” estimator in the high-dimensional case, to which we now turn.

3.2 Estimation With High-Dimensional Covariates We now consider that p may grow with n, with possibly p n. This can be of interest in several situations. First, in macroeconomic problems, n is actually small, and p may easily be of comparable size. For example, in the Tobacco control program application by [2] the control group size is limited due to the fact that the observational unit is the state but many pre-treatment outcomes are included among the covariates. Section 5.2 revisits this example. Second, researcher may want to consider a flexible form for the weights by including transformations of the covariates. For instance, one may want to interact categorical variables with other covariates or consider, e.g., different powers of continuous variables if one wants to allow for flexible non-linear effects. See Sect. 5.1 for an application considering such transformations. Third, one may want not only to balance the first moments of the distribution of the covariates but also the second moments, the covariances, the third moments and so on to make the distribution more similar between the treated and the control group. In this case, high-dimensional settings seem to be of interest as well. In high-dimensional regime, the GMM estimator in (9) is, in general, not consistent. We therefore propose an alternative Lasso-type method by adding in (9) an 1 penalization term:

426

M. Bléhaut et al.



⎞ p n   1  ∈ arg min ⎝ [(1 − Di )H (X iT β) − Di X iT β] + λ ψ j |β j |⎠ . β n i=1 β∈R p j=1

(12)

Here, λ > 0 is an overall penalty  parameter set to dominate the noise in the gradient of the objective function and ψ j j=1,..., p are covariate specific penalty loadings set as to grant good asymptotic properties. The penalty loadings can be adjusted using the algorithm presented in Appendix 7. This type of penalization offers several advantages. First, the program (12) has almost surely a unique solution when the entries of X have a continuous distribution, cf. Lemma 5 in [46], which cannot be granted for its non-penalized version (9). Second, it yields a sparse solution in the sense that some entries of the vector of estimated coefficients are set exactly to zero if the penalty is large enough, which is not the case for estimators based on 2 penalization. The 0 -penalized estimator shares the same sparsity property but is very costly to compute, whereas (12) can be easily solved by computationally efficient methods, see, e.g., [29]. The use of covariate specific penalty loadings goes back to [15]; the particular choice of penalty loadings that we consider below is inspired by [10]. A drawback of penalizing by the 1 -norm is that it induces a bias in estimation of the coefficients. But this is not an issue here since we are ultimately interested in estimating θ0 rather  of (12) only plays the role of a pilot estimator. than β0 . The solution β  is consistent as n tends to infinity under assumptions analogous The estimator β to those used in [15, 18] for the Lasso with quadratic loss, see Theorem 4 below. As in the low-dimensional case (see Sect. 3.1), one is then tempted to consider the plug-in estimator for the ATT based on Equation (8) with μ = 0: 1  θ = n

i=1 Di

n 

 i. [Di − (1 − Di )h(X iT β)]Y

(13)

i=1

We refer to this estimator as the naive plug-in estimator. However, as mentioned  of the nuisance parameter β0 is not asymptotically above, the Lasso estimator β unbiased. In high-dimensional regime where p grows with n, naive plug-in estimators suffer from a regularization bias and may not be asymptotically normal with zero mean, as illustrated for example in [13, 19, 21]. Therefore, following the general approach of [19, 21], we develop an immunized estimator that, at the first order, is  We show that this estimator is asymptotically normal with mean insensitive to β. zero and an asymptotic variance that does not depend on the properties of the pilot  The idea is to choose parameter μ in (8) such that the expected gradient estimator β. of the estimating function g(Z , θ, (β, μ)) with respect to β is zero when taken at (θ0 , β0 ). This holds for μ = μ0 , where μ0 satisfies   E (1 − D)h (X T β0 )(Y − X T μ0 )X = 0.

(14)

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

427

Notice that if the corresponding matrix is invertible we get the low-dimensional solution (11). Clearly, μ0 depends on unknown quantities and we need to estimate it. To this end, observe that Equation (14) corresponds to the first-order condition of a weighted least-squares problem,3 namely   μ0 = arg min E (1 − D)h (X T β0 )(Y − X T μ)2 . μ∈R p

(15)

Since X is high-dimensional we cannot estimate μ0 via the empirical counterpart of (15). Instead, we consider a Lasso-type estimator ⎞ p n     1 2  Yi − X T μ ] + λ  μ ∈ arg min ⎝ [(1 − Di )h (X iT β) ψ j |μ j |⎠ . i n i=1 μ∈R p j=1 ⎛

(16)

Here, similarly to (12), λ > 0 is an overall penalty set to dominate the   parameter are covariate-specific noise in the gradient of the objective function and ψ j j=1,..., p

penalty loadings. Importantly, by estimating μ0 we do not introduce, at least asymptotically, an additional source of variability since by construction, the gradient of the moment condition (8) (if we consider it as function of β rather than of β0 ) with respect to (β, μ) vanishes at point (β0 , μ0 ). Finally, the immunized ATT estimator is defined as 1  θ := n i=1

n  

Di

  (Yi − X T  Di − (1 − Di )h(X iT β) i μ).

i=1

Intuitively, the immunized procedure corrects the naive plug-in estimator in the case where the balancing program has “missed” a covariate that is very important to predict the outcome: ⎤T n n   1 1  i⎦   Xi − h(X iT β)X μ, θ = θ˜ − ⎣ n 1 i:D =1 n 1 i:D =0 ⎡

i

i

where n 1 is the number of treated observations. This has a flavor of Frish-WaughLowell partialling-out procedure for model selection as observed by [13] and further developed in [21]. To summarize, the estimation procedure in high-dimensional regime consists of the three following steps. Each step is computationally simple as it needs at most to minimize a convex function: The assumptions under which we prove the results below guarantee that μ0 defined here is unique. Extension to the case of multiple solutions can be worked out as well. It is technically more involved but in our opinion does not add much to the understanding of the problem.

3

428

M. Bléhaut et al.

1. (Balancing step.)For a given level of penalty λ and positive covariate-specific p  defined by penalty loadings ψ j j=1 , compute β ⎞ p n   1  ∈ arg min ⎝ [(1 − Di )H (X iT β) − Di X iT β] + λ ψ j |β j |⎠ . β n i=1 β∈R p j=1 ⎛

(17)

2. (Immunization   pstep.) For a given level of penalty λ and covariate-specific penalty  obtained in the previous step, compute  , and using β μ defined loadings ψ j j=1

by: ⎞ p n     1  Yi − X T μ 2 ] + λ  μ ∈ arg min ⎝ [(1 − Di )h (X iT β) ψ j |μ j |⎠ . (18) i p n μ∈R i=1 j=1 ⎛

3. (ATT estimation.) Estimate the ATT using the immunized estimator: 1  θ = n i=1

n  

Di

  (Yi − X T  Di − (1 − Di )h(X iT β) i μ).

(19)

i=1

3.3 Asymptotic Properties The current framework poses several challenges to achieving asymptotically valid inference. First, X can be high-dimensional since we allow for p n provided that sparsity conditions are met (see Assumption 3 below). Second, the ATT estimation is affected by the estimation of the nuisance parameters (β0 , μ0 ) and we wish to neutralize their influence. Finally, the 1 -penalized estimators we use for β0 and μ0 are not conventional. The estimator of β0 relies on a non-standard loss function  that we need are not available in the and, to our knowledge, the properties of β literature, cf., e.g., [26] and the references therein. The estimator of μ0 is close to  In general, discrepancy in the usual Lasso except for the weights that depend on β. the weights can induce an extra bias. Thus, it is not granted that such an estimator achieves properties close to the Lasso. We show below that it holds true under our assumptions. Our proof techniques may be of interest for other problems of similar type. Let η = (β, μ) denote the vector of two nuisance parameters and recall that Z = (D, X, Y ). In what follows, we write for brevity g(Z , θ, η) instead of g(Z , θ, (β, μ)). In particular, for the value η0 := (β0 , μ0 ) we have Eg(Z , θ0 , η0 ) = 0.

(20)

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

429

Hereafter, the notation a  b means that a ≤ cb for some constant c > 0 independent of the sample size n. We denote by  and −1 the cumulative distribution function and the quantile function of a standard normal random variable, respectively. nWe use the symbol En (·) to denote the empirical average, that is Epn (a) = ai for a = (a1 , . . . , an ). Finally, for a vector δ = (δ1 , . . . , δ p ) ∈ R and a n −1 i=1 p subset S ⊆ {1, . . . , p} we consider the restricted vector δ S = (δ j I( j ∈ S)) j=1 , where   I(·) denotes the indicator function, and we set δ0 := Card 1 ≤ j ≤ p : δ j = 0 ,  p 2 δ1 := pj=1 |δ j |, δ2 := j=1 δ j and δ∞ := max |δ j |. j=1,..., p

We now state the assumptions used to prove the asymptotic results. Assumption 3 (Sparsity restrictions) The nuisance parameter is sparse in the following sense: β0 0 ≤ sβ , μ0 0 ≤ sμ for some integers sβ , sμ ∈ [1, p]. Assumption 4 (Conditions on function h) Function h is increasing, twice continuously differentiable on R and (i) the second derivative h is Lipschitz on any compact subset of R, (ii) either inf u∈R h (u) ≥ c2 or β0 1 ≤ c3 where c2 > 0 and c3 > 0 are constants independent of n. We also need some conditions on the distribution of data. The random vectors Z i = (Di , X i , Yi ) are assumed to be i.i.d. copies of Z = (D, X, Y ) with D ∈ {0, 1}, X ∈ R p and Y ∈ R. Throughout the paper, we assume that Z depends on n, so that in fact we deal with a triangular array of random vectors. This dependence on n is needed for rigorously stating asymptotical results since we consider the setting where the dimension p = p(n) is a function of n. Thus, in what follows Z is indexed by n but for brevity we typically suppress this dependence in the notation. On the other hand, all constants denoted by c (with various indices) and K appearing below are independent of n. Assumption 5 (Conditions on the distribution of data) The random vectors Z i are i.i.d. copies of Z = (D, X, Y ) with D ∈ {0, 1}, X ∈ R p and Y ∈ R satisfying (6), (14), (20) and the following conditions: (i) There exist constants K > 0 and c1 > 0 such that max{X ∞ , |X T β0 |, |Y − X T μ0 |} ≤ K (a.s.), and 0 < P(D = 1) < 1. (ii) Non-degeneracy conditions. There exists c1 > 0 such that for all j = 1, . . . , p,

430

M. Bléhaut et al.

     min E (Y − X T μ0 − θ0 )2 |D = 1 , E h 2 (X T β0 )(Y − X T μ0 )2 |D = 0 ,     E (X T e j )2 |D = 1 , E h 2 (X T β0 )(X T e j )2 |D = 0 ,    E (h (X T β0 ))2 (Y − X T μ0 )2 (X T e j )2 |D = 0 ≥ c1 , where e j denotes the jth canonical basis vector in R p . Assumption 6 (Condition on the population Gram matrix of the control group) The population Gram matrix of the control group  := E((1 − D)X X T ) is such that

v T v ≥ κ , \{0} v2 2

min p

v∈R

(21)

where the minimal eigenvalue κ is a positive number. Note that in view of Assumption 5, κ is uniformly bounded: κ ≤ e1T e1 ≤ K 2 . Remark 1 If (i) c p := supx P[D = 1|X = x] < 1 and (ii) E[X X T ] is nonsingular, with smallest eigenvalue equal to λmin , then Assumption 6 holds with κ = (1 − c p )λmin . Condition (i) is a slight reinforcement of Assumption 1, which holds for instance (since X is bounded) if x → P[D = 1|X = x] is continuous. Assumption 7 (Dimension restrictions) The integers p, s= max(sβ , sμ ) ∈ [1, p/2] and the value κ > 0 are functions of n satisfying the following growth conditions: s 2 log( p) √ → 0 as n → ∞, κ2 n (ii) s/κ = o( p) as n → ∞, (iii) log( p) = o(n 1/3 ) as n → ∞. (i)

Finally, we define the penalty loadings for estimation of nuisance parameters. The gradients of the estimating function with respect to the nuisance parameters are   ∇μ g(Z , θ, η) = − D − (1 − D)h(X T β) X,   ∇β g(Z , θ, η) = −(1 − D)h (X T β) Y − X T μ X. For each i = 1, . . . , n, we define the random vector Ui ∈ R2 p with entries corresponding to these gradients:  Ui, j :=

  if1 ≤ j ≤ p, − Di − (1 − Di )h(X iT β0 ) X i, j  −(1 − Di )h (X iT β0 ) Yi − X iT μ0 X i, j if p + 1 ≤ j ≤ 2 p,

where X i, j is the jth entry of X i .

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

431

Assumption 8 (Penalty Loadings) Let c > 1 and γ ∈ (0, 2 p) be such that log(1/γ)  log( p) and γ = o(1) as n → ∞. The penalty loadings for estimation of β0 satisfy

ψ j,max

√ λ := c−1 (1 − γ/2 p)/ n,   n   ≥ ψ j ≥ n −1 Ui,2 j for j = 1, . . . , p. i=1

The penalty loadings for estimation of μ0 satisfy √ λ := 2c−1 (1 − γ/2 p)/ n,   n   ≥ ψ j ≥ n −1 Ui,2 j+ p for j = 1, . . . , p.

ψ j,max

i=1

Here, the upper bounds on the loadings are ψ j,max

ψ j,max

  n 1  = max [(1 − Di )h(u) − Di ]2 X i,2 j , n i=1 |u|≤K

  n 1    = (1 − Di ) max |h (u)|2 [Yi − u]2 X i,2 j |u|≤K n i=1

 n representing feasible majorants for n −1 i=1 Ui,2 j under our assumptions.  n The values n −1 i=1 Ui,2 j depend on the unknown parameters. Thus, we cannot choose ψ j , ψ j equal to these values but we can take them equal to the upper bounds ψ j = ψ j,max and ψ j = ψ j,max . A more flexible iterative approach of choosing feasible loadings is discussed in Appendix 7. The following theorem constitutes the main asymptotic result of the paper. Theorem 3 (Asymptotic Normality of the Immunized Estimator) Let Assumptions 3–8 hold. Then the immunized estimator  θ defined in Equation (19) satisfies √ D θ − θ0 ) → N (0, 1) as n → ∞,  σ −1 n( where  σ 2 :=

1 n

n i=1

g 2 (Z i ,  θ,  η)

 n 1 n

i=1

Di

−2

is a consistent estimator of the

D

asymptotic variance, and → denotes convergence in distribution. The proof is given in Appendix 8. An important point underlying the root-n convergence and asymptotic normality of  θ is the fact that the expected gradient of g with

432

M. Bléhaut et al.

respect to η is zero at η0 . In granting this property, we follow the general methodology of estimation in the presence of high-dimensional nuisance parameters developed in [10, 12, 13, 21] among other papers by the same authors. The second important ingredient of the proof is to ensure that the estimator  η converges fast enough to the nuisance parameter η0 . Its rate of convergence is given in the following theorem. Theorem 4 (Nuisance Parameter Estimation) Under Assumptions 3–8, we have, with probability tending to 1 as n → ∞,  − β0 1  β

sβ κ

log( p) , n

(22)

 μ − μ0 1 

s κ

log( p) . n

(23)

The proof is given in Appendix 8. Theorem 4, together with the fact that  θ is immu and  nized, implies that β μ have no asymptotic effect on  θ. In related work, similar conditions to (22) and (23) have been imposed rather than established. We refer in particular to (36) of [21] and Assumption 3-(ii) in [25]. Note, on the other hand, that  contrary to [25] (see Assumption 3-(i)), we do not require that both x → G(x β) μ are consistent for x → P[D = 1|X = x] and x → E[Y (0)|X = x]. and x → x  Theorem 3 only requires (20) to hold. By Theorem 2 above, this is the case if either  P[D = 1|X ] = G(X T β0 ) or E[Y (0)|X ] = X T μ0 , in which case either x → G(x β) μ is consistent, but not necessarily both. or x → x  Remark 2 The rate in (23) depends not only on the sparsity index sμ of μ0 , but on the maximum s = max(sβ , sμ ). This is natural since one should account for the  used to obtain  accuracy of the preliminary estimator β μ. Remark 3 Inspection of the proof shows that Theorem 4 remains valid under weaker condition (55). assumptions, namely, we can replace Assumption 7 on s, p, κ by the √ We also note that Assumption 7(i) in Theorem 3 can be modified to (s/ n) log( p) → 0 if κ is a constant independent of n as it is assumed, for example, in [17]. Such a modification would require a substantially more involved proof but only improves upon considering relatively non-sparse cases. This does not seem of much added value for using the methods in practice when the sparsity index is typically small. Moreover, in the high-dimensional scenario we find it more important to specify the dependency of the growth conditions on κ .  Remark 4 The accuracy of β, μ and ultimately  θ may be affected by the exact value of c p := supx P[D = 1|X = x]. When c p is close to 1, our bounds on the performance of estimators deteriorate, see Theorem 4 and Remark 1. Such a phenomenon could be expected. If P[D = 1|X = x] gets closer to 1 it becomes more difficult to estimate the distribution of Y conditional on D = 0 and X , which is required to estimate E[Y (0)|D = 1] and, in turn, θ0 . Related to this, [31] show that, in the

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

433

absence of further restrictions on x → P[D = 1|X = x] and x → E(Y (0)|X = x), the condition supx P[D = 1|X = x] < 1 is necessary for root-n consistency in the problem of estimation of the ATT (see Theorem 4.1 in [31]).4

4 Monte Carlo Simulations The aim of this experiment is twofold: To illustrate the better properties of the immunized estimator over the naive plug-in, and to compare it with other estimators. In particular, we compare it with a similar estimator proposed by [25]. We consider the following DGP. The p covariates are distributed as X ∼ N (0, ), where the (i, j)th element of the covariance matrix satisfies i, j = 0.5|i− j| . The treatment equation follows a logit model, P(D = 1|X ) =  X T γ0 with (u) = 1/(1 + exp(−u)). The potential outcomes satisfy Y (0) = exp(X T μ0 ) + ε, Y (1) = Y (0) + ζ0 X T γ0 , where ζ0 is a constant, ε ∼ N (0, 1) and ε is independent of (D, X ). We assume the following form for the jth entry of γ0 and μ0 : ! γ0 j =

ργ (−1) j /j 2 if j ≤ 10 0

otherwise

, μ0 j

⎧ ⎪ ρ (−1) j /j 2 ⎪ ⎨ μ = ρμ (−1) j+1 /( p − j + 1)2 ⎪ ⎪ ⎩ 0

if j ≤ 10 if j ≥ p − 9 otherwise,

where ργ and ρμ are positive constants. We are thus in an strictly sparse setting for both equations where only ten covariates play a role in the treatment assignment and twenty in the outcome. Figure 1 depicts the precise pattern of the corresponding coefficients for p = 30. The constants ργ and ρμ fix the signal-to-noise ratio. More precisely, 2 ργ is set so that R 2 = 0.3 in the latent model for D, and ρμ is  set so that R = 0.8 in the model for Y (0). Finally, we let ζ0 = [V exp(X T μ0 ) /5V (Y (0))]1/2 . This impies that the variance of the individual treatment effect ζ0 X γ0 is one fifth of the variance of Y (0). In this set-up, the ATT satisfies E[Y (1) − Y (0)|D = 1] = ζ0 E[Z (Z )]/E[(Z )], with Z ∼ N (0, γ0T γ0 ). We compute it using Monte Carlo simulations. We consider several estimators of the ATT. The first is the naive plug-in estimator defined in (13). Next, we consider our proposed estimator defined in (19), with H (x) = h(x) = exp(x). We also consider the estimator proposed by [25]. This  and  estimator is also defined by (19), but β μ therein are obtained by a Logit Lasso and an unweighted Lasso regression, respectively. For these three estimators, the penalty loadings for the first-step estimators are set as in Appendix 7. The last estimator, called the oracle hereafter, is our low-dimensional estimator defined by (10), 4

Their result is on the average treatment effect (ATE), rather than on the ATT, but their proof can be simply adapted to the ATT.

434

M. Bléhaut et al.

Coefficient value



0.2



0.0



−1.0 −0.8 −0.6 −0.4 −0.2







●●●●●●●●●●●●●●●●●●



● ●



● 0



5

10

15

20

25

30

j Fig. 1 Sparsity patterns of γ0 (crosses) and μ0 (circles). Notes In this example, ργ = ρμ = 1. The central region of the graph represents the coefficients γ0 and μ0 associated with variables that do not play a role either in the selection equation or in the outcome equation. The left region shows the coefficients associated with variables that are important for both equations. In the right region, only entries in μ0 is different from 0, meaning that the variables determine the outcome equation but not the selection equation

where in the first step we only include the ten covariates affecting the treatment. For all estimators, we construct 95% confidence estimator on the ATT using the normal approximation and asymptotic variance estimators. We estimate the asymptotic variance of the naive plug-in estimator acting as if we were in a low-dimensional setting. This means that this estimator would have an asymptotic coverage of 0.95 if p remained fixed. In our DGP, the variables X j with j ≥ p − 9 matter in the outcome equation but are irrelevant in the treatment assignment rule. Given that the propensity score is correctly specified, the four estimators are consistent. The oracle estimator should be the most accurate since it incorporates the information on which covariates matter in the propensity score. Because the balancing program misses some covariates that are relevant for the outcome variable (namely, the X j with j ≥ p − 9), the naive plug-in estimator is expected to be asymptotically biased. The immunized procedure should correct for this bias. Table 1 displays the results. We consider several values of n and p such that the dimenison p is relatively high compared to n. For every pair (n, p), we report the root mean squared error (RMSE), the bias and the coverage rate of the confidence intervals associated to each estimator. Our estimator performs well in all settings, with a correct coverage rate and often the lowest RMSE over all estimators. The oracle has always a coverage rate close to 0.95 and a bias very close to 0, as one could expect, but it does not always exhibit the lowest RMSE. This is because, intuitively,

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity Table 1 Monte-Carlo simulations n = 500 RMSE Bias CR p = 50 0.312 0.264 0.186 0.102 0.197 0.117 0.202 –0.017 p = 200 Naive plug-in 0.318 0.274 Immunized 0.185 0.108 Farrell 0.195 0.121 Oracle 0.194 –0.023 p = 500 Naive plug-in 0.339 0.297 Immunized 0.202 0.128 Farrell 0.212 0.141 Oracle 0.205 –0.019 p = 1, 000 Naive plug-in 0.345 0.309 Immunized 0.199 0.133 Farrell 0.211 0.146 Oracle 0.194 –0.017 Naive plug-in Immunized Farrell Oracle

435

n = 1, 000 RMSE Bias

CR

n = 2, 000 RMSE Bias

CR

0.62 0.872 0.857 0.929

0.228 0.121 0.132 0.143

0.195 0.059 0.075 –0.025

0.601 0.907 0.885 0.938

0.169 0.084 0.093 0.105

0.144 0.03 0.046 –0.03

0.608 0.907 0.887 0.935

0.587 0.883 0.87 0.936

0.238 0.208 0.125 0.066 0.135 0.081 0.141 –0.029

0.557 0.897 0.873 0.946

0.179 0.085 0.095 0.108

0.156 0.037 0.052 –0.034

0.544 0.911 0.882 0.934

0.532 0.841 0.81 0.927

0.247 0.128 0.14 0.146

0.216 0.07 0.086 –0.033

0.522 0.881 0.854 0.931

0.185 0.087 0.098 0.105

0.165 0.044 0.059 –0.032

0.494 0.912 0.866 0.938

0.485 0.835 0.807 0.947

0.258 0.135 0.146 0.14

0.23 0.082 0.097 –0.022

0.478 0.862 0.823 0.943

0.194 0.09 0.101 0.103

0.175 0.051 0.066 –0.026

0.449 0.885 0.853 0.939

Notes RMSE and CR stand respectively for root mean squared error and coverage rate. The nominal coverage rate is 0.95. The results are based on 10,000 simulations for each (n, p). The naive plugin and immunized estimators are defined in (13) and (19), respectively. “Farrell” is the estimator considered by [25].“Oracle” is defined by (10), where in the first step we only include the ten covariates affecting the treatment

the immunized estimator and that of [25] trade off variance with some bias in their first steps, sometimes resulting in slightly lower RMSE on the final estimator. Note that the bias of the final estimator, though asymptotically negligible, results in a slight undercoverage of the confidence intervals. Yet, even with n = 500 and p = 1, 000, the coverage rate is still of 0.84, which is much higher than the value 0.485 for the naive plug-in estimator. This estimator exhibits a large bias and a low coverage rate even for p = 50 and n = 2, 000. It shows the importance of correcting for the bias of the first-step estimator, even when p/n is quite small. Finally, the estimator of [25] exhibits similar performance as the immunized estimator, though it displays a slightly larger RMSE and smaller coverage rate with this particular DGP.

436

M. Bléhaut et al.

5 Empirical Applications 5.1 Job Training Program We revisit [34] who examines the ability of econometric methods to recover the causal effect of employment programs.5 This dataset was first built to assess the impact of the National Supported Work (NSW) program. The NSW is a transitional, subsidized work experience program targeted towards people with longstanding employment problems: ex-offenders, former drug addicts, women who were long-term recipients of welfare benefits and school dropouts. The quantity of interest is the average effect of the program for the participants on 1978 yearly earnings. The treated group gathers people who were randomly assigned to this program from the population at risk (with a sample size of n 1 = 185). Two control groups are available. The first one comes from the Panel Study of Income Dynamics (PSID) (sample size n 0 = 2, 490). The second one comes from the experiment (sample size n 0 = 260) and is therefore directly comparable to the treated group. It provides us with a benchmark for the ATT. Hereafter, we use the group of participants and the PSID sample to compute our estimator and compare it with other competitors and the experimental benchmark. To allow for a flexible specification, we follow [25] by taking the raw covariates of the dataset (age, education, black, hispanic, married, dummy variable of no degree, income in 1974, income in 1975, dummy variables of no earnings in 1974 and in 1975), two-by-two-interactions between the continuous and dummy variables, twoby-two interactions between the dummy variables and powers up to degree 5 of the continuous variables. Continuous variables are linearly rescaled to [0, 1]. We end up with 172 variables to select from. The experimental benchmark for the ATT estimate is $1,794, with a standard error of 671. We compare several estimators: the naive plug-in estimator, the immunized plug-in estimator, the doubly-robust estimator of [25], the double-post-selection linear estimator of [13], and the plain OLS estimator including all the covariates. For the four penalized estimators, the penalty loadings on the first-step estimators are set as in Appendix 7. Table 2 displays the results. Our immunized estimator and that of [25] give credible values for the ATT with respect to the experimental benchmark, with also similar standard errors.6 Notably, only these two estimators out of the five considered display a significant positive impact as the experimental benchmark. The immunized estimator estimator offers a substantial improvement on bias and standard error over the naive plug-in estimator, in line with the evidence from the Monte Carlo experiment. 5

For more discussion on the NSW program and the controversy regarding econometric estimates of causal effects based on nonexperimental data, see [34] and the subsequent contributions by [22, 23, 44]. 6 Farrel [25]’s estimate shown in Table 2 differs from that displayed in Farrell’s paper because contrary to him, we have not automatically included education, the dummy of no degree and the 1974 income in the set of theory pre-selected covariates. When doing so, the results are slightly better but not qualitatively different for this estimator. We chose not to do so as it would bias the comparison with other estimators, which do not include a set of pre-selected variables.

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity Table 2 ATT estimates on the NSW program Estimator Experimental benchmark (1) Point estimate Standard error 95% confidence interval # variables in propensity score # variables in outcome equation

Point estimate Standard error 95% confidence interval # variables in propensity score # variables in outcome equation

437

Naive plug-in (2)

Immunized estimator (3)

1,794.34 (671.00) [519; 3,046]

401.89 (746.07) [–1,060; 1,864]

1,608.99 (705.38) [226; 2991]

None

9

9

None

None

12

Farrell (2015) (4)

BCH (2014) (5)

OLS estimator (6)

1420.43 (670.32) [107; 2734]

226.82 (867.51) [–1473; 1927]

83.17 (1,184.48) [–2,238; 2,405]

3

16

None

13

10

172

Notes For details on the estimators, see the text. Standard errors and confidence intervals are based on the asymptotic distribution.

Finally, the OLS estimator in Column (6) is introduced as a benchmark, in order to demonstrate the behavior of a very simple procedure that does not use any selection at all. As no surprise, this estimator has a substantially larger standard error than the other considered estimators.

5.2 California Tobacco Control Program Proposition 99 is one of the first and most ambitious large-scale tobacco control program, implemented in 1989 in California. It includes a vast array of measures, including an increase in cigarette taxation of 25 cents per pack, and a significant effort in prevention and education. In particular, the tax revenues generated by Proposition 99 were used to fund anti-smoking campaigns. [2] analyzes the impact of the law on tobacco consumption in California. Since this program was only enforced in California it is a nice example where the synthetic control method applies whereas more standard public policy evaluation tools cannot be used. It is possible to reproduce a synthetic California by reweighting other states so as to mimic California’s behavior.

438

M. Bléhaut et al.

Fig. 2 The effect of Proposition 99 on per capita tobacco consumption

For this purpose, [2] consider the following covariates: Retail price of cigarettes, state log income per capita, percentage of the population between 15 and 24, per capita beer consumption (all 1980–1988 averages). Cigarette consumptions for the years between 1970 and 1975, 1980 and 1988 are also included. Using the same variables, we conduct the same analysis with our estimator. Figure 2 displays the estimated effect of Proposition 99 using the immunized estimator. We find almost no effect of the policy over the pre-treatment period, giving credibility to the counterfactual employed. A steady decline takes place after 1988 and, in the long run, tobacco consumption is estimated to have decreased by about 30 packs per capita per year in California as a consequence of the policy. The variance is larger towards the end of the period because covariates are measured in the pre-treatment period and they become less relevant as predictors. Note also that, by construction, including 1970 to 1975, 1980 and 1988 cigarette consumptions among the covariates yields a very good fit at these dates because of the immunization procedure. The fit is not perfect, however, because of the shrinkage induced by the 1 -penalization. Figure 3 displays a comparison between the immunized estimator and the synthetic control method. The dashed green line is the synthetic control counterfactual. It does not match exactly the plot of [2], in which the weights given to each predictors are optimized to fit best the outcome over the whole pre-treatment period. Instead, the green curve in Fig. 3 optimizes the predictor weights using only the years 1970 through 1975, 1980 and 1988. This strategy brings a fairer comparison with our estimator that does not use California’s per capita tobacco consumption outside those dates to optimize the fit. In such a case, the years from 1976 to 1987, excluding 1980, can be used as sort of placebo tests. Both our estimator and the synthetic control are credible counterfactuals as they are able to closely match California pre-treatment tobacco consumption. They offer a sizable improvement over a sample average over the U.S. that did not implement any tobacco control program. Furthermore, even if our estimator gives a result relatively

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

439

Fig. 3 Cigarette consumption in California, actual and counterfactual

similar to the synthetic control, it displays a smoother pattern especially towards the end of the 1980s. The estimated treatment effect appears to be larger with the immunized estimate than with the synthetic control. However, it is hard to conclude that this difference is significant because, in the absense of any asymptotic theory on the synthetic control estimator, it is unclear how one could make a test on the difference between the two. In fact, the availability of standard asymptotic approximation for confidence intervals is to the advantage of our method.

6 Conclusion In this paper, we propose an estimator that makes a link between the synthetic control method, typically used with aggregated data and n 0 smaller than or of the same order as p, and treatment effect methods used for micro data for which n 0 p. Our method accommodates both settings. In the low-dimensional regime, it pins down one of the

440

M. Bléhaut et al.

solutions of the synthetic control problem, which admits an infinity of solutions. In the high-dimensional regime, the estimator is a regularized and immunized version of the low-dimensional one and then differs from the synthetic control estimator. The simulations and applications suggest that it works well in practice. In our study, we have focused on specific procedures based on 1 -penalization and proved that they achieve good asymptotic behavior in possibly high-dimensional regime under sparsity restrictions. Other types of estimators could be explored using these ideas. For example, in the high-dimensional regime, our strategy can be used with the whole spectrum of sparsity-related penalization techniques, such as group Lasso, fused Lasso, adaptive Lasso, Slope among others. Acknowledgements We thank participants at the 2016 North American and European Meetings of the Econometric Society, the 2017 IAAE Meeting and CREST internal seminars for their useful comments and discussions. We acknowledge funding from Investissements d’Avenir (ANR-11IDEX-0003/Labex Ecodec/ANR-11-LABX-0047).

7 Algorithm for Feasible Penalty Loadings Consider the ideal penalty loadings for estimation of β0 defined as √ λ := c−1 (1 − γ/2 p)/ n   n 1  2 (1 − Di )h(X iT β0 ) − Di X i,2 j for j = 1, ..., p, ψ¯ j :=  n i=1 and the ideal penalty loadings for estimation of μ0 : √ λ := 2c−1 (1 − γ/2 p)/ n   n 1   2 ¯ (1 − Di )h (X iT β0 )2 Yi − X iT μ0 X i,2 j for j = 1, ..., p. ψ j :=  n i=1 Here c > 1 is an absolute constant, γ > 0 is a tuning parameter while β0 and μ0 are the true coefficients. We follow [13] and set γ =  pand c = 1.1.  .05 We first estimate the ideal penalty loadings ψ¯ j j=1 of the balancing step using the following algorithm. Set a small constant  > 0 and a maximal number of iterations k0 . 1. Start by using a preliminary estimate β (0) of β0 . For β (0) with the nexample, take n entry corresponding to the intercept equal to log( i=1 Di / i=1 (1 − Di )) and all other entries equal to zero. Then, for all j = 1, ..., p, set

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

ψ˜ (0) j

441

  n 1  2 (1 − Di )h(X iT β (0) ) − Di X i,2 j . = n i=1

  2 2 n 1 T (k) At step k, set ψ˜ (k) = X i, j , j = 1, ..., p. i=1 (1 − Di )h(X i β ) − Di j n 2. Estimate the model by the penalized balancing Equation (12) using the penalty (k) . level λ and penalty loadings found previously, to obtain β ˜ (k−1) | ≤  or if k > k0 . Set k = k + 1 and go to step 1 − ψ 3. Stop if max |ψ˜ (k) j j j=1,..., p

otherwise. Asymptotic validity of this approach is established analogously to [10, Lemma 11]. Estimation of the penalty loadings ψ¯ j on the immunization step follows a similar procedure where we replace β0 in the formula for ψ¯ j by its estimator obtained on the balancing step.

8 Proofs Proof (Theorem 1) First, note that we have E [(1 − D)w(X )X ] = E [D X ]. As a result, for any μ ∈ R p ,   E (D − (1 − D)w(X )) (Y − X T μ) = E [(D − (1 − D)w(X )) Y ] . Since (1 − D)Y = (1 − D)Y (0) and DY = DY (1), the value θ0 satisfies the moment condition (8) if and only if E [w(X )Y (1 − D)] = E [DY (0)] . By the Mean Independence assumption, E(D|X )E(Y (0)|X ) = E(DY (0)|X ). Thus, E [w(X )Y (1 − D)] = E [E(w(X )Y (0)(1 − D)|X )] = E [w(X )(1 − E(D|X ))E(Y (0)|X )] .

(24) We consider the two cases of the theorem separately. 1. In the linear case E(Y (0)|X ) = X T μ0 we have   E [w(X )Y (1 − D)] = E w(X )(1 − D)X T μ0   = E D X T μ0 = E [DE(Y (0)|X )] = E [DY (0)] .

442

M. Bléhaut et al.

The first equality here is due to (24). The second equality follows from the fact that E[(1 − D)w(X )X ] = E[D X ]. The last equality uses the Mean Independence assumption. 2. Propensity score satisfies P(D = 1|X ) = w(X )/(1 + w(X )). In this case, using (24) we have E [w(X )Y (1 − D)] = E [w(X )(1 − P(D = 1|X ))E(Y (0)|X )] = E [P(D = 1|X )E(Y (0)|X )] = E [E(D|X )E(Y (0)|X )] = E [DY (0)] , where the last equality follows from the Mean Independence assumption.



Proof (Theorem 3) Denote the observed data by Z i = (Yi , Di , X i ), and by π0 the probability of being treated: π0 := P(D = 1). The estimating moment function for θ0 is g(Z , θ, η) := [D − (1 − D)h(X T β)][Y − X T μ] − Dθ. Recall that we define (θ0 , η0 ) as the values satisfying: Eg(Z , θ0 , η0 ) = 0. All these quantities depend on the sample size n but for the sake of brevity we suppress this dependency in the notation except for the cases when we need it explicitly. By the Taylor expansion and the linearity of the estimating function g in θ, there exists t ∈ (0, 1) such that θ,  η )] = En [g(Z , θ0 ,  η )] +  π (θ0 −  θ) En [g(Z ,    θ) + En [g(Z , θ0 , η0 )] + ( η − η0 )T En ∇η g(Z , θ0 , η0 ) = π (θ0 −    1 η − η0 )T En ∇η2 g(Z , θ0 , η) + ( ˜ ( η − η0 ), 2 n η and  π = n1 i=1 Di . The immunized estimator satisfies where η˜ := tη0 + (1 − t) θ,  η )] = 0. Thus, we obtain En [g(Z ,    √ √ √ θ − θ0 ) = nEn [g(Z , θ0 , η0 )] + n( η − η0 )T En ∇η g(Z , θ0 , η0 )  π n( & '( ) & '( ) :=I1

:=I2

  √ +2 n( η − η0 )T En ∇η2 g(Z , θ0 , η) ˜ ( η − η0 ) . & '( ) −1

:=I3

Now, to prove Theorem 3 we proceed as follows. First, we show that I1 converges in distribution to a zero mean normal random variable with variance Eg 2 (Z , θ0 , η0 ) zero in probability (Steps 2 and 3). This (Step 1 below), while I2 and I3 tend to √ θ − θ0 ) is asymptotically normal with and the fact that  π → π0 (a.s.) imply that n( some positive variance, which in turn implies that  θ → θ0 in probability. Finally,

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

443

  using this property we prove that En g 2 (Z i ,  θ,  η ) → Eg 2 (Z , θ0 , η0 ) in probability (Step 4). Combining Steps 1 to 4 and using Slutsky lemma leads to the result of the theorem. Thus, to complete the proof of the theorem it remains to establish Steps 1 to 4. Step 1. In this part, we write gi,n := g(Z i , θ0 , η0 ) (making the dependence on n explicit in the notation). Recall that Egi,n = 0. We apply the Lindeberg-Feller central limit theorem for triangular arrays by checking a Lyapunov condition. Since (gi,n )1≤i≤n are i.i.d., it suffices to prove that lim sup n→∞

2+δ E(g1,n ) 2 1+δ/2 (Eg1,n )

0. Assumption 5(i) implies that E(g1,n Moreover,

    2 E(g1,n ) = π0 E (Y (1) − X T μ0 − θ0 )2 |D = 1 + (1 − π0 )E h(X T β0 )2 (Y (0) − X T μ0 )2 |D = 0 . 2 Due to Assumption 5(ii) we have E(g1,n ) ≥ c1 , where c1 > 0 does not depend on 2 ))−1/2 I1 converges in distribution to a n. Thus, (25) holds. It follows that (E(g1,n standard normal random variable.

Step 2. Set ψ j+ p = ψ j , j = 1, . . . , p, and denote by  a diagonal matrix of dimen¯ be a diagonal matrix sion 2 p with diagonal elements ψ j , j = 1, . . . , 2 p. Let also  of dimension 2 p with diagonal elements ψ¯ j = n −1 n U 2 , j = 1, . . . , 2 p. By i=1

i, j

Assumption 8 we have ψ j ≥ ψ¯ j . Hence,     √ √ ¯ −1 nEn ∇η g(Z , θ0 , η0 ) ∞ . |I2 | ≤ ( η − η0 )1  −1 nEn ∇η g(Z , θ0 , η0 ) ∞ ≤ ( η − η0 )1 

  √ ¯ −1 nEn ∇η g(Z , θ0 , η0 ) ∞ is a maximum of self-normalized Here, the term  sums of variables Ui, j and it can be bounded by using standard inequalities for self-normalized sums, cf. Lemma 2 below. From the orthogonality conditions and Assumption 5 we have E(Ui, j ) = 0, E(Ui,2 j ) ≥ c1 and E(|Ui, j |3 ) < ∞ for any i and * any j. By Lemma 2 and the fact that −1 (1 − a) ≤ −2 log(a) for all a ∈ (0, 1) we obtain that, with probability tending to 1 as n → ∞, * *   √ ¯ −1 nEn ∇η g(Z , θ0 , η0 ) ∞ ≤ −1 (1 − γ/2 p) ≤ 2 log(2 p/γ)  log( p).  * Next, inequalities (41) and (46) imply that ( η − η0 )1  (s/κ ) log( p)/n with probability tending to 1 as n → ∞. Using these facts and the growth condition (i) in Assumption 7 we conclude that I2 converges to 0 in probability as n → ∞.   Step 3. Let H := En ∇η2 g(Z , θ0 , η) ˜ ∈ R2 p×2 p and let h k, j be the elements of matrix H. We have

444

M. Bléhaut et al.

|I3 | ≤

√ n  η − η0 21 2

max |h k, j |.

1≤k, j≤2 p

(26)

We now control the random variable max1≤k, j≤2 p |h k, j |. To do this, we first note that   ∂2 g(Z , θ, η) = −(1 − D)h (X T β) Y − X T μ X X T , T ∂β∂β ∂2 ∂2 g(Z , θ, η) = g(Z , θ, η) = (1 − D)h (X T β)X X T , ∂μ∂β T ∂β∂μT ∂2 g(Z , θ, η) = 0. ∂μ∂μT It follows that + max |h k, j | ≤ max

1≤k, j≤2 p

where

, max |h˜ k, j |, max |h¯ k, j | ,

1≤k, j≤ p

1≤k, j≤ p

n 1 ˜ i − X i μ)X (1 − Di )h (X iT β)(Y ˜ i,k X i, j , h˜ k, j = n i=1

and

n 1 ˜ i,k X i, j . (1 − Di )h (X iT β)X h¯ k, j = n i=1

We now evaluate separately the terms max1≤k, j≤ p |h˜ k, j | and max1≤k, j≤ p |h¯ k, j |. Note that h˜ k, j can be decomposed as h˜ k, j = h˜ k, j,1 + h˜ k, j,2 + h˜ k, j,3 + h˜ k, j,4 , where h˜ k, j,1 = n −1

n  (1 − Di )h (X iT β0 )(Yi − X iT μ0 )X i,k X i, j , i=1

h˜ k, j,2

n  = n −1 (1 − Di )h (X iT β0 )X iT (μ0 − μ)X ˜ i,k X i, j ,

h˜ k, j,3

n  ˜ − h (X T β0 ))(Yi − X T μ0 )X i,k X i, j , = n −1 (1 − Di )(h (X iT β) i i

h˜ k, j,4

n  ˜ − h (X T β0 ))X T (μ0 − μ)X = n −1 (1 − Di )(h (X iT β) ˜ i,k X i, j . i i

i=1

i=1

i=1

(27)

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

445

It follows from Assumption 5 that, for all k, j, μ1 . |h˜ k, j,2 | ≤ Cμ0 − 

(28)

Here and in what follows we denote by C positive constants depending only on K that can be different on different appearences. Next, from Assumptions 5 and 4 (i) we obtain that, for all k, j,  1 , |h˜ k, j,4 | ≤ Cβ0 − β  1 μ0 −  μ1 . |h˜ k, j,3 | ≤ Cβ0 − β

(29)

From (28), (29), Theorem 4 and the growth condition (i) in Assumption 7 we find that, with probability tending to 1 as n → ∞, * max (|h˜ k, j,2 | + |h˜ k, j,3 | + |h˜ k, j,4 |)  (s/κ ) log( p)/n  1.

1≤k, j≤ p

(30)

Next, again from Assumption 5, we deduce that |E(h˜ k, j,1 )| ≤ C, while by Hoeffding’s inequality   P |h˜ k, j,1 − E(h˜ k, j,1 )| ≥ x ≤ 2 exp(−Cnx 2 ), ∀x > 0. This and the union bound imply that there exists C > 0 large enough such that, with probability tending to 1 as n → ∞, max |h˜ k, j,1 | ≤ C(1 +

1≤k, j≤ p

*

log( p)/n)  1.

(31)

Finally, combining (30) and (31) we obtain that, with probability tending to 1 as n → ∞, max |h˜ k, j |  1. 1≤k, j≤ p

Quite similarly, we get that with probability tending to 1 as n → ∞, max |h¯ k, j |  1.

1≤k, j≤ p

Thus, with probability tending to * 1 as n → ∞ we have max1≤k, j≤2 p |h k, j |  1. On the other hand,  η − η0 1  (s/κ ) log( p)/n with probability tending to 1 as n → ∞ due to Theorem 4. Using these facts together with (26) and the growth condition (i) in Assumption 7 we conclude that I3 tends to 0 in probability as n → ∞. n Step 4. We now prove that if  θ → θ0 in probability then n1 i=1 g 2 (Z i ,  θ,  η) → 2 Eg (Z , θ0 , η0 ) in probability as n → ∞. We have g(Z , θ, η) = [D − (1 − D)h(X T β)][Y − X T μ] − Dθ.

446

M. Bléhaut et al.

1 Theorem 4 and the growth condition (i) in Assumption 7 imply that β0 − β is bounded by 1 on an event An of probability tending to 1 as n → ∞. Using  for Assumption 5 we deduce that, on the event An , the values X iT β0 and X iT β all i belong to a subset of R of diameter at most 2K . On the other hand, Assumption 4 (i) implies that function h is bounded and Lipschitz on any compact subset of R. Therefore, using again Assumption 5 we find that on the event An we have  1 for all i. It follows from this remark and from  ≤ Cβ0 − β |h(X iT β0 ) − h(X iT β)| Assumption 5 that, on the event An , |g(Z i ,  θ, η) − g(Z i , θ0 , η0 )| ≤ | θ − θ0 | + |h(X iT β0 ) − h(X iT  μ)| β)||Y − X T μ0 | + |h(X iT  β)||X iT (μ0 −    ≤ | θ − θ0 | + C β0 −  β1 + μ0 −  μ1 + μ0 −  μ1 β0 −  β1 := ζn

for all i. Also note that, due to Assumption 5, the random variables g(Z i , θ0 , η0 ) are a.s. uniformly bounded. Thus, using the equality b2 − a 2 = (b − a)2 + 2a(b − a), ∀a, b ∈ R, we get that, on the event An , θ,  η ) − g 2 (Z i , θ0 , η0 )| ≤ C(ζn2 + ζn ). |g 2 (Z i ,  for all i. Using this fact together with Theorem 4, the growth condition n (i)2 in Assumpg (Z i ,  θ,  η) − tion 7 and the convergence  θ → θ0 in probability we find that n1 i=1 1 n 2 as n → ∞. We conclude by applying the i=1 g (Z i , θ0 , η0 ) → 0 in probability n n g 2 (Z i , θ0 , η0 ).  law of large numbers to the sum n1 i=1 Proof (Proof of Theorem 4)  is defined as: Proof of (22). Recall that β ⎛

⎞ p n   1  ∈ arg min ⎝ β [(1 − Di )H (X iT β) − Di X iT β] + λ ψ j |β j |⎠ n i=1 β∈R p j=1

(32)

with penalty loadings satisfying Assumption 8. Let d ∈ R p× p be the diago¯ d ∈ R p× p be a diagonal nal matrix with diagonal entries ψ1, . . . , ψ p . Let also  n matrix with diagonal entries ψ¯ j = n −1 i=1 Ui,2 j , j = 1, . . . , p. We denote by S0 ⊆ {1, . . . , p} the set of indices of non-zero components of β0 . By assumption, Card(S0 ) = β0 0 ≤ sβ . Step 1 : Concentration Inequality. We first bound the sup-norm of the gradient of the using for 1 ≤ j ≤ p we have Ui, j =  Lemma 2. Recall that  objective function n Ui, j /ψ¯ j and consider the event (1 − Di )h(X iT β0 ) − Di X i, j . Set S j := √1n i=1 ! B :=

- n . - U - λ 1 i, j max . -≤ n 1≤ j≤ p - i=1 ψ¯ j - c

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

447

By construction, the random variables Ui, j are i.i.d., E(Ui, j ) = 0, E(Ui,2 j ) ≥ c1 and E(|Ui, j |3 ) ≤ C by Assumptions 5 and 4. Using these remarks, 8 and Lemma 2 we obtain + ,   √ c P BC = P √ max |S j | > c−1 (1 − γ/2 p)/ n n 1≤ j≤ p , + = P max |S j | > −1 (1 − γ/2 p) = o(1) as n → ∞. 1≤ j≤ p

Step 2 : Restricted Eigenvalue condition for the empirical Gram matrix. The empirical Gram matrix is  := 

n n 1 1 (1 − Di )X i X iT = (1 − Di )2 X i X iT . n i=1 n i=1

We also recall the Restricted Eigenvalue (RE) condition [15]. For a non-empty subset S ⊆ {1, . . . , p} and α > 0, define the set:   C[S, α] := v ∈ R p : v SC 1 ≤ αv S 1 , v = 0

(33)

where S C stands for the complement of S. Then, for given s ∈ {1, . . . , p} and α > 0,  satisfies the RE(s, α) condition if there exists κ(  ) > 0 such that the matrix  min

S⊆{1,..., p}: Card(S)≤s

v vT   ). ≥ κ( v∈C[S,α] v S 2 2 min

(34)

We now use Lemma 3, stated and proved in Sect. 9 below. Note that Assumption 7 implies (55) therein and set Vi = (1 − Di )X i . Then, for any s ∈ [1, p/2] and α > 0,  satisfies the RE(s, α) condition, with κ(  ) = c∗ κ where c∗ ∈ (0, 1) is an absolute  constant, with probability tending to 1 as n → ∞. Step 3 : Basic inequality. At this step, we prove that with probability tending to 1  satisfies the following inequality (further called the basic inequality): as n → ∞, β    − β0 ) ≤ 2λ d β0 1 − d β  − β0 )1 ,  − β0 )T   1 + 2λ d (β  (β τ (β c where τ > 0 is a constant that does not depend on n.  we have By optimality of β n   1 1 , [γβ (X i , Di ) − γβ0 (X i , Di )] ≤ λ d β0 1 − d β n i=1

(35)

448

M. Bléhaut et al.

where γβ (X, D) := (1 − D)H (X T β) − D X T β. Subtracting the inner product of the  − β0 on both sides we find gradient ∇β γβ0 (X i , Di ) and β n   1  − β0 )T X i ] [γβ (X i , Di ) − γβ0 (X i , Di ) − (1 − Di )h(X iT β0 ) − Di (β n i=1 n       − β0 )T X i . 1 −1 ≤ λ d β0 1 − d β (1 − Di )h(X iT β0 ) − Di (β n i=1

(36) Using Taylor expansion we get that there exists 0 ≤ t ≤ 1 such that n   1  − β0 )T X i γβ (X i , Di ) − γβ0 (X i , Di ) − (1 − Di )h(X iT β0 ) − Di (β n i=1 0 / n 1 1  T T T ˜  − β0 ), = (β − β0 ) (1 − Di )X i X i h (X i β) (β 2 n i=1

 + (1 − t)β0 . Plugging this into (36) and using the facts that | where β˜ = t β i ai bi | ≤ a1 b∞ and ψ j ≥ ψ¯ j we get that, on the event B, which occurs with probability tending to 1 as n → ∞: 1  (β − β0 )T 2

/

0 n 1 ˜ ( β − β0 ) (1 − Di )X i X iT h (X iT β) n i=1

n    1  β1 − (1 − Di )h(X iT β0 ) − Di ( β − β0 )T X i ≤ λ d β0 1 − d  n i=1 - n -1  X    i, j ≤ λ d β0 1 − d  β − β0 )1 β1 + max (1 − Di )h(X iT β0 ) − Di - d ( 1≤ j≤ p - n ψ¯ j i=1

  λ ≤ λ d β0 1 − d  β − β0 )1 . β1 + d ( c

(37)

By Assumption 4 we have h > 0, which implies that the left-hand side of (37) is non-negative. Hence we have, under the event B,    − β0 )1 .  1 + (λ/c)d (β 0 ≤ λ d β0 1 − d β which implies that  1 ≤ c0 d β0 1 d β

(38)

where c0 = (c + 1)/(c − 1). By Assumption 8, we have max j ψ j ≤ max j ψ j,max ≤ ψ¯ where ψ¯ > 0 is a constant that does not depend on n. On the other hand,

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

449

Assumption 5(ii) and the fact that the random variables Ui,2 j are uniformly bounded √ implies that min j ψ j ≥ c1 /2 := ψ with probability tending to 1 as n → ∞ (this follows immediately from Hoeffding’s inequality, the union bound and the fact that log( p)/n → 0 due to Assumption 7(i)). These remarks and (38) imply that, with probability tending to 1 as n → ∞, ¯  1 ≤ c0 ψ β0 1 . β ψ

(39)

¯  1 ≤ c0 (ψ/ψ)c We now use Assumption 4(ii). If β0 1 ≤ c3 , then β 3 with proba¯ T ˜ bility tending to 1 as n → ∞, so that min h (X β) ≥ h (−K max(1, c0 ψ )c3 ) > 0 ψ

i

i=1,...,n

where we have used Assumptions 5(i) and 4. Otherwise, given Assumption 4(ii), ˜ ≥ c2 . It follows that h ≥ c2 on the whole real line so obviously min h (X iT β) i=1,...,n

there exists τ > 0 that does not depend on n such that, with probability tending to 1 as n → ∞, 0 / n  1 ˜ v, ∀ v ∈ R p . v ≤ v T τ vT  (1 − Di )X i X iT h (X iT β) (40) n i=1  − β0 and combining it with inequality (37) yields (35). Using (40) with v = β  We prove that with probability tending to 1 Step 4 : Control of the 1 -error for β. as n → ∞,  − β0 )1 ≤ d (β

 c  4ψ¯ 2 λs β , c − 1 c∗ τ κ

 − β0 1 ≤ β

 c  4ψ¯ 2 λs β . c − 1 ψc∗ τ κ

(41)

It suffices to prove the first inequality in (41). The second inequality follows as an immediate consequence.  1 . By We will use the basic inequality (35). First, we bound d β0 1 − d β the triangular inequality, S0 )1 . S0 1 ≤ d (β0,S0 − β d β0,S0 1 − d β Furthermore, SC 1 = 2d β0,SC 1 − d β0,SC 1 − d β SC 1 d β0,S0C 1 − d β 0 0 0 0 SC )1 ≤ 2d β0,SC 1 − d (β0,SC − β 0

≤ −d (β0,S0C

0

SC )1 . −β 0

The last inequality follows from the fact that β0,S0C 1 = 0. Hence,

0

450

M. Bléhaut et al.

 − β0 )1  1 + 1 d (β d β0 1 − d β c + + , , 1 1  SC − β0,SC )1 . ≤ 1+ d (β S0 − β0,S0 )1 − 1 − d (β 0 0 c c

(42)

Plugging this result in (35) we get, with probability tending to 1 as n → ∞,  − β0 )  − β0 )T   (β (β 2 1+ , , + 1 2λ S0 − β0,S0 )1 − 1 − 1 d (β SC − β0,SC )1 , 1+ d (β ≤ 0 0 τ c c

(43)

and thus  − β0 ) +  − β0 )T   (β (β

1 2λ   − β0 )1 ≤ 4λ d (β S0 − β0,S0 )1 1− d (β τ c τ √ 4λψ¯ sβ S0 − β0,S0 2 , ≤ β τ (44)

where we have used the fact that Card(S0 ) ≤ sβ due to Assumption 3. Recall that  − β0 ) ≥ 0 we obtain a  − β0 )T   (β c > 1. From inequality (43) and the fact that (β  − β0 ) ∈ C[S0 , c0 ] for d (β  − β0 ), which in turn implies (with cone condition d (β ¯  − β0 ∈ C[S0 , c0 ψ/ψ] for probability tending to 1 as n → ∞) a cone condition β   β − β0 . Therefore, using (34) (where we recall that κ( ) = c∗ κ ), we obtain that, with probability tending to 1 as n → ∞, + , √ 4λψ¯ sβ 2λ 1  ( β − β0 ) + ( β − β0 )T  β − β0 )1 ≤ 1− d ( τ c τ

3

 ( β − β0 ) ( β − β0 )T  . c ∗ κ

Using here the inequality ab ≤ (a 2 + b2 )/2, ∀a, b > 0, we find that, with probability tending to 1 as n → ∞, + , 2 ¯2 1   − β0 ) + 2λ 1 − 1 d (β  − β0 )1 ≤ 8λ ψ sβ .  (β (β − β0 )T  2 2 τ c τ c∗ κ which implies the first inequality in (41). Since λ  complete.

*

log( p)/n the proof of (22) is 

Proof of (23) Recall that  μ is defined as: ⎞ p n     1  Yi − X T μ 2 + λ  μ ∈ arg min ⎝ (1 − Di )h (X iT β) ψ j |μ j |⎠ . i p n μ∈R i=1 j=1 ⎛

(45)

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

451

Let  ∈ R p× p denote the diagonal matrix with diagonal entries ψ1 , . . . , ψ p . We will prove that, with probability tending to 1 as n → ∞,  ( μ − μ0 )1 

s κ

log( p) . n

(46)

Using an argument analogous to that after (38) we easily get that (46) implies (23). Step 1 : Concentration inequality. Define Vi j := (1 − Di )h (X iT β0 )  1 n T √ Yi − X i μ0 X i, j , S j := n i=1 Vi j /ψ j and consider the event ! B :=

- n . - V - λ 2 ij max . -≤ n 1≤ j≤ p - i=1 ψ j c

The random variables Vi j , i = 1, . . . , n, are i.i.d. and E(Vi j ) = 0, E(Vi2j ) ≥ c1 and E(|Vi j |3 ) < C for all i, j by Assumptions 4 and 5. Using Assumptions 3, 8, and Lemma 2 we obtain + , √ c P(B C ) = P √ max |S j | > c−1 (1 − γ/2 p)/ n n 1≤ j≤ p + , = P max |S j | > −1 (1 − γ/2 p) = o(1) as n → ∞. 1≤ j≤ p

μ. Introduce the notation γβ,μ (Z i ) = (1 − Step 2 : Control of the 1 -error for   2 Di )h (X iT β) Yi − X iT μ . It follows from the definition of  μ that n   1 [γ (Z i ) − γβ,μ0 (Z i )] ≤ λ  μ0 1 −   μ1 . n i=1 β,μ

Here ⎛ ⎞ n n  1 1 [γ μ − μ0 )T ⎝ (1 − Di )h (X iT  μ − μ0 ) β)X i X iT ⎠ ( μ (Z i ) − γ β, β,μ0 (Z i )] = ( n n i=1

i=1

n 2 (1 − Di )h (X iT  μ). + β)(Yi − X iT μ0 )X iT (μ0 −  n i=1

Therefore,

452

M. Bléhaut et al.

μ − μ0 ) T (

n 1 (1 − Di )h (X iT  β)X i X iT n

μ − μ0 ) (

i=1

n   2 ≤λ  μ0 1 −   μ1 + (1 − Di )h (X iT β0 )(Yi − X iT μ0 )X iT ( μ − μ0 ) + Rn , (47) n i=1

where Rn = =

n   2  − h (X T β0 ) (Yi − X T μ0 )X T ( (1 − Di ) h (X iT β) i i i μ − μ0 ) n i=1 n 2 ˜ T (β  − β0 )X T ( (1 − Di )(Yi − X iT μ0 )h (X iT β)X i i μ − μ0 ) n i=1

 + (1 − t)β0 for some t ∈ [0, 1]. Introducing the matrix with β˜ = t β A=

n 2 ˜ i XT , (1 − Di )(Yi − X iT μ0 )h (X iT β)X i n i=1

we can write  − β0 )T A( μ − μ0 ). R n = (β From (47) we deduce that on the event B that occurs with probability tending to 1 as n → ∞,

n  1  i X T ( (1 − Di )h (X iT β)X μ − μ0 ) μ − μ0 )T ( i n i=1   λ  − β0 )T A( μ1 +  ( μ − μ0 )1 + (β μ − μ0 ). ≤λ  μ0 1 −   c  = β˜ for t = 1) to obtain that, with probability We now use (40) (noticing that β tending to 1 as n → ∞,   λ  ( τ ( μ − μ0 )T  μ − μ0 ) ≤ λ  μ0 1 −   μ1 +  ( μ − μ0 )1 c  − β0 )T A( + (β μ − μ0 ). (48) Next, observe that, with probability tending to 1 as n → ∞,  − β0 )  − β0 )T   − β0 )T A(  (  (β μ − μ0 ) ≤ C(β μ − μ0 ) − (τ /2)( μ − μ0 )T  (β (49)

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

453

− where C > 0 is a constant that does not depend on n. To see this, set u i = (β ˜ We have μ − μ0 )T X i and ai = (Yi − X iT μ0 )h (X iT β). β0 )T X i vi = ( n   4a τ  i  ( u i vi − vi2 ( β − β0 )T A( μ − μ0 ) − (τ /2)( μ − μ0 )T  (1 − Di ) μ − μ0 ) = 2n τ i=1

n 1  ≤ (1 − Di )ai2 u i2 . τn i=1

This implies (49) since (22) and Assumptions 5(i) and 4 garantee that, with probability tending to 1 as n → ∞, we have maxi |ai | ≤ C for a constant C > 0 that does not depend on n. We also note that, due to (45), with probability tending to 1 as n → ∞, 2  − β0 )  λ sβ  sβ log( p) .  − β0 )T   (β (β κ nκ

(50)

Combining (48), (49) and (50) we finally get that, with probability tending to 1 as n → ∞, + , cs ¯ β log( p) 1 τ  ( μ − μ0 )T  μ1 +  ( μ − μ0 )1 + , (51) μ − μ0 ) ≤ λ  μ0 1 −   ( 2 c nκ

where c¯ > 0 is a constant that does not depend on n. Let S1 ⊆ {1, . . . , p} denote the set of indices of non-zero components of μ0 . By assumption, Card(S1 ) = μ0 0 ≤ sμ .  d , S0 by μ0 ,  μ,  , S1 , The same argument as in (42) (where we replace β0 , β, respectively) yields 1 μ1 +  ( μ − μ0 )1  μ0 1 −   c    1 1  (  ( ≤ 1+ μ S1 − μ0,S1 )1 − 1 − μ S1C − μ0,S1C )1 . c c This and (51) imply that, with probability tending to 1 as n → ∞,  cs ¯ β log( p) 1 τ  ( μ − μ0 )T  μ − μ0 )1 ≤ 4λ  ( μ S1 − μ0,S1 )1 + , μ − μ0 ) + λ 1 −  ( ( 2 c nκ

(52)

where we have used the fact that c > 1. We now consider two cases : cs ¯ β log( p) μ S1 − μ0,S1 )1 ≤ . In this case, inequality (52) implies 1. λ  ( nκ μ − μ0 )1   (

sβ log( p) λ nκ

454

M. Bléhaut et al.

* and (46) follows immediately since log( p)/n  λ . Consequently, (23) holds in this case. cs ¯ β log( p) μ S1 − μ0,S1 )1 > . Then with probability tending to 1 as n → ∞ 2. λ  ( nκ we have  1 τ  (  ( μ − μ0 ) + λ 1 − μ − μ0 )1 ≤ 5λ  ( μ S1 − μ0,S1 )1 . μ − μ0 )T  ( 2 c This inequality is analogous to (44). In particular, it implies the cone condition, which now takes the form  ( μ S0C − μ0,S0C )1 ≤ α ( μ S0 − μ0,S0 )1 with α = 5c/(c − 1). Therefore, we can use an argument based on the Restricted Eigenvalue condition, which is completely analogous to that after inequality (44) (we omit the details here). It leads to the following analog of (45):  ( μ − μ0 ) + Cλ  ( μ − μ0 )1  μ − μ0 )T  (

(λ )2 sμ , κ

where C > 0 is a constant that does not depend on n. Since λ  get (46). Thus, the proof of (23) is complete.

*

(53)

log( p)/n we 

9 Auxiliary Lemmas Lemma 2 (Deviation of maximum of self-normalized sums) Consider S j :=

n  i=1

Ui, j

n 

Ui,2 j

−1/2

,

i=1

where Ui, j are independent random variables across i with mean zero and for all i, j we have E[|Ui, j |3 ] ≤ C1 , E[Ui,2 j ] ≥ C2 for some positive constants C1 , C2 independent of n. Let p = p(n) satisfy the condition log( p) = o(n 1/3 ) and let γ = γ(n) ∈ (0, 2 p) be such that log(1/γ)  log( p). Then, + P

, max |S j | > −1 (1 − γ/2 p) = γ (1 + o(1))

1≤ j≤ p

as n → ∞. Proof We use a corollary of a result from [41] given by [10] (p. 2409), which in our case can be stated as follows. Let S j and Ui, j satisfy the assumptions of the present lemma. If there exist positive numbers  > 0, γ > 0 such that 0 < −1 (1 − γ/2 p) ≤

1/2

n 1/6 − 1, 1/3  C1

C2

(54)

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

then,

+ P

455

, + , A max |S j | > −1 (1 − γ/2 p) ≤ γ 1 + 3 , 1≤ j≤ p 

where A > 0 is an absolute constant. * Now, since −1 (1 − γ/2 p) ≤ 2 log(2 p/γ) and we assume that log(1/γ)  log( p) and log( p) = o(n 1/3 ) condition (54) is satisfied with  = (n) = (n 1/3 / log( p))1/4 for n large enough. Then (n) → ∞ as n → ∞ and the lemma follows.  Lemma 3 Let s ∈ [1, p/2] be an integer and α > 0. Let V ∈ R p be a random vector such that V ∞ ≤ M < ∞ (a.s.), and set  = E(V V T ). Let  satisfy (21) and s/κ = o( p) as n → ∞, and s 

nκ2 . log( p) log3 (n)

(55)

Consider i.i.d. random vectors V1 , . . . , Vn with the same distribution as V . Then, for all n large enough with probability at least 1 − exp(−C log( p) log3 (n)) where n 1 = C > 0 is a constant depending only on M the empirical matrix  Vi ViT n i=1  ) = c∗ κ where c∗ ∈ (0, 1) is an absolute satisfies the RE(s,α) condition with κ( constant. Proof We set the parameters of Theorem 22 in [43] as follows s0 = s, k0 = α, and due to (21) we have, in the notation of that theorem, K 2 (s0 , 3k0 ,  1/2 ) ≤ 1/κ and ρ ≥ κ . Also note that  1/2 e j 22 = E[(V T e j )2 ] ≤ M 2 , where e j denotes the jth canonical basis vector in R p (this, in particular, implies that κ ≤ M 2 ). Thus, the value d defined in Theorem 22 in [43] satisfies d  s/κ and condition d ≤ p holds true for n large enough due to (55). Next, note that condition n ≥ x log3 (x) is satisfied for all x ≤ n/ log3 (n) and n ≥ 3, so that the penultimate display formula in Theorem 22 of [43] can be written as d log( p)/ρ  n/ log3 (n). Given the above bounds for d and ρ, we have a sufficient condition for this inequality in the form s/κ2  n/ log3 (n), which is granted by (55). Thus, all the conditions of Theorem 22 in [43] are satisfied and we find that, for all n large enough, with probability at least 1 − exp(−C log( p) log3 (n)) where C > 0 is a constant depending only on M we have v vT  min ≥ (1 − 5δ)κ , (56) min S⊆{1,..., p}: v∈C[S,α] v S 2 2 Card(S)=s

where δ ∈ (0, 1/5) (remark that there is a typo in Theorem 22 in [43] that is corrected 2 in (56): the last formula of that theorem should be (1 − 5δ) 1/2 u2 ≤ X√u ≤ n 1/2 (1 + 3δ) u2 where 0 < δ < 1/5 [[42]]). It remains to note that though at first glance (56) differs from (34) (in (56) we have Card(S) = s rather than Card(S) ≤ s), these two conditions are equivalent. Indeed, as shown in [9, p. 3607],

456

M. Bléhaut et al.



S⊆{1,..., p}:Card(S)≤s

s      v ∈ R p : v S C 1 ≤ αv S 1 = v ∈ R p : v1 ≤ (1 + α) v ∗j j=1

where v1∗ ≥ · · · ≥ v ∗p denotes a non-increasing rearrangement of |v1 |, . . . , |v p |. On the other hand, ∪

S⊆{1,..., p}:Card(S)=s

    v ∈ R p : v S C 1 ≤ αv S 1 ⊇ v ∈ R p : v S C (v) 1 ≤ αv S∗ (v) 1 ∗



= v ∈ R p : v1 ≤ (1 + α)

s 

v ∗j



j=1

where S∗ (v) is the set of s largest in absolute value components of v.



References 1. Abadie, A.: Using synthetic controls: Feasibility, data requirements, and methodological aspects. J. Econ. Lit. 59(2), 391–425 (2021). https://doi.org/10.1257/jel.20191450 2. Abadie, A., Diamond, A., Hainmueller, J.: Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. J. Am. Stat. Assoc. 105(490), 493–505 (2010) 3. Abadie, A., Diamond, A., Hainmueller, J.: Comparative politics and the synthetic control method. Am. J. Polit. Sci. 59(2), 495–510 (2015). https://doi.org/10.1111/ajps.12116 4. Abadie, A., Gardeazabal, J.: The economic costs of conflict: a case study of the Basque country. Am. Econ. Rev. 93(1), 113–132 (2003) 5. Abadie, A., L’Hour, J.: A penalized synthetic control estimator for disaggregated data. J. Am. Stat. Assoc. 1–18 (2021). https://doi.org/10.1080/01621459.2021.1971535 6. Arkhangelsky, D., Athey, S., Hirshberg, D.A., Imbens, G.W., Wager, S.: Synthetic difference in differences. Working Paper wp2019_907, CEMFI (2019) 7. Athey, S., Imbens, G.W., Wager, S.: Approximate residual balancing: debiased inference of average treatment effects in high dimensions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 80(4), 597–623 (2018). https://doi.org/10.1111/rssb.12268 8. Bang, H., Robins, J.M.: Doubly robust estimation in missing data and causal inference models. Biom. 61(4), 962–973 (2005). https://doi.org/10.1111/j.1541-0420.2005.00377.x 9. Bellec, P.C., Lecué, G., Tsybakov, A.B.: Slope meets lasso: improved oracle bounds and optimality. Ann. Stat. 46(6B), 3603–3642 (2018) 10. Belloni, A., Chen, D., Chernozhukov, V., Hansen, C.: Sparse models and methods for optimal instruments with an application to eminent domain. Econ. 80(6), 2369–2429 (2012). https:// doi.org/10.3982/ECTA9626 11. Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013). https://doi.org/10.3150/11-BEJ410 12. Belloni, A., Chernozhukov, V., Fernández-Val, I., Hansen, C.: Program evaluation and causal inference with high-dimensional data. Econ. 85(1), 233–298 (2017). https://doi.org/10.3982/ ECTA12723 13. Belloni, A., Chernozhukov, V., Hansen, C.: Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81(2), 608–650 (2014). https://doi.org/10.1093/ restud/rdt044 14. Ben-Michael, E., Feller, A., Rothstein, J.: The augmented synthetic control method. J. Am. Stat. Assoc. 0(ja), 1–34 (2021). https://doi.org/10.1080/01621459.2021.1929245

An Alternative to Synthetic Control for Models with Many Covariates Under Sparsity

457

15. Bickel, P., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009) 16. Blèhaut, M., D’Haultfoeuille, X., L’Hour, J., Tsybakov, A.B.: A parametric generalization of the synthetic control method, with high dimension. In: 2017 IAAE Meeting, pp. 0–53. Sapporo. https://editorialexpress.com/conference/IAAE2017/program/IAAE2017.html (2017) 17. Bradic, J., Wager, S., Zhu, Y.: Sparsity double robust inference of average treatment effects (2019). arXiv preprint arXiv:1905.00744 18. Bunea, F., Tsybakov, A.B., Wegkamp, M.H.: Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1, 169–194 (2007) 19. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J.: Double/debiased machine learning for treatment and structural parameters. Econ. J. 21(1), C1–C68 (2018). https://doi.org/10.1111/ectj.12097 20. Chernozhukov, V., Hansen, C., Spindler, M.: Post-selection and post-regularization inference in linear models with many controls and instruments. Am. Econ. Rev. 105(5), 486–90 (2015) 21. Chernozhukov, V., Hansen, C., Spindler, M.: Valid post-selection and post-regularization inference: An elementary, general approach. Annu. Rev. Econ. 7(1), 649–688 (2015). https://doi. org/10.1146/annurev-economics-012315-015826 22. Dehejia, R.H., Wahba, S.: Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. J. Am. Stat. Assoc. 94(448), 1053–1062 (1999) 23. Dehejia, R.H., Wahba, S.: Propensity score-matching methods for nonexperimental causal studies. Rev. Econ. Stat. 84(1), 151–161 (2002) 24. Deville, J.C., Särndal, C.E.: Calibration estimators in survey sampling. J. Am. Stat. Assoc. 87(418), 376–382 (1992) 25. Farrell, M.H.: Robust inference on average treatment effects with possibly more covariates than observations. J. Econ. 189(1), 1–23 (2015). https://doi.org/10.1016/j.jeconom.2015.06. 017 26. van de Geer, S.A.: Estimating and Testing Under Sparsity. Springer (2016) 27. Graham, B.S., Pinto, C.C.D.X., Egel, D.: Inverse probability tilting for moment condition models with missing data. Rev. Econ. Stud. 79(3), 1053–1079 (2012) 28. Hainmueller, J.: Entropy balancing for causal effects: a multivariate reweighting method to produce balanced samples in observational studies. Polit. Anal. 20(1), 25–46 (2012). https:// doi.org/10.1093/pan/mpr025 29. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data mining, Inference and Prediction, 2nd edn. Springer (2009) 30. Imai, K., Ratkovic, M.: Covariate balancing propensity score. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 76(1), 243–263 (2014). https://doi.org/10.1111/rssb.12027 31. Khan, S., Tamer, E.: Irregular identification, support conditions, and inverse weight estimation. Econ. 78(6), 2021–2042 (2010) 32. Kline, P.: Oaxaca-blinder as a reweighting estimator. Am. Econ. Rev. 101(3), 532–37 (2011). https://doi.org/10.1257/aer.101.3.532 33. Klößner, S., Kaul, A., Pfeifer, G., Schieler, M.: Comparative politics and the synthetic control method revisited: A note on abadie et al.(2015). Swiss J. Econ. Stat. 154(1), 11 (2018) 34. LaLonde, R.J.: Evaluating the econometric evaluations of training programs with experimental data. Am. Econ. Rev. 76(4), 604–20 (1986) 35. Leamer, E.E.: Let’s take the con out of econometrics. Am. Econ. Rev. 73(1), 31–43 (1983) 36. Leeb, H., Pötscher, B.M.: Model selection and inference: facts and fiction. Econ. Theory. 21(01), 21–59 (2005) 37. Leeb, H., Pötscher, B.M.: Recent developments in model selection and related areas. Econ. Theory. 24, 319–322 (2008). https://doi.org/10.1017/S0266466608080134 38. Leeb, H., Pötscher, B.M.: Sparse estimators and the oracle property, or the return of Hodges’ estimator. J. Econ. 142(1), 201–211 (2008) 39. Newey, W.K., McFadden, D.: Chapter 36 large sample estimation and hypothesis testing. In: Handbook of Econometrics, vol. 4, pp. 2111–2245. Elsevier (1994). https://doi.org/10.1016/ S1573-4412(05)80005-4

458

M. Bléhaut et al.

40. Ning, Y., Sida, P., Imai, K.: Robust estimation of causal effects via a high-dimensional covariate balancing propensity score. Biom. 107(3), 533–554 (2020). https://doi.org/10.1093/biomet/ asaa020 41. de la Peña, V.H., Lai, T.L., Shao, Q.M.: Self-Normalized Processes: Limit Theory and Statistical Applications, 1st edn. Springer-Verlag Berlin Heidelberg (2009). https://doi.org/10.1007/9783-540-85636-8 42. Rudelson, M.: Personal Communication (2020) 43. Rudelson, M., Zhou, S.: Reconstruction from anisotropic random measurements. IEEE Trans. Inf. Theory. 59(6), 3434–3447 (2013) 44. Smith, J., Todd, P.: Does matching overcome LaLonde’s critique of nonexperimental estimators? J. Econ. 125(1–2), 305–353 (2005) 45. Tan, Z.: Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. Ann. Statist. 48(2), 811–837 (2020). https://doi.org/10.1214/19AOS1824 46. Tibshirani, R.J.: The lasso problem and uniqueness. Electron. J. Stat. 7, 1456–1490 (2013)

Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models Christoph Breunig and Xiaohong Chen

Abstract This paper considers adaptive, minimax estimation of a quadratic functional in a nonparametric instrumental variables (NPIV) model, which is an important problem in optimal estimation of a nonlinear functional of an ill-posed inverse regression with an unknown operator. We first show that a leave-one-out, sieve NPIV estimator of the quadratic functional can attain a convergence rate that coincides with the lower bound previously derived in [10]. The minimax rate is achieved by the optimal choice of the sieve dimension (a key tuning parameter) that depends on the smoothness of the NPIV function and the degree of ill-posedness, both are unknown in practice. We next propose a Lepski-type data-driven choice of the key sieve dimension adaptive to the unknown NPIV model features. The adaptive estimator of the quadratic functional is shown to attain the minimax optimal rate in the severely √ illposed case and in the regular mildly ill-posed case, but up to a multiplicative log n factor in the irregular mildly ill-posed case. Keywords Nonparametric instrumental variables · Ill-posed inverse problem with an unknown operator · Quadratic functional · Minimax estimation · Leave-one-out · Adaptation · Lepski’s method

We are grateful to Volodia Spokoiny for his friendship and insightful comments, and admire his creativity and high standard on research. We thank Enno Mammen and an anonymous referee for helpful comments, and Cristina Butucea, Tim Christensen, Richard Nickl and Sasha Tsybakov for initial discussions. Early versions have been presented at the conference “Celebrating Whitney Newey’s Contributions to Econometrics” at MIT in May 2019, the North American Summer Meeting of the Econometric Society in Seattle in June 2019, and the conference “Foundations of Modern Statistics” in occasion of Volodia Spokoiny’s 60th birthday at WIAS in November 2019. C. Breunig Department of Economics, Bonn University, Bonn 53113, Germany e-mail: [email protected] X. Chen (B) Cowles Foundation for Research in Economics, Yale University, Box 208281, New Haven, CT 06520, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Belomestny et al. (eds.), Foundations of Modern Statistics, Springer Proceedings in Mathematics & Statistics 425, https://doi.org/10.1007/978-3-031-30114-8_13

459

460

C. Breunig and X. Chen

1 Introduction Long before the recent popularity of instrumental variables in modern machine learning causal inference, reinforcement learning and biostatistics, the instrumental variables technique has been widely used in economics. For instance, instrumental variables regressions are frequently used to account for omitted variables, mis-measured regressors, endogeneity in simultaneous equations and other complex situations in economic observational data. In economics and other social sciences, as well as in medical research, it is very difficult to estimate causal effects when treatment assignment is not randomized. Instrumental variables are commonly used to provide exogenous variation that is associated with the treatment status, but not with the outcome variable (beyond its direct effect on the treatments). To avoid mis-specification of parametric functional forms, nonparametric instrumental variables (NPIV) regressions have gained popularity in econometrics and modern causal inference in statistics and machine learning. The simplest NPIV model n is drawn from an unknown joint disassumes that a random sample {(Yi , X i , Wi )}i=1 tribution of (Y, X, W ) satisfying Y = h 0 (X ) + U,

E[U |W ] = 0,

(1)

where h 0 is an unknown continuous function, X is a d-dimensional vector of continuous endogenous regressors in the sense that E[U |X ] = 0, W is a vector of conditioning ssvariables (instrumental variables) such that E[U |W ] = 0. The structural function h 0 can be identified as a solution to an integral equation of first kind with an unknown operator:  E[Y |W = w] = (T h 0 )(w) :=

h 0 (x) f X |W (x|w)d x,

where the conditional density f X |W (and hence the conditional expectation operator T ) is unknown. Under mild conditions, the conditional density f X |W is continuous and the operator T smoothes out “low regular” (or wiggly) parts of h 0 . This makes the nonparametric estimation (recovery) of h 0 a difficult ill-posed inverse problem with an unknown smoothing operator T . See, for example, [3, 9, 12, 24, 33] and [17]. For a given smoothness of h 0 , the difficulty of recovering h 0 depends on the smoothing property of the conditional expectation operator T . The literature distinguishes between the mildly and severely ill-posed regimes, and the optimal convergence rates for nonparametrically estimating h 0 are different in the two regimes. This paper considers adaptive, minimax rate-optimal estimation of a quadratic functional of h 0 in the NPIV model (1):  f (h 0 ) := h 20 (x)μ(x)d x (2)

Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models

461

for a known positive, continuous weighting function μ, which is assumed to be uniformly bounded below from zero and from above on some subset of of the support of X . Let  h be a sieve NPIV estimator of the NPIV function h 0 (see e.g., [3]). [11] and [10] considered inference on a slightly more general nonlinear functional h). However, there is no result on any g(h 0 ) using plug-in sieve NPIV estimator g( adaptive, minimax rate-optimal estimation of any nonlinear functional g(h 0 ) of the NPIV function h 0 yet. Since a quadratic functional is a leading example of a smooth nonlinear functional in h 0 , [10, Theorem C.1] established the minimax lower bound for estimating a quadratic functional f (h 0 ) in a NPIV model. They also point out that a plug-in sieve NPIV estimator f ( h) of the quadratic functional f (h 0 ) can achieve the lower bound in the severely ill-posed regime, but fails to achieve the lower bound in the mildly ill-posed regime. Moreover, none of the existing work considers adaptive minimax rate-optimal estimation of the quadratic functional f (h 0 ) in a NPIV model. In this paper, we first propose a simple leave-one-out sieve NPIV estimator  f J for the quadratic functional f (h 0 ), and establish an upper bound on its convergence rate. By choosing the sieve dimension J optimally to balance the squared bias and the variance parts, we show that the resulting convergence rate of  f J − f (h 0 ) coincides with the lower bound of [10, Theorem C.1]. In this sense the estimator  f J is minimax rate-optimal for f (h 0 ) regardless whether the NPIV model is severely ill-posed or mildly ill-posed. In particular, for the severely ill-posed case, the optimal convergence rate is of the order (log n)−α , where α > 0 depends on the smoothness of the NPIV function h 0 and the degree of severe ill-posedness. For the mildly ill-posed case, the optimal convergence rate of  f J − f (h 0 ) exhibits the so-called elbow phenomena: the rate is of the parametric order n −1/2 for the regular mildly ill-posed case, and is of the order n −β for the irregular mildly ill-posed case, where β ∈ (0, 1/2) depends on the smoothness of h 0 , the dimension of X and the degree of mild ill-posedness. The minimax optimal estimation rate of  f J − f (h 0 ) is achieved by the optimal choice of the sieve dimension J (a key tuning parameter) that depends on the unknown smoothness of h 0 and the unknown degree of ill-posedness. We next propose a data driven choice J of the sieve dimension based on a modified Lepski method.1 The modification is needed to account for the estimation of the unknown degree of illposedness. The adaptive, leave-one-out sieve NPIV estimator  f J of f (h 0 ) is shown to attain the minimax optimal rate in the severely√ill-posed case and in the regular mildly ill-posed case, but up to a multiplicative log n in the irregular mildly illposed case. We note that even for adaptive estimation of a quadratic functional of a direct regression in a Gaussian white noise model, [20] already shown that the extra √ log n factor is the necessary price to pay for adaptation to the unknown smoothness of the regression function. Previously for the nonparametric estimation of h 0 in the NPIV model (1), [25] considers adaptive estimation of h 0 in L 2 norm using a model selection procedure. Breunig and Johannes [5] consider adaptive estimation of a linear functional of the NPIV function h 0 in a root-mean squared error metric using a combined model selection and Lepski method. These papers obtain adaptive rate of convergence up 1

See [29, 30] and [31] for detailed descriptions of the original Lepski principle.

462

C. Breunig and X. Chen

 to a multiplicative factor of log(n) (of the minimax optimal rate) in both severely ill-posed and mildly ill-posed cases. Chen et al. [13] propose adaptive estimation of h 0 in L ∞ norm using a modified Lepski method and tight random matrix inequalities to account for the estimated measure of ill-posedness. They show that their datadriven procedure attains the minimax optimal rate in L ∞ norm and is fully adaptive to the unknown smoothness of h 0 in both severely and mildly ill-posed regimes. Our data-driven choice of the sieve dimension is closest to that of [13], which might explain why we also obtain minimax optimal adaptivity for the quadratic functional f (h 0 ) in both severely and mildly ill-posed regimes. While [5, 25] and [13] use plug-in sieve NPIV estimators in their adaptive estimation of a linear functional of h 0 , we use a leave-one-out sieve NPIV estimator  f J for the quadratic functional f (h 0 ) = h 20 (x)μ(x)d x. Recently [4] propose a test statistic that is based on a standardized leave-one-out estimator of a quadratic distance for a null hypothesis of E[(h 0 (X ) − h R (X ))2 μ(X )] = 0 in a NPIV model (for some parametric, semiparametric or shape restricted h R ). They construct an adaptive minimax test using a random exponential scan procedure. We use the unstandardized leave-one-out estimator  f J in our modified Lepski procedure for adaptive minimax estimation of f (h 0 ) in a NPIV model. It is well-known that adaptive minimax testing and adaptive minimax estimation are related but different (see, e.g., [23]). In particular, while both papers apply a tight Bernstein-type inequality for U-statistics ([26]) in the proofs, the adaptive optimal rates are different. For instance, the adaptive minimax L 2 separation rate of testing in [4] is always slower than n −1/2 , while our adaptive minimax estimation for f (h 0 ) can achieve the parametric rate of n −1/2 for regular mildly ill-posed NPIV models. Minimax rate-optimal estimation of a quadratic functional in density and direct regression (in Gaussian white noise) settings has a long history in statistics. See, for example, [1, 8, 16, 18, 20–22, 28] and the references therein. To the best of our knowledge, there are not many published papers on minimax estimation of a quadratic functional in difficult inverse problems. See [6, 7, 15] and [27] for deconvolutions and inverse regressions in Gaussian sequence models. Moreover, [15] seems the only published work on adaptive estimation of a quadratic functional in a special deconvolution (with a known operator). Our paper is the first to propose a simple estimator that is adaptive minimax rate-optimal for a quadratic functional in a NPIV model, and also contributes to inverse problems with unknown operators. The rest of the paper is organized as follows. Section 2 presents the leave-oneout sieve NPIV estimator of the quadratic functional f (h 0 ), and derives its optimal convergence rates. Section 3 first presents a simple data-driven procedure of choosing the sieve dimension using a modified Lepski method. It then establishes the optimal convergence rates of our adaptive estimator of the quadratic functional. Section 4 provides a brief conclusion and discusses several extensions. All proofs can be found in the Appendices 5–7.

Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models

463

2 Minimax Optimal Quadratic Functional Estimation This section consists of three parts. The first subsection introduces model preliminaries and notation. Section 2.2 introduces a simple leave-one-out, sieve NPIV estimator of the quadratic functional f (h 0 ). Section 2.3 establishes the convergence rate of the proposed estimator, and shows that the convergence rate coincides with the lower bound and hence is optimal.

2.1 Preliminaries and Notation We first introduce notation that is used throughout the paper. For any random vector V with support V, we let L 2 (V ) = {φ : V → R, φ L 2 (V ) < ∞} with the norm  φ L 2 (V ) = E[φ2 (V )]. If {an } and {bn } are sequences of positive numbers, we use the notation an  bn if lim supn→∞ an /bn < ∞ and an ∼ bn if an  bn and bn  an . We consider a known positive, continuous weighting function μ, which is assumed to be uniformly bounded below from zero and from above on some subset of X , denoted by X μ . Denote L 2μ = {h : Xμ → R, hμ < ∞} with the norm  h 2 (x)μ(x)d x. We consider basis functions {ψ j } j≥1 to approximate the hμ = NPIV function h 0 . Its orthonormalized analog with respect to  · μ is denoted by j } j≥1 . We assume that the structural function h 0 belongs to the Sobolev ellipsoid {ψ 

H2 ( p, L) = h ∈ L 2μ :



j 2 p/d h,  ψ j 2μ ≤ L , for d/2 < p < ∞, 0 < L < ∞ .

j=1

Let T : L 2 (X ) → L 2 (W ) denote the conditional expectation operator given by (T h)(w) = E[h(X )|W = w]. Finally let {ψ1 , ..., ψ J } and {b1 , ..., b K } be collections of sieve basis functions of dimension J and K for approximating functions in L 2 (X ) and L 2 (W ), respectively. We define the sieve measure of ill-posedness which, roughly speaking, measures how much the conditional expectation operator T smoothes out h. Following [3] the sieve L 2μ measure of ill-posedness is τJ =

√ hμ f (h) = sup , h∈ J ,h=0 T h L 2 (W ) h∈ J ,h=0 T h L 2 (W ) sup

where  J = clsp{ψ1 , ..., ψ J } ⊂ L 2 (X ) denotes the sieve spaces for the endogenous variables. We call a NPIV model (1) (i) mildly ill-posed if τ j ∼ j a/d for some a > 0; and (ii) severely ill-posed if τ j ∼ exp( 21 j a/d ) for some a > 0.

464

C. Breunig and X. Chen

2.2 A Leave-one-out, Sieve NPIV Estimator n Let {(Yi , X i , Wi )}i=1 denote a random sample from the NPIV model (1). The sieve NPIV (or series 2SLS) estimator  h of h 0 can be written in matrix form as follows (see, e.g., [10])

  Y/n  h(·) = ψ J (·) [  PB ]−   PB Y = ψ J (·) AB where PB = B(B  B)− B  and Y = (Y1 , . . . , Yn ) , ψ J (x) = (ψ1 (x), . . . , ψ J (x))

 = (ψ J (X 1 ), . . . , ψ J (X n ))

b K (w) = (b1 (w), . . . , b K (w))

B = (b K (W1 ), . . . , b K (Wn ))

−1  −1  = n[  PB ]−   B(B  B)− is an estimator of A = [S  G −1 and A b S] S G b , with S = E[b K (Wi )ψ J (X i ) ] and G b = E[b K (Wi )b K (Wi ) ]. As pointed out by [10], although one could estimate f (h 0 ) by the plug-in sieve NPIV estimator f ( h), it fails to achieve the minimax lower bound. We propose a leave-one-out sieve NPIV estimator for the quadratic functional f (h 0 ) as follows:

 fJ =

2  G μ A  b K (Wi  )Yi  Yi b K (Wi ) A n(n − 1) 1≤i d/2).

Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models

465

Assumption 1 (i) T [h − h 0 ] = 0 for any h ∈ L 2μ implies that f (h) = f (h 0 ); (ii) supw∈W E[Y 2 |W = w] ≤ σ 2Y < ∞ and E[Y 4 ] < ∞; (iii) the densities of X and W are Lebesgue continuous and uniformly bounded below from zero and from above on the closed rectangular supports X ⊂ Rd and W ⊂ Rdw , respectively.  Assumption 2 τ J J (log J )/n = O(1). K Below we let  K g(w) = b K (w) G −1 b E[b (W )g(W )] denote the sieve LS pro2 jection of g ∈ L (W ) onto B K = clsp{b1 , ..., b K }.

Assumption 3 (i) suph∈ J τ J ( K T − T )h L 2 (W ) /hμ ≤ v J where v J < 1 for all J and v J → 0 as J → ∞. (ii) there exists a constant C > 0 such that τ J T (h 0 −  J h 0 ) L 2 (W ) ≤ Ch 0 −  J h 0 μ . For a r × c matrix M with r ≤ c and full row rank r we let Ml− denote its left pseudoinverse, namely (M  M)− M  where  denotes transpose and − denotes generalized inverse. Below,  ·  respectively denotes the vector 2 norm when applied to a vector and the operator norm A := supx:x=1 Ax when applied to a matrix A. −1/2 Let (s1 , . . . , s J ) denote the singular values, in non-increasing order, of G b SG −1/2 . μ −1/2 In particular s J = smin (G b SG −1/2 ). μ

− −1/2 ≤ D for some constant D > 0. Assumption 4 diag(s1 , . . . , s J ) G b SG −1/2 μ l Discussion of Assumptions: Assumption 1(i) ensures identification of the nonlinear functional f (h 0 ). Assumption 2 restricts the growth of the sieve dimension J . Assumption 3(i) is a mild condition on the approximation properties of the basis used for the instrument space and is first imposed in [13]. In fact, ( K T − T )h L 2 (W ) = 0 for all h ∈  J when the basis functions for B K (with K ≥ J ) and  J form either a Riesz basis or an eigenfunction basis for the conditional expectation operator. Assumption 3(ii) is the usual L 2 “stability condition" imposed in the NPIV literature (cf. Assumption 6 in [3]). Note that Assumption 3(ii) is also automatically satisfied by Riesz bases. Assumption 4 is a modification of the sieve measure of ill-posedness and was used by [19]. Assumption 4 is also related to the extended link condition in [5] to establish optimal upper bounds in the context of minimax optimal estimation of linear functionals in NPIV models. Finally we note that by definition, s J satisfies sJ =

inf

h∈ J ,h=0

 K T h L 2 (W ) ≤ τ J−1 hμ

(3)

for all K = K (J ) ≥ J > 0. Assumption 3(i) further implies that sJ ≥

T h L 2 (W ) ( K T − T )h L 2 (W ) − sup = cτ τ J−1 , h∈ J ,h=0 hμ hμ h∈ J ,h=0 inf

(4)

for some constant cτ > 0. We shall maintain Assumption 3(i) and use the equivalence of s J and τ J−1 in the paper.

466

C. Breunig and X. Chen

The next result provides an upper bound on the rate of convergence for the estimator  fJ. Theorem 1 Let Assumptions 1–3 hold. Then:   f J − f (h 0 ) = O p

τ J2

√ n

J

 h 0 , ψ J  (G −1/2 S)− + τ J h 0 −  J h 0 μ μ l b 2 + − h 0 −  J h 0 μ . √ n

(5) If in addition h 0 ∈ H2 ( p, L) and Assumption 4 holds, then: 1. Mildly ill-posed case: choosing J ∼ n 2d/(4( p+a)+d) implies  f J − f (h 0 ) =





p+a)+d) , if p ≤ a + d/4, O p n −4 p/(4(

O p n −1/2 , if p > a + d/4.

(6)

2. Severely ill-posed case: choosing d/a  4p + d log log n J ∼ log n − 2a implies

 f J − f (h 0 ) = O p (log n)−2 p/a .

(7)

Theorem 1 presents an upper bound on the convergence rates of  f J to f (h 0 ). When the sieve dimension J is chosen optimally, the convergence rate (6) coincides with the minimax lower bound in [10, Theorem C.1] for the mildly ill-posed case, while the convergence rate (7) coincides with the minimax lower bound in [10, Theorem C.1] for the severely ill-posed case. Moreover, within the mildly ill-posed case, depending on the smoothness of h 0 relatively to the dimension of X and the degree of mildly ill-posedness a, either the first or the second variance term in (5) dominates, which leads to the so-called elbow phenomenon: the regular case with a parametric rate of n −1/2 when p > a + d/4; and the irregular case with a nonparametric rate when p ≤ a + d/4. In particular, Theorem 1 shows that the simple leave-one-out estimator  f J is minimax rate optimal provided that the sieve dimension J is chosen optimally. Chen and Christensen [10, Theorem C.1] actually established lower bound for estimating a quadratic functional of a derivative of h 0 in a NPIV model as well. Using Fourier, spline and wavelet bases, we can easily show that our simple leaveone-out, sieve NPIV estimator of the quadratic functional of a derivative of h 0 also achieve the lower bound, and hence is minimax rate-optimal. We do not present such a result here since it is a very minor extension of Theorem 1.

Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models

467

3 Rate Adaptive Estimation The minimax rate of convergence depends on the optimal choice of sieve dimension J , which depends on the unknown smoothness p of the true NPIV function h 0 and the unknown degree of ill-posedness. In this section we propose a data-driven choice of the sieve dimension J based on a modified Lepski method; see [29, 30] and [31] for early development of this popular method. In this section we follow [13] and let  J be a tensor-product Cohen-DaubechiesVial (CDV) wavelet (see, e.g., chap. 4.3.5 of [23]) or dyadic B-spline sieve (see, e.g., Appendix A.1 of [13]) for H2 ( p, L). Let T denote the set of possible sieve dimensions J . For example for (order r ) B-splines, T = {J = (2l + r − 1)d : l ∈ N ∪ {0}}. Since  f J is based on a sieve NPIV estimator, we can simply use a random index set  I that is proposed in [13] for their sup-norm rate adaptive sieve NPIV estimation of h 0 :  I = {J ∈ T : 0.1(log Jmax )2 ≤ J ≤ Jmax }, where

   √ −1 + + , s −1 J n <  s J log J ≤ 10 log J Jmax = min J ∈ T :  + J J

(8)

 s J is the smallest singular value of (B  B/n)−1/2 (B  /n)G −1/2 , and J + = min{ j ∈ μ T : j > J }. We define our data driven choice J of “optimal” sieve dimension for estimating f (h 0 ) as follows: 

(J ) + V (J  )) for all J  ∈  J = min J ∈  I : | fJ −  f J  | ≤ c0 (V I with J  > J (9) for some constant c0 > 0 and  (J ) = V

J (log n) 1 ∨√ , 2 n n sJ

(10)

where a ∨ b := max{a, b}. The random index set  I is used to compute our data s −1 driven choice (9) since the unknown measure of ill-posedness τ J is estimated by J . J }, where J = sup We introduce a non-random index set I = {J ∈ T : J ≤

  J ∈ T : τ J J (log J )/n ≤ c¯ for some sufficiently large constant c¯ > 0. Let B = {h ∈ L 2μ : h∞ ≤ L} and p > p ≥ 3d/4. The following assumption strengthens some conditions imposed in the previous section. Assumption 5 (i) suph∈H2 ( p,L)∩B h −  J hμ ≤ c J − p/d for some finite constant c > 0 for all p ∈ [ p, p], with  J being CDV wavelet or dyadic B-spline basis; (ii) supw∈W E[Y 4 |W = w] ≤ σ 4Y < ∞; (iii) Assumptions 3(ii) and 4 hold for all J ∈ I.

468

C. Breunig and X. Chen

The next result establishes an upper bound for the adaptive estimator  f J. Theorem 2 Let Assumptions 1(i)(iii), 3(i), and 5 hold. Then, we have in the 1. mildly ill-posed case: sup

sup

p∈[ p, p] h 0 ∈H2 ( p,L)∩B

 

f J − f (h 0 ) > C1rn = o(1) Ph 0  

(11)

for some constant C1 > 0 and where rn =

 √

log n/n n −1/2 ,

4 p/(4( p+a)+d)

, if p ≤ a + d/4, if p > a + d/4.

2. severely ill-posed case: sup

sup

p∈[ p, p] h 0 ∈H2 ( p,L)∩B

 

f J − f (h 0 ) > C2 (log n)−2 p/a = o(1) Ph 0  

(12)

for some constant C2 > 0. Theorem 2 shows that our data-driven choice of the key sieve dimension can lead to fully adaptive rate-optimal estimation of f (h 0 ) for both the severely ill-posed √ case and the regular mildly ill-posed case, while it has to pay a price of an extra log n factor for the irregular mildly ill-posed case (i.e., when p ≤ a + d/4). We note that when a = 0 in the mildly ill-posed case, the NPIV model (1) becomes the regression model with X = W . Thus our result is in √ agreement with the theory in [20], which showed that one must pay a factor of log n penalty in adaptive estimation of a quadratic functional in a Gaussian white noise model when p ≤ d/4. In adaptive estimation of a nonparametric regression function E[Y |X = ·] = h(·), it is known that Lepski method has the tendency of choosing small sieve dimension, and hence may not perform well in empirical work. We wish to point out that due to the ill-posedness of the NPIV model (1), the optimal sieve dimension for estimating f (h 0 ) is smaller than the optimal sieve dimension for estimating f (E[Y |X = ·]). Therefore, we suspect that our simple adaptive estimator of a quadratic functional of a NPIV function will perform well in finite samples.

4 Conclusion and Extensions In this paper we first show that a simple leave-one-out sieve NPIV estimator of the quadratic functional f (h 0 ) is minimax rate optimal. We then propose an adaptive leave-one-out sieve NPIV estimator of the f (h 0 ) based on a modified Lepski method to account for the unknown degree of ill-posedness. We show that the adaptive estimator achieves the minimax optimal rate for the severely ill-posed case and for

Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models

469

√ the regular mildly ill-posed case, while a multiplicative log n term is the price to pay for the irregular mildly ill-posed NPIV problem. Like all existing work using Lepski method, implementation of our data-driven choice relies on a calibration constant. To improve finite sample performance over the original Lepski method, [34] suggest a propagation approach, [14] and [35] propose bootstrap calibrations in kernel density estimation and in linear regressions with Gaussian errors respectively. [13] propose a bootstrap implementation of a modified Lepski method in their minimax adaptive sup-norm estimation in a NPIV model, and show its good performance in finite samples. Their bootstrap implementation can be easily extended to calibrate the constant in our adaptive estimation of the quadratic functional in a NPIV model. We leave this to future refinement. Our results can be extended in several directions. First, we can relax the Sobolev ball assumption imposed on h 0 in the NPIV model. We can let the NPIV function h 0 belong to a bump algebra space. The result by [16] on minimax estimation of a quadratic functional under sparsity constraints can be useful for this extension. Second, we focus on adaptive estimation of a quadratic functional of the NPIV function h 0 in this paper. There are works on minimax-rate estimation and adaptive estimation for more general smooth nonlinear functionals of densities and of nonparametric regressions; see, e.g., [2, 32] and the references therein. We can combine our approach here with those in the literature for extensions to other smooth nonlinear functionals of the NPIV function h 0 . Such an extension will allow for adaptive minimax estimation of nonlinear policy functionals in economics and modern causal inference.

5 Proofs of Results in Section 2 Recall the 2SLS projection of h onto  J is given by: −1  −1 K (J ) (W )h(X )] = ψ J (x) A E[b K (J ) (W )h(X )]. Q J h(x) = ψ J (x) [S  G −1 b S] S G b E[b

For a r × c matrix M with r ≤ c and full row rank r we let Ml− denote its left −1/2 J = G −1/2 ψ J and  pseudoinverse, namely (M  M)− M  . Let ψ b K = G b b K . Thus, μ 1/2 −1/2 − we have AG b = (G b S)l and 1/2

G 1/2 μ AG b

−1/2

= (G b

SG −1/2 )l− . μ

In particular, we can write −1/2

Q J h(x) = ψ J (x) (G b J (x) (G =ψ b

−1/2

S)l− E[ b K (J ) (W )h(X )] SG −1/2 )l− E[ b K (J ) (W )h(X )]. μ

470

C. Breunig and X. Chen

The minimal or maximal eigenvalue of a quadratic matrix M is denoted by λmin (M) or λmax (M). Recall that  fJ =

1  G μ A  b K (Wi  ). Yi Yi  b K (Wi ) A n(n − 1) i=i 

Proof (Proof of Theorem 1.) Proof of Result (5). Note that   2 −1/2 f (Q J h 0 ) = ψ J (x) (G b S)l− E[ b K (W )h 0 (X )] μ(x)d x 2 −1/2 − = G 1/2 S)l E[ b K (W )h 0 (X )] =  E[V J ]2 μ (G b K  using the notation Vi J = Yi G 1/2 μ Ab (Wi ). Thus, the definition of the estimator f J implies



1 Vi j Vi  j − E[V1 j ]2 (13) n(n − 1) j=1 i=i    1  G μ A  b K (Wi  ), + Yi Yi  b K (Wi ) A G μ A − A n(n − 1) i=i  J

 f J − f (Q J h 0 ) =

(14) where we bound both summands on the right hand side separately in the following. Consider the summand in (13), we observe J 

2  E Vi j Vi  j − E[V1 j ]2  j=1 i=i  J

= 2n(n − 1)(n − 2)

E





 V1 j V2 j − E[V1 j ]2 V3 j  V2 j  − E[V1 j  ]2

j, j  =1







I

+ n(n − 1)

J j, j  =1



E





 V1 j V2 j − E[V1 j ]2 V1 j  V2 j  − E[V1 j  ]2 . 



II

By Assumption 1(ii) it holds supw∈W E[Y 2 |W = w] ≤ σ 2Y , which together with b K (W ) ≤ σ 2Y . To bound the summand I we [4, Lemma E.7] implies λmax Var (Y  observe that

Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models

I =

J

471

E[V1 j ] E[V1 j  ] Cov(V1 j , V1 j  ) = E[V1J ] Cov(V1J , V1J ) E[V1J ]

j, j  =1

2

−1/2 ≤ λmax Var (Y  b K (W )) (G b SG −1/2 )l− E[V1J ] μ 2 −1/2 = σ 2Y Q J h 0 , ψ J μ (G b S)l− −1/2

by using the notation Vi J = Yi (G b I I = n(n − 1)

SG −1/2 )l− b K (Wi ). Consider I I . We observe μ J 

J

E[V1 j V1 j  ]2 − n(n − 1)

j, j  =1

≤ n(n − 1)

J

E[V1 j ]2

2

j=1

E[V1 j V1 j  ]2 ≤ 2σ 2Y n(n − 1)s −4 J J

j, j  =1

where the last inequality stems from [4, Lemma E.1] together with supw∈W E[Y 2 |W = w] ≤ σ 2Y . Consequently, we obtain   E

J

2 1 Vi j Vi  j − E[V1 j ]2  ≤ 4σ 4Y n(n − 1)  j=1 i=i



1 Q J h 0 , ψ J  (G −1/2 S)− 2 + J μ l b n n 2 s 4J

 .

(15) The second summand in (14) can be bounded following the same proof as that of [4, Lemma E.4] (replacing their (Yi − h 0 (X i )) with our Yi and our Assumption 1(ii)), which yields   f J − f (Q J h 0 ) = O p

√  1 J −1/2 − J  √ Q J h 0 , ψ μ (G b S)l + 2 . n ns J

J μ = (G −1/2 SG −1/2 )− E[ b K (J ) Next, by the definition of Q J we have: Q J h 0 , ψ μ b l (W )h 0 (X )]. Thus, we have Q J h 0 , ψ J  (G −1/2 S)− ≤ h 0 , ψ J  (G −1/2 S)− μ μ b b l l E[ b K (J ) (W )(h 0 (X ) −  J h 0 (X ))] , + s −2 J By inequality (4) and Assumption 3(ii), we have   E[ b K (J ) (W )(h 0 (X ) −  J h 0 (X ))] = O τ J2  K T (h 0 −  J h 0 ) L 2 (W ) s −2 J

= O τ J h 0 −  J h 0 μ . It remains to evaluate

472

C. Breunig and X. Chen

  f (Q J h 0 ) − f (h 0 ) = Q J h 0 2μ −  J h 0 2μ + h 0 −  J h 0 2μ . Consider the first summand on the right hand side. There exist unitary matrices M1 , J such that E[b˙ K (J ) (W )ψ˙ J (X ) ] has an upper M2 with b˙ K := M1 b K and ψ˙ J := M2 ψ J × J matrix diag(s1 , . . . , s J ) and is zero otherwise. We thus derive 2 −1/2 − K (J )  ) E[ b (W )h (X )] Q J h 0 2μ = (G b SG −1/2 0 μ l =

J

2 ˙ s −2 j E[b j (W )h 0 (X )] =

j=1

J

h 0 , ψ˙ j 2μ =  J h 0 2μ ,

j=1

and hence f (Q J h 0 ) − f (h 0 ) = −h 0 −  J h 0 2μ . This completes the proof of Result (5). For the proofs of Results (6) and (7), we note that h 0 ∈ H2 ( p, L) implies h 0 −  J h 0 μ ≤ L J − p/d . Moreover, by inequality (4) and Assumption 4 we have:   J  −1/2 −1/2 − − J  J   −1/2 −1 h 0 , ψ (G = h 0 ,  ≤ Dc  S) ψ (G S G ) τ 2j h 0 ,  ψ j 2μ . μ μ μ τ l l b b j=1

These bounds are used below to derive the concrete rates of convergence in the mildly and severely ill-posed regimes. Proof of Result (6) for the mildly ill-posed case. The choice of J ∼ n 2d/(4( p+a)+d) implies n −2 τ J4 J ∼ n −2 J 1+4a/d ∼ n −8 p/(4( p+a)+d) and for the bias term J −4 p/d ∼ n −8 p/(4( p+a)+d) . We now distinguish between the two regularity cases of the result. First, consider the case p ≤ a + d/4, where the mapping j → j 2(a− p)/d+1/2 is increasing and consequently, we observe n −1

J J j 2 τ 2 ∼ n −1 j 2 j 2 p/d−1/2 j 2(a− p)/d+1/2

h 0 , ψ

h 0 , ψ μ j μ j=1

j=1

n

−1

J 2(a− p)/d+1/2 ∼ n −8 p/(4( p+a)+d) .

Moreover, we obtain n −1 τ J2 J −2 p/d ∼ n −1 J 2(a− p)/d  n −8 p/(4( p+a)+d) . Finally, it remains to consider the case p > a + d/4. In this case, we have that

Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models

473

J J j 2 τ 2  j 2 j 2 p/d = O(1)

h 0 , ψ

h 0 , ψ μ J μ j=1

j=1

2 −1/2 and consequently, the second variance term satisfies n −1 Q J h 0 , ψ J μ (G b S)l− = O(n −1 ) which is the dominating rate and thus, completes the proof of the result. Proof of Result (7) for the severely ill-posed case. The choice of d/a  4p + d log log n J ∼ log n − 2a implies d/a  4p + d log log n n −2 τ J4 J ∼ n −2 J exp(2J a/d ) ∼ log n − (log n)−(4 p+d)/a ∼ (log n)−4 p/a . 2a

We further analyze for the bias part J

−4 p/d

−4 p/a  4p + d log log n ∼ log n − ∼ (log n)−4 p/a . 2a

Moreover, since the mapping j → j −2 p/d exp( j a/d ) is increasing we obtain n −1

J J

h 0 ,  ψ j 2μ τ 2j ∼ n −1

h 0 ,  ψ j 2μ j 2 p/d j −2 p/d exp( j a/d ) j=1

j=1

 n −1 exp(J a/d )J −2 p/d ∼ (log n)−2 p/a (log n)−(2 p+d)/a  (log n)−4 p/a

and finally n −1 τ J2 J −2 p/d ∼ n −1 exp(J a/d )J −2 p/d  (log n)−4 p/a , which shows the result.



6 Proofs of Results in Section 3  We denote H = p∈[ p, p] H2 ( p, L) ∩ B and recall that B = B(L) = {h : h∞ < L}. Below, we make use of the notation 

(J ) + V (J  )) for all J  ∈  J = J ∈  I : | fJ −  f J  | ≤ c0 (V I with J  > J and recall the definition  I = {J ∈ T : 0.1(log Jmax )2 ≤ J ≤ Jmax }. We denote

474

C. Breunig and X. Chen

 J (c) = sup{J ∈ T : τ J J (log J )/n ≤ c} for some constant c > 0. The oracle choice of the dimension parameter is given by 

J0 = J0 ( p, c0 ) = sup J ∈ T : V (J ) ≤ c0 J

−2 p/d



J (log n) 1 2 ∨√ , V (J ) = τ J n n (16)

for some constant c0 > 0. We introduce the set s J − s J | ≤ ηs J for all J ∈ I} En∗ = {J0 ∈ J} ∩ {| for some η ∈ (0, 1). Proof (Proof of Theorem 2.) Proof of Result (11) for the mildly ill-posed case. Due to [13, Lemma B.5] we have J (c1 ) ≤ Jmax ≤ J (c2 ) for some constants c1 , c2 > 0 on En∗ . The definition J = min J ∈J J implies J ≤ J0 on the set En∗ and hence, we obtain      f J −  f J0  1En∗ +|  f J0 − f (h 0 )| 1En∗ f J − f (h 0 ) 1En∗ ≤     ( J) + V (J0 ) 1E ∗ +|  ≤ c0 V f J0 − f (h 0 )|. n On the set En∗ , we have | s J − s J | ≤ ηs J , for some η ∈ (0, 1), which implies  s −2 J ≤ −2 −2 (·) in (10) we have s J (1 − η) and thus, by the definition of V √         + s −2 J0 1E ∗ log n ∨ √1 + |  f J − f (h 0 ) 1En∗ ≤ c0 (1 − η)−2 s −2 f J0 − f (h 0 )|. J J0 n J n n 2 Using inequality (4) together with Assumption 3(i) yields s −2 J ≤ cτ τ J for all J , see inequality (4). Consequently, from the definition of V (·) in (16) we infer:

√      1 log n c 0 cτ 2  2 ∗ ∨ √ +| τ J J + τ f J0 − f (h 0 )| 1 0 E  J n 0 J (1 − η)2 n n   c 0 cτ ≤ V ( J) + V (J0 ) 1En∗ +|  f J0 − f (h 0 )| (1 − η)2 2c0 cτ V (J0 ) + |  f J0 − f (h 0 )| ≤ (1 − η)2

   f J − f (h 0 ) 1En∗ ≤

for n sufficiently large, where the last inequality is due to V ( J) 1En∗ ≤ V (J0 ) since J ≤ J0 on En∗ . By Lemmas 3 and 7 it holds P(En∗ ) = 1 + o(1). √ The definition of the oracle choice in (16) implies J0 ∼ (n/ log n)2d/(4( p+a)+d) in the mildly ill-posed case. Thus, we obtain n −2 (log n)τ J40 J0 ∼ n −2 (log n)J0

1+4a/d

 ∼ ( log n/n)8 p/(4( p+a)+d)

Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models

475

which coincides with the rate for the bias term. We now distinguish between the two cases in the mildly ill-posed case. First, consider the case p ≤ a + d/4. In this case, the mapping j → j 2(a− p)/d+1/2 is increasing in j and consequently, we observe n

−1

J0

 j 2  J 2(a− p)/d+1/2 n −1  ( log n/n)8 p/(4( p+a)+d) . τ 2j h 0 , ψ μ 0

j=1

Moreover, using h 0 ∈ H2 ( p, L), i.e., n −1 τ J20



 2 2 p/d j≥1 h 0 , ψ j μ j

≤ L, we obtain

 j 2  ( log n/n)8 p/(4( p+a)+d) .

h 0 , ψ μ

j>J0

Finally, it remains to consider the case p > a + d/4, where as in the proof of Theo 0 j 2 = O(1), implying n −1 Q J0 h 0 , ψ J0 μ (G −1/2 S)− 2 τ 2j h 0 , ψ rem 1 we have Jj=1 μ b l = O(n −1 ) which is the dominating rate and thus, completes the proof for the mildly ill-posed case. Proof of Result (12) for the severely ill-posed case. We have      f J − f (h 0 ) 1En∗ ≤   f J − f (Q Jh 0 ) 1En∗ +  2 ≤ 2σ Y s −2 J (c

 2)

J (c2 ) log J (c2 ) n−1

max

J (c1 )≤J ≤J (c2 )

| f (Q J h 0 − h 0 )| 1En∗ −1/2

 Q J (c2 ) h 0 , ψ J (c2 ) μ (G b + √ n

S)l− 





−2 p/d + J (c1 )

with probability approaching one by Lemma 5. From [13, Lemma B.2] it holds, in the severely ill-posed case, J0+ = inf{J ∈ T : J > J0 } ≥ J (c1 ) for all n sufficiently large and thus, by the definition of J (·) we have  

−2 p/d  f J − f (h 0 ) 1En∗ ≤ (2σ 2Y + 1) C J (c2 ) with probability approaching one, using that J (c1 ) ≥ C J (c2 ) for some constant C > 0. From the definition of J (·) we have (c log n)d/a ≤ J (c2 ) for any c ∈ (0, 1) and n sufficiently large. This implies  

 f J − f (h 0 ) 1En∗ = O p (log n)−2 p/a , which completes the proof.



476

C. Breunig and X. Chen

7 Supplementary Lemmas We first introduce additional notation. First we consider a U-statistic Un,1 =

2 R1 (Z i , Z i  ) n(n − 1) i