127 38 24MB
English Pages 656 [657] Year 2020
Statistical Inference via Convex Optimization
Princeton Series in Applied Mathematics Ingrid Daubechies (Duke University); Weinan E (Princeton University); Jan Karel Lenstra (Centrum Wiskunde & Informatica, Amsterdam); Endre S¨ uli (University of Oxford), Series Editors The Princeton Series in Applied Mathematics features highquality advanced texts and monographs in all areas of applied mathematics. The series includes books of a theoretical and general nature as well as those that deal with the mathematics of specific applications and realworld scenarios. For a full list of titles in the series, go to https://press.princeton.edu/series/princetonseriesinappliedmathematics Statistical Inference via Convex Optimization, Anatoli Juditsky and Arkadi Nemirovski A Dynamical Systems Theory of Thermodynamics, Wassim M. Haddad Formal Verification of Control System Software, PierreLoic Garoche Rays, Waves, and Scattering: Topics in Classical Mathematical Physics, John A. Adam Mathematical Methods in Elasticity Imaging, Habib Ammari, Elie Bretin, Josselin Garnier, Hyeonbae Kang, Hyundae Lee, and Abdul Wahab Hidden Markov Processes: Theory and Applications to Biology, M. Vidyasagar Topics in Quaternion Linear Algebra, Leiba Rodman Mathematical Analysis of Deterministic and Stochastic Problems in Complex Media Electromagnetics, G. F. Roach, I. G. Stratis, and A. N. Yannacopoulos Stability and Control of LargeScale Dynamical Systems: A Vector Dissipative Systems Approach, Wassim M. Haddad and Sergey G. Nersesov Matrix Completions, Moments, and Sums of Hermitian Squares, Mih´aly Bakonyi and Hugo J. Woerdeman Modern Antiwindup Synthesis: Control Augmentation for Actuator Saturation, Luca Zaccarian and Andrew R. Teel Totally Nonnegative Matrices, Shaun M. Fallat and Charles R. Johnson Graph Theoretic Methods in Multiagent Networks, Mehran Mesbahi and Magnus Egerstedt Matrices, Moments and Quadrature with Applications, Gene H. Golub and G´erard Meurant Control Theoretic Splines: Optimal Control, Statistics, and Path Planning, Magnus Egerstedt and Clyde Martin Robust Optimization, Aharon BenTal, Laurent El Ghaoui, and Arkadi Nemirovski Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms, Francesco Bullo, Jorge Cort´es, and Sonia Martinez Algebraic Curves over a Finite Field, J.W.P. Hirschfeld, G. Korchm´aros, and F. Torres Wave Scattering by TimeDependent Perturbations: An Introduction, G. F. Roach Genomic Signal Processing, Ilya Shmulevich and Edward R. Dougherty The Traveling Salesman Problem: A Computational Study, David L. Applegate, Robert E. Bixby, Vaˇsek Chv´ atal, and William J. Cook Positive Definite Matrices, Rajendra Bhatia Impulsive and Hybrid Dynamical Systems: Stability, Dissipativity, and Control, Wassim M. Haddad, VijaySekhar Chellaboina, and Sergey G. Nersesov
Statistical Inference via Convex Optimization Anatoli Juditsky Arkadi Nemirovski
Princeton University Press Princeton and Oxford
c 2020 by Princeton University Press Copyright Requests for permission to reproduce material from this work should be sent to [email protected] Published by Princeton University Press 41 William Street, Princeton, New Jersey 08540 6 Oxford Street, Woodstock, Oxfordshire OX20 1TR press.princeton.edu All Rights Reserved ISBN 9780691197296 ISBN (ebook) 9780691200316 British Library CataloginginPublication Data is available Editorial: Susannah Shoemaker and Lauren Bucca Production Editorial: Nathan Carr Production: Jacquie Poirier Publicity: Matthew Taylor and Katie Lewis Jacket/Cover Credit: Adapted from Fran¸cois de Kresz, “Excusezmoi, excusezmoi...,” 1974 Copyeditor: Bhisham Bherwani The publisher would like to acknowledge the authors of this volume for providing the cameraready copy from which this book was printed. This book has been composed in LaTeX Printed on acidfree paper. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
Contents
List of Figures
xi
Preface
xiii
Acknowledgements
xvii
Notational Conventions
xix
About Proofs
xxi
On Computational Tractability
xxi
1 Sparse Recovery via ℓ1 Minimization 1.1 Compressed Sensing: What is it about? 1.1.1 Signal Recovery Problem . . . . . . . . . . . . . . . . . . . 1.1.2 Signal Recovery: Parametric and nonparametric cases . . . 1.1.3 Compressed Sensing via ℓ1 minimization: Motivation . . . . 1.2 Validity of sparse signal recovery via ℓ1 minimization 1.2.1 Validity of ℓ1 minimization in the noiseless case . . . . . . . 1.2.2 Imperfect ℓ1 minimization . . . . . . . . . . . . . . . . . . . 1.2.3 Regular ℓ1 recovery . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Penalized ℓ1 recovery . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Verifiability and tractability issues 1.3.1 Restricted Isometry Property and sgoodness of random matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Verifiable sufficient conditions for Qq (s, κ) . . . . . . . . . . 1.3.3 Tractability of Q∞ (s, κ) . . . . . . . . . . . . . . . . . . . . 1.4 Exercises for Chapter 1 1.5 Proofs 1.5.1 Proofs of Theorem 1.3, 1.4 . . . . . . . . . . . . . . . . . . . 1.5.2 Proof of Theorem 1.5 . . . . . . . . . . . . . . . . . . . . . 1.5.3 Proof of Proposition 1.7 . . . . . . . . . . . . . . . . . . . . 1.5.4 Proof of Propositions 1.8 and 1.12 . . . . . . . . . . . . . . 1.5.5 Proof of Proposition 1.10 . . . . . . . . . . . . . . . . . . . 1.5.6 Proof of Proposition 1.13 . . . . . . . . . . . . . . . . . . .
20 20 22 26 30 30 32 33 36 37 39
2 Hypothesis Testing 2.1 Preliminaries from Statistics: Hypotheses, Tests, Risks 2.1.1 Hypothesis Testing Problem . . . . . . . . . . . 2.1.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Testing from repeated observations . . . . . . . 2.1.4 Risk of a simple test . . . . . . . . . . . . . . .
41 41 41 42 42 45
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
1 1 1 2 6 8 8 11 13 14 14 19
vi
CONTENTS 2.1.5 Twopoint lower risk bound . . . . . . . . . . . . . . . . . . Hypothesis Testing via Euclidean Separation 2.2.1 Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Pairwise Hypothesis Testing via Euclidean Separation . . . 2.2.3 Euclidean Separation, Repeated Observations, and Majority Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 From Pairwise to Multiple Hypotheses Testing . . . . . . . 2.3 Detectors and DetectorBased Tests 2.3.1 Detectors and their risks . . . . . . . . . . . . . . . . . . . . 2.3.2 Detectorbased tests . . . . . . . . . . . . . . . . . . . . . . 2.4 Simple observation schemes 2.4.1 Simple observation schemes—Motivation . . . . . . . . . . . 2.4.2 Simple observation schemes—The definition . . . . . . . . . 2.4.3 Simple observation schemes—Examples . . . . . . . . . . . 2.4.4 Simple observation schemes—Main result . . . . . . . . . . 2.4.5 Simple observation schemes—Examples of optimal detectors 2.5 Testing multiple hypotheses 2.5.1 Testing unions . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Testing multiple hypotheses “up to closeness” . . . . . . . . 2.5.3 Illustration: Selecting the best among a family of estimates 2.6 Sequential Hypothesis Testing 2.6.1 Motivation: Election polls . . . . . . . . . . . . . . . . . . . 2.6.2 Sequential hypothesis testing . . . . . . . . . . . . . . . . . 2.6.3 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . 2.7 Measurement Design in simple observation schemes 2.7.1 Motivation: Opinion polls revisited . . . . . . . . . . . . . . 2.7.2 Measurement Design: Setup . . . . . . . . . . . . . . . . . . 2.7.3 Formulating the MD problem . . . . . . . . . . . . . . . . . 2.8 Affine detectors beyond simple observation schemes 2.8.1 Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Main result . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Beyond the scope of affine detectors: lifting the observations 2.9.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Quadratic lifting: Gaussian case . . . . . . . . . . . . . . . 2.9.3 Quadratic lifting—Does it help? . . . . . . . . . . . . . . . 2.9.4 Quadratic lifting: SubGaussian case . . . . . . . . . . . . . 2.9.5 Generic application: Quadratically constrained hypotheses . 2.10 Exercises for Chapter 2 2.10.1 Twopoint lower risk bound . . . . . . . . . . . . . . . . . . 2.10.2 Around Euclidean Separation . . . . . . . . . . . . . . . . . 2.10.3 Hypothesis testing via ℓ1 separation . . . . . . . . . . . . . 2.10.4 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . 2.11 Proofs 2.11.1 Proof of the observation in Remark 2.8 . . . . . . . . . . . 2.11.2 Proof of Proposition 2.6 in the case of quasistationary Krepeated observations . . . . . . . . . . . . . . . . . . . . . 2.11.3 Proof of Theorem 2.23 . . . . . . . . . . . . . . . . . . . . . 2.11.4 Proof of Proposition 2.37 . . . . . . . . . . . . . . . . . . . 2.11.5 Proof of Proposition 2.43 . . . . . . . . . . . . . . . . . . . 2.11.6 Proof of Proposition 2.46 . . . . . . . . . . . . . . . . . . . 2.2
46 49 49 50 55 58 65 65 65 72 72 73 74 79 83 87 88 91 97 105 105 108 113 113 113 115 116 123 124 132 139 139 140 142 145 147 157 157 157 157 163 168 168 168 172 175 176 180
vii
CONTENTS 3 From Hypothesis Testing to Estimating Functionals 3.1 Estimating linear forms on unions of convex sets 3.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 The estimate . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Main result . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Nearoptimality . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimating N convex functions on unions of convex sets 3.2.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Estimating N convex functions: Problem setting . . . . 3.2.3 Bisection estimate: Construction . . . . . . . . . . . . . 3.2.4 Building Bisection estimate . . . . . . . . . . . . . . . . 3.2.5 Bisection estimate: Main result . . . . . . . . . . . . . . 3.2.6 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.7 Estimating N convex functions: An alternative . . . . . 3.3 Estimating linear forms beyond simple observation schemes 3.3.1 Situation and goal . . . . . . . . . . . . . . . . . . . . . 3.3.2 Construction and main results . . . . . . . . . . . . . . 3.3.3 Estimation from repeated observations . . . . . . . . . . 3.3.4 Application: Estimating linear forms of subGaussianity rameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Estimating quadratic forms via quadratic lifting 3.4.1 Estimating quadratic forms, Gaussian case . . . . . . . 3.4.2 Estimating quadratic form, subGaussian case . . . . . . 3.5 Exercises for Chapter 3 3.6 Proofs 3.6.1 Proof of Proposition 3.3 . . . . . . . . . . . . . . . . . . 3.6.2 Verifying 1convexity of the conditional quantile . . . . 3.6.3 Proof of Proposition 3.4 . . . . . . . . . . . . . . . . . . 3.6.4 Proof of Proposition 3.14 . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . . . .
. . . . . . .
. . . . . . pa. . . . . .
. . . .
4 Signal Recovery by Linear Estimation Overview 4.1 Preliminaries: Executive summary on Conic Programming 4.1.1 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Conic problems and their duals . . . . . . . . . . . . . . . 4.1.3 Schur Complement Lemma . . . . . . . . . . . . . . . . . 4.2 Nearoptimal linear estimation from Gaussian observations 4.2.1 Situation and goal . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Building a linear estimate . . . . . . . . . . . . . . . . . . 4.2.3 Byproduct on semidefinite relaxation . . . . . . . . . . . . 4.3 From ellitopes to spectratopes 4.3.1 Spectratopes: Definition and examples . . . . . . . . . . . 4.3.2 Semidefinite relaxation on spectratopes . . . . . . . . . . 4.3.3 Linear estimates beyond ellitopic signal sets and k · k2 risk 4.4 Linear estimates of stochastic signals 4.4.1 Minimizing Euclidean risk . . . . . . . . . . . . . . . . . . 4.4.2 Minimizing k · krisk . . . . . . . . . . . . . . . . . . . . . 4.5 Linear estimation under uncertainbutbounded noise 4.5.1 Uncertainbutbounded noise . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
185 185 186 187 189 190 191 193 194 197 199 201 202 203 205 211 212 213 216 218 222 222 228 238 250 250 253 254 258 260 260 262 262 263 265 265 265 267 274 275 275 277 278 291 293 294 295 295
viii
CONTENTS
4.6 4.7
4.8
4.5.2 Mixed noise . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculus of ellitopes/spectratopes Exercises for Chapter 4 4.7.1 Linear estimates vs. Maximum Likelihood . . . . . . . . . . 4.7.2 Measurement Design in Signal Recovery . . . . . . . . . . . 4.7.3 Around semidefinite relaxation . . . . . . . . . . . . . . . . 4.7.4 Around Propositions 4.4 and 4.14 . . . . . . . . . . . . . . . 4.7.5 Signal recovery in Discrete and Poisson observation schemes 4.7.6 Numerical lowerbounding minimax risk . . . . . . . . . . . 4.7.7 Around SLemma . . . . . . . . . . . . . . . . . . . . . . . 4.7.8 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . Proofs 4.8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Proof of Proposition 4.6 . . . . . . . . . . . . . . . . . . . . 4.8.3 Proof of Proposition 4.8 . . . . . . . . . . . . . . . . . . . . 4.8.4 Proof of Lemma 4.17 . . . . . . . . . . . . . . . . . . . . . . 4.8.5 Proofs of Propositions 4.5, 4.16 and 4.19 . . . . . . . . . . . 4.8.6 Proofs of Propositions 4.18 and 4.19, and justification of Remark 4.20 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
299 300 302 302 303 306 317 335 347 359 360 361 361 364 366 368 371 383
5 Signal Recovery Beyond Linear Estimates 386 Overview 386 5.1 Polyhedral estimation 386 5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 5.1.2 Generic polyhedral estimate . . . . . . . . . . . . . . . . . . 388 5.1.3 Specifying sets Hδ for basic observation schemes . . . . . . 390 5.1.4 Efficient upperbounding of R[H] and contrast design, I. . . 392 5.1.5 Efficient upperbounding of R[H] and contrast design, II. . 399 5.1.6 Assembling estimates: Contrast aggregation . . . . . . . . . 411 5.1.7 Numerical illustration . . . . . . . . . . . . . . . . . . . . . 413 5.1.8 Calculus of compatibility . . . . . . . . . . . . . . . . . . . 413 5.2 Recovering signals from nonlinear observations by Stochastic Optimization 415 5.2.1 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . 415 5.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 417 5.2.3 Estimating via Sample Average Approximation . . . . . . . 420 5.2.4 Stochastic Approximation estimate . . . . . . . . . . . . . . 423 5.2.5 Numerical illustration . . . . . . . . . . . . . . . . . . . . . 425 5.2.6 “Singleobservation” case . . . . . . . . . . . . . . . . . . . 428 5.3 Exercises for Chapter 5 431 5.3.1 Estimation by Stochastic Optimization . . . . . . . . . . . . 431 5.4 Proofs 440 5.4.1 Proof of (5.8) . . . . . . . . . . . . . . . . . . . . . . . . . . 440 5.4.2 Proof of Lemma 5.6 . . . . . . . . . . . . . . . . . . . . . . 441 5.4.3 Verification of (5.44) . . . . . . . . . . . . . . . . . . . . . . 442 5.4.4 Proof of Proposition 5.10 . . . . . . . . . . . . . . . . . . . 443 Solutions to Selected Exercises 6.1 Solutions for Chapter 1 6.2 Solutions for Chapter 2
447 447 454
ix
CONTENTS
6.3 6.4
6.5
6.2.1 Twopoint lower risk bound . . . . . . . . . 6.2.2 Around Euclidean Separation . . . . . . . . 6.2.3 Hypothesis testing via ℓ1 separation . . . . 6.2.4 Miscellaneous exercises . . . . . . . . . . . . Solutions for Chapter 3 Solutions for Chapter 4 6.4.1 Linear Estimates vs. Maximum Likelihood 6.4.2 Measurement Design in Signal Recovery . . 6.4.3 Around semidefinite relaxation . . . . . . . 6.4.4 Around Propositions 4.4 and 4.14 . . . . . . 6.4.5 Numerical lowerbounding minimax risk . . 6.4.6 Around SLemma . . . . . . . . . . . . . . 6.4.7 Miscellaneous exercises . . . . . . . . . . . . Solutions for Chapter 5 6.5.1 Estimation by Stochastic Optimization . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . .
454 455 457 465 477 495 495 497 502 518 572 586 589 592 592
Appendix: Executive Summary on Efficient Solvability of Convex Optimization Problems 609 Bibliography
613
Index
629
List of Figures
1.1
1.2 1.3
1.4
Top: true 256 × 256 image; bottom: sparse in the wavelet basis approximations of the image. Wavelet basis is orthonormal, and a natural way to quantify nearsparsity of a signal is to look at the fraction of total energy (sum of squares of wavelet coefficients) stored in the leading coefficients; these are the “energy data” presented in the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Singepixel camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . Regular and penalized ℓ1 recovery of nearly ssparse signals. o: true signals, +: recoveries (to make the plots readable, one per eight consecutive vector’s entries is shown). Problem sizes are m = 256 and n = 2m = 512, noise level is σ = 0.01, deviation from ssparsity p is kx − xs k1 = 1, contrast pair is (H = n/mA, k · k∞ ). In penalized recovery, λ = 2s, parameter ρ of regular recovery is set to σ · ErfcInv(0.005/n). . . . . . . . . . . . . . . . . . . . . . . . . . . Erroneous ℓ1 recovery of a 25sparse signal, no observation noise. Top: frequency domain, o – true signal, + – recovery. Bottom: time domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
“Gaussian Separation” (Example 2.5): Optimal test deciding on whether the mean of Gaussian r.v. belongs to the domain A (H1 ) or to the domain B (H2 ). Hyperplane oo separates the acceptance domains for H1 (“left” halfspace) and for H2 (“right” halfspace). . . . . . . . . . . . . . . .
2.2 2.3 2.4
Drawing for Proposition 2.4. . . . . . . . . . . . . . . . . . . . . . . Positron Emission Tomography (PET) . . . . . . . . . . . . . . . . Nine hypotheses on the location of the mean µ of observation ω ∼ N (µ, I2 ), each stating that µ belongs to a specific polygon. . . . . Signal (top, solid) and candidate estimates (top, dotted). Bottom: the primitive of the signal. . . . . . . . . . . . . . . . . . . . . . . . 3candidate hypotheses in probabilistic simplex ∆3 . . . . . . . . . PET scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frames from a “movie” . . . . . . . . . . . . . . . . . . . . . . . .
2.5 2.6 2.7 2.8 3.1
Boxplot of empirical distributions, over 20 random estimation problems, of the upper 0.01risk bounds max1≤i,j≤100 ρij (as in (3.15)) for different observation sample sizes K. . . . . . . . . . . . . . . . . . . . . . . .
3.2 3.3
Bisection via Hypothesis Testing. . . . . . . . . . . . . . . . . . . . A circuit (nine nodes and 16 arcs). a: arc of interest; b: arcs with measured currents; c: input node where external current and voltage are measured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 6
18
25
48 52 76 93 105 109 121 150
193 194
209
xii
FIGURES 3.4 4.1
5.1
5.2
5.3
5.4
5.5 6.1 6.2 6.3
Histograms of recovery errors in experiments, 1,000 simulations per experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . True distribution of temperature U∗ = B(x) at time t0 = 0.01 (left) b via the optimal linear estimate (center) along with its recovery U e (right). . . . . . . . . . . . . . . . . . . and the “naive” recovery U
Recovery errors for the nearoptimal linear estimate (circles) and for polyhedral estimates yielded by Proposition 5.8 (PolyI, pentagrams) and by the construction from Section 5.1.4 (PolyII, triangles), 20 simulations per each value of σ. . . . . . . . . . . . . . . . . . . . . Left: functions h; right: moduli of strong monotonicity of the operators F (·) in {z : kzk2 ≤ R} as functions of R. Dashed lines – case A (logistic sigmoid), solid lines – case B (linear regression), dashdotted lines – case C (hinge function), dotted line – case D (ramp sigmoid). Mean errors and CPU times for SA (solid lines) and SAA estimates (dashed lines) as functions of the number of observations K. o – case A (logistic link), x – case B (linear link), + – case C (hinge function), ✷ – case D (ramp sigmoid). . . . . . . . . . . . . . . . . . . . . . . Solid curve: MωK (z), dashed curve: HωK (z). True signal x (solid vertical line): +0.081; SAA estimate (unique minimizer of HωK , dashed vertical line): −0.252; ML estimate (global minimizer of MωK on [−20, 20]): −20.00, closest to x local minimizer of MωK (dotted vertical line): −0.363. . . . . . . . . . . . . . . . . . . . . . . . . . Mean errors and CPU times for standard deviation λ = 1 (solid line) and λ = 0.1 (dashed line). . . . . . . . . . . . . . . . . . . . . . . . Recovery of a 1200×1600 image at different noise levels, illposed case ∆ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recovery of a 1200×1600 image at different noise levels, wellposed case ∆ = 0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smooth curve: f ; “bumpy” curve: recovery; gray cloud: observations. In all experiments, n = 8192, κ = 2, p0 = p1 = p2 = 2, σ = 0.5, Lι = (10π)ι , 0 ≤ ι ≤ 2. . . . . . . . . . . . . . . . . . . . .
238
272
414
426
427
428 431 539 540
545
PREFACE When speaking about links between Statistics and Optimization, what comes to mind first is the indispensable role played by optimization algorithms in the “computational toolbox” of Statistics (think about the numerical implementation of the fundamental Maximum Likelihood method). However, on a second thought, we should conclude that no matter how significant this role could be, the fact that it comes to our mind first primarily reflects the weaknesses of Optimization rather than its strengths; were optimization algorithms which are used in Statistics as efficient and as reliable as, say, Linear Algebra techniques, nobody would think about special links between Statistics and Optimization, just as nobody usually thinks about special links between Statistics and Linear Algebra. When computational, rather than methodological, issues are concerned, we start to think about links of Statistics with Optimization, Linear Algebra, Numerical Analysis, etc. only when computational tools offered to us by these disciplines do not work well and need the attention of experts in these disciplines. The goal of this book is to present other types of links between Optimization and Statistics, those which have little in common with algorithms and numbercrunching. What we are speaking about, are the situations where Optimization theory (theory, not algorithms!) seems to be of methodological value in Statistics, acting as the source of statistical inferences with provably optimal, or nearly so, performance. In this context, we focus on utilizing Convex Programming theory, mainly due to its power, but also due to the desire to end up with inference routines reducing to solving convex optimization problems and thus implementable in a computationally efficient fashion. Therefore, while we do not mention computational issues explicitly, we do remember that at the end of the day we need a number, and in this respect, intrinsically computationally friendly convex optimization models are the first choice. The three topics we intend to consider are: A. Sparsityoriented Compressive Sensing. Here the role of Convex Optimization theory as a creative tool motivating the construction of inference procedures is relatively less important than in the two other topics. This being said, its role is by far nonnegligible in the analysis of Compressive Sensing routines (it allows, e.g., to derive from “first principles” the necessary and sufficient conditions for the validity of ℓ1 recovery). On account of this, and also due to its popularity and the fact that now it is one of the major “customers” of advanced convex optimization algorithms, we believe that Compressive Sensing is worthy of being considered. B. Pairwise and Multiple Hypothesis Testing, including sequential tests, estimation of linear functionals, and some rudimentary design of experiments. C. Recovery of signals from noisy observations of their linear images. B and C are the topics where, as of now, the approaches we present in this book appear to be the most successful. The exposition does not require prior knowledge of Statistics and Optimization; as far as these disciplines are concerned, all necessary facts and concepts are incorporated into the text. The actual prerequisites are basic Calculus, Probability, and Linear Algebra. Selection and treatment of our topics are inspired by a kind of “philosophy”
xiv
PREFACE
which can be explained to an expert as follows. Compare two wellknown results of nonparametric statistics (“h...i” marks fragments irrelevant to the discussion to follow): Theorem A [I. Ibragimov & R. Khasminskii [124], 1979] Given α, L, k, let X be the set of all functions f : [0, 1] → R with (α, L)H¨ older continuous kth derivative. For a given t, the minimax risk of estimating f (t), f ∈ X , from noisy observations y = f Γn + ξ, ξ ∼ N (0; In ) taken along npoint equidistant grid Γn , up to a factor C(β) = h...i, β := k + α, is (Ln−β )1/(2β+1) , and the upper risk bound is attained at the affine in y estimate explicitly given by h...i.
Theorem B [D. Donoho [64], 1994] Let X ⊂ RN be a convex compact set, A be an n × N matrix, and g(·) be a linear form on X . The minimax, over f ∈ X , risk of recovering g(f ) from the noisy observations y = Af + ξ, ξ ∼ N (0, In ), within factor 1.2 is attained at an affine in y estimate which, along with its risk, can be built efficiently by solving convex optimization problem h...i. In many respects, A and B are similar: both are theorems on minimax optimal estimation of a given linear form of an unknown “signal” f known to belong to a given convex set X from observations, corrupted by Gaussian noise, of the image of f under linear mapping,1 and both are associated with efficiently computable nearoptimal—in a minimax sense—estimators which happen to be affine in observations. There is, however, a significant structural difference: A gives an explicit “closed form” analytic description of the minimax risk as a function of n and smoothness parameters of f , along with explicit description of the nearoptimal estimator. Numerous results of this type—let us call them descriptive—form the backbone of the deep and rich theory of Nonparametric Statistics. This being said, strong “explanation power” of descriptive results has its price: we need to impose assumptions, sometimes quite restrictive, on the entities involved. For example, A says nothing about what happens with the minimax risk/estimate when in addition to smoothness other a priori information on f , like monotonicity or convexity, is available, and/or when “direct” observations of f Γn are replaced with observations of a linear image of f (say, convolution of f with a given kernel; more often than not, this is what happens in applications), and descriptive answers to the questions just posed require a dedicated (and sometimes quite problematic) investigation more or less “from scratch.” In contrast, the explanation power of B is basically nonexistent: the statement presents “closed form” expressions neither for the nearoptimal estimate, nor for its worstcase risk. As a compensation, B makes only (relatively) mild general structural assumptions about the model (convexity and compactness of X , linear dependence of y on f ), and all the rest—the nearoptimal estimate and its risk—can be found by efficient computation. Moreover, we know in advance that the risk, whatever it happens to be, is within 20% of the actual minimax risk achievable under the circumstances. In this respect, B is an operational, rather than a descriptive, result: it explains how to act to achieve the (nearly) best possible performance, with no a priori prediction of what this performance will be. This hardly is a “big issue” in applications—with huge computational power readily available, efficient computability is, basically, as good as a “simple explicit formula.” We
1 Infinite dimensionality of X in A is of no importance—nothing changes when replacing the original X with its ndimensional image under the mapping f 7→ f Γn .
PREFACE
xv
strongly believe that as far as applications of highdimensional statistics are concerned, operational results, possessing much broader scope than their descriptive counterparts, are of significant importance and potential. Our main motivation when writing this book was to contribute to the body of operational results in Statistics, and this is what Chapters 2–5 to follow are about. Anatoli Juditsky & Arkadi Nemirovski March 6, 2019
.
ACKNOWLEDGEMENTS We are greatly indebted to H. Edwin Romeijn who initiated creating the Ph.D. course “Topics in Data Science.” The Lecture Notes for this course form the seed of the book to follow. We gratefully acknowledge support from SF Grant CC1523768 Statistical Inference via Convex Optimization; this research project is the source of basically all novel results presented in Chapters 2–5. Our deepest gratitude goes to Lucien Birge, who encouraged us to write this monograph, and to Stephen Boyd, who many years ago taught one of the authors “operational philosophy,” motivating the research we are presenting. Our separate thanks to those who decades ago guided our first steps along the road which led to this book—Rafail Khasminskii, Yakov Tsypkin, and Boris Polyak. We are deeply indebted to our colleagues Alekh Agarwal, Aharon BenTal, Fabienne Comte, Arnak Dalalyan, David Donoho, C´eline Duval, Valentine GenonCatalot, Alexander Goldenshluger, Yuri Golubev, Zaid Harchaoui, G´erard Kerkyacharian, Vladimir Koltchinskii, Oleg Lepski, Pascal Massart, Eric Moulines, Axel Munk, Aleksander Nazin, Yuri Nesterov, Dominique Picard, Alexander Rakhlin, Philippe Rigollet, Alex Shapiro, Vladimir Spokoiny, Alexandre Tsybakov, and Frank Werner for their advice and remarks. We would like to thank Elitsa Marielle, Andrey Kulunchakov and Hlib Tsyntseus for their assistance when preparing the manuscript. It was our pleasure to collaborate with Princeton University Press on this project. We highly appreciate valuable comments of the anonymous referees, which helped to improve the initial text. We are greatly impressed by the professionalism of Princeton University Press editors, and in particular, Lauren Bucca, Nathan Carr, and Susannah Shoemaker, and also by their care and patience. Needless to say, responsibility for all drawbacks of the book is ours. A. J. & A. N.
.
NOTATIONAL CONVENTIONS Vectors and matrices. By default, all vectors are column ones; to write them 1 down, we use “Matlab notation”: 2 is written as [1; 2; 3]. More generally, for 3 vectors/matrices A, B, ..., Z of the same “width” (or vectors/matrices A, B, C, ..., Z of the same “height”), [A; B; C; ...; D] is the matrix obtained by vertical (or horizontal) concatenation of A,B, C, etc. Examples: For what inthe “normal” notation 7 1 2 , we have ,B= 5 6 ,C= is written down as A = 8 3 4
1 2 1 2 7 [A; B] = 3 4 = [1, 2; 3, 4; 5, 6], [A, C] = = [1, 2, 7; 3, 4, 8]. 3 4 8 5 6
Blanks in matrices replace (blocks of) zero entries. 1 1 0 2 = 2 0 3 4 5 3 4
For example, 0 0 . 5
Diag{A1 , A2 , ..., Ak } stands for a blockdiagonal matrix with diagonal blocks A1 , A2 , ..., Ak . For example, 1 2 1 , Diag{[1, 2]; [3; 4]} = 3 . 2 Diag{1, 2, 3} = 4 3
For an m×n matrix A, dg(A) is the diagonal of A—a vector of dimension min[m, n] with entries Aii , 1 ≤ i ≤ min[m, n].
Standard linear spaces in our book are Rn (the space of ndimensional column vectors), Rm×n (the space of m × n real matrices), and Sn (the space of n × n real symmetric matrices). All these linear spaces are equipped with the standard inner product: X hA, Bi = Aij Bij = Tr(AB T ) = Tr(BAT ) = Tr(AT B) = Tr(B T A); i,j
in the case when A = a and B = b are column vectors, this simplifies to ha, bi = aT b = bT a, and when A, B are symmetric, there is no need to write B T in Tr(AB T ). Usually, we denote vectors by lowercase, and matrices by uppercase letters; sometimes, however, lowercase letters are used also for matrices. Given a linear mapping A(x) : Ex → Ey , where Ex , Ey are standard linear spaces, one can define the conjugate mapping A∗ (y) : Ey → Ex via the identity hA(x), yi = hx, A∗ (y)i ∀(x ∈ Ex , y ∈ Ey ). One always has (A∗ )∗ = A. When Ex = Rn , Ey = Rm and Pn A(x) = Ax, one has A∗ (y) = AT y; when Ex = Rn , Ey = Sm , so that A(x) = i=1 xi Ai , Ai ∈ Sm , we
xx
NOTATIONAL CONVENTIONS
have
A∗ (Y ) = [Tr(A1 Y ); ...; Tr(An Y )].
Zn is the set of ndimensional integer vectors.
Norms. For 1 ≤ p ≤ ∞ and for a vector x = [x1 ; ...; xn ] ∈ Rn , kxkp is the standard pnorm of x: Pn 1/p ( i=1 xi p ) , 1 ≤ p < ∞, kxkp = maxi xi  = limp′ →∞ kxkp′ , p = ∞. The spectral norm (the largest singular value) of a matrix A is denoted by kAk2,2 ; notation for other norms of matrices is specified when used. Standard cones. R+ is the nonnegative ray on the real axis; Rn+ stands for the ndimensional nonnegative orthant, the cone comprised of all entrywise nonnegative vectors from Rn ; Sn+ stands for the positive semidefinite cone in Sn , the cone comprised of all positive semidefinite matrices from Sn . Miscellaneous. • For matrices A, B, relation A B, or, equivalently, B A, means that A, B are symmetric matrices of the same size such that B − A is positive semidefinite; we write A 0 to express the fact that A is a symmetric positive semidefinite matrix. Strict version A ≻ B (⇔ B ≺ A) of A B means that A − B is positive definite (and, as above, A and B are symmetric matrices of the same size). • Linear Matrix Inequality (LMI, a.k.a. semidefinite constraint) in variables x is the constraint on x stating that a symmetric matrix affinely depending on x is positive semidefinite. When x ∈ Rn , LMI reads X xi Ai 0 [Ai ∈ Sm , 0 ≤ i ≤ n]. A0 + i
• N (µ, Θ) stands for the Gaussian distribution with mean µ and covariance matrix Θ. Poisson(µ) denotes Poisson distribution with parameter µ ∈ R+ , i.e., the disi tribution of a random variable taking values i = 0, 1, 2, ... with probabilities µi! e−µ . Uniform([a, b]) is the uniform distribution on segment [a, b]. • For a probability distribution P , • ξ ∼ P means that ξ is a random variable with distribution P . Sometimes we express the same fact by writing ξ ∼ p(·), where p is the density of P taken w.r.t. some reference measure (the latter always is fixed by the context); • Eξ∼P {f (ξ)} is the expectation of f (ξ), ξ ∼ P ; when P is clear from the context, this notation can be shortened to Eξ {f (ξ)}, or EP {f (ξ)}, or even E{f (ξ)}. Similarly, Probξ∼P {...}, Probξ {...}, ProbP {...}, and Prob{...} denote the P probability of the event specified inside the braces. • O(1)’s stand for positive absolute constants—positive reals with numerical values (completely independent of the parameters of the situation at hand) which we do not R want or are too lazy to write down explicitly, as in sin(x) ≤ O(1)x. • Ω f (ξ)Π(dξ) stands for the integral, taken w.r.t. measure Π over domain Ω, of function f .
ABOUT PROOFS The book is basically selfcontained in terms of proofs of the statements to follow. Simple proofs usually are placed immediately after the corresponding statements; more technical proofs are transferred to dedicated sections titled “Proof of ...” at the end of each chapter, and this is where a reader should look for “missing” proofs.
ON COMPUTATIONAL TRACTABILITY In the main body of the book, one can frequently meet sentences like “Φ(·) is an efficiently computable convex function,” or “X is a computationally tractable convex set,” or “(P ) is an explicit, and therefore efficiently solvable, convex optimization problem.” For an “executive summary” on what these words actually mean, we refer the reader to the Appendix.
ABOUT PROOFS The book is basically selfcontained in terms of proofs of the statements to follow. Simple proofs usually are placed immediately after the corresponding statements; more technical proofs are transferred to dedicated sections titled “Proof of ...” at the end of each chapter, and this is where a reader should look for “missing” proofs.
ON COMPUTATIONAL TRACTABILITY In the main body of the book, one can frequently meet sentences like “Φ(·) is an efficiently computable convex function,” or “X is a computationally tractable convex set,” or “(P ) is an explicit, and therefore efficiently solvable, convex optimization problem.” For an “executive summary” on what these words actually mean, we refer the reader to the Appendix.
.
.
Statistical Inference via Convex Optimization
.
Chapter One Sparse Recovery via ℓ1 Minimization In this chapter, we overview basic results of Compressed Sensing, a relatively new and rapidly developing area in Statistics and Signal Processing dealing with recovering signals (vectors x from some Rn ) from their noisy observations Ax + η (A is a given m × n sensing matrix, η is observation noise) in the case when the number of observations m is much smaller than the signal’s dimension n, but is essentially larger than the “true” dimension—the number of nonzero entries—in the signal. This setup leads to a deep, elegant and highly innovative theory and possesses quite significant application potential. It should be added that along with the plain sparsity (small number of nonzero entries), Compressed Sensing deals with other types of “lowdimensional structure” hidden in highdimensional signals, most notably, with the case of low rank matrix recovery—when the signal is a matrix, and sparse signals are matrices with low ranks—and the case of block sparsity, where the signal is a block vector, and sparsity means that only a small number of blocks are nonzero. In our presentation, we do not consider these extensions, and restrict ourselves to the simplest sparsity paradigm.
1.1
COMPRESSED SENSING: WHAT IS IT ABOUT?
1.1.1
Signal Recovery Problem
One of the basic problems in Signal Processing is the problem of recovering a signal x ∈ Rn from noisy observations y = Ax + η
(1.1)
of a linear image of the signal under a given sensing mapping x 7→ Ax : Rn → Rm ; in (1.1), η is the observation error. Matrix A in (1.1) is called sensing matrix. Recovery problems of the outlined types arise in many applications, including, but by far not reducing to, • communications, where x is the signal sent by the transmitter, y is the signal recorded by the receiver, and A represents the communication channel (reflecting, e.g., dependencies of decays in the signals’ amplitude on the transmitterreceiver distances); η here typically is modeled as the standard (zero mean, unit covariance matrix) mdimensional Gaussian noise;1 1 While
the “physical” noise indeed is often Gaussian with zero mean, its covariance matrix is not necessarily the unit matrix. Note, however, that a zero mean Gaussian noise η always can be represented as Qξ with standard Gaussian ξ. Assuming that Q is known and is nonsingular (which indeed is so when the covariance matrix of η is positive definite), we can rewrite (1.1) equivalently as Q−1 y = [Q−1 A]x + ξ and treat Q−1 y and Q−1 A as our new observation and new sensing matrix; the new observation
2
CHAPTER 1
• image reconstruction, where the signal x is an image—a 2D array in the usual photography, or a 3D array in tomography—and y is data acquired by the imaging device. Here η in many cases (although not always) can again be modeled as the standard Gaussian noise; • linear regression, arising in a wide range of applications. In linear regression, one is given m pairs “input ai ∈ Rn ” to a “black box,” with output yi ∈ R. Sometimes we have reason to believe that the output is a corrupted by noise version of the “existing in nature,” but unobservable, “ideal output” yi∗ = xT ai which is just a linear function of the input (this is called “linear regression model,” with inputs ai called “regressors”). Our goal is to convert actual observations (ai , yi ), 1 ≤ i ≤ m, into estimates of the unknown “true” vector of parameters x. Denoting by A the matrix with the rows [ai ]T and assembling individual observations yi into a single observation y = [y1 ; ...; ym ] ∈ Rm , we arrive at the problem of recovering vector x from noisy observations of Ax. Here again the most popular model for η is the standard Gaussian noise. 1.1.2
Signal Recovery: Parametric and nonparametric cases
Recovering signal x from observation y would be easy if there were no observation noise (η = 0) and the rank of matrix A were equal to the dimension n of the signals. In this case, which arises only when m ≥ n (“more observations than unknown parameters”), and is typical in this range of m and n, the desired x would be the unique solution to the system of linear equations, and to find x would be a simple problem of Linear Algebra. Aside from this trivial “enough observations, no noise” case, people over the years have looked at the following two versions of the recovery problem: Parametric case: m ≫ n, η is nontrivial noise with zero mean, say, standard Gaussian. This is the classical statistical setup with the emphasis on how to use numerous available observations in order to suppress in the recovery, to the extent possible, the influence of observation noise. Nonparametric case: m ≪ n.2 If addressed literally, this case seems to be senseless: when the number of observations is less that the number of unknown parameters, even in the noiseless case we arrive at the necessity to solve an undetermined (fewer equations than unknowns) system of linear equations. Linear Algebra says that if solvable, the system has infinitely many solutions. Moreover, the solution set (an affine subspace of positive dimension) is unbounded, meaning that the solutions are in no sense close to each other. A typical way to make the case of m ≪ n meaningful is to add to the observations (1.1) some a priori information about the signal. In traditional Nonparametric Statistics, this additional information is summarized in a bounded convex set X ⊂ Rn , given to us in advance, known to contain the true signal x. This set usually is such that every signal x ∈ X can be approximated by a linear combination of s = 1, 2, ..., n vectors noise ξ is indeed standard. Thus, in the case of Gaussian zero mean observation noise, to assume the noise standard Gaussian is the same as to assume that its covariance matrix is known. 2 Of course, this is a blatant simplification—the nonparametric case covers also a variety of important and by far nontrivial situations in which m is comparable to n or larger than n (or even ≫ n). However, this simplification is very convenient, and we will use it in this introduction.
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
3
from a properly selected basis known to us in advance (“dictionary” in the slang of signal processing) within accuracy δ(s), where δ(s) is a function, known in advance, approaching 0 as s → ∞. In this situation, with appropriate A (e.g., just the unit matrix, as in the denoising problem), we can select some s ≪ m and try to recover x as if it were a vector from the linear span Es of the first s vectors of the outlined basis [54, 86, 124, 112, 208]. In the “ideal case,” x ∈ Es , recovering x in fact reduces to the case where the dimension of the signal is s ≪ m rather than n ≫ m, and we arrive at the wellstudied situation of recovering a signal of low (compared to the number of observations) dimension. In the “realistic case” of x δ(s)close to Es , deviation of x from Es results in an additional component in the recovery error (“bias”); a typical result of traditional Nonparametric Statistics quantifies the resulting error and minimizes it in s [86, 124, 178, 222, 223, 230, 239]. Of course, this outline of the traditional approach to “nonparametric” (with n ≫ m) recovery problems is extremely sketchy, but it captures the most important fact in our context: with the traditional approach to nonparametric signal recovery, one assumes that after representing the signals by vectors of their coefficients in properly selected base, the ndimensional signal to be recovered can be well approximated by an ssparse (at most s nonzero entries) signal, with s ≪ n, and this sparse approximation can be obtained by zeroing out all but the first s entries in the signal vector. The assumption just formulated indeed takes place for signals obtained by discretization of smooth uni and multivariate functions, and this class of signals for several decades was the main, if not the only, focus of Nonparametric Statistics. Compressed Sensing. The situation changed dramatically around the year 2000 as a consequence of important theoretical breakthroughs due to D. Donoho, T. Tao, J. Romberg, E. Candes, and J.J. Fuchs, among many other researchers [49, 44, 45, 46, 48, 67, 68, 69, 70, 93, 94]; as a result of these breakthroughs, a novel and rich area of research, called Compressed Sensing, emerged. In the Compressed Sensing (CS) setup of the Signal Recovery problem, as in the traditional Nonparametric Statistics approach to the m ≪ n case, it is assumed that after passing to an appropriate basis, the signal to be recovered is ssparse (has ≤ s nonzero entries, with s ≪ m), or is well approximated by an ssparse signal. The difference with the traditional approach is that now we assume nothing about the location of the nonzero entries. Thus, the a priori information about the signal x both in the traditional and in the CS settings is summarized in a set X known to contain the signal x we want to recover. The difference is that in the traditional setting, X is a bounded convex and “nice” (well approximated by its lowdimensional crosssections) set, while in CS this set is, computationally speaking, a “monster”: already in the simplest case of recovering exactly ssparse signals, X is the union of all sdimensional coordinate planes, which is a heavily combinatorial entity. Note that, in many applications we indeed can assume that the true vector of parameters x is sparse. Consider, e.g., the following story about signal detection. There are n locations where signal transmitters could be placed, and m locations with the receivers. The contribution of a signal of unit magnitude originating in location j to the signal measured by receiver i is a known quantity Aij , and signals originating in different locations merely sum up in the receivers. Thus, if x is the ndimensional vector with entries xj representing the magnitudes of signals transmitted in locations j = 1, 2, ..., n, then the mdimensional vector y of measurements of the m receivers is y =
4
CHAPTER 1
Ax + η, where η is the observation noise. Given y, we intend to recover x. Now, if the receivers are, say, hydrophones registering noises emitted by submarines in a certain part of the Atlantic, tentative positions of “submarines” being discretized with resolution 500 m, the dimension of the vector x (the number of points in the discretization grid) may be in the range of tens of thousands, if not tens of millions. At the same time, presumably, there is only a handful of “submarines” (i.e., nonzero entries in x) in the area. To “see” sparsity in everyday life, look at the 256 × 256 image at the top of Figure 1.1. The image can be thought of as a 2562 = 65, 536dimensional vector comprised of the pixels’ intensities in gray scale, and there is not much sparsity in this vector. However, when representing the image in the wavelet basis, whatever it means, we get a “nearly sparse” vector of wavelet coefficients (this is true for typical “nonpathological” images). At the bottom of Figure 1.1 we see what happens when we zero out all but a small percentage of the wavelet coefficients largest in magnitude and replace the true image by its sparse—in the wavelet basis—approximations. This simple visual illustration along with numerous similar examples shows the “everyday presence” of sparsity and the possibility to utilize it when compressing signals. The difficulty, however, is that simple compression—compute the coefficients of the signal in an appropriate basis and then keep, say, 10% of the largest in magnitude coefficients—requires us to start with digitalizing the signal— representing it as an array of all its coefficients in some orthonormal basis. These coefficients are inner products of the signal with vectors of the basis; for a “physical” signal, like speech or image, these inner products are computed by analogous devices, with subsequent discretization of the results. After the measurements are discretized, processing the signal (denoising, compression, storing, etc.) can be fully computerized. The major (to some extent, already actualized) advantage of Compressed Sensing is in the possibility to reduce the “analogous effort” in the outlined process: instead of computing analogously n linear forms of ndimensional signal x (its coefficients in a basis), we use an analog device to compute m ≪ n other linear forms of the signal and then use the signal’s sparsity in a basis known to us in order to recover the signal reasonably well from these m observations. In our “picture illustration” this technology would work (in fact, works—it is called “single pixel camera” [83]; see Figure 1.2) as follows: in reality, the digital 256×256 image on the top of Figure 1.1 was obtained by an analog device—a digital camera which gets on input an analog signal (light of varying intensity along the field of view caught by camera’s lens) and discretizes the light’s intensity in every pixel to get the digitalized image. We then can compute the wavelet coefficients of the digitalized image, compress its representation by keeping, say, just 10% of leading coefficients, etc., but “the damage is already done”—we have already spent our analog resources to get the entire digitalized image. The technology utilizing Compressed Sensing would work as follows: instead of measuring and discretizing light intensity in each of the 65,536 pixels, we compute (using an analog device) the integral, taken over the field of view, of the product of light intensity and an analoggenerated “mask.” We repeat it for, say, 20,000 different masks, thus obtaining measurements of 20,000 linear forms of our 65,536dimensional signal. Next we utilize, via the Compressed Sensing machinery, the signal’s sparsity in the wavelet basis in order to recover the signal from these 20,000 measurements. With this approach, we reduce the “analog component” of signal processing effort,
5
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
1% of leading wavelet coefficients (97.83 % of energy) kept
5% of leading wavelet coefficients (99.51 % of energy) kept
10% of leading wavelet coefficients (99.82% of energy) kept
25% of leading wavelet coefficients (99.97% of energy) kept
Figure 1.1: Top: true 256 × 256 image; bottom: sparse in the wavelet basis approximations of the image. Wavelet basis is orthonormal, and a natural way to quantify nearsparsity of a signal is to look at the fraction of total energy (sum of squares of wavelet coefficients) stored in the leading coefficients; these are the “energy data” presented in the figure.
6
CHAPTER 1
Yh Ed/• Z W,KdK /K
WZK ^^/E'
Figure 1.2: Singepixel camera.
at the price of increasing the “computerized component” of the effort (instead of readytouse digitalized image directly given by 65,536 analog measurements, we need to recover the image by applying computationally nontrivial decoding algorithms to our 20,000 “indirect” measurements). When taking pictures with your camera or iPad, the game is not worth the candle—the analog component of taking usual pictures is cheap enough, and decreasing it at the cost of nontrivial decoding of the digitalized measurements would be counterproductive. There are, however, important applications where the advantages stemming from reduced “analog effort” outweigh significantly the drawbacks caused by the necessity to use nontrivial computerized decoding [96, 176]. 1.1.3 1.1.3.1
Compressed Sensing via ℓ1 minimization: Motivation Preliminaries
In principle there is nothing surprising in the fact that under reasonable assumption on the m × n sensing matrix A we may hope to recover from noisy observations of Ax an ssparse signal x, with s ≪ m. Indeed, assume for the sake of simplicity that there are no observation errors, and let Colj [A] be jth column in A. If we knew the locations j1 < j2 < ... < js of the nonzero entries Ps in x, identifying x could be reduced to solving the system of linear equations ℓ=1 xiℓ Coljℓ [A] = y with m equations and s ≪ m unknowns; assuming every s columns in A to be linearly independent (a quite unrestrictive assumption on a matrix with m ≥ s rows), the solution to the above system is unique, and is exactly the signal we are looking for. Of course, the assumption that we know the locations of nonzeros in x makes the recovery problem completely trivial. However, it suggests the following course of action: given noiseless observation y = Ax of an ssparse signal x, let us solve the
7
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
combinatorial optimization problem min {kzk0 : Az = y} , z
(1.2)
where kzk0 is the number of nonzero entries in z. Clearly, the problem has a solution with the value of the objective at most s. Moreover, it is immediately seen that if every 2s columns in A are linearly independent (which again is a very unrestrictive assumption on the matrix A provided that m ≥ 2s), then the true signal x is the unique optimal solution to (1.2). What was said so far can be extended to the case of noisy observations and “nearly ssparse” signals x. For example, assuming that the observation error is “uncertainbutbounded,” specifically some known norm k · k of this error does not exceed a given ǫ > 0, and that the true signal is ssparse, we could solve the combinatorial optimization problem min {kzk0 : kAz − yk ≤ ǫ} . (1.3) z
Assuming that every m×2s submatrix A¯ of A is not just with linearly independent columns (i.e., with trivial kernel), but is reasonably well conditioned, ¯ kAwk ≥ C −1 kwk2 for all (2s)dimensional vectors w, with some constant C, it is immediately seen that the true signal x underlying the observation and the optimal solution x b of (1.3) are close to each other within accuracy of order of ǫ: kx − x bk2 ≤ 2Cǫ. It is easily seen that the resulting error bound is basically as good as it could be.
We see that the difficulties with recovering sparse signals stem not from the lack of information; they are of purely computational nature: (1.2) is a difficult combinatorial problem. As far as known theoretical complexity guarantees are concerned, they are not better than “brute force” search through all guesses on where the nonzeros in x are located—by inspecting first the only option that there are no nonzeros in x at all, then by inspecting n options that there is only one nonzero, for every one of n locations of this nonzero, then n(n − 1)/2 options that there are exactly two nonzeros, etc., until the current option results in a solvable system of linear equations Az = y in variables z with entries restricted to vanish outside the locations prescribed by the current option. The running time of this “brute force” search, beyond the range of small values of s and n (by far too small to be of any applied interest), is by many orders of magnitude larger than what we can afford in reality.3 A partial remedy is as follows. Well, if we do not know how to minimize the “bad” objective kzk0 under linear constraints, as in (1.2), let us “approximate” this objective with which we do know how to minimize. The true objective is Pone n separable: kzk = i=1 ξ(zj ), where ξ(s) is the function on the axis equal to 0 at the origin and equal to 1 otherwise. As a matter of fact, the separable functions which 3 When s = 5 and n = 100, a sharp upper bound on the number of linear systems we should process before termination in the “brute force” algorithm is ≈ 7.53e7—a lot, but perhaps doable. When n = 200 and s = 20, the number of systems to be processed jumps to ≈ 1.61e27, which is by many orders of magnitude beyond our “computational grasp”; we would be unable to carry out that many computations even if the fate of the mankind were at stake. And from the perspective of Compressed Sensing, n = 200 still is a completely toy size, 3–4 orders of magnitude less than we would like to handle.
8
CHAPTER 1
we do know how to minimize under linear constraints are sums of convex functions of z1 , ..., zn . The most natural candidate to the role of convex approximation of ξ(s) is s; with this approximation, (1.2) converts into the ℓ1 minimization problem n o Xn min kzk1 := zj  : Az = y , (1.4) z
i=1
and (1.3) becomes the convex optimization problem
min {kzk1 : kAz − yk ≤ ǫ} .
(1.5)
z
Both problems are efficiently solvable, which is nice; the question, however, is how relevant these problems are in our context—whether it is true that they do recover the “true” ssparse signals in the noiseless case, or “nearly recover” these signals when the observation error is small. Since we want to be able to handle any ssparse signal, the validity of ℓ1 recovery—its ability to recover well every ssparse signal—depends solely on the sensing matrix A. Our current goal is to understand which sensing matrices are “good” in this respect.
1.2
VALIDITY OF SPARSE SIGNAL RECOVERY VIA ℓ1 MINIMIZATION
What follows is based on the standard basic results of Compressed Sensing theory originating from [19, 49, 45, 44, 46, 47, 48, 67, 69, 70, 93, 94, 232] and augmented by the results of [129, 130, 132, 133].4 1.2.1
Validity of ℓ1 minimization in the noiseless case
The minimal requirement on sensing matrix A which makes ℓ1 minimization valid is to guarantee the correct recovery of exactly ssparse signals in the noiseless case, and we start with investigating this property. 1.2.1.1
Notational convention
From now on, for a vector x ∈ Rn • Ix = {j : xj 6= 0} stands for the support of x; we also set Ix+ = {j : xj > 0}, Ix− = {j : xj < 0}
[⇒ Ix = Ix+ ∪ Ix− ];
• for a subset I of the index set {1, ..., n}, xI stands for the vector obtained from x by zeroing out entries with indices not in I, and I o for the complement of I: I o = {i ∈ {1, ..., n} : i 6∈ I}; • for s ≤ n, xs stands for the vector obtained from x by zeroing out all but the s 4 In fact, in the latter source, an extension of the sparsity, the socalled block sparsity, is considered; in what follows, we restrict the results of [130] to the case of plain sparsity.
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
9
entries largest in magnitude.5 Note that xs is the best ssparse approximation of x in all ℓp norms, 1 ≤ p ≤ ∞; • for s ≤ n and p ∈ [1, ∞], we set kxks,p = kxs kp ; note that k · ks,p is a norm. 1.2.1.2
sGoodness
Definition of sgoodness. Let us say that an m × n sensing matrix A is sgood if whenever the true signal x underlying noiseless observations is ssparse, this signal will be recovered exactly by ℓ1 minimization. In other words, A is sgood if whenever y in (1.4) is of the form y = Ax with ssparse x, x is the unique optimal solution to (1.4). Nullspace property. There is a simplylooking necessary and sufficient condition for a sensing matrix A to be sgood—the nullspace property originating from [70]. After this property is guessed, it is easy to see that it indeed is necessary and sufficient for sgoodness; we, however, prefer to derive this condition from the “first principles,” which can be easily done via Convex Optimization. Thus, in the case in question, as in many other cases, there is no necessity to be smart to arrive at the truth via a “lucky guess”; it suffices to be knowledgeable and use the standard tools. Let us start with necessary condition for A to be such that whenever x is ssparse, x is an optimal solution (perhaps not the unique one) of the optimization problem min {kzk1 : Az = Ax} ; (P [x]) z
we refer to the latter property of A as weak sgoodness. Our first observation is as follows: Proposition 1.1. If A is weakly sgood, then the following condition holds true: whenever I is a subset of {1, ..., n} of cardinality ≤ s, we have ∀w ∈ KerA kwI k1 ≤ kwI o k1 .
(1.6)
Proof is immediate. Assume A is weakly sgood, and let us verify (1.6). Let I be an selement subset of {1, ..., n}, and x be an ssparse vector with support I. Since A is weakly sgood, x is an optimal solution to (P [x]). Rewriting the latter problem in the form of LP, that is, as X min{ tj : tj + zj ≥ 0, tj − zj ≥ 0, Az = Ax}, z,t
j
and invoking LP optimality conditions, the necessary and sufficient condition for 5 Note that in general xs is not uniquely defined by x and s, since the sth largest among the magnitudes of entries in x can be achieved at several entries. In our context, it does not matter how ties of this type are resolved; for the sake of definiteness, we can assume that when ordering the entries in x according to their magnitudes, from the largest to the smallest, entries of equal magnitude are ordered in the order of their indices.
10
CHAPTER 1
− z = x to be the zcomponent of an optimal solution is the existence of λ+ j , λj , µ ∈ Rm (Lagrange multipliers for the constraints tj − zj ≥ 0, tj + zj ≥ 0, and Az = Ax, respectively) such that
(a) (b) (c) (d) (e) (f )
− λ+ j + λj T λ −λ +A µ λ+ j (xj  − xj ) λ− j (xj  + xj ) λ+ j λ− j +
−
= = = = ≥ ≥
1 ∀j, 0, 0 ∀j, 0 ∀j, 0 ∀j, 0 ∀j.
(1.7)
− + − + − From (c, d), we have λ+ j = 1, λj = 0 for j ∈ Ix and λj = 0, λj = 1 for j ∈ Ix . ± From (a) and nonnegativity of λj it follows that for j 6∈ Ix we should have −1 ≤ − λ+ j − λj ≤ 1. With this in mind, the above optimality conditions admit eliminating λ’s and reduce to the following conclusion: (!) x is an optimal solution to (P [x]) if and only if there exists vector µ ∈ Rm such that the jth entry of AT µ is −1 if xj > 0, +1 if xj < 0, and a real from [−1, 1] if xj = 0. Now let w ∈ Ker A be a vector with the same signs of entries wi , i ∈ I, as these of the entries in x. Then P 0 = µT Aw = [AT µ]T w = j [AT µ]j wj P P P P ⇒ j∈Ix wj  = j∈Ix [AT µ]j wj = − j6∈Ix [AT µ]j wj ≤ j6∈Ix wj 
(we have used the fact that [AT µ]j = sign xj = sign wj for j ∈ Ix and [AT µ]j  ≤ 1 for all j). Since I can be an arbitrary selement subset of {1, ..., n} and the pattern of signs of an ssparse vector x supported on I can be arbitrary, (1.6) holds true. ✷ 1.2.1.3
Nullspace property
In fact, it can be shown that (1.6) is not only a necessary, but also sufficient condition for weak sgoodness of A; we, however, skip this verification, since our goal so far was to guess the condition for sgoodness, and this goal has already been achieved—from what we already know it immediately follows that a necessary condition for sgoodness is for the inequality in (1.6) to be strict whenever w ∈ Ker A is nonzero. Indeed, we already know that if A is sgood, then for every I of cardinality s and every nonzero w ∈ Ker A it holds kwI k1 ≤ kwI o k1 . If the latter inequality for some I and w in question holds true as equality, then A clearly is not sgood, since the ssparse signal x = wI is not the unique optimal solution to (P [x])—the vector −wI o is a different feasible solution to the same problem and with the same value of the objective. We conclude that for A to be sgood, a necessary condition is ∀(0 6= w ∈ Ker A, I, Card(I) ≤ s) : kwI k1 < kwI o k1 .
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
11
By the standard compactness argument, this is the same as the existence of γ ∈ (0, 1) such that ∀(w ∈ Ker A, I, Card(I) ≤ s) : kwI k1 ≤ γkwI o k1 , or—which is the same—existence of κ ∈ (0, 1/2) such that ∀(w ∈ Ker A, I, Card(I) ≤ s) : kwI k1 ≤ κkwk1 . Finally, the supremum of kwI k1 over I of cardinality s is the norm kwks,1 (the sum of s largest magnitudes of entries) of w, so that the condition we are processing finally can be formulated as ∃κ ∈ (0, 1/2) : kwks,1 ≤ κkwk1 ∀w ∈ Ker A.
(1.8)
The resulting nullspace condition in fact is necessary and sufficient for A to be sgood: Proposition 1.2. Condition (1.8) is necessary and sufficient for A to be sgood. Proof. We have already seen that the nullspace condition is necessary for sgoodness. To verify sufficiency, let A satisfy the nullspace condition, and let us prove that A is sgood. Indeed, let x be an ssparse vector, and y be an optimal solution to (P [x]); all we need is to prove that y = x. Let I be the support of x, and w = y − x, so that w ∈ Ker A. By the nullspace property we have kwI k1 ≤ κkwk1 = κ[kwI k1 + kwI o k1 ] = κ[kwI k1 + kyI o k1 κ kyI o k1 ⇒ kwI k1 ≤ 1−κ κ kyI o k1 ≤ kyI k1 + kyI o k1 = kyk1 ⇒ kxk1 = kxI k1 = kyI − wI k1 ≤ kyI k1 + 1−κ where the concluding ≤ is due to κ ∈ [0, 1/2). Since x is a feasible, and y is an optimal solution to (P [x]), the resulting inequality kxk1 ≤ kyk1 must be equality, which, again due to κ ∈ [0, 1/2), is possible only when yI o = 0. Thus, y has the same support I as x, and w = x − y ∈ Ker A is supported on selement set I; by nullspace property, we should have kwI k1 ≤ κkwk1 = κkwI k1 , which is possible only when w = 0. ✷ 1.2.2
Imperfect ℓ1 minimization
We have found a necessary and sufficient condition for ℓ1 minimization to recover exactly ssparse signals in the noiseless case. More often than not, both these assumptions are violated: instead of ssparse signals, we should speak about “nearly ssparse” ones, quantifying the deviation from sparsity by the distance from the signal x underlying the observations to its best ssparse approximation xs . Similarly, we should allow for nonzero observation noise. With noisy observations and/or imperfect sparsity, we cannot hope to recover the signal exactly. All we may hope for, is to recover it with some error depending on the level of observation noise and “deviation from ssparsity,” and tending to zero as the level and deviation tend to 0. We are about to quantify the nullspace property to allow for instructive “error analysis.”
12
CHAPTER 1
1.2.2.1
Contrast matrices and quantifications of Nullspace property
By itself, the nullspace property says something about the signals from the kernel of the sensing matrix. We can reformulate it equivalently to say something important about all signals. Namely, observe that given sparsity s and κ ∈ (0, 1/2), the nullspace property kwks,1 ≤ κkwk1 ∀w ∈ Ker A (1.9)
is satisfied if and only if for a properly selected constant C one has6 kwks,1 ≤ CkAwk2 + κkwk1 ∀w.
(1.10)
Indeed, (1.10) clearly implies (1.9); to get the inverse implication, note that for every h orthogonal to Ker A it holds kAhk2 ≥ σkhk2 , where σ > 0 is the minimal positive singular value of A. Now, given w ∈ Rn , we can decompose w into the sum of w ¯ ∈ Ker A and h ∈ (Ker A)⊥ , so that √ √ kwks,1 ≤ kwk ¯ s,1 + khks,1 ≤ κkwk ¯ 1 + skhks,2 ≤ κ[kwk1 + khk1 ] + skhk2 √ √ √ √ ≤ κkwk1 + [κ n + s]khk2 ≤ σ −1 [κ n + s] kAhk2 +κkwk1 , {z }  {z }  C
=kAwk2
as required in (1.10).
Condition Q1 (s, κ). For our purposes, it is convenient to present the condition (1.10) in the following flexible form: kwks,1 ≤ skH T Awk + κkwk1 ,
(1.11)
where H is an m × N contrast matrix and k · k is some norm on RN . Whenever a pair (H, k · k), called contrast pair, satisfies (1.11), we say that (H, k · k) satisfies condition Q1 (s, κ). From what we have seen, If A possesses nullspace property with some sparsity level s and some κ ∈ (0, 1/2), then there are many ways to select pairs (H, k · k) satisfying Q1 (s, κ), e.g., to take H = CIm with appropriately large C and k · k = k · k2 . Conditions Qq (s, κ). As we will see in a while, it makes sense to embed the condition Q1 (s, κ) into a parametric family of conditions Qq (s, κ), where the parameter q runs through [1, ∞]. Specifically, Given an m × n sensing matrix A, sparsity level s ≤ n, and κ ∈ (0, 1/2), we say that m × N matrix H and a norm k · k on RN satisfy condition Qq (s, κ) if 1 1 (1.12) kwks,q ≤ s q kH T Awk + κs q −1 kwk1 ∀w ∈ Rn . Let us make two immediate observations on relations between the conditions: A. When a pair (H, k · k) satisfies condition Qq (s, κ), the pair satisfies also all conditions Qq′ (s, κ) with 1 ≤ q ′ ≤ q. √ that (1.9) is exactly the φ2 (s, κ)Compatibility condition of [231] with φ(s, κ) = C/ s; see also [232] for the analysis of relationships of this condition with other assumptions (e.g., a similar Restricted Eigenvalue assumption of [20]) used to analyse ℓ1 minimization procedures. 6 Note
13
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
Indeed in the situation in question for 1 ≤ q ′ ≤ q it holds i 1 −1 h 1 1 −1 1 −1 kwks,q′ ≤ s q′ q kwkq,s ≤ s q′ q s q kH T Awk + κs q kwk1 =
1
1
s q′ kH T Awk + κs q′
−1
kwk1 ,
where the first inequality is the standard inequality between ℓp norms of the sdimensional vector ws .
B. When a pair (H, k · k) satisfies condition Qq (s, κ) and 1 ≤ s′ ≤ s, the pair 1 ((s/s′ ) q H, k · k) satisfies the condition Qq (s′ , κ). Indeed, in the situation in question we clearly have for 1 ≤ s′ ≤ s: h i 1 1 1 −1 kwks′ ,q ≤ kwks,q ≤ (s′ ) q k (s/s′ ) q H Awk + κ s q{z } kwk1 . 1 −1
≤(s′ ) q
1.2.3
Regular ℓ1 recovery
Given the observation scheme (1.1) with an m × n sensing matrix A, we define the regular ℓ1 recovery of x via observation y as (1.13) x breg (y) ∈ Argmin kuk1 : kH T (Au − y)k ≤ ρ , u
where the contrast matrix H ∈ Rm×N , the norm k · k on RN and ρ > 0 are parameters of the construction. The role of Qconditions we have introduced is clear from the following
Theorem 1.3. Let s be a positive integer, q ∈ [1, ∞] and κ ∈ (0, 1/2). Assume that a pair (H, k · k) satisfies the condition Qq (s, κ) associated with A, and let Ξρ = {η : kH T ηk ≤ ρ}.
(1.14)
Then for all x ∈ Rn and η ∈ Ξρ one has
1 kx − xs k1 4(2s) p ρ+ , 1 ≤ p ≤ q. kb xreg (Ax + η) − xkp ≤ 1 − 2κ 2s
(1.15)
The above result can be slightly strengthened by replacing the assumption that (H, k · k) satisfies Qq (s, κ) with some κ < 1/2, with a weaker—by observation A from Section 1.2.2.1—assumption that (H, k · k) satisfies Q1 (s, κ) with κ < 1/2 and satisfies Qq (s, κ) with some (perhaps large) κ: Theorem 1.4. Given A, integer s > 0, and q ∈ [1, ∞], assume that (H, k · k) satisfies the condition Q1 (s, κ) with κ < 1/2 and the condition Qq (s, κ) with some κ ≥ κ, and let Ξρ be given by (1.14). Then for all x ∈ Rn and η ∈ Ξρ it holds: q(p−1) 1 kx − xs k1 4(2s) p [1 + κ − κ] p(q−1) ρ+ , 1 ≤ p ≤ q. (1.16) kb xreg (Ax+η)−xkp ≤ 1 − 2κ 2s
For proofs of Theorems 1.3 and 1.4, see Section 1.5.1. Before commenting on the above results, let us present their alternative versions.
14
CHAPTER 1
1.2.4
Penalized ℓ1 recovery
Penalized ℓ1 recovery of signal x from its observation (1.1) is x bpen (y) ∈ Argmin kuk1 + λkH T (Au − y)k ,
(1.17)
u
where H ∈ Rm×N , a norm k · k on RN , and a positive real λ are parameters of the construction. Theorem 1.5. Given A, positive integer s, and q ∈ [1, ∞], assume that (H, k · k) satisfies the conditions Qq (s, κ) and Q1 (s, κ) with κ < 1/2 and κ ≥ κ. Then (i) Let λ ≥ 2s. Then for all x ∈ Rn , y ∈ Rm it holds: 1
kb xpen (y) − xkp ≤
4λ p 1−2κ
1+
κλ 2s
−κ
In particular, with λ = 2s we have: 1
kb xpen (y) − xkp ≤
4(2s) p 1−2κ
q(p−1) h p(q−1)
kH T (Ax − y)k +
kx−xs k1 2s
h q(p−1) [1 + κ − κ] p(q−1) kH T (Ax − y)k +
i
kx−xs k1 2s
, 1 ≤ p ≤ q.
(1.18)
i
, 1 ≤ p ≤ q.
(1.19) (ii) Let ρ ≥ 0, and let Ξρ be given by (1.14). Then for all x ∈ Rn and all η ∈ Ξρ one has: λ ≥ 2s λ = 2s
⇒
1
kb xpen (Ax + η) − xkp ≤
4λ p 1−2κ
kb xpen (Ax + η) − xkp ≤
4(2s) p 1−2κ
⇒
1
1+
κλ 2s
−κ
q(p−1) h p(q−1)
ρ+
q(p−1) h [1 + κ − κ] p(q−1) ρ +
kx−xs k1 2s
kx−xs k1 2s
i
i
, 1 ≤ p ≤ q;
, 1 ≤ p ≤ q.
(1.20)
For proof, see Section 1.5.2. 1.2.5
Discussion
Some remarks are in order. A. Qualitatively speaking, Theorems 1.3, 1.4, and 1.5 say the same thing: when Qconditions are satisfied, the regular or penalized recoveries reproduce the true signal exactly when there is no observation noise and the signal is ssparse. In the presence of observation error η and imperfect sparsity, the signal is recovered within the error which can be upperbounded by the sum of two terms, one proportional to the magnitude of observation noise and one proportional to the deviation kx − xs k1 of the signal from ssparse ones. In the penalized recovery, the observation error is measured in the scale given by the contrast matrix and the norm k · k—as kH T ηk— and in the regular recovery by an a priori upper bound ρ on kH T ηk; when ρ ≥ kH T ηk, η belongs to Ξρ and thus the bounds (1.15) and (1.16) are applicable to the actual observation error η. Clearly, in qualitative terms, an error bound of this type is the best we may hope for. Now let us look at the quantitative aspect. Assume that in the regular recovery we use ρ ≈ kH T ηk, and in the penalized one λ = 2s. In this case, error bounds (1.15), (1.16), and (1.20), up to factors C depending solely on κ and κ, are the same, specifically, kb x − xkp ≤ Cs1/p [kH T ηk + kx − xs k1 /s], 1 ≤ p ≤ q.
(!)
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
15
Is this error bound bad or good? The answer depends on many factors, including on how well we select H and k · k. To get a kind of orientation, consider the trivial case of direct observations, where matrix A is square and, moreover, is proportional to the unit matrix: A = αI. Let us assume in addition that x is exactly ssparse. In this case, the simplest way to ensure condition Qq (s, κ), even with κ = 0, is to take k · k = k · ks,q and H = s−1/q α−1 I, so that (!) becomes kb x − xkp ≤ Cα−1 s1/p−1/q kηks,q , 1 ≤ p ≤ q.
(!!)
As far as the dependence of the bound on the magnitude kηks,q of the observation noise is concerned, this dependence is as good as it can be—even if we knew in advance the positions of the s entries of x of largest magnitudes, we would be unable to recover x in qnorm with error ≤ α−1 kηks,q . In addition, with the s largest magnitudes of entries in η equal to each other, the k·kp norm of the recovery error clearly cannot be guaranteed to be less than α−1 kηks,p = α−1 s1/p−1/q kηks,q . Thus, at least for ssparse signals x, our error bound is, basically, the best one can get already in the “ideal” case of direct observations. B. Given that (H, k · k) obeys Q1 (s, κ) with some κ < 1/2, the larger the q such that the pair (H, k · k) obeys the condition Qq (s, κ) with a given κ ≥ κ (recall that κ can be ≥ 1/2) and s, the larger the range p ≤ q of values of p where the error bounds (1.16) and (1.20) are applicable. This is in full accordance with the fact that if a pair (H, k · k) obeys condition Qq (s, κ), it obeys also all conditions Qq′ (s, κ) with 1 ≤ q ′ ≤ q (item A in Section 1.2.2.1). C. The flexibility offered by contrast matrix H and norm k · k allows us to adjust, to some extent, the recovery to the “geometry of observation errors.” For example, when η is “uncertain but bounded,” say, when all we know is that kηk2 ≤ δ with some given δ, all that matters (on the top of the requirement for (H, k · k) to obey Qconditions) is how large kH T ηk could be when kηk2 ≤ δ. In particular, when k · k = k · k2 , the error bound “is governed” by the spectral norm of H. Consequently, if we have a technique allowing us to design H such that (H, k · k2 ) obeys Qcondition(s) with given parameters, it makes sense to look for a design with as small a spectral norm of H as possible. In contrast to this, in the case of Gaussian noise the most interesting for applications, y = Ax + η, η ∼ N (0, σ 2 Im ),
(1.21)
looking at the spectral norm of H, with k·k2 in the role of k·k, is counterproductive, √ since a typical realization of η is of Euclidean norm of order of mσ and thus is quite large when m is large. In this case to quantify “the magnitude” of H T η by the product of the spectral norm of H and the Euclidean norm of η is completely misleading—in typical cases, this product will grow rapidly with the number of observations m, completely ignoring the fact that η is random with zero mean.7 What is much better suited for the case of Gaussian noise, is the k · k∞ norm in the role of k·k and the norm of H which is “the maximum of k·k2 norms of the columns 7 The simplest way to see the difference is to look at a particular entry hT η in H T η. Operating with spectral norms, we upperbound √ this entry by khk2 kηk2 , and the second factor for η ∼ N (0, σ 2 Im ) is typically as large as σ m. This is in sharp contrast to the fact that typical values of hT η are of order of σkhk2 , independently of what m is!
16
CHAPTER 1
in H,” denoted by kHk1,2 . Indeed, with η ∼ N (0, σ 2 Im ), the entries in H T η are Gaussian with zero mean and variance bounded by σ 2 kHk21,2 , so that kH T ηk∞ is the maximum of magnitudes of N zero mean Gaussian random variables with standard deviations bounded by σkHk1,2 . As a result, T
Prob{kH ηk∞ ≥ ρ} ≤ 2N Erfc where
ρ σkHk1,2
≤ Ne
1 Erfc(s) = Probξ∼N (0,1) {ξ ≥ s} = √ 2π
Z
−
∞
ρ2 2σ 2 kHk2 1,2
e−t
2
/2
,
(1.22)
dt
s
is the (slightly rescaled) complementary error function. T 2 It follows p that the typical values of kH ηk∞ , η ∼ N (0, σ Im ) are of order of at most σ ln(N )kHk1,2 . In applications we consider in this chapter, we have N = O(m), so that with σ and kHk1,2 given, typical values kH T ηk∞ are nearly independent of m. The bottom line is that ℓ1 minimization is capable of handling largescale Gaussian observation noise incomparably better than “uncertainbutbounded” observation noise of similar magnitude (measured in Euclidean norm).
D. As far as comparison of regular and penalized ℓ1 recoveries with the same pair (H, k · k) is concerned, the situation is as follows. Assume for the sake of simplicity that (H, k · k) satisfies Qq (s, κ) with some s and κ < 1/2, and let the observation error be random. Given ǫ ∈ (0, 1), let ρǫ [H, k · k] = min ρ : Prob η : kH T ηk ≤ ρ ≥ 1 − ǫ ; (1.23) this is nothing but the smallest ρ such that
Prob{η ∈ Ξρ } ≥ 1 − ǫ
(1.24)
(see (1.14)), and thus the smallest ρ for which the error bound (1.15) for the regular ℓ1 recovery holds true with probability 1 − ǫ (or at least the smallest ρ for which the latter claim is supported by Theorem 1.3). With ρ = ρǫ [H, k · k], the regular ℓ1 recovery guarantees (and that is the best guarantee one can extract from Theorem 1.3) that (#) For some set Ξ, Prob{η ∈ Ξ} ≥ 1 − ǫ, of “good” realizations of η ∼ N (0, σ 2 Im ), one has 1 kx − xs k1 4(2s) p ρǫ [H, k · k] + , 1 ≤ p ≤ q, (1.25) kb x(Ax + η) − xkp ≤ 1 − 2κ 2s
whenever x ∈ Rn and η ∈ Ξρ . The error bound (1.19) (where we set κ = κ) says that (#) holds true for the penalized ℓ1 recovery with λ = 2s. The latter observation suggests that the penalized ℓ1 recovery associated with (H, k · k) and λ = 2s is better than its regular counterpart, the reason being twofold. First, in order to ensure (#) with the regular recovery, the “built in” parameter ρ of this recovery should be set to ρǫ [H, k · k], and the latter quantity is not always easy to identify. In contrast to this, the construc
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
17
tion of penalized ℓ1 recovery is completely independent of a priori assumptions on the structure of observation errors, while automatically ensuring (#) for the error model we use. Second, and more importantly, for the penalized recovery the bound (1.25) is no more than the “worst, with confidence 1 − ǫ, case,” while the typical values of the quantity kH T ηk which indeed participates in the error bound (1.18) may be essentially smaller than ρǫ [H, k · k]. Numerical experience fully supports the above claim: the difference in observed performance of the two routines in question, although not dramatic, is definitely in favor of the penalized recovery. The only potential disadvantage of the latter routine is that the penalty parameter λ should be tuned to the level s of sparsity we aim at, while the regular recovery is free of any guess of this type. Of course, the “tuning” is rather loose—all we need (and experiments show that we indeed need this) is the relation λ ≥ 2s, so that a rough upper bound on s will do. However, that bound (1.18) deteriorates as λ grows. Finally, we remark that when H is m × N and η ∼ N (0, σ 2 Im ), we have ρǫ [H, k · k∞ ] ≤ σErfcInv(
p ǫ )kHk1,2 ≤ σ 2 ln(N/ǫ)kHk1,2 2N
(see (1.22)); here ErfcInv(δ) is the inverse complementary error function: Erfc(ErfcInv(δ)) = δ, 0 < δ < 1.
(1.26)
How it works. Here we present a small numerical illustration. We observe in Gaussian noise m = n/2 randomly selected terms in nelement “time series” z = (z1 , ..., zn ) and want to recover this series under the assumption that the series is “nearly ssparse in frequency domain,” that is, that z = F x with kx − xs k1 ≤ δ, where F is the matrix of n × n the Inverse Discrete Cosine Transform, xs is the vector obtained from x by zeroing out all but the s entries of largest magnitudes and δ upperbounds the distance from x to ssparse signals. Denoting by A the m × n submatrix of F corresponding to the time instants t where zt is observed, our observation becomes y = Ax + σξ, where ξ is the standard Gaussian noise. After the signal in frequency domain, that is, x, is recovered by ℓ1 minimization, let the recovery be x b, we recover the signal in the time domain as zb = F x b. In Figure 1.3, we present four test signals, of different (near)sparsity, along with their regular and penalized ℓ1 recoveries. The data in Figure 1.3 clearly show how the quality of ℓ1 recovery deteriorates as the number s of “essential nonzeros” of the signal in the frequency domain grows. It is seen also that the penalized recovery meaningfully outperforms the regular one in the range of sparsities up to 64.
18
CHAPTER 1
0.5 0.5 0
0 0.5
0.5 0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
0.5 0.5 0
0
0.5 0.5 0
50
100
150
200
250
300
350
400
450
500
s=16 s=32 Top plots: regular ℓ1 recovery, bottom plots: penalized ℓ1 recovery. 2
1
h
0.5
1
0
0
0.5 1
1 2
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
2
1 1
0.5 0
0
0.5 1
1 2
0
50
100
150
200
250
300
350
400
450
500
s=64 s=128 Top plots: regular ℓ1 recovery, bottom plots: penalized ℓ1 recovery. kz − z bk2 kz − z bk∞
s = 16 0.2417 0.0343
s = 32 0.3871 0.0514
s = 64 0.8178 0.1744
s = 128 4.8256 0.8272
recovery errors, regular ℓ1 recovery
kz − z bk2 kz − z bk∞
s = 16 0.1399 0.0177
s = 32 0.2385 0.0362
s = 64 0.4216 0.1023
s = 128 5.3431 0.9141
recovery errors, penalized ℓ1 recovery
Figure 1.3: Regular and penalized ℓ1 recovery of nearly ssparse signals. o: true signals, +: recoveries (to make the plots readable, one per eight consecutive vector’s entries is shown). Problem sizes are m = 256 and n = 2m = 512, noise level s is p σ = 0.01, deviation from ssparsity is kx − x k1 = 1, contrast pair is (H = n/mA, k · k∞ ). In penalized recovery, λ = 2s, parameter ρ of regular recovery is set to σ · ErfcInv(0.005/n).
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
1.3
19
VERIFIABILITY AND TRACTABILITY ISSUES
The good news about ℓ1 recovery stated in Theorems 1.3, 1.4, and 1.5 is “conditional”—we assume that we are smart enough to point out a pair (H, k · k) satisfying condition Q1 (s, κ) with κ < 1/2 (and condition Qq (s, κ) with a “moderate” κ 8 ). The related issues are twofold: 1. First, we do not know in which range of s, m, and n these conditions, or even the weaker than Q1 (s, κ), κ < 1/2, nullspace property can be satisfied; and without the nullspace property, ℓ1 minimization becomes useless, at least when we want to guarantee its validity whatever be the ssparse signal we want to recover; 2. Second, it is unclear how to verify whether a given sensing matrix A satisfies the nullspace property for a given s, or a given pair (H, k · k) satisfies the condition Qq (s, κ) with given parameters. What is known about these crucial issues can be outlined as follows. 1. It is known that for given m, n with m ≪ n (say, m/n ≤ 1/2), there exist m × n sensing matrices which are sgood for the values of s “nearly as large m .9 Moreover, there are natural families as m,” specifically, for s ≤ O(1) ln(n/m) of matrices where this level of goodness “is a rule.” E.g., when drawing an m × n matrix at random from Gaussian or Rademacher distributions (i.e., when filling the matrix with independent realizations of a random variable which is either a standard (zero mean, unit variance) Gaussian one, or takes values ±1 with probabilities 0.5), the result will be sgood, for the outlined value of s, with probability approaching 1 as m and n grow. All this remains true when instead of speaking about matrices A satisfying “plain” nullspace properties, we are speaking about matrices A for which it is easy to point out a pair (H, k · k) satisfying the condition Q2 (s, κ) with, say, κ = 1/4. The above results can be considered as a good news. A bad news is that we do not know how to check efficiently, given an s and a sensing matrix A, that the matrix is sgood, just as we do not know how to check that A admits good (i.e., satisfying Q1 (s, κ) with κ < 1/2) pairs (H, k · k). Even worse: we do not know m an efficient recipe allowing us to build, given √ m, an m × 2m matrix A which is provably sgood for s larger than O(1) m, which is a much smaller “level of goodness” than the one promised by theory for randomly generated matrices.10 The “common life” analogy of this situation would be as follows: you know that 90% of bricks in your wall are made of gold, and at the same time, you do not know how to tell a golden brick from a usual one. 2. There exist verifiable sufficient conditions for sgoodness of a sensing matrix, similarly to verifiable sufficient conditions for a pair (H, k · k) to satisfy condition 8 Q (s, κ) is always satisfied with “large enough” κ, e.g., κ = s, but such values of κ are of no q interest: the associated bounds on pnorms of the recovery error are straightforward consequences of the bounds on the k · k1 norm of this error yielded by the condition Q1 (s, κ). 9 Recall that O(1)’s denote positive absolute constants—appropriately chosen numbers like 0.5, or 1, or perhaps 100,000. We could, in principle, replace all O(1)’s with specific numbers; following the standard mathematical practice, we do not do it, partly out of laziness, partly because particular values of these numbers in our context are irrelevant. 10 Note that the naive algorithm “generate m × 2m matrices at random until an sgood, with s promised by the theory, matrix is generated” is not an efficient recipe, since we still do not know how to check sgoodness efficiently.
20
CHAPTER 1
Qq (s, κ). The bad news is that when m √ ≪ n, these verifiable sufficient conditions can be satisfied only when s ≤ O(1) m—once again, in a much more narrow range of values of s than √ when typical randomly selected sensing matrices are sgood. In fact, s = O( m) is so far the best known sparsity level for which we know individual sgood m × n sensing matrices with m ≤ n/2. 1.3.1
Restricted Isometry Property and sgoodness of random matrices
There are several sufficient conditions for sgoodness, equally difficult to verify, but provably satisfied for typical random sensing matrices. The best known of them is the Restricted Isometry Property (RIP) defined as follows: Definition 1.6. Let k be an integer and δ ∈ (0, 1). We say that an m × n sensing matrix A possesses the Restricted Isometry Property with parameters δ and k, RIP(δ, k), if for every ksparse x ∈ Rn one has (1 − δ)kxk22 ≤ kAxk22 ≤ (1 + δ)kxk22 .
(1.27)
It turns out that for natural ensembles of random m × n matrices, a typical matrix from the ensemble satisfies RIP(δ, k) with small δ and k “nearly as large as m,” and that RIP( 61 , 2s) implies the nullspace condition, and more. The simplest versions of the corresponding results are as follows. Proposition 1.7. Given δ ∈ (0, 51 ], with properly selected positive c = c(δ), d = d(δ), f = f (δ) for all m ≤ n and all positive integers k such that k≤
m c ln(n/m) + d
(1.28)
1 the probability for a random m × n matrix A with independent N (0, m ) entries to satisfy RIP(δ, k) is at least 1 − exp{−f m}.
For proof, see Section 1.5.3. Proposition 1.8. Let A ∈ Rm×n satisfy RIP(δ, 2s) for some δ < 1/3 and positive integer s. Then s−1/2 δ (i) The pair H = √ I , k · k satisfies the condition Q s, m 2 2 1−δ associated 1−δ with A; δ 1 A, k · k∞ ) satisfies the condition Q2 s, 1−δ associated (ii) The pair (H = 1−δ with A. For proof, see Section 1.5.4. 1.3.2
Verifiable sufficient conditions for Qq (s, κ)
When speaking about verifiable sufficient conditions for a pair (H, k · k) to satisfy Qq (s, κ), it is convenient to restrict ourselves to the case where H, like A, is an m × n matrix, and k · k = k · k∞ . Proposition 1.9. Let A be an m × n sensing matrix, and s ≤ n be a sparsity level.
21
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
Given an m × n matrix H and q ∈ [1, ∞], let us set νs,q [H] = max kColj [I − H T A]ks,q ,
(1.29)
j≤n
where Colj [C] is jth column of matrix C. Then kwks,q ≤ s1/q kH T Awk∞ + νs,q [H]kwk1 ∀w ∈ Rn ,
(1.30)
1
implying that the pair (H, k · k∞ ) satisfies the condition Qq (s, s1− q νs,q [H]). Proof is immediate. Setting V = I − H T A, we have kwks,q = k[H T A + VP ]wks,q ≤ kH T Awks,q + kV wks,q 1/q T ≤ s kH Awk∞ + j wj kColj [V ]ks,q ≤ s1/q kH T Ak∞ + νs,q [H]kwk1 .
✷
Observe that the function νs,q [H] is an efficiently computable convex function of H, so that the set 1
κ Hs,q = {H ∈ Rm×n : νs,q [H] ≤ s q −1 κ}
(1.31)
is a computationally tractable convex set. When this set is nonempty for some κ < 1/2, every point H in this set is a contrast matrix such that (H, k · k∞ ) satisfies the condition Qq (s, κ), that is, we can find contrast matrices making ℓ1 minimization valid. Moreover, we can design contrast matrix, e.g., by minimizing κ over Hs,q the function kHk1,2 , thus optimizing the sensitivity of the corresponding ℓ1 recoveries to Gaussian observation noise; see items C, D in Section 1.2.5. Explanation. The sufficient condition for sgoodness of A stated in Proposition 1.9 looks as if coming out of thin air; in fact it is a particular case of a simple and general construction as follows. Let f (x) be a realvalued convex function on Rn , and X ⊂ Rn be a nonempty bounded polytope represented as X = {x ∈ Conv{g1 , ..., gN } : Ax = 0}, P P where Conv{g1 , ..., gN } = { i λi gi : λ ≥ 0, i λi = 1} is the convex hull of vectors g1 , ..., gN . Our goal is to upperbound the maximum Opt = maxx∈X f (x); this is a meaningful problem, since precisely maximizing a convex function over a polyhedron typically is a computationally intractable task. Let us act as follows: clearly, for any matrix H of the same size as A we have maxx∈X f (x) = maxx∈X f ([I − H T A]x), since on X we have [I − H T A]x = x. As a result, Opt
:= ≤ =
max f (x) = max f ([I − H T A]x) x∈X
x∈X
max
x∈Conv{g1 ,...,gN }
f ([I − H T A]x)
max f ([I − H T A]gj ). j≤N
We get a parametric—the parameter being H—upper bound on Opt, namely, the bound maxj≤N f ([I − H T A]gj ). This parametric bound is convex in H, and thus is well suited for minimization over this parameter. The result of Proposition 1.9 is inspired by this construction as applied to the
22
CHAPTER 1
nullspace property: given an m × n sensing matrix A and setting X = {x ∈ Rn : kxk1 ≤ 1, Ax = 0} = {x ∈ Conv{±e1 , ..., ±en } : Ax = 0} (ei are the basic orths in Rn ), A is sgood if and only if Opts := max{f (x) := kxks,1 } < 1/2. x∈X
A verifiable sufficient condition for this, as yielded by the above construction, is the existence of an m × n matrix H such that max max[f ([In − H T A]ej ), f (−[In − H T A]ej )] < 1/2, j≤n
or, which is the same, max kColj [In − H T A]ks,1 < 1/2. j
This observation brings to our attention the matrix I − H T A with varying H and the idea of expressing sufficient conditions for sgoodness and related properties in terms of this matrix. 1.3.3
Tractability of Q∞ (s, κ)
As we have already mentioned, the conditions Qq (s, κ) are intractable, in the sense that we do not know how to verify whether a given pair (H, k · k) satisfies the condition. Surprisingly, this is not the case with the strongest of these conditions, the one with q = ∞. Namely, Proposition 1.10. Let A be an m × n sensing matrix, s be a sparsity level, and ¯ k · k) satisfies the condition Q∞ (s, κ), there exists κ ≥ 0. Then whenever a pair (H, an m × n matrix H such that kColj [In − H T A]ks,∞ = kColj [In − H T A]k∞ ≤ s−1 κ, 1 ≤ j ≤ n (so that (H, k · k∞ ) satisfies Q∞ (s, κ) by Proposition 1.9), and also ¯ T ηk ∀η ∈ Rm . kH T ηk∞ ≤ kH
(1.32)
In addition, the m × n contrast matrix H such that the pair (H, k · k∞ ) satisfies the condition Q∞ (s, κ) with as small κ as possible can be found as follows. Consider n LP programs Opti = min ν : kAT h − ei k∞ ≤ ν , (#i ) ν,h
where ei is ith basic orth of Rn . Let Opti , hi , i = 1, ..., n be optimal solutions to these problems; we set H = [h1 , ..., hn ]; the corresponding value of κ is κ∗ = s max Opti . i
Besides this, there exists a transparent alternative description of the quantities Opti
23
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
(and thus of κ∗ ); specifically, Opti = max {xi : kxk1 ≤ 1, Ax = 0} .
(1.33)
x
For proof, see Section 1.5.5. Taken along with (1.32) and error bounds of Theorems 1.3, 1.4, and 1.5, Proposition 1.10 says that As far as the condition Q∞ (s, κ) is concerned, we lose nothing when restricting ourselves with pairs (H ∈ Rm×n , k · k∞ ) and contrast matrices H satisfying the condition [In − H T A]ij  ≤ s−1 κ,
(1.34)
implying that (H, k · k∞ ) satisfies Q∞ (s, κ). The good news is that (1.34) is an explicit convex constraint on H (in fact, even on H and κ), so that we can solve the design problems, where we want to optimize a convex function of H under the requirement that (H, k · k∞ ) satisfies the condition Q∞ (s, κ) (and, perhaps, additional convex constraints on H and κ). 1.3.3.1
Mutual Incoherence
The simplest (and up to some point in time, the only) verifiable sufficient condition for sgoodness of a sensing matrix A is expressed in terms of mutual incoherence of A, defined as ColTi [A]Colj [A] . (1.35) µ(A) = max i6=j kColi [A]k22 This quantity is well defined whenever A has no zero columns (otherwise A is not even 1good). Note that when A is normalized to have all columns of equal k · k2 lengths,11 µ(A) is small when the columns of A are nearly mutually orthogonal. The standard related result is that Whenever A and a positive integer s are such that
2µ(A) 1+µ(A)
< 1s , A is sgood.
It is immediately seen that the latter condition is weaker than what we can get with the aid of (1.34): Proposition 1.11. Let A be an m × n matrix, and let the columns of m × n matrix H be given by Colj (H) =
1 Colj (A), 1 ≤ j ≤ n. (1 + µ(A))kColj (A)k22
Then [Im − H T A]ij  ≤
µ(A) ∀i, j. 1 + µ(A)
(1.36)
11 As far as ℓ minimization is concerned, this normalization is nonrestrictive: we always can 1 enforce it by diagonal scaling of the signal underlying observations (1.1), and ℓ1 minimization in scaled variables is the same as weighted ℓ1 minimization in original variables.
24
CHAPTER 1
In particular, when
2µ(A) 1+µ(A)
< 1s , A is sgood.
1 = Proof. With H as above, the diagonal entries in I −H T A are equal to 1− 1+µ(A) µ(A) 1+µ(A) ,
while by definition of mutual incoherence the magnitudes of the offdiagonal
µ(A) entries in I − H T A are ≤ 1+µ(A) as well, implying (1.36). The “in particular” claim is given by (1.36) combined with Proposition 1.9. ✷
1.3.3.2
From RIP to conditions Qq (·, κ)
It turns out that when A is RIP(δ, k) and q ≥ 2, it is easy to point out pairs (H, k·k) satisfying Qq (t, κ) with a desired κ > 0 and properly selected t: Proposition 1.12. Let A be an m × n sensing matrix satisfying RIP(δ, 2s) with some s and some δ ∈ (0, 1), and let q ∈ [2, ∞] and κ > 0 be given. Then (i) Whenever a positive integer t satisfies # " q q−2 q κ(1 − δ) q−1 q−1 ,s s 2q−2 , (1.37) t ≤ min δ the pair (H =
−1 q
t √
I ,k 1−δ m
· k2 ) satisfies Qq (t, κ);
(ii) Whenever a positive integer t satisfies (1.37), the pair (H = satisfies Qq (t, κ).
1 −1
s2 t q 1−δ
A, k · k∞ )
For proof, see Section 1.5.4. The most important consequence of Proposition 1.12 deals with the case of q = ∞ and states that when sgoodness of a sensing matrix A can be ensured by the difficult to verify condition RIP(δ, 2s) with, say, δ = 0.2, the somehow worse level of √ sparsity, t = O(1) s with properly selected absolute constant O(1), can be certified via condition Q∞ (t, 13 )—there exists a pair (H, k·k∞ ) satisfying this condition. The point is that by Proposition 1.10, if the condition Q∞ (t, 31 ) can at all be satisfied, a pair (H, k · k∞ ) satisfying this condition can be found efficiently. Unfortunately, the significant “dropdown” in the level of sparsity when passing from unverifiable RIP to verifiable Q∞ is inevitable; this bad news is what is on our agenda now. 1.3.3.3
Limits of performance of verifiable sufficient conditions for goodness
Proposition 1.13. Let A be an m × n sensing matrix which is “essentially nonsquare,” specifically, such that 2m ≤ n, and let q ∈ [1, ∞]. Whenever a positive integer s and an m × n matrix H are linked by the relation 1
kColj [In − H T A]ks,q < 21 s q −1 , 1 ≤ j ≤ n, one has s≤
√
m.
(1.38) (1.39)
As a result, the sufficient condition for the validity √ of Qq (s, κ) with κ < 1/2 from Proposition 1.9 can never be satisfied when s > m. Similarly, the verifiable sufficient condition Q∞ (s, κ), κ < 1/2, for sgoodness of A cannot be satisfied
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
25
Figure 1.4: Erroneous ℓ1 recovery of a 25sparse signal, no observation noise. Top: frequency domain, o – true signal, + – recovery. Bottom: time domain.
when s >
√
m.
For proof, see Section 1.5.6. We see that unless A is “nearly square,” our (same as all others known to us) verifiable sufficient conditions for sgoodness are unable to justify this property for “large” s. This unpleasant fact is in full accordance with the already mentioned fact that no individual provably sgood “essentially nonsquare” m × n matrices √ with s ≥ O(1) m are known. Matrices for √ which our verifiable sufficient conditions do establish sgoodness with s ≤ O(1) m do exist. How it works: Numerical illustration. Let us apply our machinery to the 256×512 randomly selected submatrix A of the matrix of 512×512 Inverse Discrete Cosine Transform which we used in experiments reported in Figure 1.3. These experiments exhibit nice performance of ℓ1 minimization when recovering sparse (even nearly sparse) signals with as many as 64 nonzeros. In fact, the level of goodness of A is at most 24, as is witnessed in Figure 1.4. In order to upperbound the level of goodness of a matrix A, one can try to maximize the convex function kwks,1 over the set W = {w : Aw = 0, kwk1 ≤ 1}: if, for a given s, the maximum of k·ks,1 over W is ≥ 1/2, the matrix is not sgood— it does not possess the nullspace property. Now, while global maximization of the convex function kwks,1 over W is difficult, we can try to find suboptimal solutions as follows. Let us start with a vector w1 ∈ W of k·k1 norm 1, and let u1 be obtained from w1 by replacing the s entries in w1 of largest magnitudes by the signs of these entries and zeroing out all other entries, so that w1T u1 = kw1 ks,1 . After u1 is found, let us solve the LO program maxw {[u1 ]T w : w ∈ W }. w1 is a feasible solution to this problem, so that for the optimal solution w2 we have [u1 ]T w2 ≥ [u1 ]T w1 =
26
CHAPTER 1
kw1 ks,1 ; this inequality, by virtue of what u1 is, implies that kw2 ks,1 ≥ kw1 ks,1 , and, by construction, w2 ∈ W . We now can iterate the construction, with w2 in the role of w1 , to get w3 ∈ W with kw3 ks,1 ≥ kw2 ks,1 , etc. Proceeding in this way, we generate a sequence of points from W with monotonically increasing value of the objective k · ks,1 we want to maximize. We terminate this recurrence either when the achieved value of the objective becomes ≥ 1/2 (then we know for sure that A is not sgood, and can proceed to investigating sgoodness for a smaller value of s) or when the recurrence gets stuck—the observed progress in the objective falls below a given threshold, say, 10−6 . When it happens, we can restart the process from a new starting point randomly selected in W , after getting stuck, restart again, etc., until we exhaust our time budget. The output of the process is the best of the points we have generated—that of the largest k · ks,1 . Applying this approach to the matrix A in question, in a couple of minutes it turns out that the matrix is at most 24good.
One can ask how it may happen that previous experiments with recovering 64sparse signals went fine, when in fact some 25sparse signals cannot be recovered by ℓ1 minimization even in the ideal noiseless case. The answer is simple: in our experiments, we dealt with randomly selected signals, and typical randomly selected data are much nicer, whatever be the purpose of a numerical experiment, than the worstcase data. It is interesting to understand also which goodness we can certify using our verifiable sufficient conditions. Computations show that the fully verifiable (and strongest in our scale of sufficient conditions for sgoodness) condition Q∞ (s, κ) can be satisfied with κ < 1/2 when s is as large as 7 and κ = 0.4887, and cannot be satisfied with κ < 1/2 when s = 8. As for Mutual Incoherence, it can only justify 3goodness, no more. We can hardly be happy with the resulting bounds—goodness at least 7 and at most 24; however, it could be worse.
1.4
EXERCISES FOR CHAPTER 1
Exercise 1.1. The kth Hadamard matrix, Hk (here k is a nonnegative integer), is the nk × nk matrix, nk = 2k , given by the recurrence Hk Hk . H0 = [1]; Hk+1 = Hk −Hk In the sequel, we assume that k > 0. Now comes the exercise: 1. Check that Hk is a symmetric matrix with entries ±1, and columns of the matrix √ are mutually orthogonal, so that Hk / nk is an orthogonal matrix. √ √ 2. Check that when k > 0, Hk has just two distinct eigenvalues, nk and − nk , each of multiplicity mk := 2k−1 = nk /2. 3. Prove that whenever f is an eigenvector of Hk , one has √ kf k∞ ≤ kf k1 / nk .
Derive from this observation the conclusion as follows:
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
27
Let a1 , ..., amk ∈ Rnk be unit vectors orthogonal to each other which are √ eigenvectors of Hk with eigenvalues nk (by the above, the dimension of √ the eigenspace of Hk associated with the eigenvalue nk is mk , so that the required a1 , ..., amk do exist), and let A be the mk × nk matrix with the rows aT1 , ..., aTmk . For every x ∈ Ker A it holds 1 kxk∞ ≤ √ kxk1 , nk whence A satisfies the nullspace property whenever the sparsity s satisfies √ √ 2s < nk = 2mk . Moreover, there exists (and can be found efficiently) √ an mk × nk contrast matrix H = Hk such that for every s < 12 nk , the √ pair (Hk , k · k∞ ) satisfies the condition Q∞ (s, κs = s/ nk ) associated  {z } O(1)n/ mo , for properly selected absolute constant O(1). Exercise 1.5. Utilize the results of Exercise 1.3 in a numerical experiment as follows. • select n as an integer power 2k of 2, say, n = 210 = 1024; • select a “representative” sequence M of values of m, 1 ≤ m < n, including values of m close to n and “much smaller” than n, say, M = {2, 5, 8, 16, 32, 64, 128, 256, 512, 7, 896, 960, 992, 1008, 1016, 1020, 1022, 1023};
• for every m ∈ M ,
30
CHAPTER 1
– generate at random an m × n submatrix A of the n × n Hadamard matrix Hk and utilize the result of item 4 of Exercise 1.3 in order to find the largest s such that the sgoodness of A can be certified via the condition Q∞ (·, ·); call s(m) the resulting value of s; – generate a moderate sample of Gaussian m × n sensing matrices Ai with independent N (0, 1/m) entries and use the construction from Exercise 1.2 to upperbound the largest s for which a matrix from the sample satisfies RIP(1/3, 2s); call sb(m) the largest—over your Ai ’s—of the resulting upper bounds.
The goal of the exercise is to compare the computed values of s(m) and sb(m); in other words, we again want to understand how “theoretically perfect” RIP compares to “conservative restricted scope” condition Q∞ .
1.5
PROOFS
1.5.1
Proofs of Theorem 1.3, 1.4
All we need is to prove Theorem 1.4, since Theorem 1.3 is the particular case κ = κ < 1/2 of Theorem 1.4. Let us fix x ∈ Rn and η ∈ Ξρ , and let us set x b = x breg (Ax + η). Let also I ⊂ {1, ..., n} be the set of indexes of the s entries in x of largest magnitudes, I o be the complement of I in {1, ..., n}, and, for w ∈ Rn , wI and wI o be the vectors obtained from w by zeroing entries with indexes j 6∈ I and j 6∈ I o , respectively, and keeping the remaining entries intact. Finally, let z = x b − x. 1o . By the definition of Ξρ and due to η ∈ Ξρ , we have kH T ([Ax + η] − Ax)k ≤ ρ,
(1.40)
so that x is a feasible solution to the optimization problem specifying x b, whence kb xk1 ≤ kxk1 . We therefore have kb xI o k1
= ≤
xI k1 kb xk1 − kb xI k1 ≤ kxk1 − kb xI k1 = kxI k1 + kxI o k1 − kb o kzI k1 + kxI k1 ,
(1.41)
and therefore kzI o k1 ≤ kb xI o k1 + kxI o k1 ≤ kzI k1 + 2kxI o k1 . It follows that kzk1 = kzI k1 + kzI o k1 ≤ 2kzI k1 + 2kxI o k1 .
(1.42)
Further, by definition of x b we have kH T ([Ax + η] − Ab x)k ≤ ρ, which combines with (1.40) to imply that kH T A(b x − x)k ≤ 2ρ. (1.43) 2o . Since (H, k · k) satisfies Q1 (s, κ), we have
kzks,1 ≤ skH T Azk + κkzk1 . By (1.43), it follows that kzks,1 ≤ 2sρ + κkzk1 , which combines with the evident
31
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
inequality kzI k ≤ kzks,1 (recall that Card(I) = s) and with (1.42) to imply that kzI k1 ≤ 2sρ + κkzk1 ≤ 2sρ + 2κkzI k1 + 2κkxI o k1 , whence kzI k1 ≤ Invoking (1.42), we conclude that
2sρ + 2κkxI o k1 . 1 − 2κ
4s kxI o k1 kzk1 ≤ ρ+ . 1 − 2κ 2s
(1.44)
3o . Since (H, k · k) satisfies Qq (s, κ), we have 1
1
kzks,q ≤ s q kH T Azk + κs q −1 kzk1 , which combines with (1.44) and (1.43) to imply that 1
1
kzks,q ≤ s q 2ρ + κs q
4ρ+2s−1 kxI o k1 1−2κ
1
≤
4s q [1+κ−κ] 1−2κ
h
ρ+
kxo k1 2s
i
(1.45)
(we have taken into account that κ < 1/2 and κ ≥ κ). Let θ be the (s + 1)st largest magnitude of entries in z, and let w = z − z s . Now (1.45) implies that 1 kxI o k1 4[1 + κ − κ] ρ+ . θ ≤ kzks,q s− q ≤ 1 − 2κ 2s Hence invoking (1.44) we have q−1
kwkq
≤ ≤ ≤
1
1
q−1
kwk∞q kwk1q ≤ θ q kzk1q 1 h i q1 q−1 q I o k1 θ q (4s) 1 ρ + kx2s [1−2κ] q q−1 h 1 i q kxI o k1 4s [1+κ−κ] q ρ + . 1−2κ 2s
Taking into account (1.45) and the fact that the supports of z s and w do not intersect, we get kzkq
≤ ≤
1
1
2 q max[kz s kq , kwkq ] = 2 q max[kzks,q , kwkq ] 1 h i 4(2s) q [1+κ−κ] kxI o k1 ρ + . 1−2κ 2s
This bound combines with (1.44), the Moment inequality,12 and with the relation kxI o k1 = kx − xs k1 to imply (1.16). ✷ 12 The
Moment inequality states that if (Ω, µ) is a space with measure and f is a µmeasurable R ρ 1 realvalued function on Ω, then φ(ρ) = ln Ω f (ω) ρ µ(dω) is a convex function of ρ on every
segment ∆ ⊂ [0, 1] such that φ(·) is well defined at the endpoints of ∆. As a corollary, when q−p p(q−1)
x ∈ Rn and 1 ≤ p ≤ q ≤ ∞, one has kxkp ≤ kxk1
q(p−1) p(q−1)
kxkq
.
32 1.5.2
CHAPTER 1
Proof of Theorem 1.5
Let us prove (i). Let us fix x ∈ Rn and η, and let us set x b=x bpen (Ax + η). Let also I ⊂ {1, ..., K} be the set of indexes of the s entries in x of largest magnitudes, I o be the complement of I in {1, ..., n}, and, for w ∈ Rn , wI and wI o be the vectors obtained from w by zeroing out all entries with indexes not in I and not in I o , respectively. Finally, let z = x b − x and ν = kH T ηk. o 1 . We have kb xk1 + λkH T (Ab x − Ax − η)k ≤ kxk1 + λkH T ηk and kH T (Ab x − Ax − η)k = kH T (Az − η)k ≥ kH T Azk − kH T ηk, whence kb xk1 + λkH T Azk ≤ kxk1 + 2λkH T ηk = kxk1 + 2λν. We have
kb xk1
= ≥
(1.46)
kx + zk1 = kxI + zI k1 + kxI o + zI o k1 kxI k1 − kzI k1 + kzI o k1 − kxI o k1 ,
which combines with (1.46) to imply that kxI k1 − kzI k1 + kzI o k1 − kxI o k1 + λkH T Azk ≤ kxk1 + 2λν, or, which is the same, kzI o k1 − kzI k1 + λkH T Azk ≤ 2kxI o k1 + 2λν.
(1.47)
Since (H, k · k) satisfies Q1 (s, κ), we have kzI k1 ≤ kzks,1 ≤ skH T Azk + κkzk1 , so that (1 − κ)kzI k1 − κkzI o k1 − skH T Azk ≤ 0.
(1.48)
Taking a weighted sum of (1.47) and (1.48), the weights being 1 and 2, respectively, we get (1 − 2κ) [kzI k1 + kzI o k1 ] + (λ − 2s)kH T Azk ≤ 2kxI o k1 + 2λν, whence, due to λ ≥ 2s, kzk1 ≤
kxI o k1 2λ 2λν + 2kxI o k1 ν+ . ≤ 1 − 2κ 1 − 2κ 2s
(1.49)
Further, by (1.46) we have λkH T Azk ≤ kxk1 − kb xk1 + 2λν ≤ kzk1 + 2λν, which combines with (1.49) to imply that λkHAT zk ≤
2λν + 2kxI o k1 2λν(2 − 2κ) + 2kxI o k1 + 2λν = . 1 − 2κ 1 − 2κ
(1.50)
33
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
From Qq (s, κ) it follows that 1
1
kzks,q ≤ s q kH T Azk + κs q −1 kzk1 , which combines with (1.50) and (1.49) to imply that kzks,q
h i 1 −1 4sν(1−κ)+ 2s kxI o k1 kxI o k1 ] κ[2λν+ λ λ s skH T Azk + κkzk1 ≤ s q + 1−2κ 1−2κ i 1 h −1 −1 1 +κs−2 λ]kxI o k1 kxI o k1 sq κλ = s q [4(1−κ)+2s λκ]ν+[2λ ≤ 4 − κ ν + 1 + 1−2κ 1−2κ 2s 2s 1 −1
≤ sq
(1.51)
(recall that λ ≥ 2s, κ ≥ κ, and κ < 1/2). It remains to repeat the reasoning following (1.45) in item 3o of the proof of Theorem 1.4. Specifically, denoting by θ the (s + 1)st largest magnitude of entries in z, (1.51) implies that λ kxI o k1 4 , (1.52) [1 + κ − κ] ν + θ ≤ s−1/q kzks,q ≤ 1 − 2κ 2s 2s so that for the vector w = z − z s one has kwkq
≤
1
1
θ1− q kwk1q ≤
1
4(λ/2) q 1−2κ
λ 1 + κ 2s −κ
h q−1 q
ν+
kxI o k1 2s
i
(we have used (1.52) and (1.49)). Hence, taking into account that z s and w have nonintersecting supports, kzkq
≤
≤
1
1
2 q max[kz s kq , kwkq ] = 2 q max[kzks,q , kwkq ] i 1 h kxI o k1 4λ q λ 1 + κ − κ ν + 1−2κ 2s 2s
(we have used (1.51) along with λ ≥ 2s and κ ≥ κ). This combines with (1.49) and the Moment inequality to imply (1.18). All remaining claims of Theorem 1.5 are immediate corollaries of (1.18). ✷ 1.5.3
Proof of Proposition 1.7
1o . Assuming k ≤ m and selecting a set I of k indices from {1, ..., n} distinct from each other, consider an m × k submatrix AI of A comprised of columns with indexes from I, and let u be a unit vector in Rk . The entries in the vector m1/2 A are independent N (0, 1) random variables, so that for the random variable PI u m ζu = i=1 (m1/2 AI u)2i and γ ∈ (−1/2, 1/2) it holds (in what follows, expectations and probabilities are taken w.r.t. our ensemble of random A’s) Z m 1 γt2 − 12 t2 e ds = − ln(1 − 2γ). ln (E{exp{γζ}}) = m ln √ 2 2π Given α ∈ (0, 0.1] and selecting γ in such a way that 1 − 2γ = 0 < γ < 1/2 and therefore Prob{ζu > m(1 + α)} ≤ E{exp{γζu }} exp{−mγ(1 + α)} = exp{− m 2 ln(1 − 2γ) − mγ(1 + α)} m 2 = exp{ m 2 [ln(1 + α) − α]} ≤ exp{− 5 α },
1 1+α ,
we get
34
CHAPTER 1
and similarly, selecting γ in such a way that 1 − 2γ = and therefore
1 1−α ,
we get −1/2 < γ < 0
Prob{ζu < m(1 − α)} ≤ E{exp{γζu }} exp{−mγ(1 − α)} = exp{− m 2 ln(1 − 2γ) − mγ(1 − α)} m 2 = exp{ m 2 [ln(1 − α) + α]} ≤ exp{− 5 α }, and we end up with u ∈ Rk , kuk2 = 1 ⇒
2 Prob{A : kAI uk22 > 1 + α} ≤ exp{− m 5α } . m 2 2 Prob{A : kAI uk2 < 1 − α} ≤ exp{− 5 α }
(1.53)
2o . As above, let α ∈ (0, 0.1], let M = 1 + 2α, ǫ =
α , 2(1 + 2α)
and let us build an ǫnet on the unit sphere S in Rk as follows. We start with a point u1 ∈ S; after {u1 , ..., ut } ⊂ S is already built, we check whether there is a point in S at the k · k2 distance from all points of the set > ǫ. If it is the case, we add such a point to the net built so far and proceed with building the net; otherwise we terminate with the net {u1 , ..., ut }. By compactness of S and due to ǫ > 0, this process eventually terminates; upon termination, we have at our disposal the collection {u1 , ..., uN } of unit vectors such that every two of them are at k · k2 distance > ǫ from each other, and every point from S is at distance at most ǫ from some point of the collection. We claim that the cardinality N of the resulting set can be bounded as
2+ǫ N≤ ǫ
k
4 + 9α = α
k
k 5 ≤ . α
(1.54)
Indeed, the interiors of the k · k2 balls of radius ǫ/2 centered at the points u1 , ..., uN are mutually disjoint, and their union is contained in the k · k2 ball of radius 1 + ǫ/2 centered at the origin; comparing the volume of the union and that of the ball, we arrive at (1.54). 3o . Consider event E comprised of all realizations of A such that for all kelement subsets I of {1, ..., n} and all t ≤ n it holds 1 − α ≤ kAI ut k22 ≤ 1 + α.
(1.55)
By (1.53) and the union bound, Prob{A 6∈ E} ≤ 2N
m n exp{− α2 }. k 5
(1.56)
We claim that A ∈ E ⇒ (1 − 2α) ≤
kAI uk22
≤ 1 + 2α ∀
I ⊂ {1, ..., n} : Card(I) = k u ∈ Rk : kuk2 = 1
.
(1.57)
35
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
Indeed, let A ∈ E, let us fix I ∈ {1, ..., n}, Card(I) = k, and let M be the maximal value of the quadratic form f (u) = uT ATI AI u on the unit k · k2 ball B, centered at the origin, in Rk . In this ball, f is Lipschitz continuous with constant 2M w.r.t. k · k2 ; denoting by u ¯ a maximizer of the form on B, we lose nothing when assuming that u ¯ is a unit vector. Now let us be the point of our net which is at k · k2 distance at most ǫ from u ¯. We have M = f (¯ u) ≤ f (us ) + 2M ǫ ≤ 1 + α + 2M ǫ, whence M≤
1+α = 1 + 2α, 1 − 2ǫ
implying the right inequality in (1.57). Now let u be unit vector in Rk , and us be a point in the net at k · kdistance ≤ ǫ from u. We have f (u) ≥ f (us ) − 2M ǫ ≥ 1 − α − 2
1+α ǫ = 1 − 2α, 1 − 2ǫ
justifying the first inequality in (1.57). The bottom line is: δ ∈ (0, 0.2], 1 ≤ k ≤ n
k 10 ⇒ Prob{A : A does not satisfy RIP(δ, k)} ≤ 2 δ  {z } k ≤( 20 δ )
n k
2
exp{− mδ 20 }.
(1.58)
Indeed, setting α = δ/2, we have seen that whenever A 6∈ E, we have (1 − δ) ≤ kAuk22 ≤ (1 + δ) for all unit ksparse u, which is nothing but RIP(δ, k); with this in mind, (1.58) follows from (1.56) and (1.54). 4o . It remains to verify that with properly selected—depending solely on δ— positive quantities c, d, f , for every k ≥ 1 satisfying (1.28) the righthand side in (1.58) is at most exp{−f m}. Passing to logarithms, our goal is to ensure the relation (δ) > 0 G := a(δ)m − b(δ)k − ln nk ≥ mf h i (1.59) δ2 20 a(δ) = 20 , b(δ) = ln δ provided that k ≥ 1 satisfies (1.28). Let k satisfy (1.28) with some c, d to be specified later, and let y = k/m. Assuming d ≥ 3, we have 0 ≤ y ≤ 1/3. Now, it is well known that n n−k n k n ln( ) + ln( ) , ≤n C := ln n k n n−k k
whence C≤n ≤n
m
m
n
n y ln( my )+
n n y ln( my )
+
k n
n−k n
k ln(1 + ) n−k {z }  k
≤ n−k n n = m y ln( my ) + y ≤ 2my ln( my )
36
CHAPTER 1
(recall that n ≥ m and y ≤ 1/3). It follows that G
= =
n ) a(δ)m − b(δ)k − C ≥ a(δ)m − b(δ)ym − 2my ln( my n 1 m a(δ) − b(δ)y − 2y ln( ) − 2y ln( ) , m y {z }  H
and all we need is to select c, d in such a way that (1.28) would imply that H ≥ f with some positive f = f (δ). This is immediate: we can find u(δ) > 0 such that when 0 ≤ y ≤ u(δ), we have 2y ln(1/y) + b(δ)y ≤ 31 a(δ); selecting d(δ) ≥ 3 large enough, (1.28) would imply y ≤ u(δ), and thus would imply H≥
n 2 a(δ) − 2y ln( ). 3 m
n Now we can select c(δ) large enough for (1.28) to ensure that 2y ln( m ) ≤ 13 a(δ). With the c, d just specified, (1.28) implies that H ≥ 31 a(δ), and we can take the latter quantity as f (δ). ✷
1.5.4
Proof of Propositions 1.8 and 1.12
Let x ∈ Rn , and let x1 , ..., xq be obtained from x by the following construction: x1 is obtained from x by zeroing all but the s entries of largest magnitudes; x2 is obtained by the same procedure applied to x − x1 ; x3 —by the same procedure applied to x − x1 − x2 ; and so on; the process is terminated at the first step q when j it happens that x = x1 + ... + xq . Note that for k∞ ≤ s−1 kxj−1 k1 pj ≥ 2 we have kx −1/2 j j−1 j kxj−1 k1 . It is and kx k1 ≤ kx k1 , whence also kx k2 ≤ kxj k∞ kxj k1 ≤ s easily seen that if A is RIP(δ, 2s), then for every two ssparse vectors u, v with nonoverlapping supports we have v T AT Au ≤ δkuk2 kvk2 .
(∗)
Indeed, for ssparse u, v, let I be the index set of cardinality ≤ 2s containing the supports of u and v, so that, denoting by AI the submatrix of A comprised of columns with indexes from I, we have v T AT Au = vIT [ATI AI ]uI . By RIP, the eigenvalues λi = 1 + µi of the symmetric matrix Q = ATI AI are inbetween 1 − δ and 1 + δ; representing uI and vI by vectors w and z of their coordinates in P P the orthonormal eigenbasis of Q, we get v T AT Au =  i λi wi zi  =  i wi zi + P T T T i µi wi zi  ≤ w z + δkwk2 kzk2 . It remains to note that w z = uI vI = 0 and kwk2 = kuk2 , kzk2 = kvk2 .
We have
⇒
⇒
⇒
Pq kAx1 k2 kAxk2 ≥ [x1 ]T AT Ax = kAx1 k22 + j=2 [x1 ]T AT Axj P q ≥ kAx1 k22 − δ j=2 kx1 k2 kxj k2 [by (∗)] Pq ≥ kAx1 k22 − δs−1/2 kx1 k2 j=2 kxj−1 k1 ≥ kAx1 k22 − δs−1/2 kx1 k2 kxk1 kAx1 k22 ≤ kAx1 k2 kAxk2 + δs−1/2 kx1 k2 kxk1 1 2 kx k2 kx1 k2 kx1 k2 −1/2 1 2 kAxk + δs kAx k ≤ kxk1 kx1 k2 = kAx 2 1 2 1k 2 kAx k2 kAx1 k2 2
kxks,2 = kx1 k2 ≤ [by RIP(δ, 2s)]
√ 1 kAxk2 1−δ
+
δs−1/2 1−δ kxk1
(!)
37
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
s−1/2 δ ), as claimed in I , k · k and we see that the pair H = √ satisfies Q2 (s, 1−δ m 2 1−δ Proposition 1.8.i. Moreover, when q ≥ 2, κ > 0, and integer t ≥ 1 satisfy t ≤ s and −1/2 κt1/q−1 ≥ δs1−δ , by (!) we have kxkt,q ≤ kxks,q ≤ kxks,2 ≤ √
1 kAxk2 + κt1/q−1 kxk1 , 1−δ
or, equivalently, 1 ≤ t ≤ min ⇒
(H =
−1 q
t √
h
κ(1−δ) δ
I ,k 1−δ m
q i q−1
,s
q−2 2q−2
q
s 2q−2
· k2 ) satisfies Qq (t, κ),
as required in Proposition 1.12.i. Next, we have Pq kx1 k1 kAT Axk∞ ≥ [x1 ]T AT Ax = kAx1 k22 + j=2 [x1 ]T AT Axj ≥ kAx1 k22 − δs−1/2 kx1 k2 kxk1 [exactly as above] ⇒ kAx1 k22 ≤ kx1 k1 kAT Axk∞ + δs−1/2 kx1 k2 kxk1 ⇒ (1 − δ)kx1 k22 ≤ kx1 k1 kAT Axk∞ + δs−1/2 kx1 k2 kxk1 [by RIP(δ, 2s)] ≤ s1/2 kx1 k2 kAT Axk∞ + δs−1/2 kx1 k2 kxk1 1/2 δ s−1/2 kxk1 ⇒ kxks,2 = kx1 k2 ≤ s1−δ kAT Axk∞ + 1−δ
(!!)
1 δ and we see that the pair H = 1−δ A, k · k∞ satisfies the condition Q2 s, 1−δ , as required in Proposition 1.8.ii. Moreover, when q ≥ 2, κ > 0, and integer t ≥ 1 δ satisfy t ≤ s and κt1/q−1 ≥ 1−δ s−1/2 , we have by (!!) kxkt,q ≤ kxks,q ≤ kxks,2 ≤
1 1/2 T s kA Axk∞ + κt1/q−1 kxk1 , 1−δ
or, equivalently, 1 ≤ t ≤ min ⇒ (H =
h
κ(1−δ) δ
1 −1
s2 t q 1−δ
q i q−1
q q−2 , s 2q−2 s 2q−2
A, k · k∞ ) satisfies Qq (t, κ),
as required in Proposition 1.12.ii. 1.5.5
Proof of Proposition 1.10
¯ ∈ Rm×N and k · k satisfy Q∞ (s, κ). Then for every k ≤ n we have (i): Let H ¯ T Axk + s−1 κkxk1 , xk  ≤ kH or, which is the same by homogeneity, T ¯ Axk − xk : kxk1 ≤ 1 ≥ −s−1 κ. min kH x
✷
38
CHAPTER 1
In other words, the optimal value Optk of the conic optimization problem13 ¯ T Axk ≤ t, kxk1 ≤ 1 , Optk = min t − [ek ]T x : kH x,t
where ek ∈ Rn is kth basic orth, is ≥ −s−1 κ. Since the problem clearly is strictly feasible, this is the same as saying that the dual problem ¯ + g = ek , kgk∞ ≤ µ, kηk∗ ≤ 1 , −µ : AT Hη max µ∈R,g∈Rn ,η∈RN
where k · k∗ is the norm conjugate to k · k,
kuk∗ = max hT u, khk≤1
has a feasible solution with the value of the objective ≥ −s−1 κ. It follows that there exist η = η k and g = g k such that (a) : ek = AT hk + g k , ¯ k , kη k k∗ ≤ 1, (b) : hk := Hη k (c) : kg k∞ ≤ s−1 κ.
(1.60)
Denoting H = [h1 , ..., hn ], V = I − H T A, we get Colk [V T ] = ek − AT hk = g k , implying that kColk [V T ]k∞ ≤ s−1 κ. Since the latter inequality is true for all k ≤ n, we conclude that kColk [V ]ks,∞ = kColk [V ]k∞ ≤ s−1 κ, 1 ≤ k ≤ n, whence, by Proposition 1.9, (H, k · k∞ ) satisfies Q∞ (s, κ). Moreover, for every η ∈ Rm and every k ≤ n we have, in view of (b) and (c), ¯ T η ≤ kη k k∗ kH ¯ T ηk, [hk ]T η = [η k ]T H ¯ T ηk. whence kH T ηk∞ ≤ kH Now let us prove the “In addition” part of the proposition. Let H = [h1 , ..., hn ] be the contrast matrix specified in this part. We have [Im − H T A]ij  = [[ei ]T − hTi A]j  ≤ k[ei ]T − hTi Ak∞ = kei − AT hi k∞ ≤ Opti , implying by Proposition 1.9 that (H, k · k∞ ) does satisfy the condition Q∞ (s, κ∗ ) with κ∗ = s maxi Opti . Now assume that there exists a matrix H ′ which, taken along with some norm k · k, satisfies the condition Q∞ (s, κ) with κ < κ∗ , and let us lead this assumption to a contradiction. By the already proved first part of Proposition 1.10, our assumption implies that there exists an m × n matrix ¯ 1 , ..., h ¯ n ] such that kColj [In − H ¯ = [h ¯ T A]k∞ ≤ s−1 κ for all j ≤ n, implying that H −1 i T T ¯ i k∞ ≤ s−1 κ ¯ [[e ] − hi A]j  ≤ s κ for all i and j, or, which is the same, kei − AT h i T¯ for all i. Due to the origin of Opti , we have Opti ≤ ke − A hi k∞ for all i, 13 For
a summary on conic programming, see Section 4.1.
39
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
and we arrive at s−1 κ∗ = maxi Opti ≤ s−1 κ, that is, κ∗ ≤ κ, which is a desired contradiction. It remains to prove (1.33), which is just an exercise on LP duality: denoting by e an ndimensional allones vector, we have T i Opti := minh kei − AT hk∞ = minh,t t : ei − AT h P≤ te, APh − e ≤ te = maxλ,µ {λi − µi : λ, µ ≥ 0, A[λ − µ] = 0, i λi + i µi = 1} [LP duality] = maxx:=λ−µ {xi : Ax = 0, kxk1 ≤ 1} where the concluding equality follows from the fact that vectors x representable as λ − µ with λ, µ ≥ 0 satisfying kλk1 + kµk1 = 1 are exactly vectors x with kxk1 ≤ 1. ✷ 1.5.6
Proof of Proposition 1.13
Let H satisfy (1.38). Since kvks,1 ≤ s1−1/q kvks,q , it follows that H satisfies for some α < 1/2 the condition kColj [In − H T A]ks,1 ≤ α, 1 ≤ j ≤ n,
(1.61)
whence, as we know from Proposition 1.9, kxks,1 ≤ skH T Axk∞ + αkxk1 ∀x ∈ Rn . It follows that s ≤ m, since otherwise there exists a nonzero ssparse vector x with Ax = 0; for this x, the inequality above cannot hold true. ¯ and A¯ be the m × n Let us set n ¯ = 2m, so that n ¯ ≤ n, and let H ¯ matrices comprised of the first 2m columns of H and A, respectively. Relation (1.61) implies ¯ T A¯ satisfies that the matrix V = In¯ − H kColj [V ]ks,1 ≤ α < 1/2, 1 ≤ j ≤ n ¯.
(1.62)
¯ T A¯ is ≤ m, at least n Now, since the rank of H ¯ − m singular values of V are ≥ 1, and therefore the squared Frobenius norm kV k2F of V is at least n ¯ − m. On the other hand, we can upperbound this squared norm as follows. Observe that for every n ¯ dimensional vector f one has hn ¯ i (1.63) kf k22 ≤ max 2 , 1 kf k2s,1 . s Indeed, by homogeneity it suffices to verify the inequality when kf ks,1 = 1; besides, we can assume w.l.o.g. that the entries in f are nonnegative, and that f1 ≥ f2 ≥ ... ≥ fn¯ . We have fs ≤ kf ks,1 /s = 1s ; in addition, Pn¯ 2 n − s)fs2 . Now, due to kf ks,1 = 1, for fixed fs ∈ [0, 1/s] we j=s+1 fj ≤ (¯ have s−1 s−1 s X X X tj = 1 − f s . t2j : tj ≥ fs , j ≤ s − 1, fj2 ≤ fs2 + max t j=1
j=1
j=1
The maximum on the righthand side is the maximum of a convex function
40
CHAPTER 1
over a bounded polytope; it is achieved at an extreme point, that is, at a point where one of the tj is equal to 1 − (s − 1)fs , and all remaining tj are equal to fs . As a result, X j
fj2 ≤ (1 − (s − 1)fs )2 + (s − 1)fs2 + (¯ n − s)fs2 ≤ (1 − (s − 1)fs )2 + (¯ n − 1)fs2 .
The righthand side in the latter inequality is convex in fs and thus achieves its maximum P over the range2[0, 1/s] of allowed values of fs at an endpoint, ¯ /s ], as claimed. yielding j fj2 ≤ max[1, n
Applying (1.63) to the columns of V and recalling that n ¯ = 2m, we get kV k2F =
2m 2m 2m X kColj [V ]k2s,1 ≤ 2α2 m max 1, 2 . kColj [V ]k22 ≤ max 1, 2 s s j=1 j=1
2m X
The left hand side in this inequality, as we remember, is ≥ n ¯ − m = m, and we arrive at m ≤ 2α2 m max[1, 2m/s2 ]. √ Since α < 1/2, this inequality implies 2m/s2 ≥ 2, whence s ≤ m. It remains to prove that when √ m ≤ n/2, the condition Q∞ (s, κ) with κ < 1/2 can be satisfied only when s ≤ m. This is immediate: by Proposition 1.10, assuming Q∞ (s, κ) satisfiable, there exists an m × n contrast matrix H such that [In − H T A]ij  ≤ κ/s for all i, √j, which, by the already proved part of Proposition ✷ 1.13, is impossible when s > m.
Chapter Two Hypothesis Testing Disclaimer for experts. In what follows, we allow for “general” probability and observation spaces, general probability distributions, etc., which, formally, would make it necessary to address the related measurability issues. In order to streamline our exposition, and taking into account that we do not expect our target audience to be experts in formal nuances of the measure theory, we decided to omit in the text comments (always selfevident for an expert) on measurability and replace them with a “disclaimer” as follows: Below, unless the opposite is explicitly stated, • all probability and observation spaces are Polish (complete separable metric) spaces equipped with σalgebras of Borel sets; • all random variables (i.e., functions from a probability space to some other space) take values in Polish spaces; these variables, like other functions we deal with, are Borel; • all probability distributions we are dealing with are σadditive Borel measures on the respective probability spaces; the same is true for all reference measures and probability densities taken w.r.t. these measures. When an entity (a random variable, or a probability density, or a function, say, a test) is part of the data, the Borel property is a default assumption; e.g., the sentence “Let random variable η be a deterministic transformation of random variable ξ” should be read as “let η = f (ξ) for some Borel function f ,” and the sentence “Consider a test T deciding on hypotheses H1 , ..., HL via observation ω ∈ Ω” should be read as “Consider a Borel function T on Polish space Ω, the values of the function being subsets of the set {1, ..., L}.” When an entity is built by us rather than being part of the data, the Borel property is (an always straightforwardly verifiable) property of the construction. For example, the statement “The test T given by ... is such that ...” should be read as “The test T given by ... is a Borel function of observations and is such that ....” On several occasions, we still use the word “Borel”; those not acquainted with the notion are welcome to just ignore this word.
2.1
2.1.1
PRELIMINARIES FROM STATISTICS: HYPOTHESES, TESTS, RISKS Hypothesis Testing Problem
Hypothesis Testing is one of fundamental problems of Statistics. Informally, this is the problem where one is given an observation—a realization of a random variable with unknown (at least partly) probability distribution—and wants to decide, based on this observation, on two or more hypotheses on the actual distribution of the observed variable. A formal setting convenient for us is as follows:
42
CHAPTER 2
Given are: • Observation space Ω, where the observed random variable (r.v.) takes its values; • L families Pℓ of probability distributions on Ω. We associate with these families L hypotheses H1 , ..., HL , with Hℓ stating that the probability distribution P of the observed r.v. belongs to the family Pℓ (shorthand: Hℓ : P ∈ Pℓ ). We shall say that the distributions from Pℓ obey hypothesis Hℓ . Hypothesis Hℓ is called simple if Pℓ is a singleton, and is called composite otherwise. Our goal is, given an observation—a realization ω of the r.v. in question—to decide which of the hypotheses is true. 2.1.2
Tests
Informally, a test is an inference procedure one can use in the above testing problem. Formally, a test for this testing problem is a function T (ω) of ω ∈ Ω; the value T (ω) of this function at a point ω is some subset of the set {1, ..., L}: T (ω) ⊂ {1, ..., L}. Given observation ω, the test accepts all hypotheses Hℓ with ℓ ∈ T (ω) and rejects all hypotheses Hℓ with ℓ 6∈ T (ω). We call a test simple if T (ω) is a singleton for every ω, that is, whatever be the observation, the test accepts exactly one of the hypotheses H1 , ..., HL and rejects all other hypotheses. Note: What we have defined is a deterministic test. Sometimes we shall consider also randomized tests, where the set of accepted hypotheses is a (deterministic) function of an observation ω and a realization θ of a random parameter (which w.l.o.g. can be assumed to be uniformly distributed on [0, 1]) independent of ω. Thus, in a randomized test, the inference depends both on the observation ω and the outcome θ of “flipping a coin,” while in a deterministic test the inference depends on observation only. In fact, randomized testing can be reduced to deterministic testing. To this end it suffices to pass from our “actual” observation ω to the new observation ω+ = (ω, θ), where θ ∼ Uniform[0, 1] is independent of ω; the ωcomponent of our new observation ω+ is, as before, generated “by nature,” and the θcomponent is generated by us. Now, given families Pℓ , 1 ≤ ℓ ≤ L, of probability distributions on the original observation space Ω, we can associate with them families Pℓ,+ = {P × Uniform[0, 1] : P ∈ Pℓ } of probability distributions on our new observation space Ω+ = Ω × [0, 1]. Clearly, to decide on the hypotheses associated with the families Pℓ via observation ω is the same as to decide on the hypotheses associated with the families Pℓ,+ of our new observation ω+ , and deterministic tests for the latter testing problem are exactly the randomized tests for the former one. 2.1.3
Testing from repeated observations
There are situations where an inference can be based on several observations ω1 , ..., ωK rather than on a single one. Our related setup is as follows: We are given L families Pℓ , ℓ = 1, ..., L, of probability distributions on the
43
HYPOTHESIS TESTING
observation space Ω and a collection ω K = (ω1 , ..., ωK ) ∈ ΩK = Ω × ... × Ω  {z } K
and want to make conclusions on how the distribution of ω K “is positioned” w.r.t. the families Pℓ , 1 ≤ ℓ ≤ L. We will be interested in three situations of this type, specifically, as follows. 2.1.3.1
Stationary Krepeated observations
In the case of stationary Krepeated observations, ω1 , ..., ωK are independently of each other drawn from a distribution P . Our goal is to decide, given ω K , on the hypotheses P ∈ Pℓ , ℓ = 1, ..., L. Equivalently: Families Pℓ of probability distributions of ω ∈ Ω, 1 ≤ ℓ ≤ L, give rise to the families Pℓ⊙,K = {P K = P × ... × P : P ∈ Pℓ } {z }  K
of probability distributions on ΩK ; we refer to the family Pℓ⊙,K as the Kth diagonal power of the family Pℓ . Given observation ω K ∈ ΩK , we want to decide on the hypotheses Hℓ⊙,K : ω K ∼ P K ∈ Pℓ⊙,K , 1 ≤ ℓ ≤ L. 2.1.3.2
Semistationary Krepeated observations
In the case of semistationary Krepeated observations, “nature” selects somehow a sequence P1 , ..., PK of distributions on Ω, and then draws, independently across k, observations ωk , k = 1, ..., K, from these distributions: ωk ∼ Pk , ωk are independent across k ≤ K. Our goal is to decide, given ω K = (ω1 , ..., ωK ), on the hypotheses {Pk ∈ Pℓ , 1 ≤ k ≤ K}, ℓ = 1, ..., L. Equivalently: Families Pℓ of probability distributions of ω ∈ Ω, 1 ≤ ℓ ≤ L, give rise to the families Pℓ⊕,K = {P K = P1 × ... × PK : Pk ∈ Pℓ , 1 ≤ k ≤ K} of probability distributions on ΩK . Given observation ω K ∈ ΩK , we want to decide on the hypotheses Hℓ⊕,K : ω K ∼ P K ∈ Pℓ⊕,K , 1 ≤ ℓ ≤ L. In the sequel, we refer to families Pℓ⊕,K as the Kth direct powers of the families
44
CHAPTER 2
Pℓ . A closely related notion is that of the direct product Pℓ⊕,K =
K M k=1
Pℓ,k
of K families Pℓ,k , of probability distributions on Ωk , over k = 1, ..., K. By definition, Pℓ⊕,K = {P K = P1 × ... × PK : Pk ∈ Pℓ,k , 1 ≤ k ≤ K}. 2.1.3.3
Quasistationary Krepeated observations
Quasistationary Krepeated observations ω1 ∈ Ω, ..., ωK ∈ Ω stemming from a family P of probability distributions on an observation space Ω are generated as follows: “In nature” there exists random sequence ζ K = (ζ1 , ..., ζK ) of “driving factors” (or states) such that for every k, ωk is a deterministic function of ζ1 , ..., ζk , ωk = θk (ζ1 , ..., ζk ), and the conditional distribution Pωk ζ1 ,...,ζk−1 of ωk given ζ1 , ..., ζk−1 always (i.e., for all ζ1 , ..., ζk−1 ) belongs to P. With the above mechanism, the collection ω K = (ω1 , ..., ωK ) has some distribution P K which depends on the distribution of driving factors and functions θk (·). We denote by P ⊗,K the family of all distributions P K which can be obtained in this fashion, and we refer to random observations ω K with distribution P K of the type just defined as the quasistationary Krepeated observations stemming from P. The quasistationary version of our hypothesis testing problem reads: Given L families Pℓ of probability distributions Pℓ , ℓ = 1, ..., L, on Ω and an observation ω K ∈ ΩK , decide on the hypotheses Hℓ⊗,K = {P K ∈ Pℓ⊗,K }, 1 ≤ ℓ ≤ K on the distribution P K of the observation ω K . A related notion is that of the quasidirect product Pℓ⊗,K =
K O k=1
Pℓ,k
of K families Pℓ,k , of probability distributions on Ωk , over k = 1, ..., K. By definition, Pℓ⊗,K is comprised of all probability distributions of random sequences ω K = (ω1 , ..., ωK ), ωk ∈ Ωk , which can be generated as follows: “in nature” there exists a random sequence ζ K = (ζ1 , ..., ζK ) of “driving factors” such that for every k ≤ K, ωk is a deterministic function of ζ k = (ζ1 , ..., ζk ), and the conditional distribution of ωk given ζ k−1 always belongs to Pℓ,k . The description of quasistationary Krepeated observations seems to be too complicated. However, this is exactly what happens in some important applications, e.g., in hidden Markov chains. Suppose that Ω = {1, ..., d} is a finite set, and that ωk ∈ Ω, k = 1, 2, ..., are generated as follows: “in nature” there exists a Markov chain with Delement state space S split into d nonoverlapping bins, and ωk is the
45
HYPOTHESIS TESTING
index β(η) of the bin to which the state ηk of the chain belongs. Every column Qj of the transition matrix Q of the chain (this column is a probability distribution on {1, ..., D}) generates a probability distribution Pj on Ω, specifically, the distribution of β(η), η ∼ Qj . Now, a family P of distributions on Ω induces a family Q[P] of all D × D stochastic matrices Q for which all D distributions P j , j = 1, ..., D, belong to P. When Q ∈ Q[P], observations ωk , k = 1, 2, ..., clearly are given by the above “quasistationary mechanism” with ηk in the role of driving factors and P in the role of Pℓ . Thus, in the situation in question, given L families Pℓ , ℓ = 1, ..., L, of probability distributions on S, deciding on hypotheses Q ∈ Q[Pℓ ], ℓ = 1, ..., L, on the transition matrix Q of the Markov chain underlying our observations reduces to hypothesis testing via quasistationary Krepeated observations. 2.1.4
Risk of a simple test
Let Pℓ , ℓ = 1, ..., L, be families of probability distributions on observation space Ω; these families give rise to hypotheses Hℓ : P ∈ Pℓ , ℓ = 1, ..., L on the distribution P of a random observation ω ∼ P . We are about to define the risks of a simple test T deciding on the hypotheses Hℓ , ℓ = 1, ..., L, via observation ω. Recall that simplicity means that as applied to an observation, our test accepts exactly one hypothesis and rejects all other hypotheses. Partial risks Riskℓ (T H1 , ..., HL ) are the worstcase, over P ∈ Pℓ , P probabilities of T rejecting the ℓth hypothesis when it is true, that is, when ω ∼ P : Riskℓ (T H1 , ..., HL ) = sup Probω∼P {ω : T (ω) 6= {ℓ}} , ℓ = 1, ..., L. P ∈Pℓ
Obviously, for ℓ fixed, the ℓth partial risk depends on how we order the hypotheses; when reordering them, we should reorder risks as well. In particular, for a test T deciding on two hypotheses H, H ′ we have Risk1 (T H, H ′ ) = Risk2 (T H ′ , H).
Total risk Risktot (T H1 , ..., HL ) is the sum of all L partial risks: Risktot (T H1 , ..., HL ) =
L X ℓ=1
Riskℓ (T H1 , ..., HL ).
Risk Risk(T H1 , ..., HL ) is the maximum of all L partial risks: Risk(T H1 , ..., HL ) = max Riskℓ (T H1 , ..., HL ). 1≤ℓ≤L
Note that at first glance, we have defined risks for singleobservation tests only; in fact, we have defined them for tests based on stationary, semistationary, and quasistationary Krepeated observations as well, since, as we remember from Section
46
CHAPTER 2
2.1.3, the corresponding testing problems, after redefining observations and families K L Pℓ in the role of probability distributions (ω K in the role of ω and, say, Pℓ⊕,K = k=1
of Pℓ ), become singleobservation testing problems. Pay attention to the following two important observations:
• Partial risks of a simple test are defined in the worstcase fashion: as the maximal, over the true distributions P of observations compatible with the hypothesis in question, probability to reject this hypothesis. • Risks of a simple test say what happens, statistically speaking, when the true distribution P of observation obeys one of the hypotheses in question, and say nothing about what happens when P does not obey any of the L hypotheses. Remark 2.1. “The smaller are the hypotheses, the less are the risks.” Specifically, given families of probability distributions Pℓ ⊂ Pℓ′ , ℓ = 1, ..., L, on observation space Ω, along with hypotheses Hℓ : P ∈ Pℓ , Hℓ′ : P ∈ Pℓ′ on the distribution P of an observation ω ∈ Ω, every test T deciding on the “larger” hypotheses H1′ , ..., HL′ can be considered as a test deciding on the smaller hypotheses H1 , ..., HL as well, and the risks of the test when passing from larger hypotheses to smaller ones can only drop down: Pℓ ⊂ Pℓ′ , 1 ≤ ℓ ≤ L ⇒ Risk(T H1 , ..., HL ) ≤ Risk(T H1′ , ..., HL′ ). For example, families of probability distributions Pℓ , 1 ≤ ℓ ≤ L, on Ω and a positive integer K induce three families of hypotheses on a distribution P K of Krepeated observations: Hℓ⊙,K K : P K ∈ Pℓ⊙,K ,
Hℓ⊕,K : P K ∈ Pℓ⊕,K =
Hℓ⊗,K : P K ∈ Pℓ⊗,K = (see Section 2.1.3), and clearly
K N
k=1
Pℓ , 1 ≤ ℓ ≤ L
K L
k=1
Pℓ ,
PℓK ⊂ Pℓ⊕,K ⊂ Pℓ⊗,K . It follows that when passing from quasistationary Krepeated observations to semistationary Krepeated observations, and then to stationary Krepeated observations, the risks of a test can only go down. 2.1.5
Twopoint lower risk bound
The following wellknown [162, 164] observation is nearly evident: Proposition 2.2. Consider two simple hypotheses H1 : P = P1 and H2 : P = P2 on the distribution P of observation ω ∈ Ω, and assume that P1 , P2 have densities p1 , p2 w.r.t. some reference measure Π on Ω.1 Then for any simple test T deciding 1 This
assumption is w.l.o.g.—we can take, as Π, the sum of the measures P1 and P2 .
47
HYPOTHESIS TESTING
on H1 , H2 it holds Risktot (T H1 , H2 ) ≥
Z
min[p1 (ω), p2 (ω)]Π(dω).
(2.1)
Ω
Note that the righthand side in this relation is independent of how Π is selected. Proof. Consider a simple test T , perhaps a randomized one, and let π(ω) be the probability for this test to accept H1 and reject H2 when the observation is ω. Since the test is simple, the probability for T to accept H2 and to reject H1 , the observation being ω, is 1 − π(ω). Consequently, R Risk1 (T H1 , H2 ) = RΩ (1 − π(ω))p1 (ω)Π(dω), Risk2 (T H1 , H2 ) = Ω π(ω)p2 (ω)Π(dω),
whence
Risktot (T H1 , H2 )
= ≥
R RΩ [(1 − π(ω))p1 (ω) + π(ω)p2 (ω)]Π(dω) min[p1 (ω), p2 (ω)]Π(dω). ✷ Ω
Remark 2.3. Note that the lower risk bound (2.1) is achievable; given an observation ω, the corresponding test T accepts H1 with probability 1 (i.e., π(ω) = 1) when p1 (ω) > p2 (ω), accepts H2 when p1 (ω) < p2 (ω) (i.e., π(ω) = 0 when p1 (ω) < p2 (ω)) and accepts H1 and H2 with probabilities 1/2 in the case of a tie (i.e., π(ω) = 1/2 when p1 (ω) = p2 (ω)). This is nothing but the likelihood ratio test naturally adjusted to account for ties. Example 2.1. Let Ω = Rd , let the reference measure Π be the Lebesgue measure on Rd , and let pχ (·) = N (µχ , Id ), be the Gaussian densities on Rd with unit covariance and means µχ , χ = 1, 2. In this case, assuming µ1 6= µ2 , the recipe from Remark 2.3 reduces to the following: Let φ1,2 (ω) = 12 [µ1 − µ2 ]T [ω − w], w = 12 [µ1 + µ2 ].
(2.2)
Consider the simple test T which, given an observation ω, accepts H1 : p = p1 (and rejects H2 : p = p2 ) when φ1,2 (ω) ≥ 0, and accepts H2 (and rejects H1 ) otherwise. For this test, Risk1 (T H1 , H2 ) = Risk2 (T H1 , H2 ) = Risk(T H1 , H2 ) = 21 Risktot (T H1 , H2 ) = Erfc( 12 kµ1 − µ2 k2 )
(2.3)
(see (1.22) for the definition of Erfc), and the test is optimal in terms of its risk and its total risk. Note that optimality of T in terms of total risk is given by Proposition 2.2 and Remark 2.3; optimality in terms of risk is ensured by optimality in terms of total risk combined with the first equality in (2.3). Example 2.1 admits an immediate and useful extension [36, 37, 84, 128]: Example 2.2. Let Ω = Rd , let the reference measure Π be the Lebesgue measure on Rd , and let M1 and M2 be two nonempty closed convex sets in Rd with empty
48
CHAPTER 2
0
"
#
0
Figure 2.1: “Gaussian Separation” (Example 2.5): Optimal test deciding on whether the mean of Gaussian r.v. belongs to the domain A (H1 ) or to the domain B (H2 ). Hyperplane oo separates the acceptance domains for H1 (“left” halfspace) and for H2 (“right” halfspace).
intersection and such that the convex optimization program min {kµ1 − µ2 k2 : µχ ∈ Mχ , χ = 1, 2}
µ1 ,µ2
(∗)
has an optimal solution µ∗1 , µ∗2 (this definitely is the case when at least one of the sets M1 , M2 is bounded). Let φ1,2 (ω) = 21 [µ∗1 − µ∗2 ]T [ω − w], w = 12 [µ∗1 + µ∗2 ],
(2.4)
and let the simple test T deciding on the hypotheses H1 : p = N (µ, Id ) with µ ∈ M1 ,
H2 : p = N (µ, Id ) with µ ∈ M2
be as follows (see Figure 2.1): given an observation ω, T accepts H1 (and rejects H2 ) when φ1,2 (ω) ≥ 0, and accepts H2 (and rejects H1 ) otherwise. Then Risk1 (T H1 , H2 ) = Risk2 (T H1 , H2 ) = Risk(T H1 , H2 ) = 21 Risktot (T H1 , H2 ) = Erfc( 12 kµ∗1 − µ∗2 k2 ),
(2.5)
and the test is optimal in terms of its risk and its total risk. Justification of Example 2.2 is immediate. Let e be the k · k2 unit vector in the direction of µ∗1 − µ∗2 , and let ξ[ω] = eT (ω − w). From optimality conditions for (∗) it follows that eT µ ≥ eT µ∗1 ∀µ ∈ M1 & eT µ ≤ eT µ∗2 ∀µ ∈ M2 . As a result, if µ ∈ M1 and the density of ω is pµ = N (µ, Id ), the random variable ξ[ω] is a scalar Gaussian random variable with unit variance and expectation ≥ δ := 21 kµ∗1 − µ∗2 k2 , implying that the pµ probability for ξ[ω] to be negative (which is exactly the same as the pµ probability for T to reject H1 and accept H2 ) is at most
49
HYPOTHESIS TESTING
Erfc(δ). Similarly, when µ ∈ M2 and the density of ω is pµ = N (µ, Id ), ξ[ω] is a scalar Gaussian random variable with unit variance and expectation ≤ −δ, implying that the pµ probability for ξ[ω] to be nonnegative (which is exactly the same as the probability for T to reject H2 and accept H1 ) is at most Erfc(δ). These observations imply the validity of (2.5). The test optimality in terms of risks follows from the fact that the risks of a simple test deciding on our now—composite—hypotheses H1 , H1 on the density p of observation ω can be only larger than the risks of a simple test deciding on two simple hypotheses p = pµ∗1 and p = pµ∗2 . In other words, the quantity Erfc( 12 kµ∗1 − µ∗2 k2 )—see Example 2.1—is a lower bound on the risk and half of the total risk of a test deciding on H1 , H2 . With this in mind, the announced optimalities of T in terms of risks are immediate consequences of (2.5). We remark that the (nearly selfevident) result stated in Example 2.2 seems to have first been noticed in [36]. Example 2.2 allows for substantial extensions in two directions: first, it turns out that the “Euclidean separation” underlying the test built in this example can be used to decide on hypotheses on the location of a “center” of ddimensional distribution far beyond the Gaussian observation model considered in this example. This extension will be our goal in the next section, based on the recent paper [110]. Less straightforward and, we believe, more instructive extensions, originating from [102], will be considered in Section 2.3.
2.2 2.2.1
HYPOTHESIS TESTING VIA EUCLIDEAN SEPARATION Situation
In this section, we will be interested in testing hypotheses Hℓ : P ∈ Pℓ , ℓ = 1, ..., L
(2.6)
on the probability distribution of a random observation ω in the situation where the families of distributions Pℓ are obtained from a given family P of probability distributions by shifts. Specifically, we are given • a family P of probability distributions on Ω = Rd such that all distributions from P possess densities with respect to the Lebesgue measure on Rn , and these densities are even functions on Rd ;2 • a collection X1 , ..., XL of nonempty closed and convex subsets of Rd , with at most one of the sets unbounded. These data specify L families Pℓ of distributions on Rd ; Pℓ is comprised of distributions of random vectors of the form x + ξ, where x ∈ Xℓ is deterministic, and ξ is random with distribution from P. Note that with this setup, deciding upon hypotheses (2.6) via observation ω ∼ P is exactly the same as deciding, given observation ω = x + ξ, (2.7) 2 Allowing for a slight abuse of notation, we write P ∈ P, where P is a probability distribution, to express the fact that P belongs to P (no abuse of notation so far), and write p(·) ∈ P (this is an abuse of notation), where p(·) is the density of the probability distribution P , to express the fact that P ∈ P.
50
CHAPTER 2
where x is a deterministic “signal” and ξ is random noise with distribution P known to belong to P, on the “position” of x w.r.t. X1 , ..., XL , the ℓth hypothesis Hℓ saying that x ∈ Xℓ . The latter allows us to write down the ℓth hypothesis as Hℓ : x ∈ Xℓ (of course, this shorthand makes sense only within the scope of our current “signal plus noise” setup). 2.2.2 2.2.2.1
Pairwise Hypothesis Testing via Euclidean Separation The simplest case
Consider nearly the simplest case of the situation from Section 2.2.1 where L = 2, X1 = {x1 } and X2 = {x2 }, x1 6= x2 , are singletons, and P also is a singleton. Let the probability density of the only distribution from P be of the form p(u) = f (kuk2 ), f (·) is a strictly monotonically decreasing function on the nonnegative ray. (2.8) This situation is a generalization of the one considered in Example 2.1, where we dealt with the special case of f , namely, with p(u) = (2π)−d/2 e−u
T
u/2
.
In the case in question our goal is to decide on two simple hypotheses Hχ : p(u) = f (ku − xχ k2 ), χ = 1, 2, on the density of observation (2.7). Let us set x1 − x2 , φ(ω) = eT ω − 21 eT [x1 + x2 ], kx1 − x2 k2  {z }
δ = 21 kx1 − x2 k2 , e =
(2.9)
c
and consider the test T which, given observation ω = x + ξ, accepts the hypothesis H1 : x = x1 when φ(ω) ≥ 0, and accepts the hypothesis H2 : x = x2 otherwise. ւ p2 (·) p1 (·) ց
x2 x1
φ(ω) > 0
φ(ω) = 0
φ(ω) < 0
We have (cf. Example 2.1) Risk1 (T H1 , H2 )
= =
R
p1 (ω)dω =
ω:φ(ω) 0, ω∈Ω
• F is the space of all realvalued functions on the finite set Ω. Clearly, Discrete o.s. is simple; the function f (µ) := ln
Z
e
φ(ω)
pµ (ω)Π(dω)
Ω
= ln
X
e
φ(ω)
µω
ω∈Ω
!
indeed is concave in µ ∈ M. 2.4.3.4
Direct products of simple observation schemes
Given K simple observation schemes Ok = Ωk , Πk ; {pµ,k (·) : µ ∈ Mk }; Fk , 1 ≤ k ≤ K,
6 Large Binocular Telescope [16, 17] is a cuttingedge instrument for highresolution optical/infrared astronomical imaging; it is the subject of a huge ongoing international project; see http://www.lbto.org. Nanoscale Fluorescent Microscopy (a.k.a. Poisson Biophotonics) is a revolutionary tool for cell imaging trigged by the advent of techniques [18, 113, 117, 211] (2014 Nobel Prize in Chemistry) allowing us to break the diffraction barrier and to view biological molecules “at work” at a resolution of 10–20 nm, yielding entirely new insights into the signalling and transport processes within cells.
78
CHAPTER 2
we can define their direct product OK =
K Y
k=1
Ok = ΩK , ΠK ; {pµ : µ ∈ MK }; F K
by modeling the situation where our observation is a tuple ω K = (ω1 , ..., ωK ) with components ωk yielded, independently of each other, by observation schemes Ok , namely, as follows: • The observation space ΩK is the direct product of observations spaces Ω1 , ..., ΩK , and the reference measure ΠK is the product of the measures Π1 , ..., ΠK ; • The parameter space MK is the direct product of partial parameter spaces M1 , ..., MK , and the distribution pµ (ω K ) associated with parameter µ = (µ1 , µ2 , ..., µK ) ∈ MK = M1 × ... × MK is the probability distribution on ΩK with the density pµ (ω K ) =
K Y
pµ,k (ωk )
k=1
w.r.t. ΠK . In other words, random observation ω K ∼ pµ is a sample of observations ω1 , ..., ωK , drawn, independently of each other, from the distributions pµ1 ,1 , pµ2 ,2 , ..., pµK ,K ; • The space F K is comprised of all separable functions φ(ω K ) =
K X
φk (ωk )
k=1
with φk (·) ∈ Fk , 1 ≤ k ≤ K. It is immediately seen that the direct product of simple observation o.s.’s is simple. When all factors Ok , 1 ≤ k ≤ K, are the identical simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), the direct product of the factors can be “truncated” to yield the Kth power (called also the stationary Krepeated version) of O, denoted by [O]K = (ΩK , ΠK ; {p(K) : µ ∈ M}; F (K) ) µ and defined as follows. • ΩK and ΠK are exactly the same as in the direct product: ΩK = Ω × ... × Ω, ΠK = Π × ... × Π;  {z }  {z } K
K
• the parameter space is M rather than the direct product of K copies of M, and
79
HYPOTHESIS TESTING
the densities are K p(K) = (ω1 , ..., ωK )) = µ (ω
K Y
pµ (ωk ).
k=1 (K)
In other words, random observations ω K ∼ pµ are Kelement samples with components drawn, independently of each other, from pµ ; • the space F (K) is comprised of separable functions φ
(K)
K
(ω ) =
K X
φ(ωk )
k=1
with identical components belonging to F (i.e., φ ∈ F). It is immediately seen that a power of a simple o.s. is simple. Remark 2.19. Gaussian, Poisson, and Discrete o.s.’s clearly are nondegenerate. It is also clear that the direct product of nondegenerate o.s.’s is nondegenerate. 2.4.4
Simple observation schemes—Main result
We are about to demonstrate that when deciding on convex, in some precise sense to be specified below, hypotheses in simple observation schemes, optimal detectors can be found efficiently by solving convexconcave saddle point problems. We start with an “executive summary” of convexconcave saddle point problems.
2.4.4.1
Executive summary of convexconcave saddle point problems
The results to follow are absolutely standard, and their proofs can be found in all textbooks on the subject, see, e.g., [221] or [15, Section D.4]. Let U and V be nonempty sets, and let Φ : U ×V → R be a function. These data define an antagonistic game of two players, I and II, where player I selects a point u ∈ U , and player II selects a point v ∈ V ; as an outcome of these selections, player I pays to player II the sum Φ(u, v). Clearly, player I is interested in minimizing this payment, and player II in maximizing it. The data U, V, Φ are known to the players in advance, and the question is, what should be their selections? When player I makes his selection u first, and player II makes his selection v with u already known, player I should be ready to pay for a selection u ∈ U a toll as large as Φ(u) = sup Φ(u, v). v∈V
In this situation, a riskaverse player I would select u by minimizing the above worstcase payment, by solving the primal problem Opt(P ) = inf Φ(u) = inf sup Φ(u, v) u∈U
u∈U v∈V
(P )
associated with the data U, V, Φ. Similarly, if player II makes his selection v first, and player I selects u after v becomes known, player II should be ready to get, as a result of selecting v ∈ V , the
80
CHAPTER 2
amount as small as Φ(v) = inf Φ(u, v). u∈U
In this situation, a riskaverse player II would select v by maximizing the above worstcase payment, by solving the dual problem Opt(D) = sup Φ(v) = sup inf Φ(u, v). v∈V
v∈V u∈U
(D)
Intuitively, the first situation is less preferable for player I than the second one, so that his guaranteed payment in the first situation, that is, Opt(P ), should be ≥ his guaranteed payment, Opt(D), in the second situation: Opt(P ) := inf sup Φ(u, v) ≥ sup inf Φ(u, v) =: Opt(D). u∈U v∈V
v∈V u∈U
This fact, called Weak Duality, indeed is true. The central question related to the game is what should the players do when making their selections simultaneously, with no knowledge of what is selected by the adversary. There is a case when this question has a completely satisfactory answer—this is the case where Φ has a saddle point on U × V . Definition 2.20. A point (u∗ , v∗ ) ∈ U × V is called a saddle point7 of function Φ(u, v) : U × V → R if Φ as a function of u ∈ U attains at this point its minimum, and as a function of v ∈ V its maximum, that is, if Φ(u, v∗ ) ≥ Φ(u∗ , v∗ ) ≥ Φ(u∗ , v) ∀(u ∈ U, v ∈ V ). From the viewpoint of our game, a saddle point (u∗ , v∗ ) is an equilibrium: when one of the players sticks to the selection stemming from this point, the other one has no incentive to deviate from his selection stemming from the point. Indeed, if player II selects v∗ , there is no reason for player I to deviate from selecting u∗ , since with another selection, his loss (the payment) can only increase; similarly, when player I selects u∗ , there is no reason for player II to deviate from v∗ , since with any other selection, his gain (the payment) can only decrease. As a result, if the cost function Φ has a saddle point on U × V , this saddle point (u∗ , v∗ ) can be considered as a solution to the game, as the pair of preferred selections of rational players. It can be easily seen that while Φ can have many saddle points, the values of Φ at all these points are equal to each other; we denote this common value by SadVal. If (u∗ , v∗ ) is a saddle point and player I selects u = u∗ , his worst loss, over selections v ∈ V of player II, is SadVal, and if player I selects any u ∈ U , his worstcase loss, over the selections of player II can be only ≥ SadVal. Similarly, when player II selects v = v∗ , his worstcase gain, over the selections of player I, is SadVal, and if player II selects any v ∈ V , his worstcase gain, over the selections of player I, can be only ≤ SadVal. Existence of saddle points of Φ (min in u ∈ U , max in v ∈ V ) can be expressed in terms of the primal problem (P ) and the dual problem (P ):
7 More precisely, “saddle point (min in u ∈ U , max in v ∈ V )”; we will usually skip the clarification in parentheses, since it always will be clear from the context what are the minimization variables and what are the maximization ones.
81
HYPOTHESIS TESTING
Proposition 2.21. Φ has a saddle point (min in u ∈ U , max in v ∈ V ) if and only if problems (P ) and (D) are solvable with equal optimal values: Opt(P ) := inf sup Φ(u, v) = sup inf Φ(u, v) =: Opt(D). u∈U v∈V
v∈V u∈U
(2.55)
Whenever this is the case, the saddle points of Φ are exactly the pairs (u∗ , v∗ ) comprised of optimal solutions to problems (P ) and (D), and the value of Φ at every one of these points is the common value SadVal of Opt(P ) and Opt(D). Existence of a saddle point of a function is a “rare commodity,” and the standard sufficient condition for it is convexityconcavity of Φ coupled with convexity of U and V . The precise statement is as follows: Theorem 2.22. [SionKakutani; see, e.g., [221] or [15, Theorems D.4.3, D.4.4]] Let U ⊂ Rm , V ⊂ Rn be nonempty closed convex sets, with V bounded, and let Φ : U × V → R be a continuous function which is convex in u ∈ U for every fixed v ∈ V , and is concave in v ∈ V for every fixed u ∈ U . Then the equality (2.55) holds true (although it may happen that Opt(P ) = Opt(D) = −∞). If, in addition, Φ is coercive in u, meaning that the level sets {u ∈ U : Φ(u, v) ≤ a} are bounded for every a ∈ R and v ∈ V (equivalently: for every v ∈ V , Φ(ui , v) → +∞ along every sequence ui ∈ U going to ∞: kui k → ∞ as i → ∞), then Φ admits saddle points (min in u ∈ U , max in v ∈ V ). Note that the “true” SionKakutani Theorem is a bit stronger than Theorem 2.22; the latter, however, covers all our related needs. 2.4.4.2
Main result
Theorem 2.23. Let O = (Ω, Π; {pµ : µ ∈ M}; F) be a simple observation scheme, and let M1 , M2 be nonempty compact convex subsets of M. Then (i) The function R R Φ(φ, [µ; ν]) = 21 ln Ω e−φ(ω) pµ (ω)Π(dω) + ln Ω eφ(ω) pν (ω)Π(dω) : (2.56) F × (M1 × M2 ) → R is continuous on its domain, convex in φ(·) ∈ F, concave in [µ; ν] ∈ M1 × M2 , and possesses a saddle point (min in φ ∈ F, max in [µ; ν] ∈ M1 × M2 ) (φ∗ (·), [µ∗ ; ν∗ ]) on F × (M1 × M2 ). W.l.o.g. φ∗ can be assumed to satisfy the relation8 Z Z (2.57) exp{φ∗ (ω)}pν∗ (ω)Π(dω). exp{−φ∗ (ω)}pµ∗ (ω)Π(dω) = Ω
Ω
8 Note that F contains constants, and shifting by a constant the φcomponent of a saddle point of Φ and keeping its [µ; ν]component intact, we clearly get another saddle point of Φ.
82
CHAPTER 2
Denoting the common value of the two quantities in (2.57) by ε⋆ , the saddle point value min max Φ(φ, [µ; ν]) φ∈F [µ;ν]∈M1 ×M2
is ln(ε⋆ ). Besides this, setting φa∗ (·) = φ∗ (·) − a, one has R (a) ΩRexp{−φa∗ (ω)}pµ (ω)Π(dω) ≤ exp{a}ε⋆ ∀µ ∈ M1 , (b) exp{φa∗ (ω)}pν (ω)Π(dω) ≤ exp{−a}ε⋆ ∀ν ∈ M2 . Ω
(2.58)
In view of Proposition 2.14 this implies that when deciding via an observation ω ∈ Ω on the hypotheses Hχ : ω ∼ pµ with µ ∈ Mχ , χ = 1, 2, the risks of the simple test Tφa∗ based on the detector φa∗ can be upperbounded as follows: Risk1 (Tφa∗ H1 , H2 ) ≤ exp{a}ε⋆ , Risk2 (Tφa∗ H1 , H2 ) ≤ exp{−a}ε⋆ . Moreover, φ∗ , ε⋆ form an optimal solution to the optimization problem R −φ(ω) e pµ (ω)Π(dω) ≤ ǫ ∀µ ∈ M1 Ω R min ǫ : eφ(ω) pµ (ω)Π(dω) ≤ ǫ ∀µ ∈ M2 φ,ǫ Ω
(2.59)
(the minimum in (2.59) is taken over all ǫ > 0 and all Πmeasurable functions φ(·), not just over φ ∈ F). (ii) The dual problem associated with the saddle point data Φ, F, M1 × M2 is (D) max Φ(µ, ν) := inf Φ(φ; [µ; ν]) . µ∈M1 ,ν∈M2
φ∈F
The objective in this problem is in fact the logarithm of the Hellinger affinity of pµ and pν , Z q Φ(µ, ν) = ln pµ (ω)pν (ω)Π(dω) , (2.60) Ω
and this objective is concave and continuous on M1 × M2 . The (µ, ν)components of saddle points of Φ are exactly the maximizers (µ∗ , ν∗ ) of the concave function Φ on M1 × M2 . Given such a maximizer [µ∗ ; ν∗ ] and setting φ∗ (ω) =
1 2
ln(pµ∗ (ω)/pν∗ (ω))
(2.61)
we get a saddle point (φ∗ , [µ∗ ; ν∗ ]) of Φ satisfying (2.57). (iii) Let [µ∗ ; ν∗ ] be a maximizer of Φ over M1 × M2 . Let, further, ǫ ∈ [0, 1/2] be such that there exists any (perhaps randomized) test for deciding via observation ω ∈ Ω on two simple hypotheses (A) : ω ∼ p(·) := pµ∗ (·), with total risk ≤ 2ǫ. Then
(B) : ω ∼ q(·) := pν∗ (·)
p ε⋆ ≤ 2 ǫ(1 − ǫ).
(2.62)
In other words, if the simple hypotheses (A), (B) can be decided, by any test, with total risk 2ǫ, the risks of the simple test with detector φ∗ given by (2.61) on the
83
HYPOTHESIS TESTING
composite hypotheses H1 , H2 do not exceed 2 For proof, see Section 2.11.3.
p ǫ(1 − ǫ).
Remark 2.24. Assume that we are under the premise of Theorem 2.23 and that the simple o.s. in question is nondegenerate (see Section 2.4.2). Then ε⋆ < 1 if and only if the sets M1 and M2 do not intersect. Indeed, by Theorem 2.23.i, ln(ε⋆ ) is the saddle point value of Φ(φ, [µ; ν]) on F × (M1 × M2 ), or, which is the same by Theorem 2.23.ii, the maximum of the function (2.60) on M1 × M2 ; since saddle points exist, this maximum is achieved at some pair [µ; ν] ∈ M1 ×M2 . Since (2.60) is ≤ 0, we conclude that ε⋆ ≤ 1 and p R clearly the equality takes place if and only if Ω pµ (ω)pν (ω)Π(dω) = 1 for some µ ∈ M1 p R p and ν ∈ M2 , or, which is the same, Ω ( pµ (ω) − pν (ω))2 Π(dω) = 0 for these µ and ν. Since pµ (·) and pν (·) are continuous and the support of Π is the entire Ω, the latter can happen if and only if pµ = pν for our µ, ν, or, by nondegeneracy of O, if and only if M1 ∩ M2 6= ∅. ✷ 2.4.5
Simple observation schemes—Examples of optimal detectors
Theorem 2.23.i states that when the observation scheme O = (Ω, Π; {pµ : µ ∈ M}; F) is simple and we are interested in deciding on a pair of hypotheses on the distribution of observation ω ∈ Ω, Hχ : ω ∼ pµ with µ ∈ Mχ , χ = 1, 2 and the hypotheses are convex, meaning that the underlying parameter sets Mχ are convex and compact, building an optimal, in terms of its risk, detector φ∗ —that is, solving (in general, a semiinfinite and infinitedimensional) optimization problem (2.59)—reduces to solving a finitedimensional convex problem. Specifically, an optimal solution (φ∗ , ε⋆ ) can be built as follows: 1. We solve optimization problem Z q pµ (ω)pν (ω)Π(dω) Φ(µ, ν) := ln Opt = max µ∈M1 ,ν∈M2
(2.63)
Ω
of maximizing Hellinger affinity (the quantity under the logarithm) of a pair of distributions obeying H1 and H2 , respectively; for a simple o.s., the objective in this problem is concave and continuous, and optimal solutions do exist; 2. (Any) optimal solution [µ∗ ; ν∗ ] to (2.63) gives rise to an optimal detector φ∗ and its risk ε⋆ , according to 1 pµ∗ (ω) φ∗ (ω) = ln , ε⋆ = exp{Opt}. 2 pν∗ (ω) The risks of the simple test Tφ∗ associated with the above detector and deciding on H1 , H2 , satisfy the bounds max [Risk1 (Tφ∗ H1 , H2 ), Risk2 (Tφ∗ H1 , H2 )] ≤ ε⋆ ,
84
CHAPTER 2
and the test is nearoptimal, meaning that whenever the hypotheses H1 , H2 (and in fact even two simple hypotheses stating that ω ∼ pµ∗ and ω ∼ pν∗ , respectively) can be decided upon by a test with total risk ≤ 2ǫ ≤ 1, Tφ∗ exhibits a “comparable” risk: p (2.64) ε⋆ ≤ 2 ǫ(1 − ǫ). The test Tφ∗ is just the maximum likelihood test induced by the probability densities pµ∗ and pν∗ .
Note that after we know that (φ∗ , ε⋆ ) form an optimal solution to (2.59), some kind of nearoptimality of the test Tφ∗ is guaranteed already by Proposition 2.18. By this proposition, whenever in nature there exists a simple test T which decides on H1 , H2 with risks Risk1 , Risk2 bounded by some ǫ ≤ 1/2, the upper bound ε⋆ on the risks of Tφ∗ can be bounded according to (2.64). Our now nearoptimality statement is slightly stronger: first, we allow T to have the total risk ≤ 2ǫ, which is weaker than to have both risks ≤ ǫ; second, and more important, now 2ǫ should upperbound the total risk of T on a pair of simple hypotheses “embedded” into the hypotheses H1 , H2 ; both these modifications extend the family of tests T to which we compare the test Tφ∗ , and thus enrich the comparison. Let us look how the above recipe works for our basic simple o.s.’s. 2.4.5.1
Gaussian o.s.
When O is a Gaussian o.s., that is, {pµ : µ ∈ M} are Gaussian densities with expectations µ ∈ M = Rd and common positive definite covariance matrix Θ, and F is the family of affine functions on Ω = Rd , • M1 , M2 can be arbitrary nonempty convex compact subsets of Rd , • problem (2.63) becomes the convex optimization problem Opt = −
1
(µ min µ∈M1 ,ν∈M2 8
− ν)T Θ−1 (µ − ν),
(2.65)
• the optimal detector φ∗ and the upper bound ε⋆ on its risks given by an optimal solution (µ∗ , ν∗ ) to (2.65) are φ∗ (ω) ε⋆
= =
[µ∗ − ν∗ ]T Θ−1 [ω − w], w = 21 [µ∗ + ν∗ ] exp{− 81 [µ∗ − ν∗ ]Θ−1 [µ∗ − ν∗ ]}. 1 2
(2.66)
Note that when Θ = Id , the test Tφ∗ becomes exactly the optimal test from Example 2.1. The upper bound on the risks of this test established in Example 2.1 (in our present notation, this bound is Erfc( 21 kµ∗ − ν∗ k2 )) is slightly better than the bound ε⋆ = exp{−kµ∗ − ν∗ k22 /8} given by (2.66) when Θ = Id . Note, however, that when speaking about the distance δ = kµ∗ − ν∗ k2 between M1 and M2 allowing for a test with risks ≤ ǫ ≪ 1, the results of Example 2.1 and (2.66) say nearly the same thing: Example 2.1 says that δ should be ≥ p 2ErfcInv(ǫ), with ErfcInv defined in (1.26), and (2.66) says that δ should be ≥ 2 2 ln(1/ǫ). When ǫ → +0, the ratio of these two lower bounds on δ tends to 1. It should be noted that our general construction of optimal detectors as applied to Gaussian o.s. and a pair of convex hypotheses results in exactly an optimal test and can be analyzed directly, without any “science” (see Example 2.1).
85
HYPOTHESIS TESTING
2.4.5.2
Poisson o.s.
When O is a Poisson o.s., that is, M = Rd++ is the interior of the nonnegative orthant in Rd , and pµ , µ ∈ M, is the density Y µωi i e−µi , ω = (ω1 , ..., ωd ) ∈ Zd+ pµ (ω) = ωi ! i taken w.r.t. the counting measure Π on Ω = Zd+ , and F is the family of affine functions on Ω, the recipe from the beginning of Section 2.4.5 reads as follows: • M1 , M2 can be arbitrary nonempty convex compact subsets of Rd++ = {x ∈ Rd : x > 0}; • problem (2.63) becomes the convex optimization problem d
Opt = −
min
µ∈M1 ,ν∈M2
√ 2 1X √ ( µi − ν i ) ; 2 i=1
(2.67)
• the optimal detector φ∗ and the upper bound ε⋆ on its risks given by an optimal solution (µ∗ , ν ∗ ) to (2.67) are d
1X φ∗ (ω) = ln 2 i=1 2.4.5.3
µ∗i νi∗
d
1X ∗ ωi + [ν − µ∗i ], 2 i=1 i
ε⋆ = eOpt .
Discrete o.s.
When O is a Discrete P o.s., that is, Ω = {1, ..., d}, Π is a counting measure on Ω, M = {µ ∈ Rd : µ > 0, i µi = 1}, and pµ (ω) = µω , ω = 1, ..., d, µ ∈ M,
the recipe from the beginning of Section 2.4.5 reads as follows:9 • M1 , M2 can be arbitrary nonempty convex compact subsets of the relative interior M of the probabilistic simplex, • problem (2.63) is equivalent to the convex program ε⋆ =
max
µ∈M1 ,ν∈M2
d X √
µi ν i ;
(2.68)
i=1
• the optimal detector φ∗ given by an optimal solution (µ∗ , ν ∗ ) to (2.67) is ∗ µ (2.69) φ∗ (ω) = 12 ln ν ω∗ , ω
and the upper bound ε⋆ on the risks of this detector is given by (2.68). 9 It
should be mentioned that the results of this section as applied to the Discrete observation scheme are a simple particular case—that of finite Ω—of the results of [21, 22, 25] on distinguishing convex sets of probability distributions.
86
CHAPTER 2
2.4.5.4
Kth power of a simple o.s.
Recall that Kth power of a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F) (see Section 2.4.3.4) is the o.s. [O]K = (ΩK , ΠK ; {p(K) : µ ∈ M}; F (K) ) µ where ΩK is the direct product of K copies of Ω, ΠK is the product of K copies of (K) Π, the densities pµ are product densities induced by K copies of the density pµ , µ ∈ M, K Y pµ (ωk ), pµ(K) (ω K = (ω1 , ..., ωK )) = k=1
and F
(K)
is comprised of functions φ(K) (ω K = (ω1 , ..., ωK )) =
K X
φ(ωk )
k=1
stemming from functions φ ∈ F. Clearly, [O]K is the observation scheme describing the stationary Krepeated observations ω K = (ω1 , ..., ωK ) with ωk stemming from the o.s. O; see Section 2.3.2.3. As we remember, [O]K is simple provided that O is so. Assuming O simple, it is immediately seen that as applied to the o.s. [O]K , the recipe from the beginning of Section 2.4.5 reads as follows: • M1 , M2 can be arbitrary nonempty convex compact subsets of M, and the corresponding hypotheses, HχK , χ = 1, 2, state that the components ωk of observation ω K = (ω1 , ..., ωK ) are independently of each other drawn from distribution pµ with µ ∈ M1 (hypothesis H1K ) or µ ∈ M2 (hypothesis H2K ); • problem (2.63) is the convex program Z q (K) (K) pµ (ω K )pν (ω K )ΠK (dΩ) (DK ) ln Opt(K) = max µ∈M1 ,ν∈M2 ΩK {z }  R √ ≡K ln
Ω
pµ (ω)pν (ω)Π(dω)
implying that any optimal solution to the “singleobservation” problem (D1 ) associated with M1 , M2 is optimal for the “Kobservation” problem (DK ) associated with M1 , M2 , and Opt(K) = KOpt(1); (K) • the optimal detector φ∗ given by an optimal solution (µ∗ , ν∗ ) to (D1 ) (this solution is optimal for (DK ) as well) is (K)
φ∗ (ω K ) =
K X
k=1
φ∗ (ωk ),
φ∗ (ω) =
1 ln 2
pµ∗ (ω) pν∗ (ω)
(K)
and the upper bound ε⋆ (K) on the risks of the detector φ∗ families of distributions obeying hypotheses H1K or H2K is ε⋆ (K) = eOpt(K) = eKOpt(1) = [ǫ⋆ (1)]K .
,
(2.70)
on the pair of (2.71)
87
HYPOTHESIS TESTING
The just outlined results on powers of simple observation schemes allow us to express nearoptimality of detectorbased tests in simple o.s.’s in a nicer form. Proposition 2.25. Let O = (Ω, Π; {pµ : µ ∈ M}; F) be a simple observation scheme, M1 , M2 be two nonempty convex compact subsets of M, and (µ∗ , ν∗ ) be an optimal solution to the convex optimization problem (cf. Theorem 2.23) Z q pµ (ω)pν (ω)Π(dω) . Opt = max ln µ∈M1 ,ν∈M2
Ω
Let φ∗ and φK ∗ be single and Kobservation detectors induced by (µ∗ , ν∗ ) via (2.70). Let ǫ ∈ (0, 1/2), and assume that for some positive integer K in nature there exists a simple test T K deciding via K i.i.d. observations ω K = (ω1 , ..., ωK ) with ωk ∼ pµ , for some unknown µ ∈ M, on the hypotheses Hχ(K) : µ ∈ Mχ , χ = 1, 2, with risks Risk1 , Risk2 not exceeding ǫ. Then setting 2 K , K+ = 1 − ln(4(1 − ǫ))/ ln(1/ǫ) the simple test T
(K+ )
φ∗
(K+ )
utilizing K+ i.i.d. observations decides on H1
(K+ )
, H2
with risks ≤ ǫ. Note that K+ “is of the order of K”: K+ /K → 2 as ǫ → +0.
Proof. Applying item (iii) of Theorem 2.23 to the simple o.s. [O]K , we see that what above was called ε⋆ (K) satisfies p ε⋆ (K) ≤ 2 ǫ(1 − ǫ).
p 1/K By (2.71), we conclude that ε⋆ (1) ≤ 2 ǫ(1 − ǫ) , whence, by the same (2.71), T /K p , T = 1, 2, .... When plugging in this bound T = K+ , we ε⋆ (T ) ≤ 2 ǫ(1 − ǫ)
get the inequality ε⋆ (K+ ) ≤ ǫ. It remains to recall that ε⋆ (K+ ) upperbounds the (K ) (K ) risks of the test T (K+ ) when deciding on H1 + vs. H2 + . ✷ φ∗
2.5
TESTING MULTIPLE HYPOTHESES
So far, we have focused on detectorbased tests deciding on pairs of hypotheses, and our “constructive” results were restricted to pairs of convex hypotheses dealing with a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), (2.72) convexity of a hypothesis meaning that the family of probability distributions obeying the hypothesis is {pµ : µ ∈ X}, associated with a convex (in fact, convex compact) set X ⊂ M. In this section, we will be interested in pairwise testing unions of convex hypotheses and testing multiple (more than two) hypotheses.
88
CHAPTER 2
2.5.1 2.5.1.1
Testing unions Situation and goal
Let Ω be an observation space, and assume we are given two finite collections of families of probability distributions on Ω: families of red distributions Ri , 1 ≤ i ≤ r, and families of blue distributions Bj , 1 ≤ j ≤ b. These families give rise to r red and b blue hypotheses on the distribution P of an observation ω ∈ Ω, specifically, Ri : P ∈ Ri (red hypotheses) and Bj : P ∈ Bj (blue hypotheses). Assume that for every i ≤ r, j ≤ b we have at our disposal a simple detectorbased test Tij capable of deciding on Ri vs. Bj . What we want is to assemble these tests into a test T deciding on the union R of red hypotheses vs. the union B of blue ones: b r [ [ Bj . Ri , B : P ∈ B := R : P ∈ R := i=1
j=1
Here P , as always, stands for the probability distribution of observation ω ∈ Ω. Our motivation primarily stems from the case where Ri and Bj are convex hypotheses in a simple o.s. (2.72): Ri = {pµ : µ ∈ Mi }, Bj = {pµ : µ ∈ Nj }, where Mi and Nj are convex compact subsets of M. In this case we indeed know how to build nearoptimal tests deciding on Ri vs. Bj , and the question we have posed becomes, how do we assemble these tests into a test deciding on R vs. B, with S R : P ∈ R = {pµ : µ ∈ X}, X = S i Mi , B : P ∈ B = {pµ : µ ∈ Y }, Y = j Nj ?
While the structure of R, B is similar to that of Ri , Bj , there is a significant difference: the sets X, Y are, in general, nonconvex, and therefore the techniques we have developed fail to address testing R vs. B directly.
2.5.1.2
The construction
In the situation just described, let φij be the detectors underlying the tests Tij ; w.l.o.g., we can assume these detectors balanced (see Section 2.3.2.2) with some risks ǫij : R −φ (ω) ij RΩ eφ (ω) P (dω) ≤ ǫij ∀P ∈ Ri , 1 ≤ i ≤ r, 1 ≤ j ≤ b. (2.73) e ij P (dω) ≤ ǫij ∀P ∈ Bj Ω Let us assemble the detectors φij into a detector for R, B as follows: φ(ω) = max min [φij (ω) − αij ], 1≤i≤r 1≤j≤b
(2.74)
where the shifts αij are parameters of the construction. Proposition 2.26. The risks of φ on R, B can be bounded as hP i R b αij ∀P ∈ R : Ω e−φ(ω) P (dω) ≤ maxi≤r , j=1 ǫij e R φ(ω) Pr −αij ∀P ∈ B : e P (dω) ≤ maxj≤b [ i=1 ǫij e ]. Ω
(2.75)
89
HYPOTHESIS TESTING
Therefore, the risks of φ on R, B are upperbounded by the quantity X hX r i b ε⋆ = max max ǫij eαij , max ǫij e−αij , i≤r
j=1
j≤b
i=1
(2.76)
whence the risks of the simple test Tφ , based on the detector φ, deciding on R, B are upperbounded by ε⋆ . Proof. Let P ∈ R, so that P ∈ Ri∗ for some i∗ ≤ r. Then R −φ(ω) R e P (dω) = Ω emini≤r maxj≤b [−φij (ω)+αij ] P (dω) Ω R Pb R ≤ Ω emaxj≤b [−φi∗ j (ω)+αi∗ j ] P (dω) ≤ j=1 Ω e−φi∗ j (ω)+αi∗ j P (dω) R Pb = j=1 eαi∗ j Ω e−φi∗ j (ω) P (dω) Pb ≤ j=1 ǫi∗hj eαi∗ j [by (2.73) i due to P ∈ Ri∗ ] Pb αij ≤ maxi≤r . j=1 ǫij e Now let P ∈ B, so that P ∈ Bj∗ for some j∗ . We have R φ(ω) R (ω)−αij ] e P (dω) = Ω emaxi≤r minj≤b [φij P Ω R R P (dω) r maxi≤r [φij∗ (ω)−αij∗ ] φij∗ (ω)−αij∗ ≤P e P (dω) ≤ P (dω) i=1 Ω e Ω R φ (ω) r −αij∗ ij∗ = Pi=1 e e P (dω) Ω r ≤ i=1 ǫij∗P e−αij∗ [by (2.73) due to P ∈ Bj∗ ] r ≤ maxj≤b [ i=1 ǫij e−αij ] .
(2.75) is proved. The remaining claims of the proposition are readily given by (2.75) combined with Proposition 2.14. ✷ Optimal choice of shift parameters. The detector and the test considered in Proposition 2.26, like the resulting risk bound ε⋆ , depend on the shifts αij . Let us optimize the risk bound w.r.t. these shifts. To this end, consider the r × b matrix E = [ǫij ] i≤r
j≤b
and the symmetric (r + b) × (r + b) matrix E . E= ET As is well known, the eigenvalues of the symmetric matrix E are comprised of the pairs (σs , −σs ), where σs are the singular values of E, and several zeros; in particular, the leading eigenvalue of E is the spectral norm kEk2,2 (the largest singular value) of matrix E. Further, E is a matrix with positive entries, so that E is a symmetric entrywise nonnegative matrix. By the PerronFrobenius Theorem, the leading eigenvector of this matrix can be selected to be nonnegative. Denoting this nonnegative eigenvector [g; h] with rdimensional g and bdimensional h, and setting ρ = kEk2,2 , we have ρg = Eh (2.77) ρh = E T g. Observe that ρ > 0 (evident), whence both g and h are nonzero (since otherwise (2.77) would imply g = h = 0, which is impossible—the eigenvector [g; h] is
90
CHAPTER 2
nonzero). Since h and g are nonzero nonnegative vectors, ρ > 0 and E is entrywise positive, (2.77) says that g and h are strictly positive vectors. The latter allows us to define shifts αij according to αij = ln(hj /gi ).
(2.78)
With these shifts, we get hP i Pb b αij max = max j=1 ǫij hj /gi = max(Eh)i /gi = max ρ = ρ j=1 ǫij e i≤r
i≤r
i≤r
i≤r
(we have used the first relation in (2.77)), and Pr Pr max [ i=1 ǫij e−αij ] = max i=1 ǫij gi /hj = max[E T g]j /hj = max ρ = ρ j≤b
j≤b
j≤b
j≤b
(we have used the second relation in (2.77)). The bottom line is as follows:
Proposition 2.27. In the situation and the notation of Section 2.5.1.1, the risks of the detector (2.74) with shifts (2.77), (2.78) on the families R, B do not exceed the quantity kE := [ǫij ]i≤r,j≤b k2,2 . As a result, the risks of the simple test Tφ deciding on the hypotheses R, B, does not exceed kEk2,2 as well. In fact, the shifts in the above proposition are the best possible; this is an immediate consequence of the following simple fact: Proposition 2.28. Let E = [eij ] be a nonzero entrywise nonnegative n × n symmetric matrix. Then the optimal value in the optimization problem n X (∗) Opt = min max eij eαij : αij = −αji αij i≤n j=1
is equal to kEk2,2 . When the PerronFrobenius eigenvector f of E can be selected positive, the problem is solvable, and an optimal solution is given by αij = ln(fj /fi ), 1 ≤ i, j ≤ n.
(2.79)
Proof. Let us prove that Opt ≤ ρ := kEk2,2 . Given ǫ > 0, we clearly can find an entrywise nonnegative symmetric matrix E ′ with entries e′ij inbetween eij and eij + ǫ such that the PerronFrobenius eigenvector f of E ′ can be selected positive (it suffices, e.g., to set e′ij = eij + ǫ). Selecting αij according to (2.79), we get a feasible solution to (∗) such that X X e′ij fj /fi = kE ′ k2,2 , eij eαij ≤ ∀i : j
j
implying that Opt ≤ kE ′ k2,2 . Passing to limit as ǫ → +0, we get Opt ≤ kEk2,2 . As a byproduct of our reasoning, if E admits a positive PerronFrobenius eigenvector f , then (2.79) yields a feasible solution to (∗) with the value of the objective equal to kEk2,2 .
91
HYPOTHESIS TESTING
It remain to prove that Opt ≥ kEk2,2 . Assume that this is not the case, so that (∗) admits a feasible solution α bij such that X eij eαbij < ρ := kEk2,2 . ρb := max i
j
By an arbitrarily small perturbation of E, we can make this matrix symmetric and entrywise positive, and still satisfying the above strict inequality; to save notation, assume that already the original E is entrywise positive. Let f be a positive PerronFrobenius eigenvector of E, and let, as above, αij = ln(fj /fi ), so that X X eij fj /fi = ρ ∀i. eij eαij = j
j
Setting δij = α bij − αij , we conclude that the convex functions X θi (t) = eij eαij +tδij j
all are equal to ρ as t = 0, and all are ≤ ρb < ρ as t = 1, implying that θi (1) < θi (0) for every i. The latter, in view of convexity of θi (·), implies that X X eij (fj /fi )δij < 0 ∀i. eij eαij δij = θi′ (0) = j
j
Multiplying the resulting inequalities by fi2 and summing up over i, we get X eij fi fj δij < 0, i,j
which is impossible: we have eij = eji and δij = −δji , implying that the lefthand side in the latter inequality is 0. ✷ 2.5.2
Testing multiple hypotheses “up to closeness”
So far, we have considered detectorbased simple tests deciding on pairs of hypotheses, specifically, convex hypotheses in simple o.s.’s (Section 2.4.4) and unions of convex hypotheses (Section 2.5.1).10 Now we intend to consider testing of multiple (perhaps more than 2) hypotheses “up to closeness”; the latter notion was introduced in Section 2.2.4.2. 10 Strictly speaking, in Section 2.5.1 it was not explicitly stated that the unions under consideration involve convex hypotheses in simple o.s.’s; our emphasis was on how to decide on a pair of uniontype hypotheses given pairwise detectors for “red” and “blue” components of the unions from the pair. However, for now, the only situation where we indeed have at our disposal good pairwise detectors for red and blue components is that in which these components are convex hypotheses in a good o.s.
92
CHAPTER 2
2.5.2.1
Situation and goal
Let Ω be an observation space, and let a collection P1 , ..., PL of families of probability distributions on Ω be given. As always, families Pℓ give rise to hypotheses Hℓ : P ∈ Pℓ on the distribution P of observation ω ∈ Ω. Assume also that we are given a closeness relation C on {1, ..., L}. Recall that, formally, a closeness relation is some set of pairs of indices (ℓ, ℓ′ ) ∈ {1, ..., L}; we interpret the inclusion (ℓ, ℓ′ ) ∈ C as the fact that hypothesis Hℓ “is close” to hypothesis Hℓ′ . When (ℓ, ℓ′ ) ∈ C, we say that ℓ′ is close (or Cclose) to ℓ. We always assume that • C contains the diagonal: (ℓ, ℓ) ∈ C for every ℓ ≤ L (“each hypothesis is close to itself”), and • C is symmetric: whenever (ℓ, ℓ′ ) ∈ C, we have also (ℓ′ , ℓ) ∈ C (“if the ℓth hypothesis is close to the ℓ′ th one, then the ℓ′ th hypothesis is close to the ℓth one”). Recall that a test T deciding on the hypotheses H1 , ..., HL via observation ω ∈ Ω is a procedure which, given on input ω ∈ Ω, builds some set T (ω) ⊂ {1, ..., L}, accepts all hypotheses Hℓ with ℓ ∈ T (ω), and rejects all other hypotheses. Risks of an “up to closeness” test. The notion of Crisk of a test was introduced in Section 2.2.4.2, we reproduce it here for the reader’s convenience. Given closeness C and a test T , we define the Crisk RiskC (T H1 , ..., HL ) of T as the smallest ǫ ≥ 0 such that
S Whenever an observation ω is drawn from a distribution P ∈ ℓ Pℓ , and ℓ∗ is such that P ∈ Pℓ∗ (i.e., hypothesis Hℓ∗ is true), the P probability of the event ℓ∗ 6∈ T (ω) (“true hypothesis Hℓ∗ is not accepted”) or there exists ℓ′ not close to ℓ such that Hℓ′ is accepted” is at most ǫ.
Equivalently: RiskC (T H1 , ..., HL ) ≤ ǫ if and only if the following takes place: S Whenever an observation ω is drawn from a distribution P ∈ ℓ Pℓ , and ℓ∗ is such that P ∈ Pℓ∗ (i.e., hypothesis Hℓ∗ is true), the P probability of the event ℓ∗ ∈ T (ω) (“the true hypothesis Hℓ∗ is accepted”) and ℓ′ ∈ T (ω) implies that (ℓ, ℓ′ ) ∈ C (“all accepted hypotheses are Cclose to the true hypothesis Hℓ∗ ”) is at least 1 − ǫ. For example, consider nine polygons presented on Figure 2.4 and associate with them nine hypotheses on a 2D “signal plus noise” observation ω = x + ξ, ξ ∼ N (0, I2 ), with the ℓth hypothesis stating that x belongs to the ℓth polygon. We define closeness C on the collection of hypotheses presented on Figure 2.4 as “two hypotheses are close if and only if the corresponding polygons intersect,” like A and B, or A and E. Now the fact that a test T has Crisk ≤ 0.01 would imply, in particular, that if the probability distribution P underlying the observed
93
HYPOTHESIS TESTING
Figure 2.4: Nine hypotheses on the location of the mean µ of observation ω ∼ N (µ, I2 ), each stating that µ belongs to a specific polygon. ω obeys hypothesis A (i.e., the mean of P belongs to the polygon A), then with P probability at least 0.99 the list of accepted hypotheses includes hypothesis A, and the only other hypotheses in this list are among hypotheses B, D, and E. 2.5.2.2
“Building blocks” and construction
The construction we are about to present is, essentially, that used in Section 2.2.4.3 as applied to detectorgenerated tests. This being said, the presentation to follow is selfcontained. The building blocks of our construction are pairwise detectors φℓℓ′ (ω), 1 ≤ ℓ < ′ ℓ ≤ L, for pairs Pℓ , Pℓ′ along with (upper bounds on) the risks ǫℓℓ′ of these detectors: R ∀(P ∈ Pℓ ) : RΩ e−φℓℓ′ (ω) P (dω) ≤ ǫℓℓ′ , 1 ≤ ℓ < ℓ′ ≤ L. ∀(P ∈ Pℓ′ ) : Ω eφℓℓ′ (ω) P (dω) ≤ ǫℓℓ′
Setting
φℓ′ ℓ (ω) = −φℓℓ′ (ω), ǫℓ′ ℓ = ǫℓℓ′ , 1 ≤ ℓ < ℓ′ ≤ L, φℓℓ (ω) ≡ 0, ǫℓℓ = 1, 1 ≤ ℓ ≤ L, we get what we refer to as a balanced system of detectors φℓℓ′ and risks ǫℓℓ′ , 1 ≤ ℓ, ℓ′ ≤ L, for the collection P1 , ..., PL , meaning that φℓℓ′ (ω) + φRℓ′ ℓ (ω) ≡ 0, ǫℓℓ′ = ǫℓ′ ℓ , ∀P ∈ Pℓ : Ω e−φℓℓ′ (ω) P (dω) ≤ ǫℓℓ′ ,
1 ≤ ℓ, ℓ′ ≤ L, 1 ≤ ℓ, ℓ′ ≤ L.
(2.80)
Given closeness C, we associate with it the symmetric L × L matrix C given by 0, (ℓ, ℓ′ ) ∈ C, ′ (2.81) Cℓℓ = 1, (ℓ, ℓ′ ) 6∈ C. Test TC . Let a collection of shifts αℓℓ′ ∈ R satisfying the relation αℓℓ′ = −αℓ′ ℓ , 1 ≤ ℓ, ℓ′ ≤ L
(2.82)
94
CHAPTER 2
be given. The detectors φℓℓ′ and the shifts αℓℓ′ specify a test TC deciding on hypotheses H1 , ..., HL . Precisely, given an observation ω, the test TC accepts exactly those hypotheses Hℓ for which φℓℓ′ (ω) − αℓℓ′ > 0 whenever ℓ′ is not Cclose to ℓ: TC (ω) = {ℓ : φℓℓ′ (ω) − αℓℓ′ > 0 ∀(ℓ′ : (ℓ, ℓ′ ) 6∈ C)}. Proposition 2.29. (i) The Crisk of the test TC just defined is upperbounded by the quantity L X ε[α] = max ǫℓℓ′ Cℓℓ′ eαℓℓ′ ℓ≤L
ℓ′ =1
with C given by (2.81). (ii) The infimum, over shifts α satisfying (2.82), of the risk bound ε[α] is the quantity ε⋆ = kEk2,2 , where the L × L symmetric entrywise nonnegative matrix E is given by E = [eℓℓ′ := ǫℓℓ′ Cℓℓ′ ]ℓ,ℓ′ ≤L . Assuming E admits a strictly positive PerronFrobenius vector f , an optimal choice of the shifts is αℓℓ′ = ln(fℓ′ /fℓ ), 1 ≤ ℓ, ℓ′ ≤ L, resulting in ε[α] = ε⋆ = kEk2,2 . Proof. (i): Setting φ¯ℓℓ′ (ω) = φℓℓ′ (ω) − αℓℓ′ , ǫ¯ℓℓ′ = ǫℓℓ′ eαℓℓ′ , (2.80) and (2.82) imply that (a) (b)
φ¯ℓℓ′ (ω) + φ¯ℓ′ ℓ (ω) ≡ 0, R ¯ ∀P ∈ Pℓ : Ω e−φℓℓ′ (ω) P (dω) ≤ ǫ¯ℓℓ′ ,
1 ≤ ℓ, ℓ′ ≤ L 1 ≤ ℓ, ℓ′ ≤ L.
(2.83)
Now let ℓ∗ be such that the distribution P of observation ω belongs to Pℓ∗ . Then for every ℓ′ the P probability of the event φ¯ℓ∗ ℓ′ (ω) ≤ 0 is ≤ ǫ¯ℓ∗ ℓ′ by (2.83.b), whence the P probability of the event E∗ = ω : ∃ℓ′ : (ℓ∗ , ℓ′ ) 6∈ C & φ¯ℓ∗ ℓ′ (ω) ≤ 0 is upperbounded by
X
ℓ′ :(ℓ∗ ,ℓ′ )6∈C
ǫ¯ℓ∗ ℓ′ =
L X
ℓ′ =1
Cℓ∗ ℓ′ ǫℓ∗ ℓ′ eαℓ∗ ℓ′ ≤ ε[α].
Assume that E∗ does not take place (as we have seen, this indeed is so with P probability ≥ 1 − ε[α]). Then φ¯ℓ∗ ℓ′ (ω) > 0 for all ℓ′ such that (ℓ∗ , ℓ′ ) 6∈ C, implying, first, that Hℓ∗ is accepted by our test. Second, φ¯ℓ′ ℓ∗ (ω) = −φ¯ℓ∗ ℓ′ (ω) < 0 whenever (ℓ∗ , ℓ′ ) 6∈ C, or, due to the symmetry of closeness, whenever (ℓ′ , ℓ∗ ) 6∈ C, implying that the test TC rejects the hypothesis Hℓ′ when ℓ′ is not Cclose to ℓ∗ . Thus, the P probability of the event “Hℓ∗ is accepted, and all accepted hypotheses are Cclose
95
HYPOTHESIS TESTING
to Hℓ∗ ” is at least 1 − ε[α]. We conclude that the Crisk RiskC (TC H1 , ..., HL ) of the test TC is at most ε[α]. (i) is proved. (ii) is readily given by Proposition 2.28. ✷ 2.5.2.3
Testing multiple hypotheses via repeated observations
In the situation of Section 2.5.2.1, given a balanced system of detectors φℓℓ′ and risks ǫℓℓ′ , 1 ≤ ℓ, ℓ′ ≤ L, for the collection P1 , ..., PL (see (2.80)) and a positive integer K, we can • pass from detectors φℓℓ′ and risks ǫℓℓ′ to the entities (K)
φℓℓ′ (ω K = (ω1 , ..., ωK )) =
K X
k=1
(K)
′ φℓℓ′ (ωk ), ǫℓℓ′ = ǫK ℓℓ′ , 1 ≤ ℓ, ℓ ≤ L; (K)
• associate with the families Pℓ families Pℓ of probability distributions underlying quasistationary Krepeated versions of observations ω ∼ P ∈ Pℓ —see Section 2.3.2.3—and thus arrive at hypotheses HℓK = Hℓ⊗,K stating that the distribution P K of Krepeated observation ω K = (ω1 , ..., ωK ), ωk ∈ Ω, belongs K N to the family Pℓ⊗,K = Pℓ , associated with Pℓ ; see Section 2.1.3.3. k=1
By Proposition 2.16 and (2.80), we arrive at the following analog of (2.80): (K)
(K)
(K)
(K)
φℓℓ′ (ω K ) + φℓ′ ℓ (ω K ) ≡ 0, ǫℓℓ′ = ǫℓ′ ℓ = ǫK ℓℓ′ , (K) (K) R (K) −φℓℓ′ (ω K ) K K K ∀P ∈ Pℓ : ΩK e P (dω ) ≤ ǫℓℓ′ ,
1 ≤ ℓ, ℓ′ ≤ L
1 ≤ ℓ, ℓ′ ≤ L.
Given shifts αℓℓ′ satisfying (2.82) and applying the construction from Section 2.5.2.2 to these shifts and our newly constructed detectors and risks, we arrive at the test TCK deciding on hypotheses H1K , ..., HLK via Krepeated observation ω K . Specifically, given an observation ω K , the test TCK accepts exactly those hypotheses HℓK (K) for which φℓℓ′ (ω K ) − αℓℓ′ > 0 whenever ℓ′ is not Cclose to ℓ: (K)
TCK (ω K ) = {ℓ : φℓℓ′ (ω K ) − αℓℓ′ > 0 ∀(ℓ′ : (ℓ, ℓ′ ) 6∈ C)}. Invoking Proposition 2.29, we arrive at Proposition 2.30. (i) The Crisk of the test TCK just defined is upperbounded by the quantity L X αℓℓ′ . ǫK ε[α, K] = max ℓℓ′ Cℓℓ′ e ℓ≤L
ℓ′ =1
(ii) The infimum, over shifts α satisfying (2.82), of the risk bound ε[α, K] is the quantity ε⋆ (K) = kE (K) k2,2 , where the L × L symmetric entrywise nonnegative matrix E (K) is given by h i (K) ′ E (K) = eℓℓ′ := ǫK C . ′ ℓℓ ℓℓ ℓ,ℓ′ ≤L
Assuming E (K) admits a strictly positive PerronFrobenius vector f , an optimal
96
CHAPTER 2
choice of the shifts is
αℓℓ′ = ln(fℓ /fℓ′ ), 1 ≤ ℓ, ℓ′ ≤ L,
resulting in ε[α, K] = ε⋆ (K) = kE (K) k2,2 . 2.5.2.4
Consistency and nearoptimality
Observe that when closeness C is such that ǫℓℓ′ < 1 whenever ℓ, ℓ′ are not Cclose to each other, the entries on the matrix E (K) go to 0 as K → ∞ exponentially fast, whence the Crisk of test TCK also goes to 0 as K → ∞, meaning that test TCK is consistent. When, in addition, Pℓ correspond to convex hypotheses in a simple o.s., the test TCK possesses the property of nearoptimality similar to that stated in Proposition 2.25: Proposition 2.31. Consider the special case of the situation from Section 2.5.2.1 where, given a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), the families Pℓ of probability distributions are of the form Pℓ = {pµ : µ ∈ Nℓ }, where Nℓ , 1 ≤ ℓ ≤ L, are nonempty convex compact subsets of M. Let also the pairwise detectors φℓℓ′ and their risks ǫℓℓ′ underlying the construction from Section 2.5.2.2 be obtained by applying Theorem 2.23 to the pairs Nℓ , Nℓ′ , so that for 1 ≤ ℓ < ℓ′ ≤ L one has φℓℓ′ (ω) =
1 2
ln(pµℓ,ℓ′ (ω)/pνℓ,ℓ′ (ω)), ǫℓℓ′ = exp{Optℓℓ′ }
where Optℓℓ′ =
min
µ∈Nℓ ,ν∈Nℓ′
ln
Z q
pµ (ω)pν (ω)Π(dω) ,
Ω
and (µℓℓ′ , νℓℓ′ ) form an optimal solution to the optimization problem on the righthand side. Assume that for some positive integer K∗ in nature there exists a test T K∗ which decides with Crisk ǫ ∈ (0, 1/2), via stationary K∗ repeated observation ω K∗ , on the (K ) hypotheses Hℓ ∗ , stating that the components in ω K∗ are drawn, independently of each other, from a distribution P ∈ Pℓ , ℓ = 1, ..., L, and let 1 + ln(L − 1)/ ln(1/ǫ) K∗ . (2.84) K= 2 1 − ln(4(1 − ǫ))/ ln(1/ǫ) Then the test TCK yielded by the construction from Section 2.5.2.2 as applied to the above φℓℓ′ , ǫℓℓ′ , and trivial shifts αℓℓ′ ≡ 0, decides on the hypotheses HℓK —see Section 2.5.2.3—via quasistationary Krepeated observations ω K , with Crisk ≤ ǫ. Note that K/K∗ → 2 as ǫ → +0. Proof. Let ǫ¯ = max {ǫℓℓ′ : ℓ < ℓ′ , and ℓ, ℓ′ are not Cclose to each other} . ′ ℓ,ℓ
Denoting by (ℓ∗ , ℓ′∗ ) the corresponding maximizer, note that T K∗ induces a simple test T able to decide via stationary K∗ repeated observations ω K on the pair of (K ) (K ) hypotheses Hℓ∗ ∗ , Hℓ′ ∗ with risks ≤ ǫ (it suffices to make T to accept the first ∗ of the hypotheses in the pair and reject the second one whenever T K∗ on the same (K ) observation accepts Hℓ∗ ∗ ; otherwise T rejects the first hypothesis in the pair and accepts the second one). This observation, by the same argument as in the proof
97
HYPOTHESIS TESTING
p of Proposition 2.25, implies that ǫ¯K∗ ≤ 2 ǫ(1 − ǫ) < 1, whence all entries in the matrix E (K) do not exceed ǫ¯(K/K∗ ) , implying by Proposition 2.29 that the Crisk of the test TCK does not exceed p ǫ(K) := (L − 1)[2 ǫ(1 − ǫ)]K/K∗ . It remains to note that for K given by (2.84) one has ǫ(K) ≤ ǫ.
✷
TCK
Remark 2.32. Note that tests TC and we have built may, depending on observations, accept no hypotheses at all, which sometimes is undesirable. Clearly, every test deciding on multiple hypotheses up to Ccloseness always can be modified to ensure that a hypothesis always is accepted. To this end, it suffices, for instance, that the modified test accepts exactly those hypotheses, if any, which are accepted by our original test, and accepts, say, hypothesis # 1 when the original test accepts no hypotheses. It is immediate to see that the Crisk of the modified test cannot be larger than the risk of the original test. 2.5.3
Illustration: Selecting the best among a family of estimates
Let us illustrate our machinery for multiple hypothesis testing by applying it to the situation as follows: We are given: • a simple nondegenerate observation scheme O = (Ω, Π; {pµ (·) : µ ∈ M}; F), • a seminorm k · k on Rn ,11 • a convex compact set X ⊂ Rn along with a collection of M points xi ∈ Rn , 1 ≤ i ≤ M , and a positive D such that the k · kdiameter of the set X + = X ∪ {xi : 1 ≤ i ≤ M } is at most D: kx − x′ k ≤ D ∀(x, x′ ∈ X + ), • an affine mapping x 7→ A(x) from Rn into the embedding space of M such that A(x) ∈ M for all x ∈ X, • a tolerance ǫ ∈ (0, 1).
We observe a Kelement sample ω K = (ω1 , ..., ωK ) of observations ωk ∼ pA(x∗ ) , 1 ≤ k ≤ K,
(2.85)
independent across k, where x∗ ∈ Rn is an unknown signal known to belong to X. Our “ideal goal” is to use ω K in order to identify, with probability ≥ 1 − ǫ, the k · kclosest to x∗ point among the points x1 , ..., xM . The goal just outlined may be too ambitious, and in the sequel we focus on the relaxed goal as follows: 11 A seminorm on Rn is defined by exactly the same requirements as a norm, except that now we allow zero seminorms for some nonzero vectors. Thus, a seminorm on Rn is a nonnegative function k · k which is even and homogeneous: kλxk = λkxk and satisfies the triangle inequality kx + yk ≤ kxk + kyk. A universal example is kxk = kBxko , where k · ko is a norm on some Rm and B is an m × n matrix; whenever this matrix has a nontrivial kernel, k · k is a seminorm rather than a norm.
98
CHAPTER 2
Given a positive integer N and a “resolution” θ > 1, consider the grid Γ = {rj = Dθ−j , 0 ≤ j ≤ N } and let
ρ(x) = min ρj ∈ Γ : ρj ≥ min kx − xi k . 1≤i≤M
Given the design parameters α ≥ 1 and β ≥ 0, we want to specify a volume of observations K and an inference routine ω K 7→ iα,β (ω K ) ∈ {1, ..., M } such that ∀(x∗ ∈ X) : Prob{kx∗ − xiα,β (ωK ) k > αρ(x∗ ) + β} ≥ 1 − ǫ.
(2.86)
Note that when passing from the “ideal” to the relaxed goal, the simplification is twofold: first, instead of the precise distance mini kx∗ − xi k from x∗ to {x1 , ..., xM } we look at the best upper bound ρ(x∗ ) on this distance from the grid Γ; second, we allow factor α and additive term β in mimicking the (discretized) distance ρ(x∗ ) by kx∗ − xiα,β (ωK ) k. The problem we have posed is quite popular in Statistics and originates from the estimate aggregation problem [185, 229, 101] as follows: let xi be candidate estimates of x∗ yielded by a number of a priori “models” of x∗ and perhaps some preliminary noisy observations of x∗ . Given xi and a matrix B, we want to select among the vectors Bxi the (nearly) best approximation of Bx∗ w.r.t. a given norm k · ko , utilizing additional observations ω K of the signal. To bring this problem into our framework, it suffices to specify the seminorm as kxk = kBxko . We shall see in the meantime that in the context of this problem, the “discretization of distances” is, for all practical purposes, irrelevant: the dependence of the volume of observations on N is just logarithmic, so that we can easily handle a fine grid, like the one with θ = 1.001 and θ−N = 10−10 . As for factor α and additive term β, they indeed could be “expensive in terms of applications,” but the “nearly ideal” goal of making α close to 1 and β close to 0 may be unattainable. 2.5.3.1
The construction
Let us associate with i ≤ M and j, 0 ≤ j ≤ N , the hypothesis Hij stating that the observations ωk independent across k—see (2.85)—stem from x∗ ∈ Xij := {x ∈ X : kx − xi k ≤ rj }. Note that the sets Xij are convex and compact. We denote by J the set of all pairs (i, j), for which i ∈ {1, ..., M }, j ∈ {0, 1, ..., N }, and Xij 6= ∅. Further, we define closeness Cα,β on the set of hypotheses Hij , (i, j) ∈ J , as follows: (ij, i′ j ′ ) ∈ Cαβ if and only if ¯= kxi − xi′ k ≤ α ¯ (rj + rj ′ ) + β, α
α−1 2
(2.87)
(here and in what follows, kℓ denotes the ordered pair (k, ℓ)). Applying Theorem 2.23, we can build, in a computationfriendly fashion, the system φij,i′ j ′ (ω), ij, i′ j ′ ∈ J , of optimal balanced detectors for the hypotheses Hij along
99
HYPOTHESIS TESTING
with the risks of these detectors, so that ′ j ′ (ω) ≡ −φi′ j ′ ,ij (ω) φ R ij,i−φ ij,i′ j ′ (ω) p e A(x) (ω)Π(dω) ≤ ǫij,i′ j ′ Ω
∀(ij, i′ j ′ ∈ J ), ∀(ij ∈ J , i′ j ′ ∈ J , x ∈ Xij ).
Let us say that a pair (α, β) is admissible if α ≥ 1, β ≥ 0, and ∀((i, j) ∈ J , (i′ , j ′ ) ∈ J , (ij, i′ j ′ ) 6∈ Cα,β ) : A(Xij ) ∩ A(Xi′ j ′ ) = ∅. Note that checking admissibility of a given pair (α, β) is a computationally tractable task. Given an admissible pair (α, β), we associate with it a positive integer K = K(α, β) and inference ω K 7→ iα,β (ω K ) as follows: 1. K = K(α, β) is the smallest integer such that the detectorbased test TCKα,β yielded by the machinery of Section 2.5.2.3 decides on the hypotheses Hij , ij ∈ J , with Cα,β risk not exceeding ǫ. Note that by admissibility, ǫij,i′ j ′ < 1 whenever (ij, i′ j ′ ) 6∈ Cα,β , so that K(α, β) is well defined. 2. Given observation ω K , K = K(α, β), we define iα,β (ω K ) as follows: a) We apply to ω K the test TCKα,β . If the test accepts no hypothesis (case A), iαβ (ω K ) is undefined. The observations ω K resulting in case A comprise some set, which we denote by B; given ω K , we can recognize whether or not ω K ∈ B. b) When ω K 6∈ B, the test TCKα,β accepts some of the hypotheses Hij , let the set of their indices ij be J (ω K ); we select from the pairs ij ∈ J (ω K ) the one with the largest j, and set iα,β (ω K ) to be equal to the first component, and jα,β (ω K ) to be equal to the second component of the selected pair. We have the following: Proposition 2.33. Assuming (α, β) admissible, for the inference ω K 7→ iα,β (ω K ) just defined and for every x∗ ∈ X, denoting by PxK∗ the distribution of stationary Krepeated observation ω K stemming from x∗ one has kx∗ − xiα,β (ωK ) k ≤ αρ(x∗ ) + β
(2.88)
with PxK∗ probability at least 1 − ǫ. Proof. Let us fix x∗ ∈ X, and let j∗ = j∗ (x∗ ) be the largest j ≤ N such that rj ≥ min kx∗ − xi k; i≤M
note that j∗ is well defined due to r0 = D ≥ kx∗ − x1 k, and that rj∗ = ρ(x∗ ). We specify i∗ = i∗ (x∗ ) ≤ M in such a way that kx∗ − xi∗ k ≤ rj∗ .
(2.89)
Note that i∗ is well defined and that observations (2.85) stemming from x∗ obey the hypothesis Hi∗ j∗ . Let E be the set of those ω K for which the predicate
100
CHAPTER 2
P: As applied to observation ω K , the test TCKα,β accepts Hi∗ j∗ , and all hypotheses accepted by the test are Cα,β close to Hi∗ j∗ holds true. Taking into account that the Cα,β risk of TCKα,β does not exceed ǫ and that the hypothesis Hi∗ j∗ is true, the PxK∗ probability of the event E is at least 1 − ǫ. Let observation ω K satisfy ω K ∈ E. (2.90) Then 1. The test TCKα,β accepts the hypothesis Hi∗ j∗ , that is, ω K 6∈ B. By construction of iα,β (ω K )jα,β (ω K ) (see the rule 2b above) and due to the fact that TCKα,β accepts Hi∗ j∗ , we have jα,β (ω K ) ≥ j∗ . 2. The hypothesis Hiα,β (ωK )jα,β (ωK ) is Cα,β close to Hi∗ j∗ , so that kxi∗ − xiα,β (ωK ) k ≤ α ¯ (rj∗ + rjα,β (ωK ) ) + β ≤ 2¯ αrj∗ + β = 2¯ αρ(x∗ ) + β, where the concluding inequality is due to the fact that, as we have already seen, jα,β (ω K ) ≥ j∗ when (2.90) takes place. Invoking (2.89), we conclude that with PxK∗ probability at least 1 − ǫ it holds kx∗ − xiα,β (ωK ) k ≤ (2¯ α + 1)ρ(x∗ ) + β = αρ(x∗ ) + β. 2.5.3.2
✷
A modification
From the computational viewpoint, an obvious shortcoming of the construction presented in the previous section is the necessity to operate with M (N +1) hypotheses, which might require computing as many as O(M 2 N 2 ) detectors. We are about to present a modified construction, where we deal at most N + 1 times with just M hypotheses at a time (i.e., with the total of at most O(M 2 N ) detectors). The idea is to replace simultaneously processing all hypotheses Hij , ij ∈ J , with processing them in stages j = 0, 1, ..., with stage j operating only with the hypotheses Hij , i = 1, ..., M . The implementation of this idea is as follows. In the situation of Section 2.5.3, given the same entities Γ, (α, β), Hij , Xij , ij ∈ J , as at the beginning of Section 2.5.3.1 and specifying closeness Cα,β according to (2.87), we now act as follows. Preprocessing. For j = 0, 1, ..., N 1. we identify the set Ij = {i ≤ M : Xij 6= ∅} and stop if this set is empty. If this set is nonempty, j 2. we specify the closeness Cαβ on the set of hypotheses Hij , i ∈ Ij , as a “slice” of the closeness Cα,β : j Hij and Hi′ j (equivalently, i and i′ ) are Cα,β close to each other if (ij, i′ j) are Cα`,β close, that is,
kxi − xi′ k ≤ 2¯ αrj + β, α ¯=
α−1 . 2
3. We build the optimal detectors φij,i′ j , along with their risks ǫij,i′ j , for all i, i′ ∈ Ij j such that (i, i′ ) 6∈ Cα,β . If ǫij,i′ j = 1 for a pair i, i′ of the latter type, that is,
101
HYPOTHESIS TESTING
A(Xij ) ∩ A(Xi′ j ) 6= ∅, we claim that (α, β) is inadmissible and stop. Otherwise we find the smallest K = Kj such that the spectral norm of the symmetric M × M matrix E jK with the entries K j ǫij,i′ j , i ∈ Ij , i′ ∈ Ij , (i, i′ ) 6∈ Cα,β jK Eii ′ = 0, otherwise does not exceed ǫ¯ = ǫ/(N + 1). We then use the machinery of Section 2.5.2.3 K to build detectorbased test TC j j , which decides on the hypotheses Hij , i ∈ Ij , α,β
j with Cα,β risk not exceeding ǫ¯.
It may happen that the outlined process stops when processing some value ¯j of j; if this does not happen, we set ¯j = N + 1. Now, if the process does stop, and stops with the claim that (α, β) is inadmissible, we call (α, β) inadmissible and terminate—in this case we fail to produce a desired inference; note that if this is the case, (α, β) is inadmissible in the sense of Section 2.5.3.1 as well. When we do not stop with the inadmissibility claim, we call (α, β) admissible, and in this case we do produce an inference, specifically, as follows. Processing observations: 1. We set J¯ = {0, 1, ..., b j = ¯j − 1}, K = K(α, β) = max K j . Note that J¯ is 0≤j≤b j
nonempty due to ¯j > 0.12
2. Let ω K = (ω1 , ..., ωK ) with independent across k components stemming from unknown signal x∗ ∈ X according to (2.85). We put Ib−1 (ω K ) = {1, ..., M } = I0 . a) For j = 0, 1, ..., b j, we act as follows. When processing j, we have at our disposal subsets Ibk (ω K ) ⊂ {1, ..., M }, −1 ≤ k < j. To build the set Ibj (ω K ) K
i. we apply the test TC j j to the initial Kj components of the observation α,β
ω K . Let Ij+ (ω K ) be the set of hypotheses Hij , i ∈ Ij , accepted by the test; ii. it may happen that Ij+ (ω K ) = ∅; if it is so, we terminate;
iii. if Ij+ (ω K ) is nonempty, we look, one by one, at indices i ∈ Ij+ (ω K ) and call the index i good if for every ℓ ∈ {−1, 0, ..., j − 1}, i ∈ Ibℓ (ω K ); iv. we define Ibj (ω K ) as the set of good indices of Ij+ (ω K ) if this set is not empty and proceed to the next value of j (if j < b j), or terminate (if j = b j). We terminate if there are no good indices in Ij+ (ω K ). b) Upon termination, we have at our disposal a collection Ibj (ω K ), 0 ≤ j ≤ e j(ω K ), of all sets Ibj (ω K ) we have built (this collection can be empty, which we encode by setting e j(ω K ) = −1). When e j(ω K ) = −1, our inference remains undefined. Otherwise we select from the set Ibej(ωK ) (ω K ) an index iα,β (ω K ), say, the smallest one, and claim that the point xiα,β (ωK ) is the point among 12 All the sets X i0 contain X and thus are nonempty, so that I0 = {1, ..., M } 6= ∅, and thus we cannot stop at step j = 0 due to I0 = ∅; the other possibility to stop at step j = 0 is ruled out by the fact that we are in the case where (α, β) is admissible.
102
CHAPTER 2
x1 , ..., xM “nearly closest” to x∗ . We have the following analog of Proposition 2.33: Proposition 2.34. Assuming (α, β) admissible, for the inference ω K 7→ iα,β (ω K ) just defined and for every x∗ ∈ X, denoting by PxK∗ the distribution of stationary Krepeated observation ω K stemming from x∗ one has PxK∗ ω K : iα,β (ω K ) is well defined and kx∗ − xiα,β (ωK ) k ≤ αρ(x∗ ) + β ≥ 1 − ǫ. Proof. Let us fix the signal x∗ ∈ X underlying observations ω K . As in the proof of Proposition 2.33, let j∗ be such that ρ(x∗ ) = rj∗ , and let i∗ ≤ M be such that x∗ ∈ Xi∗ j∗ . Clearly, i∗ and j∗ are well defined, and the hypotheses Hi∗ j , 0 ≤ j ≤ j∗ , are true. In particular, Xi∗ j 6= ∅ when j ≤ j∗ , implying that i∗ ∈ Ij , 0 ≤ j ≤ j∗ , whence also b j ≥ j∗ . For 0 ≤ j ≤ j∗ , let Ej be the set of all realizations of ω K such that j i∗ ∈ Ij+ (ω K ) & {(i∗ , i) ∈ Cα,β ∀i ∈ Ij+ (ω K )}. K
j Since the Cα,β risk of the test TC j j is ≤ ǫ¯, we conclude that the PxK∗ probability of α,β
Ej is at least 1 − ǫ¯, whence the PxK∗ probability of the event E=
j∗ \
j=0
Ej
is at least 1 − (N + 1)¯ ǫ˙ = 1 − ǫ. Now let ω K ∈ E. Then, • By the definition of Ej , when j ≤ j∗ , we have i∗ ∈ Ij+ (ω K ), whence, by evident induction in j, i∗ ∈ Ibj (ω K ) for all j ≤ j∗ . • We conclude from the above that e j(ω K ) ≥ j∗ . In particular, i := iα,β (ω K ) is well defined and turned out to be good at step e j ≥ j∗ , implying that i ∈ Ibj∗ (ω K ) ⊂ + K Ij∗ (ω ).
Thus, i ∈ Ij+∗ (ω K ), which combines with the definition of Ej∗ to imply that i and j∗ i∗ are Cα,β close to each other, whence αρ(x∗ ) + β, kxi(α,β)(ωK ) − xi∗ k ≤ 2¯ αrj∗ + β = 2¯ resulting in the desired relation
kxi(α,β)(ωK ) − x∗ k ≤ 2¯ αρ(x∗ ) + β + kxi∗ − x∗ k ≤ [2¯ α + 1]ρ(x∗ ) + β = αρ(x∗ ) + β. ✷ 2.5.3.3
“Nearoptimality”
We augment the above constructions with the following ¯ ǫ ∈ (0, 1/2), and a pair (a, b) ≥ Proposition 2.35. Let for some positive integer K,
103
HYPOTHESIS TESTING ¯
¯
0 there exist an inference ω K 7→ i(ω K ) ∈ {1, ..., M } such that whenever x∗ ∈ X, we have ProbωK¯ ∼PxK¯ {kx∗ − xi(ωK¯ ) k ≤ aρ(x∗ ) + b} ≥ 1 − ǫ. ∗
Then the pair (α = 2a + 3, β = 2b) is admissible in the sense of Section 2.5.3.1 (and thus—in the sense of Section 2.5.3.2), and for the constructions in Sections 2.5.3.1 and 2.5.3.2 one has 1 + ln(M (N + 1))/ ln(1/ǫ) ¯ (2.91) K ; K(α, β) ≤ Ceil 2 1 − ln(4(1−ǫ)) ln(1/ǫ) Proof. Consider the situation of Section 2.5.3.1 (the situation of Section 2.5.3.2 can be processed in a completely similar way). Observe that with α, β as above, there exists a simple test deciding on a pair of hypotheses Hij , Hi′ j ′ which are not ¯ ¯ Cα,β close to each other via stationary Krepeated observation ω K with risk ≤ ǫ. ′ ′ Indeed, the desired test T is as follows: given ij ∈ J , i j ∈ J , and observation ¯ ¯ ω K , we compute i(ω K ) and accept Hij if and only if kxi(ωK¯ ) − xi k ≤ (a + 1)rj + b, and accept Hi′ j ′ otherwise. Let us check that the risk of this test indeed is at most ¯ ǫ. Assume, first, that Hij takes place. The PxK∗ probability of the event E : kxi(ωK¯ ) − x∗ k ≤ aρ(x∗ ) + b is at least 1 − ǫ due to the origin of i(·), and kxi − x∗ k ≤ rj since Hij takes place, implying that ρ(x∗ ) ≤ rj by the definition of ρ(·). Thus, in the case of E it holds kxi(ωK¯ ) − xi k ≤ kxi(ωK¯ ) − x∗ k + kxi − x∗ k ≤ aρ(x∗ ) + b + rj ≤ (a + 1)rj + b. ¯
We conclude that if Hij is true and ω K ∈ E, then the test T accepts Hij , and thus ¯ the PxK∗ probability for the simple test T not to accept Hij when the hypothesis takes place is ≤ ǫ. ¯ Now let Hi′ j ′ take place, and let E be the same event as above. When ω K ∈ E, ¯ which happens with the PxK∗ probability at least 1−ǫ, for the same reasons as above, we have kxi(ωK¯ ) − xi′ k ≤ (a + 1)rj ′ + b. It follows that when Hi′ j ′ takes place and ¯ ω K ∈ E, we have kxi(ωK¯ ) − xi k > (a + 1)rj + b, since otherwise we would have kxi − xi′ k ≤ =
kxi(ωK¯ ) − xi k + kxi(ωK¯ ) − xi′ k ≤ (a + 1)rj + b + (a + 1)rj ′ + b ′ (a + 1)(rj + rj ′ ) + 2b = α−1 2 (rj + rj ) + β,
which contradicts the fact that ij and i′ j ′ are not Cα,β close. Thus, whenever Hi′ j ′ holds true and E takes place, we have kxi(ωK¯ ) − xi k > (a + 1)rj + b, implying ¯ by the definition of T that T accepts Hi′ j ′ . Thus, the PxK∗ probability not to accept Hi′ j ′ when the hypotheses is true is at most ǫ. From the fact that whenever ¯ observations, (ij, i′ j ′ ) 6∈ Cα,β , the hypotheses Hij , Hi′ j ′ can be decided upon, via K ′ ′ with risk ≤ ǫ < 0.5 it follows that for the ij, i j in question, the sets A(Xij ) and A(Xi′ j ′ ) do not intersect, so that (α, β) is an admissible pair. As in the proof of Proposition 2.31, by basic properties of simple observation schemes, the fact that the hypotheses Hij , Hi′ j ′ with (ij, i′ j ′ ) 6∈ Cα,β can be decided ¯ upon observations (2.85) with risk ≤ ǫ < 1/2 implies that ǫij,i′ j ′ ≤ p via Krepeated ¯ 1/K , whence, again by basic results on simple observation schemes (look [2 ǫ(1 − ǫ)]
104
CHAPTER 2
once again at the proof of Proposition 2.31), the Cα,β risk of Kobservation detectorbased test TK deciding Hij , ijp ∈ J , up to closeness Cα,β does not p on theK/hypotheses ¯ ¯ K exceed Card(J )[2 ǫ(1 − ǫ)] ≤ M (N + 1)[2 ǫ(1 − ǫ)]K/K , and (2.91) follows. ✷ Comment. Proposition 2.35 says that in our problem, the “statistical toll” for quite large values of N and M is quite moderate: with ǫ = 0.01, resolution θ = 1.001 (which for all practical purposes is the same as no discretization of distances at all), ¯ D/rN as large as 1010 , and M as large as 10,000, (2.91) reads K = Ceil(10.7K)— not a disaster! The actual statistical toll of our construction is in replacing the “existing in nature” a and b with a α = 2α + 3 and β = 2b. And of course there is a huge computational toll for large M and N : we need to operate with large (albeit polynomial in M, N ) number of hypotheses and detectors. 2.5.3.4
Numerical illustration
As an illustration of the approach presented in this section consider the following (toy) problem: A signal x∗ ∈ Rn (one may think of x∗ as of the restriction on the equidistant npoint grid in [0, 1] of a function of continuous argument t ∈ [0, 1]) is observed according to ω = Ax∗ + ξ, ξ ∼ N (0, σ 2 In ),
(2.92)
where A is a “discretized integration”: s
(Ax)s =
1X xs , s = 1, ..., n. n j=1
We want to approximate x in the discrete version of L1 norm n
kyk =
1X ys , y ∈ Rn n s=1
by a loworder polynomial. In order to build the approximation, we use a single observation ω as in (2.92), to build five candidate estimates xi , i = 1, ..., 5, of x∗ . Specifically, xi is the Least Squares polynomial—of degree ≤ i − 1—approximation of x: xi = argmin kAy − ωk22 , y∈Pi−1
where Pκ is the linear space of algebraic polynomials, of degree ≤ κ, of discrete argument s varying in {1, 2, ..., n}. After the candidate estimates are built, we use additional K observations (2.92) “to select the model”—to select among our estimates the k · kclosest to x∗ . In the experiment reported below we use n = 128 and σ = 0.01. The true signal x∗ is a discretization of a piecewise linear function of continuous argument t ∈ [0, 1], with slope 2 to the left of t = 0.5, and with slope −2 to the right of t = 0.5; at t = 0.5, the function has a jump. The a priori information on the true signal is that
105
HYPOTHESIS TESTING
1
0
1
2
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0.3
0.2
0.1
0
Figure 2.5: Signal (top, solid) and candidate estimates (top, dotted). Bottom: the primitive of the signal.
it belongs to the box {x ∈ Rn : kxk∞ ≤ 1}. The signal and sample polynomial approximations xi of x∗ , 1 ≤ i ≤ 5, are presented on the top plot in Figure 2.5; their actual k · kdistances to x∗ are as follows: i kxi − x∗ k
1 0.534
2 0.354
3 0.233
4 0.161
5 0.172
Setting ǫ = 0.01, N = 22, and θ = 21/4 , α = 3 and β = 0.05 resulted in K = 3. In a series of 1,000 simulations of the resulting inference, all 1,000 results correctly identified the candidate estimate x4 k · kclosest to x∗ , in spite of the factor α = 3 in (2.88). Surprisingly, the same holds true when we use the resulting inference with the reduced values of K, namely, K = 1 and K = 2, although the theoretical guarantees deteriorate: with K = 1 and K = 2, the theory guarantees the validity of (2.88) with probabilities 0.77 and 0.97, respectively.
2.6
SEQUENTIAL HYPOTHESIS TESTING
2.6.1
Motivation: Election polls
Let us consider the following “practical” question. One of L candidates for an office is about to be selected by a populationwide majority vote. Every member of the population votes for exactly one candidate. How do we predict the winner via an opinion poll? A (naive) model of the situation could be as follows. Let us represent the preference of a particular voter by his preference vector—a basic orth e in RL with unit entry in a position ℓ meaning that the voter is about to vote for the ℓth candidate. The
106
CHAPTER 2
entries µℓ in the average µ, over the population, of these vectors are the fractions of votes in favor of the ℓth candidate, and the elected candidate is the one “indexing” the largest of the µℓ ’s. Now assume that we select at random, from the uniform distribution, a member of the population and observe his preference vector. Our observation ω is a realization of a discrete random variable taking values in the set Ω = {e1 , ..., eL } of basic orths in RL , and µ is the distribution of ω (technically, the density of this distribution w.r.t. the counting measure Π on Ω). Selecting a small threshold δ and assuming that the true—unknown to us—µ is such that the largest entry in µ is at least by δ larger than every other entry and that µℓ ≥ N1 for all ℓ, N being the population size,13 we can model the population preference for the ℓth candidate with P µ ∈ Mℓ = {µ ∈ Rd : µi ≥ N1 , i µP i = 1, µℓ ≥ µi + δ ∀(i 6= ℓ)} ⊂ M = {µ ∈ Rd : µ > 0, i µi = 1}.
In an (idealized) poll, we select at random a number K of voters and observe their preferences, thus arriving at a sample ω K = (ω1 , ..., ωK ) of observations drawn, independently SLof each other, from an unknown distribution µ on Ω, with µ known to belong to ℓ=1 Mℓ . Therefore, to predict the winner is the same as to decide on L convex hypotheses, H1 , ..., HL , in the Discrete o.s., with Hℓ stating that ω1 , ..., ωK are drawn, independently of each other, from a distribution µ ∈ Mℓ . What we end up with, is the problem of deciding on L convex hypotheses in the Discrete o.s. with Lelement Ω via stationary Krepeated observations.
Illustration. Consider twocandidate elections; now the goal of a poll is, given K independent of each other realizations ω1 , ..., ωK of random variable ω taking value χ = 1, 2 with probability µχ , µ1 + µ2 = 1, to decide what is larger, µ1 or µ2 . As explained above, we select somehow a threshold δ and impose on the unknown µ an a priori assumption that the gap between the largest and the next largest (in our case, just the smallest) entry of µ is at least δ, thus arriving at two hypotheses, H1 : µ1 ≥ µ2 + δ,
H2 : µ2 ≥ µ1 + δ,
which is the same as H1 : µ ∈ M1 = {µ : µ1 ≥ H2 : µ ∈ M2 = {µ : µ2 ≥
1+δ 2 , µ2 1+δ 2 , µ1
≥ 0, µ1 + µ2 = 1}, ≥ 0, µ1 + µ2 = 1}.
We now want to decide on these two hypotheses via a stationary Krepeated observation. We are in the case of a simple (specifically, Discrete) o.s.; the optimal detector as given by Theorem 2.23 stems from the optimal solution (µ∗ , ν ∗ ) to the convex optimization problem ε⋆ =
max
µ∈M1 ,ν∈M2
√ √ [ µ1 ν 1 + µ2 ν 2 ] ;
(2.93)
the optimal balanced singleobservation detector is φ∗ (ω) = f∗T ω, f∗ = 21 [ln(µ∗1 /ν1∗ ); ln(µ∗2 /ν2∗ )] 13 With the size N of population in the range of tens of thousands and δ as 1/N , both these assumptions seem to be quite realistic.
107
HYPOTHESIS TESTING
(recall that we encoded observations ωk by basic orths from R2 ), the risk of this detector being ε⋆ . In other words, √ 1−δ 1−δ 1+δ ∗ 1 − δ2 , µ∗ = [ 1+δ 2 ; 2 ], ν = [ 2 ; 2 ], ε⋆ = 1 f∗ = 2 [ln((1 + δ)/(1 − δ)); ln((1 − δ)/(1 + δ))] . The optimal balanced Kobservation detector and its risk are (K)
(K)
φ∗ (ω1 , ..., ωK ) = f∗T (ω1 + ... + ωK ), ε⋆  {z }
= (1 − δ 2 )K/2 .
ωK
(K)
The nearoptimal Kobservation test TφK∗ accepts H1 and rejects H2 if φ∗ (ω K ) ≥ 0; otherwise it accepts H2 and rejects H1 . Both risks of this test do not exceed (K) ε⋆ . Given risk level ǫ, we can identify the minimal “poll size” K for which the risks K Risk1 , Risk2 of the test Tφ∗ do not exceed ǫ. This poll size depends on ǫ and on our a priori “hypotheses separation” parameter δ : K = Kǫ (δ). Some impression on this size can be obtained from Table 2.1, where, as in all subsequent “election illustrations,” ǫ is set to 0.01. We see that while poll sizes for “landslide” elections are surprisingly low, reliable prediction of the results of “close run” elections requires surprisingly high sizes of the polls. Note that this phenomenon reflects reality (to the extent to which the reality is captured by our model).14 Indeed, from Proposition 2.25 we know that our poll size is within an explicit factor, depending solely on ǫ, from the “ideal” poll sizes—the smallest ones which allow to decide upon H1 , H2 with risk ≤ ǫ. For ǫ = 0.01, this factor is about 2.85, meaning that when δ = 0.01, the ideal poll size is larger than 32,000. In fact, we can easily construct more accurate “numerical” lower bounds on the sizes of ideal polls, specifically, as follows. When computing the optimal detector φ∗ , we get, as a byproduct, two distributions, µ∗ , ν ∗ obeying ∗ H1 , H2 , respectively. Denoting by µ∗K and νK the distributions of Kelement i.i.d. ∗ ∗ samples drawn from µ and ν , the risk of deciding on two simple hypotheses on ∗ the distribution of ω K —stating that this distribution is µ∗K and νK , respectively— can be only smaller than the risk of deciding on H1 , H2 via Krepeated stationary observations. On the other hand, the former risk can be lowerbounded by one half of the total risk of deciding on our two simple hypotheses, and the latter risk admits a sharp lower bound given by Proposition 2.2, namely, " ( " # #) X Y Y Y Y ∗ ∗ ∗ ∗ min µ iℓ , νiℓ = E(i1 ,...,iK ) min (2µiℓ ), (2νiℓ ) , i1 ,...,iK ∈{1,2}
ℓ
ℓ
ℓ
ℓ
with the expectation taken w.r.t independent tuples of K integers taking values 14 In actual opinion polls, additional information is used. For instance, in reality voters can be split into groups according to their age, sex, education, income, etc., with variability of preferences within a group essentially lower than across the entire population. When planning a poll, respondents are selected at random within these groups, with a prearranged number of selections in every group, and their preferences are properly weighted, yielding more accurate predictions as compared to the case when the respondents are selected from the uniform distribution. In other words, in actual polls a nontrivial a priori information on the “true” distribution of preferences is used—something we do not have in our naive model.
108
CHAPTER 2
δ K0.01 (δ), L = 2 K0.01 (δ), L = 5
0.5623 25 32
0.3162 88 114
0.1778 287 373
0.1000 917 1193
0.0562 2908 3784
0.0316 9206 11977
0.0177 29118 37885
0.0100 92098 119745
Table 2.1: Sample of values of poll size K0.01 (δ) as a function of δ for 2candidate (L = 2) and 5candidate (L = 5) elections. Values of δ form a geometric progression with ratio 10−1/4 .
1 and 2 with probabilities 1/2. Of course, when K is in the range of a few tens and more, we cannot compute the 2K term sum above exactly; however, we can use Monte Carlo simulation in order to estimate the sum reliably with moderate accuracy, like 0.005, and use this estimate to lowerbound the value of K for which an “ideal” Kobservation test decides on H1 , H2 with risks ≤ 0.01. Here are the resulting lower bounds (along with upper bounds from Table 2.1): δ K /K
0.5623
0.3162
0.1778
0.1000
0.0562
0.0316
0.0177
0.0100
14 25
51 88
166 287
534 917
1699 2908
5379 9206
17023 29122
53820 92064
Lower (K) and upper (K) bounds on the “ideal” poll sizes We see that the poll sizes as yielded by our machinery are within factor 2 of the “ideal” poll sizes. Clearly, the outlined approach can be extended to Lcandidate elections with L ≥ 2. In our model of the corresponding problem we decide, via stationary Krepeated observations drawn from unknown probability distribution µ on Lelement set, on L hypotheses Hℓ : µ ∈ M ℓ =
(
µ ∈ R d : µi ≥
) X 1 , i ≤ L, µi = 1, µℓ ≥ µℓ′ + δ ∀(ℓ′ 6= ℓ) , ℓ ≤ L. N i (2.94)
Here δ > 0 is a threshold selected in advance smallSenough to believe that the actual preferences of the voters correspond to µ ∈ ℓ Mℓ . Defining closeness C in the strongest possible way—Hℓ is close to Hℓ′ if and only if ℓ = ℓ′ —predicting the outcome of elections with risk ǫ becomes the problem of deciding upon our multiple hypotheses with Crisk ≤ ǫ. Thus, we can use pairwise detectors yielded by Theorem 2.23 to identify the smallest possible K = Kǫ such that the test TCK from Section 2.5.2.3 is capable of deciding upon our L hypotheses with Crisk ≤ ǫ. A numerical illustration of the performance of this approach in 5candidate elections is presented in Table 2.1 (where ǫ is set to 0.01). 2.6.2
Sequential hypothesis testing
In view of the above analysis, when predicting outcomes of “close run” elections, huge poll sizes are necessary. It, however, does not mean that nothing can be done in order to build more reasonable opinion polls. The classical related statistical idea, going back to Wald [236], is to pass to sequential tests where the observations are processed one by one, and at every instant we either accept some of our hypotheses and terminate, or conclude that the observations obtained so far are insufficient to make a reliable inference and pass to the next observation. The idea is that a properly built sequential test, while still ensuring a desired risk, will be able to make “early decisions” in the case when the distribution underlying observations is “well inside” the true hypothesis and thus is far from the alternatives. Let us show
109
HYPOTHESIS TESTING
"
$
#
Figure 2.6: 3candidate hypotheses in probabilistic simplex ∆3 [area [area [area [area [area [area
A] A] B] B] C] C]
M1 M1s M2 M2s M3 M3s
dark dark dark dark dark dark
tetragon + light border strip: candidate A wins with margin ≥ δS tetragon: candidate A wins with margin ≥ δs > δS tetragon + light border strip: candidate B wins with margin ≥ δS tetragon: candidate B wins with margin ≥ δs > δS tetragon + light border strip: candidate C wins with margin ≥ δS tetragon: candidate C wins with margin ≥ δs > δS
Cs closeness: hypotheses in the tuple {Gs2ℓ−1 : µ ∈ Mℓ , Gs2ℓ : µ ∈ Mℓs , 1 ≤ ℓ ≤ 3} are not Cs close to each other if the corresponding M sets belong to different areas and at least one of the sets is painted dark, like M1s and M2 , but not M1 and M2 . how our machinery can be utilized to conceive a sequential test for the problem of predicting the outcome of Lcandidate elections. Thus, our goal is, given a small threshold δ, to decide upon L hypotheses (2.94). Let us act as follows. 1. We select a factor θ ∈ (0, 1), say, θ = 10−1/4 , and consider thresholds δ1 = θ, δ2 = θδ1 , δ3 = θδ2 , and so on, until for the first time we get a threshold ≤ δ; to save notation, we assume that this threshold is exactly δ, and let the number of the thresholds be S. 2. We split somehow (e.g., equally) the risk ǫ which we want to guarantee into S portions ǫs , 1 ≤ s ≤ S, so that ǫs are positive and S X
ǫs = ǫ.
s=1
3. For s ∈ {1, 2, ..., S}, we define, along with the hypotheses Hℓ , the hypotheses Hℓs : µ ∈ Mℓs = {µ ∈ Mℓ : µℓ ≥ µℓ′ + δs , ∀(ℓ′ 6= ℓ)}, ℓ = 1, ..., L, (see Figure 2.6), and introduce 2L hypotheses Gs2ℓ−1 = Hℓ , and Gs2ℓ = Hℓs , 1 ≤ ℓ ≤ L. It is convenient to color these hypotheses in L colors, with Gs2ℓ−1 = Hℓ and Gs2ℓ = Hℓs assigned color ℓ. We define also sth closeness Cs as follows: When s < S, hypotheses Gsi and Gsj are Cs close to each other if either they are of the same color, or they are of different colors and both of them have odd indices (that is, one of them is Hℓ , and another one is Hℓ′ with ℓ 6= ℓ′ ).
110
CHAPTER 2
When s = S (in this case GS2ℓ−1 = Hℓ = GS2ℓ ), hypotheses GSℓ and GSℓ′ are CS close to each other if and only if they are of the same color, i.e., both coincide with the same hypothesis Hℓ . Observe that Gsi is a convex hypothesis: Gsi : µ ∈ Yis
s s [Y2ℓ−1 = Mℓ , Y2ℓ = Mℓs ]
The key observation is that when Gsi and Gsj are not Cs close, sets Yis and Yjs are “separated” by at least δs , meaning that for some vector e ∈ RL with just two nonvanishing entries, equal to 1 and −1, we have min eT µ ≥ δs + maxs eT µ.
µ∈Yis
µ∈Yj
(2.95)
Indeed, let Gsi and Gsj not be Cs close to each other. That means that the hypotheses are of different colors, say, ℓ and ℓ′ 6= ℓ, and at least one of them has even index. W.l.o.g. we can assume that the evenindexed hypothesis is Gsi , so that Yis ⊂ {µ : µℓ − µℓ′ ≥ δs },
while Yjs is contained in the set {µ : µℓ′ ≥ µℓ }. Specifying e as the vector with just two nonzero entries, ℓth equal to 1 and ℓ′ th equal to −1, we ensure (2.95).
4. For 1 ≤ s ≤ S, we apply the construction from Section 2.5.2.3 to identify the smallest K = K(s) for which the test Ts yielded by this construction as applied to a stationary Krepeated observation allows us to decide on the hypotheses Gs1 , ..., Gs2L with Cs risk ≤ ǫs . The required K exists due to the already mentioned separation of members in a pair of not Cs close hypotheses Gsi , Gsj . It is easily seen that K(1) ≤ K(2) ≤ ... ≤ K(S − 1). However, it may happen that K(S − 1) > K(S), the reason being that CS is defined differently than Cs with s < S. We set S = {s ≤ S : K(s) ≤ K(S)}. For example, here is what we get in Lcandidate Opinion P8 Poll problem when S = 8, δ = δS = 0.01, and for properly selected ǫs with s=1 ǫs = 0.01: L 2 5
K(1) 177 208
K(2) 617 723
K(3) K(4) K(5) K(6) K(7) K(8) 1829 5099 15704 49699 153299 160118 2175 6204 19205 60781 188203 187718 S = 8, δs = 10−s/4 . S = {1, 2, ..., 8} when L = 2 and S = {1, 2, ..., 6} ∪ {8} when L = 5.
5. Our sequential test Tseq works in attempts (stages) s ∈ S—it tries to make conclusions after observing K(s), s ∈ S, realizations ωk of ω. At the sth attempt, we apply the test Ts to the collection ω K(s) of observations obtained so far to decide on hypotheses Gs1 , ..., Gs2L . If Ts accepts some of these hypotheses and all accepted hypotheses are of the same color—let it be ℓ—the sequential test accepts the hypothesis Hℓ and terminates; otherwise we continue to observe the realizations of ω (if s < S) or terminate with no hypotheses accepted/rejected (if s = S). It is easily seen that the risk of the outlined sequential test Tseq does not exceed SL ǫ, meaning that whatever be the distribution µ ∈ ℓ=1 Mℓ underlying observations
HYPOTHESIS TESTING
111
ω1 , ω2 , ...ωK(S) and ℓ∗ such that µ ∈ Mℓ∗ , the µprobability of the event is at least 1 − ǫ.
Tseq accepts exactly one hypothesis, namely, Hℓ∗
Indeed, observe, first, that the sequential test always accepts at most one of the hypotheses H1 , ..., HL . Second, let ωk ∼ µ with µ obeying Hℓ∗ . Consider events Es , s ∈ S, defined as follows:
• when s < S, Es is the event “the test Ts as applied to observation ω K(s) does not accept the true hypothesis Gs2ℓ∗ −1 = Hℓ∗ ”; • ES is the event “as applied to observation ω K(S) , the test TS does not accept the S true hypothesis GS 2ℓ∗ −1 = Hℓ∗ or accepts a hypothesis not CS close to G2ℓ∗ −1 .”
Note that by our selection of K(s)’s, the µprobability of Es does not exceed ǫs , so that the µprobability of none of the events Es , s ∈ S, taking place is at least 1 − ǫ. To justify the above claim on the risk of the sequential test, all we need to verify is that when none of the events Es , s ∈ S, takes place, the sequential test accepts the true hypothesis Hℓ∗ . Verification is immediate: let the observations be such that none of the Es ’s takes place. We claim that in this case (a) The sequential test does accept a hypothesis—if this does not happen at the sth attempt with some s < S, it definitely happens at the Sth attempt. Indeed, since ES does not take place, TS accepts GS 2ℓ∗ −1 and all other hypotheses, if any, accepted by TS are CS close to GS 2ℓ∗ −1 , implying by construction of CS that TS does accept hypotheses, and all these hypotheses are of the same color. That is, the sequential test at the Sth attempt does accept a hypothesis.
(b) The sequential test does not accept a wrong hypothesis.
Indeed, assume that the sequential test accepts a wrong hypothesis, Hℓ′ , ℓ′ 6= ℓ∗ , and it happens at the sth attempt, and let us lead this assumption to a contradiction. Observe that under our assumption the test Ts as applied to observation ω K(s) does accept some hypothesis Gsi , but does not accept the true hypothesis Gs2ℓ∗ −1 = Hℓ∗ . Indeed, assuming Gs2ℓ∗ −1 to be accepted, its color, which is ℓ∗ , should be the same as the color ℓ′ of Gsi —we are in the case where the sequential test accepts Hℓ′ at the sth attempt! Since in fact ℓ′ 6= ℓ∗ , the above assumption leads to a contradiction. On the other hand, we are in the case where Es does not take place, that is, Ts does accept the true hypothesis Gs2ℓ∗ −1 , and we arrive at the desired contradiction.
(a) and (b) provide us with the verification we were looking for.
Discussion and illustration. It can be easily seen that when ǫs = ǫ/S for all s, the worstcase duration K(S) of our sequential test is within a logarithmic in the SL factor of the duration of any other test capable of deciding on our L hypotheses with risk ǫ. At the same time it is easily seen that when the distribution µ of our observation is “deeply inside” some set Mℓ , specifically, µ ∈ Mℓs for some s ∈ S, s < S, then the µprobability to terminate not later than just after K(s) realizations ωk of ω ∼ µ are observed and to infer correctly what is the true hypothesis is at least 1 − ǫ. Informally speaking, in the case of “landslide” elections, a reliable prediction of the elections’ outcome will be made after a relatively small number of respondents are interviewed. Indeed, let s ∈ S and ωk ∼ µ ∈ Mℓs , so that µ obeys the hypothesis Gs2ℓ . Consider the s events Et , 1 ≤ t ≤ s, defined as follows: • For t < s, Et occurs when the sequential test terminates at attempt t by accepting, instead of Hℓ , the wrong hypothesis Hℓ′ , ℓ′ 6= ℓ. Note that Et can take place only when Tt does not accept the true hypothesis Gt2ℓ−1 = Hℓ , and the
112
CHAPTER 2
µprobability of this outcome is ≤ ǫt . • Es occurs when Ts does not accept the true hypothesis Gs2ℓ or accepts it along with some hypothesis Gsj , 1 ≤ j ≤ 2L, of color different from ℓ. Note that we are in the situation where the hypothesis Gs2ℓ is true, and, by construction of Cs , all hypotheses Cs close to Gs2ℓ are of the same color ℓ as Gs2ℓ . Recalling what Cs risk is and that the Cs risk of Ts is ≤ ǫs , we conclude that the µprobability of Es is at most ǫs . S P The bottom line is that the µprobability of the event t≤s Et is at most st=1 ǫt ≤ S ǫ; by construction of the sequential test, if the event t≤s Et does not take place, the test terminates in the course of the first s attempts by accepting the correct hypothesis Hℓ . Our claim is justified.
Numerical illustration. To get an impression of the “power” of sequential hypothesis testing, here are the data on the durations of nonsequential and sequential tests with risk ǫ = 0.01 for various values of δ; in the sequential tests, θ = 10−1/4 is used. The worstcase data for 2candidate and 5candidate elections are as follows (below, “volume” stands for the number of observations used by the test) δ K, L = 2 S / K(S), L = 2 K, L = 5 S / K(S), L = 5
0.5623 25
0.3162 88
0.1778 287
0.1000 917
0.0562 2908
0.0316 9206
0.0177 29118
0.0100 92098
1 25
2 152
3 499
4 1594
5 5056
6 16005
7 50624
8 160118
32
114
373
1193
3784
11977
37885
119745
1 32
2 179
3 585
4 1870
5 5931
6 18776
7 59391
8 187720
Volume K of nonsequential test, number S of stages, and worstcase volume K(S) of sequential test as functions of threshold δ = δS . Risk ǫ is set to 0.01. As it should be, the worstcase volume of the sequential test is significantly larger than the volume of the nonsequential test.15 This being said, look at what happens in the “average,” rather than the worst, case; specifically, let us look at the empirical distribution of the volume when the distribution µ of observations is selected in the P Ldimensional probabilistic simplex ∆L = {µ ∈ RL : µ ≥ 0, ℓ µℓ = 1} at random. Here are the empirical statistics of test volume obtained when drawing µ from the S uniform distribution on ℓ≤L Mℓ and running the sequential test16 on observations drawn from the selected µ: L 2 5 L 2 5
risk 0.0010 0.0040 75% 617 12704
median 177 1449 80% 1223 19205
mean 9182 18564 85% 1829 39993
60% 177 2175 90% 8766 60781
65% 397 4189 95% 87911 124249
70% 617 6204 100% 160118 187718
Parameters (columns “median, mean”) and quantiles (columns “60%,..., 100%”) of the sample distribution of the observation volume of the Sequential test for a given empirical risk (column ”risk”) . The data in the table are obtained from 1,000 experiments. We see that with the Sequential test, “typical” numbers of observations before termination are much 15 The reason is twofold: first, for s < S we pass from deciding on L hypotheses to deciding on 2L of them; second, the desired risk ǫ is now distributed among several tests, so that each of them should be more reliable than the nonsequential test with risk ǫ. 16 Corresponding to δ = 0.01, θ = 10−1/4 and ǫ = 0.01.
HYPOTHESIS TESTING
113
less than the worstcase values of these numbers. For example, in as much as 80% of experiments these numbers were below quite reasonable levels, at least in the case L = 2. Of course, what is “typical,” and what is not, depends on how we generate µ’s (this is called “prior Bayesian distribution”). Were our generation more likely to produce “close run” distributions, the advantages of sequential decision making would be reduced. This ambiguity is, however, unavoidable when attempting to go beyond worstcaseoriented analysis. 2.6.3
Concluding remarks
Application of our machinery to sequential hypothesis testing is in no sense restricted to the simple election model considered so far. A natural general setup we can handle is as follows: We are given a simple observation scheme O and a number L of related convex hypotheses, colored in d colors, on the distribution of an observation, with distributions obeying hypotheses of different colors being distinct from each other. Given the risk level ǫ, we want to decide (1 − ǫ)reliably on the color of the distribution underlying observations (i.e., the color of the hypothesis obeyed by this distribution) from stationary Krepeated observations, utilizing as small a number of observations as possible. For detailed description of related constructions and results, an interested reader is referred to [134].
2.7
2.7.1
MEASUREMENT DESIGN IN SIMPLE OBSERVATION SCHEMES Motivation: Opinion polls revisited
Consider the same situation as in Section 2.6.1—we want to use an opinion poll to predict the winner in a populationwide election with L candidates. When addressing this situation earlier, no essential a priori information on the distribution of voters’ preferences was available. Now consider the case when the population is split into I groups (according to age, sex, income, etc., etc.), with the ith group forming the fraction θi of the entire population, and we have at our disposal, at least for some i, nontrivial a priori information about the distribution pi of the preferences across group # i (the ℓth entry piℓ in pi is the fraction of voters of group i voting for candidate ℓ). For instance, we could know in advance that at least 90% of members of group #1 vote for candidate #1, and at least 85% of members of group #2 vote for candidate #2; no information of this type for group #3 is available. In this situation it would be wise to select respondents in the poll via a twostage procedure, first selecting at random, with probabilities q1 , ..., qI , the group from which the next respondent will be picked, and second selecting the respondent from this group at random according to the uniform distribution on the group. When the qi are proportional to the sizes of the groups (i.e., qi = θi for all i), we come back to selecting respondents at random from the uniform distribution over the entire population. The point, however, is that in the presence of a priori information, it makes sense to use qi different from θi , specifically, to make the
114
CHAPTER 2
ratios qi /θi “large” or “small” depending on whether a priori information on group #i is poor or rich. The story we have just told is an example of a situation in which we can “design measurements”—draw observations from a distribution which partly is under our control. Indeed, what in fact happens in the story is the following. “In nature” there exist I probabilistic vectors p1 , ..., pI of dimension L representing distributions of voting preferences within the corresponding the distribution of preferP groups; i θ p . With the twostage selection ences across the entire population is p = i i of respondents, the outcome of a particular interview becomes a pair (i, ℓ), with i identifying the group to which the respondent belongs, and ℓ identifying the candidate preferred by this respondent. In subsequent interviews, the pairs (i, ℓ)—these are our observations—are drawn, independently of each other, from the probability distribution on the pairs (i, ℓ), i ≤ I, ℓ ≤ L, with the probability of an outcome (i, ℓ) equal to p(i, ℓ) = qi piℓ . Thus, we find ourselves in the situation of stationary repeated observations stemming from the Discrete o.s. with observation space Ω of cardinality IL; the distribution from which the observations are drawn is a probabilistic vector µ of the form µ = Ax, where • x = [p1 ; ...; pI ] is the “signal” underlying our observations and representing the preferences of the population; this signal is selected by nature in the set X known to us defined in terms of our a priori information on p1 , ..., pI : X = {x = [x1 ; ...; xI ] : xi ∈ Πi , 1 ≤ i ≤ I},
(2.96)
where the Πi are the sets, given by our a priori information, of possible values of the preference vectors pi of the voters from ith group. In the sequel, we assume o L that P the Πi are convex compact subsets of the positive part ∆L = {p ∈ R : p > 0, ℓ pℓ = 1} of the Ldimensional probabilistic simplex; • A is a “sensing matrix” which, to some extent, is under our control; specifically, A[x1 ; ...; xI ] = [q1 x1 ; q2 x2 ; ...; qI xI ],
(2.97)
with q = [q1 ; ...; qI ] fully controlled by us (up to the fact that q must be a probabilistic vector). Note that in the situation under consideration the hypotheses we want to decide upon can be represented by convex sets in the space of signals, with a particular hypothesis stating that the observations stem from a distribution µ on Ω, with µ belonging to the image of some convex P compact set Xℓ ⊂ X under the mapping x 7→ µ = Ax. For example, when ν = i θi xi , the hypotheses X H ℓ : ν ∈ Mℓ = ν ∈ R L : νj = 1, νj ≥ N1 , νℓ ≥ νℓ′ + δ, ℓ′ 6= ℓ , 1 ≤ ℓ ≤ L, j
115
HYPOTHESIS TESTING
considered in Section 2.6.1 can be expressed in terms of the signal x = [x1 ; ...; xI ]: P i xℓ = 1∀i ≤ I xi ≥ 0, ℓ P P . Hℓ : µ = Ax, x ∈ Xℓ = x = [x1 ; ...; xI ] : Pi θi xiℓ ≥ i θi xiℓ′ + δ ∀(ℓ′ 6= ℓ) 1 i ≥ θ x , ∀j i j i N (2.98) The challenge we intend to address is as follows: so far, we were interested in inferences from observations drawn from distributions selected “by nature.” Now our goal is to make inferences from observations drawn from a distribution selected partly by nature and partly by us: nature selects the signal x, we select from some set matrix A, and the observations are drawn from the distribution Ax. As a result, we arrive at a question completely new for us: how do we utilize the freedom in selecting A in order to improve our inferences (this is somewhat similar to what is called “design of experiments” in Statistics)? 2.7.2
Measurement Design: Setup
In what follows we address measurement design in simple observation schemes, and our setup is as follows (to make our intensions transparent, we illustrate our general setup by explaining how it should be specified to cover the outlined twostage Opinion Poll Design (OPD) problem). Given are • simple observation scheme O = (Ω, Π; {pµ : µ ∈ M}; F), specifically, Gaussian, Poisson, or Discrete, with M ⊂ Rd . In OPD, O is the Discrete o.s. with Ω = {(i, ℓ) : 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L}, that is, points of Ω are the potential outcomes “reference group, preferred candidate” of individual interviews. • a nonempty closed convex signal space X ⊂ Rn , along with L nonempty convex compact subsets Xℓ of X , ℓ = 1, ..., L. In OPD, X is the set (2.96) comprised of tuples of allowed distributions of voters’ preferences from various groups, and Xℓ are the sets (2.98) of signals associated with the hypotheses Hℓ we intend to decide upon. • a nonempty convex compact set Q in some RN along with a continuous mapping q 7→ Aq acting from Q into the space of d × n matrices such that ∀(x ∈ X , q ∈ Q) : Aq x ∈ M.
(2.99)
In OPD, Q is the set of probabilistic vectors q = [q1 ; ...; qI ] specifying our measurement design, and Aq is the matrix of the mapping (2.97). • a closeness C on the set {1, ..., L} (that is, a set C of pairs (i, j) with 1 ≤ i, j ≤ L such that (i, i) ∈ C for all i ≤ L and (j, i) ∈ C whenever (i, j) ∈ C), and a positive integer K. In OPD, the closeness C is as strict as it could be—i is close to j if and only if i = j,17 and K is the total number of interviews in the poll. 17 This
closeness makes sense when the goal of the poll is to predict the winner; a less ambitious goal, e.g., to decide whether the winner will or will not belong to a particular set of candidates, would require weaker closeness.
116
CHAPTER 2
We associate with q ∈ Q and Xℓ , ℓ ≤ L, the nonempty convex compact sets Mℓq in the space M, Mℓq = {Aq x : x ∈ Xℓ },
and hypotheses Hℓq on Krepeated stationary observations ω K = (ω1 , ..., ωK ), Hℓq stating that the ωk , k = 1, ..., K, are drawn, independently of each other, from a distribution µ ∈ Mℓq , ℓ = 1, ..., L. Closeness C can be thought of as closeness on the collection of hypotheses H1q , H2q , ..., HLq . Given q ∈ Q, we can use the construction from Section 2.5.2 in order to build the test TφK∗ deciding on the hypotheses Hℓq up to closeness C, the Crisk of the test being the smallest allowed by the construction. Note that this Crisk depends on q; the “Measurement Design” (MD for short) problem we are about to consider is to select q ∈ Q which minimizes the Crisk of the associated test TφK∗ . 2.7.3
Formulating the MD problem
By Proposition 2.30, the Crisk of the test TφK∗ is upperbounded by the spectral norm of the symmetric entrywise nonnegative L × L matrix E (K) (q) = [ǫℓℓ′ (q)]ℓ,ℓ′ , and this is what we intend to minimize in our MD problem. In the above formula, ǫℓℓ′ (q) = ǫℓ′ ℓ (q) are zeros if (ℓ, ℓ′ ) ∈ C. For (ℓ, ℓ′ ) 6∈ C and 1 ≤ ℓ < ℓ′ ≤ L, the quantities ǫℓℓ′ (q) = ǫℓ′ ℓ (q) are defined depending on what the simple o.s. is O. Specifically, • In the case of the Gaussian observation scheme (see Section 2.4.5.1), restriction (2.99) does not restrain the dependence Aq on q at all (modulo the default constraint that Aq is a d × n matrix continuous in q ∈ Q), and ǫℓℓ′ (q) = exp{KOptℓℓ′ (q)} where Optℓℓ′ (q) =
max
x∈Xℓ ,y∈Xℓ′
− 81 [Aq (x − y)]T Θ−1 [Aq (x − y)]
(Gq )
and Θ is the common covariance matrix of the Gaussian densities forming the family {pµ : µ ∈ M}; • In the case of Poisson o.s. (see Section 2.4.5.2), restriction (2.99) requires of Aq x to be a positive vector whenever q ∈ Q and x ∈ X , and ǫℓℓ′ (q) = exp{KOptℓℓ′ (q)}, where Optℓℓ′ (q) =
max
x∈Xℓ ,y∈Xℓ′
X q i
1 2
1 2
[Aq x]i [Aq y]i − [Aq x]i − [Aq y]i ;
(Pq )
• In the case of Discrete o.s. (see Section 2.4.5.3), restriction (2.99) requires of Aq x to be a positive probabilistic vector whenever q ∈ Q and x ∈ X , and K
ǫℓℓ′ (q) = [Optℓℓ′ (q)] ,
117
HYPOTHESIS TESTING
where Optℓℓ′ (q) =
max
x∈Xℓ ,y∈Xℓ′
Xq
[Aq x]i [Aq y]i .
(Dq )
i
The summary of the above observations is as follows. The norm kE (K) k2,2 —the quantity we are interested in minimizing in q ∈ Q—as a function of q ∈ Q is of the form Ψ(q) = ψ({Optℓℓ′ (q) : (ℓ, ℓ′ ) 6∈ C})  {z } (2.100) Opt(q)
where the outer function ψ is an explicitly given realvalued function on RN (N is the cardinality of the set of pairs (ℓ, ℓ′ ), 1 ≤ ℓ, ℓ′ ≤ L, with (ℓ, ℓ′ ) 6∈ C) which is convex and nondecreasing in each argument. Indeed, denoting by Γ(S) the spectral norm of the d × d matrix S, note that Γ is a convex function of S, and this function is nondecreasing in every one of the entries of S, provided that S is restricted to be entrywise nonnegative.18 ψ(·) is obtained from Γ(S) by substituting for the entries Sℓℓ′ of S, certain—explicit everywhere—convex, nonnegative and nondecreasing functions of variables z = {zℓℓ′ : (ℓ, ℓ′ ) 6∈ C, 1 ≤ ℓ, ℓ′ ≤ L}. Namely, • when (ℓ, ℓ′ ) ∈ C, we set Sℓℓ′ to zero; • when (ℓ, ℓ′ ) 6∈ C, we set Sℓℓ′ = exp{Kzℓℓ′ } in the case of Gaussian and Poisson o.s.’s, and set Sℓℓ′ = max[0, zℓℓ′ ]K , in the case of Discrete o.s. As a result, we indeed get a convex and nondecreasing, in every argument, function ψ of z ∈ RN . Now, the Measurement Design problem we want to solve reads Opt = min ψ(Opt(q)). q∈Q
(2.101)
As we remember, the entries in the inner function Opt(q) are optimal values of solvable convex optimization problems and as such are efficiently computable. When these entries are also convex functions of q ∈ Q, the objective in (2.101), due to the already established convexity and monotonicity properties of ψ, is a convex function of q, meaning that (2.101) is a convex and thus efficiently solvable problem. On the other hand, when some of the entries in Opt(q) are nonconvex in q, we can hardly expect (2.101) to be easy to solve. Unfortunately, convexity of the entries in Opt(q) in q turns out to be a “rare commodity.” For example, we can verify by inspection that the objectives in (Gq ), (Pq ), and (Dq ) as functions of Aq (not of q!) are concave rather than convex. Thus, the optimal values in the problems, as functions of q, are maxima, over the parameters, of parametric families of concave functions of Aq (the parameters in these parametric families are the optimization variables in (Gq ) – (Dq )) and as such can hardly be convex as functions of Aq . And indeed, as a matter of fact, the MD problem usually is nonconvex and difficult to solve. We intend to consider a “Simple case” where this difficulty does not arise, i.e., the case where the objectives of the optimization problems specifying Optℓℓ′ (q) are affine in q. In this case, Optℓℓ′ (q) as a function of q is the maximum, over the 18 The
monotonicity follows from the fact that for an entrywise nonnegative S, we have
kSk2,2 = max{xT Sy : kxk2 ≤ 1, kyk2 ≤ 1} = max{xT Sy : kxk2 ≤ 1, kyk2 ≤ 1, x ≥ 0, y ≥ 0}. x,y
x,y
118
CHAPTER 2
parameters (optimization variables in the corresponding problems), of parametric families of affine functions of q and as such is convex. Our current goal is to understand what our sufficient condition for tractability of the MD problem—affinity in q of the objectives in the respective problems (Gq ), (Pq ), and (Dq )—actually means, and to show that this, by itself quite restrictive, assumption indeed takes place in some important applications. 2.7.3.1
Simple case, Discrete o.s.
Looking at the optimization problem (Dq ), we see that the simplest way to ensure that its objective is affine in q is to assume that Aq = Diag{Bq}A,
(2.102)
where A is some fixed d × n matrix, and B is some fixed d × (dim q) matrix such that Bq is positive whenever q ∈ Q. On the top of this, we should ensure that when q ∈ Q and x ∈ X , Aq x is a positive probabilistic vector; this amounts to some restrictions linking Q, X , A, and B. Illustration. The Opinion Poll Design problem of Section 2.7.1 provides an instructive example of the Simple case of Measurement Design in Discrete o.s.: recall that in this problem the voting population is split into I groups, with the ith group constituting fraction θi of the entire population. The distribution of voters’ preferences in the ith group is represented by an unknown Ldimensional probabilistic vector xi = [xi1 ; ...; xiL ] (L is the number of candidates, xiℓ is the fraction of voters in the ith group intending to vote for the ℓth candidate), known to belongPto a given convex compact subset Πi of the “positive part” ∆oL = {x ∈ RL : x > 0, ℓ xℓ = 1} of the Ldimensional probabilistic simplex. We are given threshold δ > 0 and want to decide on PIL hypotheses H1 ,..., HL , with Hℓ stating that the populationwide vector y = i=1 θi xi of voters’ preferences belongs to the closed convex set Yℓ =
(
y=
I X i=1
i
′
i
)
θi x : x ∈ Πi , 1 ≤ i ≤ I, yℓ ≥ yℓ′ + δ, ∀(ℓ 6= ℓ) .
Note that Yℓ is the image, under the linear mapping X θi xi , [x1 ; ...; xI ] 7→ y(x) = i
of the compact convex set Xℓ = x = [x1 ; ...; xI ] : xi ∈ Πi , 1 ≤ i ≤ I, yℓ (x) ≥ yℓ′ (x) + δ, ∀(ℓ′ 6= ℓ) , which is a subset of the convex compact set
X = {x = [x1 ; ...; xI ] : xi ∈ Πi , 1 ≤ i ≤ I}. The kth poll interview is organized as follows: We draw at random a group among the I groups of voters, with probability qi to draw ith group, and then draw at random, from the uniform distribution on the group, the respondent to be interviewed. The outcome of
119
HYPOTHESIS TESTING
the interview—our observation ωk —is the pair (i, ℓ), where i is the group to which the respondent belongs, and ℓ is the candidate preferred by the respondent. This results in a sensing matrix Aq —see (2.97)—which is in the form of (2.102), namely, Aq = Diag{q1 IL , q2 IL , ..., qI IL }, [q ∈ ∆I ] and the outcome of kth interview is drawn at random from the discrete probability distribution Aq x, where x ∈ X is the “signal” summarizing voters’ preferences in the groups. Given the total number of observations K, our goal is to decide with a given risk ǫ on our L hypotheses. Whether this goal is or is not achievable depends on K and on Aq . What we want is to find q for which the above goal can be attained with as small a K as possible; in the case in question, this reduces to solving, for various trial values of K, problem (2.101), which under the circumstances is an explicit convex optimization problem. To get an impression of the potential of Measurement Design, we present a sample of numerical results. In all reported experiments, we use δ = 0.05, ǫ = 0.01 and equal fractions θi = I −1 for all groups. The sets Πi , 1 ≤ i ≤ I, are generated as follows: we pick at random a probabilistic vector p¯i of dimension L, and define Πi as the intersection of the box {p : p¯ℓ − ui ≤ pℓ ≤ p¯ℓ + ui } centered at p¯ with the probabilistic simplex ∆L , where the ui , i = 1, ..., I, are prescribed “uncertainty levels.” Note that uncertainty level ui ≥ 1 is the same as absence of any a priori information on the preferences of voters from the ith group. The results of our numerical experiments are as follows: L 2 2 3 5 5
I 2 2 3 4 4
Uncertainty levels u [0.03;1.00] [0.02;1.00] [0.02;0.03;1.00] [0.02;0.02;0.03;1.00] [1.00;1.00;1.00;1.00]
Kini 1212 2699 3177 2556 4788
qopt [0.437;0.563] [0.000;1.000] [0.000;0.455;0.545] [0.000;0.131;0.322;0.547] [0.250;0.250;0.250;0.250]
Kopt 1194 1948 2726 2086 4788
Effect of measurement design: poll sizes required for 0.99reliable winner prediction when q = θ (column Kini ) and q = qopt (column Kopt ). We see that measurement design allows us to reduce (for some data, quite significantly) the volume of observations as compared to the straightforward selecting of the respondents uniformly across the entire population. To compare our current model and results with those from Section 2.6.1, note that now we have more a priori information on the true distribution of voting preferences due to some a priori knowledge of preferences within groups, which allows us to reduce the poll sizes with both straightforward and optimal measurement designs.19 On the other hand, the difference between Kini and Kopt is fully due to the measurement design. Comparative drug study. A Simple case of the Measurement Design in Discrete o.s. related to OPD and perhaps more interesting is as follows. Suppose that now, 19 To illustrate this point, look at the last two lines in the table: utilizing a priori information allows us to reduce the poll size from 4,7,88 to 2,556 even with the straightforward measurement design.
120
CHAPTER 2
instead of L competing candidates running for an office we have L competing drugs, and the population of patients the drugs are aimed at rather than the population of voters. For the sake of simplicity, assume that when a particular drug is administered to a particular patient, the outcome is binary: (positive) “effect” or “no effect” (what follows can be easily extended to the case of nonbinary categorial outcomes, like “strong positive effect,” “weak positive effect,” “negative effect,” and alike). Our goal is to organize a clinical study in order to decide on comparative drug efficiency, measured by the percentage of patients on which a particular drug has effect. The difference with organizing an opinion poll is that now we cannot just ask a respondent what his or her preferences are; we may only administer to a participant of the study a single drug of our choice and look at the result. As in the OPD problem, we assume that the population of patients is split into I groups, with the ith group comprising a fraction θi of the entire population. We model the situation as follows. We associate with a patient a Boolean vector of dimension 2L, with the ℓth entry in the vector equal to 1 or 0 depending on whether drug # ℓ has effect on the patient, and the (L + ℓ)th entry complementing the ℓth one to 1 (that is, if the ℓth entry is χ, then the (L+ℓ)th entry is 1−χ). Let xi be the average of these vectors over patients from group i. We define “signal” x underlying our measurements as the vector [x1 ; ...; xI ] and assume that our a priori information allows us to localize x in a closed convex subset X of the set Y = {x = [x1 ; ...; xI ] : xi ≥ 0, xiℓ + xiL+ℓ = 1, 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L} to which all our signals belong by construction. Note that the vector X y = Bx = θi xi i
can be treated as a “populationwise distribution of drug effects:” yℓ , ℓ ≤ L, is the fraction, in the entire population of patients, of those patients on whom drug ℓ has effect, and yL+ℓ = 1 − yℓ . As a result, typical hypotheses related to comparison of the drugs, like “drug ℓ has effect on a larger fraction, at least by margin δ, of patients than drug ℓ′ ,” become convex hypotheses on the signal x. In order to test hypotheses of this type, we can use a twostage procedure for observing drug effects, namely, as follows. To get a particular observation, we select at random, with probability qiℓ , a pair (i, ℓ) from the set {(i, ℓ) : 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L}, select a patient from group i according to the uniform distribution on the group, administer to the patient the drug ℓ, and check whether the drug has effect. Thus, a single observation is a triple (i, ℓ, χ), where χ = 0 if the administered drug has no effect on the patient, and χ = 1 otherwise. The probability of getting observation (i, ℓ, 1) is qiℓ xiℓ , and the probability of getting observation (i, ℓ, 0) is qiℓ xiL+ℓ . Thus, we arrive at the Discrete o.s. where the distribution µ of observations is of the form µ = Aq x, with the rows in Aq indexed by triples ω = (i, ℓ, χ) ∈ Ω := {1, 2, ..., I} × {1, 2, ..., L} × {0, 1} and given by qiℓ xiℓ χ = 1, 1 I (Aq [x ; ...; x ])i,ℓ,χ = qiℓ xiL+ℓ χ = 0. Specifying the set Q of admissible measurement designs as a closed convex subset of the set of all nonvanishing discrete probability distributions on the set {1, 2, ..., I}× {1, 2, ..., L}, we find ourselves in the Simple case of Discrete o.s., as defined by
121
HYPOTHESIS TESTING
Figure 2.7: PET scanner
(2.102), and Aq x is a probabilistic vector whenever q ∈ Q and x ∈ Y. 2.7.3.2
Simple case, Poisson o.s.
Looking at the optimization problem (Pq ), we see that the simplest way to ensure its objective is, as in the case of Discrete o.s., to assume that Aq = Diag{Bq}A, where A is some fixed d × n matrix, and B is some fixed d × (dim q) matrix such that Bq is positive whenever q ∈ Q. On the top of this, we should ensure that when q ∈ Q and x ∈ X , Aq x is a positive vector; this amounts to some restrictions linking Q, X , A, and B. Application Example: PET with time control. Positron Emission Tomography was already mentioned, as an example of Poisson o.s., in Section 2.4.3.2. As explained in the section, in PET we observe a random vector ω ∈ Rd with independent entries [ω]i ∼ Poisson(µi ), 1 ≤ i ≤ d, where the vector of parameters µ = [µ1 ; ...µd ] of the Poisson distributions is the linear image µ = Aλ of an unknown “signal” λ (the tracer’s density in patient’s body) belonging to some known subset Λ of RD + , with entrywise nonnegative matrix A. Our goal is to make inferences about λ. Now, in an actual PET scan, the patient’s position w.r.t. the scanner is not the same during the entire study; the position is kept fixed for an ith time period, 1 ≤ i ≤ I, and changes from period to period in order to expose to the scanner the entire “area of interest.” For example, with the scanner shown on Figure 2.7, during the PET study the imaging table with the patient will be shifted several times along the axis of the scanning ring. As a result, the observed vector ω can be split into blocks ω i , i = 1, ..., I, of data acquired during the ith period, 1 ≤ i ≤ I. On closer inspection, the corresponding block µi in µ is µi = qi Ai λ, where Ai is an entrywise nonnegative matrix known in advance, and qi is the duration of the ith period. In principle, the qi could PIbe treated as nonnegative design variables subject to the “budget constraint” i=1 qi = T , where T is the
122
CHAPTER 2
total duration of the study,20 and perhaps some other convex constraints, say, positive lower bounds on qi . It is immediately seen that the outlined situation is exactly as is required in the Simple case of Poisson o.s. 2.7.3.3
Simple case, Gaussian o.s.
Looking at the optimization problem (Gq ), we see that the simplest way to ensure that its objective is affine in q is to assume that the covariance matrix Θ is diagonal, and √ √ (2.103) Aq = Diag{ q1 , ..., qd }A where A is a fixed d × n matrix, and q runs through a convex compact subset of Rd+ . It turns out that there are situations where assumption (2.103) makes perfect sense. Let us start with a preamble. In Gaussian o.s.
ω = Ax + ξ A ∈ Rd×n , ξ ∼ N (0, Σ), Σ = Diag{σ12 , ..., σd2 }
(2.104)
the “physics” behind the observations in many cases is as follows. There are d sensors (receivers), the ith registering the continuous time analogous input depending linearly on the underlying observations signal x. On the time horizon on which the measurements are taken, this input is constant in time and is registered by the ith sensor on time interval ∆i . The deterministic component of the measurement registered by sensor i is the integral of the corresponding input taken over ∆i , and the stochastic component of the measurement is obtained by integrating white Gaussian noise over the same interval. As far as this noise is concerned, what matters is that when the white noise affecting the ith sensor is integrated over a time interval ∆i , the result is a Gaussian random variable with zero mean and variance σi2 ∆i  (here ∆i  is the length of ∆i ), and the random variables obtained by integrating white noise over nonoverlapping segments are independent. Besides this, we assume that the noisy components of measurements are independent across the sensors. Now, there could be two basic versions of the situation just outlined, both leading to the same observation model (2.104). In the first, “parallel,” version, all d sensors work in parallel on the same time horizon of duration 1. In the second, “sequential,” version, the sensors are activated and scanned one by one, each working unit time; thus, here the full time horizon is d, and the sensors are registering their respective inputs on consecutive time intervals of duration 1 each. In this second “physical” version of Gaussian o.s., we can, in principle, allow for sensors to register their inputs on consecutive time segments of varying durations q1 ≥ 0, q2 ≥ 0, ..., qd ≥ 0, with thePadditional to nonnegativity restriction that our total time budget is respected: i qi = d (perhaps with some other convex constraints on qi ). Let us look what the observation scheme we end up with is. Assuming that (2.104) represents correctly our observations in the reference case where all the ∆i  are equal to 1, the deterministic component ofP the measurement registered by sensor i in time interval of duration qi will be qi j aij xj , and the √ standard deviation of the noisy component will be σi qi , so that the measurements 20 T cannot be too large; aside from other considerations, the tracer disintegrates, and its density can be considered as nearly constant only on a properly restricted time horizon.
123
HYPOTHESIS TESTING
become
X √ aij xj , i = 1, ..., d, zi = σ i q i ζ i + q i j
with standard (zero mean, unit variance) Gaussian noises ζi independent of each other. Now, since we know qi , we can scale the latter observations by making the standard deviation of the noisy component the same σi as in the reference case. Specifically, we lose nothing when assuming that our observations are √ √ X ω i = zi / q i = σ i ζ i + q i aij xj , {z} ξi
j
or, equivalently,
√ √ ω = ξ + Diag{ q1 , ..., qd }A x, ξ ∼ N (0, Diag{σ12 , ..., σd2 }) {z } 
[A = [aij ]]
Aq
P where q runs through a convex compact subset Q of the simplex {q ∈ Rd+ : i qi = d}. Thus, if the “physical nature” of a Gaussian o.s. is sequential, then, making the activity times of the sensors our design variables, as is natural under the circumstances, we arrive at (2.103), and, as a result, end up with an easytosolve Measurements Design problem.
2.8
AFFINE DETECTORS BEYOND SIMPLE OBSERVATION SCHEMES
On a closer inspection, the “common denominator” of our basic simple o.s.’s— Gaussian, Poisson and Discrete ones—is that in all these cases the minimal risk detector for a pair of convex hypotheses is affine. At first glance, this indeed is so for Gaussian and Poisson o.s.’s, where F is comprised of affine functions on the corresponding observation space Ω (Rd for Gaussian o.s., and Zd+ ⊂ Rd for Poisson o.s.), but is not so for Discrete o.s.—in that case, Ω = {1, ..., d}, and F is comprised of all functions on Ω, while “affine functions on Ω = {1, ..., d}” make no sense. Note, however, that we can encode (and from now on this is what we do) the points i = 1, ..., d of a delement set by basic orths ei = [0; ...; 0; 1; 0; ...; 0] ∈ Rd in Rd , thus making our observation space Ω a subset of Rd . With this encoding, every real valued function on {1, ..., d} becomes a restriction on Ω of an affine function. Note that when passing from our basic simple o.s.’s to their direct products, the minimum risk detectors for pairs of convex hypotheses remain affine. Now, in our context the following two properties of simple o.s.’s are essential: A) the best—with the smallest possible risk—affine detector, like its risk, can be efficiently computed; B) the smallest risk affine detector from A) is the best detector, in terms of risk, available under the circumstances, so that the associated test is nearoptimal. Note that as far as practical applications of the detectorbased hypothesis testing are concerned, one “can survive” without B) (nearoptimality of the constructed detectors), while A) is a requisite.
124
CHAPTER 2
In this section we focus on families of probability distributions obeying A). This class turns out to be incomparably larger than what was defined as simple o.s.’s in Section 2.4; in particular, it includes nonparametric families of distributions. Staying within this much broader class, we still are able to construct in a computationally efficient way the best affine detectors, in certain precise sense, for a pair of “convex” hypotheses, along with valid upper bounds on the risks of the detectors. What we, in general, cannot claim anymore, is that the tests associated with such detectors are nearoptimal. This being said, we believe that investigating possibilities for building tests and quantifying their performance in a computationally friendly manner is of value even when we cannot provably guarantee nearoptimality of these tests. The results to follow originate from [135, 136]. 2.8.1
Situation
In what follows, we fix an observation space Ω = Rd , and let Pj , 1 ≤ j ≤ J, be given families of probability distributions on Ω. Put S broadly, our goal still Pj , to decide upon the is, given a random observation ω ∼ P , where P ∈ j≤J
hypotheses Hj : P ∈ Pj , j = 1, ..., J. We intend to address this goal in the case when the families Pj are simple—they are comprised of distributions for which momentgenerating functions admit an explicit upper bound. 2.8.1.1
Preliminaries: Regular data and associated families of distributions
Definition 2.36.A. Regular data is as a triple H, M, Φ(·, ·), where
– H is a nonempty closed convex set in Ω = Rd symmetric w.r.t. the origin, – M is a closed convex set in some Rn ,
– Φ(h; µ) : H×M → R is a continuous function convex in h ∈ H and concave in µ ∈ M.
B. Regular data H, M, Φ(·, ·) define two families of probability distributions on Ω: – the family of regular distributions
R = R[H, M, Φ] comprised of all probability distributions P on Ω such that R ∀h ∈ H ∃µ ∈ M : ln Ω exp{hT ω}P (dω) ≤ Φ(h; µ).
– the family of simple distributions
S = S[H, M, Φ] comprised of probability distributions P on Ω such that R ∃µ ∈ M : ∀h ∈ H : ln Ω exp{hT ω}P (dω) ≤ Φ(h; µ).
(2.105)
For a probability distribution P ∈ S[H, M, Φ], every µ ∈ M satisfying (2.105) is referred to as a parameter of P w.r.t. S. Note that a distribution may have many parameters different from each other.
125
HYPOTHESIS TESTING
Recall that beginning with Section 2.3, the starting point in all our constructions is a “plausibly good” detectorbased test which, given two families P1 and P2 of distributions with common observation space, and repeated observations ω1 , ..., ωt drawn from a distribution P ∈ P1 ∪ P2 , decides whether P ∈ P1 or P ∈ P2 . Our interest in the families of regular/simple distributions stems from the fact that when the families P1 and P2 are of this type, building such a test reduces to solving a convexconcave saddle point problem and thus can be carried out in a computationally efficient manner. We postpone the related construction and analysis to Section 2.8.2, and continue with presenting some basic examples of families of simple and regular distributions along with a simple “calculus” of these families. 2.8.1.2
Basic examples of simple families of probability distributions
2.8.1.2.A. SubGaussian distributions: Let H = Ω = Rd , let M be a closed convex subset of the set Gd = {µ = (θ, Θ) : θ ∈ Rd , Θ ∈ Sd+ }, where Sd+ is a cone of positive semidefinite matrices in the space Sd of symmetric d × d matrices, and let Φ(h; θ, Θ) = θT h + 21 hT Θh. Recall that a distribution P on Ω = Rd is called subGaussian with subGaussianity parameters θ ∈ Rd and Θ ∈ Sd+ if Eω∼P {exp{hT ω}} ≤ exp{θT h + 12 hT Θh} ∀h ∈ Rd .
(2.106)
Whenever this is the case, θ is the expected value of P . We shall use the notation ξ ∼ SG(θ, Θ) as a shortcut for the sentence “random vector ξ is subGaussian with parameters θ, Θ.” It is immediately seen that when ξ ∼ N (θ, Θ), we also have ξ ∼ SG(θ, Θ), and (2.106) in this case is an identity rather than an inequality. With Φ as above, S[H, M, Φ] clearly contains every subGaussian distribution P on Rd with subGaussianity parameters (forming a parameter of P w.r.t. S) from M. In particular, S[H, M, Φ] contains all Gaussian distributions N (θ, Θ) with (θ, Θ) ∈ M. 2.8.1.2.B. Poisson distributions: Let H = Ω = Rd , let M be a closed convex subset of ddimensional nonnegative orthant Rd+ , and let Φ(h = [h1 ; ...; hd ]; µ = [µ1 ; ...; µd ]) =
d X i=1
µi [exp{hi } − 1] : H × M → R.
The family S = S[H, M, Φ] contains all Poisson distributions Poisson[µ] with vectors µ of parameters belonging to M; here Poisson[µ] is the distribution of a random ddimensional vector with entries independent of each other, the ith entry being a Poisson random variable with parameter µi . µ is a parameter of Poisson[µ] w.r.t. S. 2.8.1.2.C. Discrete distributions. Consider a discrete random variable taking values in delement set {1, 2, ..., d}, and let us think of such a variable as of random
126
CHAPTER 2
variable taking values ei ∈ Rd , i = 1, ..., d, where ei = [0; ...; 0; 1; 0; ...; 0] (1 in position i) are standard basic orths in Rd . The probability distribution of such a variable can be identified with a point µ = [µ1 ; ...; µd ] from the ddimensional probabilistic simplex ) ( d X νi = 1 , ∆d = ν ∈ Rd+ : i=1
where µi is the probability for the variable to take value ei . With these identifications, setting H = Rd , specifying M as a closed convex subset of ∆d , and setting ! d X Φ(h = [h1 ; ...; hd ]; µ = [µ1 ; ...; µd ]) = ln µi exp{hi } , i=1
the family S = S[H, M, Φ] contains distributions of all discrete random variables taking values in {1, ..., d} with probabilities µ1 , ..., µd comprising a vector from M. This vector is a parameter of the corresponding distribution w.r.t. S.
2.8.1.2.D. Distributions with bounded support. Consider the family P[X] of probability distributions supported on a closed and bounded convex set X ⊂ Ω = Rd , and let φX (h) = max hT x x∈X
be the support function of X. We have the following result (to be refined in Section 2.8.1.3): Proposition 2.37. For every P ∈ P[X] it holds Z 2 T d exp{h ω}P (dω) ≤ hT e[P ] + 81 [φX (h) + φX (−h)] , (2.107) ∀h ∈ R : ln Rd
R
where e[P ] = Rd ωP (dω) is the expectation of P , and the function in the righthand side of (2.107) is convex. As a result, setting H = Rd , M = X, Φ(h; µ) = hT µ + 81 [φX (h) + φX (−h)]
2
we obtain regular data such that P[X] ⊂ S[H, M, Φ], e[P ] being a parameter of a distribution P ∈ P[X] w.r.t. S. For proof, see Section 2.11.4 2.8.1.3
Calculus of regular and simple families of probability distributions
Families of regular and simple distributions admit “fully algorithmic” calculus, with the main calculus rules as follows. 2.8.1.3.A. Direct summation. For 1 ≤ ℓ ≤ L, let regular data Hℓ ⊂ Ωℓ = Rdℓ ,
127
HYPOTHESIS TESTING
Mℓ ⊂ Rnℓ , Φℓ (hℓ ; µℓ ) : Hℓ × Mℓ → R be given. Let us set Ω1 × ... × ΩL = Rd , d = d1 + ... + dL , H1 × ... × HL = {h = [h1 ; ...; hL ] : hℓ ∈ Hℓ , ℓ ≤ L}, M1 × ... × ML = {µ = [µ1 ; ...; µL ] : µℓ ∈ Mℓ , ℓ ≤ L} ⊂ Rn , n = n1 + ... + nL , PL Φ(h = [h1 ; ...; hL ]; µ = [µ1 ; ...; µL ]) = ℓ=1 Φℓ (hℓ ; µℓ ) : H × M → R.
Ω H M
= = =
Then H is a closed convex set in Ω = Rd , symmetric w.r.t. the origin, M is a nonempty closed convex set in Rn , Φ : H × M → R is a continuous convexconcave function, and clearly • the family R[H, M, Φ] contains all producttype distributions P = P1 × ... × PL on Ω = Ω1 × ... × ΩL with Pℓ ∈ R[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L; • the family S = S[H, M, Φ] contains all producttype distributions P = P1 × ... × PL on Ω = Ω1 × ... × ΩL with Pℓ ∈ Sℓ = S[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L, a parameter of P w.r.t. S being the vector of parameters of Pℓ w.r.t. Sℓ . 2.8.1.3.B. Mixing. For 1 ≤ ℓ ≤ L, let regular data Hℓ ⊂ Ω = Rd , Mℓ ⊂ Rnℓ , Φℓ (hℓ ; µℓ ) : Hℓ ×Mℓ → R be given, with compact Mℓ . Let also ν = [ν1 ; ...; νL ] be a L probabilistic vector. For a tuple P L = {Pℓ ∈ R[Hℓ , Mℓ , Φℓ ]}L ℓ=1 , let Π[P , ν] be the νmixture of distributions P1 , ..., PL defined as the distribution of random vector ω ∼ Ω generated as follows: we draw at random, from probability distribution ν on {1, ..., L}, index ℓ, and then draw ω at random from the distribution Pℓ . Finally, let P be the set of all probability distributions on Ω which can be obtained as Π[P L , ν] from the outlined tuples P P L and vectors ν running through the probabilistic simplex L ∆L = {µ ∈ R : ν ≥ 0, ℓ νℓ = 1}. Let us set H
=
Ψℓ (h)
=
Φ(h; ν)
=
L T
ℓ=1
Hℓ ,
max Φℓ (h; µℓ ) : Hℓ → R, PL ν exp{Ψ (h)} : H × ∆L → R. ln ℓ ℓ ℓ=1
µℓ ∈M ℓ
Then H, ∆L , Φ clearly is regular data (recall that all Mℓ are compact sets), and for every ν ∈ ∆L and tuple P L of the above type one has Z T eh ω P (dω) ≤ Φ(h; ν) ∀h ∈ H, (2.108) P = Π[P L , ν] ⇒ ln Ω
implying that P ⊂ S[H, ∆L , Φ], ν being a parameter of P = Π[P L , ν] ∈ P. Indeed,(2.108) is readily given by the fact that for P = Π[P L , ν] ∈ P and h ∈ H it holds ! ! L L n T o X X h ω hT ω νℓ exp{Ψℓ (h)} = Φ(h; ν), νℓ Eω∼Pℓ {e } ≤ ln = ln ln Eω∼P e ℓ=1
ℓ=1
with the concluding inequality given by h ∈ H ⊂ Hℓ and Pℓ ∈ R[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L.
128
CHAPTER 2
We have built a simple family of distributions S := S[H, ∆L , Φ] which contains all mixtures of distributions from given regular families Rℓ := R[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L, which makes S a simple outer approximation of mixtures of distributions from the simple families Sℓ := S[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L. In this latter capacity, S has a drawback—the only parameter of the mixture P = Π[P L , ν] of distributions Pℓ ∈ Sℓ is ν, while the parameters of Pℓ ’s disappear. In some situations, this makes the outer approximation S of P too conservative. We are about to get rid, to some extent, of this drawback. A modification. In the situation described at the beginning of 2.8.1.3.B, let a vector ν¯ ∈ ∆L be given, and let ¯ Φ(h; µ1 , ..., µL ) =
L X ℓ=1
Let d × d matrix Q 0 satisfy ¯ Φℓ (h; µℓ ) − Φ(h; µ1 , ..., µL ) and let
ν¯ℓ Φℓ (h; µℓ ) : H × (M1 × ... × ML ) → R.
2
≤ hT Qh ∀(h ∈ H, ℓ ≤ L, µ ∈ M1 × ... × ML ), (2.109)
¯ µ1 , ..., µL ) : H × (M1 × ... × ML ) → R. (2.110) Φ(h; µ1 , ..., µL ) = 53 hT Qh + Φ(h; T Φ clearly is convexconcave and continuous on its domain, whence H = ℓ Hℓ , M1 × ... × ML , Φ is regular data. Proposition 2.38. In the situation just defined, denoting by Pν¯ the family of all probability distributions P = Π[P L , ν¯], stemming from tuples P L = {Pℓ ∈ S[Hℓ , Mℓ , Φℓ ]}L ℓ=1 ,
(2.111)
one has Pν¯ ⊂ S[H, M1 × ... × ML , Φ].
As a parameter of distribution P = Π[P L , ν¯] ∈ Pν¯ with P L as in (2.111), one can take µL = [µ1 ; ....; µL ]. Proof. It is easily seen that 3
2
ea ≤ a + e 5 a , ∀a. P As a result, when aℓ , ℓ = 1, ..., L, satisfy ℓ ν¯ℓ aℓ = 0, we have X ℓ
ν¯ℓ eaℓ ≤
X ℓ
ν¯ℓ aℓ +
X ℓ
Now let P L be as in (2.111), and let h ∈ H = ln
R
3
2
3
2
ν¯ℓ e 5 aℓ ≤ e 5 maxℓ aℓ . T L
(2.112)
Hℓ . Setting P = Π[P L , ν¯], we have
P R P T T ¯ℓ Ω eh ω Pℓ (dω) = ln ( ℓ ν¯ℓ exp{Φℓ (h, µℓ )}) eh ω P (dω) = ln ℓν P ¯ ¯ ¯ℓ exp{Φℓ (h, µℓ ) − Φ(h; µ1 , ...µL )} = Φ(h; µ1 , ...µL ) + ln ℓν ¯ ¯ ≤ Φ(h; µ1 , ...µL ) + 35 maxℓ [Φℓ (h, µℓ ) − Φ(h; µ1 , ...µL )]2 ≤ Φ(h; µ1 , ..., µL ), {z} {z} Ω
a
b
129
HYPOTHESIS TESTING
¯ where a is given by (2.112) as applied to aℓ = Φℓ (h, µℓ ) − Φ(h; µ1 , ...µL ), and b is due to (2.109) and (2.110). The resulting inequality, which holds true for all h ∈ H, is all we need. ✷ 2.8.1.3.C. I.i.d. summation. Let Ω = Rd be an observation space, (H, M, Φ) be regular data on this space, and λ = {λℓ }K ℓ=1 be a collection of reals. We can associate with the outlined entities new data (Hλ , M, Φλ ) on Ω by setting Hλ = {h ∈ Ω : kλk∞ h ∈ H}, Φλ (h; µ) =
L X ℓ=1
Φ(λℓ h; µ) : Hλ × M → R.
Now, given a probability distribution P on Ω, we can associate with it and with the λ λ above P λ a new probability distribution P on Ω as follows: P is the distribution of ℓ λℓ ωℓ , where ω1 , ω2 , ..., ωL are drawn, independently of each other, from P . An immediate observation is that the data (Hλ , M, Φλ ) is regular, and whenever a probability distribution P belongs to S[H, M, Φ], the distribution P λ belongs to S[Hλ , M, Φλ ], and every parameter of P is a parameter of P λ . In particular, when ω ∼ P ∈ S[H, M, Φ] the distribution P L of the sum of L independent copies of ω belongs to S[H, M, LΦ]. 2.8.1.3.D. Semidirect summation. For 1 ≤ ℓ ≤ L, let regular data Hℓ ⊂ Ωℓ = Rdℓ , Mℓ , Φℓ be given. To avoid complications, we assume that for every ℓ, • Hℓ = Ωℓ , • Mℓ is bounded. Let also an ǫ > 0 be given. We assume that ǫ is small, namely, Lǫ < 1. Let us aggregate the given regular data into a new one by setting H = Ω := Ω1 × ... × ΩL = Rd , d = d1 + ... + dL , M = M1 × ... × ML , and let us define function Φ(h; µ) : Ωd × M → R as follows: Φ(h = [h1 ; ...; hL ]; µ = [µ1 ; ...; µL ]) = inf λ∈∆ǫ PL ∆ǫ = {λ ∈ Rd : λℓ ≥ ǫ ∀ℓ & ℓ=1 λℓ = 1}.
Pd
ℓ=1
λℓ Φℓ (hℓ /λℓ ; µℓ ),
(2.113)
For evident reasons, the infimum in the description of Φ is achieved, and Φ is continuous. In addition, Φ is convex in h ∈ Rd and concave in µ ∈ M. Postponing for a moment verification, the consequences are that H = Ω = Rd , M, and Φ form regular data. We claim that Whenever ω = [ω 1 ; ...; ω L ] is a random variable taking values in Ω = Rd1 × ... × RdL , and the marginal distributions Pℓ , 1 ≤ ℓ ≤ L, of ω belong to the families Sℓ = S[Rdℓ , Mℓ , Φℓ ] for all 1 ≤ ℓ ≤ L, the distribution P of ω belongs to S = S[Rd , M, Φ], a parameter of P w.r.t. S being the vector comprised of parameters of Pℓ w.r.t. Sℓ . Indeed, since Pℓ ∈ S[Rdℓ , Mℓ , Φℓ ], there exists µ bℓ ∈ Mℓ such that ln(Eωℓ ∼Pℓ {exp{g T ω ℓ }}) ≤ Φℓ (g; µ bℓ ) ∀g ∈ Rdℓ .
130
CHAPTER 2
Let us set µ b = [b µ1 ; ...; µ bL ], and let h = [h1 ; ...; hL ] ∈ Ω be given. We can find λ ∈ ∆ǫ such that L X Φ(h; µ b) = λℓ Φℓ (hℓ /λℓ ; µ bℓ ). ℓ=1
Applying the H¨ older inequality, we get ( ) L X Y λℓ E[ω1 ;...;ωL ]∼P exp{ [hℓ ]T ω ℓ } ≤ Eωℓ ∼Pℓ [hℓ ]T ω ℓ /λℓ , ℓ
ℓ=1
whence ln E[ω1 ;...;ωL ]∼P We see that
(
)! L X X λℓ Φℓ (hℓ /λℓ ; µ bℓ ) = Φ(h; µ b). ≤ exp{ [hℓ ]T ω ℓ }
ln E[ω1 ;...;ωL ]∼P
ℓ=1
ℓ
(
X exp{ [hℓ ]T ω ℓ } ℓ
)!
≤ Φ(h; µ b) ∀h ∈ H = Rd ,
and thus P ∈ S[Rd , M, Φ], as claimed. It remains to verify that the function Φ defined by (2.113) indeed is convex in h ∈ Rd and concave in µ ∈ M. Concavity in µ is evident. Further, functions λℓ Φℓ (hℓ /λℓ ; µ) (as perspective transformations of convex functions Φℓ (·; µ)) are PL ℓ jointly convex in λ and hℓ , and so is Ψ(λ, h; µ) = ℓ=1 λℓ Φℓ (h /λℓ , µ). Thus Φ(·; µ), obtained by partial minimization of Ψ in λ, indeed is convex. 2.8.1.3.E. Affine image. Let H, M, Φ be regular data, Ω be the embedding ¯ = Rd¯, and let us space of H, and x 7→ Ax + a be an affine mapping from Ω to Ω set ¯ ∈ Rd¯ : AT h ¯ ∈ H}, M ¯ µ) = Φ(AT h; ¯ µ) + aT h ¯: H ¯ = {h ¯ = M, Φ( ¯ h; ¯×M ¯ → R. H ¯ M, ¯ Φ ¯ is regular data. It is immediately seen that Note that H, Whenever the probability distribution P of a random variable ω belongs to R[H, M, Φ] (or belongs to S[H, M, Φ]), the distribution P¯ [P ] of the ran¯ M, ¯ Φ] ¯ (respectively, belongs to dom variable ω ¯ = Aω + a belongs to R[H, ¯ ¯ ¯ S[H, M, Φ], and every parameter of P is a parameter of P¯ [P ]). 2.8.1.3.F. Incorporating support information. Consider the situation as follows. We are given regular data H ⊂ Ω = Rd , M, Φ and are interested in a family P of distributions known to belong to S[H, M, Φ]. In addition, we know that all distributions P from P are supported on a given closed convex set X ⊂ Rd . How could we incorporate this domain information to pass from the family S[H, M, Φ] containing P to a smaller family of the same type still containing P? We are about to give an answer in the simplest case of H = Ω. When denoting by φX (·) the support function of X and selecting somehow a closed convex set G ⊂ Rd containing
131
HYPOTHESIS TESTING
the origin, let us set b µ) = inf Φ+ (h, g; µ) := Φ(h − g; µ) + φX (g) , Φ(h; g∈G
where Φ(h; µ) : Rd × M → R is the continuous convexconcave function participatb is realvalued and continuous on ing in the original regular data. Assuming that Φ the domain Rd × M (which definitely is the case when G is a compact set such that b is convexconcave on this domain, φX is finite and continuous on G), note that Φ d b so that R , M, Φ is regular data. We claim that b contains P, provided the family S[Rd , M, Φ] does The family S[Rd , M, Φ] so, and the first of these two families is smaller than the second one.
Verification of the claim is immediate. Let P ∈ P, so that for properly selected µ = µP ∈ M and for all e ∈ Rd it holds Z T exp{e ω}P (dω) ≤ Φ(e; µP ). ln Rd
On the other hand, for every g ∈ G we have φX (g) − g T ω ≥ 0 on the support of P , whence for every h ∈ Rd one has R R ln Rd exp{hT ω}P (dω) ≤ ln Rd exp{hT ω + φX (g) − g T ω}P (dω) ≤ φX (g) + Φ(h − g; µP ). Since the resulting inequality holds true for all g ∈ G, we get Z b µP ) ∀h ∈ Rd , exp{hT ω}P (dω) ≤ Φ(h; ln Rd
b because P ∈ P is arbitrary, the first part of the implying that P ∈ S[Rd , M, Φ]; b ⊂ S[Rd , M, Φ] is readily given by the claim is justified. The inclusion S[Rd , M, Φ] b ≤ Φ, and the latter is due to Φ(h, b µ) ≤ Φ(h − 0, µ) + φX (0). inequality Φ
Illustration: Distributions with bounded support revisited. In Section 2.8.1.2, given a convex compact set X ⊂ Rd with support function φX , we checked that the data H = Rd , M = X, Φ(h; µ) = hT µ + 18 [φX (h) + φX (−h)]2 is regular and the family S[Rd , M, Φ] contains the family P[X] of all probability distributions supported on X. Moreover, for every µ ∈ M = X, the family S[Rd , {µ}, Φ Rd ×{µ} ] contains all distributions supported on X with the expectations e[P] = µ. Note that R T Φ(h; e[P ]) describes well the behavior of the logarithm FP (h) = ln Rd eh ω P (dω) of the momentgenerating function of P ∈ P[X] when h is small (indeed, FP (h) = hT e[P ] + O(khk2 ) as h → 0), and by far overestimates FP (h) when h is large. Utilizing the above construction, we replace Φ with the realvalued, convexconcave, and continuous on Rd × M = Rd × X (see Exercise 2.22) function h i b µ) = inf Ψ(h, b Φ(h; g; µ) := (h − g)T µ + 18 [φX (h − g) + φX (−h + g)]2 + φX (g) g
≤
Φ(h; µ).
(2.114)
132
CHAPTER 2
b ·) still ensures the inclusion P ∈ S[Rd , {e[P ]}, Φ b d ] It is easy to see that Φ(·; R ×{e[P ]} for every distribution P ∈ P[X] and “reproduces FP (h) reasonably well” for both b e[P ]) ≤ Φ(h; e[P ]), for small h small and large h. Indeed, since FP (h) ≤ Φ(h; b Φ(h; e[P ]) reproduces FP (h) even better than Φ(h; e[P ]), and we clearly have b µ) ≤ (h − h)T µ + 1 [φX (h − h) + φX (−h + h)]2 + φX (h) = φX (h) ∀µ, Φ(h; 8
and φX (h) is a correct description of FP (h) for large h. 2.8.2
Main result
2.8.2.1
Situation & Construction
Assume we are given two collections of regular data with common Ω = Rd and common H, specifically, the collections (H, Mχ , Φχ ), χ = 1, 2. We start with constructing a specific detector for the associated families of regular probability distributions Pχ = R[H, Mχ , Φχ ], χ = 1, 2. When building the detector, we impose on the regular data in question the following Assumption I: The regular data (H, Mχ , Φχ ), χ = 1, 2, are such that the convexconcave function Ψ(h; µ1 , µ2 ) =
1 2
[Φ1 (−h; µ1 ) + Φ2 (h; µ2 )] : H × (M1 × M2 ) → R
(2.115)
has a saddle point (min in h ∈ H, max in (µ1 , µ2 ) ∈ M1 × M2 ). A simple sufficient condition for existence of a saddle point of (2.115) is Condition A: The sets M1 and M2 are compact, and the function Φ(h) =
max
µ1 ∈M1 ,µ2 ∈M2
Φ(h; µ1 , µ2 )
is coercive on H, meaning that Φ(hi ) → ∞ along every sequence hi ∈ H with khi k2 → ∞ as i → ∞. Indeed, under Condition A by the SionKakutani Theorem (Theorem 2.22) it holds SadVal[Φ] := inf
max
sup inf Φ(h; µ1 , µ2 ), Φ(h; µ1 , µ2 ) = µ1 ∈M1 ,µ2 ∈M2 h∈H  {z } {z }
h∈H µ1 ∈M1 ,µ2 ∈M2

Φ(µ1 ,µ2 )
Φ(h)
so that the optimization problems (P )
Opt(P ) = min Φ(h)
(D)
Opt(D) =
h∈H
max
µ1 ∈M1 ,µ2 ∈M2
Φ(µ1 , µ2 )
have equal optimal values. Under Condition A, problem (P ) clearly is a problem of minimizing a continuous coercive function over a closed set and as such is solvable; thus, Opt(P ) = Opt(D) is a real. Problem (D) clearly is the problem of maximizing over a compact set an upper semicontinuous (since Φ is continuous) function taking real values and, perhaps, value −∞,
133
HYPOTHESIS TESTING
and not identically equal to −∞ (since Opt(D) is a real), and thus (D) is solvable. As a result, (P ) and (D) are solvable with common optimal values, and therefore Φ has a saddle point.
2.8.2.2
Main Result
An immediate (and essential) observation is as follows: Proposition 2.39. In the situation of Section 2.8.2.1, let h ∈ H be such that the quantities Ψ1 (h) = sup Φ1 (−h; µ1 ), Ψ2 (h) = sup Φ2 (h; µ2 ) µ1 ∈M1
µ2 ∈M2
are finite. Consider the affine detector φh (ω) = hT ω + 21 [Ψ1 (h) − Ψ2 (h)] . {z }  κ
Then
Risk[φh R[H, M1 , Φ1 ], R[H, M2 , Φ2 ]] ≤ exp{ 21 [Ψ1 (h) + Ψ2 (h)]}. Proof. Let h satisfy the premise of the proposition. For every µ1 ∈ M1 , we have Φ1 (−h; µ1 ) ≤ Ψ1 (h), and for every P ∈ R[H, M1 , Φ1 ] we have Z exp{−hT ω}P (dω) ≤ exp{Φ1 (−h; µ1 )} Ω
for properly selected µ1 ∈ M1 . Thus, Z exp{−hT ω}P (dω) ≤ exp{Ψ1 (h)} ∀P ∈ R[H, M1 , Φ1 ], Ω
whence also Z
Ω
exp{−hT ω−κ}P (dω) ≤ exp{Ψ1 (h)−κ} = exp{ 21 [Ψ1 (h)+Ψ2 (h)]} ∀P ∈ R[H, M1 , Φ1 ].
Similarly, for every µ2 ∈ M2 , we have Φ2 (h; µ2 ) ≤ Ψ2 (h), and for every P ∈ R[H, M2 , Φ2 ], we have Z exp{hT ω}P (dω) ≤ exp{Φ2 (h; µ2 )} Ω
for properly selected µ2 ∈ M2 . Thus, Z exp{hT ω}P (dω) ≤ exp{Ψ2 (h)} ∀P ∈ R[H, M2 , Φ2 ], Ω
and Z
Ω
exp{hT ω + κ}P (dω) ≤ exp{Ψ2 (h) + κ} = exp{ 21 [Ψ1 (h) + Ψ2 (h)]} ∀P ∈ R[H, M2 , Φ2 ]. ✷
An immediate corollary is as follows:
134
CHAPTER 2
Proposition 2.40. In the situation of Section 2.8.2.1 and under Assumption I, let us associate with a saddle point (h∗ ; µ∗1 , µ∗2 ) of the convexconcave function (2.115) the following entities: • the risk
ǫ⋆ := exp{Ψ(h∗ ; µ∗1 , µ∗2 )};
(2.116)
this quantity is uniquely defined by the saddle point value of Ψ and thus is independent of how we select a saddle point; • the detector φ∗ (ω)—the affine function of ω ∈ Rd given by φ∗ (ω) = hT∗ ω + a, a =
1 2
[Φ1 (−h∗ ; µ∗1 ) − Φ2 (h∗ ; µ∗2 )] .
(2.117)
Then Risk[φ∗ R[H, M1 , Φ1 ], R[H, M2 , Φ2 ]] ≤ ǫ⋆ . Consequences. Assume we are given L collections (H, Mℓ , Φℓ ) of regular data on a common observation space Ω = Rd and with common H, and let Pℓ = R[H, Mℓ , Φℓ ] be the corresponding families of regular distributions. Assume also that for every pair (ℓ, ℓ′ ), 1 ≤ ℓ < ℓ′ ≤ L, the pair (H, Mℓ , Φℓ ), (H, Mℓ′ , Φℓ′ ) of regular data satisfies Assumption I, so that the convexconcave functions 1 2
[Φℓ (−h; µℓ ) + Φℓ′ (h; µℓ′ )] : H × (Mℓ × Mℓ′ ) → R [1 ≤ ℓ < ℓ′ ≤ L] ∗ ∗ ∗ have saddle points (hℓℓ′ ; (µℓ , µℓ′ )) (min in h ∈ H, max in (µℓ , µℓ′ ) ∈ Mℓ × Mℓ′ ). These saddle points give rise to the affine detectors Ψℓℓ′ (h; µℓ , µℓ′ ) =
φℓℓ′ (ω) = [h∗ℓℓ′ ]T ω + 21 [Φℓ (−h∗ℓℓ′ ; µ∗ℓ ) − Φℓ′ (h∗ ; µ∗ℓ′ )]
[1 ≤ ℓ < ℓ′ ≤ L]
and the quantities ǫℓℓ′ = exp { 12 [Φℓ (−h∗ℓℓ′ ; µ∗ℓ ) + Φℓ′ (h∗ ; µ∗ℓ′ )]} ;
[1 ≤ ℓ < ℓ′ ≤ L]
by Proposition 2.40, ǫℓℓ′ are upper bounds on the risks, taken w.r.t. Pℓ , Pℓ′ , of the detectors φℓℓ′ : Z Z −φℓℓ′ (ω) eφℓℓ′ (ω) P (dω) ≤ ǫℓℓ′ ∀P ∈ Pℓ′ . P (dω) ≤ ǫℓℓ′ ∀P ∈ Pℓ & e Ω
Ω
[1 ≤ ℓ < ℓ′ ≤ L] Setting φℓℓ′ (·) = −φℓ′ ℓ (·) and ǫℓℓ′ = ǫℓ′ ℓ when L ≥ ℓ > ℓ ≥ 1 and φℓℓ (·) ≡ 0, ǫℓℓ = 1, 1 ≤ ℓ ≤ L, we get a system of detectors and risks satisfying (2.80) and, consequently, can use these “building blocks” in the machinery developed so far for pairwise and multiple hypothesis testing from single and repeated observations (stationary, semistationary, and quasistationary). ′
Numerical example. To get some impression of how Proposition 2.40 extends the grasp of our computationfriendly machinery of test design consider a toy problem as follows:
135
HYPOTHESIS TESTING
We are given an observation √ √ ω = Ax + σADiag { x1 , ..., xn } ξ,
(2.118)
where • unknown signal x is known to belong to a given convex compact subset M of the interior of Rn+ ; • A is a given n × n matrix of rank n, σ > 0 is a given noise intensity, and ξ ∼ N (0, In ).
Our goal is to decide via a Krepeated version of observations (2.118) on the pair of hypotheses x ∈ Xχ , χ = 1, 2, where X1 , X2 are given nonempty convex compact subsets of M .
Note that an essential novelty, as compared to the standard Gaussian o.s., is that now we deal with zero mean Gaussian noise with covariance matrix Θ(x) = σ 2 ADiag{x}AT depending on the true signal—the larger the signal, the greater the noise. We can easily process the situation in question utilizing the machinery developed in this section. Namely, let us set Hχ = Rn , Mχ = {(x, Diag{x}) : x ∈ Xχ } ⊂ Rn+ × Sn+ , 2 Φχ (h; x, Ξ) = hT AT x + σ2 hT [AΞAT ]h : Mχ → R.
[χ = 1, 2]
It is immediately seen that for χ = 1, 2, H, Mχ , Φχ is regular data, and that the distribution P of observation (2.118) stemming from a signal x ∈ Xχ belongs to S[H, Mχ , Φχ ], so that we can use Proposition 2.40 to build an affine detector for the families Pχ , χ = 1, 2, of distributions of observations (2.118) stemming from signals x ∈ Xχ . The corresponding recipe boils down to the necessity to find a saddle point (h∗ ; x∗ , y∗ ) of the simple convexconcave function σ2 T 1 T h A(y − x) + h ADiag{x + y}AT h Ψ(h; x, y) = 2 2 (min in h ∈ Rn , max in (x, y) ∈ X1 × X2 ). Such a point clearly exists and is easily found, and gives rise to affine detector φ∗ (ω) = hT∗ ω + 41 σ 2 hT∗ ADiag{x∗ − y∗ }AT h∗ − 21 hT∗ A[x∗ + y∗ ]  {z } a
such that
Risk[φ∗ P1 , P2 ] ≤ exp
σ2 T 1 T h∗ A[y∗ − x∗ ] + h∗ ADiag{x∗ + y∗ }AT h∗ . 2 2
(2.119)
Note that we could also process the situation when defining the regular data as + H, M+ χ = Xχ , Φχ , χ = 1, 2, where T Φ+ χ (h; x) = h Ax +
σ2 θ T h AAT h 2
[θ =
max
x∈X1 ∪X2
kxk∞ ],
which, basically, means passing from our actual observations (2.118) to the “more
136
CHAPTER 2
noisy” observations given by Gaussian o.s. ω = Ax + η, η ∼ N (0, σ 2 θAAT ).
(2.120)
It is easily seen that, for this Gaussian o.s., the risk Risk[φ# P1 , P2 ] of the optimal, detector φ# can be upperbounded by the risk Risk[φ# P1+ , P2+ ] known to us, where Pχ+ is the family of distributions of observations (2.120) induced by signals x ∈ Xχ . Note that Risk[φ# P1+ , P2+ ] is seemingly the best risk bound available for us “within the realm of detectorbased tests in simple o.s.’s.” The goal of the small numerical experiment we are about to report on is to understand how our new risk bound (2.119) compares to the “old” bound Risk[φ# P1+ , P2+ ]. We use 0.001 ≤ x1 ≤ δ 16 n = 16, X1 = x ∈ R : , 0.001 ≤ xi ≤ 1, 2 ≤ i ≤ 16 2δ ≤ x1 ≤ 1 X2 = x ∈ R16 : 0.001 ≤ xi ≤ 1, 2 ≤ i ≤ 16 and σ = 0.1. The “separation parameter” δ is set to 0.1. Finally, the 16 × 16 matrix A has condition number 100 (singular values 0.01(i−1)/15 , 1 ≤ i ≤ 16) and randomly oriented systems of left and right singular vectors. With this setup, a typical numerical result is as follows: • the righthand side in (2.119) is 0.4346, implying that with detector φ∗ , a 6repeated observation is sufficient to decide on our two hypotheses with risk ≤ 0.01; • the quantity Risk[φ# P1+ , P2+ ] is 0.8825, meaning that with detector φ# , we need at least a 37repeated observation to guarantee risk ≤ 0.01. When the separation parameter δ participating in the descriptions of X1 , X2 is reduced to 0.01, the risks in question grow to 0.9201 and 0.9988, respectively (a 56repeated observation to decide on the hypotheses with risk 0.01 when φ∗ is used vs. a 3685repeated observation needed when φ# is used). The bottom line is that the new developments can indeed improve significantly the performance of our inferences. 2.8.2.3
SubGaussian and Gaussian cases
For χ = 1, 2, let Uχ be a nonempty closed convex set in Rd , and Vχ be a compact convex subset of the interior of the positive semidefinite cone Sd+ . We assume that U1 is compact. Setting Hχ = Ω = Rd , M χ = U χ × V χ , Φχ (h; θ, Θ) = θT h + 21 hT Θh : Hχ × Mχ → R, χ = 1, 2,
(2.121)
we get two collections (H, Mχ , Φχ ), χ = 1, 2, of regular data. As we know from Section 2.8.1.2, for χ = 1, 2, the families of distributions S[Rd , Mχ , Φχ ] contain the families SG[Uχ , Vχ ] of subGaussian distributions on Rd with subGaussianity parameters (θ, Θ) ∈ Uχ × Vχ (see (2.106)), as well as families G[Uχ , Vχ ] of Gaussian distributions on Rd with parameters (θ, Θ) (expectation and covariance matrix) running through Uχ × Vχ . Besides this, the pair of regular data in question clearly satisfies Condition A. Consequently, the test T∗K given by the above construction
137
HYPOTHESIS TESTING
as applied to the collections of regular data (2.121) is well defined and allows to decide on hypotheses Hχ : P ∈ R[Rd , Uχ , Vχ ], χ = 1, 2, on the distribution P underlying Krepeated observation ω K . The same test can be also used to decide on stricter hypotheses HχG , χ = 1, 2, stating that the observations ω1 , ..., ωK are i.i.d. and drawn from a Gaussian distribution P belonging to G[Uχ , Vχ ]. Our goal now is to process in detail the situation in question and to refine our conclusions on the risk of the test T∗1 when the Gaussian hypotheses HχG are considered and the situation is symmetric, that is, when V1 = V2 . Observe, first, that the convexconcave function Ψ from (2.115) in the current setting becomes Ψ(h; θ1 , Θ1 , θ2 , Θ2 ) = 12 hT [θ2 − θ1 ] + 41 hT Θ1 h + 14 hT Θ2 h.
(2.122)
We are interested in solutions to the saddle point problem min
h∈Rd
max
θ1 ∈U1 ,θ2 ∈U2
Ψ(h; θ1 , Θ1 , θ2 , Θ2 )
(2.123)
Θ1 ∈V1 ,Θ2 ∈V2
associated with the function (2.122). From the structure of Ψ and compactness of U1 , V1 , V2 , combined with the fact that Vχ , χ = 1, 2, are comprised of positive definite matrices, it immediately follows that saddle points do exist, and a saddle point (h∗ ; θ1∗ , Θ∗1 , θ2∗ , Θ∗2 ) satisfies the relations (a) (b) (c)
h∗ = [Θ∗1 + Θ∗2 ]−1 [θ1∗ − θ2∗ ], hT∗ (θ1 − θ1∗ ) ≥ 0 ∀θ1 ∈ U1 , hT∗ (θ2∗ − θ2 ) ≥ 0 ∀θ2 ∈ U2 , hT∗ Θ1 h∗ ≤ hT∗ Θ∗1 h∗ ∀Θ1 ∈ V1 , hT∗ Θ2 h∗ ≤ h∗ Θ∗2 h∗ ∀Θ2 ∈ V2 .
(2.124)
From (2.124.a) it immediately follows that the affine detector φ∗ (·) and risk ǫ⋆ , as given by (2.116) and (2.117), are φ∗ (ω) ǫ⋆
= = =
hT∗ [ω − w∗ ] + 12 hT∗ [Θ∗1 − Θ∗2 ]h∗ , w∗ = 12 [θ1∗ + θ2∗ ]; exp{− 41 [θ1∗ − θ2∗ ]T [Θ∗1 + Θ∗2 ]−1 [θ1∗ − θ2∗ ]} exp{− 14 hT∗ [Θ∗1 + Θ∗2 ]h∗ }.
(2.125)
Note that in the symmetric case (where V1 = V2 ), there always exists a saddle point of Ψ with Θ∗1 = Θ∗2 ,21 and the test T∗1 associated with such saddle point is quite transparent: it is the maximum likelihood test for two Gaussian distributions, N (θ1∗ , Θ∗ ), N (θ2∗ , Θ∗ ), where Θ∗ is the common value of Θ∗1 and Θ∗2 . The bound ǫ⋆ on the risk of the test is nothing but the Hellinger affinity of these two Gaussian distributions, or, equivalently, ∗ ∗ ǫ⋆ = exp − 81 [θ1∗ − θ2∗ ]T Θ−1 ∗ [θ1 − θ2 ] .
21 Indeed, from (2.122) it follows that when V 1 = V2 , the function Ψ(h; θ1 , Θ1 , θ2 , Θ2 ) is symmetric w.r.t. Θ1 , Θ2 , implying similar symmetry of the function Ψ(θ1 , Θ1 , θ2 , Θ2 ) = minh∈H Ψ(h; θ1 , Θ1 , θ2 , Θ2 ). Since Ψ is concave, the set M of its maximizers over M1 × M2 (which, as we know, is nonempty) is symmetric w.r.t. the swap of Θ1 and Θ2 and is convex, implying that if (θ1 , Θ1 , θ2 , Θ2 ) ∈ M , then (θ1 , 21 [Θ1 + Θ2 ], θ2 , 12 [Θ1 + Θ2 ]) ∈ M as well, and the latter point is the desired component of the saddle point of Ψ with Θ1 = Θ2 .
138
CHAPTER 2
We arrive at the following result: Proposition 2.41. In the symmetric subGaussian case (i.e., in the case of (2.121) with V1 = V2 ), saddle point problem (2.122), (2.123) admits a saddle point of the form (h∗ ; θ1∗ , Θ∗ , θ2∗ , Θ∗ ), and the associated affine detector and its risk are given by φ∗ (ω) ǫ⋆
= =
hT∗ [ω − w∗ ], w∗ = 21 [θ1∗ + θ2∗ ]; ∗ ∗ exp{− 18 [θ1∗ − θ2∗ ]T Θ−1 ∗ [θ1 − θ2 ]}.
As a result, when deciding, via ω K , on “subGaussian hypotheses” Hχ , χ = 1, 2, PK (K) the risk of the test T∗K associated with φ∗ (ω K ) := t=1 φ∗ (ωt ) is at most ǫK ⋆ .
In the symmetric singleobservation Gaussian case, that is, when V1 = V2 and we apply the test T∗ = T∗1 to observation ω ≡ ω1 in order to decide on the hypotheses HχG , χ = 1, 2, the above risk bound can be improved: Proposition 2.42. Consider the symmetric case V1 = V2 = V, let (h∗ ; θ1∗ ; Θ∗1 , θ2∗ , Θ∗2 ) be the “symmetric”—with Θ∗1 = Θ∗2 = Θ∗ —saddle point of function Ψ given by (2.122), and let φ∗ be the affine detector given by (2.124) and (2.125): ∗ ∗ ∗ 1 ∗ φ∗ (ω) = hT∗ [ω − w∗ ], h∗ = 21 Θ−1 ∗ [θ1 − θ2 ], w∗ = 2 [θ1 + θ2 ].
Let also δ= so that
q
hT∗ Θ∗ h∗
=
1 2
q
∗ ∗ [θ1∗ − θ2∗ ]T Θ−1 ∗ [θ1 − θ2 ],
δ 2 = hT∗ [θ1∗ − w∗ ] = hT∗ [w∗ − θ2∗ ] and ǫ⋆ = exp − 21 δ 2 .
(2.126)
Let, further, α ≤ δ 2 , β ≤ δ 2 . Then (a) (b)
∀(θ ∈ U1 , Θ ∈ V) : Probω∼N (θ,Θ) {φ∗ (ω) ≤ α} ≤ Erfc(δ − α/δ), ∀(θ ∈ U2 , Θ ∈ V) : Probω∼N (θ,Θ) {φ∗ (ω) ≥ −β} ≤ Erfc(δ − β/δ).
(2.127)
In particular, when deciding, via a single observation ω, on Gaussian hypotheses HχG , χ = 1, 2, with HχG stating that ω ∼ N (θ, Θ) with (θ, Θ) ∈ Uχ × V, the risk of the test T∗1 associated with φ∗ is at most Erfc(δ). Proof. Let us prove (a) (the proof of (b) is completely similar). For θ ∈ U1 , Θ ∈ V we have Probω∼N (θ,Θ) {φ∗ (ω) ≤ α} = Probω∼N (θ,Θ) {hT∗ [ω − w∗ ] ≤ α} = Probξ∼N (0,I) {hT∗ [θ + Θ1/2 ξ − w∗ ] ≤ α} = Probξ∼N (0,I) {[Θ1/2 h∗ ]T ξ ≤ α − hT∗ [θ − w∗ ]  {z }
}
∗ 2 ≥hT ∗ [θ1 −w∗ ]=δ
by (2.124.b),(2.126) 2
≤ Probξ∼N (0,I) {[Θ1/2 h∗ ]T ξ ≤ α − δ } = Erfc([δ 2 − α]/kΘ1/2 h∗ k2 ) 1/2 ≤ Erfc([δ 2 − α]/kΘ∗ h∗ k2 )
[due to δ 2 − α ≥ 0 and hT∗ Θh∗ ≤ hT∗ Θ∗ h∗ by (2.124.c)]
= Erfc([δ 2 − α]/δ).
139
HYPOTHESIS TESTING
The “in particular” part of Proposition is readily given by (2.127) as applied with α = β = 0. ✷ Note that the progress, as compared to our results on the minimum risk detectors for convex hypotheses in Gaussian o.s., is that we do not assume anymore that the covariance matrix is once and forever fixed. Now neither the mean nor the covariance matrix of the observed Gaussian random variable are known in advance. In this setting, the mean is running through a closed convex set (depending on the hypothesis), and the covariance is running, independently of the mean, through a given convex compact subset of the interior of the positive definite cone, and this subset should be common for both hypotheses we are deciding upon.
2.9
2.9.1
BEYOND THE SCOPE OF AFFINE DETECTORS: LIFTING THE OBSERVATIONS Motivation
The detectors considered in Section 2.8 were affine functions of observations. Note, however, that what an observation is, to some extent depends on us. To give an instructive example, consider the Gaussian observation ζ = A[u; 1] + ξ ∈ Rn , where u is an unknown signal known to belong to a given set U ⊂ Rn , u 7→ A[u; 1] is a given affine mapping from Rn into the observation space Rd , and ξ is zero mean Gaussian observation noise with covariance matrix Θ known to belong to a given convex compact subset V of the interior of the positive semidefinite cone Sd+ . Treating the observation “as is,” affine in the observation detector is affine in [u; ξ]. On the other hand, we can treat as our observation the image of the actual observation ζ under any deterministic mapping, e.g., the “quadratic lifting” ζ 7→ (ζ, ζζ T ). A detector affine in the new observation is quadratic in u and ξ— we get access to a wider set of detectors as compared to those affine in ζ! At first glance, applying our “affine detectors” machinery to appropriate “nonlinear liftings” of actual observations we can handle quite complicated detectors, e.g., polynomial, of arbitrary degree, in ζ. The bottleneck here stems from the fact that in general it is difficult to “cover” the distribution of a “nonlinearly lifted” observation ζ (even as simple as the Gaussian observation above) by an explicitly defined family of regular distributions, and such a “covering” is what we need in order to apply to the lifted observation our affine detector machinery. It turns out, however, that in some important cases the desired covering is achievable. We are about to demonstrate that this takes place in the case of the quadratic lifting ζ 7→ (ζ, ζζ T ) of (sub)Gaussian observation ζ, and the resulting quadratic detectors allow us to handle some important inference problems which are far beyond the grasp of “genuinely affine” detectors.
140
CHAPTER 2
2.9.2
Quadratic lifting: Gaussian case
Given positive integer d, we define E d as the linear space Rd × Sd equipped with the inner product h(z, S), (z ′ , S ′ )i = sT z ′ + 21 Tr(SS ′ ).
Note that the quadratic lifting z 7→ (z, zz T ) maps the space Rd into E d . In the sequel, an instrumental role is played by the following result. Proposition 2.43. (i) Assume we are given
• a nonempty and bounded subset U of Rn ; • a convex compact set V contained in the interior of the cone Sd+ of positive semidefinite d × d matrices; • a d × (n + 1) matrix A. These data specify the family GA [U, V] of distributions of quadratic liftings (ζ, ζζ T ) of Gaussian random vectors ζ ∼ N (A[u; 1], Θ) stemming from u ∈ U and Θ ∈ V. Let us select some 1. γ ∈ (0, 1), 2. convex compact subset Z of the set Z n = {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1} such that Z(u) := [u; 1][u; 1]T ∈ Z ∀u ∈ U, (2.128)
3. positive definite d × d matrix Θ∗ ∈ Sd+ and δ ∈ [0, 2] such that −1/2
Θ∗ Θ ∀Θ ∈ V & kΘ1/2 Θ∗
− Id k ≤ δ ∀Θ ∈ V,
(2.129)
where k · k is the spectral norm,22 and set
−1 H = Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗ H γΘ∗ },
ΦA,Z (h, H; Θ)
=
1/2
1/2
− 21 ln Det(I − Θ∗ HΘ∗ ) + 21 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) kΘ∗ HΘ∗ k2F + 1/2 1/2 2(1−kΘ HΘ k) ∗ ∗ h i H h T −1 −1 + 12 φZ B T + [H, h] [Θ − H] [H, h] B : T ∗ h
H × V → R, (2.130)
where B is given by B=
A [0, ..., 0, 1]
,
(2.131)
the function φZ (Y ) := max Tr(ZY ) Z∈Z
(2.132)
is the support function of Z, and k · kF is the Frobenius norm. Function ΦA,Z is continuous on its domain, convex in (h, H) ∈ H and concave 22 It is easily seen that with δ = 2, the second relation in (2.129) is satisfied for all Θ such that 0 Θ Θ∗ , so that the restriction δ ≤ 2 is w.l.o.g..
141
HYPOTHESIS TESTING
in Θ ∈ V, so that (H, V, ΦA,Z ) is regular data. Besides this, (#) Whenever u ∈ Rn is such that [u; 1][u; 1]T ∈ Z and Θ ∈ V, the Gaussian random vector ζ ∼ N (A[u; 1], Θ) satisfies the relation o n 1 T T ≤ ΦA,Z (h, H; Θ). (2.133) ∀(h, H) ∈ H : ln Eζ∼N (A[u;1],Θ) e 2 ζ Hζ+h ζ
The latter relation combines with (2.128) to imply that GA [U, V] ⊂ S[H, V, ΦA,Z ].
In addition, ΦA,Z is coercive in (h, H): ΦA,Z (hi , Hi ; Θ) → +∞ as i → ∞ whenever Θ ∈ V, (hi , Hi ) ∈ H, and k(hi , Hi )k → ∞, i → ∞. (χ)
(ii) Let two collections of entities from (i), (Vχ , Θ∗ , δχ , γχ , Aχ , Zχ ), χ = 1, 2, with common d be given, giving rise to the sets Hχ , matrices Bχ , and functions ΦAχ ,Zχ (h, H; Θ), χ = 1, 2. These collections specify the families of normal distributions Gχ = {N (v, Θ) : Θ ∈ Vχ & ∃u ∈ U : v = Aχ [u; 1]}, χ = 1, 2. Consider the convexconcave saddle point problem SV =
1
min max 2 (h,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2 
[ΦA1 ,Z1 (−h, −H; Θ1 ) + ΦA2 ,Z2 (h, H; Θ2 )] . {z } Φ(h,H;Θ1 ,Θ2 )
(2.134) A saddle point (H∗ , h∗ ; Θ∗1 , Θ∗2 ) in this problem does exist, and the induced quadratic detector φ∗ (ω) = 12 ω T H∗ ω+hT∗ ω+ 21 [ΦA1 ,Z1 (−h∗ , −H∗ ; Θ∗1 ) − ΦA2 ,Z2 (h∗ , H∗ ; Θ∗2 )], (2.135) {z }  a
when applied to the families of Gaussian distributions Gχ , χ = 1, 2, has the risk Risk[φ∗ G1 , G2 ] ≤ ǫ⋆ := eSV , that is, (a) (b)
R
e−φ∗ (ω) P (dω) ≤ ǫ⋆ eφ∗ (ω) P (dω) ≤ ǫ⋆ Rd
R Rd
∀P ∈ G1 , ∀P ∈ G2 .
(2.136)
For proof, see Section 2.11.5.
Remark 2.44. Note that the computational effort to solve (2.134) reduces dramatically in the “easy case” of the situation described in item (ii) of Proposition 2.43 where • the observations are direct, meaning that Aχ [u; 1] ≡ u, u ∈ Rd , χ = 1, 2; • the sets Vχ are comprised of positive definite diagonal matrices, and matrices (χ) Θ∗ are diagonal as well, χ = 1, 2; • the sets Zχ , χ = 1, 2, are convex compact sets of the form Zχ = {Z ∈ Sd+1 : Z 0, Tr(ZQχj ) ≤ qjχ , 1 ≤ j ≤ Jχ } +
142
CHAPTER 2
with diagonal matrices Qχj ,23 and these sets intersect the interior of the positive semidefinite cone Sd+1 + . In this case, the convexconcave saddle point problem (2.134) admits a saddle point (h∗ , H∗ ; Θ∗1 , Θ∗2 ) where h∗ = 0 and H∗ is diagonal. Justifying the remark. In the easy case, we have Bχ = Id+1 and therefore h i−1 H h (χ) −1 T + [H, h] [Θ ] − H [H, h] Bχ Mχ (h, H) := BχT ∗ T = and φZχ (Z)
= =
h
h i−1 (χ) H + H [Θ∗ ]−1 − H H h i−1 (χ) −1 T T h + h [Θ∗ ] − H H
(χ) h + H[[Θ∗ ]−1 − H]−1 h h i−1 (χ) hT [Θ∗ ]−1 − H h
max Tr(ZW ) : W 0, Tr(W Qχj ) ≤ qjχ , 1 ≤ j ≤ Jχ W o nP P χ χ minλ , j λj Q j j qj λj : λ ≥ 0, Z
where the last equality is due to semidefinite duality.24 From the second representation of φZχ (·) and the fact that all Qχj are diagonal it follows that φZχ (Mχ (−h, H)) = φZχ (Mχ (h, H)) (indeed, with diagonal Qχj , if λ is feasible for the minimization problem participating in the representation when Z = Mχ (h, H), it clearly remains feasible when Z is replaced with Mχ (−h, H)). This, in turn, combines straightforwardly with (2.130) to imply that when replacing h∗ with 0 in a saddle point (h∗ , H∗ ; Θ∗1 , Θ∗2 ) of (2.134), we end up with another saddle point of (2.134). In other words, when solving (2.134), we can from the very beginning set h to 0, thus converting (2.134) into the convexconcave saddle point problem SV =
min
max
H:(0,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2
Φ(0, H; Θ1 , Θ2 ).
(2.137)
Taking into account that we are in the case where all matrices from the sets Vχ , like (χ) the matrices Θ∗ and all the matrices Qχj , χ = 1, 2, are diagonal, it is immediate to verify that Φ(0, H; Θ1 , Θ2 ) = Φ(0, EHE; Θ1 , Θ2 ) for any d × d diagonal matrix E with diagonal entries ±1. Due to convexityconcavity of Φ this implies that (2.137) admits a saddle point (0, H∗ ; Θ∗1 , Θ∗2 ) with H∗ invariant w.r.t. transformations H∗ 7→ EH∗ E with the above E, that is, with diagonal H∗ , as claimed. ✷ 2.9.3
Quadratic lifting—Does it help?
Assume that for χ = 1, 2, we are given • affine mappings u 7→ Aχ (u) = Aχ [u; 1] : Rnχ → Rd , • nonempty convex compact sets Uχ ⊂ Rnχ , • nonempty convex compact sets Vχ ⊂ int Sd+ . These data define families Gχ of Gaussian distributions on Rd : Gχ is comprised of all 23 In terms of the sets U , this assumption means that the latter sets are given by linear χ inequalities on the squares of entries in u. 24 See Section 4.1 (or [187, Section 7.1] for more details).
143
HYPOTHESIS TESTING
distributions N (Aχ (u), Θ) with u ∈ Uχ and Θ ∈ Vχ . The data define also families SGχ of subGaussian distributions on Rd : SGχ is comprised of all subGaussian distributions with parameters (Aχ (u), Θ) with (u, Θ) ∈ Uχ × Vχ . Assume we observe random variable ζ ∈ Rd drawn from a distribution P known to belong to G1 ∪ G2 , and our goal is to decide from a stationary Krepeated version of our observation on the pair of hypotheses Hχ : P ∈ Gχ , χ = 1, 2; we refer to this situation as the Gaussian case, and we assume from now on that we are in this case.25 At present, we have developed two approaches to building detectorbased tests for H1 , H2 : A. Utilizing the affine in ζ detector φaff given by solution to the saddle point problem (see (2.122), (2.123) and set θχ = Aχ (uχ ) with uχ running through Uχ ) T 1 h [A2 (u2 ) − A1 (u1 )] + 12 hT [Θ1 + Θ2 ] h ; SadValaff = min max 2 h∈Rd
u1 ∈U1 ,u2 ∈U2
Θ1 ∈V1 ,Θ2 ∈V2
this detector satisfies the risk bound Risk[φaff G1 , G2 ] ≤ exp{SadValaff }. Q. Utilizing the quadratic in ζ detector φlift given by Proposition 2.43.ii, with the risk bound Risk[φlift G1 , G2 ] ≤ exp{SadVallift }, with SadVallift given by (2.134). A natural question is, which of these options results in a better risk bound. Note that we cannot just say “clearly, the second option is better, since there are more quadratic detectors than affine ones”—the difficulty is that the key relation (2.133), in the context of Proposition 2.43, is inequality rather than equality.26 We are about to show that under reasonable assumptions, the second option indeed is better: Proposition 2.45. In the situation in question, assume that the sets Vχ , χ = 1, 2, contain the largest elements, and that these elements are taken as the matrices (χ) Θ∗ participating in Proposition 2.43.ii. Let, further, the convex compact sets Zχ participating in Proposition 2.43.ii satisfy W u ¯ 0, u ∈ Uχ (2.138) Zχ ⊂ Zχ := Z = uT 1 (this assumption does not restrict generality, since Z¯χ is, along with Uχ , a closed convex set which clearly contains all matrices [u; 1][u; 1]T with u ∈ Uχ ). Then SadVallift ≤ SadValaff ,
(2.139)
that is, option Q is at least as efficient as option A. 25 It is easily seen that what follows can be straightforwardly extended to the subGaussian case, where the hypotheses we would decide upon state that P ∈ SGχ . 26 One cannot make (2.133) an equality by redefining the righthand side function—it will lose the convexityconcavity properties required in our context.
144
CHAPTER 2
ρ 0.5 0.5 0.01
σ1 2 1 1
σ2 2 4 4
unrestricted H and h 0.31 0.24 0.41
H=0 0.31 0.39 1.00
h=0 1.00 0.62 0.41
Table 2.2: Risk of quadratic detector φ(ζ) = hT ζ + 12 ζ T Hζ + κ.
Proof. Let Aχ = [A¯χ , aχ ]. Looking at (2.122) (where one should substitute θχ = (χ) Aχ (uχ ) with uχ running through Uχ ) and taking into account that Θχ Θ∗ ∈ Vχ when Θχ ∈ Vχ , we conclude that h h i i (1) (2) 1 hT [A¯2 u2 − A¯1 u1 + a2 − a1 ] + 21 hT Θ∗ + Θ∗ h . SadValaff = min max 2 h
u1 ∈U1 ,u2 ∈U2
(2.140)
At the same time, we have by Proposition 2.43.ii: SadVallift
= ≤ =
1 max min [ΦA1 ,Z1 (−h, −H; Θ1 ) + ΦA2 ,Z2 (h, H; Θ2 )] (h,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2 2 1 [ΦA1 ,Z1 (−h, 0; Θ1 ) + ΦA2 ,Z2 (h, 0; Θ2 )] max min 2 h∈Rd Θ1 ∈V1 ,Θ2 ∈V2 " #! −A¯T1 h 1 1 min max max Tr Z1 (1) 2 2 Z ∈Z −hT A¯1 −2hT a1 + hT Θ∗ h h∈Rd Θ1 ∈V1 ,Θ2 ∈V2 1 1
"
#! A¯T2 h Tr Z2 (2) hT A¯2 2hT a2 + hT Θ∗ h [by direct computation utilizing (2.130)] h i (1) T ¯T 1 1 ≤ min 2 2 max −2u1 A1 h − 2aT1 h + hT Θ∗ h + u1 ∈U1 h∈Rd h i T ¯T T T (2) 1 2u A h + 2a h + h Θ h max ∗ 2 2 2 2 + 21 max Z2 ∈Z2
u2 ∈U2
[due to (2.138)]
= SadValaff ,
where the concluding equality is due to (2.140).
✷
Numerical illustration. To get an impression of the performance of quadratic detectors as compared to affine ones under the premise of Proposition 2.45, we present here the results of an experiment where U1 = U1ρ = {u ∈ R12 : ui ≥ (χ) ρ, 1 ≤ i ≤ 12}, U2 = U2ρ = −U1ρ , A1 = A2 ∈ R8×13 , and Vχ = {Θ∗ = σχ2 I8 } are singletons. The risks of affine, quadratic and “purely quadratic” (with h set to 0) detectors on the associated families G1 , G2 are given in Table 2.2. We see that • when deciding on families of Gaussian distributions with a common covariance matrix and expectations varying in the convex sets associated with the families, passing from affine detectors described by Proposition 2.41 to quadratic detectors does not affect the risk (first row in the table). This should be expected: we are in the scope of Gaussian o.s., where minimum risk affine detectors are optimal among all possible detectors. • When deciding on families of Gaussian distributions in the case where distributions from different families can have close expectations (third row in the table), (1) affine detectors are useless, while the quadratic ones are not, provided that Θ∗
145
HYPOTHESIS TESTING (2)
differs from Θ∗ . This is how it should be—we are in the case where the first moments of the distribution of observation bear no definitive information on the family to which this distribution belongs, making affine detectors useless. In contrast, quadratic detectors are able to utilize information (valuable when (1) (2) Θ∗ 6= Θ∗ ) “stored” in the second moments of the observation. • “In general” (second row in the table), both affine and purely quadratic components in a quadratic detector are useful; suppressing one of them may increase significantly the attainable risk.
2.9.4
Quadratic lifting: SubGaussian case
The subGaussian version of Proposition 2.43 is as follows: Proposition 2.46. (i) Assume we are given • a nonempty and bounded subset U of Rn ; • a convex compact set V contained in the interior of the cone Sd+ of positive semidefinite d × d matrices; • a d × (n + 1) matrix A. These data specify the family SG A [U, V] of distributions of quadratic liftings (ζ, ζζ T ) of subGaussian random vectors ζ with subGaussianity parameters A[u; 1], Θ stemming from u ∈ U and Θ ∈ V. Let us select some 1. reals γ, γ + such that 0 < γ < γ + < 1, 2. convex compact subset Z of the set Z n = {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1} such that relation (2.128) takes place, 3. positive definite d×d matrix Θ∗ ∈ Sd+ and δ ∈ [0, 2] such that (2.129) takes place. These data specify the closed convex sets H
b H
= =
d d −1 −1 Hγ := {(h, ∗ }, H) ∈ R × S : −γΘ∗ H γΘ−1 −1 + −γΘ γ,γ d d d ∗ H γΘ∗ b H = (h, H, G) ∈ R × S × S : 0 G γ + Θ−1 ∗ , H G
and the functions
ΨA,Z (h, H, G)
=
1/2
1/2
− − 21 ln Det(I Θ∗ GΘ∗ ) H h −1 [H, h] B + [H, h]T [Θ−1 : + 12 φZ B T T ∗ − G] h
ΨδA,Z (h, H, G; Θ)
=
1/2
1/2
− 12 ln Det(I − Θ∗ GΘ∗ )
δ(2+δ)
1/2
b × Z → R, H 1/2
kΘ∗ GΘ∗ k2F + 12 Tr([Θ − Θ∗ ]G) + 1/2 1/2 2(1−kΘ∗ GΘ∗ k) h H −1 [H, h] B + [H, h]T [Θ−1 : + 21 φZ B T T ∗ − G] h
b × {0 Θ Θ∗ } → R H
(2.141) where B is given by (2.131) and φZ (·) is the support function of Z given by (2.132),
146
CHAPTER 2
along with ΦA,Z (h, H)
=
ΦδA,Z (h, H; Θ)
=
n o b : H → R, min ΨA,Z (h, H, G) : (h, H, G) ∈ H G n o b : H × {0 Θ Θ∗ } → R, min ΨδA,Z (h, H, G; Θ) : (h, H, G) ∈ H G
ΦA,Z (h, H) is convex and continuous on its domain, and ΦδA,Z (h, H; Θ) is continuous on its domain, convex in (h, H) ∈ H and concave in Θ ∈ {0 Θ Θ∗ }. Besides this, (##) Whenever u ∈ Rn is such that [u; 1][u; 1]T ∈ Z and Θ ∈ V, the subGaussian random vector ζ, with parameters (A[u; 1], Θ), satisfies the relation ∀(h, H) ∈ Hn: (a)
(b)
o 1 T T ≤ ΦA,Z (h, H), ln Eζ e 2 ζ Hζ+h ζ o n 1 T T ζ Hζ+h ζ ≤ ΦδA,Z (h, H; Θ), ln Eζ e 2
(2.142)
which combines with (2.128) to imply that
SG A [U, V] ⊂ S[H, V, ΦA,Z ] & SG A [U, V] ⊂ S[H, V, ΦδA,Z ].
(2.143)
In addition, ΦA,Z and ΦδA,Z are coercive in (h, H): ΦA,Z (hi , Hi ) → +∞ and ΦδA,Z (hi , Hi ; Θ) → +∞ as i → ∞ whenever Θ ∈ V, (hi , Hi ) ∈ H, and k(hi , Hi )k → ∞, i → ∞. (χ)
(ii) Let two collections of data from (i): (Vχ , Θ∗ , δχ , γχ , γχ+ , Aχ , Zχ ), χ = 1, 2, with common d be given, giving rise to the sets Hχ , matrices Bχ , and functions δ ΦAχ ,Zχ (h, H), ΦAχχ ,Zχ (h, H; Θ), χ = 1, 2. These collections specify the families SG χ = SG Aχ [Uχ , Vχ ] of subGaussian distributions. Consider the convexconcave saddle point problem h i δ1 δ2 1 Φ SV = min (−h, −H; Θ ) + Φ max (h, H; Θ ) . 1 2 A1 ,Z1 A2 ,Z2 2 (h,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2  {z } Φδ1 ,δ2 (h,H;Θ1 ,Θ2 )
(2.144) A saddle point (H∗ , h∗ ; Θ∗1 , Θ∗2 ) in this problem does exist, and the induced quadratic detector h i φ∗ (ω) = 21 ω T H∗ ω + hT∗ ω + 12 ΦδA11 ,Z1 (−h∗ , −H∗ ; Θ∗1 ) − ΦδA22 ,Z2 (h∗ , H∗ ; Θ∗2 ) , {z }  a
when applied to the families of subGaussian distributions SG χ , χ = 1, 2, has the risk Risk[φ∗ SG 1 , SG 2 ] ≤ ǫ⋆ := eSV .
As a result, (a) (b)
R
e−φ∗ (ω) P (dω) ≤ ǫ⋆ eφ∗ (ω) P (dω) ≤ ǫ⋆ Rd
R Rd
∀P ∈ SG 1 , ∀P ∈ SG 2 .
147
HYPOTHESIS TESTING
Similarly, the convex minimization problem Opt =
min
(h,H)∈H1 ∩H2
1
2
[ΦA1 ,Z1 (−h, −H) + ΦA2 ,Z2 (h, H)] . {z }
(2.145)
Φ(h,H)
is solvable, and the quadratic detector induced by its optimal solution (h∗ , H∗ ) φ∗ (ω) = 12 ω T H∗ ω + hT∗ ω + 21 [ΦA1 ,Z1 (−h∗ , −H∗ ) − ΦA2 ,Z2 (h∗ , H∗ )],  {z }
(2.146)
a
when applied to the families of subGaussian distributions SG χ , χ = 1, 2, has the risk Risk[φ∗ SG 1 , SG 2 ] ≤ ǫ⋆ := eOpt , so that relation (2.145) takes place for the φ∗ and ǫ⋆ just defined. For proof, see Section 2.11.6. Remark 2.47. Proposition 2.46 offers two options for building quadratic detectors for the families SG 1 , SG 2 , those based on the saddle point of (2.144) and on the optimal solution to (2.145). Inspecting the proof, the number of options can be δ increased to 4: we can replace any of the functions ΦAχχ ,Zχ , χ = 1, 2 (or both these functions simultaneously), with ΦAχ ,Zχ . The second of the original two options is δ exactly what we get when replacing both ΦAχχ ,Zχ , χ = 1, 2, with ΦAχ ,Zχ . It is easily seen that depending on the data, each of these four options can be the best—result in the smallest risk bound. Thus, it makes sense to keep all these options in mind and to use the one which, under the circumstances, results in the best risk bound. Note that the risk bounds are efficiently computable, so that identifying the best option is easy. 2.9.5
Generic application: Quadratically constrained hypotheses
Propositions 2.43 and 2.46 operate with Gaussian/subGaussian observations ζ with matrix parameters Θ running through convex compact subsets V of int Sd+ , and means of the form A[u; 1], with “signals” u running through given sets U ⊂ Rn . The constructions, however, involved additional entities—convex compact sets Z ⊂ Z n := {Z ∈ Sn+1 : Zn+1,n+1 = 1} containing quadratic liftings [u; 1][u; 1]T of all + signals u ∈ U . Other things being equal, the smaller the Z, the smaller the associated function ΦA,Z (or ΦδA,Z ), and consequently, the smaller the (upper bounds on the) risks of the quadratic in ζ detectors we end up with. In order to implement these constructions, we need to understand how to build the required sets Z in an “economical” way. There is a relatively simple case when it is easy to get reasonable candidates for the role of Z—the case of quadratically constrained signal set U : U = {u ∈ Rn : fk (u) := uT Qk u + 2qkT u ≤ bk , 1 ≤ k ≤ K}.
(2.147)
Indeed, the constraints fk (u) ≤ bk are just linear constraints on the quadratic lifting [u; 1][u; 1]T of u: Qk qk T T T ∈ Sn+1 . u Qk u + 2qk u ≤ bk ⇔ Tr(Fk [u; 1][u; 1] ) ≤ bk , Fk = qkT
148
CHAPTER 2
Consequently, in the case of (2.147), the simplest candidate on the role of Z is the set Z = {Z ∈ Sn : Z 0, Zn+1,n+1 = 1, Tr(Fk Z) ≤ bk , 1 ≤ k ≤ K}.
(2.148)
This set clearly is closed and convex (the latter even when U itself is not convex), and indeed contains the quadratic liftings [u; 1][u; 1]T of all points u ∈ U . We need also the compactness of Z; the latter definitely takes place when the quadratic constraints describing U contain the constraint of the form uT u ≤ R2 , which, in turn, can be ensured, basically “for free,” when U is bounded. It should be stressed that the “ideal” choice of Z would be the convex hull Z[U ] of all rank 1 matrices [u; 1][u; 1]T with u ∈ U —this definitely is the smallest convex set which contains the quadratic liftings of all points from U . Moreover, Z[U ] is closed and bounded, provided U is so. The difficulty is that Z[U ] can be computationally intractable (and thus useless in our context) already for pretty simple sets U of the form (2.147). The set (2.148) is a simple outer approximation of Z[U ], and this approximation can be very loose: for instance, when U = {u : −1 ≤ uk ≤ 1, 1 ≤ k ≤ n} is just the unit box in Rn , the set (2.148) is {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1, Zk,n+1  ≤ 1, 1 ≤ k ≤ n}; this set even is not bounded, while Z[U ] clearly is bounded. There is, essentially, just one generic case when the set (2.148) is exactly equal to Z[U ]—the case where U = {u : uT Qu ≤ c}, Q ≻ 0 is an ellipsoid centered at the origin; the fact that in this case the set given by (2.148) is exactly Z[U ] is a consequence of what is called SLemma. Though, in general, the set Z can be a very loose outer approximation of Z[U ], this does not mean that this construction cannot be improved. As an instructive example, let U = {u ∈ Rn : kuk∞ ≤ 1}. We get an approximation of Z[U ] much better than the one above when applying (2.148) to an equivalent description of the box by quadratic constraints: U := {u ∈ Rn : kuk∞ ≤ 1} = {u ∈ Rn : u2k ≤ 1, 1 ≤ k ≤ n}. Applying the recipe of (2.148) to the latter description of U , we arrive at a significantly less conservative outer approximation of Z[U ], specifically, Z = {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1, Zkk ≤ 1, 1 ≤ k ≤ n}. Not only the resulting set Z is bounded; we can get a reasonable “upper bound” on the discrepancy between Z and Z[U ]. Namely, denoting by Z o the matrix obtained from a symmetric n × n matrix Z by zeroing out the entry Zn+1,n+1 and keeping the remaining entries intact, we have Z o [U ] := {Z o : Z ∈ Z[U ]} ⊂ Z o := {Z o : Z ∈ Z} ⊂ O(1) ln(n + 1)Z o . This is a particular case of a general result (which goes back to [191]; we shall get this result as a byproduct of our forthcoming considerations, specifically, Proposition 4.6) as follows:
HYPOTHESIS TESTING
149
Let U be a bounded set given by a system of convex quadratic constraints without linear terms: U = {u ∈ Rn : uT Qk u ≤ ck , 1 ≤ k ≤ K}, Qk 0, 1 ≤ k ≤ K, and let Z be the associated set (2.148): Z = {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1, Tr(ZDiag{Qk , 1}) ≤ ck , 1 ≤ k ≤ K}
Then √ Z o [U ] := {Z o : Z ∈ Z[U ]} ⊂ Z o := {Z o : Z ∈ Z} ⊂ 3 ln( 3(K + 1))Z o [U ]. Note that when K = 1 (i.e., U is an ellipsoid centered at the origin), the factor 4 ln(5(K + 1)), as it was already mentioned, can be replaced by 1. √ One can think that the factor 3 ln( 3(K + 1)) is too large to be of interest; well, this is nearly the best factor one can get under the circumstances, and a nice fact is that the factor is “nearly independent” of K. Finally, we remark that, as in the case of a box, we can try to reduce the conservatism of the outer approximation (2.148) of Z[U ] by passing from the initial description of U to an equivalent one. The standard recipe here is to replace linear constraints in the description of U by their quadratic consequences; for example, we can augment a pair of linear constraints qiT u ≤ ci , qjT u ≤ cj , assuming there is such a pair, with the quadratic constraint (ci −qiT u)(cj −qjT u) ≥ 0. While this constraint is redundant, as far as the description of U itself is concerned, adding this constraint reduces, and sometimes significantly, the set given by (2.148). Informally speaking, transition from (2.147) to (2.148) is by itself “too stupid” to utilize the fact (known to every kid) that the product of two nonnegative quantities is nonnegative; when augmenting linear constraints in the description of U by their pairwise products, we somehow compensate for this stupidity. Unfortunately, while “computationally tractable” assistance of this type allows us to reduce the conservatism of (2.148), it usually does not allow us to eliminate it completely: a grave “fact of life” is that even in the case of the unit box U , the set Z[U ] is computationally intractable. Scientifically speaking: maximizing quadratic forms over the unit box U is provably an NPhard problem; were we able to get a computationally tractable description of Z[U ], we would be able to solve this NPhard problem efficiently, implying that P=NP. While we do not know for sure that the latter is not the case, “informal odds” are strongly against this possibility. The bottom line is that while the approach we are discussing in some situations could result in quite conservative tests, “some” is by far not the same as “always”; on the positive side, this approach allows us to process some important problems. We are about to present a simple and instructive illustration. 2.9.5.1
Simple change detection
In Figure 2.8, you see a sample of frames from a “movie” in which a noisy picture of a dog gradually transforms into a noisy picture of a lady; several initial frames differ just by realizations of noise, and starting from some instant, the “signal” (the deterministic component of the image) starts to drift from the dog towards the lady. What, in your opinion, is the change point—the first time instant where
150
CHAPTER 2
#1
#2
#3
#4
#5
#6
#7
#8
# 15
# 20
# 28
# 30
Figure 2.8: Frames from a “movie”
151
HYPOTHESIS TESTING
the signal component of the image differs from the signal component of the initial image? A simple model of the situation is as follows: we observe, one by one, vectors (in fact, 2D arrays, but we can “vectorize” them) ωt = xt + ξt , t = 1, 2, ..., K,
(2.149)
where the xt are deterministic components of the observations and the ξt are random noises. It may happen that for some τ ∈ {2, 3, ..., K}, the vectors xt are independent of t when t < τ , and xτ differs from xτ −1 (“τ is a change point”); if it is the case, τ is uniquely defined by xK = (x1 , ..., xK ). An alternative is that xt is independent of t, for all 1 ≤ t ≤ K (“no change”). The goal is to decide, based on observation ω K = (ω1 , .., ωK ), whether there was a change point, and if yes, then, perhaps, to localize it. The model we have just described is the simplest case of “change detection,” where, given noisy observations on some time horizon, one is interested in detecting a “change” in some time series underlying the observations. In our simple model, this time series is comprised of deterministic components xt of observations, and “change at time τ ” is understood in the most straightforward way—as the fact that xτ differs from preceding xt ’s equal to each other. In more complicated situations, our observations are obtained from the underlying time series {xt } by a nonanticipative transformation, like ωt =
t X
Ats xs + ξt , t = 1, ..., K,
s=1
and we still want to detect the change, if any, in the time series {xt }. As an instructive example, consider observations, taken along an equidistant time grid, of the positions of an aircraft which “normally” flies with constant velocity, but at some time instant can start to maneuver. In this situation, the underlying time series is comprised of the velocities of the aircraft at consecutive time instants, observations are obtained from this time series by integration, and to detect a maneuver means to detect that on the observation horizon, there was a change in the series of velocities. Change detection is the subject of a huge literature dealing with a wide range of models differing from each other in • whether we deal with direct observations of the time series of interest, as in (2.149), or with indirect ones (in the latter case, there is a wide spectrum of options related to how the observations depend on the underlying time series), • what are the assumptions on the noise, • what happens with the xt ’s after the change—do they jump from their common value prior to time τ to a new common value starting with this time, or start to depend on time (and if yes, then how), etc. A significant role in change detection is played by hypothesis testing; as far as affine/quadraticdetectorbased techniques developed in this section are concerned, their applications in the context of change detection are discussed in [50]. In what follows, we focus on the simplest of these applications. Situation and goal. We consider the situation as follows:
152
CHAPTER 2
1. Our observations are given by (2.149) with noises ξt ∼ N (0, σ 2 Id ) independent across t = 1, ..., K. We do not known σ a priori; what we know is that σ is independent of t and belongs to a given segment [σ, σ], with 0 < σ ≤ σ; 2. Observations (2.149) arrive one by one, so that at time t, 2 ≤ t ≤ K, we have at our disposal observation ω t = (ω1 , ..., ωt ). Our goal is to build a system of inferences Tt , 2 ≤ t ≤ K, such that Tt as applied to ω t either infers that there was a change at time t or earlier, in which case we terminate, or infers that so far there has been no change, in which case we either proceed to time t + 1 (if t < K), or terminate (if t = K) with a “no change” conclusion. We are given ǫ ∈ (0, 1) and want our collection of inferences to satisfy the bound ǫ on the probability of false alarm (i.e., on the probability of terminating somewhere on time horizon 2, 3, ..., K with a “there was a change” conclusion in the situation where there was no change: x1 = ... = xK ). Under this restriction, we want to make as small as possible the probability of a miss (of not detecting the change at all in the situation where there was a change). The “small probability of a miss” desire should be clarified. When the noise is nontrivial, we have no chances to detect very small changes and respect the bound on the probability of false alarm. A realistic goal is to make as small as possible the probability of missing a not too small change, which can be formalized as follows. Given ρ > 0, and tolerances ǫ, ε ∈ (0, 1), let us look for a system of inferences {Tt : 2 ≤ t ≤ K} such that • the probability of false alarm is at most ǫ, and • the probability of “ρmiss”—the probability of detecting no change when there was a change of energy ≥ ρ2 (i.e., when there was a change a time τ , and, moreover, it holds kxτ − x1 k22 ≥ ρ2 ) is at most ε. What we are interested in, is to achieve the goal just formulated with as small a ρ as possible. Construction. Let us select a large “safety parameter” R, like R = 108 or even R = 1080 , so that we can assume that for all time series we are interested in it holds kxt − xτ k22 ≤ R2 .27 Let us associate with ρ > 0 “signal hypotheses” Htρ , t = 2, 3, ..., K, on the distribution of observation ω K given by (2.149), with Htρ stating that at time t there is a change, of energy at least ρ2 , in the time series K {xt }K t=1 underlying the observation ω : x1 = x2 = ... = xt−1 & kxt − xt−1 k22 = kxt − x1 k22 ≥ ρ2 (and on top of that, kxt − xτ k22 ≤ R2 for all t, τ ). Let us augment these hypotheses by the null hypothesis H0 stating that there is no change at all—the observation ω K stems from a stationary time series x1 = x2 = ... = xK . We are about to use our machinery of detectorbased tests in order to build a system of tests deciding, S with partial risks ǫ and ε, on the null hypothesis vs. the “signal alternative” t Htρ for as small a ρ as possible. The implementation is as follows. Given ρ > 0 such that ρ2 < R2 , consider two 27 R is needed only to make the domains we are working with bounded, thus allowing us to apply the theory we have developed so far. The actual value of R does not enter our constructions and conclusions.
153
HYPOTHESIS TESTING
hypotheses, G1 and Gρ2 , on the distribution of observation ζ = x + ξ ∈ Rd .
(2.150)
Both hypotheses state that ∼ N (0, σ 2 Id ) with unknown σ known to belong to √ ξ√ a given segment ∆ := [ 2σ, 2σ]. In addition, G1 states that x = 0, and Gρ2 that ρ2 ≤ kxk22 ≤ R2 . We can use the result of Proposition 2.43.ii to build a detector quadratic in ζ for the families of distributions P1 , P2ρ obeying the hypotheses G1 , Gρ2 , respectively. To this end it suffices to apply the proposition to the collections (χ)
Vχ = {σ 2 Id : σ ∈ ∆}, Θ∗
= 2σ 2 Id , δχ = 1 − σ/σ, γχ = 0.999, Aχ = Id , Zχ , [χ = 1, 2]
where Z1 Z2
= =
{[0; ...; 0; 1][0; ...; 0; 1]T } ⊂ Sd+1 + , Z2ρ = {Z ∈ Sd+1 : Z = 1, 1 + R2 ≥ Tr(Z) ≥ 1 + ρ2 }. d+1,d+1 +
The (upper bound on the) risk of the quadratic in ζ detector yielded by a saddle point of function (2.134), as given by Proposition 2.43.ii, is immediate: by the same argument as used when justifying Remark 2.44, in the situation in question one can look for a saddle point with h = 0, H = ηId , and identifying the required η reduces to solving the univariate convex problem σ4 η2 b4 η 2 ) − d2 σ b2 (1 − σ 2 /σ 2 )η + dδ(2+δ)b Opt(ρ) = min 21 − d2 ln(1 − σ 2η 1+b σ η ρ2 η 2 + 2(1−b : −γ ≤ σ b η ≤ 0 σ 2 η) √ σ b = 2σ, δ = 1 − σ/σ which can be done in no time by Bisection. The resulting detector and the upper bound on its risk are given by the optimal solution η(ρ) to the latter problem according to 1−σ b2 η(ρ) ρ2 η(ρ) d 2 2 2 /σ φ∗ρ (ζ) = 12 η(ρ)ζ T ζ + ln − σ b (1 − σ )η(ρ) − 4 1+σ b2 η(ρ) d(1 − σ b2 η(ρ)) {z }  a(ρ)
with
Risk[φ∗ρ P1 , P2 ] ≤ Risk(ρ) := eOpt(ρ)
(observe that R appears neither in the definition of the optimal detector nor in the risk bound). It is immediately seen that Opt(ρ) → 0 as ρ → +0 and Opt(ρ) → −∞ as ρ → +∞, implying that given κ ∈ (0, 1), we can easily find by bisection ρ = ρ(κ) such that Risk(ρ) = κ; in what follows, we assume w.l.o.g. that R > ρ(κ) for the value of κ we end with; see below. Next, let us pass from the detector φ∗ρ(κ) (·) to its shift φ∗,κ (ζ) = φ∗ρ(κ) (ζ) + ln(ε/κ), so that for the simple test T κ which, given observation ζ, accepts G1 and rejects
154
CHAPTER 2
ρ(κ)
G2
ρ(κ)
whenever φ∗,κ (ζ) ≥ 0, and accepts G2 ρ(κ)
Risk1 (T κ G1 , G2
)≤
and rejects G1 otherwise, it holds
κ2 ρ(κ) , Risk2 (T κ G1 , G2 ) ≤ ε; ε
(2.151)
see Proposition 2.14 and (2.48). We are nearly done. Given κ ∈ (0, 1), consider the system of tests Ttκ , t = 2, 3, ..., K, as follows. At time t ∈ {2, 3, ..., K}, given observations ω1 , ..., ωt stemming from (2.149), let us form the vector ζt = ωt − ω1 and compute the quantity φ∗,κ (ζt ). If this quantity is negative, we claim that the change has already taken place and terminate; otherwise we claim that so far, there was no change, and proceed to time t + 1 (if t < K) or terminate (if t = K). The risk analysis for the resulting system of inferences is immediate. Observe that (!) For every t = 2, 3, ..., K: • if there is no change on time horizon 1, ..., t: x1 = x2 = ... = xt (case A) the probability for Ttκ to conclude that there was a change is at most κ2 /ε; • if, on the other hand, kxt − x1 k22 ≥ ρ2 (κ) (case B), then the probability for Ttκ to conclude that so far there was no change is at most ε. Indeed, we clearly have
ζt = [xt − x1 ] + ξ t , √ √ where ξ t = ξt − ξ1 ∼ N (0, σ 2 Id ) with σ ∈ [ 2σ, 2σ]. Our action at time t is nothing but application of the test T κ to the observation ζt . In case A the distribution of this observation obeys the hypothesis G1 , and the probability for Ttκ to claim that there was a change is at most κ2 /ε by the first inequality ρ(κ) in (2.151). In case B, the distribution of ζt obeys the hypothesis G2 , and κ thus the probability for Tt to claim that there was no change on time horizon 1, ..., t is ≤ ε by the second inequality in (2.151). In view of (!), the probability of false alarm for the system of inferences {Ttκ }K t=2 is at most (K − 1)κ2 /ε, and specifying κ as p κ = ǫε/(K − 1),
we make this probability ≤ ǫ. The resulting procedure, by the same (!), detects a change at time t ∈ {2, 3, ..., K} with probability at least 1 − ε, provided that the energy of this change is at least ρ2∗ , with p ǫε/(K − 1) . (2.152) ρ∗ = ρ
In fact we can say a bit more:
Proposition 2.48. Let the deterministic sequence x1 , ..., xK underlying observations (2.149) be such that for some t it holds kxt − x1 k22 ≥ ρ2∗ , with ρ∗ given by
155
HYPOTHESIS TESTING
(2.152). Then the probability for the system of inferences we have built to detect a change at time t or earlier is at least 1 − ε. Indeed, under the premise of the proposition, the probability for Ttκ to claim that a change already took place is at least 1 − ε, and this probability can be only smaller than the probability to detect change on time horizon 2, 3, ..., t. How it works. As applied to the “movie” story we started with, the outlined procedure works as follows. The images in question are of the size 256 × 256, so that we are in the case of d = 2562 = 65536. The images are represented by 2D arrays in gray scale, that is, as 256 × 256 matrices with entries in the range [0, 255]. In the experiment to be reported (just as in √ the movie) we assumed the maximal noise intensity σ to be 10, and used σ = σ/ 2. The reliability tolerances ǫ, ε were set to 0.01, and K was set to 9, resulting in ρ2∗ = 7.38 · 106 , which corresponds to the per pixel energy ρ2∗ /65536 = 112.68—just 12% above the allowed expected per pixel energy of noise (the latter is σ 2 = 100). The resulting detector is ζT ζ φ∗ (ζ) = −2.7138 5 + 366.9548. 10 In other words, test Ttκ claims that the change took place when the average, over pixels, per pixel energy in the difference ωt − ω1 was at least 206.33, which is pretty close to the expected per pixel energy (200.0) in the noise ξt − ξ1 affecting the difference ωt − ω1 . Finally, this is how the system of inferences just described worked in simulations. The underlying sequence of images is obtained from the “basic sequence” x ¯t = D + 0.0357(t − 1)(L − D), t = 1, 2, ...28
(2.153)
where D is the image of the dog and L is the image of the lady (up to noise, these are the first and the last frames on Figure 2.8). To get the observations in a particular simulation, we augment this sequence from the left by a random number of images D in such a way that with probability 1/2 there was no change of image on the time horizon 1,2, ..., 9, and with probability 1/2 there was a change at time instant τ chosen at random from the uniform distribution on {2, 3, ..., 9}. The observation is obtained by taking the first nine images in the resulting sequence, and adding to them observation noises independent across the images drawn at random from N (0, 100I65536 ). In the series of 3,000 simulations of this type we have not observed a single false alarm, while the empirical probability of a miss was 0.0553. Besides, the change at time t, if detected, was never detected with a delay more than 1. Finally, in the particular “movie” in Figure 2.8 the change takes place at time t = 3, and the system of inferences we have just developed discovered the change at time 4. How does this compare to the time when you managed to detect the change? “Numerical nearoptimality.” Recall that beyond the realm of simple o.s.’s we 28 The
coefficient 0.0357 corresponds to a 28frame linear transition from D to L.
156
CHAPTER 2
have no theoretical guarantees of nearoptimality for the inferences we are developing. This does not mean, however, that we cannot quantify the conservatism of our techniques numerically. To give an example, let us forget, for the sake of simplicity, about change detection per se and focus on the auxiliary problem we have introduced above, that of deciding upon hypotheses G1 and Gρ2 via observation (2.150), and suppose that we want to decide on these two hypotheses from a single observation with risk ≤ ǫ, for a given ǫ ∈ (0, 1). Whether this is possible or not depends on ρ; let us denote by ρ+ the smallest ρ for which we can meet the risk specification with our detectorbased approach (ρ+ is nothing but what was above called ρ(ǫ)), and by ρ the smallest ρ for which there exists “in nature” a simple test deciding on G1 vs. Gρ2 with risk ≤ ǫ. We can consider the ratio ρ+ /ρ as the “index of conservatism” of our approach. Now, ρ+ is given by an efficient computation; what about ρ? Well, there is a simple way to get a lower bound on ρ, namely, as follows. Observe that if the composite hypotheses G1 , Gρ2 can be decided upon with risk ≤ ǫ, the same holds true for two simple hypotheses stating that the distribution of observation (2.150) is P1 or P2 respectively, where P1 , P2 correspond to the cases where • (P1 ): ζ is drawn from N (0, 2σ 2 Id ) • (P2 ): ζ is obtained by adding N (0, 2σ 2 Id )noise to a random signal u, independent of the noise, uniformly distributed on the sphere {kuk2 = ρ}. Indeed, P1 obeys hypothesis G1 , and P2 is a mixture of distributions obeying Gρ2 ; as a result, a simple test T deciding (1 − ǫ)reliably on G1 vs. Gρ2 would induce a test deciding equally reliably on P1 vs. P2 , specifically, the test which, given observation ζ, accepts P1 if T on the same observation accepts G1 , and accepts P2 otherwise. We can now use a twopoint lower bound (Proposition 2.2) to lowerbound the risk of deciding on P1 vs. P2 . Because both distributions are spherically symmetric, computing this bound reduces to computing a similar bound for the univariate distributions of ζ T ζ induced by P1 and P2 , and these univariate distributions are easy to compute. The resulting lower risk bound depends on ρ, and we can find the smallest ρ for which the bound is ≥ 0.01, and use this ρ in the role of ρ; the associated indexes of conservatism can be only larger than the true ones. Let us look at what these indexes are for the data used in our√change detection experiment, that is, ǫ = 0.01, d = 2562 = 65536, σ = 10, σ = σ/ 2. Computation shows that in this case we have ρ+ = 2702.4, ρ+ /ρ ≤ 1.04 —nearly no conservatism at all! √ When eliminating the uncertainty in the intensity of noise by increasing σ from σ/ 2 to σ, we get ρ+ = 668.46, ρ+ /ρ ≤ 1.15 —still not that much of conservatism!
157
HYPOTHESIS TESTING
2.10
EXERCISES FOR CHAPTER 2
2.10.1
Twopoint lower risk bound
Exercise 2.1. Let p and q be two probability distributions distinct from each other on delement observation space Ω = {1, ..., d}, and consider two simple hypotheses on the distribution of observation ω ∈ Ω, H1 : ω ∼ p, and H2 : ω ∼ q. 1. Is it true that there always exists a simple deterministic test deciding on H1 , H2 with risk < 1/2? 2. Is it true that there always exists a simple randomized test deciding on H1 , H2 with risk < 1/2? 3. Is it true that when quasistationary Krepeated observations are allowed, one can decide on H1 , H2 with any small risk, provided K is large enough? 2.10.2
Around Euclidean Separation
Exercise 2.2. Justify the “immediate observation” in Section 2.2.2.3.B. Exercise 2.3. 1) Prove Proposition 2.9. Hint: You can find useful the following simple observation (prove it, provided you indeed use it): Let f (ω), g(ω) be probability densities taken w.r.t. a reference measure P on an observation space Ω, and let ǫ ∈ (0, 1/2] be such that Z min[f (ω), g(ω)]P (dω) ≤ 2ǫ. 2¯ ǫ := Ω
Then
Z p Ω
f (ω)g(ω)P (dω) ≤ 2
2) Justify the illustration in Section 2.2.3.2.C. 2.10.3
p ǫ(1 − ǫ).
Hypothesis testing via ℓ1 separation
Let d be a positive integer, and the observation space Ω be the finite set {1, ..., d} equipped with the counting reference measure.29 Probability distributions on Ω can be identified with points p of ddimensional probabilistic simplex X pi = 1}; ∆d = {p ∈ Rd : p ≥ 0, i
29 Counting measure is the measure on a discrete (finite or countable) set Ω which assigns every point of Ω with mass 1, so that the measure of a subset of Ω is the cardinality of the subset when it is finite and is +∞ otherwise.
158
CHAPTER 2
the ith entry pi in p ∈ ∆d is the probability for the random variable distributed according to p to take value i ∈ {1, ..., d}. With this interpretation, p is the probability density taken w.r.t. the counting measure on Ω. Assume B and W are two nonintersecting nonempty closed convex subsets of ∆d ; we interpret B and W as the sets of black and white probability distributions on Ω, and our goal is to find the optimal, in terms of its total risk, test deciding on the hypotheses H1 : p ∈ B, H2 : p ∈ W via a single observation ω ∼ p. Warning: Everywhere in this section, “test” means “simple test.” Exercise 2.4. Our first goal is to find the optimal test, in terms of its total risk, deciding on the hypotheses H1 , H2 via a single observation ω ∼ p ∈ B ∪ W . To this end we consider the convex optimization problem " # d X f (p, q) := Opt = min pi − qi  (2.154) p∈B,q∈W
i=1
and let (p∗ , q ∗ ) be an optimal solution to this problem (it clearly exists). 1. Extract from optimality conditions that there exist reals ρi ∈ [−1, 1], 1 ≤ i ≤ n, such that 1, p∗i > qi∗ (2.155) ρi = −1, p∗i < qi∗ and
ρT (p − p∗ ) ≥ 0 ∀p ∈ B & ρT (q − q ∗ ) ≤ 0 ∀q ∈ W.
(2.156)
2. Extract from the previous item that the test T which, given an observation ω ∈ {1, ..., d}, accepts H1 with probability πω = (1 + ρω )/2 and accepts H2 with complementary probability, has its total risk equal to X min[p∗ω , qω∗ ], (2.157) ω∈Ω
and thus is minimax optimal in terms of the total risk. Comments. Exercise 2.4 describes an efficiently computable and, in terms of worstcase total risk, optimal simple test deciding on a pair of “convex” composite hypotheses on the distribution of a discrete random variable. While it seems an attractive result, we believe by itself this result is useless, since typically in the testing problem in question a single observation by far is not enough for a reasonable inference; such an inference requires observing several independent realizations ω1 , ..., ωK of the random variable in question. And the construction presented in Exercise 2.4 says nothing on how to adjust the test to the case of repeated observation. Of course, when ω K = (ω1 , ..., ωK ) is a Kelement i.i.d. sample drawn from a probability distribution p on Ω = {1, ..., d}, ω K can be thought of as a single observation of a discrete random variable taking value in the set ΩK = Ω × ... × Ω,  {z } K
the probability distribution pK of ω K being readily given by p. So, why not to
HYPOTHESIS TESTING
159
apply the construction from Exercise 2.4 to ω K in the role of ω? On a close inspection, this idea fails. One of the reasons for this failure is that the cardinality of ΩK (which, among other factors, is responsible for the computational complexity of implementing the test in Exercise 2.4) blows up exponentially as K grows. Another, even more serious, complication is that pK depends on p nonlinearly, so that the family of distributions pK of ω K induced by a convex family of distributions p of ω—convexity meaning that the p’s in question fill a convex subset of the probabilistic simplex—is not convex; and convexity of the sets B, W in the context of Exercise 2.4 is crucial. Thus, passing from a single realization of discrete random variable to the sample of K > 1 independent realizations of the variable results in severe structural and quantitative complications “killing,” at least at first glance, the approach undertaken in Exercise 2.4.30 In spite of the above pessimistic conclusions, the singleobservation test from Exercise 2.4 admits a meaningful multiobservation modification, which is the subject of our next exercise. Exercise 2.5. There is a straightforward way to use the optimal–in terms of its total risk— singleobservation test built in Exercise 2.4 in the “multiobservation” environment. Specifically, following the notation from the exercise 2.4, let ρ ∈ Rd , p∗ , q ∗ be the entities built in this Exercise, so that p∗ ∈ B, q ∗ ∈ W , all entries in ρ belong to [−1, 1], and {ρT p ≥ α := ρT p∗ ∀p ∈ B} & {ρT q ≤ β := ρT q ∗ ∀q ∈ W } & α − β = ρT [p∗ − q ∗ ] = kp∗ − q ∗ k1 . Given an i.i.d. sample ω K = (ω1 , ..., ωK ) with ωt ∼ p, where p ∈ B ∪ W , we could try to decide on the hypotheses H1 : p ∈ B, H2 : p ∈ W as follows.PLet us K 1 set ζt = ρωt . For large K, given ω K , the observable quantity ζ K := K t=1 ζt , by the Law of Large Numbers, will be with overwhelming probability close to Eω∼p {ρω } = ρT p, and the latter quantity is ≥ α when p ∈ B and is ≤ β < α when p ∈ W . Consequently, selecting a “comparison level” ℓ ∈ (β, α), we can decide on the hypotheses p ∈ B vs. p ∈ W by computing ζ K , comparing the result to ℓ, accepting the hypothesis p ∈ B when ζ K ≥ ℓ, and accepting the alternative p ∈ W otherwise. The goal of this exercise is to quantify the above qualitative considerations. To this end let us fix ℓ ∈ (β, α) and K and ask ourselves the following questions: A. For p ∈ B, how do we upperbound the probability ProbpK {ζ K ≤ ℓ}? B. For p ∈ W , how do we upperbound the probability ProbpK {ζ K ≥ ℓ}? Here pK is the probability distribution of the i.i.d. sample ω K = (ω1 , ..., ωK ) with ωt ∼ p. The simplest way to answer these questions is to use Bernstein’s bounding scheme. Specifically, to answer question A, let us select γ ≥ 0 and observe that for 30 Though directly extending the optimal singleobservation test to the case of repeated observations encounters significant technical difficulties, it was carried on in some specific situations. For instance, in [122, 123] such an extension has been proposed for the case of sets B and W of distributions which are dominated by bialternating capacities (see, e.g., [8, 12, 35], and references therein); explicit constructions of the test were proposed for some special sets of distributions [121, 196, 209].
160
CHAPTER 2
every probability distribution p on {1, 2, ..., d} it holds
ProbpK ζ {z 
K
πK,− [p]
whence
≤ ℓ exp{−γℓ} ≤ EpK } ln(πK,− [p]) ≤ K ln
" d #K X 1 exp{−γζ } = pi exp − γρi , K i=1 K
! 1 + γℓ, pi exp − γρi K i=1
d X
implying, via substitution γ = µK, that
∀µ ≥ 0 : ln(πK,− [p]) ≤ Kψ− (µ, p), ψ− (µ, p) = ln Similarly, setting πK,+ [p] = ProbpK ζ K ≥ ℓ , we get ∀ν ≥ 0 : ln(πK,+ [p]) ≤ Kψ+ (ν, p), ψ+ (ν, p) = ln
d X i=1
!
pi exp{−µρi }
d X i=1
!
pi exp{νρi }
+ µℓ.
− νℓ.
Now comes the exercise: 1. Extract from the above observations that Risk(T K,ℓ H1 , H2 ) ≤ exp{Kκ}, κ = max max inf ψ− (µ, p), max inf ψ+ (ν, q) , p∈B µ≥0
q∈W ν≥0
where T K,ℓ is the Kobservation test which accepts the hypothesis H1 : p ∈ B when ζ K ≥ ℓ and accepts the hypothesis H2 : p ∈ W otherwise. 2. Verify that ψ− (µ, p) is convex in µ and concave in p, and similarly for ψ+ (ν, q), so that max inf ψ− (µ, p) = inf max ψ− (µ, p), max inf ψ+ (ν, q) = inf max ψ+ (ν, q). p∈B µ≥0
µ≥0 p∈B
q∈W ν≥0
ν≥0 q∈W
Thus, computing κ reduces to minimizing on the nonnegative ray the convex functions φ− (µ) = maxp∈B ψ+ (µ, p) and φ+ (ν) = maxq∈W ψ+ (ν, q). 3. Prove that when ℓ = 12 [α + β], one has κ≤−
1 2 ∆ , ∆ = α − β = kp∗ − q ∗ k1 . 12
Note that the above test and the quantity κ responsible for the upper bound on its risk depend, as on a parameter, on the “acceptance level” ℓ ∈ (β, α). The simplest way to select a reasonable value of ℓ is to minimize κ over an equidistant grid Γ ⊂ (β, α), of small cardinality, of values of ℓ. Now, let us consider an alternative way to pass from a “good” singleobservation test to its multiobservation version. Our “building block” now is the minimum risk randomized singleobservation test31 and its multiobservation modification is just 31 This test can differ from the test built in Exercise 2.4—the latter test is optimal in terms of the sum, rather than the maximum, of its partial risks.
161
HYPOTHESIS TESTING
the majority version of this building block. Our first observation is that building the minimum risk singleobservation test reduces to solving a convex optimization problem. Exercise 2.6. Let, as above, B and W be nonempty nonintersecting closed convex subsets of probabilistic simplex ∆d . Show that the problem of finding the best—in terms of its risk—randomized singleobservation test deciding on H1 : p ∈ B vs. H2 : p ∈ W via observation ω ∼ p reduces to solving a convex optimization problem. Write down this problem as an explicit LO program when B and W are polyhedral sets given by polyhedral representations: B W
= =
{p : ∃u : PB p + QB u ≤ aB }, {p : ∃u : PW p + QW u ≤ aW }.
We see that the “ideal building block”—the minimumrisk singleobservation test—can be built efficiently. What is at this point unclear is whether this block is of any use for majority modifications, that is, whether the risk of this test < 1/2— this is what we need for the majority version of the minimumrisk singleobservation test to be consistent. Exercise 2.7. Extract from Exercise 2.4 that in the situation of this section, denoting by ∆ the optimal value in the optimization problem (2.154), one has 1. The risk of any singleobservation test, deterministic or randomized, is ≥ 21 − ∆ 4 2. There exists a singleobservation randomized test with risk ≤ 12 − ∆ 8 , and thus the risk of the minimum risk singleobservation test given by Exercise 2.6 does not exceed 12 − ∆ 8 < 1/2 as well. Pay attention to the fact that ∆ > 0 (since, by assumption, B and W do not intersect). The bottom line is that in the situation of this section, given a target value ǫ of risk and assuming stationary repeated observations are allowed, we have (at least) three options to meet the risk specifications: 1. To start with the optimal—in terms of its total risk—singleobservation detector as explained in Exercise 2.4, and then to pass to its multiobservation version built in Exercise 2.5; 2. To use the majority version of the minimumrisk randomized singleobservation test built in Exercise 2.6; 3. To use the test based on the minimum risk detector for B, W , as explained in the main body of Chapter 2. In all cases, we have to specify the number K of observations which guarantees that the risk of the resulting multiobservation test is at most a given target ǫ. A bound on K can be easily obtained by utilizing the results on the risk of a detectorbased test in a Discrete o.s. from the main body of Chapter 2 along with riskrelated results of Exercises 2.5, 2.6, and 2.7. Exercise 2.8.
162
CHAPTER 2
Run numerical experiments to see if one of the three options above always dominates the others (that is, requires a smaller sample of observations to ensure the same risk). Let us now focus on a theoretical comparison of the detectorbased test and the majority version of the minimumrisk singleobservation test (options 1 and 2 above) in the general situation described at the beginning of Section 2.10.3. Given ǫ ∈ (0, 1), the corresponding sample sizes Kd and Km are completely determined by the relevant “measure of closeness” between B and W . Specifically, • For Kd , the closeness measure is ρd (B, W ) = 1 −
max
p∈B,q∈W
X√
pω q ω ;
(2.158)
ω
1 − ρd (B, W ) is the minimal risk of a detector for B, W , and for ρd (B, W ) and ǫ small, we have Kd ≈ ln(1/ǫ)/ρd (B, W ) (why?). • Given ǫ, Km is fully specified by the minimal risk ρ of simple randomized singleobservation test T deciding on the hypotheses associated with B, W . By Exercise 2.7, we have ρ = 12 − δ, where δ is within absolute constant factor of the optimal value ∆ = minp∈B,q∈W kp − qk1 of (2.154). The risk bound for the Kobservation majority version of T is the probability to get at least K/2 heads in K independent tosses of coin with probability to get heads in a single toss equal to ρ = 1/2 − δ. When ρ is not close to 0 and ǫ is small,pthe (1 − ǫ)quantile of the number of heads in our K coin tosses is Kρ + O(1) K ln(1/ǫ) = p K/2−δK +O(1) K ln(1/ǫ) (why?). Km is the smallest K for which this quantile is < K/2, so that Km is of the order of ln(1/ǫ)/δ 2 , or, which is the same, of the order of ln(1/ǫ)/∆2 . We see that the closeness between B and W “responsible for Km ” is 2
ρm (B, W ) = ∆2 =
min
p∈B,q∈W
kp − qk1
,
and Km is of the order of ln(1/ǫ)/ρm (B, W ). The goal of the next exercise is to compare ρb and ρm . Exercise 2.9. Prove that in the situation of this section one has p 1 1 ρm (B, W ). ρ (B, W ) ≤ ρ (B, W ) ≤ m d 8 2
(2.159)
Relation (2.159) suggests that while Kd never is “much larger” than Km (this we know in advance: in repeated versions of Discrete o.s., a properly built detectorbased test provably is nearly optimal), Km might be much larger than Kd . This indeed is the case: Exercise 2.10. Given δ ∈ (0, 1/2), let B = {[δ; 0; 1 − δ]} and W = {[0; δ; 1 − δ]}. Verify that in this case the numbers of observations Kd and Km , resulting in a given risk ǫ ≪ 1 of multiobservation tests, as functions of δ are proportional to 1/δ and 1/δ 2 , respectively. Compare the numbers when ǫ = 0.01 and δ ∈ {0.01; 0.05; 0.1}.
HYPOTHESIS TESTING
2.10.4
163
Miscellaneous exercises
Exercise 2.11. Prove that the conclusion in Proposition 2.18 remains true when the test T in the premise of the proposition is randomized. Exercise 2.12. Let p1 (ω), p2 (ω) be two positive probability densities, taken w.r.t. a reference measure Π on an observation space Ω, and let Pχ = {pχ }, χ = 1, 2. Find the optimal—in terms of its risk—balanced detector for Pχ , χ = 1, 2. Exercise 2.13. Recall that the exponential distribution on Ω = R+ , with parameter µ > 0, is the distribution with the density pµ (ω) = µe−µω , ω ≥ 0. Given positive reals α < β, consider two families of exponential distributions, P1 = {pµ : 0 < µ ≤ α}, and P2 = {pµ : µ ≥ β}. Build the optimal—in terms of its risk—balanced detector for P1 , P2 . What happens with the risk of the detector you have built when the families Pχ , χ = 1, 2, are replaced with their convex hulls? Exercise 2.14. [Followup to Exercise 2.13] Assume that the “lifetime” ζ of a lightbulb is a realization of random variable with exponential distribution (i.e., the density pµ (ζ) = µe−µζ , ζ ≥ 0; in particular, the expected lifespan of a lightbulb in this model is 1/µ).32 Given a lot of lightbulbs, you should decide whether they were produced under normal conditions (resulting in µ ≤ α = 1) or under abnormal ones (resulting in µ ≥ β = 1.5). To this end, you can select at random K lightbulbs and test them. How many lightbulbs should you test in order to make a 0.99reliable conclusion? Answer this question in the situations when the observation ω in a test is 1. the lifespan of a lightbulb (i.e., ω ∼ pµ (·)); 2. the minimum ω = min[ζ, δ] of the lifespan ζ ∼ pµ (·) and the allowed duration δ > 0 of your test (i.e., if the lightbulb you are testing does not “die” on time horizon δ, you terminate the test); 3. ω = χζ 0 is the allowed test duration (i.e., you observe whether or not a lightbulb “dies” on time horizon δ, but do not register the lifespan when it is < δ). Consider the values 0.25, 0.5, 1, 2, 4 of δ. Exercise 2.15. 32 In Reliability, probability distribution of the lifespan ζ of an organism or a technical device Prob{t≤ζ≤t+δ} is characterized by the failure rate λ(t) = limδ→+0 δ·Prob{ζ≥t} (so that for small δ, λ(t)δ is the conditional probability to “die” in the time interval [t, t + δ] provided the organism or device is still “alive” at time t). The exponential distribution corresponds to the case of failure rate independent of t; in applications, this indeed is often the case except for “very small” and “very large” values of t.
164
CHAPTER 2
[Followup to Exercise 2.14] In the situation of Exercise 2.14, build a sequential test for deciding on null hypothesis “the lifespan of a lightbulb from a given lot is ζ ∼ pµ (·) with µ ≤ 1” (recall that pµ (z) is the exponential density µe−µz on the ray {z ≥ 0}) vs. the alternative “the lifespan is ζ ∼ pµ (·) with µ > 1.” In this test, you can select a number K of lightbulbs from the lot, switch them on at time 0 and record the actual lifetimes of the lightbulbs you are testing. As a result at the end of (any) observation interval ∆ = [0, δ], you observe K independent realizations of r.v. min[ζ, δ], where ζ ∼ pµ (·) with some unknown µ. In your sequential test, you are welcome to make conclusions at the endpoints δ1 < δ2 < ... < δS of several observation intervals. Note: We deliberately skip details of the problem’s setting; how you decide on these missing details is part of your solution to the exercise. Exercise 2.16. In Section 2.6, we consider a model of elections where every member of the population was supposed to cast a vote. Enrich the model by incorporating the option for a voter not to participate in the elections at all. Implement Sequential test for the resulting model and run simulations. Exercise 2.17. Work out the following extension of the Opinion Poll Design problem. You are given two finite sets, Ω1 = {1, ..., I} and Ω2 = {1, ..., M }, along with L nonempty closed convex subsets Yℓ of the set ) ( M I X X yim = 1 ∆IM = [yim > 0]i,m : i=1 m=1
of all nonvanishing probability distributions on Ω = Ω1 × Ω2 = {(i, m) : 1 ≤ i ≤ I, 1 ≤ m ≤ M }. Sets Yℓ are such that all distributions from Yℓ have a common marginal distribution θℓ > 0 of i: M X
m=1
yim = θiℓ , 1 ≤ i ≤ I, ∀y ∈ Yℓ , 1 ≤ ℓ ≤ L.
Your observations ω1 , ω2 , ... are sampled, independently of each other, from a distribution partly selected “by nature,” and partly by you. Specifically, nature selects ℓ ≤ L and a distribution y ∈ Yℓ , and you select a positive an Idimensional probabilistic vector q from a given convex compact subset Q of the positive part of Idimensional probabilistic simplex. Let yi be the conditional distribution of m ∈ Ω2 given i induced by y, so that yi is the M dimensional probabilistic vector with entries yim yim = ℓ . [yi ]m = P y θi µ≤M iµ
In order to generate ωt = (it , mt ) ∈ Ω, you draw it at random from the distribution q, and then nature draws mt at random from the distribution yit . Given closeness relation C, your goal is to decide, up to closeness C, on the hypotheses H1 , ..., HL , with Hℓ stating that the distribution y selected by nature belongs to Yℓ . Given an “observation budget” (a number K of observations ωk you can use), you want to find a probabilistic vector q which results in the test with as
165
HYPOTHESIS TESTING
small a Crisk as possible. Pose this Measurement Design problem as an efficiently solvable convex optimization problem. Exercise 2.18. [Probabilities of deviations from the mean] The goal of what follows is to present the most straightforward application of simple families of distributions—bounds on probabilities of deviations of random vectors from their means. Let H ⊂ Ω = Rd , M, Φ be regular data such that 0 ∈ int H, M is compact, Φ(0; µ) = 0 ∀µ ∈ M, and Φ(h; µ) is differentiable at h = 0 for every µ ∈ M. Let, further, P¯ ∈ S[H, M, Φ] and let µ ¯ ∈ M be a parameter of P¯ . Prove that 1. P¯ possesses expectation e[P¯ ], and e[P¯ ] = ∇h Φ(0; µ ¯) 2. For every linear form eT ω on Ω it holds π
:= ≤
P¯ {ω: eT (ω − e[P¯ ]) ≥ 1} T Φ(te; µ ¯) − te ∇h Φ(0; µ ¯) − t . exp inf
(2.160)
t≥0:te∈H
What are the consequences of (2.160) for subGaussian distributions? Exercise 2.19. [testing convex hypotheses on mixtures] Consider the situation as follows. For given positive integers K and L and for χ = 1, 2, given are • nonempty convex compact signal sets Uχ ⊂ Rnχ , χ • regular data Hkℓ ⊂ Rdk , Mχkℓ , Φχkℓ , and affine mappings uχ 7→ Aχkℓ [uχ ; 1] : Rnχ → Rdk such that
uχ ∈ Uχ ⇒ Aχkℓ [uχ ; 1] ∈ Mχkℓ ,
1 ≤ k ≤ K, 1 ≤ ℓ ≤ L, • probabilistic vectors µk = [µk1 ; ...; µkL ], 1 ≤ k ≤ K. We can associate with the outlined data families of probability distributions Pχ on the observation space Ω = Rd1 × ... × RdK as follows. For χ = 1, 2, Pχ is comprised of all probability distributions P of random vectors ω K = [ω1 ; ...; ωK ] ∈ Ω generated as follows: We select • a signal uχ ∈ Uχ , χ • a collection of probability distributions Pkℓ ∈ S[Hkℓ , Mχkℓ , Φχkℓ ], 1 ≤ k ≤ K, χ 1 ≤ ℓ ≤ L, in such a way that Akℓ [uχ ; 1] is a parameter of Pkℓ : T χ ∀h ∈ Hkℓ : ln Eωk ∼Pkℓ {eh ωk } ≤ Φχkℓ (hk ; Aχkℓ [uχ ; 1]);
• we generate the components ωk , k = 1, ..., K, independently across k, from µk mixture Π[{Pkℓ }L ℓ=1 , µ] of distributions Pkℓ , ℓ = 1, ..., L, that is, draw at random,
166
CHAPTER 2
from distribution µk on {1, ..., L}, index ℓ, and then draw ωk from the distribution Pkℓ . Prove that when setting Hχ
=
Mχ
=
Φχ (h; µ)
=
{h = [h1 ; ...; hK ] ∈ Rd=d1 +...+dK : hk ∈
L T
ℓ=1
χ Hkℓ , 1 ≤ k ≤ K},
{0} ⊂ R, PK PL χ χ k ln µ exp max Φ (h ; A [u ; 1]) : Hχ × Mχ → R, χ k kℓ kℓ k=1 ℓ=1 ℓ uχ ∈Uχ
we obtain the regular data such that
Pχ ⊂ S[Hχ , Mχ , Φχ ]. Explain how to use this observation to compute via Convex Programming an affine detector and its risk for the families of distributions P1 and P2 . Exercise 2.20. [Mixture of subGaussian distributions] Let Pℓ be subGaussian distributions on Rd with subGaussianity parameters θℓ , Θ, 1 ≤ ℓ ≤ L, with a common Θparameter, and let ν = [ν1 ; ...; νL ] be a probabilistic vector. Consider the νmixture P = Π[P L , ν] of distributions Pℓ , so that ω ∼ P is generated as follows: we draw at random from distribution ν index ℓ and then draw ω at random from distribution P Pℓ . Prove that P is subGaussian with subGaussianity parameters θ¯ = ℓ νℓ θℓ ¯ with (any) Θ ¯ chosen to satisfy and Θ, ¯ ℓ − θ] ¯ T ∀ℓ, ¯ Θ + 6 [θℓ − θ][θ Θ 5 in particular, according to any one of the following rules: ¯ 2 Id , ¯ = Θ + 6 maxℓ kθℓ − θk 1. Θ 2 5P ¯ ¯T ¯ =Θ+ 6 2. Θ ℓ (θℓ − θ)(θℓ − θ) , 5 ¯ = Θ + 6 P θℓ θT , provided that ν1 = ... = νL = 1/L. 3. Θ ℓ ℓ 5 Exercise 2.21.
The goal of this exercise is to give a simple sufficient condition for quadratic lifting “to work” in the Gaussian case. Namely, let Aχ , Uχ , Vχ , Gχ , χ = 1, 2, be as in Section 2.9.3, with the only difference that now we do not assume the compact sets Uχ to be convex, and let Zχ be convex compact subsets of the sets Z nχ —see item i.2. in Proposition 2.43—such that [uχ ; 1][uχ ; 1]T ∈ Zχ ∀uχ ∈ Uχ , χ = 1, 2. (∗)
(χ)
Augmenting the above data with Θχ , δχ such that V = Vχ , Θ∗ = Θ∗ , δ = δχ satisfy (2.129), χ = 1, 2, and invoking Proposition 2.43.ii, we get at our disposal a quadratic detector φlift such that Risk[φlift G1 , G2 ] ≤ exp{SadVallift }, with SadVallift given by (2.134). A natural question is, when SadVallift is negative, meaning that our quadratic detector indeed “is working”—its risk is < 1, imply
HYPOTHESIS TESTING
167
ing that when repeated observations are allowed, tests based upon this detector are consistent—able to decide on the hypotheses Hχ : P ∈ Gχ , χ = 1, 2, on the distribution of observation ζ ∼ P with any small desired risk ǫ ∈ (0, 1). With our computationoriented ideology, this is not too important a question, since we can answer it via efficient computation. This being said, there is no harm in a “theoretical” answer which could provide us with an additional insight. The goal of the exercise is to justify a simple result on the subject. Here is the exercise: In the situation in question, assume that V1 = V2 = {Θ∗ }, which allows us to (χ) set Θ∗ = Θ∗ , δχ = 0, χ = 1, 2. Prove that in this case a necessary and sufficient condition for SadVallift to be negative is that the convex compact sets Uχ = {Bχ ZBχT : Z ∈ Zχ } ⊂ Sd+1 + , χ = 1, 2 do not intersect with each other. Exercise 2.22. Prove that if X is a nonempty convex compact set in Rd , then the function b µ) given by (2.114) is realvalued and continuous on Rd × X and is convex in Φ(h; h and concave in µ. Exercise 2.23.
The goal of what follows is to refine the change detection procedure (let us refer to it as the“basic” one) developed in Section 2.9.5.1. The idea is pretty simple. With the notation from Section 2.9.5.1, in the basic procedure, when testing the null hypothesis H0 vs. signal hypothesis Htρ , we look at the difference ζt = ωt − ω1 and try to decide whether the energy of the deterministic component xt − x1 of ζt is 0, as is the case under H0 , or is ≥ ρ2 , as is the case under Htρ . Note that if σ ∈ [σ, σ] is the actual intensity of the observation noise, then the noise component of ζt is N (0, 2σ 2 Id ); other things being equal, the larger is the noise in ζt , the larger should be ρ to allow for a reliable—with a given reliability level—decision. Now note that under the hypothesis Htρ , we have x1 = ... = xt−1 , so that the deterministic component ofP the difference ζt = ωt − ω1 is exactly the same as for t−1 1 2 e the difference ζet = ωt − t−1 s=1 ωs , while the noise component in ζt is N (0, σt Id ) 1 t 2 2 2 2 with σt = σ + t−1 σ = t−1 σ . Thus, the intensity of noise in ζet is at most the same as in ζt , and this intensity, in contrast to that for ζt , decreases as t grows. Here comes the exercise: Let reliability tolerances ǫ, ε ∈ (0, 1) be given, and let our goal be to design a system of inferences Tt , t = 2, 3, ..., K, which, when used in the same fashion as tests Ttκ were used in the basic procedure, results in false alarm probability at most ǫ and in probability to miss a change of energy ≥ ρ2 at most ε. Needless to say, we want to achieve this goal with as small a ρ as possible. Think how to utilize the above observation to refine the basic procedure eventually reducing (and provably not increasing) the required value of ρ. Implement the basic and the refined change detection procedures and compare their quality (the resulting values of ρ), e.g., on the data used in the experiment reported in Section 2.9.5.1.
168
CHAPTER 2
2.11 2.11.1
PROOFS Proof of the observation in Remark 2.8
We have to prove that if p = [p1 ; ...; pK ] ∈ B = [0, 1]K then the probability PM (p) of the event The total number of heads in K independent coin tosses, with probability pk to get heads in kth toss, is at least M is a nondecreasing function of p: if p′ ≤ p′′ , p′ , p′′ ∈ B, then PM (p′ ) ≤ PM (p′′ ). To see it, let us associate with p ∈ B a subset of B, specifically, Bp = {x ∈ B : 0 ≤ xk ≤ pk , 1 ≤ k ≤ K}, and a function χp (x) : B → {0, 1} which is equal to 0 at every point x ∈ B where the number of entries xk satisfying xk ≤ pk is less than M , and is equal to 1 otherwise. It is immediately seen that Z PM (p) ≡ χp (x)dx (2.161) B
(since with respect to the uniform distribution on B, the events Ek = {x ∈ B : xk ≤ pk } are independent across k and have probabilities pk , and the righthand side in (2.161) is exactly the probability, taken w.r.t. the uniform distribution on B, of the event “at least M of the events E1 ,..., EK take place”). But the righthand side in (2.31) clearly is nondecreasing in p ∈ B, since χp , by construction, is the characteristic function of the set B[p] = {x : at least M of the entries xk in x satisfy xk ≤ pk }, and these sets clearly grow when p increases entrywise. 2.11.2
✷
Proof of Proposition 2.6 in the case of quasistationary Krepeated observations
2.11.2.A Situation and goal. We are in the case QS—see Section 2.2.3.1—of the setting described at the beginning of Section 2.2.3. It suffices to verify that if Hℓ , ℓ ∈ {1, 2}, is true then the probability for TKmaj to reject Hℓ is at most the quantity ǫK defined in (2.23). Let us verify this statement in the case of ℓ = 1; the reasoning for ℓ = 2 “mirrors” the one to follow. It is clear that our situation and goal can be formulated as follows: • “In nature” there exists a random sequence ζ K = (ζ1 , ..., ζK ) of driving factors and a collection of deterministic functions θk (ζ k = (ζ1 , ..., ζk ))33 taking values in Ω = Rd such that our kth observation is ωk = θk (ζ k ). Additionally, the conditional distribution Pωk ζ k−1 of ωk given ζ k−1 always belongs to the family P1 comprised of distributions of random vectors of the form x + ξ, where deterministic x belongs to X1 and the distribution of ξ belongs to Pγd . • There exist deterministic functions χk : Ω → {0, 1} and integer M, 1 ≤ M ≤ K, such that the test TKmaj , as applied to observation ω K = (ω1 , ..., ωK ), rejects H1 33 As always, given a Kelement sequence, say, ζ , ..., ζ , we write ζ t , t ≤ K, as a shorthand 1 K for the fragment ζ1 , ..., ζt of this sequence.
169
HYPOTHESIS TESTING
if and only if the number of 1’s among the quantities χk (ωk ), 1 ≤ k ≤ K, is at least M . In the situation of Proposition 2.6, M =⌋K/2⌊ and χk (·) are in fact independent of k: χk (ω) = 1 if and only if φ(ω) ≤ 0.34 • What we know is that the conditional probability of the event χk (ωk = θk (ζ k )) = 1, ζ k−1 being given, is at most ǫ⋆ : Pωk ζ k−1 {ωk : χk (ωk ) = 1} ≤ ǫ⋆ ∀ζ k−1 . Indeed, Pωk ζ k−1 ∈ P1 . As a result, Pωk ζ k−1 {ωk : φk (ωk ) = 1}
= =
Pωk ζ k−1 {ωk : φ(ωk ) ≤ 0} Pωk ζ k−1 {ωk : φ(ωk ) < 0} ≤ ǫ⋆ ,
where the second equality is due to the fact that φ(ω) is a nonconstant affine function and Pωk ζ k−1 , along with all distributions from P1 , has density, and the inequality is given by the origin of ǫ⋆ which upperbounds the risk of the singleobservation test underlying TKmaj . What we want to prove is that under the circumstances we have just summarized, we have M} PωK {ω K = (ω1 , ..., ωK ) : Card{k ≤ K : χk (ω k ) = 1} ≥K−k P K k , ≤ ǫM = M ≤k≤K k ǫ⋆ (1 − ǫ⋆ )
(2.162)
where PωK is the distribution of ω K = {ωk = θk (ζ k−1 )}K k=1 induced by the distribution of hidden factors. There is nothing to prove when ǫ⋆ = 1, since in this case ǫM = 1. Thus, we assume from now on that ǫ⋆ < 1. 2.11.2.B Achieving the goal, step 1. Our reasoning, inspired by that used to justify Remark 2.8, is as follows. Consider a sequence of random variables ηk , 1 ≤ k ≤ K, uniformly distributed on [0, 1] and independent of each other and of ζ K , and consider new driving factors λk = [ζk ; ηk ] and new observations35 µk = [ωk = θk (ζ k ); ηk ] = Θk (λk = (λ1 , ..., λk ))
(2.163)
driven by these new driving factors, and let ψk (µk = [ωk ; ηk ]) = χk (ωk ). It is immediately seen that • µk = [ωk = θk (ζ k ); ηk ] is a deterministic function, Θk (λk ), of λk , and the con34 In fact, we need to write φ(ω) < 0 instead of φ(ω) ≤ 0; we replace the strict inequality with its nonstrict version in order to make our reasoning applicable to the case of ℓ = 2, where nonstrict inequalities do arise. Clearly, replacing in the definition of χk strict inequality with the nonstrict one, we only increase the “rejection domain” of H1 , so that the upper bound on the probability of this domain we are about to get automatically is valid for the true rejection domain. 35 In this display, as in what follows, whenever some of the variables λ, ω, ζ, η, µ appear in the same context, it should always be understood that ζt and ηt are components of λt = [ζt ; ηt ], µt = [ωt ; ηt ] = Θt (λt ), and ωt = θt (ζ t ). To remind us about these “hidden relations,” we sometimes write something like φ(ωk = θk (ζ k )) to stress that we are speaking about the value of function φ at the point ωk = θk (ζ k ).
170
CHAPTER 2
ditional distribution Pµk λk−1 of µk given λk−1 = [ζ k−1 ; η k−1 ] is the product distribution Pωk ζ k−1 × U on Ω × [0, 1], where U is the uniform distribution on [0, 1]. In particular, πk (λk−1 )
:= =
Pµk λk−1 {µk = [ωk ; ηk ] : χk (ωk ) = 1} Pωk ζ k−1 {ωk : χk (ωk ) = 1} ≤ ǫ⋆ .
(2.164)
• We have PλK {λK : Card{k ≤ K : ψk (µk = Θk (λk )) = 1} ≥ M } = PωK {ω K = (ω1 , ..., ωK ) : Card{k ≤ K : χk (ωk ) = 1} ≥ M }
(2.165)
where PωK is as in (2.162), and Θk (·) is defined in (2.163). Now let us define ψk+ (λk ) as follows: • when ψk (Θk (λk )) = 1, or, which is the same, χk (ωk = θk (ζ k )) = 1, we set ψk+ (λk ) = 1 as well; • when ψk (Θk (λk )) = 0, or, which is the same, χk (ωk = θk (ζ k )) = 0, we set ψk+ (λk ) = 1 whenever ηk ≤ γk (λk−1 ) :=
ǫ⋆ − πk (λk−1 ) 1 − πk (λk−1 )
and ψk+ (λk ) = 0 otherwise. Let us make the following immediate observations: (A) Whenever λk is such that ψk (µk = Θk (λk )) = 1, we also have ψk+ (λk ) = 1; (B) The conditional probability of the event ψk+ (λk ) = 1, given λk−1 = [ζ k−1 ; η k−1 ] is exactly ǫ⋆ . Indeed, let Pλk λk−1 be the conditional distribution of λk given λk−1 . Let us fix λk−1 . The event E = {λk : ψk+ (λk ) = 1}, by construction, is the union of two nonoverlapping events: E1 E2
= =
{λk = [ζk ; ηk ] : χk (θk (ζ k )) = 1}, {λk = [ζk ; ηk ] : χk (θk (ζ k )) = 0, ηk ≤ γk (λk−1 )}.
Taking into account that the conditional distribution of µk = [ωk = θk (ζ k ); ηk ], λk−1 being fixed, is the product distribution Pωk ζ k−1 × U , we conclude in view of (2.164) that Pλk λk−1 {E1 } = Pλk λk−1 {E2 } = =
Pωk ζ k−1 {ωk : χk (ωk ) = 1} = πk (λk−1 ), Pωk ζ k−1 {ωk : χk (ωk ) = 0}U {η ≤ γk (λk−1 )} (1 − πk (λk−1 ))γk (λk−1 ),
which combines with the definition of γk (·) to imply (B).
171
HYPOTHESIS TESTING
2.11.2.C Achieving the goal, step 2. By (A) combined with (2.165) we have PωK {ω K : Card{k ≤ K : χk (ωk ) = 1} ≥ M } = PλK {λK : Card{k ≤ K : ψk (µk = Θk (λk )) = 1} ≥ M } ≤ PλK {λK : Card{k ≤ K : ψk+ (λk ) = 1} ≥ M }, and all we need to verify is that the first quantity in this chain is upperbounded by the quantity ǫM given by (2.162). Invoking (B), it is enough to prove the following claim: (!) Let λK = (λ1 , ..., λK ) be a random sequence with probability distribution P , let ψk (λk ) take values 0 and 1 only, and let for every k ≤ K the conditional probability for ψk+ (λk ) to take value 1, λk−1 being fixed, be equal to ǫ⋆ , for all λk−1 . Then the P probability of the event {λK : Card{k ≤ K : ψk+ (λk ) = 1} ≥ M } is equal to ǫM given by (2.162). This is immediate. For integers k, m, 1 ≤ k ≤ K, m ≥ 0, let χkm (λk ) be the characteristic function of the event {λk : Card{t ≤ k : ψt+ (λt ) = 1} = m}, and let k πm = P {λK : χkm (λk ) = 1}.
We have the following evident recurrence: k−1 k−1 k−1 χkm (λk ) = χm (λ )(1 − ψk+ (λk )) + χm−1 (λk−1 )ψk+ (λk ), k = 1, 2, ... k−1 augmented by the “boundary conditions” χ0m = 0, m > 0, χ00 = 1, χ−1 = 0 for all k ≥ 1. Taking expectation w.r.t. P and utilizing the fact that conditional expectation of ψk+ (λk ) given λk−1 is, identically in λk−1 , equal to ǫ⋆ , we get k πm
=
0 πm
=
k−1 k−1 πm (1 − ǫ⋆ ) + πm−1 ǫ⋆ , k = 1, ..., K, 1, m = 0, k−1 π−1 = 0, k = 1, 2, ... 0, m > 0,
whence k πm =
k m
0,
k−m ǫm , ⋆ (1 − ǫ⋆ )
m ≤ k, m > k.
Therefore, P {λK : Card{k ≤ K : ψk+ (λk ) = 1} ≥ M } = as required.
X
πkK = ǫM ,
M ≤k≤K
✷
172
CHAPTER 2
2.11.3
Proof of Theorem 2.23
1o . Since O is a simple o.s., the function Φ(φ, [µ; ν]) given by (2.56) is a welldefined realvalued function on F × (M × M) which is concave in [µ; ν]; convexity of the function in φ ∈ F is evident. Since both F and M are convex sets coinciding with their relative interiors, convexityconcavity and real valuedness of Φ on F ×(M×M) imply the continuity of Φ on the indicated domain. As a consequence, Φ is a convexconcave continuous realvalued function on F × (M1 × M2 ). Now let (2.166) Φ(µ, ν) = inf Φ(φ, [µ; ν]). φ∈F
Note that Φ, being the infimum of a family of concave functions of [µ; ν] ∈ M × M, is concave on M × M. We claim that for µ, ν ∈ M the function φµ,ν (ω) =
1 2
ln(pµ (ω)/pν (ω))
(which, by definition of a simple o.s., belongs to F) is an optimal solution to the righthand side minimization problem in (2.166), so that ∀(µ ∈ M1 , ν ∈ M2 ) :
Φ([µ; ν]) := inf φ∈F Φ(φ, [µ; ν]) = Φ(φµ,ν , [µ; ν]) = ln Indeed, we have
R p p (ω)p (ω)Π(dω) . µ ν Ω (2.167)
exp{−φµ,ν (ω)}pµ (ω) = exp{φµ,ν (ω)}pν (ω) = g(ω) := whence Φ(φµ,ν , [µ; ν]) = ln δ(·) ∈ F we have (a) (b)
R
Ω
q
pµ (ω)pν (ω),
(2.168)
g(ω)Π(dω) . On the other hand, for φ(·) = φµ,ν (·) +
i hp i R hp g(ω)Π(dω) = g(ω) exp{−δ(ω)/2} g(ω) exp{δ(ω)/2} Π(dω) Ω Ω 1/2 1/2 R R g(ω) exp{δ(ω)}Π(dω) ≤ Ω g(ω) exp{−δ(ω)}Π(dω) 1/2 1/2 Ω R R = Ω Rexp{−φ(ω)}pµ (ω)Π(dω) [by (2.168)] exp{φ(ω)}pν (ω)Π(dω) Ω ⇒ ln Ω g(ω)Π(dω) ≤ Φ(φ, [µ; ν]), R
and thus Φ(φµ,ν , [µ; ν]) ≤ Φ(φ, [µ; ν]) for every φ ∈ F.
Remark 2.49. Note that the above reasoning did not use the fact that the minimization on the righthand side of (2.166) is over φ ∈ F; in fact, this reasoning shows that φµ,ν (·) minimizes Φ(φ, R R [µ; ν]) over all functions φ for which the integrals exp{−φ(ω)}p (ω)Π(dω) and exp{φ(ω)}pν (ω)Π(dω) exist. µ Ω Ω
Remark 2.50. Note that the inequality in (b) can be equality only when the inequality in (a) is so. In other words, if φ¯ is pa minimizer of Φ(φ, [µ; ν])pover φ ∈ F, setting ¯ − φµ,ν (·), the functions g(ω) exp{−δ(ω)/2} and g(ω) exp{δ(ω)/2}, δ(·) = φ(·) considered as elements of L2 [Ω, Π], are proportional to each other. Since g is positive and g, δ are continuous, while the support of Π is the entire Ω, this “L2 proportionality” means that the functions in question differ by a constant factor, or, which is the same, that δ(·) is constant. Thus, the minimizers of Φ(φ, [µ; ν]) over φ ∈ F are exactly the functions of the form φ(ω) = φµ,ν (ω) + const.
173
HYPOTHESIS TESTING
2o . Let us verify that Φ(φ, [µ; ν]) has a saddle point (min in φ ∈ F, max in [µ; ν] ∈ M1 × M2 ). First, observe that on the domain of Φ it holds Φ(φ(·) + a, [µ; ν]) = Φ(φ(·), [µ; ν]) ∀(a ∈ R, φ ∈ F).
(2.169)
Let us select some µ ¯ ∈ M, R and let P be the measure on Ω with density pµ¯ w.r.t. Π. For φ ∈ F, the integrals Ω e±φ(ω) P (dω) are finite (since O is simple), implying that φ R ∈ L1 [Ω, P ]; note also that P is a probabilistic measure. Let now F0 = {φ ∈ F : φ(ω)P (dω) = 0}, so that F0 is a linear subspace in F, and all functions φ ∈ F Ω can be obtained by shifts of functions from F0 by constants. Now, by (2.169), to prove the existence of a saddle point of Φ on F × (M1 × M2 ) is exactly the same as to prove the existence of a saddle point of Φ on F0 × (M1 × M2 ). Let us verify that Φ(φ, [µ; ν]) indeed has a saddle point on F0 × (M1 × M2 ). Because M1 × M2 is a convex compact set, and Φ is continuous on F0 × (M1 × M2 ) and convexconcave, invoking the SionKakutani Theorem we see that all we need in order to prove the existence of a saddle point is to verify that Φ is coercive in the first argument. In other words, we have to show that for every fixed [µ; ν] ∈ M1 × M2 one has Φ(φ, [µ; ν]) → +∞ as φ ∈ F0 and kφk → ∞ (whatever be the norm k · k on F0 ; recall that F0 is a finitedimensional linear space). Setting Z Z 1 φ(ω) −φ(ω) e pν (ω)Π(dω) e pµ (ω)Π(dω) + ln ln Θ(φ) = Φ(φ, [µ; ν]) = 2 ω ω and taking into account that Θ is convex and finite on F0 , in order to prove that Θ is coercive, it suffices to verify Rthat Θ(tφ) → ∞, t → ∞, for every nonzero φ ∈ F0 , which is evident: since Ω φ(ω)P (dω) = 0 and φ is nonzero, we have R R max[φ(ω), 0]P (dω) = Ω max[−φ(ω), 0]P (dω) > 0, whence φ > 0 and φ < 0 on Ω sets of Πpositive measure, so that Θ(tφ) → ∞ as t → ∞ due to the fact that both pµ (·) and pν (·) are continuous and everywhere positive. 3o . Now let (φ∗ (·); [µ∗ ; ν∗ ]) be a saddle point of Φ on F × (M1 × M2 ). Shifting, if necessary, φ∗ (·) by a constant (by (2.169), this does not affect the fact that (φ∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ), we can assume that Z Z exp{φ∗ (ω)}pν∗ (ω)Π(dω), exp{−φ∗ (ω)}pµ∗ (ω)Π(dω) = ε⋆ := Ω
Ω
so that the saddle point value of Φ is Φ∗ :=
max
min Φ(φ, [µ; ν]) = Φ(φ∗ , [µ∗ ; ν∗ ]) = ln(ε⋆ ),
[µ;ν]∈M1 ×M2 φ∈F
(2.170)
as claimed in item (i) of the theorem. Now let us prove (2.58). For µ ∈ M1 , we have ln(ε⋆ )
Hence, ln
R
Ω
= = =
Φ∗ ≥ RΦ(φ∗ , [µ; ν∗ ]) 1 ln RΩ exp{−φ∗ (ω)}pµ (ω)Π(dω) + 2 1 ln Ω exp{−φ∗ (ω)}pµ (ω)P (dω) + 2
exp{−φa∗ (ω)}pµ (ω)Π(dω)
= ≤
1 2 1 2
R ln Ω exp{φ∗ (ω)}pν∗ (ω)Π(dω) ln(ε⋆ ).
R ln Ω exp{−φ∗ (ω)}pµ (ω)P (dω) + a ln(ε⋆ ) + a,
174
CHAPTER 2
and (2.58.a) follows. Similarly, when ν ∈ M2 , we have ln(ε⋆ )
= = =
so that ln
R
Ω
Φ∗ ≥ RΦ(φ∗ , [µ∗ ; ν]) R 1 ln Ω exp{−φR∗ (ω)}pµ∗ (ω)Π(dω) + 21 ln Ω exp{φ∗ (ω)}pν (ω)Π(dω) 2 1 ln(ε⋆ ) + 21 ln Ω exp{φ∗ (ω)}pν (ω)Π(dω) , 2
exp{φa∗ (ω)}pν (ω)Π(dω)
= ≤
R ln Ω exp{φ∗ (ω)}pν (ω)Π(dω) − a ln(ε⋆ ) − a,
and (2.58.b) follows. We have proved all statements of item (i), except for the claim that the φ∗ , ε⋆ just defined form an optimal solution to (2.59). Note that by (2.58) as applied with a = 0, the pair in question is feasible for (2.59). Assuming that the problem admits ¯ ǫ) with ǫ < ε⋆ , let us lead this assumption to a contradiction. a feasible solution (φ, ¯ Note that φ should be such that Z Z ¯ ¯ −φ(ω) eφ(ω) pν∗ (ω)Π(dω) < ε⋆ , e pµ∗ (ω)Π(dω) < ε⋆ & Ω
Ω
¯ [µ∗ ; ν∗ ]) < ln(ε⋆ ). On the other hand, Remark 2.49 says and consequently Φ(φ, ¯ that Φ(φ, [µ∗ ; ν∗ ]) cannot be less than min Φ(φ, [µ∗ ; ν∗ ]), and the latter quantity is φ∈F
Φ(φ∗ , [µ∗ ; ν∗ ]) because (φ∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ on F × (M1 × M2 ). Thus, assuming that the optimal value in (2.59) is < ε⋆ , we conclude that Φ(φ∗ , [µ∗ ; ν∗ ]) ≤ ¯ [µ∗ ; ν∗ ]) < ln(ε⋆ ), contradicting (2.170). Item (i) of Theorem 2.23 is proved. Φ(φ, 4o . Let us prove item (ii) of Theorem 2.23. Relation (2.60) and concavity of the righthand side of this relation in [µ; ν] were already proved; moreover, these relations were proved in the range M × M of [µ; ν]. Since this range coincides with its relative interior, the realvalued concave function Φ is continuous on M × M and thus is continuous on M1 × M2 . Next, let φ∗ be the φcomponent of a saddle point of Φ on F × (M1 × M2 ) (we already know that such a saddle point exists). By Proposition 2.21, the [µ; ν]components of saddle points of Φ on F × (M1 × M2 ) are exactly the maximizers of Φ on M1 × M2 ; let [µ∗ ; ν∗ ] be such a maximizer. By the same proposition, (φ∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ, whence Φ(φ, [µ∗ ; ν∗ ]) attains its minimum over φ ∈ F at φ = φ∗ . We have also seen that Φ(φ, [µ∗ ; ν∗ ]) attains its minimum over φ ∈ F at φ = φµ∗ ,ν∗ . These observations combine with Remark 2.50 to imply that φ∗ and φµ∗ ,ν∗ differ by a constant, which, in view of (2.169), means that (φµ∗ ,ν∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ along with (φ∗ , [µ∗ ; ν∗ ]). (ii) is proved. 5o . It remains to prove item (iii) of Theorem 2.23. In the notation from (iii), simple hypotheses (A) and (B) can be decided with the total risk ≤ 2ǫ, and therefore, by Proposition 2.2, Z min[p(ω), q(ω)]Π(dω) ≤ 2ǫ. 2¯ ǫ := Ω
On the other hand, we have seen that the saddle point value of Φ is ln(ε⋆ ); since [µ∗ ; ν∗ ] is a component of a saddle point of Φ, it follows that minφ∈F Φ(φ, [µ∗ ; ν∗ ]) = ln(ε⋆ ). The lefthand side in this equality, by item 1o , is Φ(φµ∗ ,ν∗ , [µ∗ ; ν∗ ]), and we
175
HYPOTHESIS TESTING
arrive at 1 2
ln(ε⋆ ) = Φ( ln(pµ∗ (·)/pν∗ (·)), [µ∗ ; ν∗ ]) = ln so that ε⋆ =
Z q
pµ∗ (ω)pν∗ (ω)Π(dω) =
Ω
We now have ε⋆
Z q
pµ∗ (ω)pν∗ (ω)Π(dω) ,
Ω
Z p
p(ω)q(ω)Π(dω).
Ω
p R p R p = Ω p(ω)q(ω)Π(dω) = Ω min[p(ω), q(ω)] max[p(ω), q(ω)]Π(dω) 1/2 1/2 R R max[p(ω), q(ω)]Π(dω) ≤ Ω min[p(ω), q(ω)]Π(dω) 1/2 1/2 RΩ R (p(ω) + q(ω) − min[p(ω), q(ω)])Π(dω) = pΩ min[p(ω), q(ω)]Π(dω) Ω p = 2¯ ǫ(2 − 2¯ ǫ) ≤ 2 (1 − ǫ)ǫ,
where the concluding inequality is due to ǫ¯ ≤ ǫ ≤ 1/2. (iii) is proved, and the proof of Theorem 2.23 is complete. ✷ 2.11.4
Proof of Proposition 2.37
All we need is to verify (2.107) and to check that the righthand side function in this relation is convex. The latter is evident, since φX (h) + φX (−h) ≥ 2φX (0) = 0 and φX (h) + φX (−h) is convex. To verify (2.107), let us fix P ∈ P[X] and h ∈ Rd and set ν = hT e[P ], so that ν is the expectation of hT ω with ω ∼ P . Note that for ω ∼ P we have hT ω ∈ [−φX (−h), φX (h)] with P probability 1, whence −φX (−h) ≤ ν ≤ φX (h). In particular, when φX (h) + φX (−h) = 0, hT ω = ν with P probability 1, so that (2.107) definitely holds true. Now let η :=
1 2
[φX (h) + φX (−h)] > 0,
and let a=
1 2
[φX (h) − φX (−h)] , β = (ν − a)/η.
Denoting by Ph the distribution of hT ω induced by the distribution P of ω and noting that this distribution is supported on [−φX (−h), φX (h)] = [a − η, a + η] and has expectation ν, we get β ∈ [−1, 1] and γ :=
Z
exp{hT ω}P (dω) =
Z
a+η a−η
[es − λ(s − ν)]Ph (ds)
for all λ ∈ R. Hence, ln(γ) ≤ inf ln max [es − λ(s − ν)] a−η≤s≤a+η λ = a + inf ln max [et − ρ(t − [ν − a])] [substituting λ = ea ρ, s = a + t] ρ −η≤t≤η = a + inf ln max [et − ρ(t − ηβ)] ≤ a + ln max [et − ρ¯(t − ηβ) ρ
−η≤t≤η
−η≤t≤η
176
CHAPTER 2
with ρ¯ = (2η)−1 (eη − e−η ). The function g(t) = et − ρ¯(t − ηβ) is convex on [−η, η], and g(−η) = g(η) = cosh(η) + β sinh(η), which combines with the above computation to yield the relation ln(γ) ≤ a + ln(cosh(η) + β sinh(η)).
(2.171)
Thus, all we need to verify is that ∀(η > 0, β ∈ [−1, 1]) : βη + 21 η 2 − ln(cosh(η) + β sinh(η)) ≥ 0.
(2.172)
Indeed, if (2.172) holds true (2.171) implies that ln(γ) ≤ a + βη + 12 η 2 = ν + 21 η 2 , which, recalling what γ, ν, and η are, is exactly what we want to prove. Verification of (2.172) is as follows. The lefthand side in (2.172) is convex in β for β > − cosh(η) sinh(η) containing, due to η > 0, the range of β in (2.172). Furthermore, the minimum of the lefthand side of (2.172) over β > − coth(η) is attained at cosh(η) and is equal to β = sinh(η)−η η sinh(η) r(η) = 12 η 2 + 1 − η coth(η) − ln(sinh(η)/η). All we need to prove is that the latter quantity is nonnegative whenever η > 0. We have r′ (η) = η − coth(η) − η(1 − coth2 (η)) − coth(η) + η −1 = (η coth(η) − 1)2 η −1 ≥ 0, and since r(+0) = 0, we get r(η) ≥ 0 when η > 0. 2.11.5
✷
Proof of Proposition 2.43
2.11.5.A Proof of Proposition 2.43.i
A 1 . Let b = [0; ...; 0; 1] ∈ R , so that B = , and let A(u) = A[u; 1]. For bT any u ∈ Rn , h ∈ Rd , Θ ∈ Sd+ , and H ∈ Sd such that −I ≺ Θ1/2 HΘ1/2 ≺ I we have Ψ(h, H; u, Θ) :=ln Eζ∼N (A(u),Θ) exp{hT ζ + 21 ζ T Hζ} = ln Eξ∼N (0,I) exp{hT [A(u) + Θ1/2 ξ] + 21 [A(u) + Θ1/2 ξ]T H[A(u) + Θ1/2 ξ]} = − 12 ln Det(I − Θ1/2 HΘ1/2 ) + hT A(u) + 21 A(u)T HA(u) 1/2 1/2 1/2 −1 1/2 + 21 [HA(u) + h]T Θ ] Θ [HA(u) + h] T[I − Θ T HΘ 1 1 T 1/2 1/2 T = − 2 ln Det(I − Θ HΘ ) + 2 [u; 1] bh A + A hb + AT HA [u; 1] + 21 [u; 1]T B T [H, h]T Θ1/2 [I − Θ1/2 HΘ1/2 ]−1 Θ1/2 [H, h]B [u; 1] (2.173) due to hT A(u) = [u; 1]T bhT A[u; 1] = [u; 1]T AT hbT [u; 1] o
n+1
and HA(u) + h = [H, h]B[u; 1].
177
HYPOTHESIS TESTING
Observe that when (h, H) ∈ Hγ , we have −1 Θ1/2 [I − Θ1/2 HΘ1/2 ]−1 Θ1/2 = [Θ−1 − H]−1 [Θ−1 , ∗ − H]
so that (2.173) implies that for all u ∈ Rn , Θ ∈ V, and (h, H) ∈ Hγ , Ψ(h, H; u,Θ) ≤ − 12 ln Det(I − Θ1/2 HΘ1/2 ) −1 [H, h]B [u; 1] + 12 [u; 1]T bhT A + AT hbT + AT HA + B T [H, h]T [Θ−1 ∗ − H]  {z } Q[H,h]
= − 21 ln Det(I − Θ1/2 HΘ1/2 ) + 12 Tr(Q[H, h]Z(u)) ≤ − 21 ln Det(I − Θ1/2 HΘ1/2 ) + ΓZ (h, H), ΓZ (h, H) = 12 φZ (Q[H, h])
(2.174) (we have taken into account that Z(u) ∈ Z when u ∈ U , the premise of the proposition, and therefore Tr(Q[H, h]Z(u)) ≤ φZ (Q[H, h])). Note that the above function Q[H, h] is nothing but H h T −1 −1 + [H, h] [Θ − H] [H, h] B. (2.175) Q[H, h] = B T ∗ hT 2o . We need the following: Lemma 2.51. Let Θ∗ be a d × d symmetric positive definite matrix, let δ ∈ [0, 2], and let V be a closed convex subset of Sd+ such that −1/2
Θ ∈ V ⇒ {Θ Θ∗ } & {kΘ1/2 Θ∗
− Id k ≤ δ}
(2.176)
−1 (cf. (2.129)). Let also Ho := {H ∈ Sd : −Θ−1 ∗ ≺ H ≺ Θ∗ }. Then
∀(H, Θ) ∈ Ho × V : G(H; Θ) := − 12 ln Det(I − Θ1/2 HΘ1/2 ) 1/2 1/2 ≤ G+ (H; Θ) := − 12 ln Det(I − Θ∗ HΘ∗ ) + 21 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) + kΘ∗ HΘ∗ k2F , 1/2 1/2 2(1−kΘ∗
HΘ∗
(2.177)
k)
where k · k is the spectral, and k · kF the Frobenius norm of a matrix. In addition, G+ (H, Θ) is a continuous function on Ho × V which is convex in H ∈ H o and concave (in fact, affine) in Θ ∈ V Proof. Let us set
1/2
1/2
d(H) = kΘ∗ HΘ∗ k,
so that d(H) < 1 for H ∈ Ho . For H ∈ Ho and Θ ∈ V fixed we have kΘ1/2 HΘ1/2 k = ≤
−1/2
1/2
1/2
−1/2
k[Θ1/2 Θ∗ ][Θ∗ HΘ∗ ][Θ1/2 Θ∗ ]T k −1/2 1/2 1/2 1/2 1/2 kΘ1/2 Θ∗ k2 kΘ∗ HΘ∗ k ≤ kΘ∗ HΘ∗ k = d(H) (2.178) 1/2 −1/2 (we have used the fact that 0 Θ Θ∗ implies kΘ Θ∗ k ≤ 1). Noting that
178
CHAPTER 2
kABkF ≤ kAkkBkF , a computation completely similar to the one in (2.178) yields 1/2
1/2
kΘ1/2 HΘ1/2 kF ≤ kΘ∗ HΘ∗ kF =: D(H).
(2.179)
Besides this, setting F (X) = − ln Det(X) : int Sd+ → R and equipping Sd with the 1/2 1/2 Frobenius inner product, we have ∇F (X) = −X −1 , so that with R0 = Θ∗ HΘ∗ , R1 = Θ1/2 HΘ1/2 , and ∆ = R1 − R0 , we have for properly selected λ ∈ (0, 1) and Rλ = λR0 + (1 − λ)R1 : F (I − R1 )
= = =
F (I − R0 − ∆) = F (I − R0 ) + h∇F (I − Rλ ), −∆i F (I − R0 ) + h(I − Rλ )−1 , ∆i F (I − R0 ) + hI, ∆i + h(I − Rλ )−1 − I, ∆i.
We conclude that F (I − R1 ) ≤ F (I − R0 ) + Tr(∆) + kI − (I − Rλ )−1 kF k∆kF .
(2.180)
Denoting by µi the eigenvalues of Rλ and noting that kRλ k ≤ max[kR0 k, kR1 k] = 1 = d(H) (see (2.178)), we have µi  ≤ d(H), and therefore eigenvalues νi = 1 − 1−µ i µi −1 − 1−µi of I − (I − Rλ ) satisfy νi  ≤ µi /(1 − µi ) ≤ µi /(1 − d(H)), whence kI − (I − Rλ )−1 kF ≤ kRλ kF /(1 − d(H)). Noting that kRλ kF ≤ max[kR0 kF , kR1 kF ] ≤ D(H)—see (2.179)—we conclude that kI − (I − Rλ )−1 kF ≤ D(H)/(1 − d(H)), so that (2.180) yields F (I − R1 ) ≤ F (I − R0 ) + Tr(∆) + D(H)k∆kF /(1 − d(H)). −1/2
Further, by (2.129) the matrix D = Θ1/2 Θ∗ 1/2
(2.181)
− I satisfies kDk ≤ δ, whence
1/2
1/2 = (I +D)R0 (I +DT )−R0 = DR0 +R0 DT +DR0 DT . ∆ = Θ1/2 HΘ HΘ {z }−Θ  ∗ {z ∗ } R1
R0
Consequently, k∆kF
kDR0 kF + kR0 DT kF + kDR0 DT kF ≤ [2kDk + kDk2 ]kR0 kF δ(2 + δ)kR0 kF = δ(2 + δ)D(H).
≤ ≤
This combines with (2.181) and the relation 1/2
1/2
Tr(∆) = Tr(Θ1/2 HΘ1/2 − Θ∗ HΘ∗ ) = Tr([Θ − Θ∗ ]H) to yield F (I − R1 )
≤
=
F (I − R0 ) + Tr([Θ − Θ∗ ]H) +
F (I − R0 ) + Tr([Θ − Θ∗ ]H) +
δ(2+δ) 1−d(H) D(H) 1/2 1/2 δ(2+δ) HΘ∗ k2F , 1/2 1/2 kΘ∗ 1−kΘ∗ HΘ∗ }
and we arrive at (2.177). It remains to prove that G+ (H; Θ) is convexconcave and continuous on Ho × V. The only component of this claim which is not completely evident is convexity of the function in H ∈ Ho . To see that it is the case, note that ln Det(S) is concave on the interior of the semidefinite cone, the function
179
HYPOTHESIS TESTING
f (u, v) =
u2 1−v
is convex and nondecreasing in u, v in the convex domain Π =
{(u, v) : u ≥ 0, v < 1}, and the function convex substitution of variables H 7→ into Π. .
1/2 2 kΘ1/2 ∗ HΘ∗ kF 1/2
1/2
is obtained from f by
1−kΘ∗ HΘ∗ k 1/2 1/2 1/2 1/2 (kΘ∗ HΘ∗ kF , kΘ∗ HΘ∗ k)
mapping Ho ✷
3o . Combining (2.177), (2.174), and (2.130) and the origin of Ψ—see (2.173)—we arrive at ∀((u, Θ) ∈ U × V, (h, H) ∈ Hγ = H) : ln Eζ∼N (A[u;1],Θ) exp{hT ζ + 12 ζ T Hζ} ≤ ΦA,Z (h, H; Θ),
as claimed in (2.133).
4o . Now let us check that ΦA,Z (h, H; Θ) : H × V → R is continuous and convexconcave. Recalling that the function G+ (H; Θ) from (2.177) is convexconcave and continuous on Ho ×V, all we need to verify is that ΓZ (h, H) is convex and continuous on H. Recalling that Z is a nonempty compact set, the function φZ (·) : Sd+1 → R is continuous, implying the continuity of ΓZ (h, H) = 12 φZ (Q[H, h]) on H = Hγ (Q[H, h] is defined in (2.175)). To prove convexity of ΓZ , note that Z is contained in Sn+1 + , implying that φZ (·) is convex and monotone. On the other hand, by the Schur Complement Lemma, we have S
:= =
{(h, (h, H) ∈ Hγ } H, G) : G Q[H, h], T G − [bh A + AT hbT + AT HA] (h, H, G) : [H, h]B
B T [H, h]T Θ−1 ∗ −H
0, γ (h, H) ∈ H ,
implying that S is convex. Since φZ (·) is monotone, we have {(h, H, τ ) : (h, H) ∈ Hγ , τ ≥ ΓZ (h, H)} = {(h, H, τ ) : ∃G : G Q[H, h], 2τ ≥ φZ (G), (h, H) ∈ Hγ }, and we see that the epigraph of ΓZ is convex (since the set S and the epigraph of φZ are so), as claimed. 5o . It remains to prove that ΦA,Z is coercive in H, h. Let Θ ∈ V and (hi , Hi ) ∈ Hγ with k(hi , Hi )k → ∞ as i → ∞, and let us prove that ΦA,Z (hi , Hi ; Θ) → ∞. Looking at the expression for ΦA,Z (hi , Hi ; Θ), it is immediately seen that all terms in this expression, except for the terms coming from φZ (·), remain bounded as i grows, so that all we need to verify is that the φZ (·)term goes to ∞ as i → ∞. Observe that Hi are uniformly bounded due to (hi , Hi ) ∈ Hγ , implying that khi k2 → ∞ as i → ∞. Denoting e = [0; ...; 0; 1] ∈ Rd+1 and, as before, b = [0; ...; 0; 1] ∈ Rn+1 , note that, by construction, B T e = b. Now let W ∈ Z, so −1 that Wn+1,n+1 = 1. Taking into account that the matrices [Θ−1 satisfy ∗ − Hi ] −1 γ αId [Θ−1 − H ] βI for some positive α, β due to H ∈ H , observe that i d i ∗ H i hi T −1 −1 + [Hi , hi ] [Θ−1 [Hi , hi ] = hTi [Θ−1 hi eeT + Ri , ∗ − Hi ] ∗ − Hi ] T hi {z }   {z } α kh k2 Qi =Q[Hi ,hi ]
i
i 2
180
CHAPTER 2
where αi ≥ α > 0 and kRi kF ≤ C(1 + khi k2 ). As a result, φZ (B T Qi B)
≥
=
Tr(W B T Qi B) = Tr(W B T [αi khi k22 eeT + Ri ]B) αi khi k22 Tr(W bbT ) −kBW B T kF kRi kF  {z } =Wn+1,n+1 =1
≥
αkhi k22
− C(1 + khi k2 )kBW B T kF ,
and the concluding quantity tends to ∞ as i → ∞ due to khi k2 → ∞, i → ∞. Part (i) is proved. 2.11.5.B Proof of Proposition 2.43.ii By (i) the function Φ(h, H; Θ1 , Θ2 ), as defined in (2.134), is continuous and convexconcave on the domain (H1 ∩ H2 ) × (V1 × V2 ) and is coercive in (h, H), H and V  {z }  {z } H
V
are closed and convex, and V in addition is compact, so that saddle point problem (2.134) is solvable (SionKakutani Theorem, a.k.a. Theorem 2.22). Now let (h∗ , H∗ ; Θ∗1 , Θ∗2 ) be a saddle point. To prove (2.136), let P ∈ G1 , that is, P = N (A1 [u; 1], Θ1 ) for some Θ1 ∈ V1 and some u with [u; 1][u; 1]T ∈ Z1 . Applying (2.133) to the first collection of data, with a given by (2.135), we get the first ≤ in the following chain: R 1 T T e− 2 ω H∗ ω−ω h∗ −a P (dω) ≤ ΦA1 ,Z1 (−h∗ , −H∗ ; Θ1 ) − a ln ≤ Φ (−h∗ , −H∗ ; Θ∗1 ) − a {z} = SV , {z} A1 ,Z1 (b)
(a)
where (a) is due to the fact that ΦA1 ,Z1 (−h∗ , −H∗ ; Θ1 ) + ΦA2 ,Z2 (h∗ , H∗ ; Θ2 ) attains its maximum over (Θ1 , Θ2 ) ∈ V1 × V2 at the point (Θ∗1 , Θ∗2 ), and (b) is due to the origin of a and the relation SV = 21 [ΦA1 ,Z1 (−h∗ , −H∗ ; Θ∗1 ) + ΦA2 ,Z2 (h∗ , H∗ ; Θ∗2 )]. The bound in (2.136.a) is proved. Similarly, let P ∈ G2 , that is, P = N (A2 [u; 1], Θ2 ) for some Θ2 ∈ V2 and some u with [u; 1][u; 1]T ∈ Z2 . Applying (2.133) to the second collection of data, with the same a as above, we get the first ≤ in the following chain: R 1 T T ln e 2 ω H∗ ω+ω h∗ +a P (dω) ≤ ΦA2 ,Z2 (h∗ , H∗ ; Θ2 ) + a (h , H ; Θ∗ ) + a {z} = SV , ≤ Φ {z} A2 ,Z2 ∗ ∗ 2 (b)
(a)
with exactly the same justification of (a) and (b) as above. The bound in (2.136.b) is proved. ✷ 2.11.6
Proof of Proposition 2.46
2.11.6.A Preliminaries We start with the following result: ¯ be a positive definite d × d matrix, Lemma 2.52. Let Θ u 7→ C(u) = A[u; 1]
B=
A 0, ..., 0, 1
, and let
HYPOTHESIS TESTING
181
be an affine mapping from Rn into Rd . Finally, let h ∈ Rd , H ∈ Sd and P ∈ Sd satisfy the relations ¯ 1/2 H Θ ¯ 1/2 . 0 P ≺ Id & P Θ (2.182)
¯ and for every u ∈ Rn it holds Then, ζ ∼ SG(C(u), Θ) n T 1 T o ≤ − 12 ln Det(I − P ) ln Eζ eh ζ+ 2 ζ Hζ h i T ¯ 1/2 H h −1 ¯ 1/2 + [H, h] Θ [I − P ] Θ [H, h] B[u; 1] + 21 [u; 1]T B T T h
(2.183)
¯ −1/2 P Θ ¯ −1/2 ): whenever h ∈ Rd , H ∈ Sd and G ∈ Sd Equivalently (set G = Θ satisfy the relations ¯ −1 & G H, 0G≺Θ (2.184)
¯ and every for every u ∈ Rn : one has for ζ ∼ SG(C(u), Θ) n T 1 T o ¯ 1/2 GΘ ¯ 1/2 ) ln Eζ eh ζ+ 2 ζ Hζ ≤ − 21 ln Det(I − Θ h i T ¯ −1 H h −1 + 21 [u; 1]T B T + [H, h] [ Θ − G] [H, h] B[u; 1]. T h
(2.185)
Proof. 1o . Let us start with the following observation:
Lemma 2.53. Let Θ ∈ Sd+ and S ∈ Rd×d be such that SΘS T ≺ Id . Then for every ν ∈ Rd one has o o n 1 T n T 1 T T T ln Eξ∼SG(0,Θ) eν Sξ+ 2 ξ S Sξ ≤ ln Eη∼N (ν,Id ) e 2 η SΘS η (2.186) = − 12 ln Det(Id − SΘS T ) + 21 ν T SΘS T (Id − SΘS T )−1 ν. Indeed, let ξ ∼ SG(0, Θ) and η ∼ N (ν, Id ) be independent. We have n T o n n oo n n T T oo 1 T T T Eξ eν Sξ+ 2 ξ S Sξ {z} = Eξ Eη e[Sξ] η = Eη Eξ e[S η] ξ o a n 1 T T ≤ Eη e 2 η SΘS η , {z} b
where a is due to η ∼ N (ν, Id ) and b is due to ξ ∼ SG(0, Θ). We have verified the inequality in (2.186); the equality in (2.186) is given by direct computation. ✷ 2o . Now, in the situation described in Lemma 2.52, by continuity it suffices to prove (2.183) in the case when P 0 in (2.182) is replaced with P ≻ 0. Under the premise of the lemma, given u ∈ Rn and assuming P ≻ 0, let us set µ = C(u) = A[u; 1], ¯ 1/2 [Hµ + h], and S = P 1/2 Θ ¯ −1/2 , so that S ΘS ¯ T = P ≺ Id , and let ν = P −1/2 Θ −1/2 ¯ −1/2 ¯ ¯ G=Θ PΘ , so that G H. Let ζ ∼ SG(µ, Θ). Representing ζ as ζ = µ + ξ
182
CHAPTER 2
¯ we have with ξ ∼ SG(0, Θ), o n T 1 T o n 1 T T = hT µ + 21 µT Hµ + ln Eξ e[h+Hµ] ξ+ 2 ξ Hξ ln Eζ eh ζ+ 2 ζ Hζ o n 1 T T ≤ hT µ + 12 µT Hµ + ln Eξ e[h+Hµ] ξ+ 2 ξ Gξ [since G H] n o = hT µ + 12 µT Hµ + ln Eξ eν
T
1
Sξ+ 2 ξ T S T Sξ
[since S T ν = h + Hµ and G = S T S] ¯ T ) + 1 ν T S ΘS ¯ T (Id − S ΘS ¯ T )−1 ν ≤ hT µ + 12 µT Hµ − 21 ln Det(Id − S ΘS 2 ¯ [by Lemma 2.53 with Θ = Θ] 1 1 1 T T ¯ 1/2 −1 ¯ 1/2 T = h µ + 2 µ Hµ − 2 ln Det(Id − P ) + 2 [Hµ + h] Θ (Id − P ) Θ [Hµ + h] [plugging in S and ν].
It is immediately seen that the concluding quantity in this chain is nothing but the righthand side quantity in (2.183). ✷ 2.11.6.B Completing the proof of Proposition 2.46. ¯ = Θ∗ , 1o . Let us prove (2.142.a). By Lemma 2.52 (see (2.185)) applied with Θ setting C(u) = A[u; 1], we have ∀ (h, H) ∈ H, G : 0n G γ + Θ−1 , G H, u ∈ Rn : [u; 1][u; 1]T ∈ Z : ∗ o 1 T T 1/2 1/2 ≤ − 12 ln Det(I − Θ∗ GΘ∗ ) ln Eζ∼SG(C(u),Θ∗ ) eh ζ+ 2 ζ Hζ h i T h H −1 −1 + 12 [u; 1]T B T + [H, h] [Θ − G] [H, h] B[u; 1] T ∗ h 1/2
1/2
≤ − 12 ln Det(I h − Θ∗ GΘ∗ ) + 21 φZ B T
H hT
h
i T −1 + [H, h] [Θ−1 − G] [H, h] B = ΨA,Z (h, H, G), ∗
(2.187) implying, due to the origin of ΦA,Z , that under the premise of (2.187) we have n T 1 T o ≤ ΦA,Z (h, H), ∀(h, H) ∈ H. ln Eζ∼SG(C(u),Θ∗ ) eh ζ+ 2 ζ Hζ
Taking into account that when ζ ∼ SG(C(u), Θ) with Θ ∈ V, we have also ζ ∼ SG(C(u), Θ∗ ); (2.142.a) follows.
2o . Now let us prove (2.142.b). All we need is to verify the relation + −1 n T ∀ (h, H) ∈ H, G : 0 G γ Θ∗ , G n H, u ∈ Ro: [u; 1][u; 1] ∈ Z, Θ ∈ V : ln Eζ∼SG(C(u),Θ) eh
T
1
ζ+ 2 ζ T Hζ
≤ ΨδA,Z (h, H, G; Θ);
(2.188) with this relation at our disposal (2.142.b) can be obtained by the same argument as the one we used in item 1o to derive (2.142.a). To establish (2.188), let us fix h, H, G, u, Θ satisfying the premise of (2.188); recall that under the premise of Proposition 2.46.i, we have 0 Θ Θ∗ . Now let λ ∈ (0, 1), and let Θλ = Θ + λ(Θ∗ − Θ), so that 0 ≺ Θλ Θ∗ , and let δλ = 1/2 −1/2 + −1 kΘλ Θ∗ − Id k, implying that δλ ∈ [0, 2]. We have 0 G γ + Θ−1 ∗ γ Θλ , ¯ ¯ that is, H, G satisfy (2.184) w.r.t. Θ = Θλ . As a result, for our h, G, H, u, the Θ
183
HYPOTHESIS TESTING
just defined and the ζ ∼ SG(C(u), Θλ ) relation (2.185) hold true: n T 1 T o 1/2 1/2 ≤ − 21 ln Det(I − Θλ GΘλ ) ln Eζ eh ζ+ 2 ζ Hζ h i T H h −1 −1 + [H, h] [Θ − G] [H, h] B[u; 1] + 12 [u; 1]T B T λ hT 1/2
1/2
≤ − 21 ln Det(I h− Θλ GΘλ )
+ 12 φZ B T
h
H hT
(2.189)
i T −1 + [H, h] [Θ−1 − G] [H, h] B λ
(recall that [u; 1][u; 1]T ∈ Z). As a result,
1 T T 1/2 1/2 ≤ − 21 ln Det(I − Θλ GΘλ ) ln Eζ∼SG(C(u),Θ) eh ζ+ 2 ζ Hζ H h −1 + 12 φZ B T + [H, h]T [Θ−1 [H, h] B . T ∗ − G] h
(2.190)
When deriving (2.190) from (2.189), we have used that — Θ Θλ , so that when ζ ∼ SG(C(u), Θ), we have also ζ ∼ SG(C(u), Θλ ), −1 −1 −1 — 0 Θλ Θ∗ and G ≺ Θ−1 [Θ−1 , ∗ , whence [Θλ − G] ∗ − G] n+1 — Z ⊂ S+ , whence φZ is monotone: φZ (M ) ≤ φZ (N ) whenever M N .
By Lemma 2.51 applied with Θλ in the role of Θ and δλ in the role of δ, we have 1/2
1/2
1/2
1/2
− 21 ln Det(I − Θλ GΘλ ) ≤ − 12 ln Det(I − Θ∗ GΘ∗ ) + 12 Tr([Θλ − Θ∗ ]G) 1/2 1/2 δλ (2+δλ ) kΘ∗ GΘ∗ k2F . + 1/2 1/2 2(1−kΘ∗
GΘ∗
k)
Consequently, (2.190) implies that
ln Eζ∼SG(C(u),Θ) e
hT ζ+
1 T ζ Hζ 2
+
1/2
1/2
δλ (2+δλ ) 1/2
2(1−kΘ ∗
+ 21 φZ
1/2
≤ − 12 ln Det(I − Θ∗ GΘ∗ ) + 21 Tr([Θλ − Θ∗ ]G) 1/2
GΘ ∗
BT
k)
1/2
kΘ∗ GΘ∗ k2F
H hT
h
−1 + [H, h]T [Θ−1 [H, h] B . ∗ − G]
The resulting inequality holds true for all small positive λ; taking lim inf of the righthand side as λ → +0, and recalling that Θ0 = Θ, we get 1 T T 1/2 1/2 ln Eζ∼SG(C(u),Θ) eh ζ+ 2 ζ Hζ ≤ − 21 ln Det(I − Θ∗ GΘ∗ ) + 12 Tr([Θ − Θ∗ ]G) +
1/2
δ(2+δ) 1/2
2(1−kΘ ∗
+ 12 φZ
1/2
GΘ ∗
BT
k)
H hT
1/2
kΘ∗ GΘ∗ k2F h
−1 + [H, h]T [Θ−1 − G] [H, h] B ∗
(note that under the premise of Proposition 2.46.i we clearly have lim inf λ→+0 δλ ≤ δ). The righthand side of the resulting inequality is nothing but ΨδA,Z (h, H, G; Θ)— see (2.141)—and we arrive at the inequality required in the conclusion of (2.188). 3o . To complete the proof of Proposition 2.46.i, it remains to show that functions ΦA,Z , ΦδA,Z , as announced in the proposition, possess continuity, convexityconcavity, and coerciveness properties. Let us verify that this indeed is so for ΦδA,Z ; the reasoning which follows, with obvious simplifications, is applicable to ΦA,Z as well.
184
CHAPTER 2
Observe, first, that for exactly the same reasons as in item 4o of the proof of Proposition 2.43, the function ΨδA,Z (h, H, G; Θ) is realvalued, continuous and convexconcave on the domain + −1 + −1 b × V = {(h, H, G) : −γ + Θ−1 H ∗ H γ Θ∗ , 0 G γ Θ∗ , H G} × V.
The function ΦδA,Z (h, H; Θ) : H × V → R is obtained from ΨδA,Z (h, H, G; Θ) by the following two operations: we first minimize ΨδA,Z (h, H, G; Θ) over G linked to (h, H) by the convex constraints 0 G γ + Θ−1 and G H, thus obtaining a ∗ function + −1 ¯ Φ(h, H; Θ) : {(h, H) : −γ + Θ−1 ∗ H γ Θ∗ } ×V → R ∪ {+∞} ∪ {−∞}. {z }  ¯ H
¯ ¯ ¯ Second, we restrict the function Φ(h, H; Θ) from H×V onto H×V. For (h, H) ∈ H, the set of G’s linked to (h, H) by the above convex constraints clearly is a nonempty ¯ is a realvalued convexconcave function on H ¯ ×V. From compact set; as a result, Φ δ δ continuity of ΨA,Z on its domain it immediately follows that ΨA,Z is bounded and uniformly continuous on every bounded subset of this domain. This implies that ¯ ¯ × V, where B ¯ is a bounded Φ(h, H; Θ) is bounded in every domain of the form B ¯ ¯ subset of H, and is continuous on B × V in Θ ∈ V with properly selected modulus ¯ Furthermore, by construction, H ⊂ int H, ¯ of continuity independent of (h, H) ∈ B. implying that if B is a convex compact subset of H, it belongs to the interior of ¯ of H. ¯ Since Φ ¯ is bounded on B ¯×V a properly selected convex compact subset B ¯ and is convex in (h, H), the function Φ is a Lipschitz continuous in (h, H) ∈ B with Lipschitz constant which can be selected to be independent of Θ ∈ V. Taking into account that H is convex and closed, the bottom line is that ΦδA,Z is not just realvalued convexconcave function on the domain H × V, but is also continuous on this domain. Coerciveness of ΦδA,Z (h, H; Θ) in (h, H) is proved in exactly the same way as the similar property of function (2.130); see item 5o in the proof of Proposition 2.43. The proof of item (i) of Proposition 2.46 is complete. 4o . Item (ii) of Proposition 2.46 can be derived from item (i) of the proposition following the steps of the proof of (ii) of Proposition 2.43. ✷
Chapter Three From Hypothesis Testing to Estimating Functionals In this chapter we extend the techniques developed in Chapter 2 beyond the hypothesis testing problem and apply them to estimating properly structured scalar functionals of the unknown signal, specifically: • In simple observation schemes—linear (and more generally, N convex; see Section 3.2) functionals on unions of convex sets (Sections 3.1 and 3.2); • Beyond simple observation schemes—linear and quadratic functionals on convex sets (Sections 3.3 and 3.4).
3.1
ESTIMATING LINEAR FORMS ON UNIONS OF CONVEX SETS
The key to the subsequent developments in this section and in Sections 3.3 and 3.4 is the following simple observation. Let P = {Px : x ∈ X } be a parametric family of distributions on Rd , X being a convex subset of some Rm . Suppose that given a linear form g T x on Rm and an observation ω ∼ Px stemming from unknown signal x ∈ X , we want to recover g T x, and intend to use for this purpose an affine function hT ω + κ of the observation. How do we ensure that the recovery, with a given probability 1 − ǫ, deviates from g T x by at most a given margin ρ, for all x ∈ X? Let us focus on one “half” of the answer: how to ensure that the probability of the event hT ω + κ > g T x + ρ does not exceed ǫ/2, for every x ∈ X . The answer becomes easy when assuming that we have at our disposal an upper bound on the exponential moments of the distributions from the family—a function Φ(h; x) such that Z T ln eh ω Px (dω) ≤ Φ(h; x) ∀(h ∈ Rn , x ∈ X ). Indeed, for obvious reasons, in this case the Px probability of the event hT ω + κ − g T x > ρ is at most exp{Φ(h; x) − [g T x + ρ − κ]}. To add some flexibility, note that when α > 0, the event in question is the same as the event (h/α)T ω + κ/α > [g T x + ρ]/α; thus we arrive at a parametric family of upper bounds exp{Φ(h/α; x) − [g T x + ρ − κ]/α}, α > 0, on the Px probability of our “bad” event. It follows that a sufficient condition for this probability to be ≤ ǫ/2, for a given x ∈ X , is the existence of α > 0 such that exp{Φ(h/α; x) − [g T x + ρ − κ]/α} ≤ ǫ/2,
186
CHAPTER 3
or Φ(h/α; x) − [g T x + ρ − κ]/α ≤ ln(ǫ/2), or, which again is the same, the existence of α > 0 such that αΦ(h/α; x) + α ln(2/ǫ) − g T x ≤ ρ − κ. In other words, a sufficient condition for the relation Probω∼Px {hT ω + κ > g T x + ρ} ≤ ǫ/2 is inf [αΦ(h/α; x) + α ln(2/ǫ) − g T x] ≤ ρ − κ.
α>0
If we want the bad event in question to take place with Px probability ≤ ǫ/2 whatever be x ∈ X , the sufficient condition for this is sup inf [αΦ(h/α; x) + α ln(2/ǫ) − g T x] ≤ ρ − κ.
x∈X α>0
(3.1)
Now assume that X is convex and compact, and Φ(h; x) is continuous, convex in h, and concave in x. In this case the function αΦ(h/α; x) is convex in (h, α) in the domain α > 0 1 and is concave in x, so that we can switch sup and inf, thus arriving at the sufficient condition (3.2) ∃α > 0 : max αΦ(h/α; x) + α ln(2/ǫ) − g T x ≤ ρ − κ, x∈X
for the validity of the relation
∀x ∈ X : Probω∼Px hT ω + κ − g T x ≤ ρ ≥ 1 − ǫ/2.
Note that our sufficient condition is expressed in terms of a convex constraint on h, κ, ρ, α. Consider also the dramatic simplification allowed by the convexityconcavity of Φ: in (3.1), every x ∈ X should be “served” by its own α, so that (3.1) is an infinite system of constraints on h, ρ, κ. In contrast, in (3.2) all x ∈ X are “served” by a single α. The developments in this section and Sections 3.3 and 3.4 are no more than implementations, under various circumstances, of the simple idea we have just outlined. 3.1.1
The problem
Let O = (Ω, Π, {pµ (·) : µ ∈ M}, F) be a simple observation scheme (see Section 2.4.2). The problem we consider in this section is as follows: We are given a positive integer K and I nonempty convex compact sets Xj ⊂ Rn , along with affine mappings Aj (·) : Rn → RM such that Aj (x) ∈ M whenever x ∈ Xj , 1 ≤ j ≤ I. In addition, we are given a linear function 1 This is due to the following standard fact: if f (h) is a convex function, then the projective transformation αf (h/α) of f is convex in (h, α) in the domain α > 0.
187
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
g T x on Rn . Given random observation ω K = (ω1 , ..., ωK ) with ωk drawn, independently across k, from pAj (x) with j ≤ I and x ∈ Xj , we want to recover g T x. It should be stressed that we do not know j and x underlying our observation. Given reliability tolerance ǫ ∈ (0, 1), we quantify the performance of a candidate estimate—a Borel function gb(·) : Ω → R—by the worstcase, over j and x, width of a (1 − ǫ)confidence interval. Precisely, we say that gb(·) is (ρ, ǫ)reliable if g (ω) − g T x > ρ} ≤ ǫ. ∀(j ≤ I, x ∈ Xj ) : Probω∼pAj (x) {b
(3.3)
We define the ǫrisk of the estimate as Riskǫ [b g ] = inf {ρ : gb is (ρ, ǫ)reliable} ,
i.e., Riskǫ [b g ] is the smallest ρ such that gb is (ρ, ǫ)reliable. The technique we are about to develop originates from [131] where estimating a linear form on a convex compact set in a simple o.s. (i.e., the case I = 1 of the problem at hand) was considered, and where it was proved that in this situation the estimate X gb(ω K ) = φ(ωk ) + κ k
with properly selected φ ∈ F and κ ∈ R is nearoptimal. The problem of estimating linear functionals of a signal in Gaussian o.s. has a long history; see, e.g., [38, 40, 124, 125, 125, 127, 126, 170, 179] and references therein. In particular, in the case of I = 1, using different techniques, a similar fact was proved by D. Donoho [64] in 1991; related results in the case of I > 1 are available in [41, 42]. 3.1.2
The estimate
In the sequel, we associate with the simple o.s. O = (Ω, Π, {pµ (·) : µ ∈ M}, F) in question the function Z ΦO (φ; µ) = ln eφ(ω) pµ (ω)Π(dω) , (φ, µ) ∈ F × M. Recall that by definition of a simple o.s., this function is realvalued on F × M, concave in µ ∈ M, convex in φ ∈ F, and continuous on F × M (the latter follows from convexityconcavity and relative openness of M and F). Let us associate with a pair (i, j), 1 ≤ i, j ≤ I, the functions Φij (α, φ; x, y)
=
Ψij (α, φ)
= =
1 2
KαΦO (φ/α; Ai (x)) + KαΦ O (−φ/α; Aj (y)) +g T (y − x) + 2α ln(2I/ǫ) : {α > 0, φ ∈ F } × [Xi × Xj ] → R, max Φij (α, φ; x, y)
x∈Xi ,y∈Xj 1 [Ψi,+ (α, φ) 2
+ Ψj,− (α, φ)] : {α > 0} × F → R
188
CHAPTER 3
where Ψℓ,+ (β, ψ)
=
Ψℓ,− (β, ψ)
=
max KβΦO (ψ/β; Aℓ (x)) − g T x + β ln(2I/ǫ) :
x∈Xℓ
{β > 0, ψ ∈ F} → R, max KβΦO (−ψ/β; Aℓ (x)) + g T x + β ln(2I/ǫ) :
x∈Xℓ
{β > 0, ψ ∈ F} → R.
Note that the function αΦO (φ/α; Ai (x)) is obtained from the continuous convexconcave function ΦO (·, ·) by projective transformation in the convex argument, and affine substitution in the concave argument, so that the former function is convexconcave and continuous on the domain {α > 0, φ ∈ X } × Xi . By similar argument, the function αΦO (−φ/α; Aj (y)) is convexconcave and continuous on the domain {α > 0, φ ∈ F} × Xj . These observations combine with compactness of Xi and Xj to imply that Ψij (α, φ) is a realvalued continuous convex function on the domain F + = {α > 0} × F. Observe that functions Ψii (α, φ) are nonnegative on F + . Indeed, selecting some x ¯ ∈ Xi , and setting µ = Ai (¯ x), we have µ) + ΦO (−φ/α; µ)]K + ln(2I/ǫ) Ψii (α, φ) ≥ Φii (α, φ; x ¯, x ¯) = α 21 [ΦO (φ/α; ≥ α ΦO (0; µ) K + ln(2I/ǫ) = α ln(2I/ǫ) ≥ 0  {z } =0
(we have used convexity of ΦO in the first argument). Functions Ψij give rise to convex and feasible optimization problems Optij = Optij (K) =
min
(α,φ)∈F +
Ψij (α, φ).
(3.4)
By its origin, Optij is either a real, or −∞; by the observation above, Optii are nonnegative. Our estimate is as follows. 1. For 1 ≤ i, j ≤ I, we select some feasible solutions αij , φij to problems (3.4) (the less the values of the corresponding objectives, the better) and set ρij κij gij (ω K ) ρ
= = = =
Ψij (αij , φij ) = 21 [Ψi,+ (αij , φij ) + Ψj,− (αij , φij )] 1 [Ψj,− (αij , φij ) − Ψi,+ (αij , φij )] 2 P K k=1 φij (ωk ) + κij max1≤i,j≤I ρij .
2. Given observation ω K , we specify the estimate gb(ω K ) as follows: ri cj gb(ω K )
= = =
maxj≤I gij (ω K ) mini≤I gij (ω K ) 1 [mini≤I ri + maxj≤I cj ] . 2
(3.5)
(3.6)
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.1.3
189
Main result
Proposition 3.1. The ǫrisk of the estimate gb(ω K ) can be upperbounded as follows: Riskǫ [b g ] ≤ ρ. (3.7) Proof. Let the common distribution p of components ωk independent across k in observation ω K be pAℓ (u) for some ℓ ≤ I and u ∈ Xℓ . Let us fix these ℓ and u; we denote µ = Aℓ (u) and let pK stand for the distribution of ω K . 1o . We have
Ψℓ,+ (αℓj , φℓj ) = maxx∈Xℓ Kαℓj ΦO (φℓj /αℓj , Aℓ (x)) − g T x + αℓj ln(2I/ǫ) T ≥ Kαℓj ΦO (φ [since u ∈ Xℓ and µ = Aℓ (u)] R ℓj /αℓj , µ) − g u + αℓj ln(2I/ǫ) = Kαℓj ln exp{φℓj (ω)/αℓj }pµ (ω)Π(dω) − g T u + αℓj ln(2I/ǫ) [by definition of ΦO ] n o −1 P T = αℓj ln EωK ∼pK exp{αℓj φ (ω )} − g u + α ln(2I/ǫ) ℓj k ℓj k o n −1 = αℓj ln EωK ∼pK exp{αℓj [gℓj (ω K ) − κℓj ]} − g T u + αℓj ln(2I/ǫ) n o −1 = αℓj ln EωK ∼pK exp{αℓj [gℓj (ω K ) − g T u − ρℓj ]} + ρℓj − κℓj + αℓj ln(2I/ǫ) K T ≥ αℓj ln ProbωK ∼pK gℓj (ω ) > g u + ρℓj + ρℓj − κℓj + αℓj ln(2I/ǫ) ⇒ ǫ ) αℓj ln ProbωK ∼pK gℓj (ω K ) > g T u + ρℓj ≤ Ψℓ,+ (αℓj , φℓj ) + κℓj − ρℓj + αℓj ln( 2I ǫ = αℓj ln( 2I ) [by (3.5)],
and we arrive at
Similarly,
ǫ ProbωK ∼pK gℓj (ω K ) > g T u + ρℓj ≤ . 2I
(3.8)
Ψℓ,− (αiℓ , φiℓ ) = maxy∈Xℓ Kαiℓ ΦO (−φiℓ /αiℓ , Aℓ (y)) + g T y + αiℓ ln(2I/ǫ) T ≥ Kαiℓ ΦO (−φ [since u ∈ Xℓ and µ = Aℓ (u)] R iℓ /αiℓ , µ) + g u + αiℓ ln(2I/ǫ) = Kαiℓ ln exp{−φiℓ (ω)/αiℓ }pµ (ω)Π(dω) + g T u + αiℓ ln(2I/ǫ) [by definition of ΦO ] −1 P T = αiℓ ln EωK ∼pK exp{−αiℓ φ (ω )} + g u + α ln(2I/ǫ) iℓ k iℓ k −1 = αiℓ ln EωK ∼pK exp{αiℓ [−giℓ (ω K ) + κiℓ ]} + g T u + αiℓ ln(2I/ǫ) −1 K T = αiℓ ln EωK ∼pK exp{α u − ρiℓ ]} + ρiℓ + κiℓ + αiℓ ln(2I/ǫ) iℓ [−giℓ (ω ) + g ≥ αiℓ ln ProbωK ∼pK giℓ (ω K ) < g T u − ρiℓ + ρiℓ + κiℓ + αiℓ ln(2I/ǫ) ⇒ ǫ ) αiℓ ln ProbωK ∼pK giℓ (ω K ) < g T u − ρiℓ ≤ Ψℓ,− (αiℓ , φiℓ ) − κiℓ − ρiℓ + αiℓ ln( 2I ǫ = αiℓ ln( 2I ) [by (3.5)],
and we arrive at ǫ ProbωK ∼pK giℓ (ω K ) < g T u − ρiℓ ≤ . 2I
(3.9)
2o . Let E = {ω K : gℓj (ω K ) ≤ g T u + ρℓj , giℓ (ω K ) ≥ g T u − ρiℓ , 1 ≤ i, j ≤ I}. From (3.8) and (3.9) and the union bound it follows that pK probability of the event E is ≥ 1 − ǫ. As a result, all we need to complete the proof of the proposition
190
CHAPTER 3
is to verify that ω K ∈ E ⇒ b g (ω K ) − g T u ≤ ρℓ := max[max ρiℓ , max ρℓj ], i
j
(3.10)
since clearly ρℓ ≤ ρ := maxi,j ρij . To this end, let us fix ω K ∈ E, and let E be the I × I matrix with entries Eij = gij (ω K ), 1 ≤ i, j ≤ I. The quantity ri —see (3.6)—is the maximum of the entries in the ith row of E, while the quantity cj is the minimum of the entries in the jth column of E. In particular, ri ≥ Eij ≥ cj for all i, j, implying that ri ≥ cj for all i, j. Now, let ∆ = [g T u − ρℓ , g T u + ρℓ ]. Since ω K ∈ E, we have Eℓℓ = gℓℓ (ω K ) ≥ g T u − ρℓℓ ≥ g T u − ρℓ and Eℓj = gℓj (ω K ) ≤ g T u + ρℓj ≤ g T u + ρℓ for all j, implying that rℓ = maxj Eℓj ∈ ∆. Similarly, ω K ∈ E implies that Eℓℓ = gℓℓ (ω K ) ≤ g T u + ρℓ and Eiℓ = giℓ (ω K ) ≥ g T u − ρiℓ ≥ g T u − ρℓ for all i, implying that cℓ = mini Eiℓ ∈ ∆. We see that both rℓ and cℓ belong to ∆; since r∗ := mini ri ≤ rℓ and, as have already seen, ri ≥ cℓ for all i, we conclude that r∗ ∈ ∆. By a similar argument, c∗ := maxj cj ∈ ∆ as well. By construction, gb(ω K ) = 21 [r∗ + c∗ ], that is, gb(ω K ) ∈ ∆, and the conclusion in (3.10) indeed takes place. ✷
Remark 3.2. Let us consider a special case of I = 1. In this case, given a Krepeated observation of the signal in a simple o.s., our construction yields an estimate of a linear form g T x of unknown signal x, known to belong to a given convex compact set X1 . This estimate is K X gb(ω K ) = φ(ωk ) + κ, (3.11) k=1
and is associated with the optimization problem
{Ψ(α, φ) := 21 [Ψ+ (α, φ) + Ψ− (α, φ)]} , Ψ+ (α, φ) = max KαΦO (φ/α, A1 (x)) − g T x + α ln(2/ǫ) , x∈X1 Ψ− (α, φ) = max KαΦO (−φ/α, A1 (x)) + g T x + α ln(2/ǫ) . min
α>0,φ∈F
x∈X1
By Proposition 3.1, when α, φ is a feasible solution to the problem and κ = 12 [Ψ− (α, φ) − Ψ+ (α, φ)], the ǫrisk of estimate (3.11) does not exceed Ψ(α, φ). 3.1.4
Nearoptimality
Observe that by properly selecting φij and αij we can make, in a computationally efficient manner, the upper bound ρ on the ǫrisk of the above estimate arbitrarily close to Opt(K) = max Optij (K). 1≤i,j≤I
We are about to demonstrate that the quantity Opt(K) “nearly lowerbounds” the minimax optimal ǫrisk Risk∗ǫ (K) = inf Riskǫ [b g ], g b(·)
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
191
the infimum being taken over all estimates (all Borel functions of ω K ). The precise statement is as follows: Proposition 3.3. In the situation of this section, let ǫ ∈ (0, 1/2) and K be a positive integer. Then for every integer K satisfying K/K > one has
2 ln(2I/ǫ) 1 ln 4ǫ(1−ǫ)
Opt(K) ≤ Risk∗ǫ (K).
(3.12)
(3.13)
In addition, in the special case where for every i, j there exists xij ∈ Xi ∩ Xj such that Ai (xij ) = Aj (xij ) one has K ≥ K ⇒ Opt(K) ≤ For proof, see Section 3.6.1. 3.1.5
2 ln(2I/ǫ) Risk∗ǫ (K). 1 ln 4ǫ(1−ǫ)
(3.14)
Illustration
We illustrate our construction with the simplest possible example in which Xi = {xi } are singletons in Rn , i = 1, ..., I, and the observation scheme is Gaussian. Thus, setting yi = Ai (xi ) ∈ Rm , the observation’s components ωk , 1 ≤ k ≤ K, stemming from the signal xi , are drawn, independently of each other, from the normal distribution N (yi , Im ). The family F of functions φ associated with Gaussian o.s. is the family of all affine functions φ(ω) = φ0 +ϕT ω on the observation space (which at present is Rm ); we identify φ ∈ F with the pair (φ0 , ϕ). The function ΨO associated with the Gaussian observation scheme with mdimensional observations is ΦO (φ; µ) = φ0 + ϕT µ + 21 ϕT ϕ : (R × Rm ) × Rm → R; a straightforward computation shows that in the case in question, setting θ = ln(2I/ǫ),
192
CHAPTER 3
we have Ψi,+ (α, φ)
= =
Ψj,− (α, φ)
=
Optij
= = = =
Kα φ0 + ϕT yi /α + 21 ϕT ϕ/α2 + αθ − g T xi K T ϕ ϕ + αθ, Kαφ0 + KϕT yi − g T xi + 2α K T −Kαφ0 − KϕT yj + g T xj + ϕ ϕ + αθ, 2α inf 1 [Ψi,+ (α, φ) + Ψj,− (α, φ)] α>0,φ 2 K T K T 1 T ϕ [yi − yj ] + inf ϕ ϕ + αθ g [xj − xi ] + inf 2 ϕ α>0 2α 2 √ K T 1 T ϕ [yi − yj ] + 2Kθkϕk2 g [xj − xi ] + inf 2 ϕ 2 p 1 T g [xj − xi ], kyi − yj k2 ≤ 2p2θ/K 2 −∞, kyi − yj k2 > 2 2θ/K.
We see that we can put φ0 = 0, and that setting
p I = {(i, j) : kyi − yj k2 ≤ 2 2θ/K},
Optij (K) is finite if and only if (i, j) ∈ I and is −∞ otherwise. In both cases, the optimization problem specifying Optij has no optimal solution.2 Indeed, this clearly is the case when (i, j) 6∈ I; when (i, j) ∈ I, a minimizing sequence is, e.g., φ0 ≡ 0, ϕ ≡ 0, αi → 0, but its limit is not in the minimization domain (on this domain, α should be positive). In this particular case, the simplest way to overcome the difficulty is to restrict the optimization domain F + in (3.4) with its compact subset {α ≥ 1/R, φ0 = 0, kϕk2 ≤ R} with large R, like R = 1010 or 1020 . Then we specify the entities participating in (3.5) as 0, (i, j) ∈ I T φij (ω) = ϕij ω, ϕij = −R[y − y ]/ky − y k , (i, j) 6∈ I i j i j 2 ( 1/R, (i, j) ∈ I q αij = K 2θ R, (i, j) 6∈ I resulting in κij
= = =
1 [Ψ (αij , φij ) − Ψi,+ (αij , φij )] 2 h j,− 1 −KϕTij yj + g T xj + 2αKij ϕTij ϕij 2 1 T g [xi + xj ] − K ϕT [y + yj ] 2 2 ij i
+ αij θ − KϕTij yi + g T xi −
K 2αij
ϕTij ϕij − αij θ
i
2 Handling this case was exactly the reason why in our construction we required φ , α ij ij to be feasible, and not necessary optimal, solutions to the optimization problems (3.4).
193
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
and ρij
= = = =
1 2
[Ψi,+ (αij , φij ) + Ψj,− (αij , φij )] K T K T 1 KϕTij yi − g T xi + ϕij ϕij + αij θ − KϕTij yj + g T xj + ϕij ϕij + αij θ 2 2αij 2αij K T K T 1 T ϕij φij + αij θ + 2 g [xj − xi ] + ϕij [yi − yj ] 2αij 2 1 T −1 g [x − x ] + R θ, (i, j) ∈ I, j i 2 √ (3.15) 1 T K 2Kθ − g [x − x ] + [ ky − y k ]R, (i, j) 6∈ I. j i i j 2 2 2
In the numerical experiment we report on we use n = 20, m = 10, and I = 100, with xi , i ≤ I, drawn independently of each other from N (0, In ), and yi = Axi with randomly generated matrix A (specifically, matrix with independent N (0, 1) entries normalized to have unit spectral norm). The linear form to be recovered is the first coordinate of x, the confidence parameter is set to ǫ = 0.01, and R = 1020 . Results of a typical experiment are presented in Figure 3.1.
2.5 2
1.5
1
0.5 0 20
30
40
50
100
200
300
Figure 3.1: Boxplot of empirical distributions, over 20 random estimation problems, of the upper 0.01risk bounds max1≤i,j≤100 ρij (as in (3.15)) for different observation sample sizes K.
3.2
ESTIMATING N CONVEX FUNCTIONS ON UNIONS OF CONVEX SETS
In this section, we apply our testing machinery to the estimation problem as follows. Given are: • a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), • a signal space X ⊂ Rn along with the affine mapping x 7→ A(x) : X → M, • a realvalued function f on X.
Given observation ω ∼ pA(x∗ ) stemming from unknown signal x∗ known to belong to X, we want to recover f (x∗ ).
194
CHAPTER 3
OHIW
ULJKW
OHIW
ULJKW
7HVW OHIW YV ULJKW
7HVW OHIW YV ULJKW
1HZ ORFDOL]HU OHIW DFFHSWHG
ULJKW
OHIW
, 7HVW OHIW YV ULJKW
,, 7HVW OHIW YV ULJKW
1HZ ORFDOL]HU OHIW DFFHSWHG
1HZ ORFDOL]HU ULJKW DFFHSWHG
1HZ ORFDOL]HU
1HZ ORFDOL]HU
ULJKW DFFHSWHG
, ,, DFFHSW OHIW
1HZ ORFDOL]HU , ,, DFFHSW ULJKW
1HZ ORFDOL]HU , ,, LQ GLVDJUHHPHQW
D
E
F
Figure 3.2: Bisection via Hypothesis Testing.
Our approach imposes severe restrictions on f (satisfied, e.g., when f is linear, or linearfractional, or is the maximum of several linear functions); as a compensation, we allow for rather “complex” X—finite unions of convex sets. 3.2.1
Outline
Though the estimator we develop is, in a nutshell, quite simple, its formal description turns out to be rather involved.3 For this reason we start its presentation with an informal outline, which exposes some simple ideas underlying its construction. Consider the situation where the signal space X is the 2D rectangle as presented on the top of Figure 3.2.(a), and let the function to be recovered be f (x) = x1 . Thus, “nature” has somehow selected x = [x1 , x2 ] in the rectangle, and we observe a Gaussian random vector with the mean A(x) and known covariance matrix, where A(·) is a given affine mapping. Note that hypotheses f (x) ≥ b and f (x) ≤ a translate into convex hypotheses on the expectation of the observed Gaussian r.v., so that we can use our hypothesis testing machinery to decide on hypotheses of this type and to localize f (x) in a (hopefully, small) segment by a Bisectiontype process. Before describing the process, let us make a terminological agreement. In the sequel we shall use pairwise hypothesis testing in the situation where it may happen that neither of the hypotheses we are deciding upon is true. In this case, we will say that the outcome of a test is correct if the rejected hypothesis indeed is wrong (the accepted hypothesis can be wrong as well, but the latter can happen 3 It should be mentioned that the proposed estimation procedure is a “close relative” of the binary search algorithm of [77].
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
195
only in the case when both our hypotheses are wrong). This is what the Bisection might look like. 1. Were we able to decide reliably on the left and the right hypotheses in Figure 3.2.(a), that is, to understand via observations whether x belongs to the left or to the right half of the original rectangle, our course of actions would be clear: depending on this decision, we would replace our original rectangle with a smaller rectangle localizing x, as shown in Figure 3.2.(a), and then iterate this process. The difficulty, of course, is that our left and right hypotheses intersect, so that is impossible to decide on them reliably. 2. In order to make the left and right hypotheses distinguishable from each other, we could act as shown in Figure 3.2.(b), by shrinking the left and the right rectangles and inserting a rectangle in the middle (“no man’s land”). Assuming that the width of the middle rectangle allows to decide reliably on our new left and right hypotheses and utilizing the available observation, we can localize x either in the left, or in the right rectangle as shown in Figure 3.2.(b). Specifically, assume that our “left vs. right” test rejected correctly the right hypothesis. Then x can be located either in the left, or in the middle rectangle shown on the top, and thus x is in the new left localizer which is the union of the left and the middle original rectangles. Similarly, if our test rejects correctly the left hypothesis, then we can take, as the new localizer of x, the union of the original right and middle rectangles. Note that our localization is as reliable as our test is, and that it reduces the width of the localizer by a factor close to 2, provided the width of the middle rectangle is small compared to the width of the original localizer of x. We can iterate this process, until we arrive at a localizer so narrow that the corresponding separator— “no man’s land” (this part cannot be too narrow, since it should allow for a reliable decision on the current left and right hypotheses)—becomes too large to allow reducing significantly the localizer’s width. Note that in this implementation of the binary search (same as in the implementation proposed in [77]), starting from the second step of the Bisection, the hypotheses to decide upon depend on the observations (e.g., when x belongs to the middle part of the threerectangle localizer in Figure 3.2, deciding on “left vs. right” can, depending on observation, result in accepting either the left or the right hypothesis, leading to different updated localizers). Analysing this situation usually brings about complications we would like to avoid. 3. A simple modification of the Bisection allows us to circumvent the difficulties related to testing random hypotheses. Indeed, let us consider the following construction: given the current localizer for x (at the first step the initial rectangle), we consider two “threerectangle” partitions of it as presented in Figure 3.2.(c). In the first partition, the left rectangle is the left half of the original rectangle, in the second partition the right rectangle is the right half of the original rectangle. We then run two “left vs. right” tests, the first on the pair of left and right hypotheses stemming from the first partition, and the second on the pair of left and right hypotheses stemming from the second partition. Assuming that in both tests the rejected hypotheses indeed were wrong, the results of these tests allow us to make the following conclusions: • when both tests reject the right hypotheses from the corresponding pairs, x is located in the left half of the initial rectangle (since otherwise in the second test
196
CHAPTER 3
the rejected hypothesis were in fact true, contradicting to the assumption that both tests make no wrong rejections); • when both tests reject the left hypotheses from the corresponding pairs, x is located in the right half of the original rectangle (for the exactly same reasons as in the previous case); • when the tests “disagree,” rejecting hypotheses of different types (like left in the firsts, and right in the second test), x is located in the union of the two middle rectangles we deal with. Indeed, otherwise x should be either in the left rectangles of both our threerectangle partitions, or in the right rectangles of both of them. Since we have assumed that in both tests no wrong rejections took place, in the first case both tests must reject the right hypotheses, and both should reject the left hypotheses in the second, while none of these events took place. Now, in the first two cases we can safely say to which of the “halves”—left or right— of the initial rectangle x belongs, and take this half as the new localizer. In the third case, we take as a new localizer for x the middle rectangle shown at the bottom of Figure 3.2 and terminate our estimation process—the new localizer already is narrow! In the proposed algorithm, unless we terminate at the very first step, we carry out the second step exactly in the same way as the first one, with the localizer of x yielded by the first step in the role of the initial localizer, then carry out, in the same way, the third step, etc., until termination either due to running into a disagreement, or due to reaching a prescribed number of steps. Upon termination, we return the last localizer for x which we have built, and claim that f (x) = x1 belongs to the projection of this localizer onto the x1 axis. In all tests from the above process, we use the same observation. Note that in the present situation, in contrast to that discussed earlier, reutilizing a single observation creates no difficulties, since with no wrong rejections in the pairwise tests we use, the pairs of hypotheses participating in the tests are not random at all—they are uniquely defined by f (x) = x1 . Indeed, with no wrong rejections, prior to termination everything is as if we were running deterministic Bisection, that is, were updating subsequent rectangles ∆t containing x according to the rules • ∆1 is a rectangle containing x given in advance, • ∆t+1 is precisely the half of ∆t containing x (say, the left half in the case of a tie). Thus, given x and assuming that there are no wrong rejections, the situation is as if a single observation were used in L tests running in “parallel” rather than sequentially. The only elaboration caused by the sequential nature of our process is the “risk accumulation”—we want the probability of error in one or more of our L tests to be less than the desired risk ǫ of wrong “bracketing” of f (x), implying, in the absence of something better, that the risks of the individual tests should be at most ǫ/L. These risks, in turn, define the allowed width of separators and thus – the accuracy to which f (x) can be estimated. It should be noted that the number L of steps of Bisection always is a moderate integer (since otherwise the width of “no man’s land,” which at the concluding Bisection steps is of order of 2−L , would be too small to allow for deciding on the concluding pairs of our hypotheses with risk ǫ/L, at least when our observations possess nonnegligible volatility). As a result, “the cost” of Bisection turns out to be significantly lower than in the case where every test uses its own observation.
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
197
From the above sketch of our construction it is clear that all that matters is our ability to decide on the pairs of hypotheses {x ∈ X : f (x) ≤ a} and {x ∈ X : f (x) ≥ b}, with a and b given, via observation drawn from pA(x) . In our outline, these were convex hypotheses in Gaussian o.s., and in this case we can use detectorbased pairwise tests yielded by Theorem 2.23. Applying the machinery developed in Section 2.5.1, we could also handle the case when the sets {x ∈ X : f (x) ≤ a} and {x ∈ X : f (X) ≥ b} are unions of a moderate number of convex sets (e.g., f is affine, and X is the union of a number of convex sets), the o.s. in question still being simple, and this is the situation we intend to consider. 3.2.2
Estimating N convex functions: Problem setting
In the rest of this section, we consider the situation as follows. We are given 1. simple o.s. O = (Ω, P, {pµ (·) : µ ∈ M}, F), 2. convex compact set X ⊂ Rn along with a collection of I convex compact sets Xi ⊂ X , 3. affine mapping x 7→ A(x) : X → M, 4. a continuous function f (x) : X → R which is N convex, meaning that for every a ∈ R the sets X a,≥ = {x ∈ X : f (x) ≥ a} and X a,≤ = {x ∈ X : f (x) ≤ a} can be represented as the unions of at most N closed convex sets Xνa,≥ , Xνa,≤ : X a,≥ =
N [
ν=1
Xνa,≥ , X a,≤ =
For some unknown x known to belong to X =
N [
ν=1 I S
Xνa,≤ .
Xi , we have at our disposal
i=1
observation ω K = (ω1 , ..., ωK ) with i.i.d. ωt ∼ pA(x) (·), and our goal is to estimate from this observation the quantity f (x). Given tolerances ρ > 0, ǫ ∈ (0, 1), let us call a candidate estimate fb(ω K ) (ρ, ǫ)reliable (cf. (3.3)) if for every x ∈ X, with the pA(x) probability at least 1 − ǫ, it holds fb(ω K ) − f (x) ≤ ρ or, which is the same, n o ∀(x ∈ X) : ProbωK ∼pA(x) ×...×pA(x) fb(ω K ) − f (x) > ρ ≤ ǫ. 3.2.2.1
Examples of N convex functions
Example 3.1. [Minima and maxima of linearfractional functions] Every function which can be obtained from linearfractional functions hgνν (x) (x) (gν , hν are affine functions on X and hν are positive on X ) by taking maxima and minima is N convex for appropriately selected N due to the following immediate observations: g(x) • linearfractional function h(x) with denominator positive on X is 1convex on X ; • if f (x) is N convex, so is −f (x);
198
CHAPTER 3
• if fi (x) is Ni convex, i = 1, 2, ..., I, then f (x) = max fi (x) is N convex with i
N = max
"
Y
Ni ,
i
due to {x ∈ X : f (x) ≤ a}
=
{x ∈ X : f (x) ≥ a}
=
X i
I T
i=1 I S
i=1
#
Ni ,
{x : fi (x) ≤ a}, {x : fi (x) ≥ a}.
Note that the first set is the intersection of I unionsQof convex sets with Ni components in ith union, and thus is the union of i Ni convex sets. The second set is the union of I unions P of convex sets with Ni elements in the ith union, and thus is the union of i Ni convex sets.
Example 3.2. [Conditional quantile] Let S = {s1 < s2 < ... < sM } ⊂ R. For a nonvanishing probability distribution q on S and α ∈ [0, 1], let χα [q] be the regularized αquantile of q defined as follows: we pass from q to the distribution on [s1 , sM ] by spreading uniformly the mass qν , 1 < ν ≤ M , over [sν−1 , sν ], and assigning mass q1 to the point s1 ; χα [q] is the usual αquantile of the resulting χα [q] = min{s ∈ [s1 , sM ] : q¯{[s1 , s]} ≥ α}. distribution q¯: s s4 s3 s2
s1
0
q1
q1+q2
q1+q2+q3 1
α
Regularized quantile as function of α, M = 4
Given, along with S, a finite set T , let X be a convex compact set in the space of nonvanishing probability distributions on S ×T . For τ ∈ T , consider the conditional to t = τ , distribution pτ (·) of s ∈ S induced by a distribution p(·, ·) ∈ X : p(µ, τ ) , 1 ≤ µ ≤ M, pτ (µ) = PM ν=1 p(ν, τ )
where p(µ, τ ) is the pprobability for (s, t) to take value (sµ , τ ), and pτ (µ) is the pτ probability for s to take value sµ , 1 ≤ µ ≤ M . The function χα [pτ ] : X → R turns out to be 1convex; for verification see Section 3.6.2.
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.2.3
199
Bisection estimate: Construction
While the construction to be presented admits numerous refinements, we focus here on its simplest version. 3.2.3.1
Preliminaries
Upper and lower feasibility/infeasibility, sets Zia,≥ and Zia,≤ . Let a be a real. We associate with a a collection of upper asets defined as follows: we look at the sets Xi ∩ Xνa,≥ , 1 ≤ i ≤ I, 1 ≤ ν ≤ N , and arrange the nonempty sets from this family into a sequence Zia,≥ , 1 ≤ i ≤ Ia,≥ . Here Ia,≥ = 0 if all sets in the family are empty; in the latter case, we call a upperinfeasible, and call it upperfeasible otherwise. Similarly, we associate with a the collection of lower asets Zia,≤ , 1 ≤ i ≤ Ia,≤ , by arranging into a sequence all nonempty sets from the family Xi ∩ Xνa,≤ , and call a lowerfeasible or lowerinfeasible depending on whether Ia,≤ is positive or zero. Note that upper and lower asets are nonempty convex compact sets, and S Zia,≥ , X a,≥ := {x ∈ X : f (x) ≥ a} = 1≤i≤Ia,≥ S Zia,≤ . X a,≤ := {x ∈ X : f (x) ≤ a} = 1≤i≤Ia,≤
Right tests. Given a segment ∆ = [a, b] of positive length with lowerfeasible K a, we associate with this segment a right test—a function T∆,r (ω K ) taking values right and left, and risk σ∆,r ≥ 0—as follows: K 1. if b is upperinfeasible, T∆,r (·) ≡ left and σ∆,r = 0; 2. if b is upperfeasible, the collections of “right sets” {A(Zib,≥ )}i≤Ib,≥ and of “left sets” {A(Zja,≤ )}j≤Ia,≤ are nonempty, and the test is given by the construction from Section 2.5.1 as applied to these sets and the stationary Krepeated version of O , specifically,
• for 1 ≤ i ≤ Ib,≥ , 1 ≤ j ≤ Ia,≤ , we build the detectors K φK ij∆ (ω ) =
K X
φij∆ (ωt ),
t=1
with φij∆ (ω) given by (rij∆ , sij∆ ) φij∆ (ω)
∈ =
Argmin b,≥
r∈Zi 1 2
We set ǫij∆ =
a,≤
,s∈Zj
ln
R p pA(r) (ω)pA(s) (ω)Π(dω) , Ω
ln pA(rij∆ ) (ω)/pA(sij∆ ) (ω) .
Z q Ω
pA(rij∆ ) (ω)pA(sij∆ ) (ω)Π(dω)
and build the Ib,≥ × Ia,≤ matrix E∆,r = [ǫK ij∆ ] 1≤i≤Ib,≥ ; 1≤j≤Ia,≤
• we define σ∆,r as the spectral norm of E∆,r . We compute the PerronFrobenius E ∆,r eigenvector [g ∆,r ; h∆,r ] of the matrix , so that (see Section T E∆,r
200
CHAPTER 3
2.5.1.2) T g ∆,r > 0, h∆,r > 0, σ∆,r g ∆,r = E∆,r h∆,r , σ∆,r h∆,r = E∆,r g ∆,r .
Finally, we define the matrixvalued function ∆,r ∆,r K D∆,r (ω K ) = [φK ij∆ (ω )− ln(hj )+ ln(gi )] 1≤i≤Ib,≥ . 1≤j≤Ia,≤
K Test T∆,r (ω K ) takes value right iff the matrix D∆,r (ω K ) has a nonnegative row, and takes value left otherwise.
Given δ > 0, κ > 0, we call segment ∆ = [a, b] δgood (right) if a is lowerfeasible, b > a, and σ∆,r ≤ δ. We call a δgood (right) segment ∆ = [a, b] κmaximal if the segment [a, b − κ] is not δgood (right). Left tests. The “mirror” version of the above is as follows. Given a segment ∆ = [a, b] of positive length with upperfeasible b, we associate with this segment a K left test—a function T∆,l (ω K ) taking values right and left, and risk σ∆,l ≥ 0—as follows: K 1. if a is lowerinfeasible, T∆,l (·) ≡ right and σ∆,l = 0; K K 2. if a is lowerfeasible, we set T∆,l ≡ T∆,r , σ∆,l = σ∆,r .
Given δ > 0, κ > 0, we call segment ∆ = [a, b] δgood (left) if b is upperfeasible, b > a, and σ∆,l ≤ δ. We call a δgood (left) segment ∆ = [a, b] κmaximal if the segment [a + κ, b] is not δgood (left). Explanation: When a < b and a is lowerfeasible, b is upperfeasible, so that the sets X a,≤ = {x ∈ X : f (x) ≤ a}, X b,≥ = {x ∈ X : f (x) ≥ b}
K K are nonempty, the right and the left tests T∆,l , T∆,r are identical to each other and coincide with the minimal risk test, built as explained in Section 2.5.1, deciding, via stationary Krepeated observations, on the “location” of the distribution pA(x) underlying the observations—whether this location is left (left hypothesis stating S A(Zia,≤ )), or right (right that x ∈ X and f (x) ≤ a, whence A(x) ∈ 1≤i≤Ia,≤
hypothesis stating that x ∈ X and f (x) ≥ b, whence A(x) ∈
S
1≤i≤Ib,≥
A(Zib,≥ )).
When a is lowerfeasible and b is not upperfeasible, the right hypothesis is empty, and the left test associated with [a, b], naturally, always accepts the left hypothesis; similarly, when a is lowerinfeasible and b is upperfeasible, the right test associated with [a, b] always accepts the right hypothesis. A segment [a, b] with a < b is δgood (left) if the right hypothesis corresponding K to the segment is nonempty, and the left test T∆ℓ associated with [a, b] decides on the right and the left hypotheses with risk ≤ δ, and similarly for the δgood (right) segment [a, b].
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.2.4 3.2.4.1
201
Building Bisection estimate Control parameters
The control parameters of the Bisection estimate are 1. positive integer L—the maximum allowed number of bisection steps, 2. tolerances δ ∈ (0, 1) and κ > 0. 3.2.4.2
Bisection estimate: Construction
The estimate of f (x) (x is the signal underlying our observations: ωt ∼ pA(x) ) is given by the following recurrence run on the observation ω ¯ K = (¯ ω1 , ..., ω ¯ K ) at our disposal: 1. Initialization. We find a valid upper bound b0 on maxu∈X f (u) and valid lower bound a0 on minu∈X f (u) and set ∆0 = [a0 , b0 ]. We assume w.l.o.g. that a0 < b0 ; otherwise the estimation is trivial. Note: f (x) ∈ ∆0 . 2. Bisection Step ℓ, 1 ≤ ℓ ≤ L. Given the localizer ∆ℓ−1 = [aℓ−1 , bℓ−1 ] with aℓ−1 < bℓ−1 , we act as follows: a) We set cℓ = 21 [aℓ−1 + bℓ−1 ]. If cℓ is not upperfeasible, we set ∆ℓ = [aℓ−1 , cℓ ] and pass to 2e, and if cℓ is not lowerfeasible, we set ∆ℓ = [cℓ , bℓ−1 ] and pass to 2e. Note: When the rule requires us to pass to 2e, the set ∆ℓ−1 \∆ℓ does not intersect with f (X); in particular, in such a case f (x) ∈ ∆ℓ provided that f (x) ∈ ∆ℓ−1 . b) When cℓ is both upper and lowerfeasible, we check whether the segment [cℓ , bℓ−1 ] is δgood (right). If it is not the case, we terminate and claim that ¯ := ∆ℓ−1 ; otherwise find vℓ , cℓ < vℓ ≤ bℓ−1 , such that the segment f (x) ∈ ∆ ∆ℓ,rg = [cℓ , vℓ ] is δgood (right) κmaximal. Note: In terms of the outline of our strategy presented in Section 3.2.1, termination when the segment [cℓ , bℓ−1 ] is not δgood (right) corresponds to the case when the current localizer is too small to allow for the “noman’s land” wide enough to ensure lowrisk decision on the left and the right hypotheses. To find vℓ , we check the candidates with vℓk = bℓ−1 − kκ, k = 0, 1, ... until arriving for the first time at segment [cℓ , vℓk ], which is not δgood (right), and take as vℓ the quantity v k−1 (because k ≥ 1 the resulting value of vℓ is welldefined and clearly meets the above requirements). c) Similarly, we check whether the segment [aℓ−1 , cℓ ] is δgood (left). If it is ¯ := ∆ℓ−1 ; otherwise find not the case, we terminate and claim that f (x) ∈ ∆ uℓ , aℓ−1 ≤ uℓ < cℓ , such that the segment ∆ℓ,lf = [uℓ , cℓ ] is δgood (left) κmaximal. Note: The rules for building uℓ are completely similar to those for vℓ . d) We compute T∆Kℓ,rg ,r (¯ ω K ) and T∆Kℓ,lf ,l (¯ ω K ). If T∆Kℓ,rg ,r (¯ ω K ) = T∆Kℓ,lf ,l (¯ ωK ) (“consensus”), we set ∆ℓ = [aℓ , bℓ ] =
[cℓ , bℓ−1 ], [aℓ−1 , cℓ ],
T∆Kℓ,rg ,r (¯ ω K ) = right, K T∆ℓ,rg ,r (¯ ω K ) = left
(3.16)
and pass to 2e. Otherwise (“disagreement”) we terminate and claim that
202
CHAPTER 3
¯ = [uℓ , vℓ ]. f (x) ∈ ∆ e) We pass to step ℓ + 1 when ℓ < L; otherwise we terminate with the claim that ¯ := ∆L . f (x) ∈ ∆ ¯ built upon termination 3. Output of the estimation procedure is the segment ∆ and claimed to contain f (x) (see rules 2b–2e) the midpoint of this segment is the estimate of f (x) yielded by our procedure. 3.2.5
Bisection estimate: Main result
Our main result on Bisection is as follows: Proposition 3.4. Consider the situation described at the beginning of Section 3.2.2, and let ǫ ∈ (0, 1/2) be given. Then (i) [reliability of Bisection] For every positive integer L and every κ > 0, Bisection with control parameters L, δ =
ǫ ,κ 2L
(3.17)
is (1 − ǫ)reliable: for every x ∈ X, the pA(x) probability of the event ¯ f (x) ∈ ∆ ¯ is the Bisection output as defined above) is at least 1 − ǫ. (∆ (ii) [nearoptimality] Let ρ > 0 and positive integer K be such that “in nature” S there exists a (ρ, ǫ)reliable estimate fb(·) of f (x), x ∈ X := i≤I Xi , via stationary
Krepeated observation ω K with ωk ∼ pA(x) , 1 ≤ k ≤ K. Given ρb > 2ρ, the Bisection estimate utilizing stationary Krepeated observations, with K≥
2 ln(2LN I/ǫ) K, 1 ln 4ǫ(1−ǫ)
the control parameters of the estimate being ǫ b0 − a 0 L = log2 , δ= , κ = ρb − 2ρ, 2b ρ 2L
(3.18)
(3.19)
is (b ρ, ǫ)reliable. Note that K is only “slightly larger” than K.
For proof, see Section 3.6.3. Note that the running time K of the Bisection estimate as given by (3.18) is just by (at most) logarithmic in N , I, L, and 1/ǫ factor larger than K; note also that L is just logarithmic in 1/b ρ. Assume, e.g., that for some γ > 0 “in nature” there exist (ǫγ , ǫ)reliable estimates, parameterized by ǫ ∈ (0, 1/2), utilizing K = K(ǫ) observations. Then Bisection with the volume of observation and control parameters given by (3.18) and (3.19), where ρb = 3ρ = 3ǫγ and K = K(ǫ), is (3ǫγ , ǫ)reliable and requires K = K(ǫ)repeated observations with limǫ→+0 K(ǫ)/K(ǫ) ≤ 2.
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.2.6
203
Illustration
To illustrate bisectionbased estimation of an N convex function, consider the following situation.4 There are M devices (“receivers”) recording a signal u known to belong to a given convex compact and nonempty set U ⊂ Rn ; the output of the ith receiver is the vector yi = Ai u + σξ ∈ Rm [ξ ∼ N (0, Im )] where Ai are given m × n matrices (you may think of M allowed positions for a single receiver, and of yi as the output of the receiver when the latter is in position i). Our observation ω is one of the vectors yi , 1 ≤ i ≤ M , with index i unknown to us (“we observe a noisy record of a signal, but do not know the position in which this record was taken”). Given ω, we want to recover a given linear function g(x) = eT u of the signal. The problem can be modeled as follows. Consider M sets Xi = {x = [x1 ; ...; xM ] ∈ RM n = Rn × ... × Rn : xj = 0, j 6= i; xi ∈ U } {z }  M
along with the linear mapping
A[x1 ; ...; xM ] =
M X i=1
Ai x i : R M n → R m
and linear function f ([x1 ; ...; xM ]) = eT
X i
xi : RM n → R.
Let X be a convex compact set in RM n containing all the sets Xi , 1 ≤ i ≤ m. Observe that the problem we are interested in is nothing but the problem of recovering f (x) via observation ω = Ax + σξ, ξ ∼ N (0, Im ), (3.20) SM where the unknown signal x is known to belong to the union i=1 Xi of known convex compact sets Xi . As a result, our problem can be solved via the machinery developed in this section. Numerical illustration. In the numerical experiments to be reported, we use n = 128, m = 64 and M = 2. The data is generated as follows: • The set U ⊂ R128 of candidate signals is comprised of restrictions onto the equidistant (n = 128)point grid in [0, 1] of twice differentiable functions h(t) of continuous argument t ∈ [0, 1] satisfying the relations h(0) ≤ 1, h′ (0) ≤ 1, h′′ (t) ≤ 1, 0 ≤ t ≤ 1. For the discretized signal u = [h(0); h(1/n); ...; h(1 − 1/n)] this translates into the system of convex constraints u1  ≤ 1, nu2 − u1  ≤ 1, n2 ui+1 − 2ui + ui−1  ≤ 1, 2 ≤ i ≤ n − 1. 4 Our goal is to illustrate a mathematical construction rather than to work out a particular application; the reader is welcome to invent a plausible “covering story.”
204
CHAPTER 3
Characteristic error bound actual error # of Bisection steps
min 0.008 0.001 5
median 0.015 0.002 7.00
mean 0.014 0.002 6.60
max 0.015 0.005 8
Table 3.1: Data of 10 Bisection experiments, σ = 0.01. In the table, “error bound” is the halflength of the final localizer, which is an 0.99reliable upper bound on the estimation error; the “actual error” is the actual estimation error. R1 • We look to estimate the discretized counterpart of the integral 0 h(t)dt, specifP n ically, the quantity eT u = α i=1 ui . The normalizing constant α is selected to T ensure maxu∈U e u = 1, minu∈U eT u = −1, allowing us to run Bisection over ∆0 = [−1; 1]. • We generate A1 as an (m = 64)×(n = 128) matrix with singular values σi = θi−1 , 1 ≤ i ≤ m, with θ selected from the requirement σm = 0.1. The system of left singular vectors of A1 is obtained from the system of basic orths in Rn by random rotation. Matrix A2 was selected as A2 = A1 S, where S is a symmetry w.r.t. the axis e, that is, Se = e & Sh = −h whenever h is orthogonal to e. (3.21) Signals u underlying the observations are selected at random in U . • The reliability 1 − ǫ of the estimate is set to 0.99, while the maximal allowed number L of Bisection steps is set to 8. We use single observation (3.20) (i.e., use K = 1 in our general scheme) with σ = 0.01. The results of our experiments are presented in Table 3.1. Observe that in the considered problem there exists an intrinsic obstacle for high accuracy estimation even in the case of noiseless observations and invertible matrices Ai , i = 1, 2 (recall that we are in the case of M = 2). Indeed, assume that there exist u ∈ U , u′ ∈ U such that A1 u = A2 u′ and eT u 6= eT u′ . Since we do not know which of the matrices, A1 or A2 , underlies the observation and A1 u = A2 u′ , there is no way to distinguish between the two cases we have described, implying that the quantity 1 T e (u − u′ ) : A1 u = A2 u′ (3.22) ρ = max 2 ′ u,u ∈U
is a lower bound on the worstcase, over signals from U , error of a reliable recovery of eT u, independently of how small the noise is. In the reported experiments, we used A2 = A1 S with S linked to e (see (3.21)); with this selection of S, e, and A2 , and invertible A1 , the lower bound ρ would be trivial—just zero. Note that the selected A1 is not invertible, resulting in a positive ρ. However, computation shows that with our data, this positive ρ is negligibly small (about 2.0e − 5). When we destroy the link between e and S, the estimation problem can become intrinsically more difficult, and the performance of our estimation procedure can deteriorate. Let us look at what happens when we keep A1 and A2 = A1 S exactly as they are, but replace the linear form to be estimated with eT u, e being randomly selected.5 The corresponding results are presented in Table 3.2. The data in the
5 In the experiments to be reported, e is selected as follows: we start with a random unit vector drawn from the uniform distribution on the unit sphere in Rn and then normalize it to
205
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
Characteristic error bound actual error # of Bisection steps
min 0.057 0.001 1
median 0.457 0.297 1.00
mean 0.441 0.350 2.20
max 1.000 1.000 5
“Difficult” signals, data over 10 experiments ρ error bound
0.022
0.028
0.154
0.170
0.213
0.248
0.250
0.500
0.605
0.924
0.057
0.063
0.219
0.239
0.406
0.508
0.516
0.625
0.773
1.000
Error bound vs. ρ, experiments sorted according to the values of ρ Characteristic error bound actual error # of Bisection steps
min 0.016 0.005 1
median 0.274 0.066 2.00
mean 0.348 0.127 2.80
max 1.000 0.556 7
Random signals, data over 10 experiments ρ error bound
0.010
0.085
0.177
0.243
0.294
0.334
0.337
0.554
0.630
0.762
0.016
0.182
0.376
0.438
0.602
0.029
0.031
0.688
0.125
1.000
Error bound vs. ρ, experiments sorted according to the values of ρ
Table 3.2: Results of experiments with randomly selected linear form, σ = 0.01.
top part of the table match “difficult” signals u—those participating in forming the lower bound (3.22) on the recovery error, while the data in the bottom part of the table correspond to randomly selected signals.6 Observe that when estimating a randomly selected linear form, the error bounds indeed deteriorate, as compared to those in Table 3.1. We see also that the resulting error bounds are in a reasonably good agreement with the lower bound ρ, illustrating the basic property of nearly optimal estimates: the guaranteed performance of an estimate can be bad or good, but it is always nearly as good as is possible under the circumstances. As for actual estimation errors, they in some experiments are significantly less than the error bounds, especially when random signals are used. 3.2.7
Estimating N convex functions: An alternative
Observe that the problem of estimating an N convex function on the union of convex sets posed in Section 3.2.2 can be processed not only by Bisection. An alternative is as follows. In the notation ofSSection 3.2.2, we start with computing Xi , that is, we compute the quantities the range ∆ of function f on the set X = i≤I
f = min f (x), f = max f (x) x∈X
x∈X
have maxu∈U eT u − minu∈U eT u = 2. 6 Precisely, to generate a signal u, we draw a point u ¯ at random, from the uniform distribution √ ¯. on the sphere of radius 10 n, and take as u the point of U k · k2 closest to u
206
CHAPTER 3
and set ∆ = [f , f ]. We assume that this segment is not a singleton; otherwise estimating f is trivial. Let L ∈ Z+ and let δL = (f −f )/L be the desired estimation accuracy. We split ∆ into L segments ∆ℓ of equal length δL and consider the sets Xiℓ = {x ∈ Xi : f (x) ∈ ∆ℓ }, 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L. Since f is N convex, each set Xiℓ is a union of Miℓ ≤ N 2 convex compact sets Xiℓj , 1 ≤ j ≤ Miℓ . Thus, we have at our disposal a collection of at most ILN 2 convex compact sets; let us eliminate from this collection empty sets and S arrange the nonempty ones into a sequence Y1 , ..., YM , M ≤ ILN 2 . Note that s≤M Ys = X, so that the goal set in Section 3.2.2 can be reformulated as follows: For some unknown x known to belong to X =
M S
Ys , we have at our disposal
s=1
observation ω K = (ω1 , ..., ωK ) with i.i.d. ωt ∼ pA(x) (·); we aim at estimating the quantity f (x) from this observation. The sets Ys give rise to M hypotheses H1 , ..., HM on the distribution of the observations ωt , 1 ≤ t ≤ K; according to Hs , ωt ∼ pA(x) (·) with some x ∈ Ys . Let us define a closeness C on the set of our M hypotheses as follows. Given s ≤ M , the set Ys is some Xi(s)ℓ(s)j(s) ; we say that two hypotheses, Hs and Hs′ , are Cclose if the segments ∆ℓ(s) and ∆ℓ(s′ ) intersect. Observe that when Hs and Hs′ are not Cclose, the convex compact sets Ys and Ys′ do not intersect, since the values of f on Ys belong to ∆ℓ(s) , the values of f on Ys′ belong to ∆ℓ(s′ ) , and the segments ∆ℓ(s) and ∆ℓ(s′ ) do not intersect. Now let us apply to the hypotheses H1 , ..., HM our machinery for testing up to closeness C; see Section 2.5.2. Assuming that whenever Hs and Hs′ are not Cclose, the risks ǫss′ defined in Section 2.5.2.2 are < 1,7 we, given tolerance ǫ ∈ (0, 1), can find K = K(ǫ) such that stationary Krepeated observation ω K allows us to decide (1−ǫ)reliably on H1 , ..., HM up to closeness C. As applied to ω K , the corresponding test T K will accept some (perhaps, none) of the hypotheses, let the indexes of the accepted hypotheses form set S = S(ω K ). We convert S into an estimate fb(ω K ) of S f (x), x ∈ X = s≤M Ys being the signal underlying our observation, as follows: • when S = ∅ the estimate is, say (f + f )/2; • when S is nonempty we take the union ∆(S) of the segments ∆ℓ(s) , s ∈ S, and our estimate is the average of the largest and the smallest elements of ∆(S).
It is immediately seen that if the signal x underlying our stationary Krepeated observation ω K belongs to some Ys∗ , so that the hypothesis Hs∗ is true, and the outcome S of T K contains s∗ and is such that for all s ∈ S Hs and Hs∗ are Cclose to each other, we have f (x) − fb(ω K ) ≤ δL . Note that since the Crisk of T K is ≤ ǫ, the pA(x) probability to get such a “good” outcome, and thus to get f (x) − fb(ω K ) ≤ δL , is at least 1 − ǫ. 7 In standard simple o.s.’s, this is the case whenever for s, s′ in question the images of Y and s Ys′ under the mapping x 7→ A(x) do not intersect. Because for s, s′ , Ys and Ys′ do not intersect, this definitely is the case when A(·) is an embedding.
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.2.7.1
207
Numerical illustration
Our illustration deals with the situation when I = 1, X = X1 is a convex compact set, and f (x) is fractionallinear: f (x) = aT x/cT x with positive on X denominator. Specifically, assume we are given noisy measurements of voltages Vi at some nodes i and currents Iij in some arcs (i, j) of an electric circuit, and want to recover the resistance of a particular arc (i∗ , j∗ ): ri ∗ j ∗ =
V j ∗ − V i∗ . I i∗ j ∗
The observation noises are assumed to be N (0, σ 2 ) and independent across the measurements. In our experiment, we work with the data as follows:
B
D C
input node
output node
x = [voltages at nodes; currents in arcs] Ax = [observable voltages; observable currents] • • • •
Currents are measured in all arcs except for a, b Voltages are measured at all nodes except for c We want to recover resistance of arc b conservation of current, except for input/output nodes zero voltage at input node, nonnegative currents X: current in arc b at least 1, total of currents at most 33 Ohm’s Law, resistances of arcs between 1 and 10
We are in the situation of N = 1 and I = 1, implying M = L. When using L = 8, the projections of the sets Ys , 1 ≤ s ≤ L = 8, onto the 2D plane of variables
208
CHAPTER 3
(Vj∗ − Vi∗ , Ii∗ j∗ ) are the “stripes” shown below: I i∗ j ∗
V j ∗ − V i∗ The range of the unknown resistance turns out to be ∆ = [1, 10]. We set ǫ = 0.01, and instead of looking for K such that the Krepeated observation allows us to recover 0.99reliably the resistance in the arc of interest within accuracy ∆/L, we look for the largest observation noise σ allowing us to achieve the desired recovery with a single observation. The results for L = 8, 16, 32 are as follows: L δL σ σopt /σ ≤ σ σopt /σ ≤
8 9/8 ≈ 1.13 0.024 1.31 0.031 1.01
16 9/16 ≈ 0.56 0.010 1.31 0.013 1.06
32 9/32 ≈ 0.28 0.005 1.33 0.006 1.08
In the above table: • σopt is the largest σ for which “in nature” there exists a test deciding on H1 , ..., HL with Crisk ≤ 0.01; • Underlined data: Risks ǫss′ of pairwise tests are bounded via risks of optimal detectors; Crisk of T is bounded by 1, (s, s′ ) 6∈ C, L ′ ′ ′ , χ = ] χ [ǫss ss s,s′ =1 ss 0, (s, s′ ) ∈ C; 2,2
see Proposition 2.29; • “Slanted” data: Risks ǫss′ of pairwise tests are bounded via the error function; Crisk of T is bounded by X max ǫss′ s
s′ :(s,s′ )6∈C
(it is immediately seen that in the case of Gaussian o.s., this indeed is a legitimate risk bound).
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
209
C
B D C Figure 3.3: A circuit (nine nodes and 16 arcs). a: arc of interest; b: arcs with measured currents; c: input node where external current and voltage are measured.
3.2.7.2
Estimating dissipated power
The alternative approach to estimating N convex functions proposed in Section 3.2.7 can be combined with the quadratic lifting described in Section 2.9 to yield, under favorable circumstances, estimates of quadratic and quadratic fractional functions. We are about to consider an instructive example of this type. Figure 3.3 represents a DC circuit. We have access to repeated noisy measurements of currents in some arcs and voltages at some nodes, with the voltage of the ground node equal to 0. The arcs are oriented; this orientation, however, is of no relevance in our context and therefore is not displayed. Our goal is to use these observations to estimate the power dissipated in a given “arc of interest.” The a priori information is as follows: • the (unknown) arc resistances are known to belong to a given range [r, R], with 0 < r < R < ∞; • the currents and the voltages are linked by Kirchhoff’s laws:
– at every node, the sum of currents in the outgoing arcs is equal to the sum of currents in the incoming arcs plus the external current at the node. In our circuit, there are just two external currents, one at the ground node and one at the input node c.
– the voltages and the currents are linked by Ohm’s law: for every (inner) arc γ, we have Iγ rγ = Vj(γ) − Vi(γ) where Iγ is the current in the arc, rγ is the arc’s resistance, Vs is the voltage at node s, and i(γ), j(γ) are the initial and the terminal nodes linked by arc γ; • magnitudes of all currents and voltages are bounded by 1. We assume that the measurements of observable currents and voltages are affected by zero mean Gaussian noise with scalar covariance matrix θ2 I, with unknown θ from a given range [σ, σ]. Processing the problem. We specify the “signal” underlying our observation as
210
CHAPTER 3
a collection u of the voltages at nine nodes and currents Iγ in 16 (inner) arcs γ of the circuit, augmented by the external current Io at the input node (so that −Io is the external current at the ground node). Thus, our singletime observation is ζ = Au + θξ,
(3.23)
where A extracts from u four entries (currents in two arcs b and external current and voltage at the input node c), ξ ∼ N (0, I4 ), and θ ∈ [σ, σ]. Our a priori information on u states that u belongs to the compact set U given by the quadratic constraints, namely, as follows:
U=
u = {Iγ , Io , Vi } :
Iγ2 ≤ 1, Vi2 ≤ 1 ∀γ, i; uT J T Ju≤ 0 [Vj(γ) − Vi(γ) ]2 /R − Iγ [Vj(γ) − Vi(γ) ] ≤ 0 ∀γ Iγ [Vj(γ) − Vi(γ) ] − [Vj(γ) − Vi(γ) ]2 /r ≤ 0 2 rIγ − Iγ [Vj(γ) − Vi(γ) ] ≤ 0 ∀γ Iγ [Vj(γ) − Vi(γ) ] − RIγ2 ≤ 0
(a)
(b)
(3.24)
where Ju = 0 expresses the first Kirchhoff’s law, and quadratic constraints (a) and (b) account for Ohm’s law in the situation when we do not know the exact resistances but only their range [r, R]. Note that groups (a) and (b) of constraints in (3.24) are “logical consequences” of each other, and thus one of groups seems to be redundant. However, on closer inspection, quadratic inequalities valid on U do not tighten the outer approximation Z of Z[U ] and thus are redundant in our context only when these inequalities can be obtained from the inequalities we do include into the description of Z “in a linear fashion”—by taking weighted sums with nonnegative coefficients. This is not how (b) is obtained from (a). As a result, to get a smaller Z, it makes sense to keep both (a) and (b). The dissipated power we are interested in estimating is the quadratic function f (u) = Iγ∗ [Vj∗ − Vi∗ ] = [u; 1]T G[u; 1] where γ∗ = (i∗ , j∗ ) is the arc of interest, and G ∈ Sn+1 , n = dim u, is a properly built matrix. In order to build an estimate, we “lift quadratically” the observations ζ 7→ ω = (ζ, ζζ T ) and pass from the domain U of actual signals to the outer approximation Z of the quadratic lifting of U : Z
:= ⊃
n+1 {Z : Z 0, Z ∈S n+1,n+1 = 1, Tr(Qs Z) ≤ cs , 1 ≤ s ≤ S} [u; 1][u; 1]T : u ∈ V .
Here the matrix Qs ∈ Sn+1 represents the lefthand side Fs (u) of the sth quadratic constraint in the description (3.24) of U : Fs (u) ≡ [u; 1]T Qs [u; 1], and cs is the righthand side of the sth constraint. We process the problem similarly to what was done in Section 3.2.7.1, where our goal was to estimate a fractionallinear function. Specifically, 1. We compute the range of f on U ; the smallest value f of f on U clearly is zero, and an upper bound on the maximum of f (u) over u ∈ U is the optimal value
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
211
in the convex optimization problem f = max Tr(GZ). Z∈Z
2. Given a positive integer L, we split the range [f , f ] into L segments ∆ℓ = [aℓ−1 , aℓ ] of equal length δL = (f − f )/L and define convex compact sets Zℓ = {Z ∈ Z : aℓ−1 ≤ Tr(GZ) ≤ aℓ }, 1 ≤ ℓ ≤ L, so that u ∈ U, f (u) ∈ ∆ℓ ⇒ [u; 1][u; 1]T ∈ Zℓ , 1 ≤ ℓ ≤ L. 3. We specify L quadratically constrained hypotheses H1 , ..., HL on the distribution of observation (3.23), with Hℓ stating that ζ ∼ N (Au, θ2 I4 ) with some u ∈ U satisfying f (u) ∈ ∆ℓ (so that [u; 1][u; 1]T ∈ Zℓ ), and θ belongs to the above segment [σ, σ]]. We equip our hypotheses with a closeness relation C; specifically, we consider Hℓ and Hℓ′ Cclose if and only if the segments ∆ℓ and ∆ℓ′ intersect. 4. We use Propositions 2.43.ii and 2.40 to build detectors φℓℓ′ quadratic in ζ for the families of distributions obeying Hℓ and Hℓ′ , respectively, along with upper bounds ǫℓℓ′ on the risks of these detectors. Finally, we use the machinery from Section 2.5.2 to find the smallest K and a test TCK , based on a stationary Krepeated version of observation (3.23), able to decide on H1 , ..., HL with Crisk ≤ ǫ, where ǫ ∈ (0, 1) is a given tolerance. Finally, given stationary Krepeated observation (3.23), we apply to it test TCK , look at the hypotheses, if any, accepted by the test, and build the union ∆ of the corresponding segments ∆ℓ . If ∆ = ∅, we estimate f (u) as the midpoint of the power range [f , f ]; otherwise the estimate is the mean of the largest and the smallest points in ∆. It is easily seen that for this estimate, the probability for the estimation error to be > δℓ is ≤ ǫ. The numerical results we present√here correspond to the circuit presented in Figure 3.3. We set σ = 0.01, σ = σ/ 2, [r, R] = [1, 2], ǫ = 0.01, and L = 8. The simulation setting is as follows: the computed range [f , f ] of the dissipated power is [0, 0.821], so that the estimate built recovers the dissipated power within accuracy 0.103 and reliability 0.99. The resulting value of K is K = 95. In all 500 simulation runs, the actual recovery error was less than the bound 0.103, and the average error was as small as 0.041.
3.3
ESTIMATING LINEAR FORMS BEYOND SIMPLE OBSERVATION SCHEMES
We are about to show that the techniques developed in Section 2.8 can be applied to building estimates of linear and quadratic forms of the parameters of observed distributions. As compared to the machinery of Section 3.2, our new approach has somewhat restricted scope: we do not estimate general N convex functions nor handle domains which are unions of convex sets; now we need the function to be linear (perhaps, after quadratic lifting of observations) and the domain to
212
CHAPTER 3
be convex.8 As a compensation, we are not limited to simple observation schemes anymore—our approach is in fact a natural extension of the approach developed in Section 3.1 beyond simple o.s.’s. In this section, we focus on estimating linear forms; estimating quadratic forms will be our subject in Section 3.4. 3.3.1
Situation and goal
Consider the situation as follows: given are Euclidean spaces Ω = EH , EM , EX along with • regular data (see Section 2.8.1.1) H ⊂ EH , M ⊂ EM , Φ(·; ·) : H × M → R, with 0 ∈ int H, • a nonempty convex compact set X ⊂ EX , • an affine mapping x 7→ A(x) : EX → EM such that A(X ) ⊂ M, • a continuous convex calibrating function υ(x) : X → R, • a vector g ∈ EX and a constant c specifying the linear form G(x) = hg, xi + c : EX → R,9 • a tolerance ǫ ∈ (0, 1). These data specify, in particular, the family P = S[H, M, Φ] of probability distributions on Ω = EH ; see Section 2.8.1.1. Given random observation ω ∼ P (·) (3.25) where P ∈ P is such that ∀h ∈ H : ln
Z
ehh,ωi P (dω) EH
≤ Φ(h; A(x))
(3.26)
for some x ∈ X (that is, A(x) is a parameter, as defined in Section 2.8.1.1, of distribution P ), we want to recover the quantity G(x). ǫrisk. Given ρ > 0, we call an estimate gb(·) : EH → R (ρ, ǫ, υ(·))accurate if for all pairs x ∈ X , P ∈ P satisfying (3.26) it holds Probω∼P {b g (ω) − G(x) > ρ + υ(x)} ≤ ǫ.
If ρ∗ is the infimum of those ρ for which estimate gb is (ρ, ǫ, υ(·))accurate, then clearly gb is (ρ∗ , ǫ, υ(·))accurate; we shall call ρ∗ the ǫrisk of the estimate gb taken
8 The latter is just for the sake of simplicity, to not overload the presentation to follow. An interested reader will certainly be able to reproduce the corresponding construction of Section 3.1 in the situation of this section. 9 From now on, hu, vi denotes the inner product of vectors u, v belonging to a Euclidean space; what this space is will always be clear from the context.
213
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
w.r.t. the data G(·), X , υ(·), and (A, H, M, Φ):
Riskǫ (b g (·)G, X , υ, A, H, M, Φ) = min ρ : Probω∼P {ω : b g (ω) − G(x) > ρ + υ(x)} ≤ ǫ ( P ∈ P, x ∈ X R hT ω . ∀(x, P ) : ln e P (dω) ≤ Φ(h; A(x)) ∀h ∈ H
(3.27)
When G, X , υ, A, H, M, and Φ are clear from the context, we shorten Riskǫ (b g (·)G, X , υ, A, H, M, Φ)
to Riskǫ (b g (·)). Given the data listed at the beginning of this section, we are about to build, in a computationally efficient fashion, an affine estimate gb(ω) = hh∗ , ωi + κ along with ρ∗ such that the estimate is (ρ∗ , ǫ, υ(·))accurate. 3.3.2
Construction and main results
Let us set H+ = {(h, α) : h ∈ EH , α > 0, h/α ∈ H}
so that H+ is a nonempty convex set in EH × R+ , and let (a) (b)
Ψ+ (h, α)
=
Ψ− (h, β)
=
sup [αΦ(h/α, A(x)) − G(x) − υ(x)] : H+ → R,
x∈X
sup [βΦ(−h/β, A(x)) + G(x) − υ(x)] : H+ → R,
(3.28)
x∈X
so that Ψ± are convex realvalued functions on H+ (recall that Φ is convexconcave and continuous on H × M, while A(X ) is a compact subset of M). Our starting point is quite simple: ¯ α ¯ κ, Proposition 3.5. Given ǫ ∈ (0, 1), let h, ¯ , β, ¯ ρ¯ be a feasible solution to the system of convex constraints (a1 ) (a2 ) (b1 ) (b2 )
(h, α) (h, β) α ln(ǫ/2) β ln(ǫ/2)
∈ ∈ ≥ ≥
H+ H+ Ψ+ (h, α) − ρ + κ Ψ− (h, β) − ρ − κ
(3.29)
in variables h, α, β, ρ, κ. Setting ¯ ωi + κ, gb(ω) = hh, ¯
we obtain an estimate with ǫrisk at most ρ¯.
¯ α ¯ κ, Proof. Let ǫ ∈ (0, 1), h, ¯ , β, ¯ ρ¯ satisfy the premise of the proposition, and let x ∈ X , P satisfy (3.26). We have ⇒
Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)}
=
Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)}
≤ ≤
o n ¯ ¯ κ+υ(x) ¯ Probω∼P hh,ωi > G(x)+ρ− α ¯ α ¯ hR i G(x)+ρ− ¯ κ+υ(x) ¯ ¯ α ¯ ehh,ωi/α¯ P (dω) e− ¯
¯ eΦ(h/α,A(x)) e−
G(x)+ρ− ¯ κ+υ(x) ¯ α ¯
.
214
CHAPTER 3
As a result, α ¯ ln (Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)}) ¯ α, A(x)) − G(x) − ρ¯ + κ ≤α ¯ Φ(h/¯ ¯ − υ(x) ¯ α ≤ Ψ+ (h, ¯ ) − ρ¯ + κ ¯ [by definition of Ψ+ and due to x ∈ X ] ≤α ¯ ln(ǫ/2) [by (3.29.b1 )] so that Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)} ≤ ǫ/2. Similarly o n ¯ −G(x)+ρ+ ¯ κ+υ(x) ¯ > = Probω∼P −hh,ωi ¯ ¯ β i −G(x)+βρ+ hR ¯ κ+υ(x) ¯ ¯ ¯ − ¯ β e−hh,ωi/β P (dω) e ⇒ Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)} ≤ Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)}
¯ ¯
≤ eΦ(−h/β,A(x)) e
G(x)−ρ− ¯ κ−υ(x) ¯ ¯ β
.
Thus β¯ ln (Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)}) ¯ ¯ β, ¯ A(x)) + G(x) − ρ¯ − κ ≤ βΦ(− h/ ¯ − υ(x) ¯ β) ¯ − ρ¯ − κ ≤ Ψ− (h, ¯ [by definition of Ψ− and due to x ∈ X ] ≤ β¯ ln(ǫ/2) [by (3.29.b2 )] and Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)} ≤ ǫ/2.
✷
Corollary 3.6. In the situation described in Section 3.3.1, let Φ satisfy the relation Φ(0; µ) ≥ 0 ∀µ ∈ M.
(3.30)
Then b + (h) := inf α {Ψ+ (h, α) + α ln(2/ǫ) : α > 0, (h, α) ∈ H+ } Ψ = supx∈X inf α>0,(h,α)∈H+ [αΦ(h/α, A(x)) − G(x) − υ(x) + α ln(2/ǫ)] , b − (h) := inf α {Ψ− (h, α) + α ln(2/ǫ) : α > 0, (h, α) ∈ H+ } (b) Ψ = supx∈X inf α>0,(h,α)∈H+ [αΦ(−h/α, A(x)) + G(x) − υ(x) + α ln(2/ǫ)] , (3.31) ¯ κ, b ± : EH → R are convex. Furthermore, let h, and functions Ψ ¯ ρe be a feasible solution to the system of convex constraints (a)
b + (h) ≤ ρ − κ, Ψ b − (h) ≤ ρ + κ Ψ
(3.32)
in variables h, ρ, κ. Then the estimate
¯ ωi + κ gb(ω) = hh, ¯
of G(x), x ∈ X, has the ǫrisk at most ρe:
Riskǫ (b g (·)G, X, υ, A, H, M, Φ) ≤ ρe.
(3.33)
¯ is a Relation (3.32) (and thus the risk bound (3.33)) clearly holds true when h
215
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
candidate solution to the convex optimization problem h io n b + (h) + Ψ b − (h) , b Opt = min Ψ(h) := 12 Ψ h
¯ and b h), ρe = Ψ(
κ ¯=
1 2
h
(3.34)
i ¯ −Ψ ¯ . b − (h) b + (h) Ψ
¯ we can make (an upper bound on) the ǫrisk of As a result, by properly selecting h, estimate gb(·) arbitrarily close to Opt, and equal to Opt when optimization problem (3.34) is solvable. Proof. Let us first verify the identities in (3.31). The function
Θ+ (h, α; x) = αΦ(h/α, A(x)) − G(x) − υ(x) + α ln(2/ǫ) : H+ × X → R is convexconcave and continuous, and X is compact, whence by the SionKakutani Theorem b + (h) Ψ
:= = = =
inf α {Ψ+ (h, α) + α ln(2/ǫ) : α > 0, (h, α) ∈ H+ } inf α>0,(h,α)∈H+ maxx∈X Θ+ (h, α; x) supx∈X inf α>0,(h,α)∈H+ Θ+ (h, α; x) supx∈X inf α>0,(h,α)∈H+ [αΦ(h/α, A(x)) − G(x) − υ(x) + α ln(2/ǫ)] ,
as required in (3.31.a). As we know, Ψ+ (h, α) is realvalued continuous function on b + is convex on EH , provided that the function is realvalued. Now, H+ , so that Ψ let x ¯ ∈ X , and let e be a subgradient of φ(h) = Φ(h; A(¯ x)) taken at h = 0. For h ∈ EH and all α > 0 such that (h, α) ∈ H+ we have Ψ+ (h, α)
≥ ≥ ≥
αΦ(h/α; A(¯ x)) − G(¯ x) − υ(¯ x) + α ln(2/ǫ) α[Φ(0; A(¯ x)) + he, h/αi] − G(¯ x) − υ(¯ x) + α ln(2/ǫ) he, hi − G(¯ x) − υ(¯ x)
(we have used (3.30)), and therefore Ψ+ (h, α) as a function of α is bounded from below on the set {α > 0 : h/α ∈ H}. In addition, this set is nonempty, since H b + is realvalued and convex on EH . contains a neighbourhood of the origin. Thus, Ψ b Verification of (3.31.b) and of the fact that Ψ− (h) is realvalued convex function on EH is completely similar. ¯ κ, Now, given a feasible solution (h, ¯ ρe) to (3.32), let us select some ρ¯ > ρe. Taking b ± , we can find α into account the definition of Ψ ¯ and β¯ such that ¯ α ¯ α (h, ¯ ) ∈ H+ & Ψ+ (h, ¯) + α ¯ ln(2/ǫ) ≤ ρ¯ − κ, ¯ + ¯ ¯ ¯ ¯ + β¯ ln(2/ǫ) ≤ ρ¯ + κ, (h, β) ∈ H & Ψ− (h, β) ¯
¯ α ¯ κ, implying that the collection (h, ¯ , β, ¯ ρ¯) is a feasible solution to (3.29). Invoking Proposition 3.5, we get Probω∼P {ω : b g (ω) − G(x) > ρ¯ + υ(x)} ≤ ǫ for all (x ∈ X , P ∈ P) satisfying (3.26). Since ρ¯ can be selected arbitrarily close to ρe, gb(·) indeed is a (e ρ, ǫ, υ(·))accurate estimate. ✷
216 3.3.3
CHAPTER 3
Estimation from repeated observations
Assume that in the situation described in Section 3.3.1 we have access to K observations ω1 , ..., ωK sampled, independently of each other, from a probability distribution P , and aim to build the estimate based on these K observations rather than on a single observation. We can immediately reduce this new situation to the previous one, just by redefining the data. Specifically, given initial data H ⊂ EH , M ⊂ EM , Φ(·; ·) : H × M → R, X ⊂ EX , υ(·), A(·), G(x) = hg, xi + c (see Section 3.3.1) and a positive integer K, let us update part of the data, namely, replace H ⊂ EH with K := EH × ... × EH , HK := H × ... × H ⊂ EH {z }  {z }  K
K
and replace Φ(·, ·) : H × M → R with
ΦK (hK = (h1 , ..., hK ); µ) =
K X i=1
Φ(hi ; µ) : HK × M → R.
It is immediately seen that the updated data satisfy all requirements imposed on the data in Section 3.3.1, and that whenever x ∈ X and a Borel probability distribution P on EH are linked by (3.26), x and the distribution P K of Kelement i.i.d. sample ω K = (ω1 , ..., ωK ) drawn from P are linked by the relation K K ∀h 1 , ..., hK ) ∈ H : R = (hK K ln E K ehh ,ω i P K (dω K ) H
= ≤
P
R
ehhi ,ωi i P (dωi ) Φ (h ; A(x)). i ln EH K K
Applying to our new data the construction from Section 3.3.2, we arrive at “repeated observation” versions of Proposition 3.5 and Corollary 3.6. Note that the resulting convex constraints/objectives are symmetric w.r.t. permutations functions of the components h1 , ..., hK of hK , implying that we lose nothing when restricting ourselves with collections hK with components equal to each other; it is convenient to denote the common value of these components h/K. With this observation in mind, Proposition 3.5 and Corollary 3.6 translate into the following statements (we use the assumptions and the notation from the previous sections): Proposition 3.7. Given ǫ ∈ (0, 1) and positive integer K, let (a) (b)
Ψ+ (h, α)
=
Ψ− (h, β)
=
sup [αΦ(h/α, A(x)) − G(x) − υ(x)] : H+ → R,
x∈X
sup [βΦ(−h/β, A(x)) + G(x) − υ(x)] : H+ → R,
x∈X
¯ α ¯ κ, and let h, ¯ , β, ¯ ρ¯ be a feasible solution to the system of convex constraints (a1 ) (a2 ) (b1 ) (b2 )
(h, α) (h, β) αK −1 ln(ǫ/2) βK −1 ln(ǫ/2)
∈ ∈ ≥ ≥
H+ H+ Ψ+ (h, α) − ρ + κ Ψ− (h, β) − ρ − κ
(3.35)
217
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
in variables h, α, β, ρ, κ. Setting gb(ω K ) =
XK ¯ 1 ωi + κ, ¯ h, i=1 K
we obtain an estimate of G(x) via independent Krepeated observations ωi ∼ P, i = 1, ..., K, with the ǫrisk on X not exceeding ρ¯. In other words, whenever x ∈ X and a Borel probability distribution P on EH are linked by (3.26), one has g (ω K ) − G(x) > ρ¯ + υ(x) ≤ ǫ. (3.36) ProbωK ∼P K ω K : b
Corollary 3.8. In the situation described at the beginning of Section 3.3.1, let Φ satisfy relation (3.30), and let a positive integer K be given. Then (a)
(b)
b +,K (h) := inf Ψ+ (h, α) + K −1 α ln(2/ǫ) : α > 0, (h, α) ∈ H+ Ψ α = sup inf αΦ(h/α, A(x)) − G(x) − υ(x) + K −1 α ln(2/ǫ) , x∈X α>0,(h,α)∈H+ b −,K (h) := inf α Ψ− (h, α) + K −1 α ln(2/ǫ) : α > 0, (h, α) ∈ H+ Ψ = sup inf αΦ(−h/α, A(x)) + G(x) − υ(x) + K −1 α ln(2/ǫ) , x∈X α>0,(h,α)∈H+
¯ κ, b ±,K : EH → R are convex. Furthermore, let h, and functions Ψ ¯ ρe be a feasible solution to the system of convex constraints b +,K (h) ≤ ρ − κ, Ψ b −,K (h) ≤ ρ + κ Ψ
(3.37)
in variables h, ρ, κ. Then the ǫrisk of the estimate XK ¯ 1 ωi + κ, ¯ gb(ω K ) = h, i=1 K
¯ implying that whenever x ∈ X and a Borel b h), of G(x), x ∈ X , is at most Ψ( probability distribution P on EH are linked by (3.26), relation (3.36) holds true. ¯ is a candidate solution to the convex Relation (3.37) clearly holds true when h optimization problem io h n b +,K (h) + Ψ b −,K (h) , b K (h) := 1 Ψ (3.38) OptK = min Ψ 2 h
¯ and b K (h), ρ¯ = Ψ
κ ¯=
1 2
h
i ¯ −Ψ ¯ . b −,K (h) b +,K (h) Ψ
¯ we can make (an upper bound on) the ǫrisk As a result, by properly selecting h of the estimate gb(·) arbitrarily close to Opt, and equal to Opt when optimization problem (3.38) is solvable.
From now on, if not explicitly stated otherwise, we deal with Krepeated observations; to get back to singleobservation case, it suffices to set K = 1.
218
CHAPTER 3
3.3.4
Application: Estimating linear forms of subGaussianity parameters
Consider the simplest case of the situation from Sections 3.3.1 and 3.3.3, where • H = EH = Rd , M = EM = Rd × Sd+ , Φ(h; µ, M ) = hT µ + 12 hT M h : Rd × (Rd × Sd+ ) → R, • • • •
so that S[H, M, Φ] is the family of all subGaussian distributions on Rd ; X ⊂ EX = Rnx is a nonempty convex compact set; A(x) = (Ax + a, M (x)), where A is d × nx matrix, and M (x) is a symmetric d × d matrix affinely depending on x such that M (x) is 0 when x ∈ X ; υ(x) is a convex continuous function on X ; G(x) is an affine function on EX .
In the case in question, (3.30) clearly takes place, and the lefthand sides in constraints (3.37) become b +,K (h) Ψ
=
b −,K (h) Ψ
=
=
=
1 T sup inf hT [Ax + a] + 2α h M (x)h + K −1 α ln(2/ǫ) − G(x) − υ(x) x∈X α>0 o np 2K −1 ln(2/ǫ)[hT M (x)h] + hT [Ax + a] − G(x) − υ(x) , max x∈X 1 T sup inf −hT [Ax + a] + 2α h M (x)h + K −1 α ln(2/ǫ) + G(x) − υ(x) α>0 x∈X n o p 2K −1 ln(2/ǫ)[hT M (x)h] − hT [Ax + a] + G(x) − υ(x) . max x∈X
Thus, system (3.37) reads hp i aT h + max 2K −1 ln(2/ǫ)[hT M (x)h] + hT Ax − G(x) − υ(x) x∈X h i p −aT h + max 2K −1 ln(2/ǫ)[hT M (x)h] − hT Ax + G(x) − υ(x) x∈X
≤
ρ − κ,
≤
ρ + κ.
We arrive at the following version of Corollary 3.8:
Proposition 3.9. In the situation described at the beginning of Section 3.3.4, given ¯ be a feasible solution to the convex optimization problem ǫ ∈ (0, 1), let h b K (h) OptK = min Ψ h∈Rd
where
b +,K (h) Ψ
(3.39)
} { z hp i T T max −1 T 2K ln(2/ǫ)[h M (x)h] + h Ax − G(x) − υ(x) + a h i b K (h) := 1 x∈X hp . Ψ 2 + max 2K −1 ln(2/ǫ)[hT M (y)h] − hT Ay + G(y) − υ(y) − aT h y∈X {z } 
Then, setting
κ ¯= the affine estimate
1 2
h
b −,K (h) Ψ
i ¯ −Ψ ¯ , ρ¯ = Ψ ¯ b −,K (h) b +,K (h) b K (h), Ψ gb(ω K ) =
K 1 X ¯T h ωi + κ ¯ K i=1
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
219
has ǫrisk, taken w.r.t. the data listed at the beginning of this section, at most ρ¯. It is immediately seen that optimization problem (3.39) is solvable, provided that \ Ker(M (x)) = {0}, x∈X
and an optimal solution h∗ to the problem, taken along with i h b −,K (h∗ ) − Ψ b +,K (h∗ ) , κ∗ = 1 Ψ 2
(3.40)
yields the affine estimate
gb∗ (ω) =
K 1 X T h ωi + κ∗ K i=1 ∗
with ǫrisk, taken w.r.t. the data listed at the beginning of this section, at most OptK . 3.3.4.1
Consistency
Assuming υ(x) ≡ 0, we can easily answer the natural question “when is the proposed estimation scheme consistent?” meaning that for every ǫ ∈ (0, 1), it allows us to achieve arbitrarily small ǫrisk, provided that K is large enough. Specifically, denoting by g T x the linear part of G(x): G(x) = g T x + c, from Proposition 3.9 it is immediately seen that a necessary and sufficient condition for consistency is the ¯ ∈ Rd such that h ¯ T Ax = g T x for all x ∈ X − X , or, equivalently, existence of h the condition that g is orthogonal to the intersection of the kernel of A with the linear span of X − X . Indeed, under this assumption, for every fixed ǫ ∈ (0, 1) we ¯ = 0, implying that limK→∞ Opt = 0, with Ψ b K (h) b K and clearly have limK→∞ Ψ K OptK given by (3.39). On the other hand, if the condition is violated, then there exist x′ , x′′ ∈ X such that Ax′ = Ax′′ and G(x′ ) 6= G(x′′ ); we lose nothing when assuming that G(x′′ ) > G(x′ ). Looking at (3.39), we see that p −1 ln(2/ǫ)[hT M (x′ )h] + hT Ax′ − G(x′ ) + aT h b K (h) ≥ 1 2K Ψ 2 p + 2K −1 ln(2/ǫ)[hT M (x′′ )h] − hT Ax′′ + G(x′′ ) − aT h ≥
G(x′′ ) − G(x′ ),
whence OptK , for all K, is lowerbounded by G(x′′ ) − G(x′ ) > 0. 3.3.4.2
Direct product case
Further simplifications are possible in the direct product case, where, in addition to what was assumed at the beginning of Section 3.3.4, • EX = EU × EV and X = U × V , with convex compact sets U ⊂ EU = Rnu and V ⊂ E V = R nv , • A(x = (u, v)) = [Au + a, M (v)] : U × V → Rd × Sd , with M (v) 0 for v ∈ V , • G(x = (u, v)) = g T u + c depends solely on u, and
220
CHAPTER 3
• υ(x = (u, v)) = ̺(u) depends solely on u. It is immediately seen that in the direct product case problem (3.39) reads q φU (AT h − g) + φU (−AT h + g) −1 T + max 2K ln(2/ǫ)h M (v)h , OptK = min v∈V 2 h∈Rd (3.41) where (3.42) φU (f ) = max uT f − ̺(u) . u∈U T Assuming v∈V Ker(M (v)) = {0}, the problem is solvable, and its optimal solution h∗ gives rise to the affine estimate gb∗ (ω K ) =
1 X T h ωi + κ∗ , κ∗ = 21 [φU (−AT h + g) − φU (AT h − g)] − aT h∗ + c K i ∗
with ǫrisk ≤ OptK . Nearoptimality. In addition to the assumption that we are in the direct product case, assume that υ(·) ≡ 0 and, for the sake of simplicity, that M (v) ≻ 0 whenever v ∈ V . In this case (3.39) reads OptK = minh maxv∈V Θ(h, v) := 21 [φU (AT h − g) + φU (−AT h + g)] p −1 T + 2K ln(2/ǫ)h M (v)h . Hence, taking into account that Θ(h, v) clearly is convex in h and concave in v, while V is a convex compact set, by the SionKakutani Theorem we get also OptK =
maxv∈V Opt(v) = minh 21 [φU (AT h − g) + φU (−AT h + g)] p + 2K −1 ln(2/ǫ)hT M (v)h .
(3.43)
Now consider the problem of estimating g T u from independent observations ωi , i ≤ K, sampled from N (Au + a, M (v)), where unknown u is known to belong to U and v ∈ V is known. Let ρǫ (v) be the minimax ǫrisk of recovery: g (ω K ) − g T u > ρ} ≤ ǫ ∀u ∈ U , ρǫ (v) = inf ρ : ProbωK ∼[N (Au+a,M (v))]K {ω K : b g b(·)
where inf is taken over all Borel functions gb(·) : RKd → R. Invoking [131, Theorem 3.1], it is immediately seen that whenever ǫ < 1/4, one has "
2 ln(2/ǫ) ρǫ (v) ≥ 1 ln 4ǫ
#−1
Opt(v).
Since the family SG(U, V ) of all subGaussian distributions on Rd with parameters (Au + a, M (v)), u ∈ U , v ∈ V , contains all Gaussian distributions N (Au + a, M (v)) induced by (u, v) ∈ U × V , we arrive at the following conclusion:
221
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
Proposition 3.10. In the just described situation, the minimax optimal ǫrisk Riskopt g (·)) ǫ (K) = inf Riskǫ (b g b(·)
of recovering g T u from a Krepeated i.i.d. subGaussian observation with parameters (Au + a, M (v)), (u, v) ∈ U × V , is within a moderate factor of the upper bound OptK on the ǫrisk, taken w.r.t. the same data, of the affine estimate gb∗ (·) yielded by an optimal solution to (3.41), namely, OptK ≤ 3.3.4.3
Numerical illustration
2 ln(2/ǫ) Riskopt ǫ (K). 1 ln 4ǫ
The numerical illustration we are about to discuss models the situation in which we want to recover a linear form of a signal x known to belong to a given convex compact subset X via indirect observations Ax affected by subGaussian “relative noise,” meaning that the variance of observation is larger the larger is the signal. Specifically, our observation is ω ∼ SG(Ax, M (x)), where
n
x ∈ X = x ∈ R : 0 ≤ xj ≤ j
−α
, 1 ≤ j ≤ n , M (x) = σ
2
n X
xj Θ j
(3.44)
j=1
where A ∈ Rd×n and Θj ∈ Sd+ , j = 1, ..., n, are given matrices; the linear form to be estimated is G(x) = g T x. The entities g, A and {Θj }nj=1 and reals α ≥ 0 (“degree of smoothness”) and σ > 0 (“noise intensity”) are parameters of the estimation problem we intend to process. The parameters g, A, Θj are as follows: • g ≥ 0 is selected at random and then normalized to have max g T x = max g T [x − y] = 2; x∈X
x,y∈X
• we deal with the case of n > d (“deficient observations”); the d nonzero singular i−1 values of A were set to θ− d−1 , where “condition number” θ ≥ 1 is a parameter; the orthonormal systems U and V of the first d left and, respectively, right singular vectors of A were drawn at random from rotationally invariant distributions; • the positive semidefinite d×d matrices Θj are orthogonal projectors on randomly selected subspaces in Rd of dimension ⌊d/2⌋; • in all our experiments, we consider the singleobservation case K = 1 and use υ(·) ≡ 0. Note that X possesses the ≥largest point x ¯, whence M (x) M (¯ x) whenever x ∈ X ; as a result, subGaussian distributions with matrix parameter M (x), x ∈ X , can be thought also to have matrix parameter M (¯ x). One of the goals of the considered experiment is to understand how much we might lose were we replacing c(x) ≡ M (¯ M (·) with M x), that is, were we ignoring the fact that small signals result
222
CHAPTER 3
in lownoise observations. In our experiment we use d = 32, m = 48, α = 2, θ = 2, and σ = 0.01. With these parameters, we generated at random, as described above, 10 collections {g, A, Θj , j ≤ d}, thus arriving at 10 estimation problems. For each problem, we apply the outlined machinery to build an estimate of g T x affine in ω as yielded by the optimal solution to (3.39), and compute the upper bound Opt on the (ǫ = 0.01)risk of this estimate. In fact, for each problem, we build two estimates and two risk bounds: the first for the problem “as is,” and the second for the aforementioned “direct product envelope” of the problem, where the mapping x 7→ M (x) is replaced c(x) := M (¯ with conservative x 7→ M x). The results are as follows: min median mean max 0.138 0.190 0.212 0.299 0.150 0.210 0.227 0.320 Upper bounds on 0.01risk, data over 10 estimation problems [d = 32, m = 48, α = 2, θ = 2, σ = 0.01] First row: ω ∼ SG(Ax, M (x)); second row: ω ∼ SG(Ax, M (¯ x))
Note the significant “noise amplification” in the estimate (about 20 times the observation noise level σ) and high risk variability across the experiments. Seemingly, both these phenomena stem from the fact that we have highly deficient observations (n/d = 1.5) combined with a random orientation of the 16dimensional kernel of A.
3.4
ESTIMATING QUADRATIC FORMS VIA QUADRATIC LIFTING
In the situation of Section 3.3.1, passing from “original” observations (3.25) to their quadratic lifting, we can use the machinery just developed to estimate quadratic, rather than linear, forms of the underlying parameters. We investigate the related possibilities in the cases of Gaussian and subGaussian observations. The results of this section form an essential extension of the results of [39, 81] where a similar approach to estimating quadratic functionals of the mean of a Gaussian vector was used. 3.4.1 3.4.1.1
Estimating quadratic forms, Gaussian case Preliminaries
Consider the situation where we are given a nonempty bounded set U in Rm ; a nonempty convex compact subset V of the positive semidefinite cone Sd+ ; a matrix Θ∗ ≻ 0 such that Θ∗ Θ for all Θ ∈ V; an affine mapping u 7→ A[u; 1] : Rm → Ω = Rd , where A is a given d × (m + 1) matrix; • a convex continuous function ̺(·) on Sm+1 . + • • • •
A pair (u ∈ U, Θ ∈ V) specifies Gaussian random vector ζ ∼ N (A[u; 1], Θ) and thus
223
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
specifies probability distribution P [u, Θ] of (ζ, ζζ T ). Let Q(U, V) be the family of probability distributions on Ω = Rd × Sd stemming this way from Gaussian distributions with parameters from U × V. Our goal is to cover the family Q(U, V) by a family of the type S[N, M, Φ]. It is convenient to represent a linear form on Ω = Rd × Sd as hT z + 21 Tr(HZ), where (h, H) ∈ Rd × Sd is the “vector of coefficients” of the form, and (z, Z) ∈ Rd × Sd is the argument of the form. We assume that for some δ ∈ [0, 2] it holds −1/2
kΘ1/2 Θ∗
− Id k ≤ δ ∀Θ ∈ V,
(3.45)
where k · k is the spectral norm (cf. (2.129)). Finally, we set A m+1 b = [0; ...; 0; 1] ∈ R , B= bT and
Z + = {W ∈ Sm+1 : Wm+1,m+1 = 1}. +
The statement below is nothing but a straightforward reformulation of Proposition 2.43.i: Proposition 3.11. In the just described situation, let us select γ ∈ (0, 1) and set H M+ Φ(h, H; Θ, Z)
= = =
−1 Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗ H γΘ∗ }, V × Z +, 1/2 1/2 − 12 ln Det(I − Θ∗ HΘ∗ ) + 21 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) + kΘ∗ HΘ∗ k2F + Γ(h, H; Z) : H × M+ → R, 1/2 1/2 2(1−kΘ∗
HΘ∗
k)
where k · k is the spectral, k · kF is the Frobenius norm, and Γ(h, H; Z)
= =
AT hbT + AT HA + B T [H, h]T [Θ−1 H]−1 ∗ − [H, h]B] H h T −1 −1 + [H, h] [Θ∗ − H] [H, h] B . hT
1 Tr Z[bhT A+ 2 1 Tr 2
ZB T
Then H, M+ , Φ is a regular data, and for every (u, Θ) ∈ Rm × V it holds n T 1 T o ≤ Φ(h, H; Θ, [u; 1][u; 1]T ). ∀(h, H) ∈ H : ln Eζ∼N (A[u;1],Θ) eh ζ+ 2 ζ Hζ
Besides this, function Φ(h, H; Θ, Z) is coercive in the convex argument: whenever (Θ, Z) ∈ M and (hi , Hi ) ∈ H and k(hi , Hi )k → ∞ as i → ∞, we have Φ(hi , Hi ; Θ, Z) → ∞, i → ∞. 3.4.1.2
Estimating quadratic form: Situation and goal
Let us assume that we are given a sample ζ K = (ζ1 , ..., ζK ) of identically distributed observations ζi ∼ N (A[u; 1], M (v)), 1 ≤ i ≤ K (3.46) independent across i, where
224
CHAPTER 3
• (u, v) is an unknown “signal” known to belong to a given set U × V , where – U ⊂ Rm is a compact set, and
– V ⊂ Rk is a compact convex set;
• A is a given d × (m + 1) matrix, and v 7→ M (v) : Rk → Sd is an affine mapping such that M (v) 0 whenever v ∈ V . We are also given a convex calibrating function ̺(Z) : Sm+1 → R and “functional + of interest” F (u, v) = [u; 1]T Q[u; 1] + q T v, (3.47) where Q and q are a known (m+1)×(m+1) symmetric matrix and a kdimensional vector, respectively. Our goal is to estimate the value F (u, v), for unknown (u, v) known to belong to U × V . Given a tolerance ǫ ∈ (0, 1), we quantify the quality of a candidate estimate gb(ζ K ) of F (u, v) by the smallest ρ such that for all (u, v) ∈ U ×V it holds g (ζ K ) − F (u, v) > ρ + ̺([u; 1][u; 1]T ) ≤ ǫ. Probζ K ∼[N (A[u;1],M (v))]K b 3.4.1.3
Construction and result
Let V = {M (v) : v ∈ V },
so that V is a convex compact subset of the positive semidefinite cone Sd+ . Let us select some 1. matrix Θ∗ ≻ 0 such that Θ∗ Θ, for all Θ ∈ V; 2. convex compact subset Z of the set Z + = {Z ∈ Sm+1 : Zm+1,m+1 = 1} such + that [u; 1][u; 1]T ∈ Z for all u ∈ U ; 3. real γ ∈ (0, 1) and a nonnegative real δ such that (3.45) takes place. We further set (cf. Proposition 3.11) B
=
H M Φ(h, H; Θ, Z)
= = =
A ∈ R(d+1)×(m+1) , [0, ..., 0, 1] −1 Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗ H γΘ∗ }, V × Z, 1/2 1/2 − 12 ln Det(I − Θ∗ HΘ∗ ) + 12 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) + kΘ∗ HΘ∗ k2F + Γ(h, H; Z) : H × M → R 1/2 1/2 2(1−kΘ∗
HΘ∗
k)
(3.48)
where Γ(h, H; Z)
= =
AT hbT + AT HA + B T [H, h]T [Θ−1 H]−1 ∗ − [H, h]B] H h T −1 −1 + [H, h] [Θ∗ − H] [H, h] B hT
1 Tr Z[bhT A+ 2 1 Tr 2
ZB T
and treat, as observation, the quadratic lifting of observation (3.46), that is, our observation is ω K = {ωi = (ζi , ζi ζiT )}K i=1 , with independent ζi ∼ N (A[u; 1], M (v)).
(3.49)
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
225
Note that by Proposition 3.11 function Φ(h, H; Θ, Z) : H × M → R is a continuous convexconcave function which is coercive in convex argument and is such that ∀(u ∈ U, v ∈ V, (h, H)n∈ 1H)T : o T ln Eζ∼N (A[u;1],M (v)) e 2 ζ Hζ+h ζ ≤ Φ(h, H; M (v), [u; 1][u; 1]T ).
(3.50)
We are about to demonstrate that when estimating the functional of interest (3.47) at a point (u, v) ∈ U × V via observation (3.49), we are in the situation considered in Section 3.3 and can utilize the corresponding machinery. Indeed, let us specify the following data introduced in Section 3.3.1: • H = {f = (h, H) ∈ H} ⊂ EH = Rd × Sd , with H defined in (3.48), and the inner product on EH defined as 1 h(h, H), (h′ , H ′ )i = hT h′ + Tr(HH ′ ), 2 EM = Sd × Sm+1 , and M, Φ defined as in (3.48); • EX = Rk × Sm+1 , X = V × Z; • A(x = (v, Z)) = (M (v), Z); note that A is an affine mapping from EX into EM which maps X into M, as required in Section 3.3.1. Observe that when u ∈ U and v ∈ V , the common distribution P = Pu,v of i.i.d. observations ωi defined by (3.49) satisfies the relation ∀(f = (h, H) ∈ H) : ln Eω∼P ehf,ωi
= ≤
n T 1 T o ln Eζ∼N (A[u;1],M (v)) eh ζ+ 2 ζ Hζ Φ(h, H; M (v), [u; 1][u; 1]T );
(3.51)
see (3.50); • υ(x = (v, Z)) = ̺(Z) : X → R; • we define affine functional G(x) on EX by the relation hg, x := (v, Z)i = q T v + Tr(QZ); see (3.47). As a result, for x = (v, [u; 1][u; 1]T ) with v ∈ V and u ∈ U we have F (u, v) = G(x). Applying Corollary 3.8 to the data just specified (which is legitimate, because our Φ clearly satisfies (3.30)), we arrive at the result as follows: Proposition 3.12. In the situation just described, let us set
226
CHAPTER 3
b +,K (h, H) Ψ := inf α
=
max
(v,Z)∈V ×Z
max
(v,Z)∈V ×Z
b −,K (h, H) Ψ := inf α
=
α>0, −1 −1 −γαΘ∗ HγαΘ∗
(v,Z)∈V ×Z
max
αΦ(h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z) + K −1 α ln(2/ǫ) : inf
max
(v,Z)∈V ×Z
−1 α > 0, −γαΘ−1 ∗ H γαΘ∗
αΦ(h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z)
+K −1 α ln(2/ǫ) ,
αΦ(−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) + K −1 α ln(2/ǫ) : −1 −1 α > 0, −γαΘ∗ H γαΘ∗ αΦ(−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) inf
α>0, −1 −1 −γαΘ∗ HγαΘ∗
+K −1 α ln(2/ǫ) ,
(3.52)
b ±,K (h, H) : Rd × Sd → R are convex. Furthermore, whenever so that functions Ψ ¯ ¯ h, H, ρ¯, κ ¯ form a feasible solution to the system of convex constraints b +,K (h, H) ≤ ρ − κ, Ψ b −,K (h, H) ≤ ρ + κ Ψ
(3.53)
in variables (h, H) ∈ Rd × Sd , ρ ∈ R, κ ∈ R, setting gb(ζ K := (ζ1 , ..., ζK )) =
K 1 1 X T h ζi + ζiT Hζi + κ, ¯ K i=1 2
(3.54)
we get an estimate of the functional of interest F (u, v) = [u; 1]T Q[u; 1] + q T v via K independent observations ζi ∼ N (A[u; 1], M (v)), i = 1, ..., K, with the following property: ∀(u, v) ∈ U × V : Probζ K ∼[N (A[u;1],M (v))]K F (u, v) − gb(ζ K ) > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ.
(3.55)
Proof. Under the premise of the proposition, let us fix u ∈ U , v ∈ V , so that x := (v, Z := [u; 1][u; 1]T ) ∈ X . Denoting, as above, by P = Pu,v the distribution of ω := (ζ, ζζ T ) with ζ ∼ N (A[u; 1], M (v)), and invoking (3.51), we see that for the (x, P ) just defined, relation (3.26) takes place. Applying Corollary 3.8, we conclude that g (ζ K ) − G(x) > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ. Probζ K ∼[N (A[u;1],M (v))]K b
227
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
It remains to note that by construction for the x = (v, Z) in question it holds G(x) = q T v + Tr(QZ) = q T v + Tr(Q[u; 1][u; 1]T ) = q T v + [u; 1]T Q[u, 1] = F (u, v). ✷ An immediate consequence of Proposition 3.12 is as follows: Corollary 3.13. Under the premise and in the notation of Proposition 3.12, let (h, H) ∈ Rd × Sd . Setting h i b +,K (h, H) + Ψ b −,K (h, H) , ρ = 12 Ψ h i (3.56) b −,K (h, H) − Ψ b +,K (h, H) , κ = 21 Ψ the ǫrisk of estimate (3.54) does not exceed ρ.
Indeed, with ρ and κ given by (3.56), h, H, ρ, κ satisfy (3.53). 3.4.1.4
Consistency
We are about to present a simple sufficient condition for the estimator defined in Proposition 3.12 to be consistent in the sense of Section 3.3.4.1. Specifically, in the situation and with the notation from Sections 3.4.1.1 and 3.4.1.3 assume that A.1. ̺(·) ≡ 0; A.2. V = {¯ v } is a singleton and M (v) ≻ 0, which allows us to set Θ∗ = M (¯ v ), to satisfy (3.45) with δ = 0, and to assume w.l.o.g. that F (u, v) = [u; 1]T Q[u; 1], G(Z) = Tr(QZ); A.3. the first m columns of the d × (m + 1) matrix A are linearly independent. By A.3, the columns of (d + 1) × (m + 1) matrix B (see (3.48)) are linearly independent, so that we can find (m + 1) × (d + 1) matrix C such that CB = Im+1 . Let ¯ H) ¯ ∈ Rd × Sd from the relation us define (h,
¯ H ¯ hT
¯ h
= 2(C T QC)o ,
(3.57)
where for (d + 1) × (d + 1) matrix S, S o is the matrix obtained from S by zeroing our the entry in the cell (d + 1, d + 1). The consistency of our estimation machinery is given by the following simple statement: Proposition 3.14. In the situation just described and under assumptions A.1–3, given ǫ ∈ (0, 1), consider the estimate
where
gbK (ζ K ) = κK =
1 2
K 1 X ¯T ¯ k ] + κK , [h ζk + 21 ζ T Hζ K k=1
h
¯ H) ¯ H) b −,K (h, ¯ −Ψ b +,K (h, ¯ Ψ
i
b ±,K are given by (3.52). Then the ǫrisk of gbK (·) goes to 0 as K → ∞. and Ψ
228
CHAPTER 3
For proof, see Section 3.6.4. 3.4.1.5
A modification
In the situation described at the beginning of Section 3.4.1.2, let a set W ⊂ U × V be given, and assume we are interested in estimating the value of F (u, v), as defined in (3.47), at points (u, v) ∈ W only. When reducing the “domain of interest” U × V to W , we hopefully can reduce the attainable ǫrisk of recovery. Let us assume that we can point out a convex compact set W ⊂ V × Z such that (u, v) ∈ W ⇒ (v, [u; 1][u; 1]T ) ∈ W A straightforward inspection justifies the following: Remark 3.15. In the situation just described, the conclusion of Proposition 3.12 remains valid when the set U × V participating in (3.55) is reduced to W , and the set V × Z participating in relations (3.52) is reduced to W. This modification enlarges the feasible set of (3.53) and thus reduces the risk bound ρ¯. 3.4.2 3.4.2.1
Estimating quadratic form, subGaussian case Situation
In the rest of this section we are interested in the situation as follows: we are given K i.i.d. observations ζi ∼ SG(A[u; 1], M (v)), i = 1, ..., K
(3.58)
(i.e., ζi are subGaussian random vectors with parameters A[u; 1] ∈ Rd and M (v) ∈ d S+ ), where • (u, v) is an unknown “signal” known to belong to a given set U × V , where – U ⊂ Rm is a compact set, and
– V ⊂ Rk is a compact convex set;
• A is a given d × (m + 1) matrix, and v 7→ M (v) : Rk → Sd is an affine mapping such that M (v) 0 whenever v ∈ V . We are also given a convex calibrating function ̺(Z) : Sm+1 → R and “functional + of interest” F (u, v) = [u; 1]T Q[u; 1] + q T v, (3.59) where Q and q are a known (m+1)×(m+1) symmetric matrix and a kdimensional vector, respectively. Our goal is to recover F (u, v), for unknown (u, v) known to belong to U × V , via observation (3.58). Note that the only difference between our present setting and that considered in Section 3.4.1.1 is that now we allow for subGaussian, and not necessary Gaussian, observations. 3.4.2.2
Construction and result
Let V = {M (v) : v ∈ V },
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
229
so that V is a convex compact subset of the positive semidefinite cone Sd+ . Let us select some 1. matrix Θ∗ ≻ 0 such that Θ∗ Θ, for all Θ ∈ V; 2. convex compact subset Z of the set Z + = {Z ∈ Sm+1 : Zm+1,m+1 = 1} such + that [u; 1][u; 1]T ∈ Z for all u ∈ U ; 3. reals γ, γ + ∈ (0, 1) with γ < γ + (say, γ = 0.99, γ + = 0.999). Preliminaries. Given the data of the above description and δ ∈ [0, 2], we set (cf. Proposition 3.11) −1 Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗ H γΘ∗ }, A B = ∈ R(d+1)×(m+1) , [0, ..., 0, 1] M = V × Z, 1/2 1/2 Ψ(h, H, G; Z) = − 21 ln ∗ GΘ∗ ) Det(I − Θ h H T −1 −1 + [H, h] [Θ − G] [H, h] B : + 12 Tr ZB T T ∗ h + −1 H × {G : 0 G γ Θ∗ } × Z → R,
H
where
=
Ψδ (h, H, G; Θ, Z)
=
Φ(h, H; Z)
=
Φδ (h, H; Θ, Z)
=
1/2
(3.60)
1/2
− 12 ln Det(I − Θ∗ GΘ∗ ) + 12 Tr([Θ − Θ∗ ]G) 1/2 1/2 δ(2+δ) kΘ∗ GΘ∗ k2F + 1/2 1/2 2(1−kΘ ∗ GΘ∗ k) h H T −1 −1 + [H, h] [Θ − G] [H, h] B : + 21 Tr ZB T ∗ hT + −1 H × {G : 0 G γ Θ∗ } × ({0 Θ Θ∗ } × Z) → R, min Ψ(h, H, G; Z) : 0 G γ + Θ−1 ∗ , G H : H × Z → R, G min Ψδ (h, H, G; Θ, Z) : 0 G γ + Θ−1 ∗ ,G H : G
H × ({0 Θ Θ∗ } × Z) → R.
The following statement is a straightforward reformulation of Proposition 2.46.i: Proposition 3.16. In the situation described in Sections 3.4.2.1 and 3.4.2.2 we have (i) Φ is welldefined realvalued continuous function on the domain H × Z; the function is convex in (h, H) ∈ H, concave in Z ∈ Z, and Φ(0; Z) ≥ 0. Furthermore, let (h, H) ∈ H, u ∈ U , v ∈ V , and let ζ ∼ SG(A[u; 1], M (v)). Then (3.61) ln Eζ exp{hT ζ + 21 ζ T Hζ} ≤ Φ(h, H; [u; 1][u; 1]T ). (ii) Assume that
−1/2
∀Θ ∈ V : kΘ1/2 Θ∗
− Id k ≤ δ.
(3.62)
Then Φδ (h, H; Θ, Z) is a welldefined realvalued continuous function on the domain H × (V × Z); it is convex in (h, H) ∈ H, concave in (Θ, Z) ∈ V × Z, and Φδ (0; Θ, Z) ≥ 0. Furthermore, let (h, H) ∈ H, u ∈ U , v ∈ V , and let ζ ∼ SG(A[u; 1], M (v)). Then (3.63) ln Eζ exp{hT ζ + 21 ζ T Hζ} ≤ Φδ (h, H; M (v), [u; 1][u; 1]T ). The estimate. Our construction of the estimate is completely similar to the case of Gaussian observations. Specifically, let us pass from observations (3.58) to their
230
CHAPTER 3
quadratic lifts, so that our observations become ωi = (ζi , ζi ζiT ), 1 ≤ i ≤ K, ζi ∼ SG(A[u; 1], M (v)) are i.i.d.
(3.64)
As in the Gaussian case, we find ourselves in the situation considered in Section 3.3.3 and can use the corresponding constructions. Indeed, let us specify the data introduced in Section 3.3.1 and participating in the constructions of Section 3.3 as follows: • H = {f = (h, H) ∈ H} ⊂ EH = Rd × Sd , with H defined in (3.60), and the inner product on EH defined as 1 h(h, H), (h′ , H ′ )i = hT h′ + Tr(HH ′ ), 2 EM = Sd × Sm+1 , and M, Φ defined as in (3.60); • EX = Rk × Sm+1 , X = V × Z; • A(x = (v, Z)) = (M (v), Z); note that A is an affine mapping from EX into EM mapping X into M, as required in Section 3.3. Observe that when u ∈ U and v ∈ V , the common distribution P = Pu,v of i.i.d. observations ωi defined by (3.64) satisfies the relation ∀(f = (h, H) ∈ H) : ln Eω∼P ehf,ωi
= ≤
n T 1 T o ln Eζ∼SG(A[u;1],M (v)) eh ζ+ 2 ζ Hζ Φ(h, H; [u; 1][u; 1]T );
(3.65)
see (3.61). Moreover, in the case of (3.62), we have also ∀(f = (h, H) ∈ H) : ln Eω∼P ehf,ωi
= ≤
n T 1 T o ln Eζ∼SG(A[u;1],M (v)) eh ζ+ 2 ζ Hζ Φδ (h, H; M (v), [u; 1][u; 1]T );
(3.66)
see (3.63); • we set υ(x = (v, Z)) = ̺(Z); • we define affine functional G(x) on EX by the relation G(x := (v, Z)) = q T v + Tr(QZ); see (3.59). As a result, for x = (v, [u; 1][u; 1]T ) with v ∈ V and u ∈ U we have F (u, v) = G(x). The result. Applying to the data just specified Corollary 3.8 (which is legitimate, because our Φ clearly satisfies (3.30)), we arrive at the result as follows: Proposition 3.17. In the situation described in Sections 3.4.2.1 and 3.4.2.2 let us
231
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
set b +,K (h, H) := inf Ψ α
=
max
(v,Z)∈V ×Z
=
max
(v,Z)∈V ×Z
max
(v,Z)∈V ×Z
inf
αΦ(h/α, H/α; Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :
α>0, −1 −1 −γαΘ∗ HγαΘ∗
b −,K (h, H) := inf Ψ α
max
(v,Z)∈V ×Z
−1 α > 0, −γαΘ−1 ∗ H γαΘ∗ αΦ(h/α, H/α; Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) ,
αΦ(−h/α, −H/α; Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :
inf
α>0, −1 −1 −γαΘ∗ HγαΘ∗
−1 α > 0, −γαΘ−1 ∗ H γαΘ∗ αΦ(−h/α, −H/α; Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) .
(3.67)
b ±,K (h, H) : Rd × Sd → R are convex. Furthermore, whenever Thus, functions Ψ ¯ ¯ h, H, ρ¯, κ ¯ form a feasible solution to the system of convex constraints b +,K (h, H) ≤ ρ − κ, Ψ b −,K (h, H) ≤ ρ + κ Ψ
(3.68)
in variables (h, H) ∈ Rd × Sd , ρ ∈ R, κ ∈ R, the estimate gb(ζ K ) =
K 1 1 X T h ζi + ζiT Hζi + κ, ¯ K i=1 2
of F (u, v) = [u; 1]T Q[u; 1] + q T v via i.i.d. observations ζi ∼ SG(A[u; 1], M (v)), 1 ≤ i ≤ K, satisfies for all (u, v) ∈ U × V :
Probζ K ∼[SG(A[u;1],M (v))]K F (u, v) − gb(ζ K ) > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ.
Proof. Under the premise of the proposition, let us fix u ∈ U , v ∈ V , and let x = (v, Z := [u; 1][u; 1]T ). Denoting by P the distribution of ω := (ζ, ζζ T ) with ζ ∼ SG(A[u; 1], M (v)), and invoking (3.65), we see that for the (x, P ) just defined relation (3.26) takes place. Applying Corollary 3.8, we conclude that g (ζ K ) − G(x) > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ. Probζ K ∼[N (A[u;1],M (v))]K b It remains to note that by construction for the x = (v, Z) in question it holds G(x) = q T v + Tr(QZ) = q T v + [u; 1]T Q[u, 1] = F (u, v).
✷
Remark 3.18. In the situation described in Sections 3.4.2.1 and 3.4.2.2 let δ ∈ [0, 2] be such that −1/2 kΘ1/2 Θ∗ − Id k ≤ δ ∀Θ ∈ V.
Then the conclusion of Proposition 3.17 remains valid when the function Φ in (3.67)
232
CHAPTER 3
b ±,K are defined as is replaced with the function Φδ , that is, when Ψ b +,K (h, H) := inf Ψ α
=
max
(v,Z)∈V ×Z
inf
α
max
(v,Z)∈V ×Z
max
(v,Z)∈V ×Z
αΦδ (h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :
α>0, −1 −1 −γαΘ∗ HγαΘ∗
b −,K (h, H) := inf Ψ =
max
(v,Z)∈V ×Z
inf
α > 0, −γαΘ−1 H γαΘ−1 ∗ ∗ αΦδ (h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) ,
αΦδ (−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :
α>0, −1 −1 −γαΘ∗ HγαΘ∗
α > 0, −γαΘ−1 H γαΘ−1 ∗ ∗ αΦδ (−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) .
To justify Remark 3.18, it suffices to replace relation (3.65) in the proof of Proposition 3.17 with (3.66). Note that what is better in terms of the risk of the resulting estimate—Proposition 3.17 “as is” or its modification presented in Remark 3.18—depends on the situation, so that it makes sense to keep in mind both options. 3.4.2.3
Numerical illustration, direct observations
The problem. Our initial illustration is deliberately selected to be extremely simple: given direct noisy observations ζ =u+ξ of unknown signal u ∈ Rm known to belong to a given set U , we want to recover the “energy” uT u of u. We are interested in an estimate of uT U quadratic in ζ with as small as possible an ǫrisk on U ; here ǫ ∈ (0, 1) is a given design parameter. The details of our setup are as follows: • U is the “spherical layer” U = {u ∈ Rm : r2 ≤ uT u ≤ R2 }, where r and R, 0 ≤ r < R < ∞, are given. As a result, the “main ingredient” of constructions from Sections 3.4.1.3 and 3.4.2.2—the convex compact subset Z of the set {Z ∈ Sm+1 : Zm+1,m+1 = 1} containing all matrices [u; 1][u; 1]T , u ∈ U —can be + specified as Z = {Z ∈ Sm+1 : Zm+1,m+1 = 1, 1 + r2 ≤ Tr(Z) ≤ 1 + R2 }; + • ξ is either ∼ N (0, Θ) (Gaussian case), or ∼ SG(0, Θ) (subGaussian case), with matrix Θ known to be diagonal with diagonal entries equal to each other satisfying θσ 2 ≤ Θii ≤ σ 2 , 1 ≤ i ≤ d = m, withPknown θ ∈ [0, 1] and σ 2 > 0; m • the calibrating function ̺(Z) is ̺(Z) = ς( i=1 Zii ), where ς is a convex continuous realvalued function on R+ . Note that with this selection, the claim that ǫrisk of an estimate gb(·) is ≤ ρ means that whenever u ∈ U , one has Prob{b g (u + ξ) − uT u > ρ + ς(uT u)} ≤ ǫ.
(3.69)
Processing the problem. It is easily seen that in the situation in question the apparatus in Sections 3.4.1 and 3.4.2 translates into the following:
233
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
1. We lose nothing when restricting ourselves with estimates of the form gb(ζ) = 12 ηζ T ζ + κ,
(3.70)
with properly selected scalars η and κ; 2. In Gaussian case, η and κ are yielded by the convex optimization problem with only three variables α+ , α− , and η, namely the problem n i h o b + , α− , η) = 1 Ψ b + (α+ , η) + Ψ b − (α− , η) : σ 2 η < α± min Ψ(α (3.71) 2 α± ,η
where
b + (α+ , η) Ψ
=
b − (α+ , η) Ψ
=
4 2
dα+ 2
dδ(2+δ)σ η ln(1 − σ 2 η/α+ ) + d2 σ 2 (1 − θ) max[−η, 0] + 2(α 2 + −σ η) hh i i α+ η + max − 1 t − ς(t) + α+ ln(2/ǫ) 2(α −σ 2 η)
−
−
r 2 ≤t≤R2 dα− ln(1 + 2 hh
+ max
r 2 ≤t≤R2
√
+
4 2
dδ(2+δ)σ η σ 2 η/α− ) + d2 σ 2 (1 − θ) max[η, 0] + 2(α 2 − −σ η) i i α η − 2(α − + 1 t − ς(t) + α ln(2/ǫ), − +σ 2 η) −
with δ = 1− θ. Now, the ηcomponent of a feasible solution to (3.71) augmented by the quantity i h b − (α− , η) − Ψ b + (α+ , η) κ = 21 Ψ
b + , α− , η); yields estimate (3.70) with ǫrisk on U not exceeding Ψ(α 3. In the subGaussian case, η and κ are yielded by the convex optimization problem with five variables, α± , g± , and η, namely, the problem i h b ± , g± , η) = 1 Ψ b + (α+ , g+ , η) + Ψ b − (α− , g− , η) : min Ψ(α 2 α± ,g± ,η (3.72) 0 ≤ σ 2 g± < α± , −α+ < σ 2 η < α− , η ≤ g+ , −η ≤ g− , where b + (α+ , g+ , η) Ψ
b − (α− , g− , η) Ψ
=
−
dα+ 2
−
dα− 2
ln(1 − σ 2 g+ /α+ )hh
+α+ ln(2/ǫ) + =
max
r 2 ≤t≤R2
ln(1 − σ 2 g− /α− )hh
+α− ln(2/ǫ) +
max
r 2 ≤t≤R2
σ2 η2 2(α+ −σ 2 g+ ) σ2 η2 2(α− −σ 2 g− )
i i + 12 η − 1 t − ς(t)
i i − 21 η + 1 t − ς(t)
The ηcomponent of a feasible solution to (3.72) augmented by the quantity i h b − (α− , g− , η) − Ψ b + (α+ , g+ , η) κ=1 Ψ 2
b ± , g± , η). yields estimate (3.70) with ǫrisk on U not exceeding Ψ(α
Note that the Gaussian case of our “energy estimation” problem is well studied in the literature (see, among others, [19, 43, 81, 87, 90, 97, 120, 124, 147, 160]), mainly in the case ξ ∼ N (0, σ 2 Im ) of white Gaussian noise with exactly known variance σ 2 . Available results investigate analytically the interplay between the dimension m of signal, noise intensity σ 2 and the parameters R, r and offer estimates which are provably optimal, up to absolute constant factors. A nice property of the proposed
234
CHAPTER 3
d
r
R
θ
64 64 64 64 64 64 256 256 256 256 256 256 1024 1024 1024 1024 1024 1024
0 0 0 0 8 8 0 0 0 0 16 16 0 0 0 0 32 32
16 16 128 128 80 80 32 32 512 512 160 160 64 64 2048 2048 320 320
1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5
Relative 0.01risk, Gaussian case 0.34808 0.43313 0.04962 0.05064 0.07827 0.08095 0.19503 0.26813 0.01264 0.01289 0.03996 0.04255 0.10272 0.17032 0.00317 0.00324 0.02019 0.02273
Relative 0.01risk, subGaussian case 0.44469 0.44469 0.05181 0.05181 0.08376 0.08376 0.30457 0.30457 0.01314 0.01314 0.04501 0.04501 0.21923 0.21923 0.00330 0.00330 0.02516 0.02516
Optimality ratio 1.22 1.48 1.28 1.34 1.28 1.34 1.28 1.41 1.28 1.34 1.28 1.34 1.28 1.34 1.28 1.34 1.28 1.41
Table 3.3: Estimating the signal energy from direct observations.
approach is that (3.71) automatically takes care of the parameters and results in estimates with seemingly nearoptimal performance, as witnessed by the numerical experiments we are about to present. Numerical results. In the first series of experiments we use the trivial calibrating function: ς(·) ≡ 0. A typical sample of numerical results is presented in Table 3.3. To avoid large numbers, we display in the table relative 0.01risk of the estimates, that is, the plain risk as given by (3.71) divided by R2 ; keeping this in mind, one will not be surprised that when extending the range [r, R] of allowed norms of the observed signal, all other components of the setup being fixed, the relative risk can decrease (the actual risk, of course, can only increase). Note that in all our experiments σ is set to 1. Along with the values of the relative 0.01risk, we present also the values of “optimality ratios”—the ratios of the upper risk bounds given by (3.71) in the Gaussian case, to (lower bounds on) the best 0.01risks Risk∗0.01 possible under the circumstances, defined as the infimum of the 0.01risk over all estimates recovering kuk22 via single observation ω = u + ζ. These lower bounds are obtained as follows. Let us select some values r1 < r2 in the allowed range [r, R] of kuk2 , along with two values, σ1 , σ2 , in the allowed range [θσ, σ] = [θ, 1] of values of diagonal entries in diagonal matrices Θ, and consider two distributions of observations P1 and P2 as follows: Pχ is the distribution of the random vector x + ζ, where x and ξ are independent, x is uniformly distributed on the sphere kxk2 = rχ , and ζ ∼ N (0, σχ2 Id ). It is immediately seen that whenever the two simple hypotheses ω ∼ P1 and ω ∼ P2 cannot be decided upon via a single observation by a test with total risk (the sum, over the two hypotheses in question, of probabilities for the test to reject the hypothesis when it is true) ≤ 2ǫ, the quantity δ = 21 (r22 − r12 ) is a lower bound on the optimal ǫrisk, Risk∗ǫ . In other words, denoting by pχ (·) the density
235
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
of Pχ , we have 0.02
d (“deficient observations”), • u ∈ Rm is a signal known to belong to a compact set U , • ξ ∼ N (0, Θ) (Gaussian case) or ξ ∼ SG(0, Θ) (subGaussian case) is the observation noise; Θ is a positive semidefinite d × d matrix known to belong to a given convex compact set V ⊂ Sd+ . Our goal is to estimate the energy F (u) =
1 m
kuk22
of the signal given observation (3.74). In our experiment, the data is specified as follows: 1. We think of u ∈ Rm as of discretization of a smooth function x(t) of continuous argument t ∈ [0; 1]: ui = x( mi ), 1 ≤ i ≤ m. We set U = {u : kSuk2 ≤ 1}, where u 7→ Su is the finitedifference approximation of the mapping x(·) 7→ (x(0), x′ (0), x′′ (·)), so that U is a natural discretetime analog of the SobolevR1 type ball {x : [x(0)]2 + [x′ (0)]2 + 0 [x′′ (t)]2 dt ≤ 1}. 2. d × m matrix B is of the form U DV T , where U and V are randomly selected d × d and m × m orthogonal matrices, and the d diagonal entries in diagonal i−1 d × m matrix D are of the form θ− d−1 , 1 ≤ i ≤ d. 3. The set V of admissible matrices Θ is the set of all diagonal d × d matrices with diagonal entries varying in [0, σ 2 ]. Both σ and θ are components of the experiment setup. Processing the problem. The described estimation problem clearly is covered by the setups considered in Sections 3.4.1 (Gaussian case) and 3.4.2 (subGaussian case); in terms of these setups, it suffices to specify Θ∗ as σ 2 Id , M (v) as the identity mapping of V onto itself, the mapping u 7→ A[u; 1] as the mapping u 7→ Bu, and the set Z (which should be a convex compact subset of the set {Z ∈ Sd+1 : Zd+1,d+1 = + 0} containing all matrices of the form [u; 1][u; 1]T , u ∈ U ) as the set Z = {Z ∈ Sd+1 : Zd+1,d+1 = 1, Tr ZDiag{S T S, 0} ≤ 1}. + As suggested by Propositions 3.12 (Gaussian case) and 3.17 (subGaussian case), 1 kuk22 stem the linear in “lifted observation” ω = (ζ, ζζ T ) estimates of F (u) = m from the optimal solution (h∗ , H∗ ) to the convex optimization problem h i b + (h, H) + Ψ b − (h, H) , Opt = min 12 Ψ (3.75) h,H
b ± (·) given by (3.52) in the Gaussian, and by (3.67) in the subGaussian cases, with Ψ with the number K of observations in (3.52) and (3.67) set to 1. The resulting
237
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
d, m 8, 12 16, 24 32, 48
Opt, Gaussian case 0.1362(+65%) 0.1614(+53%) 0.0687(+46%)
Opt, subGaussian case 0.1382(+67%) 0.1640(+55%) 0.0692(+48%)
LwBnd 0.0825 0.1058 0.0469
Table 3.4: Upper bound (Opt) on the 0.01risk of estimate (3.76), (3.75) vs. lower bound (LwBnd) on the 0.01risk attainable under the circumstances. In the experiments, σ = 0.025 and θ = 10. Data in parentheses: excess of Opt over LwBnd.
estimate is ζ 7→ hT∗ ζ + 12 ζ T H∗ ζ + κ, κ =
1 2
h
b − (h∗ , H∗ ) − Ψ b + (h∗ , H∗ ) Ψ
i
(3.76)
and the ǫrisk of the estimate is (upperbounded by) Opt. Problem (3.75) is a wellstructured convexconcave saddle point problem and as such is beyond the “immediate scope” of the standard Convex Programming software toolbox primarily aimed at solving wellstructured convex minimization (or maximization) problems. However, applying conic duality, one can easily eliminate in (3.52) and (3.67) the inner maxima over v, Z to end up with a reformulation which can be solved numerically by CVX [108], and this is how we process (3.75) in our experiments. Numerical results. In the experiments to be reported, we use the trivial calibrating function: ̺(·) ≡ 0. We present some typical numerical results in Table 3.4. To qualify the performance of our approach, we present, along with the upper risk bounds for the computed estimates, simple lower bounds on ǫrisk. The origin of the lower bounds is as follows. Assume we have at our disposal a signal w ∈ U , and let t(w) = kBwk2 , ρ = 2σErfcInv(ǫ), where ErfcInv is the inverse error function as defined in (1.26). Setting θ(w) = max[1 − ρ/t(w), 0], observe that w′ := θ(w)w ∈ U and kBw − Bw′ k2 ≤ ρ, which, due to the origin of ρ, implies that there is no way to decide via observation Bu + ξ, ξ ∼ N (0, σ 2 ), with risk < ǫ on the two simple hypotheses u = w and u = w′ . As an immediate consequence, the quantity φ(w) := 12 [kwk22 − kw′ k22 ] = kwk22 [1 − θ2 (w)]/2 is a lower bound on the ǫrisk, on U , of any estimate of kuk22 . We can now try to maximize the resulting lower risk bound over U , thus arriving at the lower risk bound LwBnd = max 21 kwk22 (1 − θ2 (w)) . w∈U
On closer inspection, the latter problem is not a convex one, which does not prevent building a suboptimal solution to this problem, and this is how the lower risk bounds in Table 3.4 are built (we omit the details). We see that the ǫrisks of our estimates are within a moderate factor of the optimal ones. Figure 3.4 shows empirical error distributions of the estimates built in the three experiments reported in Table 3.4. When simulating the observations and estimates, we used N (0, σ 2 Id ) noise and selected signals in U by maximizing over U randomly selected linear forms. Finally, we note that already with fixed design parameters d, m, θ and σ we deal with a family of estimation problems rather than with a single problem, the reason being that our U is an ellipsoid with halfaxes es
238
CHAPTER 3
d = 8, m = 12
d = 16, m = 24
d = 32, m = 48
d = 8, m = 12
Gaussian case d = 16, m = 24
d = 32, m = 48
SubGaussian case
Figure 3.4: Histograms of recovery errors in experiments, 1,000 simulations per experiment.
sentially different from each other. In this situation, attainable risks heavily depend on how the right singular vectors of A are oriented with respect to the directions of the halfaxes of U , so that the risks of our estimates vary significantly from instance to instance. Note also that the “subGaussian experiments” were conducted on exactly the same data as “Gaussian experiments” of the same sizes d and m.
3.5
EXERCISES FOR CHAPTER 3
Exercise 3.1. In the situation of Section 3.3.4, design of a “good” estimate is reduced to solving convex optimization problem (3.39). Note that the objective in this problem is, in a sense, “implicit”—the design variable is h, and the objective is obtained from an explicit convexconcave function of h and (x, y) by maximization over (x, y). There exist solvers able to process problems of this type efficiently. However, commonly used offtheshelf solvers, like cvx, cannot handle problems of this type. The goal of the exercise to follow is to reformulate (3.39) as a semidefinite program, thus making it amenable for cvx. On an immediate inspection, the situation we are interested in is as follows. We are given • a nonempty convex compact set X ⊂ Rn along with affine function M (x) taking values in Sd and such that M (x) 0 when x ∈ X, and • an affine function F (h) : Rd → Rn .
239
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
Given γ > 0, this data gives rise to the convex function q Ψ(h) = max F T (h)x + γ hT M (x)h , x∈X
and we want to find a “nice” representation of this function, specifically, we want to represent the inequality τ ≥ Ψ(h) by a bunch of LMIs in variables τ , h, and perhaps additional variables. To achieve our goal, we assume in the sequel that the set X + = {(x, M ) : x ∈ X, M = M (x)} can be described by a system of linear and semidefinite constraints in variables x, M , and additional variables ξ, namely, (a) si − aTi x − bTi ξ − Tr(Ci M ) ≥ 0, i ≤ I + (b) S − A(x) − B(ξ) − C(M ) 0 X = (x, M ) : ∃ξ : . (c) M 0
Here si ∈ R, S ∈ SN are some constants, and A(·), B(·), C(·) are (homogeneous) linear functions taking values in SN . We assume that this system of constraints is essentially strictly feasible, meaning that there exists a feasible solution at which the semidefinite constraints (b) and (c) are satisfied strictly (i.e., the lefthand sides of the LMIs are positive definite). Here comes the exercise: 1) Check that Ψ(h) is the optimal value in the semidefinite program si − aTi x − bTi ξ − Tr(Ci M ) ≥ 0, i ≤ I S − A(x) − B(ξ) − C(M ) 0 M 0 F T (h)x + γt : Ψ(h) = max x,M,ξ,t hT M h t 0 t 1
(a) (b) (c)
(d)
.
(P )
2) Passing from (P ) to the semidefinite dual of (P ), build explicit semidefinite representation of Ψ, that is, an explicit system S of LMIs in variables h, τ , and additional variables u such that {τ ≥ Ψ(h)} ⇔ {∃u : (τ, h, u) satisfies S}. Exercise 3.2. Let us consider the situation as follows. Given an m × n “sensing matrix” A which is stochastic with columns from the probabilistic simplex ) ( X m vi = 1 ∆m = v ∈ R : v ≥ 0, i
and a nonempty closed subset U of ∆n , we observe an M element, M > 1, i.i.d. sample ζ M = (ζ1 , ..., ζM ) with ζk drawn from the discrete distribution Au∗ , where u∗ is an unknown probabilistic vector (“signal”) known to belong to U . We handle the discrete distribution Au, u ∈ ∆n , as a distribution on the vertices e1 , ..., em of ∆m , so that possible values of ζk are basic orths e1 , ..., em in Rm . Our goal is to
240
CHAPTER 3
recover the value F (u∗ ) of a given quadratic form F (u) = uT Qu + 2q T u. Observe that for u ∈ ∆n , we have u = [uuT ]1n , where 1k is the allones vector in Rk . This observation allows us to rewrite F (u) as a homogeneous quadratic form: ¯ Q ¯ = Q + [q1T + 1n q T ]. F (u) = uT Qu, n
(3.77)
The goal of the exercise is to follow the approach developed in Section 3.4.1 for the Gaussian case in order to build an estimate gb(ζ M ) of F (u). To this end, consider the following construction. Let
JM = {(i, j) : 1 ≤ i < j ≤ M }, JM = Card(JM ).
For ζ M = (ζ1 , ..., ζM ) with ζk ∈ {e1 , ..., em }, 1 ≤ k ≤ M , let ωij [ζ M ] = 21 [ζi ζjT + ζj ζiT ], (i, j) ∈ JM .
The estimates we are interested in are of the form 1 X +κ ωij [ζ M ] gb(ζ M ) = Tr h (i,j)∈JM JM {z }  ω[ζ M ]
where h ∈ Sm and κ ∈ R are the parameters of the estimate. Now comes the exercise: 1) Verify that when the ζk ’s stem from signal u ∈ U , the expectation of ω[ζ M ] is a linear image Az[u]AT of the matrix z[u] = uuT ∈ Sn : denoting by PuM the distribution of ζ M , we have Eζ M ∼PuM {ω[ζ M ]} = Az[u]AT .
(3.78)
Check that when setting Zk = {ω ∈ Sk : ω 0, ω ≥ 0, 1Tk ω1k = 1}, where x ≥ 0 for a matrix x means that x is entrywise nonnegative, the image of Zn under the mapping z 7→ AzAT is contained in Zm . 2) Let ∆k = {z ∈ Sk : z ≥ 0, 1Tn z1n = 1}, so that Zk is the set of all positive semidefinite matrices from ∆k . For µ ∈ ∆m , let Pµ be the distribution of the random matrix w taking values in Sm as follows: the possible values of w are matrices of the form eij = 12 [ei eTj + ej eTi ], 1 ≤ i ≤ j ≤ m; for every i ≤ m, w takes value eii with probability µii , and for every i, j with i < j, w takes value eij with probability 2µij . Let us set m X µij exp{hij } : Sm × ∆m → R, Φ1 (h; µ) = ln i,j=1
241
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
so that Φ1 is a continuous convexconcave function on Sm × ∆m .
2.1. Prove that
∀(h ∈ Sm , µ ∈ Zm ) : ln Ew∼Pµ {exp{Tr(hw)}} = Φ1 (h; µ).
2.2. Derive from 2.1 that setting
K = K(M ) = ⌊M/2⌋, ΦM (h; µ) = KΦ1 (h/K; µ) : Sm × ∆m → R, ΦM is a continuous convexconcave function on Sm × ∆m such ΦK (0; µ) = 0 for all µ ∈ Zm , and whenever u ∈ U , the following holds true: Let Pu,M be the distribution of ω = ω[ζ M ], ζ M ∼ PuM . Then for all u ∈ U, h ∈ Sm , ln Eω∼Pu,M {exp{Tr(hω)}} ≤ ΦM (h; Az[u]AT ), z[u] = uuT . (3.79)
3) Combine the above observations with Corollary 3.6 to arrive at the following result: Proposition 3.19. In the situation in question, let Z be a convex compact subset of Zn such that uuT ∈ Z for all u ∈ U . Given ǫ ∈ (0, 1), let Ψ+ (h, α)
=
Ψ− (h, α)
=
b + (h) Ψ
b − (h) Ψ
:= = = := = =
¯ : Sm × {α > 0} → R, max αΦM (h/α, AzAT ) − Tr(Qz) z∈Z T ¯ : Sm × {α > 0} → R max αΦM (−h/α, AzA ) + Tr(Qz) z∈Z
inf α>0 [Ψ+ (h, α) + α ln(2/ǫ)] ¯ + α ln(2/ǫ) max inf αΦM (h/α, AzAT ) − Tr(Qz) z∈Z α>0 ¯ + β ln(2/ǫ) max inf βΦ1 (h/β, AzAT ) − Tr(Qz) K z∈Z β>0
inf [Ψ− (h, α) + α ln(2/ǫ)] ¯ + α ln(2/ǫ) max inf αΦM (−h/α, AzAT ) + Tr(Qz) z∈Z α>0 ¯ + β ln(2/ǫ) max inf βΦ1 (−h/β, AzAT ) + Tr(Qz)
[β = Kα],
α>0
K
z∈Z β>0
[β = Kα].
b ± are realvalued and convex on Sm , and every candidate solution The functions Ψ h to the convex optimization problem h io n b + (h) + Ψ b − (h) b (3.80) Opt = min Ψ(h) := 12 Ψ h
induces the estimate
b − (h) − Ψ b + (h)] gbh (ζ M ) = Tr(hω[ζ M ]) + κ(h), κ(h) = 12 [Ψ
of the functional of interest (3.77) via observation ζ M with ǫrisk on U not exceeding b ρ = Ψ(h): ∀(u ∈ U ) : Probζ M ∼PuM {F (u) − gbh (ζ M ) > ρ} ≤ ǫ.
4) Consider an alternative way to estimate F (u), namely, as follows. Let u ∈ U . Given a pair of independent observations ζ1 , ζ2 drawn from distribution Au, let us convert them into the symmetric matrix ω1,2 [ζ 2 ] = 21 [ζ1 ζ2T + ζ2 ζ1T ]. The distribution Pu,2 of this matrix is exactly the distribution Pµ(z[u]) —see item B—where µ(z) = AzAT : ∆n → ∆m . Now, given M = 2K observations ζ 2K = (ζ1 , ..., ζ2K ) stemming from signal u, we can split them into K consecutive pairs giving rise
242
CHAPTER 3
to K observations ω K = (ω1 , ..., ωK ), ωk = ω[[ζ2k−1 ; ζ2k ]], drawn independently of each other from probability distribution Pµ(z[u]) , and the functional of interest ¯ (3.77) is a linear function Tr(Qz[u]) of z[u]. Assume that we are given a set Z as in the premise of Proposition 3.19. Observe that we are in the situation as follows: Given K i.i.d. observations ω K = (ω1 , ..., ωK ) with ωk ∼ Pµ(z) , where z is an unknown signal known to belong to Z, we want to recover the value ¯ of v ∈ Sn . Besides this, we know at z of linear function G(v) = Tr(Qv) m that Pµ , for every µ ∈ ∆ , satisfies the relation ∀(h ∈ Sm ) : ln Eω∼Pµ {exp{Tr(hω)}} ≤ Φ1 (h; µ).
This situation fits the setting of Section 3.3.3, with the data specified as H = EH = Sm , M = ∆m ⊂ EM = Sm , Φ = Φ1 , X := Z ⊂ EX = Sn , A(z) = AzAT .
Therefore, we can use the apparatus developed in that section to upperbound the ǫrisk of the affine estimate ! K 1 X ωk + κ Tr h K k=1
¯ and to build the best, in terms of the upper risk of F (u) := G(z[u]) = uT Qu bound, estimate; see Corollary 3.8. On closer inspection (carry it out!), the b ± arising in (3.38) are exactly the associated with the above data functions Ψ b functions Ψ± specified in Proposition 3.19 for M = 2K. Thus, the approach to estimating F (u) via observations ζ 2K stemming from u ∈ U results in a family of estimates ! K 1 X 2K geh (ζ ) = Tr h ω[[ζ2k−1 ; ζ2k ]] + κ(h), h ∈ Sm . K k=1
b b The resulting upper bound on the ǫrisk of estimate geh is Ψ(h), where Ψ(·) is associated with M = 2K according to Proposition 3.19. In other words, this is exactly the upper bound on the ǫrisk of the estimate gbh offered by the proposition. Note, however, that the estimates geh and gbh are not identical: PK 1 2K ] + κ(h), geh (ζ 2K ) = Tr h K k=1 ω2k−1,2k [ζ P 1 2K ω [ζ ] + κ(h). gbh (ζ 2K ) = Tr h K(2K−1) 1≤i ζ,” where η, ζ are discrete realvalued random variables independent of each other with distributions u, v, and π is a linear function of the “joint distribution” uv T of η, ζ. This story gives rise to the aforementioned estimation problem with the unit sensing matrices P and Q. Assuming that there are “measurement errors”—instead of observing an action’s outcome “as is,” we observe a realization of a random variable with distribution depending, in a prescribed fashion, on the outcome—we arrive at problems where P and Q can be general type stochastic matrices. As always, we encode the p possible values of ηk by the basic orths e1 , ..., ep in Rp , and the q possible values of ζ by the basic orths f1 , ..., fq in Rq . We focus on estimates of the form #T " # " 1X 1 X K L [h ∈ Rp×q , κ ∈ R]. ηk h ζℓ + κ gbh,κ (η , ζ ) = K L k
This is what you are supposed to do:
ℓ
244
CHAPTER 3
1) (cf. item 2 in Exercise 3.2) Denoting by ∆mn the set of nonnegative m × n matrices with unit sum of all entries (i.e., the set of all probability distributions on {1, ..., m} × {1, ..., n}) and assuming L ≥ K, let us set A(z) = P zQT : Rr×s → Rp×q and Φ(h; µ) ΦK (h; µ)
= =
P Pq p ln µ exp{h } : Rp×q × ∆pq → R, ij ij i=1 j=1 KΦ(h/K; µ) : Rp×q × ∆pq → R.
Verify that A maps ∆rs into ∆pq , Φ and ΦK are continuous convexconcave functions on their domains, and that for every u ∈ ∆r , v ∈ ∆s , the following holds true: (!) When η K = (η1 , ..., ηK ), ζ L = (ζ1 , ..., ζK ) with mutually independent η1 , ..., ζL such that ηk ∼ P u, ηℓ ∼ Qv for all k, ℓ, we have " #T " # 1 X X 1 ≤ ΦK (h; A(uv T )). (3.82) ηk h ζℓ ln Eη,ζ exp K L k
ℓ
2) Combine (!) with Corollary 3.6 to arrive at the following analog of Proposition 3.19: Proposition 3.20. In the situation in question, let Z be a convex compact subset of ∆rs such that uv T ∈ Z for all u ∈ U , v ∈ V . Given ǫ ∈ (0, 1), let Ψ+ (h, α)
=
Ψ− (h, α)
=
z∈Z
b + (h) Ψ
:= =
b − (h) Ψ
:=
=
= =
max αΦK (h/α, P zQT ) − Tr(F z T ) : Rp×q × {α > 0} → R, z∈Z max αΦK (−h/α, P zQT ) + Tr(F z T ) : Rp×q × {α > 0} → R,
inf α>0 [Ψ+ (h, α) + α ln(2/ǫ)] max inf αΦK (h/α, P zQT ) − Tr(F z T ) + α ln(2/ǫ) z∈Z α>0 β ln(2/ǫ) [β = Kα], max inf βΦ(h/β, P zQT ) − Tr(F z T ) + K z∈Z β>0
inf [Ψ− (h, α) + α ln(2/ǫ)] max inf αΦK (−h/α, P zQT ) + Tr(F z T ) + α ln(2/ǫ) z∈Z α>0 β max inf βΦ(−h/β, P zQT ) + Tr(F z T ) + K ln(2/ǫ) [β = Kα].
α>0
z∈Z β>0
b ± are realvalued and convex on Rp×q , and every candidate solution The functions Ψ h to the convex optimization problem n h io b b + (h) + Ψ b − (h) Opt = min Ψ(h) := 21 Ψ h
induces the estimate
"
1 X ηk gbh (η , ζ ) = Tr h K K
L
k
#"
1X ζℓ L ℓ
# T T
b − (h) − Ψ b + (h)] + κ(h), κ(h) = 1 [Ψ 2
of the functional of interest (3.81) via observation η K , ζ L with ǫrisk on U × V not b exceeding ρ = Ψ(h): ∀(u ∈ U, v ∈ V ) : Prob{F (u, v) − gbh (η K , ζ L ) > ρ} ≤ ǫ,
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
245
the probability being taken w.r.t. the distribution of observations η K , ζ L stemming from signals u, v. Exercise 3.4. [recovering mixture weights] The problem to be addressed in this exercise is as follows. We are given K probability distributions P1 , ..., PK on observation space Ω, and let these distributions have densities pk (·) w.r.t. some reference measure Π P on Ω; we assume that k pk (·) is positive on Ω. We are given also N independent observations ωt ∼ Pµ , t = 1, ..., N, drawn from distribution Pµ =
K X
µ k Pk ,
k=1
where µ is an unknown P “signal” known to belong to the probabilistic simplex ∆K = {µ ∈ RK : µ ≥ 0, k µk = 1}. Given ω N = (ω1 , ..., ωN ), we want to recover the linear image Gµ of µ, where G ∈ Rν×K is given. b N ) : Ω × ... × Ω → Rν We intend to measure the risk of a candidate estimate G(ω by the quantity h oi1/2 n b b N ) − Gµk22 Risk[G(·)] = sup EωN ∼Pµ ×...×Pµ kG(ω . µ∈∆
3.4.A. Recovering linear form. Let us start with the case when G = g T is a 1 × K matrix. 3.4.A.1. Preliminaries. To motivate the construction to follow, consider the case when Ω is a finite set (obtained, e.g., by “fine discretization” of the “true” observation space). In this situation our problem becomes an estimation problem in Discrete o.s.: given a stationary N repeated observation stemming from a discrete probability distribution Pµ affinely parameterized by signal µ ∈ ∆K , we want to recover a linear form of µ. It is shown in Section 3.1—see Remark 3.2—that in this case a nearly optimal, in terms of its ǫrisk, estimate is of the form gb(ω N ) =
N 1 X φ(ωt ) N t=1
(3.83)
with properly selected φ. The difficulty with this approach is that as far as computations are concerned, optimal design of φ requires solving a convex optimization problem of design dimension of order of the cardinality of Ω, and this cardinality could be huge, as is the case when Ω is a discretization of a domain in Rd with d in the range of tens. To circumvent this problem, we are to simplify the outlined approach: from the construction of Section 3.1 we inherit the simple structure (3.83) of the estimator; taking this structure for granted, we are to develop an alternative design of φ. With this new design, we have no theoretical guarantees for the resulting estimates to be nearoptimal; we sacrifice these guarantees in order to reduce dramatically the computational effort of building the estimates.
246
CHAPTER 3
3.4.A.2. Generic estimate. Let us select somehow L functions Fℓ (·) on Ω such that Z Fℓ2 (ω)pk (ω)Π(dω) < ∞, 1 ≤ ℓ ≤ L, 1 ≤ k ≤ K. (3.84) With λ ∈ RL , consider estimate of the form gbλ (ω N ) =
1) Prove that
≤
Risk[b gλ ]
:=
N X 1 X λℓ Fℓ (ω). Φλ (ωt ), Φλ (ω) = N t=1
Risk(λ) R P 2 max N1 [ ℓ λℓ Fℓ (ω)] pk (ω)Π(dω) k≤K
2 R P [ ℓ λℓ Fℓ (ω)] pk (ω)Π(dω) − g T ek 1/2 max N1 λT Wk λ + [eTk [M λ − g]]2 ,
+
= where M
=
Wk
=
(3.85)
ℓ
1/2
(3.86)
k≤K
Mkℓ := [Wk ]ℓℓ′
Fℓ (ω)pk (ω)Π(dω) k≤K , ℓ≤L R := Fℓ (ω)Fℓ′ (ω)pk (ω)Π(dω) ℓ≤L , 1 ≤ k ≤ K, R
ℓ′ ≤L
and e1 , ..., eK are the standard basic orths in RK .
Note that Risk(λ) is a convex function of λ; this function is easy to compute, provided the matrices M and Wk , k ≤ K, are available. Assuming this is the case, we can solve the convex optimization problem Opt = min Risk(λ) λ∈RK
(3.87)
and use the estimate (3.85) associated with the optimal solution to this problem; the risk of this estimate will be upperbounded by Opt. 3.4.A.3. Implementation. When implementing the generic estimate we arrive at the “Measurement Design” question: how do we select the value of L and functions Fℓ , 1 ≤ ℓ ≤ L, resulting in small (upper bound Opt on the) risk of the estimate (3.85) yielded by an optimal solution to (3.87)? We are about to consider three related options—naive, basic, and Maximum Likelihood (ML). The naive option is to take Fℓ = pℓ , 1 ≤ ℓ ≤ L = K, assuming that this selection meets (3.84). For the sake of definiteness, consider the “Gaussian case,” where Ω = Rd , Π is the Lebesgue measure, and pk is Gaussian distribution with parameters νk , Σk : pk (ω) = (2π)−d/2 Det(Σk )−1/2 exp − 21 (ω − νk )T Σ−1 k (ω − νk ) .
In this case, the Naive option leads to easily computable matrices M and Wk appearing in (3.86).
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
247
2) Check that in the Gaussian case, when setting −1 −1 −1 −1 −1 Σkℓ = [Σ−1 , Σkℓm = [Σ−1 , χk = Σ−1 k + Σℓ ] k + Σℓ + Σm ] k νk , q q Det(Σkℓ ) Det(Σkℓm ) −d αkℓ = (2π)d Det(Σ , β = (2π) kℓm Det(Σk )Det(Σℓ )Det(Σm ) , )Det(Σ ) k ℓ
we have
Mkℓ [Wk ]ℓm
:= = := =
R
pℓ (ω)pk (ω)Π(dω) 1 T T T α R kℓ exp 2 [χk + χℓ ] Σkℓ [χk + χℓ ] − χk Σk χk − χℓ Σℓ χℓ , pℓ (ω)pm(ω)p k (ω)Π(dω) βkℓm exp 12 [χk + χℓ + χm ]T Σkℓm [χk + χℓ + χm ] −χTk Σk χk − χTℓ Σℓ χℓ − χTm Σm χm .
Basic option. Though simple, the Naive option does not make much sense: when
replacing the reference measure Π with another measure Π′ which has positive density θ(·) w.r.t. Π, the densities pk are updated according to pk (·) 7→ p′k (·) = θ−1 (·)p(·), so that selecting Fℓ′ = p′ℓ , the matrices M and Wk become M ′ and Wk′ with R R pk (ω)pℓ (ω) ′ ℓ (ω) ′ Mkℓ = Π (dω) = pk (ω)p Π(dω), θ 2 (ω) θ(ω) R R pk (ω)pℓ (ω)pm (ω) ′ pk (ω)pℓ (ω) ′ [Wk ]ℓm = Π (dω) = Π(dω). θ 3 (ω) θ 2 (ω)
We see that in general M 6= M ′ and Wk 6= Wk′ , which makes the Naive option rather unnatural. In the alternative Basic option we set pℓ (ω) . L = K, Fℓ (ω) = π(ω) := P k pk (ω)
The motivation is that the functions Fℓ are invariant when replacing Π with Π′ , so that here M = M ′ and Wk = Wk′ . Besides this, there are statistical arguments in favor P of the Basic option, namely, as follows. Let Π∗ be the measure with the w.r.t. Π; taken w.r.t. Π∗ , the densities of Pk are exactly the density k pk (·) P above πk (·), and k πk (ω) ≡ 1. Now, (3.86) says that the risk of estimate gbλ can be upperbounded by the function Risk(λ) defined in (3.86), and this function, in turn, can be upperbounded by the function P R P 2 + 1 [ ℓ λℓ Fℓ (ω)] pk (ω)Π(dω) Risk (λ) := k N 2 1/2 R P + maxk [ k λℓ Fℓ (ω)] pk (ω)Π(dω) − g T ek R P 2 = N1 [ ℓ λℓ Fℓ (ω)] Π∗ (dω) 2 1/2 R P T + maxk [ k λℓ Fℓ (ω)] πk (ω)Π∗ (dω) − g ek ≤ KRisk(λ)
(we have said that the maximum of K nonnegative quantities is at most their sum, and the latter is at most K times the maximum of the quantities). Consequently, the risk of the estimate (3.85) stemming from an optimal solution to (3.87) can be
248
CHAPTER 3
upperbounded by the quantity Opt+ := min Risk+ (λ) λ
[≥ Opt := max Risk(λ)]. λ
And here comes the punchline: 3.1) Prove that both the quantities Opt defined in (3.87) and the above Opt+ depend only on the linear span of the functions Fℓ , ℓ = 1, ..., L, not on how the functions Fℓ are selected in this span. 3.2) Prove that the selection Fℓ = πℓ , 1 ≤ ℓ ≤ L = K, minimizes Opt+ among all possible selections L, {Fℓ }L ℓ=1 satisfying (3.84). Conclude that the selection Fℓ = πℓ , 1 ≤ ℓ ≤ L = K, while not necessarily optimal in terms of Opt, definitely is meaningful: this selection optimizes the natural upper bound Opt+ on Opt. Observe that Opt+ ≤ KOpt, so that optimizing instead of Opt the upper bound Opt+ , although rough, is not completely meaningless. A downside of the Basic option is that it seems problematic to get closed form expressions for the associated matrices M and Wk ; see (3.86). For example, in the Gaussian case, the Naive choice of Fℓ ’s allows us to represent M and Wk in an explicit closed form; in contrast to this, when selecting Fℓ = πℓ , ℓ ≤ L = K, seemingly the only way to get M and Wk is to use MonteCarlo simulations. This being said, we indeed can use MonteCarlo simulations to compute M and Wk , provided we can sample from distributions P1 , ..., PK . In this respect, it should be stressed that with Fℓ ≡ πℓ , the entries in M and Wk are expectations, w.r.t. P1 , ..., PK , of functions of ω bounded in magnitude by 1, and thus wellsuited for MonteCarlo simulation. Maximum Likelihood option. This choice of {Fℓ }ℓ≤L follows straightforwardly the idea of discretization we started with in this exercise. Specifically, we split Ω into L cells Ω1 , ..., ΩL in such a way that the intersection of any two different cells is of Πmeasure zero, and treat as our observations not the actual observations ωt , but the indexes of the cells to which the ωt ’s belong. With our estimation scheme, this is the same as selecting Fℓ as the characteristic function of Ωℓ , ℓ ≤ L. Assuming that for distinct k, k ′ the densities pk , pk′ differ from each other Πalmost surely, the simplest discretization independent of how the reference measure is selected is the Maximum Likelihood discretization Ωℓ = {ω : max pk (ω) = pℓ (ω)}, 1 ≤ ℓ ≤ L = K; k
with the ML option, we take, as Fℓ ’s, the characteristic functions of the sets Ωℓ , 1 ≤ ℓ ≤ L = K, just defined. As with the Basic option, the matrices M and Wk associated with the ML option can be found by MonteCarlo simulation. We have discussed three simple options for selecting Fℓ ’s. In applications, one can compute the upper risk bounds Opt—see (3.87)—associated with each option, and use the option with the best—the smallest—risk bound (“smart” choice of Fℓ ’s). Alternatively, one can take as {Fℓ , ℓ ≤ L} the union of the three collections yielded by the above options (and, perhaps, further extend this union). Note that the larger is the collection of the Fℓ ’s, the smaller is the associated Opt, so that the only price for combining different selections is in increasing the computational cost of solving (3.87). 3.4.A.4. Illustration. In the experimental part of this exercise your are expected
249
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
to 4.1) Run numerical experiments to compare the estimates yielded by the above three options (Naive, Basic, ML). Recommended setup: • d = 8, K = 90; • Gaussian case with the covariance matrices Σk of Pk selected at random, Sk = rand(d, d), Σk =
Sk SkT kSk k2
[k · k: spectral norm]
and the expectations νk of Pk selected at random from N (0, σ 2 Id ), with σ = 0.1; • values of N : {10s , s = 0, 1, ..., 5}; • linear form to be recovered: g T µ ≡ µ1 .
4.2† ). Utilize the CramerRao lower risk bound (see Proposition 4.37, Exercise 4.22) to Opt of the estimates built in item 4.1. upperbound the level of conservatism Risk ∗ Here Risk∗ is the minimax risk in our estimation problem: 1/2 g (ω N ) − g T µ2 , Risk∗ = inf Risk[b g (ω N )] = inf sup EωN ∼Pµ ×...×Pµ b g b(·)
g b(·) µ∈∆
where inf is taken over all estimates.
3.4.B. Recovering linear images. Now consider the case when G is a general ν × K matrix. The analog of the estimate gbλ (·) is now as follows: with somehow chosen F1 , ..., FL satisfying (3.84), we select a ν × L matrix Λ = [λiℓ ], set X X X ΦΛ (ω) = [ λ1ℓ Fℓ (ω); λ2ℓ Fℓ (ω); ...; λνℓ Fℓ (ω)], ℓ
ℓ
ℓ
and estimate Gµ by
N X b Λ (ω N ) = 1 Φλ (ωt ). G N t=1
5) Prove the following counterpart of the results of item 3.4.A: Proposition 3.21. The risk of the proposed estimator can be upperbounded as follows: bΛ ] Risk[G Ψ(Λ, µ)
:= ≤ = =
where
oi1/2 n h b N ) − Gµk22 maxµ∈∆K EωN ∼Pµ ×...×Pµ kG(ω
Risk(Λ) := maxk≤K Ψ(Λ, ek ), h P i1/2 K 2 2 1 k=1 µk Eω∼Pk kΦΛ (ω)k2 + k[ψΛ − G]µk2 N i1/2 h R P PK P 2 , k[ψΛ − G]µk22 + N1 k=1 µk [ i≤ν [ ℓ λiℓ Fℓ (ω)] ]Pk (dω)
R P [ ℓ λ1ℓ Fℓ (ω)]Pk (dω) , 1 ≤ k ≤ K ··· Colk [ψΛ ] = Eω∼Pk (·) ΦΛ (ω) = R P [ ℓ λνℓ Fℓ (ω)]Pk (dω)
and e1 , ..., eK are the standard basic orths in RK .
250
CHAPTER 3
Note that exactly the same reasoning as in the case of the scalar Gµ ≡ g T µ demonstrates that a reasonable way to select L and Fℓ , ℓ = 1, ..., L, is to set L = K and Fℓ (·) = πℓ (·), 1 ≤ ℓ ≤ L.
3.6
PROOFS
3.6.1
Proof of Proposition 3.3
o
1 . Observe that Optij (K) is the saddle point value in the convexconcave saddle point problem: 1 Kα [ΦO (φ/α; Ai (x)) + ΦO (−φ/α; Aj (y))] Optij (K) = inf max α>0,φ∈F x∈Xi ,y∈Xj 2 + 12 g T [y − x] + α ln(2I/ǫ) . The domain of the maximization variable is compact and the cost function is continuous on its domain, whence, by the SionKakutani Theorem, we also have Optij (K)
=
Θij (x, y)
=
max
x∈Xi ,y∈Xj
Θij (x, y), 1 2
Kα [ΦO (φ/α; Ai (x)) + ΦO (−φ/α; Aj (y))] +α ln(2I/ǫ) + 12 g T [y − x]. inf
α>0,φ∈F
(3.88)
Note that Θij (x, y)
=
=
inf
α>0,ψ∈F 1 T
1 2
Kα [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] + α ln(2I/ǫ)
+ 2 g [y − x] inf
α>0
1 2
αK inf [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] + α ln(2I/ǫ) ψ∈F
+ 21 g T [y − x].
Given x ∈ Xi , y ∈ Xj and setting µ = Ai (x), ν = Aj (y), we obtain inf [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] Z Z = inf ln exp{ψ(ω)}pµ (ω)Π(dω) + ln exp{−ψ(ω)}pν (ω)Π(dω) .
ψ∈F
ψ∈F
¯ Since O is a good o.s., the function ψ(ω) = inf
ψ∈F
= =
ln
Z
inf
δ∈F
inf
δ∈F
exp{ψ(ω)}pµ (ω)Π(dω)

ln ln
Z Z
+ ln
Z
q
ln(pν (ω)/pµ (ω)) belongs to F, and
exp{−ψ(ω)}pν (ω)Π(dω)
¯ exp{ψ(ω) + δ(ω)}pµ (ω)Π(dω) exp{δ(ω)}
1 2
pµ (ω)pν (ω)Π(dω)
+ ln + ln {z
f (δ)
Z
Z
¯ exp{−ψ(ω) − δ(ω)}pν (ω)Π(dω) exp{−δ(ω)}
q
pµ (ω)pν (ω)Π(dω)
}
.
251
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
Observe that f (δ) clearly is a convex and even function of δ ∈ F; as such, it attains its minimum over δ ∈ F when δ = 0. The bottom line is that Z q pAi (x) (ω)pAj (y) (ω)Π(dω) , inf [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] = 2 ln ψ∈F
(3.89)
and Θij (x, y)
= =
Z q pAi (x) (ω)pAj (y) (ω)Π(dω) + ln(2I/ǫ) + 12 g T [y − x] inf α K ln α>0 R q ( 1 T pAi (x) (ω)pAj (y) (ω)Π(dω) + ln(2I/ǫ) ≥ 0, g [y − x], K ln 2 −∞,
otherwise.
This combines with (3.88) to imply that n Optij (K) = maxx,y 21 g T [y − x] : x ∈ Xi , y ∈ Xj , iK hR q pAi (x) (ω)pAj (y) (ω)Π(dω) ≥
ǫ 2I
.
(3.90)
2o . We claim that under the premise of the proposition, for all i, j, 1 ≤ i, j ≤ I, one has Optij (K) ≤ Risk∗ǫ (K), implying the validity of (3.13). Indeed, assume that for some pair i, j the opposite inequality holds true, Optij (K) > Risk∗ǫ (K), and let us lead this assumption to a contradiction. Under our assumption optimization problem in (3.90) has a feasible solution (¯ x, y¯) such that y−x ¯] > Risk∗ǫ (K), r := 21 g T [¯
(3.91)
implying, due to the origin of Risk∗ǫ (K), that there exists an estimate ge(ω K ) such that for µ = Ai (¯ x), ν = Aj (¯ y ) it holds o o n n x + y¯] ≤ ProbωK ∼pK e ProbωK ∼pK ge(ω K ) ≤ 12 g T [¯ g (ω K ) − g T y¯ ≥ r ≤ ǫ ν n ν n o o 1 T K x + y¯] ≤ ProbωK ∼pK e g (ω K ) − g T x ¯ ≥ r ≤ ǫ. ProbωK ∼pK ge(ω ) ≥ 2 g [¯ µ
µ
In other words, we can decide on two simple hypotheses stating that observation ω K K K obeys distribution pK Π × ... × Π µ or pν , with risk ≤ ǫ. Consequently, setting Π =  {z } K Q K K and pK (ω ) = p (ω ), we have k θ k=1 θ Z i h K K K (ω ), p (ω ) ΠK (dω K ) ≤ 2ǫ. min pK µ ν
252
CHAPTER 3
Hence, hR p = ≤ =
= ≤
iK Rq K K K K K pµ (ω)pν (ω)Π(dω) pK = µ (ω )pν (ω )Π (dω ) r ir i h h R K K K K K K K K min pK max pK µ (ω ), pν (ω ) µ (ω ), pν (ω ) Π (dω ) R h i 1 R i 1 h 2 2 K K K K K K K K K K min pK max pK µ (ω ), pν (ω ) Π (dω ) µ (ω ), pν (ω ) Π (dω ) i 1 R h 2 K K K K K min pK µ (ω ), pν (ω ) Π (dω ) h ii 1 R h 2 K K K K K K K ΠK (dω K ) × pK µ (ω ) + pν (ω ) − min pµ (ω ), pν (ω ) i 1 i 1 h R h R 2 2 K K K K K K K K K K 2 − min pK min pK µ (ω ), pν (ω ) Π (dω ) µ (ω ), pν (ω ) Π (dω ) p 2 ǫ(1 − ǫ).
Therefore, for K satisfying (3.12) we have
K Z q p ǫ pµ (ω)pν (ω)Π(dω) , ≤ [2 ǫ(1 − ǫ)]K/K < 2I
which is the desired contradiction (recall that µ = Ai (¯ x), ν = Aj (¯ y ) and (¯ x, y¯), is feasible for (3.90)). 3o . Now let us prove that under the premise of the proposition, (3.14) takes place. To this end, let us set Z q 1 T g [y − x] : K ln wij (s) = max pAi (x) (ω)pAj (y) (ω)Π(dω) +s ≥ 0 . 2 x∈Xj ,y∈Xj {z }  H(x,y)
(3.92)
As we have seen in item 1o —see (3.89)—one has H(x, y) = inf
1
ψ∈F 2
[ΦO (ψ; Ai (x)) + ΦO (−ψ, Aj (y))] ,
that is, H(x, y) is the infimum of a parametric family of concave functions of (x, y) ∈ Xi × Xj and as such is concave. Besides this, the optimization problem in (3.92) is feasible whenever s ≥ 0, a feasible solution being y = x = xij . At this feasible solution we have g T [y − x] = 0, implying that wij (s) ≥ 0 for s ≥ 0. Observe also that from concavity of H(x, y) it follows that wij (s) is concave on the ray {s ≥ 0}. Finally, we claim that p (3.93) wij (¯ s) ≤ Risk∗ǫ (K), s¯ = − ln(2 ǫ(1 − ǫ)).
Indeed, wij (s) is nonnegative, concave, and bounded (since Xi , Xj are compact) on R+ , implying that wij (s) is continuous on {s > 0}. Assuming, on the contrary to what we need to prove, that wij (¯ s) > Risk∗ǫ (K), there exists s′ ∈ (0, s¯) such that ∗ ′ wij (s ) > Riskǫ (K) and thus there exist x ¯ ∈ Xi , y¯ ∈ Xj such that (¯ x, y¯) is feasible for the optimization problem specifying wij (s′ ) and (3.91) takes place. We have seen in item 2o that the latter relation implies that for µ = Ai (¯ x), ν = Aj (¯ y ) it holds K Z q p pµ (ω)pν (ω)Π(dω) ≤ 2 ǫ(1 − ǫ),
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
that is, K ln
Z q
K ln
Z q
Hence,
253
pµ (ω)pν (ω)Π(dω) + s¯ ≤ 0.
pµ (ω)pν (ω)Π(dω) + s′ < 0,
contradicting the feasibility of (¯ x, y¯) to the optimization problem specifying wij (s′ ). It remains to note that (3.93) combines with concavity of wij (·) and the relation wij (0) ≥ 0 to imply that wij (ln(2I/ǫ)) ≤ ϑwij (¯ s) ≤ ϑRisk∗ǫ (K) where ϑ = ln(2I/ǫ)/¯ s= Invoking (3.90), we conclude that
2 ln(2I/ǫ) . ln([4ǫ(1 − ǫ)]−1 )
Optij (K) = wij (ln(2I/ǫ)) ≤ ϑRisk∗ǫ (K) ∀i, j. Finally, from (3.90) it immediately follows that Optij (K) is nonincreasing in K (as K grows, the feasible set of the optimization problem in (3.90) shrinks), so that for K ≥ K we have Opt(K) ≤ Opt(K) = max Optij (K) ≤ ϑRisk∗ǫ (K), i,j
and (3.14) follows. 3.6.2
✷
Verifying 1convexity of the conditional quantile
Let r be a nonvanishing probability distribution on S, and let Fm (r) =
m X i=1
ri , 1 ≤ m ≤ M,
so that 0 < F1 (r) < F2 (r) < ... < FM (r) = 1. Denoting by P the set of all nonvanishing probability distributions on S, observe that for every p ∈ P χα [r] is a piecewise linear function of α ∈ [0, 1] with breakpoints 0, F1 (r), F2 (r), F3 (r), ..., FM (r), the values of the function at these breakpoints being s1 , s1 , s2 , s3 , ..., sM . In particular, this function is equal to s1 on [0, F1 (r)] and is strictly increasing on [F1 (r), 1]. Now let s ∈ R, and let Pα≤ [s] = {r ∈ P : χα [r] ≤ s}, Pα≥ [s] = {r ∈ P : χα [r] ≥ s}. Observe that the just introduced sets are cut off P by nonstrict linear inequalities, specifically, • • • •
when when when when
s < s1 , we have Pα≤ [s] = ∅, Pα≥ [s] = P; s = s1 , we have Pα≤ [s] = {r ∈ P : F1 (r) ≥ α}, Pα≥ [s] = P; s > sM , we have Pα≤ [s] = P, Pα≥ [s] = ∅; s1 < s ≤ sM , for every r ∈ P the equation χγ [r] = s in variable γ ∈ [0, 1]
254
CHAPTER 3
has exactly one solution γ(r) which can be found as follows: we specify k = k s ∈ {1, ..., M − 1} such that sk < s ≤ sk+1 and set γ(r) =
(sk+1 − s)Fk (r) + (s − sk )Fk+1 (r) . sk+1 − sk
Since χα [r] is strictly increasing in α when α ∈ [F1 (p), 1], for s ∈ (s1 , sM ] we have
(sk+1 − s)Fk (r) + (s − sk )Fk+1 (r) r∈P: ≥α , sk+1 − sk (s − s)F k+1 k (r) + (s − sk )Fk+1 (r) Pα≥ [s] = {r ∈ P : α ≥ γ(r)} = r ∈ P : ≤α . sk+1 − sk
Pα≤ [s] = {r ∈ P : α ≤ γ(r)} =
As an immediate consequence of this description, given α ∈ [0, 1] and τ ∈ T and setting µ X p(ι, τ ), 1 ≤ µ ≤ M, Gτ,µ (p) = ι=1
and
X s,≤ = {p(·, ·) ∈ X : χα [pτ ] ≤ s}, X s,≥ = {p(·, ·) ∈ X : χα [pτ ] ≥ s}, we get s < s1 s = s1
⇒
⇒
s > sM
⇒
s1 < s ≤ sM
⇒
X s,≤ = ∅, X s,≥ = X ,
X s,≤ = {p ∈ X : Gτ,1 (p) ≤ s1 Gτ,M (p)}, X s,≥ = X ,
X s,≤ = X , X s,≥ = ∅, o n X s,≤ = p ∈ X : (sk+1 −s)Gτ,k (r)+(s−sk )Gτ,k+1 (r) ≥ αGτ,M (p) , sk+1 −sk o n X s,≥ = p ∈ X : (sk+1 −s)Gτ,k (r)+(s−sk )Gτ,k+1 (r) ≤ αGτ,M (p) , sk+1 −sk k = ks : sk < s ≤ sk+1 ,
implying 1convexity of the conditional quantile on X (recall that Gτ,µ (p) are linear in p). ✷ 3.6.3 3.6.3.1
Proof of Proposition 3.4 Proof of Proposition 3.4.i
We call step ℓ essential if at this step rule 2d is invoked. 1o . Let x ∈ X be the true signal underlying the observation ω ¯ K , so that ω ¯ 1 , ..., ω ¯K are drawn from the distribution pA(x) independently of each other. Consider the “ideal” estimate given by exactly the same rules as the Bisection procedure in Section 3.2.4.2 (in the sequel, we refer to the latter as the “true” one), with tests T∆Kℓ,rg ,r (·), T∆Kℓ,lf ,l (·) in rule 2d replaced with the “ideal tests” Tb∆ℓ,rg ,r = Tb∆ℓ,lf ,l =
right, left,
f (x) > cℓ , f (x) ≤ cℓ .
Marking by ∗ the entities produced by the resulting fully deterministic procedure, we arrive at the sequence of nested segments ∆∗ℓ = [a∗ℓ , b∗ℓ ], 0 ≤ ℓ ≤ L∗ ≤ L, along
255
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
with subsegments ∆∗ℓ,rg = [c∗ℓ , vℓ∗ ], ∆∗ℓ,lf = [u∗ℓ , c∗ℓ ] of ∆∗ℓ−1 , defined for all ∗ essential ¯ ∗ claimed to contain f (x). Note that the ideal values of ℓ, and the output segment ∆ procedure cannot terminate due to arriving at a disagreement, and that f (x), as is ¯ ∗. immediately seen, is contained in all segments ∆∗ℓ , 0 ≤ ℓ ≤ L∗ , just as f (x) ∈ ∆ ∗ ∗ ∗ Let L be the set of all essential values of ℓ. For ℓ ∈ L , let the event Eℓ [x] parameterized by x be defined as follows: n o K K K K K ω : T (ω ) = right or T (ω ) = right , f (x) ≤ u∗ℓ , ∗ ∗ ∆ ,r n ℓ,rg o ∆ℓ,lf ,l ω K : T K∗ (ω K ) = right , u∗ℓ < f (x) ≤ c∗ℓ , ∆ℓ,rg ,r n o Eℓ [x] = K K K c∗ℓ < f (x) < vℓ∗ , nω : T∆∗ℓ,lf ,l (ω ) = left , o ω K : T K∗ (ω K ) = left or T K∗ (ω K ) = left , f (x) ≥ vℓ∗ . ,r ∆ ∆ ,l ℓ,rg
ℓ,lf
(3.94)
2o . Observe that by construction and in view of Proposition 2.27 we have ∀ℓ ∈ L∗ : ProbωK ∼pA(x) ×...×pA(x) {Eℓ [x]} ≤ 2δ.
(3.95)
Indeed, let ℓ ∈ L∗ . • When f (x) ≤ u∗ℓ , we have x ∈ X and f (x) ≤ u∗ℓ ≤ c∗ℓ , implying that Eℓ [x] takes place only when either the left test T∆K∗ ,l or the right test T∆K∗ ,r , or both, accept ℓ,rg ℓ,lf wrong—right—hypotheses from the pairs of right and left hypotheses. Since the corresponding intervals ([u∗ℓ , c∗ℓ ] for the left side test, [c∗ℓ , vℓ∗ ] for the right side one) are δgood left and right, respectively, the risks of the tests do not exceed δ, and the pA(x) probability of the event Eℓ [x] is at most 2δ; • when u∗ℓ < f (x) ≤ c∗ℓ , the event Eℓ [x] takes place only when the right side test T∆K∗ ,r accepts the wrong—right—hypothesis from the pair; as above, this can ℓ,rg happen with pA(x) probability at most δ; • when cℓ < f (x) ≤ vℓ , the event Eℓ [x] takes place only if the left test T∆K∗ ,l accepts ℓ,lf the wrong—left—hypothesis from the pair to which it was applied, which again happens with pA(x) probability ≤ δ; • finally, when f (x) > vℓ , the event Eℓ [x] takes place only when either the left side test T∆K∗ ,l or the right side test T∆K∗ ,r , or both, accept wrong—left—hypotheses ℓ,rg ℓ,lf from the pairs; as above, this can happen with pA(x) probability at most 2δ. ¯ = L(¯ ¯ ω K ) be the last step of the true estimating procedure as run on the 3o . Let L observation ω ¯ K . We claim that the following holds true: S (!) Let E := ℓ∈L∗ Eℓ [x], so that the pA(x) probability of the event E, the observations stemming from x, is at most 2δL = ǫ ¯ ω K ) ≤ L∗ , and only two (see (3.17), (3.95)). Assume that ω ¯ K 6∈ E. Then L(¯ cases are possible: A. The true estimating procedure does not terminate due to arriving at a ¯ ω K ) and the trajectories of the ideal and disagreement. In this case L∗ = L(¯
256
CHAPTER 3
the true procedures are identical (same localizers and essential steps, same ¯ or output segments, etc.), and, in particular, f (x) ∈ ∆, B. The true estimating procedure terminates due to arriving at a dis¯ and f (x) ∈ ∆. ¯ agreement. Then ∆ℓ = ∆∗ℓ for ℓ < L, ¯ is at least In view of A and B the pA(x) probability of the event f (x) ∈ ∆ 1 − ǫ, as claimed in Proposition 3.4. To prove (!), note that the actions at step ℓ in ideal and true procedures depend solely on ∆ℓ−1 and on the outcome of rule 2d. Taking into account that ∆0 = ∆∗0 , all we need to verify is the following claim: (!!) Let ω ¯ K 6∈ E, and let ℓ ≤ L∗ be such that ∆ℓ−1 = ∆∗ℓ−1 , whence also ∗ uℓ = uℓ , cℓ = c∗ℓ , and vℓ = vℓ∗ . Assume that ℓ is essential (given that ∆ℓ−1 = ∆∗ℓ−1 , this may happen if and only if ℓ is ∗ essential as well). Then either C. At step ℓ the true procedure terminates due to disagreement, in which ¯ or case f (x) ∈ ∆, D. At step ℓ there was no disagreement, in which case ∆ℓ as given by (3.16) is identical to ∆∗ℓ as given by the ideal counterpart of (3.16) in the case of ∆∗ℓ−1 = ∆ℓ−1 , that is, by the rule ∆∗ℓ =
[cℓ , bℓ−1 ], [aℓ−1 , cℓ ],
f (x) > cℓ , f (x) ≤ cℓ .
(3.96)
To verify (!!), let ω ¯ K and ℓ satisfy the premise of (!!). Note that due to ∆ℓ−1 = ∗ ∆ℓ−1 we have uℓ = u∗ℓ , cℓ = c∗ℓ , and vℓ = vℓ∗ , and thus also ∆∗ℓ,lf = ∆ℓ,lf , ∆∗ℓ,rg = ∆ℓ,rg . Consider first the case when the true estimation procedure terminates by disagreement at step ℓ, so that T∆K∗ ,l (¯ ω K ) 6= T∆K∗ ,r (¯ ω K ). When ℓ,lf
ℓ,rg
assuming that f (x) < uℓ = u∗ℓ , the relation ω ¯ K 6∈ Eℓ [x] combines with (3.94) to K K K K imply that T∆∗ ,r (¯ ω ) = T∆∗ ,l (¯ ω ) = left, which under disagreement is imℓ,rg
ℓ,lf
possible. Assuming f (x) > vℓ = vℓ∗ , the same argument results in T∆K∗
ℓ,rg ,r
(¯ ωK ) =
T∆K∗
(¯ ω K ) = right, which again is impossible. We conclude that in the case in ¯ as claimed in C. C is proved. question uℓ ≤ f (x) ≤ vℓ , i.e., f (x) ∈ ∆, Now, suppose that there was a consensus at step ℓ in the true estimating procedure. Because ω ¯ K 6∈ Eℓ [x] this can happen in the following four cases: ℓ,lf ,l
(¯ ω K ) = left and f (x) ≤ uℓ = u∗ℓ , ℓ,rg ,r K T∆∗ ,r (¯ ω K ) = left and uℓ < f (x) ≤ cℓ = c∗ℓ , ℓ,rg T∆K∗ ,l (¯ ω K ) = right and cℓ < f (x) < vℓ = vℓ∗ , ℓ,lf T∆K∗ ,l (¯ ω K ) = right and vℓ ≤ f (x). ℓ,lf
1. T∆K∗
2. 3. 4.
Due to consensus at step ℓ, in situations 1 and 2 (3.16) says that ∆ℓ = [aℓ−1 , cℓ ], which combines with (3.96) and vℓ = vℓ∗ to imply that ∆ℓ = ∆∗ℓ . Similarly, in situations 3 and 4, due to consensus at step ℓ, (3.16) implies that ∆ℓ = [cℓ , bℓ−1 ], which combines with uℓ = u∗ℓ and (3.96) to imply that ∆ℓ = ∆∗ℓ . D is proved. ✷
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.6.3.2
257
Proof of Proposition 3.4.ii
0 0 ≤ ρb, since in this case the estimate a0 +b There is nothing to prove when b0 −a 2 2 which does not use observations at all is (b ρ, 0)reliable. From now on we assume that b0 − a0 > 2b ρ, implying that L is a positive integer.
1o . Observe, first, that if a and b are such that a is lowerfeasible, b is upperfeasible, and b − a > 2ρ, then for every i ≤ Ib,≥ and j ≤ Ia,≤ there exists a test, based on K observations, which decides upon the hypotheses H1 , H2 , stating that the observations are drawn from pA(x) with x ∈ Zib,≥ (H1 ) or with x ∈ Zja,≤ (H2 ) with risk at most ǫ. Indeed, it suffices to consider the test which accepts H1 and rejects H2 when fb(ω K ) ≥ a+b 2 and accepts H2 and rejects H1 otherwise.
2o . With parameters of Bisection chosen according to (3.19), by already proved Proposition 3.4.i, we have ¯ ∆ ¯ being E. For every x ∈ X, the pA(x) probability of the event f (x) ∈ ∆, the output segment of our Bisection, is at least 1 − ǫ.
3o . We claim also that F.1. Every segment ∆ = [a, b] with b − a > 2ρ and lowerfeasible a is δgood (right), F.2. Every segment ∆ = [a, b] with b − a > 2ρ and upperfeasible b is δgood (left), F.3. Every κmaximal δgood (left or right) segment has length at most 2ρ + κ = ρb. As a result, for every essential step ℓ, the lengths of the segments ∆ℓ,rg and ∆ℓ,lf do not exceed ρb.
Let us verify F.1 (verification of F.2 is completely similar, and F.3 is an immediate consequence of the definitions and F.12). Let [a, b] satisfy the premise of F.1. It may happen that b is upperinfeasible, whence ∆ = [a, b] is 0good (right), and we are done. Now let b be upperfeasible. As we have already seen, whenever i ≤ Ib,≥ and j ≤ Ia,≤ , the hypotheses stating that ωk are sampled from pA(x) for some x ∈ Zib,≥ and for some x ∈ Zja,≤ , respectively, can be decided upon with risk ≤ ǫ, implying, as in the proof of Proposition 2.25, that p ǫij∆ ≤ [2 ǫ(1 − ǫ)]1/K .
Hence, taking into account that the column and the row sizes of E∆,r do not exceed N I, p ǫ K/K =δ σ∆,r ≤ N I max ǫK ≤ ij∆ ≤ N I[2 ǫ(1 − ǫ)] i,j 2L
(we have used (3.19)), that is, ∆ indeed is δgood (right).
4o . Let us fix x ∈ X and consider a trajectory of Bisection, the observation being ¯ of the procedure is given by one of the following drawn from pA(x) . The output ∆ options: 1. At some step ℓ of Bisection, the process terminated according to rules in 2b or 2c. In the first case, the segment [cℓ , bℓ−1 ] has lowerfeasible left endpoint and is not δgood (right), implying by F.1 that the length of this segment (which is ¯ = ∆ℓ−1 ) is ≤ 2ρ, so that the length ∆ ¯ of ∆ ¯ is at most half the length of ∆
258
CHAPTER 3
4ρ ≤ 2b ρ. The same conclusion, by a completely similar argument, holds true if the process terminated at step ℓ according to rule 2c. 2. At some step ℓ of Bisection, the process terminated due to disagreement. In this ¯ ≤ 2b case, by F.3, we have ∆ ρ. ¯ = ∆L . In this case, termination clauses in 3. Bisection terminated at step L, and ∆ rules 2b, 2c, and 2d were never invoked, clearly implying that ∆s  ≤ ∆s−1 /2, ¯ = ∆L  ≤ 2−L ∆0  ≤ 2b 1 ≤ s ≤ L, and thus ∆ ρ (see (3.19)). ¯ ≤ 2b Thus, we have ∆ ρ, implying that whenever the signal x ∈ X underlying ¯ are such that f (x) ∈ ∆, ¯ the error of the observations and the output segment ∆ ¯ is at most ρb. Invoking E, we Bisection estimate (which is the midpoint of ∆) conclude that the Bisection estimate is (b ρ, ǫ)reliable. ✷ 3.6.4
Proof of Proposition 3.14
Let us fix ǫ ∈ (0, 1). Setting ρK =
1 2
h
¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ Ψ
i
and invoking Corollary 3.13, all we need to prove is that in the case of A.13 one has i h ¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ 0. (3.97) lim sup Ψ K→∞
To this end, note that in our current situation, (3.48) and (3.52) simplify to 1/2
1/2
− Θ∗ HΘ∗ ) Φ(h, H; Z) = − 21ln Det(I H h T −1 1 + 2 Tr Z B + [H, h]T [Θ−1 [H, h] B , ∗ − H] T h {z }  Q(h,H) b +,K (h, H) = inf max αΦ(h/α, H/α; Z) − Tr(QZ) + K −1 α ln(2/ǫ) : Ψ α Z∈Z −1 , α > 0, −γαΘ−1 ∗ H γαΘ∗ b −,K (h, H) = inf max αΦ(−h/α, −H/α; Z) + Tr(QZ) + K −1 α ln(2/ǫ) : Ψ α Z∈Z −1 −1 . α > 0, −γαΘ∗ H γαΘ∗
Hence h i ¯ H) ¯ H) ¯ b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ inf ¯ Ψ max αΦ(h/α, H/α; Z1 ) − Tr(QZ1 ) α
Z1 ,Z2 ∈Z ¯ ¯ +Φ(−h/α, −H/α; Z1 ) + Tr(QZ2 ) + 2K −1 α ln(2/ǫ) : −1 −1 ¯ α > 0, −γαΘ∗ H γαΘ∗ 1/2 ¯ 1/2 2 2 − 21 α ln Det I − [Θ∗ HΘ = inf max + 2K −1 α ln(2/ǫ) ∗ ] /α α Z1 ,Z2 ∈Z ¯ ¯ ¯ ¯ H/α) + αTr Z2 Q(−h/α, −H/α) +Tr(Q[Z2 − Z1 ]) + 12 αTr Z1 Q(h/α, : −1 ¯ α > 0, −γαΘ−1 ∗ H γαΘ∗
259
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
= inf max
α Z1 ,Z2 ∈Z
1/2 ¯ 1/2 2 2 + 2K −1 α ln(2/ǫ) − 12 α ln Det I − [Θ∗ HΘ ∗ ] /α ¯ T [αΘ−1 − H] ¯ ¯ h] ¯ −1 [H, ¯ h]B + 21 Tr Z1 B T [H, ∗ ¯ T [αΘ−1 + H] ¯ ¯ h] ¯ −1 [H, ¯ h]B + 21 Tr Z2 B T [H, ∗ + Tr(Q[Z2 − Z1 ]) + 21 Tr([Z1 − Z2 ]B T  {z T (Z1 ,Z2 )
α > 0, −γαΘ−1 ∗
By (3.57) we have 21 B T
¯ H ¯T h
¯ h
¯ H ¯ hT
¯ h
B) : } ¯ γαΘ−1 H . ∗
(3.98)
B = B T [C T QC + J]B, where the only nonzero
entry, if any, in the (d + 1) × (d + 1) matrix J is in the cell (d + 1, d + 1). By definition of B—see (3.48)—the only nonzero element, if any, in J¯ = B T JB is in the cell (m + 1, m + 1), and we conclude that 1 2
BT
¯ h
¯ H ¯T h
B = (CB)T Q(CB) + J¯ = Q + J¯
(recall that CB = Im+1 ). Now, when Z1 , Z2 ∈ Z, the entries of Z1 , Z2 in the cell (m + 1, m + 1) both are equal to 1, whence 1 2
Tr([Z1 −Z2 ]B
T
¯ H ¯T h
¯ h
¯ = Tr([Z1 −Z2 ]Q), B) = Tr([Z1 −Z2 ]Q)+Tr([Z1 −Z2 ]J)
implying that the quantity T (Z1 , Z2 ) in (3.98) is zero, provided Z1 , Z2 ∈ Z. Consequently, (3.98) becomes h
i ¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ inf Ψ
1/2 ¯ 1/2 2 2 − 12 α ln Det I − [Θ∗ HΘ ∗ ] /α α Z1 ,Z2 ∈Z ¯ TB ¯ h][αΘ−1 ¯ −1 [H, ¯ h] +2K −1 α ln(2/ǫ) + 21 Tr Z1 B T [H, ∗ − H] −1 ¯ T [αΘ−1 ¯ ¯ h] ¯ −1 [H, ¯ h]B ¯ . : α > 0, −γαΘ−1 + 21 Tr Z2 B T [H, ∗ + H] ∗ H γαΘ∗ max
(3.99)
Now, for an appropriately selected real c independent of K, for α allowed by (3.99), and all Z1 , Z2 ∈ Z we have (recall that Z is bounded) 1 ¯ T [αΘ−1 − H] ¯ ¯ h] ¯ −1 [H, ¯ h]B Tr Z1 B T [H, ∗ 2 ¯ T [αΘ−1 + H] ¯ ¯ h] ¯ −1 [H, ¯ h]B ≤ c/α, + 21 Tr Z2 B T [H, ∗ along with
1/2 ¯ 1/2 2 2 − 12 α ln Det I − [Θ∗ HΘ ≤ c/α. ∗ ] /α
Therefore, given δ > 0, we can find α = αδ > 0 large enough to ensure that ¯ γαδ Θ−1 and 2c/αδ ≤ δ, −γαδ Θ−1 H ∗
∗
which combines with (3.99) to imply that i h ¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ δ + 2K −1 αδ ln(2/ǫ), Ψ
and (3.97) follows.
✷
Chapter Four Signal Recovery by Linear Estimation OVERVIEW In this chapter we consider several variations of one of the most basic problems of highdimensional statistics—signal recovery. In its simplest form the problem is as follows: given positive definite m × m matrix Γ, m × n matrix A, ν × n matrix B, and indirect noisy observation [ξ ∼ N (0, Γ)]
ω = Ax + ξ
(4.1)
of unknown “signal” x known to belong to a given convex compact subset X of Rn , we want to recover the vector Bx ∈ Rν of x. We focus first on the case where the quality of a candidate recovery ω 7→ x b(ω) is quantified by its worstcase, over x ∈ X , expected k · k22 error, that is, by the risk q x(Ax + ξ) − Bxk22 }. (4.2) Risk[b x(·)X ] = sup Eξ∼N (0,Γ) {kb x∈X
The simplest and the most studied type of recovery is an affine one: x b(ω) = H T ω+h; assuming X to be symmetric w.r.t. the origin, we lose nothing when passing from affine estimates to linear ones—those of the form x bH (ω) = H T ω. An advantage of linear estimates is that under favorable circumstances (e.g., when X is an ellipsoid), minimizing risk over linear estimates is an efficiently solvable problem, and there exists a huge body of literature on optimal in terms of their risk linear estimates (see, e.g., [6, 57, 82, 155, 156, 197, 206, 207] and references therein). Moreover, in the case of signal recovery from direct observations in white Gaussian noise (the case of B = A = In , Γ = σ 2 In ), there is huge body of results on nearoptimality of properly selected linear estimates among all possible recovery routines; see, e.g., [79, 88, 106, 124, 198, 230, 239] and references therein. A typical result of this type states that when recovering x ∈ X from direct observation ω = x+σξ, ξ ∼ N (0, Im ), where X is an ellipsoid of the form X {x ∈ Rn : j 2α x2j ≤ L2 }, j
or the box {x ∈ Rn : j α xj  ≤ L, j ≤ n}, with fixed L < ∞ and α > 0, the ratio of the risk of a properly selected linear estimate to the minimax risk Riskopt [X ] := inf Risk[b xX ] x b(·)
(4.3)
(the infimum is taken over all estimates, not necessarily linear) remains bounded, or even tends to 1, as σ → +0, and this happens uniformly in n, α and L being fixed.
SIGNAL RECOVERY BY LINEAR ESTIMATION
261
Similar “nearoptimality” results are known for the “diagonal” case, where X is an ellipsoid/box and A, B, Γ are diagonal matrices. To the best of our knowledge, the only “general” (that is, not imposing severe restrictions on how the geometries of X , A, B, Γ are linked to each other) result on optimality of linear estimates is due to D. Donoho, who proved [64], that when recovering a linear form (i.e., in the case of onedimensional Bx), the best risk over all linear estimates is within the factor 1.2 of the minimax risk. The primary goal of this chapter is to establish rather general results on nearoptimality of properly built linear estimates as compared to all possible estimates. Results of this type are bound to impose some restrictions on X , since there are cases (e.g., the case of a highdimensional k · k1 ball X ) where linear estimates are by far nonoptimal. Our restrictions on X reduce to the existence of a special type representation of X and are satisfied, e.g., when X is the intersection of K < ∞ ellipsoids/elliptic cylinders, P X = {x ∈ Rn : xT Rk x ≤ 1, 1 ≤ k ≤ K} [Rk 0, k Rk ≻ 0] (4.4)
in particular, X can be a symmetric w.r.t. the origin compact polytope given by 2K linear inequalities −1 ≤ rkT x ≤ 1, 1 ≤ k ≤ K, or, equivalently, X = {x : xT (rk rkT ) x, 1 ≤ k ≤ K}. Another instructive example is a set of the form  {z } Rk
X = {x : kSxkp ≤ L}, where p ≥ 2 and S is a matrix with trivial kernel. It should be stressed that while imposing some restrictions on X , we require nothing from A, B, and Γ, aside from positive definiteness of the latter matrix. Our main result (Proposition 4.5) states, in particular, that with X given by (4.4) and with arbitrary A and B, the risk of properly selected linear estimate x bH∗ with both H∗ and the risk efficiently computable, satisfies the bound p (∗) Risk[b xH∗ X ] ≤ O(1) ln(K + 1)Riskopt [X ],
where Riskopt [X ] is the minimax risk, and O(1) is an absolute constant. Note that the outlined result is an “operational” one—the risk of provably nearly optimal estimate and the estimate itself are given by efficient computation. This is in sharp contrast with traditional results of nonparametric statistics, where nearoptimal estimates and their risks are given in a “closed analytical form,” at the price of severe restrictions on the structure of the “data” X , A, B, Γ. This being said, it should be stressed that one of the crucial components in our construction is quite classical—this is the idea, going back to M.S. Pinsker [198], of bounding from below the minimax risk via the Bayesian risk associated with a properly selected Gaussian prior.1 The main body of the chapter originates from [138, 137] and is organized as follows. • Section 4.1 presents basic results on Conic Programming and Conic Duality—the 1 [88, 198] address the problem of k · k recovery of a signal x from direct observations (A = 2 B = I) in the case when X is a highdimensional ellipsoid with “regularly decreasing halfaxes,” P 2α 2 2 n like X = {x ∈ R : j j xj ≤ L } with α > 0. In this case Pinsker’s construction shows that as σ → +0, the risk of a properly built linear estimate is, uniformly in n, (1 + o(1)) times the minimax risk. This is much stronger than (∗), and it seems to be unlikely that a similarly strong result holds true in the general case underlying (∗).
262
CHAPTER 4
principal optimization tools utilized in all subsequent constructions and proofs. • Section 4.2 contains problem formulation (Section 4.2.1), construction of the linear estimate we deal with (Section 4.2.2) and the central result on nearoptimality of this estimate (Section 4.2.2.2). We discuss also the “expressive abilities” of the family of sets (we call them ellitopes) to which our main result applies. • In Section 4.3 we extend the results of the previous section from ellitopes to their “matrix analogs”—spectratopes in the role of signal sets, passing simultaneously from the norm k · k2 in which the recovery error is measured to arbitrary spectratopic norms, those for which the unit ball of the conjugate norm is a spectratope. In addition, we allow for observation noise to have nonzero mean and to be nonGaussian. • Section 4.4 adjusts our preceding results on linear estimation to the case where the signals to be recovered possess stochastic components. • Finally, Section 4.5 deals with “uncertainbutbounded” observation noise, that is, noise selected “by nature,” perhaps in an adversarial fashion, from a given bounded set.
4.1
4.1.1
PRELIMINARIES: EXECUTIVE SUMMARY ON CONIC PROGRAMMING Cones
A cone in Euclidean space E is a nonempty set K which is closed w.r.t. taking conic combinations of its elements, that is, linear combinations with nonnegative coefficients. Equivalently: K ⊂ E is a cone if K is nonempty, and • x, y ∈ K ⇒ x + y ∈ K; • x ∈ K, λ ≥ 0 ⇒ λx ∈ K. It is immediately seen that a cone is a convex set. We call a cone K regular if it is closed, pointed T (that is, does not contain lines passing through the origin, or, equivalently, K [−K] = {0}) and possesses a nonempty interior. Given a cone K ⊂ E, we can associate with it its dual cone K ∗ defined as K ∗ = {y ∈ E : hy, xi ≥ 0 ∀x ∈ K}
[h·, ·i is inner product on E].
It is immediately seen that K ∗ is a closed cone, and K ⊂ (K ∗ )∗ . It is well known that • if K is a closed cone, it holds K = (K ∗ )∗ ; • K is a regular cone if and only if K ∗ is so. Examples of regular cones “useful in applications” are as follows: 1. Nonnegative orthants Rd+ = {x ∈ Rd : x ≥ 0}; qP d−1 2 2. Lorentz cones Ld+ = {x ∈ Rd : xd ≥ i=1 xi }; 3. Semidefinite cones Sd+ comprised of positive semidefinite symmetric d × d matrices. Semidefinite cone Sd+ lives in the space Sd of symmetric matrices equipped
263
SIGNAL RECOVERY BY LINEAR ESTIMATION
with the Frobenius inner product hA, Bi = Tr(AB T ) = Tr(AB) =
d X
Aij Bij ,
i,j=1
A, B ∈ Sd .
All cones listed so far are selfdual. 4. Let k · k be a norm on Rn . The set {[x; t] ∈ Rn × R : t ≥ kxk} is a regular cone, and the dual cone is {[y; τ ] : kyk∗ ≤ τ }, where kyk∗ = max{xT y : kxk ≤ 1} x
is the norm on Rn conjugate to k · k. An additional example of a regular cone useful for the sequel is the conic hull of a convex compact set defined as follows. Let T be a convex compact set with a nonempty interior in Euclidean space E. We can associate with T its closed conic hull T = cl [t; τ ] ∈ E + = E × R : τ > 0, t/τ ∈ T . {z }  K o (T )
It is immediately seen that T is a regular cone, and that to get this cone, one should add to the convex set K o (T ) the origin of E + . It is also clear that one can “see T in T:”—T is nothing but the crosssection of the cone T by the hyperplane τ = 1 in E + = {[t; τ ]}: T = {t ∈ E : [t; 1] ∈ T}.
It is easily seen that the cone T∗ dual to T is given by T∗ = {[g; s] ∈ E+ : s ≥ φT (−g)}, where φT (g) = maxhg, ti t∈T
is the support function of T . 4.1.2
Conic problems and their duals
Given regular cones Ki ⊂ Ei , 1 ≤ i ≤ m, consider an optimization problem of the form Ai x − bi ∈ Ki , i = 1, ..., m Opt(P ) = min hc, xi : , (P ) Rx = r where x 7→ Ai x − bi are affine mappings acting from some Euclidean space E to the spaces Ei where the cones Ki live. A problem in this form is called a conic problem on the cones K1 , ..., Km ; the constraints Ai x − bi ∈ Ki on x are called conic constraints. We call a conic problem (P ) strictly feasible if it admits a strictly feasible solution x ¯, meaning that x ¯ satisfies the equality constraints and satisfies strictly the conic constraints, i.e., Ai x ¯ − bi ∈ int Ki . One can associate with conic problem (P ) its dual, which also is a conic problem. The origin of the dual problem is the desire to obtain lower bounds on the optimal value Opt(P ) of the primal problem (P ) in a systematic way—by linear aggregation
264
CHAPTER 4
of constraints. Linear aggregation of constraints works as follows: let us equip every conic constraint Ai x − bi ∈ Ki with aggregation weight, called Lagrange multiplier, yi restricted to reside in the cone Ki∗ dual to Ki . Similarly, we equip the system Rx = r of equality constraints in (P ) with Lagrange multiplier z—a vector of the same dimension as r. Now let x be a feasible solution to the conic problem, and let yi ∈ Ki∗ , i ≤ m, and z be Lagrange multipliers. By the definition of the dual cone and due to Ai x − bi ∈ Ki , yi ∈ Ki∗ we have hyi , Ai xi ≥ hyi , bi i, 1 ≤ i ≤ m and of course z T Rx ≥ rT z. Summing up all resulting inequalities, we arrive at the scalar linear inequality D E X X R∗ z + A∗i yi , x ≥ rT z + hbi , yi i (!) i
i
where A∗i are the conjugates to Ai : hy, Ai xiEi ≡ hA∗i y, xiE , and R∗ is the conjugate of R. By its origin, (!) is a consequence of the system of constraints in (P ) and as such is satisfied everywhere on the feasible domain of the problem. If we are lucky to get the objective of (P ) as the linear function of x in the left hand side of (!), that is, if X A∗i yi = c, R∗ z + i
(!) imposes a lower bound on the objective of the primal conic problem (P ) everywhere on the feasible domain of the primal problem, and the conic dual of (P ) is the problem ) ( ∗ X yi ∈ KiP ,1≤i≤m T (D) hbi , yi i : Opt(D) = max r z + m R∗ z + i=1 A∗i yi = c yi ,z i
of maximizing this lower bound on Opt(P ). The relations between the primal and the dual conic problems are the subject of the standard Conic Duality Theorem as follows: Theorem 4.1. [Conic Duality Theorem] Consider conic problem (P ) (where all Ki are regular cones) along with its dual problem (D). Then
1. Duality is symmetric: the dual problem (D) is conic, and the conic dual of (D) is (equivalent to) (P ); 2. Weak duality: It always holds Opt(D) ≤ Opt(P ) 3. Strong duality: If one of the problems (P ), (D) is strictly feasible and bounded,2 then the other problem in the pair is solvable, and the optimal values of the problems are equal to each other. In particular, if both (P ) and (D) are strictly feasible, then both problems are solvable with equal optimal values. Remark 4.2. While the Conic Duality Theorem in the form just presented meets all our subsequent needs, it makes sense to note that in fact the Strong Duality part of 2 For a minimization problem, boundedness means that the objective is bounded from below on the feasible set, for a maximization problem, that it is bounded from above on the feasible set.
SIGNAL RECOVERY BY LINEAR ESTIMATION
265
the theorem can be strengthened by replacing strict feasibility with “essential strict feasibility” defined as follows: a conic problem in the form of (P ) (or, which is the same, form of (D)) is called essentially strictly feasible if it admits a feasible solution x ¯ which satisfies strictly the nonpolyhedral conic constraints, that is, Ai x ¯ − bi ∈ int Ki for all i for which the cone Ki is not polyhedral—is not given by a finite list of homogeneous linear inequality constraints. The proof of the Conic Duality Theorem can be found in numerous sources, e.g., in [187, Section 7.1.3]. 4.1.3
Schur Complement Lemma
The following simple fact is extremely useful: Lemma 4.3. [Schur Complement Lemma] A symmetric block matrix P QT A= Q R with R ≻ 0 is positive (semi)definite if and only if the matrix P − QT R−1 Q is so. Proof. With u, v of the same sizes as P , R, we have T
min [u; v] A [u; v] = uT [P − QT R−1 Q]u v
(direct computation utilizing the fact that R ≻ 0). It follows that the quadratic form associated with A is nonnegative everywhere if and only if the quadratic form with the matrix [P −QT R−1 Q] is nonnegative everywhere (since the latter quadratic form is obtained from the former one by partial minimization). ✷
4.2
NEAROPTIMAL LINEAR ESTIMATION FROM GAUSSIAN OBSERVATIONS
4.2.1
Situation and goal
Given an m × n matrix A, a ν × n matrix B, and an m × m matrix Γ ≻ 0, consider the problem of estimating the linear image Bx of an unknown signal x known to belong to a given set X ⊂ Rn via noisy observation ω = Ax + ξ, ξ ∼ N (0, Γ),
(4.5)
where ξ is the observation noise. A candidate estimate in this case is a (Borel) function x b(·) : Rm → Rν , and the performance of such an estimate in what follows will be quantified by the Euclidean risk Risk[b xX ] defined by (4.2). 4.2.1.1
Ellitopes
From now on we assume that X ⊂ Rn is a set given by X = x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y, y T Rk y ≤ tk , 1 ≤ k ≤ K ,
(4.6)
266
CHAPTER 4
where • P is an n × n ¯ matrix, P • Rk 0 are n ¯×n ¯ matrices with k Rk ≻ 0, • T is a nonempty computationally tractable convex compact subset of RK + intersecting the interior of RK and such that T is monotone, meaning that the + relations 0 ≤ τ ≤ t and t ∈ T imply that τ ∈ T .3 Note that under our assumptions int T 6= ∅. In the sequel, we refer to a set of the form (4.6) with data [P, {Rk , 1 ≤ k ≤ K}, T ] satisfying the assumptions just formulated as an ellitope, and to (4.6) as an ellitopic representation of X . Here are instructive examples of ellitopes (in all these examples, P is the identity mapping; in the sequel, we call ellitopes of this type basic): • when K = 1, T = [0, 1], and R1 ≻ 0, X is the ellipsoid {x : xT R1 x ≤ 1}; • when K ≥ 1, T = {t ∈ RK : 0 ≤ tk ≤ 1, k ≤ K}, and X is the intersection of \ {x : xT Rk x ≤ 1} 1≤k≤K
ellipsoids/elliptic cylinders centered at the origin. In particular, when U is a K × n matrix of rank n with rows uTk , 1 ≤ k ≤ K, and Rk = uk uTk , X is the symmetric w.r.t. the origin polytope {x : kU xk∞ ≤ 1}; P p/2 ≤ 1} • when U , uk and Rk are as in the latter example and T = {t ∈ RK + : k tk for some p ≥ 2, we get X = {x : kU xkp ≤ 1}. It should be added that the family of ellitoperepresentable sets is quite rich: this family admits a “calculus,” so that more ellitopes can be constructed by taking intersections, direct products, linear images (direct and inverse) or arithmetic sums of ellitopes given by the above examples. In fact, the property of being an ellitope is preserved by nearly all basic operations with sets preserving convexity and symmetry w.r.t. the origin (a regrettable exception is taking the convex hull of a finite union); see Section 4.6;. As another example of an ellitope instructive in the context of nonparametric statistics, consider the situation where our signals x are discretizations of functions of continuous argument running through a compact ddimensional domain D, and the functions f we are interested in are those satisfying a Sobolevtype smoothness constraint – an upper bound on the Lp (D)norm of Lf , where L is a linear differential operator with constant coefficients. After discretization, this restriction can be modeled as kLxkp ≤ 1, with properly selected matrix L. As we already know from the above example, when p ≥ 2, the set X = {x : kLxkp ≤ 1} is an ellitope, and as such is captured by our machinery. Note also that by the outlined calculus, imposing on the functions f in question several Sobolevtype smoothness constraints with parameters p ≥ 2, still results in a set of signals which is an ellitope. 3 The latter relation is “for free”—given a nonempty convex compact set T ⊂ RK , the right+ hand side of (4.6) remains intact when passing from T to its “monotone hull” {τ ∈ RK + : ∃t ∈ T : τ ≤ t} which already is a monotone convex compact set.
267
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.2.1.2
Estimates and their risks
In the outlined situation, a candidate estimate is a Borel function x b(·) : Rm → Rν ; given observation (4.5), we recover w = Bx as x b(ω). In the sequel, we quantify the quality of an estimate by its worstcase, over x ∈ X , expected k · k22 recovery error h i1/2 x(Ax + ξ) − Bxk22 , Risk[b xX ] = sup Eξ∼N (0,Γ) kb x∈X
and define the optimal, or the minimax, risk as
Riskopt [X ] = inf Risk[b xX ], x b(·)
(4.7)
where inf is taken over all Borel candidate estimates. 4.2.1.3
Main goal
The main goal of what follows is to demonstrate that an estimate linear in ω x bH (ω) = H T ω
(4.8)
with a properly selected efficiently computable matrix H is nearoptimal in terms of its risk. Our first observation is that when X is the ellitope (4.6), replacing matrices A and B with AP and BP , respectively, we pass from the initial estimation problem of interest to the transformed problem, where the signal set is ¯ = {y ∈ Rn¯ : ∃t ∈ T : y T Rk y ≤ tk , 1 ≤ k ≤ K}, X ¯ via observation and we want to recover [BP ]y, y ∈ X, ω = [AP ]y + ξ. It is obvious that the considered families of estimates (the family of all linear estimates and the family of all estimates), like the risks of the estimates, remain intact under this transformation; in particular, 1/2 x([AP ] y + ξ) − [BP ] yk22 } . Risk[b xX ] = sup Eξ {kb ¯ y∈X
Therefore, to save notation, from now on, unless explicitly stated otherwise, we assume that matrix P is identity, so that X is the basic ellitope X = x ∈ Rn : ∃t ∈ T , xT Rk x ≤ tk , 1 ≤ k ≤ K . (4.9)
We assume in the sequel that B 6= 0, since otherwise one has Bx = 0 for all x ∈ X , and the estimation problem is trivial. 4.2.2
Building a linear estimate
We start with building a “presumably good” linear estimate. Restricting ourselves to linear estimates (4.8), we may be interested in the estimate with the smallest
268
CHAPTER 4
risk, that is, the estimate associated with a ν × m matrix H which is an optimal solution to the optimization problem min R(H) := Risk2 [b xH X ] . H
We have
R(H)
= =
max Eξ {kH T ω − Bxk22 } = Eξ {kH T ξk22 } + max kH T Ax − Bxk22 x∈X
x∈X
T
T
T
T
T
Tr(H ΓH) + max x (H A − B) (H A − B)x. x∈X
This function, while convex, can be hard to compute. For this reason, we use a linear estimate yielded by minimizing an efficiently computable convex upper bound on R(H) which is built as follows. Let φT be the support function of T : φT (λ) = max λT t : RK → R. t∈T
Observe that whenever λ ∈ RK + and H are such that [B − H T A]T [B − H T A] for x ∈ X it holds
X
λ k Rk ,
(4.10)
k
kBx − H T Axk22 ≤ φT (λ).
(4.11)
Indeed, in the case of (4.10) and with x ∈ X , there exists t ∈ T such that xT Rk x ≤ tk for all t, and consequently vector t¯ with the entries t¯k = xT Rk x also belongs to T , whence kBx − H T Axk22 = xT [B − H T A]T [B − H T A]x ≤
X k
λk xT Rk x = λT t¯ ≤ φT (λ),
which combines with (4.9) to imply (4.11). From (4.11) it follows that if H and λ ≥ 0 are linked by (4.10), then Risk2 [b xH X ] = max E kBx − H T (Ax + ξ)k22 x∈X
=
≤
Tr(H T ΓH) + max k[B − H T A]xk22 x∈X
T
Tr(H ΓH) + φT (λ).
We see that the efficiently computable convex function b R(H) = inf λ
(
T
T
T
T
Tr(H ΓH) + φT (λ) : (B − H A) (B − H A)
X k
λk R k , λ ≥ 0
)
P (which clearly is well defined due to compactness of T combined with k Rk ≻ 0) is an upper bound on R(H).4 Note that by Schur Complement Lemma the matrix 4 It
is well known that when K = 1 (i.e., X is an ellipsoid), the above bounding scheme is b b could be larger than R(·), although the ratio exact: R(·) ≡ R(·). For more complicated X ’s, R(·) b R(·)/R(·) is bounded by O(log(K)); see Section 4.2.3.
269
SIGNAL RECOVERY BY LINEAR ESTIMATION
P inequality (B−H T A)T (B−H T A) k λk Rk is equivalent to the matrix inequality P B T − AT H k λ k Rk 0 B − HT A Iν linear in H, λ. We have arrived at the following result: Proposition 4.4. In the situation of this section, the risk of the “presumably good” linear estimate x bH∗ (ω) = H∗T ω yielded by an optimal solution (H∗ , λ∗ ) to the (clearly solvable) convex optimization problem Opt
=
=
P λk R k , λ ≥ 0 min Tr(H T ΓH) + φT (λ) : (B − H T A)T (B − H T A) H,λ P k B T − AT H T k λk Rk min Tr(H ΓH) + φT (λ) : 0, λ ≥ 0 H,λ B − HT A Iν (4.12)
is upperbounded by 4.2.2.1
√
Opt.
Illustration: Recovering temperature distribution
Situation: A square steel plate was somewhat heated at time 0 and left to cool, the temperature along the perimeter of the plate being all the time kept zero. At time t1 , we measure the temperatures at m points of the plate, and want to recover the distribution of the temperature along the plate at a given time t0 , 0 < t0 < t1 . Physics, after suitable discretization of spatial variables, offers the following model of the situation. We represent the distribution of temperature at time t as 2N −1 (2N − 1) × (2N − 1) matrix U (t) = [uij (t)]i,j=1 , where uij (t) is the temperature, at time t, at the point Pij = (pi , pj ), pk = k/N − 1,
1 ≤ i, j ≤ 2N − 1
of the plate (in our model, this plate occupies the square S = {(p, q) : p ≤ 1, q ≤ 1}). Here positive integer N is responsible for spatial discretization. For 1 ≤ k ≤ 2N − 1, let us specify functions φk (s) on the segment −1 ≤ s ≤ 1 as follows: φ2ℓ−1 (s) = c2ℓ−1 cos(ω2ℓ−1 s), φ2ℓ (s) = c2ℓ sin(ω2ℓ s), ω2ℓ−1 = (ℓ − 1/2)π, ω2ℓ = ℓπ, where ck are readily given by the normalization condition that φk (±1) = 0. It is immediately seen that the matrices
P2N −1 i=1
φ2k (pi ) = 1; note
2N −1 Φkℓ = [φk (pi )φℓ (pj )]i,j=1 , 1 ≤ k, ℓ ≤ 2N − 1
form an orthonormal basis in the space of (2N − 1) × (2N − 1) matrices, so that we can write X U (t) = xkℓ (t)Φkℓ . k,ℓ≤2N −1
The advantage of representing temperature fields in the basis {Φkℓ }k,ℓ≤2N −1 stems from the fact that in this basis the heat equation governing evolution of the tem
270
CHAPTER 4
perature distribution in time becomes extremely simple, just d xkℓ (t) = −(ωk2 + ωℓ2 )xkℓ (t) ⇒ xkℓ (t) = exp{−(ωk2 + ωℓ2 )t}xkℓ .5 dt Now we can convert the situation into the one considered in our general estimation scheme, namely, as follows: • We select some discretization parameter N and treat x = {xkℓ (0), 1 ≤ k, ℓ ≤ 2N − 1} as the signal underlying our observations. In every potential application, we can safely upperbound the magnitudes of the initial temperatures and thus the magnitude of x, say, by a constraint of the form X x2kℓ (0) ≤ R2 k,ℓ
with properly selected R, which allows us to specify the domain X of the signal as the Euclidean ball: X = {x ∈ R(2N −1)×(2N −1) : kxk22 ≤ R2 }.
(4.13)
• Let the measurements of the temperature at time t1 be taken along the points Pi(ν),j(ν) , 1 ≤ ν ≤ m,6 and let them be affected by a N (0, σ 2 Im )noise, so that our observation is ω = A(x) + ξ, ξ ∼ N (0, σ 2 Im ). Here x 7→ A(x) is the linear mapping from R(2N −1)×(2N −1) into Rm given by [A(x)]ν =
2N −1 X
2
2
e−(ωk +ωℓ )t1 φk (pi(ν) )φℓ (pj(ν) )xkℓ (0).
(4.14)
k,ℓ=1
• We want to recover the temperatures at time t0 taken along some grid, say, the square (2K − 1) × (2K − 1) grid {Qij = (ri , rj ), 1 ≤ i, j ≤ 2K − 1}, where ri = i/K −1, 1 ≤ i ≤ 2K −1. In other words, we want to recover B(x), where the linear mapping x 7→ B(x) from R(2N −1)×(2N −1) into R(2K−1)×(2K−1) is given by [B(x)]ij =
2N −1 X
2
2
e−(ωk +ωℓ )t0 φk (ri )φℓ (rj )xkℓ (0).
k,ℓ=1
5 The explanation is simple: the functions φ (p, q) = φ (p)φ (q), k, ℓ = 1, 2, ..., form an kℓ k ℓ orthogonal basis in L2 (S) and vanish on the boundary of S, and the heat equation 2 ∂ ∂2 ∂ u(t; p, q) u(t; p, q) = + 2 2 ∂t ∂p ∂q
governing evolution of the temperature field u(t; p, q), (p, q) ∈ S, with time t, in terms of the coefficients xkℓ (t) of the temperature field in the orthogonal basis {φkℓ (p, q)}k,ℓ becomes d xkℓ (t) = −(ωk2 + ωℓ2 )xkℓ (t). dt In our discretization, we truncate the expansion of u(t; p, q), keeping only the terms with k, ℓ ≤ 2N − 1, and restrict the spatial variables to reside in the grid {Pij , 1 ≤ i, j ≤ 2N − 1}. 6 The construction can be easily extended to allow for measurement points outside of the grid {Pij }.
271
SIGNAL RECOVERY BY LINEAR ESTIMATION
Illposedness. Our problem is a typical example of an illposed inverse problem, where one wants to recover a past state of a dynamical system converging exponentially fast to equilibrium and thus “forgetting rapidly” its past. More specifically, in our situation illposedness stems from the fact that, as is clearly seen from (4.14), contributions of “high frequency” (i.e., with large ωk2 + ωℓ2 ) components xkℓ (0) of the signal to A(x) decrease exponentially fast, with high decay rate, as t1 grows. As a result, high frequency components xkℓ (0) are impossible to recover from noisy observations of A(x), unless the corresponding time instant t1 is very small. As a kind of compensation, contributions of high frequency components xkℓ (0) to B(x) are also very small, provided that t0 is not too small, implying that there is no necessity to recover well high frequency components, unless they are huge. Our linear estimate, roughly speaking, seeks for the best tradeoff between these two opposite phenomena, utilizing (4.13) as the source of upper bounds on the magnitudes of high frequency components of the signal. Numerical results. In the experiment to be reported, we used N = 32, m = 100, K = 6, t0 = 0.01, t1 = 0.03 (i.e., temperature is measured at time 0.03 at 100 points selected at random on a 63 × 63 square grid, and we want to recover the temperatures at time 0.01 along an 11 × 11 square grid). We used R = 15, that is, X X = {[xkℓ ]63 x2kℓ ≤ 225}, k,ℓ=1 : k,ℓ
and σ = 0.001. Under the circumstances, the risk of the best linear estimate turns out to be 0.3968. Figure 4.1 shows a sample temperature distribution B(x) = U∗ (t0 ) at time b (t0 ) t0 resulting from a randomly selected signal x ∈ X along with the recovery U e (t0 ) of U∗ by the optimal linear estimate and the naive “least squares” recovery U of U∗ . The latter is defined as B(x∗ ), where x∗ is the least squares recovery of the signal underlying observation ω: x = x∗ (ω) := argmin kA(x) − ωk2 . x
Notice the dramatic difference in performances of the “naive least squares” and the optimal linear estimate. 4.2.2.2
Nearoptimality of x bH∗
Proposition 4.5. The efficiently computable linear estimate x bH∗ (ω) = H∗T ω yielded by an optimal solution to the optimization problem (4.12) is nearly optimal in terms of its risk: p p (4.15) Risk[b xH∗ X ] ≤ Opt ≤ 64 45 ln 2(ln K + 5 ln 2) Riskopt [X ], where the minimax optimal risk Riskopt [X ] is given by (4.7).
For proof, see Section 4.8.5. Note that the “nonoptimality factor” in (4.15) depends logarithmically on K and is completely independent on what A, B, Γ are and the “details” Rk , T —see (4.9)—specifying ellitope X .
272
CHAPTER 4
U∗ :
kU∗ k2 = 2.01 kU∗ k∞ = 0.347
b b : kU − U∗ k2 = 0.318 U b − U∗ k∞ = 0.078 kU
e e : kU − U∗ k2 = 44.82 U e − U∗ k∞ = 12.47 kU
Figure 4.1: True distribution of temperature U∗ = B(x) at time t0 = 0.01 (left) b via the optimal linear estimate (center) and the “naive” along with its recovery U e (right). recovery U 4.2.2.3
Relaxing the symmetry requirement
Sets X of the form (4.6)—we called them ellitopes—are symmetric w.r.t. the origin convex compact sets of special structure. This structure is rather flexible, but the symmetry is “built in.” We are about to demonstrate that, to some extent, the symmetry requirement can be somewhat relaxed. Specifically, assume instead of (4.6) that the convex compact set X known to contain the signals x underlying observations (4.5) can be “sandwiched” by two ellitopes known to us and similar to each other, with coefficient α ≥ 1: x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y & y T Rk y ≤ tk , 1 ≤ k ≤ K ⊂ X ⊂ αX , {z }  X
with Rk and T possessing the properties postulated in Section 4.2.1.1. Let Opt and H∗ be the optimal value and optimal solution of the optimization problem (4.12) ¯ = BP in the role associated with the data R1 , ..., RK , T and matrices A¯ = AP , B of A, B, respectively. It is immediately seen that the risk Risk[b x H∗ X ] of the linear √ we have Riskopt [X ] ≤ estimate x bH∗ (ω) is at most α Opt. On the other hand, p √ Riskopt [X ], and by Proposition 4.5 also Opt ≤ O(1) ln(2K)Riskopt [X ]. Taken together, these relations imply that p (4.16) Risk[b xH ∗ X ] ≤ O(1)α ln(2K)Riskopt [X ].
In other words, as far as the “level of nonoptimality” of efficiently computable linear estimates is concerned, signal sets X which can be approximated by ellitopes within a factor α of order of 1 are nearly as good as the ellitopes. To give an example: it is known that whenever the intersection X of K elliptic cylinders {x : (x − ck )T Rk (x − ck ) ≤ 1}, Rk 0, concentric or not, is bounded and has a nonempty interior, X can be approximated by an ellipsoid within the factor
273
SIGNAL RECOVERY BY LINEAR ESTIMATION
√ α = K + 2 K.7 Assuming w.l.o.g. that the approximating ellipsoid is centered at the origin, the level of nonoptimality of a linear estimate is bounded by (4.16) with O(1)K in the role of α. 4.2.2.4
Comments
Note that bound (4.16) rapidly deteriorates when α grows, and this phenomenon to some extent “reflects the reality.” For example, a perfect simplex X inscribed into the unit sphere in Rn is inbetween two Euclidean balls centered at the origin with the ratio of radii equal to n (i.e. α = n). It is immediately seen that with A = B = I, Γ = σ 2 I, in the range σ ≤ nσ 2 ≤ 1 of values of n and σ, we have √ √ xH∗ X ] = O(1) nσ, Riskopt [X ] ≈ σ, Riskopt [b with ≈ meaning “up to logarithmic in n/σ factor.” In other words, for large nσ linear estimates indeed are significantly (albeit not to the full extent of (4.16)) outperformed by nonlinear ones. Another situation “bad for linear estimates” suggested by (4.15) is the one where the description (4.6) of X , albeit possible, requires a very large value of K. Here again (4.15) reflects to some extent the reality: when X is the unit k · k1 ball in Rn , (4.6) takes place with K = 2n−1 ; consequently, the factor at Riskopt [X ] √ in the righthand side of (4.15) becomes at least n. On the other hand, with A = B = I, Γ = σ 2 I, in the range σ ≤ nσ 2 ≤ 1 of values of n, σ, the risks Riskopt [X ], Riskopt [b xH∗ X ] are basically the same as in the case of X being the perfect simplex inscribed into the unit sphere in Rn , and linear estimates indeed are “heavily nonoptimal” when nσ is large. 4.2.2.5
How near is “nearoptimal”: Numerical illustration √ The “nonoptimality factor” θ in the upper bound Opt ≤ θRiskopt [X ] from Proposition 4.5, while logarithmic, seems to be unpleasantly large. On closer inspection, one can get numerically less conservative bounds on nonoptimality factors. Here are some illustrations. In the six experiments to be reported, we used n = m = ν = 32 and Γ = σ 2 Im . In the first triple of experiments, X was the ellipsoid X = {x ∈ R32 :
32 X j=1
j 2 x2j ≤ 1},
P32 that is, P was the identity, K = 1, R1 = j=1 j 2 ej eTj (ej are basic orths), and T = [0, 1]. In the second triple of experiments, X was the box circumscribed around the above ellipsoid:
X = {x ∈ R32 : jxj  ≤ 1, 1 ≤ j ≤ 32} P = I, K = 32, Rk = k 2 ek eTk , k ≤ K, T = [0, 1]K .
P T setting F (x) = − K k=1 ln(1 − (x − ck ) Rk (x − ck )) : int X → R and denoting by x ¯ the analytic center argminx∈int X F (x), one has √ {x : (x − x ¯)T F ′′ (¯ x)(x − x ¯) ≤ 1} ⊂ X ⊂ {x : (x − x ¯)T F ′′ (¯ x)(x − x ¯) ≤ [K + 2 K]2 }. 7 Namely,
274
CHAPTER 4
X ellipsoid ellipsoid ellipsoid box box box
σ 0.0100 0.0010 0.0001 0.0100 0.0010 0.0001
√
Opt 0.288 0.103 0.019 0.698 0.163 0.021
LwB 0.153 0.060 0.018 0.231 0.082 0.020
√
Opt/LwB 1.88 1.71 1.06 3.02 2.00 1.06
Table 4.1: Performance of linear estimates (4.8), (4.12), m = n = 32, B = I.
In these experiments, B was the identity matrix, and A was a randomly rotated matrix common for all experiments, with singular values λj , 1 ≤ j ≤ 32, forming a geometric progression, with λ1 = 1 and λ32 = 0.01. Experiments in a triple differed by the values of σ (0.01,0.001,0.0001). √ The results of the experiments are presented in Table 4.1, where, as above, Opt is the upper bound given by (4.12) on the risk Risk[b xH∗ X] of recovering Bx = x, x ∈ X, by the linear estimate yielded by (4.8) and (4.12), and LwB is the lower bound on Riskopt [X] computed via the techniques outlined in Exercise 4.22 (we skip the details). Whatever might be your attitude to the “reality” as reflected by the data in Table 4.1, this reality is much better than the theoretical upper bound on θ appearing in (4.15). 4.2.3
Byproduct on semidefinite relaxation
We are about to present a byproduct, important in its own right, of the reasoning underlying Proposition 4.5. This byproduct is not directly related to Statistics; it relates to the quality of the standard semidefinite relaxation. Specifically, given a quadratic from xT Cx and an ellitope X represented by (4.6), consider the problem Opt∗ = max xT Cx = max y T P T CP y : ∃t ∈ T : y T Rk y ≤ tk , k ≤ K . (4.17) x∈X
y
This problem can be NPhard (this is already so when X is the unit box and C a generaltype positive semidefinite matrix); however, Opt admits an efficiently computable upper bound given by semidefinite relaxation as follows: whenever λ ≥ 0 is such that K X P T CP λk Rk , k=1
¯ := {y : ∃t ∈ T : y T Rk y ≤ tk , k ≤ K} we clearly have for y ∈ X X [P y]T CP y ≤ λk y T Rk y ≤ φT (λ) k
where the last ≤ is due to the fact that the vector with the entries y T Rk y, 1 ≤ k ≤ K, belongs to T . As a result, the efficiently computable quantity ) ( X T (4.18) Opt = min φT (λ) : λ ≥ 0, P CP λ k Rk λ
k
SIGNAL RECOVERY BY LINEAR ESTIMATION
275
is an upper bound on Opt∗ . We have the following Proposition 4.6. Let C be a symmetric n × n matrix and X be given by ellitopic representation (4.6), and let Opt∗ and Opt be given by (4.17) and (4.18). Then Opt √ ≤ Opt∗ ≤ Opt. 3 ln( 3K)
(4.19)
For proof, see Section 4.8.2.
4.3
FROM ELLITOPES TO SPECTRATOPES
So far, the domains of signals we dealt with were ellitopes. In this section we demonstrate that basically all our constructions and results can be extended onto a much wider family of signal domains, namely, spectratopes. 4.3.1
Spectratopes: Definition and examples
We call a set X ⊂ Rn a basic spectratope if it admits a simple spectratopic representation—representation of the form X = x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K (4.20)
where
Pn S.1. Rk [x] = i=1 xi Rki are symmetric dk ×dk matrices linearly depending on x ∈ Rn (i,e., “matrix coefficients” Rki belong to Sn ). S.2. T ∈ RK + is the set with the same properties as in the definition of an ellitope, that is, T is a convex compact subset of RK + which contains a positive vector and is monotone: 0 ≤ t′ ≤ t ∈ T ⇒ t′ ∈ T . S.3. Whenever x 6= 0, it holds Rk [x] 6= 0 for at least one k ≤ K. An immediate observation is as follows: Remark 4.7. By the Schur Complement Lemma, the set (4.20) given by data satisfying S.12 can be represented as tk Idk Rk [x] 0, k ≤ K . X = x ∈ Rn : ∃t ∈ T : Rk [x] Idk By the latter representation, X is nonempty, closed, convex, symmetric w.r.t. the origin, and contains a neighbourhood of the origin. This set is bounded if and only if the data, in addition to S.1–2, satisfies S.3. A spectratope X ⊂ Rν is a set represented as a linear image of a basic spectratope: X = {x ∈ Rν : ∃(y ∈ Rn , t ∈ T ) : x = P y, Rk2 [y] tk Idk , 1 ≤ k ≤ K},
(4.21)
276
CHAPTER 4
where P is a ν × n matrix, and Rk [·], T are as in S.1–3. We associate with a basic spectratope (4.20), S.1–3, the following entities: 1. The size D=
K X
dk ;
k=1
2. Linear mappings Q 7→ Rk [Q] =
X i,j
Qij Rki Rkj : Sn → Sdk .
As is immediately seen, we have Rk [xxT ] ≡ Rk2 [x],
(4.22)
implying that Rk [Q] 0 whenever Q 0, whence Rk [·] is monotone: Q′ Q ⇒ Rk [Q′ ] Rk [Q].
(4.23)
Besides this, we have Q 0 ⇒ Eξ∼N (0,Q) {Rk2 [ξ]} = Eξ∼N (0,Q) {Rk [ξξ T ]} = Rk [Q],
(4.24)
where the first equality is given by (4.22). 3. Linear mappings Λk 7→ R∗k [Λk ] : Sdk → Sn given by [R∗k [Λk ]]ij = 21 Tr(Λk [Rki Rkj + Rkj Rki ]), 1 ≤ i, j ≤ n.
(4.25)
It is immediately seen that R∗k [·] is the conjugate of Rk [·]: hΛk , Rk [Q]iF = Tr(Λk Rk [Q]) = Tr(R∗k [Λk ]Q) = hR∗k [Λk ], QiF ,
(4.26)
where hA, BiF = Tr(AB) is the Frobenius inner product of symmetric matrices. Besides this, we have (4.27) Λk 0 ⇒ R∗k [Λk ] 0. Indeed, R∗k [Λk ] is linear in Λk , so that it suffices to verify (4.27) for dyadic matrices Λk = f f T ; for such a Λk , (4.25) reads (R∗k [f f T ])ij = [Rki f ]T [Rkj f ], that is, R∗k [f f T ] is a Gram matrix and as such is 0. Another way to arrive at (4.27) is to note that when Λk 0 and Q = xxT , the first quantity in (4.26) is nonnegative by (4.22), and therefore (4.26) states that xT R∗k [Λk ]x ≥ 0 for every x, implying R∗k [Λk ] 0. 4. The linear space ΛK = Sd1 × ... × SdK of all ordered collections Λ = {Λk ∈ Sdk }k≤K along with the linear mapping Λ 7→ λ[Λ] := [Tr(Λ1 ); ...; Tr(ΛK )] : ΛK → RK .
277
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.3.1.1
Examples of spectratopes
Example: Ellitopes. Every ellitope X = {x ∈ Rν : ∃(y ∈ Rn , t ∈ TP ) : x = P y, y T Rk y ≤ tk , k ≤ K} [Rk 0, k Rk ≻ 0]
Ppk T rkj rkj , pk = Rank(Rk ), be a dyadic is a spectratope as well. Indeed, let Rk = j=1 representation of the positive semidefinite matrix Rk , so that X T (rkj y)2 ∀y, y T Rk y = j
and let P Tb = {{tkj ≥ 0, 1 ≤ j ≤ pk , 1 ≤ k ≤ K} : ∃t ∈ T : j tkj ≤ tk }, T Rkj [y] = rkj y ∈ S1 = R. We clearly have 2 X = {x ∈ Rν : ∃({tkj } ∈ Tb , y) : x = P y, Rkj [y] tkj I1 ∀k, j},
and the righthand side is a legitimate spectratopic representation of X . Example: “Matrix box.” Let L be a positive definite d × d matrix. Then the “matrix box” X
= =
{X ∈ Sd : −L X L} = {X ∈ Sd : −Id L−1/2 XL−1/2 Id } {X ∈ Sd : R2 [X] := [L−1/2 XL−1/2 ]2 Id }
is a basic spectratope (augment R1 [·] := R[·] with K = 1, T = [0, 1]). As a result, a bounded set X ⊂ Rν given by a system of “twosided” Linear Matrix Inequalities, specifically, √ √ X = {x ∈ Rν : ∃t ∈ T : − tk Lk Sk [x] tk Lk , k ≤ K} where Sk [x] are symmetric dk × dk matrices linearly depending on x, Lk ≻ 0, and T satisfies S.2, is a basic spectratope: X = {x ∈ Rν : ∃t ∈ T : Rk2 [x] ≤ tk Idk , k ≤ K}
−1/2
[Rk [x] = Lk
−1/2
Sk [x]Lk
].
Like ellitopes, spectratopes admit fully algorithmic calculus; see Section 4.6. 4.3.2
Semidefinite relaxation on spectratopes
Now let us extend Proposition 4.6 to our current situation. The extension reads as follows: Proposition 4.8. Let C be a symmetric n×n matrix and X be given by spectratopic representation X = {x ∈ Rn : ∃y ∈ Rµ , t ∈ T : x = P y, Rk2 [y] tk Idk , k ≤ K},
(4.28)
278
CHAPTER 4
let Opt∗ = max xT Cx, x∈X
and let Opt =
min
Λ={Λk }k≤K
φT (λ[Λ]) : Λk 0, P T CP
[λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛK )]] .
P
k
R∗k [Λk ]
(4.29)
Then (4.29) is solvable, and Opt∗ ≤ Opt ≤ 2 max[ln(2D), 1]Opt∗ , D =
X
dk .
(4.30)
k
Let us verify the easy and instructive part of the proposition, namely, the left inequality in (4.30); the remaining claims will be proved in Section 4.8.3. The left inequality in (4.30) is readily given by the following Lemma 4.9. Let X be spectratope (4.28) and Q ∈ Sn . Whenever Λk ∈ Sd+k satisfy P T QP
X k
R∗k [Λk ],
for all x ∈ X we have xT Qx ≤ φT (λ[Λ]), λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛK )]. Proof of the lemma: Let x ∈ X , so that for some t ∈ T and y it holds x = P y, Rk2 [y] tk Idk ∀k ≤ K. Consequently, xT Qx
4.3.3
P P T T = yP P QP y ≤ y T k R∗k [Λk ]y = k Tr(R∗k [Λk ][yy T ]) = Pk Tr(Λk Rk [yy T ]) [by (4.26)] = Pk Tr(Λk Rk2 [y]) [by (4.22)] ≤ k tk Tr(Λk Idk ) [since Λk 0 and Rk2 [y] tk Idk ] ≤ φT (λ[Λ]). ✷
Linear estimates beyond ellitopic signal sets and k · k2 risk
In Section 4.2, we have developed a computationally efficient scheme for building “presumably good” linear estimates of the linear image Bx of unknown signal x known to belong to a given ellitope X in the case when the (squared) risk is defined as the worst, w.r.t. x ∈ X , expected squared Euclidean norm k · k22 of the recovery error. We are about to extend these results to the case when X is a spectratope, and the norm used to measure the recovery error, while not being completely arbitrary, is not necessarily k · k2 . Besides this, in what follows we also relax our assumptions on observation noise.
279
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.3.3.1
Situation and goal
We consider the problem of recovering the image Bx ∈ Rν of a signal x ∈ Rn known to belong to a given spectratope X = {x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K} from noisy observation ω = Ax + ξ,
(4.31)
where A is a known m × n matrix, and ξ is random observation noise. Observation noise. In typical signal processing applications, the distribution of noise is fixed and is a part of the data of the estimation problem. In order to cover some applications (e.g., the one in Section 4.3.3.7), we allow for “ambiguous” noise distributions; all we know is that this distribution belongs to a family P of Borel probability distributions on Rm associated with a given convex compact subset Π of the interior of the cone Sm + of positive semidefinite m × m matrices, “association” meaning that the matrix of second moments of every distribution P ∈ P is dominated by a matrix from Π: P ∈ P ⇒ ∃Q ∈ Π : Var[P ] := Eξ∼P {ξξ T } Q.
(4.32)
The actual distribution of noise in (4.31) is selected from P by nature (and may, e.g., depend on x). In the sequel, for a probability distribution P on Rm we write P ✁ Π to express the fact that the matrix of second moments of P is dominated by a matrix from Π: {P ✁ Π} ⇔ {∃Θ ∈ Π : Var[P ] Θ}. Quantifying risk. Given Π and a norm k · k on Rν , we quantify the quality of a candidate estimate x b(·) : Rm → Rν by its (Π, k · k)risk on X defined as RiskΠ,k·k [b xX ] =
sup
x∈X ,P ✁Π
Eξ∼P {kb x(Ax + ξ) − Bxk} .
Goal. As before, our focus is on linear estimates—estimates of the form x bH (ω) = H T ω
given by m×ν matrices H. Our goal is to demonstrate that under some restrictions on the signal domain X , a “presumably good” linear estimate yielded by an optimal solution to an efficiently solvable convex optimization problem is nearoptimal in terms of its risk among all estimates, linear and nonlinear alike. 4.3.3.2
Assumptions
Preliminaries: Conjugate norms. Recall that a norm k · k on a Euclidean space E, e.g., on Rk , gives rise to its conjugate norm kyk∗ = max{hy, xi : kxk ≤ 1}, x
280
CHAPTER 4
where h·, ·i is the inner product in E. Equivalently, k · k∗ is the smallest norm such that hx, yi ≤ kxkkyk∗ ∀x, y. (4.33) It is well known that taken twice, norm conjugation recovers the initial norm: (k · k∗ )∗ is exactly k · k; in other words, kxk = max{hx, yi : kyk∗ ≤ 1}. y
The standard examples are the conjugates to the standard ℓp norms on E = Rk , p ∈ [1, ∞]: it turns out that (k · kp )∗ = k · kp∗ , where p∗ ∈ [1, ∞] is linked to p ∈ [1, ∞] by the symmetric relation 1 1 = 1, + p p∗ so that 1∗ = ∞, ∞∗ = 1, 2∗ = 2. The corresponding version of inequality (4.33) is called H¨ older inequality—an extension of the CauchySchwartz inequality dealing with the case k · k = k · k∗ = k · k2 . Assumptions. From now on we make the following assumptions: Assumption A: The unit ball B∗ of the norm k · k∗ conjugate to the norm k · k in the formulation of our estimation problem is a spectratope: B∗ = {z ∈ Rν : ∃y ∈ Y : z = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L},
(4.34)
where the righthand side data are as required in a spectratopic representation. Note that Assumption A is satisfied when k · k = k · kp with p ∈ [1, 2]: in this case, B∗ = {u ∈ Rν : kukp∗ ≤ 1}, p∗ =
p ∈ [2, ∞], p−1
so that B∗ is an ellitope—see Section 4.2.1.1—and thus is a spectratope. Another potentially useful example of norm k · k which obeys Assumption A is the nuclear norm kV kSh,1 on the space Rν = Rp×q of p×q matrices—the sum of singular values of a matrix V . In this case the conjugate norm is the spectral norm k · k = k · k2,2 on Rν = Rp×q , and the unit ball of the latter norm is a spectratope: {X ∈ Rp×q : kXk ≤ 1} = {X: ∃t ∈ T = [0, 1] : R2 [X] tIp+q }, XT . R[X] = X Besides Assumption A, we make Assumption B: The signal set X is a basic spectratope: X = {x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K},
281
SIGNAL RECOVERY BY LINEAR ESTIMATION
where the righthand side data are as required in a spectratopic representation. Note: Similarly to what we have observed in Section 4.2.1.3 in the case of ellitopes, the situation where the signal set is a general type spectratope can be straightforwardly reduced to the one where X is a basic spectratope. In addition we make the following regularity assumption: Assumption R: All matrices from Π are positive definite. 4.3.3.3
Building linear estimate
Let H ∈ Rm×ν . We clearly have RiskΠ,k·k [b xH (·)X ]
Eξ∼P k[B − H T A]x − H T ξk x∈X,P ✁Π supx∈X k[B − H T A]xk + supP ✁Π Eξ∼P kH T ξk kB − H T AkX ,k·k + ΨΠ (H), (4.35)
=
sup
≤ = where
kV kX ,k·k ΨΠ (H)
= =
ν×n maxx {kV xk → R, : xT∈ X } : R sup Eξ∼P kH ξk . P ✁Π
As in Section 4.2.2, we need to derive efficiently computable convex upper bounds on the norm k·kX ,k·k and the function ΨΠ , which by themselves, while being convex, can be difficult to compute. 4.3.3.4
Upperbounding k · kX ,k·k
With Assumptions A, B in force, consider the spectratope Z
:= =
X × Y = {[x; y] ∈ Rn × Rq : ∃s = [t; r] ∈ T × R : Rk2 [x] tk Idk , 1 ≤ k ≤ K, Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L} {w = [x; y] ∈ Rn × Rq : ∃s = [t; r] ∈ S = T × R : Ui2 [w] si Igi , 1 ≤ i ≤ I = K + L}
with Ui [·] readily given by Rk [·] and Sℓ [·]. Given a ν × n matrix V and setting 1 V TM W [V ] = 2 MT V we have kV kX ,k·k = max kV xk = x∈X
max
x∈X ,z∈B∗
zT V x =
max y T M T V x = max wT W [V ]w.
x∈X ,y∈Y
w∈Z
Applying Proposition 4.8, we arrive at the following Corollary 4.10. In the situation just defined, the efficiently computable convex
282
CHAPTER 4
function kV k+ X ,k·k
=
min φT (λ[Λ]) + φR (λ[Υ]) : Λ,Υ
fℓ dk Λ = {Λk ∈ ℓ ∈ S+ }ℓ≤L, PS+ }∗k≤K , Υ =1 {Υ T V M R [Λ ] k k k P2 ∗ 0 1 T ℓ Sℓ [Υℓ ] 2M V
(4.36)
φT (λ) = max λT t, φR (λ) = max λT r, λ[{Ξ1 , ..., ΞN }] = [Tr(Ξ1 ); ...; Tr(ΞN )], t∈T r∈R P [R∗k [Λk ]]ij = 12 Tr(Λk [Rkki Rkkj + Rkkj Rkki ]), where Rk [x] = i xi Rki , P ℓj ℓj ℓi ℓi ℓi ∗ 1 [Sℓ [Υℓ ]]ij = 2 Tr(Υℓ [Sℓ Sℓ + Sℓ Sℓ ]), where Sℓ [y] = i yi S
is a norm on Rν×n , and this norm is a tight upper bound on k · kX ,k·k , namely,
4.3.3.5
∀V ∈ Rν×n : kV kX ,k·k ≤ kV k+ ≤ 2 max[ln(2D), 1]kV kX ,k·k , P X ,k·k P D = k dk + ℓ fℓ .
Upperbounding ΨΠ (·)
The next step is to derive an efficiently computable convex upper bound on the function ΨΠ stemming from a norm obeying Assumption B. The underlying observation is as follows: Lemma 4.11. Let V be an m × ν matrix, Q ∈ Sm + , and P be a probability distribution on Rm with Var[P ] Q. Let, further, k · k be a norm on Rν with the unit ball B∗ of the conjugate norm k · k∗ given by (4.34). Finally, let Υ = {Υℓ ∈ Sf+ℓ }ℓ≤L and a matrix Θ ∈ Sm satisfy the constraint 1 Θ V M 2 P ∗ 0 (4.37) 1 T T ℓ Sℓ [Υℓ ] 2M V
(for notation, see (4.34), (4.36)). Then
Eη∼P {kV T ηk} ≤ Tr(QΘ) + φR (λ[Υ]).
(4.38)
Proof is immediate. In the case of (4.37), we have kV T ξk =
≤
= = = ≤
≤
max z T V T ξ = max y T M T V T ξ Py∈Y [by (4.37)] max ξ T Θξ + ℓ y T Sℓ∗ [Υℓ ]y y∈Y P ∗ T T max ξ Θξ + ℓ Tr(Sℓ [Υℓ ]yy ) y∈Y P [by (4.22) and (4.26)] max ξ T Θξ + ℓ Tr(Υℓ Sℓ2 [y]) y∈Y P 2 2 ξ T Θξ + max ℓ Tr(Υℓ Sℓ [y]) : Sℓ [y] rℓ Ifℓ , ℓ ≤ L, r ∈ R
z∈B∗
y,r
ξ T Θξ + max r∈R
P
[by (4.34)]
ℓ Tr(Υℓ )rℓ [by Υℓ 0]
ξ T Θξ + φR (λ[Υ]).
Taking the expectation of both sides of the resulting inequality w.r.t. distribution P of ξ and taking into account that Tr(Var[P ]Θ) ≤ Tr(QΘ) due to Θ 0 (by (4.37)) and Var[P ] Q, we get (4.38). ✷
283
SIGNAL RECOVERY BY LINEAR ESTIMATION
Note that when P = N (0, Q), the smallest upper bound on Eη∼P {kV T ηk} which can be extracted from Lemma 4.11 (this bound is efficiently computable) is tight; see Lemma 4.17 below. An immediate consequence of the bound in Lemma (4.11) is: Corollary 4.12. Let Γ(Θ) = max Tr(QΘ)
(4.39)
Q∈Π
and ΨΠ (H)
=
min
{Υℓ }ℓ≤L ,Θ∈Sm
(
Γ(Θ) + φR (λ[Υ]) : Υℓ 0 ∀ℓ, Θ 1 T M HT 2
1 P2 HM ∗ ℓ Sℓ [Υℓ ]
)
(4.40)
0 .
Then ΨΠ (·) : Rm×ν → R is an efficiently computable convex upper bound on ΨΠ (·). Indeed, given Lemma 4.11, the only nonevident part of the corollary is that ΨΠ (·) is a welldefined realvalued function, which is readily given by Lemma 4.44 stating, in particular, that the optimization problem in (4.40) is feasible, combined with the fact that the objective is coercive on the feasible set (i.e., is not bounded from above along every unbounded sequence of feasible solutions). Remark 4.13. When Υ = {Υℓ }ℓ≤L , Θ is a feasible solution to the righthand side problem in (4.40) and s > 0, the pair Υ′ = {sΥℓ }ℓ≤L , Θ′ = s−1 Θ also is a feasible solution. Since φR (·) and Γ(·) are positive homogeneous of degree 1, we conclude that ΨΠ is in fact the infimum of the function p 2 Γ(Θ)φR (λ[Υ]) = inf s−1 Γ(Θ) + sφR (λ[Υ]) s>0
over Υ, Θ satisfying the constraints of the problem (4.40). In addition, for every feasible solution Υ = {Υℓ }ℓ≤L , Θ to (4.40) with M[Υ] := P ∗ 1 −1 b [Υ]M T H T is feasible for the problem ℓ Sℓ [Υℓ ] ≻ 0, the pair Υ, Θ = 4 HM M b Θ (Schur Complement Lemma), so that Γ(Θ) b ≤ Γ(Θ). As a as well, and 0 Θ result, 1 Γ(HM M−1 [Υ]M T H T ) + φR (λ[Υ]) : 4 ΨΠ (H) = inf . (4.41) Υ Υ = {Υℓ ∈ Sf+ℓ }ℓ≤L , M[Υ] ≻ 0 Illustration. Suppose that kuk = kukp with p ∈ [1, 2], and let us apply the just described scheme for upperbounding ΨΠ , assuming {Q} ⊂ Π ⊂ {S ∈ Sm + : S Q} for some given Q ≻ 0, so that Γ(Θ) = Tr(QΘ), Θ 0. The unit ball of the norm p conjugate to k · k, that is, the norm k · kq , q = p−1 ∈ [2, ∞], is the basic spectratope (in fact, ellitope) B∗ = {y ∈ Rµ : ∃r ∈ R := {Rν+ : krkq/2 ≤ 1} : Sℓ2 [y] ≤ rℓ , 1 ≤ ℓ ≤ L = ν}, Sℓ [y] = yℓ . As a result, Υ’s from Remark 4.13 are collections of ν positive semidefinite 1 × 1 matrices, and we can identify them with νdimensional nonnegative vectors υ,
284
CHAPTER 4
resulting in λ[Υ] = υ and M[Υ] = Diag{υ}. Furthermore, for nonnegative υ we clearly have φR (υ) = kυkp/(2−p) , so the optimization problem in (4.41) now reads ΨΠ (H) = inf ν υ∈R
1 4
Tr(V Diag−1 {υ}V T ) + kυkp/(2−p) : υ > 0
and when setting aℓ = kColℓ [V ]k2 , (4.41) becomes ) ( 1 X a2ℓ ΨΠ (H) = inf + kυkp/(2−p) . υ>0 4 υℓ
[V = Q1/2 H],
ℓ
This results in ΨΠ (H) = k[a1 ; ...; aµ ]kp . Recalling what aℓ and V are, we end up with ∀P, Var[P ] Q :
Eξ∼P {kH T ξk} ≤ ΨΠ (H) := kRow1 [H T Q1/2 ]k2 ; . . . ; kRowν [H T Q1/2 ]k2 p .
This result is quite transparent and could be easily obtained straightforwardly. 2 Indeed, when Var[P ] Q, and ξ ∼ P , the vector ζ = H T ξ clearly Pi } ≤ Psatisfies E{ζ σi2 := kRowi [H T Q1/2 ]k22 , implying, due to p ∈ [1, 2], that E{ i ζi p } ≤ i σip , whence E{kζkp } ≤ k[σ1 ; ...; σν ]kp . 4.3.3.6
Putting things together
An immediate outcome of Corollaries 4.10 and 4.12 is the following recipe for building a “presumably good” linear estimate: Proposition 4.14. In the situation of Section 4.3.3.1 and under Assumptions A, B, and R (see Section 4.3.3.2) consider the convex optimization problem (for notation, see (4.36) and (4.39)) Opt
=
min ′
H,Λ,Υ,Υ ,Θ
φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Γ(Θ) :
Λ = {Λk 0, k ≤ K}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Υ =P{Υℓ ∗ 0, ℓ ≤ L}, 1 [B T − AT H]M k Rk [Λk ] 2 P 0, T T ∗ 1 M [B − H A] ℓ Sℓ [Υℓ ] 2 1 HM Θ P2 ∗ ′ 0. T T 1 M H ℓ Sℓ [Υℓ ] 2
(4.42)
The problem is solvable, and the Hcomponent H∗ of its optimal solution yields linear estimate x bH∗ (ω) = H∗T ω such that RiskΠ,k·k [b xH∗ (·)X ] ≤ Opt.
(4.43)
Note that the only claim in Proposition 4.14 which is not an immediate consequence of Corollaries 4.10 and 4.12 is that problem (4.42) is solvable; this fact is readily given by the feasibility of the problem (by Lemma 4.44) and the coerciveness of the objective on the feasible set (recall that Γ(Θ) is coercive on Sm + due to Π ⊂ int Sm and that y → 7 M y is an onto mapping, since B is fulldimensional). ∗ +
285
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.3.3.7
Illustration: Covariance matrix estimation
Suppose that we observe a sample η T = {ηk = Aξk }k≤T
(4.44)
where A is a given m × n matrix, and ξ1 , ..., ξT are sampled, independently of each other, from a zero mean Gaussian distribution with unknown covariance matrix ϑ known to satisfy γϑ∗ ϑ ϑ∗ , (4.45)
where γ ≥ 0 and ϑ∗ ≻ 0 are given. Our goal is to recover ϑ, and the norm on Sn in which the recovery error is measured satisfies Assumption A. Processing the problem. We can process the problem just outlined as follows. 1. box
We represent the set {ϑ ∈ Sn+ : γϑ∗ ϑ ϑ∗ } as the image of the matrix V = {v ∈ Sn : kvk2,2 ≤ 1}
[k · k2,2 : spectral norm]
under affine mapping; specifically, we set ϑ0 =
1−γ 1+γ ϑ∗ , σ = 2 2
and treat the matrix −1/2
v = σ −1 ϑ∗
−1/2
(ϑ − ϑ0 )ϑ∗
h
1/2
1/2
⇔ ϑ = ϑ0 + σϑ∗ vϑ∗
i
as the signal underlying our observations. Note that our a priori information on ϑ reduces to v ∈ V. 2.
We pass from observations ηk to “lifted” observations ηk ηkT ∈ Sm , so that 1/2
1/2
E{ηk ηkT } = E{Aξk ξkT AT } = AϑAT = A (ϑ0 + σAϑ∗ vϑ∗ ) AT , {z }  ϑ[v]
and treat as “actual” observations the matrices
ωk = ηk ηkT − Aϑ0 AT . We have8 1/2
1/2
ωk = Av + ζk with Av = σAϑ∗ vϑ∗ AT and ζk = ηk ηkT − Aϑ[v]AT .
(4.46)
Observe that random matrices ζ1 , ..., ζT are i.i.d. with zero mean and covariance mapping Q[v] (that of random matrixvalued variable ζ = ηη T − E{ηη T }, η ∼ 8 In our current considerations, we need to operate with linear mappings acting from Sp to Sq . We treat Sk as Euclidean space equipped with the Frobenius inner product hu, vi = Tr(uv) and denote linear mappings from Sp into Sq by capital calligraphic letters, like A, Q, etc. Thus, A in (4.46) denotes the linear mapping which, on closer inspection, maps matrix v ∈ Sn into the matrix Av = A[ϑ[v] − ϑ[0]]AT .
286
CHAPTER 4
N (0, Aϑ[v]AT )). 3. Let us upperbound the covariance mapping of ζ. Observe that Q[v] is a symmetric linear mapping of Sm into itself given by hh, Q[v]hi = E{hh, ζi2 } = E{hh, ηη T i2 } − hh, E{ηη T }i2 , h ∈ Sm . Given v ∈ V, let us set θ = ϑ[v], so that 0 θ θ∗ , and let H(h) = θ1/2 AT hAθ1/2 . We have hh, Q[v]hi
= = =
Eξ∼N (0,θ) {Tr2 (hAξξ T AT )} − Tr2 (hEξ∼N (0,θ) {Aξξ T AT }) Eχ∼N (0,In ) {Tr2 (hAθ1/2 χχT θ1/2 AT ))} − Tr2 (hAθAT ) Eχ∼N (0,In ) {(χT H(h)χ)2 } − Tr2 (H(h)).
We have H(h) = U Diag{λ}U T with orthogonal U , so that Eχ∼N (0,In ) {(χT H(h)χ)2 } − Tr2 (H(h)) P T T χ∼N (0,I ) {(χ = Eχ:=U χ) ¯ 2 } − ( Pi λi )2 ¯ P P P Pn ¯2 Diag{λ} ¯i )2 } − ( i λi )2 = i6=j λi λj + 3 i λ2i − ( i λi )2 = Eχ∼N {( i λi χ ¯ (0,I ) n P = 2 i λ2i = 2Tr([H(h)]2 ).
Thus,
hh, Q[v]hi
= ≤ = =
2Tr([H(h)]2 ) = 2Tr(θ1/2 AT hAθAT hAθ1/2 ) 2Tr(θ1/2 AT hAθ∗ AT hAθ1/2 ) [since 0 θ θ∗ ] 1/2 1/2 1/2 1/2 2Tr(θ∗ AT hAθAT hAθ∗ ) ≤ 2Tr(θ∗ AT hAθ∗ AT hAθ∗ ) T T 2Tr(θ∗ A hAθ∗ A hA).
We conclude that ∀v ∈ V : Q[v] Q, he, Qhi = 2Tr(ϑ∗ AT hAϑ∗ AT eA), e, h ∈ Sm .
(4.47)
4. To continue, we need to set some additional notation to be used when operating with Euclidean spaces Sp , p = 1, 2, .... • We denote p¯ = set
p(p+1) 2
= dim Sp , Ip = {(i, j) : 1 ≤ i ≤ j ≤ p}, and for (i, j) ∈ Ip ei eTi , i=j ij ep = , √1 [ei eT + ej eT ], i < j i j 2
where the ei are standard basic orths in Rp . Note that {eij p : (i, j) ∈ Ip } is the standard orthonormal basis in Sp . Given v ∈ Sp , we denote by Xp (v) the vector of coordinates of v in this basis: vii , i=j √ , (i, j) ∈ Ip . Xpij (v) = Tr(veij ) = p 2vij , i < j Similarly, P for x ∈ Rp¯, we index the entries in x by pairs ij, (i, j) ∈ Ip , and set p p Vp (x) = (i,j)∈Ip xij eij p , so that v 7→ X (v) and x 7→ V (x) are linear normpreserving maps inverse to each other identifying the Euclidean spaces Sp and Rp¯ (recall that the inner products on these spaces are, respectively, the Frobenius and the standard one).
287
SIGNAL RECOVERY BY LINEAR ESTIMATION
• Recall that V is the matrix box {v ∈ Sn : v 2 In } = {v ∈ Sn : ∃t ∈ T := [0, 1] : v 2 tIn }. We denote by X the image of V under the mapping Xn : X X = {x ∈ Rn¯ : ∃t ∈ T : R2 [x] tIn }, R[x] = xij eij ¯ = 12 n(n + 1). n, n (i,j)∈In
Note that X is a basic spectratope of size n. Now we can assume that the signal underlying our observations is x ∈ X , and the observations themselves are wk = Xm (ωk ) = Xm (AVn (x)) +zk , zk = Xm (ζk ).  {z } =:Ax
¯ Note that zk ∈ Rm , 1 ≤ k ≤ T , are zero mean i.i.d. random vectors with covariance matrix Q[x] satisfying, in view of (4.47), the relation T kℓ Q[x] Q, where Qij,kℓ = 2Tr(ϑ∗ AT eij m Aϑ∗ A em A), (i, j) ∈ Im , (k, ℓ) ∈ Im .
Our goal is to estimate ϑ[v] − ϑ[0], or, which is the same, to recover Bx := Xn (ϑ[Vn (x)] − ϑ[0]). We assume that the norm in which the estimation error is measured is “transferred” from Sn to Rn¯ ; we denote the resulting norm on Rn¯ by k · k and assume that the unit ball B∗ of the conjugate norm k · k∗ is given by spectratopic representation: {u ∈ Rn¯ : kuk∗ ≤ 1} = {u ∈ Rn¯ : ∃y ∈ Y : u = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L}.
(4.48)
The formulated description of the estimation problem fits the premises of Proposition 4.14, specifically: • the signal x underlying our observation w(T ) = [w1 ; ...; wT ] is known to belong to basic spectratope X ∈ Rn¯ , and the observation itself is of the form w(T ) = A
(T )
(T )
x + z (T ) , A
= [A; ...; A], z (T ) = [z1 ; ...; zT ];  {z } T
• the noise z (T ) is zero mean, and its covariance matrix is QT := Diag{Q, ..., Q},  {z } T
which allows us to set Π = {QT }; • our goal is to recover Bx, and the norm k · k in which the recovery error is measured satisfies (4.48).
Proposition 4.14 supplies the linear estimate x b(w(T ) ) =
T X
k=1
T H∗k wk
288
CHAPTER 4
of Bx with H∗ = [H∗1 ; ...; H∗T ] stemming from the optimal solution to the convex optimization problem Opt
=
min
H=[H1 ;...;HT ],Λ,Υ
"
where
Tr(Λ) + φR (λ[Υ]) + Ψ{QT } (H1 , ..., HT ) :
Λ ∈ Sn L}, + , Υ = {Υℓ 0, ℓ ≤ # T T P 1 − A H ]M R∗ [Λ] [B k 2 0 P P ∗ k 1 M T [B − [ k Hk ]T A] ℓ Sℓ [Υℓ ] 2
(4.49)
kℓ R∗ [Λ] ∈ Sn¯ : (R∗ [Λ])ij,kℓ = Tr(Λeij n en ), (i, j) ∈ In , (k, ℓ) ∈ In ,
and (cf. (4.40)) Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Ψ{QT } (H1 , ..., HT ) = min Υ′ ,Θ 1 Θ [H1 M ; ...; HT M ] 2 P 0 . ∗ ′ 1 [M T H1T , ..., M T HTT ] ℓ Sℓ [Υℓ ] 2
5. Evidently, the function Ψ{QT } ([H1 , ..., HT ]) remains intact when permuting H1 , ..., HT ; with this in mind, it is clear that permuting H1 , ..., HT and keeping intact Λ and Υ is a symmetry of (4.49)—such a transformation maps the feasible set onto itself and preserves the value of the objective. Since (4.49) is convex and solvable, it follows that there exists an optimal solution to the problem with H1 = ... = HT = H. On the other hand, Ψ{QT } (H, ..., H)
= min Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ 0, ℓ ≤ L} Υ′ ,Θ 1 [HM ; ...; HM ] Θ 2 P 0 1 ∗ ′ [M T H T , ..., M T H T ] ℓ Sℓ [Υℓ ] 2 = inf ′
Υ ,Θ
= inf ′
Υ ,Θ
Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ ≻ 0, ℓ ≤ L}, 1 [HM ; ...; HM ] Θ 2 P 0 1 ∗ ′ [M T H T , ..., M T H T ] ℓ Sℓ [Υℓ ] 2
Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ ≻ 0, ℓ ≤ L},
′
= inf′ φR (λ[Υ ]) + Υ
T 4
Θ
1 4 [HM ; ...; HM ] [
Tr QHM [
P
∗ ′ −1 ℓ Sℓ [Υℓ ]]
P T
∗ ′ −1 ℓ Sℓ [Υℓ ]]
M H
T
′
:Υ =
[HM ; ...; HM ] {Υ′ℓ
T
≻ 0, ℓ ≤ L}
due to QT = Diag{Q, ..., Q}, and we arrive at T Tr(QG) + φR (λ[Υ′ ]) : Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Ψ{QT } (H, ..., H) = min Υ′ ,G (4.50) 1 G HM m 2 0 G∈S , 1 T T P ∗ ′ M H ℓ Sℓ [Υℓ ] 2 P (we have used the Schur Complement Lemma combined with the fact that ℓ Sℓ∗ [Υ′ℓ ] ≻ 0 whenever Υ′ℓ ≻ 0 for all ℓ; see Lemma 4.44).
289
SIGNAL RECOVERY BY LINEAR ESTIMATION
In view of the above observations, when replacing variables H and G with H = T H and G = T 2 G, respectively, problem (4.49), (4.50) becomes Opt = min Tr(Λ) + φR (λ[Υ]) + φR (λ[Υ′ ]) + T1 Tr(QG) : H,G,Λ,Υ,Υ′ Λ ∈ Sn+", Υ = {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ #≤ L}, T T 1 [B − A H]M R∗ [Λ] 2 (4.51) 0, P T 1 T ∗ , ℓ Sℓ [Υℓ ] 2 M [B − H" A] # 1 G HM 0 P2 ∗ ′ T 1 MT H S [Υ ] ℓ
2
and the estimate
x b(wT ) =
ℓ
ℓ
T 1 TX H wk T k=1
brought about by an optimal solution to (4.51) satisfies RiskΠ,k·k [b xX ] ≤ Opt where Π = {QT }. 4.3.3.8
Estimation from repeated observations
Consider the special case of the situation from Section 4.3.3.1 where observation ω in (4.31) is a T element sample ω = [ω¯1 ; ...; ω ¯ T ] with components ¯ + ξt , t = 1, ..., T ω ¯ t = Ax ¯ and ξt are i.i.d. observation noises with zero mean distribution P¯ satisfying P¯ ✁ Π m ¯ ¯ for some convex compact set Π ⊂ int S+ . In other words, we are in the situation where ¯ ¯ ...; A¯] ∈ Rm×n for some A¯ ∈ Rm×n and m = T m, ¯ A = [A;  {z } T
¯ ..., Q ¯ }, Q ¯ ∈ Π}. ¯ Π = {Q = Diag{Q,  {z } T
The same argument as used in item 5 of Section 4.3.3.7 above justifies the following Proposition 4.15. In the situation in question and under Assumptions A, B, and R the linear estimate of Bx yielded by an optimal solution to problem (4.42) can be found as follows. Consider the convex optimization problem Opt =
where
min
′ ,Θ ¯ ¯ H,Λ,Υ,Υ
φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) +
1 T
¯ : Γ(Θ)
Λ = {Λk 0, k ≤ K}, Υ =P {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, ∗ 1 ¯ [B T − AT H]M k Rk [Λk ] 2 P 0, T T ∗ 1 ¯ A] M [B − H S [Υ ℓ] ℓ ℓ 2 1 ¯ ¯ HM Θ P2 ∗ ′ 0 T ¯T 1 M H ℓ Sℓ [Υℓ ] 2
(4.52)
¯ = max Tr(Q ¯ Θ). ¯ Γ(Θ) ¯ Π ¯ Q∈
¯ The problem is solvable, and the estimate in question is yielded by the Hcomponent
290
CHAPTER 4
¯ ∗ of the optimal solution according to H x b([¯ ω1 ; ...; ω ¯ T ]) =
T 1 ¯T X ω ¯t. H∗ T t=1
The upper bound provided by Proposition 4.14 on the risk RiskΠ,k·k [b x(·)X ] of this estimate is equal to Opt. The advantage of this result as compared to what is stated under the circumstances by Proposition 4.14 is that the sizes of optimization problem (4.52) are independent of T . 4.3.3.9
Nearoptimality in the Gaussian case
The risk of the linear estimate x bH∗ (·) constructed in (4.42) can be compared to the minimax optimal risk of recovering Bx, x ∈ X , from observations corrupted by zero mean Gaussian noise with covariance matrix from Π. Formally, the minimax risk is defined as b(Ax + ξ)k} (4.53) RiskOptΠ,k·k [X ] = sup inf sup Eξ∼N (0,Q) {kBx − x b(·) x∈X Q∈Π x
where the infimum is taken over all estimates.
Proposition 4.16. Under the premise and in the notation of Proposition 4.14, we have Opt , (4.54) RiskOptΠ,k·k [X ] ≥ p 64 (2 ln F + 10 ln 2)(2 ln D + 10 ln 2) where
D=
X k
dk , F =
X
fℓ .
(4.55)
ℓ
Thus, the upper bound Opt on the risk RiskΠ,k·k [b xH∗ X ] of the presumably good linear estimate x bH∗ yielded by an optimal solution to optimization problem (4.42) is within logarithmic in the sizes of spectratopes X and B∗ factor of the Gaussian minimax risk RiskOptΠ,k·k [X ]. For the proof, see Section 4.8.5. The key component of the proof is the following fact important in its own right (for proof, see Section 4.8.4): Lemma 4.17. Let Y be an N × ν matrix, let k · k be a norm on Rν such that the unit ball B∗ of the conjugate norm is the spectratope (4.34), and let ζ ∼ N (0, Q) for some positive semidefinite N × N matrix Q. Then the best upper bound on ψQ (Y ) := E{kY T ζk} yielded by Lemma 4.11, that is, the optimal value Opt[Q] in the convex optimization problem (cf. (4.40)) Opt[Q] = min φR (λ[Υ]) + Tr(QΘ) : Υ = {Υℓ 0, 1 ≤ ℓ ≤ L}, Θ,Υ (4.56) 1 YM Θ 2 P 0 Θ ∈ SN , 1 ∗ MT Y T ℓ Sℓ [Υℓ ] 2
SIGNAL RECOVERY BY LINEAR ESTIMATION
291
(for notation, see Lemma 4.11 and (4.36)), satisfies the identity ∀(Q 0) : Opt[Q] = Opt[Q] :=
min φR (λ[Υ]) + Tr(G) : Υℓ 0, G,Υ={Υℓ ,ℓ≤L} 1 1/2 YM G 2Q P 0 , 1 T T 1/2 ∗ ℓ Sℓ [Υℓ ] 2M Y Q
(4.57)
and is a tight bound on ψQ (Y ), namely,
P
√ ψQ (Y ) ≤ Opt[Q] ≤ 22 2 ln F + 10 ln 2 ψQ (Y ),
where F = ℓ fℓ is the size of the spectratope (4.34). Besides this, for all κ ≥ 1 one has 2 e3/8 Opt[Q] ≥ βκ := 1 − − 2F e−κ /2 . Probζ kY T ζk ≥ 4κ 2 √ In particular, when selecting κ = 2 ln F + 10 ln 2, we obtain Opt[Q] 3 Probζ kY T ζk ≥ √ ≥ 0.2100 > 16 . 4 2 ln F + 10 ln 2
4.4
(4.58)
(4.59)
(4.60)
LINEAR ESTIMATES OF STOCHASTIC SIGNALS
In the recovery problem considered so far in this chapter, the signal x underlying observation ω = Ax+ξ was “deterministic uncertain but bounded”—all the a priori information on x was that x ∈ X for a given signal set X . There is a wellknown alternative model, where the signal x has a random component, specifically, x = [η; u] where the “stochastic component” η is random with (partly) known probability distribution Pη , and the “deterministic component” u is known to belong to a given set X . As a typical example, consider a linear dynamical system given by yt+1 ωt
= =
Pt y t + η t + u t , Ct yt + ξt , 1 ≤ t ≤ T,
(4.61)
where yt , ηt , and ut are, respectively, the state, the random “process noise,” and the deterministic “uncertain but bounded” disturbance affecting the system at time t, ωt is the output (it is what we observe at time t), and ξt is the observation noise. We assume that the matrices Pt , Ct are known in advance. Note that the trajectory y = [y1 ; ...; yT ] of the states depends not only on the trajectories of process noises ηt and disturbances ut , but also on the initial state y1 , which can be modeled as a realization of either the initial noise η0 , or the initial disturbance u0 . When ut ≡ 0, y1 = η0
292
CHAPTER 4
and the random vectors {ηt , 0 ≤ t ≤ T, ξt , 1 ≤ t ≤ T } are zero mean Gaussian independent of each other, (4.61) is the model underlying the celebrated Kalman filter [143, 144, 171, 172]. Now, given model (4.61), we can use the equations of the model to represent the trajectory of the states as a linear image of the trajectory of noises η = {ηt } and the trajectory of disturbances u = {ut }, y = P η + Qu (recall that the initial state is either the component η0 of η, or the component u0 of u), and our “full observation” becomes ω = [ω1 ; ...; ωT ] = A[η; u] + ξ, ξ = [ξ1 , ..., ξT ]. A typical statistical problem associated with the outlined situation is to estimate the linear image B[η; u] of the “signal” x = [η; u] underlying the observation. For example, when speaking about (4.61), the goal could be to recover yT +1 (“forecast”). We arrive at the following estimation problem: Given noisy observation ω = Ax + ξ ∈ Rm
of signal x = [η; u] with random component η ∈ Rp and deterministic component u known to belong to a given set X ⊂ Rq , we want to recover the image Bx ∈ Rν of the signal. Here A and B are given matrices, η is independent of ξ, and we have a priori (perhaps, incomplete) information on the probability distribution Pη of η, specifically, we know that Pη ∈ Pη for a given family Pη of probability distributions. Similarly, we assume that what we know about the noise ξ is that its distribution belongs to a given family Pξ of distributions on the observation space. Given a norm k · k on the image space of B, it makes sense to specify the risk of a candidate estimate x b(ω) by taking the expectation of the norm kb x(A[η; u] + ξ) − B[η; u]k of the error over both ξ and η and then taking the supremum of the result over the allowed distributions of η, ξ and over u ∈ X : Riskk·k [b x] = sup
sup
u∈X Pξ ∈Pξ ,Pη ∈Pη
E[ξ;η]∼Pξ ×Pη {kb x(A[η; u] + ξ) − B[η; u]k} .
When k · k = k · k2 and all distributions from Pξ and Pη are with zero means and finite covariance matrices, it is technically more convenient to operate with the Euclidean risk #1/2 " 2 x(A[η; u] + ξ) − B[η; u]k2 E[ξ;η]∼Pξ ×Pη kb . RiskEucl [b x] = sup sup u∈X Pξ ∈Pξ ,Pη ∈Pη
Our next goal is to show that as far as the design of “presumably good” linear estimates x b(ω) = H T ω is concerned, the techniques developed so far can be straightforwardly extended to the case of signals with random component.
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.4.1
293
Minimizing Euclidean risk
For the time being, assume that Pξ is comprised of all probability distributions P on Rm with zero mean and covariance matrices Cov[P ] = Eξ∼P {ξξ T } running through a computationally tractable convex compact subset Qξ ⊂ int Sm + , and Pη is comprised of all probability distributions P on Rp with zero mean and covariance matrices running through a computationally tractable convex compact subset Qη ⊂ int Sp+ . Let, in addition, X be a basic spectratope: X = {x ∈ Rq : ∃t ∈ T : Rk2 [x] tk Idk , k ≤ K} with our standard restrictions on T and Rk [·]. Let us derive an efficiently solvable convex optimization problem “responsible” for a presumably good, in terms of its Euclidean risk, linear estimate. For a linear estimate H T ω, u ∈ X , Pξ ∈ Pξ , Pη ∈ Pη , denoting by Qξ and Qη the covariance matrices of Pξ and Pη , and partitioning A as A = [Aη , Au ] and B = [Bη , Bu ] according to the partition x = [η; u], we have E[ξ;η]∼Pξ ×Pη kH T (A[η; u] + ξ) − B[η; u]k22 = E[ξ;η]∼Pξ ×Pη k[H T Aη − Bη ]η + H T ξ + [HT Au − Bu ]uk22 T T = uT [Bu −H T Au ]T [Bu − H T Au ]u + Eξ∼Pξ Tr(H ξξ H) +Eη∼Pη Tr([Bη − H T Aη ]ηη T [Bη − H T Aη ]T ) = uT [Bu − H T Au ]T [Bu − H T Au ]u + Tr(H T Qξ H) +Tr([Bη − H T Aη ]Qη [Bη − H T Aη ]T ). Hence, the squared Euclidean risk of the linear estimate x bH (ω) = H T ω is Risk2Eucl [b xH ] Φ(H)
= =
Ψξ (H)
=
Ψη (H)
=
Φ(H) + Ψξ (H) + Ψη (H), max uT [Bu − H T Au ]T [Bu − H T Au ]u, u∈X
max Tr(H T QH),
Q∈Qξ
maxQ∈Qη Tr([Bη − H T Aη ]Q[Bη − H T Aη ]T ).
Functions Ψξ and Ψη are convex and efficiently computable, function Φ(H), by Proposition 4.8, admits an efficiently computable convex upper bound Φ(H) = minΛ φT (λ[Λ]) : Λ = {Λk 0, k ≤ K}, P ∗ T T T [Bu − H Au ] [Bu − H Au ] k Rk [Λk ]
P which is tight within the factor 2 max[ln(2 k dk ), 1] (see Proposition 4.8). Thus, the efficiently solvable convex problem yielding a presumably good linear estimate is Opt = min Φ(H) + Ψξ (H) + Ψη (H) ; H
the Euclidean risk of the linear√ estimate H∗T ω yielded by the to the p optimal solution P problem is upperbounded by Opt and is within factor 2 max[ln(2 k dk ), 1] of the minimal Euclidean risk achievable with linear estimates.
294 4.4.2
CHAPTER 4
Minimizing k · krisk
Now let Pξ be comprised of all probability distributions P on Rm with matrices of second moments Var[P ] = Eξ∼P {ξξ T } running through a computationally tractable convex compact subset Qξ ⊂ int Sm + , and Pη be comprised of all probability distributions P on Rp with matrices of second moments Var[P ] running through a computationally tractable convex compact subset Qη ⊂ int Sp+ . Let, as above, X be a basic spectratope, X = {u ∈ Rn : ∃t ∈ T : Rk2 [u] tk Idk , k ≤ K}, and let k·k be such that the unit ball B∗ of the conjugate norm k·k∗ is a spectratope: B∗ = {y : kyk∗ ≤ 1} = y ∈ Rν : ∃(r ∈ R, z ∈ RN ) : y = M z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L ,
with our standard restrictions on T , R, Rk [·] and Sℓ [·]. Here the efficiently solvable convex optimization problem “responsible” for a presumably good, in terms of its risk Riskk·k , linear estimate can be built as follows. For a linear estimate H T ω, u ∈ X , Pξ ∈ Pξ , Pη ∈ Pη , denoting by Qξ and Qη the matrices of second moments of Pξ and Pη , and partitioning A as A = [Aη , Au ] and B = [Bη , Bu ] according to the partition x = [η; u], we have u] + ξ) − B[η; u]k E[ξ;η]∼Pξ ×Pη kH T (A[η; = E[ξ;η]∼Pξ ×Pη k[H T Aη − Bη ]η + H T ξ + [H T A u − Bu ]uk ≤ k[Bu − H T Au ]uk + Eξ∼Pξ kH T ξk + Eη∼Pη k[Bη − H T Aη ]ηk . It follows that for a linear estimate x bH (ω) = H T ω one has Riskk·k [b xH ] Φ(H) Ψξ (H) Ψη (H)
≤ = = =
Φ(H) + Ψξ (H) + Ψη (H), maxu∈X k[Bu − H T Au ]uk, supPξ ∈Pξ Eξ∼Pξ {kH T ξk}, supPη ∈Pη Eξ∼Pξ {k[Bη − H T Aη ]ηk}.
As was shown in Section 4.3.3.3, the functions Φ, Ψξ , Ψη admit efficiently computable upper bounds as follows (for notation, see Section 4.3.3.3): Φ(H) ≤ Φ(H) := min φT (λ[Λ]) + φR (λ[Υ]) : Λ,Υ ) Λ = {ΛP k 0, k ≤ K}, Υ = {Υℓ 0, ℓ ≤ L} 1 T T ∗ ; u − Au H]M k Rk [Λk ] 2 [BP 0 1 T T S [Υ ] M [B − H Au] ℓ ℓ u ℓ 2 Ψξ (H) ≤ Ψξ (H) := min φR (λ[Υ]) + maxQ∈Qξ Tr(GQ) : Υ = {Υℓ 0, ℓ ≤ L} Υ,G 1 G 2 HM P 0 , 1 T T ℓ Sℓ [Υℓ ] 2M H Ψη (H) ≤ Ψη (H) := min φR (λ[Υ]) + maxQ∈Qη Tr(GQ) : Υ = {Υℓ 0, ℓ ≤ L}, Υ,G 1 T T G η − Aη H]M 2 [BP 0 , 1 T T ℓ Sℓ [Υℓ ] 2 M [Bη − H Aη ]
and these bounds are reasonably tight (for details on tightness, see Proposition 4.8
295
SIGNAL RECOVERY BY LINEAR ESTIMATION
and Lemma 4.17). As a result, to get a presumably good linear estimate, one needs to solve the efficiently solvable convex optimization problem Opt = min Φ(H) + Ψξ (H) + Ψη (H) . H
The linear estimate x bH∗ = H∗T ω yielded by an optimal solution H∗ to this problem admits the risk bound Riskk·k [b xH∗ ] ≤ Opt. Note that the above derivation did not use independence of ξ and η.
4.5
LINEAR ESTIMATION UNDER UNCERTAINBUTBOUNDED NOISE
So far, the main subject of our interest was recovering (linear images of) signals via indirect observations of these signals corrupted by random noise. In this section, we focus on alternative observation schemes – those with “uncertainbutbounded” and “mixed” noise. 4.5.1
Uncertainbutbounded noise
Consider the estimation problem where, given observation ω = Ax + η
(4.62)
of unknown signal x known to belong to a given signal set X , one wants to recover linear image Bx of x. Here A and B are given m × n and ν × n matrices. The situation looks exactly as before, the difference with our previous considerations is that now we do not assume the observation noise to be random—all we assume about η is that it belongs to a given compact set H (“uncertainbutbounded observation noise”). In the situation in question, a natural definition of the risk on X of a candidate estimate ω 7→ x b(ω) is RiskH,k·k [b xX ] =
sup
x∈X,η∈H
kBx − x b(Ax + η)k
(“Hrisk”). We are about to prove that when X , H, and the unit ball B∗ of the norm k · k∗ conjugate to k · k are spectratopes, which we assume from now on, an efficiently computable linear estimate is nearoptimal in terms of its Hrisk. Our initial observation is that in this case the model (4.62) reduces straightforwardly to the model without observation noise. Indeed, let Y = X × H; then Y is a spectratope, and we lose nothing when assuming that the signal underlying observation ω is y = [x; η] ∈ Y: ¯ A¯ = [A, Im ], ω = Ax + η = Ay,
296
CHAPTER 4
while the entity to be recovered is ¯ B ¯ = [B, 0ν×m ]. Bx = By, With these conventions, the Hrisk of a candidate estimate x b(·) : Rm → Rν becomes the quantity Riskk·k [b xX × H] =
sup
y=[x;η]∈X ×H
¯ −x ¯ kBy b(Ay)k,
and we indeed arrive at the situation where the observation noise is identically zero. To avoid messy notation, let us assume that the outlined reduction has been carried out in advance, so that (!) The problem of interest is to recover the linear image Bx ∈ Rν of an unknown signal x known to belong to a given spectratope X (which, as always, we can assume w.l.o.g. to be basic) from (noiseless) observation ω = Ax ∈ Rm . The risk of a candidate estimate is defined as Riskk·k [b xX ] = sup kBx − x b(Ax)k, x∈X
where k · k is a given norm with a spectratope B∗ —see (4.34)—as the unit ball of the conjugate norm: X B∗
= =
{x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , k ≤ K}, {z ∈ Rν : ∃y ∈ Y : z = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L},
(4.63)
with our standard restrictions on T , R and Rk [·], Sℓ [·]. 4.5.1.1
Building a linear estimate
Let us build a presumably good linear estimate. For a linear estimate x bH (ω) = H T ω, we have Riskk·k [b xH X ]
=
=
max k(B − H T A)xk x∈X T max [u; x] 1
[u;x]∈B∗ ×X
2
(B − H T A)T
1 2
(B − H T A)
[u; x].
Applying Proposition 4.8, we arrive at the following: Proposition 4.18. In the situation of this section, consider the convex optimization problem Opt# = min φR (λ[Υ]) + φT (λ[Λ]) : Υℓ 0, Λk 0, ∀(ℓ, k) H,Υ={Υℓ },Λ={Λk } (4.64) P 1 ∗ − H T A]T M k Rk [Λk ] 2 [BP 0 , 1 T T ∗ ℓ Sℓ [Υℓ ] 2 M [B − H A]
297
SIGNAL RECOVERY BY LINEAR ESTIMATION
where R∗k [·], Sℓ∗ [·] are induced by Rk [·], Sℓ [·], respectively, as explained in Section 4.3.1. The problem is solvable, and the risk of the linear estimate x bH∗ (·) yielded by the Hcomponent of an optimal solution does not exceed Opt# . For proof, see Section 4.8.6.1. 4.5.1.2
Nearoptimality
Proposition 4.19. The linear estimate x bH∗ yielded by Proposition 4.18 is nearoptimal in terms of its risk: X X fℓ , (4.65) dk + Riskk·k [b xH∗ X ] ≤ Opt# ≤ O(1) ln(D)Riskopt [X ], D = k
ℓ
where Riskopt [X ] is the minimax optimal risk: Riskopt [X ] = inf Riskk·k [b xX ] with inf taken w.r.t. all Borel estimates.
x b
Remark 4.20. When X and B∗ are ellitopes rather than spectratopes, X B∗
= := =
{x ∈ Rn : ∃t ∈ T : xT Rk x ≤ tk , k ≤ K}, {u ∈ Rν : kuk∗ ≤ 1} {u ∈ Rν P : ∃r ∈ R, z : u = M P z, z T Sℓ z ≤ rℓ , ℓ ≤ L} [Rk 0, k Rk ≻ 0, Sℓ 0, ℓ Sℓ ≻ 0],
problem (4.64) becomes Opt#
=
φR (µ) + φT (λ) : λ ≥ 0, µ ≥ 0, P 1 − H T A]T M k λk Rk 2 [B P 0 , 1 T T ℓ µ ℓ Sℓ 2 M [B − H A]
min
H,λ,µ
and (4.65) can be strengthened to
Riskk·k [b xH∗ X ] ≤ Opt# ≤ O(1) ln(K + L)Riskopt [X ]. For proofs, see Section 4.8.6. 4.5.1.3
Nonlinear estimation
The uncertainbutbounded model of observation error makes it easy to point out an efficiently computable nearoptimal nonlinear estimate. Indeed, in the situation described at the beginning of Section 4.5.1, let us assume that the range of observation error η is H = {η ∈ Rm : kηk(m) ≤ σ},
where k · k(m) and σ > 0 are a given norm on Rm and a given error bound, and let us measure the recovery error by a given norm k · k(ν) on Rν . We can immediately point out a (nonlinear) estimate optimal within factor 2 in terms of its Hrisk, namely, estimate x b∗ , as follows:
298
CHAPTER 4
Given ω, we solve the feasibility problem find x ∈ X : kAx − ωk(m) ≤ σ.
(F [ω])
Let xω be a feasible solution; we set x b∗ (ω) = Bxω .
Note that the estimate is welldefined, since (F [ω]) clearly is solvable, with one of the feasible solutions being the true signal underlying observation ω. When X is a computationally tractable convex compact set, and k · k(m) is an efficiently computable norm, a feasible solution to (F [ω]) can be found in a computationally efficient fashion. Let us make the following immediate observation: Proposition 4.21. The estimate x b∗ is optimal within factor 2: RiskH [b x∗ X ] ≤ Opt∗ := sup kBx − Byk(ν) : x, y ∈ X , kA(x − y)k(m) ≤ 2σ x,y
≤
2Riskopt,H
(4.66)
where Riskopt,H is the infimum of Hrisk over all estimates. The proof of the proposition is the subject of Exercise 4.28. 4.5.1.4
Quantifying risk
Note that Proposition 4.21 does not impose restrictions on X and the norms k·k(m) , k · k(ν) . The only—but essential—shortcoming of the estimate x b∗ is that we do not know, in general, what its Hrisk is. From (4.66) it follows that this risk is tightly (namely, within factor 2) upperbounded by Opt∗ , but this quantity, being the maximum of a convex function over some domain, can be difficult to compute. Aside from a handful of special cases where this difficulty does not arise, there is a generic situation when Opt∗ can be tightly upperbounded by efficient computation. This is the situation where X is the spectratope defined in (4.63), k · k(m) is such that the unit ball of this norm is a basic spectratope, B(m) := {u : kuk(m) ≤ 1} = {u ∈ Rm : ∃p ∈ P : Q2j [u] pj Iej , 1 ≤ j ≤ J}, and the unit ball of the norm k · k(ν),∗ conjugate to the norm k · k(ν) is a spectratope, ∗ B(ν)
:= =
{v ∈ Rν : kvk(ν),∗ ≤ 1} {v : ∃(w ∈ RN , r ∈ R) : v = M w, Sℓ2 [w] rℓ Ifℓ , 1 ≤ ℓ ≤ L},
with the usual restrictions on P, R, Qj [·], and Sℓ [·]. Proposition 4.22. In the situation in question, consider the convex optimization problem Opt
=
min
Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}, Σ={Σj ,j≤J}
φT (λ[Λ]) + φR (λ[Υ]) + σ 2 φP (λ[Σ]) + φR (λ[Σ]) :
) ΛkP 0, Υℓ 0, Σj 0 ∀(k, ℓ, j), T ∗ M B ℓ Sℓ [Υℓ ] P 0 ∗ T P ∗ BT M k Rk [Λk ] + A [ j Qj [Σj ]]A
(4.67)
299
SIGNAL RECOVERY BY LINEAR ESTIMATION
where R∗k [·] is associated with mapping x 7→ Rk [x] according to (4.25), Sℓ∗ [·] and Q∗j [·] are associated in the same fashion with mappings w 7→ Sℓ [w] and u 7→ Qj [u], respectively, and φT , φR , and φP are the support functions of the corresponding sets T , R, and P. The optimal value in (4.67) is an efficiently computable upper bound on the quantity Opt# defined in (4.66), and this bound is tight within factor 2 max[ln(2D), 1], D =
X
dk +
k
X
fℓ +
ℓ
X
ej .
j
Proof of the proposition is the subject of Exercise 4.29. 4.5.2
Mixed noise
So far, we have considered separately the cases of random and uncertainbutbounded observation noises in (4.31). Note that both these observation schemes are covered by the following “mixed” scheme: ω = Ax + ξ + η, where, as above, A is a given m × n matrix, x is an unknown deterministic signal known to belong to a given signal set X , ξ is random noise with distribution known to belong to a family P of Borel probability distributions on Rm satisfying (4.32) for a given convex compact set Π ⊂ int Sm + , and η is an “uncertainbutbounded” observation error known to belong to a given set H. As before, our goal is to estimate Bx ∈ Rν via observation ω. In our present setting, given a norm k · k on Rν , we can quantify the performance of a candidate estimate ω 7→ x b(ω) : Rm → Rν by its risk RiskΠ,H,k·k [b xX ] =
sup
x∈X ,P ✁Π,η∈H
Eξ∼P {kBx − x b(Ax + ξ + η)k}.
Observe that the estimation problem associated with the “mixed” observation scheme straightforwardly reduces to a similar problem for the random observation scheme, by the same trick we have used in Section 4.5 to eliminate the observation noise. Indeed, let us treat x+ = [x; η] ∈ X + := X × H and X + as the new ¯ + = Ax + η, Bx ¯ + = Bx. signal/signal set underlying our observation, and set Ax With these conventions, the “mixed” observation scheme reduces to ¯ + + ξ, ω = Ax and for every candidate estimate x b(·) it clearly holds
RiskΠ,H,k·k [b xX ] = RiskΠ,k·k [b xX + ],
so that we find ourselves in the situation of Section 4.3.3.1. Assuming that X and H are spectratopes, so is X + , meaning that all results of Section 4.3.3 on building presumably good linear estimates and their nearoptimality are applicable to our present setup.
300 4.6
CHAPTER 4
CALCULUS OF ELLITOPES/SPECTRATOPES
We present here the rules of the calculus of ellitopes/spectratopes. We formulate these rules for ellitopes; the “spectratopic versions” of the rules are straightforward modifications of their “ellitopic versions.” • Intersection X =
I T
i=1
Xi of ellitopes
Xi = {x ∈ Rn : ∃(y i ∈ Rni , ti ∈ Ti ) : x = Pi y i & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki } is an ellitope. Indeed, this is evident when X = {0}. Assuming X 6= {0}, we have X
{x ∈ Rn : ∃(y = [y 1 ; ...; y I ] ∈ Y, t = (t1 , ..., tI ) ∈ T = T1 × ... × TI ) : x = P y := P1 y 1 & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki , 1 ≤ i ≤ I},  {z }
=
Y
1
I
{[y ; ...; y ] ∈ R
=
n1 +...+nI
:
+ y T Rik y i Pi y = P1 y 1 ,
2 ≤ i ≤ I}
(note that Y can be identified with Rn¯ with a properly selected n ¯ > 0). I Q • The direct product X = Xi of ellitopes i=1
Xi =
{xi ∈ Rni : ∃(y i ∈ Rn¯ i , ti ∈ Ti ) : xi = Pi y i , 1 ≤ i ≤ I & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki }
is an ellitope: X =
(
[x1 ; ...; xI ] ∈ Rn1 × ... × RnI : ∃ 1
I
y = [y 1 ; ...; y I ] ∈ Rn¯ 1 +...¯nI t = (t1 , ..., tI ) ∈ T = T1 × ... × TI
i T
i
x = P y := [P1 y ; ...; PI y ], [y ] Rik y ≤  {z }
tik , 1
)
≤ k ≤ Ki , 1 ≤ i ≤ I .
+ y T Rik y
• The linear image Z = {Rx : x ∈ X }, R ∈ Rp×n , of an ellitope X = {x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y & y T Rk y ≤ tk , 1 ≤ k ≤ K} is an ellitope: Z = {z ∈ Rp : ∃(y ∈ Rn¯ , t ∈ T ) : z = [RP ]y & y T Rk y ≤ tk , 1 ≤ k ≤ K}. • The inverse linear image Z = {z ∈ Rq : Rz ∈ X }, R ∈ Rn×q , of an ellitope X = {x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y & y T Rk y ≤ tk , 1 ≤ k ≤ K} under linear mapping z 7→ Rz : Rq → Rn is an ellitope, provided that the mapping is an embedding: Ker R = {0}. Indeed, setting E = {y ∈ Rn¯ : P y ∈ ImR}, we get a linear subspace in Rn¯ . If E = {0}, Z = {0} is an ellitope; if E 6= {0}, we have Z P¯
= =
{z ∈ Rq : ∃(y ∈ E, t ∈ T ) : z = P¯ y & y T Rk y ≤ tk , 1 ≤ k ≤ K}, ΠP, where Π : ImR → Rq is the inverse of z 7→ Rz : Rq → ImR
(E can be identified with some Rk , and Π is welldefined since R is an embed
301
SIGNAL RECOVERY BY LINEAR ESTIMATION
ding). n o PI • The arithmetic sum X = x = i=1 xi : xi ∈ Xi , 1 ≤ i ≤ I of ellitopes Xi is an ellitope, with representation readily given by those of X1 , ..., XI . Indeed, X is the image of X1 × ... × XI under the linear mapping [x1 ; ...; xI ] 7→ x1 + .... + xI , and taking direct products and images under linear mappings preserves ellitopes. • “Sproduct.” Let Xi = {xi ∈ Rni : ∃(y i ∈ Rn¯ i , ti ∈ Ti ) : xi = Pi y i , 1 ≤ i ≤ I & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki } be ellitopes, and let S be a convex compact set in RI+ which intersects the interior of RI+ and is monotone: 0 ≤ s′ ≤ s ∈ S implies s′ ∈ S. We associate with S the set S 1/2 = s ∈ RI+ : [s21 ; ...; s2I ] ∈ S of entrywise square roots of points from S; clearly, S 1/2 is a convex compact set. Xi and S specify the Sproduct of the sets Xi , i ≤ I, defined as the set n o Z = z = [z 1 ; ...; z I ] : ∃(s ∈ S 1/2 , xi ∈ Xi , i ≤ I) : z i = si xi , 1 ≤ i ≤ I , or, equivalently, Z = z = [z 1 ; ...; z I ] : ∃(r = [r1 ; ...; rI ] ∈ R, y 1 , ..., y I ) : i T
i
zi = Pi yi ∀i ≤ I, [y ] Rik y ≤
rki
∀(i ≤ I, k ≤ Ki ) ,
R = {[r1 ; ...; rI ] ≥ 0 : ∃(s ∈ S 1/2 , ti ∈ Ti ) : ri = s2i ti ∀i ≤ I}.
We claim that Z is an ellitope. All we need to verify to this end is that the set R is as it should be in an ellitopic representation, that is, that R is a compact 1 +...+KI and monotone subset of RK containing a strictly positive vector (all this + is evident), and that R is convex. To verify convexity, let Ti = cl{[ti ; τi ] : τi > 0, ti /τi ∈ Ti } be the conic hulls of Ti ’s. We clearly have R = {[r1 ; ...; rI ] : ∃s ∈ S 1/2 : [ri ; s2i ] ∈ Ti , i ≤ I} = {[r1 ; ...; rI ] : ∃σ ∈ S : [ri ; σi ] ∈ Ti , i ≤ I}, where the concluding equality is due to the origin of S 1/2 . The concluding set in the above chain clearly is convex, and we are done. As an example, consider the situation where the ellitopes Xi possess nonempty interiors and thus can be thought of as unit balls of norms k·k(i) on the respective spaces Rni , and let S = {s ∈ RI+ : kskp/2 ≤ 1}, where p ≥ 2. In this situation, S 1/2 = {s ∈ RI+ : kskp ≤ 1}, whence Z is the unit ball of the “block pnorm” k[z 1 ; ...; z I ]k = k kz 1 k(1) ; ...; kz I k(I) kp .
Note also that the usual direct product of I ellitopes is their Sproduct, with S = [0, 1]I . • “Sweighted sum.” Let Xi ⊂ Rn be ellitopes, 1 ≤ i ≤ I, and let S ⊂ RI+ , S 1/2 be the same as in the previous rule. Then the Sweighted sum of the sets Xi ,
302
CHAPTER 4
defined as X = {x : ∃(s ∈ S 1/2 , xi ∈ Xi , i ≤ I) : x =
X i
si xi },
is an ellitope. Indeed, the set in question is the image of the Sproduct of Xi under the linear mapping [z 1 ; ...; z I ] 7→ z 1 + ... + z I , and taking Sproducts and linear images preserves the property of being an ellitope. It should be stressed that the outlined “calculus rules” are fully algorithmic: representation (4.6) of the result of an operation is readily given by the representations (4.6) of the operands.
4.7
EXERCISES FOR CHAPTER 4
4.7.1
Linear estimates vs. Maximum Likelihood
Exercise 4.1. Consider the problem posed at the beginning of Chapter 4: Given observation ω = Ax + σξ, ξ ∼ N (0, I) of unknown signal x known to belong to a given signal set X ⊂ Rn , we want to recover Bx. Let us consider the case where matrix A is square and invertible, B is the identity, and X is a computationally tractable convex compact set. As far as computational aspects are concerned, the situation is well suited for utilizing the “magic wand” of Statistics—the Maximum Likelihood (ML) estimate where the recovery of x is x bML (ω) = argmin kω − Ayk2 (ML) y∈X
—the signal which maximizes, over y ∈ X , the likelihood (the probability density) of getting the observation we actually got. Indeed, with computationally tractable X , (ML) is an explicit convex, and therefore efficiently solvable, optimization problem. Given the exclusive role played by the ML estimate in Statistics, perhaps the first question about our estimation problem is: how good is the ML estimate? The goal of this exercise is to show that in the situation we are interested in, the ML estimate can be “heavily nonoptimal,” and this may happen even when the techniques we develop in Chapter 4 do result in an efficiently computable nearoptimal linear estimate. To justify the claim, investigate the risk (4.2) of the ML estimate in the case where ( ) n X n 2 −2 2 X = x ∈ R : x1 + ǫ xi ≤ 1 & A = Diag{1, ǫ−1 , ..., ǫ−1 }, i=2
ǫ and σ are small, and n is large, so that σ 2 (n − 1) ≥ 2. Accompany your theoretical analysis by numerical experiments—compare the empirical risks of the ML estimate with theoretical and empirical risks of the linear estimate optimal under the circumstances.
303
SIGNAL RECOVERY BY LINEAR ESTIMATION
Recommended setup: n runs through {256, 1024, 2048}, ǫ = σ runs through {0.01; 0.05; 0.1}, and signal x is generated as x = [cos(φ); sin(φ)ǫζ], where φ ∼ Uniform[0, 2π] and random vector ζ is independent of φ and is distributed uniformly on the unit sphere in Rn−1 . 4.7.2
Measurement Design in Signal Recovery
Exercise 4.2. [Measurement Design in Gaussian o.s.] As a preamble to the exercise, please read the story about possible “physics” of Gaussian o.s. from Section 2.7.3.3. The summary of the story is as follows: We consider the Measurement Design version of signal recovery in Gaussian o.s., specifically, we are allowed to use observations [ξ ∼ N (0, Im )]
ω = Aq x + σξ where
√ √ √ Aq = Diag{ q1 , q2 , ..., qm }A,
with a given A ∈ Rm×n and vector q which we can select in a given convex compact set Q ⊂ Rm + . The signal x underlying the observation is known to belong to a given ellitope X . Your goal is to select q ∈ Q and a linear recovery ω 7→ GT ω of the image Bx of x ∈ X , with given B, resulting in the smallest worstcase, over x ∈ X , expected k · k22 recovery risk. Modify, according to this goal, problem (4.12). Is it possible to end up P with a tractable problem? Work out in full detail the case when Q = {q ∈ Rm + : i qi = m}. Exercise 4.3.
[followup to Exercise 4.2] A translucent bar of length n = 32 is comprised of 32 consecutive segments of length 1 each, with density ρi of ith segment known to belong to the interval [µ − δi , µ + δi ]. Sample translucent bar The bar is lit from the left end; when light passes through a segment with density ρ, the light’s intensity is reduced by factor e−αρ . The light intensity at the left endpoint of the bar is 1. You can scan the segments one by one from left to right and measure light intensity ℓi at the right endpoint of the ith segment during time qi ; √ the result zi of the measurement is ℓi eσξi / qi , where ξi ∼ N (0, 1) are independent across i. The total time budget is n, and you are interested in recovering the m = n/2dimensional vector of densities of the right m segments. Build an optimization problem responsible for nearoptimal linear recovery with and without Measurement Design (in the latter case, we assume that each segment is observed during unit time) and compare the resulting nearoptimal risks. Recommended data: α = 0.01, δi = 1.2 + cos(4π(i − 1)/n), µ = 1.1 max δi , σ = 0.001. i
304
CHAPTER 4
Exercise 4.4. Let X be a basic ellitope in Rn : X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K} with our usual restrictions on Sk and T . Let, further, m be a given positive integer, and x 7→ Bx : Rn → Rν be a given linear mapping. Consider the Measurement Design problem where you are looking for a linear recovery ω 7→ x bH (ω) := H T ω of Bx, x ∈ X, from observation ω = Ax + σξ
[σ > 0 is given and ξ ∼ N (0, Im )]
in which the m × n sensing matrix A is under your control—it is allowed to be any m × n matrix of spectral norm not exceeding 1. You are interested in selecting H and A in order to minimize the worstcase, over x ∈ X, expected k · k22 recovery error. Similarly to (4.12), this problem can be posed as Opt = minH,λ,A σ 2 Tr(H T H) + φT (λ) : P (4.68) B T − AT H k λ k Sk 0, kAk ≤ 1, λ ≥ 0 , T B−H A Iν where k · k stands for the spectral norm. The objective in this problem is the (upper bound on the) squared risk Risk2 [b xH X], the sensing matrix being A. The problem is nonconvex, since the matrix participating in the semidefinite constraint is bilinear in H and A. A natural way to handle an optimization problem with objective and/or constraints bilinear in the decision variables u, v is to use “alternating minimization,” where one alternates optimization in v for u fixed and optimization in u for v fixed, the value of the variable fixed in a round being the result of optimization w.r.t. this variable in the previous round. Alternating minimizations are carried out until the value of the objective (which in the outlined process definitely improves from round to round) stops to improve (or nearly so). Since the algorithm does not necessarily converge to the globally optimal solution to the problem of interest, it makes sense to run the algorithm several times from different, say, randomly selected, starting points. Now comes the exercise. 1. Implement Alternating Minimization as applied to (4.68). You may restrict your experimentation to the case where the sizes m, n, ν are quite moderate, in the range of tens, and X is either the box {x : j 2γ x2j ≤ 1, 1 ≤ j ≤ n}, or the ellipsoid Pn {x : j=1 j 2γ x2j ≤ 1}, where γ is a nonnegative parameter (try γ = 0, 1, 2, 3). As for B, you can generate it at random, or enforce B to have prescribed singular values, say, σj = j −θ , 1 ≤ j ≤ ν, and a randomly selected system of singular vectors. 2. Identify cases where a globally optimal solution to (4.68) is easy to find and use this information in order to understand how reliable Alternating Minimization is in the application in question, reliability meaning the ability to identify nearoptimal, in terms of the objective, solutions. If you are not satisfied with Alternating Minimization “as is,” try to improve it.
305
SIGNAL RECOVERY BY LINEAR ESTIMATION
3. Modify (4.68) and your experiment to cover the cases where the constraint kAk ≤ 1 on the sensing matrix is replaced with one of the following: • kRowi [A]k2 ≤ 1, 1 ≤ i ≤ m, • Aij  ≤ 1 for all i, j
(note that these two types of restrictions mimic what happens if you are interested in recovering (the linear image of) the vector of parameters in a linear regression model from noisy observations of the model’s outputs at the m points which you are allowed to select in the unit ball or unit box). 4. [Embedded Exercise] Recall that a ν × n matrix G admits singular value decomposition G = U DV T with orthogonal matrices U ∈ Rν×ν and V ∈ Rn×n and diagonal ν × n matrix D with nonnegative and nonincreasing diagonal entries.9 These entries are uniquely defined by G and are called singular values σi (G), 1 ≤ i ≤ min[ν, n]. Singular values admit characterization similar to variational characterization of eigenvalues of a symmetric matrix; see, e.g., [15, Section A.7.3]: Theorem 4.23. [VCSV—Variational Characterization of Singular Values] For a ν × n matrix G it holds σi (G) = min
max
E∈Ei e∈E,kek2 =1
kGek2 , 1 ≤ i ≤ min[ν, n],
(4.69)
where Ei is the family of all subspaces in Rn of codimension i − 1. Corollary 4.24. [SVI—Singular Value Interlacement] Let G and G′ be ν × n matrices, and let k = Rank(G − G′ ). Then σi (G) ≥ σi+k (G′ ), 1 ≤ i ≤ min[ν, n], where, by definition, singular values of a ν × n matrix with indexes > min[ν, n] are zeros. We denote by σ(G) the vector of singular values of G arranged in nonincreasing order. The function kGkSh,p = kσ(G)kp is called the Shatten pnorm of matrix G; this indeed is a norm on the space of ν × n matrices, and the conjugate norm is k · kSh,q , with p1 + 1q = 1. An easy and important consequence of Corollary 4.24 is the following fact: Corollary 4.25. Given a ν × n matrix G, an integer k, 0 ≤ k ≤ min[ν, n], and p ∈ [1, ∞], (one of ) the best approximation of G in the Shatten pnorm among matrices of rank ≤ k is obtained from Pk G by zeroing out all but k largest singular values, that is, the matrix Gk = i=1 σi (G)Coli [U ]ColTi [V ], where G = U DV T is the singular value decomposition of G. Prove Theorem 4.23 and Corollaries 4.24 and 4.25. 5. Consider the Measurement Design problem (4.68) in the case when X is an ellipsoid: n o Xn X = x ∈ Rn : x2j /a2j ≤ 1 , j=1
9 We
say that a rectangular matrix D is diagonal if all entries Dij in D with i 6= j are zeros.
306
CHAPTER 4
A is an m × n matrix of spectral norm not exceeding 1, and there is no noise in observations: σ = 0. Find an optimal solution to this problem. Think how this result can be used to get a (hopefully) good starting point for Alternating Minimization in the case when X is an ellipsoid and σ is small. 4.7.3
Around semidefinite relaxation
Exercise 4.5. Let X be an ellitope: X = {x ∈ Rn : ∃(y ∈ RN , t ∈ T ) : x = P y, y T Sk y ≤ tk , k ≤ K} P rk skj sTkj , we can with our standard restrictions on T and Sk . Representing Sk = j=1 pass from the initial ellitopic representation of X to the spectratopic representation of the same set: n N + + X = {x ∈ R [sTkj x]2 t+ kj I1 , 1 ≤ k ≤ K, h : ∃(y ∈ R , t ∈ T ) : x = P y,P i 1 ≤ j ≤ rk } rk + + + + T = {t = {tkj ≥ 0} : ∃t ∈ T : j=1 tkj ≤ tk , 1 ≤ k ≤ K} .
If now C is a symmetric n × n matrix and Opt = maxx∈X xT Cx, we have P Opt∗ ≤ Opte := min φT (λ) : P T CP k λk Sk λ={λk ∈R+ } n o P Opt∗ ≤ Opts := min φT + (Λ) : P T CP k,j Λkj skj sTkj Λ={Λkj ∈R+ }
where the first relation is yielded by the ellitopic representation of X and Proposition 4.6, and the second, on closer inspection (carry this inspection out!), by the spectratopic representation of X and Proposition 4.8. Prove that Opte = Opts .
Exercise 4.6. Proposition 4.6 provides us with an upper bound on the quality of the semidefinite relaxation as applied to the problem of upperbounding the maximum of a homogeneous quadratic form over an ellitope. Extend the construction to the case where an inhomogeneous quadratic form is maximized over a shifted ellitope, so that the quantity to upperbound is Opt = max f (x) := xT Ax + 2bT x + c , x∈X
X = {x : ∃(y, t ∈ T ) : x = P y + p, y T Sk y ≤ tk , 1 ≤ k ≤ K}
with our standard assumptions on Sk and T . Note: X is centered at p, and a natural upper bound on Opt is d Opt ≤ f (p) + Opt,
d is an upper bound on the quantity where Opt
Opt = max [f (x) − f (p)] . x∈X
d What you are interested in upperbounding is the ratio Opt/Opt.
307
SIGNAL RECOVERY BY LINEAR ESTIMATION
Exercise 4.7. [estimating Kolmogorov widths of spectratopes/ellitopes] 4.7.A. Preliminaries: Kolmogorov and Gelfand widths. Let X be a convex compact set in Rn , and let k · k be a norm on Rn . Given a linear subspace E in Rn , let distk·k (x, E) = min kx − zk : Rn → R+ z∈E
be the k · kdistance from x to E. The quantity distk·k (X , E) = max distk·k (x, E) x∈X
can be viewed as the worstcase k · kaccuracy to which vectors from X can be approximated by vectors from E. Given positive integer m ≤ n and denoting by Em the family of all linear subspaces in Rm of dimension m, the quantity δm (X , k · k) = min distk·k (X , E) E∈Em
can be viewed as the best achievable quality of approximation, measured in k · k, of vectors from X by vectors from an mdimensional linear subspace of Rn . This quantity is called the mth Kolmogorov width of X w.r.t. k · k. Observe that one has distk·k (x, E) = maxξ {ξ T x : kξk∗ ≤ 1, ξ ∈ E ⊥ }, ξ T x, distk·k (X , E) = max
(4.70)
x∈X , kξk∗ ≤1,ξ∈E ⊥
where E ⊥ is the orthogonal complement to E. 1) Prove (4.70). Hint: Represent distk·k (x, E) as the optimal value in a conic problem on the cone K = {[x; t] : t ≥ kxk} and use the Conic Duality Theorem. Now consider the case when X is the unit ball of some norm k · kX . In this case (4.70) combines with the definition of Kolmogorov width to imply that δm (X , k · k)
= = =
min distk·k (x, E) = min max
min
max
max
E∈Em y∈E ⊥ ,kyk∗ ≤1 x:kxkX ≤1
min
max
max
E∈Em x∈X y∈E ⊥ ,kyk∗ ≤1 T
E∈Em
F ∈En−m y∈F,kyk∗ ≤1
y x
yT x (4.71)
kykX ,∗ ,
where k·kX ,∗ is the norm conjugate to k·kX . Note that when Y is a convex compact set in Rn and  ·  is a norm on Rn , the quantity dm (Y,  · ) =
min
max y
F ∈En−m y∈Y∩F
has a name—it is called the mth Gelfand width of Y taken w.r.t.  · . The “duality relation” (4.71) states that When X , Y are the unit balls of respective norms k · kX , k · kY , for every m < n the mth Kolmogorov width of X taken w.r.t. k · kY,∗ is the same as
308
CHAPTER 4
the mth Gelfand width of Y taken w.r.t. k · kX ,∗ . The goal of the remaining part of the exercise is to use our results on the quality of semidefinite relaxation on ellitopes/spectratopes to infer efficiently computable upper bounds on Kolmogorov widths of a given set X ⊂ Rn . In the sequel we assume that • X is a spectratope: X = {x ∈ Rn : ∃(t ∈ T , u) : x = P u, Rk2 [u] tk Idk , k ≤ K}; • The unit ball B∗ of the norm conjugate to k · k is a spectratope: B∗ = {y : kyk∗ ≤ 1} = {y ∈ Rn : ∃(r ∈ R, z) : y = M z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L}. with our usual restrictions on T , R and Rk [·] and Sℓ [·]. 4.7.B. Simple case: k·k = k·k2 . We start with the simple case where k·k = k·k2 , so that B∗ isP the ellitope {y : y T y ≤ 1}. Let D = k dk be the size of the spectratope X , and let κ = 2 max[ln(2D), 1].
Given integer m < n, consider the convex optimization problem Opt(m) = minΛ={Λk ,k≤K},Y φT (λ[Λ]) : Λk 0∀k, 0 Y In , P ∗ T k Sk [Λk ] P Y P, Tr(Y ) = n − m .
(Pm )
2) Prove the following:
Proposition 4.26. Whenever 1 ≤ µ ≤ m < n, one has 2 2 Opt(m) ≤ κδm (X , k · k2 ) & δm (X , k · k2 ) ≤
m+1 Opt(µ). m+1−µ
(4.72)
Moreover, the above upper bounds on δm (X , k · k2 ) are “constructive,” meaning that an optimal solution to (Pµ ), µ ≤ m, can be straightforwardly converted into a linear subspace E m,µ of dimension m such that r m+1 Opt(µ). distk·k2 (X , E m,µ ) ≤ m+1−µ Finally, Opt(µ) is nonincreasing in µ < n. 4.7.C. General case. Now consider the case when both X and the unit ball B∗ of the norm conjugate to k · k are spectratopes. As we are about to see, this case is essentially more difficult than the case of k · k = k · k2 , but something still can be done. 3) Prove the following statement:
309
SIGNAL RECOVERY BY LINEAR ESTIMATION
(!) Given m < n, let Y be an orthoprojector of Rn of rank n − m, and let collections Λ = {Λk 0, k ≤ K} and Υ = {Υℓ 0, ℓ ≤ L} satisfy the relation P 1 T ∗ k Rk [Λk ] P 2P Y M 0. (4.73) 1 T ∗ ℓ Sℓ [Υℓ ] 2M Y P Then
distk·k (X , Ker Y ) ≤ φT (λ[Λ]) + φR (λ[Υ]).
(4.74)
As a result, δm (X , k · k)
≤ ≤
distk·k (X , Ker Y ) φT (λ[Λ]) + φR (λ[Υ]) : Opt := min Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}
Λ 0 ∀k, Υℓ 0 ∀ℓ, kP ∗ k Rk [Λk ] T 1 M YP 2
4) Prove the following statement: (!!) Let m, n, Y be as in (!). Then
1 T P YM 2 P ∗ ℓ Sℓ [Υℓ ]
0
)
0
)
δm (X , k · k) ≤ distk·k (X , Ker Y ) ≤
d := Opt
min
ν,Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}
∗ k Rk [Λk ] T 1 M P 2
P
.
φT (λ[Λ]) + φR (λ[Υ]) :
ν ≥ 0, Λk 0 ∀k, Υℓ 0 ∀ℓ, P ∗ ℓ Sℓ [Υℓ ]
1 T P M 2 T
+ νM (I − Y )M
(4.75)
(4.76) ,
d ≤ Opt, with Opt given by (4.75). and Opt
Statements (!) and (!!) suggest the following policy for upperbounding the Kolmogorov width δm (X , k · k): A. First, we select an integer µ, 1 ≤ µ < n, and solve the convex optimization problem min φT (λ[Λ]) + φR (λ[Υ]) : 0 Y I, Tr(Y ) = n − µ, Λ,Υ,Y (P µ ) Λ = {Λ 0, k ≤ K}, Υ = {Υ 0, ℓ ≤ L}, k ℓ P ∗ 1 T P YM . k Rk [Λk ] 2 P 0 T ∗ 1 2
M YP
ℓ
Sℓ [Υℓ ]
B. Next, we take the Y component Y µ of the optimal solution to (P µ ) and “round” it to a orthoprojector Y of rank n − m in the same fashion as in the case of k · k = k · k2 , that is, keep the eigenvectors of Y µ intact and replace the m smallest eigenvalues with zeros, and all remaining eigenvalues with ones.
310
CHAPTER 4
C. Finally, we solve the convex optimization problem Optm,µ = min φT (λ[Λ]) + φR (λ[Υ]) : Λ,Υ,ν
ν ≥ 0, Λ = {Λk 0, k ≤ K}, Υ = {Υℓ 0,ℓ ≤ L}, P ∗ 1 T P M . k Rk [Λk ] 2 P ∗ 0 T T 1 2
M P
ℓ
(P m,µ )
Sℓ [Υℓ ] + νM (I − Y )M
By (!!), Optm,µ is an upper bound on the Kolmogorov width δm (X , k · k) (and in fact also on distk·k (X , Ker Y )). Observe all the complications we encounter when passing from the simple case k · k = k · k2 to the case of general norm k · k with a spectratope as the unit ball p of the conjugate norm. Note that Proposition 4.26 gives both a lower bound Opt(m)/κ on qthe mth Kolmogorov width of X w.r.t. k · k2 , and a family of
m+1 upper bounds m+1−µ Opt(µ), 1 ≤ µ ≤ m, on this width. As a result, we can approximate X by mdimensional subspaces in the Euclidean norm in a “nearly optimal” fashion. Indeed, if for some ǫ and k it holds δk (X , k · k2 ) ≤ ǫ, then Opt(k) ≤ κǫ2 by Proposition 4.26 as applied with m = k. On the other hand, assuming k < n/2, the same proposition when applied with m = 2k and µ = k says that r p √ 2k + 1 m,k distk·k2 (X , E ) ≤ Opt(k) ≤ 2Opt(k) ≤ 2κ ǫ. k+1
Thus, if X can be approximated by a kdimensional subspace within k√· k2 accuracy ǫ, we can efficiently get approximation of “nearly the same quality” ( 2κǫ instead of ǫ; recall that κ is just logarithmic in D) and “nearly the same dimension” (2k instead of k). Neither of these options is preserved when passing from the Euclidean norm to a general one: in the latter case, we do not have lower bounds on Kolmogorov widths, and have no understanding of how tight our upper bounds are. Now, two concluding questions:
5) Why in step A of the above bounding scheme do we utilize statement (!) rather d ≤ Opt) statement (!!)? than the less conservative (since Opt 6) Implement the scheme numerically and run experiments. Recommended setup: • Given σ > 0 and positive integers n and κ, let f be a function of continuous argument t ∈ [0, 1] satisfying the smoothness restriction f (k) (t) ≤ σ k , 0 ≤ t ≤ 1, k = 0, 1, 2, ..., κ. Specify X as the set of ndimensional vectors x obtained by restricting f onto the npoint equidistant grid {ti = i/n}ni=1 . To this end, translate the description on f into a bunch of twosided linear constraints on x: dT(k) [xi ; xi+1 ; ...; xi+k ] ≤ σ k , 1 ≤ i ≤ n − k, 0 ≤ k ≤ κ, where d(k) ∈ Rk+1 is the vector of coefficients of finitedifference approximation, with resolution 1/n, of the kth derivative: d(0) = 1, d(1) = n[−1; 1], d(2) = n2 [1; −2; 1], d(3) = n3 [−1; 3; −3; 1], d(4) = n4 [1; −4; 6; −4; 1], ....
311
SIGNAL RECOVERY BY LINEAR ESTIMATION
• Recommended parameters: n = 32, m = 8, κ = 5, σ ∈ {0.25, 0.5; 1, 2, 4}. • Run experiments with k · k = k · k1 and k · k = k · k2 . Exercise 4.8. [more on semidefinite relaxation] The goal of this exercise is to extend SDP relaxation beyond ellitopes/spectratopes. SDP relaxation is aimed at upperbounding the quantity OptX (B) = max xT Bx, x∈X
[B ∈ Sn ]
where X ⊂ Rn is a given set (which we from now on assume to be nonempty convex compact). To this end we look for a computationally tractable convex compact set U ⊂ Sn such that for every x ∈ X it holds xxT ∈ U ; in this case, we refer to U as to a set matching X (equivalent wording: “U matches X ”). Given such a set U , the optimal value in the convex optimization problem OptU (B) = max Tr(BU ) U ∈U
(4.77)
is an efficiently computable convex upper bound on OptX (B). Given U matching X , we can pass from U to the conic hull of U –to the set U[U ] = cl{(U, µ) ∈ Sn × R+ : µ > 0, U/µ ∈ U} which, as is immediately seen, is a closed convex cone contained in Sn × R+ . The only point (U, µ) in this cone with µ = 0 has U = 0 (since U is compact), and U = {U : (U, 1) ∈ U} = {U : ∃µ ≤ 1 : (U, µ) ∈ U}, so that the definition of OptU can be rewritten equivalently as OptU (B) = min {Tr(BU ) : (U, µ) ∈ U, µ ≤ 1} . U,µ
The question, of course, is where to take a set U matching X , and the answer depends on what we know about X . For example, when X is a basic ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , k ≤ K} with our usual restrictions on T and Sk , it is immediately seen that x ∈ X ⇒ xxT ∈ U = {U ∈ Sn : U 0, ∃t ∈ T : Tr(U Sk ) ≤ tk , k ≤ K}. Similarly, when X is a basic spectratope X = {x ∈ Rn : ∃t ∈ T : Sk2 [x] tk Idk , k ≤ K} with our usual restrictions on T and Sk [·], it is immediately seen that x ∈ X ⇒ xxT ∈ U = {U ∈ Sn : U 0, ∃t ∈ T : Sk [U ] tk Idk , k ≤ K}. One can verify that the semidefinite relaxation bounds on the maximum of a quadratic form on an ellitope/spectratope X derived in Sections 4.2.3 (for elli
312
CHAPTER 4
topes) and 4.3.2 (for spectratopes) are nothing but the bounds (4.77) associated with the U just defined. 4.8.A Matching via absolute norms. There are other ways to specify a set matching X . The seemingly simplest of them is as follows. Let p(·) be an absolute norm on Rn (recall that this is a norm p(x) which depends solely on abs[x], where abs[x] is the vector comprised of the magnitudes of entries in x). We can convert p(·) into the norm p+ (·) on the space Sn as follows: p+ (U ) = p([p(Col1 [U ]); ...; p(Coln [U ])])
[U ∈ Sn ].
1.1) Prove that p+ indeed is a norm on Sn , and p+ (xxT ) = p2 (x). Denoting by q(·) the norm conjugate to p(·), what is the relation between the norm (p+ )∗ (·) conjugate to p+ (·) and the norm q + (·) ? 1.2) Derive from 1.1 that whenever p(·) is an absolute norm such that X is contained in the unit ball Bp(·) = {x : p(x) ≤ 1} of the norm p, the set Up(·) = {U ∈ Sn : U 0, p+ (U ) ≤ 1} is matching X . If, in addition, X ⊂ {x : p(x) ≤ 1, P x = 0},
(4.78)
then the set Up(·),P = {U ∈ Sn : U 0, p+ (U ) ≤ 1, P U = 0} is matching X . Assume that in addition to p(·), we have at our disposal a computationally tractable closed convex set D such that whenever p(x) ≤ 1, the vector [x]2 := [x21 ; ...; x2n ] belongs to D; in the sequel we call such a D squaredominating p(·). For example, when p(·) = k · kr , we can take P n : i y1 ≤ 1 , r ≤ 2 y ∈ R + . D= y ∈ Rn+ : kykr/2 ≤ 1 , r > 2 Prove that in this situation the above construction can be refined: whenever X satisfies (4.78), the set
D Up(·),P = {U ∈ Sn : U 0, p+ (U ) ≤ 1, P U = 0, dg(U ) ∈ D} [dg(U ) = [U11 ; U22 ; ...; Unn ]] matches X . D when Note: in the sequel, we suppress P in the notation Up(·),P and Up(·),P P = 0; thus, Up(·) is the same as Up(·),0 . 1.3) Check that when p(·) = k · kr with r ∈ [1, ∞], one has +
p (U ) = kU kr :=
( P
1/r Uij r , maxi,j Uij , i,j
1 ≤ r < ∞, . r=∞
313
SIGNAL RECOVERY BY LINEAR ESTIMATION
1.4) Let X = {x ∈ Rn : kxk1 ≤ 1} and p(x) = kxk1 , so that X ⊂ {x : p(x) ≤ 1}, and n o X Conv{[x]2 : x ∈ X } ⊂ D = y ∈ Rn+ : yi = 1 . (4.79) i
What are the bounds OptUp(·) (B) and OptU D (B)? Is it true that the former p(·) (the latter) of the bounds is precise? Is it true that the former (the latter) bound is precise when B 0 ? 1.5) Let X = {x ∈ Rn : kxk2 ≤ 1} and p(x) = kxk2 , so that X ⊂ {x : p(x) ≤ 1} and (4.79) holds true. What are the bounds OptUp(·) (B) and OptU D (B) ? Is the p(·) former (the latter) bound precise? 1.6) Let X ⊂ Rn+ be closed, convex, bounded, and with a nonempty interior. Verify that the set X + = {x ∈ Rn : ∃y ∈ X : abs[x] ≤ y} is the unit ball of an absolute norm pX , and this is the largest absolute norm p(·) such that X ⊂ {x : p(x) ≤ 1}. Derive from this observation that the norm pX (·) is the best (i.e., resulting in the least conservative bounding scheme) among absolute norms which allow us to upperbound OptX (B) via the construction from item 1.2. 4.8.B “Calculus of matchings.” Observe that the matching we have introduced admits a kind of “calculus.” Specifically, consider the situation as follows: for 1 ≤ ℓ ≤ L, we are given • nonempty convex compact sets Xℓ ⊂ Rnℓ , 0 ∈ Xℓ , along with matching Xℓ convex compact sets Uℓ ⊂ Snℓ giving rise to the closed convex cones Uℓ = cl{(Uℓ , µℓ ) ∈ Snℓ × R+ : µℓ > 0, µ−1 ℓ Uℓ ∈ Uℓ }. We denote by ϑℓ (·) the Minkowski functions of Xℓ : ϑℓ (y ℓ ) = inf{t : t > 0, t−1 y ℓ ∈ Xℓ } : Rnℓ → R ∪ {+∞}; note that Xℓ = {y ℓ : ϑℓ (y ℓ ) ≤ P 1}; • nℓ × n matrices Aℓ such that ℓ ATℓ Aℓ ≻ 0.
On top of that, we are given a monotone convex set T ⊂ RL + intersecting the interior of RL +. These data specify the convex set X = {x ∈ Rn : ∃t ∈ T : ϑ2ℓ (Aℓ x) ≤ tℓ , ℓ ≤ L}.
(∗)
2.1) Prove the following: Lemma 4.27. In the situation in question, the set U = U ∈ Sn : U 0 & ∃t ∈ T : (Aℓ U ATℓ , tℓ ) ∈ Uℓ , ℓ ≤ L
is a closed and bounded convex set which matches X . As a result, the efficiently
314
CHAPTER 4
computable quantity OptU (B) = max {Tr(BU ) : U ∈ U} U
is an upper bound on OptX (B) = max xT Bx. x∈X
n
2.2) Prove that if X ⊂ R is a nonempty convex compact set, P is an m × n matrix, and U matches X , then the set V = {V = P U P T : U ∈ U} matches Y = {y : ∃x ∈ X : y = P x}. 2.3) Prove that if X ⊂ Rn is a nonempty convex compact set, P is an n × m matrix of rank m, and U matches X , then the set V = {V 0 : P V P T ∈ U} matches Y = {y : P y ∈ X }. 2.4) Consider the “direct product” case where X = X1 × ... × XL . When specifying Aℓ as the matrix which “cuts” the ℓth block Aℓ x = xℓ of a block vector x = [x1 ; ...; xL ] ∈ Rn1 × ... × RnL and setting T = [0, 1]L , we cover this situation by the setup under consideration. In the direct product case, the construction from item 2.1 is as follows: given the sets Uℓ matching Xℓ , we build the set ′
U = {U = [U ℓℓ ∈ Rnℓ ×nℓ′ ]ℓ,ℓ′ ≤L ∈ Sn1 +...+nL : U 0, U ℓℓ ∈ Uℓ , ℓ ≤ L} and claim that this set matches X . Could we be less conservative? While we do not know how to be less conservative in general, we do know how to be less conservative in the special case when the Uℓ are built via absolute norms. Namely, let pℓ (·) : Rnℓ → R+ , ℓ ≤ L, be absolute norms, let sets Dℓ be squaredominating pℓ (·), bℓ = {xℓ ∈ Rnℓ : Pℓ xℓ = 0, pℓ (xℓ ) ≤ 1}, Xℓ ⊂ X
and let
Uℓ = {U ∈ Snℓ : U 0, Pℓ U = 0, p+ ℓ (U ) ≤ 1, dg(U ) ∈ Dℓ }. In this case the above construction results in U=
U = [U
ℓℓ′
∈R
nℓ ×nℓ′
]ℓ,ℓ′ ≤L ∈
1 +...+nL Sn +
Now let
Pℓ U ℓℓ = 0 + ℓℓ : U 0, pℓ (U ) ≤ 1 , ℓ ≤ L . dg(U ℓℓ ) ∈ Dℓ
p([x1 ; ...; xL ]) = max[p1 (x1 ), ..., pL (xL )] : Rn1 × ... × RnL → R, so that p is an absolute norm and X ⊂ {x = [x1 ; ...; xL ] : p(x) ≤ 1, Pℓ xℓ = 0, ℓ ≤ L}. Prove that in fact the set U=
′
1 +...+nL U = [U ℓℓ ∈ Rnℓ ×nℓ′ ]ℓ,ℓ′ ≤L ∈ Sn +
Pℓ U ℓℓ = 0 : U 0, dg(U ℓℓ ) ∈ Dℓ , ℓ ≤ L p+ (U ) ≤ 1
SIGNAL RECOVERY BY LINEAR ESTIMATION
315
matches X , and that we always have U ⊂ U . Verify that in general this inclusion is strict. 4.8.C Illustration: Nullspace property revisited. Recall the sparsityoriented signal recovery via ℓ1 minimization from Chapter 1: Given an m × n sensing matrix A and (noiseless) observation y = Aw of unknown signal w known to have at most s nonzero entries, we recover w as w b ∈ Argmin {kzk1 : Az = y} . z
We called matrix A sgood if whenever y = Aw with ssparse w, the only optimal solution to the righthand side optimization problem is w. The (difficult to verify!) necessary and sufficient condition for sgoodness is the Nullspace property: Opt := max kzk(s) : z ∈ Ker A, kzk1 ≤ 1 < 1/2, z
where kzk(k) is the sum of the k largest entries in the vector abs[z]. A verifiable sufficient condition for sgoodness is d := min max kColj [I − H T A]k(s) < 1 , Opt 2 H
j
(4.80)
d is an upper bound on Opt (see the reason being that, as is immediately seen, Opt Proposition 1.9 with q = 1). An immediate observation is that Opt is nothing but the maximum of quadratic form over an appropriate convex compact set. Specifically, let P X = {[u; v] ∈ Rn × Rn : Au= 0, kuk1 ≤ 1, i vi  ≤ s, kvk∞ ≤ 1}, 1 I 2 n . B= 1 2 In Then OptX (B)
=
max [u; v]T B[u; v]
[u;v]∈X
P max uT v : Au = 0, kuk1 ≤ 1, i vi  ≤ s, kvk∞ ≤ 1 u,v max kuk : Au = 0, kuk ≤ 1 = u 1 (s) {z} =
(a)
=
Opt,
where (a) is due to the wellknown fact (prove it!) that whenever s is a positive integer ≤ n, the extreme points of the set X vi  ≤ s, kvk∞ ≤ 1} V = {v ∈ Rn : i
are exactly the vectors with at most s nonzero entries, the nonzero entries being ±1; as a result ∀(z ∈ Rn ) : max z T v = kzk(s) . v∈V
316
CHAPTER 4
Now, V is the unit ball of the absolute norm r(v) = min {t : kvk1 ≤ st, kvk∞ ≤ t} , so that X is contained in the unit ball B of the absolute norm on R2n specified as p([u; v]) = max {kuk1 , r(v)}
[u, v ∈ Rn ],
i.e., X = {[u; v] : p([u, v]) ≤ 1, Au = 0} . As a result, whenever x = [u; v] ∈ X , the matrix 11 U = uuT U 12 = uv T T U = xx = U 21 = vuT U 22 = vv T satisfies the condition p+ (U ) ≤ 1 (see item 1.2 above). In addition, this matrix clearly satisfies the condition A[U 11 , U 12 ] = 0. It follows that the set 11 U 12 U ∈ S2n : U 0, p+ (U ) ≤ 1, AU 11 = 0, AU 12 = 0} U = {U = U 21 U 22 (which clearly is a nonempty convex compact set) matches X . As a result, the efficiently computable quantity Opt
= =
max Tr(BU ) U ∈U 11 U max Tr(U 12 ) : U = U U 21
U 12 U 22
0, p+ (U ) ≤ 1, AU 11 = 0, AU 12 = 0
(4.81)
is an upper bound on Opt. As a result, the verifiable condition Opt < 1/2 is sufficient for sgoodness of A. Now comes the concluding part of the exercise: d so that (4.81) is less conservative than (4.80). 3.1) Prove that Opt ≤ Opt, Hint: Apply Conic Duality to verify that ) ( n X d = max Tr(V ) : V ∈ Rn×n , AV = 0, r(Coli [V T ]) ≤ 1 . Opt V
(4.82)
i=1
3.2) Run simulations with randomly generated Gaussian matrices A and play with d and Opt. To save time, you can use toy different values of s to compare Opt sizes m, n, say, m = 18, n = 24.
317
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.7.4 4.7.4.1
Around Propositions 4.4 and 4.14 Optimizing linear estimates on convex hulls of unions of spectratopes
Exercise 4.9. Let • X1 , ..., XJ be spectratopes in Rn : 2 Xj = {x ∈ Rn : ∃(y ∈ RNj , th∈ Tj ) : x = Pj y, Rkj [y]i tk Idkj , k ≤ Kj }, 1 ≤ j ≤ J, P Nj kji Rkj [y] = i=1 yi R ,
• A ∈ Rm×n and B ∈ Rν×n be given matrices, • k · k be a norm on Rν such that the unit ball B∗ of the conjugate norm k · k∗ is a spectratope: B∗
:= =
{u : kuk∗ ≤ 1} {u ∈ Rν : ∃(z h∈ RN , r ∈ R) : u = iM z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L} PN Sℓ [z] = i=1 zi S ℓi ,
• Π be a convex compact subset of the interior of the positive semidefinite cone Sm +, with our standard restrictions on Rkj [·], Sℓ [·], Tj and R. Let, further, [ X = Conv Xj j
be the convex hull of the union of spectratopes Xj . Consider the situation where, given observation ω = Ax + ξ of unknown signal x known to belong to X , we want to recover Bx. We assume that the matrix of second moments of noise is dominated by a matrix from Π, and quantify the performance of a candidate estimate x b(·) by its k · krisk RiskΠ,k·k [b xX ] = sup sup Eξ∼P {kBx − x b(Ax + ξ)k} x∈X P :P ✁Π
where P ✁ Π means that the matrix Var[P ] = Eξ∼P {ξξ T } of second moments of distribution P is dominated by a matrix from Π. Prove the following: Proposition 4.28. In the situation in question, consider the convex optimization problem max φTj (λ[Λj ]) + φR (λ[Υj ]) + φR (λ[Υ′ ]) + ΓΠ (Θ) : Opt = min j j ′ H,Θ,Λ ,Υ ,Υ
j
318
CHAPTER 4
Λj = {Λjk 0, j ≤ Kj }, j ≤ J, j Υj = {Υℓ 0, ℓ ≤ L}, j ≤ J, Υ′ = {Υ′ℓ 0, ℓ ≤ L} P
R∗kj [Λjk ] 1 M T [B − H T A]Pj 2
1 T P [B T 2 j P
− AT H]M 0, j ≤ J, j ∗ S ℓ ℓ [Υℓ ] 1 Θ HM 2 P ∗ ′ 0 , 1 M T HT ℓ Sℓ [Υℓ ] 2
k
where, as usual,
(4.83)
φTj (λ) = max tT λ, φR (λ) = max rT λ, t∈Tj
r∈R
ΓΠ (Θ) = max Tr(QΘ), λ[U1 , ..., Us ] = [Tr(U1 ); ...; Tr(US )], Q∈Π Sℓ∗ [·] : Sfℓ → SN : Sℓ∗ [U ] = Tr(S ℓp U S ℓq ) p,q≤N , R∗kj [·] : Sdkj → SNj : R∗kj [U ] = Tr(Rkjp U Rkjq ) p,q≤N . j
Problem (4.83) is solvable, and Hcomponent H∗ of its optimal solution gives rise to linear estimate x bH∗ (ω) = H∗T ω such that RiskΠ,k·k [b xH∗ X ] ≤ Opt.
(4.84)
Moreover, the estimate x bH∗ is nearoptimal among linear estimates: ln(D + F )RiskOptlin i h Opt ≤ O(1) P P D = maxj k≤Kj dkj , F = ℓ≤L fℓ
where
RiskOptlin = inf
sup
H x∈X ,Q∈Π
(4.85)
Eξ∼N (0,Q) kBx − H T (Ax + ξ)k
is the best risk attainable by linear estimates in the current setting under zero mean Gaussian observation noise. It should be stressed that the convex hull of a union of spectratopes is not necessarily a spectratope, and that Proposition 4.28 states that the linear estimate stemming from (4.83) is nearoptimal only among linear, not among all estimates (the latter might indeed not be the case). 4.7.4.2
Recovering nonlinear vectorvalued functions
Exercise 4.10. Consider the situation as follows: We are given a noisy observation ω = Ax + ξx
[A ∈ Rm×n ]
of the linear image Ax of an unknown signal x known to belong to a given spectratope X ⊂ Rn ; here ξx is the observation noise with distribution Px which can depend on x. As in Section 4.3.3, we assume that we are given a computationally tractable convex compact set Π ⊂ int Sm + such that for every x ∈ X , Var[Px ] Θ for some Θ ∈ Π; cf. (4.32). We want to recover the value f (x) of a given vectorvalued function f : X → Rν , and we measure the recovery error in a given norm  ·  on Rν .
SIGNAL RECOVERY BY LINEAR ESTIMATION
319
4.10.A. Preliminaries and the Main observation. Let k · k be a norm on Rn , and g(·) : X → Rν be a function. Recall that the function is called Lipschitz continuous on X w.r.t. the pair of norms k · k on the argument and  ·  on the image spaces, if there exist L < ∞ such that g(x) − g(y) ≤ Lkx − yk ∀(x, y ∈ X ); every L with this property is called a Lipschitz constant of g. It is well known that in our finitedimensional situation, the property of g to be Lipschitz continuous is independent of how the norms k · k,  ·  are selected; this selection affects only the value(s) of Lipschitz constant(s). Assume from now on that the function of interest f is Lipschitz continuous on X . Let us call a norm k · k on Rn appropriate for f if f is Lipschitz continuous with constant 1 on X w.r.t. k · k,  · . Our immediate observation is as follows:
Observation 4.29. In the situation in question, let k · k be appropriate for f . Then recovering f (x) is not more difficult than recovering x in the norm k · k: every estimate x b(ω) of x via ω such that x b(·) ∈ X induces the “plugin” estimate fb(ω) = f (b x(ω))
of f (x), and the k · krisk
x(Ax + ξ) − xk} Riskk·k [b xX ] = sup Eξ∼Px {kb x∈X
of estimate x b upperbounds the  · risk
n o Risk· [fbX ] = sup Eξ∼Px fb(Ax + ξ) − f (x) x∈X
of the estimate fb induced by x b:
Risk· [fbX ] ≤ Riskk·k [b xX ].
When f is defined and Lipschitz continuous with constant 1 w.r.t. k · k,  ·  on the entire Rn , this conclusion remains valid without the assumption that x b is X valued.
4.10.B. Consequences. Observation 4.29 suggests the following simple approach to solving the estimation problem we started with: assuming that we have at our disposal a norm k · k on Rn such that • k · k is appropriate for f , and • k · k is good, meaning that the unit ball B∗ of the norm k · k∗ conjugate to k · k is a spectratope given by explicit spectratopic representation,
we use the machinery of linear estimation developed in Section 4.3.3 to build a nearoptimal, in terms of its k·krisk, linear estimate of x via ω, and convert this estimate into an estimate of f (x). By the above observation, the  ·  risk of the resulting estimate is upperbounded by the k · krisk of the underlying linear estimate. The construction just outlined needs a correction: in general, the linear estimate x e(·) yielded by Proposition 4.14 (same as any nontrivial—not identically zero—linear estimate) is not guaranteed to take values in X , which is, in general, required for
320
CHAPTER 4
Observation 4.29 to be applicable. This correction is easy: it is enough to convert x e into the estimate x b defined by x b(ω) ∈ Argmin ku − x e(ω)k. u∈X
This transformation preserves efficient computability of the estimate, and ensures that the corrected estimate takes its values in X ; at the same time, “correction” x e 7→ x b nearly preserves the k · krisk: Riskk·k [b xX ] ≤ 2Riskk·k [e xX ].
(∗)
Note that when k · k is a (generaltype) Euclidean norm: kxk2 = xT Qx for some Q ≻ 0, factor 2 on the righthand side can be discarded. 1) Justify (∗). 4.10.C. How to select k · k. When implementing the outlined approach, the major question is how to select a norm k · k appropriate for f . The best choice would be to select the smallest among the norms appropriate for f (such a norm does exist under mild assumptions), because the smaller the k · k, the smaller the k · krisk of an estimate of x. This ideal can be achieved in rare cases only: first, it could be difficult to identify the smallest among the norms appropriate for f ; second, our approach requires for k · k to have an explicitly given spectratope as the unit ball of the conjugate norm. Let us look at a couple of “favorable cases,” where the difficulties just outlined can be (partially) overcome. Example: A norminduced f . Let us start with the case, important in its own right, when f is a scalar functional which itself is a norm, and this norm has a spectratope as the unit ball of the conjugate norm, as is the case when f (·) = k · kr , r ∈ [1, 2], or when f (·) is the nuclear norm. In this case the smallest of the norms appropriate for f clearly is f itself, and none of the outlined difficulties arises. As an extension, when f (x) is obtained from a good norm k·k by operations P preserving Lipschitz continuity and constant, such as f (x) = kx − ck, or f (x) = i ai kx − ci k, P i ai  ≤ 1, or f (x) = sup / inf kx − ck, c∈C
or even something like f (x) = sup / inf α∈A
(
)
sup / inf kx − ck . c∈Cα
In such a case, it seems natural to use this norm in our construction, although now this, perhaps, is not the smallest of the norms appropriate for f . Now let us consider the general case. Note that in principle the smallest of the norms appropriate for a given Lipschitz continuous f admits a description. Specifically, assume that X has a nonempty interior (this is w.l.o.g.—we can always replace Rn with the linear span of X ). A wellknown fact of Analysis (Rademacher Theorem) states that in this situation (more generally, when X is convex with a nonempty interior), a Lipschitz continuous f is differentiable almost everywhere in X o = int X , and f is Lipschitz continuous with constant 1 w.r.t. a norm k · k if and
321
SIGNAL RECOVERY BY LINEAR ESTIMATION
only if
kf ′ (x)kk·k→· ≤ 1
whenever x ∈ X o is such that the derivative (a.k.a. Jacobian) of f at x exists; here kQkk·k→· is the matrix norm of a ν × n matrix Q induced by the norms k · k on Rn and  ·  on Rν : kQkk·k→· := max Qx = max y T Qx = kxk≤1
kxk≤1 y∗ ≤1
max xT QT y = kQT k·∗ →k·k∗ ,
y∗ ≤1 [kxk∗ ]∗ ≤1
where k · k∗ ,  · ∗ are the conjugates of k · k,  · . 2) Prove that a norm k · k is appropriate for f if and only if the unit ball of the conjugate to k · k norm contains the set Bf,∗ = cl Conv{z : ∃(x ∈ Xo , y, y∗ ≤ 1) : z = [f ′ (x)]T y}, where Xo is the set of all x ∈ X o where f ′ (x) exists. Geometrically, Bf,∗ is the closed convex hull of the union of all images of the unit ball B∗ of  · ∗ under the linear mappings y 7→ [f ′ (x)]T y stemming from x ∈ Xo . Equivalently: k · k is appropriate for f if and only if kuk ≥ kukf := max z T u. z∈Bf,∗
(!)
Check that kukf is a norm, provided that Bf,∗ (this set by construction is a convex compact set symmetric w.r.t. the origin) possesses a nonempty interior; whenever this is the case, kukf is the smallest of the norms appropriate for f . Derive from the above that the norms k · k we can use in our approach are the norms on Rn for which the unit ball of the conjugate norm is a spectratope containing Bf,∗ . Example. Consider the case of componentwise quadratic f : f (x) = 12 xT Q1 x; 21 xT Q2 x; ...; 12 xT Qν x
[Qi ∈ Sn ]
and u = kukq with q ∈ [1, 2].10 In this case B∗ = {u ∈ Rν : kukp ≤ 1}, p =
h i q ∈ [2, ∞[, and f ′ (x) = xT Q1 ; xT Q2 ; ...; xT Qν . q−1
Setting S = {s ∈ Rν+ : kskp/2 ≤ 1} and
S 1/2 = {s ∈ Rν+ : [s21 ; ...; s2ν ] ∈ S} = {s ∈ Rν+ : kskp ≤ 1}, the set
Z = {[f ′ (x)]T u : x ∈ X , u ∈ B∗ }
10 To save notation, we assume that the linear parts in the components of f are trivial—just i zeros. In this respect, note that we always can subtract from f any linear mapping and reduce our estimation problem to two distinct problems of estimating separately the values at the signal x of the modified f and the linear mapping we have subtracted (we know how to solve the latter problem reasonably well).
322
CHAPTER 4
is contained in the set ( Y=
n
y ∈ R : ∃(s ∈ S
1/2
i
, x ∈ X , i ≤ ν) : y =
X
s i Qi xi
i
)
.
The set Y is a spectratope with spectratopic representation readily given by that of X . Indeed, Y is nothing but the Ssum of the spectratopes Qi X , i = 1, ..., ν; see Section 4.10. As a result, we can use the spectratope Y (when int Y 6= ∅) or the arithmetic sum of Y with a small Euclidean ball (when int Y = ∅) as the unit ball of the norm conjugate to k · k, thus ensuring that k · k is appropriate for f . We then can use k · k in order to build an estimate of f (·). 3.1) For illustration, work out the problem of recovering the value of a scalar quadratic form f (x) = xT M x, M = Diag{iα , i = 1, ..., n}
[ν = 1,  ·  is the absolute value]
from noisy observation ω = Ax + ση, A = Diag{iβ , i = 1, ..., n}, η ∼ N (0, In )
(4.86)
of a signal x known to belong to the ellipsoid X = {x ∈ Rn : kP xk2 ≤ 1}, P = Diag{iγ , i = 1, ..., n}, where α, β, γ are given reals satisfying α − γ − β < −1/2. You could start with the simplest unbiased estimate x e(ω) = [1−β ω1 ; 2−β ω2 ; ...; n−β ωn ]
of x. 3.2) Work out the problem of recovering the norm
f (x) = kM xkp , M = Diag{iα , i = 1, ..., n}, p ∈ [1, 2], from observation (4.86) with X = {x : kP xkr ≤ 1}, P = Diag{iγ , i = 1, ..., n}, r ∈ [2, ∞]. 4.7.4.3
Suboptimal linear estimation
Exercise 4.11. [recovery of largescale signals] Consider the problem of estimating the image Bx ∈ Rν of signal x ∈ X from observation ω = Ax + σξ ∈ Rm in the simplest case where X = {x ∈ Rn : xT Sx ≤ 1} is an ellipsoid (so that S ≻ 0), the recovery error is measured in k · k2 , and ξ ∼ N (0, Im ). In this case,
SIGNAL RECOVERY BY LINEAR ESTIMATION
323
Problem (4.12) to solve when building “presumably good linear estimate” reduces to B T − AT H λS 0 , (4.87) Opt = min λ + σ 2 kHk2F : B − HT A Iν H,λ where k · kF is the Frobenius norm of a matrix. An optimal solution H∗ to this problem results in the linear estimate x bH∗ (ω) = H∗T ω satisfying the risk bound q p Risk[b xH∗ X ] := max E{kBx − H∗T (Ax + σξ)k22 } ≤ Opt. x∈X
Now, (4.87) is an efficiently solvable convex optimization problem. However, when the sizes m, n of the problem are large, solving the problem by standard optimization techniques could become prohibitively timeconsuming. The goal of what follows is to develop a relatively cheap computational technique for finding a good enough suboptimal solution to (4.87). In the sequel, we assume that A 6= 0; otherwise (4.87) is trivial. 1) Prove that problem (4.87) can be reduced to a similar problem with S = In and diagonal positive semidefinite matrix A, the reduction requiring several singular value decompositions and multiplications of matrices of the same sizes as those of A, B, and S.
2) By item 1, we can assume from the very beginning that S = I and A = Diag{α1 , ..., αn } with 0 ≤ √ α1 ≤ α2 ≤ ... ≤ αn . Passing in (4.87) from variables λ, H to variables τ = λ, G = H T , the problem becomes Opt = min τ 2 + σ 2 kGk2F : kB − GAk ≤ τ , (4.88) G,τ
where k · k is the spectral norm. Now consider the construction as follows:
• Consider a partition {1, ..., n} = I0 ∪ I1 ∪ ... ∪ IK of the index set {1, ..., n} into consecutive segments in such a way that (a) I0 is the set of those i, if any, for which αi = 0, and Ik 6= ∅ when k ≥ 1, (b) for k ≥ 1 the ratios αj /αi , i, j ∈ Ik , do not exceed θ > 1 (θ is the parameter of our construction), while (c) for 1 ≤ k < k ′ ≤ K, the ratios αj /αi , i ∈ Ik , j ∈ Ik′ , are > θ. The recipe for building the partition is selfevident, and we clearly have K ≤ ln(α/α)/ ln(θ) + 1, where α is the largest of αi , and α is the smallest of those αi which are positive. • For 1 ≤ k ≤ K, we denote by ik the first index in Ik , set αk = αik , nk = Card Ik , and define Ak as the nk × nk diagonal matrix with diagonal entries αi , i ∈ Ik .
Now, given a ν × n matrix C, let us specify Ck , 0 ≤ k ≤ K, as the ν × nk submatrix of C comprised of columns with indexes from Ik , and consider the following parametric optimization problems: Opt∗k (τ ) = minGk ∈Rν×nk kGk k2F : kBk − Gk Ak k ≤ τ (Pk∗ [τ ]) (Pk [τ ]) Optk (τ ) = minGk ∈Rν×nk kGk k2F : kBk − αk Gk k ≤ τ
324
CHAPTER 4
where τ ≥ 0 is the parameter, and 1 ≤ k ≤ K. Justify the following simple observations: 2.1) Gk is feasible for (Pk [τ ]) if and only if the matrix G∗k = αk Gk A−1 k is feasible for (Pk∗ [τ ]), and kG∗k kF ≤ kGk kF ≤ θkG∗k kF , implying that Opt∗k (τ ) ≤ Optk (τ ) ≤ θ2 Opt∗k (τ ); 2.2) Problems (Pk [τ ]) are easy to solve: if Bk = Uk Dk VkT is the singular value decomposition of Bk and σkℓ , 1 ≤ ℓ ≤ νk := min[ν, nk ], are diagonal entries of Dk , then an optimal solution to (Pk [τ ]) is b k [τ ] = [αk ]−1 Uk Dk [τ ]VkT , G
where Dk [τ ] is the diagonal matrix obtained from Dk by truncating the diagonal entries σkℓ 7→ [σkℓ − τ ]+ (from now on, a+ = max[a, 0], a ∈ R). The optimal value in (Pk [τ ]) is Optk (τ ) = [αk ]−2
νk X ℓ=1
[σkℓ − τ ]2+ .
2.3) If (τ, G) is a feasible solution to (4.88) then τ ≥ τ := kB0 k, and the matrices Gk , 1 ≤ k ≤ K, are feasible solutions to problems (Pk∗ [τ ]), implying that X Opt∗k (τ ) ≤ kGk2F . k
And vice versa: if τ ≥ τ , Gk , 1 ≤ k ≤ K, are feasible solutions to problems (Pk∗ [τ ]), and K, I0 = ∅ K+ = , K + 1, I0 6= ∅ p then the matrix G = [0ν×n0 , G1 , ..., GK ] and τ+ = K+ τ form a feasible solution to (4.88).
Extract from these observations that if τ∗ is an optimal solution to the convex optimization problem ( ) K X 2 2 2 Optk (τ ) : τ ≥ τ min θ τ + σ (4.89) τ
k=1
and Gk,∗ are optimal solutions to the problems (Pk [τ∗ ]), then the pair p b = [0ν×n , G∗ , ..., G∗ ] [G∗k,∗ = αk Gk,∗ A−1 τb = K+ τ∗ , G 1,∗ K,∗ 0 k ]
is a feasible solution to (4.88), and the value of the objective of the latter problem at this feasible solution is within the factor max[K+ , θ2 ] of the true optimal value b gives rise to a linear estimate with risk on Opt of this problem. As a result, p G √ X which is within the factor max[ K+ , θ] of the risk Opt of the “presumably
SIGNAL RECOVERY BY LINEAR ESTIMATION
325
good” linear estimate yielded by an optimal solution to (4.87). Notice that • After carrying out singular value decompositions of matrices Bk , 1 ≤ k ≤ K, specifying τ∗ and Gk,∗ requires solving univariate convex minimization problem with an easytocompute objective, so that the problem can be easily solved, e.g., by bisection; • The computationally cheap suboptimal solution we end up with is not that bad, since K is “moderate”—just logarithmic in the condition number α/α of A. Your next task is as follows: 3) To get an idea of the performance of the proposed synthesis of “suboptimal” linear estimation, run numerical experiments as follows: • select some n and generate at random the n × n data matrices S, A, B; • for “moderate” values of n compute both the linear estimate yielded by the optimal solution to (4.12)11 and the suboptimal estimate as yielded by the above construction. Compare their risk bounds and the associated CPU times. For “large” n, where solving (4.12) becomes prohibitively timeconsuming, compute only a suboptimal estimate in order to get an impression of how the corresponding CPU time grows with n. Recommended setup: • range of n: 50, 100 (“moderate” values), 1000, 2000 (“large” values) • range of σ: {1.0, 0.01, 0.0001} • generation of S, A, B: generate the matrices at random according to S = US Diag{1, 2, ..., n}UST , A = UA Diag{µ1 , ..., µn }VAT , B = UB Diag{µ1 , ..., µn }VBT , where US , UA , VA , UB , VB are random orthogonal n × n matrices, and the µi form a geometric progression with µ1 = 0.01 and µn = 1. You could run the above construction for several values of θ and select the best, in terms of its risk bound, of the resulting suboptimal estimates. 4.11.A. Simple case. There is a trivial case where (4.88) is really easy; this is the case where the right orthogonal factors in the singular value decompositions of A and B are the same, that is, when B = W F V T , A = U DV T with orthogonal n × n matrices W, U, V and diagonal F, D. This very special case is in fact of some importance—it covers the denoising situation where B = A, so that our goal is to denoise our observation of Ax given a priori information x ∈ X 11 When X is an ellipsoid, semidefinite relaxation bound on the maximum of a quadratic form over x ∈ X is exact, so that we are in the case when an optimal solution to (4.12) yields the best, in terms of risk on X , linear estimate.
326
CHAPTER 4
on x. In this situation, setting W T H T U = G, problem (4.88) becomes Opt = min kF − GDk2 + σ 2 kGk2F . G
(4.90)
Now goes the concluding part of the exercise:
4) Prove that in the situation in question an optimal solution G∗ to (4.90) can be selected to be diagonal, with diagonal entries γi , 1 ≤ i ≤ n, yielded by the optimal solution to the optimization problem ) ( n X 2 2 2 [φi = Fii , δi = Dii ]. γi Opt = min f (G) := max(φi − γi δi ) + σ γ
i≤n
i=1
Exercise 4.12. [image reconstruction—followup to Exercise 4.11] A grayscale image can be represented by an m × n matrix x = [xpq ] 0≤p 0 be such that Z ⊂ ∆[α]. Prove that X + = {[x; z] : W [x; z] = 0, [x; z] ∈ Conv{vij = [gi ; hj ], 1 ≤ i ≤ n, 0 ≤ j ≤ p}} ,
(!)
where the gi are the standard basic orths in Rn , h0 = 0 ∈ Rp , and αj hj , 1 ≤ j ≤ p, are the standard basic orths in Rp . 6.2) Derive from 5.1 that the efficiently computable convex function n o ΦSA (H) = inf max k(B − H T A)gi + C T W vij k : C ∈ R(p+q)×ν C
i,j
is an upper bound on Φ(H). In the sequel, we refer to ΦSA (H) as to the SheraliAdams bound [214]. 4.17.G. Combined bound. We can combine the above bounds, specifically, as follows: 7) Prove that the efficiently computable convex function ΦLBS (H) =
inf
max Gij (H, Λ± , C± , µ, µ+ ),
(Λ± ,C± ,µ,µ+ )∈R i,j
(#)
where Gij (H, Λ± , C± , µ, µ+ ) := −µT F gi + µT+ W vij + min ktk :
t T W v ] , [(−B + H T A − Λ F )g + C T W v ] , t ≥ Max [(B − H T A − Λ+ F )gi + C+ + − + ij i ij − ν×(p+2q)
R = {(Λ± , C± , µ, µ+ ) : Λ± ∈ R+
, C± ∈ R(p+q)×ν , µ ∈ Rp+2q , µ+ ∈ Rp+q } +
is an upper bound on Φ(H), and that this Combined bound is at least as good as any of the Lagrange, Basic, or SheraliAdams bounds.
340
CHAPTER 4
4.17.H. How to select α? A shortcoming of the SheraliAdams and the combined upper bounds on Φ is the presence of a “degree of freedom”—on the positive vector α. Intuitively, we would like to select α to make the simplex ∆[α] ⊃ Z to be “as small as possible.” It is unclear, however, what “as small as possible” is in our context, not to speak of how to select the required α after we agree on how we measure the “size” of ∆[α]. It turns out, however, that we can efficiently select α resulting in the smallest volume ∆[α]. 8) Prove that minimizing the volume of ∆[α] ⊃ Z in α reduces to solving the following convex optimization problem: ) ( p X T T (∗) ln(αs ) : 0 ≤ α ≤ −v, E u + G v ≤ 1n . inf − α,u,v
s=1
9) Run numerical experiments to evaluate the quality of the above bounds. It makes sense to generate problems where we know in advance the actual value of Φ, e.g., to take X = {x ∈ ∆n : x ≥ a} (a) P with a ≥ 0 such that i ai ≤ 1. In this case, we can easily list the extreme points of X (how?) and thus can easily compute Φ(H). In your experiments, you can use the matrices stemming from “presumably good” linear estimates yielded by the optimization problems Opt
=
where
min
H,Υ,Θ
Φ(H) + φR (λ[Υ]) + ΓX (Θ) : Υ = {Υℓ 0, ℓ ≤ L, } 1 HM Θ P2 ∗ 0 T T 1 M H ℓ Sℓ [Υℓ ] 2
ΓX (Θ) =
(4.99)
1 max Tr(Diag{Ax}Θ), K x∈X
(see Corollary 4.12), with the actual Φ (which is available for our X ), or the upper bounds on Φ (Lagrange, Basic, SheraliAdams, and Combined) in the role of Φ. Note that it may make sense to test seven bounds rather than just four. Indeed, with additional constraints on the optimization variables in (#), we can get, besides “pure” Lagrange, Basic, and SheraliAdams bounds and their “threecomponent combination” (Combined bound), pairwise combinations of the pure bounds as well. For example, to combine Lagrange and SheraliAdams bounds, it suffices to add to (#) the constraints Λ± = 0. Exercise 4.18. The exercise to follow deals with recovering discrete probability distributions in the Wasserstein norm. The Wasserstein distance between probability distributions is extremely popular today in Statistics; it is defined as follows.17 Consider discrete random variables taking values in finite observation space Ω = {1, 2, ..., n} which is equipped with 17 The distance we consider stems from the Wasserstein 1distance between discrete probability distributions. This is a particular case of the general Wasserstein pdistance between (not necessarily discrete) probability distributions.
341
SIGNAL RECOVERY BY LINEAR ESTIMATION
the metric {dij : 1 ≤ i, j ≤ n} satisfying the standard axioms.18 As always, we identify probability distributions on Ω with ndimensional probabilistic vectors p = [p1 ; ...; pn ], where pi is the probability mass assigned by p to i ∈ Ω. The Wasserstein distance between probability distributions p and q is defined as W (p, q) = min
x=[xij ]
(
X ij
dij xij : xij ≥ 0,
X
xij = pi ,
j
X i
xij = qj ∀1 ≤ i, j ≤ n
)
. (4.100)
In other words, one may think of p and q as two distributions of unit mass on the points of Ω, and consider the mass transport problem of redistributing the mass assigned to points by distribution p to get the distribution q. Denoting by xij the P x = p say that the total mass mass moved from point i to point j, constraints ij i j P taken from point i is exactly pi , constraints i xij = qj say that as the result of transportation, the mass at point j will be exactly qj , and the constraints xij ≥ 0 reflect the fact that transport of a negative mass is forbidden. Assuming that the cost of transporting a mass µ from point i to point j is dij µ, the Wasserstein distance W (p, q) between p and q is the cost of the cheapest transportation plan which converts p into q. As compared to other natural distances between discrete probability distributions, like kp − qk1 , the advantage of the Wasserstein distance is that it allows us to model the situation (indeed arising in some applications) where the effect, measured in terms of intended application, of changing probability masses of points from Ω is small when the probability mass of a point is redistributed among close points.19 Now comes the first part of the exercise: 1) Let p, q be two probability distributions. Prove that ) ( X fi (pi − qi ) : fi − fj  ≤ dij ∀i, j . W (p, q) = maxn f ∈R
(4.101)
i
Treating vector f ∈ Rn as a function on Ω, the value of the function at a point i ∈ Ω being fi , (4.101) admits a very transparent interpretation: the Wasserstein distance W (p, q) between probability distributions p and q is the maximum of inner products of p − q and functions f on Ω which are Lipschitz continuous w.r.t. the metric d, with constant 1. When shifting f by a constant, the inner product remains intact (since p − q is a vector with zero sum of entries). Therefore, denoting by D = max dij i,j
the ddiameter of Ω, we have W (p, q) = max f T (p − q) : fi − fj  ≤ dij , fi  ≤ D/2 ∀i, j , f
(4.102)
18 Namely, positivity: d ij = dji ≥ 0, with dij = 0 if and only if i = j; and the triangle inequality: dik ≤ dij + djk for all triples i, j, k. 19 In fact, the Wasserstein distance shares this property with some other distances between distributions used in Probability Theory, such as Skorohod, or Prokhorov, or Ky Fan distances. What makes the Wasserstein distance so “special” is its representation (4.100) as the optimal value of a Linear Programming problem, responsible for efficient computational handling of this distance.
342
CHAPTER 4
the reason being that every function f on Ω which is Lipschitz continuous, with constant 1, w.r.t. metric d can be shifted by a constant to ensure kf k∞ ≤ D/2 (look what happens when the shift ensures that mini fi = −D/2). Representation (4.102) shows that the Wasserstein distance is generated by a norm on Rn : for all probability distributions on Ω one has W (p, q) = kp − qkW , where k · kW is the Wasserstein norm on Rn given by kxkW = max f T x, f ∈B∗ B∗ = u ∈ Rn : uT Sij u ≤ 1, 1 ≤ i ≤ j ≤ n , T d−2 ij [ei − ej ][ei − ej ] , 1 ≤ i < j ≤ n, Sij = −2 T 4D ei ei , 1 ≤ i = j ≤ n,
(4.103)
where e1 , ..., en are the standard basic orths in Rn . 2) Let us equip nelement set Ω = {1, ..., d} with the metric dij = What is the associated Wasserstein norm?
2, 0,
i 6= j . i=j
Note that the set B∗ in (4.103) is the unit ball of the norm conjugate to k·kW , and as we see, this set is a basic ellitope. As a result, the estimation machinery developed in Chapter 4 is well suited for recovering discrete probability distributions in the Wasserstein norm. This observation motivates the concluding part of the exercise: 3) Consider the situation as follows: Given an m × n columnstochastic matrix A and a ν × n columnstochastic matrix B, we observe K samples ωk , 1 ≤ k ≤ K, independent of each other, drawn from the discrete probability distribution Ax ∈ ∆m (as always, ∆ν ⊂ Rν is the probabilistic simplex in Rν ), x ∈ ∆n being an unknown “signal” underlying the observations; realizations of ωk are identified with respective vertices f1 , ..., fm of ∆m . Our goal is to use the observations to estimate the distribution Bx ∈ ∆ν . We are given a metric d on the set Ων = {1, 2, ..., ν} of indices of entries in Bx, and measure the recovery error in the Wasserstein norm k · kW associated with d. Build an explicit convex optimization problem responsible for a “presumably good” linear recovery of the form
Exercise 4.19.
x bH =
K 1 TX ωk . H K k=1
[followup to Exercise 4.17] In Exercise 4.17, we have built a “presumably good” linear estimate x bH∗ (·)—see (4.98)—yielded by the Hcomponent H∗ of an optimal solution to problem (4.99). The optimal value Opt in this problem is an upper bound on the risk Riskk·k [b xH∗ X ] (here and in what follows we use the same notation and impose the same assumptions as in Exercise 4.17). Recall that Riskk·k is the worst, w.r.t. signals x ∈ X underlying our observations, expected norm of the recovery error. It makes sense also to provide upper bounds on the probabilities of deviations of the error’s magnitude from its expected value, and this is the problem
343
SIGNAL RECOVERY BY LINEAR ESTIMATION
we consider here; cf. Exercise 4.14. 1) Prove the following Lemma 4.33. Let Q ∈ Sm + , let K be a positive integer, and let p ∈ ∆m . Let, further, ω K = (ω1 , ..., ωK ) be i.i.d. random vectors, with ωk taking the value ej (e1 , ..., em are the standard basic orths in Rm ) with probability pj . Finally, let PK 1 ξk = ωk − E{ωk } = ωk − p, and ξb = K k=1 ξk . Then for every ǫ ∈ (0, 1) it holds 12 ln(2m/ǫ) 2 b ≥ 1 − ǫ. Prob kξk2 ≤ K
Hint: use the classical Bernstein inequality: Let X1 , ..., XK be independent zero mean random variables taking values in [−M, M ], and let σk2 = E{Xk2 }. Then for every t ≥ 0 one has X K t2 . Prob Xk ≥ t ≤ exp − P 2 1 k=1 2[ k σk + 3 M t]
2) Consider the situation described in Exercise 4.17 with X = ∆n , specifically,
• Our observation is a sample ω K = (ω1 , ..., ωK ) with i.i.d. components ωk ∼ Ax, where X ∈ ∆n is an unknown ndimensional probabilistic vector, A is an m × n stochastic matrix (nonnegative matrix with unit column sums), and ω ∼ Ax means that ω is a random vector taking value ei (ei are standard basic orths in Rm ) with probability [Ax]i , 1 ≤ i ≤ m. • Our goal is to recover Bx in a given norm k · k; here B is a given ν × n matrix. • We assume that the unit ball B∗ of the norm k · k∗ conjugate to k · k is a spectratope: B∗ = {u = M y, y ∈ Y}, Y = {y ∈ RN : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , ℓ ≤ L}.
Our goal is to build a presumably good linear estimate x bH (ω K ) = H T ω b [ω K ], ω b [ω K ] =
1 X ωk . K k
Prove the following
Proposition 4.34. Let H, Θ, Υ be a feasible solution to the convex optimization problem
where
minH,Θ,Υ {Φ(H) + φR (λ[Υ]) + Γ(Θ)/K : Υ =1{Υℓ 0,ℓ ≤ L}, Θ HM 2 P 0 1 ∗ M T HT ℓ Sℓ [Υℓ ] 2 Φ(H) = max kColj [B − H T A]k, Γ(Θ) = max Tr(Diag{Ax}Θ). j≤n
Then
x∈∆n
(4.104)
344
CHAPTER 4
(i) For every x ∈ ∆n it holds p bH (ω K )k ≤ Φ(H) + 2K −1/2 φR (λ[Υ])Γ(Θ) EωK kBx − x ≤
Φ(H) + φR (λ[Υ]) + Φ(H) + Γ(Θ)/K .
(4.105)
(ii) Let ǫ ∈ (0, 1). For every x ∈ ∆n with p γ = 2 3 ln(2m/ǫ)
one has
o n p bH (ω K )k ≤ Φ(H) + 2γK −1/2 φR (λ[Υ])kΘkSh,∞ ProbωK kBx − x ≥ 1 − ǫ.
(4.106)
3) Look what happens when ν = m = n, A and B are the unit matrices, and H = I, i.e., we want to understand how good is the recovery of a discrete probability distribution by empirical distribution of a Kelement i.i.d. sample drawn from the original distribution. Take, as k · k, the norm k · kp with p ∈ [1, 2], and show that for every x ∈ ∆n and every ǫ ∈ (0, 1) one has ∀(x ∈ ∆n ) : 1 1 1 E kxn− x bI (ω K )kp ≤ n p − 2 K − 2 , o p 1 1 1 Prob kx − x bI (ω K )kp ≤ 2 3 ln(2n/ǫ)n p − 2 K − 2 ≥ 1 − ǫ.
Exercise 4.20.
[followup to Exercise 4.17] Consider the situation as follows. A retailer sells n items by offering customers, via internet, bundles of m < n items, so that an offer is an melement subset B of the set S = {1, ..., n} of the items. A customer has personal preferences represented by a subset P of S—customer’s preference set. We assume that if an offer B intersects with the preference set P of a customer, the latter buys an item drawn at random from the uniform distribution on B ∩ P , and if B ∩ P = ∅, the customer declines the offer. In the pilot stage we are interested in, the seller learns the market by making offers to K customers. Specifically, the seller draws the kth customer, k ≤ K, at random from the uniform distribution on the population of customers, and makes the selected customer an offer drawn at random from the uniform distribution on the set Sm,n of all mitem offers. What is observed in the kth experiment is the item, if any, bought by the customer, and we want to make statistical inferences from these observations. The outlined observation scheme can be formalized as follows. Let S be the set of all subsets of the nelement set, so that S is of cardinality N = 2n . The population of customers induces a probability distribution p on S: for P ∈ S, pP is the fraction of customers with the preference set being P ; we refer to p as to the preference distribution. An outcome of a single experiment can be represented by a pair (ι, B), where B ∈ Sm,n is the offer used in the experiment, and ι is either 0 (“nothing is bought”, P ∩ B = ∅), or a point from P ∩ B, the item which was bought, when n )P ∩ B 6= ∅. Note that AP is a probability distribution on the (M = (m + 1) m element set Ω = {(ι, B)} of possible outcomes. As a result, our observation scheme is fully specified by an M × N columnstochastic matrix A known to us with the
345
SIGNAL RECOVERY BY LINEAR ESTIMATION
columns AP indexed by P ∈ S. When a customer is drawn at random from the uniform distribution on the population of customers, the distribution of the outcome clearly is Ap, where p is the (unknown) preference distribution. Our inferences should be based on the Kelement sample ω K = (ω1 , ..., ωK ), with ω1 , .., ωK drawn, independently of each other, from the distribution Ap. Now we can pose various inference problems, e.g., that of estimating p. We, however, intend to focus on a simpler problem—one of recovering Ap. In terms of our story, this makes sense: when we know Ap, we know, e.g., what the probability is for every offer to be “successful” (something indeed is bought) and/or to result in a specific profit, etc. With this knowledge at hand, the seller can pass from a “blind” offering policy (drawing an offer at random from the uniform distribution on the set Sm,n ) to something more rewarding. Now comes the exercise: 1. Use the results of Exercise 4.17 to build a “presumably good” linear estimate # " K 1 X K T ωk x bH (ω ) = H K k=1
of Ap (as always, we encode observations ω, which are elements of the M element set Ω, by standard basic orths in RM ). As the norm k·k quantifying the recovery error, use k · k1 and/or k · k2 . In order to avoid computational difficulties, use small m and n (e.g., m = 3 and n = 5). Compare your results with those for the PK 1 “straightforward” estimate K k=1 ωk (the empirical distribution of ω ∼ Ap). 2. Assuming that the “presumably good” linear estimate outperforms the straightforward one, how could this phenomenon be explained? Note that we have no nontrivial a priori information on p! Exercise 4.21. [Poisson Imaging] The Poisson Imaging Problem is to recover an unknown signal observed via the Poisson observation scheme. More specifically, assume that our observation is a realization of random vector ω ∈ Rm + with Poisson entries ωi = Poisson([Ax]i ) independent of each other. Here A is a given entrywise nonnegative m × n matrix, and x is an unknown signal known to belong to a given compact convex subset X of Rn+ . Our goal is to recover in a given norm k · k the linear image Bx of x, where B is a given ν × n matrix. We assume in the sequel that X is a subset cut off the ndimensional probabilistic simplex ∆n by a collection of linear equality and inequality constraints. The assumption X ⊂ ∆n isPnot too restrictive. Indeed, assume that we know in advance a linear inequality i αi xi ≤ 1 with P positive coefficients which is valid on X .20 Introducing slack variable s given by i αi xi + s = 1 and passing from signal x to the new signal [α1 x1 ; ...; αn xn ; s], after a straightforward modification of matrices A and B, we arrive at the situation where X is a subset of the probabilistic simplex. Our goal in the sequel is to build a presumably good linear estimate x bH (ω) = H T ω of Bx. As in Exercise 4.17, we start with upperbounding the risk of a linear 20 For example, in PET—see Section 2.4.3.2—where x is the density of a radioactive tracer P injected into the patient taking the PET procedure, we know in advance the total amount i vi xi of the tracer, vi being the volume of voxels.
346
CHAPTER 4
estimate. When representing ω = Ax + ξx , we arrive at zero mean observation noise ξx with entries [ξx ]i = ωi − [Ax]i independent of each other and covariance matrix Diag{Ax}. We now can upperbound the risk of a linear estimate x bH (·) in the same way as in Exercise 4.17. Specifically, denoting by ΠX the set of all diagonal matrices Diag{Ax}, x ∈ X , and by Pi,x the Poisson distribution with parameter [Ax]i , we have T Riskk·k [b xH X ] = supx∈X Eω∼P 1,x ×...×Pm,xT kBx − TH ωk = supx∈X Eξx k[Bx − H A]x − H ξx k sup Eξ kH T ξk . ≤ sup k[B − H T A]xk + x∈X ξ:Cov[ξ]∈ΠX {z }   {z } Φ(H)
ΨX (H)
In order to build a presumably good linear estimate, it suffices to build efficiently X computable upper bounds Φ(H) on Φ(H) and Ψ (H) on ΨX (H) convex in H, and then take as H an optimal solution to the convex optimization problem h i X Opt = min Φ(H) + Ψ (H) . H
As in Exercise 4.17, assume from now on that k · k is an absolute norm, and the unit ball B∗ of the conjugate norm is a spectratope: B∗ := {u : kuk∗ ≤ 1} = {u : ∃r ∈ R, y : u = M y, Sℓ2 [y] rℓ Ifℓ , ℓ ≤ L}.
Observe that • In order to build Φ, we can use exactly the same techniques as those developed in Exercise 4.17. Indeed, as far as building Φ is concerned, the only difference with the situation of Exercise 4.17 is that in the latter, A was columnstochastic matrix, while now A is just an entrywise nonnegative matrix. Note, however, that when upperbounding Φ in Exercise 4.17, we never used the fact that A is columnstochastic. • In order to upperbound ΨX , we can use the bound (4.40) of Exercise 4.17. The bottom line is that in order to build a presumably good linear estimate, we need to solve the convex optimization problem Opt = min Φ(H) + φR (λ[Υ]) + ΓX (Θ) : Υ = {Υℓ 0, ℓ ≤ L} H,Υ,Θ 1 (P ) HM Θ 2 P 0 1 T T ∗ M H ℓ Sℓ [Υℓ ] 2 where
ΓX (Θ) = max Tr(Diag{Ax}Θ) x∈X
(cf. problem (4.99)) with Φ yielded by any construction from Exercise 4.17, e.g., the least conservative Combined upper bound on Φ. What in our present situation differs significantly from the situation of Exercise 4.17, are the bounds on probabilities of large deviations (for Discrete o.s., established in Exercise 4.19). The goal of what follows is to establish these bounds for
347
SIGNAL RECOVERY BY LINEAR ESTIMATION
Poisson Imaging. Here is what you are supposed to do: 1. Let ω ∈ Rm be a random vector with independent entries ωi ∼ Poisson(µi ), and let µ = [µ1 ; ...; µm ]. Prove that whenever h ∈ Rm , γ > 0, and δ ≥ 0, one has X ln Prob{hT ω > hT µ + δ} ≤ [exp{γhi } − 1]µi − γhT µ − γδ. (4.107) i
2. Taking for granted (or see, e.g., [178]) that ex − x − 1 ≤ prove that in the situation of item 1 one has for t > 0:
x2 2(1−x/3)
when x < 3,
P γ 2 i h2i µi 3 T T ⇒ ln Prob{h ω > h µ + t} ≤ − γt. 0≤γ< khk∞ 2(1 − γkhk∞ /3)
(4.108)
Derive from the latter fact that
δ2 P Prob h ω > h µ + δ ≤ exp − , 2[ i h2i µi + khk∞ δ/3]
T
T
and conclude that
δ2 Prob h ω − h µ > δ ≤ 2 exp − P 2 2[ i hi µi + khk∞ δ/3] T
T
(4.109)
.
(4.110)
3. Extract from (4.110) the following
Proposition 4.35. In the situation and under the assumptions of Exercise 4.21, let Opt be the optimal value, and H, Υ, Θ be a feasible solution to problem (P ). Whenever x ∈ X and ǫ ∈ (0, 1), denoting by Px the distribution of observations stemming from x (i.e., the distribution of random vector ω with independent entries ωi ∼ Poisson([Ax]i )), one has
and
Eω∼Px {kBx − x bH (ω)k}
≤
≤
Φ(H) + 2
p
φR (λ[Υ])Tr(Diag(Ax}Θ)
Φ(H) + φR (λ[Υ]) + ΓX (Θ)
Probω∼Px kBx − x bH (ω)k ≤ Φ(H) q p +4 29 ln2 (2m/ǫ)Tr(Θ) + ln(2m/ǫ)Tr(Diag{Ax}Θ) φR (λ[Υ]) ≥ 1 − ǫ.
(4.111)
(4.112)
Note that in the case of [Ax]i ≥ 1 for all x ∈ X and all i we have Tr(Θ) ≤ Tr(Diag{Ax}Θ), so that in this case the Px probability of the event n o p ω : kBx − x bH (ω)k ≤ Φ(H) + O(1) ln(2m/ǫ) φR (λ[Υ])ΓX (Θ) is at least 1 − ǫ. 4.7.6
Numerical lowerbounding minimax risk
Exercise 4.22. 4.22.A. Motivation. From the theoretical viewpoint, the results on nearoptimality of presumably good linear estimates stated in Propositions 4.5 and 4.16 seem
348
CHAPTER 4
to be quite strong and general. This being said, for a practically oriented user the “nonoptimality factors” arising in these propositions can be too large to make any practical sense. This drawback of our theoretical results is not too crucial—what matters in applications, is whether the risk of a proposed estimate is appropriate for the application in question, and not by how much it could be improved were we smart enough to build the “ideal” estimate; results of the latter type from a practical viewpoint offer no more than some “moral support.” Nevertheless, the “moral support” has its value, and it makes sense to strengthen it by improving the lower risk bounds as compared to those underlying Propositions 4.5 and 4.16. In this respect, an appealing idea is to pass from lower risk bounds yielded by theoretical considerations to computationbased ones. The goal of this exercise is to develop some methodology yielding computationbased lower risk bounds. We start with the main ingredient of this methodology—the classical CramerRao bound. 4.22.B. CramerRao bound. Consider the situation as follows: we are given • an observation space Ω equipped with reference measure Π, basic examples being (A) Ω = Rm with Lebesgue measure Π, and (B) (finite or countable) discrete set Ω with counting measure Π; • a convex compact set Θ ⊂ Rk and a family P = {p(ω, θ) : θ ∈ Θ} of probability densities, taken w.r.t. Π. Our goal is, given an observation ω ∼ p(·, θ) stemming from unknown θ known to belong to Θ, to recover θ. We quantify the risk of a candidate estimate θb as o1/2 n b = sup Eω∼p(·,θ) kθ(ω) b Risk[θΘ] , − θk22
(4.113)
θ∈Θ
and define the “ideal” minimax risk as
b Riskopt = inf Risk[θ], θb
the infimum being taken w.r.t. all estimates, or, which is the same, all bounded estimates (indeed, passing from a candidate estimate θb to the projected estimate b θbΘ (ω) = argminθ∈Θ kθ(ω) − θk2 will only reduce the estimate risk). The CramerRao inequality [58, 205], which we intend to use,21 is a certain relation between the covariance matrix of a bounded estimate and its bias; this relation is valid under mild regularity assumptions on the family P, specifically, as follows: 1) p(ω, θ) > 0 for all ω ∈ Ω, θ ∈ U , and p(ω, θ) is differentiable in θ, with ∇θ p(ω, θ) continuous in θ ∈ Θ; 2) The Fisher Information matrix I(θ) =
Z
Ω
∇θ p(ω, θ)[∇θ p(ω, θ)]T Π(dω) p(ω, θ)
21 As a matter of fact, the classical CramerRao inequality dealing with unbiased estimates is not sufficient for our purposes “as is.” What we need to build is a “bias enabled” version of this inequality. Such an inequality may be developed using Bayesian argument [99, 233].
349
SIGNAL RECOVERY BY LINEAR ESTIMATION
is welldefined for all θ ∈ Θ; R 3) There exists function M (ω) ≥ 0 such that Ω M (ω)Π(dω) < ∞ and k∇θ p(ω, θ)k2 ≤ M (ω) ∀ω ∈ Ω, θ ∈ Θ.
b The derivation of the CramerRao bound is as follows. Let θ(ω) be a bounded estimate, and let Z b φ(θ) = [φ1 (θ); ...; φk (θ)] = θ(ω)p(ω, θ)Π(dω) Ω
be the expected valuehof theiestimate. By item 3, φ(θ) is differentiable on Θ, with given by the Jacobian φ′ (θ) = ∂φ∂θi (θ) j i,j≤k
φ′ (θ)h =
Z
Ω
T b θ(ω)h ∇θ p(ω, θ)Π(dω), h ∈ Rk .
R this, recalling that Ω p(ω, θ)Π(dω) ≡ 1 and invoking item 3, we have RBesides hT ∇θ p(ω, θ)Π(dω) = 0, whence, in view of the previous identity, Ω Z b − φ(θ)]hT ∇θ p(ω, θ)Π(dω), h ∈ Rk . φ′ (θ)h = [θ(ω) Ω
Therefore, for all g, h ∈ Rk we have [g T φ′ (θ)h]2
= ≤ =
hR
[g T (θb − φ(θ)][hT ∇θ p(ω, θ)/p(ω, θ)]p(ω, θ)Π(dω) hRω i g T [θb − φ(θ)][θb − φ(θ)]T gp(ω, θ)Π(dω) Ω R × Ω [hT ∇θ p(ω, θ)/p(ω, θ)]2 p(ω, θ)Π(dω) [by T the Cauchy Inequality] g Covθb(θ)g hT I(θ)h ,
i2
n o b b where Covθb(θ) is the covariance matrix Eω∼p(·,θ) [θ(ω) − φ(θ)][θ(ω) − φ(θ)]T of b θ(ω) induced by ω ∼ p(·, θ). We have arrived at the inequality
g T Covθb(θ)g
hT I(θ)h ≥ [g T φ′ (θ)h]2 ∀(g, h ∈ Rk , θ ∈ Θ).
(∗)
For θ ∈ Θ fixed, let J be a positive definite matrix such that J I(θ), whence by (∗) it holds T g Covθb(θ)g hT J h ≥ [g T φ′ (θ)h]2 ∀(g, h ∈ Rk ). (∗∗)
For g fixed, the maximum of the righthand side quantity in (∗∗) over h satisfying hT J h ≤ 1 is g T φ′ (θ)J −1 [φ′ (θ]T g, and we arrive at the CramerRao inequality ∀(θ ∈ Θ, J I(θ), J ≻ 0) : Covθb(θ) φ′ (θ)J −1 [φ′ (θ]T (4.114) h n o n oi b Covθb(θ) = Eω∼p(·,θ) [θb − φ(θ)][θb − φ(θ)]T , φ(θ) = Eω∼p(·,θ) θ(ω)
b which holds true for every bounded estimate θ(·). Note also that for every θ ∈ Θ
350
CHAPTER 4
and every bounded estimate x we have o o n n b b ≥ Eω∼p(·,θ) kθ(ω) b − φ(θ)] + [φ(θ) − θ]k22 Risk2 [θ] − θk22 = Eω∼p(·,θ) k[θ(ω) o n b = Eω∼p(·,θ) kθ(ω) − φ(θ)k22 +kφ(θ) − θ)k22 h o b −2 Eω∼p(·,θ) [θ(ω) − φ(θ)]T [φ(θ) − θ)]  {z } = Tr(Covθb(θ)) + kφ(θ) − θk22 .
=0
Hence, in view of (4.114), for every bounded estimate θb it holds
∀(J ≻ 0 : J I(θ) ∀θ ∈ Θ) : b ≥ sup Tr(φ′ (θ)J −1 [φ′ (θ)]Ti) + kφ(θ) − θk22 Risk2 [θ] θ∈Θ h b φ(θ) = Eω∼p(·,θ) {θ(ω)} .
(4.115)
The fact that we considered the risk of estimating “the entire” θ rather than a given vectorvalued function f (θ) : Θ → Rν plays no special role, and in fact the CramerRao inequality admits the following modification yielded by a completely similar reasoning: Proposition 4.36. In the situation described in item 4.22.B and under assumptions 1)–3) of this item, let f (·) : Θ → Rν be a bounded Borel function, and let fb(ω) be a bounded estimate of f (ω) via observation ω ∼ p(·, θ). Then, setting for θ∈Θ n o φ(θ) = Eω∼p(·,θ) fb(θ) , n o Covfb(θ) = Eω∼p(·,θ) [fb(ω) − φ(θ)][fb(ω) − φ(θ)]T , one has
∀(θ ∈ Θ, J I(θ), J ≻ 0) : Covfb(θ) φ′ (θ)J −1 [φ′ (θ)]T .
As a result, for
h oi1/2 n Risk[fb] = sup Eω∼p(·,θ) kfb(ω) − f (θ)k22 θ∈Θ
it holds
∀(J ≻ 0 : J I(θ) ∀θ ∈ Θ) : Risk2 [fb] ≥ supθ∈Θ Tr(φ′ (θ)J −1 [φ′ (θ)]T ) + kφ(θ) − f (θ)k22
Now comes the first part of the exercise: 1) Derive from (4.115) the following
Proposition 4.37. In the situation of item 4.22.B, let • Θ ⊂ Rk be a k · k2 ball of radius r > 0, • the family P be such that I(θ) J for some J ≻ 0 and all θ ∈ Θ.
351
SIGNAL RECOVERY BY LINEAR ESTIMATION
Then the minimax optimal risk satisfies the bound rk . Riskopt ≥ p r Tr(J ) + k
(4.116)
In particular, when J = α−1 Ik , we have Riskopt
√ r αk √ . ≥ r + αk
(4.117)
Hint. Assuming w.l.o.g. that Θ is centered at the origin, and given a bounded estimate θb with risk R, let φ(θ) be associated with the estimate via (4.115). Select γ ∈ (0, 1) and consider two cases: (a): there exists θ ∈ ∂Θ such that kφ(θ) − θk2 > γr, and (b): kφ(θ) − θk2 ≤ γr for all θ ∈ ∂Θ. In the case of (a), lowerbound R by maxθ∈Θ kφ(θ) − θk2 ; see (4.115). In the case of (b), lowerbound R2 by maxθ∈Θ Tr(φ′ (θ)J −1 [φ′ (θ)]T )—see (4.115)—and use the Gauss Divergence theorem to lowerbound the latter quantity in terms of the flux of the vector field φ(·) over ∂Θ. When implementing the above program, you might find useful the following fact (prove it!): Lemma 4.38. Let Φ be an n × n matrix, and J be a positive definite n × n matrix. Then Tr2 (Φ) . Tr(ΦJ −1 ΦT ) ≥ Tr(J ) 4.22.C. Application to signal recovery. Proposition 4.37 allows us to build computationbased lower risk bounds in the signal recovery problem considered in Section 4.2, in particular, the problem where one wants to recover the linear image Bx of an unknown signal x known to belong to a given ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sℓ x ≤ tℓ , ℓ ≤ L} (with our usual restriction on Sℓ and T ) via observation ω = Ax + σξ, ξ ∼ N (0, Im ), and the risk of a candidate estimate, as in Section 4.2, is defined according to (4.113).22 It is convenient to assume that the matrix B (which in our general setup can be an arbitrary ν × n matrix) is a nonsingular n × n matrix.23 Under this 22 In fact, the approach to be developed can be applied to signal recovery problems involving Discrete/Poisson observation schemes and norms different from k · k2 used to measure the recovery error, signaldependent noises, etc. 23 This assumption is nonrestrictive. Indeed, when B ∈ Rν×n with ν < n, we can add to B n − ν zero rows, which keeps our estimation problem intact. When ν ≥ n, we can add to B a small perturbation to ensure Ker B = {0}, which, for small enough perturbation, again keeps our estimation problem basically intact. It remains to note that when Ker B = {0} we can replace Rν with the image space of B, which again does not affect the estimation problem we are interested in.
352
CHAPTER 4
assumption, setting Y = B −1 X = {y ∈ Rn : ∃t ∈ T : y T [B −1 ]T Sℓ B −1 y ≤ tℓ , ℓ ≤ L} and A¯ = AB −1 , we lose nothing when replacing the sensing matrix A with A¯ and treating as our signal y ∈ Y rather than X . Note that in our new situation A is ¯ X with Y, and B is the unit matrix In . For the sake of simplicity, replaced with A, ¯ has trivial kernel. Finally, let we assume from now on that A (and therefore A) S˜ℓ Sℓ be close to Sk positive definite matrices, e.g., S˜ℓ = Sℓ + 10−100 In . Setting S¯ℓ = [B −1 ]T S˜ℓ B −1 and Y¯ = {y ∈ Rn : ∃t ∈ T : y T S¯ℓ y ≤ tℓ , ℓ ≤ L}, we get S¯ℓ ≻ 0 and Y¯ ⊂ Y. Therefore, any lower bound on the k · k2 risk of recovery y ∈ Y¯ via observation ω = AB −1 y + σξ, ξ ∼ N (0, Im ), automatically is a lower bound on the minimax risk Riskopt corresponding to our original problem of interest. Now assume that we can point out a kdimensional linear subspace E in Rn and positive reals r, γ such that ¯ (i) the k · k2 ball Θ = {θ ∈ E : kθk2 ≤ r} is contained in Y; (ii) The restriction A¯E of A¯ onto E satisfies the relation Tr(A¯∗E A¯E ) ≤ γ (A¯∗E : Rm → E is the conjugate of the linear map A¯E : E → Rm ). Consider the auxiliary estimation problem obtained from the (reformulated) prob¯ the minimax lem of interest by replacing the signal set Y¯ with Θ. Since Θ ⊂ Y, risk in the auxiliary problem is a lower bound on the minimax risk Riskopt we are interested in. On the other hand, the auxiliary problem is nothing but the problem ¯ σ 2 I), which is just a of recovering parameter θ ∈ Θ from observation ω ∼ N (Aθ, special case of the problem considered in item 4.22.B. As it is immediately seen, the Fisher Information matrix in this problem is independent of θ and is σ −2 A¯∗E A¯E : eT I(θ)e = σ −2 eT A¯∗E A¯E e, e ∈ E. Invoking Proposition 4.37, we arrive at the lower bound on the minimax risk in the auxiliary problem (and thus in the problem of interest as well): rσk . Riskopt ≥ √ r γ + σk
(4.118)
The resulting risk bound depends on r, k, γ and is larger the smaller γ is and the larger k and r are. Lowerbounding Riskopt . In order to make the bounding scheme just outlined give its best, we need a mechanism which allows us to generate kdimensional “disks” Θ ⊂ Y¯ along with associated quantities r, γ. In order to design such a mechanism, it is convenient to represent kdimensional linear subspaces of Rn as the image spaces of orthogonal n × n projectors P of rank k. Such a projector P ¯ where rP is the gives rise to the disk ΘP of the radius r = rP contained in Y, T 2 largest ρ such that the set {y ∈ ImP : y P y ≤ ρ } is contained in Y¯ (“condition
353
SIGNAL RECOVERY BY LINEAR ESTIMATION
C(r)”), and we can equip the disk with γ satisfying (ii) if and only if ¯ ) ≤ γ, Tr(P A¯T AP or, which is the same (recall that P is an orthogonal projector) ¯ A¯T ) ≤ γ Tr(AP
(4.119)
(“condition D(γ)”). Now, when P is a nonzero orthogonal projector, the simplest sufficient condition for the validity of C(r) is the existence of t ∈ T such that ∀(y ∈ Rn , ℓ ≤ L) : y T P S¯ℓ P y ≤ tℓ r−2 y T P y, or, which is the same, ∃s : r2 s ∈ T & P S¯ℓ P sℓ P, ℓ ≤ L.
(4.120)
Let us rewrite (4.119) and (4.120) as a system of linear matrix inequalities. This is what you are supposed to do: 2.1) Prove the following simple fact: Observation 4.39. Let Q be a positive definite, R be a nonzero positive semidefinite matrix, and let s be a real. Then RQR sR if and only if
sQ−1 R.
2.2) Extract from the above observation the conclusion as follows. Let T be the conic hull of T : T = cl{[s; τ ] : τ > 0, s/τ ∈ T } = {[s; τ ] : τ > 0, s/τ ∈ T } ∪ {0}. Consider the system of constraints ¯ A¯T ) ≤ γ, [s; τ ] ∈ T & sℓ S¯ℓ−1 P, ℓ ≤ L & Tr(AP P is an orthogonal projector of rank k ≥ 1
(#)
in variables [s; τ ], k, γ, and P . Every feasible solution to this system gives rise to a kdimensional Euclidean subspace E ⊂ Rn (the image space of P ) such that the Euclidean ball Θ in E centered at the origin of radius √ r = 1/ τ taken along with γ satisfies conditions (i)–(ii). Consequently, such a feasible solution yields the lower bound Riskopt ≥ ψσ,k (γ, τ ) := √
σk √ γ + σ τk
on the minimax risk in the problem of interest. Ideally, to utilize item 2.2 to lowerbound Riskopt , we should look through k =
354
CHAPTER 4
1, ..., n and maximize for every k the lower risk bound ψσ,k (γ, τ ) under constraints (#), thus arriving at the problem n √ √ min[s;τ ],γ,P ψσ,kσ(γ,τ ) = γ/k + σ τ : (Pk ) ¯ A¯T ) ≤ γ, [s; τ ] ∈ T & sℓ S¯ℓ−1 P, ℓ ≤ L & Tr(AP P is an orthogonal projector of rank k. This problem seems to be computationally intractable, since the constraints of (Pk ) include the nonconvex restriction on P to be a projector of rank k. A natural convex relaxation of this constraint is 0 P In , Tr(P ) = k. The (minor) remaining difficulty is that √ the objective in (P ) is nonconvex. Note, √ however, that to minimize γ/k + σ τ is basically the same as to minimize the convex function γ/k 2 + σ 2 τ which is a tight “proxy” of the squared objective of (Pk ). We arrive at a convex “proxy” of (Pk )—the problem [s; τ ] ∈ T, 0 P In , Tr(P ) = k 2 2 min γ/k + σ τ : (P [k]) ¯ A¯T ) ≤ γ , sℓ S¯ℓ−1 P, ℓ ≤ L, Tr(AP [s;τ ],γ,P k = 1, ..., n. Problem (P [k]) clearly is solvable, and the P component P (k) of its (k) optimal solution gives rise to a collection of orthogonal projectors Pκ , κ = 1, ..., n (k) obtained from P (k) by “rounding”—to get Pκ , we replace the κ leading eigenvalues (k) of P with ones, and the remaining eigenvalues with zeros, while keeping the eigenvectors intact. We can now for every κ = 1, ..., n fix the P variable in (Pk ) as (k) Pκ and solve the resulting problem in the remaining variables [s; τ ] and γ, which is easy—with P fixed, the problem clearly reduces to minimizing τ under the convex constraints sℓ S¯ℓ−1 P, ℓ ≤ L, [s; τ ] ∈ T on [s; τ ]. As a result, for every k ∈ {1, ..., n}, we get n lower bounds on Riskopt , that is, a total of n2 lower risk bounds, of which we select the best—the largest. Now comes the next part of the exercise: 3) Implement the outlined program numerically and compare the lower bound on the minimax risk with the upper risk bounds of presumably good linear estimates yielded by Proposition 4.4. Recommended setup: • Sizes: m = n = ν = 16. • A, B: B = In , A = Diag{a1 , ..., an } with ai = i−α and α running through {0, 1, 2}. • X = {x ∈ Rn : xT Sℓ x ≤ 1, ℓ ≤ L} (i.e., T = [0, 1]L ) with randomly generated Sℓ . • Range of L: {1, 4, 16}. For L in this range, you can generate Sℓ , ℓ ≤ L, as Sℓ = Rℓ RℓT with Rℓ = randn(n, p), where p =⌋n/L⌊. • Range of σ: {1.0, 0.1, 0.01, 0.001, 0.0001}. Exercise 4.23. [followup to Exercise 4.22]
355
SIGNAL RECOVERY BY LINEAR ESTIMATION
1) Prove the following version of Proposition 4.37: Proposition 4.40. In the situation of item 4.22.B and under Assumptions 1)–3) from this item, let • k · k be a norm on Rk such that kθk2 ≤ κkθk ∀θ ∈ Rk , • Θ ⊂ Rk be a k · kball of radius r > 0, • the family P be such that I(θ) J for some J ≻ 0 and all θ ∈ Θ.
Then the minimax optimal risk Riskopt,k·k = inf
b θ(·)
sup Eω∼p(·,θ)
θ∈Θ
n
2 b kθ − θ(ω)k
o1/2
of recovering parameter θ ∈ Θ from observation ω ∼ p(·, θ) in the norm k · k satisfies the bound rk . (4.121) Riskopt,k·k ≥ p rκ Tr(J ) + k
In particular, when J = α−1 Ik , we get Riskopt,k·k
√ r αk √ . ≥ rκ + αk
(4.122)
2) Apply Proposition 4.40 to get lower bounds on the minimax k · krisk in the following estimation problems: 2.1) Given indirect observation ω = Aθ + σξ, ξ ∼ N (0, Im ) of unknown vector θ known to belong to Θ = {θ ∈ Rk : kθkp ≤ r} with given A, Ker A = {0}, p ∈ [2, ∞], r > 0, we want to recover θ in k · kp . 2.2) Given indirect observation ω = LθR + σξ, where θ is unknown µ × ν matrix known to belong to the Shatten norm ball Θ ∈ Rµ×ν : kθkSh,p ≤ r, we want to recover θ in k · kSh,p . Here L ∈ Rm×µ , Ker L = {0} and R ∈ Rν×n , Ker RT = {0} are given matrices, p ∈ [2, ∞], and ξ is a random Gaussian m × n matrix (i.e., the entries in ξ are N (0, 1) random variables independent of each other). 2.3) Given a Krepeated observation ω K = (ω1 , ..., ωK ) with i.i.d. components ωt ∼ N (0, θ), 1 ≤ t ≤ K, with unknown θ ∈ Sn known to belong to the matrix box Θ = {θ : β− In θ β+ In } with given 0 < β− < β+ < ∞, we want to recover θ in the spectral norm. Exercise 4.24. [More on CramerRao risk bound] Let us fix µ ∈ (1, ∞) and a norm k · k on Rk , µ . Assume that we are and let k · k∗ be the norm conjugate to k · k, and µ∗ = µ−1 in the situation of item 4.22.B and under assumptions 1) and 3) from this item; as for assumption 2) we now replace it with the assumption that the quantity 1/µ∗ Ik·k∗ ,µ∗ (θ) := Eω∼p(·,θ) {k∇θ ln(p(ω, θ))kµ∗ ∗ }
356
CHAPTER 4
is welldefined and bounded on Θ; in the sequel, we set Ik·k∗ ,µ∗ = sup Ik·k∗ ,µ∗ (θ). θ∈Θ
1) Prove the following variant of the CramerRao risk hound: Proposition 4.41. In the situation described at the beginning of item 4.22.D, let Θ ⊂ Rk be a k · kball of radius r. Then the minimax k · krisk of recovering θ ∈ Θ via observation ω ∼ p(·, θ) can be lowerbounded as h n oi1/µ b Riskopt,k·k [Θ] := inf sup Eω∼p(·,θ) kθ(ω) − θkµ ≥ rIk·krk,µ +k , ∗ ∗ b θ∈Θ θ(·) i h 1/µ∗ Ik·k∗ ,µ∗ = max Ik·k∗ ,µ∗ (θ) := Eω∼p(·,θ) {k∇θ ln(p(ω, θ))kµ∗ ∗ } .
(4.123)
θ∈Θ
Example I: Gaussian case, estimating shift. Let µ = 2, and let p(ω, θ) = N (Aθ, σ 2 Im ) with A ∈ Rm×k . Then −2 T R∇θ ln(p(ω, θ)) = σ2 A (ω − Aθ) ⇒ −4 R kAT (ω − Aθ)k2∗ p(ω, θ)dω k∇θ ln(p(ω, θ))k∗ p(ω, θ)dω = σ R T 1 = σ −4 [√2πσ] kAT ωk2∗ exp{− ω2σ2ω }dω m R = σ −4 [2π]1m/2 kAT σξk2∗ exp{−ξ T ξ/2}dξ R = σ −2 [2π]1m/2 kAT ξk2∗ exp{−ξ T ξ/2}dξ
whence
1/2 . Ik·k∗ ,2 = σ −1 Eξ∼N (0,Im ) kAT ξk2∗ {z }  γk·k (A)
Consequently, assuming Θ to be a k · kball of radius r in Rk , lower bound (4.123) becomes Riskopt,k·k [Θ] ≥
rk rIk·k∗ + k
=
rσ −1 γ
rk rσk = . rγk·k (A) + σk k·k (A) + k
(4.124)
The case of direct observations. To see “how it works,” consider the case m = k, A = Ik of direct observations, and let Θ = {θ ∈ Rk : kθk ≤ r}. Then p • We have γk·k1 (Ik ) ≤ O(1) ln(k), whence the k · k1 risk bound is
rσk Riskopt,k·k1 [Θ] ≥ O(1) p [Θ = {θ ∈ Rk : kθ − ak1 ≤ r}]. r ln(k) + σk √ • We have γk·k2 (Ik ) = k, whence the k · k2 risk bound is √ rσ k √ Riskopt,k·k2 [Θ] ≥ r+σ k
[Θ = {θ ∈ Rk : kθ − ak2 ≤ r}].
357
SIGNAL RECOVERY BY LINEAR ESTIMATION
• We have γk·k∞ (Ik ) ≤ O(1)k, whence the k · k∞ risk bound is Riskopt,k·k∞ [Θ] ≥ O(1)
rσ r+σ
[Θ = {θ ∈ Rk : kθ − ak∞ ≤ r}].
In fact, the above examples are essentially covered by the following: Observation 4.42. Let k · k be a norm on Rk , and let Θ = {θ ∈ Rk : kθk ≤ r}. Consider the problem of recovering signal θ ∈ Θ via observation ω ∼ N (θ, σ 2 Ik ). Let n o1/2 b = sup Eω∼N (θ,σ2 I) kθ(ω) b Riskk·k [θΘ] − θk2 θ∈Θ
b be the k · krisk of an estimate θ(·), and let
b Riskopt,k·k [Θ] = inf Riskk·k [θΘ] b θ(·)
be the associated minimax risk. Assume that the norm k · k is absolute and symmetric w.r.t. permutations of the coordinates. Then rσk Riskopt,k·k [Θ] ≥ p , 2 ln(ek)rα∗ + σk
α∗ = k[1; ...; 1]k∗ .
(4.125)
Here is the concluding part of the exercise: 2) Prove the observation and compare the lower risk bound it yields with the k·krisk of the “plugin” estimate χ b(ω) ≡ ω.
Example II: Gaussian case, estimating covariance. Let µ = 2, let K be a positive integer, and let our observation ω be a collection of K i.i.d. samples ωt ∼ N (0, θ), 1 ≤ t ≤ K, with unknown θ known to belong to a given convex compact subset Θ of the interior of the positive semidefinite cone Sn+ . Given ω1 ,...,ωK , we want to recover θ in the Shatten norm k · kSh,s with s ∈ [1, ∞]. Our estimation problem is covered by the setupQof Exercise 4.22 with P comprised of the product K probability densities p(ω, θ) = t=1 g(ωt , θ), θ ∈ Θ, where g(·, θ) is the density of N (0, θ). We have P P −1 1 ωt ωtT θ−1 − θ−1 ln(g(ω ∇θ ln(p(ω, θ)) = 21 t ∇θP t , θ)) = 2 t θ (4.126) −1/2 ωt ][θ−1/2 ωt ]T − In θ−1/2 . = 12 θ−1/2 t [θ With some effort [149] it can be proved that when K ≥ n, which we assume from now on, for random vectors ξ1 , ..., ξK independent across t sampled from the standard Gaussian distribution N (0, In ) for every u ∈ [1, ∞] one
358
CHAPTER 4
has
" (
2
XK T
[ξt ξt − In ] E
t=1
Sh,u
)#1/2
1
1
≤ Cn 2 + u
√
K
(4.127)
with appropriate absolute constant C. Consequently, for θ ∈ Θ and all u ∈ [1, ∞] we have n o Eω∼p(·,θ) k∇θ ln(p(ω, θ))k2Sh,u o n P −1/2 ωt ][θ−1/2 ωt ]T − In θ−1/2 k2Sh,u = 41 Eω∼p(·,θ) kθ−1/2 t [θ [by (4.126)] n −1/2 2 o P T 1 −1/2 −1/2 = 4 Eξ∼p(·,In ) kθ θ kSh,u [setting θ ωt = ξt ] t ξt ξt − I n o n P ≤ 14 kθ−1/2 k4Sh,∞ Eξ∼p(·,In ) k t ξt ξtT − In k2Sh,u [since kABkSh,u ≤ kAkSh,∞ kBkSh,u ] h i2 1 1√ + ≤ 41 kθ−1/2 k4Sh,∞ Cn 2 u K [by (4.127)] and we arrive at
1/2 1 1√ C Eω∼p(·,θ) k∇θ ln(p(ω, θ))k2Sh,u ≤ kθ−1 kSh,∞ n 2 + u K. 2
(4.128)
Now assume that Θ is k · kSh,s ball of radius r < 1 centered at In : Θ = {θ ∈ Sn : kθ − In kSh,s ≤ r}.
(4.129)
In this case the estimation problem from Example II is the scope of Proposition 4.41, and the quantity Ik·k∗ ,2 as defined in (4.123) can be upperbounded as follows: Ik·k∗ ,2
= ≤ ≤
h n oi1/2 max Eω∼p(·,θ) k∇θ ln(p(ω, θ))k2Sh,s∗ θ∈Θ 1 1 √ O(1)n 2 + s∗ K maxθ∈Θ kθ−1 kSh,∞ [see (4.128)] O(1) n
1 + 1 √ 2 s∗ K
1−r
.
We can now use Proposition 4.41 to lowerbound the minimax k · kSh,s risk, thus arriving at n(1 − r)r (4.130) Riskopt,k·kSh,s [Θ] ≥ O(1) √ 1 1 Kn 2 − s r + n(1 − r)
(note that we are in the case of k = dim θ = n(n+1) ). 2 Let us compare this lower risk bound with the k · kSh,s risk of the “plugin” estimate K 1 X b ωt ωtT . θ(ω) = K t=1
359
SIGNAL RECOVERY BY LINEAR ESTIMATION
Assuming θ ∈ Θ, we have o o n P n b Eω∼p(·,θ) kK[θ(ω) − θ]k2Sh,s = Eω∼p(·,θ) k t [ωt ωtT − θ]k2Sh,s n 1/2 2 o P −1/2 −1/2 T [[θ ω ][θ ω ] − I ] θ kSh,s = Eω∼p(·,θ) kθ1/2 t t n t n o P T 1/2 2 = Eξ∼p(·,In ) kθ1/2 kSh,s t [ξt ξt − In ] θ n P o ≤ kθ1/2 k4Sh,∞ Eξ∼p(·,In ) k t [ξt ξtT − In ]k2Sh,s i2 h 1 1√ [see (4.127)] ≤ kθ1/2 k4Sh,∞ Cn 2 + s K ,
and we arrive at
1 1 2+s
b ≤ O(1) max kθkSh,∞ n√ Riskk·kSh,s [θΘ] θ∈Θ
K
.
In the case of (4.129), the latter bound becomes
1
1
2+s b ≤ O(1) max kθkSh,∞ n√ . Riskk·kSh,s [θΘ] θ∈Θ K
(4.131)
For the sake of simplicity, assume that r in (4.129) is 1/2 (what actually matters below is that r ∈ (0, 1) is bounded away from 0 and from 1). In this case the lower bound (4.130) on the minimax k · kSh,s risk reads # 1 1 n2+s Riskopt,k·kSh,s [Θ] ≥ O(1) min √ , 1 . K "
2
When K is “large”: K ≥ n1+ s , this lower bound matches, within an absolute constant factor, the upper bound (4.131) on the risk of the plugin estimate, so that 2 the latter estimate is nearoptimal. When K < n1+ s , the lower risk bound becomes b O(1), so that here a nearly optimal estimate is the trivial estimate θ(ω) ≡ In .
4.7.7
Around SLemma
SLemma is a classical result of extreme importance in Semidefinite Optimization. Basically, the lemma states that when the ellitope X in Proposition 4.6 is an ellipsoid, (4.19) can be strengthened to Opt = Opt∗ . In fact, SLemma is even stronger: Lemma 4.43. [SLemma] Consider two quadratic forms f (x) = xT Ax + 2aT x + α and g(x) = xT Bx + 2bT x + β such that g(¯ x) < 0 for some x ¯. Then the implication g(x) ≤ 0 ⇒ f (x) ≤ 0 takes place if and only if for some λ ≥ 0 it holds f (x) ≤ λg(x) for all x, or, which is the same, if and only if Linear Matrix Inequality λb − a λB − A 0 λbT − aT λβ − α
360
CHAPTER 4
in scalar variable λ has a nonnegative solution. Proof of SLemma can be found, e.g., in [15, Section 3.5.2]. The goal of subsequent exercises is to get “tight” tractable outer approximations of sets obtained from ellitopes by quadratic lifting. We fix an ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K}
(4.132)
where, as always, Sk are positive semidefinite matrices with positive definite sum, and T is a computationally tractable convex compact subset in Rk+ such that t ∈ T implies t′ ∈ T whenever 0 ≤ t′ ≤ t and T contains a positive vector. Exercise 4.25.
Let us associate with ellitope X given by (4.132) the sets X Xb
= =
Conv{xxT : x ∈ X}, {Y ∈ Sn : Y 0, ∃t ∈ T : Tr(Sk Y ) ≤ tk , 1 ≤ k ≤ K},
so that X , Xb are convex compact sets containing the origin, and Xb is computationally tractable along with T . Prove that
1. When K = 1, we have X = Xb; √ 2. We always have X ⊂ Xb ⊂ 3 ln( 3K)X . Exercise 4.26.
n
T
o
For x ∈ R let Z(x) = [x; 1][x; 1] , Z [x] = C=
1
xxT xT
x
. Let
,
and let us associate with ellitope X given by (4.132) the sets X+
Xb+
= =
o Conv{Z [x] : x ∈X}, U u Y = ∈ Sn+1 : Y + C 0, ∃t ∈ T : Tr(Sk U ) ≤ tk , 1 ≤ k ≤ K , T u
so that X + , Xb+ are convex compact sets containing the origin, and Xb+ is computationally tractable along with T . Prove that
1. When K = 1, we have X + = Xb+ ; √ 2. We always have X + ⊂ Xb+ ⊂ 3 ln( 3(K + 1))X + . 4.7.8
Miscellaneous exercises
Exercise 4.27. Let X ⊂ Rn be a convex compact set, let b ∈ Rn , and let A be an m × n matrix. Consider the problem of affine recovery ω 7→ hT ω + c of the linear function Bx = bT x of x ∈ X from indirect observation ω = Ax + σξ, ξ ∼ N (0, Im ).
361
SIGNAL RECOVERY BY LINEAR ESTIMATION
Given tolerance ǫ ∈ (0, 1), we are interested in minimizing the worstcase, over x ∈ X, width of (1 − ǫ) confidence interval, that is, the smallest ρ such that Prob{ξ : bT x−f T (Ax+σξ) > ρ} ≤ ǫ/2 & Prob{ξ : bT x−f T (Ax+σξ) < ρ} ≤ ǫ/2 ∀x ∈ X.
Pose the problem as a convex optimization problem and consider in detail the case where X is the box {x ∈ Rn : aj xj  ≤ 1, 1 ≤ j ≤ n}, where aj > 0 for all j. Exercise 4.28. Prove Proposition 4.21. Exercise 4.29. Prove Proposition 4.22.
4.8
PROOFS
4.8.1 4.8.1.1
Preliminaries Technical lemma
Lemma 4.44. Given basic spectratope X = {x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K}
(4.133)
and a positive definite n × n matrix Q and setting Λk = Rk [Q] (for notation, see P Section 4.3.1), we get a collection of positive semidefinite matrices, and k R∗k [Λk ] is positive definite. As a corollaries, P (i) whenever Mk , k ≤ K, are positive definite matrices, the matrix k R∗k [Mk ] is positive definite; (ii) the set QT = {Q 0 : Rk [Q] T Idk , k ≤ K} is bounded for every T . Proof. Let us prove the first claim. P Assuming the opposite, we would be able to find a nonzero vector y such that k y T R∗k [Λk ]y ≤ 0, whence 0≥
X k
y T R∗k [Λk ]y =
X k
Tr(R∗k [Λk ][yy T ]) =
X k
Tr(Λk Rk [yy T ])
(we have used (4.26), (4.22)). Since Λk = Rk [Q] 0 due to Q 0—see (4.23)— it follows that Tr(Λk Rk [yy T ]) = 0 for all k. Now, the linear mapping Rk [·] is monotone, and Q is positive definite, implying that Q rk yy T for some rk > 0, whence Λk rk Rk [yy T ], and therefore Tr(Λk Rk [yy T ]) = 0 implies that Tr(R2k [yy T ]) = 0, that is, Rk [yy T ] = Rk2 [y] = 0. Since Rk [·] takes values in Sdk , we get Rk [y] = 0 for al k, which is impossible due to y 6= 0 and property S.3; see Section 4.3.1. To verify (i), note that when Mk are positive definite, we can find γ > 0 such that Λk P γMk for all k ≤ K; invoking (4.27), we conclude that R∗k [Λk ] γR∗k [Mk ], P ∗ ∗ whence k Rk [Mk ] is positive definite along with k Rk [Λk ]. To verify (ii), assume, on the contrary to what should be proved, that QT is unbounded. Since QT is closed and convex, it must possess a nonzero recessive
362
CHAPTER 4
direction, that is, there should exist nonzero positive semidefinite matrix D such that Rk [D] 0 for all k. Selecting positive definite matrices Mk , the matrices R∗k [Mk ] are positive semidefinite (see Section 4.3.1), and their sum S is positive definite by (i). We have X X Tr(DR∗k [Mk ]) = Tr(DS), Tr(Rk [D]Mk ) = 0≥ k
k
where the first inequality is due to Mk 0, and the first equality is due to (4.26). The resulting inequality is impossible due to 0 6= D 0 and S ≻ 0, which is the desired contradiction. ✷ 4.8.1.2
Noncommutative Khintchine Inequality
We will use a deep result from Functional Analysis (“Noncommutative Khintchine Inequality”) due to LustPiquard [175], Pisier [199] and Buchholz [34]; see [228, Theorem 4.6.1]: Theorem 4.45. Let Qi ∈ Sn , 1 ≤ i ≤ I, and let ξi , i = 1, ..., I, be independent Rademacher (±1 with probabilities 1/2) or N (0, 1) random variables. Then for all t ≥ 0 one has
( I )
X
t2
Prob ξi Qi ≥ t ≤ 2n exp −
2vQ i=1
P
I
where k · k is the spectral norm, and vQ = i=1 Q2i . We need the following immediate consequence of the theorem:
Lemma 4.46. Given spectratope (4.20), let Q ∈ Sn+ be such that Rk [Q] ρtk Idk , 1 ≤ k ≤ K,
(4.134)
for some t ∈ T and some ρ ∈ (0, 1]. Then h
Probξ∼N (0,Q) {ξ 6∈ X } ≤ min 2De
1 − 2ρ
K i X , 1 , D := dk . k=1
Proof. When setting ξ = Q1/2 η, η ∼ N (0, In ), we have Rk [ξ] = Rk [Q1/2 η] =:
n X
¯ ki = R ¯ k [η] ηi R
i=1
with X i
2 ¯ ki ]2 = Eη∼N (0,I ) R ¯ k [η] = Eξ∼N (0,Q) Rk2 [ξ] = Rk [Q] ρtk Id [R k n
due to (4.24). Hence, by Theorem 4.45 1
¯ k [η]k2 ≥ tk } ≤ 2dk e− 2ρ . Probξ∼N (0,Q) {kRk [ξ]k2 ≥ tk } = Probη∼N (0,In ) {kR
363
SIGNAL RECOVERY BY LINEAR ESTIMATION
We conclude that 1
Probξ∼N (0,Q) {ξ 6∈ X } ≤ Probξ∼N (0,Q) {∃k : kRk [ξ]k2 > tk } ≤ 2De− 2ρ .
✷
The ellitopic version of Lemma 4.46 is as follows: Lemma 4.47. Given ellitope (4.9), let Q ∈ Sn+ be such that Tr(Rk Q) ≤ ρtk , 1 ≤ k ≤ K,
(4.135)
for some t ∈ T and some ρ ∈ (0, 1]. Then
1 Probξ∼N (0,Q) {ξ 6∈ X } ≤ 2K exp − . 3ρ
Proof. Observe that if P ∈ Sn+ satisfies Tr(R) ≤ 1, we have √ Eη∼N (0,In ) exp 13 η T P η ≤ 3.
(4.136)
Indeed, we lose nothing when assuming that P = Diag{λ1 , ..., λn } with λi ≥ 0, P i λi ≤ 1. In this case ) ( X Eη∼N (0,In ) exp{ 13 η T P η} = f (λ) := Eη∼N (0,In ) exp{ 31 λi ηi2 } . i
Function f is convex, so that its maximum on the simplex {λ ≥ 0 : achieved at a vertex, that is, √ f (λ) ≤ Eη∼N (0,1) exp{ 13 η 2 } = 3;
P
i
(4.136) is proved. Note that (4.136) implies that √ Probη∼N (0,In ) η : η T P η > s < 3 exp{−s/3}, s ≥ 0.
λi ≤ 1} is
(4.137)
Now let Q and t satisfy the Lemma’s premise. Setting ξ = Q1/2 η, η ∼ N (0, In ), for k ≤ K such that tk > 0 we have ξ T Rk ξ = ρtk η T Pk η, Pk := [ρtk ]−1 Q1/2 Rk Q1/2 0 & Tr(Pk ) = [ρtk ]−1 Tr(QRk ) ≤ 1, so that Probξ∼N (0,Q) ξ : ξ T Rk ξ > sρtk
=
s √ 3 exp{−s/3},
(4.138)
where the inequality is due to (4.137). Relation (4.138) was established for k with tk > 0; it is trivially true when tk = 0, since in this case Q1/2 Rk Q1/2 = 0 due to Tr(QRk ) ≤ 0 and Rk , Q ∈ Sn+ . Setting s = 1/ρ, we get from (4.138) that √ 1 Probx∼N (0,Q) ξ T Rk ξ > tk ≤ 3 exp{− }, k ≤ K, 3ρ
and (4.137) follows due to the union bound.
✷
364 4.8.1.3
CHAPTER 4
Anderson’s Lemma
Below we use a simplelooking, but by far nontrivial, fact. Anderson’s Lemma [4]. Let f be a nonnegative even (f (x) ≡ f (−x)) summable function on RN such that the level sets {x : f (x) ≥ t} are convex for all t and let X ⊂ RN be a closed convex set symmetric w.r.t. the origin. Then for every y ∈ RN Z f (z)dz X+ty
is a nonincreasing function of t ≥ 0. In particular, if ζ is a zero mean N dimensional Gaussian random vector, then for every y ∈ RN Prob{ζ 6∈ y + X} ≥ Prob{ζ 6∈ X}. Hence, for every norm k · k on RN it holds Prob{ζ : kζ − yk > ρ} ≥ Prob{ζ : kζk > ρ} ∀(y ∈ RN , ρ ≥ 0).
4.8.2
Proof of Proposition 4.6
1o . We need the following: Lemma 4.48. Let S be a positive semidefinite N × N matrix with trace ≤ 1 and ξ be an N dimensional Rademacher random vector (i.e., the entries in ξ are independent and take values ±1 with probabilities 1/2). Then √ ≤ 3, E exp 31 ζ T Sζ
implying that
√ Prob{ξ T Sξ > s} ≤ 3 exp{−s/3}, s ≥ 0. P i i T be the eigenvalue decomposition of S, so that Proof. Let S = i h [h ] i σP i T i [h ] h = 1, σi ≥ 0, and i σi ≤ 1. The function n 1P T i i T o F (σ1 , ..., σn¯ ) = E e 3 i σi ξ h [h ] ξ
P is convex on the simplex {σ ≥ 0, i σi ≤ 1} and thus attains its maximum over the simplex at a vertex, implying that for some f = hi , f T f = 1, it holds 1
E{e 3 ξ
T
Sξ
1
} ≤ E{e 3 (f
T
ξ)2
}.
365
SIGNAL RECOVERY BY LINEAR ESTIMATION
Let ζ ∼ N (0, 1) be independent of ξ. We have oo n n p Eξ exp{ 13 (f T ξ)2 } = Eξ Eζ exp{[ 2/3f T ξ]ζ} ( ) n n n oo o N p p Q = Eζ Eξ exp{[ 2/3f T ξ]ζ} Eξ exp{ 2/3ζfj ξj } = Eζ j=1 ) ( ) ( N N p Q Q 1 2 2 exp{ 3 ζ fj } cosh( 2/3ζfj ) ≤ Eζ = Eζ j=1 j=1 1 2 √ = Eζ exp{ 3 ζ } = 3. ✷ 2o . The right inequality in (4.19) has been justified in Section 4.2.3. To prove the left inequality in (4.19), let T be the closed conic hull of T (see Section 4.1.1), and let us consider the conic problem Opt# = max Tr(P T CP Q) : Q 0, Tr(QRk ) ≤ tk ∀k ≤ K, [t; 1] ∈ T . (4.139) Q,t
We claim that
Opt = Opt# .
(4.140)
Indeed, (4.139) clearly is a strictly feasible and bounded conic problem, so that its optimal value is equal to the optimal value of its conic dual (Conic Duality Theorem). Taking into account that the cone T∗ dual to T is {[g; s] : s ≥ φT (−g)}—see Section 4.1.1—we therefore get Opt#
P P Tr([ k λk Rk − L]Q) − k [λk + gk ]tk = Tr(P T CP Q) ∀(Q, t), λP≥ 0, L 0, s ≥ φT (−g) λ,[g;s],L λk Rk − L = P T CP, g = −λ, k = min s: λP≥ 0, L 0, s ≥ φT (−g) λ,[g;s],L = min φT (λ) : k λk Rk P T CP, λ ≥ 0 = Opt,
= min
s:
λ
as claimed.
3o . With Lemma 4.48 and (4.140) at our disposal, we can now complete the proof of Proposition 4.6 by adjusting the technique from [191]. Specifically, problem (4.139) clearly is solvable; let Q∗ , t∗ be an optimal solution to the problem. Next, let us 1/2 set R∗ = Q∗ , C¯ = R∗ P T CP R∗ , let C¯ = U DU T be the eigenvalue decomposition ¯ ¯ k = U T R∗ Rk R∗ U . Observe that of C, and let R ¯ Tr(C) ¯k ) Tr(R
= =
Tr(R∗ P T CP R∗ ) = Tr(Q∗ P T CP ) = Opt# = Opt, Tr(R∗ Rk R∗ ) = Tr(Q∗ Rk ) ≤ t∗k .
Now let ξ be a Rademacher random vector. For k with t∗k > 0, applying Lemma ¯ k /t∗ , we get for s > 0 4.48 to matrices R k √ ¯ k ξ > st∗k } ≤ 3 exp{−s/3}; (4.141) Prob{ξ T R ¯ k ) = 0, that is, R ¯ k = 0 (since R ¯ k 0), and if k is such that t∗k = 0, we have Tr(R (4.141) holds true as well. Now let √ s∗ = 3 ln( 3K),
366
CHAPTER 4
√ so that 3 exp{−s/3} < 1/K when s > s∗ . The latter relation combines with (4.141) to imply that for every s > s∗ there exists a realization ξ¯ of ξ such that ¯ k ξ¯ ≤ st∗k ∀k. ξ¯T R Let us set y¯ =
¯ √1 R∗ U ξ. s
Then
¯ k ξ¯ ≤ t∗k ∀k y¯T Rk y¯ = s−1 ξ¯T U T R∗ Rk R∗ U ξ¯ = s−1 ξ¯T R implying that x ¯ := P y¯ ∈ X , and ¯ = s−1 Opt. x ¯T C x ¯ = s−1 ξ¯T U T R∗ P T CP R∗ U ξ¯ = s−1 ξ¯T Dξ¯ = s−1 Tr(D) = s−1 Tr(C) {z }  ¯ C
Thus, Opt∗ := maxx∈X xT Cx ≥ s−1 Opt whenever s > s∗ , which implies the left inequality in (4.19). ✷ 4.8.3
Proof of Proposition 4.8
The proof follows the lines of the proof of Proposition 4.6. First, passing from C to the matrix C¯ = P T CP , the situation clearly reduces to the one where P = I. To save notation, in the rest of the proof we assume that P is the identity. Second, from Lemma 4.44 and the fact that the level sets of φT (·) on the nonnegative orthant are bounded (since T contains a positive vector) it immediately follows that problem (4.29) is feasible with bounded level sets of the objective, so that the problem is solvable. The left inequality in (4.30) was proved in Section 4.3.2. Thus, all we need is to prove the right inequality in (4.30). 1o . Let T be the closed conic hull of T (see Section 4.1.1). Consider the conic problem Opt# = max {Tr(CQ) : Q 0, Rk [Q] tk Idk ∀k ≤ K, [t; 1] ∈ T} . Q,t
(4.142)
This problem clearly is strictly feasible; by Lemma 4.44, the feasible set of the problem is bounded, so the problem is solvable. We claim that Opt# = Opt. Indeed, (4.142) is a strictly feasible and bounded conic problem, so that its optimal value is equal to the one in its conic dual, that is, P P Tr([ k R∗k [Λk ] − L]Q) − k [Tr(Λk ) + gk ]tk = Tr(CQ) ∀(Q, t), Opt# = min s: Λ={Λk }k≤K ,[g;s],L Λ 0 ∀k, L 0, s ≥ φ (−g) k T P ∗ k Rk [Λk ] − L = C, g = −λ[Λ], = min s: Λk P 0 ∀k, L 0, s ≥ φT (−g) Λ,[g;s],L = min {φT (λ[Λ]) : k R∗k [Λk ] C, Λk 0 ∀k} = Opt, Λ
as claimed.
2o . Problem (4.142), as we already know, is solvable; let Q∗ , t∗ be an optimal
367
SIGNAL RECOVERY BY LINEAR ESTIMATION
1/2 b b = solution to the problem. Next, let us set R∗ = Q∗ , C = R∗ CR∗ , and let C T T b U DU be the eigenvalue decomposition of C, so that the matrix D = U R∗ CR∗ U is diagonal, and the trace of this matrix is Tr(R∗ CR∗ ) = Tr(CQ∗ ) = Opt# = Opt. Now let V = R∗ U , and let ξ = V η, where η is ndimensional random Rademacher vector (independent entries taking values ±1 with probabilities 1/2). We have
ξ T Cξ = η T [V T CV ]η = η T [U T R∗ CR∗ U ]η = η T Dη ≡ Tr(D) = Opt
(4.143)
(recall that D is diagonal) and Eξ {ξξ T } = Eη {V ηη T V T } = V V T = R∗ U U T R∗ = R∗2 = Q∗ . From the latter relation, Eξ Rk2 [ξ]
Eξ Rk [ξξ T ] = Rk [Eξ {ξξ T }] Rk [Q∗ ] t∗k Idk , 1 ≤ k ≤ K.
= =
(4.144)
¯ kj we have On the other hand, with properly selected symmetric matrices R X ¯ ki yi R Rk [V y] = i
identically in y ∈ Rn , whence
Eξ Rk2 [ξ] = Eη Rk2 [V η] = Eη
hX
i
¯ ki ηi R
i2
=
X i,j
¯ ki R ¯ kj = Eη {ηi ηj }R
This combines with (4.144) to imply that X ¯ ki ]2 t∗ Id , 1 ≤ k ≤ K. [R k k
X
¯ ki ]2 . [R
i
(4.145)
i
3o . Let us fix k ≤ K. Assuming t∗k > 0 and applying Theorem 4.45, we derive from (4.145) that 1 ¯ k [η]k2 > t∗k /ρ} < 2dk e− 2ρ , Prob{η : kR and recalling the relation between ξ and η, we arrive at 1
Prob{ξ : kRk [ξ]k2 > t∗k /ρ} < 2dk e− 2ρ ∀ρ ∈ (0, 1].
(4.146)
¯ ki = 0 for all i, so that Rk [ξ] ≡ R ¯ k [η] ≡ 0, Note that when t∗k = 0 (4.145) implies R and (4.146) holds for those k as well. 1 . For this ρ, the sum over k ≤ K of the rightNow let us set ρ = 2 max[ln(2D),1] hand sides in inequalities (4.146) is ≤ 1, implying that there exists a realization ξ¯ of ξ such that ¯ 2 ≤ t∗ /ρ, ∀k, kRk [ξ]k k or, equivalently,
x ¯ := ρ1/2 ξ¯ ∈ X
(recall that P = I), implying that Opt∗ := max xT Cx ≥ x ¯T C x ¯ = ρξ T Cξ = ρOpt x∈X
368
CHAPTER 4
(the concluding equality is due to (4.143)), and we arrive at the right inequality in (4.30). ✷ 4.8.4
Proof of Lemma 4.17
1o . Let us verify (4.57). When Q ≻ 0, passing from variables (Θ, Υ) in problem (4.56) to the variables (G = Q1/2 ΘQ1/2 , Υ), the problem becomes exactly the optimization problem in (4.57), implying that Opt[Q] = Opt[Q] when Q ≻ 0. As is easily seen, both sides in this equality are continuous in Q 0, and (4.57) follows. 2o . Let us prove (4.59). Setting ζ = Q1/2 η with η ∼ N (0, IN ) and Z = Q1/2 Y , to justify (4.59) we have to show that when κ ≥ 1 one has 3/8 2 Opt[Q] ¯ ≥ βκ := 1 − e ⇒ Probη {kZ T ηk ≥ δ} − 2F e−κ /2 , δ¯ = 4κ 2
(4.147)
where (cf. (4.57)) [Opt[Q] =] Opt[Q] := min φR (λ[Υ]) + Tr(Θ) : Θ,Υ={Υℓ ,ℓ≤L} 1 ZM Θ 2 Υℓ 0, 1 T T P ∗ 0 . M Z ℓ Sℓ [Υℓ ] 2
(4.148)
Justification of (4.147) is as follows.
2.1o . Let us represent Opt[Q] as the optimal value of a conic problem. Setting K = cl{[r; s] : s > 0, r/s ∈ R}, we ensure that R = {r : [r; 1] ∈ K}, K∗ = {[g; s] : s ≥ φR (−g)}, where K∗ is the cone dual to K. Consequently, (4.148) reads Υ (a) ℓ 0, 1 ≤ ℓ ≤ L 1 ZM Θ 2 P . 0 (b) Opt[Q] = min θ + Tr(Θ) : 1 ∗ T T Sℓ [Υℓ ] M Z Θ,Υ,θ ℓ 2 [−λ[Υ]; θ] ∈ K∗ (c)
(P )
2.2o . Now let us prove that there exists a matrix W ∈ Sq+ and r ∈ R such that Sℓ [W ] rℓ Ifℓ , ℓ ≤ L, and Opt[Q]≤
X
σi (ZM W 1/2 ),
(4.149) (4.150)
i
where σ1 (·) ≥ σ2 (·) ≥ ... are singular values. To get the announced result, let us pass from problem (P ) to its conic dual. Applying Lemma 4.44 we conclude that (P ) is strictly feasible; in addition, (P )
SIGNAL RECOVERY BY LINEAR ESTIMATION
369
clearly is bounded, so that the dual to (P ) problem (D) is solvable with optimal −R G value Opt[Q]. Let us build (D). Denoting by Λℓ 0, ℓ ≤ L, −RT W 0, [r; τ ] ∈ K the Lagrange multipliers for the respective constraints in (P ), and aggregating these constraints, the multipliers being the aggregation weights, we arrive at the following aggregated constraint: P P P Tr(ΘG) + Tr(W ℓ Sℓ∗ [Υℓ ]) + ℓ Tr(Λℓ Υℓ ) − ℓ rℓ Tr(Υℓ ) + θτ ≥ Tr(ZM RT ).
To get the dual problem, we impose on the Lagrange multipliers, in addition to the initial conic constraints like Λℓ 0, 1 ≤ ℓ ≤ L, the restriction that the lefthand side in the aggregated constraint, identically in Θ, Υℓ , and θ, is equal to the objective of (P ), that is, G = I, Sℓ [W ] + Λℓ − rℓ Ifℓ = 0, 1 ≤ ℓ ≤ L, τ = 1, and maximize, under the resulting restrictions, the righthand side of the aggregated constraint. After immediate simplifications, we arrive at Opt[Q] = max Tr(ZM RT ) : W RT R, r ∈ R, Sℓ [W ] rℓ Ifℓ , 1 ≤ ℓ ≤ L W,R,r
T (note that r ∈ R is equivalent to [r; 1] ∈ K, and W R R is the same as I −R 0). Now, to say that RT R W is exactly the same as to say −RT W that R = SW 1/2 with the spectral norm kSk2,2 of S not exceeding 1, so that
Opt[Q] = max
W,S,r
Tr(ZM [SW  {z
1/2 T
] ) : W 0, kSk2,2 ≤ 1, r ∈ R, Sℓ [W ] rℓ Ifℓ , ℓ ≤ L . }
=Tr([ZM W 1/2 ]S T )
We can immediately eliminate the Svariable, using the wellknown fact that for a p × q matrix J it holds max
S∈Rp×q ,kSk2,2 ≤1
Tr(JS T ) = kJkSh,1 ,
where kJkSh,1 is the nuclear norm (the sum of singular values) of J. We arrive at n o Opt[Q] = max kZM W 1/2 kSh,1 : r ∈ R, W 0, Sℓ [W ] rℓ Ifℓ , ℓ ≤ L . W,r
The resulting problem clearly is solvable, and its optimal solution W ensures the target relations (4.149) and (4.150). 2.3o . Given W satisfying (4.149) and (4.150), let U JV = W 1/2 M T Z T be the singular value decomposition of W 1/2 M T Z T , so that U and V are, respectively, q×q and N ×N orthogonal matrices, J is q×N matrix with diagonal σ = [σ1 ; ...; σp ], p = min[q, N ], and zero offdiagonal entries; the diagonal entries σi , 1 ≤ i ≤ p are the singular values of W 1/2 M T Z T , or, which is the same, of ZM W 1/2 . Therefore, by (4.150) we have X σi ≥ Opt[Q]. (4.151) i
370
CHAPTER 4
Now consider the following construction. Let η ∼ N (0, IN ); we denote by υ the vector comprised of the first p entries in V η; note that υ ∼ N (0, Ip ), since V is orthogonal. We then augment, if necessary, υ by q − p N (0, 1) random variables independent of each other and of η to obtain a qdimensional random vector υ ′ ∼ N (0, Iq ), and set χ = U υ ′ . Because U is orthogonal we also have χ ∼ N (0, Iq ). Observe that χT W 1/2 M T Z T η = χT U JV η = [υ ′ ]T Jυ =
p X
σi υi2 .
(4.152)
i=1
To continue we need two simple observations. (i) One has α := Prob
(
p X i=1
σi υi2
0, and let us apply the Cramer bounding scheme. Namely, given γ > 0, consider the random variable ) ( X X 2 1 σ i υi . σi − γ ω = exp 4 γ i
i
Pp
Pp Note that ω > 0 a.s., and is > 1 when i=1 σi υi2 < 14 i=1 σi , so that α ≤ E{ω}, or, equivalently, thanks to υ ∼ N (0, Ip ), P P ln(α) ≤ ln(E{ω})P = 41 γ i σi + i ln E{exp{−γσi υi2 }} ≤ 41 γσ − 21 i ln(1 + 2γσi ). P in [σ1 ; ...; σp ] ≥ 0; therefore, its maximum Function − i ln(1 + 2γσi ) is convex P over the simplex {σi ≥ 0, i ≤ p, i σi = σ} is attained at a vertex, and we get ln(α) ≤ 14 γσ − 21 ln(1 + 2γσ).
Minimizing the righthand side in γ > 0, we arrive at (4.153). (ii) Whenever κ ≥ 1, one has Prob{kM W 1/2 χk∗ > κ} ≤ 2F exp{−κ 2 /2},
(4.154)
with F given by (4.55). √ Indeed, setting ρ = 1/κ 2 ≤ 1 and ω = ρW 1/2 χ, we get ω ∼ N (0, ρW ). Let us apply Lemma 4.46 to Q = ρW , R in the role of T , L in the role of K, and Sℓ [·] in the role of Rk [·]. Denoting Y := {y : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , ℓ ≤ L}, we have Sℓ [Q] = ρSℓ [W ] ρrℓ Ifℓ , ℓ ≤ L, with r ∈ R (see (4.149)), so we are under the premise of Lemma 4.46 (with Y in the role of X and thus with F in the role of D). Applying the lemma, we conclude that Prob χ : κ −1 W 1/2 χ 6∈ Y ≤ 2F exp{−1/(2ρ)} = 2F exp{−κ 2 /2}.
371
SIGNAL RECOVERY BY LINEAR ESTIMATION
Recalling that B∗ = M Y, we see that Prob{χ : κ −1 M W 1/2 χ 6∈ B∗ } is indeed upperbounded by the righthand side of (4.154), and (4.154) follows. 2.4o . Now, for κ ≥ 1, let ( Eκ =
(χ, η) : kM W
1/2
χk∗ ≤ κ,
X
σi υi2
i
≥
1 4
X
σi
i
)
,
and let Eκ+ = {η : ∃χ : (χ, η) ∈ Eκ }. For η ∈ Eκ+ there exists χ such that (χ, η) ∈ Eκ , leading to κkZ T ηk ≥ kM W 1/2 χk∗ kZ T ηk ≥ χT W 1/2 M T Z T η =
X i
σi υi2 ≥
1 4
X i
σi ≥ 14 Opt[Q]
(we have used (4.152) and (4.151)). Thus, η ∈ Eκ+ ⇒ kZ T ηk ≥
Opt[Q] . 4κ
On the other hand, due to (4.153) and (4.154), for our random (χ, η) it holds Prob{Eκ } ≥ 1 −
2 e3/8 − 2F e−κ /2 = βκ , 2
and the marginal distribution of η is N (0, IN ), implying that Probη∼N (0,IN ) {η ∈ Eκ+ } ≥ βκ . (4.147) is proved. 3o . As was explained in the beginning of item 2o , (4.147) is exactly the same as (4.59). The latter relation clearly implies (4.60) which, in turn, implies the right inequality in (4.58). ✷ 4.8.5
Proofs of Propositions 4.5, 4.16 and 4.19
Below, we focus on the proof of Proposition 4.16; Propositions 4.5 and 4.19 will be derived from it in Sections 4.8.5.2, 4.8.6.2, respectively. 4.8.5.1
Proof of Proposition 4.16
In what follows, we use the assumptions and the notation of Proposition 4.16. 1o . Let Φ(H, Λ, Υ, Υ′ , Θ; Q) = φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Tr(QΘ) : M × Π → R,
372
CHAPTER 4
where M
=
(H, Λ, Υ, Υ′ , Θ) :
Λ = {Λk 0, k ≤ K}, Υ {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ≤ L}, =P 1 ∗ [B T − AT H]M k Rk [Λk ] 2 P 0, 1 T T ∗ M [B − H A] S [Υ ] ℓ ℓ ℓ 2 1 Θ P2 HM 0. 1 T T ∗ ′ M H S [Υ ] ℓ ℓ ℓ 2
Looking at (4.42), we see immediately that the optimal value Opt in (4.42) is nothing but ′ ′ Φ(H, Λ, Υ, Υ , Θ) := max Φ(H, Λ, Υ, Υ , Θ; Q) . (4.155) Opt = min ′ (H,Λ,Υ,Υ ,Θ)∈M
Q∈Π
Note that sets M and Π are closed and convex, Π is compact, and Φ is a continuous convexconcave function on M × Π. In view of these observations, the fact that Π ⊂ int Sm + combines with the SionKakutani Theorem to imply that Φ possesses saddle point (H∗ , Λ∗ , Υ∗ , Υ′∗ , Θ∗ ; Q∗ ) (min in (H, Λ, Υ, Υ′ , Θ), max in Q) on M×Π, whence Opt is the saddle point value of Φ by (4.155). We conclude that for properly selected Q∗ ∈ Π it holds Opt = =
min
(H,Λ,Υ,Υ′ ,Θ)∈M
min ′
H,Λ,Υ,Υ ,Θ
Φ(H, Λ, Υ, Υ′ , Θ; Q∗ )
φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Tr(Q∗ Θ) :
Λ = {Λ {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Pk ∗0, k ≤ K},1 Υ = T [B − AT H]M k Rk [Λk ] 2 P 0, ∗ T T 1 ℓ Sℓ [Υℓ ] 2 M [B − H 1A] Θ HM P2 ∗ ′ 0 T T 1 M H ℓ Sℓ [Υℓ ] 2 min ′ φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Tr(G) :
=
H,Λ,Υ,Υ ,G
=
min
H,Λ,Υ
where Ψ(H)
Λ = {Λ {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Pk ∗0, k ≤ K},1 Υ = T R [Λ ] [B − AT H]M k k k 2 P 0, ∗ T T 1 [Υℓ ] ℓ Sℓ # " 2 M [B − H A] 1/2 1 Q HM G 2 ∗ 0 P T T 1/2 ∗ ′ 1 M H Q∗ ℓ Sℓ [Υℓ ] 2
φT (λ[Λ]) + φR (λ[Υ]) + Ψ(H) :
Λ = {Λ {Υℓ 0, ℓ ≤ L}, Pk ∗0, k ≤ K},1 Υ = [B T − AT H]M k Rk [Λk ] 2 P 0 ∗ T T 1 M [B − H A] ℓ Sℓ [Υℓ ] 2
:=
(4.156)
min′ φR (λ[Υ′ ]) + Tr(G) : Υ′ = {Υ′ℓ 0, ℓ ≤ L}, G,Υ # ) " 1/2 1 Q∗ HM G 2 0 , P ∗ ′ 1/2 1 M T H T Q∗ ℓ Sℓ [Υℓ ] 2
and Opt is given by (4.42), and the equalities are due to (4.56) and (4.57).
SIGNAL RECOVERY BY LINEAR ESTIMATION
373
From now on we assume that the noise ξ in observation (4.31) is ξ ∼ N (0, Q∗ ). We also assume that B 6= 0, since otherwise the conclusion of Proposition 4.16 is evident. 2o . ǫrisk. In Proposition 4.16, we are speaking about k·krisk of an estimate—the maximal, over signals x ∈ X , expected norm k · k of the error of recovering Bx; what we need to prove is that the minimax optimal risk RiskOptΠ,k·k [X ] as given by (4.53) can be lowerbounded by a quantity “of order of” Opt. To this end, of course, it suffices to build such a lower bound for the quantity RiskOptk·k := inf sup Eξ∼N (0,Q∗ ) {kBx − x b(Ax + ξ)k} , x b(·) x∈X
since this quantity is a lower bound on RiskOptΠ,k·k . Technically, it is more convenient to work with the ǫrisk defined in terms of “k · kconfidence intervals” rather than in terms of the expected norm of the error. Specifically, in the sequel we will heavily use the minimax ǫrisk defined as b(Ax + ξ)k > ρ} ≤ ǫ ∀x ∈ X , RiskOptǫ = inf ρ : Probξ∼N (0,Q∗ ) {kBx − x x b,ρ
where x b in the infimum runs through the set of all Borel estimates. When ǫ ∈ (0, 1) is once and forever fixed (in the sequel, we use ǫ = 18 ) we can use ǫrisk to lowerbound RiskOptk·k , since by evident reasons RiskOptk·k ≥ ǫ · RiskOptǫ .
(4.157)
Consequently, all we need in order to prove Proposition 4.16 is to lowerbound RiskOpt 18 by a “not too small” multiple of Opt, and this is our current objective. 3o . Let W be a positive semidefinite n × n matrix, let η ∼ N (0, W ) be random signal, and let ξ ∼ N (0, Q∗ ) be independent of η; vectors (η, ξ) induce random vector ω = Aη + ξ ∼ N (0, AW AT + Q∗ ). Consider the Bayesian version of the estimation problem where given ω we are interested in recovering Bη. Recall that, because [ω; Bη] is zero mean Gaussian, ¯Tω the conditional expectation Eω {Bη} of Bη given ω is linear in ω: Eω {Bη} = H 24 ¯ for some H depending on W only. Therefore, denoting by Pω the conditional probability distribution given ω, for any ρ > 0 and estimate x b(·) one has Probη,ξ {kBη −x b(Aη + ξ)k ≥ ρ} = Eω Prob b(ω)k ≥ ρ} ω {kBη − x ¯ T (Aη + ξ)k ≥ ρ}, ≥ Eω Probω {kBη − Eω {Bη}k ≥ ρ} = Probη,ξ {kBη − H
with the inequality given by the Anderson Lemma as applied to the shift of the Gaussian distribution Pω by its mean. Applying the Anderson Lemma again we 24 We have used the following standard fact [172]: let ζ = [ω; η] ∼ N (0, S), the covariance matrix of the marginal distribution of ω being nonsingular. Then the conditional distribution of η given ω is Gaussian with the mean linearly depending on ω and covariance matrix independent of ω.
374
CHAPTER 4
get ¯ T (Aη + ξ)k ≥ ρ} Probη,ξ {kBη − H
= ≥
¯ T A)η − H ¯ T ξk ≥ ρ} Eξ Probη {k(B − H ¯ T A)ηk ≥ ρ}, Probη {k(B − H
and, by “symmetric” reasoning, ¯ T (Aη + ξ)k ≥ ρ} ≥ Probξ {kH ¯ T ξk ≥ ρ}. Probη,ξ {kBη − H We conclude that for any x b(·)
Probη,ξ {kBηn − x b(ω)k ≥ ρ}
o ¯ T A)ηk ≥ ρ}, Probξ {kH ¯ T ξk ≥ ρ} . ≥ max Probη {k(B − H
(4.158)
¯ Q = Q∗ , 4o . Let H be an m × ν matrix. Applying Lemma 4.17 to N = m, Y = H, we get from (4.59) ¯ ≥ βκ ∀κ ≥ 1, ¯ T ξk ≥ [4κ]−1 Ψ(H)} Probξ∼N (0,Q∗ ) {kH
(4.159)
where Ψ(H) is defined by (4.156). Similarly, applying Lemma 4.17 to N = n, ¯ T A)T , Q = W , we obtain Y = (B − H ¯ ≥ βκ ∀κ ≥ 1, ¯ T A)ηk ≥ [4κ]−1 Φ(W, H)} Probη∼N (0,W ) {k(B − H
(4.160)
where Φ(W, H)
=
Tr (W Θ) + φR (λ[Υ]) : Υℓ 0 ∀ℓ, min Υ={Υ ℓ ,ℓ≤L},Θ 1 Θ [B T − AT H]M 2 P 0 . 1 ∗ M T [B − H T A] ℓ Sℓ [Υℓ ] 2
(4.161)
¯ = [8κ]−1 [Ψ(H) ¯ + Φ(W, H)]; ¯ Let us put ρ(W, H) when combining (4.160) with (4.159) we conclude that n o ¯ T A)ηk ≥ ρ(W, H)}, ¯ ¯ T ξk ≥ ρ(W, H)} ¯ max Probη {k(B − H Probξ {kH ≥ βκ , ¯ is replaced with the smaller quantity and the same inequality holds if ρ(W, H) ρ¯(W ) = [8κ]−1 inf [Ψ(H) + Φ(W, H)]. H
Now, the latter bound combines with (4.158) to imply the following result: Lemma 4.49. Let W be a positive semidefinite n × n matrix, and κ ≥ 1. Then for any estimate x b(·) of Bη given observation ω = Aη + ξ, where η ∼ N (0, W ) is independent of ξ ∼ N (0, Q∗ ), one has o n 2 e3/8 −2F e−κ /2 Probη,ξ kBη − x b(ω)k ≥ [8κ]−1 inf [Ψ(H) + Φ(W, H)] ≥ βκ = 1− H 2
where Ψ(H) and Φ(W, H) are defined, respectively, by (4.156) and (4.161).
375
SIGNAL RECOVERY BY LINEAR ESTIMATION
In particular, for κ=κ ¯ :=
√
2 ln F + 10 ln 2
(4.162)
it holds Probη,ξ {kBη − x b(ω)k ≥ [8κ] ¯ −1 inf [Ψ(H) + Φ(W, H)]} > H
3 16 .
5o . For 0 < κ ≤ 1, let us set (a) (b)
Wκ ={W ∈ Sn+ : ∃t ∈ T : Rk [W ] κtk Idk , 1 ≤ k ≤ K}, Z=
(Υ = {Υℓ , ℓ ≤ L}, Θ, H) :
Υ " ℓ 0 ∀ℓ,
Θ − H T A]
1 M T [B 2
1 [B T 2 P
− AT H]M ∗ ℓ Sℓ [Υℓ ]
#
0
.
Note that Wκ is a nonempty convex and compact (by Lemma 4.44) set such that Wκ = κW1 , and Z is a nonempty closed convex set. Consider the parametric saddle point problem Opt(κ) = max
inf
W ∈Wκ (Υ,Θ,H)∈Z
h
i E(W ; Υ, Θ, H) := Tr(W Θ) + φR (λ[Υ]) + Ψ(H) .
(4.163)
This problem is convexconcave; utilizing the fact that Wκ is compact and contains positive definite matrices, it is immediately seen that the SionKakutani theorem ensures the existence of a saddle point whenever κ ∈ (0, 1]. We claim that √ (4.164) 0 < κ ≤ 1 ⇒ Opt(κ) ≥ κOpt(1). Indeed, Z is invariant w.r.t. scalings (Υ = {Υℓ , ℓ ≤ L}, Θ, H) 7→ (θΥ := {θΥℓ , ℓ ≤ L}, θ−1 Θ, H),
[θ > 0].
When taking into account that φR (λ[θΥ]) = θφR (λ[Υ]), we get E(W )
:= =
inf
(Υ,Θ,H)∈Z
inf
(Υ,Θ,H)∈Z
E(W ; Υ, Θ, H) = inf inf E(W ; θΥ, θ−1 Θ, H) θ>0 (Υ,Θ,H)∈Z i h p 2 Tr(W Θ)φR (λ[Υ]) + Ψ(H) .
Because Ψ is nonnegative we conclude that whenever W 0 and κ ∈ (0, 1], one has √ E(κW ) ≥ κE(W ). This combines with Wκ = κW1 to imply that Opt(κ) = max E(W ) = max E(κW ) ≥ W ∈Wκ
W ∈W1
√
κ max E(W ) = W ∈W1
√
κOpt(1),
and (4.164) follows. 6o . We claim that Opt(1) = Opt,
(4.165)
where Opt is given by (4.42) (and, as we have seen, by (4.156) as well). Note that (4.165) combines with (4.164) to imply that √ 0 < κ ≤ 1 ⇒ Opt(κ) ≥ κOpt. (4.166)
376
CHAPTER 4
Verification of (4.165) is given by the following computation. By the SionKakutani Theorem, Tr(W Θ) + φR (λ[Υ]) + Ψ(H) Opt(1) = max inf W ∈W1 (Υ,Θ,H)∈Z = inf max Tr(W Θ) + φR (λ[Υ]) + Ψ(H) W ∈W1 (Υ,Θ,H)∈Z = inf Ψ(H) + φR (λ[Υ]) + max Tr(ΘW ) : W (Υ,Θ,H)∈Z W 0, ∃t ∈ T : Rk [W ] tk Idk , k ≤ K Ψ(H) + φR (λ[Υ]) + max Tr(ΘW ) : = inf W,t (Υ,Θ,H)∈Z W 0, [t; 1] ∈ T, Rk [W ] tk Idk , k ≤ K , where T is the closed conic hull of T . On the other hand, using Conic Duality combined with the fact that T∗ = {[g; s] : s ≥ φT (−g)} we obtain max {Tr(ΘW ) : W 0, [t; 1] ∈ T, Rk [W ] tk Idk , k ≤ K} W,t Z 0, [g; s] ∈ T∗ , ΛP k 0, k ≤ K, −Tr(ZW ) − g T t +P k Tr(R∗k [Λk ]W ) s: = min − k tk Tr(Λk ) = Θ , Z,[g;s],Λ={Λk } ∀(W ∈ Sn , t ∈ RK ) Z 0, P s ≥ φT (−g), Λk 0, k ≤ K, s: = min Θ = k R∗k [Λk ] − Z, g = −λ[Λ] Z,[g;s],Λ={Λk } ( ) X ∗ = min φT (λ[Λ]) : Λ = {Λk 0, k ≤ K}, Θ Rk [Λk ] , Λ
k
and we arrive at Opt(1) =
inf
Υ,Θ,H,Λ
= inf
Υ,H,Λ
= Opt
Ψ(H) + φR (λ[Υ]) + φT (λ[Λ]) :
Υ = {Υ P ℓ ∗0, ℓ ≤ L}, Λ = {Λk 0, k ≤ K}, Θ k Rk [Λk ], 1 [B T − AT H]M Θ 2 P 0 1 T T ∗ ℓ Sℓ [Υℓ ] 2 M [B − H A] Ψ(H) + φR (λ[Υ]) + φT (λ[Λ]) :
Υ = {Υ P ℓ ∗0, ℓ ≤ L},1Λ =T {Λk T 0, k ≤K}, [B − A H]M k Rk [Λk ] 2 P 0 1 ∗ T T M [B − H A] ℓ Sℓ [Υℓ ] 2 [see (4.156)].
7o . Now we can complete the proof. For κ ∈ (0, 1], let Wκ be the W component of
377
SIGNAL RECOVERY BY LINEAR ESTIMATION
a saddle point solution to the saddle point problem (4.163). Then, by (4.166), o n √ κOpt ≤ Opt(κ) = inf Tr(Wκ Θ) + φR (λ[Υ]) + Ψ(H) (Υ,Θ,H)∈Z (4.167) = inf Φ(Wκ , H) + Ψ(H) . H
On the other hand, when applying Lemma 4.46 to Q = Wκ and ρ = κ, we obtain, in view of relations 0 < κ ≤ 1, Wκ ∈ Wκ , 1
δ(κ) := Probζ∼N (0,In ) {Wκ1/2 ζ 6∈ X } ≤ 2De− 2κ ,
(4.168)
with D given by (4.55). In particular, when setting κ ¯=
1 2 ln D + 10 ln 2
(4.169)
we obtain δκ ≤ 1/16. Therefore, Probη∼N (0,Wκ¯ ) {η 6∈ X } ≤ Now let
1 16 .
Opt ̺∗ := p . 8 (2 ln F + 10 ln 2)(2 ln D + 10 ln 2)
(4.170)
(4.171)
All we need in order to achieve our goal of justifying (4.54) is to show that RiskOpt 81 ≥ ̺∗ ,
(4.172)
since given the latter relation, (4.54) will be immediately given by (4.157) as applied with ǫ = 81 . To prove (4.172), assume, on the contrary to what should be proved, that the 1 risk is < ̺∗ , and let x ¯(·) be an estimate with 18 risk ̺′ < ̺∗ . We can utilize x ¯ to 8 estimate Bη, in the Bayesian problem of recovering Bη from observation ω = Aη+ξ, (η, ξ) ∼ N (0, Σ) with Σ = Diag{Wκ¯ , Q∗ }. From (4.170) we conclude that Prob(η,ξ)∼N (0,Σ) {kBη − x ¯(Aη + ξ)k > ̺′ } ≤ Prob(η,ξ)∼N (0,Σ) {kBη − x ¯(Aη + ξ)k > ̺′ , η ∈ X } 1 3 = 16 . + Probη∼N (0,Wκ¯ ) {η 6∈ X } ≤ 18 + 16
(4.173)
On the other hand, by (4.167) we have κ) ≥ inf [Φ(Wκ¯ , H) + Ψ(H)] = Opt(¯ H
√
κ ¯ Opt = [8κ]̺ ¯ ∗
with κ ¯ given by (4.162). Thus, by Lemma 4.49, for any estimate x ˆ(·) of Bη via observation ω = Ax + ξ it holds Probη,ξ {kBη − x b(Aη + ξ)k ≥ ̺∗ } ≥ βκ¯ > 3/16;
in particular, this relation should hold true for x b(·) ≡ x ¯(·), but the latter is impos3 risk of x ¯ is ≤ ̺′ < ̺∗ ; see (4.173). ✷ sible: the 16
378 4.8.5.2
CHAPTER 4
Proof of Proposition 4.5
We shall extract Proposition 4.5 from the following result, meaningful by its own right (it can be considered as an “ellitopic refinement” of Proposition 4.16): Proposition 4.50. Consider the recovery of the linear image Bx ∈ Rν of unknown signal x known to belong to a given signal set X ⊂ Rn from noisy observation ω = Ax + ξ ∈ Rm
[ξ ∼ N (0, Γ), Γ ≻ 0],
the recovery error being measured in norm k · k on Rν . Assume that X and the unit ball B∗ of the norm k · k∗ conjugate to k · k are ellitopes: X B∗
= =
{x ∈ Rn : ∃t ∈ T : xT Rk x ≤ tk , k ≤ K}, {y ∈ Rν : ∃(r ∈ R, y) : u = M y, y T Sℓ y ≤ rℓ , ℓ ≤ L},
(4.174)
with our standard restrictions on T , R, Rk and Sℓ (as always, we lose nothing when assuming that the ellitope X is basic). Consider the optimization problem Opt# = min ′ φT (λ) + φR (µ) + φR (µ′ ) + Tr(ΓΘ) : Θ,H,λ,µ,µ
λ ≥ 0,P µ ≥ 0, µ′ ≥ 0,
1 [B − H T A]T M λk R k 2 P T T 1 ℓ µℓ S ℓ 2 M [B − H 1 A] HM Θ 2 P ′ 0 1 M T HT ℓ µℓ S ℓ 2 k
0,
)
(4.175) .
The problem is solvable, and the linear estimate x bH∗ (ω) = H∗T ω yielded by the Hcomponent of an optimal solution to the problem satisfies the risk bound bH∗ (Ax + ξ)k} ≤ Opt# . RiskΓ,k·k [b xH∗ X ] := max Eξ∼N (0,Γ) {kBx − x x∈X
Furthermore, the estimate x bH∗ (·) is nearoptimal: p Opt# ≤ 64 (3 ln K + 15 ln 2)(3 ln L + 15 ln 2) RiskOpt,
(4.176)
where RiskOpt is the minimax optimal risk
RiskOpt = inf sup Eξ∼N (0,Γ) {kBx − x b(Ax + ξ)k} , x b x∈X
the infimum being taken w.r.t. all estimates.
Proposition 4.50 ⇒ Proposition 4.5: Clearly, the situation considered in Proposition 4.5 is a particular case of the setting of Proposition 4.50, namely, the case where B∗ is the standard Euclidean ball, B∗ = {u ∈ Rν : uT u ≤ 1}. In this case,
379
SIGNAL RECOVERY BY LINEAR ESTIMATION
problem (4.175) reads Opt# =
min
Θ,H,λ,µ,µ′
φT (λ) + µ + µ′ + Tr(ΓΘ) : λ ≥P 0, µ ≥ 0, µ′ ≥ 0,
1 [B − H T A]T λk Rk 2 T µIν − H A] 1 Θ H 2 0 1 H T µ′ Iν 2
=
min
Θ,H,λ,µ,µ′
= min χ,H
1 [B 2
k
0,
φT (λ) + µ + µ′ + Tr(ΓΘ) :
′ λ≥ P0, µ ≥ 0, µ 1≥ 0, µ [ k λk Rk ] 4 [B − H T A]T [B − H T A], µ′ Θ 14 HH T [Schur Complement Lemma] p p φT (χ) + Tr(HΓH T ) : P ′ [B − H T A]T k χk Rk χ ≥ 0, 0 T [B − H A]
Iν
[by eliminating µ, µ′ ; note that φT (·) is positively homogeneous of degree 1].
Comparing the resulting representation of Opt# with (4.12), we see that the upper √ bH∗ appearing in (4.15) is ≤ Opt# . bound Opt on the risk of the linear estimate x Combining this observation with (4.176) and the evident relation RiskOpt
=
≤
inf xb sup b(Ax + ξ)k2 } q x∈X Ex∼N (0,Γ) {kBx − x
inf xb
supx∈X Ex∼N (0,Γ) {kBx − x b(Ax + ξ)k22 } = Riskopt
(recall that we are in the case of k · k = k · k2 ), we arrive at (4.15) and thus justify Proposition 4.5. ✷ Proof of Proposition 4.50. It is immediately seen that problem (4.175) is nothing but problem (4.42) in the case when the spectratopes X , B∗ and the set Π participating in Proposition 4.14 are, respectively, the ellitopes given by (4.174), and the singleton {Γ}. Thus, Proposition 4.50 is, essentially, a particular case of Proposition 4.16. The only refinement in Proposition 4.50 as compared to Proposition 4.16 is the form of the logarithmic “nonoptimality” factor in (4.176); a similar factor in Proposition 4.16 is expressed in terms of spectratopic sizes D, F of X and B∗ (the total ranks of matrices Rk , k ≤ K, and Sℓ , ℓ ≤ L, in the case of (4.174)), while in (4.176) the nonoptimality factor is expressed in terms of ellitopic sizes K, L of X and B∗ . Strictly speaking, to arrive at this (slight—the sizes in question are under logs) refinement, we were supposed to reproduce, with minimal modifications, the reasoning of items 2o –7o of Section 4.8.5.1, with Γ in the role of Q∗ , and slightly refine Lemma 4.17 underlying this reasoning. Instead of carrying out this plan literally, we detail “local modifications” to be made in the proof of Proposition 4.16 in order to prove Proposition 4.50. Here are these modifications: A. The collections of matrices Λ = {Λk 0, k ≤ K}, Υ = {Υℓ 0, ℓ ≤ L} should be L substituted by collections of nonnegative reals λ ∈ RK + or µ ∈ R+ , and vectors
380
CHAPTER 4
λ[Λ], λ[Υ]—with vectors λ or µ. Expressions like Rk [W ], R∗k [Λk ], and Sℓ∗ [Υℓ ] should be replaced, respectively, with Tr(Rk W ), λk Rk , and µℓ Sℓ . Finally, Q∗ should be replaced with Γ, and scalar matrices, like tk Idk , should be replaced with the corresponding reals, like tk . B. The role of Lemma 4.17 is now played by Lemma 4.51. Let Y be an N × ν matrix, let k · k be a norm on Rν such that the unit ball B∗ of the conjugate norm is the ellitope B∗ = {y ∈ Rν : ∃(r ∈ R, y) : u = M y, y T Sℓ y ≤ rℓ , ℓ ≤ L},
(4.174)
and let ζ ∼ N (0, Q) for some positive semidefinite N × N matrix Q. Then the best upper bound on ψQ (Y ) := E{kY T ζk} yielded by Lemma 4.11, that is, the optimal value Opt[Q] in the convex optimization problem (cf. (4.40)) 1 YM Θ 0 Opt[Q] = min φR (µ) + Tr(QΘ) : µ ≥ 0, 1 T T P2 M Y Θ,µ ℓ µ ℓ Rℓ 2 satisfies for all Q 0 the identity ( Opt[Q] = Opt[Q] :=
min φR (µ) + Tr(G) : G,µ
µ ≥ 0,
"
G
1 M T Y T Q1/2 2
1 1/2 Q YM 2P ℓ
µℓ Rℓ
#
0
)
(4.177) ,
and is a tight bound on ψQ (Y ). Namely, √ ψQ (Y ) ≤ Opt[Q] ≤ 22 3 ln L + 15 ln 2ψQ (Y ), where L is the size of the ellitope B∗ ; see (4.174). Furthermore, for all κ ≥ 1 one has 2 Opt[Q] e3/8 T Probζ kY ζk ≥ ≥ βκ := 1 − − 2Le−κ /3 . (4.178) 4κ 2 √ In particular, when selecting κ = 3 ln L + 15 ln 2, we obtain Opt[Q] T 3 Probζ kY ζk ≥ √ ≥ βκ = 0.2100 > 16 . 4 3 ln L + 15 ln 2 Proof of Lemma 4.51 follows the lines of the proof of Lemma 4.17, with Lemma 4.47 substituting Lemma 4.46. 1o . Relation (4.177) can be verified exactly in the same fashion as in the case of Lemma 4.17. 2o . Let us set ζ = Q1/2 η with η ∼ N (0, IN ) and Z = Q1/2 Y . Observe that to prove (4.178) is the same as to show that when κ ≥ 1 one has 3/8 2 Opt[Q] ¯ ≥ βκ := 1 − e δ¯ = ⇒ Probη {kZ T ηk ≥ δ} − 2Le−κ /3 , 4κ 2
(4.179)
381
SIGNAL RECOVERY BY LINEAR ESTIMATION
where [Opt[Q] =]
Opt[Q]
:=
min Θ,µ
(
φR (µ) + Tr(Θ) : µ ≥ 0, Θ T T 1 M Z 2
1
ZM ℓ µℓ R ℓ
P2
)
(4.180)
0 .
Justification of (4.179) goes as follows. 2.1o . Let us represent Opt[Q] as the optimal value of a conic problem. Setting K = cl{[r; s] : s > 0, r/s ∈ R}, we ensure that R = {r : [r; 1] ∈ K}, K∗ = {[g; s] : s ≥ φR (−g)},
where K∗ is the cone dual to K. Consequently, (4.180) reads µ ≥0 1 Θ ZM 2 P Opt[Q] = min θ + Tr(Θ) : 0 1 Θ,Υ,θ M T ZT ℓ µℓ S ℓ 2 [−µ; θ] ∈ K∗
(a) . (b) (c)
(PE )
2.2o . Now let us prove that there exist matrix W ∈ Sq+ and r ∈ R such that Tr(W Sℓ ) ≤ rℓ , ℓ ≤ L,
and Opt[Q]≤
X
(4.181)
σi (ZM W 1/2 ),
(4.182)
i
where σ1 (·) ≥ σ2 (·) ≥ ... are singular values. To get the announced result, let us pass from problem (P ) to its conic dual. (PE ) clearly is strictly feasible and bounded, so that the dual to (PE ) problem (DE ) is solvable G −R with optimal value Opt[Q]. Denoting by λℓ ≥ 0, ℓ ≤ L, 0, [r; τ ] ∈ K, −RT W the Lagrange multipliers for the respective constraints in (PE ), and aggregating these constraints, the multipliers being the aggregation weights, we arrive at the aggregated constraint: P P P Tr(ΘG) + Tr(W ℓ µℓ Sℓ ) + ℓ λℓ µℓ − ℓ rℓ µℓ + θτ ≥ Tr(ZM RT ).
To get the dual problem, we impose on the Lagrange multipliers, in addition to the initial constraints, the restriction that the lefthand side in the aggregated constraint is equal to the objective of (P ), identically in Θ, µℓ , and θ, that is, G = I, Tr(W Sℓ ) + λℓ − rℓ = 0, 1 ≤ ℓ ≤ L, τ = 1, and maximize the righthand side of the aggregated constraint. After immediate simplifications, we arrive at n o Opt[Q] = max Tr(ZM RT ) : W RT R, r ∈ R, Tr(W Sℓ ) ≤ rℓ , 1 ≤ ℓ ≤ L W,R,r
(note that r ∈ R is equivalent to [r; 1] ∈ K, and W RT R is the same as 0).
I −RT
−R W
Exactly as in the proof of Lemma 4.17, the above representation of Opt[Q] implies that n o Opt[Q] = max kZM W 1/2 kSh,1 : r ∈ R, W 0, Tr(W Sℓ ) ≤ rℓ , ℓ ≤ L . W,r
382
CHAPTER 4
The resulting problem clearly is solvable, and its optimal solution W ensures the target relations (4.181) and (4.182). 2.3o . Given W satisfying (4.181) and (4.182), we proceed exactly as in item 2.3o of the proof of Lemma 4.17, thus arriving at three random vectors (χ, υ, η) with marginal distributions N (0, Iq ), N (0, Iq ), and N (0, IN ), respectively, such that χT W 1/2 M T Z T η =
p X
σi υi2 ,
(4.183)
i=1
where p = min[q, N ] and σi = σi (ZM W 1/2 ). As in item 3o .i of the proof of Lemma 4.17, we have (i) ) ( p p X X e3/8 2 1 σi ≤ [= 0.7275...]. (4.184) α := Prob σ i υi < 4 2 i=1 i=1
The role of item 3o .ii in the aforementioned proof is now played by (ii) Whenever κ ≥ 1, one has
Prob{kM W 1/2 χk∗ > κ} ≤ 2L exp{−κ 2 /3},
(4.185)
with L as defined in (4.174). √ Indeed, setting ρ = 1/κ 2 ≤ 1 and ω = ρW 1/2 χ, we get ω ∼ N (0, ρW ). Let us apply Lemma 4.47 to Q = ρW , R in the role of T , with L in the role of K, and Sℓ ’s in the role of Rk ’s. Denoting Y := {y : ∃r ∈ R : y T Sℓ y rℓ , ℓ ≤ L},
we have Tr(QSℓ ) = ρTr(W Sℓ ) = ρTr(W Sℓ ) ≤ ρrℓ , ℓ ≤ L, with r ∈ R (see (4.181)), so we are under the premise of Lemma 4.47 (with Y in the role of X and therefore with L in the role of K). Applying the lemma, we conclude that n o Prob χ : κ −1 W 1/2 χ 6∈ Y ≤ 2L exp{−1/(3ρ)} = 2L exp{−κ 2 /3}. Recalling that B∗ = M Y, we see that Prob{χ : κ −1 M W 1/2 χ 6∈ B∗ } is indeed upperbounded by the righthand side of (4.185), and (4.185) follows. With (i) and (ii) at our disposal, we complete the proof of Lemma 4.51 in exactly the same way as in items 2.4o and 3o of the proof of Lemma 4.17. ✷
C. As a result of substituting Lemma 4.17 with Lemma 4.51, the counterpart of Lemma 4.49 used in item 4o of the proof of Proposition 4.16 now reads as follows: Lemma 4.52. Let W be a positive semidefinite n × n matrix, and κ ≥ 1. Then for any estimate x b(·) of Bη given observation ω = Aη + ξ with η ∼ N (0, W ) and ξ ∼ N (0, Γ) independent of each other, one has o n 2 e3/8 − 2Le−κ /3 Probη,ξ kBη − x b(ω)k ≥ [8κ]−1 inf [Ψ(H) + Φ(W, H)] ≥ βκ = 1 − H 2
where Ψ(H) and Φ(W, H) are defined, respectively, by (4.156) (where Q∗ should be set to Γ) and (4.161). In particular, for √ κ=κ ¯ := 3 ln K + 15 ln 2 the latter probability is > 3/16. D. We substitute the reference to Lemma 4.46 in item 7o of the proof with Lemma 4.47, resulting in replacing
383
SIGNAL RECOVERY BY LINEAR ESTIMATION
• definition of δ(κ) in (4.168) with 1
δ(κ) := Probζ∼N (0,In ) {Wκ1/2 ζ 6∈ X } ≤ 3Ke− 3κ , • definition (4.169) of κ ¯ with κ ¯=
1 , 3 ln K + 15 ln 2
• and, finally, definition (4.171) of ρ∗ with Opt ̺∗ := p . 8 (3 ln L + 15 ln 2)(3 ln K + 15 ln 2) 4.8.6 4.8.6.1
Proofs of Propositions 4.18 and 4.19, and justification of Remark 4.20 Proof of Proposition 4.18
The only claim of the proposition which is not an immediate consequence of Proposition 4.8 is that problem (4.64) is solvable; let us justify this claim. Let F = ImA. Clearly, feasibility of a candidate solution (H, Λ, Υ) to the problem depends solely on the restriction of the linear mapping z 7→ H T z onto F , so that adding to the constraints of the problem the requirement that the restriction of this linear mapping on the orthogonal complement of F in Rm is identically zero, we get an equivalent problem. It is immediately seen that in the resulting problem, the feasible solutions with the value of the objective ≤ a for every a ∈ R form a compact set, so that the latter problem (and thus the original one) indeed is solvable. ✷ 4.8.6.2
Proof of Proposition 4.19
We are about to derive Proposition 4.19 from Proposition 4.16. Observe that in the situation of the latter Proposition, setting formally Π = {0}, problem (4.42) becomes problem (4.64), so that Proposition 4.19 looks like the special case Π = {0} of Proposition 4.16. However, the premise of the latter proposition forbids specializing Π as {0}—this would violate the regularity assumption R which is part of the premise. The difficulty, however, can be easily resolved. Assume w.l.o.g. that the image space of A is the entire Rm (otherwise we could from the very beginning replace Rm with the image space of A), and let us pass from our current noiseless recovery problem of interest (!)—see Section 4.5.1—to its “noisy modification,” the differences with (!) being • noisy observation ω = Ax + σξ, σ > 0, ξ ∼ N (0, Im ); • risk quantification of a candidate estimate x b(·) according to
Riskσk·k [b x(Ax + σξ)X ] = sup Eξ∼N (0,Im ) {kBx − x b(Ax + σξ)k} , x∈X
the corresponding minimax optimal risk being
RiskOptσk·k [X ] = inf Riskσk·k [b x(Ax + σξ)X ]. x b(·)
384
CHAPTER 4
Proposition 4.16 does apply to the modified problem—it suffices to specify Π as {σ 2 Im }. According to this proposition, the quantity Opt[σ]
=
min ′
H,Λ,Υ,Υ ,Θ
φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + σ 2 Tr(Θ) :
Λ = {Λk 0, k ≤ K}, Υ =P {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, ∗ 1 [B T − AT H]M k Rk [Λk ] 2 P 0, ∗ T T 1 M [B − H A] ℓ Sℓ [Υℓ ] 2 1 Θ HM P2 ∗ ′ 0 T T 1 Sℓ [Υℓ ] M H ℓ 2
satisfies the relation
Opt[σ] ≤ O(1) ln(D)RiskOptσk·k [X ]
(4.186)
with D defined in (4.65). Looking at problem (4.64) we immediately conclude that Opt# ≤ Opt[σ]. Thus, all we need in order to extract the target relation (4.65) from (4.186) is to prove that the minimax optimal risk Riskopt [X ] defined in Proposition 4.19 satisfies the relation lim inf RiskOptσk·k [X ] ≤ Riskopt [X ]. σ→+0
(4.187)
To prove this relation, let us fix r > Riskopt [X ], so that for some Borel estimate x b(·) it holds sup kBx − x b(Ax)k < r. (4.188) x∈X
Were we able to ensure that x b(·) is bounded and continuous, we would be done, since in this case, due to compactness of X , it clearly holds lim inf σ→+0 RiskOptσk·k [X ] ≤ lim inf σ→+0 supx∈X Eξ∼N (0,Im ) {kBx − x b(Ax + σξ)k} ≤ supx∈X kBx − x b(Ax)k < r,
and since r > Riskopt [X ] is arbitrary, (4.187) would follow. Thus, all we need to do is to verify that given Borel estimate x b(·) satisfying (4.188), we can update it into a bounded and continuous estimate satisfying the same relation. Verification is as follows: 1. Setting β = maxx∈X kBxk and replacing estimate x b with its truncation x b(ω), kb x(ω)k ≤ 2β x e(ω) = 0, otherwise
for any x ∈ X we only reduce the norm of the recovery error. At the same time, x e is Borel and bounded. Thus, we lose nothing when assuming in the rest of the proof that x b(·) is Borel and bounded. 2. For ǫ > 0, let x bǫ (ω) = (1 + ǫ)b x(ω/(1 + ǫ)) and let Xǫ = (1 + ǫ)X . Observe that supx∈Xǫ kBx − x bǫ (Ax)k = supy∈X kB[1 + ǫ]y − x bǫ (A[1 + ǫ]y)k = supy∈X kB[1 + ǫ]y − [1 + ǫ]b x(Ay)k = [1 + ǫ] supy∈X kBy − x b(Ay)k,
385
SIGNAL RECOVERY BY LINEAR ESTIMATION
implying, in view of (4.188), that for small enough positive ǫ we have r¯ := sup kBx − x bǫ (Ax)k < r.
(4.189)
x∈Xǫ
3. Finally, let A† be the pseudoinverse of A, so that AA† z = z for every z ∈ Rm (recall that the image space of A is the entire Rm ). Given ρ > 0, let θρ (·) be a nonnegative smooth function on Rm with integral 1 such that θρ vanishes outside of the ball of radius ρ centered at the origin, and let Z x bǫ,ρ (ω) = x bǫ (ω − z)θρ (z)dz Rm
be the convolution of x bǫ and θρ . Since x bǫ (·) is Borel and bounded, this convolution is a welldefined smooth function on Rm . Because X contains a neighbourhood of the origin, for all small enough ρ > 0, all z from the support of θρ and all x ∈ X the point x − A† z belongs to Xǫ . For such ρ and any x ∈ X we have kBx − x bǫ (Ax − z)k
= ≤ ≤
kBx − x bǫ (A[x − A† z])k † kBA zk + kB[x − A† z] − x bǫ (A[x − A† z])k Cρ + r¯
with properly selected constant C independent of ρ (we have used (4.189); note that for our ρ and x we have x − A† z ∈ Xǫ ). We conclude that for properly selected r′ < r, ρ > 0 and all x ∈ X we have kBx − x bǫ (Ax − z)k ≤ r′ ∀(z ∈ supp θρ ),
implying, by construction of x bǫ,ρ , that
∀(x ∈ X ) : kBx − x bǫ,ρ (Ax)k ≤ r′ < r.
The resulting estimate x bǫ,ρ is the continuous and bounded estimate satisfying (4.188) we were looking for. ✷ 4.8.6.3
Justification of Remark 4.20
Justification of Remark is given by repeating word by word the proof of Proposition 4.19, with Proposition 4.50 in the role of Proposition 4.16.
Chapter Five Signal Recovery Beyond Linear Estimates OVERVIEW In this chapter, as in Chapter 4, we focus on signal recovery. In contrast to the previous chapter, on our agenda now are • a special kind of nonlinear estimation—polyhedral estimate (Section 5.1), an alternative to linear estimates which were our subject in Chapter 4. We demonstrate that as applied to the same estimation problem as in Chapter 4—recovery of an unknown signal via noisy observation of a linear image of the signal, polyhedral estimation possesses the same attractive properties as linear estimation, that is, efficient computability and nearoptimality, provided the signal set is an ellitope/spectratope. Besides this, we show that properly built polyhedral estimates are nearoptimal in several special cases where linear estimates could be heavily suboptimal. • recovering signals from noisy observations of nonlinear images of the signal. Specifically, we consider signal recovery in generalized linear models, where the expected value of an observation is a known nonlinear transformation of the signal we want to recover, in contrast to observation model (4.1) where this expectation is linear in the signal.
5.1
POLYHEDRAL ESTIMATION
5.1.1
Motivation
The estimation problem we were considering so far is as follows: We want to recover the image Bx ∈ Rν of unknown signal x known to belong to signal set X ⊂ Rn from a noisy observation ω = Ax + ξx ∈ Rm , where ξx is observation noise (index x in ξx indicates that the distribution Px of the observation noise may depend on x). Here X is a given nonempty convex compact set, and A and B are given m × n and ν × n matrices; in addition, we are given a norm k · k on Rν in which the recovery error is measured. We have seen that if X is an ellitope/spectratope then, under reasonable assumptions on observation noise and k · k, an appropriate efficiently computable estimate linear in ω is nearoptimal. Note that the ellitopic/spectratopic structure of X is crucial here. What follows is motivated by the desire to build an alternative estimation scheme which works beyond the ellitopic/spectratopic case, where linear estimates can become “heavily nonoptimal.”
387
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
Motivating example. Consider the simplylooking problem of recovering Bx = x in the k · k2 norm from direct observations (Ax = x) corrupted by the standard Gaussian noise ξ ∼ N (0, σ 2 I), and let X be the unit k · k1 ball: X xi  ≤ 1}. X = {x ∈ Rn : i
In this situation, one can easily build the optimal, in terms of the worstcase, over x ∈ X , expected squared risk, linear estimate x bH (ω) = H T ω: Risk2 [b xH X ] := maxx∈X E kb xH (ω) − Bxk22 = maxx∈X k[I − H T ]xk22 + σ 2 Tr(HH T ) = maxi≤n kColi [I − H T ]k22 + σ 2 Tr(HH T ). Clearly, the optimal H is just a scalar matrix hI, the optimal h is the minimizer of the univariate quadratic function (1 − h)2 + σ 2 nh2 , and the best squared risk attainable with linear estimates is R2 = min (1 − h)2 + σ 2 nh2 = h
nσ 2 . 1 + nσ 2
On the other hand, consider a nonlinear estimate x b(ω) as follows. Given observation ω, specify x b(ω) as an optimal solution to the optimization problem Opt(ω) = min ky − ωk∞ . y∈X
Note that for every ρ > 0 the probability that the true signal satisfies kx−ωk∞ ≤ ρσ (“event E”) is at least 1 − 2n exp{−ρ2 /2}, and if this event happens, then both x and x b belong to the box {y : ky − ωk∞ ≤ ρσ}, implying that kx − x bk∞ ≤ 2ρσ. In addition, we always have kx − x bk2 ≤ kx − x bk1 ≤ 2, since x ∈ X and x b ∈ X . We therefore have √ p 2 ρσ, ω ∈ E, bk∞ kx − x bk1 ≤ kx − x bk2 ≤ kx − x 2, ω 6∈ E, whence
E kb x − xk22 ≤ 4ρσ + 8n exp{−ρ2 /2}. (∗) p Assuming σ ≤ 2n exp{−1/2} and specifying ρ as 2 ln(2n/σ), we get ρ ≥ 1 and 2n exp{−ρ2 /2} ≤ σ, implying that the right hand side in (∗) is at most 8ρσ. In other words, for our nonlinear estimate x b(ω) it holds p Risk2 [b xX ] ≤ 8 ln(2n/σ)σ.
2 When p nσ is of order of 1, the latter bound on the squared risk is of order of σ ln(1/σ), while the best squared risk achievable with linear estimates under the circumstances is of order of 1. We conclude that when σ is small and n is large (specifically, is of order of 1/σ 2 ), the best linear estimate is by far inferior compared to our nonlinear estimate—the ratio of the corresponding squared risks is as large as √O(1) , the factor which is “by far” worse than the nonoptimality factor in σ
ln(1/σ)
the case of ellitope/spectratope X .
388
CHAPTER 5
The construction of the nonlinear estimate x b which we have built1 admits a natural extension yielding what we shall call polyhedral estimate, and our present goal is to design and to analyse presumably good estimates of this type. 5.1.2
Generic polyhedral estimate
A generic polyhedral estimate is as follows: Given the data A ∈ Rm×n , B ∈ Rν×n , X ⊂ Rn of our recovery problem (where X is a computationally tractable convex compact set) and a “reliability tolerance” ǫ ∈ (0, 1), we specify somehow positive integer N along with N linear forms hTℓ z on the space Rm where observations live. These forms define linear forms gℓT x := hTℓ Ax on the space of signals Rn . Assuming that the observation noise ξx is zero mean for every x ∈ X , the “plugin” estimates hTℓ ω are unbiased estimates of the forms giT x. Assume that vectors hℓ are selected in such a way that ∀(x ∈ X ) : Prob{hTℓ ξx  > 1} ≤ ǫ/N ∀ℓ.
(5.1)
In this situation, setting H = [h1 , ..., hN ] (in the sequel, H is referred to as contrast matrix), we can ensure that whatever be the signal x ∈ X underlying our observation ω = Ax+ξx , the observable vector H T ω satisfies the relation Prob kH T ω − H T Axk∞ > 1 ≤ ǫ. (5.2) With the polyhedral estimation scheme, we act as if all information about x contained in our observation ω were represented by H T ω, and we estimate Bx by B x ¯, where x ¯ = x ¯(ω) is any vector from X compatible with this information, specifically, such that x ¯ solves the feasibility problem find x ¯ ∈ X such that kH T ω − H T A¯ xk∞ ≤ 1. Note that this feasibility problem with positive probability can be unsolvable; all we know in this respect is that the latter probability is ≤ 1 − ǫ, since by construction the true signal x underlying observation ω is with probability 1 − ǫ a feasible solution. In other words, such x ¯ is not always welldefined. To circumvent this difficulty, let us define x ¯ as (5.3) x ¯ ∈ Argmin kH T ω − H T Auk∞ : u ∈ X , u
so that x ¯ always is welldefined and belongs to X , and estimate Bx by B x ¯. Thus, a polyhedral estimate is specified by an m × N contrast matrix H = [h1 , ..., hN ] with columns hℓ satisfying (5.1) and is as follows: given observation ω, we build x ¯=x ¯(ω) ∈ X according to (5.3) and estimate Bx by x bH (ω) = B x ¯(ω).
The rationale behind polyhedral estimation scheme is the desire to reduce complex 1 In fact, this estimate is nearly optimal under the circumstances in a meaningful range of values of n and σ.
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
389
estimating problems to those of estimating linear forms. To the best of our knowledge, this approach was first used in [192] (see also [185, Chapter 2]) in connection with recovering from direct observations (restrictions on regular grids of) multivariate functions from Sobolev balls. Recently, the ideas underlying the results of [192] have been taken up in the MIND estimator of [109], then applied to multiple testing in [203]. What follows is based on [139]. (ǫ, k · k)risk. Given a desired “reliability tolerance” ǫ ∈ (0, 1), it is convenient to quantify the performance of polyhedral estimate by its (ǫ, k · k)risk Riskǫ,k·k [b x(·)X ] = inf {ρ : Prob {kBx − x b(Ax + ξx )k > ρ} ≤ ǫ ∀x ∈ X } ,
(5.4)
that is, the worst, over x ∈ X , size of “(1 − ǫ)reliable k · kconfidence interval” associated with the estimate x b(·). An immediate observation is as follows:
Proposition 5.1. In the situation in question, denoting by Xs = 12 (X −X ) the symmetrization of X , given a contrast matrix H = [h1 , ..., hN ] with columns satisfying (5.1), the quantity R[H] = max kBzk : kH T Azk∞ ≤ 2, z ∈ 2Xs (5.5) z
is an upper bound on the (ǫ, k · k)risk of the polyhedral estimate x bH (·): Riskǫ,k·k [b xH X ] ≤ R[H].
Proof is immediate. Let us fix x ∈ X , and let E be the set of all realizations of ξx such that kH T ξx k∞ ≤ 1, so that Px (E) ≥ 1−ǫ by (5.2). Let us fix a realization ξ ∈ E of the observation noise, and let ω = Ax+ξ, x ¯=x ¯(Ax+ξ). Then u = x is a feasible solution to the optimization problem (5.3) with the value of the objective ≤ 1, implying that the value of this objective at the optimal solution x ¯ to the problem is ≤ 1 as well, so that kH T A[x − x ¯]k∞ ≤ 2. Besides this, z = x − x ¯ ∈ 2Xs . We see that z is a feasible solution to (5.5), whence kB[x − x ¯]k = kBx − x bH (ω)k ≤ R[H]. It remains to note that the latter relation holds true whenever ω = Ax + ξ with ξ ∈ E, and the Px probability of the latter inclusion is at least 1 − ǫ, whatever be x ∈ X. ✷ What is ahead. In what follows our focus will be on the following questions pertinent to the design of polyhedral estimates: 1. Given the data of our estimation problem and a tolerance δ ∈ (0, 1), how to find a set Hδ of vectors h ∈ Rm satisfying the relation ∀(x ∈ X ) : Prob hT ξx  > 1 ≤ δ. (5.6)
With our approach, after the number N of columns in a contrast matrix has been selected, we choose the columns of H from Hδ , with δ = ǫ/N , ǫ being a given reliability tolerance of the estimate we are designing. Thus, the problem of constructing sets Hδ arises, the larger Hδ , the better. 2. The upper bound R[H] on the (ǫ, k · k)risk of the polyhedral estimate x bH is, in general, difficult to compute—this is the maximum of a convex function over a computationally tractable convex set. Thus, similarly to the case of linear
390
CHAPTER 5
estimates, we need techniques for computationally efficient upper bounding of R[·]. 3. With “raw materials” (sets Hδ ) and efficiently computable upper bounds on the risk of candidate polyhedral estimates at our disposal, how do we design the best in terms of (the upper bound on) its risk polyhedral estimate? We are about to consider these questions one by one. 5.1.3
Specifying sets Hδ for basic observation schemes
To specify reasonable sets Hδ we need to make some assumptions on the distributions of observation noises we want to handle. In the sequel we restrict ourselves to three special cases as follows: • subGaussian case: For every x ∈ X , the observation noise ξx is subGaussian with parameters (0, σ 2 Im ), where σ > 0, i.e. ξx ∼ SG(0, σ 2 Im ). • Discrete case: X P is a convex compact subset of the probabilistic simplex ∆n = {x ∈ Rn : x ≥ 0, i xi = 1}, A is a columnstochastic matrix, and ω=
K 1 X ζk K k=1
with random vectors ζk independent across k ≤ K, ζk taking value ei with probability [Ax]i , i = 1, ...., m, ei being the basic orths in Rm . • Poisson case: X is a convex compact subset of the nonnegative orthant Rn+ , A is entrywise nonnegative, and the observation ω stemming from x ∈ X is a random vector with entries ωi ∼ Poisson([Ax]i ) independent across i. The associated sets Hδ can be built as follows. 5.1.3.1
SubGaussian case
When h ∈ Rn is deterministic and ξ is subGaussian with parameters 0, σ 2 Im , we have 1 T . Prob{h ξ > 1} ≤ 2 exp − 2 2σ khk22 Indeed, when h 6= 0 and γ > 0, we have Prob{hT ξ > 1} ≤ exp{−γ}E exp{γhT ξ} ≤ exp{ 12 σ 2 γ 2 khk22 − γ}. n o Minimizing the resulting bound in γ > 0, we get Prob{hT ξ > 1} ≤ exp − 2khk12 σ2 ; 2
the n same reasoning as applied to −h in the role of h results in Prob{hT ξ < −1} ≤ o exp − 2khk12 σ2 . 2
Consequently
πG (h) := σ 
and we can set
p
2 ln(2/δ) khk2 ≤ 1 ⇒ Prob{hT ξ > 1} ≤ δ, {z } ϑG
Hδ = HδG := {h : πG (h) ≤ 1}.
391
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
5.1.3.2
Discrete case
Given x ∈ X , setting µ = Ax and ηk = ζk − µ, we get ω = Ax +
K 1 X ηk . K k=1  {z } ξx
Given h ∈ Rm ,
hT ξ x =
1 X T h η .  {z k} K k
χk
Random variables χ1 , ..., χK are independent zero mean and clearly satisfy X [Ax]i h2i , χk  ≤ 2khk∞ ∀k. E χ2k ≤ i
When applying Bernstein’s inequality2 we get (cf. Exercise 4.19) P Prob{hT ξx  > 1} = Prob{ o n k χk  > K} . ≤ 2 exp − 2 P [Ax] K h2 + 4 khk i
Setting
πD (h)
=
i
i
3
(5.7)
∞
p P ϑ2D maxx∈X i [Ax]i h2i + ̺2D khk2∞ , q ϑD = 2 ln(2/δ) , ̺D = 8 ln(2/δ) , K 3K
after a completely straightforward computation, we conclude from (5.7) that πD (h) ≤ 1 ⇒ Prob{hT ξx  > 1} ≤ δ, ∀x ∈ X . Thus, in the Discrete case we can set Hδ = HδD := {h : πD (h) ≤ 1}. 5.1.3.3
Poisson case
In the Poisson case, for x ∈ X , setting µ = Ax, we have ω = Ax + ξx , ξx = ω − µ. It turns out that for every h ∈ Rm one has
n ∀t ≥ 0 : Prob hT ξx  ≥ t ≤ 2 exp − 2[P
t2
1 2 i hi µi + 3 khk∞ t]
o
(5.8)
2 The classical Bernstein inequality states that if X , ..., X 1 K are independent zero mean scalar random variables with finite variances σk2 such that Xk  ≤ M a.s., then for every t > 0 one has ( ) t2 Prob{X1 + ... + Xk > t} ≤ exp − P 2 . 2[ k σk + 13 M t]
392
CHAPTER 5
(for verification, see Exercise 4.21 or Section 5.4.1). As a result, we conclude via a straightforward computation that setting p P πP (h) = ϑ2P maxx∈X i [Ax]i h2i + ̺2P khk2∞ , p ϑP = 2 ln(2/δ), ̺P = 43 ln(2/δ),
we ensure that
πP (h) ≤ 1 ⇒ Prob{hT ξx  > 1} ≤ δ, ∀x ∈ X . Thus, in the Poisson case we can set Hδ = HδP := {h : πP (h) ≤ 1}. 5.1.4
Efficient upperbounding of R[H] and contrast design, I.
The scheme for upperbounding R[H] to be presented in this section (an alternative, completely different, scheme will be presented in Section 5.1.5) is inspired by our motivating example. Note that there is a special case of (5.5) where R[H] is easy to compute—the case where k · k is the uniform norm k · k∞ , whence b R[H] = R[H] := 2 max max RowTi [B]x : x ∈ Xs , kH T Axk∞ ≤ 1 i≤ν
x
is the maximum of ν efficiently computable convex functions. It turns out that when k · k = k · k∞ , it is not only easy to compute R[H], but to optimize this risk bound in H as well.3 These observations underlie the forthcoming developments in this section: under appropriate assumptions, we bound the risk of a polyhedral b estimate with contrast matrix H via the efficiently computable quantity R[H] and then show that the resulting risk bounds can be efficiently optimized w.r.t. H. We shall also see that in some “simple for analytical analysis” situations, like that of the example, the resulting estimates are nearly minimax optimal. 5.1.4.1
Assumptions
We stay within the setup introduced in Section 5.1.1 which we augment with the following assumptions: A.1. k · k = k · kr with r ∈ [1, ∞]. A.2. We have at our disposal a sequence γ = {γi > 0, i ≤ ν} and ρ ∈ [1, ∞] such that the image of Xs under the mapping x 7→ Bx is contained in the “scaled k · kρ ball” Y = {y ∈ Rν : kDiag{γ}ykρ ≤ 1}. (5.9) 5.1.4.2
Simple observation
Let BℓT be the ℓth row in B, 1 ≤ ℓ ≤ ν. Let us make the following observation: 3 On closer inspection, in the situation considered in the motivating example the k · k ∞ b optimal contrast matrix H is proportional to the unit matrix, and the quantity R[H] can be easily translated into an upper bound on, say, the k · k2 risk of the associated polyhedral estimate.
393
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
Proposition 5.2. In the situation described in Section 5.1.1, let us assume that Assumptions A.12 hold. Let ǫ ∈ (0, 1) and let a positive real N ≥ ν be given; let also π(·) be a norm on Rm such that ∀(h : π(h) ≤ 1, x ∈ X ) : Prob{hT ξx  > 1} ≤ ǫ/N. Next, let a matrix H = [H1 , ..., Hν ] with Hℓ ∈ Rm×mℓ , mℓ ≥ 1, and positive reals ςℓ , ℓ ≤ ν, satisfy the relations (a) (b)
π(Colj[H]) ≤ 1, 1 ≤ j ≤ N ; maxx BℓT x : x ∈ Xs , kHℓT Axk∞ ≤ 1 ≤ ςℓ , 1 ≤ ℓ ≤ ν.
(5.10)
Then the quantity R[H] as defined in (5.5) can be upperbounded as follows: R[H] ≤ Ψ(ς)
:=
2 maxw {k[w1 /γ1 ; ...; wν /γν ]kr : kwkρ ≤ 1, 0 ≤ wℓ ≤ γℓ ςℓ , ℓ ≤ ν} ,
(5.11)
which combines with Proposition 5.1 to imply that Riskǫ,k·k [b xH X ] ≤ Ψ(ς).
(5.12)
Function Ψ is nondecreasing on the nonnegative orthant and is easy to compute. Proof. Let z = 2¯ z be a feasible solution to (5.5), thus z¯ ∈ Xs and kH T A¯ z k∞ ≤ 1. Let y = B z¯, so that y ∈ Y (see (5.9)) due to z¯ ∈ Xs and A.2. Then kDiag{γ}ykp ≤ 1. Besides this, by (5.10.b) relations z¯ ∈ Xs and kH T A¯ z k∞ ≤ 1 combine with the symmetry of Xs w.r.t. the origin to imply that yℓ  = BℓT z¯ ≤ ςℓ , ℓ ≤ ν. Taking into account that k · k = k · kr by A.1, we see that R[H] = maxz kBzkr : z ∈ 2Xs , kH T Azk∞ ≤ 2 ≤ 2 maxy {kykr : yℓ  ≤ ςℓ , ℓ ≤ ν, & kDiag{γ}ykρ ≤ 1} = 2 maxw {k[w1 /γ1 ; ...; wν /γν ]kr : kwkρ ≤ 1, 0 ≤ wℓ ≤ γℓ ςℓ , ℓ ≤ ν} , as stated in (5.11). It is evident that Ψ is nondecreasing on the nonnegative orthant. Computing Ψ can be carried out as follows: 1. When r = ∞, we need to compute maxℓ≤ν maxw {wℓ /γℓ : kwkρ ≤ 1, 0 ≤ wj ≤ γj ςj , j ≤ ν} so that evaluating Ψ reduces to solving ν simple convex optimization problems; 2. When ρ = ∞, we clearly have Ψ(ς) = k[w ¯1 /γ1 ; ...; w ¯ν /γν ]kr , w ¯ℓ = min[1, γℓ ςℓ ]; 3. When 1 ≤ r, ρ < ∞, passing from variables wℓ to variables uℓ = wℓρ , we get ( ) X X r/ρ Ψr (ς) = 2r max γℓ−r uℓ : uℓ ≤ 1, 0 ≤ uℓ ≤ (γℓ ςℓ )ρ . u
ℓ
ℓ
When r ≤ ρ, the optimization problem on the righthand side is the easily solvable problem of maximizing a simple concave function over a simple convex compact set. When ∞ > r > ρ, this problem can be solved by Dynamic
394
CHAPTER 5
Programming.
✷
Comment. When we want to recover Bx in k · k∞ (i.e., we are in the case of r = ∞), under the premise of Proposition 5.2 we clearly have Ψ(ς) ≤ maxℓ ςℓ , resulting in the bound Riskǫ,k·k∞ [b xH X ] ≤ 2 max ςℓ . ℓ≤ν
Note that this bound in fact does not require Assumption A.2 (since it is satisfied for any ρ with large enough γi ’s). 5.1.4.3
Specifying contrasts
Risk bound (5.12) allows for an easy design of contrast matrices. Recalling that Ψ is monotone on the nonnegative orthant, all we need is to select hℓ ’s satisfying (5.10) and resulting in the smallest possible ςℓ ’s, which is what we are about to do now. Preliminaries. Given a vector b ∈ Rm and a norm s(·) on Rm , consider convexconcave saddle point problem (SP ) Opt = infm max φ(g, x) := [b − AT g]T x + s(g) g∈R
x∈Xs
along with the induced primal and dual problems Opt(P ) = inf g∈Rm φ(g) := maxx∈Xs φ(g, x) = inf g∈Rm s(g) + maxx∈Xs [b − AT g]T x ,
and
Opt(D)
= = =
maxx∈Xs φ(g) := inf Tg∈Rm φ(g,Tx) maxx∈X Ts inf g∈Rm b x − [Ax] g + s(g) maxx b x : x ∈ Xs , q(Ax) ≤ 1
(P )
(D)
where q(·) is the norm conjugate to s(·) (we have used the evident fact that inf g∈Rm [f T g + s(g)] is either −∞ or 0 depending on whether q(f ) > 1 or q(f ) ≤ 1). Since Xs is compact, we have Opt(P ) = Opt(D) = Opt by the SionKakutani Theorem. Besides this, (D) is solvable (evident) and (P ) is solvable as well, since φ(g) is continuous due to the compactness of Xs and φ(g) ≥ s(g), so that φ(·) has bounded level sets. Let g¯ be an optimal solution to (P ), let x ¯ be an optimal solution to (D), ¯ be the s(·)unit normalization of g¯, so that s(h) ¯ = 1 and g¯ = s(¯ ¯ Now and let h g )h. let us make the following observation: Observation 5.3. In the situation in question, we have ¯ T Ax ≤ 1 ≤ Opt. max bT x : x ∈ Xs , h x
(5.13)
In addition, for any matrix G = [g 1 , ..., g M ] ∈ Rm×M with s(g j ) ≤ 1, j ≤ M , one has maxx bT x: x ∈ Xs , kGT Axk∞ ≤ 1 (5.14) = maxx bT x : x ∈ Xs , kGT Axk∞ ≤ 1 ≥ Opt. Proof. Let x be a feasible solution to the problem in (5.13). Replacing, if
395
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
necessary, x with −x, we can assume that bT x = bT x. We now have bT x
= ≤ ≤
bT x = [¯ g T Ax − s(¯ g )] + [b − AT g¯]T x + s(¯ g) {z }  g )=Opt(P ) ≤φ(¯
¯ T Ax − s(¯ ¯ T Ax −s(¯ Opt(P ) + [s(¯ g )h g )] ≤ Opt(P ) + s(¯ g ) h g)  {z } ≤1
Opt(P ) = Opt,
as claimed in (5.13). Now, the equality in (5.14) is due to the symmetry of Xs w.r.t. the origin. To verify the inequality in (5.14), note that x ¯ satisfies the relations x ¯ ∈ Xs and q(A¯ x) ≤ 1, implying, due to the fact that the columns of G are of s(·)norm ≤ 1, that x ¯ is a feasible solution to the optimization problems in (5.14). As a result, the second quantity in (5.14) is at least bT x ¯ = Opt(D) = Opt, and (5.14) follows. ✷ Comment. Note that problem (P ) has a very transparent origin. In the situation of Section 5.1.1, assume that our goal is, to estimate, given observation ω = Ax+ξx , the value at x ∈ X of the linear function bT x, and we want to use for this purpose an estimate gb(ω) = g T ω + γ affine in ω. Given ǫ ∈ (0, 1), how do we construct a presumably good in terms of its ǫrisk estimate? Let us show that a meaningful answer is yielded by the optimal solution to (P ). Indeed, we have bT x − gb(Ax + ξx ) = [b − AT g]T x − γ − g T ξx .
Assume that we have at our disposal a norm s(·) on Rm such that ∀(h ∈ Rm , s(h) ≤ 1, x ∈ X ) : Prob{ξx : hT ξx  > 1} ≤ ǫ, or, which is the same, ∀(g ∈ Rm , x ∈ X ) : Prob{ξx : g T ξx  > s(g)} ≤ ǫ. Then we can safely upperbound the ǫrisk of a candidate estimate gb(·) by the quantity ρ = max [b − AT g]T x − γ +s(g). x∈X {z }  bias B(g, γ)
Observe that for g fixed, the minimal, over γ, bias is
M (g) := max[b − AT g]x. x∈Xs
Postponing verification of this claim, here is the conclusion: in the present setting, problem (P ) is nothing but the problem of building the best in terms of the upper bound ρ on the ǫrisk affine estimate of linear function bT x. It remains to justify the above claim, which is immediate: on one hand, for all u ∈ X , v ∈ X we have B(g, γ) ≥ [b − AT g]T u − γ,
B(g, γ) ≥ −[b − AT g]T v + γ
396
CHAPTER 5
implying that B(g, γ) ≥ 21 [b − AT g]T [u − v] ∀(u ∈ X , v ∈ X ), just as B(g, γ) ≥ M (g). On the other hand, let M+ (g) = max[b − AT g]T x, M− (g) = − min[b − AT g]T x, x∈X
x∈X
so that M (g) = 12 [M+ (g) + M− (g)]. Setting γ¯ = 12 [M+ (g) − M− (g)], we have maxx∈X [b − AT g]T x − γ¯ = M+ (g) − γ¯ = 21 [M+ (g) + M− (g)] = M (g), minx∈X [b − AT g]T x − γ¯ = −M− (g) − γ¯ = − 12 [M+ (g) + M− (g)] = −M (g).
That is, B(g, γ¯ ) = M (g). Combining these observations, we arrive at min B(g, γ) = γ
M (g), as claimed.
✷
Contrast design. Proposition 5.2 and Observation 5.3 allow for a straightforward solution of the associated contrast design problem, at least in the case of subGaussian, Discrete, and Poisson observation schemes. Indeed, in these cases, when designing a contrast matrix with N columns, with our approach we are supposed to select its columns in the respective sets Hǫ/N ; see Section 5.1.3. Note that these sets, while shrinking as N grows, are “nearly independent” of N , since the norms πG , πD , πP in the description of the respective sets HδG , HδD , HδP depend on 1/δ via factors logarithmic in 1/δ. It follows that we lose nearly nothing when assuming that N ≥ ν. Let us act as follows: We set N = ν, specify π ¯ (·) as the norm (πG , or πD , or πP ) associated with the observation scheme (subGaussian, or Discrete, or Poisson) in question and δ = ǫ/ν. We solve ν convex optimization problems Optℓ = ming∈Rm φℓ (g) := maxx∈Xs φℓ (g, x) , (Pℓ ) φℓ (g, x) = [Bℓ − AT g]T x + π ¯ (g). Next, we convert optimal solution gℓ to (Pℓ ) into vector hℓ ∈ Rm by representing gℓ = π ¯ (gℓ )hℓ with π ¯ (hℓ ) = 1, and set Hℓ = hℓ . As a result, we obtain an m × ν contrast matrix H = [h1 , ..., hν ] which, taken along with N = ν, quantities ςℓ = Optℓ , 1 ≤ ℓ ≤ ν, (5.15) and with π(·) ≡ π ¯ (·), in view of the first claim in Observation 5.3 as applied with s(·) ≡ π ¯ (·), satisfies the premise of Proposition 5.2. Consequently, by Proposition 5.2 we have Riskǫ,k·k [b xH X ] ≤ Ψ([Opt1 ; ...; Optν ]).
(5.16)
Comment. Optimality of the outlined contrast design for the subGaussian, or Discrete, or Poisson observation scheme stems, within the framework set by Proposition 5.2, from the second claim of Observation 5.3, which states that when N ≥ ν and the columns of the m × N contrast matrix H = [H1 , ..., Hν ] belong to the set Hǫ/N associated with the observation scheme in question—i.e., the norm π(·) in the proposition is the norm πG , or πD , or πP associated with δ = ǫ/N —the quantities ςℓ participating in (5.10.b) cannot be less than Optℓ .
397
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
Indeed, the norm π(·) from Proposition 5.2 is ≥ the norm π ¯ (·) participating in (Pℓ ) (because the value ǫ/N in the definition of π(·) is at most νǫ ), implying, by (5.10.a), that the columns of matrix H obeying the premise of the proposition satisfy the relation π ¯ (Colj [H]) ≤ 1. Invoking the second part of Observation 5.3 with s(·) ≡ π ¯ (·), b = Bℓ , and G = Hℓ , and taking (5.10.b) into account, we conclude that ςℓ ≥ Optℓ for all ℓ, as claimed.
Since the bound on the risk of a polyhedral estimate offered by Proposition 5.2 is better the lesser are the ςℓ ’s, we see that as far as this bound is concerned, the outlined design procedure is the best possible, provided N ≥ ν. An attractive feature of the contrast design we have just presented is that it is completely independent of the entities participating in assumptions A.12—these entities affect theoretical risk bounds of the resulting polyhedral estimate, but not the estimate itself. 5.1.4.4
Illustration: Diagonal case
Let us consider the diagonal case of our estimation problem, where • X = {x ∈ Rn : kDxkρ ≤ 1}, where D is a diagonal matrix with positive diagonal entries Dℓℓ =: dℓ , • m = ν = n, and A and B are diagonal matrices with diagonal entries 0 < Aℓℓ =: aℓ , 0 < Bℓℓ =: bℓ , • k · k = k · kr , • We are in the subGaussian case, that is, observation noise ξx is (0, σ 2 In )subGaussian for every x ∈ X . Let us implement the approach developed in Sections 5.1.4.1–5.1.4.3. 1. Given reliability tolerance ǫ, we set p p δ = ǫ/n, ϑG := σ 2 ln(2/δ) = σ 2 ln(2n/ǫ),
(5.17)
and
H = HδG = {h ∈ Rn : πG (h) := ϑG khk2 ≤ 1}. 2. We solve ν = n convex optimization problems (Pℓ ) associated with π ¯ (·) ≡ πG (·), which is immediate: the resulting contrast matrix is H = ϑ−1 G In , and Optℓ = ςℓ := bℓ min[ϑG /aℓ , 1/dℓ ].
(5.18)
Risk analysis. The (ǫ, k · k)risk of the resulting polyhedral estimate x b(·) can be bounded by Proposition 5.2. Note that setting γℓ = dℓ /bℓ , 1 ≤ ℓ ≤ n, we meet assumptions A.12, and the above choice of H, N = n, and ςℓ satisfies the premise of Proposition 5.2. By this proposition, Riskǫ,k·kr [b xH X ] ≤ Ψ
:=
2 maxw {k[w1 /γ1 ; ...; wn /γn ]kr : kwkρ ≤ 1, 0 ≤ wℓ ≤ γℓ ςℓ } .
(5.19)
398
CHAPTER 5
Let us work out what happens in the simple case where 1 ≤ ρ ≤ r < ∞, aℓ /dℓ and bℓ /aℓ are nonincreasing in ℓ.
(a) (b)
(5.20)
Proposition 5.4. In the simple case just defined, let n = n when n X ℓ=1
ρ
(ϑG dℓ /aℓ ) ≤ 1;
otherwise let n be the smallest integer such that n X
ρ
(ϑG dℓ /aℓ ) > 1,
ℓ=1
with ϑG given by (5.17). Then for the contrast matrix H = ϑ−1 G In one has Riskǫ,k·kr [b xH X ] ≤ Ψ ≤ 2
hX n
ℓ=1
(ϑG bℓ /aℓ )r
i1/r
.
Proof. Consider the optimization problem specifying Ψ in (5.19). Setting θ = r/ρ ≥ 1, let us pass in this problem from variables wℓ to variables zℓ = wℓρ , so that ( ) X X r r θ r ρ Ψ = 2 max zℓ (bℓ /dℓ ) : zℓ ≤ 1, 0 ≤ zℓ ≤ (dℓ ςℓ /bℓ ) ≤ 2r Γ, z
ℓ
ℓ
where Γ = max z
(
X ℓ
zℓθ (bℓ /dℓ )r
:
X ℓ
zℓ ≤ 1, 0 ≤ zℓ ≤ χℓ := (ϑG dℓ /aℓ )
ρ
)
(we have used (5.18)). Note that ΓP is the optimal value in the problem of maximizing a convex (since θ ≥ 1) function ℓ zℓθ (bℓ /dℓ )r over a bounded polyhedral set, so that the maximum is attained at an extreme point z¯ of the feasible set. By the standard characterization of extreme points, the (clearly nonempty) set I of positive entries in z¯ is as follows. Let us denote by I ′ the set of indexes ℓ ∈ I such that z¯ℓ is on its upper z¯ℓ = χℓ ; note that the cardinality I ′  of I ′ is at least P bound P I − 1. Since ℓ∈I ′ z¯ℓ = ℓ∈I ′ χℓ ≤ 1 and χℓ are nondecreasing in ℓ by (5.20.b), we conclude that I ′  X χℓ ≤ 1, ℓ=1
′
implying that I  < n provided that n < n, so that in this case I ≤ n; and of course I ≤ n when n = n. Next, we have X X X Γ= z¯ℓθ (bℓ /dℓ )r ≤ χθℓ (bℓ /dℓ )r = (ϑG bℓ /aℓ )r , ℓ∈I
ℓ∈I
ℓ∈I
and Pn since bℓ /aℓr is nonincreasing in ℓ and I ≤ n, the latter quantity is at most ✷ ℓ=1 (ϑG bℓ /aℓ ) .
399
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
Application. Consider the “standard case” [72, 74] where p 0 < ln(2n/ǫ)σ ≤ 1, aℓ = ℓ−α , bℓ = ℓ−β , dℓ = ℓκ
with β ≥ α ≥ 0, κ ≥ 0 and (β − α)r < 1. In this case for large n, namely, −
1
n ≥ cϑG α+κ+1/ρ
[ϑG = σ
p
2 ln(2n/ǫ)]
(5.21)
(here and in what follows, the factors denoted by c and C depend solely on α, β, κ, r, ρ) we get −
1
n ≤ CϑG α+κ+1/ρ , resulting in β+κ+1/ρ−1/r
Riskǫ,k·kr [b xX ] ≤ CϑG α+κ+1/ρ .
(5.22)
Setting x = D y, α ¯ = α + κ, β¯ = β + κ and treating y, rather than x, as the signal underlying the observation, we obtain the estimation problem which is similar to the original one in which α, β, κ and X are replaced, respectively, with α ¯, ¯ κ β, ¯ = 0, and Y = {y : kykρ ≤ 1}, and A, B replaced with A¯ = Diag{ℓ−α¯ , ℓ ≤ n}, 1 ¯ = Diag{ℓ−β¯ , ℓ ≤ n}. When n is large enough, namely, n ≥ σ − α+1/ρ ¯ , Y contains B the “coordinate box” −1
Y = {x : xℓ  ≤ m−1/ρ , m/2 ≤ ℓ ≤ m, xℓ = 0 otherwise} of dimension ≥ m/2, where
1
¯ . m ≥ cσ − α+1/ρ
¯ 2 ≤ Cm−α¯ kyk2 , and kByk ¯ r ≥ cm−β¯ kykr . This Observe that for all y ∈ Y, kAyk observation, when combined with the Fano inequality, implies (cf. [79]) that for ǫ ≪ 1 the minimax optimal w.r.t. the family of all Borel estimates (ǫ, k · kr )risk on the signal set X = D−1 Y ⊂ X is at least cσ
¯ β+1/ρ−1/r α+1/ρ ¯
.
In other words, in this situation, the upper bound (5.22) on the risk of the polyhedral estimate is within a factor logarithmic in n/ǫ from the minimax risk. In particular, without surprise, in the case of β = 0 the polyhedral estimates attain wellknown optimal rates [72, 109]. 5.1.5 5.1.5.1
Efficient upperbounding of R[H] and contrast design, II. Outline
In this section we develop and alternative approach to the design of polyhedral estimates which resembles in many aspects the approach to building linear estimates from Chapter 4. Recall that the principal technique underlying the design of a presumably good linear estimate x bH (ω) = H T ω was upperbounding of maximal risk of the estimate—the maximum of a quadratic form, depending on H as a parameter, over the signal set X , and we were looking for a bounding scheme allowing us to efficiently optimize the bound in H. The design of a presumably good polyhedral estimate also reduces to minimizing
400
CHAPTER 5
the optimal value in a parametric maximization problem (5.5) over the contrast matrix H. However, while the design of a presumably good linear estimate reduces to unconstrained minimization, to conceive a polyhedral estimate we need to minimize bound R[H] on the estimation risk under the restriction on the contrast matrix H—the columns hℓ of this matrix should satisfy condition (5.1). In other words, in the case of polyhedral estimate the “design parameter” affects the constraints of the optimization problem rather than the objective. Our strategy can be outlined as follows. Let us denote by B∗ = {u ∈ Rν : kuk∗ ≤ 1} the unit ball of the norm k · k∗ conjugate to the norm k · k in the formulation of the estimation problem in Section 5.1.2. Assume that we have at our disposal a technique for bounding quadratic forms on the set B∗ × Xs , in other words, we have an efficiently computable convex function M(M ) on Sν+n such that M(M ) ≥
max
[u;z]∈B∗ ×Xs
[u; z]T M [u; z] ∀M ∈ Sν+n .
(5.23)
Note that the upper bound R[H], as defined in (5.5), on the risk of a candidate polyhedral estimate x bH is nothing but ( 1 B T 2 [u; z] : R[H] = 2 max[u;z] [u; z] 1 BT }  2 {z (5.24) B+ u ∈ B ∗ , z ∈ Xs , . z T AT hℓ hTℓ Az ≤ 1, ℓ ≤ N T T T Given λ ∈ RN + , the constraints z A hℓ hℓ Az ≤ 1 in (5.24) can be aggregated to yield the quadratic constraint X z T AT Θλ Az ≤ µλ , Θλ = HDiag{λ}H T , µλ = λℓ . ℓ
Observe that for every λ ≥ 0 we have R[H] ≤ 2M 1 T B  2
1 2B T
−A Θλ A {z }
+ 2µλ .
(5.25)
B+ [Θλ ]
Indeed, let [u; z] be a feasible solution to the optimization problem (5.24) specifying R[H]. Then [u; z]T B+ [u; z] = [u; z]T B+ [Θλ ][u; z] + z T AT Θλ Az; the first term on the righthand side is ≤ M(B+ [Θλ ]) since [u; z] ∈ B∗ × Xs , and the second term on the righthand side, as we have already seen, is ≤ µλ , and (5.25) follows.
Now assume that we have at our disposal a computationally tractable cone H ⊂ SN + × R+
401
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
satisfying the following assumption: C. Whenever (Θ, µ) ∈ H, we can efficiently find an n × N matrix H = [h1 , ..., hN ] and a nonnegative vector λ ∈ RN + such that hℓ satisfies (5.1), 1 ≤ ℓ ≤ N , T Θ P= HDiag{λ}H , i λi ≤ µ.
(a) (b) (c)
(5.26)
The following simple observation is crucial to what follows: Proposition 5.5. Consider the estimation problem posed in Section 5.1.1, and let efficiently computable convex function M and computationally tractable closed convex cone H satisfy (5.23) and Assumption C, respectively. Consider the convex optimization problem Opt = minτ,Θ,µ {2τ +2µ : (Θ, µ) ∈ H, M(B + [Θ]) ≤ τ } 1 B 2 . B+ [Θ] = 1 T −AT ΘA 2B
(5.27)
Given a feasible solution (τ, Θ, µ) to this problem, by C we can efficiently convert it P to (H, λ) such that H = [h1 , ..., hN ] with hℓ satisfying (5.1) and λ ≥ 0 with ℓ λℓ ≤ µ. We have R[H] ≤ 2τ + 2µ, whence the (ǫ, k · k)risk of the polyhedral estimate x bH satisfies the bound Riskǫ,k·k [b xH X ] ≤ 2τ + 2µ.
(5.28)
Consequently, we can efficiently construct polyhedral estimates with (ǫ, k · k)risk arbitrarily close to Opt (and with risk exactly Opt, provided problem (5.27) is solvable). Proof is readily given by the reasoning preceding the proposition. Indeed, with τ, Θ, µ, H, λ as in the premise of the proposition, the columns hℓ of H satisfy (5.1) by C, implying, by Proposition 5.1, that Riskǫ,k·k [b xH X ] ≤ R[H]. Besides this, C says that for our H, λ it holds Θ = Θλ and µλ ≤ µ, so that (5.25) combines with the constraints of (5.27) to imply that R[H] ≤ 2τ + 2µ, and (5.28) follows by Proposition 5.1. ✷ The approach to the design of polyhedral estimates we develop in this section amounts to reducing the construction of the estimate (i.e., construction of the contrast matrix H) to finding (nearly) optimal solutions to (5.27). Implementing this approach requires devising techniques for constructing cones H satisfying C along with efficiently computable functions M(·) satisfying (5.23). These tasks are the subjects of the sections to follow. 5.1.5.2
Specifying cones H
We specify cones H in the case when the number N of columns in the candidate contrast matrices is m and under the following assumption on the given reliability tolerance ǫ and observation scheme in question: D. There is a computationally tractable convex compact subset Z ⊂ Rm +
402
CHAPTER 5
intersecting int Rm + such that the norm π(·) s X zi h2i π(h) = max z∈Z
i
induced by Z satisfies the relation π(h) ≤ 1 ⇒ Prob{hT ξx  > 1} ≤ ǫ/m ∀x ∈ X . Note that condition D is satisfied for subGaussian, Discrete, and Poisson observation schemes: according to the results of Section 5.1.3, • in the subGaussian case, it suffices to take Z = {2σ 2 ln(2m/ǫ)[1; ...; 1]}; • in the Discrete case, it suffices to take Z=
64 ln2 (2m/ǫ) 4 ln(2m/ǫ) AX + ∆m , K 9K 2
where AX = {Ax : x ∈ X }, ∆m = {y ∈ Rm : y ≥ 0, • in the Poisson case, it suffices to take Z = 2 ln(2m/ǫ)AX +
16 9
X
yi = 1}.
i
ln2 (2m/ǫ)∆m ,
with AX and ∆m as above. Note that in all these cases Z only “marginally”—logarithmically—depends on ǫ and m. Under Assumption D, the cone H can be built as follows: • When Z is a singleton, Z = {¯ z }, so that π(·) is a scaled Euclidean norm, we set ) ( X m z¯i Θii . H = (Θ, µ) ∈ S+ × R+ : µ ≥ i
Given (Θ, µ) the m × m matrix H and λ ∈ Rm + are built as follows: setting √ ∈ H, √ S = Diag{ z¯1 , ..., z¯m }, we compute the eigenvalue decomposition of the matrix SΘS: SΘS = U Diag{λ}U T , where U isP orthonormal, andP set H = S −1 U , thus ensuring Θ = HDiag{λ}H T . Since µ ≥ i z¯i Θii , we have i λi = Tr(SΘS) ≤ µ. Finally, a column h of H is of the form S −1 f with k · k2 unit vector f , implying that sX sX 2 −1 z¯i [S f ]i = fi2 = 1, π(h) = i
so that h satisfies (5.1) by D.
i
403
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
• When Z is not a singleton, we set φ(r) = κ = H =
T maxz∈Z √ z 2r, 6 ln(2 3m ), {(Θ, µ) ∈ Sm + × R+ : µ ≥ κφ(dg(Θ))},
(5.29)
where dg(Q) is the diagonal of a (square) matrix Q. Note that φ(r) > 0 whenever r ≥ 0, r 6= 0, since Z contains a positive vector. The justification of this construction and the efficient (randomized) algorithm for converting a pair (Θ, µ) ∈ H into (H, λ) satisfying, when taken along with (Θ, µ), the requirements of C are given by the following: Lemma 5.6. Let norm π(·) satisfy D. (i) Whenever H is an m × m matrix with columns hℓ satisfying π(hℓ ) ≤ 1 and λ ∈ Rm + , we have ! X T Θλ = HDiag{λ}H , µ = κ λi ∈ H. i
(ii) Given (Θ, µ) ∈ H with Θ 6= 0, we find decomposition Θ = QQT with m × m matrix Q, andp fix an orthonormal m × m matrix V with magnitudes of entries not exceeding 2/m (e.g., the orthonormal scaling of the matrix of the cosine µ transform). When µ > 0, we set λ = m [1; ...; 1] ∈ Rm and consider the random matrix r m QDiag{χ}V, Hχ = µ where χ is the mdimensional Rademacher random vector. We have X Hχ Diag{λ}HχT ≡ Θ, λ ≥ 0, λi = µ.
(5.30)
i
Moreover, the probability of the event π(Colℓ [Hχ ]) ≤ 1 ∀ℓ ≤ m
(5.31)
is at least 1/2. Thus, generating independent samples of χ and terminating with H = Hχ when the latter matrix satisfies (5.31), we with probability 1 terminate with (H, λ) satisfying C, and the probability for the outlined procedure to terminate in the course of the first M = 1, 2, ... steps is at least 1 − 2−M . When µ = 0, we have Θ = 0 (since µ = 0 implies φ(dg(Θ)) = 0, which with Θ 0 is possible only when Θ = 0); thus, when µ = 0, we set H = 0m×m and λ = 0m×1 . Note that the lemma states, essentially, that the cone H is a tight, up to a factor logarithmic in m, inner approximation of the set Θ = HDiag{λ}H T , m×m [H]) ≤ 1, ℓ ≤ m, (Θ, µ) : ∃(λ ∈ Rm ) : π(Col . ℓ +,H ∈ R P µ ≥ ℓ λℓ For proof, see Section 5.4.2.
404
CHAPTER 5
5.1.5.3
Specifying functions M
In this section we focus on computationally efficient upperbounding of maxima of quadratic forms over convex compact sets symmetric w.r.t. the origin by semidefinite relaxation, our goal being to specify a “presumably good” efficiently computable convex function M(·) satisfying (5.23). Cones compatible with convex sets. Given a nonempty convex compact set Y ⊂ RN , we say that a cone Y is compatible with Y if • Y is a closed convex computationally tractable cone contained in SN + × R+ • one has ∀(V, τ ) ∈ Y : max y T V y ≤ τ (5.32) y∈Y
• Y contains a pair (V, τ ) with V ≻ 0 • relations (V, τ ) ∈ Y and τ ′ ≥ τ imply that (V, τ ′ ) ∈ Y.4 We call a cone Y sharp if Y is a closed convex cone contained in SN + × R+ and such that the only pair (V, τ ) ∈ Y with τ = 0 is the pair (0, 0), or, equivalently, a sequence {(Vi , τi ) ∈ Y, i ≥ 1} is bounded if and only if the sequence {τi , i ≥ 1} is bounded. Note that whenever the linear span of Y is the entire RN , every cone compatible with Y is sharp. Observe that if Y ⊂ RN is a nonempty convex compact set and Y is a cone compatible with a shift Y − a of Y, then Y is compatible with Ys . Indeed, when shifting a set Y, its symmetrization 21 [Y − Y] remains intact, so that we can assume that Y is compatible with Y. Now let (V, τ ) ∈ Y and y, y ′ ∈ Y. We have [y − y ′ ]T V [y − y ′ ] + [y + y ′ ]T V [y + y ′ ] = 2[y T V y + [y ′ ]T V y ′ ] ≤ 4τ, {z }  ≥0
whence for z = 12 [y − y ′ ] it holds z T V z ≤ τ . Since every z ∈ Ys is of the form 1 [y − y ′ ] with y, y ′ ∈ Y, the claim follows. 2 Note that the claim can be “nearly inverted”: if 0 ∈ Y and Y is compatible with Ys , then the “widening” of Y—the cone Y + = {(V, τ ) : (V, τ /4) ∈ Y} —is compatible with Y (evident, since when 0 ∈ Y, every vector from Y is proportional, with coefficient 2, to a vector from Ys ).
Constructing functions M. The role of compatibility in our context becomes clear from the following observation: Proposition 5.7. In the situation described in Section 5.1.1, assume that we have at our disposal cones X and U compatible, respectively, with Xs and with the unit 4 The latter requirement is “for free”—passing from a computationally tractable closed convex + = {(V, τ ) : ∃¯ cone Y ⊂ SN τ ≤ τ : (V, τ¯) ∈ Y}, we get + × R+ satisfying (5.32) to the cone Y a cone larger than Y and still compatible with Y. It will be clear from the sequel that in our context, the larger is a cone compatible with Y, the better.
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
405
ball B∗ = {v ∈ Rν : kuk∗ ≤ 1}
of the norm k · k∗ conjugate to the norm k · k. Given M ∈ Sν+n , let us set M(M ) = inf {t + s : (X, t) ∈ X, (U, s) ∈ U, Diag{U, X} M } . X,t,U,s
(5.33)
Then M is a realvalued efficiently computable convex function on Sν+n such that (5.23) takes place: for every M ∈ Sn+ν it holds M(M ) ≥
max
[u;z]∈B∗ ×Xs
[u; z]T M [u; z].
In addition, when X and U are sharp, the infimum in (5.33) is achieved. Proof is immediate. Given that the objective of the optimization problem specifying M(M ) is nonnegative on the feasible set, the fact that M is realvalued is equivalent to problem’s feasibility, and the latter is readily given by the fact that X is a cone containing a pair (X, t) with X ≻ 0 and similarly for U. Convexity of M is evident. To verify (5.23), let (X, t, U, s) form a feasible solution to the optimization problem in (5.33). When [u; z] ∈ B∗ × Xs we have [u; z]T M [u; z] ≤ uT U u + z T Xz ≤ s + t, where the first inequality is due to the constraint in (5.33), and the second is due to the fact that U is compatible with B∗ , and X is compatible with Xs . Since the resulting inequality holds true for all feasible solutions to the optimization problem in (5.33), (5.23) follows. Finally, when X and U are sharp, (5.33) is a feasible conic problem with bounded level sets of the objective and as such is solvable. ✷ 5.1.5.4
Putting things together
The following statement combining the results of Propositions 5.7 and 5.5 summarizes our second approach to the design of the polyhedral estimate. Proposition 5.8. In the situation of Section 5.1.1, assume that we have at our disposal cones X and U compatible, respectively, with Xs and with the unit ball B∗ of the norm conjugate to k · k. Given reliability tolerance ǫ ∈ (0, 1) along with a positive integer N and a computationally tractable cone H satisfying Assumption C, consider the (clearly feasible) convex optimization problem Opt = minΘ,µ,X,t,U,s f (t, s, µ) := 2(t + s + µ) : (Θ, t) ∈ X,(U, s) ∈ U, ) (5.34) µ) ∈ H, (X, 1 B U . 2 0 1 T AT ΘA + X 2B Let Θ, µ, X, t, U, s be a feasible solution to (5.34). Invoking C, we can convert, in a computationally efficient manner, (Θ, µ) into (H, λ) such that the columns of the P m × N contrast matrix H satisfy (5.1), Θ = HDiag{λ}H T , and µ ≥ ℓ λℓ . The
406
CHAPTER 5
(ǫ, k · k)risk of the polyhedral estimate x bH satisfies the bound Riskǫ,k·k [b xH X ] ≤ f (t, s, µ).
(5.35)
In particular, we can build, in a computationally efficient manner, polyhedral estimates with risks arbitrarily close to Opt (and with risk Opt, provided that (5.34) is solvable). Proof. Let Θ, µ, X, t, U, s form a feasible solution to (5.34). By the semidefinite constraint in (5.34) we have 1 − 21 B U 2B , = Diag{U, X} − 0 1 T − 21 B T AT ΘA + X −AT ΘA 2B {z }  =:M
whence for the function M defined in (5.33) one has M(M ) ≤ t + s.
Since M, by Proposition 5.7, satisfies (5.23), invoking Proposition 5.5 we arrive at R[H] ≤ 2(µ + M(M )) ≤ f (t, s, µ). By Proposition 5.1 this implies the target relation (5.35). 5.1.5.5
✷
Compatibility: Basic examples and calculus
Our approach to the design of polyhedral estimates utilizing the recipe described in Proposition 5.8 relies upon our ability to equip convex “sets of interest” (in our context, these are the symmetrization Xs of the signal set and the unit ball B∗ of the norm conjugate to the norm k · k) with compatible cones.5 Below, we discuss two principal sources of such cones, namely (a) spectratopes/ellitopes, and (b) absolute norms. More examples of compatible cones can be constructed using a “compatibility calculus.” Namely, let us assume that we are given a finite collection of convex sets (operands) and apply to them some basic operation, such as taking the intersection, or arithmetic sum, direct or inverse linear image, or convex hull of the union. It turns out that cones compatible with the results of such operations can be easily (in a fully algorithmic fashion) obtained from the cones compatible with the operands; see Section 5.1.8 for principal calculus rules. In view of Proposition 5.8, the larger are the cones X and U compatible with Xs and B∗ , the better—the wider is the optimization domain in (5.34) and, consequently, the less is (the best) risk bound achievable with the recipe presented in the proposition. Given convex compact set Y ∈ RN , the “ideal”—the largest— candidate to the role of the cone compatible with Y would be T Y∗ = {(V, τ ) ∈ SN + × R+ : τ ≥ max y V y}. y∈Y
However, this cone is typically intractable, therefore, we look for “as large as pos5 Recall
H.
that we already know how to specify the second element of the construction, the cone
407
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
sible” tractable inner approximations of Y∗ . 5.1.5.5.A. Cones compatible with ellitopes/spectratopes are readily given by semidefinite relaxation. Specifically, when Y = {y ∈ RN : ∃(r ∈ RK ) : y = M z, Rℓ2 [z] i rℓ Idℓ , ℓ ≤ L} h ∈ R, z P Rℓ [z] = j zj Rℓj , Rℓj ∈ Sdℓ
with our standard restrictions on R, invoking Proposition 4.8 it is immediately seen that the set dℓ Y = (V, τ ) ∈ SN (λ[Λ]) ≤ τ + × R+ : ∃Λ = {Λℓ ∈ S+ , ℓ ≤ L} : φRP (5.36) R∗ [Λℓ ] MT V M ℓ
is a closed convex cone which is compatible with Y; here, as usual,
[R∗ℓ [Λℓ ]]ij = Tr(Rℓi Λℓ Rℓj ), λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛL )], φR (λ) = max rT λ. r∈R
Similarly, when Y is an ellitope: Y = {y ∈ RN : ∃(r ∈ R, z ∈ RK ) : y = M z, z T Rℓ z ≤ rℓ , ℓ ≤ L} with our standard restrictions on Rℓ , invoking Proposition 4.6, the set X L T Y = {(V, τ ) ∈ SN λℓ Rℓ , φR (λ) ≤ τ } + × R+ : ∃λ ∈ R+ : M V M
(5.37)
ℓ
is a closed convex cone which is compatible with Y. In both cases, Y is sharp, provided that the image space of M is the entire RN . Note that in both these cases Y is a reasonably tight inner approximation of Y∗ : wheneverP (V, τ ) ∈ Y∗ , we have (V, θτ ) ∈ Y, with a moderate θ (specifically, θ = O(1) ln 2 ℓ dℓ in the spectratopic, and θ = O(1) ln(2L) in the ellitopic case; see Propositions 4.8, 4.6, respectively). 5.1.5.5.B. Compatibility via absolute norms. Preliminaries. Recall that a norm p(·) on RN is called absolute if p(x) is a function of the vector abs[x] := [x1 ; ...; xN ] of the magnitudes of entries in x. It ′ is well known that an absolute norm p is monotone on RN + , so that abs[x] ≤ abs[x ] ′ implies that p(x) ≤ p(x ), and that the norm p∗ (x) = max xT y y:p(y)≤1
conjugate to p(·) is absolute along with p. Let us say that an absolute norm r(·) fits an absolute norm p(·) on RN if for every vector x with p(x) ≤ 1 the entrywise square [x]2 = [x21 ; ...; x2N ] of x satisfies r([x]2 ) ≤ 1. For example, the largest norm r(·) which fits the absolute norm p(·) = k · ks , s ∈ [1, ∞], is k · k1 , 1≤s≤2 r(·) = . k · ks/2 , s ≥ 2
408
CHAPTER 5
An immediate observation is that an absolute norm p(·) on RN can be “lifted” to a norm on SN , specifically, the norm p+ (Y ) = p([p(Col1 [Y ]); ...; p(ColN [Y ])]) : SN → R+ ,
(5.38)
where Colj [Y ] is jth column in Y . It is immediately seen that when p is an absolute norm, the righthand side in (5.38) indeed is a norm on SN satisfying the identity p+ (xxT ) = p2 (x), x ∈ RN .
(5.39)
Absolute norms and compatibility. Our interest in absolute norms is motivated by the following immediate observation: Observation 5.9. Let p(·) be an absolute norm on RN , and r(·) be another absolute norm which fits p(·), both norms being computationally tractable. These norms give rise to the computationally tractable and sharp closed convex cone N N P = Pp(·),r(·) = (V, τ ) ∈ SN + × R+ : ∃(W ∈ S , w ∈ R+ ) : (5.40) V W + Diag{w}, [p+ ]∗ (W ) + r∗ (w) ≤ τ where [p+ ]∗ (·) is the norm on SN conjugate to the norm p+ (·), and r∗ (·) is the norm on RN conjugate to the norm r(·), and this cone is compatible with the unit ball of the norm p(·) (and thus with any convex compact subset of this ball). Verification is immediate. The fact that P is a computationally tractable and closed convex cone is evident. Now let (V, τ ) ∈ P, so that V 0 and V W + Diag{w} with [p+ ]∗ (W ) + r∗ (w) ≤ τ . For x with p(x) ≤ 1 we have xT V x
≤ ≤ ≤
xT [W + Diag{w}]x = Tr(W [xxT ]) + wT [x]2 p+ (xxT )[p+ ]∗ (W ) + r([x]2 )r∗ (w) = p2 (x)[p+ ]∗ (W ) + r∗ (w) [p+ ]∗ (W ) + r∗ (w) ≤ τ
(we have used (5.40)), whence xT V x ≤ τ for all x with p(x) ≤ 1. ✷ Let us look at the proposed construction in the case where p(·) = k·ks , s ∈ [1, ∞], s¯ s , s¯∗ = s¯−1 , we clearly have and let r(·) = k · ks¯, s¯ = max[s/2, 1]. Setting s∗ = s−1 +
[p ]∗ (W ) = kW ks∗ :=
( P
s∗ i,j Wij  maxi,j Wij ,
1/s∗
,
s∗ < ∞ , r (w) = kwk , (5.41) ∗ s¯∗ s∗ = ∞
resulting in Ps
:=
Pk·ks ,k·ks¯ =
N N (V, τ ) : V ∈ SN + , ∃(W ∈ S , w ∈ R+ ) : V W + Diag{w}, . kW ks∗ + kwks¯∗ ≤ τ
(5.42)
By Observation 5.9, Ps is compatible with the unit ball of k · ks norm on RN (and therefore with every closed convex subset of this ball).
409
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
When s = 1, that is, s∗ = s¯∗ = ∞, (5.42) results in V W + Diag{w}, P1 = (V, τ ) : V 0, ∃(W ∈ SN , w ∈ RN ) : + kW k∞ + kwk∞ ≤ τ =
{(V, τ ) : V 0, kV k∞ ≤ τ },
(5.43)
and it is easily seen that the situation is a good as it could be, namely, P1 = {(V, τ ) : V 0, max xT V x ≤ τ }. kxk1 ≤1
It can be shown (see Section 5.4.3) that when s ∈ [2, ∞], and so s¯∗ = results in
s s−2 ,
s Ps = {(V, τ ) : V 0, ∃(w ∈ RN + ) : V Diag{w} & kwk s−2 ≤ τ }.
(5.42) (5.44)
Note that P2 = {(V, τ ) : V 0, kV k2,2 ≤ τ }, and this is exactly the largest cone compatible with the unit Euclidean ball. When s ≥ 2, the unit ball Y of the norm k · ks is an ellitope: {y ∈ RN : kyks ≤ 1} = {y ∈ RN : ∃(t ≥ 0, ktks¯ ≤ 1) : y T Rℓ y := yℓ2 ≤ tℓ , ℓ ≤ L = N },
so that one of the cones compatible with Y is given by (5.37) with the identity matrix in the role of M . As it is immediately seen, the latter cone is nothing but the cone (5.44). 5.1.5.6
Nearoptimality of polyhedral estimate in the spectratopic subGaussian case
As an instructive application of the approach developed so far, consider the special case of the estimation problem stated in Section 5.1.1, where 1. The signal set X and the unit ball B∗ of the norm conjugate to k · k are spectratopes: X B∗
= =
{x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K}, {z ∈ Rν : ∃y ∈ Y : z = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L},
(cf. Assumptions A, B in Section 4.3.3.2; as always, we lose nothing assuming spectratope X to be basic). 2. For every x ∈ X , observation noise ξx is subGaussian, i.e., ξx ∼ SG(0, σ 2 Im ). We are about to show that in the present situation, the polyhedral estimate constructed in Sections 5.1.5.2–5.1.5.4, i.e., yielded by the efficiently computable (high accuracy near) optimal solution to the optimization problem (5.34), is nearoptimal in the minimax sense. Given reliability tolerance ǫ ∈ (0, 1), the recipe for constructing the m × m contrast matrix H as presented in Proposition 5.8 is as follows: • Set
Z = {ϑ2 [1; ...; 1]}, ϑ = σκ, κ =
p
2 ln(2m/ǫ),
410
CHAPTER 5
and utilize the construction from Section 5.1.5.2, thus arriving at the cone 2 2 H = {(Θ, µ) ∈ Sm + × R+ : σ κ Tr(Θ) ≤ µ}
satisfying the requirements of Assumption C. • Specify the cones X and U compatible with Xs = X , and B∗ , respectively, according to (5.36). The resulting problem (5.34), after immediate straightforward simplifications, reads 2 φR (λ[Υ]) + φT (λ[Λ]) + σ 2 κ2 Tr(Θ) : Opt = min Θ,U,Λ,Υ Θ 0, U 0, Λ = {Λk 0, k ≤ K}, P (5.45) ∗ T S [Υ ], Υ = {Υ 0, ℓ ≤ L}, M U M ℓ ℓ ℓ ℓ 1 B U 2P 0 1 T T ∗ R [Λ ] A ΘA + B k k k 2 where, as always,
and
P [R∗k [Λk ]]ij = Tr(Rki Λk Rkj ) [Rk [x] =P i xi Rki ], [Sℓ∗ [Υℓ ]]ij = Tr(S ℓi Υℓ S ℓj ) [Sℓ [u] = i ui S ℓi ],
λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛK )], λ[Υ] = [Tr(Υ1 ); ...; Tr(ΥL )], φW (f ) = max wT f. w∈W
Let now RiskOptǫ = inf sup inf ρ : Probξ∼N (0,σ2 I) {kBx − x b(Ax + ξ)k > ρ} ≤ ǫ ∀x ∈ X , x b(·) x∈X
be the minimax optimal (ǫ, k · k)risk of estimating Bx in the Gaussian observation scheme where ξx ∼ N (0, σ 2 Im ) independently of x ∈ X. Proposition 5.10. When ǫ ≤ 1/8, the polyhedral estimate x bH yielded by a feasible nearoptimal, in terms of the objective, solution to problem (5.45) is minimax optimal within the logarithmic factor, namely r P P Riskǫ,k·k [b xH X ] ≤ O(1) ln ℓ fℓ ln(2m/ǫ) RiskOpt 81 k dk ln r P P ≤ O(1) ln ℓ fℓ ln(2m/ǫ) RiskOptǫ k dk ln where O(1) is an absolute constant.
See Section 5.4.4 for the proof. Discussion. It is worth mentioning that the approach described in Section 5.1.4 is complementary to the approach developed in this section. In fact, it is easily seen that the bound Opt for the risk of the polyhedral estimate stemming from (5.34) is suboptimal in the simple situation described in the motivating example from Section 5.1.1. Indeed, let X be the unit k · k1 ball, k · k = k · k2 , and let us consider the problem of estimating x ∈ X from the direct observation ω = x + ξ with Gaussian observation noise ξ ∼ N (0, σ 2 I). We equip the ball B∗ = {u ∈ Rn : kuk2 ≤ 1}
411
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
with the cone U = P2 = {(U, τ ) : U 0, kU k2,2 ≤ τ } and X with the cone X = P1 = {(X, t) : X 0, kXk∞ ≤ t}, (note that both cones are the largest w.r.t. inclusion cones compatible with the respective sets). The corresponding problem (5.34) reads Opt
=
=
Θ 0, X 0, U 0, 1 I U min 2 κ2 σ 2 Tr(Θ) + max Xii + kU k2,2 : 2 n i 0 Θ,X,U 1 Θ + X I n 2 0, U 0, Θ 0, X 2 2 1 τ In I . min 2 κ σ Tr(Θ) + max Xii + τ : 2 n i 0 Θ,X,U 1 Θ + X I n 2
(5.46)
Observe that every n × n matrix of the form Q = EP , where E is diagonal with diagonal entries ±1, and P is a permutation matrix, induces a symmetry (Θ, X, τ ) 7→ (QΘQT , QXQT , τ ) of the second optimization problem in (5.46), that is, a transformation which maps the feasible set onto itself and keeps the objective intact. Since the problem is convex and solvable, we conclude that it has an optimal solution which remains intact under the symmetries in question, i.e., solution with scalar matrices Θ = θIn and X = uIn . As a result, √ 2 2 Opt = min 2(κ σ nθ + u + τ ) : τ (θ + u) ≥ 41 = 2 min κσ n, 1 . (5.47) θ≥0,u≥0,τ
A similar derivation shows that the value Opt remains intact if we replace the set X = {x : kxk1 ≤ 1} with X = {x : kxks ≤ 1}, s ∈ [1, 2], and the cone X = P1 with X = Ps ; see (5.42). Since the Θcomponent of an optimal solution to (5.46) can be selected to be scalar, the contrast matrix H we end up with can be selected to be the unit matrix. An unpleasant observation is that when s < 2, the quantity Opt given by (5.47) “heavily overestimates” the actual risk of the polyhedral estimate with H = In . Indeed, the analysis of this estimate in Section 5.1.4 results in the √ risk bound (up to a factor√logarithmic in n) min[σ 1−s/2 , σ n], which √ can be much less than Opt = 2 min [κσ n, 1], e.g., in the case of large n, and σ n = O(1). 5.1.6
Assembling estimates: Contrast aggregation
The good news is that whenever the approaches to the design of polyhedral estimates presented in Sections 5.1.4 and 5.1.5 are applicable, they can be utilized simultaneously. The underlying observation is that (!) In the problem setting described in Section 5.1.2, a collection of K candidate polyhedral estimates can be assembled into a single polyhedral estimate with the (upper bound on the) risk, as given by Proposition 5.1, being nearly the minimum of the risks of estimates we aggregate. Indeed, given an observation scheme (that is, collection of probability distributions Px of noises ξx , x ∈ X ), assume we have at our disposal norms πδ (·) : Rm → R parameterized by δ ∈ (0, 1) such that πδ (h), for every h, is larger the lesser δ is,
412
CHAPTER 5
and ∀(x ∈ X , δ ∈ (0, 1), h ∈ Rm ) : πδ (h) ≤ 1 ⇒ Probξ∼Px {ξ : hT ξ > 1} ≤ δ. Assume also (as is indeed the case in all our constructions) that we ensure (5.1) by imposing on the columns hℓ of an m × N contrast matrix H the restrictions πǫ/N (hℓ ) ≤ 1. Now suppose that given risk tolerance ǫ ∈ (0, 1), we have generated K candidate contrast matrices Hk ∈ Rm×Nk such that πǫ/Nk (Colj [Hk ]) ≤ 1, j ≤ Nk , so that the (ǫ, k · k)risk of the polyhedral estimate yielded by the contrast matrix Hk does not exceed Rk = max kBxk : x ∈ 2Xs , kHkT Axk∞ ≤ 2 . x
Let us combine the contrast matrices H1 , ..., HK into a single contrast matrix H with N = N1 + ... + NK columns by normalizing the columns of the concatenated matrix [H1 , ..., HK ] to have πǫ/N norms equal to 1, so that ¯ 1 , ..., H ¯ K ], Colj [H ¯ k ] = θjk Colj [Hk ] ∀(k ≤ K, j ≤ Nk ) H = [H with θjk =
πǫ/Nk (h) 1 ≥ ϑk := min , h6=0 πǫ/N (h) πǫ/N (Colj [Hk ])
where the concluding ≥ is due to πǫ/Nk (Colj [Hk ]) ≤ 1. We claim that in terms of (ǫ, k·k)risk, the polyhedral estimate yielded by H is “almost as good” as the best of the polyhedral estimates yielded by the contrast matrices H1 , ..., HK , specifically,6 R[H] := max kBxk : x ∈ 2Xs , kH T Axk∞ ≤ 2 ≤ min ϑ−1 k Rk . x
k
The justification is readily given by the following observation: when ϑ ∈ (0, 1), we have Rk,ϑ := max kBxk : x ∈ 2Xs , kHkT Axk∞ ≤ 2/ϑ ≤ Rk /ϑ. x
Indeed, when x is a feasible solution to the maximization problem specifying Rk,ϑ , ϑx is a feasible solution to the problem specifying Rk , implying that ϑkBxk ≤ Rk . It remains to note that we clearly have R[H] ≤ mink Rk,ϑk . The bottom line is that the aggregation just described of contrast matrices H1 , ..., HK into a single contrast matrix H results in a polyhedral estimate which in terms of upper bound R[·] on its (ǫ, k · k)risk is, up to factor ϑ¯ = maxk ϑ−1 k , not worse than the best of the K estimates yielded by the original contrast matrices. Consequently, if πδ (·) grows slowly as δ decreases, the “price” ϑ¯ of assembling the original estimates is quite moderate. For example, in our basic cases (subGaussian, Discrete, and Poisson), ϑ¯ is logarithmic in maxk Nk−1 (N1 +...+NK ), and ϑ¯ = 1+o(1) as ǫ → +0 for K, N1 , ..., NK fixed. 6 This
is the precise “quantitative expression” of the observation (!).
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
5.1.7
413
Numerical illustration
We are about to illustrate the numerical performance of polyhedral estimates by comparing it to the performance of a “presumably good” linear estimate. Our setup is deliberately simple: the signal set X is just the unit box {x ∈ Rn : kxk∞ ≤ 1}, B ∈ Rn×n is “numerical double integration”: for a δ > 0, 2 δ (i − j + 1), j ≤ i Bij = , 0, j>i so that x, modulo boundary effects, is the second order finite difference derivative of w = Bx, wi − 2wi−1 + wi−2 , 2 < i ≤ n; xi = δ2 and Ax is comprised of m randomly selected entries of Bx. The observation is ω = Ax + ξ, ξ ∼ N (0, σ 2 Im ) and the recovery norm is k · k2 . In other words, we want to recover a restriction of a twice differentiable function of one variable on the npoint regular grid on the segment ∆ = [0, nδ] from noisy observations of this restriction taken along m randomly selected points of the grid. A priori information on the function is that the magnitude of its second order derivative does not exceed 1. Note that in the considered situation both linear estimate x bH yielded by Proposition 4.14 and polyhedral estimate x bH yielded by Proposition 5.7, are nearoptimal in the minimax sense in terms of their k · k2  or (ǫ, k · k2 )risk. In the experiments reported in Figure 5.1, we used n = 64, m = 32, and δ = 4/n (i.e., ∆ = [0, 4]); the reliability parameter for the polyhedral estimate was set to ǫ = 0.1. For different noise levels σ = {0.1, 0.01, 0.001, 0.0001} we generate 20 random signals x from X and record the k · k2 recovery errors of the linear and the polyhedral estimates. In addition to testing the nearly optimal polyhedral estimate PolyI yielded by Proposition 5.8 as applied in the framework of item 5.1.5.5.A, we also record the performance of the polyhedral estimate PolyII yielded by the construction from Section 5.1.4. The observed k · k2 recovery errors of the three estimates are plotted in Figure 5.1. All three estimates exhibit similar empirical performance in these simulations. However, when the noise level becomes small, polyhedral estimates seem to outperform the linear one. In addition, the estimate PolyII seems to “work” better than or, at the very worst, similarly to PolyI in spite of the fact that in the situation in question the estimate PolyI, in contrast to PolyII, is provably nearoptimal. 5.1.8
Calculus of compatibility
The principal rules of the calculus of compatibility are as follows (verification of the rules is straightforward and is therefore skipped): 1. [passing to a subset] When Y ′ ⊂ Y are convex compact subsets of RN and a cone Y is compatible with Y, the cone is compatible with Y ′ as well.
414
CHAPTER 5
0.7 0.11
0.6
0.1
0.5
0.09 0.08
0.4 0.07 0.06
0.3
0.05
0.2 0.04
0
2
4
6
8
10
12
14
16
18
20
0
2
4
6
σ = 0.1
8
10
12
14
16
18
20
14
16
18
20
σ = 0.01
0.02 0.018
102 0.016 0.014 0.012
0.01
0.008
103 0.006 0
2
4
6
8
10
12
14
16
18
0
20
2
4
6
8
10
12
σ = 0.001 σ = 0.0001 Figure 5.1: Recovery errors for the nearoptimal linear estimate (circles) and for polyhedral estimates yielded by Proposition 5.8 (PolyI, pentagrams) and by the construction from Section 5.1.4 (PolyII, triangles), 20 simulations per each value of σ.
2. [finite intersection] Let cones Yj be compatible with convex compact sets Yj ⊂ RN , j = 1, ..., J. Then the cone j Y = cl{(V, τ ) ∈ SN + × R+ : ∃((Vj , τj ) ∈ Y , j ≤ J) : V
is compatible with Y =
T j
X j
Vj ,
X j
τj ≤ τ }
Yj . The closure operation can be skipped when all
cones Yj are sharp, in which case Y is sharp as well. 3. [convex hulls of finite union] Let cones Yj be compatible with convex compact sets Yj ⊂ RN , j = 1, ..., J, and let there exist (V, τ ) such that V ≻ 0 and \ Yj . (V, τ ) ∈ Y := j
S Then Y is compatible with Y = Conv{ Yj } and, in addition, is sharp provided j
that at least one of the Yj is sharp. 4. [direct product] Let cones Yj be compatible with convex compact sets Yj ⊂ RNj , j = 1, ..., J. Then the cone N1 +...+NJ Y = {(V, τ ) ∈ S+ × R+ : ∃(Vj , τj ) ∈ Y j : V Diag{V1 , ..., VJ } & τ ≥
X j
τj }
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
415
is compatible with Y = Y1 × ... × YJ . This cone is sharp, provided that all the Yj are so. 5. [linear image] Let cone Y be compatible with convex compact set Y ⊂ RN , let A be a K × N matrix, and let Z = AY. The cone T Z = cl{(V, τ ) ∈ SK + × R+ : ∃U A V A : (U, τ ) ∈ Y}
is compatible with Z. The closure operation can be skipped whenever Y is either sharp, or complete, completeness meaning that (V, τ ) ∈ Y and 0 V ′ V imply that (V ′ , τ ) ∈ Y. The cone Z is sharp, provided Y is so and the rank of A is K. 6. [inverse linear image] Let cone Y be compatible with convex compact set Y ⊂ RN , let A be an N × K matrix with trivial kernel, and let Z = A−1 Y := {z ∈ RK : Az ∈ Y}. The cone T Z = cl{(V, τ ) ∈ SK + × R+ : ∃U : A U A V & (U, τ ) ∈ Y}
is compatible with Z. The closure operations can be skipped whenever Y is sharp, in which case Z is sharp as well. 7. [arithmetic summation] Let cones Yj be compatible with convex compact sets Yj ⊂ RN , j = 1, ..., J. Then the arithmetic sum Y = Y1 + ... + YJ of the sets Yj can be equipped with a compatible cone readily given by the cones Yj ; this cone is sharp, provided all the Yj are so. Indeed, the arithmetic sum of Yj is the linear image of the direct product of the Yj ’s under the mapping [y 1 ; ...; y J ] 7→ y 1 + ... + y J , and it remains to combine rules 4 and 5; note the cone yielded by rule 4 is complete, so that when applying rule 5, the closure operation can be skipped.
5.2
RECOVERING SIGNALS FROM NONLINEAR OBSERVATIONS BY STOCHASTIC OPTIMIZATION
The “common denominator” of all estimation problems considered so far in this chapter is that what we observed was obtained by adding noise to the linear image of the unknown signal to be recovered. In this section we consider the problem of signal estimation in the case where the observation is obtained by adding noise to a nonlinear transformation of the signal. 5.2.1
Problem setting
A motivating example for what follows is provided by the logistic regression model, where • the unknown signal to be recovered is a vector x known to belong to a given signal set X ⊂ Rn , which we assume to be a nonempty convex compact set; • our observation ω K = {ωk = (ηk , yk ), 1 ≤ k ≤ K} stemming from a signal x is as follows: – the regressors η1 , ..., ηK are i.i.d. realizations of an ndimensional random
416
CHAPTER 5
vector η with distribution Q independent of x and such that Q possesses a finite and positive definite matrix Eη∼Q {ηη T } of second moments;
– the labels yk are generated as follows: yk is the Bernoulli random variable independent of the “history” η1 , ..., ηk−1 , y1 , ..., yk−1 , and the conditional, given ηk , probability for yk to be 1 is φ(ηkT x), where φ(s) =
exp{s} . 1 + exp{s}
In this model, the standard (and very wellstudied) approach to estimating the signal x underlying the observations is to use the Maximum Likelihood (ML) estimate: the logarithm of the conditional, given ηk , 1 ≤ k ≤ K, probability of getting the observed labels as a function of a candidate signal z is K
ℓ(z, ω )
=
K X
k=1
=
"
X k
yk ln φ(ηkT z) + (1 − yk ) ln 1 − φ(ηkT z) yk η k
#T
z−
X k
ln 1 + exp{ηkT z} ,
(5.48)
and the ML estimate of the “true” signal x underlying our observation ω K is obtained by maximizing the loglikelihood ℓ(z, ω K ) over z ∈ X , x bML (ω K ) ∈ Argmax ℓ(z, ω K ),
(5.49)
z∈X
which is a convex optimization problem.
The problem we intend to consider (referred to as the generalized linear model (GLM) in Statistics) can be viewed as a natural generalization of the logistic regression just presented and is as follows: Our observation depends on unknown signal x known to belong to a given convex compact set X ⊂ Rn and is ω K = {ωk = (ηk , yk ), 1 ≤ k ≤ K}
(5.50)
with ωk , 1 ≤ k ≤ K, which are i.i.d. realizations of a random pair (η, y) with the distribution Px such that • the regressor η is a random n×m matrix with some probability distribution Q independent of x; • the label y is an mdimensional random vector such that the conditional distribution of y given η induced by Px has the expectation f (η T x): Exη {y} = f (η T x),
(5.51)
where Exη {y} is the conditional expectation of y given η stemming from the distribution Px of ω = (η, y), and f (·) : Rm → Rm (“link function”) is a given mapping. Note that the logistic regression model corresponds to the case where m = 1,
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
417
exp{s} f (s) = 1+exp{s} , and y takes values 0,1, with the conditional probability of taking value 1 given η equal to f (η T x). Another example is provided by the model
y = f (η T x) + ξ, where ξ is a random vector with zero mean independent of η, say, ξ ∼ N (0, σ 2 Im ). Note that in the latter case the ML estimate of the signal x underlying the observations is X kyk − f (ηkT z)k22 . (5.52) x bML (ω K ) ∈ Argmin z∈X
k
In contrast to what happens with logistic regression, now the optimization problem—“Nonlinear Least Squares”—responsible for the ML estimate typically is nonconvex and can be computationally difficult. Following [140], we intend to impose on the data of the estimation problem we have just described (namely, on X , f (·), and the distributions Px , x ∈ X , of the pair (η, y)) assumptions which allow us to reduce our estimation problem to a problem with convex structure—a strongly monotone variational inequality represented by a stochastic oracle. At the end of the day, this will lead to a consistent estimate of the signal, with explicit “finite sample” accuracy guarantees. 5.2.2
Assumptions
Preliminaries: Monotone vector fields. A monotone vector field on Rm is a singlevalued everywhere defined mapping g(·) : Rm → Rm which possesses the monotonicity property [g(z) − g(z ′ )]T [z − z ′ ] ≥ 0 ∀z, z ′ ∈ Rm . We say that such a field is monotone with modulus κ ≥ 0 on a closed convex set Z ⊂ Rm if [g(z) − g(z ′ )]T [z − z ′ ] ≥ κkz − z ′ k22 , ∀z z ′ ∈ Z, and say that g is strongly monotone on Z if the modulus of monotonicity of g on Z is positive. It is immediately seen that for a monotone vector field which is continuously differentiable on a closed convex set Z with a nonempty interior, the necessary and sufficient condition for being monotone with modulus κ on the set is dT f ′ (z)d ≥ κdT d ∀(d ∈ Rn , z ∈ Z).
(5.53)
Basic examples of monotone vector fields are: • gradient fields ∇φ(x) of continuously differentiable convex functions of m variables or, more generally, the vector fields [∇x φ(x, y); −∇y φ(x, y)] stemming from continuously differentiable functions φ(x, y) which are convex in x and concave in y; • “diagonal” vector fields f (x) = [f1 (x1 ); f2 (x2 ); ...; fm (xm )] with monotonically nondecreasing univariate components fi (·). If, in addition, the fi (·) are continuously differentiable with positive first order derivatives, then the associated field f is strongly monotone on every compact convex subset of Rm , the monotonicity modulus depending on the subset.
418
CHAPTER 5
Monotone vector fields on Rn admit simple calculus which includes, in particular, the following two rules: I. [affine substitution of argument]: If f (·) is a monotone vector field on Rm and A is an n × m matrix, the vector field g(x) = Af (AT x + a) is monotone on Rn ; if, in addition, f is monotone with modulus κ ≥ 0 on a closed convex set Z ⊂ Rm and X ⊂ Rn is closed, convex, and such that AT x + a ∈ Z whenever x ∈ X, g is monotone with modulus σ 2 κ on X, where σ is the nth singular value of A (i.e., the largest γ such that kAT xk2 ≥ γkxk2 for all x). II. [summation]: If S is a Polish space, f (x, s) : Rm × S → Rm is a Borel vectorvalued function which is monotone in x for every s ∈ S, and µ(ds) is a Borel probability measure on S such that the vector field Z F (x) = f (x, s)µ(ds) S
is welldefined for all x, then F (·) is monotone. If, in addition, X is a closed convex set in Rm and f (·, s) is monotone on X with Borel in s modulus κ(s) for R every s ∈ S, then F is monotone on X with modulus S κ(s)µ(ds). Assumptions. In what follows, we make the following assumptions on the ingredients of the estimation problem posed in Section 5.2.1: • A.1. The vector field f (·) is continuous and monotone, and the vector field F (z) = Eη∼Q ηf (η T z)
is welldefined (and therefore is monotone along with f by I, II); • A.2. The signal set X is a nonempty convex compact set, and the vector field F is monotone with positive modulus κ on X ; • A.3. For properly selected M < ∞ and every x ∈ X it holds E(η,y)∼Px kηyk22 ≤ M 2 . (5.54)
A simple sufficient condition for the validity of Assumptions A.13 with properly selected M < ∞ and κ > 0 is as follows: • The distribution Q of η has finite moments of all orders, and Eη∼Q {ηη T } ≻ 0; • f is continuously differentiable, and dT f ′ (z)d > 0 for all d 6= 0 and all z. Besides this, f is of polynomial growth: for some constants C ≥ 0 and p ≥ 0 and all z one has kf (z)k2 ≤ C(1 + kzkp2 ). Verification of sufficiency is straightforward. The principal observation underlying the construction we are about to discuss is as follows. Proposition 5.11. With Assumptions A.1–3 in force, let us associate with a pair
419
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
(η, y) ∈ Rn×m × Rm the vector field G(η,y) (z) = ηf (η T z) − ηy : Rn → Rn . Then for every x ∈ X we have E(η,y)∼Px G(η,y) (z) kF (z)k 2 E(η,y)∼Px kG(η,y) (z)k22
= ≤ ≤
F (z) − F (x) ∀z ∈ Rn M ∀z ∈ X 4M 2 ∀z ∈ X .
(5.55)
(a) (b) (c)
(5.56)
Proof is immediate. Indeed, let x ∈ X . Then n o E(η,y)∼Px {ηy} = Eη∼Q Exη {ηy} = Eη ηf (η T x) = F (x)
(we have used (5.51) and the definition of F ), whence, E(η,y)∼Px G(η,y) (z)
= =
n o n o E(η,y)∼Px ηf (η T z) − ηy = E(η,y)∼Px ηf (η T z) − F (x) n o Eη∼Q ηf (η T z) − F (x) = F (z) − F (x),
as stated in (5.56.a). Besides this, for x, z ∈ X , taking into account that the marginal distribution of η induced by Pz is Q, we have E(η,y)∼Px {kηf (η T z)k22 } = Eη∼Q kηf (η T z)k22 o n = Eη∼Q kEy∼Pηz {ηy}k22 [since Ey∼Pηz {y} = f (η T z)] n o ≤ Eη∼Q Ey∼Pηz kηyk22 [by Jensen’s inequality] = E(η,y)∼Pz kηyk22 ≤ M 2 [by A.3 due to z ∈ X ].
This combines with the relation E(η,y)∼Px {kηyk22 } ≤ M 2 given by A.3 due to x ∈ X to imply (5.56.b) and (5.56.c). ✷ Consequences. Our goal is to recover the signal x ∈ X underlying observations (5.50), and under assumptions A.1–3, x is a root of the monotone vector field G(z) = F (z) − F (x), F (z) = Eη∼Q ηf (η T z) ; (5.57)
we know that this root belongs to X , and this root is unique because G(·) is strongly monotone on X along with F (·). Now, the problem of finding a root, known to belong to a given convex compact set X , of a vector field G which is strongly monotone on this set is known to be computationally tractable, provided we have access to an “oracle” which, given on input a point z ∈ X , returns the value G(z) of the field at the point. The latter is not exactly the case in the situation we are interested in: the field G is the expectation of a random field: G(z) = E(η,y)∼Px ηf (η T z) − ηy , and we do not know a priori what the distribution is over which the expectation is taken. However, we can sample from this distribution—the samples are exactly the observations (5.50), and we can use these samples to approximate G and use
420
CHAPTER 5
this approximation to approximate the signal x.7 Two standard implementations of this idea are Sample Average Approximation (SAA) and Stochastic Approximation (SA). We are about to consider these two techniques as applied to the situation we are in. 5.2.3
Estimating via Sample Average Approximation
The idea underlying SAA is quite transparent: given observations (5.50), let us approximate the field of interest G with its empirical counterpart GωK (z) =
K 1 X ηk f (ηkT z) − ηk yk . K k=1
By the Law of Large Numbers, as K → ∞, the empirical field GωK converges to the field of interest G, so that under mild regularity assumptions, when K is large, GωK , with overwhelming probability, will be close to G uniformly on X . Due to strong monotonicity of G, this would imply that a set of “nearzeros” of GωK on X will be close to the zero x of G, which is nothing but the signal we want to recover. The only question is how we can consistently define a “nearzero” of GωK on X .8 A convenient notion of a “nearzero” in our context is provided by the concept of a weak solution to a variational inequality with a monotone operator, defined as follows (we restrict the general definition to the situation of interest): Let X ⊂ Rn be a nonempty convex compact set, and H(z) : X → Rn be a monotone (i.e., [H(z) − H(z ′ )]T [z − z ′ ] ≥ 0 for all z, z ′ ∈ X ) vector field. A vector z∗ ∈ X is called a weak solution to the variational inequality (VI) associated with H, X when H T (z)[z − z∗ ] ≥ 0 ∀z ∈ X . Let X ⊂ Rn be a nonempty convex compact set and H be monotone on X . It is well known that • The VI associated with H, X (let us denote it by VI(H, X )) always has a weak solution. It is clear that if z¯ ∈ X is a root of H, then z¯ is a weak solution to VI(H, X ).9 • When H is continuous on X , every weak solution z¯ to VI(H, X ) is also a strong solution, meaning that H T (¯ z )(z − z¯) ≥ 0 ∀z ∈ X .
(5.58)
Indeed, (5.58) clearly holds true when z = z¯. Assuming z 6= z¯ and setting zt = z¯+t(z−¯ z ), 0 < t ≤ 1, we have H T (zt )(zt −¯ z ) ≥ 0 (since z¯ is a weak solution), 7 The observation expressed by Proposition 5.11, however simple, and the resulting course of actions seem to be new. In retrospect, one can recognize unperceived ad hoc utilization of this approach in Perceptron and Isotron algorithms, see [1, 2, 29, 62, 116, 141, 142, 210] and references therein. 8 Note that we in general cannot define a “nearzero” of G ω K on X as a root of Gω K on this set—while G does have a root belonging to X , nobody told us that the same holds true for GωK . 9 Indeed, when z ¯ ∈ X and H(¯ z ) = 0, monotonicity of H implies that H T (z)[z − z¯] = [H(z) − H(¯ z )]T [z − z¯] ≥ 0 for all z ∈ X , that is, z¯ is a weak solution to the VI.
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
421
whence H T (zt )(z − z¯) ≥ 0 (since z − z¯ is a positive multiple of zt − z¯). Passing to limit as t → +0 and invoking the continuity of H, we get H T (¯ z )(z − z¯) ≥ 0, as claimed. • When H is the gradient field of a continuously differentiable convex function on X (such a field indeed is monotone), weak (or strong, which in the case of continuous H is the same) solutions to VI(H, X ) are exactly the minimizers of the function on X . Note also that a strong solution to VI(H, X ) with monotone H always is a weak one: if z¯ ∈ X satisfies H T (¯ z )(z − z¯) ≥ 0 for all z ∈ X , then H(z)T (z − z¯) ≥ 0 for all z ∈ X , since by monotonicity H T (z)(z − z¯) ≥ H T (¯ z )(z − z¯). In the sequel, we utilize the following simple and wellknown fact: Lemma 5.12. Let X be a convex compact set, and H be a monotone vector field on X with monotonicity modulus κ > 0, i.e. ∀z, z ′ ∈ X [H(z) − H(z ′ )]T [z − z ′ ] ≥ κkz − z ′ k22 . Further, let z¯ be a weak solution to VI(H, X ). Then the weak solution to VI(H, X ) is unique. Besides this, H T (z)[z − z¯] ≥ κkz − z¯k22 ∀z ∈ X .
(5.59)
Proof: Under the premise of lemma, let z ∈ X and let z¯ be a weak solution to VI(H, X ) (recall that it does exist). Setting zt = z¯ + t(z − z¯), for t ∈ (0, 1) we have H T (z)[z − zt ] ≥ H T (zt )[z − zt ] + κkz − zt k22 ≥ κkz − zt k22 , where the first ≥ is due to strong monotonicity of H, and the second ≥ is due to the fact that H T (zt )[z − zt ] is proportional, with positive coefficient, to H T (zt )[zt − z¯], and the latter quantity is nonnegative since z¯ is a weak solution to the VI in question. We end up with H T (z)(z − zt ) ≥ κkz − zt k22 ; passing to limit as t → +0, we arrive at (5.59). To prove uniqueness of a weak solution, assume that besides the weak solution z¯ there exists a weak solution ze distinct from z¯, and let us z + ze]. Since both z¯ and ze are weak solutions, both the quantities set z ′ = 12 [¯ H T (z ′ )[z ′ − z¯] and H T (z ′ )[z ′ − ze] should be nonnegative, and because the sum of these quantities is 0, both of them are zero. Thus, when applying (5.59) to z = z ′ , we get z ′ = z¯, whence ze = z¯ as well. ✷ Now let us come back to the estimation problem under consideration. Let Assumptions A.13 hold, so that vector fields G(ηk ,yk ) (z) defined in (5.55), and therefore vector field GωK (z) are continuous and monotone. When using the SAA, we compute a weak solution x b(ω K ) to VI(GωK , X ) and treat it as the SAA estimate of signal x underlying observations (5.50). Since the vector field GωK (·) is monotone with efficiently computable values, provided that so is f , computing (a high accuracy approximation to) a weak solution to VI(GωK , X ) is a computationally tractable problem (see, e.g., [189]). Moreover, utilizing the techniques from [30, 204, 220, 212, 213], under mild regularity assumptions additional to A.1–3 one can get a nonasymptotical upper bound on, say, the expected k · k22 error of the SAA estimate as a function of the sample size K and find out the rate at which this bound converges to 0 as K → ∞; this analysis, however, goes beyond our scope.
422
CHAPTER 5
Let us specify the SAA estimate in the logistic regression model. In this case we have f (u) = (1 + e−u )−1 , and exp{ηkT z} G(ηk ,yk ) (z) = − yk η k , 1 + exp{ηkT z} K exp{ηkT z} 1 X − yk η k GωK (z) = K 1 + exp{ηkT z} k=1 # " X 1 T T ln 1 + exp{ηk z} − yk ηk z . ∇z = K k
In other words, GωK (z) is proportional, with negative coefficient −1/K, to the gradient field of the loglikelihood ℓ(z, ω K ); see (5.48). As a result, in the case in question weak solutions to VI(GωK , X ) are exactly the maximizers of the loglikelihood ℓ(z, ω K ) over z ∈ X , that is, for the logistic regression the SAA estimate is nothing but the Maximum Likelihood estimate x bML (ω K ) as defined in (5.49).10 On the other hand, in the “nonlinear least squares” example described in Section 5.2.1 with (for the sake of simplicity, scalar) monotone f (·) the vector field GωK (·) is given by K 1 X f (ηkT z) − yk ηk GωK (z) = K k=1
which is different (provided that f is nonlinear) from the gradient field 2
K X
k=1
f ′ (ηkT z) f (ηkT z) − yk ηk
of the minus loglikelihood appearing in (5.52). As a result, in this case the ML estimate (5.52) is, in general, different from the SAA estimate (and, in contrast to the ML, the SAA estimate is easy to compute). 10 This phenomenon is specific to the logistic regression model. The equality of the SAA and the ML estimates in this case is due to the fact that the logistic sigmoid f (s) = exp{s}/(1+exp{s}) “happens” to satisfy the identity f ′ (s) = f (s)(1 − f (s)). When replacing the logistic sigmoid with f (s) = φ(s)/(1 + φ(s)) with differentiable monotonically nondecreasing positive φ(·), the SAA estimate becomes the weak solution to VI(Φ, X ) with # " X φ(ηkT z) − yk ηk . Φ(z) = 1 + φ(ηkT z) k
On the other hand, the gradient field of the minus loglikelihood i Xh − yk ln(f (ηkT z)) + (1 − yk ) ln(1 − f (ηkT z)) k
we need to minimize when computing the ML estimate is # " X φ′ (η T z) φ(ηkT z) k − y k ηk . Ψ(z) = φ(ηkT z) 1 + φ(ηkT z) k When k > 1 and φ is not an exponent, Φ and Ψ are “essentially different,” so that the SAA estimate typically will differ from the ML one.
423
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
5.2.4
Stochastic Approximation estimate
The Stochastic Approximation (SA) estimate stems from a simple algorithm— Subgradient Descent—for solving variational inequality VI(G, X ). Were the values of the vector field G(·) available, one could approximate a root x ∈ X of this VI using the recurrence zk = ProjX [zk−1 − γk G(zk−1 )], k = 1, 2, ..., K, where • ProjX [z] is the metric projection of Rn onto X : ProjX [z] = argmin kz − uk2 ; u∈X
• γk > 0 are given stepsizes; • the initial point z0 is an arbitrary point of X . It is well known that under Assumptions A.13 this recurrence with properly selected stepsizes and started at a point from X allows to approximate the root of G (in fact, the unique weak solution to VI(G, X )) to any desired accuracy, provided K is large enough. However, we are in the situation when the actual values of G are not available; the standard way to cope with this difficulty is to replace in the above recurrence the “unobservable” values G(zk−1 ) of G with their unbiased random estimates G(ηk ,yk ) (zk−1 ). This modification gives rise to Stochastic Approximation (coming back to [146])—the recurrence zk = ProjX [zk−1 − γk G(ηk ,yk ) (zk−1 )], 1 ≤ k ≤ K,
(5.60)
where z0 is a once and forever chosen point from X , and γk > 0 are deterministic stepsizes. The next item on our agenda is the (wellknown) convergence analysis of SA under assumptions A.1–3. To this end observe that the zk are deterministic functions of the initial fragments ω k = {ωt , 1 ≤ t ≤ k} ∼ Px × ... × Px of our sequence {z }  of observations ω
K
Pxk k
= {ωk = (ηk , yk ), 1 ≤ k ≤ K}: zk = Zk (ω ). Let us set
Dk (ω k ) = 12 kZk (ω k ) − xk22 = 21 kzk − xk22 ,
dk = Eωk ∼Pxk {Dk (ω k )},
where x ∈ X is the signal underlying observations (5.50). Note that, as is well known, the metric projection onto a closed convex set X is contracting: ∀(z ∈ Rn , u ∈ X ) : kProjX [z] − uk2 ≤ kz − uk2 . Consequently, for 1 ≤ k ≤ K it holds Dk (ω k )
= ≤
=
1 2 1 2 1 2
kProjX [zk−1 − γk Gωk (zk−1 )] − xk22 kzk−1 − γk Gωk (zk−1 ) − xk22
kzk−1 − xk22 − γk Gωk (zk−1 )T (zk−1 − x) + 21 γk2 kGωk (zk−1 )k22 .
Taking expectations w.r.t. ω k ∼ Pxk on both sides of the resulting inequality and
424
CHAPTER 5
keeping in mind relations (5.56) along with the fact that zk−1 ∈ X , we get (5.61) dk ≤ dk−1 − γk Eωk−1 ∼Pxk−1 G(zk−1 )T (zk−1 − x) + 2γk2 M 2 .
Recalling that we are in the case where G is strongly monotone on X with modulus κ > 0, x is the weak solution VI(G, X ), and zk−1 takes values in X , invoking (5.59), the expectation in (5.61) is at least 2κdk , and we arrive at the relation dk ≤ (1 − 2κγk )dk−1 + 2γk2 M 2 . We put S=
2M 2 , κ2
γk =
1 . κ(k + 1)
(5.62)
(5.63)
Let us verify by induction in k that for k = 0, 1, ..., K it holds dk ≤ (k + 1)−1 S.
(∗k )
Base k = 0. Let D stand for the k · k2 diameter of X , and z± ∈ X be such that kz+ − z− k2 = D. By (5.56) we have kF (z)k2 ≤ M for all z ∈ X , and by strong monotonicity of G(·) on X we have [G(z+ ) − G(z− )]T [z+ − z− ] = [F (z+ ) − F (z− )][z+ − z− ] ≥ κkz+ − z− k22 = κD2 . By the Cauchy inequality, the lefthand side in the concluding ≥ is at most 2M D, and we get 2M D≤ , κ whence S ≥ D2 /2. On the other hand, due to the origin of d0 we have d0 ≤ D2 /2. Thus, (∗0 ) holds true. Inductive step (∗k−1 ) ⇒ (∗k ). Now assume that (∗k−1 ) holds true for some k, 1 ≤ k ≤ K, and let us prove that (∗k ) holds true as well. Observe that κγk = (k + 1)−1 ≤ 1/2, so that dk
≤ ≤ =
dk−1 (1 − 2κγk ) + 2γk2 M 2 [by (5.62)] S (1 − 2κγk ) + 2γk2 M 2 [by (∗k−1 ) and due to κγk ≤ 1/2] k S k−1 2 S S S 1 1− + ≤ = + , 2 k k+1 (k + 1) k+1 k k+1 k+1
so that (∗k ) hods true. Induction is complete. Recalling that dk = 21 E{kzk − xk22 }, we arrive at the following: Proposition 5.13. Under Assumptions A.1–3 and with the stepsizes γk =
1 , k = 1, 2, ... , κ(k + 1)
(5.64)
for every signal x ∈ X the sequence of estimates x bk (ω k ) = zk given by the SA
425
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
recurrence (5.60) and ωk = (ηk , yk ) defined in (5.50) obeys the error bound Eωk ∼Pxk kb xk (ω k ) − xk22 ≤
4M 2 , k = 0, 1, ... , + 1)
κ 2 (k
(5.65)
Px being the distribution of (η, y) stemming from signal x.
5.2.5
Numerical illustration
To illustrate the above developments, we present here the results of some numerical experiments. Our deliberately simplistic setup is as follows: • X = {x ∈ Rn : kxk2 ≤ 1}; • the distribution Q of η is N (0, In ); • f is the monotone vector field on R given by one of the following four options: A. f (s) = exp{s}/(1 + exp{s}) (“logistic sigmoid”); B. f (s) = s (“linear regression”); C. f (s) = max[s, 0] (“hinge function”); D. f (s) = min[1, max[s, 0]] (“ramp sigmoid”). • the conditional distribution of y given η induced by Px is
– Bernoulli distribution with probability f (η T x) of outcome 1 in the case of A (i.e., A corresponds to the logistic model), – Gaussian distribution N (f (η T x), In ) in cases B–D.
Note that when m = 1 and η ∼ N (0, In ), one can easily compute the field F (z). Indeed, we have ∀z ∈ Rn \{0}: zz T zz T η, η + I − η= kzk22 kzk22  {z } η⊥
and due to the independence of η T z and η⊥ , F (z) = Eη∼N (0,I) {ηf (η T z)} = Eη∼N (0,I)
zz T η f (η T z) kzk22
=
z Eζ∼N (0,1) {ζf (kzk2 ζ)}, kzk2
so that F (z) is proportional to z/kzk2 with proportionality coefficient h(kzk2 ) = Eζ∼N (0,1) {ζf (kzk2 ζ)}. In Figure 5.2 we present the plots of the function h(t) for the situations A–D and of the moduli of strong monotonicity of the corresponding mappings F on the k · k2 ball of radius R centered at the origin, as functions of R. The dimension n in all experiments was set to 100, and the number of observations K was 400, 1, 000, 4, 000, 10, 000, and 40, 000. For each combination of parameters we ran 10 simulations for signals x underlying observations (5.50) drawn randomly from the uniform distribution on the unit sphere (the boundary of X ).
426
CHAPTER 5
5
100
4.5 4 3.5 3 2.5 2
101
1.5 1 0.5 0
0 0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Figure 5.2: Left: functions h; right: moduli of strong monotonicity of the operators F (·) in {z : kzk2 ≤ R} as functions of R. Dashed lines – case A (logistic sigmoid), solid lines – case B (linear regression), dashdotted lines – case C (hinge function), dotted line – case D (ramp sigmoid).
In each experiment, we computed the SAA and the SA estimates (note that in the cases A and B the SAA estimate is the Maximum Likelihood estimate as well). The SA stepsizes γk were selected according to (5.64) with “empirically tuned” κ.11 Namely, given observations ωk = (ηk , yk ), k ≤ K—see (5.50)—we used them to build the SA estimate in two stages: — at the tuning stage, we generate a random “training signal” x′ ∈ X and then generate labels yk′ as if x′ were the actual signal. For instance, in the case of A, yk′ is assigned value 1 with probability f (ηkT x′ ) and value 0 with complementary probability. After the “training signal” and associated labels are generated, we run on the resulting artificial observations SA with different values of κ, compute the accuracy of the resulting estimates, and select the value of κ resulting in the best recovery; — at the execution stage, we run SA on the actual data with stepsizes (5.64) specified by the κ found at the tuning stage. The results of some numerical experiments are presented in Figure 5.3. Note that the CPU time for SA includes both tuning and execution stages. The conclusion from these experiments is that as far as estimation quality is concerned, the SAA estimate marginally outperforms the SA, while being significantly more time consuming. Note also that the dependence of recovery errors √ on K observed in our experiments is consistent with the convergence rate O(1/ K) established by Proposition 5.13. Comparison with Nonlinear Least Squares. Observe that in the case m = 1 of scalar monotone field f , the SAA estimate yielded by our approach as applied to observation ω K is the minimizer of the convex function Z t k 1 X T T f (s)ds, v(ηk z) − yk ηk z , v(r) = HωK (z) = K 0 k=1
11 We could get (lower bounds on) the moduli of strong monotonicity of the vector fields F (·) we are interested in analytically, but this would be boring and conservative.
427
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
102
100
101
100
101
101 102
103 103
104
103
k
Mean estimation error kb xk (ω ) − xk2
104
CPU time (sec)
Figure 5.3: Mean errors and CPU times for SA (solid lines) and SAA estimates (dashed lines) as functions of the number of observations K. o – case A (logistic link), x – case B (linear link), + – case C (hinge function), ✷ – case D (ramp sigmoid).
on the signal set X . When f is the logistic sigmoid, HωK (·) is exactly the convex loss function leading to the ML estimate in the logistic regression model. As we have already mentioned, this is not the case for a general GLM. Consider, e.g., the situation where the regressors and the signals are reals, the distribution of regressor η is N (0, 1), and the conditional distribution of y given η is N (f (ηx), σ 2 ), with f (s) = arctan(s). In this situation the ML estimate stemming from observation ω K is the minimizer on X of the function MωK (z) =
k 1 X 2 [yk − arctan(ηk z)] . K
(5.66)
k=1
The latter function is typically nonconvex and can be multiextremal. For example, when running simulations12 we from time to time observe the situation similar to that presented in Figure 5.4. Of course, in our toy situation of scalar x the existence of several local minima of MωK (·) is not an issue—we can easily compute the ML estimate by a brute force search along a dense grid. What to do in the multidimensional case—this is another question. We could also add that in the simulations which led to Figure 5.4 both the SAA and the ML estimates exhibited nearly the same performance in terms of the estimation error: in 1, 000 experiments, the median of the observed recovery errors was 0.969 for the ML, and 0.932 for the SAA estimate. When increasing the number of observations to 1, 000, the empirical median (taken over 1, 000 simulations) of recovery errors became 0.079 for the ML, and 0.085 for the SAA estimate. 12 In these simulations, the “true” signal x underlying observations was drawn from N (0, 1), the number K of observations also was random with uniform distribution on {1, ..., 20}, and X = [−20, 20], σ = 3 were used.
428
CHAPTER 5
25
20
15
10
5
0
5 20
15
10
5
0
5
10
15
20
Figure 5.4: Solid curve: MωK (z), dashed curve: HωK (z). True signal x (solid vertical line): +0.081; SAA estimate (unique minimizer of HωK , dashed vertical line): −0.252; ML estimate (global minimizer of MωK on [−20, 20]): −20.00, closest to x local minimizer of MωK (dotted vertical line): −0.363. 5.2.6
“Singleobservation” case
Let us look at the special case of our estimation problem where the sequence η1 , ..., ηK of regressors in (5.50) is deterministic. At first glance, this situation goes beyond our setup, where the regressors should be i.i.d. drawn from some distribution Q. However, we can circumvent this “contradiction” by saying that we are now